Tuesday, April 28, 2009

Two months using RSP with EVAs and going strong


This blog post will be a recap of my painful migration to RSP, which is now almost finished for the systems I care for which are EVAs, Integrity Servers running HP-UX, and Proliant servers running ESX. This post will talk about EVAs. The other systems will be covered later.

When it came time to migrate from ISEE to RSP, I was very fearful of doing the migration for my EVAs. Why? Because server components rarely fail nowadays. This leaves us mostly with mechanical disks, power supplies and batteries. Since disk arrays such as EVAs have a bunch of them, they are the most prone to failure and require the most maintenance.

I started testing RSP as early as last October, with disastrous results. There's a price to pay for being bleeding edge. It took quite a while for me to understand all the RSP components, but now the puzzle is fairly complete. And now that many Jack Bauer-style customers like me have started migrating without the help of HP, I'm seeing an increasing number of disguised complaints in the ITRC forums of disgruntled users, even from HP employees.

I was so pissed with all the bugs I've had with my initial test run of RSP, I decided to go back to square one and put myself in the shoes of HP's QA people, thinking about how they must actually test their software; and I said to myself: "chances are they start with freshly installed environments"... thus I did the same. I reinstalled the CMS from scratch, even went as far as zapping my Storage Management Servers, and took great care to RTFM the boring prerequisites guide which is so clinical that it would make reading medical transcriptions a funny adventure. Every small dependency, from ELMC to MC3 components, to manually configuring WEBES to use the "CommandView" protocol have been taken care of.

And it works!

Up until now, I've had three events, on two different EVAs, and they were all forwarded to WEBES, then the ISEE client, then HP. HP then called back the contact person to schedule what needs to be done. What's fun now is that since everything is centralized rather than dispersed on multiple SMSes. Configuring the contacts for the EVAs is easier. We're spread out in multiple data centers across the province, and this proves much easier to manage.

Using a centralized CMS, however, has its drawbacks. For instance, what will happen if there's an event and the CMS is down? Will it be queued until it gets back up? I haven't tested this yet. Furthermore, should we invest in a highly-available CMS? That's quite a sum of money. And what will happen when my CMS or SMS are out of contract? Guess what, I wouldn't be surprised if it doesn't work anymore.

So many questions, so little time...

No comments: