Wednesday, January 28, 2009

Goobye ISEE: Monitoring an EVA with RSP: day 1

I have deferred upgrading the monitoring of my EVAs to RSP until Q1 2009 since I had a great deal of trouble with RSP last fall and was fed up.

HP Services proposed coming to help me (we have 6 EVAs and 6 SMSes), but I thought I'd try for myself for the first one to at least understand what they'll be doing, and be able to troubleshoot it once they're gone.

To increase my chances, I decided to start everything from scratch on the CMS and SMS. As far as these two servers are concernd, it doesn't get as "standard" as this:

  • The SIM administrator and me installed SIM 5.2 on a freshly reinstalled Windows server, then we restored our database succesfully (see one of my previous post for my recommendations on this)
  • I completely zapped my test SMS and reinstalled a vanilla Windows 2003 Enterprise, along with CV 6.0.2 and nothing more (we stick with 6.0.2 since it's the only version certified with Metrocluster).
Now does it work? Partly.

  • SIM must have at least spoken to SMI-S, since EVAs appeared automagically in the system list. But there's not much information I can get from them.
  • As the RSP "prerequisite" documentation that explains how to set everything up has no fucking example screenshot, who knows if the EVA entries in SIM are supposed to be in this state or not. Message to whoever's writing these guides: I'm sure you are allowed to put images there. Please do it!!!
  • WEBES is still not able to communicate with CV since it sees the server as a generic "Proliant", and not a "CommandView Server". Is it because I'm running on a generic server instead of a real SMS? Maybe, but a generic server is officially supported. I don't know how to change this yet. More work needs to be done.
Since there are still problems and SIM 5.3 has just been released last Monday along with WEBES 5.4, I'll try to have SIM upgraded first and start from there. This will probably be the last straw. If it still doesn't work, I'm converting my SSSU monitoring script to a nagios plugin and I'll give it for free to everyone who's interested. If HP is happy to give me crappy software that doesn't work, then I'll let them handle the overhead paying a human to manage the service calls that I'll log manually. I just wasted too much time and energy on this.

Thursday, January 22, 2009

The idiot's guide to (re-)installing SIM on Windows and making it actually work

My colleague and I have been busy in the last few days doing a complete re-install of SIM and RSP since we were running into problems with our server that would be tough to explain. To make a long story short, we decided that a fresh reinstall would fix things, and it looks like it did. Why are we running on Windows and not on HP-UX? Basically because 1) SIM was initially installed on Windows in our shop, 2) RSP only works on Windows and 3) My colleague is a Windows guy. :)

Here are my 10 suggestions if you want to do this. This might seem stupid for a Windows admin but I'm an HP-UX guy, remember.

1. Have a good backup
First of all, we made sure we had a good backup of the SIM database. HP has a whitepaper on the subject. But it says what to backup, but not necessarily how to back it up automatically. This was my first MS-SQL experience, and I ended up writing a custom script to back it up. I run it each day to dump the database, so that it can be backed up consistently.

2. Before reinstalling, confirm first that your data can be restored
Which I did by setting up a dummy VM running Windows, and restored data to a dummy SIM. It worked.

3. Use the Smart Start CD to Install Windows Server
I'm always sceptical of software that's self-labeled as "smart" and thought that we could just install a vanilla Windows server, then add all appropriate drivers and stuff... waste of time. Smart Start does all of this for you, and can install Windows from a CIFS-accessible .iso file.

4. Don't use a localized Windows and other software
Use a plain, honest-to-goodness U.S English version of Windows. If and when you run into problems, google will be a much better friend if you paste it error messages that are in english. If your company has a policy of installing software in a localized language, screw 'em.

5. Use the defaults to install *EVERYTHING*
Even if you don't like the defaults, at least they will work. We ran into a few bugs, especially with the database, and ended up thinking "if we were the QA guys at HP, how would we set up our server?" Chances are the answer to this is using the defaults! So don't try to tweak install optons, whether in SIM, RSP or MSSQL, unless you really know what you're doing. We didn't.

6. Don't run the software in your own account
Have it run with a generic account. If you use your personal account, SIM and MSSQL will work, but expect problems when your account gets deleted once you a) quit your job or b) get fired. Of course doing this is a good way to leave a time bomb at work in the case of b).

7. Update your server with Windows update between each software install
You'll probably end up going there 3-4 times

8. Run the SIM installer on the console
No need to use the iLO, you can type "mstsc /console" to do a terminal session. If you don't use the console, the RSP installer could fail miserably. Trust me.

9. Be patient when RSP is installing
It often asks you to wait "a few minutes" but experience here has shown me that it should rather be "a few hours" since it's downloading in the background a lot of software. Looks like the development team at HP tested this only on their gigabit network. In the real world, downloading hundreds of megabytes of bloated data through the internet can actually take quite some time.

10. Be prepared to reinstall everything, even Windows, if it doesn't work
There's an expression in French, un mal pour un bien, which means a bad thing for a good thing. We had problems with MSSQL which would have been impossible to fix cleanly, and decided that reinstalling Windows would be actually quicker than trying to make it work. It's not that bad, since by reinstalling Windows, yours truly actually took notes this time, and is sharing them with you!

Good luck

Tuesday, January 13, 2009

One liner: count the total uncompressed space of a gzipped tarball on HP-UX

gzcat u01.tar.gz tar tvf - awk '{tot+=$3; print; printf ("total = %10.0f\n", tot);}'

Friday, January 9, 2009

BladeSystem Virtual Connect Support Utility reports "TCP Port 21 in Use"


When this happens, don't waste time looking for TCP issues on the VC-FC as I did... it means that you have a running FTP server on the machine from which you're running vcutil. Stop it and it will work.

Monday, December 8, 2008

Reduce vxfsd usage

If you're seeing high usage of vxfsd on 11iv2 (I don't know for 11iv3), chances are it's wasting time managing the vxfs inode cache. Depending on your situation, setting a static cache can help. I've been doing this for years on a particular system with good results, had to do it again this morning, so I thought I'd post about this. The procedure is documented here:
http://docs.hp.com/en/5992-0732/5992-0732.pdf

Simply put, you have to do this:
# kctune vxfs_ifree_timelag=-1

Don't credit me to finding this one out. I owe it to Doug Grumann and Stepehen Ciullo.

Wednesday, December 3, 2008

Using DDR in a mixed SAN environment under 11iv3

Update Feb 10th 2009: I wrote a script to help manage DDR.

A little-known feature of the HP-UX 11iv3 storage stack is DDR which stands for Device Data Repository. It lets you set "scopes attributes" for the storage driver which apply to specific disk types. As far as I know, there is no whitepaper on this yet, so you have to read the scsimgr(1m) manpage to know about it. In my case, I learned about this feature during a lab in Mannheim (which was worth the trip in itself). The scsimgr whitepaper on docs.hp.com does give out a few bits of info but doesn't show the real deal. I'll try to do this here.

Simply put, creating a scope enables you to use the -N option with scsimgr set_attr and scsimgr get_attr that will let you apply attributes on a set of devices that share common attributes, rather than a specific device.

For example, if you have a server that has EVA disks along with MPT devices, you will probably want to set the SCSI queue length of the EVA devices to something bigger than 8 which is the default. But MPT devices have to remain at 8. Doing this with DDR is easy; simply set a scope attribute that will automatically adjust the queue length only for HSV200 devices.

Here's an example.

First of all, let's define a scope. Start by getting the DDR name that applies to your EVA device:
# scsimgr ddr_name -D /dev/rdisk/disk93 pid
SETTABLE ATTRIBUTE SCOPE
"/escsi/esdisk/0x0/HP /HSV210 "


You can go down further to the bone and even include the revision of your controller:
# scsimgr ddr_name -D /dev/rdisk/disk93 pid
SETTABLE ATTRIBUTE SCOPE
"/escsi/esdisk/0x0/HP /HSV210 /6110"


Once you got your scope, add it to the device data repository - the DDR. You have to do some cut and paste here, as blanks between the quotes are important.
# scsimgr ddr_add \
-N "/escsi/esdisk/0x0/HP /HSV210 "
scsimgr:WARNING: Adding a settable attribute scope may impact system operation if some attribute values are changed at this scope.Do you really want to continue? (y/[n])? y
scsimgr: settable attribute scope '/escsi/esdisk/0x0/HP /HSV210 ' added successfully

Finally, use the -N to scsimgr to set your attribute on the entire scope. In this example, I'll set max_q_depth:
# scsimgr set_attr \
-N "/escsi/esdisk/0x0/HP /HSV210 " -a max_q_depth=32

Don't forget to save it if you want to keep it across reboots:
# scsimgr save_attr \
-N "/escsi/esdisk/0x0/HP /HSV210 " -a max_q_depth=32

And voilĂ . All your EVA disks, running on an HSV200, now have a queue depth of 32. Furthermore, any new EVA device you present on the server that matches your scope will inherit the new attribute. Does it really work across reboots? I don't know yet, but most probably.

Another example would be to set a specific load balancing policy for MSA devices:
# scsimgr ddr_add \
-N "/escsi/esdisk/0x0/HP /COMPAQ MSA1000 VOLUME"
# scsimgr set_attr \

-N "/escsi/esdisk/0x0/HP /COMPAQ MSA1000 VOLUME" \
-a load_bal_policy=preferred_path
# scsimgr save_attr \
-N "/escsi/esdisk/0x0/HP /COMPAQ MSA1000 VOLUME" \
-a load_bal_policy=preferred_path

Get the picture? DDR is very powerful in mixed SAN environments. With it you don't have to bother about setting attributes for each specific disk.

Have fun.

Tuesday, December 2, 2008

RSP still sucks... but not big time anymore

The blog entry were I was saying that RSP sucks has created some attention, both in and out of the comments area. An update is in order. First of all, I won't censor this entry; it represents my initial feeling about RSP, a software bundle which made me waste lots of time, and whatever I think of it has not changed.

On the upside, following my rant on the ITRC forums (which was deleted quickly), some people at HP Canada noticed and they've put me in contact with colleagues in Colorado who were glad to listen my comments, and they promised to address some of the issues. Some of my concerns were: no support for VMs; no cookbook for HP-UX admins, lack of feedback from SWM, etc. I also had a quick talk with Brian Cox in Mannheim a few weeks later and he was aware of the problems HP-UX shops are facing with ISEE going away as some of them don't want to install Windows. Personally I don't care, but I would have rather run this on HP-UX if I could; I'm no Windows admin and feel more at home on Unix systems.

I've been running RSP as the only notification mechanism for a few Proliant(ESX) and Integrity(HP-UX) servers for over a month now, and it seems to work. All the events are sent to HP, and closed. I've also been able to have my C7000 blade chassis monitored too, although I couldn't find any documentation for this. I just set up the CMS as the trap destination, crossed my fingers, and test traps generate RSP events.

I evaluate that installing, debugging (and trying to understand) SIM and all the components that replace ISEE have taken me over 20 hours. That's a lot of work. So when a component will break in the future, I expect a phone call or e-mail from HP Support. If I don't get anything, I won't be in a good mood. I have many EVAs of different generations that will be migrated sometime in early 2009. They require more preventive maintenance, so this will be the real test.

In the mean time I'm asking all the support personnel to take a walk in their data center (we have 6) once in a while, looking for red lights. I thought these days were over, but RSP is a stack of multiple monitoring software solutions, and I haven't had proof yet that it can be trusted.

O.