The Born-again Sysadmin: August 2008

Friday, August 29, 2008

Multi-homing under IPFilter: A gotcha with HP-UX

In the last year I've been experiencing some weird problems under IP Filter when using multi-homed HP-UX servers. I've overcome this up until now but I think I have hit a particular problem when running under ServiceGuard and floating IPs.

Take the following steps if your TCP sessions lock up after a while, without any indication in the syslog that packets are being blocked:

1. Stop IP Filter (easy, but probably not what you want)

2. If running IP Filter with a multi-homed system, take great care to prevent any asymmetric routing (i.e. be sure that what gets in on one interface, gets out on the same).

I'll try to make a comprehensive post on this particular problem soon.

Tuesday, August 26, 2008

Great blog entry that lists useful ESX tools

http://communities.vmware.com/blogs/gabrielmaciel/2008/06/18/more-vmware-tools-and-utilities

MCS StorageView is particularly useful!!

Monday, August 25, 2008

Building a (cheap) NAS for ESX using OpenFiler

In the last days before my vacation, I spent some time rebuilding an old DL380G2 Proliant attached to an MSA500 to make a cheap NAS to use as an ESX datastore.

Using OpenFiler, it is possible to make a cheap, iSCSI-based server that could store non-critical data such as ESX clones and templates. I tried it, and it seems to work well.

However:

There is no way to easily install a Proliant Insight Agent on OpenFiler, as RPM packages can't be installed (and I didn't push my luck trying rpm2cpio). When reusing old hard drives, this is a necessity as you really need to be able to monitor them.
I left it up and running for a few weeks, and a networking glitch made it unresponsive on the network; my take is that teaming does not work well. That's weird since I test it by unplugging cables. That server doesn't have an iLO, and it's located in our downtown datacenter to which I don't go that often, so I'm screwed.

So I'm ditching this for the time being. I would prefer having a CentOS-based solution, so that the RHEL Proliant Insight Agent works. But AFAIK nothing seems as easy to set up as OpenFiler. I'm no Red Hat admin, so making all these features work on a vanilla system would take me too much time. If anybody has any suggestions, drop me a note.

Adopting a conservative ESX patching strategy

HP-UX system administrators are familiar with the two "patching strategies": conservative or aggressive. Needless to say that on the mission critical systems I manage, I've always adopted the conservative strategy. It's hard to get downtime to reboot anyway, so one might as well be sure that the patches work.

With my previous, lone ESX 2.x server, I almost never installed any patches since it was complicated; VMware simply didn't have any tool to make the inventory easy.

With ESX 3.5, up until now I've been delighted by VMotion and Update Manager's ease of use. It's now simple to patch ESX servers: simply use Update Manager, remediate your servers, and everything goes on unnoticed by the users. UM will download the patches on your VC server, put the server in maintenance mode, VMotion away any VM you could have on your server, then run esxupdate. It's simple, no questions asked.

That was until the ESX 3.5 Update 2 release.

Most ESX admins will know about the August 12th timebomb in this release. All of this while I was on vacation. Thank God nothing happened, had anyone shutdown a VM it would have been impossible to restart it. And I might has been paged while on vacation.

Needless to say that I spent some time fixing this. Had I waited a few weeks before applying this update, as I should have, I would have missed this unpleasent experience.

Experienced sysadmins will tell me you've been too aggressive. That's true. I was too excited by Update Manager with VMotion. I'll be more careful, now.

Sunday, August 3, 2008

Taking time off... and VM snapshots

Anyone reading this blog won't notice much activity as I'm taking a vacation from work for the next three weeks. Stuff I could write about during that time would mostly concern home improvement, child care and leisure destinations and this, my friends, I don't intend to post about. :)

On a side note, be careful with these darn ESX snapshots. It turns out that the snapshots are reversed in logic from what I'm used to. I might be wrong, but all snapshot technologies I've seen until now such as VxFS snapshots and EVAs snapshots/snapclones all create a seperate data area, and store all the delta since the moment of the snapshot. When there's no more space left, for example when the LV busts out with a VxFS snapshot, the logical thing that happens is that no more delta can be logged so your snapshot is lost.

That's not how it works with ESX. Under ESX, the original .vmdk is frozen and made read-only, and all the delta is logged to another .vmdk file, aptly named xxxxx-delta.vmdk. So the original vmdk holds the state of the past snapshot, and not the current state of the disk.

When you "delete" a snapshot, as a matter of fact you're commiting the delta to the original file, a process which takes some time as all the delta is merged back to the original file. So anyone intending to use snapshots must consider the time it takes to get rid of it.

I don't know why ESX makes snapshots like this, I haven't found an explanation yet (although I'm sure there is one; there might be a performance gain in doing so). But what happens if there's no more space left to hold your snapshot? You'll be actually loosing current, and not past data. That sucks. Your VM will crash. And since your snapshot, or would I say current state, will be corrupted, the only thing you can do is go back to its parent.

So be careful.