Wednesday, January 27, 2010

Cold-Updating small ServiceGuard clusters -- FAST!

Here's my guerilla procedure to cold-update small ServiceGuard clusters without doing an official rolling upgrade.

I'm currently migrating many small two-node ServiceGuard clusters which are scattered in different sites from SG 11.18 / HP-UX 11.23 to SG 11.19 / HP-UX 11.31. I decided to upgrade not only the OS, but the clustering software too for the simple reason that I didn't want to stick with 11.18 and have to update SG later down the road... With 11.19, I should be good for a few years.

The "rolling upgrade" procedure documented in the Admin Guide doesn't work in such a scenario as last time I checked, it only supports running an update-ux on the nodes one after another. I don't do update-ux, I prefer cold-reinstalling my systems with my heavily customized Golden Image. And since I wanted to take advantage of the downtime to move to 11.19, I fell in the "unsupported" arena.

Here's how I'm pulling it off with a procedure that takes a mere 60 seconds more downtime than a straight failover:

1. Update the failover node
1a) reconfigure the packages to be runnable only on the main node
1b) reconfigure the cluster to remove the failover node (you'll end up with a one node cluster)
1c) dump the golden image on the failover node
1d) install and configure the requirements for SG 11.19 on the failover node (it takes maybe 10 minutes if you've documented the process correctly, I know it for fact)
1e) set up a configuration file for a brand new one-node cluster on the failover node. If using lock disks, you can either use new lock disks and start it right away, or prepare config files which you're sure will work and start the cluster at step 2b.
1f) bring in the package configuration files and volume groups on the failover node, and configure these packages to be runnable only on the failover node. Run a cmcheckconf on them but do NOT run cmapplyconf yet because they're still used on the other cluster!

2. Move the packages to the failover node
2a) stop the packages on the cluster running on the main node
2b) remove the cluster bit on the VGs (vgchange -c) to prevent SG from identifying the disks as part of a cluster
2c) cmapplyconf the packages on the failover node (you might need to run vgchange -c again)
2d) start the packages
Total downtime: maybe a few minutes more than a standard failover but not much. With a well-prepared scenario with pastable commands, it takes me less than 60 seconds to do 2b and 2c.

3. Upgrade the main node
3a) dump the golden image on the main node
3b) install SG on the main node
3c) have that node join the cluster running on the failover node
3d) configure the packages to be runnable on both nodes

4. Bring back the packages to the main node
Simply move back the packages as you would in a normal cluster. Downtime will be the same as during a standard failover.

O.

No comments: