Thursday, February 25, 2010

Fun dealing with 'at'

I'm stuck with a design decision taken before my time by the development team which consists of using the native OS service as the task scheduler for our custom SCADA application. The decision was purely logical; they were migrating away from a previous timesharing OS that had, from what I heard, enterprise-class batch and scheduling services and it was taken for granted that "Unix" (HP-UX to be precise) would be able to handle task scheduling well.

The result is that to save a programmer a few days, the decision was taken to delegate task scheduling to the Operating System and be done with it: when the software needs to run something at a later date, it spawns the scheduler and leaves the responsibility to the OS to run the job.

Where that design decision hurts is that nobody realized that on Unix, as far as scheduling goes, you're pretty much limited to the stock cron or at (both being the same software, by the way) and these can be a real pain in the but to manage on modern systems.

I have nothing against at in itself. There's nothing wrong with it. As a bare bones task scheduler, it does the job and has been doing it for what, maybe 40 years now. Many system administrators have learned to depend on at to schedule nightly jobs. But it shows signs of its age, and has nothing that should appeal to a developer in need of a task scheduler: It doesn't do dependencies; running at -l doesn't show much; its logging features are, to be honest, close to nonexistent; jobs are saved with a file name representing an epoch offset which, while clever, isn't really a nice way of presenting data.

As a sysadmin, I ran into a lot a trouble over the years when trying to support a bunch of application-level at jobs. Here are some examples:
  • At saves all its tasks under /var/spool/cron/atjobs. That's nice, but what do you do with clustered applications that are packaged with ServiceGuard? There is no easy way to migrate the jobs across nodes when a failover occurs. I had to write a special daemon that monitors the atjobs directory just to handle that.
  • Support personnel were used on their previous OS to hold, release, and reschedule jobs on the fly. At doesn't support that. When you want to reschedule a job with at, you need to extract what that job runs, delete it, then reschedule it yourself. That's not nice. I had to write a complete wrapper around at just to do that.
  • You don't know what a task consists of, except of what user is running it, and what epoch-offset name it has. That's not very useful when you have an application that scheduled 50 different jobs over a week. I had to change my wrapper to be able to show a few lines of the contents of each job.
  • When cold-reinstalling a server, you have to be sure you saved the jobs somewhere as the users will expect you to recover them. Sure, nobody forgets the crontab, but that darn atjobs directory needs to be saved, too.
I'm so fed up with this that I'm thinking of writing my own distributed task scheduler, that would address most of the issues above, while still keeping a standard at front-end that would not mess up any application depending on its format. What do you think?

N.B. Yes, I took a look at vixie-cron a few years ago but didn't think it would be worth trying to make it work on HP-UX as I didn't gain much using its atfront-end over the one shipped with HP-UX. If anyone thinks otherwise, drop me node.

Monday, February 15, 2010

Steps to take in SIM/RSP when upgrading HP-UX Servers

When cold-reinstalling an HP-Ux server from 11.23 to 11.31, steps need to be taken to be sure that it is correctly linked to SIM and Remote Support.

Here are the steps I take without needing to delete the server in SIM, this way I keep all its past events. These are the quickest I've found over the last year:

1. Go in SIM. Find the server and open its system properties. Uncheck "Prevent Discovery [...] from changing these system properties"

2. Run an "Identify Systems" on the server. Once this is done, it should now show 11.31 as the OS version.

3. SIM won't subscribe to WBEM events when doing an identify, only a discovery. So you need to manually subscribe to WBEM events on the CMS (mxwbemsub -a -n hostname).

4. WEBES will not resubscribe its WBEM events either. To force it, you need to log into WEBES (http://cmsaddress:7906), click the "Configure Managed Entities" icon, find your server, check it, and delete it (that's right, delete it). Then, restart WEBES by doing "net stop desta_service" and "net start desta_service" on the CMS. Within a few minutes it will resubscribe automagically to the HP-UX server.

5. You can confirm you have SIM and WEBES subscriptions on your server by running "evweb subscribe -b external -L"

Good luck


Wednesday, February 10, 2010

Thumbs up to ServiceGuard Manager

Being a CLI kind of guy I've never been really attracted to ServiceGuard Manager, especially the first web-based versions. However, since it started having a map view again , I find myself increasingly proposing it to support personnel who find it more intuitive than using the CLI. Training time is decreased at least three fold by using SG Manager.

Today, I decided to try to build a small package from scratch using the GUI instead of making the config files manually and was delighted by its ease of use. I won't publish too much screenshots as those I took contain confidential data and it would take me a while to obfuscate them. But here are two teasers:

The general look is polished, and very intuitive. The interface is responsive. Online help is readily available, with a question mark icon and sometimes with pop-on bubbles. This makes creating packages an easy task which is done in minutes without needing to go through the ServiceGuard manual.

Behind the scenes, SG Manager takes care of migrating the configuration files on all nodes itself. You don't need to copy them manually. Furthermore, they're very easy to read. Here is an example of a config file generated by SG Manager:

# module name and version
operation_sequence $SGCONF/scripts/sg/
operation_sequence $SGCONF/scripts/sg/
package_description Quorum Server
module_name sg/basic
module_version 1
module_name sg/package_ip
module_version 1
module_name sg/priority
module_version 1
module_name sg/monitor_subnet
module_version 1
module_name sg/failover
module_version 1
module_name sg/service
module_version 1
package_type FAILOVER
NODE_NAME mtlrelux00
NODE_NAME mtlprdux00
auto_run yes
node_fail_fast_enabled no
run_script_timeout no_timeout
halt_script_timeout no_timeout
successor_halt_timeout no_timeout
script_log_file $SGRUN/log/$SG_PACKAGE.log
log_level 0

failover_policy CONFIGURED_NODE
failback_policy MANUAL

# Package monitored subnets...
local_lan_failover_allowed yes

# Package subnets and relocatable IP addresses ...

# Package services...
service_name qs
service_cmd /usr/lbin/qs >>/var/adm/qs/qs.log 2>&1
service_restart 3
service_fail_fast_enabled no
service_halt_timeout 0

Instead of a sea of comments, there are only a few well-placed onces, which make re-editing and fine-tuning configuration files an easy task.

Nice piece of work! I think I've been converted to ServiceGuard Manager.


Tuesday, February 9, 2010

Remote Support Advanced 5.40 has been released

Version 5.40 has been released last week. I was waiting for 5.40 to show up in RSSWM and thought it would update itself automatically. It hasn't done so yet, and it is not clear if RSSWM will eventually take care of updating to 5.40. The Release Notes indicate that for current customers to upgrade, a package must be downloaded from the HP Software Depot, so my take is that it won't be pushed by RSSWM.

This time I do not intend to forcibly update. Version 5.30 has been running fine for me for a while now, I found it to be mature and stable. It might be a better idea for current mission-critical customers to wait until they update to HP SIM 6.0 to do RSP at the same time (unless RSSWM pushes it without warning). That is probably what I will end up doing. However, I don't know any experienced SIM admins who risk upgrading to a new SIM release before a service pack is released a few months later. So I'm actually NOT planning to update to SIM 6.0 / RSP 5.40 before next summer.

Here is a list of the main new features. The most significant one, from what I've seen users asking for in the ITRC forums, is the official support for CMS's running in virtual machines.
  • Added virtualization support for the Central Management Server
  • Support for HP Systems Insight Manager 6.0
  • Improved scalability of the Central Management Server
  • New Basic Configuration collections for MSA2000 storage and OpenVMS on Integrity servers
  • Introduction of Unified Communications monitoring
  • Windows 2008 operating system support for the HP Remote Support Network Component
  • Web-Based Enterprise Services (WEBES) v5.6 and WEBES v5.6 Update 2 are the most current supported analysis engines
WEBES 5.6U2 is required to monitor most recent HP hardware. Current users who do not wish to update to RSP 5.30 right away can install WEBES 5.6U2 from RSSWM and delay updating to 5.40 until later.


Monday, February 8, 2010

Using olrad to remotely flag PCI slots

Many rack-mountable Integrity servers from the rx3600 and up support OLAR, which is an acronym of "online addition and replacement" that applies in many cases to PCI cards. Cell-based servers also support OLAR of complete cells. The System Management Homepage offers some OLAR-related commands but over time I've learned to use the CLI-based olrad command which I trust more than the GUI.

The olrad command can be used not only to replace cards, but also to flash a LED under specific PCI slots. This is very useful when you send an operator on site to plug wires; using olar, you can flag the exact slot where you want a cable to be plugged, and save time.

Here is a quick procedure to see how to do this:

1. Run ioscan to show the hardware path of your device


# ioscan -kfnC lan
Class I H/W Path Driver S/W State H/W Type Description
lan 0 0/0/1/1/0/6/0 igelan CLAIMED INTERFACE HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter
lan 1 1/0/1/1/0/6/0 iether CLAIMED INTERFACE HP AB290-60001 PCI/PCI-X 1000Base-T 2-port U320 SCSI/2-port 1000B-T Combo Adapter
lan 2 1/0/1/1/0/6/1 iether CLAIMED INTERFACE HP AB290-60001 PCI/PCI-X 1000Base-T 2-port U320 SCSI/2-port 1000B-T Combo Adapter
lan 3 1/0/12/1/0/6/0 igelan CLAIMED INTERFACE HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter

2. Run "olrad -q" to obtain a table matching hardware paths with slot numbers.


# olrad -q
Slot Path Bus Max Spd Pwr Occu Susp OLAR OLD Max Mode
Num Spd Mode
0-0-0-1 0/0/8/1 140 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-2 0/0/10/1 169 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-3 0/0/12/1 198 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-4 0/0/14/1 227 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-5 0/0/6/1 112 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-6 0/0/4/1 84 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-7 0/0/2/1 56 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-8 0/0/1/1 28 133 133 On Yes No Yes Yes PCI-X PCI-X
0-0-1-1 1/0/8/1 396 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-2 1/0/10/1 425 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-3 1/0/12/1 454 266 133 On Yes No Yes Yes PCI-X PCI-X
0-0-1-4 1/0/14/1 483 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-5 1/0/6/1 368 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-6 1/0/4/1 340 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-7 1/0/2/1 312 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-8 1/0/1/1 284 133 133 On Yes No Yes Yes PCI-X PCI-X

3. Run "olrad -I ATTN slot_number" to flash the LED under the desired slot.


# olrad -I ATTN 0-0-0-8

4. When you're done, turn off the LED on your slot using "olrad -I OFF slot_number"


# olrad -I OFF 0-0-0-8