Thursday, February 25, 2010

Fun dealing with 'at'

I'm stuck with a design decision taken before my time by the development team which consists of using the native OS service as the task scheduler for our custom SCADA application. The decision was purely logical; they were migrating away from a previous timesharing OS that had, from what I heard, enterprise-class batch and scheduling services and it was taken for granted that "Unix" (HP-UX to be precise) would be able to handle task scheduling well.

The result is that to save a programmer a few days, the decision was taken to delegate task scheduling to the Operating System and be done with it: when the software needs to run something at a later date, it spawns the scheduler and leaves the responsibility to the OS to run the job.

Where that design decision hurts is that nobody realized that on Unix, as far as scheduling goes, you're pretty much limited to the stock cron or at (both being the same software, by the way) and these can be a real pain in the but to manage on modern systems.

I have nothing against at in itself. There's nothing wrong with it. As a bare bones task scheduler, it does the job and has been doing it for what, maybe 40 years now. Many system administrators have learned to depend on at to schedule nightly jobs. But it shows signs of its age, and has nothing that should appeal to a developer in need of a task scheduler: It doesn't do dependencies; running at -l doesn't show much; its logging features are, to be honest, close to nonexistent; jobs are saved with a file name representing an epoch offset which, while clever, isn't really a nice way of presenting data.

As a sysadmin, I ran into a lot a trouble over the years when trying to support a bunch of application-level at jobs. Here are some examples:
  • At saves all its tasks under /var/spool/cron/atjobs. That's nice, but what do you do with clustered applications that are packaged with ServiceGuard? There is no easy way to migrate the jobs across nodes when a failover occurs. I had to write a special daemon that monitors the atjobs directory just to handle that.
  • Support personnel were used on their previous OS to hold, release, and reschedule jobs on the fly. At doesn't support that. When you want to reschedule a job with at, you need to extract what that job runs, delete it, then reschedule it yourself. That's not nice. I had to write a complete wrapper around at just to do that.
  • You don't know what a task consists of, except of what user is running it, and what epoch-offset name it has. That's not very useful when you have an application that scheduled 50 different jobs over a week. I had to change my wrapper to be able to show a few lines of the contents of each job.
  • When cold-reinstalling a server, you have to be sure you saved the jobs somewhere as the users will expect you to recover them. Sure, nobody forgets the crontab, but that darn atjobs directory needs to be saved, too.
I'm so fed up with this that I'm thinking of writing my own distributed task scheduler, that would address most of the issues above, while still keeping a standard at front-end that would not mess up any application depending on its format. What do you think?

N.B. Yes, I took a look at vixie-cron a few years ago but didn't think it would be worth trying to make it work on HP-UX as I didn't gain much using its atfront-end over the one shipped with HP-UX. If anyone thinks otherwise, drop me node.

2 comments:

Greg Baker said...

You could put /var/spool/cron/atjobs on to shared storage. Then you don't need to move that at jobs around.

The best strategy is often to have a scheduled jobs package which then uses ssh to log in to the IP address of the package where the job should run.

You can see the contents of an "at" job with "-d", which I presume is what you used in your wrapper.

But yes, at/batch are a bit lame, aren't they?

Unknown said...

Have you thought about writing an AT like front end to an existing scheduler like CA's AutoSys? With AutoSys you can create batch schedules run jobs put them on hold etc. from a *NIX or DOS shell but it is not exactly easy. It also has a web based GUI for managing and tracking your batch schedule, graphing batch history etc. unfortunately the license doesn't come cheaply.