Wednesday, December 1, 2010

Integrating IEDs inside your IT infrastructure - Part 1

Image: jscreationzs / FreeDigitalPhotos.net



Introduction


This should be a multi-part blog series that will introduce you to the control and data acquisition of substation-grade IEDs (Intelligent Electronic Devices) all the way to the data center. I'll write it as I have the time. If you have any comments or corrections, feel free to leave me a note.

Being a systems architect, not a control engineer, the emphasis of my writings will be on the IT side. I don't have any deep knowledge in the control field, having been exposed to these technologies only recently. When I tried looking for some information on the internet, there wasn't much to start with except Wikipedia entries that didn't fit together linearly. May this series help anyone who happens to follow my footsteps.


Part 1: It all starts at the IED

Wikipedia defines an IED quite well:

An Intelligent Electronic Device (IED) is a term used in the electric power industry to describe microprocessor-based controllers of power system equipment, such as circuit breakers, transformers, and capacitor banks.

Okay, so let me make my own definition, and it's all IT folks like me ought to know:

An IED is either a sensor that returns data, a control device installed in a substation
, or "something that impacts the grid".

All these devices need to provide an interface to communicate back their data, and also some means to be configured by a control engineer or technician. A lot of this stuff have traditionally been relying on the serial RS-232 point-to-point interface (unless you're under 25, you've probably heard of RS-232 before; it is the standard 9-pin or 25-pin serial port on PCs on which you can hook up serial devices). Many IEDs also rely on RS-422 and RS-485 networks which have more features than the basic, low-speed RS-232: RS-422 is a "multi-drop" network, where one sender can be heard by up to 10 slave receivers, while RS-485 is a "multi-point" network that allows up to 32 arbitrary connections.

The upper layer protocol that IEDs use seem to be, a lot of times, Modbus or DNP3. Another interesting fact is that clock synchronization with these devices is often done using the IRIG-B protocol which has a lot more history than the usual (S)NTP protocol many network administrators are already familiar with. For one, IRIG-B can work on serial interfaces.

Many IEDs have recently started to rely on ethernet media and routable TCP/IP networks instead of point-to-point communications. TCP/IP can channel DNP3, Modbus and others, but the IEC 61850 protocol is also slowly becoming a leading standard. Using TCP/IP basically enables you to access the device from anywhere -- a nice feature, but a double-edged sword nonetheless: introducing a routable network in the substation, and hooking up IEDs to it, brings up many obvious security issues that weren't there before.

Remote control and data acquisition from these IEDs could be done using plain modems and dedicated POTS lines. But now that this telecommunication architecture is slowly moving towards routable, high speed IP networks, new ways to remotely manage the substation become available using commoditized IT technology. Many analysis and archiving possibilities spring up once this scattered data is centralized, which involves software that will be unheard of to many IT admins.

We see here that while it all starts at the IED, that IED needs a way to send back its data to the data center. So what once used to be the sole business of the "Control Guys" is also becoming one which also requires some assistance from the "IT Guys".

And yes, as I admitted firsthand at the top of this article, I'm an IT Guy.

O.

----

The next article will describe how to concentrate a bunch of IEDs together, and securely send their data to the data center.

Friday, November 5, 2010

Word clouds: a nice way to review your documents

I've recently discovered on Kieren McCarthy's blog a picture of a word cloud. In essence, it's a cloud of words where each word occupies a space relative to its height.

The site Wordle lets you create such clouds for free, with a terrific design. I've pasted my whole architecture documents on this site to check out the results. Here is an example of what it results in:


In the above cloud, a few elements stand out: Acquisition (Data collection), PI, Données (data), PI-SMP, and points (tags). Anyone familiar with PI or Cooper's SMP will easily guess what that document is talking about.

Such clouds can be used to make a pro-looking cover page (or back page) for any official document; in my case, I just hang them in my cubicule. One can also use word clouds hem to size up a document, any document, in a quick glance: By looking at what words have the most weight, it is possible to have a quick idea of the document itself. And yes, it does look really nice.

O.

Friday, October 22, 2010

Thoughts on HP-UX, AIX, and Integrity

Those who know me personally and have been following this blog know that I ditched my HP-UX admin career to become a systems architect in another division at my company. The reasons for this were mostly personal, although maybe 10% of my choice to move on was due to other reasons. Bottom line is that even though I miss the technical side, I'm glad I've made the switch to the "clouds" and became an architect.

I'm now mostly attached to stuff made by Cooper Power Systems and OSISoft, which is fine. Expect the content of this blog to switch to these two vendors over time. HP still has a place too, as I've spent so much time using their products. But here we're not a HP shop; we have some HP products of course, but also IBM, EMC, and other vendors on the floor.

Some HP reps came in this week to show us what was new this year. While I used to be very fond of HP since relying on a limited number of manufacturers was part of my one-sysadmin-for-all strategy, I must now step back and try to be as objective as possible.

That being said, there was the usual presentation about let-HP-shove-its-converged-infrastructure-down your-throat, then a half-hour presentation on Integrity systems (among others) during which I kept my mouth shut. I couldn't help myself thinking what's the future of this great platform. All talk about running Linux and Windows on the platform is now gone, which is a good thing, as we all know it can no longer be an option. Only HP-UX and OpenVMS are left. The BCS rep, knowing my division runs RHEL and AIX, told us to "please challenge us", meaning that we should evalutate HP-UX as a contender as much as possible.

I've worked with HP-UX for 10 years and love it. But, in my opinion, AIX is roughly equivalent. Although I haven't been administering AIX systems for 10+ years, I know enough that it is a mission-critical OS backed by a manufacturer who won't let my company down, same as HP-UX. If we were running Solaris, things would be very different. But for the moment, as an architect, I consider both HP-UX and AIX as equals: these are the last true "Enterprise" UNIX options available.

So, what's left for the Integrity platform? Not much. As AIX is the same as HP-UX at a glance, I can only think that Integrity is the same as POWER systems, give or take. So it's not a bad platform per se, but a niche one for sure.

It's too bad HP lost the big bet they've made on the Itanium. I still remember all the talk about Merced when I started my sysadmin career in the late 90s. Things didn't turn out as expected for sure, but even if HP had stayed on the PA-RISC bandwagon, they would be at the same spot they are right now. That doesn't mean HP-UX has no future - the OS still has a big place in my heart. I can only hope HP will eventually port the HP-UX kernel to x86, or make an HP-UX ecosystem and support infrastructure revolving around a Linux kernel. This is probably the best thing to do to this operating system to ensure its long term viability.

Have any comments? Please leave me feedback.

O.

Wednesday, October 20, 2010

Announced: An official Cooper EAS web forum

I've been informed that Cooper EAS has to intention to build a community around their grid automation products such as IMS, a process to which I'll be glad to contribute when the time comes.

A first step into building a community is, in my opinion, to set up a web forum. To my surprise, EAS just announced one today. It is available here:


http://216.17.94.116/vbulletin/index.php


There is no DNS name yet, but I'm sure they'll fix this soon. Also note that access is limited to current customers only.

I think that a critical mass of their current customers is mostly interested in DR, which means that I don't expect to see many grid automation subjects in that forum to start. But it is a really, really good first step.

O.

Tuesday, October 5, 2010

Cooper Power Systems EAS, Stuxnet and control vs. IT

I'm currently at the Cooper Power Systems EAS (Energy Automation Solutions) user conference in Minneapolis. I don't know much about DR, AMI, Smart Grids and such, but had to go there to at least learn the basics and be able to do a better architecture job.

I'm almost ashamed to admit that I'm an "IT guy". Seems that most who work in control don't like IT and I can't blame them. Many control systems are increasingly being linked to ethernet and IP-based networks, along with remote and consolidated interfaces, and this brings many challenges which only IT can address. Enhancing security of these systems is especially important, and many control users don't seem to view security as that important.

I've had an interesting chat with EAS's security guru about the Stuxnet worm. Many technical details have been leaking through Slashdot and elsewhere for a few weeks, thus I won't speculate on its possible origins or intents. But the bottom line is that Stuxnet does exist, and it is a staggering proof that even though its engineering is not within the reach of just anyone, SCADA systems are not immune to security threats.

Like we IT people have been disgusted by the security guys for years now, it's now the turn of control people to have to live with IT. Nice threesome. Looks like I'm stuck in the middle position. FML.

O.

Tuesday, September 28, 2010

REST: "understandable" web services

Most of my experience as an ex-sysadmin with web services is with SOAP. And I couldn't speak about experience; the only thing I had to do back in these days was to setup Tomcat and Axis to host mysterious "SOAP" applications.

As soon as I started reading the manuals of Tomcat and Axis at each of these installations, I quickly ran away -- not only couldn't I understand anything, but these docs all started with the idea that whoever was reading them had a strong knowledge of Web Services and/or Java. Which, of course, I didn't.

At a high level, I understand what SOAP is. But since I've jumped in system architecture, what I've been hearing recently is this:

"We're modeling our web services architecture on REST".

Oh no, not a new buzzword. Wasn't SOAP complex enough? What the heck is REST?

Fear no more. It turns out REST has been out there for a while. A long, long while. On the web before many didn't know what the web was.

Anyone who knows the basics of HTTP will understand REST quickly. REST is not a heavy, XML-based protocol as in SOAP. It's a way of doing things the simple way, using straight HTTP to transfer information in - sorry if I interpret things little too much here - an ad hoc manner. No freaking XML and standards. Thus, your web browser is a REST client. If you want to go a little more deeper, another example would be Amazon's S3 which is a more complex service based on REST. In fact, if a service doesn't use SOAP, it probably uses REST.

REST is how anyone with a good-enough basic knowledge of HTTP, but doesn't give a shit about XML and layered protocols, will natively think web services should work. The way things should be. I'm hardwired to think this way, and SOAP always had me thinking that it was way overkill for a lot of uses.

I'm currently reading the following tutorial:
http://rest.elkstein.org/

Go read it, it is very well written.

Also take a look at the comment page here. Dr. Elkstein gives there a rough comparison between RPC, SOAP, REST.

O.

Friday, September 17, 2010

Building a low-power FreeNAS Server: Part 3

My FreeNAS Server has been up for a week.

I assembled everything in my parts list and I'm glad to report that all components worked as predicted. I did not have any surprise.

Assembling a homebrew computer is a nice activity for kids. My 7-year old helped me plug all the headers and assemble the case. Counting all the explanations I had to give to him, we were over in a little over 30 minutes.

I then used m0n0wall's physdiskwrite to write an embedded amd64 image of FreeNAS directly on a leftover 1GB flash card I had from an HP tradeshow. Then I plugged it on my USB-header-to-USB-plug thingy directly inside the case. The two hard disks are not used to boot FreeNAS at all; their purpose is only to hold data.

I happen to no longer have any computer screens; I only have laptops at home. So I hooked up the PC to my flat screen TV to configure the BIOS and do the initial FreeNAS configuration.

There are a two things about using a mini-itx desktop board I don't like:
  • The Intel Desktop BIOS does not support console redirection (at least I didn't see any mention of this anywhere) so you need to configure it the old way with a screen and a keyboard;
  • The built-in video card, needed by the BIOS, cannot be deactivated. Its video memory is, naturally, shared with the system memory and the lowest memory footprint is 128Mb. That ended up taking 10% of the 1Gb of RAM in my server. Not a problem since I'll never need that gig, but a hassle anyway.
But for the price I paid that board, these were things I was ready to live with.

Now I've got a bootable system, and it works well. My next task is to ensure that it consumes the less power possible. I already configured hard disk spindowns but I'll try to see if I can do something with wake-on-lan. More to come later!

Tuesday, September 14, 2010

Adding a DynDNS client to OpenWRT's LuCI

OpenWRT is terrific. The more I play with it, the more I like it. Even though I used to be a UNIX administrator (and a good-enough one, I think), I prefer using the LuCI interface as much as I can, keeping the CLI for repetitive tasks or debugging. LuCI is completely modular and lets you add packages depending on your specific needs. This is a good thing; it removes clutter from the interface and saves some precious space on your flash.

As an example, the following is a quick procedure that shows you how to add a DynDNS client functionality in LuCI.

Log into LuCI and switch to Administration mode.

Go into Overview -> LuCI Components.

In the Available Packages panel, install the luci-app-ddns package.


A Dynamic DNS menu entry will appear automagically.

Inside this menu, you can then add your DynDNS settings as usual. It might take a while before OpenWRT updates your status; be patient.

O.

Friday, September 3, 2010

The RNX-GX4, a well-priced Broadcom router

Remember how I blogged about OpenWRT and Broadcom routers a few months ago? Turns out I've been looking for a new router and I found one which did what I wanted at a good price:
I introduce you to Rosewill's RNX-GX4. This is a rebranded Netcore NW618, a chinese router which is not available this side of the Pacific. What makes this router special is that it is tested with, and officially supports, DD-WRT at a reasonable price -- I got mine on sale, at almost a quarter of the price of a WRT54GL. It has the same quantity of RAM and flash as the WRT54GL, but a faster CPU. It's not a straight WRT54GL clone; there are some differences such as a serial flash chip which isn't as common as parallel flash. But the patches have been submitted to the various distributions, and up until now, it has been working well.

What further sets it apart from many other low-cost routers is how Rosewill openly brags about how it runs a BCM5354KFBG 240MHz CPU in the technical specs online, and even on the back of the retail box. When you see things like this, you know it is made for geeks. I also appreciate that the antennas can be removed. I don't intend to use the wireless radio on this device, so I'll put them out of the way.

I've started playing with it tonight. My intention is to replace my m0n0wall PC with a RNX-GX4 running OpenWRT. While I like m0n0wall very much, running it on a generic PC takes some real estate and electricity, and using a smaller devices would be a better fit. I've investigated purchasing an Alix board, to keep running M0n0wall, but there are many, many times the price of the RNX-GX4 so I decided against using one.

O.

Building a low-power FreeNAS Server: Part 2

Part 2 of my series will be a shopping list of the materials I've picked to build by FreeNAS server.

Everything starts with the motherboard. I was looking for a low-power Mini-ITX board that had a built-in gigabit ethernet controller and I've decided to pick an Intel D510MO. This is a desktop board based on the Atom processor which is cheap and low power. There were third-party Atom boards that were maybe 10-15$ less than the Intel-branded one, but some reviewers complained on excessive heat and I didn't want to have heat problems. I was initially looking into the VIA C7 platform but it can't beat the Atom in terms of performance. I still don't know if the intel BIOS built on this board can work on a serial console. That would be surprising, but I will keep you posted.

I purchased two 1Tb Seagate 7200 RPM SATA disks -- nothing special here, except that 7200RPM was an imported factor for me. I want these disks to be as fast as possible when I'm copying large amounts of data.

For the case, I picked a cheap MicroATX one. Why MicroATX? Because Mini-ITX cases are expensive, and usually can't fit more than one hard disk. I selected a R102-P from Rosewill which happened to be 20$ on Newegg. That case is not only cheap, but it can hold 4 hard disks (which doesn't seem too common on MicroATX cases) and the front is very well ventilated with a lot of air holes right in front of the disks.

The case doesn't come with a power supply. And I didn't want to; since I was looking to be the most power-efficient possible, I picked a 80-Plus 250W power supply from Sparkle. 250W for an ATX form factor is also not that common, but I was really looking into getting what I need and not more -- an idling 500W PS consumes more power and one rated at 250W.

The last thing I bought is a gizmo made by Koutech - an adapter that converts a 10 pin USB header into a standard USB plug. Using this, I can put FreeNAS on a small USB key, and plug that key directly on the motherboard inside the case -- no dangling key outside the box. The motherboard doesn't have an IDE header, so I couldn't use a more common IDE to CF card converter. We'll see how this goes.

That's it for now. I'm currently in the process of having all this shipped to me and I will soon see how things work out.

Monday, August 30, 2010

Building a low-power FreeNAS Server: Part 1

I've been looking into having a small file server in my home, to store my photographs and iTunes library. The most important aspects of that file server are, in order:
  • Low power requirements: It is on 24/7 and I want don't want to consume too much power
  • RAID-1: I want my data to be protected in case the hard disk crashes
  • Low cost
  • Good performance
  • Expandability: Nice features such as a bittorrent client are a plus, I want to be able to experiment with DLNA in the future, so the "hackability" factor is important.
There were two contenders here that met most requirements: DLink's DNS-323 and Synology's DS-210j (Buffalo has some too but they are hard to find online). They each had a drawback: the DNS-323 is reported by many to be subpar in terms of performance, while the DS-210j is expensive.

So I decided to build my own NAS instead. It will most probably be based on FreeNAS, assuming it is hackable enough to my taste. The overall price is below the DS-210j, and I expect performance to be up to my expectations. Low power being paramount, I had to hand pick all the components and my next posts will detail what I've chosen, and how I'll be building it. I'll try to put some nice pictures.

O.

Thursday, August 19, 2010

wuauclt.exe and svchost.exe taking memory on XP

Update 2010-08-21: The thread here indicates that Microsoft is investigating this as a priority 1 issue.

Normally I don't post PC and Windows-related stuff but there are not many recent posts on this August 2010 issue, so here it is.

I support my own PCs and those used by the extended family. For a few days a staggering issue has been happening. When configuring a Windows XP PC to use Microsoft Update (rather than the good old Windows update), svchost.exe and wuauclt.exe take so much memory that a low-memory PC (512Mb) will start swapping enough to freeze the whole system. I saw this happening on SP2 and SP3.

Yes, 512Mb is not a lot, but before calling me stupid, note that 512Mb used to be the standard configuration a few years ago for many low-end systems and it is officially supported by XP. That bug makes any low-memory PC unusable.

You should note that installing Microsoft Security Essentials switches any PC silently from WU to MU. That's how I introduced the problem and noticed it the first time on a laptop. I was about to reinstall it but found out that there have been some user reports here and here. Microsoft has not released a patch yet but I am sure many corporate users have stumbled on this problem and reported it officially.

The workaround in the mean time consists of connecting to Microsoft Update and choose to stop using Microsoft Update. The PC will revert to using Windows Update and everything will be working normally again.

O.

Wednesday, July 21, 2010

My top 2 articles on the internals of broadcom-based routers

Embedded devices are energy efficient, and their limited memory and storage present some satisfying challenges. How have I missed this for all these years, I don't know, but it is time to play catch up. For a few weeks, I've been spending some of my free time reading on Openwrt and the WRT54GL and other devices of similar design.

Of course, flashing a custom firmware can be daunting enough for an average user, but an average geek like myself will want to know how exactly these devices work. There is a published book on the subject but upon reading the table of contents, I decided against buying it as I deemed it not technical enough.

So, I assumed there was information somewhere for people like me, info that didn't require me to read source code. And there is, but you have to search for it.

And what are the top two articles I found on the subject? Here they are.

1. First of all, there is a three year-old post entitled "Everything you need to know about Broadcom hardware" here in the Openwrt forums: https://forum.openwrt.org/viewtopic.php?id=11304

It is extremely informative on the boot process of these devices and it finally clarified for me how the CFE bootloader works. I didn't understand the SquashFS + JFFS2 combination details, thanks to this post now I do. You also should read about union mounts if you're not familiar with the topic (I wasn't, HP-UX does not support this!).

2. Then you should read about how VLANs and network interfaces work on the Broadcom platform -- this article is from the old OpenWRT wiki but well done, too. It is specifically for the Asus WL-500g but being a Broadcom design I can only suspect other routers are very similar, if not identical:
http://wiki.openwrt.org/oldwiki/openwrtdocs/networkinterfaces

This article explains why you have a bridge interface on your router. It especially shines in explaining how these routers isolate the interfaces cheaply using VLANs. In fact, there is only one twisted pair interface (eth0), the other one being the wireless antenna (eth2). I was under the impression that firewalls needed different and isolated interfaces, but the VLAN trick lets you do something similar on cheaper designs. And I guess it is good enough!

I might replace my m0n0wall PC soon in order to reclaim real estate in my utility room and save a few bucks on electricity. I just need a router, and disable the wireless radio. But I will NOT be purchasing a WRT54GL. Why? I'll tell you in a future post!

O.

Monday, July 12, 2010

Ah, Mom, I bricked my router!

Remember I was running OpenWRT on a Bufallo WLI-TX4-G54HP ? After playing with OpenWrt a bit I spent some time setting up a Ubuntu-based build environment to be able to build my own custom firmware.

There's no particular reason for building a custom firmware since the package I want to ultimately run fits on the JFFS2 partition. I just wanted to get my hands dirty. But gasp... the documentation for OpenWRT is sparse and disseminated in four areas: an unfinished HTML manual, an old Wiki, a new Wiki being slowly migrated from the old one, and the OpenWRT forums. One really has to spend some time reading through all of them to understand how everything works, and I've covered maybe 10% of that. And that's okay. I don't pay a cent for OpenWRT and it's a distribution targeted to power-users... If I wasn't looking for a challenge, I'd be running Tomato instead.

Anyway. It turns out that I bricked my device yesterday after flashing that darn custom firmware. I didn't solder a serial or JTAG port, wasn't really looking into doing this, so I couldn't do much to troubleshoot. One thing that no longer worked, and this was supposed to be my planned way out, was the 2-3 seconds it pings on 192.168.1.1 when it's at the CFE bootloader so that I can TFTP a correct firmware. Since it no longer did this, I couldn't do that to unbrick it.

I was about to throw it in the garbage but found somewhere in the DD-WRT forums that some Buffalo routers listen to 192.168.11.1. I tried this address and it worked! Now as to why it decided to listen to 192.168.11.1 while the first time it was 192.168.1.1, I don't know. I must have pressed the reset button 50 times on this thing so who knows what it ended up doing.

Back to square one, I have a router that works, but no custom firmware.

O.

Friday, July 2, 2010

VMware player... what have I been missing?

Up until now, after all these years, I never considered using VMware Player because as its name implies, it used to be be a player for virtual appliances. That's what it was initially designed to do, but it looks like it can do much more than I thought.

Since I used to be an ESX admin, I don't feel the need to learn to use something else and the only solution I knew to virtualize random stuff for free on my low-end laptop, if I stuck with VMware's offering, was to use VMware Server. I tried Server a few times over the years, but must admit that it isn't really interesting when compared to ESX. The web interface is clunky, and it does not integrate well on a "home" PC.

Little did I know that versions 3 and up of VMware Player now let you create virtual machines at your liking, without needing to use pre-made appliances. I've tried it and it works well, and it is much easier to use than Server's web interface.

I've been "converted".

O.

Thursday, June 24, 2010

Looking for my next technical venture

My regular readers know that I used to be an HP-UX admin until a few months ago, and I threw the towel on my sysadmin career.

I'm left with an interesting-enough IT-related job at work, but I can't blog about it easily. The work I'm doing cannot be easily contributed back in a generic way. And the technologies I'm learning are mostly specific industrial stuff and I don't think anyone would be interested in reading about this. Case in point: Is there anybody here, expect perhaps me, interested in IRIG-B ? Yeah, I thought so.

So I'm slowly investigating where I should spend maybe 2 hours a week learning the ins and outs of a new technical thing, and possibly start contributing to the community in whatever means I can, once I feel good enough with it (which might end up taking years). Since I work with embedded devices at work, I've been getting interested in the embedded market and it looks like my next venture might be with OpenWrt but as I don't have many practical uses for it at home, I'm still not sure.

I just found out that a smaller-than-netbook device named běn NanoNote went into production recently, and it runs OpenWrt. What is special with the NanoNote is that both the software and the hardware schematics are completely open designs. At 99$, it's as cheap as it can get. It's almost the same price as a dumb HP digital picture frame I purchased a few days ago, yet I can do whatever I want with it.


Whatever I want with it... I've been spending half an hour staring at the NanoNote specs and pictures, thinking about what I could use it for. Remember that it's very limited in memory and flash, and you can't fatten it up with thousands of bloated applications. In our iPhone/Android era, I simply don't know what people would think of this. But I'll admit it has one advantage against smartphones up its sleeves: it is cheap, and can be mass-produced unencumbered by patents and legal restrictions (at least, I think so).

One can only hope that someone smarter than me will invent a proper use for this device.

O.

Monday, June 21, 2010

What running backfire on a WLI-TX4-G54HP *really* looks like

The 10.03 release of OpenWrt is named "Backfire", which is a cocktail made of the following ingredients according to the login banner:

I happen to have these three drinks at home, so I fixed myself one to celebrate yesterday's experience with this device. Here's what it looks like, with a WLI-TX4-G54HP:



I thought it would have been a three-layer drink, but it gets mixed up quickly in tha glass. I never tried a Backfire before, and it does taste good. Both in the glass, and in the router.

O.

Sunday, June 20, 2010

Running OpenWrt on a WLI-TX4-G54HP

I have an on hand a Buffalo WLI-TX4-G54HP. This is a wireless-to-ethernet bridge. What that bridge does is acually the reverse of an access point: it lets you plug any device that doesn't support wireless, such as an old Xbox, and connect it to a wireless network. I actually used it with my locked-down corporate laptop which had its wireless fuction "deactivated for security reasons". :-)

I was thinking of purchasing a WRT-54GL or an Alix board. The WRT54GL, being a hobbyist device, is pricey for what you get (even on eBay) and I was hesitant. Since I had that Buffalo bridge doing nothing, I thought that I might as well hack it with an alternate firmware and see what I can do with it.

The WLI-TX4-G54HP is not specifically documented as runnable with a third party firmware, my take is that not many of these bridges have been sold so nobody reported it. Yet I found some specs hinting me that it was running on a Broadcom 5352 which is the same as the chip used in the WRT54GL. It also has the same amount of RAM and flash, which is a good thing. So sure enough, there were some Buffalo routers based on the 5352 that were officially supported by OpenWrt, but no mention of the WLI-TX4-G54HP. I decided to take a chance and flash it anyway. And it worked:


The only method of flashing OpenWRT on this device is to use the TFTP method. There are no signed firmware available that you can install from the router's webpage. It worked on my first try using tftp.

That's it for now. I'll do some more tests as time permits, and will see how I can submit that device in OpenWrt's compatibility lists.

Why OpenWrt? And why not Tomato or DD-WRT? Because from what I've seen until now, OpenWrt seems to be the most "open" solution available. All source code is available and GNU licensed. Furthermore, it has a lot of command-line interfaces and is targeted to experienced Linux admins.

N.B. If you're running Windows 7, you'll notice there is no longer a tftp or telnet client included as with Windows XP. Look here for a quick fix to include them:
http://www.leateds.com/2009/telnet-for-windows-vista-windows-7/

O.

Tuesday, June 15, 2010

Thoughts on NTP and a possible homebrew project

After two months without working with UNIX I'm already missing it, and I'm looking for a project to have fun at home and remain technical -- all I do at work is using Word and Excel, and it's starting to make me crazy.

After accumulating them for over 20 years, I recently gave or threw away most of my computer parts (which included two vintage 5 1/4 drives, what a shame). The only PCs remaining are various laptops which are used by my family. I also have a rock solid m0n0wall appliance hidden in my utility room, and I don't want to zap it right away because all my network depends on it.

Those who know me well are aware that I've had a personal fascination for years with NTP. Being a licenced ham operator, building a public stratum-1 server synchronized to CHU using Linux's CHU driver would have been a kickass project to undertake 10 years years ago, with the satisfaction later down the road of contributing to the NTP pool when it became mainstream. However, owning a fixed IP address is costly, not counting bandwidth, and limiting the usefulness of a time server to my own internal network wouldn't give me much. And what's the purpose of using CHU or WWVB when you can sync using a GPS anyway; the only situation I can think of is if you can't have a clear path to a satellite, or need a cheap solution to extract the time from a reliable source. That's the premise those nice radio clocks that set their time automatically are built on.

My second fascination is with embedded devices. They don't consume much power, they're small and fun to work with. The cheapest way to own one to play with would logically be to buy a WRT54GL and flash it with a third party power-user firmware such as OpenWRT. However, the WRT54GL is based on old technology (2002), and thus fairly expensive for what you get. To add basics such as a serial port, you need to crack open the case and solder wires. Fixed storage is limited to 4Mb, that's not a lot of space in 2010 numbers. Bottom line, using the WRT54GL for a homebrew project can get expensive and cumbersome quickly, and paying big bucks for an underpowered device doesn't excite me much. Thus, I'm almost ready to purchase an ALIX 2d13. It's around double the price of a WRT54GL when you count shipping, a case and a CF card. But but it packs a LOT more power and expandability and this embedded device should be able to offer enough power to last 10 years.

One thing I was thinking about is to combine both by adding a soundcard (miniPCI or USB) to the ALIX board, plugging in a shortwave radio, and building a homemade CHU-compatible NTP time server. And why not try WWVB as a "part two". It could have been used in areas where satellite is not accessible. But I quickly found out I wouldn't have been the first one to think of this -- Meinberg already has one. Their only drawback is that it's fixed to one station (they're German, so of course they offer one synced to DCF77). But once your choice is made, you can't change it.

So back to square one. I won't invest tons of money to recreate what has already been done. And this is where I'm at, as of today, with my research on embedded x86 devices.

I think I'll end up building a generic server on the ALIX. Some tasks such as PXE installs don't seem to be well documented on this platform, and I think that testing and documenting a PXE environment could be of benefit to whoever has a bunch of these to flash. We'll see.

Take care

O.

Monday, May 31, 2010

The Information Paradox

Almost every IT shop has a methodology. Where I'm currently working, they're using Macroscope which, from what I see in the IT market in Montreal, has been in good use around many places here for two decades (I even studied one of its ancestors in College in 1994)

Fujitsu is now in charge of that methodology, and they offer for free an electronic version of a book named The Information Paradox which, I thought, could have helped me understand better that methodology.



I was dead wrong. I tried reading the first two chapters and couldn't finish a paragraph without phasing out and thinking about, oh, various subjects such as home improvement, Star Trek, Bangladesh or Yonge Street. Yep, that's right, a technical guy like me just cannot read this book and remain sane. It's all talk about IT Portfolio, governance, and some other nonsense which doesn't ring a bell. But who actually reads this book? Lots of people, it seems, and I figure they're all working for what I used to call the IT Gestapo. Now that I jumped the barrier to architecture, they're supposed to be my friends now. Yet this friendship is only on paper; I think I'll never be able to share a beer with such people who would talk about IT portfolio the same way I talk about, say, the latest FreeBSD release.

So it looks like with a B.Sc., I'm not qualified to read that damn book. And it's fair game: people with B.A.'s could probably never understand why APUE is one of my favorite books.

O.

Tuesday, May 25, 2010

QLogic disabled BIOS in a blade can cause problems

Today I was called for help concerning a weird problem with blades that "didn't work" on the Virtual Connect FC. I'm no longer a sysadmin and shouldn't know anything but since I've worked with blades for three years, I must have become some kind of hot property at my new workplace. :)

In this case, the WWNs of the blades never showed up in Brocade's fabric manager, even though the configuration of the VC domain seemed correct with all profiles set correctly. I double checked everything -- most of the configuration in the VC-FC was correct, except a few missing things but nothing spectacular.

These blades had no OS installed yet, so the QLogic HBA driver couldn't be brought up to initialize the HBAs... thus the fabrics could never detect them. That should be expected to be "normal", but not so: how are you supposed to boot on SAN if you can't even install the OS in the first place?

Turns out that the QLogic BIOS was disabled on all the blades. Calling-up the BIOS configuration with CTRL-Q, and enabling it for all SAN-connected blades, fixed the problem.

O.

Friday, April 30, 2010

So, what's going on?

So what's up with me? Now that UNIX and SAN admin are behind me, what am I doing as a system architect? There's obviously no further HP-UX administration going on, which means the readership of this blog will drop dramatically. I do have a link for you admins, though: look at this fascinating discussion in the ITRC that has been going on for a while about its future.

Deep technical content for The ex-syadmin blog can only be written if I get my hands dirty, which I'm not sure I'll be doing much for a while.

I probably can't disclose exactly what I'm doing, but I'll limit myself to saying that my current project consists of laying the grounds of data acquisition systems that use industrial-grade devices and software manufactured by Cooper Power Systems. Some of the win32 software parts are encapsulated using XenApp, which reminds me a lot of X-Windows. :) Everything needs to be bound to NERC security standards (one of the many standards we need to comply with) and it is my understanding that many legacy systems are in the process of being "NERC'ed", which means I'll have lots of interesting work to come.

It would be hard to blog about all the implementation details while still remaining generic enough to separate myself from my workplace; thus I won't speak too much, at least for the moment. Once I have a better understanding of the technologies I'm working with, I might come back with technical posts and recipes similar to what I did for Technocrat-UX.

O.

Tuesday, April 27, 2010

Details on the future EVAs

I couldn't say much since I was bound by a CDA, but I can link to The Register team who has posted some details along with what should be the new name for the future EVAs:
http://www.theregister.co.uk/2010/04/27/hp_eva_p6000/

Concerning release dates, sorry, I have to keep my mouth shut.

The new Integrity line has also been announced yesterday. It uses a modular, or should I say stackable, blade design. I'm no longer involved with them but if I was still an HP-UX admin, you bet I'd be excited. Note the new Superdome 2; this is a major redesign which will most probably mean the end of the cell-based servers. I didn't have the time to check if and what OLAR features will still be available to customers using the bl890c, as it was a selling point for the rx7640 and rx8640.

Yet this might be too late. RHEL no longer supports Integrity, nor will Windows soon. For customers looking for 32-core systems or more, this will still make the huge Proliants DL7xxx (and upcoming DL9xxx) interesting alternatives to the smaller blades.

O.

Thursday, April 22, 2010

A visit at the "ESS CEC"

Earlier this week, I went for one day at HP's Enterprise Storage & Servers Customer Experience Center, which is located on HP's campus in Houston. You'll probably deduce that I was there to get some info on current and future storage offerings from HP, and I did. But I signed a CDA and can't disclose what exactly I learned there... however some new products have just begun shipping, such as LTO5 drives. As for sdates for other future announcements, well...of course I can't give dates and specifics. But the HPTF usually serves as a platform to announce new products, doesn't it?

One treat that topped the day pretty well was a visit of the Factory Express shop floor where all servers and blades are assembled, and optionally pre-racked. I also saw some PODs being assembled. Once again, I can't say much without getting my ass kicked but whew! They have a nice operation there, I was impressed. Of course, I couldn't take any picture either, but a fellow who probably wasn't under a CDA leaked some pictures via Twitter a few weeks ago during a storage-related event and you can try searching to see if they're still available.

Speaking of the HPTF, I won't be there this year.

O.

Tuesday, April 13, 2010

The ITech Summit, Compellent, and cheap storage

The Infrastructure Technology Summit will be held in Montreal and Toronto in the next two weeks, and Calgary and Vancouver are coming in September. This event seems to be the descendant of the older SAN/NAS summit which was held yearly. HP won't be there... which is too bad because I remember seeing Chet Jacobs present right here in Montreal at the SAN/NAS summit maybe 7 years ago. And those who know StorageWorks know Chet; his presentation style is unforgettable.

HP aside, one SAN manufacturer I've been very interested in since last year is Compellent . As an ex-SAN administrator, I was having a lot of problems with unreasonable disk space demands that were reaching into chunks of terabytes, and Compellent's copy-on-write thin provisionning sure seemed like an easy to implement solution that didn't require putting a gazillion agents on servers. The 4/6/8x400 EVAs are, to say the least, average when compared to what Compellent makes. Let's hope the next-generation EVAs will be better. However, not everyone is jumping into Compellent's bandwagon as I personally feel it still has some mileage to do to prove itself as a reliable brand name in the enterprise where long-term commitments and support are very important. A recent article at the Register tends to show what effect having no household name in that world is having on Compellent.

Speaking about terabytes, users think storage is cheap because they can purchase a Hitachi 1Tb USB drive at Best Buy for, what, 100$ now?? Unbelievable. They can even choose to get a Seagate 500Gb hard drive filled with popular movies for the same price! Asking for terabytes is therefore no big deal, right? Well, boys and girls, Enterprise Storage ain't cheap. And there are good, valid reasons for this. I'll make a blog post about that subject soon.

O.

Wednesday, April 7, 2010

Some last HP-UX tips before I go

I'm officially finishing my job as an HP-UX/SAN/VMware/Blade admin tomorrow, and I'm moving up the food chain next Monday to my new job. I'm supposed to come back to train my replacement but don't know when it will happen.

Here are some tidbits I found in the last two weeks:

swfixrealm(1m) - Appeared in HP-UX 11.31. It lets you correct the default realm in SD ACL files in one shot. Very useful if you create a depot on a system that has its hostname changed, such as one that was ignited with a golden image.

The new 11.31 Bastille has a parameter named AccountSecurity.unowned_files that wasn't there with 11.23. It is enabled by default and will silently chown files that aren't owned by anyone (i.e. belonging to a uid which is undefined in /etc/passwd) to the bin user. Same for groups. Be careful with this on a server that serves a bunch of home directories, or an NFS server. It might be normal that some files aren't owned by anyone.

Newer versions of Ignite dropped the "Wizard" installation screens which, while braindead, was the one I was suggesting support personnel use to install HP-UX using the Golden DVDs I made. They're in remote offices, so no remote igniting is possible, thus why I burn them custom DVDs. Loosing that interface means they'll have to use the Advanced TUI and it is less user-friendly.

O.

Monday, April 5, 2010

Windows on Itanium: no more.

Corrected April 6th: I forgot NonStop

I just saw on Slashdot that Microsoft is dropping the ball on Windows on the Itanium platform. While I'm not surprised, that pretty much gives more weight to my prediction that the future of the Integrity platform, at least as we currently know it, is uncertain.

This sure isn't good news for BCS. Of course, everyone knows that most Integrity customers are using HP-UX, OpenVMS and NonStop so that shouldn't mean HP will loose that much revenue following the canning of Windows on ia64. But HP will loose potential customers for sure, and can say goodbye within a few years to ones are currently running SQL Server on Superdomes. I've heard through the grapevine that many Superdomes are actually running Windows, and less installed domes means higher engineering costs pushed down to whoever will remain to purchase high-end and midrange Integrity systems.

That's really too bad. Being a really-soon-to-be-ex-HP-UX admin I can only feel sad when I see such news. But let's be clear: HP still has three strong operating systems to run in Integrity, so that doesn't mean the end of it all.

O.

Wednesday, March 31, 2010

The sysadmin dilemma

I just stumbled upon Matt Simmons' "Culture of Quitting" post where, besides talking about the fascinating concept of Up or Out, he puts in a nutshell his motivations for being a systems administrator:

No, I (and probably you), have intrinsic motivation. I don’t expect direct rewards (or even outward appreciation, typically) from doing my job. The reward is that my infrastructure works the way it should. Sure, I have certain long term goals, but I can’t accomplish them if I don’t accomplish my short term goals first.


That's where I got my motivation, too, for quite some time. I don't know any sysadmin who wouldn't be proud of putting in place a high quality, resilient infrastructure. But the problem with this, though, is that not only can it get expensive, it never breaks. And when nothing goes sour once in a while, it's hard to get noticed (and further motivated) by all levels of management unless you're lucky enough to work for people who are perfectly aware of what you're doing.

So what should be done to get a tap on the shoulder? Be a below-average sysadmin? In other words, don't produce results too quickly, don't try to optimize everything right away so that performance issues are apparent, constantly say no to user demands... so you can come back later as a hero by finding solutions to "complex problems" to save the day? That might not make any sense, yet I'm slowly starting to think it does: Under some circumstances, the only way to actually show you're doing something productive is to spend a great deal of your time addressing issues which are visible to management.

This is the base of what I've learned to mockingly call "the sysadmin dilemma": If you're doing good work, you won't get noticed too much, and will risk either staying where you are for years with no chance of being gratified, or worse, you'll end up having to justify your job. On the other side of the dilemma, do bad work which costs big bucks to your employer, and you'll be shown the door quickly.

So I think the best path to take if you want to avoid a sysadmin dilemma is to put your target on being average. Be necessary, but don't be too good. But it is obvious to me that being an "average" kind of person will probably not be a fitting motivation for types like me and Mr. Simmons.

Up or Out.

That sucks, but life in the enterprise does, doesn't it.

O.

Monday, March 29, 2010

See a candidate coding live during a remote interview

Following a discussion on Slashdot, I stumbled upon a site named See[Mike]Code where an online temporary "interview" room can be set up (for free) and you can use it to evaluate coders in real-time.


Simple and clever. The site brings up a unique URL for the interviewer and another one for the candidate. Everything the interviewee types is replicated, realtime, to the interviewer. That's an easy way to to evaluate the technical merits of someone without the trouble of bringing him/her in for a formal interview.

O.

Wednesday, March 24, 2010

The redundant IT team

Following my resignation as the senior sysadmin at my division, I've had many of my colleagues come into my office, saying (amicably) "Darn! You've just put us in deep shit!". That's a matter of perspective; nothing should start breaking apart the moment I leave, and many servers can live without any maintenance for a while. No, I'm really not leaving them in distress, as for many years, I've been thinking the following:

Everyone is replaceable.

...a quote which probably makes me fit for management, because that's exactly what managers think. And they aren't wrong, not at all. They are right. No key IT staff, be it a sysadmin, developer or tech support person, should hold all the information and knowledge to keep any part of your business running. This is especially true with small IT teams of less than 50 resources where there is no implicit redundancy that covers everyone.

Yet, that's the position that many IT managers will put their staff into. When you're counting beans, the only thing that matters is keeping a predefined quality of service at the lowest cost. While hardware manufacturers will be hard-pressed to justify various expenses due to all the redundancies they pack into their stuff, remember that people need to be redundant too! And when the shit hits the fan, no amount of preparation will be of any help if you can't count on support staff who is experienced with your systems. Which will introduce my second quote:

If you can afford redundant hardware, you should be able to have redundant people too.

By "redundant" I don't mean to double up your resources 2 to 1. But you really, really have to reduce the chances of someone being the sole owner of a key role without an assigned teammate.

These rules should then be followed to put into place a redundant IT staff:

1. Make sure no one in your staff has a unique set of skills and knowledge. Someone must be able to replace any missing resource, to some level, quickly. This is for valid for everyone in your IT organization, as nobody answering the phone on your tech support hotline (or giving wrong information) could end up being as bad as somebody else inadvertently initiating an IPL on the mainframe.

2. Require a standard and consistent set of documentation for hardware and software solutions before putting anything into production. Docs that are formatted with a standard set of sections will be easy to skim through by anyone who needs it. The UNIX man pages are a stellar example of consistency. If you don't have the time to mess with recognized processes and standards, then simply don't; a word processor template is all you need to get your guys going.

3. Organize regular lunch'n'learn sessions where someone presents a technology subject to its peers. Not a deep dive, but just an overview so they know about it. Insist on quality presentation documents as they can become a great reference later down the road when training new recruits. And don't be a cheap bastard: if you want to motivate people to come and listen, you better pay the lunch.

4. Treat your key resources well, so that they aren't tempted to go elsewhere. By "key" resources, I mean anyone who has deep knowledge of proprietary systems, and for which there is a shortage of qualified workforce on the street. For that matter, enforce rule #2 with them. Even if you can count on someone else to fill in when they say their goodbyes, you'll still need to hire someone further down the road, so it's best to make sure they don't leave.

5. Don't hire smart asses and douche bags. And I'm serious about this. The smart ass is easy to spot in an interview; that guy will claim that he knows everything, sometimes amalgamating nonsensical buzzwords, so just grill him with very technical questions prepared in advance by your staff and he'll fail miserably. The second kind is harder to spot. In my career, I've crossed a few computer science folks who were very talented but also impossible to work with; these aren't team players, they don't trust anyone, and keep everything for themselves. They must be avoided at all cost as they have the ability to sink your operation. Besides a personality test, which they might be able to trick, there's not much you can do to find out except asking for references. So do you get me here? You need to find a balance between social and technical skills depending on the type of job you're offering; low social skills don't fit well with tech support but might be acceptable for a senior developer. Hell, that point is getting so long, I think I'll make it a blog post all by itself one day.

6. Hire people that are ready to hold many hats. Let the truth be told: some people are happy to be single task-oriented and will make sure it stays that way. These do not fit well in a small IT team, as you can't motivate them to learn about new technology, especially if it's under one of their colleague's responsibility. They're NOT autodidacts and the first they'll always do is ask to be "trained" for just about anything. If you have someone who can use a jig saw, but refuses to even try using a reciprocating saw without going a few years to the School of Reciprocating Saw Professionals, then you have a problem. Of course, when working with a unionized crew, the rules are very different and I don't think that blog post will be of any help to you.

7. And last, nobody likes change management and avoids it like the plague unless they're forced to implement it to follow some crapass compliance rule. But change management, if done well with a minimal mount of red tape, can be very beneficial to your team. Any time someone changes something, it will be documented, which makes fixing any mistake easier if that person is not reachable. Up until now, most "enterprise" change management I've seen seem to be a bunch of expensive, incomprehensible software stacks. I've yet to find a simple and easy-to-use web-based system so I can only encourage you to spend a few days making your own.

This set of rules is by no means scientific. They're the ones I would do my best to apply, was I in a management position for an IT team. Fortunately for me, I'm not. Managing an IT team presents its own set of challenges: with limited money you have to keep your employees happy, the users happy, while at the same time ensuring that your enterprise's survival isn't in peril by ensuring that a reasonable risk management is done. Running a redundant team is one of the best way to lower that risk.

O.

Tuesday, March 23, 2010

The page has turned. Let's get the new blog going!

Let's get the new blog going! Why not right here, right now! I might have decided to stop being a hardcase sysadmin, but that doesn't mean I have to stop blogging. Technocrat-UX is dead, long live Technocrat-UX! And let's welcome The ex-sysadmin!

Is there a life after the systems administrator? You tell me! In my blog, you'll follow an ex-sysadmin's endeavour into a world of business process intrigues, vague specs and internal politics. Experience has showed me that sysadmins don't always defeat the bad guys and get the girl at the end of the story. The question is, do system architects? This blog will try to find out the truth!

Expect pragmatic answers to the most elaborate questions. Simple solutions to complex problems. And of course, my own vision of market trending and analysis that you'll love to hate. Who says systems architecture has to be bleak?

But seriously, I first need some time to know what I'm doing. That should take a few weeks, if not months. So I'll document the process I'm going through to leave the sysadmin behind, and prepare the carpet for the architect.

O.

No more Technocrat-UX


Dear readers,

Instead of letting this blog die, I thought it would be better to make one last post to say goodbye, at least for the time being.

After almost 10 years as a full-time system administrator, I decided to change my career path to become a systems architect. Which, at its bottom line, means that I'll stop writing shell scripts to write plans and documents instead. I will no longer be dedicated to HP technology, and HP-UX in particular. The decision has been hard to make, as I love my job as a sysadmin. Over time, I've participated to many HP-related events and met very interesting people at HP. I will miss this, a lot.

Over time, I've been saddened to see that my workplace doesn't put a lot of, er, "financial" value to its technical staff, no matter how much effort I've put into developing the best technical career I could. The only way to keep going up the salary ladder while remaining technical would have been to quit and try my chances with consulting, and I wasn't interested in doing this for personal reasons. So, I decided to stay with my public sector employer, relinquishing my purely technical job, and time will tell if I will like it.

I'm keeping this blog open for the time being, since the content should be relevant for a few years. If I ever start producing content related to my new experiences, I'll keep the same URL (omasse.blogspot.com) and switch the name from Technocrat-UX to something else.

Thanks to everyone who've been reading this blog since 2008. You've sent me many comments and suggestions that kept me going. Writing for Technocrat-UX has been a very good experience and I sure hope that the content has been useful to you.

Take care.

Olivier S. Massé

Thursday, March 11, 2010

HP-UX 11iv3 U6 will include an optional parallel rc sequencer

One of the questions users ask me most is How come HP-UX takes such a long time to boot and shutdown?

I always reply that the startup rc sequencer works serially, thus any subsystem that takes a long time to either start (CIM) or stop (OVPA) will have a negative impact as everyone else will be waiting in queue. And yes, I also add that Linux's been doing it in parallel for years.

In the old days, HP-UX used to stand above many other BSD-derived Unix flavors with its really nice startup checklist. Yet, as startup times are getting longer and longer, a new parallel sequencer was needed and HP announced one a few days ago through a whitepaper. I expect startup and shutdown times to decrease at least twofold in the long term with this.

Details are in this whitepaper here:
http://bizsupport2.austin.hp.com/bc/docs/support/SupportManual/c02036939/c02036939.pdf

The RCEnhancement bundle is available in the software depot and it sure looks promising. At first glance, I'm not sure I like the way it's being implemented with the "rcutil" command when compared to some Linux offerings I've seen which use config files. On the upside, administrators used to the current SYSV sequence will feel comfortable right away using this one.

I know nothing of what's going on in the labs but my take is that the parallel sequencer is a backport of what's being developped for 11iv4. If that is the case, chances are strong that HP will rewrite many of their startup scripts under 11iv4 to use the new sequencer, thus promoting "now boots 200% faster!" as a marketing incentive to encourage users to migrate to v4 when the time comes.

O.

Thursday, February 25, 2010

Fun dealing with 'at'

I'm stuck with a design decision taken before my time by the development team which consists of using the native OS service as the task scheduler for our custom SCADA application. The decision was purely logical; they were migrating away from a previous timesharing OS that had, from what I heard, enterprise-class batch and scheduling services and it was taken for granted that "Unix" (HP-UX to be precise) would be able to handle task scheduling well.

The result is that to save a programmer a few days, the decision was taken to delegate task scheduling to the Operating System and be done with it: when the software needs to run something at a later date, it spawns the scheduler and leaves the responsibility to the OS to run the job.

Where that design decision hurts is that nobody realized that on Unix, as far as scheduling goes, you're pretty much limited to the stock cron or at (both being the same software, by the way) and these can be a real pain in the but to manage on modern systems.

I have nothing against at in itself. There's nothing wrong with it. As a bare bones task scheduler, it does the job and has been doing it for what, maybe 40 years now. Many system administrators have learned to depend on at to schedule nightly jobs. But it shows signs of its age, and has nothing that should appeal to a developer in need of a task scheduler: It doesn't do dependencies; running at -l doesn't show much; its logging features are, to be honest, close to nonexistent; jobs are saved with a file name representing an epoch offset which, while clever, isn't really a nice way of presenting data.

As a sysadmin, I ran into a lot a trouble over the years when trying to support a bunch of application-level at jobs. Here are some examples:
  • At saves all its tasks under /var/spool/cron/atjobs. That's nice, but what do you do with clustered applications that are packaged with ServiceGuard? There is no easy way to migrate the jobs across nodes when a failover occurs. I had to write a special daemon that monitors the atjobs directory just to handle that.
  • Support personnel were used on their previous OS to hold, release, and reschedule jobs on the fly. At doesn't support that. When you want to reschedule a job with at, you need to extract what that job runs, delete it, then reschedule it yourself. That's not nice. I had to write a complete wrapper around at just to do that.
  • You don't know what a task consists of, except of what user is running it, and what epoch-offset name it has. That's not very useful when you have an application that scheduled 50 different jobs over a week. I had to change my wrapper to be able to show a few lines of the contents of each job.
  • When cold-reinstalling a server, you have to be sure you saved the jobs somewhere as the users will expect you to recover them. Sure, nobody forgets the crontab, but that darn atjobs directory needs to be saved, too.
I'm so fed up with this that I'm thinking of writing my own distributed task scheduler, that would address most of the issues above, while still keeping a standard at front-end that would not mess up any application depending on its format. What do you think?

N.B. Yes, I took a look at vixie-cron a few years ago but didn't think it would be worth trying to make it work on HP-UX as I didn't gain much using its atfront-end over the one shipped with HP-UX. If anyone thinks otherwise, drop me node.

Monday, February 15, 2010

Steps to take in SIM/RSP when upgrading HP-UX Servers

When cold-reinstalling an HP-Ux server from 11.23 to 11.31, steps need to be taken to be sure that it is correctly linked to SIM and Remote Support.

Here are the steps I take without needing to delete the server in SIM, this way I keep all its past events. These are the quickest I've found over the last year:

1. Go in SIM. Find the server and open its system properties. Uncheck "Prevent Discovery [...] from changing these system properties"

2. Run an "Identify Systems" on the server. Once this is done, it should now show 11.31 as the OS version.

3. SIM won't subscribe to WBEM events when doing an identify, only a discovery. So you need to manually subscribe to WBEM events on the CMS (mxwbemsub -a -n hostname).

4. WEBES will not resubscribe its WBEM events either. To force it, you need to log into WEBES (http://cmsaddress:7906), click the "Configure Managed Entities" icon, find your server, check it, and delete it (that's right, delete it). Then, restart WEBES by doing "net stop desta_service" and "net start desta_service" on the CMS. Within a few minutes it will resubscribe automagically to the HP-UX server.

5. You can confirm you have SIM and WEBES subscriptions on your server by running "evweb subscribe -b external -L"

Good luck

O.









Wednesday, February 10, 2010

Thumbs up to ServiceGuard Manager

Being a CLI kind of guy I've never been really attracted to ServiceGuard Manager, especially the first web-based versions. However, since it started having a map view again , I find myself increasingly proposing it to support personnel who find it more intuitive than using the CLI. Training time is decreased at least three fold by using SG Manager.

Today, I decided to try to build a small package from scratch using the GUI instead of making the config files manually and was delighted by its ease of use. I won't publish too much screenshots as those I took contain confidential data and it would take me a while to obfuscate them. But here are two teasers:



The general look is polished, and very intuitive. The interface is responsive. Online help is readily available, with a question mark icon and sometimes with pop-on bubbles. This makes creating packages an easy task which is done in minutes without needing to go through the ServiceGuard manual.

Behind the scenes, SG Manager takes care of migrating the configuration files on all nodes itself. You don't need to copy them manually. Furthermore, they're very easy to read. Here is an example of a config file generated by SG Manager:

# module name and version
operation_sequence $SGCONF/scripts/sg/package_ip.sh
operation_sequence $SGCONF/scripts/sg/service.sh
package_description Quorum Server
module_name sg/basic
module_version 1
module_name sg/package_ip
module_version 1
module_name sg/priority
module_version 1
module_name sg/monitor_subnet
module_version 1
module_name sg/failover
module_version 1
module_name sg/service
module_version 1
package_type FAILOVER
NODE_NAME mtlrelux00
NODE_NAME mtlprdux00
auto_run yes
node_fail_fast_enabled no
run_script_timeout no_timeout
halt_script_timeout no_timeout
successor_halt_timeout no_timeout
script_log_file $SGRUN/log/$SG_PACKAGE.log
log_level 0
PRIORITY NO_PRIORITY

failover_policy CONFIGURED_NODE
failback_policy MANUAL

# Package monitored subnets...
monitored_subnet 1.2.3.0
local_lan_failover_allowed yes

# Package subnets and relocatable IP addresses ...
ip_subnet 1.2.3.0
ip_address 1.2.3.10

# Package services...
service_name qs
service_cmd /usr/lbin/qs >>/var/adm/qs/qs.log 2>&1
service_restart 3
service_fail_fast_enabled no
service_halt_timeout 0

Instead of a sea of comments, there are only a few well-placed onces, which make re-editing and fine-tuning configuration files an easy task.

Nice piece of work! I think I've been converted to ServiceGuard Manager.

O.

Tuesday, February 9, 2010

Remote Support Advanced 5.40 has been released

Version 5.40 has been released last week. I was waiting for 5.40 to show up in RSSWM and thought it would update itself automatically. It hasn't done so yet, and it is not clear if RSSWM will eventually take care of updating to 5.40. The Release Notes indicate that for current customers to upgrade, a package must be downloaded from the HP Software Depot, so my take is that it won't be pushed by RSSWM.

This time I do not intend to forcibly update. Version 5.30 has been running fine for me for a while now, I found it to be mature and stable. It might be a better idea for current mission-critical customers to wait until they update to HP SIM 6.0 to do RSP at the same time (unless RSSWM pushes it without warning). That is probably what I will end up doing. However, I don't know any experienced SIM admins who risk upgrading to a new SIM release before a service pack is released a few months later. So I'm actually NOT planning to update to SIM 6.0 / RSP 5.40 before next summer.

Here is a list of the main new features. The most significant one, from what I've seen users asking for in the ITRC forums, is the official support for CMS's running in virtual machines.
  • Added virtualization support for the Central Management Server
  • Support for HP Systems Insight Manager 6.0
  • Improved scalability of the Central Management Server
  • New Basic Configuration collections for MSA2000 storage and OpenVMS on Integrity servers
  • Introduction of Unified Communications monitoring
  • Windows 2008 operating system support for the HP Remote Support Network Component
  • Web-Based Enterprise Services (WEBES) v5.6 and WEBES v5.6 Update 2 are the most current supported analysis engines
WEBES 5.6U2 is required to monitor most recent HP hardware. Current users who do not wish to update to RSP 5.30 right away can install WEBES 5.6U2 from RSSWM and delay updating to 5.40 until later.

O.

Monday, February 8, 2010

Using olrad to remotely flag PCI slots

Many rack-mountable Integrity servers from the rx3600 and up support OLAR, which is an acronym of "online addition and replacement" that applies in many cases to PCI cards. Cell-based servers also support OLAR of complete cells. The System Management Homepage offers some OLAR-related commands but over time I've learned to use the CLI-based olrad command which I trust more than the GUI.

The olrad command can be used not only to replace cards, but also to flash a LED under specific PCI slots. This is very useful when you send an operator on site to plug wires; using olar, you can flag the exact slot where you want a cable to be plugged, and save time.

Here is a quick procedure to see how to do this:

1. Run ioscan to show the hardware path of your device

Example:

# ioscan -kfnC lan
Class I H/W Path Driver S/W State H/W Type Description
=========================================================================
lan 0 0/0/1/1/0/6/0 igelan CLAIMED INTERFACE HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter
lan 1 1/0/1/1/0/6/0 iether CLAIMED INTERFACE HP AB290-60001 PCI/PCI-X 1000Base-T 2-port U320 SCSI/2-port 1000B-T Combo Adapter
lan 2 1/0/1/1/0/6/1 iether CLAIMED INTERFACE HP AB290-60001 PCI/PCI-X 1000Base-T 2-port U320 SCSI/2-port 1000B-T Combo Adapter
lan 3 1/0/12/1/0/6/0 igelan CLAIMED INTERFACE HP A9784-60002 PCI/PCI-X 1000Base-T FC/GigE Combo Adapter


2. Run "olrad -q" to obtain a table matching hardware paths with slot numbers.

Example:

# olrad -q
Driver(s)
Capable
Slot Path Bus Max Spd Pwr Occu Susp OLAR OLD Max Mode
Num Spd Mode
0-0-0-1 0/0/8/1 140 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-2 0/0/10/1 169 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-3 0/0/12/1 198 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-4 0/0/14/1 227 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-5 0/0/6/1 112 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-6 0/0/4/1 84 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-7 0/0/2/1 56 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-0-8 0/0/1/1 28 133 133 On Yes No Yes Yes PCI-X PCI-X
0-0-1-1 1/0/8/1 396 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-2 1/0/10/1 425 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-3 1/0/12/1 454 266 133 On Yes No Yes Yes PCI-X PCI-X
0-0-1-4 1/0/14/1 483 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-5 1/0/6/1 368 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-6 1/0/4/1 340 266 266 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-7 1/0/2/1 312 133 133 Off No N/A N/A N/A PCI-X PCI-X
0-0-1-8 1/0/1/1 284 133 133 On Yes No Yes Yes PCI-X PCI-X


3. Run "olrad -I ATTN slot_number" to flash the LED under the desired slot.

Example:

# olrad -I ATTN 0-0-0-8



4. When you're done, turn off the LED on your slot using "olrad -I OFF slot_number"

Example:

# olrad -I OFF 0-0-0-8

Wednesday, January 27, 2010

Cold-Updating small ServiceGuard clusters -- FAST!

Here's my guerilla procedure to cold-update small ServiceGuard clusters without doing an official rolling upgrade.

I'm currently migrating many small two-node ServiceGuard clusters which are scattered in different sites from SG 11.18 / HP-UX 11.23 to SG 11.19 / HP-UX 11.31. I decided to upgrade not only the OS, but the clustering software too for the simple reason that I didn't want to stick with 11.18 and have to update SG later down the road... With 11.19, I should be good for a few years.

The "rolling upgrade" procedure documented in the Admin Guide doesn't work in such a scenario as last time I checked, it only supports running an update-ux on the nodes one after another. I don't do update-ux, I prefer cold-reinstalling my systems with my heavily customized Golden Image. And since I wanted to take advantage of the downtime to move to 11.19, I fell in the "unsupported" arena.

Here's how I'm pulling it off with a procedure that takes a mere 60 seconds more downtime than a straight failover:

1. Update the failover node
1a) reconfigure the packages to be runnable only on the main node
1b) reconfigure the cluster to remove the failover node (you'll end up with a one node cluster)
1c) dump the golden image on the failover node
1d) install and configure the requirements for SG 11.19 on the failover node (it takes maybe 10 minutes if you've documented the process correctly, I know it for fact)
1e) set up a configuration file for a brand new one-node cluster on the failover node. If using lock disks, you can either use new lock disks and start it right away, or prepare config files which you're sure will work and start the cluster at step 2b.
1f) bring in the package configuration files and volume groups on the failover node, and configure these packages to be runnable only on the failover node. Run a cmcheckconf on them but do NOT run cmapplyconf yet because they're still used on the other cluster!

2. Move the packages to the failover node
2a) stop the packages on the cluster running on the main node
2b) remove the cluster bit on the VGs (vgchange -c) to prevent SG from identifying the disks as part of a cluster
2c) cmapplyconf the packages on the failover node (you might need to run vgchange -c again)
2d) start the packages
Total downtime: maybe a few minutes more than a standard failover but not much. With a well-prepared scenario with pastable commands, it takes me less than 60 seconds to do 2b and 2c.

3. Upgrade the main node
3a) dump the golden image on the main node
3b) install SG on the main node
3c) have that node join the cluster running on the failover node
3d) configure the packages to be runnable on both nodes

4. Bring back the packages to the main node
Simply move back the packages as you would in a normal cluster. Downtime will be the same as during a standard failover.

O.