Monday, March 5, 2018

Installing Solaris 11.4 beta on a Proliant G4

I've been trying to install Solaris 11.4 beta on an extremely old x86 server, in part because I do not have access to a scratch VMware environment and also to see if I could pull it off.

I had access to a bunch of unused HP Proliant DL360 G4s. They are listed as reported to work on the Hardware Compatibility List, so I said to myself "Why not". So I scavenged memory and CPUs and tried to install the OS.

I was able to boot the install media using a USB key, but the graphic card didn't seem to be compatible, as I got the message "Compatible fb not found". Specifying -B console=force-text didn't work, it switched to graphical mode anyway.

It took multiple tries and reboots to find a combination that worked. I found out that it is possible to install on a serial console. There are GRUB menu entries that let you boot the OS using ttya or ttyb, but they are hidden. I'm not sure how I got into this menu, but I think it was by pressing ESC at the GRUB prompt that gives you 5 seconds before booting the OS.

I attached a laptop with a serial cable to the server and ran screen in an xterm. I've been able to access the text installer sucessfully and install the OS.

My system now boots. I'm waiting for my network patch request to come through before continuing.

I'm especially interested in trying the new Solaris Analytics interface. I'll keep you posted.

Thursday, May 11, 2017

Revisiting the restricted shell

I've been administering Unix boxes since the mid-90s and I've always been told that using restricted shells (rsh, rksh, rbash) was a bad idea because they are easily hackable. Indeed, there are countless known methods to get out of a restriced shell: from finding an application that allows a shell escape, to trying to compile your own, to doing clever hacks with the history file.

I've recently been in a corner case where I was dealing with an embedded product which requires a specific set of commands and also uses some bracket commands that are difficult to wrap with our usual SSH command authenticator. So I decided to revisit using a restricted shell to jail this user and I think I managed to make the jail shatterproof enough.

Here is how I did it:

Create Bob's home directory, but assign it to root:
# mkdir /home/bob
# chown root:root /home/bob
# chmod 755 /home/bob

Force a .bashrc and .profile that changes Bob's PATH to a limited set of commands:
# echo "export PATH=/opt/arcbck/allowed_commands" > .bashrc
# ln -s .bashrc .profile

The reason for having both a .profile and a .bashrc is to ensure that this profile will be loaded both for interactive and non-interactive sessions.

If the user needs to write stuff somewhere, create a directory for Bob, e.g.
# mkdir /home/bob/writable
# chown bob home/bob/writable
# chmod 755 /home/bob/writable

Create the allowed_commands directory and put symlinks in it pointing to allowed binaries:
# mkdir /home/bob/allowed_commands
# ln -s /bin/mycmd allowed_commands/mycmd

Now you must be sure of the following:

1. Bob must NOT have any writable access to /home/bob/.profile or /home/bob/.bashrc, else he can change the PATH value
2. Bob must NOT have any writable access to /home/bob, to prevent any modification of .profile and .bashrc
3. Investigate ANY command that ends up in the allowed_commands jail to be sure that there is NO known way of executing another command from it, showing files or escaping the shell. If there are any, then forfeit giving this command or write a wrapper around it (see below).
4. See the jail escape methods linked above, log in as Bob and see if you can use them to escape the jail.

Example of a wrapper script with scp

Let's say I want to allow Bob to scp files into his account using scp's undocumented -t (i.e. -to) option. I would normally do this:
# ln -s /bin/scp allowed_commands/scp

This is wrong as scp can be coerced with -S to execute random commands.

A solution is to put the following in the allowed_commands jail instead:
lrwxrwxrwx. 1 root root   14 May  5 10:02 scp ->
-rwxr-xr-x. 1 root root  382 May  5 13:54

With containing this:
if [[ "$1" = "-t" && "$2" != "-"* ]]
        /bin/scp -t $2
        echo "scp_wrapper: Refused SCP command: '$*'"
exit ${returncode}

Using this wrapper, scp will only allow -t and no other option.

Good luck.

Thursday, April 6, 2017

Applications crash on SLES12 due to lock elision

This issue has been discussed in other places, but mostly related to specific applications and I think it needs its own post here for those who would stumble on this following a Google search.

glibc 2.18, released in 2013, came with a new feature named TSX Lock Elision.

Briefly, this feature changes the behaviour of in the way it handles mutexes on some specific processors that support hardware lock elision. Intel Xeon CPUs, in particular, support TSX since around 2013 or so. Lock elision offers significant performance gains for some software such as databases.

You can see if your Linux server's CPU supports lock elision by checking /proc/cpuinfo. If it mentions "hle" (hardware lock elision), it does.

RHEL 7 does not support this as of now. It comes with glibc 2.17, so lock elision is not enabled on these systems. As for SLES12, it comes with glibc 2.19, which means that SLES12 systems will use lock elision if the CPU supports it.

However, if an application unlocks a mutex twice, this can cause problems if lock elision is enabled. This is explained in detail in an LWN article. Let me quote an important paragraph in this article:

pthread_mutex_unlock() detects whether the current lock is executed transactionally by checking if the lock is free. If it is free it commits the transaction, otherwise the lock is unlocked normally. This implies that if a broken program unlocks a free lock, it may attempt to commit outside a transaction, an error which causes a fault in RTM. In POSIX, unlocking a free lock is undefined (so any behavior, including starting World War 3 is acceptable). It is possible to detect this situation by adding an additional check in the unlock path. The current glibc implementation does not do this, but if this programming mistake is common, the implementation may add this check in the future.

The "programming mistake" here is double-unlocking mutexes. I've made a sample C program that does exactly this, and although it works fine with glibc 2.17, it will crash on glibc 2.19 with a segmentation fault in __lll_unlock_elision(), if, and only if, the server's cpuinfo reports "hle".

I've stumbled upon a few applications, which I will not name here, that crash on SLES12. Upon analyzing their cores, I found that they have this same exact problem with __lll_unlock_elision(). So, one can assume that they might double-unlock some mutexes.

The bottom line is that if you have an app that does this, your best bet is to contact the vendor, and ask them to remove double mutex unlocks in their code, if they have any.

If that is not possible, there are two workarounds:

1. The first is to patch /etc/ to override libpthread 2.19 with a version that is compiled with lock elision disabled. This is documented in Novell's KB here.

2. The second (and preferred) solution is to adjust LD_LIBRARY_PATH to override it on a per-application basis. You could therefore change its startup script to add this:


Hope this helps.

Friday, September 2, 2016

Sending text logfiles from Windows to a syslog server, reliably

I'm in the following situation:
  1. I have a Windows application, let's name it MyApp
  2. MyApp creates important log files on my server without using the Event Log. These log files are simply textfiles (i.e. logfile.txt)
  3. For compliance purposes, I have to send these log files to a remote syslog server.
  4. The compliance auditor wants me to ensure that these log files are always sent no matter what.
It doesn't matter what the application is (as long as it creates a text file somewhere) and wether the receiving end is an Arcsight appliance, a Splunk box, or syslog-ng: This post will describe a generic way to achieve this, with the added bonus of reliability.

The two products that I used to implement this are neologger and NSSM:
  • Neologger reads a file (as a mater of fact, it tails it) and sends it to a syslog server. 
  • NSSM is a software that lets you wrap any application (in our case, Neologger) in a Windows service. 
What is "tailing" a file?
Unix administrators are familiar with the tail command: it follows a text file, grabbing new entries at the end as they come in. Neologger, basically replicates what "tail file.log | logger" would do on Unix.

Using Neologger

Neologger is, in essence, a simple and reliable tool. It will tail a text file endlessly, and it automatically detects if that file is deleted, shrunk or rotated, which ensures a reliable operation. To use it, simply try:

# neolog.exe -r logfile.txt -tail -t syslog_server -d

This will tail file logfile.txt and send it to syslog_server. Many other command-line options are available. Note the -d option; this is a debug option that lets you see what it does, you normally would not want it there.

The first thing you need to do is therefore to craft a command-line as above, but specific for your application. Here is a more complete example:

# "C:\Program Files\neolog\neolog.exe" -r "C:\ProgramData\My App\logfile.txt" -tail -t -p 1234 -d

This will tail logfile.txt and send it to on port 1234. Once it works for you, remove the -d option.

Wrapping Neologger with NSSM

Now, the next question is, how do I ensure that neolog.exe runs reliably? The answer is to configure Neologger as a service under Windows. It's easier to manage as a service and the operating system will ensure that it restarts appropriately if it ever crashes. That's where NSSM comes into play. NSSM (Non-Sucking Service Manager) is a tool that lets you wrap almost any application as a service.

To create a service to wrap Neologger, run NSSM like this:

# nssm install MyApp-Syslog

This will create a new service named MyApp-Syslog. Then, fill the Path, Startup directory, and Arguments as appropriate (don't forget to remove -d as it is not required here). Here is an example:

You don't need to change anything in the other tabs, but you can take a look in case you need to fine-tune something.

Now you can try starting the MyApp-Syslog via the service panel, and see if it works.
What happens if the log file isn't there in the first place? While neologger will "wait" if the file disappears once it starts tailing it, it will gracefully exit if it's not initially there. NSSM will then try to restart neolog.exe using its throttling settings. This ensures that the service will loop neolog.exe, slowly, until the file appears again. During that time, the service is labeled as "Paused" in the service panel.

Going a step further with dependencies

The last step, which can be important for compliance reasons, is not only to help Neologger run reliably (which is done by configuring it as a service), but ensure that it always runs when your application runs, too. This is done with dependencies.

If your application doesn't run as a service, you're out of luck. But let's say MyApp runs under a Windows Service named MyApp-Service. It then becomes trivial to make MyApp-Service depend on MyApp-Syslog. 

To change dependencies, you have to edit MyApp-Service directly. First, query MyApp-Service to see if it has other dependencies:

# sc qc MyApp-Service

[SC] QueryServiceConfig SUCCESS

        TYPE               : 10  WIN32_OWN_PROCESS
        START_TYPE         : 2   AUTO_START
        ERROR_CONTROL      : 1   NORMAL
        BINARY_PATH_NAME   : "C:\Program Files\MyApp\MyApp.exe"
        LOAD_ORDER_GROUP   :
        TAG                : 0
        DISPLAY_NAME       : MyApp Service
        DEPENDENCIES       : tcpip

You can see here that MyApp-Service depends on tcpip. It is important to keep this in mind. Next, change the dependencies on MyApp-Service by configuring it to depend on both tcpip and MyApp-Syslog. Note here that you have to explicitly state that tcpip is still a dependency, and separate it with a slash to add MyApp-Syslog.

# sc config MyApp-Service depend= tcpip/MyApp-Syslog
[SC] ChangeServiceConfig SUCCESS

sc qc MyApp-Service

[SC] QueryServiceConfig SUCCESS

        TYPE               : 10  WIN32_OWN_PROCESS
        START_TYPE         : 2   AUTO_START
        ERROR_CONTROL      : 1   NORMAL
        BINARY_PATH_NAME   : "C:\Program Files\MyApp\MyApp.exe"
        LOAD_ORDER_GROUP   :
        TAG                : 0
        DISPLAY_NAME       : MyApp Service
        DEPENDENCIES       : tcpip

Once  this is done, start MyApp-Service. You'll notice that it starts MyApp-Syslog automatically. The same logic applies if you stop MyApp-Syslog before MyApp-Service, both will stop at the same time.

Putting it all together

To conclude, let's restate what we just did. First, we used Neologger to tail a text file on Windows, generated by an application named MyApp and sent it, live, to a syslog server. Then, we used NSSM to configure Neologger as a Windows service to help us manage its startup and shutdown. Finally, we created a dependency between the service that runs MyApp and the new service we've just created, to reassure our compliance auditor that Neologger always runs when MyApp runs, too.

Good luck.

Tuesday, August 23, 2016

Running "MRPE" check_mk scripts asynchronously on Windows

I have a corner case on Windows where I need to execute classic Nagios NRPE scripts within check_mk, but in asynchronous mode. These scripts can, in certain circumstances such as a network timeout, take a significant time to execute and they cannot be run from the check_mk agent.

It's possible to have honest-to-goodness check_mk scripts execute asynchronously, using the async directive in check_mk.ini. I tried it, it works. However, this is not supported by the agent with classic nagios plugins.

So, I wrote a wrapper named mrpe_async_wrapper that does just that. It's not rocket science; the wrapper is simply a Windows batch file that:

  1. Creates a scheduled task (on its first run) that executes the check script at 5 minutes intervals;
  2. The scheduled taks instructs mrpe_async_wrapper to run the check script and save its output in a status file;
  3. When run directly, mrpe_async_wrapper reports the contents of the status file instead of executing the script. It does it quickly. So, you can run it each minute if you want, but it will only report the status within up to the last 5 minutes. 

This lets you run slow or unpredictable NRPE scripts from check_mk without fear. I've been running this for a few days and it seems to do the job for me.

To configure it, simply add a directive to the [mrpe] section of check_mk.ini like this (on the same line)

check = check_gizmo C:\tools\mrpe_async_wrapper.bat check_gizmo C:\tools\check_gizmo.bat

This defines an MRPE check named "check_gizmo", which instructs the wrapper to create a scheduled task named "check_gizmo" that runs c:\tools\check_gizmo.bat asynchrnously.

Here is the code for the wrapper:

Have fun.

Monday, June 27, 2016

Getting UFO2 failover status from an OSIsoft PI Interface


I'm currently deploying an OSIsoft PI Interface node at my workplace.

Being a "Systems" Administrator, and not a "PI" Administrator per se, I was looking for a way to get high-availability status directly from that interface node. My objective was to provide IT Operations with an easy-to-use procedure that answers the following question: Which interface node is currently active and which one is currently in standby?... It is useful for them to know the answer to this when scheduling maintenance such as Windows patches.

Unfortunately, there is no easy way to find out which of the two interfaces is currently active. I've looked everywhere in OSIsoft's KB and I guess nobody asked. :-)

Some information on UFO 

Many, if not all, PI interfaces are based on UniInt (Universal Interface). UniInt supports two failover levels named UFO (UniInt FailOver):

  • UFO phase 1 (UFO1) which is based on PI points
  • UFO phase 2 (UFO2) which uses a shared file located on a separate file server

Not all interfaces support UniInt failover; check your Interface documentation. Mine only supports UFO2.

You can look at the following KBs for more information:

UFO2 is preferred to UFO1, and KB00446 even mentions that UFO1 is deprecated. That might be due to the fact that I see one major drawback with UFO1: if one node looses access to the PI Server, it cannot know the status of the other node. Using a shared file on a file server (a highly-available one, that is!) is deemed more reliable.

Finding what interface is active, the PI Admin Way

There seems to be one official way, the "PI Admin Way", which involves looking up points stored in the PI Server.

While my interface is UFO2, it seems to create PI points anyway. These points are created directly from ICU, and they all have "UFO2" in their names. It is therefore trivial to check their values from the PI SDK Utility tool. For example:

PRO TIP: It's also possible to find out these values at the command line using apisnap.

While this is sufficient from a PI admin perspective, from a systems administrator perspective, it's not great. For instance, it's not an easy task for IT Operations to fire up that tool and query PI points, it cannot be automated in a script (except if using apisnap) and lastly it will not work at all if the nodes cannot speak to the PI server. It is thus preferable to ask them to run a simple command.

Finding what interface is active, the born-again Sysadmin Way

It was a simple task to somewhat reverse-engineer the binary UFO2 .dat file created by the interface and write a simple program to extract basic data. I've named it readdat.

C:\tools>readdat \\myfileserver\myfile.dat

Active Node (0 = None, 1 = Node 1 is primary, 2 = Node 2 is primary)
Active ID: 1

Device Status (0 = Good, 99 = OFF, any value in between results in a failover)
Node 1: 0
Node 2: 0

Works good enough for me. Readdat.exe can then wrapped in a batch file or a powershell script to make it easier to use.

As a bonus, you can run it like this:
C:\tools>readdat \\myfileserver\myfile.dat -activeid

This will set ERRORLEVEL to the ID number.

The source code for readdat is here:

Here is also a Win32 executable:

Good luck!

Monday, September 14, 2015

Configuring vsftpd to support proxy FTP

I've had to deal with a legacy application that is hard coded to use proxy ftp sessions. These are initiated by using the "proxy" command in a stock ftp client.

It was giving us trouble with vsftpd refusing to transfer files when using "proxy get" to initiate a passive session between the vsftpd server and another server.

What is a proxy FTP? In a nutshell, a proxy session lets you open a connection to a second FTP server, so that you can transfer files between both servers from instead of between the primary server and your FTP client.

The ftp(1) man page documents what "proxy" does. It is important to read it and understand what happens when you use this:
     proxy ftp-command
                 Execute an ftp command on a secondary control connection.
                 This command allows simultaneous connection to two remote ftp
                 servers for transferring files between the two servers.  The
                 first proxy command should be an open, to establish the sec-
                 ondary control connection.  Enter the command "proxy ?" to
                 see other ftp commands executable on the secondary connec-
                 tion.  The following commands behave differently when pref-
                 aced by proxy: open will not define new macros during the
                 auto-login process, close will not erase existing macro defi-
                 nitions, get and mget transfer files from the host on the
                 primary control connection to the host on the secondary con-
                 trol connection, and put, mput, and append transfer files
                 from the host on the secondary control connection to the host
                 on the primary control connection.  Third party file trans-
                 fers depend upon support of the ftp protocol PASV command by
                 the server on the secondary control connection.

So how does this impact vsftpd when using it to handle the primary control connection?

The first thing that might happen is that if you issue a proxy get, it might  fail with the following message:
500 Illegal PORT command

This is fixed by adding the following parameter to vsftpd.conf:
What this parameter does is authroize vsftpd to open a data connection with the proxy server, instead of limiting it between vsftpd and the FTP client.

Then, you might get:
500 OOPS: vsf_sysutil_bind

This happens because the vsftpd process is trying to bind to port 20 to the IP address of the server. By stracing the process, I found out that this does not work because the vsftpd process that handles communication with clients is unprivileged. This privilege separation is by design. The workaround I found is to add this to vsftpd.conf:

This makes vsftpd bind to another port (I didn't even check which one) but it works. By default it is set to "NO", but it is left to "YES" in the example configuration file and thus why it was there in the first place.

Good luck