Parallel ssh

Tuesday, August 18th, 2009 | linux, useful tools

I’m increasingly working on clusters of systems – be they traditional HPC clusters running some MPI based software or less traditional clusters running software such as Hadoop‘s HDFS and MapReduce.

In both cases, the underlying operating systems are largely the same – pretty standard Linux systems running one of the main Linux distributions (Debian, Ubuntu, Red Hat Enterprise Linux, CentOS, SuSE Linux Enterprise Server or OpenSUSE).

There are various tools for creating standard system images and pushing those to each of the cluster nodes – and I use those (more in a future post), but often, you need to perform the same task on a bunch of the cluster nodes or maybe all of them. This task is best achieved by simply ssh’ing into each of the nodes and running some command (be it a status command such as uptime, or ps or a command to install a new piece of software).

Normally, one or more users on the cluster will have been configured to use password-less logins with ssh so a first-pass at running ssh commands on multiple systems would be to script the ssh calls from a management cluster node.  The following is an example script for checking the uptime on each node of our example cluster (which has nodes from cluster02 to cluster20, I’m assuming we’re running on cluster01).

#!/bin/bash
#
for addr in {2..20}
do
 num=`printf "%02d" $addr`
 echo -n "cluster${num}:" && ssh cluster${num} uptime
done

The script works, the downside is you have to create a new script each time you have a new command to run, or a slightly different sequence of actions you want to perform (you could improve the above by passing the command to be run as an argument to the script but even then the approach is limited).

What you really need at this stage is a parallel ssh, an ssh command which can be instructed to run the same command against multiple nodes. Ideally, the ssh command can merge the output from multiple systems if the output is the same – making it easier for the person running the parallel ssh command to understand which cluster nodes share the same status.

A quick search through Debian’s packages and a Google for parallel ssh turns up a few candidates,

This linux.com article reviews a number of these shells.

After looking at a few of these, I’ve settled on using pdsh. Each of the tools listed above use slightly different approaches – some provide multiple xterms in which to run commands – some provide a lot of flexibility in how the output is combined. What I like about pdsh is that it provides a pretty straightforward syntax to invoke commands and, most importantly for me, it cleanly merges the output from multiple hosts – allowing me to very quickly see the differences in a command’s output from different hosts.

You will need to configure your password-less ssh operation as normal. Once you have done that, on Debian or Ubuntu,  edit /etc/pdsh/rcmd_default and change the contents of this file to a single line containing the following,

ssh

(create the file it it doesn’t exist).

Now you can run a command, such as date (to verify if NTP is working correctly) on multiple hosts with the following,

pdsh -w cluster[01-05] date

This runs date on cluster01,cluster02,cluster03,cluster04 and cluster05 and returns the output. To consolidate the output from multiple nodes into a compact display format, pdsh comes with a second tool called dshbak, used as follows,

pdsh -w cluster[01-05] date | dshbak -c

Personally, I find this output most readable. On Debian and Ubuntu systems, to invoke dshbak by default, edit /usr/bin/pdsh (which is a shell script wrapper) and change the invocation line from

exec -a pdsh /usr/bin/pdsh.bin "$@"

to

exec -a pdsh /usr/bin/pdsh.bin "$@" | /usr/bin/dshbak -c

Now when you invoke pdsh by default, it’s output will be piped through dshbak.

Tags: , ,

No comments yet.