hpc
Parallel ssh
I’m increasingly working on clusters of systems – be they traditional HPC clusters running some MPI based software or less traditional clusters running software such as Hadoop‘s HDFS and MapReduce.
In both cases, the underlying operating systems are largely the same – pretty standard Linux systems running one of the main Linux distributions (Debian, Ubuntu, Red Hat Enterprise Linux, CentOS, SuSE Linux Enterprise Server or OpenSUSE).
There are various tools for creating standard system images and pushing those to each of the cluster nodes – and I use those (more in a future post), but often, you need to perform the same task on a bunch of the cluster nodes or maybe all of them. This task is best achieved by simply ssh’ing into each of the nodes and running some command (be it a status command such as uptime, or ps or a command to install a new piece of software).
Normally, one or more users on the cluster will have been configured to use password-less logins with ssh so a first-pass at running ssh commands on multiple systems would be to script the ssh calls from a management cluster node. The following is an example script for checking the uptime on each node of our example cluster (which has nodes from cluster02 to cluster20, I’m assuming we’re running on cluster01).
#!/bin/bash # for addr in {2..20} do num=`printf "%02d" $addr` echo -n "cluster${num}:" && ssh cluster${num} uptime done
The script works, the downside is you have to create a new script each time you have a new command to run, or a slightly different sequence of actions you want to perform (you could improve the above by passing the command to be run as an argument to the script but even then the approach is limited).
What you really need at this stage is a parallel ssh, an ssh command which can be instructed to run the same command against multiple nodes. Ideally, the ssh command can merge the output from multiple systems if the output is the same – making it easier for the person running the parallel ssh command to understand which cluster nodes share the same status.
A quick search through Debian’s packages and a Google for parallel ssh turns up a few candidates,
This linux.com article reviews a number of these shells.
After looking at a few of these, I’ve settled on using pdsh. Each of the tools listed above use slightly different approaches – some provide multiple xterms in which to run commands – some provide a lot of flexibility in how the output is combined. What I like about pdsh is that it provides a pretty straightforward syntax to invoke commands and, most importantly for me, it cleanly merges the output from multiple hosts – allowing me to very quickly see the differences in a command’s output from different hosts.
You will need to configure your password-less ssh operation as normal. Once you have done that, on Debian or Ubuntu, edit /etc/pdsh/rcmd_default and change the contents of this file to a single line containing the following,
ssh
(create the file it it doesn’t exist).
Now you can run a command, such as date (to verify if NTP is working correctly) on multiple hosts with the following,
pdsh -w cluster[01-05] date
This runs date on cluster01,cluster02,cluster03,cluster04 and cluster05 and returns the output. To consolidate the output from multiple nodes into a compact display format, pdsh comes with a second tool called dshbak, used as follows,
pdsh -w cluster[01-05] date | dshbak -c
Personally, I find this output most readable. On Debian and Ubuntu systems, to invoke dshbak by default, edit /usr/bin/pdsh (which is a shell script wrapper) and change the invocation line from
exec -a pdsh /usr/bin/pdsh.bin "$@"
to
exec -a pdsh /usr/bin/pdsh.bin "$@" | /usr/bin/dshbak -c
Now when you invoke pdsh by default, it’s output will be piped through dshbak.
The top 10 supercomputers in the world run Linux
Pingdom (a Swedish uptime monitoring company) took the results of the latest Top500.org list (Top500 is a site that publishes a list, usually twice a year, of the top-performing supercomputers in the world where the performance is ranked against the LINPACK benchmark) and published a study of what operating systems the top 20 supercomputers in world are running. It turns out that 19 of the top 20 are running some form of Linux (with the top 10 all running Linux).
While most of you reading this probably don’t plan on purchasing a supercomputer on the scale of Roadrunner or Jaguar in the near future (if you do, we’d love to assist you in the process) our own experience with deploying smaller HPC clusters for organisations is that Linux is equally suitable for smaller clusters for a variety of reasons including the cost of deployment, the flexibility of Linux as a server OS, the wide range of HPC software that is developed on or targetted at the Linux operating system and the excellent support provided by the Linux community for new hardware (see the recent announcement of Linux being the first operating to support USB 3.0 devices as a good example).
Stress testing a PC revisited
I’m still using mostly the same tools for stress testing PCs as when I last wrote about this topic. memtest86+ in particular continues to be very useful. In practice, the instrumentation in most PCs still isn’t good enough to identify which DIMM is failing most of the time (mcelog sometimes makes a suggestion about which DIMM has failed and EDAC can also be helpful, but in my experience there is lots of hardware out there which doesn’t support these tools well). The easiest approach I’ve found to date is to take out one DIMM at a time and re-run memtest86+ … when the errors go away you’ve found your problematic DIMM – put it back in again and re-run to make sure you’ve identified the problem. If you keep getting the errors regardless of which DIMMs are installed, you may be looking at a problem with the memory controller (either on the processor or the motherboard depending on which type of processor you are using) – if you have identical hardware, you should look at swapping the components into that for further testing.
Breakin is a tool recently announced on the beowulf mailing list which looks like it has a lot of potential also and I plan on adding it to my stress testing toolkit the next time I encounter a problem which looks like a possible hardware problem. What looks nice about Breakin is that it tests all of the usual suspects including processor, memory, hard drives and it includes support for temperature sensors, MCE logging and EDAC. This is attractive from the perspective of being able to fire it up, walk away and come back to check on progress 24 hours later.
Finally, we’ve found the Intel MPI Benchmarks (IMB, previously known as the Pallas MPI benchmark) to be pretty good at stress testing systems. Anyone conducting any kind of qualification or UAT on PC hardware, particularly hardware intended to be used in HPC applications should definitely be including
an IMB run as part of their tests.
Categories
Archives
- September 2010
- February 2010
- November 2009
- September 2009
- August 2009
- June 2009
- May 2009
- April 2009
- March 2009
- February 2009
- January 2009
- October 2008
- September 2008
- August 2008
- July 2008
- June 2008
- May 2008
- April 2008
- March 2008
- February 2008
- November 2007
- September 2007
- April 2007
- March 2007
- February 2007
- January 2007
- December 2006
- September 2006
- July 2006
- June 2006
- April 2006