Linux Labs – Beowulf Distribution: Codename Nimbus

Cluster Management Overview

I. Architecture Overview

II. Important Cluster utilities

  1. All commands accept a Node specification syntax
  2. bpsh is the primary user interface into bproc. This is a sort of shell program similar to bash or tcsh that allows you to issue commands across all nodes on the network, or to selected nodes as described herein:
    • bpsh <nodespec> command
    • bpsh -h (help)
    • bpsh -n : no redirect, like rsh.
    • bpsh will accept all rsh syntax, i.e. you could actually issue this string and expect everything to work in order to convert inscript, a rsh based script, to bpsh : (sed -e “s/rsh/bpsh/g” < inscript > outscript)
  3. bpcp is a bproc equivilent to rcp.
  4. bpstat Display node status.
  5. bpctl Change node status

Slave Configuration utilities

Master boots like typical RedHat system

Slave Booting procedure and sequence

Supermon System

Wakinyan monitor

Net console

Batch Scheduling

Advice on booting behavior

  1. Are all the ports flashing on (if you have one) your GIG-E switch? This is GOOD! This means the arps are working.
  2. (NOTE: the error most common is failed attempt to mount unavailable NFS share)
  3. IMPORTANT NOTE: Booting a cluster always seems to take longer to boot than it actually does. Don’t despair! Just standby a bit. Wait a minute. Get a cup of coffee. All is well, 99% of the time!
  4. Would you like to watch a node boot? This is also good for debugging nodes. You are going to monitor the Serial Console!
    • Minicom on master is ready to go. Run it.
    • The settings sbould already be TTYS0, 115200, N81 vt100
    • Find your null modem cable. A null modem cable is shipped with every cluster.
    • The Leftmost serial port on the master plus into the leftmost serial port on the target slave.

Other important Config files

Other info

bpsh command execution paths are strictly by canonical directory names- follow:

  1. Are you in /home/sysadmin on the master?
  2. Does this exist on the slave?
  3. Then the process you are executing runs In This Current Working Directory (cwd).
  4. Are you in /home? /home always exists in an NFS mount.
  5. You will be in the same directory if not an NFS mount. For example, if you are in /scratch on the master, you will execute in /scratch on the slave (/scratch exists on all machines)
  6. If you are not in a similar canonical directory your path will be / on the slave.

Mirroring the Master to the secondary master.

Failover procedure for secondary masters:

  1. Connect any RAID devices to the secondary master.
  2. Connect the external net connection of the master to eth0 on the secondary.
  3. Connect eth1 to the booting switch network (plus monitor, keyboard).
  4. Reboot.
  5. If necessary, hit the spacebar to skip PXE boot errors in this procedure.
  6. Some BIOSes require hitting F2 to turn off PXE in the bios (boot menu) and to make the HD the primary boot method.

Want to run PVM ?

Want to run MPI ?

Want to run Lahey FORTRAN compiler?

Partioning of nodes:

Rebuilding from source RPMs

Supplimental materials:

External resources