Linux Labs – Beowulf Distribution: Codename Nimbus

Cluster Management Overview

I. Architecture Overview

  • The Beostat daemon has been replaced with the supermon utilities from LANL - this is a very lightweight /proc based system that uses virtually no system resources; as opposed to its rather onerous predecessor.
  • Wakinyan Monitor: A graphical monitor that both saves on screen space and has ambient temperature output.
  • 2.4.19 Linux kernel for cutting-edge stability and latest feature set (i.e. hyperthreading capability for Xeon-based clusters).
  • bproc has been updated to the advanced LANL version with the following features:
    • Unified ProcessIDentification space more complete. Bproc system daemons are fully hidden once a node boots.
      1. OLD: when a process spawns on a slave node, it initializes, then a new PID is issued
      2. NEW: all system processes disappear, and PIDs are global on all nodes
    • Access control is now available on a node-by-node basis:
      1. User / Group / Other (i.e. chmod uga) on slave nodes themselves
      2. Permissions are checked on nodes for job eligibility by users.
      3. This is useful in a shared cluster where not everyone can use all the nodes.
  • rarpcatcher replaces beosetup
    • The status info is put into /etc/beowulf/config
    • This process runs on startup
    • It is always running as daemon, so when nodes are added on the fly, the process HUP’s (restarts) the beowulf system and adds in the new nodes.
  • NEW: all node data is only lost in the case of a complete trashing of your filesystem- this due to ext3 filesystem. We have experienced ZERO corruption in extensive testing. This is opposed to older versions of the software which took a rather cavalier attitude towards node filesystem data.
  • ALSO: Regarding I. Above, this makes boot much cleaner
  • NOTE: If one or more of your nodes has important data, issue a sync command before you power cycle
  • REMEMBER: the only non-persistent data stored on nodes are the libraries and system files that are copied to the node at boot time.

II. Important Cluster utilities

  1. All commands accept a Node specification syntax
  2. bpsh is the primary user interface into bproc. This is a sort of shell program similar to bash or tcsh that allows you to issue commands across all nodes on the network, or to selected nodes as described herein:
    • bpsh <nodespec> command
    • bpsh -h (help)
    • bpsh -n : no redirect, like rsh.
    • bpsh will accept all rsh syntax, i.e. you could actually issue this string and expect everything to work in order to convert inscript, a rsh based script, to bpsh : (sed -e “s/rsh/bpsh/g” < inscript > outscript)
  3. bpcp is a bproc equivilent to rcp.
  4. bpstat Display node status.
  5. bpctl Change node status

Slave Configuration utilities

Master boots like typical RedHat system

Slave Booting procedure and sequence

Supermon System

  • Communicates over TCP/IP
  • mon daemon
  • supermon daemon
  • light weight
  • Data format
    • Lisp like
    • Human readable
    • Extansable
  • Kernel modules
    • supermon_proc
    • sensors
  • mon embedded in beoboot
  • libsexpr

Wakinyan monitor

  • This program lives in /usr/bin/wakinyanmon
  • Part of Supermon system.
  • A GTK application.
  • Node Status Display
    • A horizontal yellow line means the node is — down
    • diagonal yellow means the node is — booting
    • green check means the node is — up
    • red X means the node has an — error condition
  • CPU load
  • Disk load
  • Memory used
  • Swap
  • Net
  • Temperatures
    • CPU 0
    • CPU 1
    • Northbridge

Net console

Batch Scheduling

Advice on booting behavior

  1. Are all the ports flashing on (if you have one) your GIG-E switch? This is GOOD! This means the arps are working.
  2. (NOTE: the error most common is failed attempt to mount unavailable NFS share)
  3. IMPORTANT NOTE: Booting a cluster always seems to take longer to boot than it actually does. Don’t despair! Just standby a bit. Wait a minute. Get a cup of coffee. All is well, 99% of the time!
  4. Would you like to watch a node boot? This is also good for debugging nodes. You are going to monitor the Serial Console!
    • Minicom on master is ready to go. Run it.
    • The settings sbould already be TTYS0, 115200, N81 vt100
    • Find your null modem cable. A null modem cable is shipped with every cluster.
    • The Leftmost serial port on the master plus into the leftmost serial port on the target slave.

Other important Config files

  • /etc/beowulf/config.boot the file of last resort — this gives a list of the PCI id’s and driver names
  • Command “beoboot -p” this program grabs the kernel from /etc/beowulf/config/ and creates new images in the /tftpboot/slave boot directory
  • If you have problems unsolvable with reboot or halt, toggle the power on/off manually.

Other info

bpsh command execution paths are strictly by canonical directory names- follow:

  1. Are you in /home/sysadmin on the master?
  2. Does this exist on the slave?
  3. Then the process you are executing runs In This Current Working Directory (cwd).
  4. Are you in /home? /home always exists in an NFS mount.
  5. You will be in the same directory if not an NFS mount. For example, if you are in /scratch on the master, you will execute in /scratch on the slave (/scratch exists on all machines)
  6. If you are not in a similar canonical directory your path will be / on the slave.

Mirroring the Master to the secondary master.

Failover procedure for secondary masters:

  1. Connect any RAID devices to the secondary master.
  2. Connect the external net connection of the master to eth0 on the secondary.
  3. Connect eth1 to the booting switch network (plus monitor, keyboard).
  4. Reboot.
  5. If necessary, hit the spacebar to skip PXE boot errors in this procedure.
  6. Some BIOSes require hitting F2 to turn off PXE in the bios (boot menu) and to make the HD the primary boot method.

Want to run PVM ?

  • simply run start-pvm and this launches in all nodes for legacy apps.

Want to run MPI ?

  • The newer MPI (1.5) uses all_cpus=1 rather than “MPI=” for using all CPUs.
  • Example: all_cpus=1 progrname params (linked against MPI)
  • Note that the master is node -1

Want to run Lahey FORTRAN compiler?

  • Lahey resides in /usr/local/lf95
  • PGI in in /usr/pgi (pgi c, pgi f90, etc.) flexlm and environment variables set by default to just work. See the docs for more info.

Partioning of nodes:

  • nodes are /dev/hda1 — one fs
  • /dev/hda2 is swap
  • primary and secondary masters are - /dev/hda3 (/) ; /dev/hda5 (/var) ; /dev/hda6 (/usr)
  • Depending on your cluster specification, the primary is pre-setup to also be a slave node.

Rebuilding from source RPMs

Supplimental materials:

External resources

nimbus.txt · Last modified: 2010/04/15 15:18 (external edit)
 
Except where otherwise noted, content on this wiki is licensed under the following license: GNU Free Documentation License 1.3
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki