===== Linux Labs – Beowulf Distribution: Codename Nimbus ===== === Cluster Management Overview === === I. Architecture Overview === * The **Beostat** daemon has been replaced with the **supermon** utilities from LANL - this is a very lightweight __/proc__ based system that uses virtually no system resources; as opposed to its rather onerous predecessor. * **Wakinyan** Monitor: A graphical monitor that both saves on screen space and has ambient temperature output. * **2.4.19 Linux kernel** for cutting-edge stability and latest feature set (i.e. hyperthreading capability for Xeon-based clusters). * **bproc** has been updated to the advanced LANL version with the following features: * Unified **__P__**rocess**__ID__**entification space more complete. Bproc system daemons are fully hidden once a node boots. - __OLD__: when a process spawns on a slave node, it initializes, then a new PID is issued - __NEW__: all system processes disappear, and PIDs are global on all nodes * Access control is now available on a node-by-node basis: - User / Group / Other (i.e. chmod uga) on slave nodes themselves - Permissions are checked on nodes for job eligibility by users. - This is useful in a shared cluster where not everyone can use all the nodes. * **rarpcatcher** replaces **beosetup** * The status info is put into __/etc/beowulf/config__ * This process runs on startup * It is always running as daemon, so when nodes are added on the fly, the process HUP’s (restarts) the beowulf system and adds in the new nodes. * **NEW**: all node data is only lost in the case of a complete trashing of your filesystem- this due to ext3 filesystem. We have experienced ZERO corruption in extensive testing. This is opposed to older versions of the software which took a rather cavalier attitude towards node filesystem data. * **ALSO**: Regarding I. Above, this makes boot much cleaner * **NOTE**: If one or more of your nodes has important data, issue a sync command before you power cycle * **REMEMBER**: the only non-persistent data stored on nodes are the libraries and system files that are copied to the node at boot time. === II. Important Cluster utilities === - All commands accept a [[nod|Node specification syntax]] - **[[bpsh|bpsh]]** is the primary user interface into bproc. This is a sort of shell program similar to bash or tcsh that allows you to issue commands across all nodes on the network, or to selected nodes as described herein: * **bpsh** command * **bpsh** -h (help) * **bpsh** -n : no redirect, like __rsh__. * **bpsh** will accept all rsh syntax, i.e. you could actually issue this string and expect everything to work in order to convert inscript, a rsh based script, to bpsh : (//sed -e "s/rsh/bpsh/g" < inscript > outscript//) * [[bpsh|man page]] * [[bprun|bpsh run environment]] - [[bpcp|bpcp]] is a bproc equivilent to **rcp**. - [[bpstat|bpstat]] Display node status. - [[bpctl|bpctl]] Change node status === Slave Configuration utilities === * [[flash|flash_tool]] * [[cmos|cmos_util]] === Master boots like typical RedHat system === [[slave|Slave Booting procedure and sequence]] [[http://supermon.sourceforge.net/|Supermon System]] * Communicates over TCP/IP * mon daemon * supermon daemon * light weight * Data format * Lisp like * Human readable * Extansable * Kernel modules * supermon_proc * sensors * mon embedded in beoboot * libsexpr === Wakinyan monitor === * This program lives in __/usr/bin/wakinyanmon__ * Part of [[http://supermon.sourceforge.net/|Supermon]] system. * A [[http://www.gtk.org/|GTK]] application. * Node Status Display * A horizontal yellow line means the node is — down * diagonal yellow means the node is — booting * green check means the node is — up * red X means the node has an — error condition * CPU load * Disk load * Memory used * Swap * Net * Temperatures * CPU 0 * CPU 1 * Northbridge === Net console === * [[netcons|netconsole]] === Batch Scheduling === * [[raccoon|raccoon-Maui]] * BJS === Advice on booting behavior === - Are all the ports flashing on (if you have one) your GIG-E switch? This is GOOD! This means the **arps** are working. - (NOTE: the error most common is failed attempt to mount unavailable NFS share) - IMPORTANT NOTE: Booting a cluster always seems to take longer to boot than it actually does. Don’t despair! Just standby a bit. Wait a minute. Get a cup of coffee. All is well, 99% of the time! - Would you like to watch a node boot? This is also good for debugging nodes. You are going to monitor the __Serial Console__! * **Minicom** on master is ready to go. Run it. * The settings sbould already be __TTYS0, 115200, N81 vt100__ * Find your //null modem cable//. A null modem cable is shipped with every cluster. * The Leftmost serial port on the __master__ plus into the leftmost serial port on the target __slave__. === Other important Config files === * __/etc/beowulf/config.boot__ the file of last resort — this gives a list of the PCI id’s and driver names * Command "**beoboot -p**" this program grabs the kernel from __/etc/beowulf/config/__ and creates new images in the **/tftpboot/slave** boot directory * If you have problems unsolvable with reboot or halt, toggle the power on/off manually. === Other info === **bpsh** command execution paths are strictly by canonical directory names- follow: - Are you in __/home/sysadmin__ on the master? - Does this exist on the slave? - __Then the process you are executing runs In This Current Working Directory (**cwd**)__. - Are you in __/home__? __/home__ always exists in an NFS mount. - You will be in the same directory if not an NFS mount. For example, if you are in /scratch on the master, you will execute in /scratch on the slave (/scratch exists on all machines) - **If you are not in a similar canonical directory your path will be / on the slave**. === Mirroring the Master to the secondary master. === * [[mirr|Imaging the master to a secondary master]] === Failover procedure for secondary masters: === - Connect any RAID devices to the secondary __master__. - Connect the external net connection of the __master__ to eth0 on the __secondary__. - Connect eth1 to the booting switch network (plus monitor, keyboard). - Reboot. - If necessary, hit the spacebar to skip PXE boot errors in this procedure. - Some BIOSes require hitting F2 to turn off PXE in the bios (boot menu) and to make the HD the primary boot method. === Want to run PVM ? === * simply run **start-pvm** and this launches in all nodes for legacy apps. === Want to run MPI ? === * The newer **[[mpi|MPI]]** (1.5) uses all_cpus=1 rather than "MPI=" for using all CPUs. * Example: **all_cpus=1 progrname params** (linked against MPI) * Note that the __master__ is node -1 === Want to run Lahey FORTRAN compiler? === * Lahey resides in __/usr/local/lf95__ * PGI in in __/usr/pgi__ (pgi c, pgi f90, etc.) **flexlm** and environment variables set by default to just work. See the docs for more info. === Partioning of nodes:=== * nodes are /dev/hda1 — one fs * /dev/hda2 is swap * primary and secondary masters are - /dev/hda3 (/) ; /dev/hda5 (/var) ; /dev/hda6 (/usr) * Depending on your cluster specification, the primary is pre-setup to also be a slave node. === Rebuilding from source RPMs === * [[rebuild|Summary of RPM build from SRPM]] === Supplimental materials:=== * [[http://www.rpm.org/max-rpm/|Maximum RPM]] * [[http://rfc.net/rfc1350.html|RFC 1350: TFTP]] * [[http://rfc.net/rfc2131.html|RFC 2131: DHCP]] === External resources === * [[http://www.clustermatic.org|Clustermatic]] * [[http://supermon.sourceforge.net/|Supermon]]