User Tools

Site Tools


mpi

Running

Nimbus comes pre-loaded with a slightly modified version of MPICH that knows about bproc. These modifications allow MPI applications to run convieniantly on a Nimbus cluster without a need for various wrapper applications (such as mpirun)

The number of processes spawned, and where they are run is controled through environment variables that may be set by the user, or by the maui scheduler through raccoon. In brief, they are:

  • NP
    • specifies the number of processes the MPI job is to contain.
  • BEOWULF_JOB_MAP
    • specifies which ranks are to run on which nodes of the cluster. BEOWULF_JOB_MAP should be set to a colon (:) seperated list of node numbers to run the job on. Ranks will be assigned in the order listed. Thus, BEOWULF_JOB_MAP=“-1:0:0:1:2” will run the root rank on the master, ranks 1 and 2 on node 0, rank 3 on node 1 and rank 4 on node 2. NOTE that unless otherwise specified, the job will run immediatly, regardless of how busy those CPUs are.

Should something go wrong and you need to kill the job, all of the processes will be seen with the ps command on the master. You can then use the kill command as if the jobs are local. This is in sharp contrast to 1st generation Beowulf where you would need to rsh around the cluster killing off runaway processes.

In any case, a management process will be running on the master. This process is responsable for redirecting stdout and stderr for the child processes, and for cleaning up should any process exit unexpectedly (abend). This process DOES NOT take part in the computation, and has no rank in the MPI_WORLD.

The behaviour of MPICH has changed since the previous version of NIMBUS. All job scheduling functionality has been moved to Raccoon/Maui.

MPICH has also been relocated to /usr/share/mpich-<version>-<arch> where arch is a combination of the communication type and the compiler used. This has been done so that clusters may have multiple simultanious versions available to meet differing user needs. None of the mpi libraries are now stored in /usr/lib (that is, there is no 'default' version.

Compiling and linking

You may use the mpicc wrapper just as the unmodified MPICH normally does. Otherwise, all that is necessary is to link your app with bproc and mpi using -lbproc -L /usr/share/<mpich directory>/lib -lmpi in your command line. That's actually all that mpicc does for you anyway.

Programming

For maximum compatability, the bproc modified MPICH is meant to meet the normal assumptions of an MPI program. It also offers a few features that can be quite useful, though caution is advised to avoid breaking compatability with old school systems.

The primary difference is in the exact behaviour of MPI_Init. On 1st generation Beowulf systems, and many others, some sort of wrapper (mpirun for example) is responsable for using rsh to run a copy of the program on all desired nodes. The command line arguments are pre-pended with MPI arguments that assign rank, and specify how the child processes should connect back to the root rank.

On a Nimbus system, the program simply starts executing on the master as any non-MPI program would. Inside MPI_Init, a call is made to beomap to get a list of available CPUs which will meet the requirements of the environment variables documented above. Then, a series of bproc_rfork calls are mad to fork child processes and migrate them to their intended CPU. MPI_Init will return in each process of the job.

As long as MPI_Init is the first thing done in the program (as it should be), there is no difference. If other initialization is done first, there may be subtle differences in behaviour. A few of the differences may be fatal. Foremost, files must NOT be opened before MPI_Init. the act of migration causes all open file handles other than stdin, stdout, and stderr to become invalid. Other cases where subtle behaviour changes may be noted include fetching a random seed value from /dev/random. On a bproc system, that fetch will only happen on the root rank. other ranks will get a copy of the seed value instead. Don't do this unless you would then bcast the value anyway.

It may be tempting to do some setup steps like the above before calling MPI_Init as an eligant way to skip a bcast or 12. While it IS elegant, it is not portable, thus should probably be avoided.

To emphasize: MAKE MPI_Init the first thing your program does unless you have thought very carefully about unintended consequences.

Note that the semantics of MPI_Init in Nimbus meet standards requirements. The differences are in the undefined grey areas.

mpi.txt · Last modified: 2010/04/15 21:18 (external edit)