How to run ESTEL in parallel on a cluster

From SourceWiki
Revision as of 11:40, 11 September 2007 by Jprenaud (talk | contribs)
Jump to navigation Jump to search


This article describes how to run parallel jobs in ESTEL on HPC clusters.

Beowulf clusters are real high performance facilities such as Blue Crystal. If you plan to run ESTEL on a network of workstations, use this article about networks of workstations instead.

Pre-requesites

  • TELEMAC system installed and configured for MPI. See the installation article if necessary.
  • PBS queuing system on the cluster.
  • Fortran compiler and local directory in the PATH (see below).

Adjusting the PATH for batch jobs

When you submit a job to a queue on a cluster, the environment available when the job is eventually run is not necessarily the one you had when you submitted the job. This is particularly important for the TELEMAC system as the job submitted to the PBS queue needs access to the Fortran compiler and the local folder (./) to run. Various solutions exist depending on the platform. On Bluecrystal, the easiest is to load all required modules in the .bashrc file.

Submitting a job

When all pre-requesites are satisfied, it is quite easy to submit a TELEMAC job to the PBS queue. A script exists in the /path/to/systel90/bin/ directory which takes care of most aspects of the job submission. How nice...

The syntax is the following:

$ qsub-telemac jobname nbnodes walltime code case

where:

  • jobname is a name for the job you are submitting
  • nbnodes is the number of nodes to use. Only one processor per node is supported at the moment.
  • walltime is the PBS walltime, i.e. the simulation will stop if this time is reached. The syntax is standard PBS syntax; hh:mm:ss or a single number is seconds.
  • code is the name of the TELEMAC code to run, i.e. probably estel2d or estel3d.
  • case is the name of the steering file to use for the simulation.

Note that qsub-telemac is clever enough (!) to adjust the number of parallel processors in the steering file automatically to match the argument nbnodes. However, the keyword PARALLEL PROCESSORS needs to be in the steering file. Only the value is adjusted at the moment...

For instance, for ESTEL-3D one could use:

$ qsub-telemac test 12 10:00:00 estel3d cas

This would submit a job on 12 nodes using one processor on each node, with a PBS walltime of 10 hours to run a case named "cas" with the code ESTEL-3D. You would end up with a new file called test which is the file actually submitted to PBS and also a new steering file, cas-test which is a copy of cas but with the correct number of parallel processors.

Limitations

All these limitations are being worked on and hopefully will be addressed soon.

  • Cannot add the keyword PARALLEL PROCESSORS if not present
  • Cannot deal with multiple processors and/or cores per node
  • Cannot generate a list of the processes to kill on each processor in case of crash