Difference between revisions of "How to run ESTEL in parallel"
(15 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category:Estel]] | [[Category:Estel]] | ||
− | This article describes how to run parallel jobs in [[Estel | '''ESTEL''']] on "simple" networks of workstations. | + | '''This article describes how to run parallel jobs in [[Estel | '''ESTEL''']] on "simple" networks of workstations.''' |
Note that the methodology differs slightly for real high performance facilities such as [http://www.acrc.bris.ac.uk Blue Crystal] or other [http://en.wikipedia.org/wiki/Beowulf_cluster Beowulf clusters]. Therefore, there is a [[Run ESTEL in parallel on a cluster | dedicated article for clusters]]. | Note that the methodology differs slightly for real high performance facilities such as [http://www.acrc.bris.ac.uk Blue Crystal] or other [http://en.wikipedia.org/wiki/Beowulf_cluster Beowulf clusters]. Therefore, there is a [[Run ESTEL in parallel on a cluster | dedicated article for clusters]]. | ||
Line 9: | Line 9: | ||
= Pre-requesites = | = Pre-requesites = | ||
* You need to have a working MPI configuration on the network of workstations. See the [[Install and configure MPI | article about installing MPI]]. | * You need to have a working MPI configuration on the network of workstations. See the [[Install and configure MPI | article about installing MPI]]. | ||
− | * The parallel library in the TELEMAC tree needs to have been compiled. This is | + | * The parallel library in the TELEMAC tree needs to have been compiled. See the article about [[Install_the_TELEMAC_system#parallel | installing the TELEMAC system]]. |
+ | |||
+ | = The <code>mpi_telemac.conf</code> file = | ||
+ | The TELEMAC script look for a file called <code>mpi_telemac.conf</code> for the MPI configuration. This file can either be (a) a data file in the directory where the steering file for the simulation is or (b) a global configuration file. If the global configuration is chosen, the file needs to be installed in <code>/path/to/systel90/install/HOSTTYPE/mpi_telemac.conf</code> where <code>HOSTTYPE</code> is the string entered in the <code>systel.ini</code> [[Install_the_TELEMAC_system#systel.ini | configuration file]]. Note that if you have a global <code>mpi_telemac.conf</code>, you can override it by using a local one in the folder of the steering file for the simulation. | ||
+ | |||
+ | <code>mpi_telemac.conf</code> contains a simple list of hosts with their number of processors. The total number of processors is written at the top of the file. An example is provided in the <code>config-template</code> of the TELEMAC tree: | ||
+ | |||
+ | <code><pre> | ||
+ | # Configuration for MPI | ||
+ | #----------------------- | ||
+ | # | ||
+ | # Number of processors : | ||
+ | 5 | ||
+ | # | ||
+ | # For each host : | ||
+ | # hostname number_of_processors_on_the_host | ||
+ | # | ||
+ | master 1 | ||
+ | slave1 2 | ||
+ | slave2 1 | ||
+ | </pre></code> | ||
+ | |||
+ | When running '''[[ESTEL]]''' in parallel mode, the number of processors requested in the steering file must be smaller than or equal to the number of processors in <code>mpi_telemac.conf</code>. | ||
= Run a parallel job on one machine = | = Run a parallel job on one machine = | ||
+ | Before running distributed parallel jobs, it is easier to get it to work on one machine first. | ||
== Using one process == | == Using one process == | ||
− | Before running '''[[ESTEL]]''' in parallel, you need to start | + | Before running '''[[ESTEL]]''' in parallel, you need to start a <code>mpd</code> process. Details are given in the [[Install and configure MPI | MPI article]]. Just start <code>mpd</code> with: |
+ | <code><pre> | ||
+ | master $ mpd & | ||
+ | </pre></code> | ||
− | Before trying to run real parallel jobs it is interesting to check that the "parallel" library and [[Estel | '''ESTEL''']] are playing nicely together. This can be achieved by running an existing test case with | + | Before trying to run real parallel jobs it is interesting to check that the "parallel" library and [[Estel | '''ESTEL''']] are playing nicely together. This can be achieved by running an existing test case with in which you add the following keyword in the steering file: |
<code><pre> | <code><pre> | ||
PARALLEL PROCESSORS = 1 | PARALLEL PROCESSORS = 1 | ||
</pre></code> | </pre></code> | ||
− | Using the keyword <code>PARALLEL PROCESSORS</code> will force [[ESTEL]] to use the <code>parallel</code> library instead of the <code>paravoid</code> library. As we request one processor only, no MPI calls will be done. | + | Using the keyword <code>PARALLEL PROCESSORS</code> will force '''[[ESTEL]]''' to use the <code>parallel</code> library instead of the <code>paravoid</code> library. As we request one processor only, no MPI calls will be done. |
− | If this does not work. Stop here and try to understand what is going wrong. You can email error messages (full | + | If this does not work. Stop here and try to understand what is going wrong. You can email error messages (in full) to [[User:Jprenaud | JP Renaud]] who will help if necessary. |
== Using multiple processes == | == Using multiple processes == | ||
+ | If it works fine with one process, you can try with several. Make sure first that the <code>mpi_telemac.conf</code> file contains enough entries for the number of processes you will request. As there is just one host available to MPI, just repeat its entry several times. For instance if asking for three processes, the <code>mpi_telemac.conf</code> file should contain: | ||
+ | <code><pre> | ||
+ | # Number of processors : | ||
+ | 3 | ||
+ | # | ||
+ | # For each host : | ||
+ | # | ||
+ | # hostname number_of_processors_on_the_host | ||
+ | # | ||
+ | master 1 | ||
+ | master 1 | ||
+ | master 1 | ||
+ | </pre></code> | ||
+ | Now edit the steering file of your test case to ask for 3 processors: | ||
+ | <code><pre> | ||
+ | PARALLEL PROCESSORS = 3 | ||
+ | </pre></code> | ||
− | + | If '''[[ESTEL]]''' ran properly, you should have several new files in your directory (plus the required result files). The meaning of these files is explained [[Run_ESTEL_in_parallel#Note_about_parallel_outputs | further down]]. | |
− | |||
− | + | Remember to end the ring after the computation has finished: | |
+ | <code><pre> | ||
+ | master $ mpdallexit | ||
+ | </pre></code> | ||
= Run a parallel job on several machines = | = Run a parallel job on several machines = | ||
+ | If MPI has been setup properly, running '''[[ESTEL]]''' on several machines is not very complicated. | ||
+ | |||
+ | Start a <code>mpd</code> ring requesting the right number of processors: | ||
+ | <code><pre> | ||
+ | master $ mpdboot -n 3 -f ~/mpd.hosts | ||
+ | master $ mpdtrace | ||
+ | master | ||
+ | slave1 | ||
+ | slave2 | ||
+ | </pre></code> | ||
+ | |||
+ | Then adjust <code>mpi_telemac.conf</code> to match the ring and run '''[[ESTEL]]''' requesting no more processors than in the <code>mpi_telemac.conf</code> file. Note that your <code>mpi_telemac.conf</code> file can contain many more processors than there are hosts in the ring. For instance if you use dual processor machines, you could have a ring with three machines but six processors in <code>mpi_telemac.conf</code>: | ||
+ | <code><pre> | ||
+ | # Number of processors : | ||
+ | 6 | ||
+ | # | ||
+ | # For each host : | ||
+ | # | ||
+ | # hostname number_of_processors_on_the_host | ||
+ | # | ||
+ | master 2 | ||
+ | slave1 2 | ||
+ | slave2 2 | ||
+ | </pre></code> | ||
+ | |||
+ | Remember to close the ring after the simulation with <code>mpdallexit</code>. | ||
+ | |||
+ | = Note about parallel output = | ||
+ | When you run '''ESTEL-2D''' or '''ESTEL-3D''' in parallel, you will obtain some new files in the directory where the simulation was run: | ||
+ | * <code>partel.log</code> contains the log of the domain decomposition step at the very beginning of the simulation. | ||
+ | * <code>gretel.log</code> contains the log of the domain recomposition step at the end of the simulation. Not for '''ESTEL-3D''', see note below. | ||
+ | * <code>mpirun.txt</code> contains the hosts that have been used by MPI | ||
+ | * series of <code>peN-M.log</code> files. Each of these files is the listing output of ESTEL on the compute nodes, N is the total number of processors required and M the number of the host the log comes from. Note that numbering starts at zero and the log for the master node (M=0) is not kept as you see it on the screen. Therefore for 4 processors, there would be 3 files named: | ||
+ | <code><pre> | ||
+ | pe003-001.log | ||
+ | pe003-002.log | ||
+ | pe003-003.log | ||
+ | </pre></code> | ||
+ | |||
+ | There are extra files for '''ESTEL-3D''', see below. | ||
+ | |||
+ | = Note about ESTEL-3D = | ||
+ | For estel3d, you will also have: | ||
+ | * series of <code>name-of-3d-results-fileN-M</code> | ||
+ | * series of <code>name-of-mesh-results-fileN-M</code> '''are empty at the moment, BUG''' | ||
+ | |||
+ | This is because '''ESTEL-3D''' does not recompose the solution on one mesh due to limitation in the binary Tecplot library. You will need to load all these files at once in Tecplot (option "Load Multiple Files") to see the full solution (or the full mesh). Note that this creates interpolation artefacts for P0 variables. This will be "fixed" in the next version of ESTEL which will use a native format instead of the Tecplot format and therefore will be able to do domain recomposition. In Tecplot, be careful to change the default filter as the numbering of the files hides the typical <code>.plt</code> or <code>.dat</code> extension. | ||
− | + | Also, as there is no domain recomposition, there is no <code>gretel.log</code> for '''ESTEL-3D'''. | |
= Note about estel2d = | = Note about estel2d = | ||
+ | '''To finish, Fabien??''' | ||
+ | * No "validation" possible, will crash with no warning. '''probably a bug!!''' | ||
+ | * Problem with particle tracking | ||
+ | ** second keyword required | ||
+ | ** dictionary to be changed |
Latest revision as of 09:37, 7 September 2007
This article describes how to run parallel jobs in ESTEL on "simple" networks of workstations.
Note that the methodology differs slightly for real high performance facilities such as Blue Crystal or other Beowulf clusters. Therefore, there is a dedicated article for clusters.
We call a network of workstations a set of workstations which can "talk" to each other via Intra/Internet.
Pre-requesites
- You need to have a working MPI configuration on the network of workstations. See the article about installing MPI.
- The parallel library in the TELEMAC tree needs to have been compiled. See the article about installing the TELEMAC system.
The mpi_telemac.conf
file
The TELEMAC script look for a file called mpi_telemac.conf
for the MPI configuration. This file can either be (a) a data file in the directory where the steering file for the simulation is or (b) a global configuration file. If the global configuration is chosen, the file needs to be installed in /path/to/systel90/install/HOSTTYPE/mpi_telemac.conf
where HOSTTYPE
is the string entered in the systel.ini
configuration file. Note that if you have a global mpi_telemac.conf
, you can override it by using a local one in the folder of the steering file for the simulation.
mpi_telemac.conf
contains a simple list of hosts with their number of processors. The total number of processors is written at the top of the file. An example is provided in the config-template
of the TELEMAC tree:
# Configuration for MPI
#-----------------------
#
# Number of processors :
5
#
# For each host :
# hostname number_of_processors_on_the_host
#
master 1
slave1 2
slave2 1
When running ESTEL in parallel mode, the number of processors requested in the steering file must be smaller than or equal to the number of processors in mpi_telemac.conf
.
Run a parallel job on one machine
Before running distributed parallel jobs, it is easier to get it to work on one machine first.
Using one process
Before running ESTEL in parallel, you need to start a mpd
process. Details are given in the MPI article. Just start mpd
with:
master $ mpd &
Before trying to run real parallel jobs it is interesting to check that the "parallel" library and ESTEL are playing nicely together. This can be achieved by running an existing test case with in which you add the following keyword in the steering file:
PARALLEL PROCESSORS = 1
Using the keyword PARALLEL PROCESSORS
will force ESTEL to use the parallel
library instead of the paravoid
library. As we request one processor only, no MPI calls will be done.
If this does not work. Stop here and try to understand what is going wrong. You can email error messages (in full) to JP Renaud who will help if necessary.
Using multiple processes
If it works fine with one process, you can try with several. Make sure first that the mpi_telemac.conf
file contains enough entries for the number of processes you will request. As there is just one host available to MPI, just repeat its entry several times. For instance if asking for three processes, the mpi_telemac.conf
file should contain:
# Number of processors :
3
#
# For each host :
#
# hostname number_of_processors_on_the_host
#
master 1
master 1
master 1
Now edit the steering file of your test case to ask for 3 processors:
PARALLEL PROCESSORS = 3
If ESTEL ran properly, you should have several new files in your directory (plus the required result files). The meaning of these files is explained further down.
Remember to end the ring after the computation has finished:
master $ mpdallexit
Run a parallel job on several machines
If MPI has been setup properly, running ESTEL on several machines is not very complicated.
Start a mpd
ring requesting the right number of processors:
master $ mpdboot -n 3 -f ~/mpd.hosts
master $ mpdtrace
master
slave1
slave2
Then adjust mpi_telemac.conf
to match the ring and run ESTEL requesting no more processors than in the mpi_telemac.conf
file. Note that your mpi_telemac.conf
file can contain many more processors than there are hosts in the ring. For instance if you use dual processor machines, you could have a ring with three machines but six processors in mpi_telemac.conf
:
# Number of processors :
6
#
# For each host :
#
# hostname number_of_processors_on_the_host
#
master 2
slave1 2
slave2 2
Remember to close the ring after the simulation with mpdallexit
.
Note about parallel output
When you run ESTEL-2D or ESTEL-3D in parallel, you will obtain some new files in the directory where the simulation was run:
partel.log
contains the log of the domain decomposition step at the very beginning of the simulation.gretel.log
contains the log of the domain recomposition step at the end of the simulation. Not for ESTEL-3D, see note below.mpirun.txt
contains the hosts that have been used by MPI- series of
peN-M.log
files. Each of these files is the listing output of ESTEL on the compute nodes, N is the total number of processors required and M the number of the host the log comes from. Note that numbering starts at zero and the log for the master node (M=0) is not kept as you see it on the screen. Therefore for 4 processors, there would be 3 files named:
pe003-001.log
pe003-002.log
pe003-003.log
There are extra files for ESTEL-3D, see below.
Note about ESTEL-3D
For estel3d, you will also have:
- series of
name-of-3d-results-fileN-M
- series of
name-of-mesh-results-fileN-M
are empty at the moment, BUG
This is because ESTEL-3D does not recompose the solution on one mesh due to limitation in the binary Tecplot library. You will need to load all these files at once in Tecplot (option "Load Multiple Files") to see the full solution (or the full mesh). Note that this creates interpolation artefacts for P0 variables. This will be "fixed" in the next version of ESTEL which will use a native format instead of the Tecplot format and therefore will be able to do domain recomposition. In Tecplot, be careful to change the default filter as the numbering of the files hides the typical .plt
or .dat
extension.
Also, as there is no domain recomposition, there is no gretel.log
for ESTEL-3D.
Note about estel2d
To finish, Fabien??
- No "validation" possible, will crash with no warning. probably a bug!!
- Problem with particle tracking
- second keyword required
- dictionary to be changed