Revision as of 15:37, 17 December 2012

Benchmarking UM Version4.5 on different Architectures

Preamble

Cluster/Parallel file systems are often a bottleneck.
If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
Only the master process writes output, this can lead to load-balance issues, which hinder scaling.

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between sockets	~2.0us	~2.4us	~4.7us

MPI message latency
	0 bytes	128 bytes	1024 bytes
between sockets	~0.70us	~1.15us	~2.0us

The last line of this table shows a real problem scaling beyond 16 cores. Load balance?
Would like to try to improve file writing performance and re-run.

@@ Line 15: / Line 15: @@
 * QDR Infiniband (non-RoCE)
 * GCOMv3.1
+{| border="1" cellpadding="10"
+!colspan=4|IMB ping-pong message latency
+|-
+||  || 0 bytes || 128 bytes || 1024 bytes
+|-
+|| between sockets || ~2.0us || ~2.4us || ~4.7us
+|-
+|}
 ==FAMOUS==