Revision as of 15:27, 17 December 2012

Benchmarking UM Version4.5 on different Architectures

Preamble

Cluster/Parallel file systems are often a bottleneck.
If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
Only the master process writes output, this can lead to load-balance issues, which hinder scaling.

MPI message latency
	0 bytes	128 bytes	1024 bytes
between sockets	~0.70us	~1.15us	~2.0us

The last line of this table shows a real problem scaling beyond 16 cores. Load balance?
Would like to try to improve file writing performance and re-run.

@@ Line 30: / Line 30: @@
 {| border="1" cellpadding="10"
-|| Domain Decomposition || Model-years/day
+|| Domain Decomposition || Number of Cores || Model-years/day
 |-
-|| 4x2 || ~327
+|| 4x2 || 8 || ~327
 |-
-|| 8x2 || ~450
+|| 8x2 || 16 || ~450
 |-
-|| 8x4 || ~480
+|| 8x4 || 32 || ~480
 |-
 |}
@@ Line 46: / Line 46: @@
 {| border="1" cellpadding="10"
-|| Domain Decomposition || Model-years/day
+|| Domain Decomposition || Number of Cores || Model-years/day
 |-
-|| 8x2 || ~48
+|| 8x2 || 16 || ~48
 |-
-|| 8x4 || ~65
+|| 8x4 || 32 || ~65
 |-
 |}