Benchmarking UM Version4.5 on different Architectures
Preamble
- Cluster/Parallel file systems are often a bottleneck.
 
- If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 
- Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
 
AMD Bulldozer
Intel Westmere
- Emerald.
 
- QDR Infiniband (non-RoCE)
 
- GCOMv3.1
 
FAMOUS
| Domain Decomposition | 
Number of Cores | 
Model-years/day
 | 
| 4x3 | 
12 | 
~313
 | 
| 6x4 | 
24 | 
~360
 | 
| 12x3 | 
36 | 
~424
 | 
HadCM3
| Domain Decomposition | 
Number of Cores | 
Model-years/day
 | 
| 4x3 | 
12 | 
~24
 | 
| 6x4 | 
24 | 
~40
 | 
| 12x3 | 
36 | 
~60
 | 
Intel SandyBridge
- Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
 
- 20MB L3 cache
 
| MPI message latency
 | 
 | 
0 bytes | 
128 bytes | 
1024 bytes
 | 
| between sockets | 
~0.70us | 
~1.15us | 
~2.0us
 | 
FAMOUS
| Domain Decomposition | 
Number of Cores | 
Model-years/day
 | 
| 4x2 | 
8 | 
~327
 | 
| 8x2 | 
16 | 
~450
 | 
| 8x4 | 
32 | 
~480
 | 
- The last line of this table shows a real problem scaling beyond 16 cores.  Load balance?
 
- Would like to try to improve file writing performance and re-run.
 
HadCM3
| Domain Decomposition | 
Number of Cores | 
Model-years/day
 | 
| 8x2 | 
16 | 
~48
 | 
| 8x4 | 
32 | 
~65
 |