Benchmarking UM Version4.5 on different Architectures
Preamble
- Cluster/Parallel file systems are often a bottleneck.
- If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
- Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
AMD Bulldozer
Intel Westmere
- Emerald.
- QDR Infiniband (non-RoCE)
- GCOMv3.1
IMB ping-pong message latency
|
|
0 bytes |
128 bytes |
1024 bytes
|
between sockets |
~2.0us |
~2.4us |
~4.7us
|
FAMOUS
Domain Decomposition |
Number of Cores |
Model-years/day
|
4x3 |
12 |
~313
|
6x4 |
24 |
~360
|
12x3 |
36 |
~424
|
HadCM3
Domain Decomposition |
Number of Cores |
Model-years/day
|
4x3 |
12 |
~24
|
6x4 |
24 |
~40
|
12x3 |
36 |
~60
|
Intel SandyBridge
- Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
- 20MB L3 cache
- GCOMv3.1
IMB ping-pong message latency
|
|
0 bytes |
128 bytes |
1024 bytes
|
between sockets |
~0.70us |
~1.15us |
~2.0us
|
FAMOUS
Domain Decomposition |
Number of Cores |
Model-years/day
|
4x2 |
8 |
~327
|
8x2 |
16 |
~450
|
8x4 |
32 |
~480
|
- The last line of this table shows a real problem scaling beyond 16 cores. Load balance?
- Would like to try to improve file writing performance and re-run.
HadCM3
Domain Decomposition |
Number of Cores |
Model-years/day
|
8x2 |
16 |
~48
|
8x4 |
32 |
~65
|