Difference between revisions of "UM version4.5 benchmarks"
Jump to navigation
Jump to search
(12 intermediate revisions by the same user not shown) | |||
Line 4: | Line 4: | ||
=Preamble= | =Preamble= | ||
− | * Cluster/Parallel file systems are often a bottleneck. | + | * Cluster/Parallel file systems are often a bottleneck. Timings are for writing to local disk, unless specified otherwise. |
* If the model is not filesystem-bound, it is often (MPI massage) latency-bound. | * If the model is not filesystem-bound, it is often (MPI massage) latency-bound. | ||
* Only the master process writes output, this can lead to load-balance issues, which hinder scaling. | * Only the master process writes output, this can lead to load-balance issues, which hinder scaling. | ||
+ | * Worst case message latencies for a cohort of processors are what matter for scaling. The vast majority of messages are either ~100 bytes or ~1KB in size. Latencies are reported for these key message sizes. | ||
− | |||
− | = | + | =Emerald= |
− | =Intel SandyBridge= | + | * Intel Westmere E5649 (2.53GHz) |
+ | * QDR Infiniband (non-RoCE) | ||
+ | * GCOMv3.1 | ||
+ | |||
+ | |||
+ | {| border="1" cellpadding="10" | ||
+ | !colspan=4|IMB ping-pong message latency | ||
+ | |- | ||
+ | || || 0 bytes || 128 bytes || 1024 bytes | ||
+ | |- | ||
+ | || between nodes || ~2.0us || ~2.4us || ~4.7us | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==FAMOUS== | ||
+ | |||
+ | {| border="1" cellpadding="10" | ||
+ | || Domain Decomposition || Number of Cores || Model-years/day | ||
+ | |- | ||
+ | || 4x3 || 12 || ~313 | ||
+ | |- | ||
+ | || 6x4 || 24 || ~360 | ||
+ | |- | ||
+ | || 12x3 || 36 || ~424 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==HadCM3== | ||
+ | |||
+ | {| border="1" cellpadding="10" | ||
+ | || Domain Decomposition || Number of Cores || Model-years/day | ||
+ | |- | ||
+ | || 4x3 || 12 || ~24 | ||
+ | |- | ||
+ | || 6x4 || 24 || ~40 | ||
+ | |- | ||
+ | || 12x3 || 36 || ~60 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | =Intel SandyBridge Test System= | ||
* Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power) | * Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power) | ||
* 20MB L3 cache | * 20MB L3 cache | ||
+ | * GCOMv3.1 | ||
{| border="1" cellpadding="10" | {| border="1" cellpadding="10" | ||
− | !colspan=4| | + | !colspan=4|IMB ping-pong message latency |
|- | |- | ||
|| || 0 bytes || 128 bytes || 1024 bytes | || || 0 bytes || 128 bytes || 1024 bytes | ||
Line 30: | Line 72: | ||
{| border="1" cellpadding="10" | {| border="1" cellpadding="10" | ||
− | || Domain Decomposition || Model-years/day | + | || Domain Decomposition || Number of Cores || Model-years/day |
|- | |- | ||
− | || 4x2 || ~327 | + | || 4x2 || 8 || ~327 |
|- | |- | ||
− | || 8x2 || ~450 | + | || 8x2 || 16 || ~450 |
|- | |- | ||
− | || 8x4 || ~480 | + | || 8x4 || 32 || ~480 |
|- | |- | ||
|} | |} | ||
+ | |||
+ | * The last line of this table shows a real problem scaling beyond 16 cores. Load balance? (Latencies are '''much''' better than QDR IB.) | ||
+ | * Would like to try to improve file writing performance and re-run. | ||
==HadCM3== | ==HadCM3== | ||
{| border="1" cellpadding="10" | {| border="1" cellpadding="10" | ||
− | || Domain Decomposition || Model-years/day | + | || Domain Decomposition || Number of Cores || Model-years/day |
+ | |- | ||
+ | || 8x2 || 16 || ~48 | ||
+ | |- | ||
+ | || 8x4 || 32 || ~65 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | |||
+ | =Polaris= | ||
+ | |||
+ | * Intel E5-2670 @ 2.60GHz | ||
+ | * Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3] | ||
+ | * Lustre | ||
+ | |||
+ | |||
+ | ==FAMOUS== | ||
+ | |||
+ | {| border="1" cellpadding="10" | ||
+ | || Domain Decomposition || Number of Cores || Model-years/day | ||
+ | |- | ||
+ | || 4x4 || 16 || ~330 | ||
+ | |- | ||
+ | || 8x4 || 32 || ~330 | ||
+ | |- | ||
+ | |} | ||
+ | |||
+ | ==HadCM3== | ||
+ | |||
+ | {| border="1" cellpadding="10" | ||
+ | || Domain Decomposition || Number of Cores || Model-years/day | ||
+ | |- | ||
+ | || 4x4 || 16 || ~51 | ||
|- | |- | ||
− | || | + | || 8x4 || 32 || ~73 |
|- | |- | ||
− | || | + | || 16x4 || 64 || ~73 |
|- | |- | ||
|} | |} |
Latest revision as of 14:36, 24 May 2013
Benchmarking UM Version4.5 on different Architectures
Preamble
- Cluster/Parallel file systems are often a bottleneck. Timings are for writing to local disk, unless specified otherwise.
- If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
- Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
- Worst case message latencies for a cohort of processors are what matter for scaling. The vast majority of messages are either ~100 bytes or ~1KB in size. Latencies are reported for these key message sizes.
Emerald
- Intel Westmere E5649 (2.53GHz)
- QDR Infiniband (non-RoCE)
- GCOMv3.1
IMB ping-pong message latency | |||
---|---|---|---|
0 bytes | 128 bytes | 1024 bytes | |
between nodes | ~2.0us | ~2.4us | ~4.7us |
FAMOUS
Domain Decomposition | Number of Cores | Model-years/day |
4x3 | 12 | ~313 |
6x4 | 24 | ~360 |
12x3 | 36 | ~424 |
HadCM3
Domain Decomposition | Number of Cores | Model-years/day |
4x3 | 12 | ~24 |
6x4 | 24 | ~40 |
12x3 | 36 | ~60 |
Intel SandyBridge Test System
- Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
- 20MB L3 cache
- GCOMv3.1
IMB ping-pong message latency | |||
---|---|---|---|
0 bytes | 128 bytes | 1024 bytes | |
between sockets | ~0.70us | ~1.15us | ~2.0us |
FAMOUS
Domain Decomposition | Number of Cores | Model-years/day |
4x2 | 8 | ~327 |
8x2 | 16 | ~450 |
8x4 | 32 | ~480 |
- The last line of this table shows a real problem scaling beyond 16 cores. Load balance? (Latencies are much better than QDR IB.)
- Would like to try to improve file writing performance and re-run.
HadCM3
Domain Decomposition | Number of Cores | Model-years/day |
8x2 | 16 | ~48 |
8x4 | 32 | ~65 |
Polaris
- Intel E5-2670 @ 2.60GHz
- Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3]
- Lustre
FAMOUS
Domain Decomposition | Number of Cores | Model-years/day |
4x4 | 16 | ~330 |
8x4 | 32 | ~330 |
HadCM3
Domain Decomposition | Number of Cores | Model-years/day |
4x4 | 16 | ~51 |
8x4 | 32 | ~73 |
16x4 | 64 | ~73 |