Latest revision as of 14:36, 24 May 2013

Benchmarking UM Version4.5 on different Architectures

Preamble

Cluster/Parallel file systems are often a bottleneck. Timings are for writing to local disk, unless specified otherwise.
If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
Worst case message latencies for a cohort of processors are what matter for scaling. The vast majority of messages are either ~100 bytes or ~1KB in size. Latencies are reported for these key message sizes.

Emerald

Intel Westmere E5649 (2.53GHz)
QDR Infiniband (non-RoCE)
GCOMv3.1

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between nodes	~2.0us	~2.4us	~4.7us

FAMOUS

Domain Decomposition	Number of Cores	Model-years/day
4x3	12	~313
6x4	24	~360
12x3	36	~424

HadCM3

Domain Decomposition	Number of Cores	Model-years/day
4x3	12	~24
6x4	24	~40
12x3	36	~60

Intel SandyBridge Test System

Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
20MB L3 cache
GCOMv3.1

IMB ping-pong message latency
	0 bytes	128 bytes	1024 bytes
between sockets	~0.70us	~1.15us	~2.0us

FAMOUS

Domain Decomposition	Number of Cores	Model-years/day
4x2	8	~327
8x2	16	~450
8x4	32	~480

The last line of this table shows a real problem scaling beyond 16 cores. Load balance? (Latencies are much better than QDR IB.)
Would like to try to improve file writing performance and re-run.

HadCM3

Domain Decomposition	Number of Cores	Model-years/day
8x2	16	~48
8x4	32	~65

Polaris

Intel E5-2670 @ 2.60GHz
Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3]
Lustre

FAMOUS

Domain Decomposition	Number of Cores	Model-years/day
4x4	16	~330
8x4	32	~330

HadCM3

Domain Decomposition	Number of Cores	Model-years/day
4x4	16	~51
8x4	32	~73
16x4	64	~73

@@ Line 4: / Line 4: @@
 =Preamble=
-* Cluster/Parallel file systems are often a bottleneck.
+* Cluster/Parallel file systems are often a bottleneck.  Timings are for writing to local disk, unless specified otherwise.
 * If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 * Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
+* Worst case message latencies for a cohort of processors are what matter for scaling.  The vast majority of messages are either ~100 bytes or ~1KB in size.  Latencies are reported for these key message sizes.
-=AMD Bulldozer=
-=Intel Westmere=
+=Emerald=
-=Intel SandyBridge=
+* Intel Westmere E5649 (2.53GHz)
+* QDR Infiniband (non-RoCE)
+* GCOMv3.1
+{| border="1" cellpadding="10"
+!colspan=4|IMB ping-pong message latency
+|-
+||  || 0 bytes || 128 bytes || 1024 bytes
+|-
+|| between nodes || ~2.0us || ~2.4us || ~4.7us
+|-
+|}
+==FAMOUS==
+{| border="1" cellpadding="10"
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 4x3 || 12 || ~313
+|-
+|| 6x4 || 24 || ~360
+|-
+|| 12x3 || 36 || ~424
+|-
+|}
+==HadCM3==
+{| border="1" cellpadding="10"
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 4x3 || 12 || ~24
+|-
+|| 6x4 || 24 || ~40
+|-
+|| 12x3 || 36 || ~60
+|-
+|}
+=Intel SandyBridge Test System=
 * Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
 * 20MB L3 cache
+* GCOMv3.1
 {| border="1" cellpadding="10"
-!colspan=4|MPI message latency
+!colspan=4|IMB ping-pong message latency
 |-
 ||  || 0 bytes || 128 bytes || 1024 bytes
@@ Line 30: / Line 72: @@
 {| border="1" cellpadding="10"
-|| Domain Decomposition || Model-years/day
+|| Domain Decomposition || Number of Cores || Model-years/day
 |-
-|| 4x2 || ~327
+|| 4x2 || 8 || ~327
 |-
-|| 8x2 || ~450
+|| 8x2 || 16 || ~450
 |-
-|| 8x4 || ~480
+|| 8x4 || 32 || ~480
 |-
 |}
+* The last line of this table shows a real problem scaling beyond 16 cores.  Load balance?  (Latencies are '''much''' better than QDR IB.)
+* Would like to try to improve file writing performance and re-run.
 ==HadCM3==
 {| border="1" cellpadding="10"
-|| Domain Decomposition || Model-years/day
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 8x2 || 16 || ~48
+|-
+|| 8x4 || 32 || ~65
+|-
+|}
+=Polaris=
+* Intel E5-2670 @ 2.60GHz
+* Infiniband: Mellanox Technologies MT27500 Family [ConnectX-3]
+* Lustre
+==FAMOUS==
+{| border="1" cellpadding="10"
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 4x4 || 16 || ~330
+|-
+|| 8x4 || 32 || ~330
+|-
+|}
+==HadCM3==
+{| border="1" cellpadding="10"
+|| Domain Decomposition || Number of Cores || Model-years/day
+|-
+|| 4x4 || 16 || ~51
 |-
-|| 8x2 || ~48
+|| 8x4 || 32 || ~73
 |-
-|| 8x4 || ~65
+|| 16x4 || 64 || ~73
 |-
 |}

Difference between revisions of "UM version4.5 benchmarks"

Latest revision as of 14:36, 24 May 2013

Contents

Preamble

Emerald

FAMOUS

HadCM3

Intel SandyBridge Test System

FAMOUS

HadCM3

Polaris

FAMOUS

HadCM3

Navigation menu

Search