Difference between revisions of "UM version4.5 benchmarks"
		
		
		
		
		
		Jump to navigation
		Jump to search
		
				
		
		
	
 (→FAMOUS)  | 
				|||
| Line 30: | Line 30: | ||
{| border="1" cellpadding="10"  | {| border="1" cellpadding="10"  | ||
| − | || Domain Decomposition || Model-years/day  | + | || Domain Decomposition || Number of Cores || Model-years/day  | 
|-  | |-  | ||
| − | || 4x2 || ~327  | + | || 4x2 || 8 || ~327  | 
|-  | |-  | ||
| − | || 8x2 || ~450  | + | || 8x2 || 16 || ~450  | 
|-  | |-  | ||
| − | || 8x4 || ~480  | + | || 8x4 || 32 || ~480  | 
|-  | |-  | ||
|}  | |}  | ||
| Line 46: | Line 46: | ||
{| border="1" cellpadding="10"  | {| border="1" cellpadding="10"  | ||
| − | || Domain Decomposition || Model-years/day  | + | || Domain Decomposition || Number of Cores || Model-years/day  | 
|-  | |-  | ||
| − | || 8x2 || ~48  | + | || 8x2 || 16 || ~48  | 
|-  | |-  | ||
| − | || 8x4 || ~65  | + | || 8x4 || 32 || ~65  | 
|-  | |-  | ||
|}  | |}  | ||
Revision as of 15:27, 17 December 2012
Benchmarking UM Version4.5 on different Architectures
Preamble
- Cluster/Parallel file systems are often a bottleneck.
 - If the model is not filesystem-bound, it is often (MPI massage) latency-bound.
 - Only the master process writes output, this can lead to load-balance issues, which hinder scaling.
 
AMD Bulldozer
Intel Westmere
Intel SandyBridge
- Test system: Quad socket, 8-core E-4650L (2.60GHz) (L for Low power)
 - 20MB L3 cache
 
| MPI message latency | |||
|---|---|---|---|
| 0 bytes | 128 bytes | 1024 bytes | |
| between sockets | ~0.70us | ~1.15us | ~2.0us | 
FAMOUS
| Domain Decomposition | Number of Cores | Model-years/day | 
| 4x2 | 8 | ~327 | 
| 8x2 | 16 | ~450 | 
| 8x4 | 32 | ~480 | 
- The last line of this table shows a real problem scaling beyond 16 cores. Load balance?
 - Would like to try to improve file writing performance and re-run.
 
HadCM3
| Domain Decomposition | Number of Cores | Model-years/day | 
| 8x2 | 16 | ~48 | 
| 8x4 | 32 | ~65 |