This week I had the opportunity to run the STREAM memory benchmark on a Microway 2U NumberSmasher server which supports up to 3 DIMMs per channel. In practice, this system is typically configured with 768GB or 1.5TB of DDR4 memory.A key goal of this benchmarking was to examine how RAM quantity and clock frequency affect bandwidth performance. When fully loading all three DIMMs per channel, the memory frequency defaults to 1600MHz. At two DIMMs per channel, the default memory frequency increases to 1866MHz. With one DIMM per channel, the frequency maxes out at 2133MHz.
The Test System
System: NumberSmasher 2U Server based on SYS-6028U-TR4+
Motherboard: X10DRU-i+
Processors x 2: Intel(R) Xeon(R) CPU E5-2637 v3 @ 3.50GHz
DIMMs: 32GB DDR4-2133 ECC/Registered Samsung M393A4K40BB0-CPB0Q
Operating System: CentOS Linux release 7.2.1511 (Core)
Kernel Version: 3.10.0-327.10.1.el7.x86_64
Compiler: Intel Parallel Studio XE 2016
Benchmark Compilation and Execution
When compiling STREAM with the Intel compiler, I used the following compiler knobs in the makefile:
CC = icc CFLAGS = -O3 -xHost -openmp -DSTREAM_ARRAY_SIZE=64000000 -opt-streaming-cache-evict=0 -opt-streaming-stores always -opt-prefetch-distance=64,8
Information on compiling STREAM can be found from an Intel Developer Zone article on STREAM Triad Optimization. Also, reading through the STREAM FAQ at the University of Virginia site can be helpful.
I set the KMP_AFFINITY
and OMP_NUM_THREADS
environment variables before running STREAM:
export KMP_AFFINITY=granularity=core,compact export OMP_NUM_THREADS=8 ./stream_intel
On a system that has hyper-threading turned on, I could have used GOMP_CPU_AFFINITY
environment variable to focus on real cores, but I elected to turn off hyper-threading in BIOS instead.
STREAM Performance Results
Results with 3 DIMMs per Channel – 768GB RAM @ 1600MHz
Task | Best Rate MB/s | Avg time | Min time | Max time |
---|---|---|---|---|
Copy | 73,876.7 | 0.013882 | 0.013861 | 0.013905 |
Scale | 73,430.8 | 0.013967 | 0.013945 | 0.013989 |
Add | 70,320.2 | 0.021891 | 0.021843 | 0.022147 |
Triad | 70,555.8 | 0.021859 | 0.021770 | 0.022379 |
Results with 2 DIMMs per Channel – 512GB RAM @ 1866MHz
Task | Best Rate MB/s | Avg Time | Min time | Max time |
---|---|---|---|---|
Copy | 88,413.8 | 0.011661 | 0.011582 | 0.011900 |
Scale | 87,867.6 | 0.011765 | 0.011654 | 0.012166 |
Add | 90,289.8 | 0.017417 | 0.017012 | 0.018789 |
Triad | 89,756.5 | 0.017596 | 0.017113 | 0.018941 |
Results with 1 DIMM per Channel – 256GB RAM @ 2133MHz
Task | Best Rate MB/s | Avg time | Min time | Max time |
---|---|---|---|---|
Copy | 89,242.5 | 0.011479 | 0.011468 | 0.011495 |
Scale | 87,724.0 | 0.011699 | 0.011673 | 0.011757 |
Add | 90,363.3 | 0.017031 | 0.016998 | 0.017057 |
Triad | 90,411.5 | 0.017006 | 0.016989 | 0.017027 |
Summary of Results
Notice in the chart how rapidly performance improves moving from 3 DIMMs per channel 768GB at 1600MHz to 2 DIMMs per channel 512GB at 1866MHz. Also notice that going from 2 DIMMs per channel to 1 DIMM per channel 256GB at 2133MHz does not change very much at all.
This is significant when deciding how much RAM to spec on a new system, or how much to add when upgrading. Outfitting a server with eight or sixteen DIMMs results in excellent performance. Outfitting a server with twenty-four DIMMs provides exceptional memory capacity, but results in reduced performance. Thus, there is a trade-off between memory capacity and memory performance.
Realize too that using the E5-2637 v3 processors – with only 4 real cores each – reduces the STREAM performance results. Had I used something like the E5-2690 v3 processors – with 12 real cores each – the peak STREAM throughput results would be roughly 110GB/sec.
Results with 2 DIMMs per Channel – 512GB RAM @ 2133MHz (Forced in BIOS)
The best performance over all for the day (though not graphed above) came from forcing the 512GB configuration to 2133MHz in BIOS:
Task | Best Rate MB/s | Avg Time | Min time | Max time |
---|---|---|---|---|
Copy | 89,510.2 | 0.011477 | 0.011440 | 0.011605 |
Scale | 88,981.7 | 0.011523 | 0.011508 | 0.011539 |
Add | 92,473.6 | 0.016640 | 0.016610 | 0.016665 |
Triad | 92,403.3 | 0.016674 | 0.016623 | 0.016710 |
Be careful though – a configuration like this needs to be heavily tested to insure stability. Call us at Microway if you are not sure or have questions about memory configuration on your next server.