This article provides in-depth details of the NVIDIA Tesla M-series GPU accelerators (codenamed “Maxwell”). “Maxwell” GPUs improve upon the previous-generation “Kepler” architecture, although they do not necessarily replace all “Kepler” models.
Important changes available in the “Maxwell” GPU architecture include:
- Energy-efficiency – Maxwell GPUs deliver nearly twice the power-efficiency of Kepler GPUs.
- SMM architecture – the Maxwell Multiprocessor (SMM) provides power-efficient performance, with 40% higher performance per CUDA core. Each SMM contains 128 CUDA cores (changed from 192 cores in Kepler).
- Larger, dedicated shared memory in each SMM. The L1 cache is now separate from Shared Memory (they competed for space on Kepler).
- Larger L2 caches are available on Maxwell GPUs (ranging from 2MB to 3MB, which is two to four times the size of L2 on Kepler).
- Reduced latencies on GPU instructions improve utilization and throughput. Furthermore, the throughput of many Integer instructions has been improved.
- Shared memory atomics improve upon Kepler’s device memory atomics by allowing threads to perform atomic operations on locations in shared memory.
- Maximum active thread blocks are increased from 16 to 32 per SMM.
- Dual NVENC H.264 encoders for increased throughput of video workloads. H.265 support is also added.
“Maxwell” Tesla GPU Specifications
The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
Feature | Tesla M40 | Tesla M60 |
---|---|---|
GPU Chip(s) | Maxwell GM200 | 2x Maxwell GM204 |
Recommended Workload | Machine Learning & Single-Precision apps | Virtualized Desktops (VDI) |
Peak Single Precision (GPU Boost) | 6.84 TFLOPS | 9.64 TFLOPS (both GPUs combined) |
Peak Double Precision (GPU Boost) | 0.213 TFLOPS | 0.301 TFLOPS (both GPUs combined) |
Onboard GDDR5 Memory1 | 12 GB or 24GB | 16GB (8GB per GPU) |
Memory Bandwidth1 | 288 GB/s | 160 GB/s per GPU |
L2 Cache | 3 MB | 2MB per GPU |
PCI-Express Generation | 3.0 | |
Achievable PCI-E transfer bandwidth | 12 GB/s | |
# of SMM Units | 24 | 32 (16 per GPU) |
# of CUDA Cores | 3072 | 4096 (2048 per GPU) |
Memory Clock | 3004 MHz | 2505 MHz |
GPU Base Clock | 948 MHz | 899 MHz |
GPU Boost Support | Yes – Dynamic | |
GPU Boost Clocks | 23 levels between 532 MHz and 1114 MHz | 25 levels between 532 MHz and 1177 MHz |
Compute Capability | 5.2 | |
Workstation Support | – | |
Server Support | Yes | |
Wattage (TDP) | 250W | 300W |
1. Measured with ECC disabled. Memory capacity and performance are reduced by 6.25% with ECC enabled.
Comparison between “Kepler” and “Maxwell” GPU Architectures
Feature | Kepler GK104 | Kepler GK110(b) | Kepler GK210 | Maxwell GM200 | Maxwell GM204 |
---|---|---|---|---|---|
Compute Capability | 3.0 | 3.5 | 3.7 | 5.2 | |
Threads per Warp | 32 | ||||
Max Warps per SM | 64 | ||||
Max Threads per SM | 2048 | ||||
Max Thread Blocks per SM | 16 | 32 | |||
32-bit Registers per SM | 64 K | 128 K | 64 K | ||
Max Registers per Thread Block | 64 K | ||||
Max Registers per Thread | 255 | ||||
Max Threads per Thread Block | 1024 | ||||
L1 Cache Configuration | split with shared memory | 24KB dedicated L1 cache | |||
Shared Memory Configurations | 16KB + 48KB L1 Cache 32KB + 32KB L1 Cache 48KB + 16KB L1 Cache (64KB total) | 16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total) | 96KB dedicated | ||
Max Shared Memory per Thread Block | 48KB | ||||
Max X Grid Dimension | 232-1 | ||||
Hyper-Q | Yes | ||||
Dynamic Parallelism | Yes |
Additional Tesla “Maxwell” GPU products
NVIDIA has also released Tesla M4, Tesla M6, and Tesla M10 GPUs. These products are primarily for embedded and hyperscale deployments. These models are not expected to be used in the HPC space.