This article provides in-depth details of the NVIDIA Tesla K-series GPU accelerators (codenamed “Kepler”). “Kepler” GPUs improve upon the previous-generation “Fermi” architecture.
For more information on other Tesla GPU architectures, please refer to:
Important changes available in the “Kepler” GPU architecture include:
- Dynamic parallelism supports GPU threads launching new threads. This simplifies parallel programming and avoids unnecessary communication between the GPU and the CPU.
- HyperQ enables up to 32 work queues per GPU. Multiple CPU cores and MPI processes are therefore able to address the GPU concurrently. Efficient utilization of the GPU resources is greatly improved.
- SMX architecture provides a new streaming multiprocessor design optimized for performance per watt. Each SM contains 192 CUDA cores (up from 32 cores in Fermi).
- PCI-Express generation 3.0 doubles data transfer rates between the host and the GPU.
- GPU Boost increases the clock speed of all CUDA cores, providing a 30+% performance boost for many common applications.
- Each SM contains more than twice as many registers (with another 2X on Tesla K80). Each thread may address four times as many registers.
- Shared Memory Bank width is doubled. Likewise, shared memory bandwidth is doubled. Tesla K80 features an additional 2X increase in shared memory size.
- Shuffle instructions allow threads to share data without use of shared memory.
“Kepler” Tesla GPU Specifications
The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
Comparison between “Fermi” and “Kepler” GPU Architectures
Feature | Fermi GF100 | Fermi GF104 | Kepler GK104 | Kepler GK110(b) | Kepler GK210 |
---|---|---|---|---|---|
Compute Capability | 2.0 | 2.1 | 3.0 | 3.5 | 3.7 |
Threads per Warp | 32 | ||||
Max Warps per SM | 48 | 64 | |||
Max Threads per SM | 1536 | 2048 | |||
Max Thread Blocks per SM | 8 | 16 | |||
32-bit Registers per SM | 32 K | 64 K | 128 K | ||
Max Registers per Thread Block | 32 K | 64 K | |||
Max Registers per Thread | 63 | 255 | |||
Max Threads per Thread Block | 1024 | ||||
Shared Memory Configurations (remainder is configured as L1 Cache) | 16KB + 48KB L1 Cache 48KB + 16KB L1 Cache (64KB total) | 16KB + 48KB L1 Cache 32KB + 32KB L1 Cache 48KB + 16KB L1 Cache (64KB total) | 16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total) | ||
Max Shared Memory per Thread Block | 48KB | ||||
Max X Grid Dimension | 216-1 | 232-1 | |||
Hyper-Q | – | – | – | Yes | |
Dynamic Parallelism | – | – | – | Yes |