In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators

Brett Newman

·

November 18, 2013

This article provides in-depth details of the NVIDIA Tesla K-series GPU accelerators (codenamed “Kepler”). “Kepler” GPUs improve upon the previous-generation “Fermi” architecture.

For more information on other Tesla GPU architectures, please refer to:

In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators

In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Important changes available in the “Kepler” GPU architecture include:

Dynamic parallelism supports GPU threads launching new threads. This simplifies parallel programming and avoids unnecessary communication between the GPU and the CPU.
HyperQ enables up to 32 work queues per GPU. Multiple CPU cores and MPI processes are therefore able to address the GPU concurrently. Efficient utilization of the GPU resources is greatly improved.
SMX architecture provides a new streaming multiprocessor design optimized for performance per watt. Each SM contains 192 CUDA cores (up from 32 cores in Fermi).
PCI-Express generation 3.0 doubles data transfer rates between the host and the GPU.
GPU Boost increases the clock speed of all CUDA cores, providing a 30+% performance boost for many common applications.
Each SM contains more than twice as many registers (with another 2X on Tesla K80). Each thread may address four times as many registers.
Shared Memory Bank width is doubled. Likewise, shared memory bandwidth is doubled. Tesla K80 features an additional 2X increase in shared memory size.
Shuffle instructions allow threads to share data without use of shared memory.

“Kepler” Tesla GPU Specifications

The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Currently-shipping Tesla ‘Kepler’ GPUs
Previous Tesla ‘Kepler’ GPU Models

Feature	Tesla K80	Tesla K40
GPU Chip(s)	2x Kepler GK210	Kepler GK110b
Peak Single Precision (base clocks)	5.60 TFLOPS (both GPUs combined)	4.29 TFLOPS
Peak Double Precision (base clocks)	1.87 TFLOPS (both GPUs combined)	1.43 TFLOPS
Peak Single Precision (GPU Boost)	8.73 TFLOPS (both GPUs combined)	5.04 TFLOPS
Peak Double Precision (GPU Boost)	2.91 TFLOPS (both GPUs combined)	1.68 TFLOPS
Onboard GDDR5 Memory¹	24GB (12GB per GPU)	12 GB
Memory Bandwidth¹	480 GB/s (240 GB/s per GPU)	288 GB/s
PCI-Express Generation	3.0
Achievable PCI-E transfer bandwidth	12 GB/s	12 GB/s
# of SMX Units	26 (13 per GPU)	15
# of CUDA Cores	4992 (2496 per GPU)	2880
Memory Clock	2500 MHz	3004 MHz
GPU Base Clock	560 MHz	745 MHz
GPU Boost Support	Yes – Dynamic	Yes – Static
GPU Boost Clocks	23 levels between 562 MHz and 875 MHz	810 MHz 875 MHz
Architecture features	SMX, Dynamic Parallelism, Hyper-Q
Compute Capability	3.7	3.5
Workstation Support	–	Yes
Server Support	Yes
Wattage (TDP)	300W (plus Zero Power Idle)	235W

1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.

The models listed below are still available for sale in certain scenarios, but are not generally recommended. They offer lower performance than Tesla K40 or K80 (and do not cost any less).

Feature	Tesla K20X	Tesla K20	Tesla K10
GPU Chip(s)	Kepler GK110		2x Kepler GK104
Peak Single Precision	3.95 TFLOPS	3.52 TFLOPS	2.3 TFLOPS per GPU
Peak Double Precision	1.32 TFLOPS	1.17 TFLOPS	95 GFLOPS per GPU
Onboard GDDR5 Memory¹	6GB	5GB	4GB per GPU
Memory Bandwidth¹	250 GB/s	208 GB/s	160 GB/s per GPU
PCI-Express Generation	2.0		3.0
Achievable PCI-E transfer bandwidth	6 GB/s		11 GB/s
# of SMX Units	14	13	8 per GPU
# of CUDA Cores	2688	2496	1536 per GPU
Memory Clock	2600 MHz	2600 MHz	2500 MHz
GPU Base Clock	732 MHz	705 MHz	745 MHz
GPU Boost Support	Limited	–	–
GPU Boost Clocks	758 MHz 784 MHz	–	–
Architecture features	SMX, Dynamic Parallelism, Hyper-Q		SMX
Compute Capability	3.5		3.0
Workstation Support	–	Yes	–
Server Support	Yes
Wattage (TDP)	235W	225W

1. Measured with ECC disabled. Memory capacity and performance are reduced with ECC enabled.

Comparison between “Fermi” and “Kepler” GPU Architectures

Feature	Fermi GF100	Fermi GF104	Kepler GK104	Kepler GK110(b)	Kepler GK210
Compute Capability	2.0	2.1	3.0	3.5	3.7
Threads per Warp	32
Max Warps per SM	48		64
Max Threads per SM	1536		2048
Max Thread Blocks per SM	8		16
32-bit Registers per SM	32 K		64 K		128 K
Max Registers per Thread Block	32 K		64 K
Max Registers per Thread	63			255
Max Threads per Thread Block	1024
Shared Memory Configurations (remainder is configured as L1 Cache)	16KB + 48KB L1 Cache 48KB + 16KB L1 Cache (64KB total)		16KB + 48KB L1 Cache 32KB + 32KB L1 Cache 48KB + 16KB L1 Cache (64KB total)		16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total)
Max Shared Memory per Thread Block	48KB
Max X Grid Dimension	2^16-1		2^32-1
Hyper-Q	–	–	–	Yes
Dynamic Parallelism	–	–	–	Yes

In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators

In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators

In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Important changes available in the “Kepler” GPU architecture include:

“Kepler” Tesla GPU Specifications

Comparison between “Fermi” and “Kepler” GPU Architectures

You May Also Like

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs

In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators

In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators

Important changes available in the “Kepler” GPU architecture include:

“Kepler” Tesla GPU Specifications

Comparison between “Fermi” and “Kepler” GPU Architectures

You May Also Like

Common Maintenance Tasks (Clusters)

Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

Detailed Specifications of the AMD EPYC “Milan” CPUs

In-Depth Comparison of NVIDIA Tesla “Maxwell” GPU Accelerators

In-Depth Comparison of NVIDIA Tesla “Volta” GPU Accelerators