,

In-Depth Comparison of NVIDIA Tesla Kepler GPU Accelerators

This article provides in-depth details of the NVIDIA Tesla K-series GPU accelerators (codenamed “Kepler”). “Kepler” GPUs improve upon the previous-generation “Fermi” architecture.

For more information on other Tesla GPU architectures, please refer to:

Important changes available in the “Kepler” GPU architecture include:

  • Dynamic parallelism supports GPU threads launching new threads. This simplifies parallel programming and avoids unnecessary communication between the GPU and the CPU.
  • HyperQ enables up to 32 work queues per GPU. Multiple CPU cores and MPI processes are therefore able to address the GPU concurrently. Efficient utilization of the GPU resources is greatly improved.
  • SMX architecture provides a new streaming multiprocessor design optimized for performance per watt. Each SM contains 192 CUDA cores (up from 32 cores in Fermi).
  • PCI-Express generation 3.0 doubles data transfer rates between the host and the GPU.
  • GPU Boost increases the clock speed of all CUDA cores, providing a 30+% performance boost for many common applications.
  • Each SM contains more than twice as many registers (with another 2X on Tesla K80). Each thread may address four times as many registers.
  • Shared Memory Bank width is doubled. Likewise, shared memory bandwidth is doubled. Tesla K80 features an additional 2X increase in shared memory size.
  • Shuffle instructions allow threads to share data without use of shared memory.

“Kepler” Tesla GPU Specifications

The table below summarizes the features of the available Tesla GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.

Comparison between “Fermi” and “Kepler” GPU Architectures

Feature Fermi GF100 Fermi GF104 Kepler GK104 Kepler GK110(b) Kepler GK210
Compute Capability2.02.13.03.53.7
Threads per Warp32
Max Warps per SM4864
Max Threads per SM15362048
Max Thread Blocks per SM816
32-bit Registers per SM32 K64 K128 K
Max Registers per Thread Block32 K64 K
Max Registers per Thread63255
Max Threads per Thread Block1024
Shared Memory Configurations
(remainder is configured as L1 Cache)
16KB + 48KB L1 Cache

48KB + 16KB L1 Cache

(64KB total)
16KB + 48KB L1 Cache

32KB + 32KB L1 Cache

48KB + 16KB L1 Cache

(64KB total)
16KB + 112KB L1 Cache

32KB + 96KB L1 Cache

48KB + 80KB L1 Cache

(128KB total)
Max Shared Memory per Thread Block48KB
Max X Grid Dimension216-1232-1
Hyper-QYes
Dynamic ParallelismYes

You May Also Like

  • Knowledge Center

    Common Maintenance Tasks (Clusters)

    The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers). Backup non-replaceable data Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this…

  • Knowledge Center

    Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

    This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021. The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon…

  • Knowledge Center

    Detailed Specifications of the AMD EPYC “Milan” CPUs

    This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021. These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible…