This article provides in-depth details of the NVIDIA Tesla P-series GPU accelerators (codenamed “Pascal”). “Pascal” GPUs improve upon the previous-generation “Kepler”, and “Maxwell” architectures. Pascal GPUs were announced at GTC 2016 and began shipping in September 2016. Note: these have since been superseded by the NVIDIA Volta GPU architecture.
Important changes available in the “Pascal” GPU architecture include:
- Exceptional performance with up to 5.3 TFLOPS double- and 10.6 TFLOPS single-precision floating-point performance.
- NVLink enables a 5X increase in bandwidth between Tesla Pascal GPUs and from GPUs to supported system CPUs (compared with PCI-E).
- High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to Kepler and Maxwell GPUs.
- Pascal Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
- Up to 4MB L2 caches are available on Pascal GPUs (compared to 1.5MB on Kepler and 3MB on Maxwell).
- Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
- Energy-efficiency – Pascal GPUs deliver nearly twice the FLOPS per Watt as Kepler GPUs.
- Efficient SM units – Pascal’s architecture doubles the number of registers per thread
- Improved atomics in Pascal allow for an atomic add instruction in global memory (previous GPUs supported only shared memory atomics). Atomics can also be performed within the memory of other GPUs in the system.
- Half-precision FP support improves performance for low-precision operations (frequently used in neural network training)
- INT8 support improves performance for low-precision integer operations (frequently used in neural network inference)
- Compute Preemption allows higher-priority tasks to interrupt currently-running tasks.
Tesla “Pascal” GPU Specifications
The table below summarizes the features of the available Tesla Pascal GPU Accelerators. To learn more about any of these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
Comparison between “Kepler”, “Maxwell”, and “Pascal” GPU Architectures
Feature | Kepler GK210 | Maxwell GM200 | Maxwell GM204 | Pascal GP100 | Pascal GP102 |
---|---|---|---|---|---|
Compute Capability | 3.7 | 5.2 | 6.0 | 6.1 | |
Threads per Warp | 32 | ||||
Max Warps per SM | 64 | ||||
Max Threads per SM | 2048 | ||||
Max Thread Blocks per SM | 16 | 32 | |||
Max Concurrent Kernels | 32 | 128 | 32 | ||
32-bit Registers per SM | 128 K | 64 K | |||
Max Registers per Thread Block | 64 K | ||||
Max Registers per Thread | 255 | ||||
Max Threads per Thread Block | 1024 | ||||
L1 Cache Configuration | split with shared memory | 24KB dedicated L1 cache | |||
Shared Memory Configurations | 16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total) | 96KB dedicated | 64KB dedicated | 96KB dedicated | |
Max Shared Memory per Thread Block | 48KB | ||||
Max X Grid Dimension | 232-1 | ||||
Hyper-Q | Yes | ||||
Dynamic Parallelism | Yes |
For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
Additional Tesla “Pascal” GPU products
NVIDIA has also released Tesla P4 GPUs. These GPUs are primarily for embedded and hyperscale deployments, and are not expected to be used in the HPC space.
Hardware-accelerated video encoding and decoding
All NVIDIA “Pascal” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.