This article provides in-depth details of the NVIDIA Tesla V-series GPU accelerators (codenamed “Volta”). “Volta” GPUs improve upon the previous-generation “Pascal” architecture. Volta GPUs began shipping in September 2017 and were updated to 32GB of memory in March 2018; Tesla V100S was released in late 2019. Note: these have since been superseded by the NVIDIA Ampere GPU architecture.
This page is intended to be a fast and easy reference of key specs for these GPUs. You may wish to browse our Tesla V100 Price Analysis and Tesla V100 GPU Review for more extended discussion.
Important features available in the “Volta” GPU architecture include:
- Exceptional HPC performance with up to 8.2 TFLOPS double- and 16.4 TFLOPS single-precision floating-point performance.
- Deep Learning training performance with up to 130 TFLOPS FP16 half-precision floating-point performance.
- Deep Learning inference performance with up to 62.8 TeraOPS INT8 8-bit integer performance.
- Simultaneous execution of FP32 and INT32 operations improves the overall computational throughput of the GPU
- NVLink enables an 8~10X increase in bandwidth between the Tesla GPUs and from GPUs to supported system CPUs (compared with PCI-E).
- High-bandwidth HBM2 memory provides a 3X improvement in memory performance compared to previous-generation GPUs.
- Enhanced Unified Memory allows GPU applications to directly access the memory of all GPUs as well as all of system memory (up to 512TB).
- Native ECC Memory detects and corrects memory errors without any capacity or performance overhead.
- Combined L1 Cache and Shared Memory provides additional flexibility and higher performance than Pascal.
- Cooperative Groups – a new programming model introduced in CUDA 9 for organizing groups of communicating threads
Tesla “Volta” GPU Specifications
The table below summarizes the features of the available Tesla Volta GPU Accelerators. To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC expert.
Comparison between “Kepler”, “Pascal”, and “Volta” GPU Architectures
Feature | Kepler GK210 | Pascal GP100 | Volta GV100 |
---|---|---|---|
Compute Capability ^ | 3.7 | 6.0 | 7.0 |
Threads per Warp | 32 | ||
Max Warps per SM | 64 | ||
Max Threads per SM | 2048 | ||
Max Thread Blocks per SM | 16 | 32 | |
Max Concurrent Kernels | 32 | 128 | |
32-bit Registers per SM | 128 K | 64 K | |
Max Registers per Thread Block | 64 K | ||
Max Registers per Thread | 255 | ||
Max Threads per Thread Block | 1024 | ||
L1 Cache Configuration | split with shared memory | 24KB dedicated L1 cache | 32KB ~ 128KB (dynamic with shared memory) |
Shared Memory Configurations | 16KB + 112KB L1 Cache 32KB + 96KB L1 Cache 48KB + 80KB L1 Cache (128KB total) | 64KB | configurable up to 96KB; remainder for L1 Cache (128KB total) |
Max Shared Memory per Thread Block | 48KB | 96KB* | |
Max X Grid Dimension | 232-1 | ||
Hyper-Q | Yes | ||
Dynamic Parallelism | Yes | ||
Unified Memory | No | Yes | |
Pre-Emption | No | Yes |
^ For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
* above 48 KB requires dynamic shared memory
Hardware-accelerated video encoding and decoding
All NVIDIA “Volta” GPUs include one or more hardware units for video encoding and decoding (NVENC / NVDEC). For complete hardware details, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.