This article provides details on the NVIDIA A-series GPUs (codenamed “Ampere”). “Ampere” GPUs improve upon the previous-generation “Volta” and “Turing” architectures. Ampere A100 GPUs began shipping in May 2020 (with other variants shipping by end of 2020).
Note that not all “Ampere” generation GPUs provide the same capabilities and feature sets. Broadly-speaking, there is one version dedicated solely to computation and a second version dedicated to a mixture of graphics/visualization and compute. The specifications of both versions are shown below – speak with one of our GPU experts for a personalized summary of the options best suited to your needs.
Computational “Ampere” GPU architecture – important features and changes:
- Exceptional HPC performance:
- 9.7 TFLOPS FP64 double-precision floating-point performance
- Up to 19.5 TFLOPS FP64 double-precision via Tensor Core FP64 instruction support
- 19.5 TFLOPS FP32 single-precision floating-point performance
- Exceptional AI deep learning training and inference performance:
- TensorFloat 32 (TF32) instructions improve performance without loss of accuracy
- Sparse matrix optimizations potentially double training and inference performance
- Speedups of 3x~20x for network training, with sparse TF32 TensorCores (vs Tesla V100)
- Speedups of 7x~20x for inference, with sparse INT8 TensorCores (vs Tesla V100)
- Tensor Cores support many instruction types: FP64, TF32, BF16, FP16, I8, I4, B1
- High-speed HBM2 Memory delivers 40GB or 80GB capacity at 1.6TB/s or 2TB/s throughput
- Multi-Instance GPU allows each A100 GPU to run seven separate/isolated applications
- 3rd-generation NVLink doubles transfer speeds between GPUs
- 4th-generation PCI-Express doubles transfer speeds between the system and each GPU
- Native ECC Memory detects and corrects memory errors without any capacity or performance overhead
- Larger and Faster L1 Cache and Shared Memory for improved performance
- Improved L2 Cache is twice as fast and nearly seven times as large as L2 on Tesla V100
- Compute Data Compression accelerates compressible data patterns, resulting in up to 4x faster DRAM bandwidth, up to 4x faster L2 read bandwidth, and up to 2x increase in L2 capacity.
Visualization “Ampere” GPU architecture – important features and changes:
- Double FP32 processing throughput with upgraded Streaming Multiprocessors (SM) that support FP32 computation on both datapaths
(previous generations provided one dedicated FP32 path and one dedicated Integer path) - 2nd-generation RT cores provide up to a 2x increase in raytracing performance
- 3rd-generation Tensor Cores with TF32 and support for sparsity optimizations
- 3rd-generation NVLink provides up to 56.25 GB/sec bandwidth between pairs of GPUs in each direction
- GDDR6X memory providing up to 768 GB/s of GPU memory throughput
- 4th-generation PCI-Express doubles transfer speeds between the system and each GPU
As stated above, the feature sets vary between the “computational” and the “visualization” GPU models. Additional details on each are shared in the tabs below, and the best choice will depend upon your mix of workloads. Please contact our team for additional review and discussion.
NVIDIA “Ampere” GPU Specifications
[tabby title=”High Performance Computing & Deep Learning GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for computation and deep learning/AI/ML. Note that the PCI-Express version of the NVIDIA A100 GPU features a much lower TDP than the SXM4 version of the A100 GPU (250W vs 400W). For this reason, the PCI-Express GPU is not able to sustain peak performance in the same way as the higher-power part. Thus, the performance values of the PCI-E A100 GPU are shown as a range and actual performance will vary by workload.
To learn more about these products, or to find out how best to leverage their capabilities, please speak with an HPC/AI expert.
Feature | NVIDIA A30 PCI-E | NVIDIA A100 40GB PCI-E | NVIDIA A100 80GB PCI-E | NVIDIA A100 SXM4 | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPU Chip | Ampere GA100 | |||||||||||||||||||||||||||||||||
TensorCore Performance* |
|
|
|
|||||||||||||||||||||||||||||||
Double Precision (FP64) Performance* | 5.2 TFLOPS | 8.7 ~ 9.7 TFLOPS | 9.7 TFLOPS | |||||||||||||||||||||||||||||||
Single Precision (FP32) Performance* | 10.3 TFLOPS | 17.6 ~ 19.5 TFLOPS | 19.5 TFLOPS | |||||||||||||||||||||||||||||||
Half Precision (FP16) Performance* | 41 TFLOPS | 70 ~ 78 TFLOPS | 78 TFLOPS | |||||||||||||||||||||||||||||||
Brain Floating Point (BF16) Performance* | 20 TFLOPS | 35 ~ 39 TFLOPS | 39 TFLOPS | |||||||||||||||||||||||||||||||
On-die Memory | 24GB HBM2 | 40GB HBM2 | 80GB HBM2 | 40GB HBM2 or 80GB HBM2e | ||||||||||||||||||||||||||||||
Memory Bandwidth | 933 GB/s | 1,555 GB/s | 1,940 GB/s | 1,555 GB/s for 40GB 2,039 GB/s for 80GB |
||||||||||||||||||||||||||||||
L2 Cache | 40MB | |||||||||||||||||||||||||||||||||
Interconnect | NVLink 3.0 (4 bricks) + PCI-E 4.0 NVLink is limited to pairs of directly-linked cards |
NVLink 3.0 (12 bricks) + PCI-E 4.0 NVLink is limited to pairs of directly-linked cards |
NVLink 3.0 (12 bricks) + PCI-E 4.0 | |||||||||||||||||||||||||||||||
GPU-to-GPU transfer bandwidth (bidirectional) | 200 GB/s | 600 GB/s | ||||||||||||||||||||||||||||||||
Host-to-GPU transfer bandwidth (bidirectional) | 64 GB/s | |||||||||||||||||||||||||||||||||
# of MIG instances supported | up to 4 | up to 7 | ||||||||||||||||||||||||||||||||
# of SM Units | 56 | 108 | ||||||||||||||||||||||||||||||||
# of Tensor Cores | 224 | 432 | ||||||||||||||||||||||||||||||||
# of integer INT32 CUDA Cores | 3,584 | 6,912 | ||||||||||||||||||||||||||||||||
# of single-precision FP32 CUDA Cores | 3,584 | 6,912 | ||||||||||||||||||||||||||||||||
# of double-precision FP64 CUDA Cores | 1,792 | 3,456 | ||||||||||||||||||||||||||||||||
GPU Base Clock | 930 MHz | 765 MHz | 1065 MHz | 1095 MHz | ||||||||||||||||||||||||||||||
GPU Boost Support | Yes – Dynamic | |||||||||||||||||||||||||||||||||
GPU Boost Clock | 1440 MHz | 1410 MHz | ||||||||||||||||||||||||||||||||
Compute Capability | 8.0 | |||||||||||||||||||||||||||||||||
Workstation Support | no | |||||||||||||||||||||||||||||||||
Server Support | yes | |||||||||||||||||||||||||||||||||
Cooling Type | Passive | |||||||||||||||||||||||||||||||||
Wattage (TDP) | 165 | 250W | 300W | 400W |
* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature
[tabby title=”Visualization & Ray Tracing GPUs”]
The table below summarizes the features of the NVIDIA Ampere GPU Accelerators designed for visualization and ray tracing. Note that these GPUs would not necessarily be connecting directly to a display device, but might be performing remote rendering from a datacenter.
To learn more about these GPUs and to review which are the best options for you, please speak with a GPU expert.
Feature | NVIDIA RTX A5000 | NVIDIA RTX A6000 | NVIDIA A40 | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GPU Chip | Ampere GA102 | ||||||||||||||||||||||||||
TensorCore Performance* |
|
|
|
||||||||||||||||||||||||
Double Precision (FP64) Performance* | 0.4 TFLOPS | 0.6 TFLOPS | 0.6 TFLOPS | ||||||||||||||||||||||||
Single Precision (FP32) Performance* | 27.8 TFLOPS | 38.7 TFLOPS | 37.4 TFLOPS | ||||||||||||||||||||||||
Integer (INT32) Performance* | 13.9 TOPS | 19.4 TOPS | 18.7 TOPS | ||||||||||||||||||||||||
GPU Memory | 24GB | 48GB | 48GB | ||||||||||||||||||||||||
Memory Bandwidth | 768 GB/s | 768 GB/s | 696 GB/s | ||||||||||||||||||||||||
L2 Cache | 6MB | ||||||||||||||||||||||||||
Interconnect | NVLink 3.0 + PCI-E 4.0 NVLink is limited to pairs of directly-linked cards |
||||||||||||||||||||||||||
GPU-to-GPU transfer bandwidth (bidirectional) | 112.5 GB/s | ||||||||||||||||||||||||||
Host-to-GPU transfer bandwidth (bidirectional) | 64 GB/s | ||||||||||||||||||||||||||
# of MIG instances supported | N/A | ||||||||||||||||||||||||||
# of SM Units | 64 | 84 | |||||||||||||||||||||||||
# of RT Cores | 64 | 84 | |||||||||||||||||||||||||
# of Tensor Cores | 256 | 336 | |||||||||||||||||||||||||
# of integer INT32 CUDA Cores | 8,192 | 10,752 | |||||||||||||||||||||||||
# of single-precision FP32 CUDA Cores | 8,192 | 10,752 | |||||||||||||||||||||||||
# of double-precision FP64 CUDA Cores | 128 | 168 | |||||||||||||||||||||||||
GPU Base Clock | not published | ||||||||||||||||||||||||||
GPU Boost Support | Yes – Dynamic | ||||||||||||||||||||||||||
GPU Boost Clock | not published | ||||||||||||||||||||||||||
Compute Capability | 8.6 | ||||||||||||||||||||||||||
Workstation Support | yes | no | |||||||||||||||||||||||||
Server Support | no | yes | |||||||||||||||||||||||||
Cooling Type | Active | Passive | |||||||||||||||||||||||||
Wattage (TDP) | 230W | 300W | 300W |
* theoretical peak performance based on GPU boost clock
† an additional 2X performance can be achieved via NVIDIA’s new sparsity feature
Several lower-end graphics cards and datacenter GPUs are also available including RTX A2000, RTX A4000, A10, and A16. These GPUs offer similar capabilities, but with lower levels of performance and available at lower price points.
[tabbyending]
Comparison between “Pascal”, “Volta”, and “Ampere” GPU Architectures
Feature | Pascal GP100 | Volta GV100 | Ampere GA100 |
---|---|---|---|
Compute Capability* | 6.0 | 7.0 | 8.0 |
Threads per Warp | 32 | ||
Max Warps per SM | 64 | ||
Max Threads per SM | 2048 | ||
Max Thread Blocks per SM | 32 | ||
Max Concurrent Kernels | 128 | ||
32-bit Registers per SM | 64 K | ||
Max Registers per Block | 64 K | ||
Max Registers per Thread | 255 | ||
Max Threads per Block | 1024 | ||
L1 Cache Configuration | 24KB dedicated cache |
32KB ~ 128KB dynamic with shared memory |
28KB ~ 192KB dynamic with shared memory |
Shared Memory Configurations | 64KB | configurable up to 96KB; remainder for L1 Cache (128KB total) |
configurable up to 164KB; remainder for L1 Cache (192KB total) |
Max Shared Memory per SM | 64KB | 96KB | 164KB |
Max Shared Memory per Thread Block | 48KB | 96KB | 160KB |
Max X Grid Dimension | 232-1 | ||
Tensor Cores | No | Yes | |
Mixed Precision Warp-Matrix Functions | No | Yes | |
Hardware-accelerated async-copy | No | Yes | |
L2 Cache Residency Management | No | Yes | |
Dynamic Parallelism | Yes | ||
Unified Memory | Yes | ||
Preemption | Yes |
* For a complete listing of Compute Capabilities, reference the NVIDIA CUDA Documentation
Hardware-accelerated raytracing, video encoding, video decoding, and image decoding
The NVIDIA “Ampere” Datacenter GPUs that are designed for computational workloads do not include graphics acceleration features such as RT cores and hardware-accelerated video encoders. For example, RT cores for accelerated raytracing are not included in the A30 and A100 GPUs. Similarly, video encoding units (NVENC) are not included in these GPUs.
To accelerate computational workloads that require processing of image or video files, five JPEG decoding (NVJPG) units and five video decoding units (NVDEC) are included in A100. Details are described on NVIDIA’s A100 for computer vision blog post.
For additional details on NVENC and NVDEC, reference NVIDIA’s encoder/decoder support matrix. To learn more about GPU-accelerated video encode/decode, see NVIDIA’s Video Codec SDK.