,

Comparison of NVIDIA Tesla/Quadro and NVIDIA GeForce GPUs

This resource was prepared by Microway from data provided by NVIDIA and trusted media sources. All NVIDIA GPUs support general purpose computation (GPGPU), but not all GPUs offer the same performance or support the same features. The consumer line of GeForce and RTX Consumer GPUs may be attractive to some running GPU-accelerated applications. However, it’s wise to keep in mind the differences between the products. There are many features only available on the professional Datacenter, RTX Professional, and Tesla GPUs.

FP64 64-bit (Double Precision) Floating Point Calculations

Many applications require higher-accuracy mathematical calculations. In these applications, data is represented by values that are twice as large (using 64 binary bits instead of 32 bits). These larger values are called double-precision (64-bit). Less accurate values are called single-precision (32-bit). Although almost all NVIDIA GPU products support both single- and double-precision calculations, the performance for double-precision values is significantly lower on most consumer-level GeForce GPUs. Here is a comparison of the double-precision floating-point calculation performance between GeForce and Tesla/Quadro GPUs:

NVIDIA GPU ModelDouble-precision (64-bit) Floating Point Performance
GeForce GTX Titan X Maxwellup to 0.206 TFLOPS
GeForce GTX 1080 Tiup to 0.355 TFLOPS
GeForce Titan Xpup to 0.380 TFLOPS
GeForce Titan Vup to 6.875 TFLOPS
GeForce RTX 2080 Tiestimated ~0.44 TFLOPS
Titan RTXestimated ~0.51 TFLOPS
RTX 4090~1.29 TFLOPS
Tesla K801.87+ TFLOPS
Tesla P100*4.7 ~ 5.3 TFLOPS
Quadro GP1005.2 TFLOPS
Tesla V100*7 ~ 7.8 TFLOPS
Quadro GV1007.4 TFLOPS
Quadro RTX 6000 and 8000~ 0.5 TFLOPS
Tesla T4estimated ~0.25 TFLOPS
NVIDIA A1009.7 TFLOPS
19.5 TFLOPS FP64 Tensor Core Operations

* Exact value depends upon PCI-Express or SXM2 SKU

FP16 16-bit (Half Precision) Floating Point Calculations

Some applications do not require as high an accuracy (e.g., neural network training/inference and certain HPC uses). Support for half-precision FP16 operations was introduced in the “Pascal” generation of GPUs. This was previously the standard for Deep Learning/AI computation; however, Deep Learning workloads have moved on to more complex operations (see TensorCores below). Although all NVIDIA “Pascal” and later GPU generations support FP16, performance is significantly lower on many gaming-focused GPUs. Here is a comparison of the half-precision floating-point calculation performance between GeForce and Tesla/Quadro GPUs:

NVIDIA GPU ModelHalf-precision (16-bit) Floating Point Performance
GeForce GTX Titan X MaxwellN/A
GeForce GTX 1080 Tiless than 0.177 TFLOPS
GeForce Titan Xpless than 0.190 TFLOPS
GeForce Titan V~27.5 TFLOPS
GeForce RTX 2080 Ti28.5 TFLOPS
Titan RTXup to 32.6 TFLOPS**
RTX 4090up to 82.6 TFLOPS
Tesla K80N/A
Tesla P100*18.7 ~ 21.2 TFLOPS*
Quadro GP10020.7 TFLOPS
Tesla V100*28 ~31.4 TFLOPS*
Quadro GV10029.6 TFLOPS
Quadro RTX 6000 and 800032.6 TFLOPS
Tesla T416.2 TFLOPS
NVIDIA A10078 TFLOPS

* Exact value depends upon PCI-Express or SXM2 SKU

** Value is estimated and calculated based upon theoretical FLOPS (clock speeds x cores)

TensorFLOPS and Deep Learning Performance

A new, specialized Tensor Core unit was introduced with “Volta” generation GPUs. It combines a multiply of two FP16 units (into a full precision product) with a FP32 accumulate operation—the exact operations used in Deep Learning Training computation. NVIDIA is now measuring GPUs with Tensor Cores by a new deep learning performance metric: a new unit called TensorTFLOPS.

Tensor Cores are only available on “Volta” GPUs or newer. For reference, we are providing the maximum known deep learning performance at any precision if there is no TensorFLOPS value. We consider it very poor scientific methodology to compare performance between varied precisions; however, we also recognize a desire to see at least an order of magnitude performance comparison between the Deep Learning performance of diverse generations of GPUs.

NVIDIA GPU ModelTensorFLOPS
(or max DL Performance)
GeForce GTX Titan X MaxwellN/A TensorTFLOPS
~6.1 TFLOPS FP32
GeForce GTX 1080 TiN/A TensorTFLOPS
~11.3 TFLOPS FP32
GeForce Titan XpN/A TensorTFLOPS
~12.1 TFLOPS FP32
GeForce Titan V110 TensorTFLOPS
GeForce RTX 2080 Ti56.9 TensorTFLOPS
455.4 TOPS INT4 for Inference
Titan RTX130 TensorTFLOPS
520 TOPS INT4 for Inference
RTX 4090660.6/1321.2 FP8 TensorTFLOPS***
1321.2/2642.4 TOPS INT4*** for Inference
Tesla K80N/A TensorTFLOPS
5.6 TFLOPS FP32
Tesla P100*N/A TensorTFLOPS
18.7 ~ 21.2 TFLOPS FP16
Quadro GP100N/A TensorTFLOPS
20.7 TFLOPS FP16
Tesla V100*112 ~ 125 TensorTFLOPS
Quadro GV100118.5 TensorTFLOPS
Quadro RTX 6000 and 8000130.5 TensorTFLOPS
522 TOPS INT4 for Inference
Tesla T465 TensorTFLOPS
260 TOPS INT4 for Inference
NVIDIA A100312 FP16 TensorTFLOPS
1248 TOPS INT4 for Inference

* Exact value depends upon PCI-Express or SXM2 SKU

*** Value given with and without Sparsity feature

Error Detection and Correction

On a GPU running a computer game, one memory error typically causes no issues (e.g., one pixel color might be incorrect for one frame). The user is very unlikely to even be aware of the issue. However, technical computing applications rely on the accuracy of the data returned by the GPU. For some applications, a single error can cause the simulation to be grossly and obviously incorrect. For others, a single-bit error may not be so easy to detect (returning incorrect results which appear reasonable). Titan GPUs do not include error correction or error detection capabilities. Neither the GPU nor the system can alert the user to errors should they occur. It is up to the user to detect errors (whether they cause application crashes, obviously incorrect data, or subtly incorrect data). Such issues are not uncommon – our technicians regularly encounter memory errors on consumer gaming GPUs. NVIDIA Tesla GPUs are able to correct single-bit errors and detect & alert on double-bit errors. On the latest NVIDIA A100, Tesla V100, Tesla T4, Tesla P100, and Quadro GV100/GP100 GPUs, ECC support is included in the main HBM2 memory, as well as in register files, shared memories, L1 cache and L2 cache.

Warranty and End-User License Agreement

NVIDIA’s warranty on GeForce GPU products explicitly states that the GeForce products are not designed for installation in servers. Running GeForce GPUs in a server system will void the GPU’s warranty and is at a user’s own risk. From NVIDIA’s manufacturer warranty website:

Warranted Product is intended for consumer end user purposes only, and is not intended for datacenter use and/or GPU cluster commercial deployments (“Enterprise Use”). Any use of Warranted Product for Enterprise Use shall void this warranty.

The license agreement included with the driver software for NVIDIA’s GeForce products states, in part:

No Datacenter Deployment. The SOFTWARE is not licensed for datacenter deployment, except that blockchain processing in a datacenter is permitted

GPU Memory Performance

Computationally-intensive applications require high-performance compute units, but fast access to data is also critical. For many HPC applications, an increase in compute performance does not help unless memory performance is also improved. For this reason, the Tesla GPUs provide better real-world performance than the GeForce GPUs:

NVIDIA GPU ModelGPU Memory Bandwidth
GeForce GTX Titan X Maxwell336 GB/s
GeForce GTX 1080 Ti484 GB/s
GeForce Titan Xp548 GB/s
GeForce Titan V653 GB/s
GeForce RTX 2080 Ti616 GB/s
Titan RTX672 GB/s
RTX 40901008 GB/s
Tesla K80480 GB/s
Tesla P40346 GB/s
Tesla P100 12GB549 GB/s
Tesla P100 16GB732 GB/s
Quadro GP100717 GB/s
Tesla V100 16GB/32GB900 GB/s
Quadro GV100870 GB/s
Quadro RTX 6000 and 8000624 GB/s
Tesla T4320 GB/s
NVIDIA A1001,555 GB/s for 40GB
2,039 GB/s for 80GB

 

GPU Memory Quantity

In general, the more memory a system has the faster it will run. For some HPC applications, it’s not even possible to perform a single run unless there is sufficient memory. For others, the quality and fidelity of the results will be degraded unless sufficient memory is available. Tesla GPUs offer as much as twice the memory of GeForce GPUs:

NVIDIA GPU ModelGPU Memory Quantity
GeForce GTX 1080 Ti11GB
GeForce Titan Xp12GB
GeForce GTX Titan V12GB
GeForce RTX 2080 Ti11GB
Titan RTX24GB
RTX 409024GB
Tesla K8024GB
Tesla P4024GB
Tesla P10012GB or 16GB*
Quadro GP10016GB*
Tesla V10016GB or 32GB*
Quadro GV10032GB*
Quadro RTX 600024GB*
Quadro RTX 800048GB*
Tesla T416GB*
NVIDIA A10040 or 80GB*

* note that Tesla/Quadro Unified Memory allows GPUs to share each other’s memory to load even larger datasets

PCI-E vs NVLink – Device-to-Host and Device-to-Device Throughput

One of the largest potential bottlenecks is in waiting for data to be transferred to the GPU. Additional bottlenecks are present when multiple GPUs operate in parallel. Faster data transfers directly result in faster application performance. The GeForce GPUs connect via PCI-Express, which has a theoretical peak throughput of 16GB/s. NVIDIA Tesla/Quadro GPUs with NVLink are able to leverage much faster connectivity. The NVLink in NVIDIA’s “Pascal” generation allows each GPU to communicate at up to 80GB/s (160GB/s bidirectional). The NVLink 2.0 in NVIDIA’s “Volta” generation allows each GPU to communicate at up to 150GB/s (300GB/s bidirectional). 3rd Generation NVLink in NVIDIA’s “Ampere” generation allows each GPU to communicate at up to 300GB/s (600GB/s bidirectional). NVLink connections are supported between GPUs, between GPUs and NVIDIA NVSwitches, and also between the CPUs and the GPUs on supported OpenPOWER platforms.

Application Software Support

While some software programs are able to operate on any GPU which supports CUDA, others are designed and optimized for the professional GPU series. Most professional software packages only officially support the NVIDIA Tesla and Quadro GPUs. Using a GeForce GPU may be possible, but will not be supported by the software vendor. In other cases, the applications will not function at all when launched on a GeForce GPU (for example, the software products from Schrödinger, LLC).

Operating System Support

Although NVIDIA’s GPU drivers are quite flexible, there are no GeForce drivers available for Windows Server operating systems. GeForce GPUs are only supported on Windows 7, Windows 8, and Windows 10. Groups that use Windows Server should look to NVIDIA’s professional Tesla and Quadro GPU products. The Linux drivers, on the other hand, support all NVIDIA GPUs.

Product Life Cycle

Due to the nature of the consumer GPU market, GeForce products have a relatively short lifecycle (commonly no more than a year between product release and end of production). Projects which require a longer product lifetime (such as those which might require replacement parts 3+ years after purchase) should use a professional GPU. NVIDIA’s professional Tesla and Quadro GPU products have an extended lifecycle and long-term support from the manufacturer (including notices of product End of Life and opportunities for last buys before production is halted). Furthermore, the professional GPUs undergo a more thorough testing and validation process during production.

Power Efficiency

GeForce GPUs are intended for consumer gaming usage, and are not usually designed for power efficiency. In contrast, the Tesla GPUs are designed for large-scale deployment where power efficiency is important. This makes the Tesla GPUs a better choice for larger installations. For example, the GeForce GTX Titan X is popular for desktop deep learning workloads. In server deployments, the Tesla P40 GPU provides matching performance and double the memory capacity. However, when put side-by-side the Tesla consumes less power and generates less heat.

DMA Engines

The Direct Memory Access (DMA) Engine of a GPU allows for speedy data transfers between the system memory and the GPU memory. Because such transfers are part of any real-world application, the performance is vital to GPU-acceleration. Slow transfers cause the GPU cores to sit idle until the data arrives in GPU memory. Likewise, slow returns cause the CPU to wait until the GPU has finished returning results.

GeForce products feature a single DMA Engine* which is able to transfer data in one direction at a time. If data is being uploaded to the GPU, any results computed by the GPU cannot be returned until the upload is complete. Likewise, results being returned from the GPU will block any new data which needs to be uploaded to the GPU. The Tesla GPU products feature dual DMA Engines to alleviate this bottleneck. Data may be transferred into the GPU and out of the GPU simultaneously.

* one GeForce GPU model, the GeForce GTX Titan X, features dual DMA engines

GPU Direct RDMA

NVIDIA’s GPU-Direct technology allows for greatly improved data transfer speeds between GPUs. Various capabilities fall under the GPU-Direct umbrella, but the RDMA capability promises the largest performance gain. Traditionally, sending data between the GPUs of a cluster required 3 memory copies (once to the GPU’s system memory, once to the CPU’s system memory and once to the InfiniBand driver’s memory). GPU Direct RDMA removes the system memory copies, allowing the GPU to send data directly through InfiniBand to a remote system. In practice, this has resulted in up to 67% reductions in latency and 430% increases in bandwidth for small MPI message sizes [1]. In CUDA version 8.0, NVIDIA has introduced GPU Direct RDMA ASYNC, which allows the GPU to initiate RDMA transfers without any interaction with the CPU.

GeForce GPUs do not support GPU-Direct RDMA. Although the MPI calls will still return successfully, the transfers will be performed through the standard memory-copy paths. The only form of GPU-Direct which is supported on the GeForce cards is GPU Direct Peer-to-Peer (P2P). This allows for fast transfers within a single computer, but does nothing for applications which run across multiple servers/compute nodes. Tesla GPUs have full support for GPU Direct RDMA and the various other GPU Direct capabilities. They are the primary target for these capabilities and thus have the most testing and use in the field.

Hyper-Q

Hyper-Q Proxy for MPI and CUDA Streams allows multiple CPU threads or processes to launch work on a single GPU. This is particularly important for existing parallel applications written with MPI, as these codes have been designed to take advantage of multiple CPU cores. Allowing the GPU to accept work from each of the MPI threads running on a system can offer a potentially significant performance boost. It can also reduce the amount of source code re-architecting required to add GPU acceleration to an existing application. However, the only form of Hyper-Q which is supported on the GeForce GPUs is Hyper-Q for CUDA Streams. This allows the GeForce to efficiently accept and run parallel calculations from separate CPU cores, but applications running across multiple computers will be unable to efficiently launch work on the GPU.

GPU Health Monitoring and Management Capabilities

Many health monitoring and GPU management capabilities (which are vital for maintaining multiple GPU systems) are only supported on the professional Tesla GPUs. Health features which are not supported on the GeForce GPUs include:

  • NVML/nvidia-smi for monitoring and managing the state and capabilities of each GPU. This enables GPU support from a number of 3rd party applications and tools such as Ganglia. Perl and Python bindings are also available.
  • OOB (out of band monitoring via IPMI) allows the system to monitor GPU health, adjust fan speeds to appropriately cool the devices and send alerts when an issue is seen
  • InfoROM (persistent configuration and state data) provides the system with additional data about each GPU
  • NVHealthmon utility provides cluster administrators with a ready-to-use GPU health status tool
  • TCC allows GPUs to be specifically set to display-only or compute-only modes
  • ECC (memory error detection & correction)

Cluster tools rely upon the capabilities provided by NVIDIA NVML. Roughly 60% of the capabilities are not available on GeForce – this table offers a more detailed comparison of the NVML features supported in Tesla and GeForce GPUs:

FeatureTeslaGeforce
Product Nameyesyes
Show GPU Countyesyes
PCI-Express Generation (e.g., 2.0 vs 3.0)yes
PCI-Express Link Width (e.g., x4, x8, x16)yes
Current Fan Speedyesyes
Current Temperatureyesyes*
Current Performance Stateyes
Clock Throttle Statusyes
Current GPU Usage (percentage)yes
Current Memory Usage (percentage)yesyes
GPU Boost Capabilityyesyes^
ECC Error Detection/Correction Supportyes
List Retired Pagesyes
Current Power Drawyes
Set Power Draw Limityes
Current GPU Clock Speedyes
Current Memory Clock Speedyes
Show Available Clock Speedsyes
Show Available Memory Speedsyes
Set GPU Boost Speed (core clock and memory clock)yes
Show Current Compute Processesyes
Card Serial Numberyes
InfoROM image and objectsyes
Accounting Capability (resource usage per process)yes
PCI-Express IDsyesyes
NVIDIA Driver Versionyesyes
NVIDIA VBIOS Versionyesyes

* Temperature reading is not available to the system platform, which means fan speeds cannot be adjusted. ^ GPU Boost is disabled during double precision calculations. Additionally, GeForce clock speeds will be automatically reduced in certain scenarios.

GPU Boost

All of the latest NVIDIA GPU products support GPU Boost, but their implementations vary depending upon the intended usage scenario. GeForce cards are built for interactive desktop usage and gaming. Tesla GPUs are built for intensive, constant number crunching with stability and reliability placed at a premium. Given the differences between these two use cases, GPU Boost functions differently on Tesla than on GeForce.

How GPU Boost Works on GeForce/RTX Consumer GPUs

In Geforce’s case, the graphics card automatically determines clock speed and voltage based on the temperature of the GPU. Temperature is the appropriate independent variable as heat generation affects fan speed. For less graphically-intense games or for general desktop usage, the end user can enjoy a quieter computing experience. When playing games that require serious GPU compute, however, GPU Boost automatically cranks up the voltage and clock speeds (in addition to generating more noise).

How GPU Boost Works on Tesla

Tesla’s GPU boost level, on the other hand, can also be determined by voltage and temperature dynamically just like in consumer GPUs, but needn’t always operate this way.

If preferred, boost may be specified by the system administrator or computational user – the desired clock speed may be set to a specific frequency. Rather than floating the clock speed at various levels, the desired clock speed may be statically maintained unless the power consumption threshold (TDP) is reached. This is an important consideration because accelerators in an HPC environment often need to be in sync with one other. The optional deterministic aspect of Datacenter GPU boost allows system administrators to determine optimal clock speeds and lock them in across all GPUs.

For applications that require additional performance and determinism, Datacenter GPUsf can be set for Auto Boost within synchronous boost groups. With Auto Boost with Groups enabled, each group of GPUs will increase clock speeds when headroom allows. The group will keep clocks in sync with each other to ensure matching performance across the group. Groups may be set in NVIDIA DCGM tools

1. Support for GPUs with GPUDirect RDMA in MVAPICH2 by D.K. Panda (The Ohio State University)

You May Also Like

  • Knowledge Center

    Common Maintenance Tasks (Clusters)

    The following items should be completed to maintain the health of your Linux cluster. For servers and workstations, please see Common Maintenance Tasks (Workstations and Servers). Backup non-replaceable data Remember that RAID is not a replacement for backups. If your system is stolen, hacked or started on fire, your data will be gone forever. Automate this…

  • Knowledge Center

    Detailed Specifications of the “Ice Lake SP” Intel Xeon Processor Scalable Family CPUs

    This article provides in-depth discussion and analysis of the 10nm Intel Xeon Processor Scalable Family (formerly codenamed “Ice Lake-SP” or “Ice Lake Scalable Processor”). These processors replace the previous 14nm “Cascade Lake-SP” microarchitecture and are available for sale as of April 6, 2021. The “Ice Lake SP” CPUs are the 3rd generation of Intel’s Xeon…

  • Knowledge Center

    Detailed Specifications of the AMD EPYC “Milan” CPUs

    This article provides in-depth discussion and analysis of the 7nm AMD EPYC processor (codenamed “Milan” and based on AMD’s Zen3 architecture). EPYC “Milan” processors replace the previous “Rome” processors and are available for sale as of March 15th, 2021. These new CPUs are the third iteration of AMD’s EPYC server processor family. They are compatible…