NVIDIA DGX-A100 training systems

High Performance Training

A100 introduces double-precision Tensor Cores, providing the biggest milestone since the introduction of double-precision computing in GPUs for HPC. This enables researchers to reduce a 10-hour, double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on A100. HPC applications can also leverage TF32 precision in A100’s Tensor Cores to achieve up to 10X higher throughput for single-precision dense matrix multiply operations.

The NVIDIA Ampere architecture, designed for the age of elastic computing, delivers the next giant leap by providing unmatched acceleration at every scale The A100 GPU brings massive amounts of compute to datacentres. To keep those compute engines fully utilised, it has a leading class 1.6TB/sec of memory bandwidth, a 67 per cent increase over the previous generation DGX. In addition, the DGX A100 has significantly more on-chip memory, including a 40MB Level 2 cache—7x larger than the previous generation—to maximise compute performance.

TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x. Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.

Modern AI networks are big and getting bigger, with millions and in some cases billions of parameters. Not all of these parameters are needed for accurate predictions and inference, and some can be converted to zeros to make the models 'sparse’ without compromising accuracy. Tensor Cores in A100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also be used to improve the performance of model training.

Multi-Instance GPU (MIG) expands the performance and value of each NVIDIA A100 GPU. MIG can partition the A100 GPU into as many as seven instances, each fully isolated with their own high-bandwidth memory, cache, and compute cores. Now administrators can support every workload, from the smallest to the largest, offering a right-sized GPU with guaranteed quality of service (QoS) for every job, optimising utilisation and extending the reach of accelerated computing resources to every user.

Expand GPU access to more users

With MIG, you can achieve up to 7X more GPU resources on a single A100 GPU. MIG gives researchers and developers more resources and flexibility than ever before.

Optimise GPU utilisation

MIG provides the flexibility to choose many different instance sizes, which allows provisioning of right-sized GPU instance for each workload, ultimately delivering optimal utilization and maximizing data center investment.

Run simultaneous mixed workloads

MIG enables inference, training, and high-performance computing (HPC) workloads to run at the same time on a single GPU with deterministic latency and throughput.

Upto 7 GPU instances in a single A100

Dedicated SM, Memory, L2 Cache, Bandwidth for hardware QoS & isolation

Simultaneous workload execution with guaranteed quality of service

All MIG instances run in parallel with predicatable throughput & latency

Right-sized GPU allocation

Different sized MIG instances based on target workloads

Flexibility

To run any type of workload on a MIG instance

Diverse deployment environment

Supported with Bare metal, Docker, Kubernetes, Virtualised env.

Scaling applications across multiple GPUs requires extremely fast movement of data. The third generation of NVIDIA NVLink in A100 doubles the GPU-to-GPU direct bandwidth to 600GB/s, almost 20x more than PCI-E 4.0. When paired with the latest generation of NVIDIA NVSwitch, all GPUs in the server can communicate with each other at full NVLink speed for incredibly fast training.

NVLink and NVSwitch are essential building blocks of the complete NVIDIA datacentre solution that incorporates hardware, networking, software, libraries, and optimised AI models and applications from NVIDIA GPU Cloud (NGC).

SXM4 Servers
4 GPUs	8 GPUs
Supermicro 2124GQ-NART 4x NVIDIA A100 GPUs with NVLink 2x AMD EPYC 7002 CPUs Up to 8TB of DDR4 3200 ECC Registered 4x 2.5in HDDs/SSDs Up to 4x Mellanox NICs Redundant 2200W PSUs 2U 19in rack server RESERVE NOW	Supermicro SYS-420GP-TNAR 8x NVIDIA A100 GPUs with NVLink and NVSwitch 2x AMD EPYC 7002 CPU Up to 8TB of DDR4 3200 ECC Registered 6x 2.5in HDDs/SSDs Up to 2x Mellanox NICs Redundant 2200W PSUs 4U 19in rack server RESERVE NOW

PCI-E Servers
4 GPUs	8 GPUs	10 GPUs
Asus ESC4000A-E10 4x NVIDIA A100 GPUs with NVLink AMD EPYC 7002 CPU Up to 2TB of DDR4 3200 ECC Registered 1x M.2 SSD, 8x 3.5in HDDs/SSDs Up to 2x Mellanox NICs Redundant 1600W PSUs 2U 19in rack server BUY NOW	Supermicro 4124GS-TNR 8x NVIDIA A100 GPUs with NVLink AMD EPYC 7002 CPU Up to 8TB of DDR4 3200 ECC Registered 24x 2.5in HDDs/SSDs Up to 2x Mellanox NICs Redundant 2000W PSUs 5U 19in rack server RESERVE NOW	Gigabyte G492-Z51T 10x NVIDIA A100 GPUs 2x AMD EPYC 7002 CPUs Up to 4TB of DDR4 3200 ECC Registered 12x 3.5in HDDs/SSDs Up to 2x Mellanox NICs Redundant 3200W PSUs 4U 19in rack server RESERVE NOW