High Performance Training

A100 introduces double-precision Tensor Cores, providing the biggest milestone since the introduction of double-precision computing in GPUs for HPC. This enables researchers to reduce a 10-hour, double-precision simulation running on NVIDIA V100 Tensor Core GPUs to just four hours on A100. HPC applications can also leverage TF32 precision in A100’s Tensor Cores to achieve up to 10X higher throughput for single-precision dense matrix multiply operations.

5 Innovations Driving Performance

Chart

The NVIDIA Ampere architecture, designed for the age of elastic computing, delivers the next giant leap by providing unmatched acceleration at every scale The A100 GPU brings massive amounts of compute to datacentres. To keep those compute engines fully utilised, it has a leading class 1.6TB/sec of memory bandwidth, a 67 per cent increase over the previous generation DGX. In addition, the DGX A100 has significantly more on-chip memory, including a 40MB Level 2 cache—7x larger than the previous generation—to maximise compute performance.

Tensor Cores

TF32 running on Tensor Cores in A100 GPUs can provide up to 10x speedups compared to single-precision floating-point math (FP32) on Volta GPUs. Combining TF32 with structured sparsity on the A100 enables performance gains over Volta of up to 20x. Applications using NVIDIA libraries enable users to harness the benefits of TF32 with no code change required. TF32 Tensor Cores operate on FP32 inputs and produce results in FP32. Non-matrix operations continue to use FP32.

Chart

Modern AI networks are big and getting bigger, with millions and in some cases billions of parameters. Not all of these parameters are needed for accurate predictions and inference, and some can be converted to zeros to make the models 'sparse’ without compromising accuracy. Tensor Cores in A100 can provide up to 2x higher performance for sparse models. While the sparsity feature more readily benefits AI inference, it can also be used to improve the performance of model training.

Acceleration

Multi-Instance GPU (MIG) expands the performance and value of each NVIDIA A100 GPU. MIG can partition the A100 GPU into as many as seven instances, each fully isolated with their own high-bandwidth memory, cache, and compute cores. Now administrators can support every workload, from the smallest to the largest, offering a right-sized GPU with guaranteed quality of service (QoS) for every job, optimising utilisation and extending the reach of accelerated computing resources to every user.

expand

Expand GPU access to more users

With MIG, you can achieve up to 7X more GPU resources on a single A100 GPU. MIG gives researchers and developers more resources and flexibility than ever before.

optimise

Optimise GPU utilisation

MIG provides the flexibility to choose many different instance sizes, which allows provisioning of right-sized GPU instance for each workload, ultimately delivering optimal utilization and maximizing data center investment.

workloads

Run simultaneous mixed workloads

MIG enables inference, training, and high-performance computing (HPC) workloads to run at the same time on a single GPU with deterministic latency and throughput.

Upto 7 GPU instances in a single A100

Dedicated SM, Memory, L2 Cache, Bandwidth for hardware QoS & isolation

Simultaneous workload execution with guaranteed quality of service

All MIG instances run in parallel with predicatable throughput & latency

Right-sized GPU allocation

Different sized MIG instances based on target workloads

Flexibility

To run any type of workload on a MIG instance

Diverse deployment environment

Supported with Bare metal, Docker, Kubernetes, Virtualised env.

NVLINK

Scaling applications across multiple GPUs requires extremely fast movement of data. The third generation of NVIDIA NVLink in A100 doubles the GPU-to-GPU direct bandwidth to 600GB/s, almost 20x more than PCI-E 4.0. When paired with the latest generation of NVIDIA NVSwitch, all GPUs in the server can communicate with each other at full NVLink speed for incredibly fast training.

NVLink and NVSwitch are essential building blocks of the complete NVIDIA datacentre solution that incorporates hardware, networking, software, libraries, and optimised AI models and applications from NVIDIA GPU Cloud (NGC).

Flexible Configurations

In addition to the NVIDIA DGX A100 universal system for AI infrastructure we also have developed a range of training systems powered by the NVIDIA A100 GPU. These systems are customisable to your requirements and are available in variety of configurations in both the high density SXM4 and industry standard PCI-E form factors.

SXM4 Servers
4 GPUs 8 GPUs

Supermicro 2124GQ-NART

4x NVIDIA A100 GPUs with NVLink
2x AMD EPYC 7002 CPUs
Up to 8TB of DDR4 3200 ECC Registered
4x 2.5in HDDs/SSDs
Up to 4x Mellanox NICs
Redundant 2200W PSUs
2U 19in rack server

RESERVE NOW

Supermicro SYS-420GP-TNAR

8x NVIDIA A100 GPUs with NVLink and NVSwitch
2x AMD EPYC 7002 CPU
Up to 8TB of DDR4 3200 ECC Registered
6x 2.5in HDDs/SSDs
Up to 2x Mellanox NICs
Redundant 2200W PSUs
4U 19in rack server

RESERVE NOW
PCI-E Servers
4 GPUs 8 GPUs 10 GPUs

Asus ESC4000A-E10

4x NVIDIA A100 GPUs with NVLink
AMD EPYC 7002 CPU
Up to 2TB of DDR4 3200 ECC Registered
1x M.2 SSD, 8x 3.5in HDDs/SSDs
Up to 2x Mellanox NICs
Redundant 1600W PSUs
2U 19in rack server

BUY NOW

Supermicro 4124GS-TNR

8x NVIDIA A100 GPUs with NVLink
AMD EPYC 7002 CPU
Up to 8TB of DDR4 3200 ECC Registered
24x 2.5in HDDs/SSDs
Up to 2x Mellanox NICs
Redundant 2000W PSUs
5U 19in rack server

RESERVE NOW

Gigabyte G492-Z51T

10x NVIDIA A100 GPUs
2x AMD EPYC 7002 CPUs
Up to 4TB of DDR4 3200 ECC Registered
12x 3.5in HDDs/SSDs
Up to 2x Mellanox NICs
Redundant 3200W PSUs
4U 19in rack server

RESERVE NOW

Proof of Concept

Sign up to try one of the AI & Deep Learning solutions available from Scan Computers

Register for PoC >