img-52-des
img-52-m

Networking Shapes Performance

Explore the relationship between the measurable performance of an AI model and the networks that power it.

  • Models operate using FLOPS, and the growth of scale is accelerating at an unprecedented rate.
  • MLPerf™ benchmarks are performance measures specifically designed for AI.
  • Chip hardware choices directly impact the speed of training and running solutions.

Network architecture — the bandwidth, design, and protocols — has as much impact on the performance of operating an artificial intelligence-grade HPC cluster as the model design or chips themselves; the shared workloads, spread across thousands of densely interconnected chips, demand increasingly innovative ways to design and measure networks.

Conventional measures of network performance therefore have given rise to new ways of evaluating AI-specific performance metrics.

Understanding AI Performance Metrics

In the network world, we are used to seeing megabits and gigabits per second as the measure of the data rate at which the network can transmit data from one place to another. Whether the network is Ethernet or Infiniband, the data rate is the main performance measure on traditional networks.  Today, 400 Gbps transceivers are the workhorse in the HPC AI data centers, and 800 Gbps is just being deployed. While high-speed Ethernet and Infiniband are critical in the AI data center, data rate is not the measure of how fast the AI network can perform.

The measure of the AI network’s performance is floating-point operations per second (FLOPS).  A FLOP is the measure of a computer’s ability to perform complex mathematical calculations involving high-precision floating-point decimals. Out to ten decimal places, pi can be used to calculate Earth’s circumference to within 1mm (3.1415926536), and many AI models rely on even more precise numbers.

Today, AI models perform at speeds of one quadrillion FLOPS, called petaflops—a 1, followed by fifteen zeros.

1,000,000,000,000,000 operations per second – an almost inconceivable number, in meters it’s 100X the distance of the Earth to the Sun; in drops of water, it’s 50 million cubic meters – about the size of a decent lake; in dollars, it’s 10X the size of the world’s GDP in 2022 ($100T).

Advancements in AI are growing at such an enormous rate that we may very well see exascale networks in the next supercomputers—a compute rate of one quintillion FLOPS or a billion billion floating-point operations per second. The Frontier supercomputer at the OLCF lab in Tennessee debuted in 2022; it was the first and remains the fastest exascale computer in operation as of this article in December 2023.

1018
—a exaflop of capacity: only two computers in the world have Rpeak capacity measured in exaflops: Aurora and Frontier.

MLPerf™ benchmarks, created by a group of AI experts from universities, research institutions, and the tech industry (collectively known as MLCommons), are performance measures specifically designed for AI.

They help in fairly assessing the efficiency of various hardware, software, and services in both learning new tasks (training) and applying learned knowledge to new data (inference). These benchmarks are carried out under specific conditions to ensure consistency.

To keep up with the latest advancements in AI, MLPerf is regularly updated. It includes new tests and covers a range of AI tasks that reflect the most advanced technologies, like understanding human language, recognizing and creating images, and interpreting speech. These benchmarks measure how quickly systems can learn and make decisions or predictions across different real-world applications.

Consider MLPerf benchmarks in the context of Ethernet networking: The Internet Engineering Task Force (IETF) established RFC 2544, a standard outlining benchmarking methods for network interconnect devices. This standard includes six key performance areas: throughput, latency, frame loss, burstability, system reset, and system recovery. Similarly, the Institute of Electrical and Electronics Engineers (IEEE) sets the standards for Ethernet technology, guiding developers and manufacturers in creating devices that adhere to these benchmarks.

In the AI domain, the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) have developed ISO/IEC 25059, which outlines standards for software quality across various dimensions, including design, testing, maintenance, and evaluation. Similarly, MLPerf benchmarks, developed by MLCommons, establish performance criteria and scenarios specifically for AI systems.

 

Distinguishing between scenarios and benchmark testing is crucial. In MLPerf, various scenarios are designed to assess specific aspects of AI performance, particularly focusing on inference throughput and latency. For example, in offline testing, large batches of data are fed into the AI model to measure maximum throughput. In contrast, the server scenario involves processing incoming queries under a defined latency constraint, and the performance metric is the throughput achieved within this latency budget. These tailored tests provide a nuanced view of an AI system’s performance, mirroring the specificity seen in network performance testing under RFC 2544 standards.

While we often talk about the role of GPUs (Graphics Processing Units) and CPUs (Central Processing Units) in these processes, MLPerf’s evaluations wisely consider the performance of the entire network of a supercomputer. For instance, when looking at NVIDIA, the tests would include not just the H100 GPU, but also their high-speed Quantum-2 InfiniBand network and NVLink connections that help different parts of the supercomputer communicate efficiently.

A powerful, densely connected network is an essential component of an AI solution.

Exascale Networks for Exascale Problems

  •   Parameters / Weights
  •   Tokens / Context Window
  • CPU – GPU – TPU 

The creation of a foundation stochastic model like a Large Language Model is an extremely heavy compute lift that requires acting on massive amounts of data.  Consider that ChatGPT3 uses 175 billion parameters in its LLM.  In LLMs, parameters are typically weights in the neural network that the model adjusts during training to learn patterns in data, providing a wide array of approaches and ways to act on incoming data.

Couple that scale of operation with the number of tokens, or actual data, on which the model is being executed.  For example, ChatGPT3 uses 8,000 tokens per interaction with the model, also known as a context window.  Think about the amount of compute resources required to support 175 billion parameters acting on 8,000 tokens.

ChatGPT4 grew the parameter space 10x, using an astonishing 1.76 trillion parameters acting on an even larger context window of 32,000 tokens per interaction.  This is the ideal use case for exascale supercomputers.

Imagine slogging through the enormous training data set or ever-growing token sizes with 1.76T parameters serially using CPUs—the time required would be tremendous, much less accounting for the number of CPUs involved. This need for parallel processing has led to the use of Graphic Processing Units (GPUs), allowing for multiple data sets to be processed simultaneously to accelerate the process.  High-speed I/Os optimize the data flow into and out of the GPU, and further acceleration is provided by software enhancements and network architectures.  GPUs are directly interconnected with one another, transmitting data directly from the memory of one to the other without CPU involvement via RDMA.

The network is an essential component of capitalizing on the parallelization edge GPUs have over CPUs. Without a dense, many-to-many fiber fabric weaving the work of the chipset together, operators cannot enjoy the additional power GPUs provide to the process.

The needs of training and inference have also given rise to custom silicon, designed specifically for the use cases of AI. Tensor Processing Units (TPUs) are specialized chips designed by Google to accelerate machine learning workloads. They are specifically optimized for matrix multiplication, a common operation in neural networks, and are highly performant for inference.

Google is not the only player creating custom silicon. AWS has developed two different chips: Trainium and Inferentia, each of which is designed for specific phases of the AI/ML lifecycle of training & inference. Tesla’s D1, Meta’s MTIA, and potentially Microsoft’s Athena are all examples of published activity in the design and development of chips, each tightly optimized for their company’s specific use cases and expanding their supply chain options.

Modeling
Life & Death

100x
growth in NOAA's available compute.

Training

The National Oceanic and Atmospheric Association (NOAA) runs one of the largest HPC clusters on the planet to partner with similar installations in other global regions to predict our weather; as of 2022, they run an aggregated 40 petaflops of HPC across production and R&D.

The models running on NOAA’s HPC are not stochastic like LLMs; they are based on a mathematical representation of elemental physics. These models today are designed around the fluid mechanics of heat and water vapor, subdividing the atmosphere into small slices, and predicting over time how weather forms and moves across the country. The numerical and physical models run constantly, attempting to provide life-saving early warning to agencies and residents at risk for catastrophic and dangerous weather: floods that could destroy crops and homes, hurricanes that could suddenly change course, and heat waves that endanger lives.
As of 2022, NOAA flagged a need for a hundred-fold increase of available compute in the coming years, totaling some 9.6 peak exaflops of desired capacity by 2031. The current models run for hours and require fine-grained sensor data from a massive deployment of data collection devices.

That compute, however, is designed around the specific use case from NOAA. Clusters and their networks are, as we have mentioned, purpose-built. There have been exciting recent developments in using Graph Neural Networks (GNNs) to do weather prediction by Google’s DeepMind GraphCast; while they have the recognizably massive training burden we’ve come to expect from the latest generations of models, their inference is impressively lightweight: It can run on a laptop.

More impressively? It can run in under a minute.

The potential transition from numerical-physical models to statistical-predictive models could strikingly challenge the underlying design of the exascale compute at NOAA and serves as a reminder that this space is evolving rapidly. The core competency of an operator in this environment is being nimble and positioned to change with the times.

Stay tuned for more in-depth insights in the next installments of our Architecture Series on network design.

We'd love to hear how we can help you.

Gen AI Thought Leadership

Networking Shapes Performance

Exploring network performance and ML performance measures for HPC

Operating Gen AI DCs coming Q4 2023

Heating, Cooling, Memory, & Storage

  1. Economist: https://www.economist.com/science-and-technology/2023/06/21/the-bigger-is-better-approach-to-ai-is-running-out-of-road
  2. Microsoft Azure: https://youtu.be/Rk3nTUfRZmo?t=266
  3. 20VC: https://www.thetwentyminutevc.com/yann-lecun/
  4. Inflection AI: https://inflection.ai/inflection-ai-announces-1-3-billion-of-funding
  5. Reuters. https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/
  6. Techcrunch. https://techcrunch.com/2023/10/11/box-unveils-unique-ai-pricing-plan-to-account-for-high-cost-of-running-llms/amp/