train-inference-hero-desk
train-inf-hero-mobile

Understanding Training & Inference

Artificial intelligence models have a complex life cycle — each step demands specialized networking capabilities.

  • Infrastructure is purpose-built, and training & inference have different purposes.
  • Throughput and latency needs for training and running the model shape the solution design.
  • Network and hardware choices directly impact the speed of training and running solutions.

The computational and networking demands of training have captured the limelight and breathless coverage in the tech press: Thousands of highly-networked chips burning 24/7 for months at a time, millions spent in CapEx.

While we will explore the HPC needs for training in this post, we need to highlight a second, under-examined phase: Running inference on trained models. Inference, which is performing the prediction the model is designed for, holds long-tail opportunities for infrastructure providers but presents unique, largely under-reported challenges.

Infrastructure is purpose-built, and these two phases do not share a purpose.

 

From Data Noise to Real Value

Nearly 90%
of cloud HPC spend will be on inference.

Training

Training a model from an algorithm often spans days to even months and incurs substantial costs, reaching into millions of dollars. Research firm Epoch AI estimated that the computing power necessary to train a cutting-edge model was doubling every six to ten months.1

Extensive networking infrastructure is necessary to parcel out training data and to recombine the efforts of chips distributed throughout a datacenter, located for optimal power and heat management.2 For instance, OpenAI’s ChatGPT3 was trained on an Azure supercomputer utilizing a massive assembly of 285,000 Infiniband-connected CPU cores and 10,000 NVIDIA Tensor Core GPUs. Foundational models, like GPT, Meta’s LLaMa, Amazon’s Titan, are trained to be generally adaptable to many circumstances and use cases. The amount of data required to achieve that level of adaptability is so huge that most people don’t have a working frame of reference for it.

LLaMa was trained on “1.4 trillion tokens” – a single human would take 22,000 years to read it all.3 Similarly, Google’s PaLM is trained on a dataset 3,000,000x larger than one used to train models merely a decade ago. The High-performance Computing (HPC) infrastructure and fiber densities required to transmit that data demands fiber and specialized connection technologies like Infiniband RDMA.

The largest software and technology companies in the world are producing competing foundation models, dealing with the capitally intensive process of bringing new ones to market. However, they aren’t the only players in the space: There are compute-as-a-service providers like Inflection AI and CoreWeave that are developing massive capacity positions for access to the specialized chips necessary for re-training or finetuning foundation models, building out a cloud comprising 22,000 densely connected NVIDIA H100s.4

Inference

Inference is harder to speak about in generalities. While training workloads vary widely, there are as many inference techniques as there are questions answered by data science. Training the model is – quite literally – just the table blind. The game doesn’t start until the trained model gets into the wild and does its job: Inference is the day-to-day reality of artificial intelligence.

Inference is how the trained model expresses its value. For every time a model is trained, it may run millions upon millions of inferences before it’s ever trained again. Every time ChatGPT guesses-the-next-word it performs inference, often multiple times per word; how many essays, legal briefs 5, or lines of code have these AI tools already written?

Inference is the true high-volume activity in this new age of generative AI applications. Dave Brown, VP of EC2 at AWS, estimates that 90% of the cloud spend in AI will be on inference. Providers like Box have already rolled out specialized pricing to support the elevated costs of performing inference at scale.6

There are many ways application providers are performing inference. Simpler models like image recognition can be run at edge , like the painting and plant identifiers built into Apple’s Photos app. Some are so complicated that they have dedicated HPC mesh networks collating responses and evaluations from many specialized chips like OpenAI’s ChatGPT infrastructure. As model evaluation becomes increasingly real-time, throughput and latency are critical differentiators for infrastructure providers.

Speed is King

Throughput is the measurement of how much data can pass though the AI network in a specified time, and the latency of a model describes the amount of time it takes it to make a prediction based on the data that is fed into it. 

Both throughput and latency are key factors that determine the usability of an AI application – whether using a deep-learning LLM like ChatGPT or smaller ML models like Yelp uses to sort through and classify millions of photos and read menus. These metrics are often benchmarked during training – the number of features a model accepts as an example – and have meaningful real-world consequences for the design of the HPC network necessary to support training and inference for these models and the user experience of using them.

For training, which already can run for months, time is money. Parallel processing is the best technique to shorten the training lifecycle and is a critical component of an effective training architecture. GPUs are directly interconnected with one another so data can be transmitted directly from the memory of one to the other without CPU involvement via RDMA. Hundreds or thousands of chips divvy up the work of training a model, leading to explosive growth in fiber density and the complexity of the connection schema to achieve high I/O data flow. These ultra-high bandwidth, many-to-many connections directly between GPUs or TPUs allow for maximized throughput in the training lifecycle.

In inference, we face different parameters; the use case of the model directly steers the design of the solution.

Waymo, the self-driving car and driverless robotaxi service from Google, completed one million miles of driving on public roads without a human present in January 2023. In fully autonomous mode, the Waymo vehicle makes billions of decisions each day. Many of these decisions are made through a predictive inference as the vehicles encounter the messy reality of driving. The scenario of a child running into the street chasing a ball, or a sudden encounter with lost wildlife, is a classic example where low latency in inference is critical. Even obeying the speed limit, the onboard computer has microscopic moments to observe the new data and arrive at a decision about how to behave.

Providers need multiple hardware solutions connected with unprecedented densities.

The basic difference between the two is that GPUs process tasks in parallel whereas CPUs perform them serially.  However, the advent of multicore processors enables the CPU to process data in parallel; since CPU processing is inherently lower latency and less energy-hungry than GPUs, there is a growing case to optimize models to use them.

Model latency is only part of the measurement and is directly the result of model design – the more parameters the model accepts, generally the more intensive the workload to determine the answer. There is also network latency to consider: Marshalling the data to the chips and streaming their responses also are serious concerns for application providers.

Inference also rarely exists in a vacuum. A model that predicts home values on a realtor search website runs right alongside the everyday lifecycle of a web request, requiring bridged network solutions that allow inference meshes to play well within spine-leaf architectures.

Training and inference are both critical phases in the AI model’s lifecycle, each presenting its unique challenges and requirements. The tug-of-war through that lifecycle throughput and latency plays a pivotal role in ensuring optimal performance, be it during the resource-intensive training phase or the high-frequency inference stage.

By understanding their interdependencies and the high-density HPC network architectures tailored for each phase, we can better harness the capabilities of AI models and pave the way for innovations that will revolutionize industries.

Stay tuned for more in-depth insights in the next installments of our Architecture Series on training and inference.

We'd love to hear how we can help you.

Gen AI Thought Leadership

Understanding Training & Inference

Designing a network for both training & inference

Operating Gen AI DCs coming Q4 2023

Heating, Cooling, Memory, & Storage

  1. Economist: https://www.economist.com/science-and-technology/2023/06/21/the-bigger-is-better-approach-to-ai-is-running-out-of-road
  2. Microsoft Azure: https://youtu.be/Rk3nTUfRZmo?t=266
  3. 20VC: https://www.thetwentyminutevc.com/yann-lecun/
  4. Inflection AI: https://inflection.ai/inflection-ai-announces-1-3-billion-of-funding
  5. Reuters. https://www.reuters.com/legal/new-york-lawyers-sanctioned-using-fake-chatgpt-cases-legal-brief-2023-06-22/
  6. Techcrunch. https://techcrunch.com/2023/10/11/box-unveils-unique-ai-pricing-plan-to-account-for-high-cost-of-running-llms/amp/