F_r2xA2WUAAbiz9 copy
F_r2xA2WUAAbiz9-m

Inside the Data Center: Exploring AI White Space

Next-gen AI applications like ChatGPT, Gemini, and others can require the resources of the entire data center to support a single purpose, and force changes in standard practice to deploy and operate DC infrastructure:

  • The chip itself is subject to questions of design longevity and availability.
  • The chips and compute for AI have specific needs for power and cooling, and they are best served by different transmission protocols.
  • Revisiting transmission protocols impacts the network topologies.
  • Even the physical connectivity design has changes.

We can only speculate the impacts of a fully-realized, mature artificial intelligence capability on our society: New protein folding algorithms for custom gene therapies, live translation into the local idiom and accent of a conversation, fully bespoke video game play-throughs, eerily indiscernible deepfakes, a new wave of convincing phishing attacks.

The impact on the data center infrastructure to enable that future is significant, but uses readily available technologies. 

From a network perspective, transport will use either InfiniBand, which came out in the year 2000, or Ethernet, which has been around since the 1980s, as a base for RDMA over Converged Ethernet (RoCE).  The network architecture of choice is a three-tiered CLOS network, which also been around for years.  Each will be adapted to the new machine learning data center network, but we are not looking yet at entirely new architecture.

Deploying these architectures and supporting the networking demands of the AI workloads are changing the contents of cable conveyances across the world.

Why Not CPUs?

Graphics Processing Units (GPUs) were originally designed to draw and render graphics. With this specialization in vector math geometry and support for parallel processing, GPUs are a superior choice for machine learning operations over CPUs. Because CPUs process operations serially per-core, they are poorly suited for the many-to-many workloads of training and inference of models.

While GPUs are the most common type of chip used for training Gen AI models, their original design for graphics processing means there are a number of additional optimizations possible for improving their performance for artificial intelligence.

Chip technology is constantly evolving — Tensor Processing Units from Google, Language Processing Units from Groq, new ASIC designs from Meta and more are all trying to optimize for the data science needs of their workloads, conserve power, and outrun Moore’s Law.

Designing, Powering, and Cooling Next-gen Chips

2025
is TSMC's current estimate for shipping 2nm chips.

Will chip manufacturers be able to keep up with chip design and availability? The entire supply chain for these intricate chips – lithography machines, shrinking nanometer architectures, custom-tailored designs – is constantly under pressure and increasingly the subject of regulatory regimes. Access to and availability of GPUs are squarely in the critical path of AI data center builds.

“I believe AI is going to change the world more than anything in the history of humanity. More than electricity.”



— Kai-Fu Lee, Chairman & CEO of Sinovation Ventures

Cooling the GPU-heavy racks will also be a significant hurdle, especially as we approach draws of 100 kW+. Just like the network transport layer, cooling technologies that have been around for years are being maximized for efficiency in the AI data center. As the boundaries of air cooling (including rear-door heat exchangers) are being pushed, direct-to-chip cooling is finding favor. Direct-to-chip was developed in 1971. Immersion cooling, another strategy, was developed and used to cool high-voltage transformers in the 1960s.

From a power perspective, the major concern is not getting enough power to the rack. Coreweave, Lambda Labs and each of the hyperscalers are building AI/ML data centers today. The significant hurdle is powering the rack and bringing the PUE levels down to what they are today. This one will take a bit more time to resolve. We still do not have a clear understanding of the ML workloads that leads to optimizing the network infrastructure and adapting power from the data hall to the rack level.

Understanding Transmission Protocols

The incredible growth in volume of fiber infrastructure is not the only change to support HPC networks: We have to revisit the protocols that we are deploying.

Widely used, Ethernet is the most flexible network protocol that offers scalability, easy management, and cost-efficiency. However, Ethernet does not service the complete problem set required to train AI applications. For this reason, Remote Direct Memory Access (RDMA) protocol supported by InfiniBand technology has been adopted by infrastructure providers, enabling high data transfer with low latency.

Providers are exploring new protocols to meet the needs for high throughput and low latency in the data center with the scalability and cost effectiveness of Ethernet. The general sense is RDMA over Converged Ethernet (RoCE) will become the HPC transmission protocol of choice for hyperscale data centers supporting AI applications; Oracle has publicly shared they are deploying RoCE for their data centers.

InfiniBand RDMA

InfiniBand can scale to 16K GPUs in a single 3-tier CLOS network
  • High-speed, low-latency transport technology that is scalable and efficient without the use of CPU resources.
  • Natively supports Remote Direct Memory Access (RDMA), enabling data transfer directly from memory devices in a serialized manner from one node to another without utilizing CPU resources.
  • Dramatically accelerates transport of data between server clusters while reducing latency.

RoCE

RoCE can scale to 32K GPUs in a single 3-tier CLOS network
  • RDMA over Converged Ethernet leverages RDMA via a converged Ethernet transport network.
  • Additional congestion management protocols are used with ECN, CNP, and PFC to prioritize RDMA traffic.
  • Maximizes performance and low latency of RDMA with the high data rate and flexibility of Ethernet.
  • 3 modes of RoCE can be configured – Lossless, Semi-lossless, or Lossy.
  • From a people perspective, far more engineers are proficient with Ethernet than IB.  This could impact the growth of RoCE moving forward.

In addition to deploying more fiber and new transmission protocols to support Generative AI, there are even changes to consider to the topology of the network infrastructure. InfiniBand is not managed the same way that Ethernet is.

Data center topologies greatly affect performance, reliability, and cost. Choosing a specific topology requires a strategic choice between those three considerations. No design is perfectly balanced and whichever is deployed will shape the output and performance of the data center.

Before we dig into InfiniBand topologies, it is important to understand a basic principle of transmission: Ethernet is a shared medium with traffic managed across single links; InfiniBand is a switched fabric topology. Its nodes are connected directly through a single serial-switched path—not a shared one.

This is critical when designing the network and physical layer infrastructure that require clusters of GPUs to be interconnected to schedule processing efficiently for a single application. To scale InfiniBand clusters reliably, you can’t simply add additional switches. 

This requires a scalable fiber infrastructure to easily connect additional InfiniBand switches when required.

A 3-tiered CLOS Network Architecture is a hierarchical design that consists of a central core switch, connected to spine switches, which are in turn connected to a number of leaf switches. The leaf switches are responsible for connecting to the end nodes, such as servers or workstations. CLOS networks are well-suited for high-performance computing (HPC) and other applications that require high bandwidth and low latency. They are also relatively easy to scale, as additional edge switches can be added to the network without disrupting the existing connections.

For clusters spanning multiple locations, a dragonfly topology may be used to connect remote clusters. A dragonfly network is also a hierarchical design consisting of a central spine switch connected to several leaf switches. The leaf switches are responsible for connecting to the end nodes, such as servers or workstations. Any topology can be used within each cluster that uses dragonfly to connect clusters.

Because latency is critical when connecting nodes working on a shared workload, shorter distances are required between clusters. When designing an InfiniBand physical layer infrastructure, multiple options are available for interconnecting nodes and switches.

Regardless of the design, each meter of length adds 5ns of latency.

Even the Physical Layer is Changing

150K
miles of fiber cable in the DGX GH200 supercomputer.

With ever-greater fiber density, new transmission protocols, and changing topologies, hyperscalers are facing an unprecedented level of change to support the newest advancements in Generative AI. However, even the physical layer of connectivity has competing technologies. DAC is most prevalent for short distances, given its low cost and power consumption. However, these chips are not close together: they have to be distributed throughout a data center for power consumption and temperature management. DAC is an insufficiently antifragile solution for Generative AI, and there are viable alternatives.
“The computing fabric is one of the most vital systems of the AI supercomputer. 400 Gbps ultra-low latency NVIDIA Quantum InfiniBand with in-network processing connects hundreds and thousands of DGX nodes into an AI supercomputer.”

— NVIDIA CEO Jensen Huang, GTC 2023

OSFP

  • Up to 800 Gbps for maximum distance of 500m
  • Can be used as single 800 Gbps or broken out to 2×400 Gbps
  • Dual MPO8 interface over singlemode fiber
  • Highest flexibility in patching using singlemode jumpers for long or short runs

DAC

  • Active and passive twinax copper cable
  • OSFP/QSFP-DD 800G transceiver attached to each end of cable
  • Utilized for very short patching – max length of 2 meters
  • Lowest power consumption

AOC

  • 800 Gbps over both singlemode and multimode fiber cable
  • Single 800G AOC or 800G – 2x400G breakout cable options
  • MMF lengths up to 100 meters

Complexity of Connectivity

AOC

PROS

  • Cheaper initial installation cost

 

DRAWBACKS

  • Entire assembly needs to be replaced for a single bad optic
  • Entire assembly needs to be replaced when upgrading data rates
  • Shorter distances than structured cabling
  • Ongoing management of long cables from servers to leaf switches

Structured Cabling with Transceiver

PROS

  • Data rate upgrade requires replacing transceivers only
  • High fiber count trunks reduce cabling runs
  • Longer distances than AOC
  • Manage only short fiber jumpers from panel to server and panel to switch

 

DRAWBACKS

  • More expensive initial installation

Direct Attached Cables (DAC) are the ideal for short distance connectivity, based on low latency and low power consumption.  The real decision comes whether to use transceivers and structured fiber cabling or Active Optical Cables (AOC).

In a typical InfiniBand topology using a 3-tier CLOS architecture, the leaf switches reside either at the end-of-row, middle-of-row cabinet, or in a leaf/spine switch pod.  Top-of-rack switches are not used.  The result is DACs are used where the switches are close in proximity to adhere to the distance limitations of passive DACs.  Cabling between the server cabinets and end-of-row cabinets or leaf/spine pods must be fiber optics because of distance requirements.

Design considerations for fiber optic cabling solution must consider day zero install and ease of use/management for day 1000 too.

The optics play a big part in designing the fiber infrastructure.  400G utilizing QSFP transceivers utilize 8 fibers per optic.  An NVIDIA DGX server contains (8) 400G InfiniBand connections – 64 fibers per server in a single cabinet just for InfiniBand, often scaling to 4 servers and 256 total connections.

The fiber density is one of the biggest complexities when it comes to the world of ML data centers.  In a traditional LC connector-centered infrastructure, the densest panel on the market is 192 fibers per 1U space.  With the amount of fibers required to support InfiniBand in the ML data center, the traditional fiber connectors just don’t cut it.

Getting Even Denser: VSFF

Very Small Form-Factor (VSFF) connectors enable a maximum density of 3,456 fibers in that same 1U panel.  These panels use 16 fiber VSFF connectors – USCONEC MMC or Senko SN-MT16. A consequence of the 16-fiber connector is a mismatch in the ribbon count of a traditional fiber cable (12 fibers) and the connector (16 fiber).

For the longest time, cable assemblies have been using 12 fibers for the 8 fiber connector and 24 fibers for the 16 fiber connectors. Imagine with the densities of ML data centers having to run 33% unused fibers in a cable to meet the demand. Now, imagine how quickly the cable conveyances fill up and overflow.

The solution for cable construction is a multi-pronged approach. The first is to reduce the overall size of the cable by reducing the size of each individual fiber and reduce the spacing between each fiber.

The individual fibers have been reduced from 250 micron to 200 micron. The separation, or pitch, has been reduced from 250 micron to 200 micron. The result is a 35% reduction the outside diameter (OD) of the cable

The second prong is changing the ribbon count from a standard 12 fiber to 16 fibers.  This results in a 100% utilization of every fiber whether terminating an 8 fiber connector for use with a QSFP transceiver or a 16 fiber connector used with an OSFP transceiver.

Fiber optic cabling infrastructure will increase in both complexity and density.  OM4 and singlemode fiber will still be deployed.  Both have been around since 1989 and 1972, respectively.  With the necessity of GPU interconnects, the amount of fiber will double or quadruple in some cases depending on the optics in the servers. 

New developments in the construction of the actual fiber optic cable will reduce the overall outside diameter (OD) of the cable.  New Very Small Form-Factor (VSFF) will allow higher densities of fiber per fiber panel per rack.

The next two years will be a significant learning curve for maximizing existing technologies and creating some new to support the AI data center. 

We'd love to hear how we can help you.