rethinking-hpc-desk
rethinking-hpc-mobile

Rethinking
High-performance Computing

The high performance computing landscape needs to change dramatically to support next-gen AI.

  • Architectures & technology demands are not the same as traditional HPC.
  • Unprecedented change in volume pressurizes underdeveloped infrastructure, requiring ultra-high density networking.
  • Providers will need multiple solutions to support whitespace and retrofits across multiple generations.

No single chip – even flagship GPUs – can power a Generative AI solution. 

Thousands of machines collaborate throughout the development and runtime of these models, mesh-networked by many multiples the amount of fiber used today in traditional data center applications.

Established corporations and industry newcomers alike are eager to capture the efficiency and automation potential promised by next-gen applications. Early-stage experiments are rapidly maturing into practical use cases and applications.

 

The gap between the resources required for developing and running these artificial intelligence solutions and the available on-demand compute supply poses a significant challenge: Cloud providers are not yet equipped to meet the rising demands of Gen AI.

Gen AI needs new architectures and technologies to achieve its promise, and cloud providers need multiple solutions and a rapid design partner to capture market share in this dynamic environment, which requires ultra-high density networking.

TRADITIONAL HPC

Predictable datasets with stable operations (often even the exact same calculations) in batch.

  • Overwhelmingly On-premise: 80% of HPC workloads are executed in a private or on-prem environment, often in batch or offline processing.
  • Reasonable Networking Requirements: The amount of connectivity scales exponentially with the compute distribution and size of the cluster, but batch processing reduces the pressure for data transfer and network complexity. 
  • Cluster Computing: Collaborative, tightly coupled, and long-running processing like model training, genome processing, ballistic simulations, or weather predictions. Sometimes, HPC is a good fit for embarrassingly parallel workloads (like processing a major retailer’s credit card transactions).

WHAT AI NEEDS

Proliferating datasets, many novel use cases, highly variable operations with low latency.

  • Inference in the Cloud: The value of AI is getting answers from trained models—inference is 90% of the market activity, and will require Cloud Ubiquity if not Edge to support low-latency.
  • Miles of Fiber: Model inference won’t happen in isolation: it’s part of a larger request that needs to bridge traditional and mesh networks to respond in near-real time with hundreds and thousands of networked chips.
  • Unimaginable Scale: Today’s HPC customers are specialized enterprises: Particle accelerators, Hollywood studios, aerospace manufacturers. Don’t imagine the infrastructure to support a single self-driving car; what if every car was self-driving?

We Need to Reinvent Cloud HPC

Merely upgrading existing patterns and technologies from today’s-HPC will utterly fail the future capabilities of next-gen AI.

Only 20%
of HPC workloads are executed in the cloud today – less than half of other applications

Early experiments in AI did not challenge the hardware and network designs of existing HPC — especially for model training, they were even a good fit.

This is no longer the case. Advances in primary research across model development and vector databases, use cases, data scale, and the emergence of specialized compute for training and inference* threatens a straightforward “upgrade” or “expand” mentality.

This is why we have to rethink HPC systems: Develop designs that incorporate compute accelerator technologies, software-defined solutions, high-performance data storage, storage-class memory and that adopt networking technologies like InfiniBand.** 

Trying to bolt these onto traditional HPC systems creates a whole new set of hardware challenges resulting in stranded compute resources. Even incorporating the 4-5x growth in fiber density will make managing the cabling and conveyances a challenge.

This mismatch extends further into problems of scale: Cloud adoption has soared over the last decade, and cloud providers consistently expand their capacity to match the demand.1

By 2021, 33% of enterprises were running over 50% of their workloads on the cloud.However, HPC has been resistant to full cloud adoption and seriously lags these trends. According to Hyperion Research, despite doubling in the lead up to 2020, only 20% of HPC workloads are executed in the cloud today – less than half of other applications.3

Enterprise buyers were trapped in long lock-in cycles and capitalizing existing investments in supercomputer hardware; cloud providers scaled their deployment of on-demand HPC accordingly.

The underlying consequence is significant: Right as demand for specialized computing resources accelerates, the current footprint of the cloud is not designed for Gen AI and is not the correct scale.

* Our next article is focused specifically on the emerging needs of training vs. inference workloads and network designs to support them.
** The adoption of specific technologies like InfiniBand and RoCE are covered in our next article as well.

Connectivity Solutions for High-density AI Networks

Our engineers are designing critical new offerings for ultra-high density deployments to support the most advanced providers in AI.

We are world-class partners for next-gen data halls.

Unbeatable density

Ultra-high density solutions bring an incredible 432,
3,456, or 5,184 fibers in just 1U of space

Very Small Form Factor

Our solution supports the newest VSFF connectors, designed for maximum and manageable capacity: SN, SN-MT, MDC, MMC

Flexibility by Design

HyperReach™ brings complete flexibility with a mix-and-match, modular design

Emerging AI Volume —
Not a Hype Cycle

Organizations from MSCI to McKinsey are predicting titanic impacts from artificial intelligence and automation, reaching into the trillions of dollars annually and affecting an average of 25% of job responsibilities.4 5 These predictions have accelerated this year over those made just half a decade ago. Despite the confusing chatter and speculation, one thing is exceedingly clear:

We are at the very beginning of the race.

The lure of such immense potential is propelling application providers to capture a share of the burgeoning market, while savvy pick-and-shovel operators are capitalizing on providing the underlying infrastructure.6 This surge in demand extends to bespoke and non-traditional cloud providers, who are converting underutilized cryptocurrency infrastructure into new model-friendly cloud capacity.

This pressure to capture trillions in value exceeds even the forces that catalyzed the shift to cloud computing.

To better serve the needs of next-gen AI, new primary research, database technologies, storage-class memory, and network designs are being developed and deployed rapidly into the enterprise.7

Providers need multiple solutions at unprecedented densities.

The plodding and unreliable pace of the real estate market and construction cycle is too long for cloud providers to rely on whitespace-only solutions. 

Some portion of existing capacity must be retrofitted to ensure continued market relevance and ability to capture a fair share of customer usage. 

These redeployments are wracked with complexities; the power and cooling requirements for next-gen HPC over-burden existing designs and make simple swap-ins utterly unachievable. New fluid dynamic studies, new optimizations of electricity throughput, and more are all necessary to ensure that the physical infrastructure of retrofitted data halls functions sustainably.

Even something as simple as putting two NVIDIA DGX H100 servers in a single rack requires liquid cooling—for only 16 GPUs. (Many Gen AI solutions require hundreds or thousands to train.)

The fiber architectures cannot simply be bolted-on. To drive the tremendous amount of interconnections between GPUs, Gen AI designs require incredible connectivity density at the rack level. Beyond adopting InfiniBand, the disaggregation of hardware necessitates a flexible any-to-any fiber cabling design.

The short-term tactics and solutions for altering existing capacity will compete with long-term strategic designs for AI-friendly designs at scale. Even “long-term” is potentially misleading; requirements will evolve just as rapidly as needs and use cases.

Considering the capital investment and interest AI is attracting, nimble providers will be clear winners in an uncertain future. Fast iteration speed and accelerating time to market are key to keep pace with this rapidly evolving landscape.

Organizations that mean to compete need an engineering & manufacturing partner who can design multiple generations of fiber & connectivity solutions and deliver them to market quickly and with impeccable quality.

Learn more about our HyperReach™ Solutions →

 

Designing with viaPhoton

Missing Products

Most providers are missing solutions they need to support next-gen networks.

Design Together

 

Nearly a third of viaPhoton's full-time staff are engineers -- working with you to discover and uncover the best possible design.

Rapid Protoyping

Prototypes in days and weeks, not months.

Immediate Scale

The speed and quality of a Made in the USA manufacturer with global scale.

We'd love to hear how we can help you.

Gen AI Thought Leadership

Rethinking High-performance Computing

Prepare for the design challenges to meet the rising demand for Gen AI

Gen AI Inside the DC coming Fall 2023

Designing a network for both training & inference

Operating Gen AI DCs coming Q4 2023

Heating, Cooling, Memory, & Storage

  1. JLL, Data Center Outlook 2023 Report, https://www.us.jll.com/en/trends-and-insights/research/data-center-outlook 
  2. https://www.zippia.com/advice/cloud-adoption-statistics/
  3. https://www.datacenterknowledge.com/cloud/cios-guide-migrating-hpc-workloads-cloud
  4.  https://www.visualcapitalist.com/sp/ranking-industries-by-their-potential-for-ai-automation/
  5. https://www.mckinsey.com/capabilities/mckinsey-digital/our-insights/the-economic-potential-of-generative-ai-the-next-productivity-frontier
  6. https://a16z.com/2023/01/19/who-owns-the-generative-ai-platform/
  7. https://www.forbes.com/sites/forbestechcouncil/2021/01/19/how-ai-is-reshaping-hpc-and-what-this-means-for-data-center-architects/?sh=17383d6c7371