OpenAI Introduces MRC (Multipath Reliable Connection): A New Open Networking Protocol for Large-Scale AI Supercomputer Training Clusters

Training frontier AI models is not just a compute problem — it is increasingly a networking problem. And OpenAI just introduced its solution.

OpenAI announced the release of MRC (Multipath Reliable Connection), a novel networking protocol developed over the past two years in partnership with AMD, Broadcom, Intel, Microsoft, and NVIDIA. The specification was published through the Open Compute Project (OCP), enabling the broader industry to use and build on it.

Why Networking is the Hidden Bottleneck in AI Training

To understand why MRC matters, you need to understand what happens inside a supercomputer during model training. When training large AI models, a single step can involve many millions of data transfers. One transfer arriving late can ripple through the entire job, potentially causing GPUs to sit idle.

Network congestion, link, and device failures are the most common sources of delay and jitter in transfers — and these problems get more frequent, and harder to solve, as the size of the cluster increases. This is the compounding infrastructure challenge OpenAI set out to fix.

According to OpenAI, more than 900 million people use ChatGPT every week. Sustaining and improving those models at that scale means every second of GPU idle time represents real cost and capability loss. The OpenAI states its goal as “not just to build a fast network, but also to build one that delivers very predictable performance, even in the presence of failures, to keep training jobs moving.”

What MRC Actually Does: Three Core Mechanisms

MRC is not a ground-up invention. It extends RDMA over Converged Ethernet (RoCE) — an InfiniBand Trade Association (IBTA) standard that enables hardware-accelerated remote direct memory access among GPUs and CPUs. It draws on techniques developed by the Ultra Ethernet Consortium (UEC) and extends them with SRv6-based source routing to support large-scale AI networking fabrics.

RoCE is a protocol that allows one machine to read or write memory on another machine directly over an Ethernet network, bypassing the CPU for maximum throughput. SRv6 (Segment Routing over IPv6) takes this further — the sending machine encodes the exact route the packet should follow directly inside the packet header, so switches no longer need to run complex routing calculations. This reduces the processing load on switches and saves power — a meaningful factor at data center scale.

1. Adaptive Packet Spraying to Eliminate Congestion

Instead of sending each transfer over a single network path, MRC spreads packets across hundreds of paths simultaneously, reducing congestion in the core of the network. With traditional RoCEv2, packets were stuck in a single path from point A to point B, which contributes to congestion. To overcome this, MRC introduced Intelligent Packet-Spray Load Balancing, so that if a packet’s path is unusable, packets can traverse across other paths on the network. This enables higher bandwidth utilization, reduced tail latency, and fine-grained load balancing at the packet level.

2. Microsecond-Level Failure Recovery via SRv6 Static Source Routing

When network paths, links, or switches fail, MRC can detect the problem and route around it on a microsecond timescale. Conventional network fabrics can take seconds or even tens of seconds to stabilize after failures. A key architectural decision makes this possible: the switches don’t need to recompute routes or do anything other than blindly follow the static routes they were configured with. All routing intelligence lives at the NIC level, not the switch level. This is a deliberately unconventional design — disabling dynamic routing in the switches entirely to prevent two adaptive mechanisms from interfering with each other.

Before MRC, if a link between a GPU’s network interface and a tier-0 switch failed, the training job would fail. With MRC, the job survives with reasonable performance. If an 8-port network interface loses one port, the maximum rate is reduced by one eighth. MRC detects this, recalculates paths to avoid the failed plane, and immediately tells peers not to use that plane for inbound traffic. Most failed links recover within a minute, at which point MRC brings the plane back into use.

3. Multi-Plane Networks with Fewer Switch Tiers and Lower Cost

This is where MRC changes cluster architecture fundamentally. Instead of treating each network interface as one 800Gb/s link, it is split into multiple smaller links. For example, one interface can connect to eight different switches. A switch that can connect 64 ports at 800Gb/s can instead connect 512 ports at 100Gb/s. This lets to build a network fully connecting about 131,000 GPUs with only two tiers of switches. A conventional 800Gb/s network would require three or four tiers.

The savings compound further: the research team quantifies that for full bisection bandwidth, the two-tier multi-plane design requires 2/3 of the optics and 3/5 the number of switches compared to a three-tier network. Fewer switch tiers also means lower latency — the longest path traverses only three switches rather than five or seven — and smaller blast radius when any individual component fails.

Hardware: Which NICs and Switches Run MRC

As per the research paper, MRC is already running in production on specific, named hardware. It is implemented across 400 and 800Gb/s RDMA NICs — including NVIDIA ConnectX-8, AMD Pollara, AMD Vulcano, and Broadcom Thor Ultra — with SRv6 switch support on NVIDIA Spectrum-4 and Spectrum-5 (running Cumulus and SONiC) and Broadcom Tomahawk 5 via Arista EOS. On the protocol side, AMD contributed the NSCC congestion control algorithm, now part of the UEC Congestion Control specification, along with IB/RDMA transport semantic layer extensions that allow MRC to integrate with existing RDMA programming models while adding the multipath capabilities that set it apart from traditional transports.

Already in Production: From Stargate to Fairwater

MRC is not just a prototype. It is already deployed across all of OpenAI’s largest NVIDIA GB200 supercomputers used to train frontier models, including the site with Oracle Cloud Infrastructure (OCI) in Abilene, Texas, and in Microsoft’s Fairwater supercomputers. MRC has been used to train multiple OpenAI models, leveraging hardware from NVIDIA and Broadcom. Microsoft’s Fairwater supercomputers are located in Atlanta and Wisconsin.

MRC has been used specifically to train frontier large language models for ChatGPT and Codex. During the training of a recent frontier model, OpenAI had to reboot four tier-1 switches. With MRC, the company did not need to coordinate the reboot with the teams running training jobs in the cluster.

Key Takeaways

OpenAI Introduces MRC — OpenAI partnered with AMD, Broadcom, Intel, Microsoft, and NVIDIA to release MRC (Multipath Reliable Connection) through the Open Compute Project (OCP).
Packet Spraying Kills Congestion — MRC spreads packets across hundreds of paths simultaneously, eliminating core congestion and reducing tail latency during large-scale GPU training.
Microsecond Failure Recovery — MRC detects link and switch failures and reroutes traffic in microseconds, keeping training jobs alive through failures that would previously have caused full job termination.
Two-Tier Topology for 131,000+ GPUs — By splitting 800Gb/s interfaces into eight 100Gb/s planes, MRC supports supercomputers with over 100,000 GPUs using only two tiers of switches instead of three or four.
Already used for ChatGPT and Codex — MRC is already deployed across OpenAI’s largest NVIDIA GB200 supercomputers and has been used to train frontier large language models for ChatGPT and Codex.

Check out the Paper and Technical details. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

Source link