Building Resilient AI Training with Decoupled DiLoCo

6 min readApr 27, 2026

Decoupled DiLoCo is a distributed training architecture that enables resilient AI training across geographically distant data centers. It decouples compute into asynchronous, fault-isolated ‘islands,’ allowing large language model pre-training without requiring tight synchronization. This approach eliminates the single-point-of-failure problem in large-scale AI training.

Introduction to Decoupled DiLoCo

Decoupled DiLoCo builds on two earlier advances: Pathways and DiLoCo. Pathways introduced a distributed AI system based on asynchronous data flow, while DiLoCo reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations. Decoupled DiLoCo introduces techniques for failure resilience to achieve high goodput and near 100% uptime, even when training across data centers.

The research team validated Decoupled DiLoCo at production scale by successfully training a 12 billion parameter model across four separate U.S. regions using just 2–5 Gbps of wide-area networking.

Decoupled DiLoCo eliminates the single-point-of-failure problem in large-scale AI training by dividing training across asynchronous, fault-isolated “islands” of compute called learner units.

This approach allows a chip or cluster failure in one island to not stall the rest of the training run.

Key Features of Decoupled DiLoCo

Decoupled DiLoCo is self-healing, using chaos engineering to simulate real hardware failures. The system maintained 88% goodput compared to just 27% for standard Data-Parallel training under high failure rates.

It seamlessly reintegrated offline learner units when they came back online.

Decoupled DiLoCo retains the massive bandwidth reduction properties of its predecessor, Streaming DiLoCo, due to the fact that learners and syncer exchange parameter fragments over the data-center network.

At each outer optimization step, the syncer shards perform an all-reduce over only a single fragment rather than the whole model.

The impact of hardware failures on training dynamics is twofold: (1) for both data-parallel and Decoupled DiLoCo, the loss of chips and their corresponding slices reduces the effective batch size, and (2) specific to Decoupled DiLoCo, recovering learners rejoin the system with stale parameters and optimizer states.

Experiments and Results

The research team reported the downstream performance of a standard data-parallel (DP) baseline versus Decoupled DiLoCo (M=8M=8 learners) at scales of 2B, 5B, and 9B parameters dense models, trained for 26B, 72B, and 141B tokens, respectively.

Decoupled DiLoCo achieved 88% goodput under high hardware failure rates, outperforming standard Data-Parallel training.

The system demonstrated its ability to train large language models across geographically distant data centers, making it a promising approach for resilient AI training at scale.

The results show that Decoupled DiLoCo can be used to train large language models efficiently and effectively, even in the presence of hardware failures.

88%

goodput under high failure rates

12B

parameter model trained

2-5 Gbps

wide-area networking used

Building Resilient AI Training with Decoupled DiLoCo — Experiments and Results — Experiments and Results

Conclusion and Future Work

Decoupled DiLoCo is a promising approach for resilient AI training at scale.

It enables the training of large language models across geographically distant data centers, making it a useful tool for researchers and practitioners alike.

The system’s ability to maintain high goodput under high hardware failure rates makes it an attractive option for large-scale AI training.

Future work includes exploring the application of Decoupled DiLoCo to other areas of AI research, such as computer vision and reinforcement learning.

Additionally, the development of more efficient and effective algorithms for Decoupled DiLoCo is an area of ongoing research.

How this compares

Component	Open / This Approach	Proprietary Alternative
Model provider	Any — OpenAI, Anthropic, Ollama	Single vendor lock-in
Scalability	Decoupled DiLoCo	Data-Parallel training
Goodput	88% under high failure rates	27% under high failure rates

🔑 Key Takeaway

Decoupled DiLoCo is a promising approach for resilient AI training at scale, enabling the training of large language models across geographically distant data centers. It maintains high goodput under high hardware failure rates, making it an attractive option for large-scale AI training.

Key Links

Building Resilient AI Training with Decoupled DiLoCo

ByAI

Introduction to Decoupled DiLoCo

Key Features of Decoupled DiLoCo

Experiments and Results

Conclusion and Future Work

How this compares

Watch: Technical Walkthrough

By AI

Related Post

Leave a Reply Cancel reply

You missed

Multimodal Intelligence with NVIDIA Nemotron 3 Nano Omni

Decoupled DiLoCo: A New Frontier for Resilient Distributed AI Training

Building Scalable AI-Powered Web Applications with Privacy Filters

Evaluating Performance of AI Agents with Benchmarking