Introduction to Decoupled DiLoCo
Decoupled DiLoCo builds on two earlier advances: Pathways and DiLoCo. Pathways introduced a distributed AI system based on asynchronous data flow, while DiLoCo reduced the bandwidth required between distributed data centers, making it practical to train large language models across distant locations. Decoupled DiLoCo introduces techniques for failure resilience to achieve high goodput and near 100% uptime, even when training across data centers.
The research team validated Decoupled DiLoCo at production scale by successfully training a 12 billion parameter model across four separate U.S. regions using just 2–5 Gbps of wide-area networking.
Decoupled DiLoCo eliminates the single-point-of-failure problem in large-scale AI training by dividing training across asynchronous, fault-isolated “islands” of compute called learner units.
This approach allows a chip or cluster failure in one island to not stall the rest of the training run.
Key Features of Decoupled DiLoCo
Decoupled DiLoCo is self-healing, using chaos engineering to simulate real hardware failures. The system maintained 88% goodput compared to just 27% for standard Data-Parallel training under high failure rates.
It seamlessly reintegrated offline learner units when they came back online.
Decoupled DiLoCo retains the massive bandwidth reduction properties of its predecessor, Streaming DiLoCo, due to the fact that learners and syncer exchange parameter fragments over the data-center network.
At each outer optimization step, the syncer shards perform an all-reduce over only a single fragment rather than the whole model.
The impact of hardware failures on training dynamics is twofold: (1) for both data-parallel and Decoupled DiLoCo, the loss of chips and their corresponding slices reduces the effective batch size, and (2) specific to Decoupled DiLoCo, recovering learners rejoin the system with stale parameters and optimizer states.
Experiments and Results
The research team reported the downstream performance of a standard data-parallel (DP) baseline versus Decoupled DiLoCo (M=8M=8 learners) at scales of 2B, 5B, and 9B parameters dense models, trained for 26B, 72B, and 141B tokens, respectively.
Decoupled DiLoCo achieved 88% goodput under high hardware failure rates, outperforming standard Data-Parallel training.
The system demonstrated its ability to train large language models across geographically distant data centers, making it a promising approach for resilient AI training at scale.
The results show that Decoupled DiLoCo can be used to train large language models efficiently and effectively, even in the presence of hardware failures.
88%
goodput under high failure rates
12B
parameter model trained
2-5 Gbps
wide-area networking used

Conclusion and Future Work
Decoupled DiLoCo is a promising approach for resilient AI training at scale.
It enables the training of large language models across geographically distant data centers, making it a useful tool for researchers and practitioners alike.
The system’s ability to maintain high goodput under high hardware failure rates makes it an attractive option for large-scale AI training.
Future work includes exploring the application of Decoupled DiLoCo to other areas of AI research, such as computer vision and reinforcement learning.
Additionally, the development of more efficient and effective algorithms for Decoupled DiLoCo is an area of ongoing research.
How this compares
How this compares
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Model provider | Any — OpenAI, Anthropic, Ollama | Single vendor lock-in |
| Scalability | Decoupled DiLoCo | Data-Parallel training |
| Goodput | 88% under high failure rates | 27% under high failure rates |
🔑 Key Takeaway
Decoupled DiLoCo is a promising approach for resilient AI training at scale, enabling the training of large language models across geographically distant data centers. It maintains high goodput under high hardware failure rates, making it an attractive option for large-scale AI training.
Key Links