Introduction to Decoupled DiLoCo
Decoupled DiLoCo is a distributed training architecture designed to make large language model training more resilient and efficient across geographically separated data centers. This approach decouples compute into asynchronous, fault-isolated ‘islands,’ enabling large language model pre-training without requiring tight synchronization. The Decoupled DiLoCo architecture is designed to work with mixed-generation hardware, allowing for more efficient training and reducing the barriers for smaller players in the AI field. The architecture enables the mixing of different hardware generations within a single training run, making it more flexible and efficient. Decoupled DiLoCo has been shown to achieve 88% goodput under high hardware failure rates, compared to just 27% for standard Data-Parallel methods.
88%
goodput under high hardware failure rates
27%
goodput for standard Data-Parallel methods
Technical Overview of Decoupled DiLoCo
Decoupled DiLoCo is based on asynchronous data flow, allowing different compute resources to work at their own pace without blocking on one another. The architecture consists of decoupled compute islands, each of which can process a different mini-batch of data. The learners and syncer exchange parameter fragments over the data-center network, and at each outer optimization step, the syncer shards perform an all-reduce over only a single fragment rather than the whole model. This approach reduces the communication overhead and makes the system more resilient to hardware failures. Decoupled DiLoCo has been shown to achieve high goodput even under high hardware failure rates, making it a reliable choice for large-scale distributed training.
Benefits of Decoupled DiLoCo
Decoupled DiLoCo offers several benefits over traditional data-parallel methods. It enables faster and more resilient AI training across data centers, making it a reliable choice for large-scale distributed training. The architecture allows for the mixing of different hardware generations within a single training run, reducing the barriers for smaller players in the AI field. Decoupled DiLoCo has been shown to achieve high goodput even under high hardware failure rates, making it a reliable choice for large-scale distributed training. The architecture is based on asynchronous data flow, allowing different compute resources to work at their own pace without blocking on one another.
2-5 Gbps
standard internet-level bandwidth
1.2 million chips
number of chips used in simulations

Conclusion and Future Work
Decoupled DiLoCo is a novel distributed training architecture designed to make large language model training more resilient and efficient across geographically separated data centers. The approach has been shown to achieve high goodput even under high hardware failure rates, making it a reliable choice for large-scale distributed training. Future work will focus on further improving the efficiency and resilience of Decoupled DiLoCo, as well as exploring its applications in other areas of AI research.
Comparison of Distributed Training Architectures
Comparison of Distributed Training Architectures
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Scalability | Decoupled DiLoCo | Data-Parallel |
| Resilience | Decoupled DiLoCo | Data-Parallel |
| Hardware Support | Mixed-generation hardware | Single-generation hardware |
🔑 Key Takeaway
Decoupled DiLoCo is a novel distributed training architecture that enables faster and more resilient AI training across data centers. It achieves high goodput even under high hardware failure rates, making it a reliable choice for large-scale distributed training. The architecture allows for the mixing of different hardware generations within a single training run, reducing the barriers for smaller players in the AI field.