Introduction to Decoupled DiLoCo
Decoupled DiLoCo is a distributed pre-training framework that replaces the traditional Single Program Multiple Data (SPMD) paradigm with a fully asynchronous architecture. This allows for independent learners to communicate parameter fragments to a central synchronizer, eliminating lock-step synchronization barriers and reducing the impact of hardware failures on training dynamics.
The Decoupled DiLoCo architecture is designed to work with mixed-generation hardware, making it more efficient and cost-effective. It also enables faster training across geographically distant data centers using standard internet-level bandwidth.
The framework is particularly useful for large-scale AI training, where hardware failures can be a significant bottleneck. By treating pre-training as a distributed systems problem, Decoupled DiLoCo maintains zero global downtime and near-optimal goodput under massive simulated hardware failures.
The authors of Decoupled DiLoCo analogize the tension between consistency, availability, and partition tolerance to the CAP theorem in distributed systems. They prioritize availability and partition tolerance over strict parameter consistency, making the framework more resilient to hardware failures.
Decoupled DiLoCo has been shown to achieve 88% goodput under high hardware failure rates, making it a promising solution for large-scale AI training.
88%
goodput under high hardware failure rates
0
global downtime
near-optimal
goodput under massive simulated hardware failures
💡 Key Benefits
Decoupled DiLoCo offers several key benefits, including improved resilience to hardware failures, faster training times, and better scalability.
Decoupled DiLoCo Architecture
The Decoupled DiLoCo architecture is designed to be highly scalable and resilient. It consists of independent learners that communicate parameter fragments to a central synchronizer. The synchronizer is responsible for aggregating the parameter fragments and updating the model.
The learners and synchronizer communicate asynchronously, using a minimum-quorum and adaptive grace-window system. This allows for flexible and efficient communication, even in the presence of hardware failures.
The Decoupled DiLoCo architecture is also designed to work with mixed-generation hardware, making it more efficient and cost-effective. It enables faster training across geographically distant data centers using standard internet-level bandwidth.
The framework is particularly useful for large-scale AI training, where hardware failures can be a significant bottleneck. By treating pre-training as a distributed systems problem, Decoupled DiLoCo maintains zero global downtime and near-optimal goodput under massive simulated hardware failures.
Experimental Results
The authors of Decoupled DiLoCo have conducted extensive experiments to evaluate the performance of the framework. They compared Decoupled DiLoCo with a standard data-parallel (DP) baseline, and found that Decoupled DiLoCo achieved significantly better performance in terms of goodput and resilience to hardware failures.
The experiments were conducted on large-scale models, with up to 9B parameters and 141B tokens. The results showed that Decoupled DiLoCo achieved 88% goodput under high hardware failure rates, while the DP baseline achieved only 50% goodput.
The authors also evaluated the downstream performance of Decoupled DiLoCo, and found that it achieved similar or better performance than the DP baseline. This suggests that Decoupled DiLoCo is a promising solution for large-scale AI training, where hardware failures can be a significant bottleneck.
The experiments demonstrate the effectiveness of Decoupled DiLoCo in achieving high goodput and resilience to hardware failures, making it a valuable tool for large-scale AI training.
88%
goodput under high hardware failure rates
50%
goodput of DP baseline under high hardware failure rates
9B
parameters in largest model
💡 Key Takeaways
Decoupled DiLoCo achieves 88% goodput under high hardware failure rates, and similar or better downstream performance than the DP baseline.

Conclusion
Decoupled DiLoCo is a promising solution for large-scale AI training, where hardware failures can be a significant bottleneck. By treating pre-training as a distributed systems problem, Decoupled DiLoCo maintains zero global downtime and near-optimal goodput under massive simulated hardware failures.
The framework is designed to be highly scalable and resilient, and has been shown to achieve 88% goodput under high hardware failure rates. The experimental results demonstrate the effectiveness of Decoupled DiLoCo in achieving high goodput and resilience to hardware failures.
Decoupled DiLoCo has the potential to democratize access to high-performance training, making it more accessible to smaller players in the AI field. By providing a highly scalable and resilient framework for large-scale AI training, Decoupled DiLoCo can help to accelerate the development of AI models and applications.
Overall, Decoupled DiLoCo is a valuable tool for large-scale AI training, and has the potential to make a significant impact in the field of AI.
Comparison of Decoupled DiLoCo and DP Baseline
Comparison of Decoupled DiLoCo and DP Baseline
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Resilience to hardware failures | High | Low |
| Goodput under high hardware failure rates | 88% | 50% |
| Downstream performance | Similar or better | Similar |
🔑 Key Takeaway
Decoupled DiLoCo is a promising solution for large-scale AI training, achieving 88% goodput under high hardware failure rates and similar or better downstream performance than the DP baseline. It has the potential to democratize access to high-performance training and accelerate the development of AI models and applications.
Key Links