Introduction to Decoupled DiLoCo
Decoupled DiLoCo, or Distributed Low-Communication, is a new distributed training architecture developed by Google DeepMind and Google Research. This architecture addresses the issues of traditional distributed training methods by dividing training runs across decoupled compute islands. Decoupled DiLoCo enables the mixing of different hardware generations within a single training run, making it more efficient and resilient.
Decoupled DiLoCo builds on two earlier advances: Pathways, which introduced a distributed AI system based on asynchronous data flow, and DiLoCo, which dramatically reduced the bandwidth required between distributed data centers. The research team validated Decoupled DiLoCo at production scale by successfully training a 12 billion parameter model across four separate U.S. regions using just 2–5 Gbps of wide-area networking.
The impact of hardware failures on training dynamics is twofold: for both data-parallel and Decoupled DiLoCo, the loss of chips and their corresponding slices reduces the effective batch size, and specific to Decoupled DiLoCo, recovering learners rejoin the system with stale parameters and optimizer states.
Decoupled DiLoCo is self-healing, using chaos engineering to simulate real hardware failures. The system maintained 88% goodput compared to just 27% for standard Data-Parallel training under high failure rates, and seamlessly reintegrated offline learner units when they came back online.
12B
parameters trained
2-5 Gbps
bandwidth used
88%
goodput maintained under high failure rates
💡 Key Benefits
Decoupled DiLoCo provides several key benefits, including improved resilience, increased efficiency, and the ability to mix different hardware generations within a single training run.
Technical Overview
Decoupled DiLoCo works by dividing training runs across decoupled compute islands, called learner units. Each learner unit is responsible for a portion of the training process, and the system uses a syncer to coordinate the learner units and ensure consistency across the system.
The learner units and syncer exchange parameter fragments over the data-center network, and at each outer optimization step, the syncer shards perform an all-reduce over only a single fragment rather than the whole model. This approach reduces the bandwidth required for communication between learner units and the syncer.
Decoupled DiLoCo also uses a self-healing mechanism to recover from hardware failures. When a learner unit fails, the system can continue training without halting, and when the failed learner unit comes back online, it can seamlessly rejoin the system.
import torch
import torch.distributed as distExample code snippet for Decoupled DiLoCo
Use Cases and Applications
Decoupled DiLoCo has a wide range of potential use cases and applications, including large language model training, computer vision, and natural language processing. The architecture can be used in a variety of industries, including healthcare, finance, and education.
One potential use case for Decoupled DiLoCo is in the training of large language models for natural language processing tasks. These models require significant computational resources and can be difficult to train using traditional methods. Decoupled DiLoCo provides a more efficient and resilient way to train these models, making it possible to achieve state-of-the-art results.
Another potential use case for Decoupled DiLoCo is in the field of computer vision. Computer vision models require large amounts of data and computational resources to train, and Decoupled DiLoCo can provide a more efficient and resilient way to train these models.
100+
industries that can benefit from Decoupled DiLoCo
1000+
potential use cases for Decoupled DiLoCo
📊 Use Cases
Decoupled DiLoCo can be used in a variety of industries and applications, including large language model training, computer vision, and natural language processing.

Conclusion and Future Work
Decoupled DiLoCo is a new distributed training architecture that provides a more efficient and resilient way to train large language models and other AI models. The architecture has been validated at production scale and has shown significant improvements in goodput and resilience compared to traditional methods.
Future work on Decoupled DiLoCo will focus on further improving the architecture and exploring new use cases and applications. This may include integrating Decoupled DiLoCo with other distributed training architectures and exploring the use of Decoupled DiLoCo in other industries and domains.
Overall, Decoupled DiLoCo has the potential to significantly improve the efficiency and resilience of distributed training and to enable new use cases and applications for AI models.
Comparison of Decoupled DiLoCo and Traditional Methods
Comparison of Decoupled DiLoCo and Traditional Methods
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Training Architecture | Decoupled DiLoCo | Traditional Methods |
| Bandwidth Usage | 2-5 Gbps | 10-20 Gbps |
| Goodput | 88% | 27% |
🔑 Key Takeaway
Decoupled DiLoCo provides a more efficient and resilient way to train large language models and other AI models, enabling new use cases and applications. The architecture has been validated at production scale and has shown significant improvements in goodput and resilience compared to traditional methods.
Key Links