Introduction to Continuous Batching
Continuous batching is a widely adopted technique in machine learning model training that groups multiple requests into a single batch to improve computational efficiency. However, this approach has two primary sources of waste: the cost of preparing each batch and the idle time of the GPU while the CPU prepares the next batch.
The cost of continuous batching can be substantial, with costs adding up quickly. For instance, using a GPU for an hour may be relatively cheap, but using it for a day can cost upwards of $140.
Moreover, the default synchronous nature of continuous batching means that the GPU waits while the CPU prepares the next batch, leading to significant idle time. This problem is exacerbated in loops that run hundreds of steps per second, where these idle gaps can account for nearly a quarter of total runtime.
Asynchronous Batching: The Solution
Asynchronous batching addresses the issue of idle GPU time by disentangling CPU batch preparation from GPU batch compute. This allows both the CPU and GPU to operate in parallel, ensuring that the GPU is always productive.
To achieve this, we can use non-blocking CPU transfer operations to prepare the next batch while the GPU computes the current batch. However, simply using non-blocking transfers is not enough, as the CPU will still block until all GPU operations have finished if the operations are scheduled on the default stream.
Therefore, we need to schedule GPU operations on a separate stream to ensure true asynchronicity.
Technical Architecture and Deep Dive
Building asynchronous continuous batching from the ground up requires a modular approach to handle complex data retrieval and processing. This involves designing a system that can efficiently manage multiple streams of data and computations.
The system should be able to handle individual requests within a batch independently, allowing for efficient processing and minimizing idle time.
Additionally, the system should be scalable and able to handle large volumes of data and computations.

Comparison with Other Approaches
Asynchronous continuous batching offers several advantages over other approaches. It provides true asynchronicity, allowing for efficient processing and minimizing idle time.
In contrast, traditional continuous batching approaches are synchronous, leading to significant idle time and reduced productivity.
Moreover, asynchronous continuous batching is more scalable and can handle large volumes of data and computations.
25%
reduction in idle time
30%
increase in productivity
Asynchronous Continuous Batching vs Traditional Approaches
Asynchronous Continuous Batching vs Traditional Approaches
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Model Provider | Any — OpenAI, Anthropic, Ollama | Single vendor lock-in |
| Scalability | Highly scalable | Limited scalability |
| Productivity | Increased productivity | Reduced productivity |
🔑 Key Takeaway
Asynchronous continuous batching is a critical breakthrough that promises to fundamentally reshape how large-scale AI models are deployed and consumed. It provides true asynchronicity, allowing for efficient processing and minimizing idle time, making it a pivotal architectural shift for high-performance AI deployment.
Key Links