Introduction to Slurm and Kubernetes
Slurm is an open source cluster management and job scheduling system for Linux. It manages job scheduling for over 65% of TOP500 systems. Kubernetes is a container orchestration system that automates the deployment, scaling, and management of containerized applications. By combining Slurm and Kubernetes, we can create a powerful platform for managing large-scale GPU workloads.
Slinky Slurm-Operator
Slinky, an open-source project by NVIDIA, integrates Slurm with Kubernetes to manage GPU infrastructure at scale. Slinky slurm-operator maps each Slurm component (`slurmctld` for scheduling, `slurmdbd` for accounting, `slurmd` for compute workers, `slurmrestd` for API access) as a Kubernetes Custom Resource Definition (CRD). This allows for seamless integration of Slurm with Kubernetes.
#!/bin/bash
#SBATCH -J mpi-ping-pong
#SBATCH -o /shared_storage/mpi-ping-pong-%j.out
#SBATCH -e /shared_storage/mpi-ping-pong-%j.err
#SBATCH -t 0-2:00:00
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=1Example Slurm job script
Deployment and Scaling
The slurm-operator can be installed through Helm, and Slurm clusters can be defined as Custom Resources. This allows for easy deployment and scaling of Slurm clusters on Kubernetes. NVIDIA runs Slinky slurm-operator in production across multiple clusters, with some deployments scaling to over 8,000 GPUs.
8,000+
GPUs in production clusters
💡 Easy Deployment and Scaling
With the slurm-operator, deploying and scaling Slurm clusters on Kubernetes is easy and efficient.

Conclusion and Future Work
Combining Kubernetes and Slurm provides a powerful platform for managing large-scale GPU workloads. The slurm-operator provides a scalable and flexible solution for resource allocation and job scheduling. As the demand for large-scale GPU workloads continues to grow, the importance of efficient workload management will only increase.
How this compares
How this compares
| Component | Open / This Approach | Proprietary Alternative |
|---|---|---|
| Model provider | Any — OpenAI, Anthropic, Ollama | Single vendor lock-in |
🔑 Key Takeaway
The combination of Kubernetes and Slurm provides a scalable and flexible solution for resource allocation and job scheduling. By integrating Slurm with Kubernetes, we can efficiently manage large-scale GPU workloads. This platform is well-suited for a wide range of applications, from small-scale research projects to large-scale enterprise deployments.
Key Links