Together GPU Clusters releases SlLink v1.0 for Slurm on Kubernetes
AI Impact Summary
Together GPU Clusters has released SlLink v1.0, a new Slurm-on-Kubernetes stack offering significant reliability improvements for newly provisioned clusters. This update includes self-healing worker daemons, durable job accounting, and improved process tracking, addressing common issues like orphaned processes and GPU state inconsistencies. The migration process is seamless, with existing clusters able to be migrated in place, and the new version introduces per-cluster GPU utilization metrics via DCGM.
Affected Systems
- Date
- Date not specified
- Change type
- incident
- Severity
- medium