llama.cpp server adds router mode for dynamic multi-model management
AI Impact Summary
llama.cpp server now supports router mode, enabling dynamic load/unload/switch between multiple models without restart. Each model runs in its own process, isolating failures and improving stability during model swaps. Auto-discovery from the llama.cpp cache or a user-specified models-dir, on-demand loading, and an LRU eviction cap (--models-max) give operators explicit control over memory and latency, facilitating A/B testing and multi-tenant deployments via API or Web UI. Expect a first-use delay for models not already loaded, with subsequent requests served instantly as they remain loaded until evicted.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info