InfoCapability

llama.cpp server adds router mode for dynamic multi-model management

AI Impact Summary

llama.cpp server now supports router mode, enabling dynamic load/unload/switch between multiple models without restart. Each model runs in its own process, isolating failures and improving stability during model swaps. Auto-discovery from the llama.cpp cache or a user-specified models-dir, on-demand loading, and an LRU eviction cap (--models-max) give operators explicit control over memory and latency, facilitating A/B testing and multi-tenant deployments via API or Web UI. Expect a first-use delay for models not already loaded, with subsequent requests served instantly as they remain loaded until evicted.

Affected Systems

llama.cpp serverllama-server CLI

Date: Date not specified
Change type: capability
Severity: info

llama.cpp server adds router mode for dynamic multi-model management

More from Hugging Face

Get alerts for Hugging Face