llama.cpp server adds router mode for dynamic model management
AI Impact Summary
llama.cpp server introduces router mode to enable dynamic model management: multiple models can be loaded, unloaded, and switched at runtime without restarting. Each model runs in its own process, isolating failures and improving reliability when running concurrent variants. Models auto-discover from the default llama.cpp cache or a custom --models-dir, loading on first use with LRU eviction when --models-max is reached (default 4). This enables A/B testing, multi-tenant deployments, and rapid development iterations, but teams should monitor VRAM usage and cold-start latency for large GGUF models like ggml-org/gemma-3-4b-it-GGUF.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info