InfoCapability

llama.cpp server adds router mode for dynamic model management

AI Impact Summary

llama.cpp server introduces router mode to enable dynamic model management: multiple models can be loaded, unloaded, and switched at runtime without restarting. Each model runs in its own process, isolating failures and improving reliability when running concurrent variants. Models auto-discover from the default llama.cpp cache or a custom --models-dir, loading on first use with LRU eviction when --models-max is reached (default 4). This enables A/B testing, multi-tenant deployments, and rapid development iterations, but teams should monitor VRAM usage and cold-start latency for large GGUF models like ggml-org/gemma-3-4b-it-GGUF.

Affected Systems

llama.cpp serverllama-server

Date: Date not specified
Change type: capability
Severity: info

llama.cpp server adds router mode for dynamic model management

More from Hugging Face

Get alerts for Hugging Face