Apriel-H1-15b-Thinker-SFT achieves 2.1x throughput with Mamba-based distillation in Fast-LLM
AI Impact Summary
Apriel-H1 demonstrates retrofit distillation of a 15B reasoning model into a Mamba-based hybrid, delivering 2.1x throughput with minimal quality loss by training on teacher reasoning traces rather than generic pretraining data. It uses a staged distillation workflow (LOO-based layer replacements, MIL/MMR, and end-to-end SFT) and a reverse KL objective to preserve structured reasoning during the switch from attention to linear mixing. This provides a practical path to scale inference for reasoning-heavy workloads without rebuilding from scratch, contingent on access to high-quality SFT data and a capable training framework like Fast-LLM.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info