Universal Assisted Generation enables cross-tokenizer acceleration for any target/assistant models in Hugging Face Transformers 4.46.0
AI Impact Summary
Intel Labs and Hugging Face's Universal Assisted Generation (UAG) extends speculative decoding to decouple target and assistant models across tokenizer boundaries, enabling 1.5x-2.0x latency improvements for a broad set of models without aligned tokenizers. The approach uses 2-way tokenizer translations and additional re-encoding steps, and is integrated into Transformers 4.46.0, demonstrated with models like gemma-2-9b, Mixtral-8x22B-Instruct-v0.1, and vicuna-68m. Production teams can apply this to accelerate inference across diverse model families (e.g., Llama-3.1-70B, Qwen, Phi-3) without reworking the deployment, though current gains rely on multinomial sampling and speculative-sampling support is pending.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info