Scaling BERT Inference on CPU with Hugging Face Transformers — multi-instance, NUMA-aware optimization (Part 1)
AI Impact Summary
This post demonstrates scaling BERT-like model inference on modern CPUs by running multiple independent model instances on dedicated CPU core subsets (multi-inference streams), with baseline measurements on AWS c5.metal using Intel Xeon Platinum 8275. It covers cross-framework considerations (PyTorch, TensorFlow, TorchScript, ONNX Runtime) and hardware features (AVX512, VNNI, oneDNN), plus quantization to int8/float16 to boost throughput. For production workloads, the approach translates into higher CPU throughput and lower per-request cost when serving many concurrent inferences, provided NUMA-aware deployment and careful core/batch-size tuning are applied.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info