InfoCapability

Scaling BERT Inference on CPU with Hugging Face Transformers — multi-instance, NUMA-aware optimization (Part 1)

AI Impact Summary

This post demonstrates scaling BERT-like model inference on modern CPUs by running multiple independent model instances on dedicated CPU core subsets (multi-inference streams), with baseline measurements on AWS c5.metal using Intel Xeon Platinum 8275. It covers cross-framework considerations (PyTorch, TensorFlow, TorchScript, ONNX Runtime) and hardware features (AVX512, VNNI, oneDNN), plus quantization to int8/float16 to boost throughput. For production workloads, the approach translates into higher CPU throughput and lower per-request cost when serving many concurrent inferences, provided NUMA-aware deployment and careful core/batch-size tuning are applied.

Affected Systems

Hugging Face TransformersHugging Face Inference API

Date: Date not specified
Change type: capability
Severity: info

Scaling BERT Inference on CPU with Hugging Face Transformers — multi-instance, NUMA-aware optimization (Part 1)

More from Hugging Face

Get alerts for Hugging Face