InfoCapability

Encrypted LLM Inference with FHE: GPT-2 on Hugging Face via Concrete-Python and PBS

AI Impact Summary

The post outlines an architecture for running parts of a GPT-2 inference pipeline over Fully Homomorphic Encryption (FHE), using TFHE/PBS and Concrete-Python to operate on encrypted data while protecting both user privacy and model IP. It describes replacing the first multi-head attention head with FHE-friendly operators, encrypting intermediate results on the client, performing selected server-side attention steps, and returning decrypted results to continue local inference. It notes quantization to 4-bit with ~96% accuracy and highlights that PBS operations dominate latency, indicating a heavy compute and hardware-acceleration requirement for production deployment.

Affected Systems

Hugging Face transformersGPT-2 (GPT2LMHeadModel)

Date: Date not specified
Change type: capability
Severity: info

Encrypted LLM Inference with FHE: GPT-2 on Hugging Face via Concrete-Python and PBS

More from Hugging Face

Get alerts for Hugging Face