Encrypted LLM Inference with FHE: GPT-2 on Hugging Face via Concrete-Python and PBS
AI Impact Summary
The post outlines an architecture for running parts of a GPT-2 inference pipeline over Fully Homomorphic Encryption (FHE), using TFHE/PBS and Concrete-Python to operate on encrypted data while protecting both user privacy and model IP. It describes replacing the first multi-head attention head with FHE-friendly operators, encrypting intermediate results on the client, performing selected server-side attention steps, and returning decrypted results to continue local inference. It notes quantization to 4-bit with ~96% accuracy and highlights that PBS operations dominate latency, indicating a heavy compute and hardware-acceleration requirement for production deployment.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info