Arc Virtual Cell Challenge: ST/SE transformers with Llama backbone and ESM2 embeddings for gene perturbation simulation
AI Impact Summary
Arc Institute's Virtual Cell Challenge aims to train models that predict transcriptomic changes when a gene is silenced, enabling context-generalization across unseen cell types. The baseline architecture comprises two transformer-based components: the State Transition Model (ST) with a Llama backbone and covariate-matched control/perturbation encodings, and the State Embedding Model (SE), a BERT-like autoencoder that leverages ESM2 protein embeddings to produce rich cell representations. Training uses a Maximum Mean Discrepancy objective on a dataset of ~300k single-cell RNA-seq profiles, including ~220k cells with ~38k unperturbed controls, highlighting the need to separate true perturbation effects from baseline heterogeneity and technical noise. This setup offers a concrete path for engineering teams to build in silico perturbation simulators, but success will require robust handling of biological heterogeneity and observer effects to avoid misleading generalization.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info