Nat Protoc. 2026 Apr 23. doi: 10.1038/s41596-026-01364-8. Online ahead of print.
ABSTRACT
Mapping the connections between genes enables the identification of networks disrupted in disease. The approach, however, requires large amounts of data, making the discovery of therapeutic targets difficult in settings with limited data. We recently developed a foundational artificial intelligence model, Geneformer, pretrained on a large-scale corpus of single-cell transcriptomes (initially ~30 million, now >100 million) to enable context-aware predictions in network biology with limited data. Here, we cover the methodology for using Geneformer through a combination of zero-shot inference, fine-tuning and in silico perturbation. The procedure includes the tokenization of raw gene expression counts into rank value encodings aligned with the model's pretrained vocabulary. Separability of relevant phenotypes in the pretrained embedding space is first assessed with zero-shot embeddings. Fine-tuning is then performed either with a single task, for example, disease prediction within a specific cell type, or with multiple tasks to jointly learn cross-informative features, such as cell types and disease states. Performance is evaluated with confusion matrices, macro F1 scores and embedding analysis. Subsequently, in silico perturbation simulates gene repression or activation and quantifies the shift in cell state embeddings, prioritizing candidate targets by statistical and biological metrics. The approach also supports perturbation using a quantized model to enhance computational efficiency. Outputs include predictive models fine-tuned for context-specific cell state representations and rank-ordered predictions of perturbations to induce each target state. The full pipeline typically completes in under 2 days on a standard GPU workstation and requires only moderate Python experience.
PMID:42026145 | DOI:10.1038/s41596-026-01364-8

