Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes
This repo contains the scripts and metadata used in our work presented at NeurIPS 2022 Learning Meaningful Representation of Life (LMRL) workshop.
Abstract: Despite being self-supervised, protein language models have shown remarkable performance in fundamental biological tasks such as predicting impact of genetic variation on protein structure and function. The effectiveness of these models on diverse set of tasks suggests that they learn meaningful representation of fitness landscape that can be useful for downstream clinical applications. Here, we interrogate the use of these language models in characterizing known pathogenic mutations in curated, medically actionable genes through an exhaustive search of putative compensatory mutations on each variant's genetic background. Systematic analysis of the predicted effects of these compensatory mutations reveal unappreciated structural features of proteins that are missed by other structure predictors like AlphaFold. While deep mutational scan experiments provide an unbiased estimate of the mutational landscape, we encourage the community to generate and curate rescue mutation experiments to inform the design of more sophisticated co-masking strategies and leverage large language models more effectively for downstream clinical prediction tasks.
Model | Number of layers | Number of parameters | Training dataset | Implementation in our work |
---|---|---|---|---|
ESM-2 | 33 | 650M | UR50/D | Single model with wt-marginals scoring strategy |
ESM-1v | 33 | 650M | UR90/S | Ensemble of 5 models with the same scoring strategy as ESM-2 |
ESMFold | 48 | 690M | PDB + UR50 | Structure prediction for BAG3 |
AlphaFold2 | AlphaFold2 structural model prediction for BAG3 | |||
Cross-protein transfer | Zero-shot prediction scores for all 53 ACMG genes except MAX and HNF1A |
Description | Data source |
---|---|
List of clinically actionable genes | ACMG v3.1 |
Allele frequency | gnomAD v2 GRCh38 liftover |
ClinVar annotations | Accessed on 09/17/2022 |
Multiple sequence alignments | UCSC multiz-100 way CDS alignment (Placental mammals) |
If you find this work useful, please cite it as follows:
@misc{
url = {https://arxiv.org/abs/2211.10000},
author = {Soylemez, Onuralp and Cordero, Pablo},
keywords = {Machine Learning (cs.LG), Genomics (q-bio.GN), FOS: Computer and information sciences, FOS: Computer and information sciences, FOS: Biological sciences, FOS: Biological sciences},
title = {Protein language model rescue mutations highlight variant effects and structure in clinically relevant genes},
publisher = {arXiv},
year = {2022},
copyright = {Creative Commons Attribution 4.0 International}
}
If you have any questions or comments, or would like to collaborate, please feel free to reach out.