From Text to Translation: Using Language Models to Resolve and Classify Variants
Full Description
Project Summary: Deep learning methods toward resolving uncertain variant classifications
Genomic sequencing can substantially improve clinical management, by optimizing surveillance and treatment
options, and improving risk assessment. As the interpretation of genetic variants increases, thousands of new
variant interpretations are entering variant databases each month. Most variants in these databases have
insufficient evidence to be classified as pathogenic or benign, and as a result are classified as Variants of
Uncertain Significance (VUSs). Despite potentially increasing risk, information about these variants cannot be
communicated to providers or patients due to a lack of structured evidence. This translational gap is preventing
many patients who collectively carry such variants from benefiting from genomic medicine.
ClinVar, a large diagnostic variant database contains a unique abundance of predictive information that has
been curated by clinical experts over many years. This includes over 1.1 million plaintext diagnostic reports
that often describe case data, literature review, and an analysis of computational predictions or functional
assay data. We will use these clinical reports to make predictions of pathogenicity, and to identify which
specific sources of evidence of pathogenicity are provided in each report. This project will enhance the value of
data in ClinVar, a public resource used by thousands of investigators, clinicians, and bioinformatic pipelines.
We will first optimize a text classification model to make predictions from diagnostic summaries, evaluating and
fine-tuning a set of large language models which have been trained on different text corpora. Using clinical
reports and known classifications from ClinVar variant submissions, we will evaluate different filtering criteria
used in the training process. We measure performance on high confidence labeled data which have been
previously reviewed by expert panels, as well as on bona fide VUSs, using expert panel curated variant
interpretations as ground truth validation data.
Next, we identify the information from these reports which drive predictions using post-hoc explainability
methods (attention mapping, representation probing, and causal mediation analysis), and then map this
evidence to biomedical concepts related to variant interpretation and pathogenicity, using a knowledge graph
which is refined to highlight these concepts relevant to diagnostic review criteria.
Finally, we will measure the extent to which these approaches can identify complementary evidence across
variant reports generated by different clinical labs related to the same variant, which can be used to re-classify
VUS or resolve a variant with conflicting interpretations. We will manually review a set of clinical reports to
evaluate accuracy of the sources of information that have been recovered. If evidence is sufficient, we will
identify up to 100 variants which are carried by participants in the Mass General Brigham biobank, and attempt
to update their variant classifications so that these results can be communicated to patients.
Grant Number: 1R21HG014015-01
NIH Institute/Center: NIH
Principal Investigator: Christopher Cassa
Sign up free to get the apply link, save to pipeline, and set email alerts.
Sign up free →Agency Plan
7-day free trialUnlock procurement & grants
Upgrade to access active tenders from World Bank, UNDP, ADB and more — with email alerts and pipeline tracking.
$29.99 / month
- 🔔Email alerts for new matching tenders
- 🗂️Track tenders in your pipeline
- 💰Filter by contract value
- 📥Export results to CSV
- 📌Save searches with one click