top of page

AI and deep-learning models in genomics


Human genome and genetic differences

With the evolution of sequencing technologies, millions of people have sequenced their genomes over the last few decades. This effort resulted in an impressive database with sequencing information that revealed a vast range of small genetic differences that distinguish each human individual from one another. However, what do most of these genetic differences—or variants—mean largely remains unknown. A better understanding of the intricacies of these genetic variants may help us advance faster and more confidently in the field of personalized genomic medicine. It is obvious that the more human genomes are sequenced, the more genomic data scientists and clinicians have to study the features of these genetic variants to eventually be able to link a variant(s) with a disease. As it turns out, simply sequencing more human genomes may not be enough.

Primate genomes to the rescue

According to Kyle Far, vice president of Artificial Intelligence at Illumina, “The main issue is that humans are pretty bottlenecked. Even though there are 8 billion of us, our genetic diversity still looks like the original population of 10,000 common ancestors we’re all descended from. There just isn’t enough information to glean from the human species. It became clear several years ago that, to really understand the human genome, the data contained in human genome sequencing was not enough.” To circumvent this, scientists decided to broaden the search net and study Homo sapiens’ more distant evolutionary relatives, primates. The order primates consists of around 12 families and 60 genera (the numbers vary depending on the particular zoological study being consulted) including humans. Yes, it might sound somewhat counterintuitive, but to learn more about ourselves and our genetic content, we need to look closer at the genomes from another phylogenetic lineage.

As part of a highly collaborative international project, scientists have sequenced genomes from more than 800 individuals from 233 species of nonhuman primates, representing all primate families and over 86% of living genera.

Deep-learning AI model — PrimateAI-3D

As part of a highly collaborative international project, scientists have sequenced genomes from more than 800 individuals from 233 species of nonhuman primates, representing all primate families and over 86% of living genera. To analyze and interpret the resulting sequencing data they developed PrimateAI-3D, a genomic version of ChatGPT. The large language model (LLM) ChatGPT was launched on November 30, 2022 and is the most trending Artificial Intelligence (AI) topic of 2023 so far because of its ability to generate human-like responses to any question on virtually any topic. ChatGPT was trained on massive collections of text data, including books, articles, and web pages. PrimateAI-3D uses an algorithm that relies on deep-learning language architectures similar to those used in ChatGPT, but instead of using text data and producing linguistic output, it works in the domain of genomics.

How to predict pathogenic variants?

To train the PrimateAI-3D model, developers fed it the sequencing data with variants that weren’t linked to disease states in macaques and orangutans. Having access to this information, the model learns to predict which genomic regions are likely to lead to a disease in the case of a mutation. Essentially, this means that by analyzing which variants do or do not cause disease in humans’ genetic relatives from the order primates, PrimateAI-3D learns to accurately predict pathogenic variants in humans themselves. A striking example of a perfect collaboration between biology and AI! To estimate the accuracy of the resulting model, PrimateAI-3D was compared to other existing machine-learning methods across six clinical benchmarks and demonstrated significantly more accurate results. For a more detailed description of how the model was trained, its comparison with other existing methods, and further biological insights, refer to the original articles published in Science.

Scientific and clinical impact

Precision genomic medicine is a fast developing research field with a high discovery potential. Better understanding of the interplay between genetic variants and diseases is an important step for a successful advancement of this field of research. One of the main conclusions from this series of Science articles is that most individuals (even the healthy ones) carry genetic variants that could help us better understand the causal relationship between variants and diseases. Meaning that the genomic information of healthy individuals may be just as important because they may provide the data necessary to more accurately predict diseases.


This article was written by Alexey Vorobev and edited by Courtney Thomas.

 

References:

Comments


bottom of page