Leveraging artificial intelligence techniques, researchers have demonstrated that mutations in so-called 'junk' DNA can cause autism. The study, published May 27 in Nature Genetics, is the first to functionally link such mutations to the neurodevelopmental condition.
The research was led by Olga Troyanskaya in collaboration with Robert Darnell. Troyanskaya is deputy director for genomics at the Flatiron Institute's Center for Computational Biology (CCB) in New York City and a professor of computer science at Princeton University. Darnell is the Robert and Harriet Heilbrunn Professor of Cancer Biology at Rockefeller University and an investigator at the Howard Hughes Medical Institute.
Their team used machine learning to analyze the whole genomes of 1,790 individuals with autism and their unaffected parents and siblings. These individuals had no family history of autism, meaning the genetic cause of their condition was probably spontaneous mutations rather than inherited mutations.
The analysis predicted the ramifications of genetic mutations in parts of the genome that do not encode proteins, regions often mischaracterized as 'junk' DNA. The number of autism cases linked to the noncoding mutations was comparable to the number of cases linked to protein-coding mutations that disable gene function.
The implications of the work extend beyond autism, Troyanskaya says. "This is the first clear demonstration of non-inherited, noncoding mutations causing any complex human disease or disorder."
Scientists can apply the same techniques used in the new study to explore the role noncoding mutations play in diseases such as cancer and heart disease, says study co-author Jian Zhou of CCB and Princeton. "This enables a new perspective on the cause of not just autism, but many human diseases."
Only 1 to 2 percent of the human genome is made up of genes that encode the blueprints for making proteins. Those proteins carry out tasks throughout our bodies, such as regulating blood sugar levels, fighting infections and sending communications between cells. The other 98 percent of our genome isn't genetic dead weight, though. The noncoding regions help regulate when and where genes make proteins.
Mutations in protein-coding regions account for at most 30 percent of autism cases in individuals without a family history of autism. Evidence suggested that autism-causing mutations must happen elsewhere in the genome as well.
Uncovering which noncoding mutations may cause autism is tricky. A single individual may have dozens of noncoding mutations, most of which will be unique to the individual. This make the traditional approach of identifying common mutations among affected populations nonviable.
Troyanskaya and her colleagues took a new approach. They trained a machine learning model to predict how a given sequence would affect gene expression.
"This is a shift in thinking about genetic studies that we're introducing with this analysis," says Chandra Theesfeld, a research scientist in Troyanskaya's lab at Princeton. "In addition to scientists studying shared genetic mutations across large groups of individuals, here we're applying a set of smart, sophisticated tools that tell us what any specific mutation is going to do, even those that are rare or never observed before."
The researchers studied the genetic basis of autism by applying the machine learning model to a treasure trove of genetic data called the Simons Simplex Collection. The Simons Foundation, the Flatiron Institute's parent organization, produced and maintains the repository. The Simons Simplex Collection contains the whole genomes of nearly 2,000 'quartets' made up of a child with autism, an unaffected sibling and their unaffected parents.
These foursomes had no previous family history of autism, meaning that non-inherited mutations were probably responsible for the affected child's condition. (Such mutations occur spontaneously in sperm and egg cells as well as in embryos.)
The researchers used their model to predict the impact of non-inherited, noncoding mutations in each child with autism. They then compared those predictions with the effects of the same, unmutated strand in the child's unaffected sibling.
"The design of the Simons Simplex Collection is what allowed us to do this study," says Zhou. "The unaffected siblings are a built-in control."
Noncoding mutations in many of the children with autism altered gene regulation, the analysis suggested. Moreover, the results suggested that the mutations affected gene expression in the brain and genes already linked to autism, such as those responsible for neuron migration and development. "This is consistent with how autism most likely manifests in the brain," says study co-author Christopher Park, a research scientist at CCB. "It's not just the number of mutations occurring, but what kind of mutations are occurring."
The researchers tested the effects of some of the noncoding mutations in laboratory experiments. They inserted predicted high-impact mutations found in children with autism into cells and observed the resulting changes in gene expression. These changes affirmed the model's predictions.
Troyanskaya says she and her colleagues will continue improving and expanding their method. Ultimately, she hopes the work will improve how genetic data are used for diagnosing and treating diseases and disorders. "Right now, 98 percent of the genome is usually being thrown away," she says. "Our work allows you to think about what we can do with the 98 percent."
ABOUT THE FLATIRON INSTITUTE
The Flatiron Institute is the research division of the Simons Foundation. The institute's mission is to advance scientific research through computational methods, including data analysis, theory, modeling and simulation. The institute's Center for Computational Biology develops new and innovative methods of examining data in the biological sciences whose scale and complexity have historically resisted analysis. The center's mission is to develop modeling tools and theory for understanding biological processes and to create computational frameworks that will enable the analysis of the large, complex data sets being generated by new experimental technologies.