LA JOLLA, CA - May 10, 2016 - Call them professional "data wranglers."
A team of scientists at The Scripps Research Institute (TSRI) is expanding web services to make biomedical research more efficient. With their free, public projects, MyGene.info and MyVariant.info, researchers around the world have a faster way to spot new connections between genes and disease.
"This is about how to deliver information quickly to biologists," said Chunlei Wu, associate professor of molecular medicine at TSRI.
Wu and TSRI Associate Professor Andrew Su co-led a new study published in the journal Genome Biology reporting on progress in setting up these services and the positive response from users so far.
Good News, Bad News
Here's the good news: Genetic sequencing is faster and more affordable these days, giving scientists a better understanding of mechanisms behind many diseases. The bad news? This flood of genetic data means scientists have to wade through multiple databases and PDF files to gather useful information.
Wu said he has spent hours downloading and parsing data, often running into problems when he discovers that the original data creators didn't annotate information in a standard way.
With support from the National Institutes of Health's (NIH) "Big Data to Knowledge" (BD2K) initiative, Wu, Su and their colleagues have begun to tame this problem by creating a data-harvesting platform to automatically import and update data from a variety of public databases. The data they aggregate are then structured and delivered via two high-performance web search services, MyGene.info and MyVariant.info, powered by the latest cloud-computation technology.
"Now researchers can focus on their own work instead of going through the data-wrangling effort," said Wu.
MyGene.info and MyVariant.info are also powerful because of their ability to scale up as the user base and datasets grow.
MyGene.info holds information on more than 13 million genes from about 15,000 species. The service receives four to five million user "queries" each month, and the researchers are prepared to accommodate even more by expanding their use of Amazon cloud servers. MyVariant.info currently covers more than 316 million unique variants gathered from 14 community data sources.
The services have received positive feedback from the research community so far, said Ginger Tsueng, scientific outreach project manager in the Su lab and co-author of the new study. In just this year, MyVariant.info has received more than four million hits, while MyGene.info has handled more than 17 million.
A Foundation for Future Applications
The researchers have made these services open source to encourage others to use the data and develop their own applications.
For example, researchers at the University of Washington have built an interface that retrieves data from MyGene.info and contributes additional information to run MyGene2.org, a site that aims to connect patients who share rare genetic diseases. MyGene.info also provides the backbone for BioGPS, a resource for learning about gene and protein function, run by Su, Wu and TSRI programmer Max Nanis.
Another project in the pipeline is an app built on the MyVariant.info platform that displays variants when a user scans a gene name--from a poster at a scientific conference, for example.
"Bioinformatics tools and analyses are highly dependent on having solid foundations of other tools on which to build," said Su. "MyGene.info and MyVariant.info are key pieces of infrastructure that many bioinformaticians are using every day."
###
In addition to Wu, Su, Xin and Tsueng, authors of the study, "MyGene.info and MyVarient.info: Gene and Variant Annotation Query Services," were Cyrus Afrasiabi (co-first author), Gregory S. Stupp and Timothy E. Putman of TSRI; Adam Mark (co-first author), formerly at TSRI, now at the Avera Cancer Institute; Moritz Juchler, Nikhil Gopal and Sean D. Mooney of the University of Washington; Benjamin J. Ainscough and Obi L. Griffith of Washington University School of Medicine; Ali Torkamani of TSRI and the Scripps Translational Science Institute (STSI); Patricia L. Whetzel of the University of California, San Diego; and Christopher J. Mungall of Lawrence Berkeley National Laboratory.
The study was supported by the National Institutes of Health (grants U01HG008473, GM083924, U54GM114833, U01HG006476 and K22CA188163) and an NIH-NCATS Clinical and Translational Science Award (CTSA; 5 UL1 RR025774).
Journal
Genome Biology