The information in the sequence of the human genome has a paramount importance in biomedical research. However, the value of this information is very limited in absence of a detailed map of the genes encoded in the genome. The genes are the basic biological units responsible for the biological traits of organism. Detailed information already exists, on the genomic regions that contain the genes that code for proteins, but the information about non-coding DNA regions - also known as DNA "dark matter" - lags behind. Here are found poorly-understood genes called "long non-coding RNAs" (lncRNAs), which are amongst the most numerous of all, and have been linked to a variety of diseases.
In a paper published today in Nature Genetics, an international team of scientists led by researchers at the Centre for Genomic Regulation (CRG) in Barcelona, in collaboration with researchers at Cold Spring Harbor in New York, the Wellcome Trust Sanger Institute in Hinxton, and qGenomics in Barcelona, sheds new light on this topic. In order to better identify, map, and characterize those "dark matter" genes, they have developed a new method that improves throughput and accuracy of current methods, and applied it in human and mouse.
"98% of our DNA does not encode for proteins. These DNA regions contain thousands of uncharacterised non-protein-coding genes, but there is still a long way to go until we fully understand their functions and their roles in disease. Reaching this goal will require complete gene maps. Our method represents an important step in this direction," explained Rory Johnson, CRG alumnus currently Principal Investigator at the University of Bern, and co-leader of this paper.
The key feature of the new method, named RNA Capture Long Seq (CLS), is that it focuses specifically on the non-coding regions of the genome, that are amplified and analysed using the most advanced sequencing. "In this way we could produce a detailed map of 3,500 long non-coding RNAs in human and mice (about 20% of all those known so far) As a result, we characterized the genomic features of long-non-coding RNAs to better understand how these genes work," stated Julien Lagarde and Barbara Uszczynska, co-first authors and CRG researchers.
Researchers used this method to improve one of the most important genomic databases, GENCODE, which is the worldwide reference for the genes encoded in the human and mouse genomes. "Scientists around the globe are using GENCODE for their research projects as reference sets, so improving it means contributing to biomedical research worldwide", said Roderic Guigó, coordinator of the Bioinformatics and Genomics Programme at the CRG and co-leader of this work. GENCODE was initiated in year 2003 by Roderic Guigó as part of the ENCODE 'The Encyclopedia of the DNA elements' project. Now, thanks to this new method, Guigó and collaborators have substantially improved the gene catalogues, in particular, for long non-coding RNA genes. "We have found a cheaper, faster and more accurate method that results in an improved catalogue and that will first benefit scientists worldwide and, ultimately, society," concludes Guigó.
Almost 20 years after the Human Genome Project this work illustrates how our understanding on the information encoded in the genome is still evolving thanks to the development of increasingly powerful technologies. The better understanding of genomic function will lead, in turn, to a better understanding of health and disease.