A biologist with the Commerce Department's National Institute of Standards and Technology (NIST) led the research team, which reported its findings in the March 10 online edition of Molecular Biology and Evolution. The results are based on a systematic, statistically rigorous analysis of publicly available genetic data carried out with bioinformatics software developed at CARB.
In humans, there is so much apparent "junk" DNA (sections of the genome with no known function) that it takes up more space than the functional parts. Much of this junk consists of "introns," which appear as interruptions plopped down in the middle of genes. Discovered in the 1970s, introns mystify scientists but are readily accounted for by cells: when the cellular machinery transcribes a gene in preparation for making a protein, introns are simply spliced out of the transcript.
Research from the CARB group appears to resolve a debate over the "early versus late" timing of the appearance of introns. Since introns were discovered in 1978, scientists have debated whether genes were born split (the "introns-early" view), or whether they became split after eukaryotic cells (the ones that gave rise to animals and their relatives) diverged from bacteria roughly 2 billion years ago (the "introns-late" view). Bacterial genomes lack introns. Although the study did not attempt to propose a function for introns, or determine whether they are beneficial or harmful, the results appear to rule out the "introns-early" view.
The CARB analysis shows that the probability of a modern intron's presence in an ancestral gene common to the genes studied is roughly 1 percent, indicating that the vast majority of today's introns appeared subsequent to the origin of the genes. This conclusion is supported by the findings regarding placement patterns for introns within genes. It long has been observed that, in the sequences of nitrogen-containing compounds that make up our DNA genomes, introns prefer some sites more than others. The CARB study indicates that these preferences are side effects of late-stage intron gain, rather than side effects of intron-mediated gene formation.
The CARB results are based on an analysis of carefully processed data for 10 families of protein-coding genes in animals, plants, fungi and their relatives (see sidebar for details of the method used). A variety of statistical modeling, theoretical, and automated analytical approaches were used; while most were conventional, their combined application to the study of introns was novel. The CARB study also is unique in using an evolutionary model as the basis for inferring the presence of ancestral introns. The research was made possible in part by the increasing availability, over the past decade, of massive amounts of genetic sequence data.
The lead researcher is Arlin B. Stoltzfus of NIST; collaborators include Wei-Gang Qiu, formerly of CARB and the University of Mayland and now at Hunter College in New York City, and Nick Schisler, currently at Furman University, Greenville, S.C.
CARB is a cooperative venture of NIST and the University of Maryland Biotechnology Institute.
CARB's Approach to Understanding the Origins of 'Junk' DNA
Scientists long have compared the sequences of chemical compounds in different proteins, genes and entire genomes to derive clues about structure and function. The most sophisticated comparative methods are evolutionary and rely on matching similar sequences from different organisms, inferring family trees to determine relationships, and reconstructing changes that must have occurred to create biologically relevant differences.
This type of analysis is usually done with one sequence family at a time. The Center for Advanced Research in Biotechnology (CARB), a cooperative venture of the Commerce Department's National Institute of Standards and Technology (NIST) and the University of Maryland Biotechnology Institute, developed software to automate the analysis of dozens--and perhaps hundreds, eventually--of sequence families at a time. The automated methods also assess the reliability of all the information, so that conclusions are based on the most reliable parts of the analysis.
The CARB method has two parts. The first part consists of a combination of manual and automated processing of gene data from public databases. The data are clustered into families through matching of similar sequences, first in pairs and then in groups. Then family trees are developed indicating how the genes are related to each other. A file is developed for each family that includes data on sequence matches, intron locations, family trees and reliability measures.
These datasets then are loaded into the second part of the system, which is fully automated. It consist of a relational database combined with software that computes probabilities for introns being present in ancestral genes using a method developed at CARB. Each gene is assigned to a kingdom (plants, animals, fungi and others), and a matrix of intron presence/absence data is determined for each family based on the sequence alignments. This matrix, along with the family tree, is used to estimate ancestral states of introns, as well as rates of intron loss and gain. Additional software is used for analysis and visualization of results.
The CARB study analyzed data for 10 families of protein-coding genes in multi-celled organisms, encompassing 1,868 introns at 488 different positions.