Researchers in the bioinformatics program at Cold Spring Harbor Laboratory have now developed a computer program that is especially good at finding these first segments and "on" switches of genes. The program is tailored toward detecting these features in the human genome sequence, but it will also be useful for annotating other mammalian genomes.
The program--called "First Exon Finder" or "FirstEF"--was developed by Michael Zhang and his colleagues. A paper describing the program is published in the December issue of Nature Genetics.
"FirstEF is the first program that can readily and accurately detect a class of gene segments that has previously been extraordinarily difficult to find," says Zhang. "It's like looking for buried treasure."
The gene segments Zhang is referring to occur at the very beginning of genes, and are called "non-coding first exons." Because they do not encode protein segments, non-coding first exons are undetectable by conventional computer programs that rely on protein coding patterns found in DNA.
Instead, FirstEF recognizes five other DNA "signatures" that betray the presence and location of first exons in genes. The biological basis of some of these telltale genetic signatures is unknown, says Zhang. "But they are real, and perhaps someday biology will explain why they are there." One such signature is the frequency with which two building blocks of DNA, C and G, occur next to each other.
Despite the fact that they do not encode protein, non-coding first exons are essential components of gene structure and function. Consequently, the ability to detect non-coding first exons is crucial for scientists wishing to study genes for a wide variety of biological and biomedical applications.
"The results Michael Zhang is getting with FirstEF are very exciting," says James Kent, a graduate student at the University of California at Santa Cruz. Kent's own computer program called "GigAssembler" caused a sensation in the world of genome research when he used it to generate the first and only publicly-available assembly of the human genome sequence in June of last year. Kent hopes to add a FirstEF "track" to the Human Genome Browser he has created (available at http://genome.
When Zhang used FirstEF to analyze the DNA sequences of human chromosomes 21 and 22, he found that the program correctly pinpointed the location of 90 percent of known first exons on those chromosomes. According to Zhang, FirstEF was nearly twice as sensitive as a program available from DoubleTwist, Inc. and Genomatix Software GmbH called "PromoterInspector." Zhang was joined in this study by postdoctoral researchers Ramana Davuluri (now on the faculty at Ohio State University) and Ivo Grosse.
Later, Zhang and his colleagues used FirstEF to analyze the entire human genome. They identified some 68,000 first exons. This result does not necessarily mean that there are 68,000 or so human genes, because a single gene can use alternative first exons. Moreover, the total number of genes in an organism's genome depends on other, subtle definitions of what constitutes a gene. Nevertheless, Zhang believes there are 50 to 60,000 human genes and that previous estimates of 30 to 40,000 human genes are too low.
One bonus of the way FirstEF operates is that it identifies not only first exons of genes, but also the "on" switches of genes called "promoters."
"A significant bottleneck in current DNA research is finding the promoters of genes. Because gene promoters and first exons are related, FirstEF kills two birds with one stone," says Zhang.