Human genome analyzed using supercomputer
The human genome has 100,000 genes. One gene makes one protein. Humans and
bacteria have entirely different genes.
These common beliefs were shattered earlier this year by the findings of the
International Human Genome Sequencing Consortium, which includes the Department
of Energy Joint Genome Institute (JGI) to which ORNL contributes computational
analysis. On February 15, 2001, three days after a major announcement, the
consortium published the paper "Initial Sequencing and Analysis of the Human
Genome" in the journal Nature. The paper states that the human genome has "about
30,000 to 40,000 protein-coding genes, only about twice as many as in worm or fly";
each gene codes for an average of three proteins; and it is possible that hundreds of
genes were transferred from bacteria to the human genome.
Ed Uberbacher, head of the Computational Biology Section in
ORNL's Life Sciences Division, was one of the hundreds of
contributors to this landmark paper. He and his ORNL
colleagues performed computational analysis and annotation of
the DNA data produced by JGI to uncover evidence of the
existence of genes about which little or nothing was known.
Uberbacher and his colleagues also performed an analysis of
the complete, publicly available, human genome. The analysis,
funded by DOE, was performed by ORNL, University of
Tennessee, and University of Pennsylvania researchers using
three computational methods, the GenBank database, and the
IBM RS/6000 SP supercomputer at DOE's Center for
Computational Sciences at ORNL. One of the computational
methods used was the latest version of the Gene Recognition
and Analysis Internet Link (GRAIL), which was developed by
Uberbacher and others at ORNL in 1990 and rewritten as
GrailEXP for parallel supercomputers.
"We have found experimental and computational evidence for some 35,000 genes,"
says Uberbacher. "We have also provided information on how many genes are
expressed in different tissues and organs of the body. For example, we determined that
more than 20,000 genes are expressed in the central nervous system. About 10% of all
human genes are expressed only in the brain."
The researchers found 728 cell-signaling genes that tell cells when to divide and when to
grow. They identified "zinc fingers"—regulatory proteins that bind to DNA bases
composing genes to turn them on or off. These cell-signaling genes and zinc fingers are
unique to the human genome.
GrailEXP located almost 2600 genes exhibiting "alternative splicing"—the ability to
produce two or more proteins by combining the gene's dispersed protein-coding regions
(exons) in different ways.
Each human gene contains multiple exons separated by noncoding regions called
introns. Cellular machinery called a spliceosome strips out all the introns and joins the
exons together. Sometimes certain exons are skipped.
"We found a gene with 10 exons, but in different human tissues different individual
exons are not read, so part of the code is left out that directs the cell to make a protein,"
Uberbacher says. "This gene could have 10 different protein products."
Some of the genes are known, and detailed information on their sequences is found in
GenBank. Other genes are less well characterized but are similar to genes found in
model organisms, such as the mouse. Still other genes are inferred based on expressed
sequence tags (ESTs). An EST is a unique stretch of DNA within a coding region of a
gene that can be used to identify full-length genes. ESTs were used in computational
predictions to locate additional genes and predict the makeup, structure, and function of
the proteins they encode.
In addition to genes, the researchers found many DNA sequences that are repeated in
the human genome. This "junk" DNA may have a purpose: It lowers the probability
that random mutations in DNA strike the coding sections of important genes. "Although
the human and mouse genome are about the same size," Uberbacher says, "we found
longer stretches of repeated DNA sequences, making up 40 to 48% of the human
genome, separating clusters of genes like vast deserts between metropolitan areas."