Protein prediction tool has good prospects
The international competition to predict the three-dimensional
(3D) structures of 43 proteins, using computational tools, was
intense. Of the 123 groups competing in the fourth Critical
Assessment of Techniques for Protein Structure Prediction
(CASP-4) competition, which was held from June through September 2000, an ORNL
group placed sixth, putting it in the top 4%. In fact, ORNL placed ahead of all other
Department of Energy national laboratories in the contest.
The actual structures of the 43 target proteins were determined experimentally by
nuclear magnetic resonance (NMR) spectroscopy and X-ray crystallography, and the
data were unpublished at the time of the competition. "The computational groups were
provided with the identity and order of amino acids making up each protein and the
length of the one-dimensional amino-acid sequence," says Ying Xu, leader of the
Computational Protein Structure Group in the Computational Biology Section of
ORNL's Life Sciences Division. "From this information our team predicted protein
structure." Other team members were Dong Xu, Oakley Crawford, and Phil LoCascio.
Motivating this competition is the search
by biologists to discover the function of
individual proteins that work together in
"protein machines" to form an organism
or keep it alive. They also want to
understand how these functions are
performed at the molecular level.
Proteins often do their work by docking
with another protein. Because the
function of a protein is related to its
shape, it is essential to learn the 3D
structure of each protein. Using the
details of a protein's shape, a chemical
compound can be custom designed to fit
precisely in the protein, like a hand in a
glove, blocking or enhancing the
protein's activity. In this way, a highly
effective drug with no side effects could
be created for each individual.
"The demand for rapid protein structure
determination will grow drastically
because information that could be used
for rational drug design is becoming
available rapidly," Xu says. "Traditional
experimental methods for determining
protein structure may not be able to
keep up with the pace at which
amino-acid sequences are being
generated. Computational techniques, in
conjunction with experimental methods,
could more rapidly determine protein
structures on a genome scale."
For the CASP-4 competition, the ORNL
researchers used a computer package
that they developed and continue to
improve. It is called the Protein
Structure Prediction and Evaluation
Computer Toolkit (PROSPECT) and is
one of only a few dozen
protein-threading computer programs in
"In the CASP competition, you get a 0 if you fail to identify the correct structural
template," Xu says. "You get a 4 if your alignment between the target protein and the
template is perfect. You get scores of 1 to 3 depending on how close you are to being
correct. The scores are added up for all 43 proteins. We recognized two-thirds of the
correct templates, which is the most among all the competing teams, and one-third of
our alignments were off."
Recently, the ORNL team attended a conference in Asilomar, California, and learned
how other teams did in CASP-4 compared with PROSPECT.
"Some 10,000 protein structures have been determined experimentally out of the
100,000 or so proteins believed to exist, and the information is stored in the Protein
Data Bank," Xu says. "To keep up with the production rate at which protein sequences
are being generated by the genome projects, computational methods are clearly needed.
Structure predictions have been made by the conventional ab initio technique in which
a supercomputer is used to predict how an amino-acid chain can fold itself into a final
shape based on first principles of physics and chemistry. Unfortunately, it takes weeks
to months to predict the structure of even the smallest protein using this approach and
the prediction reliability is poor."
The ORNL group uses template-based methods of protein structure prediction. These
methods rely on experimentally determined 3D structures in the Protein Data Bank.
The ORNL group uses PROSPECT to do "protein threading," in which a string of
amino acids is computationally aligned along different protein templates—like an
embroidery thread drawn through a printed design—to determine which template gives
the best fit. In a perfect alignment, the amino-acid atoms are at their preferred lowest
energy levels and are compatible with neighboring atoms and the protein's environment.
The ORNL group also uses homology modeling to fine-tune the predicted structure. In
this technique, if two amino-acid sequences are similar and one sequence has a known
structure, researchers can use this information to help determine the structure of the
unknown protein sequence. By calculating the detailed forces between atoms and
adjusting the final predicted structure to minimize the atoms' energies, the researchers
computationally tweak the predicted structure of the target protein to make it
energetically more favorable.
"It is believed that about 1000 unique protein structural folds exist in nature and that
many proteins share each of these unique structural folds," Xu says. "Some 600 unique
protein structures have been determined experimentally. Once the 1000 unique
structural folds are determined by NMR and X-ray crystallography, the rest of the
100,000 protein structures can be accurately modeled computationally."
Xu's group, which has four staff researchers and two postdoctoral scientists, is involved
in the National Institutes of Health's Structural Genome Initiative, which is dedicated to
finding the structures of 100,000 human proteins. As part of this effort, NIH has funded
seven pilot centers for experimentally determining protein structures. They include
centers at DOE's Argonne, Brookhaven, Lawrence Berkeley, and Los Alamos national
laboratories. At Lawrence Berkeley, David Eisenberg, a pioneer in protein threading, is
trying to determine the structures of proteins in the genome of the rod-shaped bacterium
that causes tuberculosis.
"A new trend in structure prediction is the incorporation of partial experimental data as
constraints in the computation process, to make structure prediction closer in accuracy
to the experimental structures," Xu says. "PROSPECT is ideally suited for
incorporating data from local researchers and from these pilot centers. The data include
measurements of distances between amino acids and information on which amino acids
tend to be found on the surface of a protein and which don't."
Recently, the ORNL group modeled a protein complex using PROSPECT and
experimental data provided by Cynthia Peterson, a University of Tennessee researcher
who has identified a number of disulfide bonds between amino acids in certain parts of
the protein. PROSPECT is being used to incorporate experimental data provided by
Greg Hurst and Jim Stephenson of ORNL's Organic Mass Spectrometry Group. They
are using electrospray ionization ion trap mass spectrometry and a cross-linking
chemical to determine the distances between two amino acids—both lysines—in a
protein. Their initial studies found that the lysines linked by this chemical of a known
length are 4 angstroms apart. (See Protein Identification by Mass Spectrometry.)
Because NMR data provide distances between amino acids in a protein, the ORNL
group will gladly accept partial NMR data from the pilot centers, which otherwise will
not be used because the information is insufficient to determine a whole protein
structure. "This amount of data is good enough to help PROSPECT reliably predict a
protein structure," Xu says. He notes that the level of confidence, or uncertainty, in
knowing a protein structure with complete accuracy is within 1 to 1.5 angstroms for
X-ray crystallography, within 2 to 2.5 angstroms for NMR, and within 4 angstroms for
The ORNL group is focused not only on predicting protein structures more accurately
but also on doing it much faster than current computational techniques allow. "We now
use PROSPECT and 20 other computational tools to determine protein structures in a
semi-automatic fashion," Xu says. "Using the IBM RS/6000 SP supercomputer at
DOE's Center for Computational Sciences at ORNL, we can now thread 100 or more
proteins a day against 2000 possible template structures. We are seeking funding to
develop software to build an expert system and an automated protein structure pipeline
to run on the IBM supercomputer. The expert system will mimic the human
decision-making process to automate the computational tools. Our goal is to predict
about 100 protein structures a day.
"If the proposal is funded, our first project will be to predict the structure of proteins of
the Prochlorococcus marinus genome, a bacterium with about 1600 genes. We hope to
show that we can predict these protein structures."
In the next three years, the ORNL group expects to do some computer simulation that
will be of interest to the pharmaceutical industry. Their work could enable more rapid
design of drugs that are safe and effective.
"To do this, you have to know whether a ligand, which is a group of molecules typical
of a new drug, will dock with a particular protein to inhibit or stimulate its activity," Xu
says. "We will be doing computer modeling to determine whether and how a ligand
binds with various proteins to cause a healing or harmful effect."
PROSPECT is a copyrighted computer program. It is being used by over 20 academic
organizations, including MIT, Columbia University, the University of Michigan, and the
University of Texas. Millennium Pharmaceuticals is interested in licensing the program.
ORNL's Technology Transfer and Economic Development Directorate seeks to license
this computer toolkit for commercial use because its recent successes suggest it has very