Form follows sequence
"Form follows function," said architect
Louis Sullivan, arguing that a building's
purpose should determine its design. If
Sullivan had been a biologist he might
have put it the other way around. With
no designer except the rough and
tumble of evolution, a protein's function
is largely determined by its form; to find
out what an unknown protein does, it's
often essential to work out its shape.
Ever since James Watson and Francis Crick solved the
double helix structure of dna in 1953, biology's most
formidable structural challenge has been the "protein
folding problem"-learning how nature gets from a gene, a
length of dna that encodes the order of amino-acid
residues in a string, to a working protein, that same
string intricately folded into all the pockets and creases
and knobs essential to the physics and chemistry of life.
While protein structures are being collected at a steadily
increasing pace, knowledge of gene sequences is
exploding. The Human Genome Project, begun by the
Department of Energy and the National Institutes of
Health less than ten years ago, will finish a draft of all
50,000 to 100,000 human genes-all three billion
base-pairs-sometime this year. The majority of the
proteins these myriad genes code for do not resemble
any already known.
"The more information you have, the more kinds of
information you need to make sense of it," says Daniel
Rokhsar, head of the Computational and Theoretical
Biology Department in the Lab's Physical Biosciences
Division and a professor of physics at the University of
California at Berkeley. "Without a simultaneous explosion
in computation-powerful computers and flexible
programs-we'll be overwhelmed." The Garden of Converging Paths
One way to test ideas about
how proteins fold is to start
with a shape smaller and less
intricate than most proteins,
made from units less
complicated than amino acids.
Supercomputers simulate the
behavior of model polymers,
which in their native
structure-analogous to the
conformation of a fully folded
protein-resemble jungle gyms
made from Tinker-Toy-like
sticks and balls.
Instead of the varying angles
between amino-acid residues
in a real protein, the
stick-and-ball units, or mers,
in a lattice model bond to
their neighbors only at right
angles or straight ahead;
instead of a real amino acid's
complex of properties, a mer
can be assigned just a few.
"Lattice models aren't meant
to model specific proteins,"
says Rokhsar, "but they give a
good representation of certain
aspects of real processes in
manageable time." Using the
Cray T3E at the National
Energy Research Scientific Computing Center (nersc), Rokhsar and Vijay Pande, an
assistant professor of chemistry at Stanford University, discovered unsuspected
regularities in the folding pathways of model polymers.
When the simulated temperature was raised high enough, their lattice model unfolded
completely; when the temperature was lowered, the model refolded, writhing through
almost a million different positions before settling into its native, low-energy structure.
Even with a 48-mer model-roughly equivalent to a small protein-the possible initial
conformations are astronomical, and each path to stability is potentially unique.
Rokhsar and Pande analyzed movies made by grabbing snapshots of the writhing polymer
every 10,000 iterations. At first the unfolded polymer fluctuates wildly through hundreds
of thousands of configurations-then suddenly settles into a partially folded intermediate
state, in which a stable core is accompanied by flailing loops and dangling ends. After
another couple of hundred thousand iterations, the polymer abruptly locks into its native
The model exhibited more than one distinct class of transition state-different
substructures that achieve temporary stability at an energy higher than the native state
and represent different folding pathways. To see how different properties of the
components may affect transition states and pathways, Rokhsar, Pande, and graduate
student Nicholas Putnam designed three other small, 27-unit polymers with the same
native-state conformation, based on three widely used types of lattice models.
In the simplest version, only mers that touched in the native state attracted each
other-all others were energetically neutral. A more complex model had three kinds of mers
in competition, with like types attracting one another more strongly than unlike types.
The most complicated lattice model used mers with 20 discrete values derived from those
of real amino-acid residues.
"In the two simpler cases, we found that folding pathways could pass through just two
distinct core transition states," says Rokhsar. "The more complex model had only a single
transition state. Both these behaviors are observed in the folding of some small natural
Lattice models created by Daniel Rokhsar and Vijay Pande, using the Cray T3E at the
National Energy Research Scientific Computing Center (NERSC), have revealed
unexpected regularities in the folding pathways of protein-like structures.
Knowing more about the transitional structures that a folding protein must pass through
sheds light on which positions in the chain of amino-acid residues are most critical for a
flawless fold-those positions where mutations that substitute one amino acid for another
are likely to have the greatest effect on a protein's shape, for better or worse.
Water, water, everywhere
Proteins don't exist as ideal Platonic forms; their real environment consists mostly of a
warm solvent, namely water. By combining theoretical and computational approaches,
such as lattice models, with data from experiments, physical chemist Teresa
Head-Gordon of the Physical Biosciences Division and her colleagues have detailed
water's essential role in driving protein folding and stabilization.
"Protein folding in a water environment is hard to model, but it's unavoidable if we're to
understand what really happens in nature," says Head-Gordon. "Water itself is an unusual
liquid, still far from completely understood. And the behavior of water molecules close to
amino acids is markedly different from the behavior of water in bulk."
One important measure of amino acids is their varying degrees of hydrophobicity, or "fear
of water." Oil is hydrophobic-that's why oil drops remain separate in water-while
hydrophilic ("water-loving") substances readily dissolve in it. Many proteins have a
hydrophobic core and a hydrophilic surface.
"Nine of the 20 amino acids that form proteins are hydrophobic," says Head-Gordon. "We
started by studying leucines, which are found in the hydrophobic cores of many proteins,
and used x-ray scattering experiments to determine what correlations exist among
leucine molecules dissolved in water-first at low concentrations, then at higher
By measuring the intensities of x-rays or neutrons scattered by water molecules
alone-and then by leucine molecules dissolved in water-Head-Gordon and her colleagues
were able to analyze the structure of water near the leucine. They conjectured that
these water structures, much more highly ordered than water in bulk, give rise to forces
that differ among different kinds of amino acids and thus influence folding pathways.
"Hydrophobic amino acids like to be in contact," says Head-Gordon, "but we found that,
rather than water being instantly driven out as leucine molecules come together, there is
also a preference for the leucines to be separated by a structured layer of water
molecules. This forms a gel-like intermediate state which allows forces other than
hydrophobic forces to come into play"-other forms of atomic and molecular attraction and
repulsion-"as the core takes shape and the protein folds into its native state."
When Head-Gordon and her colleagues applied what they had learned from scattering
experiments to lattice models of polymers, they found that by including accurate
solvation forces they could go a long way toward making the models more realistic mimics
of actual proteins. Some models were swiftly eliminated, and the performance of others
was improved to exhibit faster folding and more cooperative folding transitions. In
addition to a basic understanding of the folding of all proteins, such studies may lead to
specific insight into classic sequences such as the "leucine zipper" that joins secondary
protein structures into dimers through hydrophobic attraction-a sequence that, when
mutated, may play a prominent role in activating cancer-causing genes.
SCOPing out folds
Simple theoretical models bolstered by experimental
data are one approach to faster protein-structure
prediction. Another way to use computers to
translate dna sequences into protein structures is
to work directly from a growing library of known
"Structure is of purely scientific interest; function is
why people care," says Inna Dubchak, a computer
scientist in nersc's Center for Bioinformatics and
Computational Genomics. "I use knowledge-based
methods to apply what we already know about the
properties of molecules to predict the structures of
unknown proteins-information that biologists can
use to deduce their functions."
Describing her method of predicting the folds of unknown proteins, Dubchak explains that
"traditional methods compare unknown gene sequences to known protein sequences or
structures residue by residue, searching for correspondences. But what happens when no
similar sequence exists? I decided to tackle the problem differently, from a taxonometric
Dubchak assessed the physical properties of each of the 20 amino acids found in
proteins-such characteristics as hydrophobicity, polarity, van der Waals radius (size),
and the like-and reduced these to a number of vectors representing the residue's
cooperative influence on a fold.
Taken together, the vectors of an unknown sequence do not specify an exact shape so
much as they suggest one that may or may not resemble a fold already included in the
Structural Classification of Proteins (scop), a library of experimentally observed folds
developed by the Medical Research Council's Laboratory of Molecular Biology in
Dubchak "trains" neural networks, built with computer processors, to recognize
sequences that produce scop-like folds; at present, about a fourth of new sequences
can be matched confidently to folds already in the library. Those that don't match known
shapes represent folds that have not yet been discovered (or they signal that the neural
network doesn't have enough information or hasn't yet learned to recognize the
Armed with the knowledge that the fold of a new protein resembles familiar folds,
biologists can hypothesize the new protein's evolutionary relationships and biological
functions, as well as how it may bind to other proteins and to specific chemicals,
"We want to focus on the most important projects from the biologists' point of view,"
Dubchak says. "We want to help biologists solve their hardest problems by applying
However, because entirely different dna sequences may produce structures of similar
topology, large uncertainties remain. For example, the resolution of a neural-network fold
prediction may be limited to several times the typical distance between atoms-and two
structures possessing the same fold may be significantly different in size.
Teresa Head-Gordon seeks to
reduce these uncertainties by
invoking the gospel-that is,
"global optimization strategies to
probe energy landscapes."
Head-Gordon's goal is to find,
within the range of possibilities,
the protein structure
corresponding to a specific
sequence that has the lowest
Neural-network predictions such
as Dubchak's supply "soft constraints" on shape and specify known secondary structures
such as alpha helices and beta sheets. By applying gospel -using force-field models such
as amber and charmm, and descriptions of aqueous solvation learned from theory and
experiment-vaguely defined "coil" structures, which are more challenging, can also be
In the course of comparing candidates, the algorithm applies these empirically derived
functions to areas of the fold accessible to water; it imposes an extra energy penalty on
structures with exposed hydrophobic surfaces. Repeated perturbations of amino-acid
positions use gospel to lower the energy further, homing in on the lowest possible total
Global optimization is a voracious consumer of computer power and time. Using the Cray
T3E-900 at nersc, Head-Gordon and her colleagues have tested their algorithm against
simple "target" proteins. In the case of 1pou, for example, a dna binding protein with 72
amino acids arranged as several alpha helices, the structure predicted by gospel from
sequence gave a reasonable estimate of the fold but had some six percent higher binding
energy than the known structure derived from nuclear magnetic resonance imaging.
"We have still not reached crystal structure energy yet, so further improvements in
structure are still possible!" Head-Gordon exclaims.
Nevertheless, while improvements in the underlying model are needed, global-optimization
results have been sufficiently encouraging to attempt larger proteins with more complex
structures, including pure beta sheets and mixed alpha-helix, beta-sheet proteins.
Bundles and beads and barrels and saddles
Proteins are like strings of beads wound into bundles. Their
structure is described at increasingly intricate levels. w
Primary structure is a chain of amino-acid residues, chemical
units linked to their neighbors by peptide bonds, like
snap-together plastic beads. The 20 amino acids that can
form proteins differ in size, shape, electric charge and
polarity (which affects interaction with water),
hydrophobicity ("oiliness"), and other properties.
Researchers have assigned single-letter designations to
each, from A for alanine through Y for tyrosine; thus primary
structure, the polypeptide chain, is given by a string of
letters, e.g., MEIMKKQNSQINEINKDEIFV. . . .
Secondary structure results from the angles between amino
acids, plus the hydrogen bonds that may form from one
residue to another. Repeating bonds and angles commonly
form alpha helices and beta sheets (or sometimes variations
of these) and their hairpin or crossover connections-plus a
variety of turns, which often expose active chemical groups
on the protein surface, and a few other structures such as
loops and "paperclips."
Tertiary structures are made from helices, sheets, and other
secondary elements. A particular configuration of these is
called a fold. There are roughly 500 known folds, a dozen of
which occur very commonly, some with names like "barrel" or
"sandwich" or "saddle"-out of some 6,000 to 10,000
predicted to exist. Remarkably, many proteins that have
completely different sequences of amino acids are
structurally identical-a strong hint that this structure has
inherent evolutionary advantage. w
While a protein may consist of a single polypeptide strand
incorporating a particular fold, others are built from separate
strands. A famous example of quaternary structure is
hemoglobin, which combines two pairs of identically folded
chains in a single molecule capable of snapping up, carrying,
and releasing oxygen in the bloodstream and tissues of the
In vivo, in vitro, in silico
Models that derive values from real amino-acid residues and realistic watery environments
can help us understand the folding of real proteins, and the shapes and functions of
many unknown proteins can be deduced from libraries of known folds. These and yet
more sophisticated and powerful computer techniques are essential, for a functioning
protein is dynamic, while the protein structures determined by crystallography are
static-and even at the present rapid experimental clip it could take another century to
decipher the full atomic structures of all the proteins in cells by experiment alone.
Daniel Rokhsar and his colleagues have also studied the molecular dynamics of a real
protein structure, not under natural conditions or in an experimental set-up, but in silico,
using a fully realistic "all-atom" computer model in which the properties of every atom in
every amino acid are represented, and thousands of water molecules are explicitly
"Even in long runs on powerful computers, with all-atom calculations it's only practical to
model a few nanoseconds of real time," says Rokhsar, "yet real proteins typically fold up
in a few milliseconds"-a million times longer. "So we modeled a very small part of a real
protein, a common structure called a beta hairpin. Instead of trying to watch it fold up,
we watch it unfold, which at the high temperatures of the simulation is a much quicker
Unfolding occurs in a series of discrete steps which always happen in the same order.
Each represents the dissolution of a specific part of the hairpin structure, recalling the
transition states of lattice models.
Much faster and more manageable supercomputers will be needed to study larger protein
structures at the atomic level. The largest yet studied in silico, with 36 residues and
12,000 atoms, was tracked over the course of a single microsecond by researchers at
the University of California at San Francisco; the simulation took a Cray T3D and a Cray
T3E-600 running for two months each, and the model did not reach the real protein's
To rationally design drugs that
can attack specific disease
mechanisms, to create novel
industrial enzymes, to engineer
new organisms that can increase
food production, clean up waste,
and restore the
benefits all depend upon accurate,
intimate knowledge of a wide
range of protein structures and
their possible mutations. Every
scrap of experimental knowledge,
every advance in calculating the
molecular dynamics of model
proteins, all are essential to the solution of the protein folding problem, a goal that still
glimmers in the future.