High Resolution Structure Prediction in Prime: Better Physics Yields Better Prediction of Long Loops and Side Chains
Professor Friesner
is a co-founder of Schrödinger and Director of the Center for
Biomolecular Simulation at Columbia University. As Chairman of
Schrödinger's Scientific Advisory Board, Professor Friesner provides
strategic vision and guidance for Schrödinger's scientific
advancements. In this installment of Rich's column, he describes
ongoing research to improve the prediction of long loops and side
chains in Prime.
Predicting
the structure of loop regions in proteins has been a central objective
of biomolecular modeling for the past several decades. Loop structures
can often assume several different low-energy conformations, some of
which are considerably different from others – for example, the DFG-in
and DFG-out activation loop in kinases such as ABL and p38.
Structure-based drug design against these targets can benefit from the
ability to accurately access the different loop conformations,
particularly in the absence of a crystal structure. Similarly, the loop
regions of homologous proteins show significant variations, and methods
capable of modeling these variations will yield superior structures for
virtual screening and lead optimization. In what follows, I focus on
loop prediction in the context of the native protein environment;
applications to homology modeling, which are in progress, will be the
subject of a future article.
From its inception, Prime
has demonstrated substantial improvements in loop predictions when
compared to alternative methods in the literature. Initially, accurate
results were limited to roughly 10 residue loops. As we improved both
the sampling algorithms and the energy model in Prime, reliability in
predicting these short loops has improved considerably, but in
addition, the ability to model longer loops has also advanced
significantly. Table 1 presents a comparison of our latest development version of Prime with results from Prime version 1.5, taken from a paper [1] in the Journal of Chemical Theory and Computation.
The dramatic reduction in errors for longer loops is in great part due
to breakthroughs in developing more accurate models of continuum
solvation. These models are briefly explained below; those interested
in further details can consult refs. [1] and [2].
Table 1. Average RMSDs (Å) for loop prediction on 6, 8, 10 and 13 residue loops.
|
|
|
Uniform
Dielectric
|
Variable
Dielectric
|
Uniform
Dielectric + Hydrophobic
|
Variable
Dielectric + Hydrophobic
|
Variable
Dielectric + OptHydrophobic
|
|
6 residue
|
0.48
|
0.40
|
0.46
|
0.41
|
0.39
|
|
8 residue
|
0.84
|
0.79
|
0.76
|
0.74
|
0.68
|
|
10 residue
|
1.27
|
0.73
|
1.05
|
0.76
|
0.80
|
|
13 residue
|
2.73
|
1.62
|
1.29
|
1.08
|
1.00
|
|
The
RMSD is the loop backbone RMSD while superimposing the rest of the
protein. The first two columns show the results with uniform dielectric
model and variable dielectric model. The next two columns show the
results when these two models are combined with the hydrophobic term.
The last column shows the results of our optimization of hydrophobic
term on the variable dielectric model by taking lysines out of
hydrophobic term. Hydrophobic and OptHydrophobic represent the original
hydrophobic term and the optimized hydrophobic term, respectively.
|
Our
older results applied a previous-generation of the generalized
Born/surface area (GB/SA) model – one which is substantially similar to
those currently used in other programs. In these results there is a
noticeable increase in the average RMSD from experiment beginning at
~10 residues and presenting a serious accuracy problem at 13 residues.
Similar difficulties are observed by other groups. The origin of this
increased error is not difficult to understand; for shorter loops the
difference between the loop length and the end-to-end distance between
the loop endpoints is typically rather small, thus, the loop has little
“play” and the conformational space is greatly constrained by the need
to satisfy the attachment points. In contrast, at around 10 residues
the average loop begins to significantly exceed in length the distance
between the attachment points, and at 13 residues, there is typically
considerable excess length, which leads to an explosion in the size of
phase space available to the loop. This explosion makes the sampling
problem much more difficult, and also creates the possibility for a
much larger number of incorrect structures, any of which may score
better than the native structure due to problems with the scoring
function.
To
eliminate these incorrect structures, we have made two major
modifications to the “standard” GB/SA continuum model. Firstly, we
recognized that the surface area component of the model, while
addressing hydrophobic effects in small molecule solvation free energy
calculations, can yield very large errors in the context of larger
scale structures such as proteins. Specifically, removing a loop from
the body of the protein pulls out hydrophobic side chains from the loop
that were “docked” into the hydrophobic core of the protein. When the
loop is removed hydrophobic holes on the Ångström scale are therefore
left behind. The “standard” GB/SA model grossly underestimates the free
energy penalty associated with these holes; if one, or a few, water
molecules were to occupy such holes, they would be unable to make one
or more of their normal complement of hydrogen bonds. A continuum model
cannot compute this correctly because it models water molecules as
infinitesimal dipoles. Such structures simply do not appear in
calculating small molecule solvation free energies, which generally do
not form cavities of this type. There are a number of ways to approach
this problem and we chose the simplest. This involved the addition of
an empirical hydrophobic term to the energy function, similar to what
is used to score protein-ligand docking. As shown in ref. [2], this yielded greatly improved prediction of long loops.
The
second problem we identified with “standard” GB/SA (and PB/SA) models
is the treatment of the internal dielectric constant of the protein.
Various groups have used values ranging from 1 to 20, but none has
proven entirely satisfactory. In ref. [1]
we argue that the internal dielectric of the protein should depend upon
which residues are interacting; charged residues induce a higher degree
of polarization in their surroundings, and hence interactions involving
a charged residue should have a correspondingly higher internal
dielectric constant. We refer to our implementation of this idea as a
variable dielectric model, and in this model an effective internal
dielectric is defined for each pair of interacting residues. Using this
model we obtained substantial improvements in the prediction of charged
side chains, without reducing effectiveness in predicting neutral side
chains, and this in turn resulted in further improvement in long loop
prediction. The most dramatic effect of the new model is shown in Figure 1 below, which presents the distribution of NH4+—COO-
distances, from the lysine and carboxylate residues, respectively,
obtained from experimental crystal structures, and compared with single
side-chain predictions from the fixed and variable dielectric models.
The standard single dielectric model drastically overestimates the
formation of salt bridges, as well as the N—O distance observed in
these salt bridges. The variable dielectric model, while not perfect,
nevertheless represents a dramatic improvement.
|
|
|
Figure 1. The distribution of NH4+—COO-
distances from the lysine and carboxylate residues. The predictions of
uniform dielectric 1 and the variable dielectric model are compared
with native structures. The variable dielectric model eliminates the
over prediction of salt bridges in the uniform dielectric model.
|
When
these new energetic terms are coupled to the increasingly powerful
conformational sampling algorithms being built into Prime, prediction
of increasingly longer loop structures becomes possible. The next
release of Prime will contain technology that is reliable up to 13
residues in length. New developments in my academic group at Columbia
have been successful in predicting 15 residue loops with good
robustness, and we have had significant success for loops in the 18-20
residue range. Thus, continued technological progress in the coming
years can be expected, and will be delivered in the integrated
Schrödinger software suite.
[1]
Zhu, K.; Shirts, M.; Friesner, R. “Improved Methods for Side Chain and
Loop Predictions via the Protein Local Optimization Program: Variable
Dielectric Model for Implicitly Improving the Treatment of Polarization
Effects.” J. Chem. Theory Comput. 2007, 3, 2108-2119.
[2] Zhu, K.; Pincus, D.L.; Zhao, S.; Friesner, R.A. “Long loop prediction using the protein local optimization program.” Proteins. 2006, 65, 438-452.
Comments and questions on Dr. Friesner's column are welcome. Please send these via email to ask-rich@schrodinger.com, and we'll address particularly interesting topics in future newsletters.