Protein Structure Prediction by Global Optimization and its Application to Biological Systems
Jooyoung Lee http://lee.kias.re.kr
Center for In Silico Protein Science Korea Institute for Advanced Study
Seoul, Korea Oct.12, 2009
• Physics-Based Protein Modeling vs. Template-Based Modeling -- CASP7 & CASP8
• High-Accuracy Protein Modeling by Global Optimization
• Accurate Protein 3D Modeling Better Understanding of Biology???
KIAS Protein Folding Laboratory http://lee.kias.re.kr
Proteins are important
Proteins
sequence
(genome) structure
function
(post-genome)
scientific bottleneck
The most challenging problem of this century
Control all cellular processes Enzymes
Anti-bodies
hormones
(Protein Folding Problem)
Human hemoglobin
Protein folding problem
• For a given amino acid sequence (of size find the native structure of the protein.n),
• Total # of protein structures: 10n
• mathematically NOT well defined problem
sequence structure
function
KIAS Protein Folding Laboratory http://lee.kias.re.kr
Protein folding problem
1. Protein Structure Prediction:
For a given protein sequence, to determine its 3D structure by computation
2. Protein-Folding Mechanisms:
By what process does a protein folds into its native and biologically active conformation?
3. Inverse Folding:
For a given protein structure, to design its 1D sequence
Protein Structure Prediction
1. Physics-based approaches: Principle based-modeling
① Accurate potential energy function
② Powerful global optimization method what we can do better than others
③ Ab initio, de novo, new fold targets (10-20%)
2. Informatics-based approaches: Template based-modeling
① Map the original problem to a problem with solution
mapping problem (alignment problem)
② Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment)
③ Comparative modeling, fold recognition (80-90%)
Narrows the search
while maintaining diversity of sampling.
Annealing in conformational space”.
Conformational Space Annealing
J Comput Chem 18 1222 (1997) Phys Rev Lett 91 080201 (2003)
Examples of successful optimizations
• Optimization of ECEPP/3 for a 20-residue membrane-bound portion of me littin [Biopolymers 46, 103-115 (1998)]
• Unbiased global optimization of Lennard Jones clusters up to N =201 [Ph ys Rev Lett 91, 080201 (2003)]
• Ground state in the frustrated XY model and lattice coulomb gas with f =1 /6, [Physica A 315 314-320 (2002)]
• Conformational space annealing and an off-lattice frustrated model protei n, [J Chem Phys 119 10274-10279 (2003)]
• Structure optimization of an off-lattice AB protein model [Phys Rev E 72 011916 (2005), Submitted]
• Efficient molecular docking using conformational space annealing, [J Co mput Chem 26 78-87 (2005)]
• Ground-state energy and energy landscape of the Sherrington-Kirkpatrick spin glass [Phys. Rev. B 76, 184412 (2007)]
• Successful High Accuracy Template Based Modeling in the CASP7 ex periments [Proteins, Vol. 69, 83-89 Suppl. 8 (2007)]
• Multiple sequence alignment by conformational space annelaing [Biophys ical J. 95 4813-4819 (2008)]:
att532
What is CASP?
• Critical Assessment of Techniques for Protein Structu re Prediction (http://predictioncenter.gc.ucdavis.ed u/).
• Goal is to help advance the methods of identifying pr otein structure from sequence.
• Community-wide experiments held every two years st arting 1994 to prepare the post-genomic era
• Blind prediction (and blind assessment).
• Since CASP1 (1994), there are a total of 514 protein sequences predicted.
• Since CASP5 (2002), ~200 methods have been teste d for each CASP.
Protein Structure Prediction
1. Physics-based approaches: Principal based modeling
① Accurate potential energy function
② Powerful global optimization method
③ Ab initio, de novo, new fold targets (10-20%)
2. Informatics-based approaches: Template based modeling
① Map the original problem to a problem with solution
mapping problem (alignment problem)
② Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment)
③ Comparative modeling & fold recognition (80-90%)
HDEA
RMSD=4.2 Å for 61 residues (80%,
residues 25-85)
HDEA Segment RMSD=2.9 Å for 27 residues (36%, residues 16-42)
PNAS 96,5482
CASP6 example: T0199_D3 (FR/A, Nres=82, 145-226)
Native structure Model4
Past CASP Performances of KIAS protein folding lab
- CASP5 (2002): 18th out of 165 team in new-fold category
- CASP6 (2004): selected as a member of 12 elite teams in new-fold
Physics & Protein Structure Prediction (I)
1. Proteins are polypeptide chains containing many atoms, and the interaction between atoms is con- sidered to be reasonably well described by physics and chemistry.
2. However, there are only a few anecdotal examples of successful physics-based protein modeling
(compared to the informatics-based method).
3. Currently, protein structure prediction methods rely- ing only on physics-based approaches do not work as well as informatics-based methods.
Protein Structure Prediction
1. Physics-based approaches: Principal based modeling
① Accurate potential energy function
② Powerful global optimization method
③ Ab initio, de novo, new fold targets (10-20%)
2. Informatics-based approaches: Template based modeling
① Map the original problem to a problem with solution mapping problem (alignment problem)
② Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment)
③ Comparative modeling & fold recognition (80-90%)
Physics & Protein Structure Prediction (II)
1. The goal is to achieve better protein modeling by fusing informatics-based methods with a principle of
physics (global optimization)
2. The task was to map protein modeling using tem- plates into a series of combinatorial optimization problem
3. The reality was to learn TBM (template-based model- ing) by making lots of mistakes in a real situation (CASP7)
CASP7 Experiment
• 2006, May -- August
• About 200 prediction methods are tested
• Total of 104 targets (9 cancelled)
• Three major categories:
– High Accuracy Template Based Modeling (28 domains)
• Use fine resolution measures for backbone assessment
• Side-chains are also assessed
• Only model 1s are considered
– Template Based Modeling (108 domains) – Free Modeling (16 domains)
• Physics-based methods have chances for providing competiti ve protein models
• Official results are available from CASP7 conference hom epage (11/26-11/30/2006) and Proteins CASP7 issue
Homology modeling (template-based modelin g) methods in the literature
• Conventional methods:
– Minimal amount of computing resources
– Human power intensive (several days per target)
– A series of decision making procedures require human expertise.
• More advanced (and successful) methods:
– Requires some/significant computing power – Fragments are reassembled
– Complicated score functions (not available to others) are optimized – TASSER by Zhang and Skolnick & ROSETTA by Baker
• Our approach:
– Problems are all mapped onto combinatorial optimization problems – Computing power intensive (CSA is used)
– Requires no human expertise (this is our first-ever TMB attempt in the CAS P)
– Score functions are made up with those available in public – Goal was to learn TBM by making mistakes in a real situation
We formulate protein modeling as a series of combin atorial optimization problems:
• Multiple Sequence Alignment (MSA) optimization of a frustrate system [Biophysical J. 95 4813-4819 (2008)]:
– generate pair-wise alignments between all pairs
– from each pair-wise alignment, generate residue-to-residue restr aints a library of restraints a frustrated system
• All-atom chain building from MSA another combinatori al problem of the modeller energy function [Proteins 75 1010-1023 (2009)]:
– modeller energy is a collection of competing terms including dist ant restraint terms from MSA and stereo-chemistry terms inhe rent frustration when dealing with more than one template
– modeller energy is treated as a black box for optimization
• Side-chain modeling is a combinatorial optimization of ro tamers for a given backbone structure
CASP Strategy
Proteins, Vol. 69, 83-89 Suppl. 8 (2007)Biophysical J.
95, 4813 (2008) Proteins 75
1010-1023 (2009)
CASP7 High Accuracy
Template Based Modelingz
Proteins 69, Issue S8, 27 – 37 (2007) 0.995
CASP7 High Accuracy
Template Based Modeling
Proteins 69, Issue S8, 27 – 37 (2007)
Group nHA GDT-HA AL0 1 1/2 nMR LLG Sum
TS556 (LEE) 26 0.995 0.727 1.427 1.290 12 0.842 3.127 TS020 (Baker) 26 0.746 0.684 1.242 1.307 12 0.738 2.792 TS249 (taylor) 6 0.590 0.351 0.348 0.349 4 1.731 2.670 TS186 (CaspIta-FOX) 27 0.349 0.289 1.280 1.311 12 0.874 2.534 TS004 (ROBETTA) 28 0.432 0.382 1.405 1.290 12 0.792 2.515 TS671 (fams-multi) 28 0.654 0.657 0.876 0.933 12 0.616 2.203 TS010 (SAM-T06) 28 0.464 0.562 1.187 1.185 12 0.487 2.136 TS234 (McCormack-
Okazaki)
2 0.414 0.338 0.865 0.672 2 1.028 2.115 TS664 (CIRCLE-FAMS) 28 0.588 0.630 0.907 0.924 12 0.510 2.022 TS209 (NanoDesign) 26 0.447 0.353 0.997 0.687 12 0.883 2.016 TS568 (CHIMERA) 28 0.574 0.636 0.768 0.752 12 0.688 2.015 TS559 (GSK-CCMM) 4 0.448 0.484 0.396 0.449 2 1.105 2.001 TS338 (UCB-SHI) 28 0.604 0.522 0.271 0.333 12 1.016 1.954 TS024 (Zhang) 28 0.838 0.795 0.561 0.679 12 0.411 1.928
• • •
A total of 174 groups
Conclusion of the official CASP7 assessment for HA/TBM tar- gets (Proteins 69, Issue S8, 38 – 56 (2007) reads:
“A number of groups did well in the HA/TBM cate- gory. Group 556 (LEE) stood out as the only
group that performed near the top according to all criteria investigated: fold quality (particularly
GDT-HA), side-chain rotamer quality, and molecular
replacement model quality”.
Template Based Modeling
Proteins 69, Issue S8, 38 – 56 (2007)
(Skolnick)
CASP8
• 2008, May 5 – Aug 23.
• Over 200 prediction methods are tested.
• Total of 128 targets (6 cancelled).
• We tried 2 methods: LEE and LEE-SERVER (server)
• Partial assessment data released during the CASP8 meetin g (Sardinia, Italy, Dec 3-7, 2008).
• Highlights of LEE & LEE-SERVER prediction:
– 50 HA-TBM targets: LEE & LEE-SERVER are 2 best methods
– Binding site prediction: LEE & LEE-SERVER are 2 best methods.
– Refinement category: Best model1 prediction by LEE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0
0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Chi_(1+2 )
Chi_1 HB-score GDT-HA
LEE-SERVER LEE
BAKER-ROBETTA
Zhang-Server keasar-server
Top 20 methods sorted by GDT-HA for HA-TBM targets http://casp.kias.re.kr
SAM-T08 YASARA
Predictor Group Rankings (CASP8 TBM category)
What can one do better with more accu rate protein models?
• Predict protein functions:
– CASP8 performance (best HA-TBM prediction by LEE and LEE-S) – Best Binding site prediction protein design
• Suggest working mechanism of proteins at atomic resoluti on (insulin analogs collaboration with Prof H Shin)
• Screen natural proteins to find more efficient enzymes:
– Discovery of more efficient amino-transferases by protein modelin g and docking simulation confirmed by wet experiments where 30-60 folds increased in the reaction rate is validated (collaborati on with Prof BG. Kim)
• Determine a protein complex structure by combining X-ra y diffraction data and protein modeling (Cell 136 85-96 J an 9 2009 in collaboration with Prof BH Oh)
Cell 136 85-96, Jan 9 2009
X-tal structure of condensin complex MukBEF
Cell 136 85-96, 2009
Screening of w-aminotransferase for the asymmetric s ynthesis of chiral amine
Sequence name Lowest distance among 100 docking
poses (Å)
Initial rate for for- ward reaction (Uf, μmol/mg∙min)
Initial rate for re- verse reaction
(Ur, μmol/mg∙min) Ur / Uf Produced (S)-- MBA (mM)
Atu4761 2.87 0.38 0.24 0.63 14.5
SAV2612 3.00 0.32 0.35 1.1 15.5
Atu3407 3.09 0.088 0.058 0.66 6.15
w-ATVf 3.21 3.4 0.062 0.018 9.13
1. Caulobacter w-TA were selected and PSI-BLAST was run: 250 se- quences were selected
2. 250 sequences were multiply-aligned and 4 subgroups were iden- tified.
3. 51 sequences belong to w-TA and all the sequences were used for model building.
4. The models were docked with aminodiphenylmethane(ADPM), and the distance between PLP and the N atom of ADPM was mea- sured.
CASP8 Binding Site Prediction
T0391 (Human/Server)
Accuracy = 9/10 = 90 % Coverage = 9/9 = 100 %
Magenta X-ray (3d89A) Blue LEE
Ligand: FES complex
PDB code: 3d89
HETERO ATOMS: FES
FES 57 59 60 61 62 80 82 83 85
Prediction: FES Binding
GDT-HA: 50.73
CSAPDB
Better Structure qualities
Comparable quality
NMR Protein Structure Determination
Protein Structure Determination
by X-ray crystallography & MR
Conclusions
• We have successfully mapped the template-based protein modeling into three layers of combinatorial optimizatio n problems: MSACSA, ModellerCSA and ROTCSA.
• We have demonstrated that high accuracy protein 3D mod eling can be achieved simply by rigorous optimization of relevant score functions.
• The proposed method requires a large amount of computati onal resources (100 CPU days per 300aa protein), but prod uces significantly better results.
• There are rooms for improvement for better template det ection and loop modeling
• Application to real/experimental systems is in the prelimin ary stage but quite promising.
Acknowledgements
3D modeling: Keehyoung Joo,
Jinwoo Lee, Dept. of Math., Kwangwoon U.
Sung Jong Lee, Dept. of Phys., Suwon U.
Function: Mina Oh
NMR Structure Optimization: Jinhyuk Lee Collaboration with experimental groups:
Byung-Gee Kim, School of Chemical and Biological Engineering, SNU Byung-Ha Oh, POSTECH (moved to KAIST)
H Shin, Soongsil U.
DH Shin, Ewha Wemen’s Univ.
Cluster computers: KIAS
http://casp.kias.re.kr/ for head-to-head comparison between servers