Protein Structure Prediction by Global Optimization and its Application to Biological Systems

(1)

Protein Structure Prediction by Global Optimization and its Application to Biological Systems

Jooyoung Lee http://lee.kias.re.kr

Center for In Silico Protein Science Korea Institute for Advanced Study

Seoul, Korea Oct.12, 2009

• Physics-Based Protein Modeling vs. Template-Based Modeling -- CASP7 & CASP8

• High-Accuracy Protein Modeling by Global Optimization

• Accurate Protein 3D Modeling  Better Understanding of Biology???

(2)

KIAS Protein Folding Laboratory http://lee.kias.re.kr

Proteins are important

Proteins

sequence

(genome) ^structure

function

(post-genome)

scientific bottleneck

The most challenging problem of this century

Control all cellular processes Enzymes

Anti-bodies

hormones

(Protein Folding Problem)

(3)

Human hemoglobin

Protein folding problem

• For a given amino acid sequence (of size find the native structure of the protein.n),

• Total # of protein structures: 10ⁿ

• mathematically NOT well defined problem

sequence structure

function

KIAS Protein Folding Laboratory http://lee.kias.re.kr

(4)

Protein folding problem

1. Protein Structure Prediction:

For a given protein sequence, to determine its 3D structure by computation

2. Protein-Folding Mechanisms:

By what process does a protein folds into its native and biologically active conformation?

3. Inverse Folding:

For a given protein structure, to design its 1D sequence

(5)

Protein Structure Prediction

1. Physics-based approaches: Principle based-modeling

① Accurate potential energy function

② Powerful global optimization method  what we can do better than others

③ Ab initio, de novo, new fold targets (10-20%)

2. Informatics-based approaches: Template based-modeling

① Map the original problem to a problem with solution

 mapping problem (alignment problem)

② Use templates (problems with solutions) to obtain the solution of the original problem (multiple alignment)

③ Comparative modeling, fold recognition (80-90%)

(6)

Narrows the search

while maintaining diversity of sampling.

Annealing in conformational space”.

Conformational Space Annealing

J Comput Chem 18 1222 (1997) Phys Rev Lett 91 080201 (2003)

(7)

Examples of successful optimizations

• Optimization of ECEPP/3 for a 20-residue membrane-bound portion of me littin [Biopolymers 46, 103-115 (1998)]

• Unbiased global optimization of Lennard Jones clusters up to N =201 [Ph ys Rev Lett 91, 080201 (2003)]

• Ground state in the frustrated XY model and lattice coulomb gas with f =1 /6, [Physica A 315 314-320 (2002)]

• Conformational space annealing and an off-lattice frustrated model protei n, [J Chem Phys 119 10274-10279 (2003)]

• Structure optimization of an off-lattice AB protein model [Phys Rev E 72 011916 (2005), Submitted]

• Efficient molecular docking using conformational space annealing, [J Co mput Chem 26 78-87 (2005)]

• Ground-state energy and energy landscape of the Sherrington-Kirkpatrick spin glass [Phys. Rev. B 76, 184412 (2007)]

• Successful High Accuracy Template Based Modeling in the CASP7 ex periments [Proteins, Vol. 69, 83-89 Suppl. 8 (2007)]

• Multiple sequence alignment by conformational space annelaing [Biophys ical J. 95 4813-4819 (2008)]:

(8)

att532

(9)

What is CASP?

• Critical Assessment of Techniques for Protein Structu re Prediction (http://predictioncenter.gc.ucdavis.ed u/).

• Goal is to help advance the methods of identifying pr otein structure from sequence.

• Community-wide experiments held every two years st arting 1994 to prepare the post-genomic era

• Blind prediction (and blind assessment).

• Since CASP1 (1994), there are a total of 514 protein sequences predicted.

• Since CASP5 (2002), ~200 methods have been teste d for each CASP.

(10)

1. Physics-based approaches: Principal based modeling

② Powerful global optimization method

2. Informatics-based approaches: Template based modeling

① Map the original problem to a problem with solution

 mapping problem (alignment problem)

③ Comparative modeling & fold recognition (80-90%)

(11)

HDEA

RMSD=4.2 Å for 61 residues (80%,

residues 25-85)

HDEA Segment RMSD=2.9 Å for 27 residues (36%, residues 16-42)

PNAS 96,5482

(12)

CASP6 example: T0199_D3 (FR/A, Nres=82, 145-226)

Native structure Model4

Past CASP Performances of KIAS protein folding lab

- CASP5 (2002): 18^th out of 165 team in new-fold category

- CASP6 (2004): selected as a member of 12 elite teams in new-fold

(13)

Physics & Protein Structure Prediction (I)

1. Proteins are polypeptide chains containing many atoms, and the interaction between atoms is considered to be reasonably well described by physics and chemistry.

2. However, there are only a few anecdotal examples of successful physics-based protein modeling

(compared to the informatics-based method).

3. Currently, protein structure prediction methods rely- ing only on physics-based approaches do not work as well as informatics-based methods.

(14)

1. Physics-based approaches: Principal based modeling

② Powerful global optimization method

2. Informatics-based approaches: Template based modeling

① Map the original problem to a problem with solution  mapping problem (alignment problem)

③ Comparative modeling & fold recognition (80-90%)

(15)

Physics & Protein Structure Prediction (II)

1. The goal is to achieve better protein modeling by fusing informatics-based methods with a principle of

physics (global optimization)

2. The task was to map protein modeling using templates into a series of combinatorial optimization problem

3. The reality was to learn TBM (template-based model- ing) by making lots of mistakes in a real situation (CASP7)

(16)

CASP7 Experiment

• 2006, May -- August

• About 200 prediction methods are tested

• Total of 104 targets (9 cancelled)

• Three major categories:

– High Accuracy Template Based Modeling (28 domains)

• Use fine resolution measures for backbone assessment

• Side-chains are also assessed

• Only model 1s are considered

– Template Based Modeling (108 domains) – Free Modeling (16 domains)

• Physics-based methods have chances for providing competiti ve protein models

• Official results are available from CASP7 conference hom epage (11/26-11/30/2006) and Proteins CASP7 issue

(17)

Homology modeling (template-based modelin g) methods in the literature

• Conventional methods:

– Minimal amount of computing resources

– Human power intensive (several days per target)

– A series of decision making procedures require human expertise.

• More advanced (and successful) methods:

– Requires some/significant computing power – Fragments are reassembled

– Complicated score functions (not available to others) are optimized – TASSER by Zhang and Skolnick & ROSETTA by Baker

• Our approach:

– Problems are all mapped onto combinatorial optimization problems – Computing power intensive (CSA is used)

– Requires no human expertise (this is our first-ever TMB attempt in the CAS P)

– Score functions are made up with those available in public – Goal was to learn TBM by making mistakes in a real situation

(18)

We formulate protein modeling as a series of combin atorial optimization problems:

• Multiple Sequence Alignment (MSA)  optimization of a frustrate system [Biophysical J. 95 4813-4819 (2008)]:

– generate pair-wise alignments between all pairs

– from each pair-wise alignment, generate residue-to-residue restr aints  a library of restraints  a frustrated system

• All-atom chain building from MSA  another combinatori al problem of the modeller energy function [Proteins 75 1010-1023 (2009)]:

– modeller energy is a collection of competing terms including dist ant restraint terms from MSA and stereo-chemistry terms  inhe rent frustration when dealing with more than one template

– modeller energy is treated as a black box for optimization

• Side-chain modeling is a combinatorial optimization of ro tamers for a given backbone structure

(19)

CASP Strategy

Proteins, Vol. 69, 83-89 Suppl. 8 (2007)

Biophysical J.

95, 4813 (2008) Proteins 75

1010-1023 (2009)

(20)

CASP7 High Accuracy

Template Based Modelingz

Proteins 69, Issue S8, 27 – 37 (2007) 0.995

(21)

CASP7 High Accuracy

Template Based Modeling

Proteins 69, Issue S8, 27 – 37 (2007)

Group n_HA GDT-HA AL0 1 1/2 n_MR LLG Sum 　

TS556 (LEE) 26 0.995 0.727 1.427 1.290 12 0.842 3.127 TS020 (Baker) 26 0.746 0.684 1.242 1.307 12 0.738 2.792 TS249 (taylor) 6 0.590 0.351 0.348 0.349 4 1.731 2.670 TS186 (CaspIta-FOX) 27 0.349 0.289 1.280 1.311 12 0.874 2.534 TS004 (ROBETTA) 28 0.432 0.382 1.405 1.290 12 0.792 2.515 TS671 (fams-multi) 28 0.654 0.657 0.876 0.933 12 0.616 2.203 TS010 (SAM-T06) 28 0.464 0.562 1.187 1.185 12 0.487 2.136 TS234 (McCormack-

Okazaki)

2 0.414 0.338 0.865 0.672 2 1.028 2.115 TS664 (CIRCLE-FAMS) 28 0.588 0.630 0.907 0.924 12 0.510 2.022 TS209 (NanoDesign) 26 0.447 0.353 0.997 0.687 12 0.883 2.016 TS568 (CHIMERA) 28 0.574 0.636 0.768 0.752 12 0.688 2.015 TS559 (GSK-CCMM) 4 0.448 0.484 0.396 0.449 2 1.105 2.001 TS338 (UCB-SHI) 28 0.604 0.522 0.271 0.333 12 1.016 1.954 TS024 (Zhang) 28 0.838 0.795 0.561 0.679 12 0.411 1.928

• • •

A total of 174 groups

(22)

Conclusion of the official CASP7 assessment for HA/TBM targets (Proteins 69, Issue S8, 38 – 56 (2007) reads:

“A number of groups did well in the HA/TBM cate- gory. Group 556 (LEE) stood out as the only

group that performed near the top according to all criteria investigated: fold quality (particularly

GDT-HA), side-chain rotamer quality, and molecular

replacement model quality”.

(23)

Template Based Modeling

Proteins 69, Issue S8, 38 – 56 (2007)

(Skolnick)

(24)

CASP8

• 2008, May 5 – Aug 23.

• Over 200 prediction methods are tested.

• Total of 128 targets (6 cancelled).

• We tried 2 methods: LEE and LEE-SERVER (server)

• Partial assessment data released during the CASP8 meetin g (Sardinia, Italy, Dec 3-7, 2008).

• Highlights of LEE & LEE-SERVER prediction:

– 50 HA-TBM targets: LEE & LEE-SERVER are 2 best methods

– Binding site prediction: LEE & LEE-SERVER are 2 best methods.

– Refinement category: Best model1 prediction by LEE

(25)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 0

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Chi_(1+2 )

Chi_1 HB-score GDT-HA

LEE-SERVER LEE

BAKER-ROBETTA

Zhang-Server keasar-server

Top 20 methods sorted by GDT-HA for HA-TBM targets http://casp.kias.re.kr

SAM-T08 YASARA

(26)

Predictor Group Rankings (CASP8 TBM category)

(27)

What can one do better with more accu rate protein models?

• Predict protein functions:

– CASP8 performance (best HA-TBM prediction by LEE and LEE-S) – Best Binding site prediction  protein design

• Suggest working mechanism of proteins at atomic resoluti on (insulin analogs collaboration with Prof H Shin)

• Screen natural proteins to find more efficient enzymes:

– Discovery of more efficient amino-transferases by protein modelin g and docking simulation  confirmed by wet experiments where 30-60 folds increased in the reaction rate is validated (collaborati on with Prof BG. Kim)

• Determine a protein complex structure by combining X-ra y diffraction data and protein modeling (Cell 136 85-96 J an 9 2009 in collaboration with Prof BH Oh)

(28)

Cell 136 85-96, Jan 9 2009

(29)

X-tal structure of condensin complex MukBEF

Cell 136 85-96, 2009

(30)

Screening of w-aminotransferase for the asymmetric s ynthesis of chiral amine

Sequence name Lowest distance among 100 docking

poses (Å)

Initial rate for for- ward reaction (U_f, μmol/mg∙min)

Initial rate for re- verse reaction

(U_r, μmol/mg∙min) U_r / U_f Produced (S)-- MBA (mM)

Atu4761 2.87 0.38 0.24 0.63 14.5

SAV2612 3.00 0.32 0.35 1.1 15.5

Atu3407 3.09 0.088 0.058 0.66 6.15

w-ATVf 3.21 3.4 0.062 0.018 9.13

1. Caulobacter w-TA were selected and PSI-BLAST was run: 250 sequences were selected

2. 250 sequences were multiply-aligned and 4 subgroups were iden- tified.

3. 51 sequences belong to w-TA and all the sequences were used for model building.

4. The models were docked with aminodiphenylmethane(ADPM), and the distance between PLP and the N atom of ADPM was mea- sured.

(31)

CASP8 Binding Site Prediction

(32)

T0391 (Human/Server)

Accuracy = 9/10 = 90 % Coverage = 9/9 = 100 %

Magenta X-ray (3d89A) Blue LEE

Ligand: FES complex

PDB code: 3d89

HETERO ATOMS: FES

FES 57 59 60 61 62 80 82 83 85

Prediction: FES Binding

GDT-HA: 50.73

(33)

CSAPDB

Better Structure qualities

Comparable quality

NMR Protein Structure Determination

(34)

Protein Structure Determination

by X-ray crystallography & MR

(35)

Conclusions

• We have successfully mapped the template-based protein modeling into three layers of combinatorial optimizatio n problems: MSACSA, ModellerCSA and ROTCSA.

• We have demonstrated that high accuracy protein 3D mod eling can be achieved simply by rigorous optimization of relevant score functions.

• The proposed method requires a large amount of computati onal resources (100 CPU days per 300aa protein), but prod uces significantly better results.

• There are rooms for improvement for better template det ection and loop modeling

• Application to real/experimental systems is in the prelimin ary stage but quite promising.

(36)

Acknowledgements

3D modeling: Keehyoung Joo,

Jinwoo Lee, Dept. of Math., Kwangwoon U.

Sung Jong Lee, Dept. of Phys., Suwon U.

Function: Mina Oh

NMR Structure Optimization: Jinhyuk Lee Collaboration with experimental groups:

Byung-Gee Kim, School of Chemical and Biological Engineering, SNU Byung-Ha Oh, POSTECH (moved to KAIST)

H Shin, Soongsil U.

DH Shin, Ewha Wemen’s Univ.

Cluster computers: KIAS

http://casp.kias.re.kr/ for head-to-head comparison between servers

(37)

Protein Structure Prediction by Global Optimization and its Application to Biological Systems

Proteins are important

function

(Protein Folding Problem)

function

Narrows the search

while maintaining diversity of sampling.

Annealing in conformational space”.

Conformational Space Annealing

Examples of successful optimizations

att532

What is CASP?

Physics & Protein Structure Prediction (I)

Physics & Protein Structure Prediction (II)

CASP7 Experiment

Homology modeling (template-based modelin g) methods in the literature

CASP Strategy

CASP7 High Accuracy

Template Based Modelingz

CASP7 High Accuracy

Template Based Modeling

“A number of groups did well in the HA/TBM cate- gory. Group 556 (LEE) stood out as the only

group that performed near the top according to all criteria investigated: fold quality (particularly

GDT-HA), side-chain rotamer quality, and molecular

replacement model quality”.

Template Based Modeling

CASP8

What can one do better with more accu rate protein models?

Cell 136 85-96, Jan 9 2009

CASP8 Binding Site Prediction

T0391 (Human/Server)

NMR Protein Structure Determination

Protein Structure Determination

by X-ray crystallography & MR

Conclusions

Acknowledgements

Thank You!