Fast and Exhaustive

MISSING IMAGE

Material Information

Title:
Fast and Exhaustive Development of New Protein-Ligand Scoring Algorithms and Sampling Methods
Physical Description:
1 online resource (133 p.)
Language:
english
Creator:
Zheng, Zheng
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Chemistry
Committee Chair:
MERZ,KENNETH MALCOLM,JR
Committee Co-Chair:
DEUMENS,ERIK
Committee Members:
SMITH,BEN W
BOWERS,CLIFFORD RUSSELL
DUNN,BEN M

Subjects

Subjects / Keywords:
binding -- energy -- ligand -- protein -- sampling -- scoring
Chemistry -- Dissertations, Academic -- UF
Genre:
Chemistry thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Accurately computing the free energy for biological processes like protein folding or protein-ligand association is a central problem in structural-based drug design. This research focuses on developing a series of scoring algorithms that estimate the binding affinity of a protein-ligand complex given a three-dimensional structure, and one exhaustive sampling method that generate protein-ligand complexes poses for docking and end-point free energy calculation. The first project includes LISA (Ligand Identification Scoring Algorithm) and LISA+ use empirical scoring functions to describe the binding free energy. Interaction terms have been designed to account for van der Waals (VDW) contacts, hydrogen bonding, desolvation effects, and metal chelation to model the dissociation equilibrium constants using a linear model. As the second project, a new generation of Knowledge-based scoring function, KECSA (Knowledge-based and Empirical Combined Scoring Algorithm), is developed. KECSA introduces a new concept of reference state, in order to relate potential of mean force (PMF) to Lennard-Jones potential.Therefore, LJ potential parameters are generated from structural data of protein-ligand complexes. In the third project, we address a exhaustive sampling method together with its associated scoring method, using a novel methodology we term“movable type”. This method involves the identification of all atom pairs seen in protein-ligand complexes and then creating two databases: one with their associated pairwise distant dependent energies and another associated with the probability of how these pairs can combine. Combining these two data bases coupled with the principles of statistical mechanics allows us to accurately estimate binding free energies as well as the pose of a ligand in a receptor.This method, by its mathematical construction, samples all of configuration space of a selected region (the protein active site here) in one shot without resorting to brute force sampling schemes involving Monte Carlo, genetic algorithms or molecular dynamics simulations making the methodology extremely efficient. Importantly, this method explores the free energy surface eliminating the need to estimate the enthalpy and entropy components individually. Finally, low free energy structures can be obtained via a free energy minimization procedure yielding all low free energy poses on a given free energy surface.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Zheng Zheng.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: MERZ,KENNETH MALCOLM,JR.
Local:
Co-adviser: DEUMENS,ERIK.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0046229:00001


This item is only available as the following downloads:


Full Text

PAGE 1

1 FAST AND EXHAUSTIVE: DEVELOPMENT OF NEW PROTEIN LIGAND SCORING ALGORITHM S AND SAMPLING METHOD S By ZHENG ZHENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

PAGE 2

2 2013 Zheng Zheng

PAGE 3

3 To my beloved parents wife and the coming baby

PAGE 4

4 ACKNOWLEDGMENTS I would love to give my sincerest thanks to my advisor Prof. Kenneth M. Merz, Jr. and guidance he gave me during the last five years. He not only provides me with inspiring discussions about research but also, more importantly, teaches me to think professionally, allows me to grow in the group and teaches me many things far beyond the scope of research, from which I believe I will benefit for my entire lifetime. I would also like to thank all my com mittee members, Dr. Erik Deumens, Dr. Ben W. Smith, Dr. Clifford R. Bowers and Dr. Ben M. Dunn. Thank you for agreeing to be on my committee and the time you spent on my oral proposal and dissertation, and also all the precious discussion and advi ce I als o want to acknowledge Dr. Adrian E. Roitberg, who gave me sincere and helpful advi ce I ow e thanks to my colleague s for numerous beneficial discussions my parents for understanding and supporting. With my deepest love, I also want to say thank you to my wife Hao Liu, whose magical smile can clean up all my displeasure and tiredness Finally I want to express my appreciation for the research funding provided by NIH. Thank you, I love you all.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 9 LIST OF ABBREVIATIONS ................................ ................................ ........................... 11 ABSTRACT ................................ ................................ ................................ ................... 12 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 14 Computer Aided Drug Design ................................ ................................ ................. 14 Scorin g Function ................................ ................................ ................................ ..... 15 Docking and Sampling ................................ ................................ ............................ 18 Development of Scoring Functions ................................ ................................ ......... 19 Development of Molecular Conformation Sampling Methods ................................ 20 2 LISA SCORING FUNCTION SERIES ................................ ................................ ..... 22 LISA/LISA+ Method ................................ ................................ ................................ 22 Training set ................................ ................................ ................................ ....... 22 Scoring Function ................................ ................................ .............................. 23 van der Waals interactions ................................ ................................ ......... 24 Hydrogen bonding ................................ ................................ ...................... 25 Desolvation ................................ ................................ ................................ 28 Metal chelation ................................ ................................ ........................... 29 The final expres sion ................................ ................................ ................... 32 LISA Calculation Result and Discussion ................................ ................................ 32 Model Training Results ................................ ................................ ..................... 32 Validation of LISA ................................ ................................ ............................. 35 LISA+ Developm ent and Validation ................................ ................................ ........ 39 3 KECSA SCORING FUNCTION ................................ ................................ .............. 55 Methods and Results ................................ ................................ .............................. 55 Model Valida tion and Discussion ................................ ................................ ............ 65 Validation for Avoidance of Over Fitting and Ligand Size Dependence. .......... 65 Validation Benchmark for Comparison with LISA, LISA+ and Other Scoring Methods ................................ ................................ ................................ ........ 67 4 THE MOVABLE TYPE METHOD APPLIED TO PROTEIN LIGAND BINDING ...... 84

PAGE 6

6 Methodology ................................ ................................ ................................ ........... 84 The Movable Type Method Applied to Protein Ligand Binding ......................... 84 Construction of the Movable Type System: Atom Pairwise Interaction Energy and Probability Databases ................................ ................................ 87 Binding Free Energies from the Movable Type Method ................................ ... 91 Performance of MT KECSA as a Scoring Function for Protein Ligand Binding Affinity Prediction ................................ ................................ ................................ 98 Extracting Heat Maps from the Movable Type Method ................................ ......... 100 Extracting Structu re from the Movable Type Method ................................ ............ 101 5 CONCLUSION AND OUTLOOK ................................ ................................ ........... 111 APPENDIX SUPPORTING INFORMATION DATA ................................ ................................ .. 114 LIST OF REFERENCES ................................ ................................ ............................. 125 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 133

PAGE 7

7 LIST OF TABLES Table page 2 1 List of 20 atom types in LISA ................................ ................................ .............. 42 2 2 List of atom types for the van der Waals interaction ................................ ........... 43 2 3 Some example complexes showing the relations between sp 3 hybridized Carbon & sp 2 hybridized Carbon VDW interaction, Zn chelation and p K d ......... 43 2 4 Parameters and fitting results derived from linear fitting after training data normalization (without chelation terms). ................................ ............................. 44 2 5 Comparison between AB derived in this work and AB derived by Cornell, et al ................................ ................................ ................................ ........................ 45 2 6 Parameters and fitting results for Zinc chelation terms. ................................ ...... 45 2 7 Comparison of R 2 and RMSD between LISA and other scoring functions for ................................ ................................ ....... 46 2 8 Comparison of r SD and ME from LISA and other scoring functions to the work of Wang et al ................................ ................................ ............................. 47 2 9 ANN vs. LISA training and test results for the test set with 41 samples we have designed ................................ ................................ ................................ .... 47 2 10 and ligand molecular properties (with standard errors, SE) in the LISA test set with 1399 complexes. ................................ ................................ ................................ ......... 48 2 11 Parameters derived from linear fitting for four different sets of scoring functions. The interaction weights are given along with the lower and upper bounds of the 95% confidence interval. ................................ .............................. 49 3 1 List of the chosen atom types. ................................ ................................ ............ 73 3 2 Parameters for all 49 pairwise potentials. ................................ ........................... 74 3 3 Entropy parameters and their 95% confidence intervals ................................ .... 76 3 4 Leave one out cross validation of KECSA. ................................ ......................... 76 3 5 Statistical results for KECSA, KECSA LJ, KECSA entropy and Ligand MW correlated with experimental binding affinity. ................................ ...................... 76 3 6 Statistical results of KECSA, LISA+ and LISA from large scale validation studies. ................................ ................................ ................................ ............... 77

PAGE 8

8 3 7 Statistical results of KECSA, scaled KECSA, LISA+ and LISA from four small test sets. ................................ ................................ ................................ ............. 78 4 1 Statistical results for MT KECSA and original KECSA correlated with experimental binding affinities. ................................ ................................ ......... 10 5 4 2 RMSD values () and binding scores (p K d ) of the global and local minima. .... 105

PAGE 9

9 LIST OF FIGURES Figure page 2 1 A simple description of the hydrogen bonding model. ................................ ........ 51 2 2 four examples of protein ligand complexes with Zn N ligand chelation. Fragments are shown as ball and stick models ................................ .................. 51 2 3 Plots of p K d vs. logP p K d vs. molecul ar mass, and binding energy vs.the distance between the ligand nitrogen and Zn for the 38 complexes. ............... 52 2 4 VDW potential well depth ( AB ) by LISA vs. VDW potential well depth ( AB ) by QM. ................................ ................................ ................................ ..................... 52 2 5 LISA calculated p K i /p K d vs the experimental p K i /p K d for the PDBbind v2010 test set of 1399 protein ligand complexes. ................................ ......................... 53 2 6 With the test set built by Wang binding affinity comparison was done for LISA and some well known scoring functions ................................ .................... 54 2 7 Plot of LISA+ and the original LISA calculated p K d or p K i vales vs. Experimental p K d or p K i values. ................................ ................................ ......... 54 3 1 A protein ligand structural illustration of how the KECSA s tatistical potential is modeled. ................................ ................................ ................................ ............. 79 3 2 Ratio of the observed atom pairs to t he total interacting atom pairs vs the new reference state II potential. ................................ ................................ .......... 80 3 3 Plot of KECSA, LISA+ and LISA calculated p K d or p K i vs. Experimental p K d or p K i in obtained in validation studies. ................................ ............................... 81 3 4 Plot of KECSA, scaled KECSA, LISA+ and LISA calculated p K d or p K i vs. Experimental p K d or p K i in four small test sets. ................................ .................. 82 3 5 With the test set built by Wang, binding affinity comparison was done for KECSA, LISA, LISA+ and several other well known scoring functions .............. 83 4 1 The thermodynamic cycle used to formulate the free energy of protein ligand binding in solution. ................................ ................................ ............................ 106 4 2 Modeling of ligand solvent polar interaction using a Boltzmann factor multiplier. A carbonyl oxygen a tom is used and an example here ................... 107 4 3 Plot of MT KECSA and the original KECSA model calculated p K d or p K i val u es vs. Experimental p K d or p K i values. ................................ ....................... 108

PAGE 10

10 4 4 The MT energy maps optimization mechanism to derive the final docking pose in one protein ligand complex. ................................ ................................ 108 4 5 Contact map of the 1LI2 protein ligand complex binding region. ...................... 109 4 6 Heat maps for sp 3 oxygen and aromatic carbon Grid points with lighter color indicate energetically favorable locations for certain atom types within the binding pocket. ................................ ................................ ................................ 109 4 7 T he binding pocket of protein ligand complex 1LI2. ................................ ......... 110

PAGE 11

11 LIST OF ABBREVIATIONS IC 50 The half maximal inhibitory concentration KECSA Knowledge based and Empirical Combined Scoring Algorithm LISA Ligand Identification Scoring Algorithm LISA+ Ligand Identification Scoring Algorithm Plus Version LJ potential Lennard Jones potential MD Molecular dynamics MT method "Movable T ype method p K d Logarithm of t he dissociation constant for a protein and a corresponding ligand p K i Logarithm of t he dissociation constant of bond tightness between an enzyme and a corresponding enzyme inhibitor PMF Potential of mean force R The Pearson product moment correlation coefficient R 2 The coefficient of determination RMSE R oot mean square error VDW van der Waals

PAGE 12

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy FAST AND EXHAUSTIVE: DEVELOPMENT OF NEW PROTEIN LIGAND SCORING ALGORITHMS AND SAMPLING METHODS By Zheng Zheng December 2013 Chair: Kenneth Malcolm Merz, Jr. Cochair: Erik Deumens Major: Chemistry Accurately computing the free energy for biological processes like protein folding or protein ligand association is a central problem in structural based drug design This research focuses on developing a series of scoring algorithms that estimate the binding affinity of a protein ligand complex given a three dimensional structure, and one exhaustive sampling method that generate protein ligand complexes poses for docking and end point free energy calculation. The first project includes LISA ( Ligand Identification Scoring Algorithm ) and LISA+ use empirical scoring function s to describe the binding free energy. Intera ction terms have been designed to account for van der Waals (VDW) contacts, hydrogen bonding, desolvation effects, and metal chelation to model the dissociation equilibrium constants using a linear model. As the second project, a new generation of Knowledg e based scoring function, KECSA (Knowledge based and Empirical Combined Scoring Algorithm), is developed. KECSA introduces a new concept of reference state, in order to relate potential of mean

PAGE 13

13 force (PMF) to Lennard Jones potential. Therefore, LJ potentia l parameters are generated from structural data of protein ligand complexes. In the third project, we address a exhaustive sampling method together with its associated scoring method, This method involves t he identification of all atom pairs seen in protein ligand complexes and then creating two databases: one with their associated pairwise distant dependent energies and another associated with the probability of how these pairs can combine. Combining these two databases coupled with the principles of statistical mechanics allows us to accurately estimate binding free energies as well as the pose of a ligand in a receptor. This method, by its mathematical construction, samples all of configuration space of a selected region (the protein active site here) in one shot without resorting to brute force sampling schemes involving Monte Carlo, genetic algorithms or molecular dynamics simulations making the methodology extremely efficient. Importantly, this method ex plores the free energy surface eliminating the need to estimate the enthalpy and entropy components individually. Finally, low free energy structures can be obtained via a free energy minimization procedure yielding all low free energy poses o n a given fre e energy surface.

PAGE 14

14 CHAPTER 1 INTRODUCTION Computer Aided Drug Design In drug development, discovery of new lead compounds is a crucial step, usually involving of two processes: new structure design and receptor ligand binding affinity evaluation. 1 In the ligand design process, scientist s have developed several schemes which can generally be classified as two main categories: ligand based 1 8 an d receptor based drug design 1, 9 15 For the former scheme, new leads are identified based on information gathered from known active ligands, while for the latter scheme, new leads are derived from known receptor structures and their binding sites, althoug h sometimes information about known binding ligands is also needed. This article mainly focuses on the latter sch eme, receptor based drug design In ligand based drug design a large number of potential ligand molecules from any database ar e screened to compare molecular shape, electrostatic effects and interaction points etc. to an active ligand in order to identify those which fit the receptor binding pocket based on molecular similarity to the active ligand. On the other hand, the key p rinciple of r eceptor based drug design is that ligands with novel structures, or known structures are suggested based on protein ligand complex structural information This feature provides scientists a much more open field to devise schemes for drug desi gn. Receptor based drug design has made significant strides in the past several decades. B ased on the structure s of proteins and by applying the principles of molecular recognition such as optimization of van der Waals interactio ns and hydrogen bonding, n ew lead compounds are designed using either docking known structures into the protein binding site (virtual screening), or assembling atoms/fragments within the

PAGE 15

15 protein binding site in a stepwise manner (de novo drug design). This leads to two main challenges: (1) A reliable evaluation method to calculate the binding affinity between ligands and receptors. (2) An efficient and exhaustive searching method to generate ligand conformations (poses). This article covers both of the se topics and introduces in detail our research on development of a series of "scoring" methods for binding affinity evaluation and a novel exhaustive ligand poses searching method. Scoring Function A likely drug candidate should have an appropriate bindin g affinity fo r its target receptor, typically in the low nanomolar (nM) range. p K i or p K d (negative logarithm of dissociation constant) values for reported ligands usually range between 3 and 10. 16 B inding affinities weaker than the nM range may result in modest efficacy while affinity for multiple targets could lead to undesired side effects Thus, b eing able to accurately predict the binding affinity for these molecules is a central problem in receptor based drug design and remains a very significant sci entific challenge T here are three types of scoring function parameterization methods, based on the means of derivation: physics based, 17 20 knowledge based 21 29 and empirical 30 34 Physics based scoring functions are parameterized using quantum mechanical calculation s as well as available experimental information. These scoring functions are in which each term has a physical significance. Force field scoring methods have the advantage of dir ectly generating absolute binding free energies, rather than relative binding scores. On the other hand, physics based scoring functions are computationally intensive especially when t reatment of solvation effect is added, and the associated energy landsc apes are

PAGE 16

16 generally expensive to evaluate. 35, 36 In addition, some physics based scoring functions involve empirical weighting coefficients that are difficult to generalize Knowledge based protein ligand scoring functions, also known as potential of mean f orce (PMF) scoring functions, are derived from structural information regarding protein ligand complexes. Their pairwise interaction parameters are directly obtained from the frequency of occurrence of the given atom pairs contained in a large database of complexes. The statistical atom pair potential derived by PMF scoring methods is less expensive than the force field generated potential, because PMF scores use less complicated potential terms (especially desolvation terms). The results are more reliable than empirical scoring methods, because only structural data are needed for PMF parameterization instead of binding affinity data, making PMF scoring methods less training set dependent. One limitation arises from the inaccessib i l ity of the reference state where the interatomic interactions are zero. Moreover, knowledge based scoring functions assume t h at all of the distance space is taken into account which given the limited dataset available is unlikely to be the case for all chemotypes encountered in the drug discovery process. Empirical scoring function s are computationally efficient, because of their simple energy functions but this also highlights their major limitation, training set dependent parameterization. Energy functions with simple forms can mask the relationship between the binding affinity and the crystal structures used to build the model Hence, the training process only derives parameters that represent a compromise between the simplicity of the energy expression and the nuances of th e interaction s observed in protein ligand complexes

PAGE 17

17 Although a large and ever increasing number of scoring fu nctions have been developed for protein ligand binding in the past two decades, the scor ing problem remains challenging. For any scoring function, it is difficult for one general set of parameters to fit for the huge diversity of ligand structures. Another bottleneck in scoring function development is the limitation of each kind of scoring fu nction to a compromise among precise physical expression, simple calculation and accurate binding affinity prediction The goal of this research is to generate a new mechanic to build a scoring function which combines the advantage s of physics based, k nowledge based and empirical scoring functions and overcomes t heir limitations. The first concern is how to build the scoring function formula. B hm's concept of the "Master Equation" 37 is very useful because, when binding enthalpy is divided into independent interaction terms, it can be formulated separately, making description of the total enthalpy much easier. The independent energy terms can be categorized into certain atom pair or functional group pair based interactions. However, not all types of atom or functional group pairs within a certain distance need to b e taken into account in the calculation of total enthalpy. Therefore, the second step is to check the occurrence frequency of each type of atom pairwise interaction from our training set database. Those significant interaction types are selected to build t he enthalpy terms. Parameterization is always a challenge for scoring function design. A quantum based method is always time consuming sometimes involving empirical weighting coefficients that lack of physical definition. The traditional fitting method is widely used in knowledge based and empirical scoring function development. However, the difficulty of finding a reference state for knowledge based scoring functions and the training set

PAGE 18

18 dependency for empirical scoring functions both make fitting methods very susceptible to parameterization errors. Docking and Sampling In addition to the parameterization of scoring functions, sampling the configuration space of complex biomolecules is als o a major hurdle impeding our ability to advance the understanding of accurate prediction of ligand binding to a biological receptor. 1 10 Major advances have been made in computer hardware, which has allowed molecular dynamics (MD) simulations to reach the millisecond barrier, but this method is brute force in nature and requires highly sophisticated hardware and software. 2,10,11 14 Moreover, a major hurdle in the modeling of biological systems is associated with how the inter and intra molecular energies a re modeled. Modern force fields are highly evolved, but still need to be further refined to reach chemical accuracy in many applications. 9,14,15 17 Predicting how a ligand (drug) binds its receptor and predicting its associated binding affinity is a highly challenging problem, which if solved, would singularly advance modern receptor based drug design. 8,15,17 31 This approach has largely employed so called end point methods that dock (place) a candidate molecule into a receptor and compute the binding free energy using a range of scoring functions. From an analysis of the error propagation properties in the statistical mechanics based prediction of protein ligand binding affinities it was shown that the end point class of approaches maximizes energy function uncertainties. 32 34 This can be alleviated through the use of sampling approaches including MD methods or methods that exhaustively sample the configuration space associated with protein ligand binding. 1 3,5 7,9,11 14 These methods have shown that they ca n be successful, but are brute force in nature, which

PAGE 19

19 lead us to consider ways in which we can use ideas more akin to the end point methods, but incorporating sampling at the same time. The concept being that this approach would give us the best of both wo rlds, while mitigating the effects of energy function deficiencies. Development of Scoring Functions In o ur early work we develop a new empirical scoring function that is readily applicable to drug design This function, Ligand Identification Scoring Algorithm (LISA) 83 aims to compensate for the common disadvantage s of empirical scoring functions with a focus towards Zn metalloproteins Because van der Waals (VDW) interactions and hydrogen bonding are very important in protein ligand complexes, diffe rent atom types have been introduced in order to simulate these interactions between different atoms. A desolvation term has also been included in order to capture solvation changes resulting from protein ligand complexation Among protein ligand complexes with high binding affinity (p K d >7), metal chelation between active site zinc ions and metal binding "warheads" ( e.g. carboxylate, sulphonamides, etc. ) in ligands is widely observed ; hence, we have also built a Zn chelation term in to LISA to capture this class of interactions For an empirical scoring function, how the formula is built and how the training set is chosen for parameterization is extremely crucial. The i nteraction terms for LISA were designed to account for van der Waals (VDW) contacts, hydro gen bonding, desolvation effects and metal chelation to model the dissociation equilibrium constants using a linear model. Atom types were introduced to differentiate the parameters for VDW, H bonding interactions and metal chelation between different atom pairs. A training set of 492 protein ligand complexes was selected for the fitting process.

PAGE 20

20 We also introduce an alte rnate form of LISA, named LISA+. Though with the same modeling and training scheme, LISA+ classifies protein/ligand complexes into one of four classes based on ligand molecular weight and hydrophobicity prior to scoring By narrowing the usage of a parameter set to more specific ligand genre, the classification scheme is expected to offer reduced errors in binding affinity predictions in sev eral test sets. Knowledge based protein ligand scoring functions 56 64 building on the idea of potential of mean force (PMF), 19 are derived from structural information regarding protein ligand complexes. Their pairwise interaction parameters are directly c onverted from the frequency of occurrence of given atom pairs contained in a large database of complexes. We developed a novel knowledge based protein ligand scoring function that employs a new definition for the reference state, allowing us to relate a st atistical potential to a Lennard Jones (LJ) potential. In this way, the LJ potential parameters were generated from protein ligand complex structural data contained in the PDB. Development of Molecular Conformation Sampling Method s Using MD or exhaustive sampling procedure s to evaluate protein ligand binding is conceptually similar to wood block printing technology where all the words (molecules) are carefully placed on a board (receptor site) and the whole book can be printed (binding free energy determine d) While a more advanced printing te chnology, movable type printing, (which was invented in China in the 11th century and introduced by Gutenberg into the alphabetic language system ) uses a "database" of letters that is pre constructed and then the printi ng of any word involves a data base search followed by the appropriate combination from the movable type system. Using a typical pairwise potential the molecular energy of a system can be decomposed into atom pairwise

PAGE 21

21 interaction energies including bond, an gle, torsion, and long range non covalent forces (van der Waals and electrostatic forces) which by analogy to the MT systems is our occurrence along an atom pairwise coor dinate axis. Herein, we develop the mathematics necessary to bring end point methods up to the "movable type printing level", via building a database of energy intensities and probabilities for all atom type pair interactions found in protein ligand comple xes. Using this information we then demonstrate that the MT approach enhances our ability to predict protein ligand binding free energies and also allows us to extract the associated low energy poses all at a rce sampling methods. Moreover, the docking and scoring problem is an example of a broad class of problems in computational biology that involve both the computation of the free energy and structure of a biological system, which includes challenges like th e prediction of protein folds, protein protein interactions and protein design all of which the MT method can address.

PAGE 22

22 CHAPTER 2 LISA SCORING FUNCTION S ERIES LISA/LISA+ Method Training set It is well appreciated that b oth the size and the quality of a protein ligand training set will affect the final form and effectiveness of a scoring function. Since our scoring function focuses on the evaluation of binding affinity for de novo drug design, the ligands in our training set are required to have binding a ffinity data and structure s that reflect what drug candidates experience upon complexation Hence in this work, we have chosen our training set from the PDBbind v20 10 refined data set 28,29 and have further curated this set to further ensure quality. (1) Ev ery complex in the training set is a crystal structure with an overall X ray Complexes solved by NMR techniques are currently not included in our selection; (2) Crystal c ontacts are interchain or intermolecular contacts that occur as th e result of the protein crystallization process and they are found in all X ray structures. Crystal contacts may influenc e the conformation of a protein and where a ligand binds to the protein, by providing further crystal contacts or even un natural bind ing pockets. I n order to avoid protein ligand complexes affected by crystal contacts in our training set, we eliminated all complexes in our training set that Sndergaard et al 30 found to have crystal contacts affecting the ligand. (3) Only the complexes with K d and K i values have been introduced in to our training set. Complexes with only IC 50 values were not included. (4) We focused on complexes with p K i or p K d 's distributed from 3 to 11 because binding affinities in this range ar e pharmaceutically relevant. Potential inconsistencies in K d value s due to experimental conditions such as pH, temperature, etc. were not taken into account (5)

PAGE 23

23 Based on pharmacokinetic consideration s, as embodied by Lipinski's rule of five an orally administered drug should have no more than 5 hydrogen bond donors no more than 10 hydrogen bond acceptors a n octanol water partition coefficient log P of less than 5 and a molecular weight under 500 daltons 4 In the present research, we did not focus on the first three rules in order to not limit our training set. For example, in prodrug design, the hydrogen bond donor or acceptor atoms can be masked, and the logP value can change once in vivo On the other hand, we did retain the MW rule because comple xes with high MW ligands could have "fooled" our scoring function to become artificially dependent on MW through the van der Waals terms Hence, we decided that complexes containing ligands with MW> 600 would be excluded. 2061 complexes from PDBbind v20 10 r efined data set were filtered based on the above 5 criteria and 4 9 2 complexes were left in our training set. Both ligand and protein atoms were classified (atom typed) as defined in Table 2 1 Scoring Function model 31 where overall binding free energy ( G bind see Equation 2 1) can be decomposed into independent free energy contributions (see Equation 2 2 ) Each component is the sum of a certain type of structure related empirical energy terms (the f i ( x,y,z ) term) multiplied by a weighting coefficient c i (see Equation 2 3) The Master Equation represents the linear combination of these components. (2 1) (2 2)

PAGE 24

24 (2 3) In the current research, our scoring function was decomposed into the following interaction categories : van der Waals interactions, hydrogen bonding, desolvation ( hydrop hobic effect) and metal chelation. van der Waals i nteractions van der Waals interactions are one of the most important interactions present in protein ligand complexes. The computed potential energy depends on the distances between pairs of atoms. The Lennard Jones 6 12 term is emplo yed in this work to reflect van der Waals interaction s when two atoms approach during the binding process between a protein and a ligand. (2 4) (2 5) In Equation 2 4 and 2 5 r ij is the distance between atom i in the protein and atom j in the ligand. ij is the interatomic separation at which repulsive and attractive forces balance (the sum of the van der Waals radii of atom i and atom j ). is the potential well depth, subscript s A and B refer to atom type A and B AB can be expressed using A a nd B according to Equation 2 6 Thus, based on the A parameters taken from the work of Cornell et al 32 we can compare our well depth values obtained via regression with force field based values. (2 6)

PAGE 25

25 We scanned for all of the atom contacts between the ligands and their protein receptors from the 2061 complexes found in the PDBbind v20 10 data set and categor ized those contacts based on the atom types listed in Table 2 1. Some interaction types were not included in this scoring function because (1) they rarely appeared as protein/ligand contacts; (2) they added little substantive improvement to our model based on linear regression results, whose 95% confidence intervals included 0. In the present work, VDW interactions between all pairs of atoms are calculated according to the atom types listed in Table 2 2 We set a distance cutoff of 3 to 5.5 to avoid non p hysical attractive or repulsive forces. Moreover, t o avoid the large repulsion s introduced by overlapped atom pairs, we set an upper limit for f i j ( x,y,z ) C utoff s of 0.5, 1, and 5 p K d units have all been tested, and a value of 0.5 was found to yield the best model Hence, for any pair of atoms, if f i j ( x,y,z ) exceeds the 0.5 p K d cutoff it is set to 0.5 Finally, the AB parameter wa s obtained by linear fitting using our training set Hydrogen b onding Hydrogen bonding is a nother important interaction found in most protein ligand complexes. Such an interaction occurs when a lone pair on a (typically) polar group approaches a hydrogen atom bound (typically) to a polar atom like N or O The principle variable associated with hydrogen bonding is the distance between the hydrogen bond donor and hydrogen bond acceptor, d HA the bond angle between the hydrogen bond donor and acceptor, D H A and the H --A AA angle defined by the hydrogen bond acceptor H A AA In the present work, we modeled hydro gen bonding as defined by Equation 2 7 which is an adaptation from earlier work of Vedani and co workers (see Equation 2 7 ) 33,34 In this description of hydrogen bonding d HA D H A an d H A AA have

PAGE 26

26 defined optimal values. Departure of d HA D H A an d H A AA from these optimal values destabilizes the hydrogen bond interaction (2 7) The f 1 ( d HA ) distance function is modeled as a Lennard Jones 6 12 potential with the well depths to be obtained from linear regression. In the two angle functions, the optimal angle for D H A is 1 8 0 while for H A AA the optimal angle depend s on the type of acceptor atom and on the nature of the molecule in which it is embedded. Based on previous research 33,34 o is 135 for carbonyl, carboxyl, and sulfonamide oxygen atoms, 109.5 (sp 3 ) or 120 (sp 2 ) for hydroxyl oxygen atoms The spatial orientation s of hydrogen atoms are normally not revealed by X ray crystallography at the resolutions typically seen for protein ligand complexes (>1. 5 ). Although hydrogen atoms can be added later, energy minimization is usually required to optimally position them. Adding hydrogen atoms could become problematic especially when hydrogen atoms could be placed into multiple positions, as in cases where a drug molecule has multiple tautomeric states. 35 Therefore, we modified Equation 2 7 in order to resolve the hydrogen atom position ing problem (see Figure 2 1) For the bond distance, d HA we use the distance between the hydrogen bond donor and the hydroge n bond acceptor d DA in place of d HA For the two angle variables, we use the following approximations: Based on an analysis of the geometry of hydrogen bonds in which imidazole, serine, threonine, tyrosine, adenine, cytosine, water, and sulfonamide

PAGE 27

27 fragmen ts act as hydrogen bond acceptors, 34,36 up to 90% of the angle s D H A range within 180 30 which means that cos 2 ( D H A 0 ) range s from 0.75 to 1. Hence the assumption was made that D H A c an be set to 1 80 which results in f 2 ( D H A ) = 1 Moreover, as a result of this approximation, the value of H A AA is equal to that of D A AA With the se simplifying approximations with regards to the hydrogen atom positioning the hydrogen bond interaction between a ligand and its target protein can be quantified with explicit variables. In summary, hydrogen bonding in our model is described as follows: (2 8) For carbonyl, carbxyl and sulfonic oxygen atoms, 0 =135 and for hydroxyl oxygen atoms 0 =109.5 . The AB t erms a re decided upon based on the atom types found in the hydrogen bond donor and acceptor pair. Two atom types for hydrogen bonds are used in this work, O and N. So the O O and O N terms need to be derived in the fitting process. Hydrogen bond lengths are always shorter than the sum of VDW radii and longer than covalent bonds. Based on O H O and N H O bond length s r 0 for the O O distance was set to 2.8 and the N O distance was set to 2.9 in our model, with an upp er limit for r ij of 5 Furthermore hydrogen bond "saturation" was also considered in our model One lone electron pair can form one hydrogen bond with only one polar hydrogen atom and vice versa In order to avoid over saturation in our

PAGE 28

28 computation s the program scans the atoms in the ligand and protein. With the labels we have assigned to each atom, the program determines the number of polar hydrogen atoms within 3. 5 of a potential H bond acceptor. Next the program determine s the formation of a hydrogen bond using the following principle s : When a n H bond donor has n hydrogen atoms bonded ( n ), it will form hydrogen bond s with the nearest n H bond acceptors; When a n H bond acceptor has n lone pairs ( n ), it will form hydrogen bond s with the nearest n H bond donors. Desolvation Desolvation cause s changes in the entropy as well as in enthalpy of the ligand and its target protein. This effect is very difficult to accurately characterize since it involves complicated ligand water, protein water, and water water interactions before and after binding. Different algorithms have been used in other empirical scoring functions In this work, we associated the free energy change caused by the desolvation effect with the binding surface area. Significant advancements have been made over the last several decades in the computation of molecular surface s 37,38 but most are computationally too expensive for this work because we will evaluate thousands of protein ligand complexes. Thus, a novel method was crea ted to reflect the binding surface area with a grid based algorithm First, the effective distance between the ligand and its target protein, within which the desolvation effect occurs, is set to 5 An atom from the ligand (protein) would be judged to be within the binding surface if any atom from the protein (ligand) is less than 5 from it. In the second step of the computation, the program define s a box to cover the atoms from both the ligand and protein marked as within the binding surface and create regular ly spaced grids within the box. The grid spacing used is 0.5

PAGE 29

29 Distances between the grids and every single atom in the box are computed. If a distance between a grid and an atom is less than the van der Waals (VDW) radius of the atom, the g rid is marked as within the atom otherwise the grid is marked as outside the atom Third, grid points marked as within the atom are translated by 0.5 along the Cartesian axes and if a grid point is re identified as outside the atom after one of these translations, the grid point is labeled as a b oundary atom of either the ligand or protein. Because the grid points are closely spaced, we identify the sum of the grid points marked as b oundary atom as qualitatively reflecting the binding surfa ce area of the either ligand or protein. Hence, the mean value of the sum of boundary atom grid point s of both the ligand and protein represent s the binding surface area used in this work (2 9) Metal chelation Metal chelates are observed in numerous metallo protein ligand complexes as metal binding warheads" 39 Scanning the pdbbind v20 10 database, numerous chelates between ligands and Cu, Fe, Mg can be found for protein ligand complexes where the p K d is higher t han 6, but interestingly these metal binding warheads do not show as a significant effect on the binding affinity as was observed in the case of Zn Ligands use different "war head s" to chelate zinc ions such as O, N, S. We organize d the ligands into different categories and assign ed different parameters for their corresponding models. It is very important to design an appropriate model for the Zn chelation term. We started with ligands with N as warhead because they show the strongest Zn chelation e ffect

PAGE 30

30 When the observed p K d is higher than 7 for a metalloenzyme ligand complex, N Zn chelation is generally observed For p K d value s ranging from 7 to 11, we found 38 acceptable (no crystal contacts, etc. ) complexes containing Zn N chelation interactions in the PDBbind v20 10 database. Ultimately, all these complexes consist ed of Carbonic anhydrase II as receptor and a sulfonamide as ligand. The structure of the tetrahedral active site is formed by three nitrogen atoms from i midazole group s (from His resid ues) and a nitrogen atom from sulfonamide Some examples are shown in Figure 2 2. Although VDW interactions, hydrogen bonding and desolvation effect s also exist in these complexes, Zn chelation is still a significant effect in binding From the calculation of th e interactions listed above, we found that for complexes without Zn chelation, who have high binding affinity (p K d >7), the sum of VDW interactions between C3 and C 2 ( ) was always high (see atom types in Tables 2 1 and 2 2) However, for those complexes with Zn chelation whose p K d ranged above 7, VDW interactions between C3 and C 2 was not as high (see Table 2 3) This phenomena shows that Zn N chelation is balanced with VDW interactions in high binding affinity complexes, whic h to some degree demonstrat es that Zn chelation may be the dominant interaction in metalloenzyme complexes with high p K d s. A more detailed comparison is shown in Table 2 3 Given the importance of Zn chelation in metallo protein ligand complexes, a mathematical model for Zn ligand chelation needs to be built. We need to figure out the relation ship between all factors and their contributions to binding affinity. First, a direct comparison between the chelate structure and binding affinity were made. F rom Figure 2 3C we see that when the distance between the ligand nitrogen and Zn is around 2

PAGE 31

31 binding energy is likely to reach its maximum, and decreases when N --Zn distance moves away from 2 This has also been shown by Vedani and Huhta in their res earch 39 Then, we also examined the relation between binding affinity and ligands logP and molecular mass. From Figure 2 3A and 2 3B, we cannot see clear trends of p K d changing with logP nor molecular mass of ligands. This indicated that ligands hydrophilicity and molecular weight factors on binding affinity were excluded. Given this insight a function was constructed to model Zn chelation : (2 10) where r is the distance between the nitrogen in a ligand and Zn, is the distance at which chelation affinity reach es its maximum. We applied this model to all other Zn ligand chelation cases Besides Zn N chelation, Zn O chelations are also widely seen among metallo protein ligand complexes. They include monodentate lig ands as carbonyl group s and phosphate group s, and bidentate ligands like hydroxamic acid groups. There are also other ligands employing Zn S chelation, but the available examples were few From pdbbind v20 10 we found 38 complexes with Zn N chelation ( sulf onamide as ligands), 20 complexes with Zn O chelation, 23 complexes with Zn O ( hydroxamic acid as ligands) bidentate chelation and only 6 complexes with Zn S chelation. The equilibrium distances for different chelates were obtained from Vedani and Huhta ; 39 Zn O for a monodentate ligand was set to 1.961 Zn O for a bidentate ligand was set to 2.068 Zn N was set to 2.041 and interaction between Zn and S were ignored because of too small training set

PAGE 32

32 T he final expression In summary our score has t he following functional form: (2 11) and for a detailed explanation of the terms see the discussion below. LISA Calculation Result and Discussion Model Training Results LISA's mathematical model included 17 descriptors to describe corresponding interactions between certain types of atoms and 1 descriptor for the solvent effect. For the 2nd ( ), 3rd ( ) and 5th ( ) terms, we combined multiple interact ion types and allowed them share one common weight, in order to decrease the number of parameters to be fitted. Merging these interactions in this way is sensible because they represent similar interacting atom types (In 2nd term, sp 3 carbon aromatic car bon interaction was combined with sp 3 carbon sp 2 carbon interaction, etc .), so our observation is sensible Based on the mathematical model described in detail above, we performed linear fitting to our training set of 492 complexes. Our scoring function is able to reproduce the binding affinities of the entire training set with a n R 2 of 0.536 and an RMSD of 1.32 p K d corresponding to 1. 86 kcal/mol in binding affinity at physiological temperature (310 K). Leave one out cross validation was also done for the training set with a Q 2 of 0.503 and RMSD of 1.38 p K d (1.94 kcal/mol ).

PAGE 33

33 Goodness of fit and the resultant parameters are listed in Table 2 4. Each parameter derived from fitting reflects the weight factor for each term. W e also provide the parameters for the normalized data in order to eliminate the difference in scaling for all of the terms, because (1) in most cases, carbons are more prevalent than other atoms in l igands, causing the scale of the VDW pot ential for carbon carbon interactions to be larger than that of other interactions; (2) SASA was described as surface area in LISA, while other terms in energy units, so they have different variable scalings. Each term was scaled according to the following formula : (2 12) where X i and X i s are the raw and normalized ith term values for interaction type i ; X i,min and X i,max are the minimum and maxi mum values for ith descriptor, respectively From the result shown in Table 2 4, we can compare the contribution of each interaction to the predicted binding affinity. We find that the VDW interaction between sp 3 hybridized carbon atoms and unsaturated carbon atoms is the most important term in predicting binding affinity. However, they outnumber other interactions by a factor of two or more. This reflects the crucial effect of hydrophobic interactions in protein ligand binding. However, contributions of other interactions are not proportional to their contact number. Some types of interactions, which are not so obvious, like VDW C3_N4 and VDW Car_Car, etc. have considerable contributions to the binding affinity. This shows that for the design of new liga nds, simply increasing contact numbers in some cases may not increase binding affinity effectively VDW interactions between all other atom pairs are neglect ed in this work for three reasons: (1) AB s for some atom pairs show an unphysical negative contrib ution to

PAGE 34

34 the total binding affinity based on preliminary fitting result s and as a result adding them in to the scoring function significantly lower ed the correlation coefficient R 2 Hence, they were neglected for both a lack of p hysical interpretation and a low er ing the goodness of fit (2) 95% confidence intervals derived for each descriptor should not include 0, otherwise the descriptor would be regarded as statistically insignificant in the mathematical model. ( 3 ) VDW interactions between some atom pairs like O and S, N and S etc are rare in the training set. A dding them in to the scoring function would lead to the derivation of misleading parameters Presented in kcal/mol AB can be compared between our parameters and those derived by Cornell et al for use in a f orce f ield designed for the s imulation of p roteins and n ucleic a cids 32 From Tabl e 2 5 we see that except for the well depths for VDW C3_N3, VDW C_S, VDW C2_Npl3 and VDW Car_N3 most of our AB parameters are generally very similar to those derived by Cornell and co workers Graphical comparison s are shown in Figure 2 4. This demonstrates that the VDW parameters derived herein are physically meaningful and reliable to be used to estimate binding energies of protein ligand complexes. The SASA term is also important to binding affinity prediction To further test the SASA term we tried grid densities of 0.1 and 1 When using 1 as our grid density, the correlation coefficient ( R 2 ) fell to 0. 502 and RMSD fell to 1. 38 for our training set ( R 2 = 0.536 and the RMSD was 1.32 with our default 0.5 grid density). Using 0.1 as our grid density, the correlation coefficient ( R 2 ) become s 0. 542 with a standard deviation of 1. 31 Fitting results for the two grid densities (0.5 and 0.1 ) were similar,

PAGE 35

35 but a 0.1 grid density was much more computationally intensive than a 0.5 grid density. Hence in LISA we used 0.5 grid density throughout to simulate SASA Parameters for our chelation model are listed in Table 2 6. W e can see both zinc oxygen and zinc nitrogen chelation show significant contribution s to the observed binding affinity. However, compared to Zn O chelation, Zn N chelation plays a bigger role in the binding affinity when the latter interaction is present The r eason for this is unclear, but geometrically most of complexes containing zinc nitrogen coordination retain structures closer to tetrahedral than zinc oxygen complexes. For Zinc sulfur chelation, we couldn't obtain reliable parameters due to the limited tr aining set Validation of LISA Lisa was validated for its ability to predict experimentally measured binding affini ties using three benchmarks. Then Artificial Neural Network (ANN) analysis was employed to determine whether cross terms were needed or can i mprove LISA. ANN help s to analyze correlation s between LISA terms, and whether each term in LISA was well defined First, we introduced the entire PDBbind v20 10 database with a total of 6772 protein ligand complexes. Using the same 5 criteria we used to ch oose our training set ( see above ), 2047 complexes were selected Eliminating the complexes we used in our training set and other test sets 1399 complexes were left in our test set. Using this test set, L ISA gave a Pearson correlation coefficient r of 0.534 with a RMSD of 2.65 kcal/mol (see Figure 2 5) We utilized two already constructed and widely used test sets built by Wang et al 41 and by Muegge and Martin 12 Wang s test set contains 100 diverse protein ligand complexes and we obtained r =0. 72 RMSD=2. 32 kcal/mol using LISA Comparison of

PAGE 36

36 LISA to other score functions is presented in Figure 2 6, showing that LISA performs quite well on this test set. Muegge and Martin s test set contains 77 diverse protein ligand complexes from five protein class es. Using our model we obtained an R 2 of 0.6 8 and a RMSD of 1.42 or 2.01 kcal/mol. For a detailed breakdown of each protein class, see Table 2 7. Complexes from class 1 were chosen from serine protease s, where carbon carbon interactions dominate. LISA yiel ded a good correlation coefficient of R 2 =0.91, and a good RMSD value of 0. 97 expressed in p K d unit s or 1.38 kcal/mol. This shows that LISA models protein ligand complexes dominated by carbon carbon contacts. Class 2 consisted of 15 Zn O monodentate chelation complexes. The correlation coefficient R 2 derived from LISA was 0. 93 and the RMSD was 0.89 expressed in p K d unit or 1.26 kcal/mol. Compared wi th other scoring functions, we believe that LISA is more reliable in predicting m etalloprotein ligand binding affinities. Muegge and Martin s class 3 test set consisted of complexes with very small ligands ( L arabinose s), while class 4 contains large r liga nds. LISA had comparatively poor result s for these two classes ( R 2 = 0.43 and RMSD= 1.87 or 2.65 kcal/mol for class 3, R 2 = 0.33 and RMSD= 1.81 or 2.5 7 kcal/mol for class 4). This result to some degree proved that ligand size affect the prediction of protein ligand binding affinity and suggests future modifications of LISA should consider very small and very large fragments in the training set Several other score functions struggled with these two classes, so this seems to be a general problem. For c lass 5, we obtained R 2 =0. 83 and RMSD=1. 89 or 2.35 kcal/mol We note that there is an overlap between the Wang and Muegge and Martin test set s and our training set. For Wang 's test set 12 complexes overlap and for the

PAGE 37

37 Muegge and Martin test set 10 complexe s overlap Both test sets have less than 10% overlap with the training set, ensuring the reliability of our test set Following their earlier study, Wang and co workers published another test of several widely used scoring functions with a larger test set, the PDBbind v2002 refined set of 800 protein ligand complexes 54 in 2004. We examined this test set to make a more comprehensive comparison with other scoring functions using a much larger data set. Binding affinity data was calculated by LISA and the perf ormance was evaluated using the Pearson correlation coefficient r standard deviation (SD) and unsigned mean error (ME). Comparison between the performance of LISA and other scoring functions can be seen in Table 2 8. In addition, we found 123 complexes in cluding metal ligand contacts in this test set. For these m etalloprotein ligand complexes, LISA reproduced an r of 0.77, SD of 1.09 and ME of 0.69. From the overall three test result comparisons, we believe that LISA perfoms well in predicting binding affi nity, especially for m etalloprotein ligand complexes We also employed an Artificial Neural Network (ANN) to test the energy terms in our scoring function. ANN training result can clearly reflect how much the interaction terms in a scoring function correlate with binding affinity when judiciously controlling the number of nodes in the hidden layers. While the ANN lacks a direct physical interpretation of the interaction term s, and the training results do not provide access to the parameters, it can b e very useful in validating a scoring function model. B ecause ANN tries all possible combinations between the input values to build up a relationship with the output value, if ANN training shows a far better result than a linear fitting method, we can dete rmine that some cross terms need to be introduced in to the scoring

PAGE 38

38 function. Thus, by observing the convergence speed and comparing the goodness of fit, we can judge whether the energy terms are well designed in our scoring function We used the LISA train ing set of 492 complexes to train our network, and designed a new test set to test both LISA and the trained network. For the new test set, (1) We used the same 5 criteria in choosing complexes for the training set (2) p K d values for the complexes in the test set were distributed evenly from 3 to 1 1 Because for any fitting process, the goodness of fit for the function varies in differe nt regions of the output range, the test set should be distributed evenly across its range of applicability in order to av oid false positive test result s We searched for protein ligand complexes related to the complexes in our training set, which had similar ligand s or pocket structures but different binding affinities compared to the complexes in our training set. This was done with Binding MOAD. 55 Finally we selected 41 complexes to be our test set. ANN with a Broyden Fletcher Goldfarb Shanno (BFGS) algorithm, a model of the nonlinear fitting process, was employed in this work. Briefly, the ANN model is composed of connections of the processing elements (nodes). The processing elements transfer data from one bias to the next through activation functions until the output bias. A three bias model was used in the present work for its higher tolera nce to error and a comparatively simple structure for training. For the number of hidden nodes, we employ the so called self generation of hidden nodes method to include the fewest hidden nodes while maintaining training precision This analysis determin ed that five hidden nodes was the optimal choice. The network passes through activation functions defined as sigmoid function s ; the t an s igmoid f unction was selected as the activation function

PAGE 39

39 that defines the transference from the input bias to the hidden bias. The log s igmoid f unction was selected as the activation function that defines the transference from the hidden bias to the output bias Using our new test set, LISA can reproduce p K d 's with an R 2 of 0. 5 2 and RMSD of 1.20 or 1.71 kcal/mol Trained ANN test result got R 2 of 0. 48 and RMSD of 1. 92 kcal/mol From Table 2 9 we see that the ANN test result is no better than that of LISA, from which we can conclude that the energy terms in LISA were well chosen and that no cross terms were neede d. LISA+ Development and Validation Since LISA has made considerable improvement in predicting protein ligand binding affinities, we developed the method further based on its achievement. 42 LISA training result indicated that LISA had relatively poor predi ctive ability when ranking ligands within the low affinity (p K d /p K i <5) region as well as these within the high affinity (p K d /p K i >=8) region. Table 2 1 0 displays the RMSE (root mean square error) values of LISA predictions in different binding regions in the first t est set for LISA validation. 38 There was a general increase in both ligand molecular weight (MW) from the low binding region to the high binding region. This suggested that ligand polarity and size were potential factors in our scoring function prediction. In order to improve the prediction ability of LISA, we classify ligands into groups based on the ir size and polarity and use different parameter sets to evaluate the binding affinity So, in LISA+, before the program does any scoring, it first categorizes ligands into different groups ba sed on the ratio of carbon atoms in the whole ligand and the mol ecular weight (MW) The first group has a low carbon ratio and low molecular weight

PAGE 40

40 (carbon ratio < =0.65 and MW <=350), the second has a low carbon ratio, high molecular weight (carbon ratio <=0.65 and MW >350), the third has a high carbon ratio, low molec ular weight (carbon ratio >0.65 and MW <=350), and the last has a high carbon ratio, high molecular weight (carbo n ratio >0.65 and MW >350). Four different sets of scoring parameters are applied to thes e four groups. The previous data set with 1399 complex es were used for LISA+ parameterization. The linear fitting results are shown in Table 2 11 For some of the four classes, certain terms were found to be insignificant in determining the binding affinity and these terms are neglected in evaluating the LISA + score For validation of LISA+, 486 protein ligand complexes were analyzed with both LISA and LISA+, and the two scores were compared (Figure 2 7 ) LISA gave an R 2 of 0.4221 RMSE of 1.577 and a Standard Error of 0.0715 while LISA+ yielded an R 2 of 0.4717, RMSE of 1.419 and a Standard Error of 0.0644 From this analysis w e could see that LISA+ offers minor improvement over LISA. As knowledge based scoring functions, LISA and LISA+ both have acceptable ability to predict protein ligand binding af finities However, the detailed atom type assignment used in the LISA+ scoring function is important to improve overall prediction ability. From the test results above, we could see that LISA+ showed some improvement compared to LISA. These two scoring met hods were successful in SAMPL3 challenge (first ranking in all scoring methods), a blind test for docking and scoring methods. As fitting based scoring functions, LISA and LISA+ both had good predicting ability compared with other scoring functions. This i mplied that detailed atom type assignment in the scoring function is important to improve the prediction ability. However, based on

PAGE 41

41 our test result, LISA+ does not show a significant improvement to approve the advantage of categorizing ligands based on mol ecular weight and hydrophobicity followed by using more specific empirical scoring functions It indicates that protein ligand binding may not be sensitive enough to ligand's molecular weight and hydrophobicity change. In addition, ligands in the training set with too much similarity in molecular weight or hydrophobicity may lower the robustness of the fitted energy function. For instance, a scoring function fitted with too many hydrophobic ligands would over estimate the non polar interacti ons for protein ligand binding, thus easily falling in to over fitting.

PAGE 42

42 Table 2 1. List of 20 atom types in LISA Atom Type Description Hydrogen Bonding Donor/Acceptor? VDW radii C3 sp 3 hybridized carbon N/A 1.94 C2 sp 2 hybridized carbon N/A 1.9 0 C1 sp hybridized carbon N/A 1.8 0 Car aromatic carbon N/A 1.85 N4 positively charged nitrogen N/A (if no H bonded) / Donor (if H bonded) 1.83 N3 sp 3 hybridized nitrogen N/A (if no H bonded) / Donor (if H bonded) 1.87 N2 sp 2 hybridized nitrogen N/A (if no H bonded) / Donor (if H bonded) 1.86 N1 sp hybridized nitrogen N/A 1.85 Nar aromatic nitrogen N/A 1.86 Nam amide nitrogen N/A (if no H bonded) / Donor (if H bonded) 1.83 Npl3 trigonal planar nitrogen N/A (if no H bonded) / Donor (if H bonded) 1.86 O3 sp 3 hybridized oxygen Acceptor (if no H bonded) / Donor Acceptor (if H bonded) 1.74 O2 sp 2 hybridized oxygen Acceptor 1.66 S sulfur N/A 2.09 P phosphor N/A 2.03 F fluorine N/A 1.55 Cl chlorine N/A 2 .00 Br bromine N/A 2.2 0 I Iodine N/A 2.4 0 Zn zinc cation N/A 1. 20

PAGE 43

43 Table 2 2. List of atom types for the van der Waals interaction Atom Type Description C3 sp 3 hybridized carbon C2 sp 2 hybridized carbon Car aromatic carbon N4 positively charged nitrogen N3 sp3 hybridized nitrogen N2 sp2 hybridized nitrogen Npl3 trigonal planar nitrogen O3 sp 3 hybridized oxygen O2 sp 2 hybridized oxygen S sulfur Table 2 3. Some example complexes show ing the relations between sp 3 hybridized Carbon & sp 2 hybridized Carbon VDW interaction, Zn chelation and p K d PDB Code VDW C3_C2 Zn Chelation p K d 1vkj 16.885 N/A 4.85 1a99 12.084 N/A 5.70 1apb 15.76 N/A 5.82 1b58 37.926 N/A 6.59 1lrh 28.539 N/A 6.82 1h2t 23.873 N/A 7.89 1fcy 60.245 N/A 8.52 1kdk 27.797 N/A 9.05 2fgu 45.694 N/A 9.18 1mrw 39.35 N/A 9.7 1df8 42.823 N/A 9.92 1xpz 12.559 Y 7.08 1cny 15.87 Y 7.85 1ydb 3.7781 Y 8.24

PAGE 44

44 Table 2 3 Continued PDB Code VDW C3_C2 Zn Chelation p K d 1cim 12.885 Y 8.82 1cil 14.889 Y 9.43 1bnt 14.195 Y 9.89 1bnn 3.683 Y 10.00 Table 2 4. Parameters and fitting results derived from linear fitting after training data normalization (without chelation terms) Interaction Type Weight 95% confidence interval Normalized Weight no. of contacts in training set VDW C3_C3 0.1184 0.0848 0.1520 0.5878 22556 VDW C3_C2 0.0910 0.0791 0.1029 0.5938 53562 VDW C3_N3 0.2457 0.1553 0.3361 0.2569 4980 VDW C3_N4 0.4111 0.1907 0.6316 0.2363 1099 VDW C_S 0.2708 0.2071 0.3345 0.2576 2926 VDW C2_C2 0.1197 0.0491 0.1903 0.2123 4140 VDW C2_O3 0.1658 0.0830 0.2487 0.2775 5161 VDW C2_O2 0.0991 0.0226 0.1757 0.1800 8779 VDW C2_Npl3 0.2402 0.0936 0.3868 0.0987 1781 VDW Car_Car 0.0660 0.0498 0.0821 0.4020 6091 VDW Car_O2 0.0765 0.0243 0.1286 0.1714 10340 VDW Car_N3 0.2807 0.0249 0.5364 0.2790 1790 VDW Car_N2 0.1398 0.0391 0.2405 0.1363 1572 VDW O_N 0.1658 0.0611 0.2705 0.2075 7861 H bond between O & O 0.2716 0.1179 0.4252 0.2074 9421 H bond between O & N 0.2383 0.0578 0.4188 0.1693 5370 SASA 0.0120 0.0063 0.0176 0.0453

PAGE 45

45 Table 2 5. Comparison between AB derived in this work and AB derived by Cornell, et al Interaction Type AB (kcal/mol) a AB (kcal/mol) b Interaction Type AB (kcal/mol) AB (kcal/mol) VDW C3_C3 0.11842 0.1094 VDW C2_O2 0.09914 0.134 VDW C3_C2 0.091029 0.097 VDW C2_Npl3 0.24021 0.121 VDW C3_N3 0.24569 0.1364 VDW Car_Car 0.06596 0.086 VDW C3_N4 0.41112 VDW Car_O2 0.076465 0.134 VDW C_S 0.27081 0.1654 VDW Car_N3 0.25577 0.121 VDW C2_C2 0.1197 0.086 VDW Car_N2 0.13977 0.121 VDW C2_O3 0.16581 0.134 VDW O_N 0.16579 0.1889 a. AB is derived in this work b. AB is derived by W D Cornell et al 32 Table 2 6. Parameters and fitting results for Zinc chelation terms interaction type weight 95% confidence interval normalized weight for chelation term normalized weight for sum of other terms Zn O monodentate 0.60154 0.26073 0.94235 0.88654 0.2189 Zn O bidentate 0.45576 0.29547 0.61605 0.77997 0.31438 Zn N monodentate 1.3063 1.1673 1.4454 1.1666 0.18926

PAGE 46

46 Table 2 7. Comparison of R 2 and RMSD between LISA and other scoring functions for s test set correlation ( R 2 ) No set no. of complexes LISA ITScore /SE 20 ITScore 19 PMF99 12 DrugScore 15 BLEEP 52 SMoG0 1 16 SCORE1 31 SMOG 53 1 serine protease 16 0.91 0.89 0.87 0.87 0.86 0.79 0.81 0.76 0.76 2 metalloprotease 15 0.93 0.71 0.70 0.58 0.70 0.59 0.64 0.41 0.58 3 L arabinose binding protein 18 0.43 0.48 0.49 0.48 0.22 0.14 0.06 0.00 0.04 4 endothiapepsin 11 0.33 0.36 0.35 0.22 0.30 0.04 0.03 0.39 0.05 5 others 17 0.83 0.8 0.7 0.69 0.43 0.49 0.50 0.53 0.25 6 sets 1 5 77 0.68 0.76 0.65 0.61 n/a 0.28 0.46 0.30 0.21 RMSD No set no. of complexes LISA ITScore /SE 20 ITScore 19 PMF99 12 DrugScore 15 BLEEP 52 SMoG0 1 16 SCORE1 31 SMOG 53 1 serine protease 16 0.97 n/a n/a 0.96 0.95 n/a 1.09 1.39 1.34 2 metalloprotease 15 0.89 n/a n/a 2.31 1.53 n/a 1.62 3.27 2.29 3 L arabinose binding protein 18 1.87 n/a n/a 0.86 0.75 n/a 0.8 69.7 4.06 4 endothiapepsin 11 1.81 n/a n/a 1.89 0.94 n/a 1 1.26 4.18 5 others 17 1.66 n/a n/a 1.56 1.85 n/a 1.63 2.21 4.05 6 sets 1 5 77 1.42 n/a n/a 1.84 n/a n/a 1.69 3.47 4.43

PAGE 47

47 Table 2 8 Comparison of r SD and ME from LISA and other scoring functions to the work of Wang et al Scoring Functions r SD ME LISA 0.610 1.92 1.47 X Score::HPScore 0.514 1.89 1.47 X Score::HMScore 0.566 1.82 1.42 X Score::HSScore 0.506 1.90 1.48 DrugScore::Pair 0.473 1.94 1.51 DrugScore::Surf 0.463 1.95 1.53 DrugScore::Pair/Surf 0.476 1.94 1.50 Sybyl::D Score 0.322 2.09 1.67 Sybyl::PMF Score 0.147 2.16 1.74 Sybyl::G Score 0.443 1.98 1.56 Sybyl::ChemScore 0.499 1.91 1.50 Sybyl::F Score 0.141 2.19 1.77 Cerius2::LigScore 0.406 2.00 1.57 Cerius2::PLP1 0.458 1.96 1.52 Cerius2::PLP2 0.455 1.96 1.53 Cerius2::PMF 0.253 2.13 1.71 Cerius2::LUDI1 0.334 2.08 1.66 Cerius2::LUDI2 0.379 2.04 1.62 Cerius2::LUDI3 0.331 2.08 1.67 GOLD::GoldScore 0.285 2.16 1.72 GOLD::GoldScore_opt 0.365 2.06 1.63 GOLD::ChemScore 0.423 2.00 1.56 GOLD::ChemScore_opt 0.449 1.96 1.52 HINT 0.330 2.08 1.65 Table 2 9 ANN vs. LISA training and test result s for the test set with 41 samples we have designed training R 2 test R 2 training RMSD (kcal/mol) test RMSD (kcal/mol) ANN 0.60 0.48 1.44 1.92 LISA 0.54 0.52 1.86 1.71

PAGE 48

48 Table 2 1 0 properties (with standard errors, SE) in the LISA test set with 1399 complexes. p K d or p K i Number of ligands RMS Error Standard Error Average C atom ratio C atom ratio SE Average molecula r weight (g/mol) Molecular weight SE < 5 301 1.795 0.044 0.5783 0.0076 365.78 6.71 5 8 743 1.589 0.031 0.6678 0.0052 417.40 4.66 >8 355 2.419 0.057 0.7257 0.0056 474.68 7.21

PAGE 49

49 Table 2 11. Parameters derived from linear fitting for four different sets of scoring functions. The interaction weights are given along with the lower and upper bounds of the 95% confidence interval. Interaction Type weight 95% confidence interval Interaction Type weight 95% confidence interval low carbon ratio and low molecular weight low carbon ratio, high molecular weight sp3 C sp2 C 0.2365 0.0207 0.4524 sp3 C sp2 C 0.2460 0.0374 0.4546 sp3 C sp2 O 0.2056 0.0706 0.3405 sp3 C sp3 O 0.0989 0.0022 0.1955 sp3 C sp3 N 0.4360 0.0189 0.8531 sp3 C sp2 O 0.1223 0.0012 0.2433 sp3 C sp2 N 0.1343 0.2497 0.0189 sp3 C sp2 N 0.2792 0.5439 0.0144 sp3 C N cation 3.4010 1.1635 5.6385 sp3 C N cation 3.7510 1.1302 6.3719 sp3 C S 1.2208 0.3358 2.1057 sp3 C S 0.5418 0.0459 1.0376 sp2 C sp2 C 0.1228 0.0132 0.2325 sp2 C sp2 O 0.3239 0.0010 0.6468 sp2 C sp3 O 0.0941 0.0025 0.1857 sp2 C sp3 N 0.1276 0.0420 0.2131 sp2 C sp2 O 0.3247 0.0322 0.6172 sp2 C sp2 N 0.4712 0.0206 0.9218 sp2 C N cation 1.5279 2.5542 0.5016 sp2 C N cation 0.9372 1.8463 0.0280 HB O 1.9492 0.0662 3.8322 HB O 0.9482 0.1412 1.7552 HB N 1.1315 0.3392 1.9239 HB O 0.8587 0.0217 1.6957 Surface Area 0.0449 0.0628 0.0271 HB N 2.6710 1.9054 3.4365 Surface Area 0.0229 0.0355 0.0103 high carbon ratio, low molecular weight high carbon ratio, high molecular weight sp3 C sp3 C 0.3559 0.1568 0.5551 sp3 C sp3 C 0.1701 0.0334 0.3067 sp3 C sp2 C 0.2168 0.1056 0.3280 sp3 C sp2 C 0.0803 0.0012 0.1593 sp3 C sp3 O 0.1627 0.0129 0.3125 sp3 C sp3 O 0.1218 0.2328 0.0108 sp3 C sp2 N 0.2194 0.0233 0.4155 sp3 C sp3 N 0.3575 0.0988 0.6162 sp3 C N cation 1.6176 0.1159 3.1193 sp3 C S 1.1676 0.7129 1.6223 sp3 C S 1.7532 0.9358 2.5707 sp2 C sp2 C 0.1480 0.0061 0.2899 sp2 C sp2 C 0.0859 0.0054 0.1664 sp2 C sp3 O 0.6735 0.3280 1.0190 sp2 C sp3 O 0.2253 0.0137 0.4370 sp2 C sp2 O 0.0842 0.0011 0.1673 sp2 C sp3 N 0.2278 0.0097 0.4459 sp2 C sp3 N 0.1699 0.0061 0.3337 sp2 C N cation 1.6420 2.7613 0.5228 sp2 C N cation 0.8892 1.7745 0.0038 sp2 C S 1.2274 0.4191 2.0357 sp2 C S 0.6735 0.0036 1.3433 HB O 1.3705 0.3657 2.3753 HB O 0.6379 0.0365 1.2393

PAGE 50

50 Table 2 11 Continued Interaction Type weight 95% confidence interval Interaction Type weight 95% confidence interval HB N 1.1544 0.0241 2.2847 HB O 3.9650 0.7266 7.2033 Surface Area 0.0469 0.0561 0.0377 HB N 0.6306 0.0688 1.1925 Surface Area 0.0371 0.0440 0.0302

PAGE 51

51 Figure 2 1. A simple description of the hydrogen bonding model Figure 2 2. four examples of protein ligand complexes with Zn N ligand chelation. Fragments are shown as ball and stick models

PAGE 52

52 Figure 2 3. Plots of p K d (y axis) vs. logP (logP values are derived using xlogP program 40 ), p K d (y axis) vs. molecular mass, and binding energy vs. the distance (x axis) between the ligand nitrogen and Zn for the 38 complexes containing Zn chelation Figure 2 4. VDW potential well depth ( AB ) by LISA vs VDW potential well depth ( AB ) by QM 0 0.05 0.1 0.15 0.2 0.25 0.3 LISA P. A. Kollman et al.

PAGE 53

53 Figure 2 5. LISA calculated p K i / p K d vs the experimental p K i / p K d for the PDBbind v2010 test set of 1399 protein ligand complexes.

PAGE 54

54 Figure 2 6. With the test set built by Wang 41 binding affinity comparison was done for LISA and some well known scoring functions, ITScore/SE, 20 ITScore, 19 X Score, 5 DFIRE, 42 DrugScoreCSD, 18 DrugScorePDB, 15 Cerius2/PLP, 43,44 SYBYL/G Score, 45 SYBYL/D Score, 46 SYBYL/ChemScore, 47 Cerius2/PMF, 12 DOCK/FF, 46 Cerius2/LUDI, 31 ,48 Cerius2/LigScore, 49 SYBYL/F Score, 50 AutoDock 51 Figure 2 7 Plot of LISA+ (le ft) and the original LISA (right) calculated p K d or p K i vales vs. Experimental p K d or p K i values.

PAGE 55

55 CHAPTER 3 KECSA S CORING FUNCTION Methods and Results The concept of the potential of mean force can be illustrated by a simple fluid system of N particles whose positions are r 1 r N The average potential (n) ( r 1 N ) is expressed as: (3 1) where g ( n ) is called a correlation function. = 1 /k B T and k B is the Boltzmann constant and T is the system temperature. Hence the mean potential of the system with N particles is strictly the potential that gives the average force ove r all the configurations of the n + 1 ... N particles acting on a particle at any fixed configuration keeping the 1...n particles fixed. The mean potential can be described as follows: (3 2) where U is the total potential energy of the system. Described by Sippl and others, 1 5 the average potential is expressed as Equation 3 3 for the special case of a system with an observed particle number of n =2, as is the case herein (pairwise atoms from the prot ein and ligand). (3 3) w here g (2) (r) is the pair distribution function ij (r) is the number density for the atom pairs of types i and j observed in the known protein structures and ij (r) is the number density of the corresponding pair in a reference state. In order to obtain the pure interaction potential between atoms, a reference state is required to remove the

PAGE 56

56 contribution of the ideal gas state potential So, in the reference state, the syst em of particles is like an ideal gas state defined by fundamental statistical mechanics, in which particles would be evenly distributed in the binding site. Equation 3 3 can also be expressed as : (3 4) where n ij (r) and n ij (r) are number s of atom pairs of type i and j respectively, at distance r for the observed structures and the reference state. In potential of mean force methods, the number of the corresponding pairs in the reference state cannot be exactly obtained for protein ligand system s due to the effects of connectivity, excluded volume, composition, etc 6 Therefore, the pairwise interaction potential cannot be accurately calculated. Nonetheless, this idea of PMF scoring has advantages over empirical scoring, becau se it directly relates pairwise interaction to structur al data instead of fitting to known binding affinity data. Additionally, the PMF is more efficient than force field scoring due to the avoidance of higher expense computations. Our intent is to introdu ce a new concept of the reference state, in order to relate the statistical potential to atomic pairwise interaction potential. Hence the atomic pairwise interaction model can be parameterized exclusively from structural data instead of binding data or qua ntum calculations. Construction of traditional statistical potentials starts by collecting structural information from large numbers of protein ligand complexes, in order to simulate a "mean force" state in which the protein ligand atomic pairwise radial d istribution arises from all possible interactions in the binding site. Various reference states have been designed to remove the non interacting energy from the "mean force" state in order to

PAGE 57

57 correlate the pairwise radial distribution to the interaction po tential between selected atoms of a specific atom type i j with all other atoms in the protein ligand binding site Our goal is to equate the statistical potential to the Lennard Jones potential for each pairwise interaction. However, the LJ potential reflects pairwise interactions between two types of atoms, while a statistical potential is an average potential contributed by all atoms within the binding region. In this case, when trying to equate the statistical potential to a pairwise interaction pot ential, we need to remove all interactions except the pairwise interaction between atoms of type i and j in the binding region by defining a new reference state (denominator) in the PMF model. Unlike the traditional reference state, in which the selected a tom pairs i and j are at an infinite separation where the interaction energy is zero (as in the ideal gas state), within this new reference state (which we will call reference state II), a system of particles is under an average force contributed by all at oms in the binding region excluding the interaction force between the selected atom pairs i and j In other words, the only difference between the mean force state and the reference state II is that the latter state does not contain the pairwise interactio n potential between the selected atoms of type i and j Figure 3 1 provides a graphic illustration of the KECSA statistical potential model When equated to the LJ potential, the statistical potential can be expressed as: (3 5) where is the distance at which the inter particle potential is zero and is the well depth. The exponents for the repulsive term and attractive term are and

PAGE 58

58 alues ( i.e. 12 6), because (1) the repulsive and attractive forces change with different types of pairwise interactions and (2) E ij (r) in Equation 3 5 includes both van der Waals and electrostatic interactions, which means the LJ potential formula on the right hand side of Equation 3 5 accounts for two components : (3 6 ) The reason we use the LJ formula on the left hand side of Equation 3 6 instead of partitioning them into van der Waals and electrostatic potentials is that the LJ potential reaches 0 at and R while reaching its minimum value when r is Based on these properties, equations can be derived in order to determine the unknown parameters. In Equation 3 5, and are the number of protein ligand atomic pairwise interactions in the bin ( r r + ), with the volume 4 a in reference state II and in the training set that mimics the mean force state, respectively. is defined as 0.005. We introduce a to be determined parameter a for the shell volume because of the inaccessible volume present in protein ligand systems, and because of the devia tion of in the training set from the "perfect" pairwise number. Hence, the expectation is the parameter a will adopt values other than 2 The central issue in the KECSA model construction is to build up the radial distribution fu nction of the selected atom pairs in reference state II. A way in which to do

PAGE 59

59 this is to measure the similarity of the reference state II with two known states: the mean force state, and ideal gas state. Then build the radial distribution function with inf ormation collected from these two states. In reference state II, the radial distribution of a certain atom pair i and j is associated with a certain "background interaction" which is related to the total number of selected atom pairs N ij Because the "background interaction" potential contains all atom pairwise interactions in the binding site excluding the "selected interactions" between atom pairs i and j the difference in energy between the mean force state and reference state II for ea ch atom pair type depends on the total number of the selected atom pairs N ij and the total number of atom pairwise interactions N found in the binding site. The "background interaction" energy approaches the "mean force" state energy as becomes smaller, while the energy difference increases when becomes larger. Hence we decide to make the "background interaction" potential, as well as the radial distribution function, a function of The modeling of the atom pairwise number distribution function in reference state II starts from the two extreme situations for : (1) When N ij approaches zero ( N ij resulting in (2) When N ij approaches to N ( N ij N gas energy, resulting in resembling an ideal gas state radial distribution. The radial distribution function for an ideal gas state is defined as implying that the number of the selected atom pairs i and j is evenly distributed in the binding site which has an average

PAGE 60

60 volume V The average V of protein l igand binding site is given as with the same to be determined parameter a as introduced above. Hence, the two extreme situations for reference state II can be defined as : (3 7) (3 8) At certain distance r is a function of with a range from to In addition, with N ij tending towards 0 or N the reference state II would be more similar to the mean force state or the ideal gas state, respectively. Hence, is defined as a weighted combination of both the ideal gas state and the mean force state radial distribution functions. Due to the fact that the integral of from 0 to R (cutoff distance where the atomic interaction is regarded as zero) is N ij a linear combination (Equation 3 9) of the weighted radial distribution functions for both the ideal gas and mean force state meets all the necessary conditions : (3 9) In this way the new reference state is designed as state intermediate between the ideal gas and the mean force state. At a certain distance between the atom pairs of type i and j the total energy of reference state II ( ) or the "background interaction" energy is : (3 10)

PAGE 61

61 A plot illustrating the relationship between the number fraction and the new reference state potential is shown in Figure 3 2, to better illustrate the differences in en ergy between the different states. Combining Equations 3 5 and 3 9 we obtain: (3 11) Using V = and using the fact that the Lennard Jones potential is zero at r ij = and at r ij = R we arrive at: and (3 12) (3 13) In addition, the LJ potential reaches its minimum value when r is So the first derivative (D) of the statistical energy term with respect to r is zero at (3 14)

PAGE 62

62 To simplify the resultant expressions the factor is given as (3 15) Simplifying Equations 3 12, 3 13 and 3 15 yields: (3 16) and (3 17) (3 18) Although we don't know the values of and yet, we do know that the value of is unique for eac h combination of and Appendix Table A 1 lists all values for each integer combination of and from 2 1 to 15 14. Different values will be chosen for every pairwise interaction, to satisfy the well depth distance at ( e.g. is 2 1/6 or r ij (well depth at the minimum) for the 12 6 potential) In order to find the a and values within Equations 3 16 to 3 18, we still need to determine the value of R the cutoff distance. We introduce a nonlinear programming method to find a reasonable R for each pairwise interaction type instead of assigning a fixed R value. Ideally, R should be as large as possible since the LJ potential approaches 0 when the distan ce approaches infinity. Meanwhile, for any r between and R the potential value is below 0. Here we use the following inequality constraint in our nonlinear programming approach :

PAGE 63

63 (3 19) which can be simplified as: (3 20) With the goal of maximizing the value of R coupled with the three constraint equations (Equations 3 16 to 3 18) and an inequality constraint (Equation 3 20), a and can be determined. The values of obtained in this wa y can then be compared with the values in Appendix Table A 1 in order to determine the closest and pair. Inserting these values into Equation 3 11 we can calculate all of the corresponding values One important issue in the parameterization of KECSA is that the LJ potential parameters for each type of pairwise interaction should be independent of the other types of interactions, instead of being dependent. Derivation of a , and R comes from Equations 3 16 to 3 1 8 and 3 20, none of which contains the total interaction number N This indicates that for each type of pairwise interaction, the average volume ( ), the distances at which the LJ potential reaches zero and has a minimum ( and ), the relative strength of the repulsive and attractive forces in the LJ potential ( and ), and the long range cutoff distance ( R ) are independently derived in KECSA. The only issue lies in the derivation of the values from Equation 3 11, where the probability of occurrence ( ) is included in the

PAGE 64

64 calculation. In order to avoid relative energies generated for each interaction type based on their probability of occurrence in protein ligand binding sites we used a normalized for each interaction type. Thereby the number fractions for all interaction types are identical. In the present work, all pairwise interactions among 18 atom types (listed in Table 3 1) were examined result ing in 49 significant interaction types being identified. The remaining interaction types were abandoned or merged into similar interaction types due of the paucity of data to fit to or because they are randomly distributed across the observed distance ran ge. The chosen interaction types included 38 van der Waals and 11 hydrogen bonding interaction types. In this case, all interaction types share the same probability of occurrence ( ) in protein ligand binding sites. Equation 3 11 can be rewritten as follows in order to generate the values. All derived parameters are listed in Table 3 2 (3 21) With all of the enthalpy terms determined in the analytical manner described above, the entropy terms are then decided upon in an empirical manner. Structural information such as the number of rotatable bonds, number of double and aromatic bonds, molecular mass, counts of carbon/oxygen/nitrogen atoms, buried surface area, etc were collected f or all ligands contained in the training set. The selection of entropy

PAGE 65

65 terms is based on their contribution to our linear regression model, whose 95% confidence interval should not include 0. Finally, 9 entropy terms are selected: number of rotatable bonds in the ligand, the molecular mass of the ligand, number of aromatic bonds in the ligand, number of oxygen atoms in the ligand, number of nitrogen atoms in the ligand, the nonpolar buried surface area, total buried surface area, the ratio of the nonpolar buried s urface and total ligand surface area and, finally, the ratio of the total buried surface area and the total ligand surface area. The PDBbind v2010 data set 21,22 including 5054 protein ligand complexes is chosen as the training set for the parameterization of the enthalpy terms. We chose 1982 protein ligand complexes found in the PDBbind v2011 refined data set as the training set for the selection and parameterization of the entropy terms Model Validation and Discussion Validation for Avoidance of Over Fitt ing and Ligand Size Dependence. Several validation benchmarks were introduced to test the performance of KECSA. The first benchmark included two test sets in order to examine the dependence of KECSA on the training set. Because the entropy term was obtained by fitting to experimental binding free energies care was taken to ensure that the resultant model was not over fit. First, a leave one out cross validation was used against the training set, which includes 1982 protei n ligand complexes used in the KECSA entropy term parameterization. Comparison of the Pearson correlation coefficient r RMSE (root mean square error) and Kendall between the training set and the leave one out cross validation are shown in Table 3 4. The three statistical measures all showed small differences between the training set and the leave one out prediction, indicating that the KECSA entropy model was properly built

PAGE 66

66 Second, we introduced a test set including 1934 protein ligand complexes chosen from the PDBbind v2011 dataset, with no overlap with the training set used above. From the total of 6051 protein ligand complexes with binding affinity data in the PDBbind v2011 dataset, complexes for this test set were selected following four criteria: (1 ) Crystal structures of all selected complexes had X ray resolutions 2.5 (2) Only complexes with p K i or p K d values distributed between 2 to 8 were selected, mimicking what might be found in a virtual screening database or a pharmaceutically relevant li gand database. (3) Only complexes with molecular weights (MWs) distributed from 80 to 800 were selected, to avoid ligand size dependent prediction results. (4) Complexes used in the KECSA entropy term training set were excluded The second test set was use d to verify the robustness of KECSA against an external dataset as well as investigating the contributions of the enthalpy and entropy KECSA was modeled as a linear comb ination of several ligand properties including ligand size/mass information this test demonstrated that KECSA was not ligand size dependent. We split the KECSA scoring function and used both LJ potential and entropy terms for binding affinity prediction an d then compared with the full KECSA scoring function. We also calculated the correlation coefficient between the experimental p K i or p K d with ligand MW. The predictions and ranking results are listed in Table 3 5 KECSA produces a Pearson's r of 0.590 and a Kendall of 0.404. Comparison to the training set result (a Pearson's r of 0.610 and a Kendall of 0.442) indicates that KECSA gives robust predictions against an unknown binding affinity dataset. When just

PAGE 67

67 using the ligand MW for binding affinity pred iction, the Pearson's r drops by 0.128 and the Kendall drops by 0.08 when compared to the LJ potential only prediction, while Pearson's r drops by 0.140 and the Kendall drops by 0.08 when compared to the entropy term prediction. The drop in prediction is even greater when compared to the full KECSA score function. This result suggests that the KECSA prediction is minimally affected by MW considerations. The LJ potential and entropy only predictions are quite similar, but lower than the full KECSA predic tion (see Table 3 5). None of the two independent parts, enthalpy or entropy, shows a significant performance over the other, suggesting that both enthalpy and entropy play important roles in the KECSA scoring function. RMSE is not listed for comparison be cause the LJ potential is generated from a statistical potential while entropy is derived by fitting to p K d or p K i values, which results in different scales making comparison difficult Validation Benchmark for Comparison with LISA, LISA+ and Other Scoring Methods KECSA is the second generation scoring function we have developed after LISA 65 and LISA+. 83 The latter two were developed for the fast calculation of protein ligand binding affinity (p K d and p K i ), and were successful in the SAMPL3 challenge (first rank for all scoring methods), which was a blind test for docking and scoring methods. 24 Comparisons between KECSA and other scoring methods including LISA and LISA+ are necessary for further understanding its performance. The second validation benchmark consists of two large scale test sets and four smaller test sets for set 25 ) with 100 diverse protein ligand complexes for comparison of KECSA and several other well know n scoring methods

PAGE 68

68 First, two large scale test sets both with more than 1000 complexes were introduced to KECSA, LISA and LISA+. The first test set contained 1399 complexes from the PDBbind v2010 database which was previously used for LISA validation. KEC SA reproduces a Pearson correlation coefficient r of 0.55 3, an RMSE of 2.46 kcal/mol and a Kendall of 0.401 while LISA reproduces a Pearson correlation coefficient r of 0.5 34, an RMSE of 2. 65 kcal/mol and a Kendall of 0.378 LISA+ was trained based on this data set, so it was excluded from this validation benchmark. A larger test set was applied for all three scoring functions, including 2456 protein ligand complexes from the PDBbind v2011 refined data set, of out which 290 complexes had Zn ligand bind ing. For those 2166 non metal containing complexes, KECSA gets an r of 0.589, an RMSE of 2.31 kcal/mol and a Kendall of 0.429 while LISA gets an r of 0.542, an RMSE of 3.06 kcal/mol and a Kendall of 0.39 7, LISA+ yields an r of 0.572, an RMSE of 2.81 k cal/mol and a Kendall of 0.41 9. For those complexes with Zn ligand binding, KECSA has an r of 0.415, an RMSE of 2.33 kcal/mol and a Kendall of 0.26 7, LISA has an r of 0.409, an RMSE of 3.08 kcal/mol and a Kendall of 0.252 and LISA+ has an r of 0.420, an RMSE of 3.00 kcal/mol and a Kendall of 0.257. Calculation and statistical results of KECSA, LISA+ and LISA for all large scale validation studies are shown in Table 3 6 and Figure 3 3. In the large scale test, KECSA yields a better prediction t han our two first generation scoring functions. It produces better predicted results based on RMSE and more reliable binding affinity ranking based on Kendall compared with the other two scoring methods. LISA+ can compete with KECSA based on correlation coefficient, and even achieves better r in the subset of complexes with Zn ligand binding. We believe

PAGE 69

69 that the improvement of LISA+ compared with LISA is because the complexes in the training set are categorized based on ligand properties (mass and hydroph obicity ) and different models are trained for each category. This proves that a multi model scheme can improve the predictive ability of empirical scoring functions. Our validation studies indicate an improvement from LISA to KECSA. Introducing PMF theory for non bonding interaction modeling shows its advantages over simply fitting to binding affinity data. However, metal ligand binding prediction remains a challenge for classical mechanics based or statistical scoring methods. KECSA improves the binding affinity predicting ability mostly in RMSE for this subgroup of complexes. However, although KECSA shows improvement in both correlation coefficient r and RMSE, predictions in the low and hi gh binding regions are poor. Seen from the linear regression functions in Fig ure 3 3 the slopes of LISA and LISA+ generated data vs. experimental data are 0.66 and 0.64, while that for KECSA is 0.41; the intercepts for LISA and LISA+ are 2.02 and 2.06, whi le that for KECSA result is 3.72. The reason is that 3314 complexes, which comprised 65.3% of the whole training set, are in the mid binding region (p K d or p K i between 4 and 8). Hence the scoring function tends to overestimate binding affinity of the low b inding region while underestimating that of the high binding region if there is no significant decrease or increase in contact number for the protein ligand complexes from these two regions We used a scaling procedure to help improve the prediction of bin ding affinity in the high and low binding regions. The KECSA generated results were fit to a linear model to reproduce the PDBbind v2011 refined data set p K d values. So we g e t : (3 22)

PAGE 70

70 Next, four test sets containing 427 pr otein ligand complexes from four protein families was examined. The list of complexes, their families and binding constants are given in Appendix Table A 2 Statistical and calculation results are shown in Table 3 7 and Figure 3 4. KECSA improves the RMSE for all test sets, indicating that KECSA makes a more robust prediction with respect to experimental binding affinity data than do LISA and LISA+. However, LISA and LISA+ both give a better correlation coefficient and Kendall for the serine protease e ndothiapepsin and HIV 1 protease test sets, showing that they perform better in binding affinity ranking in these small test sets. Results for each test set were carefully examined. The s erine protease test set contains many complexes with low mass ligands while most ligands are relatively larger in the e ndothiapepsin test set. Statistical results of the three scoring methods' predictions are similar in these two test sets: KECSA generates a better RMSE, while LISA and LISA+ yielded both a better correlati on coefficient and Kendall In the s erine protease test set, 11 out of 96 protein ligand complexes had small ligands with molecular mass lower than 200 Daltons KECSA prediction overestimates most of their p K d or p K i values ( Appendix Table A 2 ), while LI SA and LISA+ to some degree underestimates these values. Hence, KECSA decreases the binding affinity differences between these low binding complexes and other high binding complexes and LISA/LISA+ increases these differences. The implication is that LISA a nd especially LISA+ differentiate the binding affinity values of the complexes better in this test set and give higher r and values, but they have worse RMSEs. In the e ndothiapepsin test set, on the other hand, all complexes have ligands with molecular m ass higher than 500 Daltons. LISA and LISA+ overestimated binding affinities with larger errors compared with KECSA, while

PAGE 71

71 distinguishing the complexes better, generating a better linear correlation towards the experimental data. These test results suggest that LISA and LISA+ are more sensitive to ligand mass changes, while KECSA makes more precise prediction with smaller error, but has difficulty in ranking complexes with similar p K d s or p K i s. The HIV 1 protease test set is a significant challenge for all three scoring methods. Most of the complexes have high mass ligands and high binding affinity. Because the ligands in this test set have similar structures binding affinity predictions from all thr ee scoring methods are not able to rank these complexes. LISA+ does better in correlation coefficient and ranking than others, which indicates that training scoring functions for different complex categories based on ligand or binding pocket properties may help scoring methods improve their ability to identify subtle changes in ligand structure. The carbonic anhydrase II test set includes 100 out of 110 complexes with Zn chelation, and contains more polar and charged interactions. KECSA demonstrates better performance in the correlation coefficient, RMSE and Kendall pointing to its advantage over LISA and LISA+ on reproducing binding affinity data of complexes with more hydrophilic interactions and metal chelation. For the reproduction of the binding affi nity for the four combined test sets all three scoring methods give a similar correlation coefficient r while LISA+ has a small advantage. KECSA does better in both RMSE and Kendall For all four test sets, scaling of the KECSA score does help to improv e the slope of calculated data vs. experimental data, but does not improve the statistical tests. Overall, KECSA g i ve better RMSE values and better binding affinity ranking of the complexes belonging to different protein families.

PAGE 72

72 For the last test set, we 25 with 100 diverse protein ligand complexes. The purpose was to compare binding affinity prediction ability of KECSA not only with LISA and LISA+, but also with other well known scoring functions. We obtain Pearson's r = 0.69, R MSE = 2.25 kcal/mol using KECSA, compared with Pearson's r = 0.72, RMSE = 2.32 kcal/mol using LISA, Pearson's r = 0.67, RMSE = 2.80 kcal/mol using LISA+. This result coincides with the conclusion gained from the second validation benchmark, that among thes e three scoring functions, KECSA prediction has the smallest RMSE Pearson correlation coefficients of KECSA together with other score functions are presented in Figure 3 5, showing its performance on this test set.

PAGE 73

73 Table 3 1 List of the chose n atom types. Atom Type Description C3 sp 3 hybridized carbon C2 sp 2 hybridized carbon Car aromatic carbon C1 sp hybridized carbon N4 positively charged nitrogen Nam amide nitrogen N3 sp3 hybridized nitrogen Nar nitrogen aromatic N2 sp2 hybridized nitrogen Npl3 trigonal planar nitrogen O3 sp 3 hybridized oxygen O2 sp 2 hybridized oxygen S sulfur P phosphorus F fluorine Cl chlorine Br bromine I iodine

PAGE 74

74 Table 3 2 Parameters for all 49 pairwise potentials. interaction type C2C2 C2Car C2N2 C2N3 C2N4 C2Nam C2Nar C2Npl3 C2O2 C2O3 4.145 3.630 3.450 3.285 3.215 3.505 3.575 3.505 3.370 3.135 a 3.375 2.224 3.085 2.810 3.089 4.296 2.662 2.273 3.298 2.992 R 5.900 6.535 4.755 4.220 4.235 4.265 5.390 6.485 4.430 5.345 0.091 0.041 0.388 0.035 0.133 1.003 0.296 0.769 1.735 0.071 LJ model 12 5 11 1 10 9 12 8 14 12 15 6 12 11 15 14 12 11 13 4 interaction type C2S C3C2 C3C3 C3Car C3N2 C3N3 C3N4 C3Nam C3Nar C3Npl3 4.350 3.940 4.290 3.850 3.580 3.650 4.570 4.470 3.455 3.815 a 2.505 3.049 2.759 2.237 2.404 1.759 2.988 3.581 2.990 2.347 R 6.425 6.210 6.840 6.775 6.130 6.945 6.850 6.165 5.435 6.160 0.387 0.085 0.364 0.454 0.053 0.123 0.022 0.071 0.067 0.129 LJ model 12 11 14 3 5 4 5 3 15 9 13 12 12 7 12 7 12 9 4 3 interaction type C3O2 C3O3 C3S CarCar CarN2 CarN3 CarN4 CarNam CarNar CarNpl3 3.200 3.325 3.940 3.700 3.600 3.700 4.360 3.720 3.565 3.665 a 2.742 3.164 1.965 1.898 2.079 2.032 1.089 3.655 1.389 1.736 R 4.515 5.650 6.630 6.855 6.440 6.845 6.980 6.030 6.865 6.675 0.343 0.038 0.016 0.249 0.013 0.005 0.056 0.279 0.206 0.016 LJ model 9 6 13 7 14 1 4 3 11 1 8 1 9 5 14 13 15 14 15 6 interaction type CarO2 CarO3 CarS N2O2HB N2O3HB N3O2HB N3O3HB NamO2HB NamO3HB Npl3O2HB 3.430 3.690 3.920 2.640 2.670 2.550 2.605 2.610 2.625 2.585 a 2.840 2.204 1.627 2.056 2.365 0.989 1.788 2.057 3.475 1.377 R 6.600 6.505 6.975 6.420 6.465 6.745 4.585 4.765 4.160 4.995 0.120 0.030 0.050 0.062 0.036 0.196 0.217 1.700 0.172 0.219 LJ model 12 10 6 2 9 5 15 8 15 5 14 10 13 9 12 10 11 8 13 8

PAGE 75

75 Table 3 2 Continued interaction type Npl3O3HB O2N2 O2Nam O2Nar O2O2 O3N2HB O3O2HB O3O2 O3O3HB 2.635 2.570 4.125 3.380 3.065 2.510 2.445 3.365 2.080 a 1.899 2.397 2.784 2.292 2.767 1.345 1.998 3.250 2.408 R 6.755 6.845 6.065 6.070 6.055 4.395 6.065 6.480 6.990 0.272 0.010 0.008 0.073 0.034 0.116 2.002 0.024 0.038 LJ model 15 12 7 1 13 3 11 7 4 1 15 8 14 13 3 2 11 3

PAGE 76

76 Table 3 3 Entropy parameters and their 95% confidence intervals parameter 95% confidence interval enthalpy 0.0928 0.0650 0.1206 number of rotatable bond s 0.0900 0.0601 0.1200 molecular mass 0.0170 0.0191 0.0149 N_num ber 0.2455 0.1838 0.3072 O_num ber 0.3131 0.2528 0.3733 Number of aromatic bond s 0.0359 0.0130 0.0588 nonpolar buried surface area 0.0152 0.0047 0.0257 total buried surface area 0.0089 0.0167 0.0012 nonpolar buried surface area /total surface area 4.1454 6.9496 1.3412 total buried surface area /total surface area 6.2438 8.1509 4.3368 Table 3 4. Leave one out cross validation of KECSA. Pearson's r RMSE(kcal/mol) Training 0.601 2.20 0.442 Leave One Out Calculation 0.594 2.22 0.437 Table 3 5. Statistical results for KECSA, KECSA LJ, KECSA entropy and Ligand MW correlated with experimental binding affinity. Pearson's r Kendall KECSA Scoring Function 0.590 0.404 LJ Potentials in KECSA 0.509 0.352 Entropy in KECSA 0.521 0.349 Ligand Molecular Weight 0.381 0.272

PAGE 77

77 Table 3 6. Statistical results of KECSA, LISA+ and LISA from large scale validation studies. KECSA Test set 1 a Test set 2 b Test set 3 c Correlation Coefficient 0.553 0.589 0.415 RMSE (kcal/mol) 2.46 2.31 2.33 Kendall 0.401 0.429 0.267 LISA+ Correlation Coefficient 0.572 0.420 RMSE (kcal/mol) 2.81 3.00 Kendall 0.419 0.257 LISA Correlation Coefficient 0.534 0.542 0.409 RMSE (kcal/mol) 2.65 3.06 3.08 Kendall 0.378 0.397 0.252 a. Test set 1 contains 1399 complexes from PDBbind v2010 database. b. Test set 2 contains 2166 non metal complexes from PDBbind v2011 refined data set. c. Test set 3 contains 290 metalloprotein ligand complexes from PDBbind v2011 refined data set.

PAGE 78

78 Table 3 7. Statistical results of KECSA, scaled KECSA, LISA+ and LISA from four small test sets. KECSA Serine protease Endothiapepsin HIV 1 protease Carbonic anhydrase II All Correlation Coefficient 0.568 0.441 0.244 0.495 0.591 RMSE (kcal/mol) 0.894 1.700 1.678 1.405 1.466 Kendall 0.423 0.202 0.139 0.246 0.420 Scaled KECSA Serine protease Endothiapepsin HIV 1 protease Carbonic anhydrase II All Correlation Coefficient 0.568 0.547 0.245 0.487 0.607 RMSE (kcal/mol) 1.299 1.987 1.677 1.478 1.562 Kendall 0.421 0.067 0.123 0.228 0.390 LISA+ Serine protease Endothiapepsin HIV 1 protease Carbonic anhydrase II All Correlation Coefficient 0.692 0.484 0.334 0.427 0.603 RMSE 0.970 2.171 1.781 1.814 1.661 Kendall 0.508 0.353 0.195 0.147 0.407 LISA Serine protease Endothiapepsin HIV 1 protease Carbonic anhydrase II All Correlation Coefficient 0.645 0.496 0.256 0.492 0.586 RMSE 1.131 2.880 2.089 1.821 1.884 Kendall 0.479 0.269 0.118 0.228 0.392

PAGE 79

79 Figure 3 1. A protein ligand structural illustration (using PDBID 1xbc) of how the KECSA statistical potential is modeled. The protein binding site is shown as a grey surface with the ligand located within the binding site surrounded by protein residues which it make s contacts with. The pink dashed lines indicate interactions between certain atom pair types i and j ( i.e. carbonyl oxygen with amine nitrogens in this example) which are defined as "selected interactions" in this manuscript. Green dashed lines indicate all other non covalent interactions between the protein and ligand atoms in the binding pocket, defined as background interactions". ( A ) In the mean force state, the system is filled with all types of interactions. ( B ) The reference state II contains all the background interactions. ( C ) Removing all the background interactions from total interactions results in a state with only the selected interactions for each i and j combination.

PAGE 80

80 Figure 3 2. Ratio of the observed atom pairs to the total interacting atom pairs vs the new reference state II potential.

PAGE 81

81 Figure 3 3. Plot of KECSA, LISA+ and LISA calculated p K d or p K i vs. Experimental p K d or p K i in obtained in validation studies.

PAGE 82

82 Figure 3 4. Plot of KECSA, scaled KECSA, LISA+ and LISA calculated p K d or p K i vs. Experimental p K d or p K i in four small test sets.

PAGE 83

83 Figure 3 5 With the test set built by Wang 25 binding affinity comparison was done for KECSA, LISA, LISA+ and several other well known scoring functions, ITScore/SE, 18 ITScore, 17 X Score, 26 DFIRE, 27 DrugScoreCSD, 12 DrugScorePDB, 11 Cerius2/PLP, 28,29 SYBYL/G Score, 30 SYBYL/D Score, 31 SYBYL/ChemScore, 32 Cerius2/PMF, 8 DOCK/FF, 31 Cerius2/LUDI, 33,34 Cerius2/LigScore, 35 SYBYL/F Score, 36 AutoDock 37

PAGE 84

84 CHAPTER 4 T HE MOVABLE TYPE METHOD APPLIED TO PROTEIN LIGAND BINDING Methodology The Movable Type Method Applied to Protein Ligand Binding A thermodynamic cycle modeling the binding free energy b s in solution (shown in Figure 4 1) is typically employed in end point methods: (4 1) where P and L indicate the protein and ligand, s and g represent the behavior in solution and the gas phase, respectively, solv is the solvation free energy, and b is the binding free energy in gas ( g ) and solution ( s ), respectively Using Equation 4 1 becomes : (4 2) The binding free energy in solution is now separated in to two terms: The binding free energy in the gas phase and the change in the solvation free energy during the complexation process At this point we introduce the moveable type algorithm to model both terms each with its own design The binding free energy in the gas phase is the most important term to evaluate in order to predict the protein ligand binding affini ty because it contains all interactions between the protein and ligand. When approximated as the Helmholtz free energy (NVT) the Gibbs (we use the canonical ensemble throughout, but will predominantly use the G notation) binding free energy in the gas ph ase can be generated using the ratio of the partition funct ions describing the protein ligand complex the protein, and the ligand.

PAGE 85

85 (4 3) where Z represent s the canonical ensemble partition function, and is the reciprocal of the thermodynamic temperature k B T Partition functions are integrals over all possible coordinates of the protein ligand complex the protein, and the ligand. Equation 4 3 can be manipulated into the following form: (4 4) where the partition functions are expressed as the Boltzmann weighted average of the pose energies (shown in brackets) multiplied the volume of configuration space available to each state shown as F in Equation 4 4. is approxima ted as the product of the external degrees of freedom ( DoFs ) of the bound protein and ligand ( including the rotational and transla tional DoFs), and the internal DoFs of the bound protein and ligand ( including the relative position al and vibration al DoFs), given as (4 5) Similarly, the DoFs of the free protein and ligand molecules are also separate d into the external and internal components. Internal DoFs are identical for bound and free protein/ligand structures and the b oun d and free protein s are also assumed to share the same internal and external DoFs. Only the external DoFs of the ligand are differentiated between the bound and free systems. The r otational DoF of a free ligand 2 on a normalized unit sphere. However, because of the inaccessible volume present in protein ligand systems the r otational DoFs of bound ligands are designated as a 2 with a to be determined average volume factor a less than 8. The transla tional

PAGE 86

86 DoFs are treated as a constant C which is assumed to be identical for all free ligands, while the transla tional DoF for bound ligands is the volume of the binding pocket V pocket in which the ligands' center of mass can translate Thereby, in the protein ligand binding process, changes in the DoFs can be modeled as a constant with respect to the different volume s of the binding pocket s. Applying these approximations we obtain. (4 6) The gas phase protein ligand binding free energy can then be further manipulated into the following form: (4 7) Again using the Helmholtz free energy approximation (Equation 4 3), the solvation free energy can be correlated to the partition function of the solute (protein, ligand or protein ligand complex) and solute solvent bul k interactions In this way, the solvation free energy, using as an example, is modeled as in Equation 4 8, and the DoFs are approximated as being the same for the solute and the solute solvent bulk terms. (4 8) Finally, the remaining solvation terms given in Equation 4 1 ( and ) can be modeled in an analogous manner yielding the change in the solvation free energy as ligand bindi ng occurs which then can be used to evaluate the overall free energy of ligand binding in aqueous solution.

PAGE 87

87 Construction of the Movable Type System: Atom Pairwise Interaction Energy and Probability Databases With pose energies sampled over all possible DoF s for the bound and free protein/ligand system, the gas phase protein ligand binding free energy can be generated using molecular dynamics, Monte Carlo, genetic algorithms, etc by sampling over a large number of poses of the protein, ligand and protein li gand complex. Using the canonical ensemble the Helmho l tz free energy can be obtained as the arithmetic mean (sum of the energies of all ligand poses divided by the total number of all poses along with an estimate of integration volume ) of Boltzmann factors: (4 9) However, the problem of pose based energy sampling lies in the fact that pose selection and sample size significantly affect the final result, not to mention that calculating many unique poses is very time consuming. Different ligand poses have different e nergy preferences for the binding site, which leads to a range of binding probabilities When calculating the averaged partition functions in E quation 4 7, one can assign probabilities ( Q ) as weights to different Boltzmann factors in order to differentiate the binding pocket preferences against ligand poses, rather than just simply using an arithmetic mean of all Boltzmann factors (4 10) (4 11)

PAGE 88

88 The challenge in deriving the canonical partition function (as the denominator in E quation 4 10) for a protein ligand system is that it is difficult to include all relevant ligand pose energies within the binding pocket using brute force sampling schemes. However, the task becomes much easier when a protein ligand system is reduced to the atom pair level. In this way the "p ose" sampling problem can then can be cast as a 1 D rather than a 3 D problem by deriving the canonical partition function as a sum of the Boltzmann factor products of all atom pairwise energies included in the system over all atom pairwise separation distance ranges. (4 12) The canonical partition function can be derived following E quation 4 12, where the index i refers t sampling scheme. When the protein ligand system is broken down to the atom pair level, q indicates all atom pairs in the molecular system, and p indicates each possible combination of all a tom pairs each of which is at a pre chosen distance. a b c and d refer to each atom pair as a bond, angle, torsion or long range (van der Waals or electrostatic) interaction in the canonical system, respectively, and , and refers to each sampled separation distance between the corresponding atom pair.

PAGE 89

89 Probabilities of all the atom pairwise distributions on the right hand side of E quation 4 12 are normalized as ( ): (4 13) Hence our MT method is designed to decompose the molecular energy into atom pairwise energies, which then simplifies the energy sampling problem to the atom pair level. The advantage of this idea lies in that (1) atom pairs can be categorized based on atom types and interaction types, e.g. bond, angle, torsion, and long range non covalent interactions; (2) Calculation of atom pairwise energies is extremely cheap. Thereby, it is easy to build an atomic pairwise interaction matrix of energy vs. distance for each interaction type a nd atom pair type i j Hence, the energy calculation for each molecule is no more than a combination of elements from di fferent energy matrices. In addition, the MT method is a template by which any pairwise decomposable energy function can be used. In th e current work, the energy for each interaction type between a certain atom type pair i j is calculated using the Knowledge based and Empirical Combined Scoring Algorithm (KECSA) potential function. 87 I n KECSA, the protein ligand statistical potential is modified and equated to an atom pairwise energy in order to generate force field parameters for bond stretching, angle bending, dihedral torsion angles and long range non covalent interactions. Please se e the detailed rationale and justification for KECSA and its parameterization in the relevant literature 87

PAGE 90

90 Along with the distance based energy, each atom pair type also has a distance preference encoded in its distribution, resulting in different probabi lities associated with Boltzmann factors for each sampled atom pairwise distance. Atom pair radial distributions were collected from a protein ligand structure training set ( i.e the PDBbind v2011 data set with 6019 protein ligand structures) 87 93 and uti lized in the current model. The atom pairwise radial distribution function is modeled as: (4 14) where n ij ( r ) is the number of protein ligand pairwise interactions between a certain atom pair type i and j in the bin (r, a from the training set in the denominator mimics the number o f protein ligand atom type pairs i and j ground distribution from the protein ligand system. 0.005. N ij is the total number of atom pairs of type i and j The average volume V of the protein ligand binding site s is given as with the same to be determined parameter a as described above (Equations 4 7 and 4 14). A cutoff distance R is assigned to each atom type pair defining the distance at which the atom pairwise interaction energy can be regarded as zero. Both a and R can be derived using a previously introduced method. 87 The radial distribution frequency is then normalized by dividing the sum of radial distributions of all the atom pairs in the system (Equation 4 15)

PAGE 91

91 (4 15) In this way, th e energy and distribution frequency vs. distance is calculated for any interaction type, and atom pair type, thereby, forming our MT database for later use. Binding Free Energies from the Movable Type Method Based on Equation 4 4, the binding free energy is defined as a ratio of partition functions of the different molecules involved in the binding process i.e., the protein, ligand and the protein ligand complex. Instead of sampling over poses of one molecule, the MT method simplifies the partition functi on of each system into a collection of partition function s ( c ) over each observed atom pair, which are equal to the normalized distribution probability of the atom type pair along the distance ( q ), multiplied by the corresponding atom pairwise partition fu nction ( z ) : (4 16) By combining the partition functions c over all atom pairs in one molecule the partition function of one molecule averaged over all possible conformations is derived (Equation 4 1 7 ). (4 17) where the averaged molecular partition function is given as a sum of atom pairwise partition functions c sampled over distance intervals ( M ) of all combination of N atom pairs at all possible distances Starting from the protein ligand complex database, we constructed the partition function matrices for the MT algorithm. When converted into a partition function matrix,

PAGE 92

92 the atom pairwise energy multiplier sampled as a function of distance is the basic elem ent needed to assemble the total energy, as shown in Equation 4 18, using the protein bond energy as an example. (4 18) where subscript k indicates a bonded atom pair i and j and each distance increment between any r a and r a + 1 is 0.005 Using this scheme the distance sampling size is given by: where r 1 and r n are the lower and upper bounds for distance sampling, which varies depending on the each atom pair and interaction type. The product over all bond linked atom pairs derives the total bond partition function in the protein: (4 1 9) (4 20) In Equations 4 19 and 4 20, m indicates the total number of atom pairs that need to have their bond stretch term computed ( i.e. number of covalent bonds), and n is the distance sampling size. T indicates the transpose. Thus, the matrix has a total of

PAGE 93

93 n m elements, and includes all combinations of the sampled atom pairwise distances and atom pairs (see Equation 4 20). Energy matrices for othe r kinds of atom pairwise interactions are assembled in the same way ( i.e. bond, angle, torsion, and long range interactions). Products over these matrices generate the entire protein partition function matrix (Equation 4 21), representing all possible com binations of the protein internal energies with different atom pairwise distances (4 21) where (4 22) The KECSA van der Waals electrostatic interaction models and hydrogen bond models 87 are applied to the protein, ligand and protein ligand complex systems. Similarly, the ligand energy (Equation 4 23) and protein ligand interaction energy matrices (Equation 4 24) can be obtained. (4 23) (4 24) The d istribution frequency matrix is built in the same way, with the de rived from Equation 4 15 as elements in each multiplier (also using the protein bond term as an example): (4 25) (4 26)

PAGE 94

94 (4 27) w here (4 28) The d istribution frequency matrix for the protein is derived using Equations 4 26 through 4 28, and the d istribution frequen cy matrices of the ligand and protein ligand intermolecular interactions are analogously derived as in Equations 4 29 and 4 30. (4 29) (4 30) We chose the same range and distance increment in both the energy and d istribution frequency calculations, which means that any r x 18 is the same as corresponding r x in Equation 4 25. Thus, the corresponding elements in all energy and d istribution frequency matrices correlate with e ach other. The pointwise product over all matrices ensures that the energies and distribution frequencies with the same range and distance increment are combined into one element in the final matrix of the probability weighted partition function of the pro tein ligand complex ( in Equation 4 31 ). (4 31) In the final matrix each element of is a value of the partition function of the protein ligand complex multiplied by its probability based on its radial distribution forming the ensemble average. Finally, the sum of all elements of the matrix give s us the averaged partition function of t he protein ligand complex:

PAGE 95

95 (4 32) where the first equation is the normalization statement for the probabilities. In this manner, the normalized averaged partition function of the protein ligand complex is derived in Equation 4 32. Following the same procedure, the averaged partition functions for the protein and ligand are generated as well: (4 33) (4 3 4) Expanding the matrices, the protein ligand binding free energy in the gas phase is defined as in Equation 4 35, using the averaged partition functions of all three systems (protein, ligand, protein ligand complex) derived above. (4 35) In Equation 4 35 Q is the radial distribution frequency and E is the energy. i j k are the indices of the protein, ligand and protein ligand complex, while I J K are the total number of protein, ligand and protein ligand complex samples, respectively. , and are standard distribution frequency matrices normalized over al l three systems, in order to satisfy In this

PAGE 96

96 way the protein ligand binding free energy in the gas phase is derived using our MT algorithm. Determination of the change in the solvation energy as a function of the binding process is computed in a similar manner. To illustrate this we describe how we obtain the solvation free energy of the ligand, which is one component of G solv and by extensi on the other terms can be derived. The ligand solvation free energy is obtained by decomposing the l igand solvent bulk energy into the free ligand energy the ligand solvent polar interaction energy an d the ligand solvent non polar interaction energy : (4 36) Solvent was approximated as a shell of even thickness around the ligand, in which the water molecules were evenly distributed. The solvent shell thickness was 6 , and the inner surface of the shell was 1.6 away from the ligand surface, which approximates the radius of a water molecule. Herein, for simplicity, the ligand solvent polar interaction was considered as a surface (solvent accessible polar surface of the ligand) surface (solvent bulk layer surface at a certain distance away from ligand) interaction, instead of a point point interaction, i.e. atom pairwise interaction. Using this model the ligand polar atom solvent interacti on energy was modeled as a solvent accessible buried area (SABA) of the ligand polar atoms multiplied by the polar atom oxygen interaction energy terms taken from KECSA, 87 to simulate the ligand solvent surface interaction energy. All SABA weighted interac tion energies along the solvent shell thickness, with a 0.005 increment were collected and stored. The ligand solvent polar interaction Boltzmann factor multiplier was modeled using Equation 4 37 with k

PAGE 97

97 indicating each polar atom in the ligand, r 1 =1.6 which is the inner layer of the solvent shell and r n =6 +1.6 =7.6 , which is the outer b oundary layer of the solvent shell. (4 37) The ligand solvent polar interaction Boltzmann factor matrix is then derived using Equation 4 38, covering all ligand polar atoms up to m The distribution frequency matrices were not included in ligand solvent energy calculation because the radial distrib ution function is approximated as being identical along all ligand solvent distances ( i.e. a featureless continuum). Figure 4 2 further illustrates the modeling of the ligand solvent polar interaction. (4 38) The non polar a tom buried area (NABA) is used to simulate the interactions between the non polar atoms and aqueous solvent, because the interaction energy between non polar atoms and water molecules has a weaker response to changes in distance. (4 39) The ligand energy is the same as was introduced in the gas phase protein ligand binding free energy calculation. So, the matrix for the ligand solvent interaction energy is: (4 40)

PAGE 98

98 (4 41) The solvation free energy was not fit to experimental solvation free energies and was found to have a small influence of the final binding free energies for the protein ligand complexes. Nonetheless, future work will fit these models to small molecule solvation free energies, but for the present application the solvation model was used as formulated above. With all necessary components constructed, t he binding free energy in solution can be generated using: (4 42) Performance of MT KECSA as a Scoring Function for Protein Ligand Binding Affinity Prediction Using the MT method we performed binding free energy calculation s with the KECSA model and its associated parameters. This validation study was performed to illustrate (1) the general performance of MT method when used to predict protein ligand binding affinities and (2) whether sampling along atom pairwise distance improves

PAGE 99

99 scoring performance as done in MT KECSA i mpro ves the prediction over the original KECSA method. A test set containing 795 protein ligand complexes was chosen from the PDBbind v2011 refined dataset based on the following criteria : (1) Crystal structures of all selected complexes had X ray resolutions of < 2.5 . (2) Complexes with molecular weights (MWs) distributed from 100 to 900 were selected, to avoid ligand size dependent prediction results. (3) Complexes with ligands who have more than 20 hydrogen donors and acceptors, more than one phosphorus at om, and complexes with metalloproteins were excluded. (Root Mean Square Error) when compared to the original KECSA model (Table 4 1). Importantly, judging from the slope and interce pt of both calculations versus experimental data, MT KECSA (with slope of 0.85 and intercept of 0.14) better reproduces the binding affinities in the low and high affinity regions than the original KECSA model (with slope of 0.27 and intercept of 3.57). In the original KECSA approach, the entropy terms were empirically trained, thus, its test results demonstrate training set dependence to some degree. Because complexes with medium binding affinities are more commonly seen in the PDB database when compared t o complexes with high or low binding affinities, they become the majority in a large training set (1982 protein ligand complexes were used to fit the original KECSA entropy terms). This causes the trained scoring functions to overestimate the binding affin ity of the low binding complexes while underestimating that of the high binding complexes. On the other hand, MT KECSA, using canonical partition functions to compute the binding free

PAGE 100

100 energies, bypasses the difficulty of empirically building the entropy te rm, and, thereby, better reproduces the binding affinity in low and high binding free energy regions Extracting Heat Maps from the Movable Type Method Grid based methods and their graphical representation have had a long tradition in computer aided drug design. 66 86 For example COMFA 92 creates a field describing the chemical nature of the active site pocket and the GRID algorithm 94 uses a grid based approach to aid in molecular docking and has been adopted by other docking programs ( e.g. GLIDE 93 94 ). By the very nature of the MT describing the chemical nature of the grid points created in the MT method. These can be used to describe pairwise interactions between the grid point and the protein environment ( e.g. a mide hydrogen with a carbonyl oxygen) or interactions can be lumped into nonpolar or polar interactions describing the aggregate collection of polar and non polar pairwise interactions. Not only does this describe the nature of the grid points it also indi cates regions where specific atoms should be placed to optimize binding affinity. In contrast to energy heat maps, the MT heat maps represent the probability weighted interaction energy on each grid point. Knowledge based data ( i.e., the probability distri bution along the interacting distance) will affect the location of both unfavorable and favorable interactions depending on the nature of the system. Moreover, energy gradient maps can be generated based on heat map energy calculation s which facilitates l igand docking as described below

PAGE 101

101 Extracting Structure from the Movable Type Method The advantage of the MT method is that the energy and the free energy (when introducing the partition function) can be derived using only atomic linkage information coupled with the databases of atom pairwise distance distributions along with their corresponding energies. This offers us a new approach for protein ligand docking without resorting to exhaustive pose sampling. Our initial efforts utilized the frozen receptor mo del, but the incorporation of receptor flexibility is, in principle, straightforward and will be explored in the future. In a docking exercise, the best docked pose for the ligand is usually obtained based on the highest binding affinity, which can be regarded as an optimization problem. With the frozen binding pocket approximation, generation of the "best" docking pose is a gradient optimization of the ligand atoms within the binding pocket, subject to the constraints of the ligand topology. Molecular internal linkages including bond lengths and angles only slightly deviate from their optimized values, making the m constraint s in the ligand energy optimization within the binding pocket. These ligand atom connectivities reduce the dimensionality of the problem in that atomic collections that do not have the correct connectivity are eliminated from further consideration. On the other hand, energies of the torsions and long range interactions between ligands and proteins vary over comparatively large distance ranges and, thereby, are regarded as the objective functions. Hence, in order to do the optimization we need to obtain the first and second derivatives of the ligand torsion and the protein ligand long range interaction partition functions (shown in E quation 4 43 and 4 44 ), which can be readily seen in the gradient maps of the individual atom type pairs

PAGE 102

102 (4 43) (4 44) Optimum ligand atom locations are obtained when the calculation satisfies the minimum values for all the objective functions (ligand torsions and protein ligand long range interactions) and all l igand bonds and angle constraints In our optimization algorithm we obtain numerical derivatives of the probability distribution and analytical derivatives for the energy expression via pairwise partial derivatives of the modified Lennard Jones potentials used in KECSA. 87 With the ligand topology and first and second derivatives we used a Newton Raphson algorithm to optimize the ligand in the pocket. A nice feature of this method is we can identify both the lowest free energy binding mode along with all oth er possible local minima with higher free energies. Moreover, we can extract saddle point and higher order transitions describing the connectivity between the local minima, but these are quite numerous detail. Figure 4 4 introduces the process of the heatmap docking. To illustrate the method in detail we will touch on just one example whose structure is 1LI 2. We have also carried out heatmap docking against the previously introduced test set of 795 prot ein ligand complexes, which will be summarized below. The protein ligand complex with PDB ID 1LI2 is used as an example to illustrate in detail the process of heatmap docking. 1LI2 is a T4 Lysozyme mutant bound to phenol with a modest binding affinity of 4.04 (pK d ). The binding pocket region is larger

PAGE 103

103 than the small phenol ligand structure (see Figure 4 5), potentially allowing several ligand poses that represent local minima. On the other hand, phenol, as the ligand, has a simple enough structure to clear ly show the differences in protein ligand contacts between low energy poses. Judging from the crystal structure, phenol forms a hydrogen bond with GLN102A, and several hydrophobic contacts with VAL87A, TYR88A, ALA99A, VAL111A and LEU118A in the binding poc ket (shown in Figure 4 5 ). There are two atom types (sp 3 oxygen and aromatic carbon atoms) in phenol. Based on the MT KECSA calculation, heat maps for both of the atom types within the binding pocket can be generated (Figure 4 6 ). The heatmap docking progr am then generated one sp 3 oxygen and six aromatic carbons to their optimized position following the gradients on their corresponding energy heatmaps while satisfying the linkage constraints of phenol. As a result, together with the energetic global minimum ligand pose ( GM ), three more local minimum poses (pose a b and c ) were generated using the heatmap docking method. RMSD values () and binding scores ( p K d ) are shown in Table 4 2 As can be seen in Figure 4 7 the GM pose slightly deviates from the crystal structure ( CS ) because of the adjustment of the hydrogen bond distance between the phenol oxygen and the sp 2 oxygen on GLN102A in the MT KECSA calculation. The phenol benzene ring balances the contacts with ALA99A a nd TYR88A on one side and the contacts with LEU118A, VAL87A and LEU84A on the other. The local minimum pose c and b have close binding scores when compared to the GM pose. They form hydrogen bonds with different hydrogen acceptors (ALA99A backbone oxygen f or pose

PAGE 104

104 c and LEU84A backbone oxygen for pose b ) while maintaining very similar benzene ring locations. The local minimum pose a is trying to form a hydrogen bond with ALA99A backbone oxygen. However, the benzene ring of local minimum pose a is tilted towa rds the LEU118A, VAL87A and LEU84A side chain carbons, weakening the hydrogen bond with the ALA99A backbone oxygen with the net result being a reduction in binding affinity. Using this procedure a docking study was applied to the test set of 795 protein li gand complexes. The heatmap docking generated an average RMSD of 1.97 with 1.27 standard deviation compared to the ligand crystal structure. The result for each individual system studied is given in the Appendix Table A 3

PAGE 105

105 Table 4 1. Statistical results for MT KECSA and original KECSA correlated with experimental binding affinities. Pearson's r RMSE(p K d ) MT KECSA 0.72 1.88 0.53 original KECSA 0.62 2.03 0.46 Table 4 2. RMSD values () and binding scores ( p K d ) of the global and local minima RMSD () Binding Affinity (p K d ) Global Minimum 0.937 3.329 Local Minimum a 2.667 2.255 Local Minimum b 2.839 2.975 Local Minimum c 2.342 3.299

PAGE 106

106 Figure 4 1. The thermodynamic cycle used to formulate the free energy of protein ligand binding in solution.

PAGE 107

107 Figure 4 2. Modeling of ligand solvent polar interaction using a Boltzmann factor multiplier. A carbonyl oxygen atom is used a s an example here. 1) The gre en surface shows the solvent accessible surface of the ligand (inner layer of the solvent shell). The surface consisting of blue dots represents the outer boundary surface of the solvent shell. 2) A close up view of a selected polar atom (carbonyl oxygen) with its solvent shell. 3) Monte Carlo sampling along carbonyl oxygen solvent shell layer distance

PAGE 108

108 Figure 4 3. Plot of MT KECSA (left) and the original KECSA model (right) calculated p K d or p K i val u es vs. Experimental p K d or p K i values. Figure 4 4. The MT energy maps optimization mechanism to derive the final docking pose in one protein ligand complex.

PAGE 109

109 Figure 4 5. Contact map of the 1LI2 protein ligand complex binding region. Hydrophobic contacts are shown as green dashed lines and the one hydrogen bond is shown as a pink dashed line. The binding pocket cavity is shown in dark yellow Figure 4 6. Heat maps for sp 3 oxygen (left) and aromatic carbon (right). Grid points with lighter color indicate energetically favorable locations for certa in atom types within the binding pocket.

PAGE 110

110 Figure 4 7. In the binding pocket of protein ligand complex 1LI2, ligand crystal structure (marked as CS ) is shown as a stick & ball, the global minimum pose (marked as GM ) is shown as a stick along with the three other identified local minimum (marked as a b and c ). Red bubbles on the protein atoms indicate potential contacts with the ligand sp 3 oxygen. Grey bubbles on the protein atoms indicate potential contacts with a romatic carbons.

PAGE 111

111 CHAPTER 5 CONCLUSION AND OUTLOOK We have developed a new empirical based scoring function to evaluate binding affinity between different protein and ligand pairs. The scoring function uses different models to simulate van der W aals interactions, hydrogen bonds between different atom types and desolvation. We have also included an explicit term to model metal ion chelation between zinc and coordinating N and O ligand atoms. Our analysis suggests that we obtained acceptable parame ters for van der Waals interactions (using a 6 12 Lennard Jones model) through a comparison with standard force field parameters. We also collected data to represent the chelation between Zn and different type of ligands. Interaction between Zn and ligands containing N chelators was shown to be quite important is defining the observed binding affinity. Zn O chelation also contributed to the observed binding affinity, especially in the case of bidentate chelators like hydroxamic acids. This is an important f eature in our model that is largely ignored in other score functions. We also included desolvation into the current model using a fast and novel surface area algorithm. When compared with other scoring functions, LISA generally shows improved performance An Artificial Neural Network analysis was also used to confirm the goodness of fit and to demonstrate that cross terms were not needed to improve our scoring function. The result suggests that differentiating van der Waals interactions and hydrogen bonding by atom types may be a good choice for empirical scoring functions, and that adding a metal chelation term significantly improved the prediction of binding affinity to m etalloprotein Introducing more specific scoring functions for ligands categorized usi ng molecular weight and hydrophobicity criteria to some degree improves the binding affinity reproducing ability of LISA based on our

PAGE 112

112 study, but further proof is still needed to illustrate the advantage of using more specific scoring functions for ligands of certain categories. Based on atom pairwise interactions, interaction enthalpy terms in KECSA were parameterized by combining PMF theory with the Lennard Jones potential, without fitting to any binding affinity data. This procedure parameterizes the LJ p otential with neither QM calculations nor binding affinity data, hence lowering the computational expense while improving the prediction accuracy relative to empirical scoring functions. Generally, KECSA improves the binding affinity RMSE, when compared to LISA and LISA+, especially for complexes dominated by polar and charged interactions. With respect to ranking predictions, KECSA better distinguishes complexes in the large scale test sets. KECSA yields the lowest RMSE values illustrated by its superior p erformance in all test sets for this measure of quality. It is less responsive, however, to minor structural changes of the ligand or binding pocket, reducing its ability to rank complexes from the same or similar protein families. In the KECSA model, the solvent accessible surface area is introduced to describe the desolvation effect and entropy terms were empirically modeled. Since we have formulated and parameterized the enthalpy component of non covalent interactions with this alternative method, an int eresting possibility is to use similar procedures to build a force field model solely based on experimental structural data. In this way, desolvation and entropy terms can also be included instead of using empirical models. We believe that more accurate an d effective scoring methods can be developed using this concept. The prediction of the free energies associated with a wide range of biological problems remains a very daunting task. Balancing the sampling of the relevant degrees

PAGE 113

113 of freedom with accurate e nergy computation makes this a very difficult problem. Herein we describe a new approach that in one shot samples all the relevant degrees of freedom in a defined region directly affording a free energy without resorting to ad hoc modeling of the entropy a ssociated with a given process. This is accomplished by converting ensemble assembly from a 3 D to a 1 D problem by using pairwise energies of all relevant interactions in a system coupled with their probabilities. We call this approach the moveable type ( MT) method and in conjunction with Knowledge based and Empirical Combined Scoring Algorithm (KECSA) potential function we demonstrated the application of this approach to protein ligand pose and binding free energy prediction. The resultant MT KECSA model out performs the original KECSA model showing the power of this approach. Importantly, the present MT model can be applied to any pairwise decomposable potential which will allow us to attack a wide range of problems in biology including the validation of potential functions.

PAGE 114

114 APPENDIX SUPPORTING INFORMATION DATA Table A 1. Lennard LJ model LJ model LJ model LJ model LJ model LJ model LJ model 1.0714 15 14 1.0742 15 13 1.0769 14 13 1.0772 15 12 1.0801 14 12 1.0833 13 12 1.0806 15 11 1.0837 14 11 1.0871 13 11 1.0909 12 11 1.0845 15 10 1.0878 14 10 1.0914 13 10 1.0954 12 10 1.1000 11 10 1.0889 15 9 1.0924 14 9 1.0963 13 9 1.1006 12 9 1.1055 11 9 1.1111 10 9 1.0940 15 8 1.0978 14 8 1.1020 13 8 1.1067 12 8 1.1120 11 8 1.1180 10 8 1.1250 9 8 1.1000 15 7 1.1041 14 7 1.1087 13 7 1.1138 12 7 1.1196 11 7 1.1262 10 7 1.1339 9 7 1.1072 15 6 1.1117 14 6 1.1168 13 6 1.1225 12 6 1.1289 11 6 1.1362 10 6 1.1447 9 6 1.1161 15 5 1.1212 14 5 1.1269 13 5 1.1332 12 5 1.1404 11 5 1.1487 10 5 1.1583 9 5 1.1277 15 4 1.1335 14 4 1.1399 13 4 1.1472 12 4 1.1555 11 4 1.1650 10 4 1.1761 9 4 1.1435 15 3 1.1503 14 3 1.1579 13 3 1.1665 12 3 1.1763 11 3 1.1877 10 3 1.2009 9 3 1.1676 15 2 1.1760 14 2 1.1855 13 2 1.1962 12 2 1.2085 11 2 1.2228 10 2 1.2397 9 2 1.2134 15 1 1.2251 14 1 1.2383 13 1 1.2535 12 1 1.2710 11 1 1.2915 10 1 1.3161 9 1 LJ model LJ model LJ model LJ model LJ model LJ model LJ model 1.1429 8 7 1.1547 8 6 1.1667 7 6 1.1696 8 5 1.1832 7 5 1.2000 6 5 1.1892 8 4 1.2051 7 4 1.2247 6 4 1.2500 5 4 1.2167 8 3 1.2359 7 3 1.2599 6 3 1.2910 5 3 1.3333 4 3 1.2599 8 2 1.2847 7 2 1.3161 6 2 1.3572 5 2 1.4142 4 2 1.5000 3 2 1.3459 8 1 1.3831 7 1 1.4310 6 1 1.4953 5 1 1.5874 4 1 1.7321 3 1 2.0000 2 1 Table A 2. Predicted pKd or pKi vs. experimental pKd or pKi for four small scale test sets. Serine protease test set PDB ID exp. Data KECSA scaled KECSA LISA LISA+ PDB ID exp. Data KECSA scaled KECSA LISA LISA+ 1aq7 6.05 6.46 6.65 8.59 7.06 1o39 7.13 6.23 6.24 6.90 7.00 1bju 4.80 5.50 4.84 5.20 5.07 1o3b 7.13 6.18 6.15 6.32 5.52 1bjv 5.54 5.52 4.89 5.01 5.00 1o3d 7.13 6.51 6.75 7.11 7.22

PAGE 115

115 1c1r 7.63 7.06 7.79 8.68 8.94 1o3e 7.96 6.53 6.79 6.79 6.20 1c5p 4.68 5.23 4.35 3.23 3.41 1o3f 7.96 6.60 6.92 7.13 6.78 1c5q 6.36 6.89 7.48 5.17 5.39 1o3g 7.96 6.60 6.92 7.13 6.78 1c5s 6.00 5.77 5.36 5.54 5.47 1o3h 7.30 5.63 5.10 5.69 6.20 1c5t 4.10 5.26 4.40 5.03 4.70 1o3i 7.30 5.71 5.25 5.67 6.23 1ce5 4.74 5.92 5.65 3.86 4.02 1o3j 6.77 5.59 5.03 5.61 5.43 1eb2 6.01 7.32 8.28 7.70 7.67 1o3k 6.77 5.27 4.41 5.66 6.10 1f0t 6.00 6.33 6.28 6.49 5.60 1oyq 6.96 6.21 6.20 6.28 6.31 1f0u 7.16 6.71 7.14 7.55 6.85 1ppc 6.16 6.98 7.64 6.56 6.08 1g36 7.17 6.46 6.66 6.51 6.46 1pph 5.92 6.32 6.30 5.66 5.20 1g3b 5.74 4.58 3.11 4.15 4.42 1qb1 6.77 6.72 7.15 5.79 6.38 1g3c 5.80 4.81 3.54 5.40 4.87 1qb6 6.06 5.92 5.64 5.57 5.93 1g3d 5.55 4.33 2.64 3.84 4.24 1qb9 7.44 7.06 7.80 5.62 6.07 1g3e 5.38 4.26 2.51 4.16 4.55 1qbn 5.85 5.63 5.09 6.77 5.81 1ghz 4.80 4.98 3.86 4.19 4.43 1qbo 7.74 6.58 6.88 5.98 5.94 1gi1 5.47 5.59 5.03 5.35 4.81 1ql7 5.00 7.53 8.68 6.52 6.94 1gi4 7.19 7.32 8.29 9.36 9.68 1tps 8.00 7.76 9.12 11.81 9.91 1gi6 6.22 5.79 5.40 4.99 4.74 1tx7 4.60 5.18 4.25 3.96 3.91 1gj6 7.00 6.83 7.36 6.95 7.70 1v2k 6.19 7.77 9.14 7.53 7.90 1k1i 6.58 6.51 6.75 5.65 4.96 1v2l 4.29 5.51 4.86 3.47 3.75 1k1j 7.55 7.02 7.73 7.71 6.72 1v2m 4.29 4.94 3.79 3.11 3.00 1k1l 6.90 6.88 7.46 6.36 6.13 1v2n 5.90 6.63 6.98 4.90 6.38 1k1m 7.40 7.01 7.71 7.36 6.44 1v2o 4.73 5.07 4.03 5.29 5.11 1k1n 6.82 6.42 6.58 5.80 5.52 1v2p 4.73 5.42 4.71 5.79 5.10 1lqe 5.82 6.78 7.27 7.51 6.36 1v2q 4.13 4.94 3.80 4.52 4.30 1o2h 7.17 6.54 6.82 6.13 7.08 1v2r 3.55 5.07 4.04 5.06 5.03 1o2j 6.92 5.67 5.16 5.45 5.65 1v2s 4.17 5.74 5.30 3.62 4.05 1o2k 6.92 5.68 5.19 5.65 5.88 1v2t 4.71 4.96 3.82 4.54 4.36 1o2n 6.09 6.55 6.83 6.12 6.76 1v2w 4.01 5.18 4.24 5.07 4.84 1o2o 6.36 6.13 6.05 4.73 5.67 1xug 7.05 6.27 6.27 7.34 7.54 1o2q 7.68 6.73 7.17 6.60 7.59 1y3v 5.64 6.95 7.58 6.95 6.69 1o2s 5.47 5.39 4.64 4.88 4.51 1y3w 5.77 6.81 7.32 6.99 6.28 1o2u 5.85 4.66 3.27 5.20 5.91 1y3x 5.14 6.54 6.81 6.51 6.32 1o2v 5.85 5.01 3.93 5.24 5.75 1y3y 5.64 6.17 6.13 4.52 5.29 1o2w 5.85 4.85 3.62 5.24 5.64 1yp9 5.11 6.82 7.35 7.10 6.96 1o2x 5.85 4.99 3.90 5.67 6.23 2zq2 6.56 7.16 7.99 7.75 6.92 1o2y 5.85 4.95 3.81 5.64 6.08 3aas 4.51 4.82 3.56 2.66 3.50 1o2z 6.11 5.84 5.48 6.90 7.05 3aau 5.05 5.18 4.25 5.76 4.87 1o30 6.77 6.25 6.29 6.76 6.80 3gy2 5.74 5.52 4.90 5.84 5.44 1o32 5.74 4.67 3.27 4.42 4.63 3gy3 5.63 5.95 5.70 7.03 7.17 1o33 5.74 5.38 4.62 4.93 5.45 3gy4 5.10 5.13 4.16 3.50 3.52 1o34 5.74 5.54 4.93 4.68 4.67 3gy7 4.52 5.75 5.33 3.21 3.66 1o36 5.96 6.11 6.01 7.12 7.02 3ljj 6.52 6.57 6.86 6.24 6.85 1o37 6.82 6.24 6.28 6.48 7.20 3ljo 6.62 6.18 6.13 5.97 6.72

PAGE 116

116 1o38 6.82 6.33 6.27 6.95 6.87 3nkk 4.70 5.95 5.70 3.93 4.29 Endothiapepsin tes set PDB ID exp. Data KECSA scaled KECSA LISA LISA+ PDB ID exp. Data KECSA scaled KECSA LISA LISA+ 4er2 9.30 7.97 9.51 8.68 7.85 1gvx 7.22 8.46 8.46 10.27 9.26 1eed 4.79 7.67 8.96 8.86 7.65 2er6 7.22 8.46 8.46 10.60 9.84 1ent 6.96 7.54 8.70 8.02 6.71 2er9 7.40 9.03 9.03 11.01 10.13 1epo 7.96 8.58 8.58 10.47 9.90 2v00 3.66 5.41 4.68 5.51 4.99 1epp 7.16 7.99 9.55 9.67 8.65 3er3 7.09 8.51 8.51 9.60 7.99 1epq 8.19 7.68 8.97 9.16 8.15 4er1 6.62 10.38 10.38 12.02 11.69 1gvu 9.00 8.18 8.18 10.89 10.02 5er1 6.02 6.18 6.13 6.56 5.88 1gvw 6.96 7.69 8.98 7.94 7.03 5er2 6.57 9.46 9.46 11.44 9.92 HIV 1 protease PDB ID exp. Data KECSA scaled KECSA LISA LISA+ PDB ID exp. Data KECSA scaled KECSA LISA LISA+ 1a30 4.30 5.00 3.90 4.46 4.58 1z1r 9.22 7.86 9.31 8.28 7.88 1a94 7.85 8.09 8.09 9.77 8.53 1zp8 8.77 8.18 8.18 9.03 7.99 1a9m 6.92 6.83 7.37 7.81 7.17 1zpa 8.40 8.03 8.03 8.61 8.26 1aaq 8.40 6.91 7.51 7.09 6.99 1zsf 9.92 8.18 8.18 8.65 8.47 1aid 4.82 7.06 7.80 3.77 3.10 1zsr 9.82 8.45 8.45 9.61 8.66 1ajv 7.72 9.18 9.18 10.14 9.28 2aoc 4.89 7.42 8.48 10.58 10.16 1ajx 7.91 8.65 8.65 8.60 8.19 2aod 5.66 7.54 8.70 10.69 10.33 1b6j 7.92 7.81 9.21 9.11 8.96 2aoe 7.62 8.02 8.02 9.60 8.99 1b6k 8.74 7.72 9.03 7.59 7.94 2aog 6.28 7.29 8.24 10.41 8.83 1b6l 8.30 7.03 7.73 8.08 7.73 2aqu 9.32 7.94 9.46 9.73 9.16 1b6m 8.40 7.34 8.33 7.41 8.06 2avm 5.70 7.45 8.54 10.20 9.06 1b6p 8.52 7.76 9.12 8.13 8.22 2avo 8.85 8.63 8.63 10.38 10.01 1bv7 9.30 11.06 11.06 14.27 12.44 2avq 4.39 7.41 8.46 10.62 9.00 1bv9 9.96 11.02 11.02 12.53 11.41 2avs 7.57 8.67 8.67 10.49 10.41 1bwa 8.60 11.01 11.01 11.39 10.34 2avv 9.26 8.64 8.64 9.82 9.37 1bwb 8.42 11.26 11.26 12.56 11.31 2az8 8.22 6.20 6.19 5.62 4.72 1c70 10.30 9.34 9.34 11.87 10.38 2az9 7.66 5.87 5.55 4.23 3.80 1cpi 7.41 7.75 9.10 9.12 8.92 2bbb 8.62 9.30 9.30 10.46 10.31 1d4h 10.00 9.32 9.32 10.31 10.14 2bpv 7.67 7.77 9.13 7.61 7.64 1d4i 8.85 9.73 9.73 10.65 10.73 2bpy 7.40 7.92 9.42 7.80 8.03 1d4j 8.36 9.37 9.37 9.89 9.85 2bqv 8.05 8.17 8.17 8.28 8.00 1d4k 9.22 8.76 8.76 9.67 8.00 2cej 8.62 8.80 8.80 9.96 9.90 1d4l 8.77 7.76 9.12 8.58 7.78 2cem 7.92 9.06 9.06 10.43 10.28 1d4y 11.10 8.80 8.80 7.43 7.51 2cen 8.30 9.19 9.19 10.05 9.95 1dif 10.66 9.50 9.50 10.31 10.52 2f3k 7.89 8.50 8.50 10.07 9.40 1dmp 9.55 8.92 8.92 9.56 8.57 2f80 8.18 7.97 9.52 9.01 8.51 1ebw 9.05 7.87 9.33 10.03 9.30 2f81 10.52 7.97 9.52 8.28 7.82 1eby 9.70 9.66 9.66 10.19 10.21 2f8g 8.70 7.95 9.49 8.79 8.46 1ebz 9.40 8.65 8.65 9.52 9.59 2fgu 9.18 8.71 8.71 10.02 9.47 1ec0 8.49 10.13 10.13 11.01 11.17 2fgv 6.12 8.68 8.68 10.34 9.73

PAGE 117

117 1ec1 8.92 9.28 9.28 10.28 9.54 2fle 7.85 8.79 8.79 10.81 11.12 1ec2 10.00 8.62 8.62 10.13 9.39 2hb3 11.35 8.04 8.04 9.76 9.10 1ec3 9.04 8.58 8.58 10.36 9.63 2hs1 8.48 7.74 9.09 8.04 7.80 1g2k 7.96 9.06 9.06 9.79 8.90 2hs2 8.31 7.88 9.35 8.03 7.64 1g35 8.14 8.99 8.99 9.74 9.21 2i0a 11.40 7.97 9.51 8.77 8.23 1gnm 6.25 7.00 7.69 7.93 6.65 2i0d 12.10 8.11 8.11 8.47 7.82 1gnn 5.68 7.08 7.84 8.59 7.44 2i4d 11.68 7.97 9.51 8.92 8.64 1gno 7.70 6.68 7.07 6.59 6.88 2i4u 11.57 7.96 9.50 8.30 8.43 1hbv 6.37 7.09 7.85 8.64 7.48 2i4v 11.20 7.87 9.33 9.17 9.14 1hef 9.00 8.00 9.57 10.78 9.71 2i4w 11.59 8.28 8.28 10.26 10.51 1heg 7.74 6.63 6.98 6.36 6.57 2i4x 11.72 8.24 8.24 10.59 10.32 1hih 8.05 7.51 8.64 8.54 8.33 2idw 8.89 7.60 8.81 9.14 8.22 1hiv 9.00 8.96 8.96 10.98 10.52 2ien 9.00 7.97 9.52 8.11 7.97 1hos 8.55 8.83 8.83 7.98 7.38 2ieo 8.49 7.67 8.94 8.41 7.96 1hpo 9.22 7.78 9.15 5.74 5.79 2o4k 10.46 8.35 8.35 11.29 10.37 1hps 9.22 8.44 8.44 9.86 8.85 2o4l 9.22 9.06 9.06 9.97 9.11 1hpv 9.22 7.34 8.31 7.69 7.12 2o4p 10.72 9.07 9.07 8.77 8.73 1hpx 11.26 8.19 8.19 11.11 9.89 2o4s 10.51 8.71 8.71 10.44 9.95 1hsg 9.42 8.46 8.46 9.40 9.26 2p3b 8.48 9.15 9.15 9.70 9.17 1htf 6.83 7.78 9.15 7.80 6.91 2pk5 10.70 10.44 10.44 13.53 11.92 1htg 8.42 9.68 9.68 10.65 10.10 2pk6 10.89 10.17 10.17 12.51 11.37 1hvh 7.96 8.13 8.13 7.69 6.82 2pqz 5.67 9.14 9.14 9.46 7.86 1hvi 10.92 8.87 8.87 9.60 9.82 2psu 7.62 7.65 8.92 8.36 7.45 1hvj 11.40 8.83 8.83 9.38 9.73 2psv 7.24 7.14 7.94 8.00 7.08 1hvk 10.96 9.05 9.05 10.71 10.59 2pwc 6.57 9.06 9.06 9.24 7.77 1hvl 9.95 8.79 8.79 9.38 9.56 2pwr 6.59 9.83 9.83 10.99 9.80 1hvr 9.51 10.16 10.16 9.55 8.73 2pym 6.82 8.10 8.10 9.15 8.98 1hvs 10.30 8.48 8.48 8.29 9.28 2pyn 7.57 8.03 8.03 9.06 9.05 1hwr 8.33 6.93 7.55 6.04 5.80 2q54 9.01 8.42 8.42 8.13 8.23 1hxb 9.92 8.63 8.63 10.29 9.34 2q55 8.69 8.70 8.70 9.58 9.27 1hxw 10.82 8.96 8.96 10.75 9.47 2q5k 11.30 8.60 8.60 9.93 9.70 1iiq 7.48 8.55 8.55 9.94 9.49 2q63 7.18 8.28 8.28 9.57 9.56 1izh 7.70 8.93 8.93 11.36 10.93 2q64 8.52 8.23 8.23 9.55 9.04 1izi 6.59 8.43 8.43 9.83 9.36 2qci 8.34 8.11 8.11 9.93 9.12 1k6c 7.48 8.18 8.18 8.54 8.37 2qd6 8.68 7.94 9.46 8.61 8.24 1k6p 7.36 8.68 8.68 9.99 10.35 2qd7 9.10 7.97 9.51 9.55 8.84 1k6t 7.62 8.34 8.34 8.87 9.01 2qd8 9.07 7.88 9.35 8.67 8.37 1k6v 6.92 8.68 8.68 9.67 10.21 2qhy 7.48 8.80 8.80 9.50 8.02 1kzk 10.39 8.95 8.95 11.06 9.94 2qhz 7.28 9.13 9.13 9.33 7.98 1lzq 8.39 8.26 8.26 9.68 9.54 2qi0 7.38 9.23 9.23 9.33 8.59 1m0b 8.82 8.35 8.35 9.51 9.70 2qi1 7.30 8.71 8.71 10.36 9.14 1mes 7.70 8.61 8.61 9.54 8.74 2qi3 10.20 7.48 8.59 6.92 6.55 1met 9.40 8.84 8.84 9.67 9.06 2qi4 10.44 8.64 8.64 9.09 8.64 1meu 6.10 8.78 8.78 8.98 8.53 2qi5 10.85 8.22 8.22 8.52 8.48 1mrw 9.70 7.87 9.33 8.93 8.65 2qi6 10.57 8.62 8.62 8.66 8.35

PAGE 118

118 1mrx 7.26 7.83 9.25 8.77 7.99 2qi7 10.21 7.51 8.65 8.06 7.49 1msm 10.48 8.98 8.98 10.97 9.90 2qnn 7.15 10.86 10.86 10.34 8.99 1msn 9.09 8.71 8.71 11.09 9.89 2qnp 6.41 12.12 12.12 9.17 7.47 1mtr 8.40 7.33 8.31 7.33 8.03 2qnq 6.11 9.36 9.36 8.69 7.46 1nh0 9.74 8.56 8.56 10.74 9.24 2r38 7.44 9.48 9.48 10.43 9.03 1npa 8.70 9.04 9.04 9.36 8.79 2r3t 5.85 9.50 9.50 10.35 8.98 1npv 8.24 8.58 8.58 8.45 8.77 2r3w 6.89 9.07 9.07 10.55 8.68 1npw 8.70 9.04 9.04 9.42 8.84 2r43 6.00 8.84 8.84 9.60 7.97 1ody 8.10 9.29 9.29 10.18 9.69 2uxz 8.48 8.46 8.46 9.52 9.55 1ohr 8.70 8.07 8.07 8.76 8.36 2uy0 6.92 8.89 8.89 10.70 10.03 1pro 11.30 8.70 8.70 9.27 8.95 2wkz 8.77 9.25 9.25 10.80 10.16 1qbr 10.57 11.12 11.12 12.79 11.02 2wl0 7.80 9.07 9.07 11.01 10.46 1qbs 9.47 8.73 8.73 10.01 8.87 2z4o 9.57 8.20 8.20 8.70 8.50 1qbt 10.62 11.57 11.57 12.88 12.07 3a2o 9.08 8.07 8.07 10.89 9.78 1qbu 10.24 8.94 8.94 8.47 7.95 3bc4 5.35 4.38 2.73 2.33 2.62 1sbg 7.74 7.57 8.75 8.49 8.36 3bgb 6.05 7.30 8.25 7.44 6.90 1sdt 9.27 8.73 8.73 10.71 10.35 3bgc 5.02 9.03 9.03 7.73 7.88 1sdu 10.07 8.68 8.68 9.79 9.60 3bva 6.52 7.33 8.31 10.05 8.71 1sdv 8.74 8.63 8.63 9.81 9.85 3bvb 5.49 7.91 9.40 9.06 8.04 1t7j 8.70 7.00 7.69 7.52 7.25 3cyw 7.77 7.94 9.45 8.46 8.08 1tcx 6.95 7.41 8.45 7.94 7.66 3cyx 8.00 8.68 8.68 9.67 9.84 1vij 9.52 11.83 11.83 12.60 11.53 3d1x 8.66 8.73 8.73 10.00 9.47 1vik 9.52 12.19 12.19 12.33 12.74 3d1y 8.22 8.73 8.73 10.31 9.94 1w5v 8.15 10.32 10.32 11.40 11.33 3d1z 8.80 7.99 9.54 9.01 8.47 1w5w 8.80 10.57 10.57 11.18 11.29 3d20 8.30 8.13 8.13 8.82 8.64 1w5x 8.40 10.72 10.72 11.30 11.28 3djk 10.59 7.58 8.78 8.97 8.54 1w5y 8.48 10.50 10.50 10.36 10.63 3dk1 9.74 7.80 9.19 9.04 8.69 1z1h 8.40 7.64 8.89 8.30 7.64 Carbonic anhydrase II test set PDB ID exp. Data KECSA scaled KECSA LISA LISA+ PDB ID exp. Data KECSA scaled KECSA LISA LISA+ 1if7 10.52 7.53 8.68 7.60 8.33 2h4n 8.70 6.72 7.15 7.64 7.92 1avn 3.90 2.71 2.71 0.77 0.83 2hd6 7.80 7.40 8.43 7.92 8.39 1bcd 8.70 6.97 7.62 6.05 6.00 2nng 5.36 7.10 7.87 7.49 7.68 1bn1 9.34 8.09 8.09 8.23 9.49 2nno 6.40 6.88 7.45 7.46 7.95 1bn3 9.89 7.69 8.98 7.25 6.04 2nns 6.05 6.84 7.37 7.31 7.75 1bn4 9.31 7.97 9.52 8.12 6.65 2nnv 7.52 6.78 7.27 7.50 7.88 1bnn 10.00 8.24 8.24 8.28 6.59 2osf 3.83 5.16 4.22 3.06 3.25 1bnq 9.49 7.91 9.40 9.90 8.26 2osm 6.20 4.65 3.25 2.95 3.09 1bnt 9.80 8.02 8.02 8.18 7.28 2pou 7.42 7.96 9.49 7.92 8.15 1bnu 9.70 8.35 8.35 10.02 6.56 2pov 7.12 7.66 8.93 7.93 8.18 1bnv 8.77 8.12 8.12 7.99 7.50 2pow 7.20 7.90 9.38 7.92 8.09 1bnw 9.08 7.88 9.35 8.35 8.21 2q1q 8.15 7.54 8.70 8.01 8.17 1cil 9.43 7.91 9.39 8.83 9.34 2qo8 8.47 7.32 8.28 8.49 8.66 1cim 8.82 7.76 9.11 8.21 8.25 2qoa 7.28 7.53 8.69 8.17 8.45

PAGE 119

119 1cin 8.73 7.98 9.53 8.04 8.50 2qp6 6.68 7.57 8.76 7.77 7.96 1cnw 7.72 6.74 7.18 8.58 7.81 2wd3 7.07 7.53 8.68 7.95 8.45 1cnx 7.37 6.49 6.71 7.69 7.92 2weg 6.50 7.22 8.10 6.86 7.00 1cny 7.85 6.27 6.27 6.66 5.73 2weh 5.95 7.18 8.03 6.52 6.49 1eou 7.44 7.44 8.52 9.29 6.32 2wej 6.08 7.19 8.04 7.14 7.23 1g1d 9.44 7.01 7.70 7.13 7.34 2weo 6.93 7.26 8.17 6.94 7.19 1g45 8.64 6.80 7.30 7.27 7.43 2x7s 5.78 5.42 4.71 2.94 3.28 1g46 8.80 7.04 7.76 7.36 7.73 2x7t 6.63 5.91 5.63 3.91 3.14 1g48 8.41 7.05 7.77 7.06 7.21 2x7u 5.27 5.99 5.77 2.98 3.83 1g4j 8.70 6.72 7.15 6.08 6.18 3b4f 8.14 8.37 8.37 9.72 10.05 1g4o 8.25 6.57 6.87 6.91 6.95 3bet 5.82 8.47 8.47 8.42 8.25 1g52 9.54 7.48 8.59 7.72 8.03 3bl0 6.92 6.59 6.91 7.10 6.68 1g53 9.04 6.73 7.17 5.73 5.91 3bl1 5.60 8.18 8.18 8.64 8.25 1g54 8.82 7.68 8.97 7.00 7.31 3c7p 10.00 8.88 8.88 9.30 8.91 1i8z 8.90 8.91 8.91 10.45 9.63 3caj 8.10 7.42 8.48 8.61 8.66 1i90 8.89 6.95 7.59 8.45 6.57 3d8w 8.15 7.35 8.35 8.12 8.25 1i91 8.94 7.87 9.33 7.50 6.77 3d9z 8.70 7.22 8.09 7.68 6.98 1i9l 8.48 6.39 6.52 6.12 6.00 3daz 7.93 6.17 6.12 7.04 6.32 1i9m 8.48 6.57 6.86 6.42 6.08 3dbu 8.06 7.30 8.26 8.28 8.22 1i9n 8.66 6.55 6.84 5.91 5.59 3dc3 8.31 6.77 7.24 8.07 7.62 1i9o 8.42 7.11 7.89 6.22 6.26 3dcc 8.96 7.25 8.16 8.04 7.56 1i9p 8.41 7.71 9.03 7.63 7.30 3dcs 7.94 7.14 7.95 8.45 7.92 1i9q 8.41 6.72 7.15 6.11 6.25 3dcw 9.00 7.53 8.69 8.98 8.79 1if8 9.64 7.03 7.73 6.70 6.85 3dd0 9.00 7.41 8.46 8.70 8.53 1kwr 6.60 6.08 5.94 5.19 5.49 3dd8 7.85 8.03 8.03 8.64 8.98 1okl 6.03 8.32 8.32 8.73 8.95 3eft 7.92 8.02 8.02 8.34 9.11 1oq5 7.68 8.77 8.77 9.22 9.52 3f8e 7.23 4.68 3.30 2.52 3.65 1ttm 7.35 7.60 8.81 6.83 7.48 3ffp 7.14 7.83 9.26 10.61 9.20 1xpz 7.08 6.33 6.27 6.12 5.92 3hkn 5.35 7.68 8.97 7.81 8.28 1xq0 6.34 7.16 7.98 8.05 6.64 3hs4 8.00 7.06 7.79 8.41 7.94 1yda 6.55 4.74 3.41 2.96 3.33 3ibi 8.59 6.68 7.07 7.48 8.24 1ydb 8.24 6.80 7.30 8.29 7.09 3ibl 7.84 6.45 6.63 8.22 7.65 1ydd 7.07 6.12 6.03 6.47 5.59 3ibn 7.61 6.23 6.24 7.42 7.37 1ze8 7.68 7.64 8.89 9.58 9.89 3ibu 9.15 6.92 7.53 8.39 8.89 2aw1 7.37 7.93 9.44 8.11 8.84 3ieo 6.48 3.58 3.58 0.20 1.66 2ez7 4.36 3.10 3.10 1.44 2.75 3igp 7.02 7.14 7.94 7.32 7.51 2f14 9.19 9.09 9.09 10.82 10.10 3k2f 7.20 7.97 9.52 10.99 8.83 2fou 7.10 5.27 4.41 4.23 3.36 3m1k 5.39 5.30 4.48 4.24 4.23 2fov 7.55 7.29 8.24 8.39 8.18 3m67 6.74 8.55 8.55 9.25 8.69 2gd8 7.41 7.62 8.85 7.73 7.43 3mhc 7.96 8.11 8.11 9.42 9.16 2h15 5.67 7.11 7.89 7.41 7.93 3mnu 7.32 7.40 8.44 7.51 8.51

PAGE 120

120 Table A 3 Heatmap docking RMSD result against a test set with 795 protein ligand complexes. PDB ID RMSD PDB ID RMSD PDB ID RMSD PDB ID RMSD 2aac 0.77 1pb8 2.27 1izi 1.46 1erb 4.94 1ajp 0.57 1s38 1.38 2pvh 2.53 1stc 0.94 1uto 0.37 1thz 0.04 2pwr 1.00 2pvj 2.75 1qi0 0.10 2f5t 0.08 1g7g 3.43 1nvr 1.88 6rnt 0.82 3b7j 3.81 2j4g 2.72 1g35 1.89 1oba 1.94 1ocq 2.22 1ii5 4.65 1yqj 0.13 2i2b 2.46 1rnt 2.67 2zb0 0.25 1fkh 1.26 1tok 1.06 1j17 0.53 3f8f 0.31 1w5v 1.65 1utl 0.17 1o5c 0.35 2d3z 2.39 2fxu 0.66 1gyx 1.16 1w4o 3.81 1o3p 2.41 3bu1 2.36 1ai4 1.04 2hxm 0.40 1pxp 0.69 2f80 0.48 2r5a 1.16 1fh7 1.71 1c5x 3.50 2jds 2.79 1ajn 1.16 2jfz 3.39 2fqw 3.57 1efy 3.31 3bra 2.63 1w4q 2.59 2iwx 2.49 1h1s 2.84 2vyt 1.65 1b46 3.53 2qbs 2.29 1w11 1.64 2bet 0.72 1c88 0.58 1b0h 0.25 2cji 1.65 2bza 0.29 1a69 0.97 1dfo 1.48 2zcs 2.22 3c2u 1.99 1pr1 2.11 1eld 0.04 3d1y 2.46 2fw6 1.63 1srg 2.35 2oxx 3.65 3e5a 2.79 2fwp 1.53 1vyq 0.12 3eko 3.68 1nvq 0.99 1tng 0.47 1xbb 0.16 3gst 1.83 2e2r 4.37 1f4e 0.54 2oyk 2.32 1usi 3.41 1tom 2.13 1ws4 0.61 1lvu 0.07 1o30 0.49 2cen 3.29 2v3u 0.95 1y20 2.53 1o3j 3.24 2v95 1.57 1utm 1.73 2c80 1.43 1o3k 3.70 3d20 0.02 1ogd 1.53 1zgi 0.33 1ony 0.14 2hs2 1.13 1ws5 0.83 2fxv 0.71 1qb1 0.35 1hwr 3.70 2ha3 1.92 1adl 2.99 2rcb 3.66 2qci 0.10 1uj6 0.45 1sgu 0.87 1fl3 0.56 1h23 0.26 1y1z 0.41 1qbv 2.75 1nq7 3.45 1mj7 1.93 1ugw 1.42 2h6b 3.93 1p1n 4.48 1d4j 1.71 3buf 3.07 1bp0 3.67 3d94 2.23 3d83 1.51 3c2r 0.05 1r9l 2.41 1drk 3.17 2nmy 0.51 1lbk 2.13 2c1p 0.16 1fhd 2.94 2nmz 0.76 1br6 4.12 2f35 3.41 1k1n 2.51 2nnk 1.15 1e6s 0.46 2i80 0.04 1uwf 2.14 2nnp 1.32 1tog 0.47 2qg2 0.25 2oxd 3.45 1k21 3.72

PAGE 121

121 2ri9 3.52 3b4p 1.90 1bhx 0.33 1mu6 2.55 1v2j 0.44 2j47 1.13 1ele 0.34 1fd0 0.27 5yas 2.72 1gja 1.15 1g30 0.95 1k22 2.35 1bcu 0.34 1syi 0.38 1os5 3.23 1mtr 0.46 1phw 3.26 2iko 0.02 1q72 3.89 1w5x 1.65 3c2o 0.67 1lv8 0.43 1xk9 2.02 3b68 0.31 1tnh 1.17 1q65 1.33 1xt8 2.87 1w96 3.21 1v2u 0.68 1gi1 1.88 1xka 0.80 2oag 2.28 1toj 1.34 1o2s 1.75 1fh8 4.83 1w5y 1.18 1uz4 0.33 1g85 4.29 1sb1 1.29 2hs1 0.94 2sim 0.28 1pot 2.18 2dri 3.76 2r5p 2.19 1c5o 1.40 2h4k 2.26 2f34 2.61 2uxz 0.66 1utn 1.47 2o0u 3.21 2r3w 2.85 1ec0 0.85 1jys 0.84 3bvb 3.94 1swr 4.09 2ieo 1.62 1v2r 1.53 1a8i 3.43 2d3u 3.36 2ihq 3.62 2cle 1.66 1oe8 3.20 2g79 3.95 1fcy 0.88 2clf 1.57 1ydr 1.92 2uy0 3.11 1i5r 1.09 1nli 0.04 2cht 3.12 1tcx 0.47 1ta2 1.27 1d7i 3.56 1bjv 1.85 1c5c 3.83 1xkk 1.38 3jdw 3.44 1y3n 2.07 1ent 1.01 2boh 0.17 1pb9 2.05 2iuz 2.64 1gvw 1.63 2bxt 0.50 1gyy 0.33 1jak 2.19 1oyq 0.43 2qu6 0.28 2azr 1.47 1njd 2.91 1x8j 3.45 2vw5 1.69 2v00 0.10 3bex 3.74 1fki 3.64 1hos 0.70 3buh 2.43 1f4x 2.76 1fpc 0.90 1ppl 1.20 1c3x 1.05 1mfi 1.44 1gj6 1.34 2byr 0.80 2a8g 1.20 1vfn 2.13 2j7g 3.70 1bwa 0.51 3clp 1.87 1a08 0.93 3cdb 0.24 3b66 0.96 1loq 0.78 1ejn 0.50 1b1h 2.25 6std 5.48 1wht 1.27 2bmk 2.29 2bq7 0.60 3d1x 0.90 2bvr 2.37 2ojg 1.48 1lpg 2.59 2qd6 1.35 2fx6 0.23 1nhu 3.56 2bz6 1.33 2q55 1.06 2vfk 0.57 2pqz 3.58 2il2 5.24 1jgl 0.95 1ai5 0.40 1ow4 1.13 3e64 3.51 1lnm 4.45 1nje 0.73 2cbu 3.27 1b05 0.15 1ohr 1.87 2hb1 0.81 1g98 3.14 1nz7 1.07 1t7j 3.18 1j16 1.97 1jsv 2.36 1o3d 2.82 1w13 2.96 1utj 2.34 2f2h 1.26 1zog 3.72 1xap 3.17 1n1t 0.33 1n7m 1.92 2fqx 3.95 2f8g 1.40 1aj7 0.72 2qg0 1.28 2qnn 1.15 2fr3 2.07 1x9d 0.64 1o33 2.59 1f0u 0.27 2jh0 3.27

PAGE 122

122 2fpz 0.44 1o5e 0.80 1g36 1.23 2ojj 3.94 2wec 3.07 1qy2 2.59 2q63 1.40 2pcp 3.07 2g8r 2.03 2b1v 4.35 1fcx 0.96 2uwo 1.82 1lgw 2.97 2pgz 1.05 2j7h 2.42 3brn 4.41 1tni 1.14 1p1o 3.26 3c79 3.60 3eqr 0.52 2qrl 4.30 1o5b 3.98 1sw2 3.76 1b6k 0.01 1c5z 2.15 1q66 1.08 2epn 2.32 1mjj 0.90 1v2w 0.29 1hmt 3.50 1g2l 1.44 1sdv 1.40 3bxh 1.46 1kpm 1.25 1oyt 2.39 2vkm 1.87 1toi 2.44 2q88 2.88 2psv 0.26 1d4l 1.01 1tuf 1.68 1urg 3.05 1jet 0.17 1zp8 0.33 2clh 1.79 3gss 3.10 1dhi 2.88 3b5r 2.39 1n2v 0.98 1k4g 1.97 1mrx 2.17 1w5w 1.78 1c5t 1.54 1o2w 1.23 1b40 2.74 3d1z 1.90 1odj 1.30 1o2x 1.10 1kel 3.96 3e93 0.69 2bvs 4.12 1ogz 4.52 2pql 3.68 1d4i 3.85 1p19 1.41 1q63 1.27 2qhz 1.38 2avo 1.14 2exm 1.29 1u1w 3.45 3ccz 2.31 3b67 2.00 1v2q 0.05 2r3t 3.80 1o3i 3.56 1nfy 3.22 1v2s 2.19 3e5u 4.35 2qi1 1.21 1nt1 1.79 1c5y 1.25 1e1x 3.67 1ydt 1.33 2idw 2.52 1c87 0.45 1ydk 1.30 2j7e 2.98 1ec1 2.98 1ghw 0.88 2brm 2.35 2j2u 0.97 1kzn 1.91 1odi 1.57 1v2n 0.96 2r2m 1.15 1nfw 2.31 1pzi 2.41 1ugx 4.94 1uw6 3.36 1mq5 0.65 1ua4 0.32 1a4w 1.81 2gv6 4.11 1mu8 2.78 2bes 2.58 1kyv 1.43 1swg 3.27 2ien 1.54 1li3 3.68 1pph 1.49 1bxq 1.54 2j4i 1.02 2qrk 0.43 1yds 2.64 2qi0 0.96 2p4y 2.85 1e2l 0.27 1fv0 1.45 1k1m 2.06 2q54 0.36 1v2l 2.54 1b9j 2.86 1oar 2.44 1ebw 0.92 1det 2.02 1nl9 0.48 1owh 2.75 1ezq 1.94 1w3k 1.01 1o36 1.35 1r6n 3.00 1kdk 4.13 2cli 0.73 1c5s 4.62 2bpy 0.79 2qd8 1.59 1ajq 2.28 1ceb 3.33 1q5k 3.03 2rkf 0.65 2tpi 1.11 1hn2 3.67 1t32 1.39 1msn 1.28 2ha6 0.16 1ssq 4.10 2bak 3.39 1h22 1.95 4rsk 2.75 1xgj 4.02 1vzq 1.71 2qd7 1.63 1e3v 2.79 2bvd 3.07 2qbq 3.30 1ta6 1.86 1z6s 1.79 2izl 3.17 2r38 3.51 2qmg 2.98 2ha5 4.08 2r43 3.66 1sqo 2.87 2fgu 0.58

PAGE 123

123 1ghv 0.10 1eb2 2.60 2qhy 1.37 1d4k 0.45 1sv3 3.42 1tcw 0.11 2hhn 2.10 1fcz 2.41 1bhf 4.42 5er1 0.18 1lah 3.25 1hps 0.50 2hjb 1.36 1n8v 3.69 1lf2 0.94 1hpv 3.79 1p57 0.04 3bgb 1.40 1lke 2.29 1z1r 0.67 2pwd 2.72 1jmg 3.55 3ekr 3.46 1db1 2.01 1no6 0.40 2gst 2.05 2f1g 2.24 2avv 0.51 2pu2 0.70 1dzk 2.08 1k1j 2.01 1sdt 0.92 2r2w 1.32 1m2q 2.88 1lpk 0.37 3b65 3.66 1qan 4.44 2e7f 0.80 1v0l 1.28 1xow 3.21 2ogy 0.63 2flr 2.98 1vj9 0.27 2ayr 0.94 1j14 2.88 2p7a 4.45 3b50 3.11 1bv7 4.40 1gi7 0.87 2wed 0.70 3cct 1.14 2aqu 0.50 1gu1 1.85 1g32 2.78 1q95 2.15 2r6w 3.83 2a4m 2.67 1o2z 0.43 2avs 3.42 2z4b 5.88 1fj4 0.89 2qnq 3.37 2pyn 0.38 2p95 1.52 1l2s 1.55 1nc1 1.62 3cd0 1.89 1ebz 1.22 2jiw 0.46 2fgv 1.55 3cda 0.22 1ksn 0.25 1pa9 1.49 1ppc 0.32 1lpz 1.47 1met 1.25 1upf 4.92 2i4j 2.30 2psu 0.67 2o4r 3.11 2ha2 1.11 1d6v 0.33 2pvl 2.63 1hsh 1.14 2nsl 0.66 1yc1 2.35 3be9 2.85 1qbs 1.92 1c5n 3.15 1xhy 2.86 1f0r 0.48 2r6y 4.46 1c86 0.06 2qhm 3.52 1nny 2.16 1hvr 0.39 2j9n 2.96 1v2k 3.34 1ppk 2.06 2bys 1.90 2qfo 4.50 2bt9 4.27 2bpv 0.39 2g94 3.20 3kiv 3.96 2qbu 2.25 1o2q 2.44 1dmp 1.22 1v2t 1.23 1owe 3.24 1t4v 3.54 1df8 2.05 1v2o 0.83 2aj8 0.72 1w0z 2.12 1eby 2.36 1ce5 2.96 1wcq 3.92 3f8c 1.32 1mrw 2.57 1w4p 3.22 2ok1 0.53 1izh 0.23 2vh6 1.05 1q8t 1.45 3bgz 2.43 1mes 0.83 1z6e 0.95 2b4l 3.27 1pbq 3.80 1oau 5.47 2std 4.32 1oss 3.83 3bfu 4.71 1uou 2.05 2drc 3.78 2nsj 0.13 1lag 4.22 1vja 0.25 1hxb 1.42 1bju 0.43 1wdn 4.69 1ajv 2.71 1hvl 0.32 2nta 0.74 1wm1 3.39 1j4r 0.44 1bv9 2.73 1aid 2.65 2q89 3.76 1lee 1.05 1bxo 2.47 1v1j 2.31 3ebl 3.25 1nfu 1.43 1d4h 1.17 1xff 1.91 2bal 2.22 1qbo 2.81 1sdu 0.96 2gl0 0.40 1w3j 4.31 1qiw 3.23 1sr7 2.60

PAGE 124

124 2pwg 2.65 2qbr 1.51 1rd4 2.02 2o4j 3.66 1c83 0.97 3cj5 3.18 1sbg 0.63 2p16 2.12 1ecv 0.43 3ckp 2.98 2jh6 2.46 1y6r 2.55 1xgi 1.09 1c5q 3.90 2qm9 3.10 2qi3 2.41 3cj2 2.93 1ela 0.09 3cyw 0.90 2qi7 1.69 2brb 0.10 1o2o 1.21 1v48 3.86 1qbu 3.94 1p1q 1.82 1hbv 1.96 2ra0 3.16 1hvs 1.37 2j77 2.93 1hms 3.37 1b6h 0.46 1kzk 1.91 3cf8 1.75 1i7z 2.30 1nf8 2.61 2qi4 3.30 1bzc 2.97 1usk 4.64 1njs 0.01 2o4k 0.30 1e1v 0.92 2o2u 2.51 1nvs 3.34 1msm 0.89 1h1p 1.70 1syh 2.04 2j34 1.51 5std 3.96 1jqy 0.11 2cgf 4.46 3cd7 0.87 2o4s 0.79 1e2k 1.10 1o5a 1.05 1laf 2.56 1n46 3.16 2gss 4.01 1fh9 2.88 1lst 1.42 2f81 1.57 1cea 4.52 2b07 0.58 2gv7 1.89 1qbr 1.05 1o5g 2.19 1gt3 4.14 3ccw 3.49 2qi6 3.52 1tsy 1.97 1m2r 2.58 3cd5 1.52 3djk 3.03 3bxg 2.75 1r5y 3.41 1ajx 0.16 3cs7 3.91 1c84 0.55 1f4g 1.17 3d7z 0.73 2pk5 3.78 1nc3 1.75 1gpn 2.21 1epo 1.32 7std 4.46 2oym 0.30 1rpj 3.37 1g2k 0.27 1hxw 2.24 2glp 1.18 1dl7 4.28 1hvh 1.81 2qi5 3.94 2d0k 0.73 3cj4 3.88 2qe4 2.75 2pk6 1.49 3bgc 2.12 1ct8 0.51 3bgq 3.32 1hvi 1.21 2qt5 0.37 1g74 3.99 3e92 1.32 1hvk 0.02 2c3l 1.44 1gt4 2.41 1a4k 3.49 1mq6 2.73 1enu 2.49 1qy1 3.57 1apw 1.55 2i4v 1.14 2qtg 0.06 1zhy 3.34 1fkg 2.34 1hpx 1.47 1m48 1.14 2nt7 1.74 1gaf 2.79 2q5k 2.57 1nw7 0.37 2oxy 3.86 1r0p 1.03 2hb3 1.27 1od8 1.42 3d0b 0.20 1vyg 2.23 1hvj 1.97 1onz 1.47 1dhj 1.79 2vnt 1.46 2i0a 0.99 1v0k 2.64 1hmr 3.58 3cyx 0.78 2i4u 0.87 1w7g 0.73 1n4h 3.94 1b7h 0.44 2i4w 2.36 1k4h 0.91 1yc4 3.91 1vyf 3.88 2i4d 0.86 2web 1.46 2bok 0.58 2bqv 0.19 1y6q 2.82 1gcz 0.40 2pwc 0.68 1gni 4.03 2i4x 1.49 2br1 0.30 1k1i 1.92 2rkg 0.57

PAGE 125

125 LIST OF REFERENCES (1) Liu, T.; Lin, Y.; Wen, X.; Jorissen, R. N.; Gilson, M. K. BindingDB: A Web Accessible Database of Experimentally Determined Protein Ligand Binding Affinities. Nucl. Acids Res. 2007 35 198 201. (2) Congreve, M.; Chessari, G.; Tisi, D.; Woodhead, A. J. Recent Developments in Fragment Based Drug Discovery. J. Med. Chem. 2008 51 3661 3680. (3) Mobley, D. L.; Dill, K. A. Binding of Small Structure 2009 17 489 498. (4) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug. Del iv. Rev. 2001 46 3 26. (5) Wang, R.; Lai, L.; Wang, S. Further development and validation of empirical scoring functions for structure based binding affinity prediction. J. Comput. Aided Mol. Des. 2002 16 11 26. (6) Merz, K. M. Limits of Free Energy Co J. Chem. Theory Comput. 2010 6 1769 1776. (7) Kolb, P.; Irwin, J. J. Docking Screens: Right for the Right Reasons? Current Topics in Medicinal Chemistry 2009 9 755 770. (8) Ewing, T. J. A.; Makino, S.; Skillma n, A. G.; Kuntz, I. D. DOCK 4.0: Search Strategies for Automated Molecular Docking of Flexible Molecule Databases. J. Comput. Aided Mol. Des. 2001 15 411 428. (9) Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.; Olson A. J. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J. Comput. Chem. 1998 19 1639 1662. (10) Weiner, S J.; Kollman, P. A.; Case, D. A. A new force field for molecular mechanical simulation of nucleic acids and proteins. J. Am. Chem. Soc. 1984 106 765 784. (11) Weiner, S. J.; Kollman, P. A.; Nguyen, D. T.; Case, D. A. An all atom force field for simulati ons of proteins and nucleic acids. J. Comput. Chem. 1986 7 230 252. Interactions: A Simplified Potential Approach. J. Med. Chem. 1999 42 791 804. (13) Muegge, I. A kn owledge based scoring function for protein ligand interactions: Probing the reference state. Perspect. Drug Discovery Des. 2000 20 99 114.

PAGE 126

126 (14) Muegge, I. Effect of ligand volume correction on PMF scoring. J. Comput. Chem. 2001 22 418 425. (15) Gohlke, H.; Hendlich, M.; Klebe, G. Knowledge based scoring function to predict protein ligand interactions. J. Mol. Biol. 2000 295 337 356. (16) Ishchenko, A. V.; Shakhnovich, E. I. Small molecule growth 2001 (SMoG2001): An improved knowledge based scoring fun ction for protein ligand interactions. J. Med. Chem. 2002 45 2770 2780. (17) Mitchell, J. B. O.; Laskowski, R. A.; Alex, A.; Thornton, J. M. BLEEP potential of mean force describing protein ligand interactions: I. Generating potential. J. Comput. Chem. 1 999 20 1165 1176. (18) Velec, H. F. G.; Gohlke, H.; Klebe, G. DrugScore CSD knowledgebased scoring function derived from small molecule crystal data with superior recognition rate of near native ligand poses and better affinity prediction. J. Med. Chem. 2 005 48 6296 6303. (19) Huang, S. Y.; Zou, X. An iterative knowledge based scoring function to predict protein ligand interactions: II. Validation of the scoring function. J. Comput. Chem. 2006 27 1876 1882. (20) Huang, S. Y.; Zou, X. Inclusion of Solvation and Entropy in the Knowledge Based J. Chem. Inf. Model. 2010 50 262 273. (21) Jones, G.; Wilett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997 267 727 748. (22) Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. A. A Fast Flexible Docking Method using an Incremental Construction Algorithm. J. Mol. Biol. 1996 261 470 489. (23) Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T.A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E.H.; Shelley, M., Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. Glide: A New Approach for Rapid, Accurate Docking and Scoring. 1 Method and Assessment of Docking Accuracy. J. Med. Chem. 2004 47 1739 1749. (24) Wang, R.; Liu, L.; Lai, L.; Tang, Y. SCORE: A New Empirical Method for Estimating the Binding Affinity of a Protein Ligand Complex. J. Mol. Model. 1998 4 379 394. (25) Korb, O.; Sttzle, T.; Exner, T. E. Empirical Scoring Functions for Advanced J. Chem. Inf. Model. 2009 49 84 96.

PAGE 127

127 (26) Teramoto, R.; Fukunishi, H. Supervised Consensus Scoring for Docking and Virtual Screening. J. Chem. Inf. Model. 2007 47 526 534. (27) Deng, Y.; Roux, B. Computations of Standard Binding Free Energies with Molecular Dynamics Simulations. J. Phys. Chem. B 2009 113 2234 2246. (28) Wang, R.; Fang, X.; Lu, Y.; Yang, C. Y.; Wang, S. The PDBbind Database: Methodologies and Updates. J. Med. Chem 2005 48 4111 4119. (29) Wang, R.; Fang, X.; Lu, Y.; Wang, S. The PDBbind Database: Collection of Binding omplexes with Known Three Dimensional Structures. J. Med. Chem. 2004 47 2977 2980. (30) Sndergaard, C. R.; Garrett, A. E.; Carstensen, T.; Pollastri, G.; Nielsen J. E. ray Structures: Implications for the Develop ment of Docking Scoring Functions. J. Med. Chem. 2009 52 5673 5684. binding constant for a protein ligand complex of known three dimensional structure. J. Comput. Aid ed Mol. Des. 1994 8 243 256. (32) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. A Second Generation Force Field for the Simulation of Proteins, Nucleic A cids, and Organic Molecules. J. Am. Chem. Soc. 1995 117 5179 5197. (33) Vedani, A. YETI: An interactive molecular mechanics program for small molecule protein complexes. J. Comput. Chem. 2004 9 269 280. (34) Vedani, A.; Dunitz, J. D. Lone pair directi onality in hydrogen bond potential functions for molecular mechanics calculations: the inhibition of human carbonic anhydrase II by sulfonamides. J. Am. Chem. Soc. 1985 107 7653 7658. J. Comput. Aided Mol. D es. 2009 23 693 704. (36) Sarkhel, S.; Desiraju, G. R. N protein ligand complexes: Strong and weak interactions in molecular recognition. Proteins: Struct. Funct. Bioinf. 2004 54 247 259. (37) Connolly, M. L. The molecular surface package. J. Mol. Graphics 1993 11 ,139 141. (38) Huang, S. Y.; Kuntz, I. D.; Zou, X. Pairwise GB/SA Scoring Function for Structure based Drug Design. J. Phys. Chem. B 2004 108 5453 5462.

PAGE 128

128 (39) Vedani, A.; Huhta, D.W. A new force field for modeling metalloproteins. J. Am. Chem. Soc. 1990 112 4759 4767. (40) Wang, R.; Fu, Y.; Lai, L. A New Atom Additive Method for Calculating Partition Coefficients. J. Chem. Inf. Comput. Sci. 1997 37 615 621. (41) Wang, R.; Lu, Y.; Wang, S. Comparative Evaluation of 11 Scoring Functions for Molecular Docking. J. Med. Chem. 2003 46 2287 2303. (42) Zhang, C.; Liu, S.; Zhu, Q.; Zhou, Y. A Knowledge Based Energy Function for es. J. Med. Chem. 2005 48 2325 2335. (43) Gehlhaar, D. K.; Verkhivker, G. M.; Rejto, P. A.; Sherman, C. J.; Fogel, D. B.; Freer, S. T. Molecular recognition of the inhibitor AG 1343 by HIV 1 Protease: Conformationally flexible docking by evolutionary pro gramming. Chem. Biol. 1995 2 317 324. (44) Gehlhaar, D. K.; Bouzida, D.; Rejto, P. A. In Rational Drug Design: Novel Methodology and Practical Applications ; Parrill, L., Reddy, M. R., Ed.; American Chemical Society: Washington, DC, 1999; Vol.719, pp 292 311. (45) Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Talor, R. Development and validation of a genetic algorithm for flexible docking. J. Mol. Biol. 1997 267 727 748. (46) Meng, E. C.; Shoichet, B. K.; Kuntz, I. D. Automated docking with grid based energy approach to macromolecule ligand interactions. J. Comput. Chem. 1992 13 505 524. (47) Eldridge, M. D.; Murray, C. W.; Auton, T. R.; Paolini, G. V.; Mee, R. P. Empirical scoring functions: I. The development of a fast empirical scoring funct ion to estimate the binding affinity of ligands in receptor complexes. J. Comput. Aided Mol. Des. 1997 11 425 445. polarization of hits obtained from de novo desig n or 3D database search programs. J. Comput. Aided Mol. Des. 1998 12 309 323. (49) CERIUS2 LigandFit (50) Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. A fast flexible docking method using an inc remental construction algorithm. J. Mol. Biol. 1996 261 470 489. (51) Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.; Olson, A. J. Automated docking using a Lamarckian genetic algorithm and an empirical binding free energy function. J. Comput. Chem. 1998 19 1639 1662.

PAGE 129

129 (52) Nobeli, I.; Mitchell, J. B. O.; Alex, A.; Thornton, J. M. Evaluation of a knowledge based potential of mean force for scoring docked proteinligand complexes. J. Comput. Chem. 2001 22 673 688. (53) DeWitte, R. S.; Shakhnovich, E. I. SMoG: de novo design method based on simple, fast, and accurate free energy estimates. 1. Methodology and supporting evidence. J. Am. Chem. Soc. 1996 118 11733 11744. (54) Wang, R., Lu, Y., Fang, X., Wang S. An Ext ensive Test of 14 Scoring Functions Using the PDBbind Refined Set of 800 Protein Ligand Complexes. J. Chem. Inf. Comput. Sci. 2004 44 2114 2125. (55) Hu, L., Benson, M. L., Smith, R. D., Lerner, M. G., Carlson, H. A. Binding MOAD (Mother Of All Databases ). Proteins 2005 60 333 400. (56) Sippl, M. J. Calculation of conformational ensembles from po tentials of mean force. J. Mol. Biol. 1990 213, 859 883. (57) Miyazawa S.; Jernigan R. L. Estimation of effective interresidue contact energies from protein crystal structures: quasi chemical approximation. Macromolecules 1985 18, 534 552. (58) Hendlich, M.; Lackner, P.; Weitckus, S.; Floeckner, H.; Froschauer, R.; Gottsbacher, K.; Casari, G.; Sippl, M. J. Identification of native protein folds amongst a lar ge number of incorrect models. The calculation of low energy conformations from potentials of mean force. J. Mol. Biol 1990 216, 167 180. (59) Jones, D. T.; Taylor, W. R.; Thornton, J. M. A new approach to protein fold recognition. Nature 1992 358, 86 89. (60) Thomas, P. D.; Dill, K. A. An iterative method for extracting energy like quantities from protein structures. Proc. Natl. Acad. Sci. USA 1996 93, 11628 11633. (61) Thomas, P. D.; Dill, K. A. Statistical potentials extracted from protein structures: How accurate are they? J. Mol. Biol. 1996 257, 457 469. (62) Lu, H.; Skolnick, J. A Distance Dependent Atomic Knowledge Based Potential for Improved Protein Structure Selection. Proteins: Struct., Funct., Genet. 2001 44, 223 232. (63) Kirkwoo d, J. G. Statistical Mechanics of fluid Mixtures. J. Chem. Phys. 1935, 3, 300 313. (64) Fan, H.; Schneidman Duhovny, D.; Irwin, J.J.; Dong, G.; Shoichet, BK.; Sali, A. Statistical potential for modeling and ranking of protein ligand interactions. J. Chem. Inf. Model. 2011 51(12), 3078 3092.

PAGE 130

130 (65) Zheng Z.; Merz, K M. Ligand Identification Scoring Algorithm (LISA) J. Chem. Inf. Model. 2011 51, 1296 1306 (66) Jorgensen, W. L.; Tirado Rives, J., Monte Carlo vs molecular dynamics for conformational sampling. J. Phys. Chem. 1996 100, 14508 14513. (67) Cancs, E.; Legoll, F.; Stoltz, G., Theoretical and numerical comparison of some sampling methods for molecu lar dynamics. ESAIM, Math. Model. Numer. Anal. 2007 41, 351 389. (68) Gallicchio, E.; Levy, R. M., Advances in all atom sampling methods for modeling protein ligand binding affinities. Curr. Opin. Struct. Biol. 2011 21, 161 16 6. (69) Halperin, I.; Ma, B. ; Wolfson, H.; Nussinov, R., Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins. 2002 47, 409 4 43. (70) Limongelli, V.; Marinelli, L.; Cosconati, S.; La Motta, C.; Sartini, S.; Mugnaini, L.; Da Settimo, F.; Novellino, E.; Parrinello, M., Sampling protein motion and solvent effect during ligand binding. Proc Natl Acad Sci U S A 2012 109, 1467 147 2. (71) Clark, M. ; Guarnieri, F. ; Shkurko, I. ; Wiseman, J. Grand Canonical Monte Carlo J. Chem. Inf. Model. 2006 46, 231 242 (72) Okamoto, Y., Generalized ensemble algorithms: enhanced sampling techniques for Monte Carlo and molecular dynamics simulations. J. Mol. Graphics Modell. 2004 22, 425 439. (73) Jones Hertzog, D. K.; Jorgensen, W. L., Binding affinities for sulfonamide inhibitors with human thrombin using Monte Carlo simulations with a linear response method J. Med. Chem. 1997 40, 1539 15 49. (74) Shaw, D. E.; Maragakis, P.; Lindorff Larsen, K.; Piana, S.; Dror, R. O.; Eastwood, M. P.; Bank, J. A.; Jumper, J. M.; Salmon, J. K.; Shan, Y.; Wriggers, W., Atomic Level Characterization of the Structural Dynamics of Proteins. Science. 2010 330, 341 346. (75) Kannan, S.; Zacharias, M., Simulated annealing coupled replica exchange molecular dynamics An efficient conformational sampling method. J. Struct. Biol. 2009 166, 288 294. (76) Klepeis, J. L.; Lindorff Larsen, K.; Dror, R. O.; Shaw, D. E., Long timescale molecular dynamics simulations of protein structure and function. Curr. Opin. Struct. Biol. 2009 19, 120 12 7.

PAGE 131

131 (77) Schlick, T., Molecular dynamics based approaches for enhanced sampling of long time, large scale conformational changes in biomole cules. F1000 Biol. Rep. 2009 1, 51. (78) Bonomi, M.; Branduardi, D.; Bussi, G.; Camilloni, C.; Provasi, D.; Raiteri, P.; Donadio, D.; Marinelli, F.; Pietrucci, F.; Broglia, R. A.; Parrinello, M., PLUMED: A portable plugin for free energy calculations with molecular dynamics. Comput. Phys. Commun. 2009 180, 1961 1972. (79) Gilson, M. K.; Zhou, H. X., Calculation of Protein Ligand Binding Affinities. Annu. Rev. Biophys. Biomol. Struct. 2007 36, 21 42. (80) Karney, C. F.; Ferrara, J. E.; Brunner, S., Method for computing protein binding affinity. J. Comput. Chem. 2005 26, 243 251. (81) Michel, J.; Essex, J. W., Prediction of protein ligand binding affinity by free energy simulations: assumptions, pitfalls and expectations. J. Comput. Aided Mol. Des. 2010 24, 639 6 58. (82) Tuffery, P.; Derreumaux, P., Flexibility and binding affinity in protein ligand, protein protein and multi component protein interactions: limitations of current computational approaches. J. R. Soc. Interface. 2012 9, 20 33. (83) Benson, M. L.; Faver, J. C.; Ucisik, M. N.; Dashti, D. S.; Zheng, Z.; Merz Jr, K. M., Pred iction of trypsin/molecular fragment binding affinities by free energy decomposition and empirical scores. J. Comput. Aided Mol. Des. 2012 26, 647 659. (84) Faver, J. C.; Zheng, Z.; Merz, K. M., Jr., Model for the fast estimation of basis set superposition error in biomolecular systems. J. Chem. Phys. 2011 135, 144110. (85) Faver, J. C.; Zheng, Z.; Merz, K. M., Jr., Statistics based model for basis set superp osition error correction in large biomolecules. Phys. Chem. Chem. Phys. 2012 14, 7795 779 9. (86) Faver, J. C.; Yang, W.; Merz, K. M., Jr., The Effects of Computational Modeling Errors on the Estimation of Statistical Mechanical Variables. J. Chem. Theory Comput. 2012 8, 3769 3776. (87) Zheng, Z.; Merz Jr, K. M., Development of the Knowledge Based and Empirical Combined Scoring Algorithm (KECSA) To Score Protein Ligand Interactions. J. Chem. Inf. Model. 2013 53, 1073 1083. (88) Goodsell, D. S.; Morris, G. M.; Olson, A. J., Automated docking of flexible ligands: applications of AutoDock. J. Mol. Recognit. 1996 9, 1 5. (89) Leis, S.; Zacharias, M., Efficient inclusion of receptor flexibility in grid based protein ligand docking. J. Comput. Chem. 2011 32, 3 433 343 9.

PAGE 132

132 (90) Taylor, R. D.; Jewsbury, P. J.; Essex, J. W., A review of protein small molecule docking methods. J. Comput. Aided Mol. Des. 2002 16, 151 1 66. (91) Wu, G.; Robertson, D. H.; Brooks, C. L., 3rd; Vieth, M., Detailed analysis of grid based molecular docking: A case study of CDOCKER A CHARMm based MD docking algorithm. J. Comput. Chem. 2003 24, 1549 15 62. (92) Kubinyi, H., Comparative molecular fiel d analysis (CoMFA). Handbook of Chemoinformatics: From Data to Knowledge in 4 Volumes; Wiley VCH: Weinheim, 2008 1555 1574. (93) Goodford, P. J., A computational procedure for determining energetically favorable binding sites on biologically important mac romolecules. J. Med. Chem. 1985 28, 849 8 57. (94) Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T.; Banks, J. L., Glide: a new approach for rapid, accurate docking and scoring. 2. Enrichment factors in database scr eening. J. Med. Chem. 2004 47, 1750 175 9.

PAGE 133

133 BIOGRAPHICAL SKETCH Zheng Zheng was born in Beijing, China in 1984. He graduated from Beijing No. 4 Hi gh School in 2002, after which he entered Peking University in China. group later in the same year. Zheng was qualified in physical chemistry division as a Ph D candidate in May 201 2 Zheng defended his research and graduated in the fall of 2013 from the University of Florida with a Ph D in chemistry, specialized in computational chemistry.