Computational Studies of Protein-Ligand Interaction through QM/MM Methods and Virtual Screening

Material Information

Computational Studies of Protein-Ligand Interaction through QM/MM Methods and Virtual Screening
Hayik, Seth
Place of Publication:
[Gainesville, Fla.]
University of Florida
Publication Date:
Physical Description:
1 online resource (190 p.)

Thesis/Dissertation Information

Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Committee Chair:
Merz, Kenneth Malcolm
Committee Members:
Deumens, Erik
Richards, Nigel G.
Hagen, Stephen J.
Dunbrack, Roland
Graduation Date:


Subjects / Keywords:
Active sites ( jstor )
Atoms ( jstor )
Docking ( jstor )
Docks ( jstor )
Entropy ( jstor )
Free energy ( jstor )
Ligands ( jstor )
Molecules ( jstor )
Solvation ( jstor )
Solvents ( jstor )
Chemistry -- Dissertations, Academic -- UF
amber, autodock, binding, dock, qm, qmmm, screening, semiempirical, virtual
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Chemistry thesis, Ph.D.


Computational methods have been successfully used to speed the drug discovery process and to provide a more robust search of chemical space for new drug candidates. These goals are accomplished through calculation of free energy upon binding using various methods or the use of virtual screening. Both methods can be used to theoretically predict a ligand?s effectiveness before proceeding with the more costly experimental phase. This dissertation will discuss the development and use of these methods and their applicability to a drug discovery application. The first two chapters of this dissertation will provide a background for computational techniques used in drug discovery. The techniques presented range from knowledge-based scoring functions from a statistical analysis of the PDB to using quantum mechanical parameters from a semi-empirical method. These chapters will also provide information on mixed Quantum Mechanical/Molecular Mechanical (QM/MM) methods as well as calculating important terms when computationally determining binding free energies such as solvation and entropy. The next two chapters detail the development and implementation of a QM/MM solvation model and scoring function. DivCon was integrated with the AMBER molecular mechanics program to produce a QM/MM package that can perform linear scaling semi-empirical calculations using AMBER?s force fields. The solvation model is based on a finite-difference Poisson-Boltzmann solver available in the DivCon program. This method was validated through calculation of solvation energy on a set of pentapeptides as well as small proteins. The QM/MM solvation energy was compared to the full QM solvation energy for various sized QM regions and only small differences were found. The QM/MM scoring function made use of this solvation method to calculate the solvation free energy of 23 zinc containing proteins. This QM/MM scoring function was validated against the set of proteins by comparing the predicted binding free energies of these proteins to the experimental binding free energies. Following this, the next chapter discusses the use of virtual screening methods to find a new inhibitor of a deubiquitinating cysteine protease protein, HAUSP/USP7. DOCK and AutoDock were utilized to screen a commercially available subset of compounds from the ZINC database. First, a potential active site was identified, and then the ligands were docked and ranked. These ranked compounds were then analyzed, purchased and tested experimentally to determine their potency for the target protein. Several ligands bound with low micromolar efficiency and the results were examined. Future directions for this study are also suggested at the end of this chapter to determine a more specific stronger binding ligand. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis (Ph.D.)--University of Florida, 2009.
Adviser: Merz, Kenneth Malcolm.
Electronic Access:
Statement of Responsibility:
by Seth Hayik.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright Hayik, Seth. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Embargo Date:
Resource Identifier:
665066375 ( OCLC )
LD1780 2009 ( lcc )


This item has the following downloads:

Full Text




2009 Seth A. Hayik 2


To Sara and my parents 3


ACKNOWLEDGMENTS First, I would like to thank my wife Sara for providing motivation and support throughout my graduate studies. She has had the foresight to anticipate what would truly make me happy and the follow-through to get me to do it. I would also like to thank my pa rents for their support for everything I have done. They have always provided a kind ear and sympathetic nod when necessary, along with important advi ce to drive me to complete my goals even when I was ready to give up. I would also like to thank Kennie for his suggestions, s upport and understanding over the years. For accepting me into his group, allowing me to fulfill my personal, as well as professional, goals I will always be grateful to Roland. Finally, I would like to thank the members of the Merz group for all of the conve rsations and advice, especially Ken, Bryan, Kevin, Duane, Kaushik and Guanglei for their support and friendship. 4


TABLE OF CONTENTS page ACKNOWLEDGMENTS...............................................................................................................4 LIST OF TABLES................................................................................................................. ..........8 LIST OF FIGURES.......................................................................................................................10 ABSTRACT...................................................................................................................................13 CHAPTER 1 INTRODUCTION................................................................................................................. .15 2 THEORY AND METHODS..................................................................................................20 Protein-Ligand Binding..........................................................................................................22 Calculation of Free Energy..............................................................................................23 Calculation of Entropy....................................................................................................24 Vibrational Contribution to Entropy...............................................................................26 Computational Drug Design...................................................................................................26 Structure Based Drug Design..........................................................................................27 Ligand Based Drug Design.............................................................................................29 Scoring Functions............................................................................................................30 Empirical scoring functions.....................................................................................30 Physical chemical methods......................................................................................32 Knowledge based functions.....................................................................................34 Molecular Mechanics............................................................................................................ ..35 Semi-Empirical Quantum Mechanics.....................................................................................36 Divide-and-Conquer Semi-Empir ical Quantum Mechanics...................................................38 The QM/MM Method.............................................................................................................39 QM/MM Implementation................................................................................................39 Link Atoms and the QM/MM Boundary.........................................................................40 Solvation Methods.............................................................................................................. ....41 The Poisson-Boltzmann Equation...................................................................................42 Finite Difference Poisson-Boltzmann.............................................................................43 Generalized Born Solvation............................................................................................44 3 A MIXED QM/MM SOLVATION METHOD......................................................................49 Introduction................................................................................................................... ..........49 Methods..................................................................................................................................51 QM/MM Implementation................................................................................................51 Poisson-Boltzmann Combined with DivCon..................................................................52 Solvation Free Energy in a QM/MM Calculation...........................................................53 Preparation of Test Cases................................................................................................55 5


Results and Discussion ......................................................................................................... ..57 Increasing QM Size.........................................................................................................58 Solvation Free Energies of Small Proteins......................................................................60 Timings............................................................................................................................60 Standard....................................................................................................................61 Divide and conquer..................................................................................................62 Conclusions.............................................................................................................................63 4 A MIXED QM/MM SCORING FUNCTI ON TO PREDICT PROTEIN-LIGAND BINDING AFFINITY............................................................................................................75 Introduction................................................................................................................... ..........75 Methods..................................................................................................................................78 QM/MM Implementation................................................................................................78 Preparation of QM/MM Calculations..............................................................................79 Preparation of Proteins....................................................................................................80 Calculation of Binding Affinity.......................................................................................82 Regression Analysis........................................................................................................87 Results and Discussion......................................................................................................... ..88 Use of ESCF Versus Total Energy..................................................................................88 Importance of Long-Range Cutoff..................................................................................89 Affect of Minimization on Binding Affinity...................................................................91 Binding Affinity Predictions Through MLR...................................................................94 Vibrational Contributio n to Binding Affinity..................................................................98 Ligand only vibrations.............................................................................................99 Use of rotatable bond estimate...............................................................................102 Charge Analysis.............................................................................................................103 Conclusions...........................................................................................................................104 5 DISCOVERY OF A NOVEL INHIBITOR OF A UBIQUITIN SPECIFIC PROTEASE USING VIRTUAL SCREENING........................................................................................126 Introduction................................................................................................................... ........126 Methods................................................................................................................................131 CADD Screening...........................................................................................................131 Preparation of Protein Structure....................................................................................132 Identification of Binding Site........................................................................................133 Screening Protocol.........................................................................................................134 Preparation of Dataset...................................................................................................136 Results and Discussion......................................................................................................... 136 Identification of Potential Active Site...........................................................................136 Sphere clustering....................................................................................................136 AutoDock blind docking........................................................................................139 Testing of the Putative Binding Site..............................................................................142 DOCK.....................................................................................................................142 AutoDock...............................................................................................................143 Inhibitors from Vi rtual Screening.................................................................................144 6


Analysis of Experim entally Active Compounds...........................................................147 Conclusions...........................................................................................................................150 6 CONCLUSIONS.................................................................................................................. 173 APPENDIX SAMPLE INPUT FILES......................................................................................................175 QM/MM Input......................................................................................................................175 DivCon Input........................................................................................................................176 LIST OF REFERENCES.............................................................................................................179 BIOGRAPHICAL SKETCH.......................................................................................................190 7


LIST OF TABLES Table page 3-1 Solvation free energy of pentapeptides with varying QM region size. The signed difference between full QM DivCon solvat ion and the QM/MM solvation energy are listed in parentheses with the average uns igned difference listed at the bottom...............69 3-2 Solvation free energy of increasing QM size for four different protein systems...............70 3-3 Solvation free energy differences for four proteins...........................................................71 3-4 Average time for a SCF/SCRF cycle in 1usm and number of diagonalizations needed for each cutoff distance in a st andard calculation in DivCon............................................72 3-5 Average time for a SCF/SCRF cycle in 1usm and the number of diagonalizations needed for each cutoff distance in a di vide and conquer calc ulation in DivCon...............73 4-1 Number of atoms in the QM region for the protein, ligand and protein-ligand complex for each of the metalloenzymes.........................................................................110 4-2 PDB ID, resolution (in ), inhi bitor, type and experimental G of the 23 complexes used in this study. CA re presents carbonic anhydrase proteins, CPA represents carboxypeptidase proteins and the experime ntal binding free energy was calculated from the Ki as (RTln(Ki))...............................................................................................111 4-3 Square of the correlation coefficients a nd standard deviations (in kcal/mol) for one and two minimization steps using ESCF and Total Energy for the heat of interaction...117 4-4 Comparison of squared correlation coefficients and standard deviations between Full QM results from Raha32 and QM/MM method described here.......................................119 4-5 Breakdown of each component in the scor ing function for each protein after MLR weighting. All units in kcal/mol.......................................................................................120 4-6 Time, in minutes, to calculate the QM region and the ligand only vibrational frequencies with the QM/MM method for the protein-ligand complex..........................121 4-7 Charge transfer from the protein to th e ligand, in electrons, for Mulliken and CM1 charges in vacuum............................................................................................................12 4 4-8 Charge transfer from the protein to th e ligand, in electrons, for Mulliken and CM1 charges in solution...........................................................................................................1 25 5-1 van der Waals(VDW), electrostatic (ES) and total score from DOCK for ZINC1637098 and ZINC1636516...................................................................................161 8


5-2 AutoDock lowest energy scores and cluster sizes for ZINC1637098 and ZINC1636516..................................................................................................................164 5-3 AutoDock largest cluster scores and cluster sizes for ZINC1637098 and ZINC1636516..................................................................................................................165 5-4 Structures, scores and experimental IC50 values (in M) of the five active compounds selected from virtual screening.....................................................................169 9


LIST OF FIGURES Figure page 2-1 Thermodynamic cycle used in protein-ligand binding calculations..................................46 2-2 Subdivision of global system in to a core region and two buffers.60..................................47 2-3 A) Schematic view of QM/MM cu t (curved line) between alpha carbon (CA) and amide nitrogen with hydrogen link atom added (HL) along bond vector. B) Structure, including link atom, as seen by AMBER..........................................................48 3-1 Representative view of QM/MM region and its cut off points. This also shows an example of the peptide designation us ed in Table 3-1 (NME cap plus one Ala residue)....................................................................................................................... ........65 3-2 Schematic view of SCRF calculation procedure...............................................................66 3-3 The terms involved in obtaining the free energy of solvation...........................................67 3-4 Example alanine pentapeptide, capped with NME and ACE............................................68 3-4 Average time per SCF/SCRF cycl e in 1usm for cutoff distances of 3 -24 D&C stands for a divide a nd conquer calculation.......................................................................74 4-1 Simple representation of a QM/MM calcu lation. The QM area (pink) extends 5 from the ligand while the rest of the protein (blue) is treated with a molecular mechanics force field.......................................................................................................108 4-2 Graphical representation of QM/MM pa rtitioning scheme. The area in brackets represents the QM region. A) The r ough QM region with a PRO residue at the boundary, and cuts the peptide bonds between the C and N. B) The refined structure which extends the QM boundary to include the PRO residue and move the boundary to cut the C-CA bond to avoid spurio us polarization at the boundary.............................109 4-3 Schematic view of the thermodynamic cycle to calculate binding affinity. The cycle calculates the protein, ligand and complex in vacuum and then transfers them to solvent to find the solvation free energy..........................................................................112 4-4 Calculated binding affinity for ESCF and total energy versus the experimental binding free energy without fitting. The squares re present the ESCF energy (R2=0.601) and the diamonds represent the total energy (R2=0.608)..............................113 4-5 Calculated versus experimental binding affinities for the zinc proteins to examine the effect of long-range cutoff. The diamonds represent the calculated binding affinity with a cutoff of 100 (R2=0.61). The squares repres ent the calculated binding affinity with a cutoff of 10 (R2=0.39)..........................................................................114 10


4-6 Calculated versus experimental binding free energy for one minimization step and two minimization steps using the total ener gy of the protein. The squares represent one minimization step (R2=0.49) and the diamonds represent two steps (R2=0.60).......115 4-7 Calculated versus experimental binding free energy for one minimization step and two minimization steps using the ESCF energy. The squares represent one minimization step (R2=0.60) and the diamonds repres ent two minimization steps (R2=0.61)..........................................................................................................................116 4-8 Calculated vs. experimental G of binding for the 23 zinc protein-ligand complexes. Pink squares represent the binding affin ity calculated without any fitting to experimental data (R2=0.61). Blue diamonds represen t calculated binding affinity after fitting components of th e scoring function using MLR (R2=0.75)..........................118 4-9 Binding affinity prediction using the nu mber or rotatable bonds on the ligand to estimate the vibrational entr opy with and without MLR fitti ng. Squares represent the prediction without fitting (R2=0.59) and the diamonds represent the prediction with fitting (R2=0.69)...............................................................................................................122 4-10 Correlation of T Svib with the number of rotatable bonds for each ligand in the set from ligand only frequency calculations (R2=0.80).........................................................123 5-1 USP7 finger, thumb and palm domains illustrated.181.....................................................153 5-2 Conformational change of catalytic triad upon ubiquitin binding...................................154 5-4 A) 2D and B) 3D structure of ZINC107402 used for blind docking...............................156 5-5 Ligand docking position on USP7 using chosen sphere set............................................157 5-6 Clustered ligand positions from blind docking with a coarse grid in AutoDock.............158 5-7 Ligand docking position from fine grid calculation in AutoDock...................................159 5-8 2D structures of A) ZINC1637098 and B) ZINC1636516..............................................160 5-9 A) ZINC1637098 binding mode predicted by DOCK. B) ZINC1636516 binding mode predicted by DOCK...............................................................................................162 5-10 Largest cluster(blue) and lowest energy(green) binding modes predicted by AutoDock for ZINC1637098...........................................................................................163 5-11 Largest cluster(blue) and lowest energy(green) binding modes predicted by AutoDock for ZINC1636516...........................................................................................166 5-12 Top 250 structures ranked by tota l interaction energy from DOCK...............................167 5-13 Top 250 structures ranked by total energy from AutoDock............................................168 11


5-14 Binding modes of the five experimental hits on USP7....................................................171 5-15 Structure of ZINC12339422 bound to USP7 from AutoDock(blue) and DOCK(green)...................................................................................................................172 A-1 Sample AMBER input for a QM/MM minimization.......................................................176 A-2 Sample DivCon input when running a QM/MM vacuum calculation in AMBER..........178 A-3 Sample DivCon input when running a QM/MM Poisson-Boltzmann calculation..........178 A-4 Sample DivCon input for a frequency calculation...........................................................178 12


Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMPUTATIONAL STUDIES OF PROTEIN-LIGAND INTERACTION THROUGH QM/MM METHODS AND VIRTUAL SCREENING By Seth A. Hayik May 2009 Chair: Kenneth M. Merz Major: Chemistry Computational methods have been successfully used to speed the drug discovery process and to provide a more robust search of chemical space for new drug candidates. These goals are accomplished through calculation of free energy upon binding using various methods or the use of virtual screening. Both methods can be used to theoretically predict a ligands effectiveness before proceeding with the more costly experime ntal phase. This dissertation will discuss the development and use of these methods and their applicability to a drug discovery application. The first two chapters of this dissertation will provide a background for computational techniques used in drug discovery. The techniques presen ted range from knowledge-based scoring functions from a statisti cal analysis of the PD B to using quantum mechanical parameters from a semi-empirical method. These chapters will also provide information on mixed Quantum Mechanical/Molecular Mechanical (QM/MM) met hods as well as calculating important terms when computationally determining binding free energies such as solvation and entropy. The next two chapters detail the development and implementation of a QM/MM solvation model and scoring function. DivCon was integrated with the AMBER molecular 13


14 mechanics program to produce a QM/MM packag e that can perform linear scaling semiempirical calculations using AMBERs force fiel ds. The solvation model is based on a finitedifference Poisson-Boltzmann solver availabl e in the DivCon program. This method was validated through calculation of solvation energy on a set of pe ntapeptides as well as small proteins. The QM/MM solvation energy was compared to the full QM solvation energy for various sized QM regions and only small differe nces were found. The QM/MM scoring function made use of this solv ation method to calculate the solvat ion free energy of 23 zinc containing proteins. This QM/MM scoring function was validated against th e set of proteins by comparing the predicted binding free energies of these prot eins to the experiment al binding free energies. Following this, the next chapter discusses the use of virtual screeni ng methods to find a new inhibitor of a deubiquitinating cysteine protease protein, HAUSP/USP7. DOCK and AutoDock were utilized to screen a commerciall y available subset of compounds from the ZINC database. First, a potential active site was id entified, and then the ligands were docked and ranked. These ranked compounds were then analy zed, purchased and tested experimentally to determine their potency for the target protei n. Several ligands bound with low micromolar efficiency and the results were examined. Future directions for this study are also suggested at the end of this chapter to determine a more specific stronger binding ligand.


CHAP TER 1 INTRODUCTION The process of discovering and validating a new drug compound is difficult, time consuming and expensive. Previously it was a matte r of trial and error, but the field has evolved to a point where trial and error will no longer suffice when billions of dollars are at stake. The primary reason for this guesswork was a lack of understanding of the interaction between a protein and its inhibitor. There are many different experimental and theoretical fields of research involved in better identifying and characterizing these interactions but they are still far from perfect. X-Ray crystallography has allowed the structures of vari ous proteins and complexes to be determined and has provided a great insight into what is physically happening. These structures are readily availa ble in the Protein Data Bank1 for general use, and provide a structural insight for many different fields of study. In addition to this, the advent of high-throughput screening (HTS) has produced a large library of small molecules with a known binding affinity for many different targets. Computational methods have been able to take advantage of the emergence of these and other fields to provide fu rther insight into interm olecular interactions. The field of computational chemistry ha s also made some significant progress in understanding how a small molecule interacts with a protein. This field has taken advantage of an abundance of experimental data to better grasp important inte ractions. Various methods are available to predict and descri be binding affinities and impor tant geometric interactions including virtual screening, docki ng and scoring which often serv e as a good starting point for further experimental testing. These methods all va ry in cost and serve different purposes in the computational drug discovery process. Virtual screening is cheap and can be used on a large database of small molecules to produce a lead structure for medicinal chemists. Scoring and docking are generally more expensiv e and therefore carried out on smaller data sets to predict the 15


binding affinity and docked pose of a sm all molecu le. These interactions can all be measured experimentally, but using computational methods reduces the time and cost associated with drug discovery by screening molecu les computationally and narrowi ng the number of compounds to test experimentally. Mixed quantum mechanics/molecular mech anics (QM/MM) methods are becoming more popular for understanding reactions and interactions for a prot ein as computational power increases. These methods split the system in to two regions, a QM region treated with the Hamiltonian selected, and a MM region treated with a classical force fiel d method. This method allows a region of interest to be treated at a high level of theory while treating the surroundings at a lower level. Using this method the region of interest, typically an active site, does not have to be completely removed from the field of the protein allowing elec tronic effects from the surroundings to influence the QM region. This has been used to study the mechanism of various reactions in proteins2-4 and has also been used to score vari ous ligand interactions with a protein.5 Another advantage of the QM/MM method is th at the accuracy of th e calculation can be increased by changing the Hamiltonian used, depending on the desired properties to be studied and computational re sources available. This dissertation describes the developmen t and application of various computational methods to characterize protein-ligand interactions. These met hods all take advantage of the abundance of 3D structures and experimental data available to st udy, predict and quantify favorable interactions between a small molecule and a protein target. The following chapters describe the development of a QM/MM solvati on method, the application of that method to predict a small molecules ability to bind to a pr otein and finally an effort to screen a large database of small molecules to identify potential inhibitors of a protein target. 16


Chapter two outlines the theory behind several computational methods. These include a brief outline of molecular mechanics and linear sc aling DivCon. This chapter also provides a more detailed review of the theory behind bi nding free energy includi ng the thermodynamics and equations used in both computational a nd experimental methods. Commonly used computational methods are also discussed including scoring functions, molecular mechanics (MM), quantum mechanics (QM), mixed QM/MM and a few virtual screening methods. The drug design process is also discussed to illustra te where some of these computational methods may prove to be useful. Chapter three discusses the implementa tion and application of a QM/MM PoissonBoltzmann solvation method. This method wa s implemented in the AMBER 10 molecular dynamics package and takes advantage of DivCons semi-empirical linear scaling ability. This ability allows larger than usual QM regions to be included in the QM re gion giving electronic information on a larger region than is usually prac tical. This method allows the area of interest to be polarized in the solvent field while keepin g the rest of the protein at the fixed charges assigned by AMBER, thus reduci ng the overall cost of the calc ulation while providing more detail about the QM region. A test of this me thod is conducted on 20 pentapeptides as well as a few small proteins. Finally, chapter three disc usses the timing involved per SCF cycle for these solvent calculations and demonstrates the advantage of linear s caling within DivCon. Chapter four demonstrates the utility of a mixed QM/MM method to provide accurate details of protein-ligand inter actions so that binding free energies may be calculated. This method uses the QM/MM method so that the active site and ligand can be calculated at the AM1 semi-empirical level and takes advantage of the solvent method discussed in chapter two. This method captures the electrostatic interactions that take place wh en a ligand binds to a protein, 17


and therefore captures several properties that a non-polarizab le m ethod, such as molecular mechanics, would miss. This chapter also examines the effects of different variables to calculate components of the scoring functi on and their impact on the predic tive ability of the function. The goal of this research was to accurately pred ict binding free energies ; its predictive ability was validated against 23 metalloprot eins from the Protein Data Bank1 that contain zinc in the active site, which would prove difficult for a non-QM method. The fifth chapter outlines the efforts invol ved in screening a large database of small molecules to identify potential inhibitors of a protein target. This chapter uses DOCK6 and AutoDock7 along with the ZINC database8. Experimental inhibiti on data for a few compounds were provided by collaborators. Using these da ta a putative active site was identified through blind docking within AutoDock. Following th is, a parallel DOCK calcu lation was run using a dataset of 63,000 commercially available small mo lecules. This initial DOCK screen took advantage of DOCKs parallel abilities to redu ce the dataset used. The top 250 small molecules from this initial DOCK screen were first visua lly inspected and then used in a more rigorous AutoDock calculation. The top 40 molecules fro m this screen were then forwarded for experimental testing. Out of these 40 sma ll molecules five were found to bind with low micromolar affinity. Finally, chapter six provides a brief summary of the work a nd results presented in this dissertation. Computationally exploring these biological systems can give great insight into the interactions and mechanisms these proteins undergo biologically. Thes e insights can provide researchers with valuable information that lowe r drug discovery costs in terms of both time and money. The first two studies presented show th e advantages and applicability of the QM/MM method in understanding and predic ting these interactions. The third study demonstrates the use 18


19 and suitability of virtual scr eening methods to identify lead compounds in a drug discovery effort. None of these methods would be viable without the continued su pport of the theoretical and experimental community, but have proven to be valuable and useful tools in the drug discovery process.


CHAP TER 2 THEORY AND METHODS Finding and developing a new dr ug is very costly and time c onsuming. This cost arises from researching and testing enough compounds to get a representative sample of chemical space and can take 10-15 years to accomplish.9 There are several different ways to lower the cost associated with discovering a new drug in cluding high-throughput screening (HTS), virtual screening (VS) and structure ba sed drug design (SBDD). These methods are computational and experimental in nature, and can be used individua lly or together to speed discovery of new leads out of all of the options available in chemical space. The first step in the drug discovery process is usually to identify and validate a target. These are generally proteins, but can include things such as ion ch annels and DNA. At this point the challenge of successfully exploring chemical space is very apparent. A small molecule lead must be developed that will give the desired results when acting on the target. It is at this point that HTS and VS can become very important to quickly and methodically explore chemical space. HTS uses many welled plates to simu ltaneously test small amounts of hundreds of compounds in an assay developed to measure the desired effect. Large commercial compilations of small, drug-like molecules ex ist to aid an HTS effort. HT S is estimated to compose around 14% of the total cost of resear ch and development in a new drug.10 Using VS is also another option to explore chemical sp ace methodically. Vast databa ses of compounds are kept commercially and academically to provide small molecules for these methods8,11,12 and are often used in lieu of a first round of experimental HTS. These starting co mpounds are then docked into the targets active site and their energies are compared and ranked to give an optimal lead. While the hits from virtual screening may not be strong binders, even af ter optimization, they 20


m ay lead to a new class of compounds to expl ore further. Virtual screening allows many compounds to tested, often hundreds at once, usin g computational resources instead of human and physical ones saving time and money in the drug discovery process. Whichever method is used, one or several hits are found and enter an optimization st age to refine the compounds so that they are more selective, better binding or whatever aspect n eeds improvement. This step is also costly as it includes synthe sis of many different compounds and an iterative process to test them all experimentally. This is also the s econd step in which computational screening is extremely useful. The interactions of the small molecule and the target can be modeled, changed and tested in silico before attempting them in vitro. These methods also allow shape differences to be explored due to easy visualization of the ac tive site. It is often at this stage that more expensive computational methods can be employed, such as QM methods13-15 or MM-PBSA16,17 to get more detailed information on protein-liga nd interactions. Following this stage the drug enters development, trials and the market. This is the most expensive and often longest phase of drug development, as many compounds that make it this far do not make it to market, and thorough in vivo tests must be performed.18 Finally, assuming successful testing, the drug is released to the market and monitored for ne w side effects and the drugs effectiveness. Throughout the drug discovery process a lo t of time and money is invested. Computational methods can give insight into th e interactions between a ligand and its receptor that might not be worth the effort, or impossible, to m easure experimentally. While experimental determination of these propertie s might be costly, they can be determined computationally quite rapidly for a large number of ligands. These interactions might include electrostatics, dipole interacti ons, van der Waals interactions and hydrophobic interactions along with a myriad of others that computational met hods allow access to. Use of these methods limits 21


the chem ical space that needs to be experimental ly explored, therefore saving time and money in an already extremely expensive drug discovery regimen. Protein-Ligand Binding A ligand (L) binding to a protei n (P), in a simple 1:1 ratio, to form a complex (C) can be described by the chemical equation seen in Equation 2-1. CLP (2-1) The Gibbs equation allows the free energy of a reaction to be calcula ted, and therefore the likelihood of the reaction to proceed spontaneous ly. The Gibbs free energy equation depends on the enthalpy, entropy and te mperature represented by H, S and T respectively in Equation 2-2. (2-2) STHG The free energy change of the chem ical reacti on (Equation 2-1) can be calculated from the concentrations of the reactants a nd product, using Equation 2-3 where G0 is the free energy of the reaction at 1atm pressure, 298K and 1M concentration. ]][[ ][ ln0LP C RTGG (2-3) At equilibrium the reaction is reversible, m eaning the ligand binds to the receptor at the same rate as it dissociates. This means the total free energy change, G, is zero and gives the overall free energy changes of the system (Equation 2-4). ]][[ ][ ln0LP C RTG (2-4) In these equations, the final term is the equilibrium constant, Ka or Kd, which can be measured experimentally and are defined in Equation 2-5. d aKLP C K 1 ]][[ ][ (2-5) 22


Ka has units of (concentration)-1, whereas Kd has units of just con centration, making it more practical and more commonly used. Kd is the concentration of ligand at which half of the protein binding sites are occupied at e quilibrium. The smaller this Kd value the lower the concentration needed to fill half the binding sites and theref ore the higher the affinity of the ligand for the receptor. Calculation of Free Energy Computationally, Kd cannot be calculated directly so a relationship between Kd and G is established by Equation 2-6. (2-6) dKRTG ln0 Because free energy is a state function, and theref ore path independent, it can be calculated in parts and compared to experimental values. The free energy for the reaction is broken down into the difference between the free energy of comple x and the free energy of the protein and ligand (Equation 2-7). In this equation, Gbind is the binding free energy and Gcomplex, Gprotein and Gligand are the overall free energy of the reaction, and the free energies of the complex, protein and ligand respectively. ) (ligand protein complex bindGG GG (2-7) To accurately calculate free energies solvent a ffects must be included. This can be done using a thermodynamic cycle, like that seen in Fig. 2-1. In this cycle the reactants and products are calculated in vacuum and then in solvent to give the free ener gy of binding in solvent seen in Equation 2-8, where is the binding free energy in the gas phase and Gsolv is the solvation free energy of binding and is calculated using Equation 2-9. gas bindG solv gas bind solv bindGGG (2-8) ) (ligand solv protein solv complex solv solvG G GG (2-9) 23


Calculation of Entropy The entropy and enthalpy terms of the free energy are calculated from the partition functions related to translational, rotational and vibrati onal components of binding from statistical mechanics. The entropy of the whole system is described by Equation 2-10, where q(V,T) is the partition function of the entropy components. The enthalpy of the same system is calculated using Equation 2-11. VT q RTeTVqRS ln ),(ln (2-10) V BT q TNkE ln2 (2-11) Each system is composed of a translation, rotation, vibration and electronic partition function to describe the various degrees of freedom. The rotational partition function is represented by Equation 2-12. rr rT q1 (2-12) Where r=h2/8 2IkB and I is the moment of inertia. Th is partition function can be used to calculate the entropy and enthalpy of the nonlinear system using Eqs. 2-13 and 2-14 respectively. V r r rT q TqRS ln ln (2-13) RTEr2 3 (2-14) The translational partition functi on can be seen in Equation 2-15. P Tk h Tk qB B t 2 3 22 (2-15) 24


This partition function can be us ed to calcu late the translatio nal contribution to entropy and energy by using Eqs. 2-16 and 2-17. T TqRSt t2 3 ) ln( (2-16) RTEt2 3 (2-17) Finally, the vibrational partition function can be seen in Equation 2-18. K T T vKv Kve e q2 2, ,1 (2-18) This can be used to find the vibrational contri bution to the entropy and en ergy of a system using Eqs. 2-19 and 2-20. K T T Kv vKv Kve e T RS, ,1ln 1, (2-19) 1 1 2 1,1,T K v vKve RE (2-20) In order to calculate the vibr ational contribution to energy and entropy the frequencies of the molecule must be calculated. This is done using the Hessian, which contains the second derivatives of the molecule from molecula r dynamics (MD) or quantum mechanics. Another important addition to a systems energy is its zero point energy (ZPE). Computational methods often assume stationary nuclei with a swarm of electrons surrounding the nucleus. While a good approxima tion, this is not exactly correct so a ZPE correction is often applied from the vibrational frequencies of the system. The ZPE compensates for the fact that the nuclei are not completely stationary by applyi ng a simple harmonic oscillator to the atoms. This leads to Equation 2-21 for a non-linear molecule, where the ZPE is one half the sums of the vibrational frequencies, N is the number of atoms and i are the vibrational frequencies. 25


63 12 1N i iZPE (2-21) Computational methods inherently overestimate fre quencies, so they must be scaled by a certain factor, 0.95320 in the case of AM1 calculations, to compensate for this.19 Vibrational Contribution to Entropy As seen in the previous section, entropy is a part of the Gibbs free energy and must be included to get a good estimate of the binding fr ee energy. Upon ligand binding there is an unfavorable loss of entropy due to the loss of rotational and tran slational degrees of freedom. This entropy loss creates a barr ier to binding that must be overcome by favorable enthalpy, electrostatic and solvatio n entropy factors. This barrier, due to loss of rotational and translational degrees of freedom, was found to range from 16.5 to 18.0 kcal/mol for several small, relatively rigid trypsin inhibitors by Schwarzl et al.20 There is also a slightly favorable entropy change upon binding, as the ligand gains vibrational degrees of freedom within the active site. Such an entropic change has been calcu lated by Tidor and Karplus to contribute -7.2 kcal/mol to the dimerization of insulin.21 This favorable entropic term has al so been calculated by Schwarzl et al for small trypsin inhibitors and rang ed from -3.8 to -2.5 kcal/mol.20 In calculating binding free energies this represents an or der of magnitude or more differe nce in experimental equilibrium constants, and therefore presents an important factor to consider when predicting binding affinity. Computational Drug Design Computational drug design is generally employe d early in the drug discovery process to speed discovery, lower costs and give insights into protein ligand in teractions and is now considered an essential part of the discovery process.22 Computational drug design can be broken up into two distinct ar eas. The first is Structure Based Drug Design (SBDD) where 26


knowledge of the recep tors structure is necessary generally from X-Ray diffraction or NMR. The second area is Ligand Based Drug Design (LBDD) where drug design is primarily based on properties of the ligand. Both of these methods will be discussed in more detail below. The first step in the drug design process is to limit the number of compounds closely examined. This may involve screening databases such as ZINC8, the NCI database23 or others. Such a screen will limit the chemical space to be further examined by separating active from inactive compounds. These screens can involve several di fferent techniques, such as docking24 or similarity searches.25 Even the databases used in the screeni ng can potentially be intractable, but physical properties can be exploited to limit the number of compounds in a database. Lipinskis rule of five26,27 can be employed here to filter out non-drug like compounds. Lipinskis first rule is a drug must not have more than five hydrogen bon d donors. Second, a drug must not have more than 10 hydrogen bond acceptors. Third, a drug mu st not have a molecular weight under 500 g/mol. Finally, a drug must have a log P less th an 5 for solubility. These guidelines provide a very rough criteria to follow. A potential drug must not break more than one of these rules to be considered drug like. The rules allow many co mpounds to be automatically discounted from a ligand database, as they are fairly easy to meas ure and calculate. Using Lipinskis rules allows us to reduce the size of a data base by eliminating compounds not following these rules, and therefore makes a first pass screeni ng more computationally tractable. Structure Based Drug Design One common method of computational drug design is Structure Based Drug Design. This method requires knowledge of the structure of the target of interest. This information is typically from X-ray crystallography, but stru ctures can be obtained from NMR or homology modeling as well. Using this structure a ligand can be ranked based on interactions with the target. Knowledge of the structure allows MD simulations and large-scale virtual screening 27


along with other comm on methods to be performed. If an active site is not known there are methods of performing blind docking to locate a potential active site.28 Identifying an active site is important so that the correct ligand-recep tor interactions can be studied and enhanced for the drug molecule. After identification of an active site, programs such as DOCK29 or other similar molecular docking programs are commonly employed30-32 allowing for large-scale virtual screening by making use of a sc oring function to calculate an energy for each ligand. These programs typically also sample the conformational space of the ligand in the receptor to incorporate flexibility in the energy calculation. The energies of all of the ligands in the screening run are then compiled and ranked against one another to assemble a list of strongest binders according to these calculati ons. Scoring functions are genera lly similar to a force field in the terms they use, and will be discussed in more detail below. Scoring functions incorporating QM methods have also been developed and validated improving on pure force field methods.13,15,33 Another method of identifying a ligand is through de novo, or fragment, design34-36 which makes use of a library of functional gr oups that are then fit into the active site and scored. As each functional group is fit into the active site an overall molecule is assembled from these functional groups to identify the compounds of interest. SBDD has been aided recently by new methods for creating crystals of many proteins very quickly providing more structures for computational screening and pr esenting more targets to be computationally screened against. A main problem with SBDD is that it is dependent on the quality of the structure provided by the crystallographer. If the crystallographer has made bad assumptions, functional groups on the target may not be correctly positioned, creating falsely identifi ed contacts or even residues within the protein. A crystal structure is also predicated on a force field which helps 28


refi ne the structure37, but still relies on input from the crystallographer. Recently, methods have been developed to refine better crystal structur es by integrating QM methods with the structure refinement to produce better overall structures.38,39 SBDD is also depend ent on the resolution of the crystal produced. In a lower resolution crysta l the positions of atoms are not as conclusive, so further refinement methods may lead to mu ch better structures. In an NMR generated structure, the structure used in SBDD is of ten based on an averaging of several captured structures. This can lead to uncertainties in the structure and problems in the drug design process. These problems can be at least partially solved or account ed for in several ways either when obtaining the structure or pr eparing it for computational wo rk. However, these potential problems do not eliminate the usefulness of SBDD and further refinement methods are currently in use to produce better structures. Ligand Based Drug Design A second method of computational drug design is Ligand Based Drug Design. Using LBDD knowledge of the structure is not necessa ry, and the drug is based on properties of the ligand. This is done by employing various Quantit ative Structure-Activity Relationship (QSAR) methods.40-42 Each QSAR model is based on a set of compounds with measured values of interest, such as the inhibition constant or IC50 values. For each compound in the set, a number of variables, called descriptors, are calculated. These descriptor s include things such as number of hydrogen bonders, solubility and many other prope rties of the compounds. If a QM method is being used, even electronic descriptor s can be included in the QSAR model14,43,44. After a set of descriptors are calculated, statistical tec hniques such as Multiple Linear Regression45, Partial Least Squares Regression46 and Computer Neural Networks47 are used to combine descriptors to create an accurate model. These methods use the descriptors of compounds with known activities to create pr edictive models for compounds with unknown activities. The aim of QSAR 29


m ethods is to pick a set of descriptors that wi ll accurately predict activity of compounds that are not measured experimentally. This method is yet another way to limit the chemical space to search in drug design and provide leads to test experimentally. Scoring Functions Scoring functions are necessary to rank each conformation of a lig and according to its interactions with the receptor. Without a sc oring function a program may be able to produce various conformations of the ligand, but would ha ve no way to distinguish between true and false binding modes or provide information on which li gand might bind better. The use of scoring functions in virtual screening has been quite successful in drug discovery13,34,48-50, and often consensus scoring is done that uses more than one scoring function to rank a set of ligands.51 There are many papers comparing different programs and their scor ing functions on their ability to predict protein-ligand binding modes34,52,53 and the proper scoring function may be a matter of experimentation. In general a scoring function is th e equation a program uses to rank a ligand based on different terms which are sometimes we ighted through statis tical methods. These equations can be based on molecular mechanics fo rce fields, empirical observations, statistical potentials or even quantum mechanically derived information. These scoring functions generally emphasize easily calculated properties and discard or ignore mo re difficult contributions to the binding free energy. This is due to the fact that many of these algor ithms are developed to quickly score and rank thousands of compounds, and therefore certain properties must be approximated or ignored for the sake of expediency Each of these methods has their advantages and each scoring function method will be discussed in more detail below. Empirical scoring functions Empirical scoring functions are based on the assumption that the binding free energy can be decomposed into separate components, a nd components can be added or removed based upon 30


their im portance for prediction.54 Work on this type of scori ng function was pioneered by Bohm, whose function consisted of five weighted parts such as hydrogen bonds and lipophilic interactions.48,49 Empirical functions weight each term by using a training set of many ligands to one receptor or a few ligands to many receptors to determine their importance within the equation. These weighted terms are combined in to an overall equation th at will predict binding affinities of a prediction set, using statistical methods such as linear and non-linear regression. The terms within the scoring function are generally fairly simple, such as van der Waals interactions or properties based on distances so that these methods are extremely inexpensive. Often these potentials are based on a force field method, but are simplified by applying a grid potential or some other means of approximation. Such a scoring function allows a set of ligands to be scored and ranked very quickly, as detailed terms do not need to be calculated, and examination of the weighting coefficients can give insight into favorable interactions to modify further on in the drug design process.54 The main drawback of this type of scoring function, however, is that they may not be appropriate to use on ligands out side of the training set, and may provide some difficulty in predicting accurate binding affinities.52 One example of a widely used empirical sc oring function is AutoDock, which strives to create an empirical relati onship between molecular stru cture and binding free energy.7 AutoDocks scoring function can be seen in Equation 2-22, where the G coefficients are empirically determined using linear regression. 222 1012 612)( )( )(ijr ji ijjisol tortor ji ijij ji elec ji ij ij ij ij hbond ji ij ij ij ij vdWeVSVSGNG rr qq G r D r C tEG r B r A GG (2-22) 31


This f irst three terms in the scoring function co ntain terms from vacuum calculations, a LennardJones 12-6 repulsion, a 12-10 di rectional hydrogen bonding term, wh ere E(t) is a weight based on the angle, t, between the probe and target55 and a Coulomb potential. The Ntor term is an approximation of entropy loss upon ligand binding and the Gsol term is a measure of the desolvation energy based on the surface area. This scoring function has pr oven to be accurate in predicting binding modes and affinities7,52,56 and is only one of many em pirical scoring functions available. The defining characte ristic of an empirical scoring function is the weighting factors based on a training set. Weigh ting these different terms provi des a way to exclude certain expensive terms while providing reas onable accuracy in prediction. Physical chemical methods This class of scoring function is generall y based on information calculated from the protein, ligand and the complex they form These methods can include free energy perturbation57,58, ligand interaction energy approach59, Molecular Mechanics/Poisson-Boltzmann Surface Area (MM/PBSA)16 and other methods that make use of a force field or QM Hamiltonian13,15. They are generally more accurate, but also more expensive because the cost associated with the generated terms is often very high compared to other methods. Due to this cost these methods are seldom used as a first step in virtual screening, but are reserved for when there are a few compounds of intere st after an initial screen or to optimize a ligand after a first round of experimental docking studies. One very commonly used physically based sc oring function is MM/PBSA. This method uses values calculated from a gas phase force field calculation, then uses GB or PB methods to determine the surface area and solvation energy, as seen in Equation 2-23. solute SA PBGB MMTSGGEG / (2-23) 32


This overall equation is based on individual term s th at can be separately calculated. In this equation GGB/PB is the solvation free energy from Gene ralized Born or Poisson-Boltzmann, GSA free energy contribution from surface area, TSsolute is a solute entropy term and EMM is a sum of molecular mechanical energies that can be broken up into electrostatic, Ees, van der Waals, Evdw and internal energies, su ch as bond angle strain, Eint, as seen in Equation 2-24. (2-24) intEEEEvdw es MM EMM can also be substituted with EQM/MM to eliminate the step of developing parameters for the ligand.60 These terms can then be combined to get an overall free energy of binding for the ligand and complex. The coordinates of th e systems for a MM/PBSA calculation are often obtained from either explicit solvent minimizatio n or from an ensemble of snapshots of a molecular dynamics run, both of which would be done in explicit solvent. Using an ensemble of MD snapshots allows an averaging over structur es as the protein and ligand move in a room temperature solution. Running simulations in ex plicit solvent allows so lvent effects to be accounted for, such as hydrogen bonding, but removes the explicit solvent atoms when calculating the binding free energy to save time. Th ese minimizations or simulations are also the cause of the expense associated w ith MM/PBSA, as the process to get coordinates is very costly. MM/PBSA methods have been used successfully in the literature and provide an accurate binding free energy, but come at some cost due to how the me thod obtains information on the system.17 Another example of a physical chemical method is a quantum mechanically based scoring function developed by Raha and Merz.15,33 This method is somewhat similar to the MM/PBSA method, but an ensemble is not calculat ed. The interaction en ergy of the system was calculated using the AM1 level of theory within the linear scaling program, DivCon61 and 33


included a Lennard-Jones potenti al from the AMBER force field. The solvation term was calculated using a PB method that allowed polari zation of the solute in the presence of the solvent. The entropy terms were calculated by including a surface area term, to account for solvent entropy changes, and by counting the lost degrees of freedom upon ligand binding. This method accounts for electronic effects on ligand bindi ng, but is fairly expensive as the entire protein, ligand and complex must be converged to calculate accurate electronic properties. However, it has the advantage of not needing para meters for the ligand because it uses a semiempirical method and accounts for electronic effects in the system. It was also demonstrated in this method that charge transf er between the ligand and prot ein upon binding was significant and may play an important role in corr ectly calculating binding free energies. Knowledge based functions Knowledge-based scoring functions are based on information that can be gathered from available structures. This information is gathered from a database of protein-ligand complexes and compiled into a statistical potential that can be used for protein structure prediction as well as protein-ligand interactions.62-64 This type of scoring functi on creates a probability distribution of interatomic distances. This probability distri bution is then used to create a distance-dependent interaction free energy of atom pairs. Using th is probability distribution a form of score is derived for each ligand based on th e statistical preferen ces of atom pairs. One disadvantage of this type of scoring function is that the probability distribution of atom pairs, and therefore the scoring function, depends on the se t of structures the function was made from. Therefore, if a ligand is not geometrically similar to any of the ligands in th e training set the ligand will not score well. However, this type of scoring function has been proven to be useful and is extremely fast after the probability density has been established. 34


One exam ple of a knowledge-based scoring function is DrugScore.65-67 DrugScores probability density uses protein-ligand co mplexes in the Protein Data Bank (PDB)1 and scales them by their solvent accessible surface area. Th en they use the Cambridg e Structural Database to fill in protein-ligand interactions that are not very populous in the PDB so that the probability density produces a better representation.67 DrugScore uses Eqs. 2-25, 2-26 and 2-27 to create its interaction database, where gij(r) is the normalized radial pair distribution function and g(r) is the normalized mean radial pair distribution function. )( )( ln)( rg rg rWij ij (2-25) ij rg rgij ij )( ) ( (2-26) r ij ij ijrrN rrN rg )4)(( 4/)( )(2 2 (2-27) In these equations, gij(r) is calculated from the frequencies, N, of atom pairs of type i and j. In building this function only distances greater than 1.0 and less than 6.0 were considered. This function is then used to score protein-ligand interactions by summ ing all atom-atom interactions between the protein and the ligand. This method has been successfully tested to predict binding modes within 2 for 80-90% of test set compounds, which is in line with se veral other scoring functions.52 Molecular Mechanics Molecular mechanics (MM) force fields are based on Newtons Law and are often employed in the study of biological systems.68-73 These are based on a set of parameters fit to match high level quantum mechanics (QM) valu es for things such as bond distances, bond 35


angles, d ihedral angles and several other properties.74 There are many different force fields available, such as AMBER74,75, CHARMM76 and MMFF77-80 which model different parameters or have different parameter values depending on what the force field was developed to most accurately represent. In all cases, each paramete r is used to help determine the overall energy of the system investigated and to calculate for ces acting on individual atoms based on the ideal values that the parameters provide. The basic AMBER force field equation, Equa tion 2-28, is composed of bond, angle, dihedral and non-bond components. The bond and angle terms are harmonic potentials where the force constant is represented by Kb(kcal/mol2) and K(kcal/mol Rad2) respectively, and beq and eq are the equilibrium bond distan ce and angle. The dihedral term is represented by a Fourier series expansion where Vn is the rotation barrier heig ht, n is the periodicity, is the calculated dihedral angle and is the phase difference74,81. The final term represents the nonbond interactions included in the energy calculation. First, the 6-12 Lennard-Jone s potential is represented with the 121 r term accounting for repulsion and the 61 r accounts for attraction. In the Lennard-Jones term, A and B are parameters that define the shape of the potential. The final term accounts for the interactions of non-bonded, charged particles in a medium with dielectric constant, where q represents the charges and r the distance between them. ji ij ji ij ij ij ij dihedrals n angles eq bonds eq b totalr qq r B r A n V KbbKE 612 2 2)]cos(1[ 2 )()( (2-28) Semi-Empirical Quantum Mechanics Most force field methods do not model electr onic interactions such as polarization and charge transfer due to their parameterization a nd point charge models. To accurately include 36


these effects a quantum mechanics method must be used instead. There are many such methods ranging in computational cost a nd accuracy including Hartree-Fock (HF), Density Functional Theory (DFT) and semi-empirical (SE) methods. Generally, due to their cost ab initio methods like HF and DFT methods are restricted to use on small systems such as chemical reactions. One way to overcome this cost barrier is through the use of semi-empirical methods. These methods are commonly based on the Modified Neglect of Differential Overlap (MNDO)82,83 method, based on the Roothaan-Hall equations. SE methods also rely on a few parameters to reproduce experimental data or ab initio results and only consider valence electrons, treating the co re as one potential.82 Some of these more common SE methods include Austin Model 1 (AM1)84, Parametric Model 3 (PM3)85,86 and MNDO/d.87 These methods all use the basic equation seen in Equation 2-29 to calc ulate the total energy of the molecule, where Eel is the electronic energy de scribed in Equation 2-30 and is the interaction between the core potentials on A and B described in Equation 2-31. ABE1AB A AB el totE EE (2-29) These equations describe the elect ronic interactions where P is the density, H is the one electron matrix and F is the Fock matrix. EAB describes the interactions between the approximated core potentials where ZA and ZB are the core charges and RAB is the distance between them. )( 2 1 FHP Eel (2-30) AB BA ABR eZZ E2 (2-31) Use of these SE methods allows larger quantum mechanical calculati ons to be conducted, but this advantage comes at the cost of accu racy. Due to the parameterization and core 37


approxim ations needed for SE methods, all of th e electron interactions are not calculated and electron correlation and exchange can not be modeled accurately. Divide-and-Conquer Semi-Empirical Quantum Mechanics In general, protein systems are much too large to use a quantum Hamiltonian to get information on the system. Typically QM systems are limited to tens of atoms, whereas protein systems may contain hundreds or even thousands of atoms. However, use of available linearscaling divide and conquer (D&C ) techniques allow semi-empirical QM calculations on an entire protein to be performed at reasonable cost.61,88,89 These D&C calculations allow the system to be split up into a number of overlapping subsystems. Typically the subsystems are individual residues, but clusters of residues can also be included into a single subsystem for accuracy. In these D&C calculations a global Fock matrix is built and then split up into each subsystems local Fock matrix, givi ng Equation 2-32 where F, C and E are the Fock, coefficient and energy matrices respectively. ECCF (2-32) For ab initio calculations the most expensive part of the calculation is the calculation of the two-center two-electr on interaction. This is overcome in semi-empirical methods by using the Neglect of Differential Diatomic Overlap (NDDO) approximation to eliminate these terms. This means that the slowest part of semi-empirical calculations is diagonalizing the Fock matrix. In this D&C approach the Fock matrix is broke n up into local subsyste ms and the local Fock matrices are diagonalized, as opposed to the global Fock matrix. Generally, each of these local subsystems also has one or more overlap buffers from nearby subsystems, as seen in Figure 2261, so that truncation effects at the boundary do not give an incorrect density matrix. It has been found that two buffer regions with sizes of 4.5 for the first and 2.0 for the second are 38


sufficient for accu racy and efficiency. Diagonalizi ng these local matrices, rather than the global matrix, gives linear scaling w ith the number of subsystems for roughly equally sized subsystems.61 The QM/MM Method The original mixed QM/MM method was implemented by Warshel and Levitt to explore the catalytic mechanism of lysozyme.90 The QM/MM approach regain ed interest in 1990 thanks to application and valida tion work by Field, Bash and Karplus and the basic QM/MM application to chemical problems.91 This method is now widely available in many programs such as AMBER81 and CHARMM76. QM/MM methods have proven usef ul to accurately study reaction mechanisms in a protein environment using molecular dynamics92-94 and occasionally for protein ligand binding studies.60,95 These methods allow the accurate de scription of electronic effects in a region of interest without the intractable co st of modeling the entire system with a QM Hamiltonian. QM/MM Implementation In the QM/MM method the system must first be split into two separate subsystems. One region will be treated with the chosen QM Ham iltonian while the second region will be treated with the chosen force field. Dividing the subsystems into regions produces the effective Hamiltonian seen in Equation 2-33. MMQM QM MM effHHHH/ (2-33) This Hamiltonian is a description of the three parts of the system, one each for the QM region, the MM region and the QM and MM intera ctions. In this implementation HMM is calculated from the force field using the chosen parameters, HQM is calculated using the chosen QM Hamiltonian and HQM/MM is calculated from Equation 2-34. 39


iM MA AM AM AM AM AM AM AM iM M MMQMR B R A R Zq r q H6 12 / (2-34) Equation 2-34 encompasses three terms to descri be the QM interactions with the MM point charges and vice versa, where the subscripts i, M and A refer to the electrons, classical nuclei and quantum nuclei respectively. The first term describes the affect of the MM region on the QM density, the second is the interaction of the QM atoms on the MM atoms and the final term is a Lennard-Jones correction. Th e effect of this Hamiltonian is to allow the QM region and MM region to interact, affecting the ge ometry of the entire system as well as the electronic structure of the QM region. Link Atoms and the QM/MM Boundary The boundary between the QM and MM regions mu st be carefully chosen so that there are no spurious interactions between the QM and MM region s, such as polarized bonds96, and in many systems a covalent bond must often be cut. This leaves an unfill ed valence in the QM region which is typically filled with a hybrid orbital97,98, a capping potential99 or a link atom100102. The link atom method is the most widely used, but there is even variab ility in this method as to how the charges are treated.96 This link atom is included in the QM SCF calculation and its electronic properties are variable like a regular QM atom. In the AMBER implementation the link atom is placed 1.09 on the bond vector between the QM and MM atoms and can be adjusted. There are some modifications at the QM-MM boundary, so that the interactions between the two regions can be correctly described. Firs t, the bond that the link at om represents and any angle or dihedral containing a classical atom is treated classically using AMBER parameters, as if the link atom was not present to maintain the bond between th e atoms. Second, the LennardJones parameters between a quantum and a classical atom are treated as described in Equation 240


34. The physical interactions of the link atom are relatively easily calculated. However, the electrostatic interactions of the link atom are more difficult and will be described next. The electrostatics surrounding the link atom are challenging due to the boundary interactions and the replacement of an atom w ith hydrogen in the QM region. The link atom interacts with all of the classi cal point charges, except those di rectly bound to a QM atom, as suggested by Field et al.91, whereas van der Waals interactions are not calculated for the link atom, but are calculated for all of the MM atom s and the non-link QM atoms. Omitting this link atom-MM atom interaction leads to incorrect pol arization of the link atom because the carbon is affected by the MM field while the link atom is not.103 Another electrostatic consideration is the overall charge of the system. This fluctuat es in QM/MM calculations due to the difference between the QM calculated charge, which must be an integer, and the parameterized charges these are replacing. Two methods of charge conservation are av ailable in AMBER. One method distributes the charge difference to the atoms adj acent to the classical atom that is replaced. The second, and preferred, method dist ributes the charge equally over the entire classical system so that each atom takes only a small fraction of the charge difference, but allows charge conservation of the system. This is the defa ult method and the one generally used in AMBER QM/MM calculations. Solvation Methods There are a number of ways to model solven t for a biological system. The effect of solvent on a biological system is important to cons ider whenever an accurate model is desired. This solvent effect is essential in binding free energy calculations and is an important part of the thermodynamic cycle. Solvent changes the pola rization of the overall protein as well as stabilizing and forming hydrogen bonds with re sidues on the surface and interactions with foreign bodies such as ligands and DNA.104-107 41


The simplest, but most expensive, way to model solvent interactions is explicitly.108 This is simply adding water molecules and their a ssociated Hamiltonian, generally a force field method, to the calculation. This is very costly for a protein system because the number of water atoms needed to envelope the entire system is typically in the thousa nds. Along with a much larger number of atoms to simulate, averaging must be done using a MD method to capture an appropriate sampling of configurat ions. The advantage of this method is that any stabilizing affects from solvent are explic itly included in the energy of th e system. This includes hydrogen bonding with surface residues as well as conformations that become stabilized in the presence of water. A second general class of solv ation effects is called implicit, or continuum, solvation. Implicit solvent models add a term to the energy function of the system to simulate the effects of water. These methods were developed to a pproximate the statistical sampling of solvent configurations between a solute and a solvent.109-112 Continuum methods also reduce errors in statistical averaging from explicit models, while giving the effect of solvent on the solute.111 This reduces the cost over explicit models because thousands of atoms do not have to be added to the system and reduces the need for sampling. However, these methods do not allow hydrogen bonding to be properly represented because the wate r molecules are not actually present113 and can cause problems for reaction mech anisms where explicit water molecules impact hydrogen bonding.114 These implicit methods include Poisson-Boltzmann115-117 (PB) and the generalized Born118 (GB) method. Generally the PB me thod is too expensive for use in MD simulations119 and so is saved for post processing, but rece nt work has show that it is possible to reduce this cost.120 The Poisson-Boltzmann Equation The PB method relies on the Poi sson-Boltzmann equation (Equation 2-35). 42


[ (r) (r)] (r)2sinh ( (r)) = -4 (r) (2-35) In this equation (r) is the charge dist ribution of the solute, (r) is the electrostatic potential that needs to be determined, (r) is the dielectric constant and (r) is a modified Debye-Hckel parameter to reflect salt concentration and temp erature. Generally, the protein is assigned a uniform, low dielectric constant, while the solven t is assigned the dielectric constant of water (approximately 80).116 These represent the objects polar izability in solvent and there is discussion as to the best dielectric constant to use for a protein, but a value of three or four is the general consensus.116 The charges used in the PB equation depend on the Hamiltonian and charge model chosen. Generally, for a semi-empirical calculation, the CM1121 or CM2122 charge models would be used to get good accuracy with low expense. The nonlinear PB equation is often converted into the linear PB equation to make it easier and less expensive to solve.123 Finite Difference Poisson-Boltzmann In order to get any information from thes e equations, they must be solved. A common method to use to solve this equation is the Finite Difference Method (FDM).124-127 The FDM has been used in Delphi120, in DivCon61,88 along with other programs to solve the PB equation. Using the Poisson-Boltzmann/Self-C onsistent Reaction Field (PB/SCRF)128 method the electrostatic term of the solv ation free energy is largely fr om the external reaction field polarizing the solute electron density. In the FDM method a cubic lattice is constructed containing the protein and a region of solvent at the surface. The surface is defined by a probe, typically 1.4 to represent water, and is representati ve of the solvent acce ssible area on the protein. Each point on the lat tice is assigned a charge density, dielectric constant and ionic strength parameter for use in the PB equation. Th e number of points in this lattice is often user variable, and with a fine enough la ttice atomic resolution can be obt ained for the PB calculation. 43


However, increas ed resolution comes at a cost of greater computational expense, so a compromise must be made between accuracy an d expense. This calculation results in the electrostatic potential for the macromolecule. From this poten tial, virtual surface charges are generated located on the solven t accessible surface. These vi rtual charges represent the electrostatic field generated by the presence of solv ent. These polarizing charges are then solved self consistently until they reach convergence and their effect on the charges, and forces if desired, is incorporated into the calculation for the entire system. Using this method the solvation free energy, Gsolv is determined from Equation 2-36 using the electrostatic, Gelec, and non-polar, Gnon-polar. (2-36) polarnon elec solvGGG In this equation the electrostatic term is derived from the wave function and the non-polar term is obtained from parameterized surface area terms. Generalized Born Solvation The second, widely used solvation mode l is the Generalized Born (GB) method.118 The GB method approximates the PB method and the solvation free energy is calculated from Equation 2-37 and 2-38. ij ijij ji solvrf qq G )( 1 1 2 1 (2-37) )4( 22)(jiijr jiij ijijerrf (2-38) In these equations fij is a function relating the interacting atoms, i and j, their Born radii and the distance separating the atoms. The Born radius is a measure of the solvent exposure of an atom which approximates the distance of the atom from the dielectric boundary. This method is quicker than the Poisson-Boltzmann method, but is not as accurate. It is useful for molecular 44


dynam ics simulations because solvent effects can be included for a reasonable cost and is often used in tandem with MD in order to efficiently model solvent in a simulation.129-131 45


P + L P + L PL Figure 2-1. Therm odynamic cycle used in protein-ligand binding calculations. PL P solvG P solvG P solvG 46


HIS GLY PRO TYR ALA Outer Buffer Core Region Inner Buffer Figure 2-2. Subdivision of global system into a core region and two buffers.61 47


48 CAC N O H CA R R ... ... H H CAC NH O H CAR R ... ... H H HL A B Figure 2-3. A) Schem atic view of QM/ MM cut (curved line) between alpha carbon (CA) and amide nitrogen with hydrogen link atom added (HL) along bond vector. B) Structure, including link atom, as seen by AMBER


CHAP TER 3 A MIXED QM/MM SOLVATION METHOD Introduction The effect of solvent on biological molecu les is known to be very significant 115,116,128,132,133, and in order to model biomacromolecules acc urately one has to in clude this effect to properly account for electrostatic interactions. These solvation effects are often necessary in molecular dynamics simulations and protein ligan d binding studies to get accurate free energies of macromolecular systems where a difference of one or two kcal/mol is substantial134-136. Computationally, the effect of solvent can be included explicitly by including water molecules surrounding the solute108, or implicitly by adding terms to the Hamiltonian using a continuum method111 as discussed in Chapter 2. Inclusion of solvation effects allows a more accurate representation of the system, sin ce most biological systems do not exist in a vacuum. Both explicit and implicit methods have been proven to be useful and each has their own strengths and weaknesses, and is essential for accurately modeling biomolecules 115,116,128,132,133. Continuum solvation methods were designed to cheaply and accurately approximate the statistical sampling of solvent configurations at the interface between a solute and solvent109-112. There are many continuum solvent models availabl e, but this chapter will focus on the PoissonBoltzmann (PB) finite difference method as origin ally implemented in Delphi by Nicholls and Honig124, extended to QM methods by Friesner, Goddard, Honig and co-workers137, adapted for the linear scaling semiempirical program DivCon by Gogonea and Merz128 and further elaborated on by Liao and Merz138. This method was chosen b ecause it is suitable for use on biological macromolecules in a quantum mechan ical framework both in terms of cost and accuracy128. These methods have been effectively used to account for solvation in previous 49


studies of binding free energy of small m olecules, and they provide an accurate estimate for solvation free energies60,134,139. The PB method is an efficient solvation model, making it useful in calculations involving large macromolecules, and has b een shown to give good solvation energies when used with a quantum mechanical method137,140,141. Combined with DivCons linear scaling QM capabilities the PB method presents an effici ent and accurate way to calculate the solvation free energy of an entire macromolecular system w ith a reasonable amount of time invested in the calculation. Mixed QM/MM methods are widely used now in a number of programs such as AMBER81 and CHARMM76, and can help reduce the cost of large calculations while allowing important areas of the protein to be treated with a QM Hamiltonian90,91,96. This has proven extremely useful in the study of reaction mech anisms when incorporated with molecular dynamics92-94 and also for protein ligand studies in some cases60,95. Another advantage of these methods is that they can be used to save parameterization time in the case of nonstandard residues such as metal ions or drug molecules. In these instances the nonstandard residue can be included in the QM region so that the step of generating accurate force field parameters for the residue(s) in questions does not have to be undertaken. A third advantage of a mixed QM/MM method is that the QM region can properly repr esent the polarization e ffects of the surrounding MM and solvent regions through the QM-MM intera ctions. This polarization changes the QM regions electronic properties and can have a large impact on a reaction pathway or ligand binding95. In our QM/MM implementation the linea r scaling semiempirical QM program DivCon61,88 has been integrated into AMBER81. This allows the solvation energy for an entire protein to be calculated utiliz ing atomic charges from semiempi rical calculations in the QM 50


region and fixed MM point charges in the surroun ding protein. These can be Mulliken, CM1121 or CM2122 charges, but for accuracy CM1 or CM2 ar e typically used in DivCon calculations. This results in a method that ca n predict solvation free energies at a reduced cost compared to full QM calculations by centering specifically on a region of interest, an active site for example, for the expensive calculations and using the less expensive MM force calculations and charges on the rest of the protein. Our approach also takes advantage of DivCon s linear scaling, which allows a large QM region to be chosen in a QM/MM calculation without becoming exceedingly computationally expensive. Methods QM/MM Implementation. In the QM/MM method the entire system is first divided into two subsystems. One region (QM) contains a limited number of atoms to be treated with the chosen QM potential, while the second region (MM) is tr eated with an appropriate force field. In this case, the QM region will be treated semi-empirically while th e MM region will be treated using a force field available within AMBER In a protein system, covalent bonds are cut at the interface between these two regions and these bonds are treated by the addition of a hydrogen link atom to the QM region in order to insure a closed shell calculatio n. This is a widely used method and is reliable given proper placement of the link atom away from polarized bonds96. In the AMBER implementation the interactions with atoms in the MM region are included, except those bonded to a QM atom (called the MM link pair), so that artificially high forces and polarization effects are not introduced into the calculation. For MM atoms directly bonded to QM atoms, electrostatic interactions are re placed by those of the link atom and the VDW interactions remain unchanged. A representative view of the QM region cut is shown in Fig. 3-1. In our 51


applications the peptide bonds are included in the QM region to avoid cutting these polarizable bonds. Poisson-Boltzmann Combined with DivCon. Implicit solvent models, like those based on the Poisson-Boltzmann approach, involve a trade off between accuracy and computational efficien cy. In the PB model, the system is divided into regions with different diel ectric constants, separated by the solvent accessible surface of the solute molecule. The solute is then described as a collection of discrete point charges, which are calculated by the QM/MM method. The potential of the electrostatic field (called the reaction field) generated by the polarized solvent due to the introduction of the solute is obtained by solving the Poisson-Boltzmann e quation as discussed in Chapter 2. For the PB equation the charge distribution of the solute is calculated from the electron density generated by the chosen semi-empirical Hamiltonian. The density is converted to atomic partial charges, using Mulliken or CM1121/CM2122 charge models for the QM region, or carried over from the AMBER force field used for the MM region. The PB solver in our semiempirical QM pr ogram, DivCon, uses the Finite Difference Method (FDM)125-127 to solve the PB equation. After solv ing the PB equation, the electrostatic potential is added into the QM Ha miltonian used in the treatment of the solute. In this way, the polarization of the solute by th e reaction field is incorporated into the QM calculation. The polarized solute now yields a new set of atom ic point charges, which will repolarize the surrounding solvent generating a new electrostati c potential upon solution of the PB equation again. This iterative Self Consistent Reaction Field (SCRF) calculation is continued until convergence is reached. Upon convergence the electro static potential is us ed to calculate the electrostatic or polar part of th e solvation free energy, while the nonpolar part is calculated from the solvent accessible surface area of the solute. The advantage of this QM/MM method is that 52


it lowers the overall number of charges to polariz e every SCRF cycle. Th e QM regions atomic charges are allowed to fluctuate according to ou tside influences, but the MM regions charges are fixed at the charge assigned by the chosen for ce field. This results in a significant decrease in the number of atomic charges to be recalculat ed every SCRF cycle, therefore greatly speeding up the calculation. Solvation Free Energy in a QM/MM Calculation A QM/MM implementation of the PB model within DivCon allows specific regions of the protein to be polarized by solvent dynamical ly, while ignoring the solvents influence on regions of lesser importance. The implementatio n described is briefly mentioned by Murphy et al.142, but is further elaborated upon and applied to our own program in the following. A similar method using DFT methods is discussed by Li et al143 where the surrounding protein and solvent are continua. They explore the redox potential of two-sulfur, two-iron clusters and find that including the protein field with the solvent ma kes a large contribution towards an accurate description of the potential in the QM region. The QM/MM PB implementation described herein will, in general, reduce the overall computational expense versus a full QM treatment, because the Hartree-Fock equations are solved for a mu ch smaller region. The current implementation also presents the opportunity to use the Divide and Conquer linear scalin g semiempirical method for the calculation of both the protein and solven t while including the effects of all of the MM atoms on the QM region. A schematic view of the PB procedure can be seen in Fig. 3-2. After the normal QM/MM setup of the protei n system, the point charges of the protein are built up from the QM and MM regions. The MM charges are treated as fixed atomic point charges and do not fluctuate dur ing the SCRF process. These are obtained from the AMBER atom types and passed into DivCon through the QM/MM interface. The QM charges are 53


calculated from the density matrix of the QM region using AM184 and are updated every SCRF cycle. DivCon allows the use of Mulliken, CM1121 or CM2122 charges. The entire collection of atomic point char ges, from both the QM and MM regions, are then passed to the PB solver in DivCon, together with the coordinate s of the protein. The dielectric constants are assigned based on the grid point location e ither in the solvent or protein as determined by the solvent accessible surface of the protein. The solver then solves the PB equation numerically based on the finite diffe rence method and obtains the electrostatic potential. The resulting electrostatic potential is used to generate a set of virtual surface charges located on the solvent accessible surface of the protein. This set of virtual surface charges represent the electrostatic field (reaction field) produced by the polarized solvent. The Hamiltonian of the QM region in vacuum(H0) is then perturbed by the interaction of the virtual surface charges with the solute electrons and nuclei and integrat ed over the solvent accessible surface of the protein, giving equation 6, the ne w potential energy operator. In this equation, (r) is the virtual surface charge lo cated at r and a is an area of the solute surface(S): srr r daHH)( '0 (3-3) The perturbed Hamiltonian results in a new set of atomic point charges in the QM region, which reflects the polarization of the QM solute. This updated set of QM charges, together with the fixed MM charges, are then fed into the PB solver for another cycle of the SCRF calculation and this continues until the polar ization between solute and solvent achieves convergence. The converged reaction field (represented by the vi rtual surface charges) and solute charge distribution (wave functions in the QM representatio n) are used to calculate the polar part of the solvation free energy (Grf)128. 54


s v rfrr r rdrdrG)( )( 2 1 (3-4) A schematic view of a solvation free energy ca lculation, along with the terms represented in the above equations, is given in Figure 3-3. Du ring the dissolution of th e protein into solvent, the solvent is polarized and generates a reaction field. The solute is also polarized by this reaction field, hence, the wave function of the QM portion (MM portion is fixed) is modified. We describe the energy difference between the two states (the vacuum and dissolved states) of the solute via the wave function distortion (wfd) term given by Gwfd. The electrostatic interaction between the reaction field and the dissolved state of the solute is called the reaction field (rf) energy (Grf). The Gwfd and Grf constitute the polar part of the solvation free energy. There is also an entropy penalty associated with making a cavit y in the solvent, which is usually combined together with the non-polar (np) interaction betw een the solute and solven t. These two terms are both proportional to the solvent accessible surface area and are lumped together in the Gnp term. Further technical details of how to calculate these terms are given in Gogonea and Merz128. In summary, the QM/MM PB method makes it possible to study the protein polarized by solvent only in the ar ea of interest without the expens e of having to perform the SCRF calculations on the entire protein while still yieldi ng a reliable solvation free energy for the entire macromolecule. Using the SCRF method within Di vCon allows a large number of atoms to be treated with a semi-empirical Hamiltonian, thus allowing atoms directly in contact with the ligand and a buffer around the ligand to be included in the QM calculation. Preparation of Test Cases In order to validate the QM/MM DivPB method several test cases had to be developed. So that the viability of the QM/MM method, and not just the DivPB method, was tested they had to be easily divided into QM and MM re gions. It would also aid in th e study if the size of the QM 55


region could be easily expanded to exam ine how the scaling works. It would also help if these systems were relatively small, so that the tests could be conducted quickly. The first method of validation was a series of pentapeptides capped with N-methylamine (NME) and acetyl (ACE) groups were constructed in LeAP in the AMBER package, as seen in Fig. 3-4. This was done using the sequence command, starting with the ACE cap, followed by five residues, ending with the NME cap. This re sulted in the capped pentapeptide systems, with hydrogens added from LeAPs residue library from the ff99SB144 force field, and these coordinate and parameter files were saved. Th ese structures were then minimized in AMBER for a maximum of 9000 steps of conjugate gradient, with a cu toff of 100 Following this minimization, the pentapeptides were split into QM and MM regions. This procedure was done as described above, cutting the two regions afte r the peptide bond so that the link atom was not placed in a highly charged region. This split wa s repeated to create the one through 5 peptide QM regions that were later used. The input fi les were prepared from these splits and the solvation energy was calculated. The same structures were also used for the full QM solvation energy calculations so that the structures were identical between full QM and QM/MM. In addition to the pentapeptides above, f our small proteins with X-Ray structure resolutions less than 2.0 were obtained from the PDB. In or der to examine how this QM/MM method performs on proteins, a series of solva tion free energy calculations were done on small proteins Hepatocyte nucl ear factor 1-alpha (DCoH)145, SARS-coronavirus accessory protein (Orf7a)146, bovine pancreatic tr ypsin inhibitor (BPTI)147 and T4 Lysozyme148 containing 1284, 1034, 892 and 2603 total atoms respectively (PDB ids: 1usm, 1xak, 4pti, 182l). First, the system was minimized within DivCon using the AM1 Hamiltonian and the LBFGS149 method. After this the solvation free energy of the entire mi nimized system was calculated with the AM1 56


Ham iltonian. Following these calculations, a sm all QM region was picked and expanded in 23 increments giving QM sizes as small as 33 atoms and as large as 1734 atoms. This QM region was based on the first heavy atom identifie d in the PDB file. The QM region was then expanded to include entire residues within a given distance from the first atom, producing various numbers of atoms in the QM region. This allows the effect of th e QM regions size to be examined, which is important to consider because a QM region is supposed to remain relatively small in relation to the overall system. This solvation method should be able to closely reproduce full QM solvation energies with a small QM region in order to save calculation time. The solvation energy of these various sized QM/MM systems was then calculated and compared to full QM calculations of the same structur es. This not only provided an estimate of the accuracy of the method and the dependence on QM region size, but also provided information about how the method scales by comparing the co mputational cost across different sized QM regions. This also provided a test of the met hod on protein systems, which is what the methods intended use is, and therefore provides important performance and accuracy information. Results and Discussion We evaluated the QM/MM PB method by co mparing QM/MM calculations of various sizes of the QM region with full QM SCRF calcu lations. The Poisson-Boltzmann solver within our semiempirical QM program, DivCon, has been described and validated elsewhere128,138, so we will compare our QM/MM results with full QM results from this approach. DivCon allows the internal and external dielectric values to be set to user-defined values. For these calculations we will use the values of 1.0 for the internal dielectric (the QM and MM regions) and 80.0 for external dielectric value, the surrounding water environment. A grid spacing of 2.5 gridpts/ and a probe radius of 1.40 are used throughout. These parameters were found to give the best cost to performance be nefit in these calculati ons. We use 1.0 for the 57


internal d ielectric, instead of 4.0 which is comm on in proteins. Since the QM atoms are already polarized by the environment through the PB approach, a higher dielectric to approximate the polarization effect, as is util ized in fixed point charge PB approaches, is unnecessary. Pentapeptides, capped with acetyl (ACE) and Nmethylamine (NME) on thei r respective termini, were calculated and their solvation free energy was compared to a full QM calculation using the AM184 Hamiltonian on the same conformation. The QM region cut was made between the carbonyl carbon and the alpha carbon of the next residue, or between the nitrogen and the alpha carbon of the previous residue so that the peptide bond was not cut. This is an important consideration due to the nature of link atoms. If a double bond or highly polarizable bond is cut to define the boundary the link atom will not accur ately capture the electronic effects of the bond and will therefore affect the entire system. This is a consideration in all QM/MM calculations, so the farther the boundary from the region of intere st the better. This is another advantage of DivCons linear scaling benefits, allowing the boundary region to be defined away from the region of interest. An example of the QM/MM cut, and a demonstration of the peptide designation found below can be seen in Figure 3-1. Increasing QM Size Results from increasing QM region size on th e pentapeptides used above can be found in Table 3-1. All energies reported in this table are in kcal/mol. The number of peptides, across the top, refers to the number of peptides included in the QM region. For example, five peptides refers to five glycine residues plus the capping ACE and NME. Four peptides refers to four glycine residues plus the capping ACE. DivCon refers to a full QM calculation of the peptide using DivCon with the AM1 Hamiltonian. This resulted in the number of atoms included in the QM region gradually decreasing so the accu racy of the QM/MM PB method could be determined. As the QM region gets smaller it is expected that the accuracy, compared to a full 58


AM1 treatm ent of the entire pentapeptide, should decrease. Since only the QM atomic charges are allowed to fluctuate, the number of atoms th at are allowed to polarize in the solvent field decreases, and the number of fixed charges provided by the force field increases potentially leading to less accurate solvati on free energies. This means that the amount of quantum information in the SCRF calculations shrinks and a gradual shift out of agreement with full AM1 calculations might be expected. For these syst ems the solvation energy remains accurate with the 1 peptide systems, with the smallest QM region, having an unsigned average difference of only 4.9 kcal/mol. Not unexpectedly, the size of the QM region has a greater impact on charged residues than aliphatic ones. Charged or highly polar residues, such as histidine, are more polarizable and as these residues/atoms ar e assigned fixed MM charges the polarization experienced by the SCRF procedure is diminished. Hence, the solvation free energy is expected to change the most as the number of QM residues is decreased. This is certainly true for the positively charges ARG and LYS residues, while the negatively charged GLU and ASP residues are affected to a lesser extent. In this study the proline pent apeptide also presents its own problems in where to place the QM/MM cut. In a regular protein the PRO residue would normally be included in the calculation, but this was not an option here so the 5-membered ring was cut when the peptide bond was included in the QM region. This presents problems calculating the ring interactions and gi ving spurious QM/MM interactions. Table 3-1 shows the signed differences be tween the QM/MM calculation and the full QM DivCon solvation free energy in pa rentheses with the unsigned aver age shown also. All of these values are in kcal/mol using the AM1 Hamiltoni an. As expected, the accuracy of the QM/MM PB results decrease as the QM region decrease s. This demonstrates the importance of the polarizable charges in the QM region against the fixed MM charges. As the MM region grows 59


the solv ation energy becomes less accurate relative to the full QM calculations, but can still get pertinent information regarding the solvation free energy. Applying this to a protein could give valuable insights into active site polarizati on, while still producing reliable solvation free energies. Solvation Free Energies of Small Proteins This QM/MM method is intended to be used on biomolecular systems, more specifically proteins. While the previous test shows the methods ability to perform on small systems, tests on protein systems need to be performed to gaug e its efficacy on the systems of interest. The results of the solvation calculations are given in Table 3-2 with differences between full QM solvation energy and QM/MM solvation energies shown in Table 3-3. In Tables 3-2 and 3-3 the size, in ngstroms, of the QM region is shown along with the number of atoms in that region in parentheses, not counting link atoms and all ener gies are in kcal/mol. The proteins give unsigned average differences from the full AM1 QM calculations of 19.13 kcal/mol, 19.67 kcal/mol 11.02 kcal/mol and 9.08 kcal/mol for DC oH, Orf7a, BPTI and Lysozyme respectively. These are in good agreement with the expected solvation free energies given from full QM calculations. There is only a w eak to nonexistent trend between in cluding more atoms in the QM region and increased accuracy in the solvation free energy as might be expected for these systems. The solvation free energy differences ar e generally acceptable in most cases and show that even reasonably sized QM regions (less than 100atoms) in a QM/MM calculation should be expected to give solvation free energies in good accord with full QM calculations. Timings QM/MM methods make assumptions and approximations to save computational cost by excluding large regions of the proteins from th e QM calculation. These approximations must 60


provide an accurate answer, while saving tim e to make them cost effective. The QM/MM solvation method above is not excluded from this and the accuracy has already been examined on the pentapeptides and small proteins using various QM region sizes. Below, the timing difference of the method will be compared to that of a full QM solv ation calculation within DivCon. The average time per SCF cycle in va cuum and solvent was calculated and tabulated for PDB ID 1usm. This was done for each of th e QM region sizes and for the full QM solvation calculation. These timings will give an idea of the time savings gained by using the mixed QM/MM PB solver. Standard Table 3-4 shows the results of timings of various QM region sizes and a full QM calculation on 1usm. The average time for each SC F cycle is reported as well as the number of cycles required to reach convergence. The av erage time per SCF is reported because some individual SCF cycles may take longer depending on how long convergence takes, so an average is the only fair estimate. The first column tells the radius of the QM region, in ngstroms, and the number of atoms present in the QM region in parentheses. These are important metrics when a study of the active site is desired. This table shows that, as expected, the gas phase SCF takes less time than the solvation SCRF because there is no polarization across the QM region. Also, the time differences between different QM region sizes can be seen, and since QM size does not have much influence on accuracy the size versus co st can be estimated. The results in Table 3-4 are from a standard (non-divid e and conquer) calculation and theref ore do not linear ly scale with system size. Figure 3-3 shows this relationshi p and demonstrates how much longer the SCRF takes over the SCF. 61


Divide and conquer Table 3-5 is similar to Table 3-4, but shows the results for a divide and conquer calculation. It can be seen that the smaller QM sizes take longer for a divide and conquer calculation than a standard calculation. This resu lt is expected because the setup time required to break up the calculation does not compensate for the time savings while diagonalizing the Fock matrix, since these systems are so small. Ho wever, for the large systems the SCF and SCRF times are significantly shorter with the full QM calculation taking six times less in vacuum with a divide and conquer calculation than in a standa rd calculation. The timi ng of the average SCRF cycle is also interesti ng and demonstrates that the divide and conquer algorithm affects the SCRF diagonalization as well with the same pitfalls as a vacuum calcu lation. Large systems are where the divide and conquer algorithm is really useful, and this SCRF implementation is no exception. The crossover for when to use the divide a nd conquer calculation in this QM/MM method appears to be approximately 600 atoms, which confirms informal results obtained earlier. In Table 3-5 it is also seen that the full QM calculation takes less time than the largest QM/MM calculation, which contains fewer atoms th an the full QM calculation. This is likely caused by memory management involving the link atom s. In order to properly include the link atoms in the divide and conquer sub-regions they must be near the sy stem in memory. In AMBER the link atoms are placed at the end of th e molecules array by default. This will not work for a divide and conquer run, so the link atoms must be identified and moved into the correct sub-region, added to that sub-regions atom list and incl uded in the calculation. This takes a bit of time and memory manipulation and this cost will increase the more link atoms a QM/MM calculation has. This is only done on ce at the beginning of a divide and conquer calculation, but can add some time to th e calculation as seen in the table. 62


Figure 3-4 g raphs the average time per SCF a nd SCRF cycle per QM region size. This graph provides and excellent illust ration of the divide and conque r algorithms linear scaling for both the SCF and SCRF diagonaliza tions. This graph demonstrates the usefulness of the divide and conquer method and also gives an idea of when to apply this method over a standard calculation. Conclusions We have presented a QM/MM implementation of the SCRF Poisson-Boltzmann solver available in DivCon to calculate solvation free energies of large molecules within a semiempirical framework. A full QM version of this Poisson-Boltzmann solver has been validated by Liao and Merz138 and we find that it is also app licable using a QM/MM approach as implemented in the Sander program of the AMBE R suite of programs. This method allows the solvation free energy of an entire protein to be calculated, incl uding only part of the system in the quantum region, thus allowing only these charges to be polarized by the solvent while keeping the remaining system at fixed charges given by the force field. Using a divide and conquer technique, a large quantum region may be se lected due to DivCons linear scaling nature allowing the link atoms to be farther from the active site than other methods may allow due to computational expense. This also allows th e solvation free energy to be calculated using Mulliken, CM1 or CM2 charges for the QM region A brief study of this methods time saving capabilities has also been done. A time saving is realized for large syst ems when utilizing the divide and conquer algorithm, but the average SCF and SCRF cycle time increases for small systems due to setup times. This graph shows the applicability of this method to large and small systems while not compromising accuracy. This me thod might be applied, in particular, to drug design due to its ability to allow polarization and charge transfer effects in a region of interest while including solvation terms and finding the solvat ion free energy of a solu te. It could also be 63


used to determ ine pKa, thermodynamic quan tities, generate improved force fields for nonstandard ligands/residues and many other app lications relevant to biological systems. 64


H3C NC C N C O O H CH3 H QMH CN C O CH3 H H ... Figure 3-1. Representative view of QM/MM region and its cut off points. This also shows an example of the peptide designation us ed in Table 3-1 (NME cap plus one Ala residue). 65


Conver g e SCF in vacuum Setu p Link Atoms/Read j ust char g es Pass vacuum charges/coordinates t o PB So l ver Solve PB Equation with FDM from Vacuum Inform ation Perturb Hamiltonian of QM region with virtual charges Update charges of Q M re g ion Calculate virtual surface charges Re p eat until conver g e d Q M/MM In p ut file Figure 3-2. Schem atic view of SCRF calculation procedure. 66


Solvent MM Gsolv Gnp Solvent + Gwf d MM 0MM Cavity Grf Figure 3-3. The term s involved in obtaining the f r ee energy of solvation. 67


H N CH C CH3 O N H CH C CH3 O H N CH C CH3 O N H CH C CH3 O H N CH C CH3 CH3 O H3C Figure 3-4. Example alanine pentapeptide, capped with NME and ACE. 68


Table 3-1. Solvation free energy of pentapeptid es with varying QM region size. The signed difference between full QM DivCon solvat ion and the QM/MM solvation energy are listed in parentheses with the average uns igned difference listed at the bottom. Residue DivCon 5 Peptides 4 Peptides 3 Peptides 2 Peptides 1 Peptide ALA -30.5 -30.4(0.0) -29.0(1.5) -28.8(1.6) -28.7(1.7) -28.6(1.9) ARG -572.6 -572.6(0.0) -572.9(-0.2) -572.0(0.6) -570.2(2.4) -560.1(12.6) ASN -55.9 -55.9(0.0) -55.7(0.3) -57.7(-1.7) -58.9(-3.0) -61.0(-5.1) ASP -698.2 -698.4(-0.1) -694.2(4.0) -695.0(3.2) -698.0(0.2) -701.6(-3.4) CYS -29.1 -29.1(0.0) -29.0(0.1) -30.5(-1.4) -31.1(-2.0) -32.5(-3.4) GLU -653.9 -653.9(0.) -652.2(1.8) -652.9(1.1) -653.3(0.7) -653.5(0.4) GLN -69.3 -69.4(0.0) -71.5(-2.2) 72.3(-2.9) -70.5(-1.1) -71.9(-2.5) GLY -39.1 -39.1(0.0) -37.7(1.4) 37.1(2.0) -36.5(2.6) -35.5(3.6) HIS -81.0 -80.9(0.1) -74.5(6.6) 70.8(10.2) -67.9(13.1) -64.3(16.7) ILE -24.9 -24.9(0.0) -24.9(0.0) -25.0(-0.1) -24.1(0.8) -24.3(0.6) LEU -24.2 -24.2(0.1) -24.1(0.1) -24.7(-0.4) -24.4(-0.1) -24.8(-0.5) LYS -624.1 -624.0(0.0) -619.6(4.4) -616.8(7.3) -607.4(16.6) -602.9(21.2) MET -34.7 -34.6(0.1) -35.0(-0.3) -33.9(0.8) -34.3(0.4) -34.3(0.4) PHE -32.6 -32.6(0.0) -32.6(-0.1) -32.7(-0.1) -33.0(-0.4) -32.0(0.6) PRO -39.5 -39.4(0.1) -36.8(2.7) 34.1(5.4) -30.0(9.5) -27.4(12.1) SER -41.1 -41.2(-0.1) -39.6(1.5) 41.5(-0.4) -43.3(-2.2) -45.2(-4.1) THR -52.6 -52.6(0.0) -53.5(-0.9) -51.5(1.1) -52.0(0.5) -52.2(0.3) TRP -53.4 -53.4(0.0) -50.8(2.6) 52.3(1.1) -49.8(3.7) -48.3(5.1) TYR -46.4 -46.5(-0.1) -46.9(-0.6) 47.6(-1.2) -47.4(-1.0) -48.4(-2.0) VAL -22.1 -22.1(0.0) -22.8(-0.7) -21.9(0.2) -22.9(-0.8) -23.7(-1.6) Average 0.0 1.6 2.1 3.2 4.9 69


Table 3-2. Solvation free energy of increasing QM size for four different protein systems. (# QM atoms) 1usm 1xak 4pti 182l Solvation Free Energy Solvation Free Energy Solvation Free Energy Solvation Free Energy 3(33,40,62,40) -452.62 -400.19 -780.44 -1411.92 5(59,98,107,54) -467.23 -399.38 -779.70 -1412.87 8(187,161,176,182) -462.01 -399.72 -805.40 -1391.85 10(236,243,308,262) -459.19 -402.66 -802.21 -1403.46 12(421,281,373,342) -464.64 -402.19 -777.16 -1399.31 15(696,405,490,630) -458.71 -416.09 -818.12 -1387.74 18(959,525,621,925) -495.69 -416.11 -804.70 -1385.38 21(1169,685,676,1290) -456.99 -396.71 -801.81 -1401.54 24(1260,859,789,1734) -442.26 -386.32 -798.18 -1417.58 27(913) -392.52 AMBER Full QM -443.24 -381.52 -803.50 -1401.69 70


Table 3-3. Solvation free energy differences for four proteins. (# QM atom s) 1usm 1xak 4pti 182l 3(33,40,62,40) -9.38 -18.67 23.06 -10.23 5(59,98,107,54) -23.99 -17.86 23.80 -11.18 8(187,161,176,182) -18.77 -18.20 -1.90 9.84 10(236,243,308,262) -15.95 -21.14 1.29 -1.77 12(421,281,373,342) -21.40 -20.67 26.34 2.38 15(696,405,490,630) -15.47 -34.57 -14.62 13.95 18(959,525,621,925) -52.45 -34.59 -1.20 16.31 21(1169,685,676,1290) -13.75 -15.19 1.69 0.15 24(1260,859,789,1734) 0.98 -4.80 5.32 -15.89 27(913) -11.00 71


Table 3-4. Average time for a SCF/SCRF cycle in 1usm and number of diagonalizations needed for each cutoff distance in a standard calculation in DivCon. Standard (# of QM atoms) Avg. Gas SCF (sec/cycle) # of diags Avg. Solvent SCRF (sec/cycle) # of diags Total SCF Time in DivCon(sec) Total Run Time(sec) 3(33) 0.01 4234.879314.24 314.58 5(59) 0.06 14 35.48 10 355.54 356.19 8(187) 0.87 63 41.51 9 428.10 430.27 10(236) 1.74 52 43.53 9 482.10 485.08 12(421) 8.66 55 53.97 10 1015.15 1020.61 15(696) 36.23 45 89.13 10 2521.53 2530.61 18(959) 89.01 22 148.94 11 3613.30 3625.70 21(1169) 158.25 21 243.98 10 5763.12 5778.05 24(1260) 199.42 22 267.38 10 7061.05 7077.28 FullQM 310.11 22 326.09 10 10083.24 10100.08 72


Table 3-5. Average time for a SCF/SCRF cy cle in 1usm and the number of diagonalizations needed for each cutoff distance in a di vide and conquer calc ulation in DivCon. Divide and Conquer Cuttoff (Angstroms) Avg. Gas SCF (sec/cycle) # of diags Avg. Solvent SCRF (sec/cycle) # of diags Total SCF Time in DivCon(sec) Total Run Time(sec) 3(33) 0.03 4218.5532594.70 595.07 5(59) 0.14 54 22.71 21 484.42 485.16 8(187) 1.42 119 25.44 28 880.63 882.75 10(236) 2.20 51 27.49 32 991.81 994.53 12(421) 6.93 74 49.63 13 1158.20 1163.58 15(696) 16.29 49 67.54 13 1676.02 1684.86 18(959) 30.70 56 90.45 13 2895.28 2907.22 21(1169) 44.02 21 105.06 15 924.43 2514.86 24(1260) 59.48 20 117.00 15 2944.57 2960.29 FullQM 51.73 20 115.50 15 2767.10 2783.21 73


74 Average SCF/SCRF Time-50.00 0.00 50.00 100.00 150.00 200.00 250.00 300.00 0510152025 Angstrom CutoffAvg. Time(sec) Standard Solvation Divide and Conquer Solvation Standard Vacuum Divide and Conquer Vacuum Figure 3-4. Average time per SCF/SCRF cycle in 1usm for cutoff distances of 3 -24 D&C stands for a divide a nd conquer calculation.


CHAP TER 4 A MIXED QM/MM SCORING FUNCTION TO PRE DICT PROTEIN-LIGAND BINDING AFFINITY Introduction Protein-ligand interactions are complicated and therefore difficult to model and predict accurately. Methods are still being developed to predict binding affinity with emphasis on cost and accuracy in these predictions. All of th ese methods vary in the scoring function and structure prediction methods used with different advantages and disadvantages.49,150,151 It is important to accurately model these interactions so that in silico screening can be effectively applied to lower the cost and time involved in drug discovery. Often these schemes sacrifice accuracy to gain speed, or improve accuracy at slower speeds but then are too complicated and slow to be used for large scale general applica tions. For these reasons research in this area continues to thrive, searching for a model that achieves a good compromise between speed and accuracy to predict binding affinities and score different poses. Typically scoring functions are composed of different terms calcu lated through classical or empirical methods. These functions perform ad mirably, but have problems and their own set of limitations. These problems come from the way these scoring functions are derived and the training sets used to construct them. Empiri cal and knowledge-based fu nctions are dependent on the size and variety of the trai ning set used. These functions may fail if the systems being examined are too different from any ligands in th e training set. This leaves the scoring function without knowledge of the systems being examin ed, and therefore it essentially must draw conclusions from estimates of other ligands. The main advantage of these scoring functions, once they are established, is their speed because only a few, easily calculated, parameters must be found and the scoring function can be used to predict binding affinity. Another common scoring function is constructed using a classical potential. These functions use a force field 75


based approach and therefore are dependent on the force field chosen, each of which has its own set of problem. All of these m ethods either excl ude electronic effects or account for them with an empirical parameter that may not always apply. These methods also have problems accurately predicting properties fo r metalloenzymes, due to the presence of a metal ion in the protein, which are difficult to de scribe with classical methods. Quantum Mechanics (QM) methods have begun to demonstrate their usefulness in scoring functions, with semi-empirical methods used to increase speed. Until recently, QM methods were primarily used only for smaller sy stems because the cost associated with these methods was untenable for large scale screening. However, these methods are showing their potential in predicting binding a ffinities for metalloenzymes, which are notoriously difficult to predict with force field approaches. The problem with metalloenzymes arises from their not quite covalent bond nature, the high charge of the metal atom and the charge transfer associated with ligand binding.152 A recent enhancement to scoring func tions is a full QM potential named QMScore15,33,44 which uses a linear scaling se mi-empirical method within DivCon61,88 to calculate properties of the protein, ligand and co mplex, and combine them into a scoring function to give a binding free energy. QMScore allows th e entire protein to be calculated with a QM Hamiltonian, with the drawback that it is time cons uming to calculate the entire protein at a QM level. QMScore takes advantage of DivCons lin ear scaling capability to keep the computational cost tractable, but is still quite expensive. QMScores scoring function uses the QM calculated heat of interaction, the dispersive Lennard-Jone s interaction, a solvent term, an entropy term dependent on the number of rotatable bonds to estimate vibrational entropy and the solvent entropy change measured by surface area changes. Using these parameters QMScore was able to accurately predict binding affinities for a small set of zinc-containing enzymes and later a larger 76


set of diverse proteins. QMScore was also com p ared with several other scoring functions that are readily available and was found to provide bette r binding affinity correl ations than most of the tested functions showing that improved accuracy may be worth the additional cost. QM methods have also been incorporated in to MM methods in recent years as computing power has increased and have proven quite succes sful for MD and reactio n mechanism studies. QM/MM methods allow a small region of interest to be explored in more de tail while treating the surroundings with a less accurate method to save ti me. Some more recent screening methods use QM in a QM/MM5,60 or QM/QM153 manner to calculate protein-lig and binding affinities. These methods take advantage of the mixed methods ability to specify a region of interest, usually the ligand, with an expensive Hamiltonian, while treating the remaining system with a cheaper Hamiltonian to get more accuracy for the ligand and its surroundings while not becoming prohibitively expensive. These methods also have an advantage over many other methods in that parameters are not necessary for the ligand or many common metals that may be present. Defining the ligand within the QM region will preclude the need for parameter definitions therefore saving time in setup and eliminating th e concern that parameters may be incorrect while providing more accurate information for the ligand and its surroundings. The QM/MM method in AMBER 10 will be used in our implementation to calculate the binding free energy of various protein ligand complexes. These calculated binding affinities will then be compared to the experime ntal binding affinities of the co mplexes to assess the fitness of the scoring function chosen. The scoring function will incorporate use of an interaction energy term, a solvation term and entropic terms to determine the overall binding energy. The results of this scoring function will be presented and analyzed. 77


Methods QM/MM Implementation The QM/MM method used in this study is simila r to that used above in Chapter 2 so it will only be covered briefly. The linear scaling program DivCon, integrated into AMBER, was used to select the QM region around the ligand as described above. This region was included in the semi-empirical QM calculation, while atoms outside of this region were treated with the classical AMBER ff99SB144 force field. This gives the effective Hamiltonian of the system shown in Equation 4-1, which is a sum of the MM (HMM), QM (HQM) and QM/MM interaction (HQM/MM) Hamiltonians. MMQM QM MM effHHHH/ (4-1) Splitting the system into regions like this leaves any covalent bonds between the two regions cut, resulting in false free radicals in the QM region and charge imbalances in both regions. This is addressed by adding a hydrogen link atom to the QM atom in the broken bond that is formed, similar to the implementation in Dynamo154. The link atom is forced to lie along the bond vector so that no extra degrees of freedom are introduced to the system, and it is treated like a regular QM atom in most respects throughout the calculation. The forces on the link atom are then distributed to the QM and MM atoms that make up the bonding pair and any interactions involving the MM atom ar e treated classically155. The electrostatics of the link atom must be carefully considered to avoid false polarization of the QM boundary atom and to maintain a constant charge for the QM region. This is accomplished by spreading the charge of the QM region across all of the MM atoms in the system on ce at the beginning of a r un. This initial setup step adds a slight cost to the QM/MM calculati on, but is not prohibitivel y expensive as it is a one-time setup. In order to avoid overpolariz ation of the QM/MM boundar y the classical atom 78


involved in the link atom bond is ignored by the li nk atom and the van Der Waals interactions of the bonding pair are treated classically. This resu lts in more stable char ge distributions for the atoms in the QM region155 and eliminates any falsely high repul sive forces between the link atom and the MM bonding atom. The QM/MM method in AMBER also allows for the minimization of a system of interest using the conjugate gradient or steepest decent methods. In a QM/MM calculation the forces of the QM region are calculated by the QM program being used and are then transferred and added to the MM forces on the QM atoms and vice ve rsa, while the MM forces are calculated by AMBER. This allows a fairly simple minimi zation to be undertaken for a QM/MM calculation while still removing the need to calculate parame ters for an organic molecule or metal ion and taking advantage of the parameters available in the force field for defined atom types. Preparation of QM/MM Calculations In a QM/MM calculation proper setup of the QM and MM region and their boundary is often one of the most difficult parts. The boundary must be carefully chosen so that the approximations and assumptions made in a QM/ MM job do not significan tly alter the accuracy of the calculation. Another cons ideration is to get the boundary region, where the most errors will be, as far from the region of interest as possibl e. This is often difficult because the cost of the calculations often increases exponentially with atom number, but this problem is reduced using DivCon in AMBER due to its linear scalin g abilities. This allows the QM region to expand so that problem boundary regions can be included in the QM region without greatly increasing the cost of the calculation. In this study the QM region was selected to include not only the ligand, which is standard, but also the first shell of the active site to capture elect ronic changes from binding. A 5 sphere centered on the ligand wa s created and the entire residue of any amino acid within the 79


sphere was included in the QM region, cutting the peptide bond. This gave a rough QM region for the QM/MM calcu lations as seen in the simple representation in Fig. 4-1, but needed to be refined in order to correctly cut the boundary regions covalent bonds as seen in Fig. 4-2. This included finding any proline resi dues and including them in th e QM region, finding any crosslinked cysteines and expanding the boundary to include the entire peptide bond. After these steps were taken the QM region was properly de fined for the calculation. Once the QM region was defined the charge of the QM region was calculated dependent on what residues were present and the charge of the ligand and the activ e site was visually inspected to confirm the accuracy of the charge. As mentioned above one of Di vCons strongest abilities is its linear scaling nature allowing larger than usual systems to be consider ed for semi-empirical calculations. The cutoff established for using standard versus divide a nd conquer calculations was determined to be 600 atoms in the QM region. Fewer than 600 atoms would mean a standard calculation should be used as the divide and conquer ca lculations have more overhead so need more atoms to properly scale. More than 600 atoms mean t that the divide and conquer me thod should be used to provide linear scaling. This was all incorporated in th e setup of these QM regions and a standard or divide and conquer calculation pe rformed as needed depending on the number of atoms in the QM region. These proteins are all relatively sma ll, so only standard calcul ations were required. The number of QM atoms in each protein, liga nd and protein ligand complex can be seen in Table 4-1. Preparation of Proteins The proteins, ligands and complexes used in this study were taken from a full QM study undertaken by Raha and Merz.33 In this study 18 carbonic anhydrase (CA) and 5 carboxypeptidase (CPA) complexes with known e xperimental binding free energies and 80


resolutions under 2.5 were downloade d from the Protein Data Bank (PDB).1 A table of these proteins and their ligands, resolution and experiment al binding affinity can be seen in Table 4-2. Protons were then added to the heavy atoms using the LEAP module in AMBER 5.0 based on standard geometries and physiological pH, and energy minimization was performed to relax the added protons. Each of these complexes contai ned a zinc atom in the active site and a zincbound water molecule was added to the uncomplexed proteins to fill the va lence as performed by Raha. These structures were used as the starting point in this study. From this point the structures were sp lit into QM and MM regions for the QM/MM calculations as described above. After this was accomplished the structures were minimized for 500 cycles using steepest decent and then a maximum of another 1500 cycles using the conjugate gradient method in version 10 of AMBER.75 The long-range cutoff for both the QM and MM regions was set to 100 to better include long range effects in the binding affinity calculations. After the initial structures were minimized the solvation free energy of the protein, ligand and complex were calculated. This was done by setting divpb equal to one in the AMBER input file and specifying the SCRF and DIVPB keywords within the DivCon input. CM1121 charges were used for the solvation calculations and the grid space was set to a st atic 2.5 grid points per with a long range cutoff of 100 An internal dielectric of 1.0 was used with an external dielectric of 80. In addition to solvation free energy the vibr ational frequencies of the ligand alone and in complex with the protein were calculated us ing DivCon to provide and estimate of the vibrational entropy associated with binding. Al ong with the vibrational entropy, the zero-point energy correction and vibrational energy were cal culated from the normal mode analysis. For this the convergence criterion was increased to 1E-6 for the changes in the energy and density matrix and the long range cutoff was again se t to 100 This was done to improve the 81


convergence of the wave function at this geom et ry and improve the frequencies obtained through this calculation. Calculation of Binding Affinity The binding affinity of a protein-ligand complex can be determined using a thermodynamic cycle as seen in Figure 43 which essentially becomes Equation 4-2. solv gas bind solv bindGGG (4-2) ) (ligand solv protein solv complex solv solvG G GG (4-3) gas bind gas bind gas bindSTHG (4-4) The free energy of binding can be broken down into solvation free energy (Equation 4-3) and gas phase free energies (Equation 44). The solvation free energy is composed of the difference between the free energy of solvati on of the complex and that of th e protein and ligand separately. The gas phase free energy contains an enthal pic term from the electrostatic and non-polar interaction energies and an entrop ic term from the degrees of free dom of the individual systems. The enthalpy term is calculated in DivCon at the AM184 semi-empirical level and is shown in Equation 4-5. (4-5) atoms f corecore elec fH EEH In determining the enthalpy, Eelec is the electronic energy, Ecore-core is the core-core repulsion from the wave function and the final term is the heat of formation for all the atoms to calculate the heat of formation of the entire system. De tails of calculation of th e entropic and solvation terms in Eqs. 4-2 and 4-4 will be discussed later in this section. The nature of the binding free energy and these equations allow a master equation for the binding affinity to be pieced together from different components so that it can be calc ulated without major difficulties. The equation used in this study can be seen in Equation 4-6 an d is composed of five independent variables to 82


describe the dependent vari able. In the equation HI is the heat of interaction, -STvib is the vibrational entropy penalty, Gsolv is the solvation free energy change and Ssolv is the solvation entropy change estimated by th e change in surface area. vib solv solv I bindSTSGHG (4-6) (4-7) ligand ESCF protein ESCF complex ESCF IH H HH Another important consideration for protei n-ligand binding is the change in entropy of the ligand upon binding. Proteinligand binding is entropically unfa vorable in most cases due to loss of degrees of rotational and vibrational freedom.5,156-158 Several scoring functions estimate this conformational entropy component of the free energy on the number of rotatable bonds in the ligand or the protein and ligand together.15,33,159 This measure provides a good estimate to the degrees of freedom lost by both the protein a nd ligand on binding. Another way to calculate this conformational entropy component is to calculate the vibr ational frequencies of only the ligand in complex60 or the frequencies of the entire QM region. Capturing these effects the change in translational, rota tional and vibrational degrees of freedom upon binding can be calculated, and the entropic effect on the ligand in the protein field or the ligand and the protein side chains in the active site that interact with the ligand can be determined. This gives a more accurate estimate of the entropic penalty at the cost of an increased expense to calculate the vibrational frequencies. Using th e frequencies alone is an estimate for this entropic change and does not necessarily capture the translational and rotational changes that accompany binding. In this implementation, after QM/MM conjugate gr adient optimization, the frequencies of the optimized ligand alone and in complex with the protein were calculated at the AM1 level using DivCon and a partial Hessian vibrational analysis (PHVA).160 This allows the frequencies of only the minimized region, either the ligand alone or the entire QM region, to be calculated, 83


excluding th e rest of the system and has b een found to accurately reproduce frequencies.160,161 Many of the systems being considered here are fair ly large; therefore the minimization steps will have difficulty finding the global minimum. This is partially due to the minimization scheme used and partially to keep the binding affi nity prediction quick enough for a large scale application. Due to this, a small number of imaginary frequencies may appear from diagonalizing the Hessian matrix, especially when calculating the frequenc ies of the entire QM region, comprised of the ligand a nd protein side chains. These ar e especially prevalent in the protein side chains due to thei r available conformational space. In these cases, any imaginary vibrational frequencies found are not included when calculating the vibrational entropy. This may disregard some low frequency vibrati onal modes which were found to be important156, but can still provide a good estimate of the vibra tional entropy component due to binding. From these frequencies the vibrational en tropy, energy and zero-point energy can be calculated from the normal mode frequencies us ing Eqs. 4-8, 4-9 and 4-10 respectively. )1ln( 1, ,63 ,T n j T jv vibjv jve e T RS (4-8) j Tjv jv Ve RE1 1 2 1, (4-9) jvZPE,2 1 (4-10) In these equations vj is the vibrational temperature found from the 3n-6 normal modes, for a non-linear molecule, at temperature T. The vibrational entropy, Svib, component accounts for the change in entropy due to the gain of six vibrational degrees of freedom and loss of translational and rotational degrees of freedom when the ligand binds to the protein. The vibrational energy, Ev, component represents the in ternal thermal energy change from molecular vibrations upon 84


ligand binding. Here, the vibrational entropy is ca lculated either by finding the vibrations of the ligand in complex and alone, which has been sh own to be a good approxim ation of the degrees of freedom of the protein and ligand system156 or by calculating the frequencies of the entire QM region. The zero-point energy (ZPE) corrects the energy of the system up from the bottom of the harmonic oscillator well to the lowest vibra tional quantum level, acc ounting for vibrations occurring even at 0K. In this study the vibra tional energy and ZPE are al ready included in the calculation through the semi-empirical vacuum energy and therefore are excluded from the scoring function to avoid doubl e counting these properties. There is also an entropic solvation term to be considered on binding which can be described as the entropy gained by the displacemen t of water molecules from the active site, and plays an important role in binding.5,162 It has been shown that changes in solvent exposed surface area upon ligand binding have a correlation to solvent entropy163 and some scoring functions successfully use the term as a m easure of solvent entropy in their scoring functions.5,52,60,67 In this study the solvent accessible su rface area of the heavy atoms, regardless of the polarity of the atoms, was used to estima te the solvation entropy of the binding process. This was done by running a 1.4 probe over the surface of the protein, ligand and complex, using SASA164 which yielded the surface area of each respective piece of the protein-ligand binding calculation. This surface area difference gave an estimate of th e solvent entropy gained upon complexation. This entropy term was then combined with the vibrational entropy described above to represents a reasonable ap proximation to the overall entropy change of the system due to ligand binding. The final part of the binding free energy to c onsider is the solvation free energy change of the system. This is an extremely important fact or in the overall binding free energy and must be 85


addressed, either im plicitly or explicitly. As discussed above, including solvent explicitly is extremely expensive, so an imp licit solvation method would be idea l for an application like this where time is an important consideration. It has been shown that an implicit model of solvation appropriately describes the solvent interactions that occur in protein-ligand binding to account for the solvation effect.5,165 This term can roughly be describe d as the free energy associated with the desolvation of the activ e site and the ligand when the lig and binds, and in this study the implicit Poisson-Boltzmann (PB) model was used. A QM/MM self-consistent reaction field (SCRF) method was used to calculate the solvati on free energy, which is described in Chapter 2 in detail. This method essentially calculates the energy of the solute with and without the presence of solvent, allowing only the QM region to change polarization when the solvent is added to the calculation. This gives an appr oximate solvation free energy for the active site, allowing the charges in that area to fluctuate in the solvent field, while holding the remaining charges in the system fixed. This saves time in the solvation calculati on and, as demonstrated above, does not sacrifice a great deal of accuracy to achieve reasonable solvation energies. Using a Poisson-Boltzmann method also means that an internal and external dielectric must be set to properly capture the polarization of the QM region.166 In this study an external dielectric constant of 80 and an internal dielectric of 1. 0 were used, as in the study in Chapter 2. The dielectric constant is used to estimate the polar izability of the region it is assigned to. In the fixed charge methods values of 3.0 and 4.0 are co mmonly used for an estimate of polarizability. However, here a value of 1.0 was used because the QM region is permitted to polarize in the solvent field, therefore using a higher value may lead to over estimating the polarizability of the protein environment. Using this QM/MM PB method allowed the solvation free energy of the 86


protein, ligand and complex to be determ ined yi elding the overall cost of desolvating the ligand and protein active site. Regression Analysis A method of measuring a scori ng functions predictive abilitie s must be used to quantify how well it performs. Here, this was done by co mparing experimental bi nding free energies to the calculated binding free energies and comp aring through use of multiple linear regression (MLR) to predict the binding free energies. ML R tries to determine a relationship between a dependent and independent vari able using a least squares me thod. MLR produces a linear equation (Equation 4-9) where Xi is an independent variable, he re a component of the scoring function, and Y is the dependent variable, the binding affinity. ...22110i iXX Y (4-9) Using these methods not only can the predictive ability of a scoring function be estimated, but weights for individual terms in the scoring func tion can be determined to give them more influence in the overall score and improve a scoring function.33,167 Using MLR, the square of the correlation coefficient (R2) and standard deviation can also be determined, giving a good assessment of the scoring functions fitness in binding affinity prediction. The standard deviation (SD) calculation for this study followed the equation given by Raha and can be seen in Equation 4-10. 2 exp calcGGSD (4-10) In this standard deviation equation, Gexp is the experimental binding free energy and Gcalc is the calculated binding free energy for each proteinligand system. Regression can also be used to compare the raw score of the scoring function to the fitted score by adjusting the raw score through regression, but not using the weighti ng coefficients on the individual terms. 87


Results and Discussion It is difficult to create a scoring function b ecause it can be difficult to select appropriate terms to use in the scoring function. In the following, several different possible variables and scoring function components will be examined and compared to experimental binding free energies. This procedure will produce the best components for use in th e scoring function while also providing details on important considerations for the scoring function. We will also examine the use of MLR weights to see the infl uence this statistical method will have on the predictions. Use of ESCF Versus Total Energy In the case of QM/MM an important consideration is whether to use the total energy of the system or only the QM regions energy, ESCF. The ESCF energy encompasses only the energy of the QM region, includ ing all the changes in the elec tronic terms associated with binding, while the total energy encompasses the ESCF energy as well as the MM energy terms, such as bond and angle terms. Either of thes e terms may be appropriate to describe proteinligand binding and their affect on binding affinity prediction must be examined to determine the best one to use. Figure 4-4 shows the una djusted calculated bindi ng free energy versus experimental binding free ener gy for both the total energy (d iamonds) and the ESCF energy (squares). In this case there is only a slight improve ment between the ESCF energy and the total energy. The square of the correlation coefficient for the total energy is 0.601 with a standard deviation of 1.98 kcal/mol while the square of th e correlation coefficient for the ESCF energy is 0.608 with a standard deviation of 1.97 kcal/mol. These binding affinities were calculated after two minimization runs with a non -bond cutoff of 100 It can be argued that only the ESCF energy is essential in a QM/MM scoring function because the most detail is focused on the area 88


of greatest interest. Since the active site and liga nd, when present, is defined as the QM region it can be assumed that the greatest electronic ch anges will be situated locally within the QM region, and a QM/MM method will capture these becau se there will not be much polarization of atoms out of the first shell of the protein. In cluding the total energy of the protein may have no effect, or lower the predictive ability of a sc oring function as random movements unassociated to binding may occur in the MM region introducing s purious energy terms to the total energy. These terms may originate especially at the outs ide of the protein, which will have the least affect on protein-ligand binding, assuming these motions are unnecessary for binding which is a whole different issue. These distant changes may reduce the predictive ability through a false steric or long range interaction. For this reason it may be prudent to disr egard the total energy of the protein in favor of using ju st the ESCF energy of the smalle r QM region. Using just the ESCF can also present the argumen t that the interaction energy is a purely physical term and is not based on molecular mechanics parameters, wh ich are largely empirica l. Another possible explanation for this very minor discrepancy is that the total energy al ready includes the ESCF energy, and the additional terms involved in the total energy may not contain much more information useful for binding affinity prediction. In this case there seems to be little difference between using ESCF energy and total energy, but al l of these proteins are relatively small and it may become a more important considerat ion to monitor for larger proteins. Importance of Long-Range Cutoff Another important consideration that may not be readily apparent is the effect of the long range non-bond cutoff. Atoms throughout the protei n may affect the electronic distribution of the active site and ligand which may in turn e nhance or inhibit protei n-ligand binding to some degree. It is important to remember and acc ount for this when calculating the binding free energy of the protein-ligand complexes. The binding affinity of the protein set was calculated 89


using the scoring function with two steps of m inimization, the ESCF energy and a cutoff of 10 and 100 Figure 4-5 demonstrates the perfor mance of the function at predicting binding affinities dependent on the cutoff assigned. The long-range cutoff has a large impact on the predictive ability of th e scoring function. Using a cutoff, for both the QM and MM regions, of 10 gives a correlation coefficient squared of 0.39 with a standard deviation of 2.46 kcal/m ol. However, using a much larger cutoff of 100 yields a squared correlation coeffi cient of 0.61 with a standard deviation of 1.97 kcal/mol. For these systems this cutoff includes the entire prot ein in the non-bond cutoff calculation. It can be seen in Fig 4-4 that there is an outlier in the prediction, representing PDB ID 7cpa, whose error is increased when considering the short cutoff di stance. This QM/MM scoring function provides poor predictions for this compound in all variants without fitting. This complex is calculated to have the highest interaction energy in vacuum but was found to be the strongest binder experimentally. This may indicate the importan ce of an accurate vacuum energy calculation in binding affinity predictions. This is partially co rrected by a larger intera ction cutoff, but could perhaps be improved through more geometry mi nimization or a more rigorous minimization algorithm. Upon removal of this outlier the square of the correl ation coefficient increases to 0.60, approximately the same as the long-range cu toff with the outlier, from 0.39. Removing the outlier from the data set with a long-rang cutoff, the correlation increases to 0.66. Removal of this outlier results in a standard deviation of 1.80 kcal/mol for the 10 cutoff and 1.66 for the 100 cutoff. The importance of long-range cutoffs is well known and appears to be no less important in binding affinity calculations than mol ecular dynamics simula tions, though periodic interactions are still diffi cult and expensive to model.155 This is another advantage of the 90


QM/MM method over a purely force field based m ethod. The MM regions electronic effects from the static charges can still be included in the calculation to affect the QM regions electron density without greatly increasing the overall cost. This allows the QM region to be influenced by some of the polarization from the outer region, which in turn gives a more accurate description of the elec tronic configuration in the active site when the ligand is bound and unbound. This may cause certain intermolecular in teractions to become more pronounced, such as hydrogen bonding, and a QM/MM method will allow this effect to be captured where a force field based method would have to estimate this contribution or skip it entirely. Affect of Minimization on Binding Affinity In these calculations the prot ein, ligand and protein-ligand complex are minimized with the QM/MM method in AMBER. This takes a dvantage of the built-in steepest decent and conjugate gradient minimizers already present. The starting geometries have the hydrogens minimized while the heavy atoms are held fixed, but the rest of the protein st ructure is that of the crystal structure saved in the PDB. The crys tal structure is produced to minimize overlap without hydrogens, but is still a rough approximation of the over all protein structure at room temperature. Minimizing the protein lets th e QM and MM regions relax in their molecular environment, lowering the overall energy of the protein and providing a good starting point for QM/MM calculations. These QM/MM minimizations also relax the ligand in and out of complexation giving a better idea of structural changes in the ligand than performing a single point calculation. Minimization also makes it possible to properly calc ulate the vibrational frequencies of the ligand by relaxing it and remo ving all of the negative frequencies. The predictive ability of the scoring function on unminimized structures was also examined. In these structures the heavy atoms are directly from th e PDB while the protons are added with leap and allowed to minimize. From these structures the vacuum interaction energy, solvation free energy 91


and solvation entropy are calculate d. For these structures the vibrational entropy com ponent was estimated using the number of rotational bonds in the ligand, assigning a 1 kcal/mol penalty to each bond. The number of rotatable bonds is used here because using the vibrational frequencies without minimization would result in mostly imag inary frequencies and not provide an accurate representation of the actual entropy change. Using the best parameters found above, namely the ESCF energy and a long range cutoff of 100 the predictive ability of the scoring function was test ed depending on how many steps of minimization were done, each set comprising a 500 step steepest decent followed by a maximum 1500 step conjugate gradient minimization. The results of this using total energy for the heat of interaction, can be seen in Fig. 4-6 while the resu lts using the ESCF energy can be seen in Fig. 4-7, both of which are raw sums and not fit. The nu mber of minimization steps used has an interesting effect on the predictive ability of the scoring function. If the total energy of the protein is used, one step of minimization re sults in a squared correlation coefficient of 0.49 with a standard deviation of 2.24 kcal/mol. Performing two minimization steps and using the total energy of the system increas es the squared correlation coeffi cient to 0.60 with a standard deviation of 1.98 kcal/mol. Using the ESCF energy, the correlation increases to 0.60 for one minimization step and to 0.61 for two minimization steps, while th e standard deviations are 2.00 kcal/mol and 1.97 respectively. The results of these predictions using structures without minimization and the vibrational entropy estima ted by the number of rotatable bonds gives a correlation of 0.20 with a standard deviation of 2.80 kcal/mol. This poor correlation is due mainly to the complex in 1bn1 which gives a po sitive ESCF energy, making this a very large outlier in the prediction. This is probably due to the structure of the protein deposited in the PDB and even a short minimization of 10 steep est decent steps on each protein increases the 92


correlation prediction to 0.29 with 1bn1 rem aining a nd outlier, but not by as wide of a margin. Table 4-3 compiles the results of each of th ese minimization runs for easier comparison. The use of the total energy for the heat of inte raction term results in a large difference in the correlations. The minimization steps allow the protein to relax, and including all of the atoms in determining the heat of interaction may l ead to spurious results as including more atoms in a minimization makes it more difficult to determine the best configuration. The total energy may also be less suitable at one minimization step because residues at the protein exterior may have more difficulty fully relaxing because of all the degrees of freedom involved whereas additional minimization steps will allow the protein to more fully relax. These results suggest that using the total energy, but only performing one round of minimization may be inappropriate to properly describe intermolecular interactions between the protein and ligand. The correlation for the ESCF energy suggests that the QM region is sufficiently minimized after one minimization step. This may be because the QM forces are more accurate than the MM forces, even considering the QM/MM bou ndary. Using the ESCF energy also excludes atoms away from the active site, which have little geometri c impact on protein-ligand binding, from the heat of interaction term of the scor ing function and removes some of the potential complications from using the total energy. It is interesting to note that the ESCF corre lation coefficient far exceeds the total energy coefficient for one minimization step, but the two binding affinity correlations are quite close for two minimization steps. This suggests that using the ESCF en ergy is an adequate measure of the heat of interaction and that a single minimization may be appropriate to predict binding affinities to save time. While this was examined a bove, the results are more profound here when comparing different levels of mi nimization, and validate the use of the ESCF energy for the heat 93


of interaction com ponent of the sc oring function. Another consider ation in this matter is the time savings, since the aim of these sc oring functions is to screen large numbers of protein-ligand complexes. Using one minimization cycle would obviously save time over performing two cycles, so the ability to accura tely calculate binding affinity at one minimization cycle is an important consideration. Clearly, Table 4-3 de monstrates that using the total energy of the system at one cycle is not as predictive as the ESCF energy while there is minimal difference between one and two cycles for the ESCF ener gy. Therefore, it is acceptable to use one minimization cycle when using the ESCF energy of the system, resulting in a large time saving and avoidance of potential problems over using the total energy. Binding Affinity Predictions Through MLR The binding affinity of the 23 zinc containing pr oteins was calculated using Equation 4-6. The components of the scoring function were calculated by the best representations as determined above. In this case, the ESCF ener gy was used for the heat of interaction, a long range cutoff of 100 and, for thoroughness, two minimization cycles even though one was shown to be adequate. For these binding affin ity predictions, CM1 charges were used in the SCRF solvation calculation. The results of the binding affinity calculation predictions can be seen in Fig 4-7. The scoring function was able to predict the gene ral trend of the binding affinity for these 23 zinc proteins. As in the work by Raha, the scoring function was tested with and without fitting any of the indi vidual contributions using MLR. Figure 4-8 demonstrates the results of both the simple sum and the fitted function. Using the MLR method the different components of the scoring function can be fit to better predict the binding affinity, as described above. These results show promise for this mixed quantum mechanics/molecular mechanics method to be used in predictions of binding affinity to include a large regi on of the active site in 94


a sem i-empirical Hamiltonian. Calculating the binding affinities of thes e zinc metalloproteins yields an R2 of 0.61 with a standard deviation of 1. 97kcal/mol, without any fitting, demonstrating that this method has good predictive ability for this set. The pred ictive ability of this method has also been examined by weigh ting the different components of the scoring function through multiple linear regression, as mentioned above. Upon fitting, the R2 increases to 0.75 with a standard deviation of only 1.56 kcal/mol. This fitting gives a weighting coefficient to each term of the scoring function for each protein-ligand co mplex. These are good results for the method with and without fitting and are at least partially due to the ab ility of the method to capture electronic effects in the active site, where they will change the most upon ligand binding. This ability is especially pertinent in metalloenzymes where a force field method will have extreme difficulty capturing these effects. Parameters will either have to be derived for the metal ions and small molecules, which is very time c onsuming and will still not accurately capture electronic effects, or estimations must be made to take the pol arization into account for a broad set of ligands. It is useful to compare the predictive abilit ies of this QM/MM study with those of the full QM study performed by Raha and Merz.33 Both of these were done using a semi-empirical Hamiltonian on the same set of metalloenzymes. This comparison will allow us to assess the QM/MM methods predictive ability in comparison to the full QM method, and determine if this might be a viable method. Raha and Merz re ported the results without fitting and by fitting the solvation entropy term in the scoring function and calculated a correlation of 0.69 without fitting, while they report a higher correlation of 0. 8 when using MLR on the surface area term. The QM/MM method here obtains a correlation of 0.75 using MLR fitting for all the terms, which is fairly close to the full QM predictive ability, while the predictions without fitting yield a 95


correlation of 0.61. Uns urprisingly, the QM/M M method does not do as well as the full QM method, but it is encouraging that it is competitiv e, especially when considering a cost-benefit analysis. The correlations and standard deviations of these two studies are compiled in Table 44. When performing the full minimization cycl e as outlined above, the QM/MM method is more expensive than the QM method for this set. This is an im portant consideration because the purpose of the method is to perform more ch eaply than the full QM method while giving comparable results. For this set, all of the prot eins being studied are rela tively small, therefore including the entire protein in a linear scaling se mi-empirical calculation is not completely cost prohibitive. In addition, the QM/MM method has additional overhead accompanying it due to memory manipulation to add and place the link atoms and their coordinates in the proper places in the arrays. This involves moving large amounts of data in an array, which is a quite slow process, but only needs to be done on the firs t run. The benefit of the QM/MM method will become more profound as the size of the protein and protein-ligand complex increase, and more savings will be realized wh en using the QM/MM method on these larger systems. The full QM method implemented by Raha also made use of a single point calculation, with no minimization of the entire protein, to pr edict binding affinities with reasonable accuracy, in order to cut computation time. Performing a si ngle point calculation on the starting structure using the QM/MM method provides a large time sa vings versus the full QM method as these calculations only take a few mi nutes, whereas the full QM method takes a few hours. In the QM/MM and full QM calculations the solvation term is quite expensive compared to the single point calculations, but the QM/MM method is st ill quicker when consid ering the vacuum and solvation terms. Predicting the binding affinity from the starti ng structure without fitting only 96


provides a correlation coefficient of 0.29 with a st andard deviation of 2.65 kcal/m ol. This does not provide good predictive ability, especially when compared to the full QM method or the QM/MM method after minimization. However, when fitting through MLR the correlation increases to 0.73 for the single point QM/MM calcu lations with a standard deviation of 1.62, which almost as good as the fully minimized correl ation and close to the full QMs correlation. It is worth noting that these correlations were obtained using the vibrational entropy component from the fully minimized structures to estim ate the entropy component present in the other calculations. This is because calculating vibrational frequenc ies of unminimized structures would result in negative frequencie s, giving unreliable vibrational entr opy results. This result is produced in less time than the full QM calculation and this time difference will only increase as the proteins being studied become larger. This promising result shows that for a large set it may be beneficial to use the QM/MM MLR fit si ngle point ESCF energy in predicting binding affinities because it is cheaper than the fu ll QM and minimized QM/MM methods and provides comparable results. A potential problem for this method is the calculation of vibrational frequencies since the ligand will not be minimized and negative frequencies may be encountered. This can potentially be overcome by using the nu mber of rotatable bonds as an estimate of vibrational entropy, instead of calcula ting it through norma l mode analysis. It is also instructive to examine the contribution of each individual component to the overall calculated binding affinity. Through these examinations, potentially important interactions and the terms related to them, can be determined and potentially used to improve the scoring functions predictive abil ity. Table 4-5 shows the contri bution of each term after MLR weighting is applied. In the table Gvac is the vacuum heat of interaction from the ESCF energies, Gsolv is the solvent free energy calcul ated from the QM/MM PB method, Ssolv is the 97


solvent entropy derived from the solvent accessible surface area and T Svib is the contribution of the vibrational entropy change to the free energy. It can be seen that th e largest contributor to the binding affinity is the solv ent entropy. This energy is gained when the ligand binds to the protein and displaces the water mo lecules that are present. This term roughly correlates to the size of the ligand involved as the larger the ligand the more water is displaced. This finding is similar to the conclusions drawn by Coi et al wh en performing binding affinity calculations using MD snapshots.168 It is also interesting to note that the T Svib contributes an average signed penalty of 0.03 kcal/mol while the unsigned av erage contribution is 0.5 kcal/mol. This represents the penalty to bindi ng from losing translational and rotational degrees of freedom. These degrees of freedom are transferred into new vibrational modes w ithin the ligand which somewhat compensates for this penalty. While a change of 0.5 kcal/mol may not appear to be much, it can change the predicted binding affini ty of a ligand significantly and have a large influence on the met hods predictive ability. Vibrational Contribution to Binding Affinity In general, as a way to accelerate these calcu lations, the vibrational entropy component is ignored completely or roughly estimated. This component is added to scoring functions to account for the loss of transla tional and rotational degrees of freedom within the ligand upon binding. It has been found that a good estimate can be achieved using th e number of rotatable bonds in the ligand48,49 or the change in the number of freely rotatable bonds on the ligand and protein based on solvent exposure.15,33,169 These two methods used to estimate vibrational entropy are extremely quick, as th e number of rotatable bonds is easily calculat ed, but are not always as accurate as desired and do not necessarily reflect actual properties of the ligand. In this situation a compromise must again be ma de between accuracy and computation time to decide if this estimate is accurate enough. 98


To account for this term in this scoring function the vibrationa l frequencies of some or all of the QM atoms can be calculated. This provi des a more physical representation of what is happening upon ligand binding than a simple count of the rotatable bonds in the ligand. It also provides the additional advantage of calculating the frequencies in the field of the entire protein, allowing potential long range intera ctions to influence the results. As mentioned above, due to limitations in the minimization scheme used imaginary frequencies were found upon normal mode analysis. These imaginary frequencies were more prevalent for larger QM regions, but still composed only a very small portion of the ov erall frequencies calculated; subsequently all imaginary frequencies were left out of the vi brational entropy calcula tions. Excluding these frequencies, may leave out important low mode contributions, but can still provide a good estimate of the vibrational entropy change associat ed with binding. The previous sections used the vibrational entropy calculated from the entire QM region. These freq uencies capture effects in both the ligand and the proteins active site where all of the vibrational change will occur from binding. While the most accurate, this method is also the most expensive because the active site and ligand must be fully minimized. This met hod may work for small proteins or small QM regions, but may prove to be much more difficult and time consuming for larger proteins or QM regions as more atoms are included in the calculati on. It is for these reasons that this section will examine various ways of calculating this vibr ational entropy component and the impact they have on the scoring func tions predictive ability. Ligand only vibrations An alternative afforded by this QM/MM method is to calculate the vibrational frequencies of just the ligand, as su ccessfully applied by Grater et al.60 In this calculation the entire protein is frozen and the vibrational frequenc ies of the ligand, in the electronic field of the protein, are calculated giving the frequencies of just the ligand in complex. These frequencies 99

PAGE 100

are th en compared to those calculated for the ligand alone to find the change in vibrational entropy upon ligand binding. Since the ligand will be a more manageable size than the entire QM region, this makes the minimization require ment easier to meet, as well as reducing the number of atoms whose frequencies need to be ca lculated. These predictions are made with the same criteria as above, using the two step mini mization and the ESCF energy of the QM region while varying the manner in which the frequencies are calculated. In the case of these 23 zinc metalloenzymes, using the frequency of just the lig and results in a correlation coefficient of 0.60 with a standard deviation of 1.98 k cal/mol without fitting. This can be compared to the results of the PHVA of the entire QM region which yielded results of 0.61 with a standard deviation of 1.97 kcal/mol. With fitting, the correlation coeffi cient of the ligand only vibrations increases to 0.70 with a standard deviation of 1.73 kcal/mol while the frequencies of the full QM region yield a correlation of 0.75 and a standard deviation of 1.56 kcal/mol. These fitted results provide a wider gap in the predictive ability than without fitting, showing that the extra information obtained from a full QM PHVA might provide mo re insight into the pr otein-ligand binding. Specifically, capturing the entropy changes of the residues in th e active site provides more information on the total entropy cha nge of complexation. However, these results must be viewed in terms of a cost-benefit an alysis as calculating the vibra tional frequencies can be quite expensive. In order to save time, it might be well worth a slight reduction in correlation, especially if the protein or ligand being studied is large. Table 4-6 shows the time n eeded to calculate the vibrational frequencies of the protein-ligand complex using the entire QM region and the ligand only approach. From this table it can be seen th at using the ligand only approach greatly reduces the time necessary to calculate the vibrational entropy component of the scoring function. While 100

PAGE 101

the resu ltant prediction might not be quite as accurate, the time savings will more than likely be worth a slight reduction unl ess the protein-ligand bi nding is particularly difficult to model. The range of vibrational entropies calculated here closely match those from Schwarzl et al.5 with a minimum of -9.85 kcal/mol and a maximum of 1.44 kcal/mol. The ligands in this study contain more rotatable bonds than those used in the study by Schwarzl, accounting for the larger range of entropies found here. Including the proteins vibrational entr opy change moves the range of values to a minimum of -13.91 kcal/mol to a ma ximum of 4.91 kcal/mol. The appearance of this penalty in these vibrational calc ulations indicates a penalty of binding incurred due to loss of vibrational degrees of freedom in protein side chains. This is similar to the degrees of freedom lost in the ligand, but cannot be fully recovered with the liga nds new vibrational degrees of freedom, often leading to an overall binding penalty depending on the size of the ligand and composition of side chains. It is also worth noting that the calculation time listed is that necessary for just one component of the vibrational entropy. When including the proteins QM region in the vibrational analysis an additiona l calculation must be undertaken to calculate the frequencies of the protein without the ligand a nd the ligand alone, which are a dditional steps that are quite expensive. These proteins are also relatively small, with small QM regions, so the time would only increase even more with larger QM regions as the number of QM atoms increases. This is an important consideration in the use of this sc oring function, and could easily be modified to include the entire QM region frequencies or those of the ligand alone depending on the computational resources available and the amount of detail required. Overall, this analysis indicates that using the vibr ations of the ligand alone may be enough to produce a good correlation, but more accuracy can be gained usi ng the frequencies of the entire QM region. 101

PAGE 102

Use of rotatable bond estimate As mentioned above, it is fairly common to use the number of rotatable bonds to estimate the vibrational entropy penalty. The use of this estimate was also examined in this study by counting the number of rotatable bonds on the li gand using the Autotors tool from AutoDock.7 After counting the number of rotatable bonds on th e ligand, it was assumed that each of these bonds would be held fixed upon binding and each bond was responsible for 1 kcal/mol, representing the conformational degrees of freedom lost on complexation. Using this estimate the binding affinities for this set were calculated and can be seen in Fig. 4-9. These calculations resulted in a correlation coeffici ent of 0.59 and a standard deviation of 2.00 kcal/mol without fitting. The correlation increases to 0.69 and the standard deviation decreases to 1.74 kcal/mol with fitting. Again, the results without fitting are close to th e results obtained by calculating the frequencies of the entire QM region, while the re sults with fitting are slightly worse than those calculated using only the ligand freq uency analysis. This is an interesting result, confirming that using the number of rotatable bonds in the liga nd as an estimate of the conformational entropy change may be sufficient to capture its effect on the predictive ability of this scoring function. The rotatable bond method may be used instead of a frequency calculation in order to save time if, for example, the ligand or data set are particularly large. It is also interesting to note that, at least for this set of 23 ligands the number of rotatable bonds correlates with the vibrational entropy of the ligand calculated by Di vCon with a correlation coefficien t of 0.81, as seen in Fig. 4-10. This presents more evidence that using th e number of rotatable bonds may in fact be an acceptable estimate for the vibrational entropy of th e ligand, and this can be capitalized upon to reduce the time down from several hours for a fu ll QM region vibrationa l calculation to a few seconds for the number of rotatable bonds if desired. 102

PAGE 103

Charge Analysis An advantage of a full quantum or QM/MM method for these binding affinity calculations is that a QM method will be able to capture polari zation and charge transfer (CT) effects. Polarization is generally considered an intramolecular term, representing the internal rearrangement of electrons, whereas charge tran sfer is an intermolecular term, which is an external sources electronic effects on the protein. In the case of binding CT has been shown to play an important role, and it is an interesting ex ercise to determine the degree of CT that occurs upon binding. When using a classical method, this term would generally be missed or estimated with a parameterized term or monopole approx imation, but a QM method allows the CT and polarization to have a more physic al affect on the binding of the lig and. These CT effects can be especially relevant in a metallo enzyme complex due to the nature of metal ions and their bonds. Table 4-7 presents the charge transfer that occu rs to the ligand, in vacuum, when binding to the protein while Table 4-8 shows the same charge transfer in implicit solution. The table shows the charge tr ansfer to the ligand involved in the binding process. CT was shown to contribute significantly to the interaction energy wh en solvating a protein, and it can safely be assumed that this is also an importa nt factor in pr otein-ligand binding. Most of the carbonic anhydrase inhibitors gain approximately 0.9 electrons upon binding when considering the CM1 charges, with the exception being 1am6 which only gains 0.71. This is probably because the inhibitor for 1am6 is significantly smaller than most of the other CA inhibitors and therefore has fewer atoms to pick up and di stribute electronic influences. For the carboxypeptidase proteins the ligands give up some of their charge to the pr otein in all cases but 3cpa. This is due to the nature of the ligands, the ligand for 3cpa has a neutral charge while the rest of the compounds have a nega tive charge, and therefore an abundance of electrons to donate to the protein. Comparing the vacuum and soluti on charge transfer effects one can see there is 103

PAGE 104

very little dif ference. More than likely this is due, not to the fact that there is little charge transfer difference in solution, but to the fact that this is an implicit solvation model. Implicit models do not allow hydrogen bonding networks to establish, as would happen in an explicit model, and therefore may end up being less accurate for charge transfer effects. In both cases, this is a significant change in the electron dist ribution of the ligand and can have serious affects on the interactions between the ligand and the pr otein. A standard molecular mechanics model would not be able to properly represent thes e, while a higher level QM method would better capture the effects, but become much too cost ly. This semi-empirical QM/MM method provides a compromise for capturing CT effects, allowing them to be included while keeping the overall cost reasonable. Conclusions We have presented a QM/MM method to calcu late the binding free energy of proteinligand complexes. This method takes advantag e of the linear scaling nature of DivCon to include a large number of atoms near the ligand in the semi-empirical QM region. This allows electronic effects which would be missed in a cl assical calculation, su ch as polarization and charge transfer, to be properly represented wh ile keeping the computational cost low enough to be performed on a large library of protein-ligand complexes. This approach was successful at predicting the binding free energies of a set of 23 Zn metalloenzymes. The function performs well without any fitting, but through multiple lin ear regression the function can be fit to experimental data and the predic tive performance increases greatly from a correlation coefficient squared of 0.57 to 0.77 respectively. This may only be the case within a single protein family, and an overall set of weights for general use may be necessary as in empirical scoring functions, but it shows that the method has potential. This method also takes into account the vibrational entropy change of the ligand upon binding. Th e QM/MM method allows a unique perspective 104

PAGE 105

on this in that a norm al mode analysis can be conducted in the field of the proteins charges while charges further from the ligand remain fixed. This gives a more accurate representation of the vibrational modes available to the ligand, an d therefore a more accurate representation of the vibrational entropy contribution to the ove rall binding affinity of the complex. The contributions of various parameters fo r the binding affinity were also examined including long range cutoff and the use of the to tal energy of the system versus the QM energy for the heat of interaction. These studies in dicated that the long ra nge cutoff used makes a significant difference in the predicted binding a ffinity. This may be more pronounced in a QM/MM calculation than a purely classical calcul ation because the electr onic effects of atoms far from the active site may influence the ligand. It was also found that the use of the ESCF energy to calculate the heat of in teraction was preferable to the use of the total energy of the system, mainly to eliminate potential sources of error. Using the ESCF energy the number of minimization steps does not make as much of an impact, whereas increased minimization cycles improve the predictive ability of the function when using the total energy. This may be due to the smaller QM region being able to minimize itself in fewer steps than the entire protein. Also, the total energy takes into account the movements and interaction of the surface residues, which may be quite mobile and difficult to minimize, where the residues considered in the QM region generally have much less exposure to the solvent. Although the scoring function presented is relatively good at predicting the binding affinities of these Zn metalloenzymes, there is room for improvement. Depending on the acceptable costs, a larger QM region can easily be chosen. This will provide a second shell of residues to interact with the ligand in the QM re gion, giving a better idea of the electronic effects involved in binding. Not only will this present a better representation of the charge transfer to 105

PAGE 106

and f rom the ligand, it will allow a better pict ure of the polarizati on due to the protein environment. This could easily be implemen ted in the current AMBER QM/MM method, and thanks to the linear scaling nature may not be cost prohibitive depending on the systems examined and computational resources available. Another improvement could be made in the solvent model implemented. While the Poisson-Boltzmann method is an excellent impl icit model, an explicit solvent model would permit a better representation of the protein and li gand interactions with solvent. As it is, no charge transfer between the solute and the solven t can take place with the implicit model and an explicit model would allow a better representation Explicit solvent would also allow modeling of hydrogen bonding networks in and around the activ e site, which might in fluence the strength of the protein-ligand interaction and provide better results for mo re polar proteins and ligands. This comes at a cost, however, as including explicit solvent adds a lot of atoms to the calculation and a lot more QM/MM interactions to be considered. Finally, a simple, but more expensive, way to potentially improve this scoring function is to perform sampling of the systems being studied. For this study a single protein, ligand and complex structure was considered following minimi zation. However, this is hardly the true nature of a protein, which is always in motion, and these motions affect the ability of the ligand to bind. It is possible for a molecular dynamics simulation to be performed on the systems of interest with various snapshots being taken as the simulation progresses. This will provide several different conformations of the systems and provide an idea of some of the motions involved. Using a purely classical simulation, th is might present problems because each ligand would need to be parameterized properly, which is quite costly, to accurately model it. However, a QM/MM method would allow the para meterization step to be skipped and a QM 106

PAGE 107

region could be for mulated to make a sampling of the different configurations tractable. A QM/MM simulation could also be constructed to include only the ligand in the QM region, allowing the protein to be samp led while removing the parameterization needs of the ligand. Such sampling increases the cost greatly, however, as separate binding affi nity calculations must be performed on each snapshot produced from the simulation. This may prove to be beneficial enough to be worth it, but the cost would have to be further studied. Overall, the QM/MM scoring method presents good predic tive trends at a palatable cost but does present some areas for future improvement. 107

PAGE 108

Active Site MM QM Figure 4-1. Sim ple representation of a QM/MM calculation. The QM area (pink) extends 5 from the ligand while the rest of the prot ein (blue) is treated with a molecular mechanics force field. 108

PAGE 109

H NC H C CH3 O H NCH2 C O C N C O H NCA2 O CA1 H NCH C CH3 O H NCH2 C O C N C O H NCA2 O CA1 R1 R1 R2 R2 ALA GLY PRO A B Figure 4-2. Graphica l representation of QM/ MM partitioning scheme. The area in brackets represents the QM region. A) The r ough QM region with a PRO residue at the boundary, and cuts the peptide bonds between the C and N. B) The refined structure which extends the QM boundary to include the PRO residue and moves the boundary to cut the C-CA bond to avoid spur ious polarization at the boundary. 109

PAGE 110

Table 4-1. Num ber of atoms in the QM re gion for the protein, ligand and protein-ligand complex for each of the metalloenzymes. PDB ID Protein Ligand Complex 1a42 210 44 250 1am6 190 10 196 1bcd 210 10 216 1bn1 210 35 241 1bn3 210 35 241 1bn4 210 36 242 1bnn 210 37 243 1bnq 210 41 247 1bnt 210 38 244 1bnu 210 34 240 1bnv 210 42 248 1bnw 210 29 235 1cbx 160 25 182 1cil 210 35 241 1cim 210 29 235 1cin 210 33 239 1cnw 264 51 311 1cnx 210 44 250 1cny 210 64 270 3cpa 123 31 150 6cpa 160 57 214 7cpa 141 73 211 8cpa 123 54 174 110

PAGE 111

Table 4-2. PDB ID, resolution (in ) inhibitor, type and experim ental G of the 23 complexes used in this study. CA represents ca rbonic anhydrase proteins, CPA represents carboxypeptidase proteins and the experime ntal binding free en ergy was calculated from the Ki as (RTln(Ki)). PDB ID Resolution Inhibitor Type G(exp) 1A42 2.25 Brinzolamide CA -13.66 1AM6 2.1 Methyl-Hydroxamate CA -5.98 1BCD 1.9 Methyl-Sulfonamide CA -5.39 1BN1 2.1 AL5917 CA -12.9 1BN3 2.2 AL6528 CA -13.66 1BN4 2.1 AL5927 CA -12.86 1BNN 2.3 AL7183 CA -13.82 1BNQ 2.4 AL4623 CA -13.11 1BNT 2.15 AL5424 CA -13.54 1BNU 2.15 AL5300 CA -13.4 1BNV 2.4 AL7099 CA -12.12 1BNW 2.25 AL5415 CA -12.54 1CIL 1.6 ETS CA -12.9 1CIM 2.1 PTS CA -12.19 1CIN 2.1 MTS CA -12.06 1CNW 2 EG1 CA -10.67 1CNX 1.9 EG2 CA -10.12 1CNY 2.3 EG3 CA -10.85 1CBX 1.54 L-benzylsuccinate CPA -8.77 3CPA 1.54 GY CPA -5.37 6CPA 1.54 ZAAP(O)F CPA -15.93 7CPA 2 BZ-FVP(O)F CPA -19.3 8CPA 1.54 BZ-AGP(O)f CPA -12.66 111

PAGE 112

PL P + L Figure 4-3. Schem atic view of the thermodynamic cycle to calculate binding affinity. The cycle calculates the protein, ligand and complex in vacuum and then transfers them to solvent to find the solvation free energy. P + L PL P solvG P solvG P solvG 112

PAGE 113

-16 -14 -12 -10 -8 -6 -4 -2 0 -20-15-10-50 G(exp) G(calc) Figure 4-4. Calculated binding a ffinity for ESCF and total en ergy versus the experimental binding free energy without f itting. The squares represent the ESCF energy (R2=0.601) and the diamonds represent the total energy (R2=0.608). 113

PAGE 114

-16 -14 -12 -10 -8 -6 -4 -2 0 -20-15-10-50 G(exp) G(calc) Figure 4-5. Calculated versus experimental bi nding affinities for the zinc proteins to examine the effect of long-range cutoff. The diamonds represent the calculated binding affinity with a cutoff of 100 (R2=0.61). The squares represent the calculated binding affinity with a cutoff of 10 (R2=0.39). 114

PAGE 115

-18 -16 -14 -12 -10 -8 -6 -4 -2 0 -20-18-16-14-12-10-8-6-4-20 G(exp) G(calc) Figure 4-6. Calculated versus experimental binding free energy for one minimization step and two minimization steps using the total energy of the protein. The squares represent one minimization step (R2=0.49) and the diamonds represent two steps (R2=0.60). 115

PAGE 116

-18 -16 -14 -12 -10 -8 -6 -4 -2 0 -20-15-10-50 G(exp) G(calc) Figure 4-7. Calculated versus experimental binding free energy for one minimization step and two minimization steps using the ESCF energy. The squares represent one minimization step (R2=0.60) and the diamonds repr esent two minimization steps (R2=0.61). 116

PAGE 117

Table 4-3. Square of the correlation coefficien ts and standard deviati ons (in kcal/mol) for one and two minimization steps using ESCF and To tal Energy for the heat of interaction. Method Correlation Coefficient Std. Dev (kcal/mol) Total Energy 1 step 0.49 2.24 ESCF Energy 1 step 0.60 2.00 Total Energy 2 steps 0.60 1.98 ESCF Energy 2 steps 0.61 1.97 117

PAGE 118

-18 -16 -14 -12 -10 -8 -6 -4 -2 0 -20-18-16-14-12-10-8-6-4-20 G(exp) G(calc) Figure 4-8. Calculat ed vs. experimental G of binding for the 23 zinc protein-ligand complexes. Pink squares represent the binding affin ity calculated without any fitting to experimental data (R2=0.61). Blue diamonds represent calculated binding affinity after fitting components of th e scoring function using MLR (R2=0.75). 118

PAGE 119

Table 4-4. Com parison of square d correlation coefficients and st andard deviations between Full QM results from Raha33 and QM/MM method described here. Method Correlation Coefficient Standard Deviation QM/MM no fitting 0.61 1.97 Full QM no fitting 0.69 1.50 QM/MM fitting 0.75 1.56 Full QM fitting 0.80 1.18 119

PAGE 120

Table 4-5. Breakdown of each component in th e scoring fu nction for each protein after MLR weighting. All units in kcal/mol. PDB ID G(exp) G(calc) Gvac Gsolv Ssolv T Svib 1a42 -13.66 -14.38 -1.94 1.24 -14.92 -0.14 1am6 -5.98 -7.07 -1.83 0.68 -6.36 -0.52 1bcd -5.39 -4.45 -1.77 0.93 -5.41 0.22 1bn1 -12.90 -10.93 -1.58 1.47 -12.07 -0.03 1bn3 -13.66 -12.50 -0.40 0.24 -12.61 -0.82 1bn4 -12.86 -11.55 -2.06 1.25 -11.57 -0.33 1bnn -13.82 -13.90 -2.06 1.04 -13.24 -0.77 1bnq -13.11 -14.00 -1.86 1.23 -14.93 0.13 1bnt -13.54 -11.29 -1.91 1.02 -12.30 0.25 1bnu -13.40 -13.36 -2.07 1.34 -13.67 -0.18 1bnv -12.12 -12.81 -2.13 1.74 -13.48 -0.29 1bnw -12.54 -12.06 -1.87 1.15 -11.83 -0.43 1cbx -8.77 -10.52 -2.17 1.90 -10.16 -0.22 1cil -12.90 -12.85 -1.70 0.58 -12.74 -0.19 1cim -12.19 -11.58 -1.75 -0.08 -9.60 -0.85 1cin -12.06 -10.54 -0.64 1.21 -12.81 0.38 1cnw -10.67 -11.53 -0.34 0.21 -14.20 1.09 1cnx -10.12 -11.36 -0.59 0.59 -12.94 0.32 1cny -10.85 -12.45 -0.21 0.27 -13.53 -0.05 3cpa -5.37 -7.30 -0.89 0.97 -11.30 1.29 6cpa -15.93 -14.02 -0.57 1.10 -18.13 2.42 7cpa -19.30 -17.06 0.19 0.99 -17.74 -0.59 8cpa -12.66 -16.26 -0.68 1.15 -17.14 -0.02 Average -1.34 0.97 -12.72 0.03 120

PAGE 121

Table 4-6. Tim e, in minutes, to calculate the QM region and the ligand only vibrational frequencies with the QM/MM method for the protein-ligand complex. PDB ID QM Region Ligand Only 1a42 868.54 8.94 1am6 416.45 0.37 1bcd 580.17 0.44 1bn1 804.06 29.05 1bn3 802.76 6.50 1bn4 794.45 6.74 1bnn 818.37 8.38 1bnq 825.55 8.09 1bnt 846.12 7.30 1bnu 785.09 10.03 1bnv 857.38 10.12 1bnw 735.04 5.76 1cbx 378.02 3.43 1cil 797.64 5.57 1cim 721.65 6.68 1cin 759.12 5.96 1cnw 1614.17 18.85 1cnx 866.89 13.80 1cny 1074.23 34.21 3cpa 199.51 4.97 6cpa 576.84 18.70 7cpa 555.22 33.83 8cpa 302.63 16.84 Average 738.26 11.50 121

PAGE 122

-18 -16 -14 -12 -10 -8 -6 -4 -2 0 -20-18-16-14-12-10-8-6-4-20 G(exp) G(calc) Figure 4-9. Binding affinity prediction using th e number or rotatable bonds on the ligand to estimate the vibrational entr opy with and without MLR fitti ng. Squares represent the prediction without fitting (R2=0.59) and the diamonds represent the prediction with fitting (R2=0.69). 122

PAGE 123

0 2 4 6 8 10 12 14 16 0102030405060 Figure 4-10. Correlation of T Svib with the number of rotatable bonds for each ligand in the set from ligand only frequency calculations (R2=0.80). 123

PAGE 124

Table 4-7. Charge transfer from the protein to the ligand, in electrons, for Mulliken and CM1 charges in vacuum. Protein Type PDB ID Mulliken CM1 CA 1a42 -0.58 -0.87 CA 1am6 -0.60 -0.71 CA 1bcd -0.66 -0.91 CA 1bn1 -0.69 -0.84 CA 1bn3 -0.55 -0.87 CA 1bn4 -0.59 -0.88 CA 1bnn -0.58 -0.88 CA 1bnq -0.58 -0.87 CA 1bnt -0.58 -0.88 CA 1bnu -0.58 -0.88 CA 1bnv -0.59 -0.88 CA 1bnw -0.58 -0.87 CA 1cil -0.58 -0.87 CA 1cim -0.58 -0.87 CA 1cin -0.60 -0.89 CA 1cnw -0.58 -0.88 CA 1cnx -0.58 -0.88 CA 1cny -0.58 -0.88 CPA 1cbx 0.39 0.38 CPA 3cpa -0.56 -0.61 CPA 6cpa 0.43 0.41 CPA 7cpa 0.47 0.45 CPA 8cpa 0.27 0.26 124

PAGE 125

Table 4-8. Charge transfer from the protein to the ligand, in electrons, for Mulliken and CM1 charges in solution. Protein Type PDB ID Mulliken CM1 CA 1a42 -0.57 -0.87 CA 1am6 -0.60 -0.72 CA 1bcd -0.64 -0.91 CA 1bn1 -0.68 -0.84 CA 1bn3 -0.58 -0.88 CA 1bn4 -0.58 -0.88 CA 1bnn -0.58 -0.88 CA 1bnq -0.57 -0.87 CA 1bnt -0.58 -0.88 CA 1bnu -0.58 -0.88 CA 1bnv -0.57 -0.87 CA 1bnw -0.58 -0.87 CA 1cil -0.57 -0.87 CA 1cim -0.57 -0.87 CA 1cin -0.61 -0.90 CA 1cnw -0.58 -0.88 CA 1cnx -0.58 -0.88 CA 1cny -0.58 -0.88 CPA 1cbx 0.37 0.35 CPA 3cpa -0.56 -0.61 CPA 6cpa 0.43 0.41 CPA 7cpa 0.49 0.46 CPA 8cpa 0.27 0.26 125

PAGE 126

CHAP TER 5 DISCOVERY OF A NOVEL INHIBITOR OF A UBIQUITI N SPECIFI C PROTEASE USING VIRTUAL SCREENING Introduction Ubiquitin(Ub) is a relatively small protein of 76 amino acids that is somehow involved in most aspects of eukaryotic biology.170 The ubiquitin protein is highly conserved throughout eukaryotes, meaning that its sequence is largely identical in every orga nism, and is generally known for its role in protein degradation by the proteasome, but is involved in many different processes.171-173 These processes include signal tran sduction, transport across the plasma membrane, growth control and many others. A de fect in ubiquitin or ubiquitin-like(Ubl) protein regulation manifests itself in diseases such as developmental problems, heart disease and cancer.174-177 The ubiquitin protein is classically seen as a precursor to protei n degradation. It behaves as a marker that binds covalently to a protein, which tags it for destruction by the proteasome.178 Many viruses take advantage of this ubiquitin binding process to overcome cell defenses by redirection Ub bi nding to antiviral proteins.179 It has also been noted to signal destruction of membrane proteins, so it has po tential for a cell wide marker independent of protein structure. The process of attaching ubiquiti n covalently to a protein is extremely complicated. This cycle involves at least thr ee enzymes and three steps.180 The first step is an initial activation step by a protein called E1. This is a Ubl specific enzyme that activate s the C-terminus of the Ubl. This is done in two steps and results in a E1-Ubl thiol ester which prepares the Ub for the final reaction. The next step is processed by the E2 enzyme which transfers the Ubl from E1 to E2. The E2 family is very large and it is thought th at this size helps to diversify ubiquitination specificity.181 The final step in ubiquitination involves the E3 protein. The E3 protein is an 126

PAGE 127

extrem ely diverse collection of enzy mes, which allows Ub to have a wide variety of intercellular applications.182 E3 acts as a bridge between E2 and the substrate to be ubiquitinated, thus allowing the transfer of Ub to the target protein. A more detailed description of the ubiquitination process is pr ovided by Pickart and Eddins.181 The final result of this process is the formation of an amide bond between a lysine on th e target protein and the terminal glycine of ubiquitin or a ubiquitin oligomer. It is the numbe r of ubiquitin monomers a ttached to the protein and how they are attached to each other that determine its next step.183 An oligomer chain of four or more Ub units targets the protein for de gradation, while attaching one Ub unit may have no effect.184 Also, attaching the Ub units to different lysine residues on the adjacent Ub results in functional differences between oligomer chains.185,186 There are also many ubiquitin-like(Ubl) proteins that have been identified that change larger protei n behavior through bonding.175,187-189 Once the proper Ub oligomer is bound to the ta rget protein, the target is taken to the proteasome for degradation. Here, the protein is broken into short pe ptide chains, but the attached Ub monomers are spared and recycled resulting in a long life cycle.190,191 This removal is done by a class of proteins called deubiquitinat ing enzymes (DUBs) befo re proteolysis of the protein.180 It is important that deubiquitination occurs at the correct time; otherwise the protein substrate may not have been committed to proteolysis yet, and therefore may not be broken down. This results in a negative regulation of protein degradation, where proteins that should be destroyed are not, and may cause many problems dow n stream as they co ntinue their function.192 However, it is important that deubiquitination occur to keep Ub levels within the cell at appropriate concentrations.180 There are several different classes of DUBs which cleave the ubiquitin covalent bond after the terminal carbonyl of the last ubiquitin residue.193 Most of the DUB families are 127

PAGE 128

cysteine pro teases and a few are metalloproteases.179 We will focus on a cysteine protease family named ubiquitin specific protease (USP), specifically Herpesvirus-associated Ubiquitinspecific Protease (HAUSP)/USP7.183 We focus on USP7 because it has been implicated as a therapeutic target for cancer through regulation of a cell signaling pathway.194,195 USP7 is composed of three domains called the finger, thumb and palm as seen in Figure 5-1. Ubiquitin binds in a large pocket created be tween the three domains. The cat alytic area of the protein lies between the Palm and the Thumb with a channe l running to the catalytic triad from the Ub binding pocket. The cysteine protease families all contain two well conserved motifs which contain a catalytic triad as well as other well conserved residues in the active site.196 This catalytic triad is composed of a cysteine, a hi stidine and an aspartate wh ich polarizes the histadine.197 The cysteine in the triad is responsible for br eaking the amide bond formed by Ub by undergoing nucleophilic attack on the carbonyl.197 This creates an oxyanion wh ich is stabilized by the surrounding residues, reacts with water and eventua lly results in the rele ase of free enzyme and Ub. Structures for USP7 with Ub bound a nd unbound are available from the PDB. These structures show a large conformational change in the catalytic triad of USP7 upon Ub binding which can be seen in Figure 5-2. In this figure the unbound USP7 structure is colored green, the Ub bound structure is colored orange and the ubiquitin itself is colored light blue. A large shift in the catalytic cysteine is noted, as it moves 5.1 upon Ub binding. This allows it to be in proximity to H464, along with H464 shifting slightly so that it can be stabilized by this interaction. Upon Ub binding D481 also shifts very slightly and rotates to accommodate a stronger hydrogen bonding interactio n, allowing more polarization of the histidine residue. These conformational changes occur with little e ffect on the rest of the protein and may help 128

PAGE 129

account for the specificity of USP7. Hu et al. have noted that this ca talytic triad m ust be conserved to retain function and th at mutation of any residue in the triad eliminates its ability to cleave Ub.183 They also noted that there are two key hydrogen bonding residues on the surface largely responsible for successfully docking Ub. These residues, H456 and D295, form hydrogen bonds to the Ub tail and mutation of eith er residue results in undetectable activity of USP7. There are also two highly conserved resi dues near the active site channel, R408 and F409, which form a hydrophobic pocket to stabilize Leucine residues from ubiquitin. This entire process results in the destructi on of a protein while releasing a ubiquitin to be recycled. This entire process is important for regulation and se veral other cellular mechanisms and can be the root cause of many different cellular problems. Drug design based off of a known target is discussed in detail above. However, knowing the target does not mean that determination of an active ligand will be easy. Earlier, individual databases to screen would have to be built re quiring careful screening, much computational effort and a very large infrastructure. To fac ilitate discovery and remove a large initial barrier, large databases of compounds have been constr ucted academically and commercially. These readily available large databases remove this barrier to SBDD, eliminating the need for the infrastructure and expertise required to build and maintain smaller databases for individual use. The Zinc Is Not Commercial (ZINC) database8 is a large, free data base consisting of over 8 million compounds. These compounds are ready for use in different formats containing different tautomers with an internet base d interface. These compounds are easily used in a variety of programs with little to no conversion required. Often the compounds are from smaller sets and are identified as such and broken into subsets making it easier to isolate and use only individual subsets. Another advantage of ZINC is that al l of the compounds contai ned in the database are 129

PAGE 130

commercially available. This rem oves the proble m of having a good binder that is hard to obtain or synthesize and can speed discovery by eliminati ng this step. ZINC is often used in virtual screening to provide the compound lib rary and has been successfully used to iden tify inhibitors of many different targets.198,199 There are many programs that are commonly used for virtual screening which use different scoring functions and algorithms to ra nk ligands. This is where the aforementioned databases are used, to provide th e ligands of interest to the do cking programs. These programs fit the ligands into the chosen site on the protei n target. DOCK does this by attempting to fit the ligand into a negative image of the binding site. As a preprocessing step spheres are fit into the binding site and the ligand is f it into the spheres. DOCK 4.0 includes a method to incorporate ligand flexibility into the docking calculation through an anchor and grow algorithm.29 This, along with a parallel processing capability, makes DOCK very quick for each ligand allowing large databases of compounds to be screened e fficiently. Another commonly used program is AutoDock.7 AutoDock builds a precalculated grid of pairwise interaction energies to be used during the scoring. It then uses a genetic algorithm to optimize the ligand in the active site and scores the ligands with an empi rical scoring function, as discu ssed above. Any of these methods will result in a ranked list of ligands in vari ous conformations, one or more of which may provide at least a starting point for drug design applications. Below we will detail a virtual screening effort that uses the methods mentioned above. We have used a cysteine protease DUB as the target for virtual screening with an unknown active site. Blind docking in AutoDock was used to identify potential binding sites for ligands. The sites identified through AutoDock were then examined and experimental mutagenic data was used to verify our identific ation of putative binding sites. After identifying putative binding 130

PAGE 131

sites with blind docking and experimental evidence a database of approxim ately 64 thousand ligands was screened against the target using DOCK. Following this, the top 250 ligands from DOCK were secondarily screened against the target using Au toDock. These ligands were individually examined for favor able contacts and to determine if experimentally determined residues important for ubiquitin bi nding were blocked. The top 50 compounds were then tested experimentally in the assay described by Nicholson et al.200 to determine if they bound to the target DUB as well as blocked Ub binding. Of these two were identified to bind to ubiquitin stronger than a reporter enzyme and one was able to successfully block ubiquitin cleavage at a low micromolar concentration. Methods In order to perform accurate virtual screeni ng there is some preparation that must be accomplished. Proteins from the PDB do not have hydrogen atoms, so they must be added somehow. Also, an area of intere st, usually an active site, must be chosen to try to fit the ligands into. If an active site is not known, one must be identified using various methods such as blind docking28,201,202 or DOCKs sphere builder.24 Once the structure is prepared and the active site identified the virtual screening can proceed to identify potential compounds for the target. CADD Screening The crystal structure of HAUSP/USP7183 was obtained for this investigation. There are several USP7 structures availabl e in the PDB with different subs trates bound, but we chose the structure with Ub bound (PDB ID 1NBF). The following steps were used to identify ligands with a high probability of binding to USP7. First, the protein was prepared as described below. Following this, a potential active site was identi fied for ligand binding and used in the docking studies. A relatively small s ubset of approximately 63,000 com pounds was taken from the ZINC database representing the drug-like molecules available in the commercial Maybridge database. 131

PAGE 132

These com pounds were scored in DOCK, followed by an AutoDock ranking on the top 250 hits. These hits were examined individually for favorable interactions, and 40 were tested experimentally for IC50 values. Preparation of Protein Structure The crystal structure of HAUSP/USP7 with ubiquitin bound (PDB ID 1NBF), obtained from the PDB, was truncated. In the original structure each asymmetric unit contains two USP7Ubiquitin aldehyde (Ubal) complexes. Ubal is ubiquitin with the C-terminal carboxylate replaced by an aldehyde and is used as an inhibitor of DUBs because it forms a covalent bond with the catalytic cysteine. One of the two U SP7-Ubal structures was discarded and water and Ubal were removed from the remaining struct ure. Hydrogen atoms were then added using standard residue definitions defined by leap within AMBER.75 Leap contains a library of standard residues and their protona tion states. Using this librar y the residues from the crystal structure have protons assigned to the heavy atoms and place the according to the standard definitions. This allows standard protons to be added in a simple manner to the protein, while more complex protonation states, su ch as histidine, must be considered more carefully. Charges were added to the protein using Chimeras203,204 dock prep option, whil e the protonation state of histidine residues was determined by hand based on the surrounding environment. The dock prep option performs many functions necessary to prepare the target protein for DOCK. These functions include deleting crysta llographic solvent molecules, a dding protons and adding partial charges to the atoms based on standard AMBER at om types. Specifically, the histidine involved in the catalytic triad was protona ted on the epsilon carbon, to allow the hydrogen transfer step of the reaction to occur. These ch arges and this structure were us ed throughout in all DOCK and AutoDock calculations. 132

PAGE 133

Identification of Binding Site A probable binding site was iden tified using two methods. The first method was use of the sphere generator from DOCK29 along with knowledge of poten tially important residues on the target protein. These resi dues included the cata lytic triad as well as the hydrophobic pocket, R408 and F409, and the necessary hydrogen bonde rs H456 and D295 because mutation of these residues inhibits activity it was assumed that blocking hydrogen bonding to these residues will also inhibit activity because the Ub chain will not be properly held to the DUB. Blocking the active site and the channel leadi ng to the active site was also a ssumed to inhibit activity because the Ub chain cannot access the cataly tic triad. Blockage of the cat alytic triad was also examined so that the cysteine protease mechanism coul d not occur, and theref ore would inhibit DUB activity. The hydrophobic pocket was targeted as a way to stabilize the interactions of potential ligands with the DUB and also provided a deep cavity for the ligand to bind. First the hydrogen atoms were removed to calculated th e Connolly solvent accessible surface205 using the DMS subroutine packaged with DOCK.29 This uses a probe of 1.4 approximately the size of a water molecule, to find the surface of the protein and ge nerate vectors normal to that surface. Next, SPHGEN, packaged with DOCK, was used to build spheres of various sizes between 1.4 -4.0 with the normal vectors calculated previously en ding in the center of the sphere. The program then filtered all of these sphere s to keep only the large sphere a ssociated with each surface atom and ultimately creating a negative image of the pr otein. After these spheres were generated, spheres within 10 of the catalytic triad were selected for inclusion in the DOCK calculations, which can be seen in Figure 5-3. An attempt to confirm the active site produced from SPHGEN wa s made by using the blind docking method suggested by Hetenyi et al.202 This method builds a coarse grid of precalculated protein interactions over the proteins surface within AutoDock. Finer grids are 133

PAGE 134

advantageous, but coarse grids have also pr oved accurate enough to identify putative binding sites.201,202 Using this grid an attempt to dock a ligand in an unknown conformation to an unknown active site is done in or der to identify the active site. We used this method on an experimentally strong binding ligand in order to identify the probable binding site. AutoDock is extremely reliable and its parameter set from the AMBER force field, genetic algorithm and scoring function combination make it well suited for blind docking.201 Hetenyi et al use a grid spacing of 0.55 to successfully locate binding sites for both rigid and flexible ligands. Here, a grid spacing of 0.497 was used in generating the coarse grid with AutoGrid for the entire protein. Following grid generating, the torsions of the ligand were generated with the AUTOTORS module, with no frozen rotatable bonds. The Lamarckian genetic algorithm (LGA) and pse udo-Solis and Wets methods were used to minimize the ligands using the default parameters, except as follows. The maximum number of energy evaluations was set to 25 million, the population size was set to 200 and the number of runs was set to 150. To test the identified binding site a grid was constructed with a grid spacing of 0.336 to include all regions identified through blind docking. The docking procedure for this test was much shorter and less extensive with only 50 GA r uns, 2.5 millions energy evaluations and a population size of 150. All other parameters for the fine grid docking of the ligand were the same as those of the coarse grid. Screening Protocol The program DOCK 6.16 was used to screen a database of approximately 63,000 compounds from the ZINC database8 as described below. Spheres were created to represent the solvent accessible surface area, as described above, and used to define the docking site on the protein. Interaction energies were de termined based off of the grid method169,206 using a grid with dimensions of 38x37x41 3 based on placing the edge of the grid 8 away from the 134

PAGE 135

spheres. Scoring was done using both the van d er Waals and electrostatic energy calculated by DOCK. Screening was done usi ng flexible ligands based on th e anchor-and-grow algorithm within DOCK.9 These ligands were then minimized using a simplex minimizer within DOCK using default parameters except 1000 maximum orientations and 20 atoms for the minimum anchor size. DOCK was then run in a parallel computation over 30 processors to score and rank each ligand against the target protein. Following DOCK ranking on the set of approxima tely 63,000 ligands an AutoDock run was performed on the top 250 compounds as ranked by DOCK. This is an attempt to narrow the selection of potential ligands us ing DOCK, which has better scali ng and parallel algorithms, and then perform more rigorous docking using AutoDo ck on the top ranked h its. The target binding site for AutoDock was obtained using blind docking and sphere methods mentioned above. Once the site was identified a grid was built to include all potential binding sites. The grid spacing was finer than that of blind docking, 0.375 to account for more potential interactions. The receptor was held fixed while the ligands were allowed to be flexible around bonds identified as rotatable by AUTOTORS using the Lamarckian genetic algorithm. LGA adds a local minimization of the ligand to the genetic algorithm which allows modifications of the ligand to enter the gene population. For each ligand in the set a ra ndom starting position and conformation was used. Along with this, 75 doc kings, a population size of 200, 10 million energy evaluations, 2 translational step size, 50 torsion step, mutation rate of 0.02 and a crossover rate of 0.8 were used. Final structures from AutoDock were clustered with a tolerance of 2.0 root-mean-square deviation (RMSD). Fo llowing docking, the binding modes were visualized individually. The t op 32 compounds plus eight that appeared to have favorable interactions with the hydrophobic pocket we re selected for the inhibition assay. 135

PAGE 136

Preparation of Dataset The ligands screened against USP7 were obtained from the ZINC database.8 This resulted in ligands with charges already added and determined to be drug like, saving time in creating a compound library. The su bset of ZINC from the vendor Maybridge was used so that the compounds could quickly and easily be iden tified, obtained and scre ened experimentally. Zinc derives its partial atomic charges from AMSOL207 and includes them in the structure files. This eliminates the expense of calculating charge s for each ligand and saves a lot of time. This resulted in approximately 63,000 compounds, a fe w of which were the same molecule in different conformations. These different confor mations were also included in the DOCK and AutoDock calculations, but should have no impact on results. This library was then used in the DOCK calculations as described above, allowing a compound library to be assembled very quickly. Results and Discussion Identification of Potential Active Site Before a library of ligands can be docked to a compound a binding site must be identified. This binding site is typically at the active site of the pr otein and blocks protein activity by preventing the natural target from bindi ng. The binding site is usually identified from a crystal structure with a ligand bound, but such a complex is not available for every crystal structure. Therefore, sometimes a binding site must be determined computationally using blind docking methods. Sphere clustering For a first pass the SPHGEN routine from DOCK was used. This is a very quick calculation and needs to be done in order to proceed with a DOCK calculation, so it was an obvious first choice. This routine maps spheres into pockets on the protein surface, as discussed 136

PAGE 137

above, and can give hin ts to potential binding site s by clearly identifying cavities on the surface. This is not an extremely accurate method, but can give a starting point fo r docking calculations. Also, if there is an area of inte rest to specifically target this method can be used to provide additional evidence for the site by confirming the presence of a cavity in which a ligand may fit. Spheres were generated for USP7 as described above The sphere clusters were then visualized on the surface of the protein to s ee potential binding sites. After the spheres were calculated they were clustered into large clusters based on the nu mber of spheres in the cluster. Typically the largest cluster indicates the active site, but sometimes that is not completely accurate. The sphere calculations for the first cluster of s pheres showed a large group of spheres immediately adjacent to the active site channel. This i ndicates a pocket where a ligand may bind and block the channel, potentially preven ting ubiquitin from being cleaved. The second cluster indicated another large invagination on the op posite side of the binding cha nnel from the first, which has been identified as the hydrophobic pocket, according to Hu et al.183 The remainder of the spheres from the calculation were scattered over the surface of the protein i ndividually or in very small groups showing few places for a ligand to bi nd. The first and second major clusters of spheres were combined, along w ith other spheres within 10 of the catalytic triad, to give the DOCK target sites, seen in Fig. 5-3. The spheres indicated in Fig. 5-3 appear to be promising for blocking Ub binding as well as strongly binding the USP7. There are two pocke ts for a ligand to bind into that could block the active site channel or even prevent Ub fr om binding due to steric interference with the protein. Along with this, the hydrophobic pocke t, which contains two very highly conserved residues among seven proteins of this family, indicating their importance in binding Ub is included in these spheres. Also, near this hydrophobic pocket and leadi ng into the active site 137

PAGE 138

channel are the hydrogen bonding residues H456 and D295, which could easily be blocked by a ligand binding in either of these two pockets. Included in these sphere s are a few small groups that directly block the catalytic triad as well as a f ew individual spheres. However, it is doubtful that these play as important a role as bl ocking Ub binding through bl ocking of any of the aforementioned interactions, but were included in the DOCK calculations for completeness. After the spheres were generated and c hosen, DOCK was run using a single ligand, ZINC107402 seen in Fig. 5-4. This was done to determine if the spheres suggested by DOCK would be acceptable upon visual inspection and could be anticipat ed to possibly block USP7s catalytic function, thereby valida ting the target site chosen by DOC K. The result of this docking can be seen in Fig. 5-5. The ligand is in gr een sticks, the highly c onserved residues R408 and F409 that compose part of the hydrophobic pocke t are magenta, the e ssential hydrogen bonders H456 and D295 are colored red and the C-terminal of the Ubal is light blue. Upon inspection the ligand is seen to have a conjugated ring that is intruding into the hydropho bic pocket. The ligand also clearly blocks the channe l leading to the catalytic triad as well as blocking H456 and possibly interacting with D295, which would interfere with Ub hydrogen bonding. This would affect ubiquitins ability to bond to USP7, eliminating at least one of the essential hydrogen bonds for activity and possibly inhibiting hydrogen bond formati on with the other essential residue. This ligand would also block ubi quitin access to the h ydrophobic pocket, which stabilizes two leucine residues on the ubiquitin protein, possibly preventing Ub binding. Also, the hydrophobic pocket is composed of a phenylalanin e that is present in all 7 members of the family examined by Hu et al.183 as well several other residues that contain conjugated rings which may contribute to stabiliz ing the ligand through pi-stacking interactions. Overall, based 138

PAGE 139

on this docking and other inform ation, this select ion of spheres was deemed appropriate to where the ligand should bind to inhibit the activity of USP7 deubiquitination. AutoDock blind docking As a check on the identification of the active site through SPHGEN, AutoDock was employed. Sphere generation within DOCK finds pockets on the surface of the target, but does not take into account any form of favorable or unfavorable interacti on between protein and ligand. This can be rectified using AutoDock by building a grid encompassing the entire protein and trying to dock the ligand over the whole surface. AutoDock will allow protein-ligand interactions to be included in considering the active site as well as different ligand conformations, whereas the sphere generation knows nothing of the lig and at all. A coarse grid reduces the number of grid poi nts, and therefore the calculat ion time, but also reduces the accuracy of the docking. For the purposes of blin d docking a relatively coarse grid is used to save time, as the entire protein is included in the calculation, but not overly coarse. To use AutoDock for blind docking a ligand must be provi ded that is known to bind to the protein. In this instance a ligand with an experime ntally measured IC50 of approximately 1 M was used. This ligand was included in a small set of appr oximately 40 ligands to experimentally screen against USP7 as a first attempt at finding a li gand before virtual scr eening was utilized and presented the best binding affinity of the set. Blind docking was carried out as described in the previous section. Hetenyi and van der Spoel carried out an extensive evaluation of blind docking using vari ous genetic algorithm parameters within AutoDock. They found that three parameters, population size, number of energy evaluations and the number of trials have the biggest influence on correctly identifying the docking site. They validate this by perf orming blind docking on ligands with a crystal structure of the complex and comparing the blind results to the crystal structure. The above 139

PAGE 140

param eters are at or above the recommended va lues by Hetenyi and van der Spoel and are used to identify the binding site of this ligand. The results of the blind docking calculation in AutoDock can be seen in Figure 5-6. AutoDock clusters ligands based on conformationa l similarity, then creates a histogram of the clusters by energy. In this case the similarity had to be within 2 to create a cluster. These clusters are often used to select a ligand in AutoDock, over pure energy ranking, because it means that a large number of genetic algorithm runs were finding similar binding modes. The position of the largest cluster is shown in purple in Fig. 5-6. In Fig. 5-6 the positions of the highly conserved residues are shown in magenta, the essential hydrogen bonders in red and the Ubal in light blue. The ligand from this cluste r occupies one of the sh allow pockets identified by SPHGEN adjacent to the catalytic tr iad channel. This pocket is also occupied by a ligand from the fourth largest cluster, in blue, which is in a slightly different conformation than the ligand from the largest cluster. In th e current binding modes neither of th ese seem to either inhibit Ub binding or block the active site channel, so the viab ility of this pocket is in question. The third ligand present in Fig. 5-6, in red, is from a cluster thats also fourth largest in size, but is lower in energy than the blue ligand. Th is ligand occupies a very shallow series of bumps on the protein surface near the hydrophobic pocke t unidentified by SPHGEN. Th is ligand would sterically interrupt Ub binding, but would not affect the proteins mechanism if Ub could take a slightly different conformation. Clusters two and three were on the opposite side of the protein from UB bind and the catalytic triad a nd were therefore disregarded because any mode of potential inhibition from these binding mode s would most likely be due to conformational changes in the protein, which cannot easily be accurately mode led through virtual screening. These results 140

PAGE 141

from a coarse grid confirm the spheres gene rated by SPHGEN for the shallow pocket near the catalytic triad channel. An attempt to validate the blind docking results from AutoDock was carried out, as with the results of SPHGEN with DOCK. A finer, smaller grid was used on the protein to encompass both sites indicated by blind docking within AutoDock. A finer grid will allow a more accurate docking because more interaction points are calcu lated, but will increase computation time by increasing the number of points. This is c ountered by making the grid smaller and only calculating points on specific areas of interest to keep the overall number of grid points lower. The AutoDock calculation with a finer grid was carried out with ZINC107402 and Fig. 5-7 shows the results of the two larges t clusters. The largest cluster is in blue and the lowest energy cluster is in green. The highl y conserved hydrophobic residues ar e in magenta, the essential hydrogen bonders are in red and the Ubal is in light blue again. The binding modes of these molecules both involve interactions in the hydrophobic pocket and, as above for the DOCK calculation, may block the important hydrogen bondi ng residues as well as the catalytic triad channel. The largest clusters binding mode is very similar to the binding mode from DOCK, with the conjugated ring inte racting with the hydrophobic po cket and its conjugated ring residues. It also adopts a conf ormation that seems to effectively block the active site channel. However, in the lowest energy cluster the tr ifluorinated carbon is in the hydrophobic pocket while the ring structure is blocking the active site channel. It is also interesting to note that, even though the grid contained the binding site above the hydrophobic pocket, as indicated by blind docking, there is only one small cluster occupyin g that binding site and it has the highest energy of all the clusters. There are a few clusters that occupy the shallow pocket adjacent to the hydrophobic pocket, but most interact with the pocket and only one is above it. This suggests 141

PAGE 142

that perhaps a finer grid is needed to properly cap ture whatever impor tant interactions are occurring in the hydrophobi c pocket to rank the ligands appropriately. Testing of the Putative Binding Site As an initial test of the chosen binding targ et two ligands were docked to USP7 that had experimental percent inhibitions and IC50 values These ligands were from an experimental screen of a very small ligand libra ry done before virtual screening began, so experimental results were immediately available, and these ligands were not used to determine the target site in blind docking. The ligands used, ZINC1637098 and ZINC 1636516, are seen in Fig. 5-8. These two ligands are very similar in structure with the main differences being that ZINC1636516 is missing two hydroxyl groups from the phenyl group a nd has sulfur in place of a carbon in the five-membered ring. These two ligands were used in DOCK and AutoDock to gauge the chosen target site. ZINC1636516 binds mo re strongly to USP7 and inhi bits its activity more than ZINC1637098 according to experiment. Therefore, if this binding site is to be assumed as correct, DOCK and AutoDock should predict the same trend for the two ligands. DOCK The binding mode results from DOCK are seen in Fig. 5-9 with these two ligands while Table 5-1 shows the energies. The terminal benzene ring creates interactions within the hydrophobic pocket for both ligands, and they also both block the active site channel. This supports the hypothesis that the hyd rophobic pocket is important in st abilizing a ligand to block deubiquitination by USP7. However, DOCK pr edicts that ZINC1637098, the weaker binder experimentally, will be the stronger binder. Th ere is only a 3% energy difference between the two binding energies. This may be due to a slightly incorrect conformation from DOCKs anchor-and-grow algorithm. In any case, the very small difference in binding energy was not 142

PAGE 143

enough to discourage use of this binding site for docking. The interactions of the ligands with the hydrophobic pocket also add to the argum ent that this target site is appropriate. AutoDock The same ligands used above were also docked into USP7 using AutoDock. This was done to gauge the appropriateness of the target si te as well as the grid size used. For these 10 million energy evaluations, 100 runs and a population size of 200 were used. As before, it is expected that AutoDocks ranke d list should show ZINC1637098 as having a higher energy, and therefore being a weaker binder than ZINC1636516. Tables 5-2 and 5-3 show the results of the docking calculation and Figs. 5-10 and 5-11 show the binding modes generated by AutoDock w ith the largest cluster in blue and the lowest energy in green. The energies of the lowest energy compounds from AutoDock are only 0.35 kcal/mol different, but are reversed from what is expected from experiments. This tight margin may be due to a simple error in the docking or lack of sampling and again is not taken as a dismissal of this binding site. Another reason not to discount this binding site is that the lowest energy has a very small cluster size and often the la rgest cluster of ligands is used instead of the lowest energy due to the fact that, unless the la rgest cluster and lowest energy are the same, the clusters give information on how likely a given binding mode is, so the ligand with the lowest energy is not necessarily the most likely binder, especially when the energies are within a standard deviation of the AutoDock force field (2.5 kcal/mol). In these cases it is common to use the largest cluster data instead of the lowe st energy as the predicted binding mode.50,208,209 In any case, the lowest energy mode for this ligand has interactions within the hydrophobic pocket. The presence of two polar hydroxyl groups may re nder these interactions more unfavorable and therefore contribute to its lowe r binding affinity. The binding mode of the largest cluster for ZINC1637098 is above the hydrophobic pocket, near the site iden tified by blind docking of 143

PAGE 144

ZINC107402. This is interesting because it provides evidence to wards suggesting that inclusion of that area into the target grid selection is valid. In this m ode the ligand forms hydrogen bonds to surface residues of the protein, which may stab ilize binding. This position of this ligand may affect Ub binding by interfering with the main part of the protein, preventing it from binding to USP7. However, these contacts are very tentative and may be avoidable with a slight change in conformation of either Ub or USP7. This steric interaction may explain why there is inhibition from this ligand, but only very weakly. Figure 5-11 shows the lowest energy and largest cluster bindin g modes of ZINC1636516 from the AutoDock calculations. These modes ar e extremely similar, with an RMSD of only 0.34 Both ligands bind with the benzene ring in the hydrophobic pocket as might be expected to stabilize the interaction, and sterically block the active site channel. The entrance to the active site channel is also slightly hydrophobic, so the ring systems can be placed in there and the carboxyl groups can be solvent accessible. In th e case of this ligand the cluster size for both the largest energy and largest cluster are fairly larg e, 18 and 22 respectively, and the energies are close also. This is more evidence that the hyd rophobic pocket plays an important role in ligand binding and adds to the conclusion th at this binding site is acceptable for the rest of the study. Based on the blind docking results of these tw o ligands as well as ZINC107402 the binding area outlined by the spheres in Fig. 5-4 was accepted and used for the rest of the docking study. Inhibitors from Virtual Screening Based on the blind docking methods above a putative binding site was chosen for HAUSP/USP7 to be targeted for in silico database screening. The first step of the screen was to run DOCK on a database of approximately 63, 000 ligands, comprising the Maybridge catalog available through ZINC. This allowed easy acc ess to the compounds so that experimental screening could be done easily, qui ckly and cheaply while providing only drug like molecules. 144

PAGE 145

The overall score, a combination of van der W aals and electrostatic energy, was used to produce a ranked list of the ligands pr esented for docking. The top 250 unique ligands from the Maybridge library are shown in Fig. 5-12 where Ubal is light blue, USP7 is orange, the hydrophobic residues are in magenta and the essentia l hydrogen bonders are in red. Ligands that are in the library twice as different conformers are included here, but resu lt in identical scores. This is due to DOCKs conformer search algor ithm, where multiple conformers of the same molecule in the database are given the same conformation by the DOCK algorithm. This result is a good measure of the DOCK algorithms reliab ility because essentially these different conformers are the same molecu le, and therefore should give you the same binding result. All but one of the ligands take a binding mode that blocks the hydrophobic pocket, essential hydrogen bonders or the active site channel. The one ligand that does not potentially interferes with the cysteine protease mechanism by blocki ng access to the active site or the oxyanion hole or may be a false positive. A large por tion of the compounds in trude upon the hydrophobic pocket to stabilize binding. Most of the ligands in the library also block H456, while fewer block or interact with D295. This may be the one of the main reasons behind HAUSP/USP7 inhibition, according to mutation st udies, and may be an important interaction to block to prevent deubiquitination of a protein. Se veral of these ligands also extend into the active site channel, which is slightly hydrophobic. Extending into this channel might provide more stability to binding while preventing the Ub tail from reaching the catalytic triad, thus blocking deubiquitination. As indicated in the Methods section a sec ondary screening was performed on the top 250 compounds from DOCK using AutoDock. DOCK is better suited for screen ing very large ligand libraries in a parallel capacity while AutoDock can only be run serially and is slower, but more 145

PAGE 146

accurate. Since only app roximately 50 compounds we re to be tested experimentally the choice of ligands had to be narrowed from 63,000 to the top 50. Once DOCK gave its top ligands AutoDock jobs were set up using scripts include d with the AutoDockTools program. The USP7 structure used in the DOCK calculations was used in the AutoDock study and a grid was set up to encompass the hydrophobic pocket, active site channel and a secondary pocket adjacent to the hydrophobic pocket. The compounds were then ranked according to lowest energy in the largest cluster. Fig 5-13 shows the result of the Auto Dock calculation on the 250 ligands from DOCK. The DOCK and AutoDock structures block th e same general area of USP7, namely the hydrophobic pocket, essential hydrogen bonders and ac tive site channel. There also remains a high concentration of ligands that extend into the hydrophobic pocket. There are more ligands that extend directly into the active site channel from AutoDock than for DOCK. However, a large difference between DOCK and AutoDock is seen for the ligands in that AutoDock poses occupy the secondary pocket much more frequen tly than DOCK poses. Also, the structure that bound over the catalytic triad is no longer in the same position and is now more involved in the anticipated region. This figure lends more credence to the importance of targeting the hydrophobic pocket to block Ub binding. Not only does this pocket potentially block Ub binding, it provides an area for the ligand to stabilize its inte ractions and therefore increase binding strength. Following this docking procedure the rankings from DOCK, AutoDock lowest energy and AutoDock lowest energy in the largest cluste r were compiled. The ligands from AutoDock were examined individually to check their binding mode within USP7 and identify any potentially good binders based on chemical intu ition or errors in the docking process that resulted in unphysical structures. From this list 40 ligands were selected to order for 146

PAGE 147

experim ental testing, eight were selected due to potential favo rable interactions upon visualization and 32 others from the lowest energy in the largest cluster rankings. Analysis of Experiment ally Active Compounds Using experimental assays200 five compounds were found to have a favorable percent inhibition for USP7 over the reporter enzyme used in the assay. Following a percent inhibition study two separate trials were done to determin e the IC50 of the ligand with USP7. The IC50 value of each ligand on the reported enzyme as we ll as the target was tested and ligands that were stronger binders to the targ et moved on to the next phase. In this final phase chains of ubiquitin, ranging from 2-7 Ub units, are cut by the target, USP7. This measures the ligands ability to prevent deubiquitination by USP7 and is an excellent test of the inhibitory effects of the ligand. The top five experimental inhibitors can be seen in Table 5-4 along with their screening score from DOCK and AutoDock and the average e xperimental IC50 values from the two trials. From Table 5-4 it can be seen that AutoDo ck predicts the experimental binding free energy at this binding site well for most compounds. AutoDock has one major outlier, ZINC623749, which it predicts to be a much stronger binder than it was experimentally measured to be. This may be due to an in correct binding mode prediction from AutoDock, conformer issues from the ZINC database or some experimental issu es that may not be identifiable. In any case, AutoDock does fairly well with an R2 of 0.26 for the lowest energy in the largest cluster and 0.19 for the lowest ener gy for the five ligands above. Removing the outlier, AutoDocks predictive co rrelation increases to 0.997 for th e lowest energy in the largest cluster and 0.96 for the lowest energy mode. Th is provides at least a st arting point for further virtual screening calculations, and helps to validat e the choice of binding site used. Since the binding site was initially unknown the fact that th ese predictions match so well helps reinforce our active site prediction. 147

PAGE 148

DOCK does not do as well at predicting the binding free energies of these ligands as AutoDock based on total score. The im portan ce of DOCK in this study is to narrow the compound library from several thousand down to a few hundred, so for this particular application its ability to predict is not an ab solute necessity. The differences between DOCK scores are only approximately 3% of the total energy so this result is not completely discouraging. It is also of not e that it has been suggested usi ng the total score from DOCK is not as accurate as using just the van der Waals score.24 In this case, using the van der Waals score alone provides a significant increase in the predictive abilities using virtual screening. There is almost no correlation between the total score calculated by DOCK and the experimental binding free energies of these five compounds. However, if the van der Waals score alone is used the correlation increases to an R2 of 0.31, which is even better than the prediction given by AutoDock. This may present an argument to use DOCK and the van der Waals score alone in this particular study. This may be used as either a first screen to be passed into AutoDock or perhaps even the only method of virtual screen ing. Since AutoDock only ranked the top 250 compounds from DOCK based on total score it is possible that some hits were excluded from AutoDock based on DOCKs total score. Th is reranking may provide a simple way to recommend more compounds for experimental testing and generate more leads for further drug development. Figure 5-14 shows the lowest en ergy in the largest cluster binding modes of the five ligands that were experimentally screened. Tw o of the ligands reported much stronger binding to the target protein, USP7, than to the reporter enzyme. These two ligands, ZINC12339422 and ZINC1036512, were then used in another assay to determine the ability of the ligand to inhibit deubiquitination by USP7. It is interesting to recall the blind docki ng performed above and 148

PAGE 149

exam ine these binding modes to apply what was l earned. According to these binding modes all of the experimentally active ligands clearly block th e active site channel. This leads to the idea that preventing deubiquitination can be accomplis hed by blocking the Ub access to the catalytic triad through this channel. This would prevent the peptide bond to be cleaved access to the catalytic triad and therefore stop activity of USP7 Another interesting feature of these active ligands is occupation of the small pocket adjacen t to the hydrophobic pocket. Three of the five active ligands occ upy this pocket which is slightly hydr ophobic. Ligands making use of this pocket must be fairly large to affect deubiquitination, but this pocket provides another potential site of stabilization for a possible drug candida te. ZINC623749 occupies the site identified by blind docking above the hydrophobic pocket. Only a few of the top 250 ligands occupy this site and its interesting to see a ligand occupying this site in the top five from experiments. However, even this ligand manages to block potential Ub interactions with the two essential hydrogen bonders as well as block the active site channe l. Finally, the strongest binder, ZINC12339422, has a benzene ring that extends into the hydrophobic pocket as well as a benzene ring that interacts slightly with the hydr ophobic active site channel. Ther e is only one more feature on this ligand to form strong interactions, one hydr ogen bond forming nitrogen that interacts with one of the essential hydrogen bonding residues. Th is shows that it may be important to get interactions with the hydrophobic pocket to stabilize ligand bindi ng to USP7 and may indicate that interactions with they hydrophobic channel ar e also important to cap ture. The poses of ZINC12339422 from DOCK and AutoDock can be s een in Figure 5-15 and are very similar. Also, upon further testing it was found that this ligand strongly inhibited deubiquitination by USP7 so it was not only a strong binder. ZINC1036512, colored yellow in Figure 5-14, was found to be the third strongest inhibitor thr ough experimental assays measuring the ligands 149

PAGE 150

ability to bind to the target, not inhibit activity. Upon furthe r testing it was found that while AutoDock correctly predicted that it would bi nd strongly to USP7, it doe s not strongly inhibit deubiquitina tion. This is an interesting situati on and can provide insight into the mode of action of a potential ligand. Looking at the structure from AutoDock the ligand does not block the essential hydrogen bonders or really block the active site channel. Therefore, it is not surprising that this ligand does not inhibit USP7 activity. This structure may provide more information for use of the secondary pocket as a site of lig and stabilization and may be worthy of further investigation. Conclusions The aim of this study was to find novel inhib itors for a new protein target to prevent deubiquitination by USP7. Using a combination of computational identification of the binding site with experimental evidence from the crysta l structure and mutagenic tests we have provided a cogent argument for the binding site chosen. Following this identification, a two stage virtual screening process using DOCK and AutoDock was undertaken to narrow a library of 63 thousand compounds down to 40 to be experimental ly assayed. From these 40, five of the ligands were determined to bind with low micromol ar affinity to the target, while one of those five was identified as strongly i nhibiting activity of the target protein. While more of the compounds may have been able to bind strongly to USP7 or inhibit its activity only two were found to bind much more strongly to the target than the repor ter enzyme and were therefore tested in all available assays. ZINC12339422, the best inhibitor for USP7 was later found to also inhibit another cysteine protease and therefore will not proceed in further testing. These compounds provide a starting point for future st ructure based drug design studies and provide important insights into how a ligand ma y interact with this protein. 150

PAGE 151

In order to further drug developm ent for this target several options are available. The first, and easiest, might be to utilize a larger ligand library. The Maybridge subset from ZINC used in this study consisted of 63 thousand compounds, which might be considered small compared to other ligand databases. A library of compounds might be compiled from various available chemical and commercial databases, or the entire ZINC database might be used for virtual screening. This could quickly and eas ily provide upwards of several million compounds to screen, vastly increasing th e chemical space screened and providing more potential hits. A second option in this vein is to perform a similarity search based on the current successful hits. This will provide structurally similar compounds and provide a library of compounds similar to known binders. This library of bi nders could then undergo another round of virtual screening or be tested experimentally, depending on the size of the library generated. Using structurally similar compounds takes advantage of what was learned from this screening to narrow chemical space, while providing more potential structures. Other options for future studies are more i nvolved than extending a database. Perhaps the best option along these lines would be to perform some conformational sampling of the target. The purpose of this is to capture multiple conformations of the target instead of a fixed structure as is commonly used. In such a study the same process as described in this study is performed, with the inclusi on of screening the top 250 DOCK or AutoDock hits on several snapshots from a molecular dynamics simulation. This provides several different conformations for the residues in the protein, ideally the residue s near the active site. This might be especially pertinent for this target due to the large c onformational change observed upon ubiquitin binding as mentioned. There is also a glutamine residue that bridges the active site channel in the bound conformation. Sampling of this residue might allow new areas for a ligand to access and 151

PAGE 152

theref ore increase binding affinity, specificity or the number of positive hits. Another more drastic measure would be to locate alternative bi nding sites that might affect ubiquitin binding. As mentioned by Hu et al.183 there are several hydrogen bonding residues in the finger region of USP7 that help bind Ub. It might be possible to target these or other residues in the finger for virtual screening and present these for experimental tests to increase specificity. However, this may prove fruitless due to the conserved nature of this family, but it is certainly a possibility. Here virtual screening has proven useful to iden tify successful micromolar inhibitors and using these results and further testing a successf ul drug candidate may yet be discovered. 152

PAGE 153

Figure 5-1. USP7 finger, thumb and palm domains illustrated.183 153

PAGE 154

H464 D481 C223 2.5 5.1 3.6 Figure 5-2. Conformational change of catalytic triad upon ubiquitin binding. 154

PAGE 155

Hydrophobic Pocket Catalytic Triad Figure 5-3. Generated sphe res from SPHGEN for USP7. 155

PAGE 156

O N N S N N N CH3 CF3 B A Figure 5-4. A) 2D and B) 3D struct ure of ZINC107402 used for blind docking. 156

PAGE 157

Figure 5-5. Ligand docki ng position on USP7 using chosen sphere set. 157

PAGE 158

Figure 5-6. Clustered ligand positions from blind docking with a coarse grid in AutoDock. 158

PAGE 159

Figure 5-7. Ligand docking position from fine grid calculation in AutoDock. 159

PAGE 160

O NH N HO HO O O O O H N S N O ZINC1637098 IC50( M): >50 % Inhibition: 27 A ZINC1636516 IC50( M):2 7 B % Inhibition: 55 Figure 5-8. 2D structures of A) ZINC1637098 and B) ZINC1636516. 160

PAGE 161

Table 5-1. van der W aals(VDW), electrost atic(ES) and total score from DOCK for ZINC1637098 and ZINC1636516. Score VDW ES EC50( M) % Inhibition ZINC1637098 -48.139 -47.119 -1.020 >50 27 ZINC1636516 -46.380 -45.428 -0.952 27 55 161

PAGE 162

A B Figure 5-9. A) ZINC1637098 binding m ode predicted by DO CK. B) ZINC1636516 binding mode predicted by DOCK. 162

PAGE 163

Figure 5-10. Largest cluster(blue) and lowe st energy(green) binding modes predicted by AutoDock for ZINC1637098. 163

PAGE 164

Table 5-2. AutoDock lowest energy sc ores and cluster sizes for ZINC1637098 and ZINC1636516. AutoDock Score(LE) Number in Cluster EC50( M) % Inhibition ZINC1637098 -6.03 8 >50 27 ZINC1636516 -5.68 18 27 55 164

PAGE 165

Table 5-3. AutoDock largest cluster sc ores and cluster si zes for ZINC1637098 and ZINC1636516. AutoDock Score(LC) Number in Cluster EC50( M) % Inhibition ZINC1637098 -4.25 23 >50 27 ZINC1636516 -5.2 22 27 55 165

PAGE 166

Figure 5-11. Largest cluster(blue) and lowe st energy(green) binding modes predicted by AutoDock for ZINC1636516. 166

PAGE 167

Figure 5-12. Top 250 structures ranked by total interaction energy from DOCK. 167

PAGE 168

Figure 5-13. Top 250 structures rank ed by total energy from AutoDock. 168

PAGE 169

Table 5-4. Structures, scores and experim ental IC50 values (in M) of the five active compounds selected from virtual screening. Chemical Structure and ID DOCK Score AutoDock Score IC50(M) Z INC12339422 -62.91 -5.81 3.72 Z INC8618162 -64.85 -4.60 13 Z INC1036512 -61.80 -4.49 13.7 NH+NH S O N 169

PAGE 170

Table 5-4 C ontinued Chemical Structure and ID DOCK Score AutoDock Score IC50(M) Z INC623749 -64.50 -6.08 14.4 Z INC8622547 -62.41 -4.44 16 170

PAGE 171

Figure 5-14. Binding modes of the five experimental hits on USP7. 171

PAGE 172

Figure 5-15. Structure of ZINC12339422 bound to USP7 from AutoDock(blue) and DOCK(green). 172

PAGE 173

CHAP TER 6 CONCLUSIONS This chapter provides an overview of the work presented in the disse rtation above. These projects represent the utility of various computational methods to study and predict the proteinligand binding process. A combination of comput ational and experimental progress has allowed these detailed studies to pr ogress, improving the drug disc overy process along the way. Chapter two discussed the drug discovery proc ess in broad terms. Several computational methods to calculate binding free energy were di scussed. These included different ways of scoring protein-ligand interac tions including knowledge based and physically based methods. Also presented in this chapter was an overview of the virtual screening process to scan a large library of small molecules for potential interactions with a specific protein target. The QM/MM method was introduced and described also, includ ing careful selection of the boundary and the link atom method to cap the QM region. Finally, implicit and explicit solvation methods were discussed. These are important for any biological calculation, but their contribution to binding free energy especially cannot be ignored. Chapter three discussed the development and validation of a QM/MM solvation method implemented in AMBER. This method allowed an ar ea of interest to be pol arized in a solvents electrostatic field, keeping the remaining area at fixed charge. This was done using a finite difference method Poisson-Boltzmann solver implem ented in the semi-empirical linear scaling program DivCon. The purpose of this method was to model solvation of an active site while reducing the overall cost by not polar izing all the atoms in the protei n or protein-ligand complex. Thanks to DivCons linear scali ng nature this method allowed larger than usual QM regions to be considered, up to several hundred atoms, moving the QM/MM boundary further from the area 173

PAGE 174

174 of interest, decreasing errors due to QM/MM approximations. Th e method was then validated by comparing the solvation energies calculat ed using the QM/MM method of 20 different pentapeptides as well as several small proteins to the solvation energy of the entire protein calculated with a full QM method. The time save d between the two methods was also discussed The fourth chapter described the calculati on and use of a QM/MM scoring function to predict protein-ligand binding free energies. The functions aim was to save time over a full QM implementation but gain accuracy over a purely classical method. A QM/MM method allowed the polarization and charge transfer of the ligand and protein upon bindi ng to be properly captured, something that most scoring functions do not take in to account. The QM/MM function was then used to predict the binding affinities of 23 zinc metalloenzymes with and without multiple linear regression. Various possible parameters were tested for the components of the scoring function and the results show a good ab ility to predict binding affinity for these metalloenzymes. Chapter five detailed the implementation of DOCK and AutoDock in a virtual screening effort to find a lead compound. A library of small molecules was compiled from the ZINC database and screened agains t the target using DOCK. Fo llowing this the top 250 compounds were visually inspected and screened using AutoDock, which has been found to be more accurate. The top 40 were then visually inspecte d again and experimentally tested, resulting in 5 low micomolar hits for the targ et. These hits were found to provide a good starting point for further investigations and several potential extensions of this study were suggested.

PAGE 175

APPENDIX SAMPLE INPUT FILES QM/MM Input The QM/MM input file is a st andard AMBER input file with a few modifications. The ifqnt flag must be set to one to turn on a QM /MM calculation. The $qmmm section is defined in the input to define QM region atoms and speci fy QM/MM specific keywords. The qmtheory keyword denotes the QM Hamiltonian to use on the QM region. The next keyword, qmcharge, specifies the charge of just the QM region while idc specifies an AMBE R, standard DivCon or divide and conquer DivCon calculation. In orde r to calculate the solvation using the QM/MM Poisson-Boltzmann solver in DivCon a simple flag, divpb, must be set to one in the $qmmm section. The presence of this keyword in th e AMBER input file and the accompanying keywords in the file are all that is necessary to implement the solvation calculation. No additional keywords are necessary in the AMBE R input in order to perform a vibrational analysis. 175

PAGE 176

&cntrl im in=1, maxcyc=2000, drms=0.0005, scee=1.2, ntpr=10, ntb=0, cut=100, ifqnt=1, ncyc=500, / $qmmm iqmatoms= 1426, 1427, 1428, 1429, 1430, 1431, 1432, 1433, 1434, 1435, 1436, 1437, 1438, 1439, 1440, 1441, 1442, 1443, 1444, 1445, 1446, 1463, 1464, 1465, 1466, 1467, 1468, 1469, 1470, 1471, 1472, 1473, 1474, 1475, 1476, 1477, 1478, 1479, 1480, 1481, 1482, 1483, 1595, 1596, 1597, 1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607, 1608, 1609, 1610, 1611, 1612, 1613, 1811, 1812, 1813, 1814, 1815, 1816, 1817, 1818, 1819, 1820, 1821, 1822, 1823, 1824, 1825, 1826, 1827, 1828, 1829, 1830, 1831, 3039, 3040, 3041, 3042, 3043, 3044, 3045, 3046, 3047, 3048, 3049, 3050, 3051, 3052, 3053, 3054, 3055, 3056, 3057, 3058, 3059, 3060, 3061, 3062, 3063, 3064, 3065, 3066, 3067, 3068, 3069, 3070, 3071, 3072, 3073, 3074, 3075, 3076, 3077, 3078, 3079, 3080, 3081, 3082, 3083, 3084, 3085, 3086, 3087, 3088, 3089, 3090, 3091, 3092, 3093, 3094, 3095, 3096, 3097, 3098, 3189, 3190, 3191, 3192, 3193, 3194, 3195, 3196, 3197, 3198, 3199, 3200, 3201, 3202, 3203, 3204, 3205, 3206, 3207, 3208, 3209, 3210, 3211, 3212, 3213, 3214, 3215, 3216, 1447, 1448, 1449, 1450, 1451, 1452, 1453, 1454, 1455, 1456, 4080, 4081, 4082, 4083, 4084, 4085, 4086, 4087, 4088, 4089 qmtheory=2 qmcharge=0 idc=1 / Figure A-1. Sample AMBER input for a QM/MM minimization. DivCon Input A file must accompany the AMBE R QM/MM input file when performing a DivCon calculation. This file contains keywords specific to DivCon a nd is read in by the DivCon library. Options here that overlap between the QM/MM and DivCon input, such as idc=1 in the QM/MM input and STANDARD in th e, must match or the program will fail. 176

PAGE 177

The file m ust also be specified for the solvation and vibration calculations. The addition of a few keywords within the file allows these various types of calculations to be performed. As seen in Fig. A-3, the PB calculation input is similar to the vacuum, but has the addition SCRF and DIVPB keywords, instructin g DivCon to use its PB solver. The CM1 calculation causes the PB solver to use CM1 ch arges, and the SCALE keyword sets the grid spacing to the desired value, in th is case 2.5 grid points/. In the case of vibrational frequency calculations the FREQ keyword mu st be set to calculate the frequencies while the THERMO keyword calculates the thermodynamic paramete rs. In these calcula tions the convergence criteria were also set to values lower than the default values and are specified by DESCF and DPSCF for the energy and density convergence respectively. A sa mple of the frequency inputs can be seen in Fig. A-4. 177

PAGE 178

178 DIRECT CARTESIAN AM1 CHARGE=0.0 & STANDARD CUTBOND=100 MAXIT=500 END_COORD Figure A-2. Sample DivCon input when r unning a QM/MM vacuum calculation in AMBER. DIRECT CARTESIAN AM1 CHARGE=0.0 & STANDARD CUTBOND=100 MAXIT=500 SCRF DIVPB CM1 SCALE=2.5 Figure A-3. Sample DivCon input when r unning a QM/MM Poisson-Boltzmann calculation. DIRECT CARTESIAN AM1 CHARGE=0.0 & STANDARD CUTBOND=100 MAXIT=500 FREQ THERMO DESCF=0.000001 DPSCF=0.000001 Figure A-4. Sample DivCon input for a frequency calculation.

PAGE 179

LIST OF REFERENCES (1) Bernstein, F. C.; Koetzle, T. F.; Williams, G. J.; Meyer, E. F., Jr.; Brice, M. D.; Rodgers, J. R.; Kennard, O.; Shimanouchi, T.; Tasumi, M. J. Mol. Biol. 1977, 112, 535-42. (2) Sulpizi, M.; Laio, A.; VandeVondele, J.; Cattaneo, A.; Rothlisberger, U.; Carloni, P. Proteins 2003, 52, 212-24. (3) Park, H.; Brothers, E. N.; Merz, K. M., Jr. J. Am. Chem. Soc. 2005, 127, 4232-41. (4) Zhang, Y.; Kua, J.; McCammon, J. A. J. Am. Chem. Soc. 2002, 124, 10572-7. (5) Schwarzl, S. M.; Tschopp, T. B.; Smith, J. C.; Fischer, S. J. Comput. Chem. 2002, 23, 1143-9. (6) Ewing, T. J.; Makino, S.; Skillman, A. G.; Kuntz, I. D. J Comput Aided Mol Des 2001, 15, 411-28. (7) Morris, G. M.; Goodsell, D. S.; Halliday, R. S.; Huey, R.; Hart, W. E.; Belew, R. K.; Olson, A. J. J. Comput. Chem. 1998, 19, 1639-1662. (8) Irwin, J. J.; Shoichet, B. K. J. Chem. Inf. Model. 2005, 45, 177-82. (9) Moustakas, D. T.; Lang, P. T.; Pegg, S.; Pettersen, E.; Kuntz, I. D.; Brooijmans, N.; Rizzo, R. C. J. Comput.-Aided Mol. Des. 2006, 20, 601-619. (10) Bajorath, J. Nat. Rev. Drug Discovery 2002, 1, 882-94. (11) von Grotthuss, M.; Pas, J.; Rychlewski, L. Bioinformatics 2003, 19, 1041-2. (12) Strausberg, R. L.; Schreiber, S. L. Science 2003, 300, 294-5. (13) Anisimov, V. M.; Bugaenko, V. L. J. Comput. Chem. 2009, 30, 784-98. (14) Peters, M. B.; Merz, K. M. J. Chem. Theory Comput. 2006, 2, 383-399. (15) Raha, K.; Merz, K. M. J. Med. Chem. 2005, 48, 4558-4575. (16) Kollman, P. A.; Massova, I.; Reyes, C.; K uhn, B.; Huo, S. H.; Chong, L.; Lee, M.; Lee, T.; Duan, Y.; Wang, W.; Donini, O.; Cieplak, P.; Srinivasan, J.; Case, D. A.; Cheatham, T. E. Acc. Chem. Res. 2000, 33, 889-897. (17) Kuhn, B.; Gerber, P.; Sc hulz-Gasch, T.; Stahl, M. J. Med. Chem. 2005, 48, 4040-4048. (18) Bleicher, K. H.; Bohm, H. J.; Muller, K.; Alanine, A. I. Nat. Rev. Drug Discovery 2003, 2, 369-78. (19) Scott, A. P.; Radom, L. J. Phys. Chem. A 1996, 100, 16502-13. 179

PAGE 180

(20) Schwarzl, S. M.; Tschopp, T. B.; Smith, J. C.; Fischer, S. J. Comput. Chem. 2002, 23, 1143-1149. (21) Tidor, B.; Karplus, M. J. Mol. Biol. 1994, 238, 405-414. (22) Jorgensen, W. L. Science 2004, 303, 1813-1818. (23) Shi, L. M.; Fan, Y.; Myers, T. G.; O' Connor, P. M.; Paull, K. D.; Friend, S. H.; Weinstein, J. N. J. Chem. Inf. Comput. Sci. 1998, 38, 189-99. (24) Zhong, S.; Chen, X.; Zhu, X.; Dziegielew ska, B.; Bachman, K. E.; Ellenberger, T.; Ballin, J. D.; Wilson, G. M.; Tomkinson, A. E.; MacKerell, A. D., Jr. J. Med. Chem. 2008, 51, 4553-62. (25) Sheridan, R. P.; Kearsley, S. K. Drug Discov Today 2002, 7, 903-11. (26) Lipinski, C. A.; Lombardo, F. ; Dominy, B. W.; Feeney, P. J. Adv. Drug Delivery Rev. 1997, 23, 3-25. (27) Lipinski, C. A.; Lombardo, F. ; Dominy, B. W.; Feeney, P. J. Adv. Drug Delivery Rev. 2001, 46, 3-26. (28) Hetenyi, C.; Kortvelyesi, T.; Penke, B. Bioorg. Med. Chem. 2002, 10, 1587-93. (29) Ewing, T. J.; Makino, S.; Skillman, A. G.; Kuntz, I. D. J. Comput.-Aided Mol. Des. 2001, 15, 411-28. (30) Rarey, M.; Kramer, B.; Lengauer, T.; Klebe, G. J. Mol. Biol. 1996, 261, 470-89. (31) Sprous, D. G.; Lowis, D. R.; Leonard, J. M.; Heritage, T.; Burkett, S. N.; Baker, D. S.; Clark, R. D. J. Comb. Chem. 2004, 6, 530-9. (32) Jones, G.; Willett, P.; Glen, R. C.; Leach, A. R.; Taylor, R. J. Mol. Biol. 1997, 267, 72748. (33) Raha, K.; Merz, K. M. J. Am. Chem. Soc. 2004, 126, 1020-1021. (34) Taylor, R. D.; Jewsbur y, P. J.; Essex, J. W. J. Comput.-Aided Mol. Des. 2002, 16, 15166. (35) Feher, M.; Gao, Y.; Baber, J. C.; Shirley, W. A.; Saunders, J. Bioorg. Med. Chem. 2008, 16, 422-427. (36) Jorgensen, W. L.; Ruiz-Caro, J.; Tirado-Ri ves, J.; Basavapathruni, A.; Anderson, K. S.; Hamilton, A. D. Bioorg. Med. Chem. Lett. 2006, 16, 663-7. (37) Brunger, A. T.; Adams, P. D.; Clore, G. M.; DeLano, W. L.; Gros, P.; Grosse-Kunstleve, R. W.; Jiang, J. S.; Kuszewski, J.; Nilges M.; Pannu, N. S.; Read, R. J.; Rice, L. M.; 180

PAGE 181

Simonson, T.; Warren, G. L. Acta Crystallogr., Sect. D: Biol. Crystallogr. 1998, 54, 905921. (38) Yu, N.; Li, X.; Cui, G.; Hayik, S. A.; Merz, K. M., 2nd Protein Sci. 2006, 15, 2773-84. (39) Yu, N.; Yennawar, H. P.; Merz, K. M., Jr. Acta Crystallogr., Sect. D: Biol. Crystallogr. 2005, 61, 322-32. (40) Klebe, G.; Abraham, U.; Mietzner, T. J. Med. Chem. 1994, 37, 4130-46. (41) Cramer, R. D.; Patterson, D. E.; Bunce, J. D. J. Am. Chem. Soc. 1988, 110, 5959-5967. (42) Cramer, R. D.; Bunce, J. D.; Patterson, D. E.; Frank, I. E. Quant. Struct.-Act. Relat. 1988, 7, 18-25. (43) Peters, M. B.; Raha, K.; Merz, K. M. Curr. Opin. Drug Discovery Dev. 2006, 9, 370-379. (44) Raha, K.; Peters, M. B.; Wang, B.; Yu, N.; WollaCott, A. M.; Westerhoff, L. M.; Merz, K. M. Drug Discovery Today 2007, 12, 725-731. (45) Arakawa, M.; Hasegawa, K.; Funatsu, K. Curr. Comput. Aided Drug Des. 2007, 3, 254262. (46) Liang, H.; Wu, X.; Yalowich, J. C.; Hasinoff, B. B. Mol. Pharmacol. 2008, 73, 686-696. (47) Guha, R.; Jurs, P. C. J. Chem. Inf. Comput. Sci. 2004, 44, 2179-2189. (48) Bohm, H. J. J. Comput.-Aided Mol. Des. 1994, 8, 243-256. (49) Bohm, H. J. J. Comput.-Aided Mol. Des. 1998, 12, 309-323. (50) Li, C.; Xu, L.; Wolan, D. W.; Wilson, I. A.; Olson, A. J. J. Med. Chem. 2004, 47, 668190. (51) Baber, J. C.; William, A. S.; Gao, Y. H.; Feher, M. J. Chem. Inform. Model. 2006, 46, 277-288. (52) Ferrara, P.; Gohlke, H.; Price, D. J.; Klebe, G.; Brooks, C. L. J. Med. Chem. 2004, 47, 3032-3047. (53) Liebeschuetz, J. W. J. Comput.-Aided Mol. Des. 2008, 22, 229-38. (54) Wang, R. X.; Lai, L. H.; Wang, S. M. J. Comput.-Aided Mol. Des. 2002, 16, 11-26. (55) Goodford, P. J. J. Med. Chem. 1985, 28, 849-857. (56) Bartolucci, C.; Perola, E.; Pilger, C.; Fels, G.; Lamba, D. Proteins 2001, 42, 182-91. (57) Guimaraes, C. R. W.; Cardozo, M. J. Chem. Inform. Model. 2008, 48, 958-970. 181

PAGE 182

(58) Guimaraes, C. R. W.; de Alencastro, R. B. J. Med. Chem. 2002, 45, 4995-5004. (59) Aqvist, J.; Medina, C.; Samuelsson, J. E. Protein Eng. 1994, 7, 385-91. (60) Grater, F.; Schwarzl, S. M.; Dejaegere, A.; Fischer, S.; Smith, J. C. J. Phys. Chem. B 2005, 109, 10474-83. (61) Dixon, S. L.; Merz, K. M. J. Chem. Phys. 1997, 107, 879-893. (62) Muegge, I. Perspects. Drug Discovery Des. 2000, 20, 99-114. (63) Koppensteiner, W. A.; Sippl, M. J. Biochemistry-Moscow 1998, 63, 247-252. (64) DeWitte, R. S.; Shakhnovich, E. I. J. Am. Chem. Soc. 1996, 118, 11733-11744. (65) Gohlke, H.; Hendlich, M.; Klebe, G. J. Mol. Biol. 2000, 295, 337-56. (66) Sotriffer, C. A.; Gohlke, H.; Klebe, G. J. Med. Chem. 2002, 45, 1967-70. (67) Velec, H. F.; Gohlke, H.; Klebe, G. J. Med. Chem. 2005, 48, 6296-303. (68) Bomble, Y. J.; Case, D. A. Biopolymers 2008, 89, 722-731. (69) Dixit, S. B.; Beveridge, D. L.; Case, D. A.; Cheatham, T. E.; Giudice, E.; Lankas, F.; Lavery, R.; Maddocks, J. H.; Osman, R.; Sklenar, H.; Thayer, K. M.; Varnai, P. Biophys. J. 2005, 89, 3721-3740. (70) Payne, R. J.; Ficht, S.; Tang, S.; Brik, A.; Yang, Y. Y.; Case, D. A.; Wong, C. H. J. Am. Chem. Soc. 2007, 129, 13527-13536. (71) Nutt, D. R.; Smith, J. C. Journal of Chemical Theory and Computation 2007, 3, 15501560. (72) Savelyev, A.; Papoian, G. A. Mendeleev Commun. 2007, 17, 97-99. (73) Witt, M.; Slusarz, M. J.; Ciarkowski, J. QSAR Comb. Sci. 2008, 27, 684-693. (74) Cornell, W. D.; Cieplak, P.; Bayly, C. I. ; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Ca ldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117, 5179-5197. (75) Case, D. A.; Cheatham, T. E.; Darden, T. ; Gohlke, H.; Luo, R.; Merz, K. M.; Onufriev, A.; Simmerling, C.; Wang, B.; Woods, R. J. J. Comput. Chem. 2005, 26, 1668-1688. (76) Brooks, B. R.; Bruccoleri, R. E.; Olafson, B. D.; States, D. J.; Swaminathan, S.; Karplus, M. J. Comput. Chem. 1983, 4, 187-217. (77) Halgren, T. A. J. Am. Chem. Soc. 1992, 114, 7827-7843. 182

PAGE 183

(78) Halgren, T. A. J. Comput. Chem. 1996, 17, 490-519. (79) Halgren, T. A. J. Comput. Chem. 1999, 20, 720-729. (80) Halgren, T. A. J. Comput. Chem. 1999, 20, 730-748. (81) D.A. Case, T. A. D., T.E. Cheatham, III, C.L. Simmerling, J. Wang, R.E. Duke, R. Luo, K.M. Merz, D.A. Pearlman, M. Crowly, R.C. Walker, W. Zhang, B. Wang, S. Hayik, A. Roitberg, G. Seabra, K.F. Wong, F. Paesani, X. Wu, S. Brozell, V. Tsui, H. Gohlke, L. Yang, C. Tan, J. Mongan, V. Hornak, G. Cui, P. Beroza, D.H. Mathews, C. Schafmeister, W. S. Ross, and P.A. Kollman; Universi ty of California, San Francisco: 2006. (82) Dewar, M. J. S.; Thiel, W. J. Am. Chem. Soc. 1977, 99, 4899-4907. (83) Dewar, M. J. S.; Thiel, W. J. Am. Chem. Soc. 1977, 99, 4907-4917. (84) Dewar, M. J. S.; Zoebisch, E. G.; Healy, E. F.; Stewart, J. P. J. Am. Chem. Soc. 1985, 107, 3902-3909. (85) Stewart, J. J. P. J. Comput. Chem. 1989, 10, 209-220. (86) Stewart, J. J. P. J. Comput. Chem. 1989, 10, 221-264. (87) Thiel, W.; Voityuk, A. A. J. Phys. Chem. 1996, 100, 616-626. (88) Dixon, S. L.; Merz, K. M. J. Chem. Phys. 1996, 104, 6643-6649. (89) Yang, W. T.; Lee, T. S. J. Chem. Phys. 1995, 103, 5674-5678. (90) Warshel, A.; Levitt, M. J. Mol. Biol. 1976, 103, 227-249. (91) Field, M. J.; Bash, P. A.; Karplus, M. J. Comput. Chem. 1990, 11, 700-733. (92) Li, G. H.; Cui, Q. J. Phys. Chem. B 2004, 108, 3342-3357. (93) Noodleman, L.; Lovell, T.; Han, W. G.; Li, J.; Himo, F. Chem. Rev. 2004, 104, 459-508. (94) Ranaghan, K. E.; Ridder, L.; Szefczyk, B.; Sokalski, W. A.; Hermann, J. C.; Mulholland, A. J. Org. Biomol. Chem. 2004, 2, 968-980. (95) Hensen, C.; Hermann, J. C.; Nam, K.; Ma, S.; Gao, J.; Holtje, H. D. J. Med. Chem. 2004, 47, 6673-80. (96) Senn, H. M.; Thiel, W. Top. Curr. Chem. 2007, 268, 173-290. (97) Gao, J. L.; Amara, P.; Alhambra, C.; Field, M. J. J. Phys. Chem. A 1998, 102, 4714-4721. (98) Philipp, D. M.; Friesner, R. A. J. Comput. Chem. 1999, 20, 1468-1494. 183

PAGE 184

(99) Zhang, Y. K.; Lee, T. S.; Yang, W. T. J. Chem. Phys. 1999, 110, 46-54. (100) Singh, U. C.; Kollman, P. A. J. Comput. Chem. 1986, 7, 701-763. (101) Bersuker, I. B.; Leong, M. K.; Boggs, J. E.; Pearlman, R. S. Int. J. Quantum Chem. 1997, 63, 1051-1063. (102) Maseras, F.; Morokuma, K. J. Comput. Chem. 1995, 16, 1170-1179. (103) Reuter, N.; Dejaegere, A. ; Maigret, B.; Karplus, M. J. Phys. Chem. A 2000, 104, 17201735. (104) Anderson, C. F.; Record, M. T. Annu. Rev. Phys. Chem. 1982, 33, 191-222. (105) Ollis, D. L.; White, S. W. Chem. Rev. 1987, 87, 981-995. (106) Warwicker, J.; Engelman, B. P.; Steitz, T. A. Proteins-Structure Function and Genetics 1987, 2, 283-289. (107) Warwicker, J.; Ollis, D.; Richards, F. M.; Steitz, T. A. J. Mol. Biol. 1985, 186, 645-649. (108) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. J. Chem. Phys. 1983, 79, 926-935. (109) Cramer, C. J.; Truhlar, D. G. Chem Rev 1999, 99, 2161-2200. (110) Fogolari, F.; Brigo, A.; Molinari, H. J. Mol. Recognit. 2002, 15, 377-92. (111) Roux, B.; Simonson, T. Biophys. Chem. 1999, 78, 1-20. (112) Tomasi, J.; Persico, M. Chem. Rev. 1994, 94, 2027-2094 (113) Dinner, A. R., Lopez, X., Karplus, M. Theor. Chem. Acc. 2003, 109, 118-124. (114) Perakyla, M.; Kollman, P. A. J. Am. Chem. Soc. 1997, 119, 1189-1196. (115) Friesner, R. A. Adv. Protein Chem. 2005, 72, 79-104. (116) Honig, B.; Nicholls, A. Science 1995, 268, 1144-1149. (117) Luty, B. A.; Davis, M. E.; Mccammon, J. A. J. Comput. Chem. 1992, 13, 1114-1118. (118) Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. J. Am. Chem. Soc. 1990, 112, 6127-6129. (119) David, L.; Luo, R.; Gilson, M. K. J. Comput. Chem. 2000, 21, 295-309. (120) Prabhu, N. V.; Zhu, P. J.; Sharp, K. A. J. Comput. Chem. 2004, 25, 2049-2064. 184

PAGE 185

(121) Storer, J. W.; Giesen, D. J. ; Cramer, C. J.; Truhlar, D. G. J. Comput.-Aided Mol. Des. 1995, 9, 87-110. (122) Li, J. B.; Zhu, T. H.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. A 1998, 102, 18201831. (123) Warwicker, J. J. Theor. Biol. 1986, 121, 199-210. (124) Nicholls, A.; Honig, B. J. Comput. Chem. 1991, 12, 435-445. (125) Warwicker, J.; Watson, H. C. J. Mol. Biol. 1982, 157, 671-9. (126) Davis, M. E.; Mccammon, J. A. J. Comput. Chem. 1989, 10, 386-391. (127) Holst, M.; Kozack, R. E.; Saied, F.; Subramaniam, S. Proteins 1994, 18, 231-45. (128) Gogonea, V.; Merz, K. M. J. Phys. Chem. A 1999, 103, 5171-5188. (129) Tsui, V.; Case, D. A. Biopolymers 2000, 56, 275-291. (130) Calimet, N.; Schaefer, M.; Simonson, T. Proteins 2001, 45, 144-58. (131) Chen, J.; Im, W.; Brooks, C. L., 3rd J. Am. Chem. Soc. 2006, 128, 3728-36. (132) Sharp, K. A.; Honig, B. Annu. Rev. Biophys. Biophys. Chem. 1990, 19, 301-32. (133) Schaefer, P.; Riccardi, D.; Cui, Q. J. Chem. Phys. 2005, 123, 014905. (134) Simonson, T.; Archontis, G.; Karplus, M. Acc. Chem. Res. 2002, 35, 430-7. (135) Simonson, T. Curr. Opin. Struct. Biol. 2001, 11, 243-52. (136) Ferrara, P.; Gohlke, H.; Price, D. J.; Klebe, G.; Brooks, C. L., 3rd J. Med. Chem. 2004, 47, 3032-47. (137) Tannor, D. J.; Marten, B.; Murphy, R.; Fr iesner, R. A.; Sitkoff, D.; Nicholls, A.; Ringnalda, M.; William A. Goddard, I.; Honig, B. J. Am. Chem. Soc. 1994, 116, 1187582. (138) Liao, N.; Merz, K. M. In Preparation. (139) Archontis, G.; Simonson, T.; Karplus, M. J. Mol. Biol. 2001, 306, 307-27. (140) Baker, N. A.; Sept, D.; Joseph, S.; Holst, M. J.; McCammon, J. A. Proc. Natl. Acad. Sci. U. S. A. 2001, 98, 10037-41. (141) Smith, P. E.; Pettitt, B. M. J. Phys. Chem. 1994, 98, 9700-9711. (142) Murphy, R. B.; Philipp, D. M.; Friesner, R. A. J. Comput. Chem. 2000, 21, 1442-1457. 185

PAGE 186

(143) Li, J.; Nelson, M. R.; Peng, C. Y.; Bashford, D.; Noodleman, L. J. Phys. Chem. A 1998, 102, 6311-6324. (144) Hornak, V.; Abel, R.; Okur, A.; Stro ckbine, B.; Roitberg, A.; Simmerling, C. Proteins 2006, 65, 712-25. (145) Tahirov, T. H.; Inagaki, E.; Vol. 2007. (146) Nelson, C. A.; Lee, C. A.; Fremont, D. H.; Joachimiak, A. Structure 2005, 13, 75-85. (147) Huber, R.; Kukla, D.; Ruehlmann, A. ; Epp, O.; Formanek, H.; Deisenhofer, J.; Steigemann, W. Acta Crystallographica 1983, 39, 480. (148) Morton, A.; Matthews, B. W. Biochemistry 1995, 34, 8576-88. (149) Liu, D. C.; Nocedal, J. Math. Prog. 1989, 45, 503-528. (150) Clark, R. D.; Strizhev, A.; Leonard, J. M.; Blake, J. F.; Matthew, J. B. J. Mol. Graphics Modell. 2002, 20, 281-95. (151) Wang, R.; Lu, Y.; Wang, S. J. Med. Chem. 2003, 46, 2287-303. (152) Irwin, J. J.; Raushel, F. M.; Shoichet, B. K. Biochemistry 2005, 44, 12316-12328. (153) Anisimov, V. M.; Bugaenko, V. L. J. Comput. Chem. 2008. (154) Field, M. J.; Albe, M.; Bret, C.; Proust-De Martin, F.; Thomas, A. J. Comput. Chem. 2000, 21, 1088-1100. (155) Walker, R. C.; Crowley, M. F.; Case, D. A. J. Comput. Chem. 2008, 29, 1019-1031. (156) Fischer, S.; Smith, J. C.; Verma, C. S. J. Phys. Chem. B 2001, 105, 8050-8055. (157) Chang, C. E. A.; Chen, W.; Gilson, M. K. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 1534-1539. (158) Murray, C. W.; Verdonk, M. L. J. Comput.-Aided Mol. Des. 2002, 16, 741-53. (159) Ishchenko, A. V.; Shakhnovich, E. I. J. Med. Chem. 2002, 45, 2770-2780. (160) Li, H.; Jensen, J. H. Theor. Chem. Acc. 2002, 107, 211-219. (161) Calvin, M. D.; Head, J. D.; Jin, S. Q. Surf. Sci. 1996, 345, 161-172. (162) Vajda, S.; Weng, Z. P.; Rosenfeld, R.; Delisi, C. Biochemistry 1994, 33, 13977-13988. (163) Bardi, J. S.; Luque, I.; Freire, E. Biochemistry 1997, 36, 6588-96. (164) Legrand, S. M.; Merz, K. M. J. Comput. Chem. 1993, 14, 349-352. 186

PAGE 187

(165) Zou, X. Q.; Sun, Y. X.; Kuntz, I. D. J. Am. Chem. Soc. 1999, 121, 8033-8043. (166) Archontis, G.; Simonson, T.; Karplus, M. J. Mol. Biol. 2001, 306, 307-27. (167) Eldridge, M. D.; Murray, C. W.; Aut on, T. R.; Paolini, G. V.; Mee, R. P. J Comput Aided Mol Des 1997, 11, 425-45. (168) Coi, A.; Tonelli, M.; Ganadu, M. L.; Bianucci, A. M. Bioorg. Med. Chem. 2006, 14, 2636-41. (169) Meng, E. C.; Shoichet, B. K.; Kuntz, I. D. J. Comput. Chem. 1992, 13, 505-524. (170) Weissman, A. M. Nat. Rev. Mol. Cell Biol. 2001, 2, 169-78. (171) d'Azzo, A.; Bongiova nni, A.; Nastasi, T. Traffic 2005, 6, 429-41. (172) Haglund, K.; Dikic, I. EMBO J. 2005, 24, 3353-9. (173) Welchman, R. L.; Gordon, C.; Mayer, R. J. Nat. Rev. Mol. Cell Biol. 2005, 6, 599-609. (174) Devoy, A.; Soane, T.; Welchman, R.; Mayer, R. J. Essays Biochem 2005, 41, 187-203. (175) Ritchie, K. J.; Zhang, D. E. Semin. Cell Dev. Biol. 2004, 15, 237-46. (176) Seeler, J. S.; Bischof, O.; Nacerddine, K.; Dejean, A. Curr. Top. Micro. Immun. 2007, 313, 49-71. (177) Seeler, J. S.; Dejean, A. Nat. Rev. Mol. Cell Biol. 2003, 4, 690-9. (178) Kloetzel, P. M. Nat. Rev. Mol. Cell Biol. 2001, 2, 179-87. (179) Lindner, H. A. Virology 2007, 362, 245-56. (180) Amerik, A. Y.; Hochstrasser, M. Biochim. Biophys. Acta 2004, 1695, 189-207. (181) Pickart, C. M.; Eddins, M. J. Biochim. Biophys. Acta 2004, 1695, 55-72. (182) Hershko, A.; Ciechanover, A. Annu. Rev. Biochem. 1998, 67, 425-79. (183) Hu, M.; Li, P.; Li, M.; Li, W.; Yao, T.; Wu, J. W.; Gu, W.; Cohen, R. E.; Shi, Y. Cell 2002, 111, 1041-54. (184) Thrower, J. S.; Hoffman, L.; Rechsteiner, M.; Pickart, C. M. EMBO J. 2000, 19, 94-102. (185) Finley, D.; Sadis, S.; Monia, B. P.; Bouc her, P.; Ecker, D. J.; Crooke, S. T.; Chau, V. Mol. Cell. Biol. 1994, 14, 5501-9. (186) Reindle, A.; Belichenko, I.; Bylebyl, G. R.; Chen, X. L.; Gandhi, N.; Johnson, E. S. J. Cell Sci. 2006, 119, 4749-57. 187

PAGE 188

(187) Komatsu, M.; Chiba, T.; Tatsumi, K.; Iemura, S.; Tanida, I.; Okazaki, N.; Ueno, T.; Kominami, E.; Natsume, T.; Tanaka, K. EMBO J. 2004, 23, 1977-86. (188) Furukawa, K.; Mizushima, N.; Noda, T.; Ohsumi, Y. J. Biol. Chem. 2000, 275, 7462-5. (189) Dittmar, G. A.; Wilkinson, C. R.; Jedrzejewski, P. T.; Finley, D. Science 2002, 295, 2442-6. (190) Haas, A. L.; Bright, P. M. J. Biol. Chem. 1987, 262, 345-51. (191) Swaminathan, S.; Amerik, A. Y.; Hochstrasser, M. Mol. Biol. Cell 1999, 10, 2583-94. (192) Li, M.; Chen, D.; Shiloh, A.; Luo, J.; Nikolaev, A. Y.; Qin, J.; Gu, W. Nature 2002, 416, 648-53. (193) Wilkinson, K. D.; Hochstrasser, M. In Ubiquitin and the biology of the cell; Peters, J. M., Finley, J. R. H. D., Eds.; Plenum Press: New York, 1998, p 99-125. (194) Holowaty, M. N.; Zeghouf, M.; Wu, H.; Te llam, J.; Athanasopoulos, V.; Greenblatt, J.; Frappier, L. J. Biol. Chem. 2003, 278, 29987-94. (195) van der Horst, A.; de Vries-Smits, A. M. ; Brenkman, A. B.; van Triest, M. H.; van den Broek, N.; Colland, F.; Mauric e, M. M.; Burgering, B. M. Nat. Cell Biol. 2006, 8, 106473. (196) Chapman, H. A.; Riese, R. J.; Shi, G. P. Annu. Rev. Physiol. 1997, 59, 63-88. (197) Nijman, S. M.; Luna-Vargas, M. P.; Veld s, A.; Brummelkamp, T. R.; Dirac, A. M.; Sixma, T. K.; Bernards, R. Cell 2005, 123, 773-86. (198) Irwin, J. J.; Raushel, F. M.; Shoichet, B. K. Biochemistry 2005, 44, 12316-28. (199) Liu, X. H.; Chen, P. Q.; Guo, W. C.; Wang, S. H.; Li, Z. M. Phosphorus, Sulfur Silicon Relat. Elem. 2008, 183, 775-778. (200) Nicholson, B.; Leach, C. A.; Goldenberg, S. J.; Francis, D. M.; Kodrasov, M. P.; Tian, X.; Shanks, J.; Sterner, D. E.; Bernal, A.; Mattern, M. R.; Wilkinson, K. D.; Butt, T. R. Protein Sci. 2008, 17, 1035-43. (201) Iorga, B.; Herlem, D.; Barre, E.; Guillou, C. J. Mol. Model. 2006, 12, 366-372. (202) Hetenyi, C.; van der Spoel, D. Protein Sci. 2002, 11, 1729-37. (203) Meng, E. C.; Pettersen, E. F.; Couch, G. S.; Huang, C. C.; Ferrin, T. E. Bmc Bioinformatics 2006, 7, 339. (204) Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E. C.; Ferrin, T. E. J. Comput. Chem. 2004, 25, 1605-12. 188

PAGE 189

(205) Connolly, M. L. Science 1983, 221, 709-13. (206) Shoichet, B. K.; Bodian, D. L.; Kuntz, I. D. J. Comput. Chem. 1992, 13, 380-397. (207) Cramer, C. J.; Truhlar, D. G. J. Comput.-Aided Mol. Des. 1992, 6, 629-66. (208) Espinoza-Fonseca, L. M.; Trujillo-Ferrara, J. G. Bioorg. Med. Chem. Lett. 2004, 14, 3151-4. (209) Rao, G. S.; Ramachandran, M. V.; Bajaj, J. S. J. Biomol. Struct. Dyn. 2006, 23, 377-84. 189

PAGE 190

190 BIOGRAPHICAL SKETCH Seth Allen Hayik was born in 1981 in a sm all Pennsylvania town. He attended school in the Daniel Boone School District, and it was in this system that his interest for science was ignited along with a strong founda tion to support it. In May 2003 he received his B.S. in biochemistry from Philadelphia University, form erly the Philadelphia College of Textiles and Sciences. While there, he worked for Dr. Ch arles Bock, who introduced him to computational chemistry and got him started on his career path. Following this, he enrolled in the PhD program at Penn State University (PSU) and began work ing with Prof. Kenneth M. Merz Jr. using computational techniques to study protein-ligand interactions. In August 2005 he followed Prof. Merz to the University of Florida (UF) to c ontinue his work towards a doctoral degree in chemistry. In January 2007 he joined Dr. Rola nd Dunbrack Jr.s lab at Fox Chase Cancer Center and began a project doing virtual screening for Progenra Inc. After graduating he will be continuing his research as a postdoctoral researcher.