Citation
Computational Studies on the Energetics and Dynamics of Biomolecular Systems

Material Information

Title:
Computational Studies on the Energetics and Dynamics of Biomolecular Systems
Creator:
Ucisik, Melek Nihan
Place of Publication:
[Gainesville, Fla.]
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (13 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Chemistry
Committee Chair:
MERZ,KENNETH MALCOLM,JR
Committee Co-Chair:
DEUMENS,ERIK
Committee Members:
BOWERS,CLIFFORD RUSSELL
HORENSTEIN,NICOLE ALANA
PHILLPOT,SIMON R
Graduation Date:
5/3/2014

Subjects

Subjects / Keywords:
Additivity ( jstor )
Atoms ( jstor )
Free energy ( jstor )
Ligands ( jstor )
Modeling ( jstor )
Proteins ( jstor )
Simulations ( jstor )
Solvation ( jstor )
Solvents ( jstor )
Trajectories ( jstor )
Chemistry -- Dissertations, Academic -- UF
additivity -- affinity -- binding -- chemistry -- computational -- design -- drug -- dynamics -- inhibitor -- ligand -- mechanics -- molecular -- protein -- quantum -- structure -- uncertainty
Genre:
Electronic Thesis or Dissertation
born-digital ( sobekcm )
Chemistry thesis, Ph.D.

Notes

Abstract:
Computational chemistry offers many avenues to investigate physical phenomena at the molecular level which is usually not totally captured by experiments. Its applications on biological problems present a whole new perspective to living organisms at a micro scale. Folding mechanisms of proteins into their functional forms, assembly formation mechanisms of multiple proteins, signal transduction pathways through a series of proteins and lipids, interactions of proteins and nucleic acids, catalysis pathways of enzymes, and binding principles of small molecules to enzymes belong to a long list of areas to be explored with computational chemistry to make sense of observations made at macroscopic scale.This dissertation features discussions pertaining protein structure, dynamics, and ligand binding. Quantum mechanics and molecular dynamics are employed to gain insights into sample problems in these areas. The first chapter introduces how computational chemistry might aid in the understanding of physical phenomena. The second chapter summarizes the basic theory behind the methodologies utilized in the projects presented herein. The next two chapters deal with ligand binding to proteins. In the third chapter, we prove the validity of the commonly used fragment interaction energy additivity assumption. The fourth chapter underlines the need of assessing the uncertainty of calculated properties and demonstrates a protocol to place error bars on binding affinity predictions for a set of protein-ligand complexes. The last two chapters investigate the structure and dynamics of a metal binding membrane fusion protein, the periplasmic piece of Cu(I)-binding protein CusB. This dissertation contributes to the field of computational drug design by demonstrating the soundness of the pairwise interaction energy additivity approximation and encouraging the use of uncertainty assessment for computed physical properties. Moreover, it is expected to aid in elucidating the working mechanisms of intrinsically disordered protein domains. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: MERZ,KENNETH MALCOLM,JR.
Local:
Co-adviser: DEUMENS,ERIK.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2014-11-30
Statement of Responsibility:
by Melek Nihan Ucisik.

Record Information

Source Institution:
University of Florida
Holding Location:
University of Florida
Rights Management:
Copyright by Melek Nihan Ucisik. Permission granted to University of Florida to digitize and display this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. .
Embargo Date:
11/30/2014
Resource Identifier:
907294982 ( OCLC )
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

COMPUTATIONAL STUDIES ON THE ENERGETICS AND DYNAMICS OF BIOMOLECULAR SYSTEMS By MELEK NIHAN UCISIK A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014

PAGE 2

2014 Melek Nihan Ucisik

PAGE 3

To my dearest grandmother Yildiz Can and my awesome family

PAGE 4

4 ACKNOWLEDGMENTS I would like to thank Dr. Kennie M erz for being the best adviser that one c ould have hoped for. He was extremely helpful, supportive and understanding during the entire time I spent in his group. I also want to acknowledge all my coworkers who always took the time to help me out whenever I needed S ome of them should be especially mentioned: Dr. Yue Yang made the beginning of my graduate student life so much easier by kindly answering my endless questions. I had very fruitful collaborations with Dr. John C. Faver, Dr. Zheng (John) Zheng, a nd Dr. Danial S abri Da shti. To work with them was a pleasure and never a burden. I also learned a lot from my seniors Dr. Dhruva K. Chakravorty, Dr. Mark L. Benson, and Dr. Michael N. Weaver. Additionally, I want to thank my undergraduate adviser Dr. Vikto rya Aviyente for getting me interested in computational chemistry and my high school chemistry teache rs Ms. Gertrud Rahvali and Dr. Klaus C remer for making me like chemistry. Countless thanks go to my family. Although there was a time zone difference of se ven hours between us, they always made me a constant part of their lives and I never felt alone. I am so grateful to have them. I also want to acknowledge my dear grandmother Yildiz Can who taught me how to read and write. Words cannot describe my deep lov e and appreciation for her. Regardless of her health conditions, she was always there for me. I know she saved the last bits of her energy to be able to talk to me for two minutes even on her last days. She demonstrated the fun of seeking knowledge and her curiosity for new things never ceased My love for her grows constantly. Last but not least, I would have never survived graduate school without my awesome friends here in Gainesville. Gulsah, Yasemin, Burcu, Ruchan, Betul, Ays egul, Mona, Aycan, Sevnur, Tuba Y. Ahu, Aysun and so many others will never be forgotten.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ... 4 LIST OF TABLES ................................ ................................ ................................ ............. 7 LIST OF FIGURES ................................ ................................ ................................ ........... 8 LIST OF ABBREVIATIONS ................................ ................................ ............................ 11 ABSTRACT ................................ ................................ ................................ .................... 16 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ ..... 18 2 THEORY AND METHODS IN MOLECULAR MODELING ................................ ...... 23 Quantum Mechanics ................................ ................................ ................................ 23 Wave Functions ................................ ................................ ................................ 23 Basis Sets ................................ ................................ ................................ .......... 26 The Hartree Fock Approximation ................................ ................................ ...... 31 Second Order Mller Plesset Pe rturbation Theory ................................ ........... 33 Density Functional Theory ................................ ................................ ................. 35 Semiempirical Theory ................................ ................................ ........................ 38 Molecular Mechanics ................................ ................................ ............................... 39 Force Fields ................................ ................................ ................................ ....... 39 Potential Energy Surface ................................ ................................ ................... 41 Energy Minimization ................................ ................................ .......................... 42 Molecular Dynamics ................................ ................................ ................................ 44 Finite Difference Methods and Time Steps ................................ ....................... 45 Temperature Regulation ................................ ................................ .................... 48 Pressure Regulation ................................ ................................ .......................... 51 Solvation Models ................................ ................................ ............................... 52 3 INVESTIGATING THE CONCEPT OF ADDITIVITY OF INTERACTION ENERGIES FOR PROTEIN LIGAND COMPLEXES ................................ ............... 56 Background ................................ ................................ ................................ .............. 56 Molecular System: HIV Protease II Indinavir Complex ................................ ............ 60 Methods ................................ ................................ ................................ ................... 61 Theory ................................ ................................ ................................ ...................... 62 Results and Discussion ................................ ................................ ............................ 68 Pairwise Interactions ................................ ................................ ......................... 68 Many body Interactions ................................ ................................ ..................... 70 Conclusions ................................ ................................ ................................ ............. 74

PAGE 6

6 4 ESTIMATION OF FREE ENERGY OF BINDING FOR PROTEIN LIGAND COMPLEXES CONCOMITANT WITH ERROR ESTIMATION ................................ 87 Background ................................ ................................ ................................ .............. 87 Methods ................................ ................................ ................................ ................... 91 Molecular System: L99A Mutant of T4 Lysozyme ................................ ............. 91 Ensemble Creation with the "Blur" Program ................................ ...................... 92 Energy Scoring and Error Estimation ................................ ................................ 93 Theory ................................ ................................ ................................ ...................... 94 Results ................................ ................................ ................................ ................... 102 Binding Affinity Estimates ................................ ................................ ................ 102 Er ror Analysis ................................ ................................ ................................ .. 106 Convergence of Binding Affinity Estimates and Error Reduction .................... 109 Conclusions ................................ ................................ ................................ ........... 110 5 DYNAMIC AND STRUCTURAL STUDIES OF THE N TERMINUS OF THE INTRINSICALLY DISORDERED Cu(I) BINDING PROTEIN CusB ....................... 128 Background ................................ ................................ ................................ ............ 128 Methods ................................ ................................ ................................ ................. 131 Initial Models ................................ ................................ ................................ .... 131 Computational Methods and Post simulation Analyses ................................ .. 132 Results ................................ ................................ ................................ ................... 135 Effect of Cu(I) Binding ................................ ................................ ..................... 135 Decomposing the Disorder ................................ ................................ .............. 146 Conclusions ................................ ................................ ................................ ........... 148 6 MOLECULAR DYNAMICS STUDY OF THE N TERMINUS OF CusB AND THE METALLOCHAPERONE CusF ................................ ................................ .............. 165 Background ................................ ................................ ................................ ............ 165 Con structing the Complex of Cu(I) b ound CusB N Terminus and the Metallochaperone CusF ................................ ................................ ...................... 1 66 Molecular Dynamics Simulation Methods and Post Simulation Analyses ............. 167 Results ................................ ................................ ................................ ................... 169 Disappearing of Disorder in Cu sB NT ................................ ............................. 169 Conclusions ................................ ................................ ................................ ........... 174 LIST OF REFERENCES ................................ ................................ .............................. 191 BIOGRAPHICAL SK ETCH ................................ ................................ ........................... 202

PAGE 7

7 LIST OF TABLES Table page 3 1 E bind E bind ', and the total correction terms for various orders of correction estimated using M06 L, HF and PM6 DH2 ................................ .......................... 77 4 1 The experimental and calculated binding free energies using microstate energies with ff94/GB, ff99SB/GB, and PM6 DH2/COSMO .............................. 113 4 2 Comparison of the solvation energies with GBSA, COSMO and SMD solvation models against the experimental solvation energies .......................... 114 4 3 The experimental and calculated binding fr ee energies using microstate energies from gas phase ff99SB and PM6DH2 scoring with experimental ("exp") free energies of solvation for the particular ligand ................................ 115 4 4 The experimental and calcu lated binding free energies using microstate energies from gas phase ff99SB and PM6DH2 scoring with SMD estimates for free energy of solvation ................................ ................................ ................ 116

PAGE 8

8 LIST OF FIGURES Figure page 3 1 A representative double mutant cycle involving protein (P) ligand (L) interactions. ................................ ................................ ................................ ......... 78 3 2 Visualization of the components of Equation 3 5. ................................ ............... 79 3 3 Visualization of the components summing up to tot al correct ion terms given in Equation 3 6. ................................ ................................ ................................ ... 80 3 4 The distribution of the 2 body corrections to the ligand binding pocket interaction energies at the HF, M06 L, and PM6 DH2 level s of theory .............. 81 3 5 The distribution of the three body correction terms to the ligand binding pocket interactions at HF M06 L, and PM6 DH2 levels of theory ...................... 82 3 6 The distributions of the n body terms for n =3 (left), n =4 (middle) and n =5 (rig ht) at the PM6 DH2 level of theory. ................................ ................................ 83 3 7 The distributions of the pairwise corrections to all the interactions present within the Indinavir binding pocket complex at the HF, M06 L, and PM6 DH2 levels of theory. ................................ ................................ ................................ ... 84 3 8 The distributions of n body correction terms for n =3, n =4, and n =5 at the PM6 DH2 level of theory. ................................ ................................ .................... 85 3 9 Magnitude of individual three body corrections vs. the distance between the nearest atoms of the two receptor fragments involved in the three body interaction. ................................ ................................ ................................ ........... 86 4 1 The process of "blurring", i.e. systematically perturbing the ligand within the protein binding pocket, shown for n butyl benzene. ................................ .......... 117 4 2 Experimental free energies of binding plotted against calculated free energies of binding ................................ ................................ ............................ 118 4 3 Thermodynamic cycle for protein ligand binding in the gas and aqueous phases. ................................ ................................ ................................ .............. 119 4 4 Systematic and random errors contained in the binding affinity estimates obtained from the ff99SB/GB and ff99SB/SMD microstate energies. ............... 120 4 5 Systematic and random errors contained in the binding affinity estimates o btain ed from the PM6 DH2/COSMO and PM6 DH2/SMD microstate energies. ................................ ................................ ................................ ............ 121

PAGE 9

9 4 6 Free energy of binding estimates obtained with a Gedanken experiment, which assumes the static protein binding pocket approximation introduces a strain en ergy per PL complex pose scored with ff99SB/GB. ............................. 122 4 7 Free energy of binding estimates obtained with a Gedanken experiment, which assumes the static protein binding pocket approximation introduces a strain energy per PL com plex pose scored with PM6 DH2/COSMO. ............... 123 4 8 Gedanken experiments 1 and 2 combined for ff99SB/GB scoring ................... 124 4 9 Gedanken experiments 1 and 2 combined for PM6 DH2/COSMO scoring ...... 125 4 10 Convergence of binding free energy calculat ions for the ff99SB/GB test set ... 126 4 11 Error bar propagation with growing sample size in ff99SB/GB calculations. ..... 127 5 1 The Cus CBFA pump in E. coli ................................ ................................ .......... 151 5 2 The predicted level of order for each residue by the online server SPINE D. ... 152 5 3 The Cu(I) bound version and the apo version of the CusB N terminal region .. 153 5 4 The STRIDE secondary structure predictions for the apo CusB NT and C u(I) bound CusB NT from MD trajectories and aMD trajectories. ............................ 154 5 5 Distribution of RMSD values over 10 microseconds of aMD f or apo and Cu(I) bound versions of the CusB NT chain. ................................ ..................... 155 5 6 Atomic fluctuations for the apo and Cu(I) bound versions of the N terminal region of CusB. ................................ ................................ ................................ .. 156 5 7 Average NMR Shifts for backbone N's and their H's for the apo and Cu(I) bound predicted by SPARTA+ and their standard deviations. .......................... 157 5 8 Distribution of the distances between the two beta sheets forming the central beta motif. ................................ ................................ ................................ .......... 158 5 9 The dihedral angle formed by L15, I14, K43 and P42 as a measure of the relative beta sheet positions in the central motif. ................................ ............... 159 5 10 The distribution of t he dihedral invo lving the center of mass of F16, M21, M36 and the c ommon center of mass of the residues D28, K29 and P30. ...... 160 5 11 Distance analysis plot for residues in and adjacent to the big loop connecting residues M21 and M36. ................................ ................................ ..................... 161 5 12 Distance between the C u(I) ion and the centers of mass of each residue constituting the loop connecting M21 to M36 shown with error bars. ................ 162

PAGE 10

10 5 13 The RMSF profiles of apo and holo CusB N terminal revisited for subdomains. ................................ ................................ ................................ ...... 163 5 14 The three suggested subdomains for the N terminal domain of CusB .............. 164 6 1 The top hit from the HADDOCK docking of CusB NT to the open conformation of CusF ................................ ................................ ........................ 178 6 2 The two model metal binding sites MMM and MMH ................................ ......... 179 6 3 Secondary structure assignments of the CusB NT/ CusF complex for both the MMM and the MMH models ................................ ................................ ............. 180 6 4 The RMSD profiles of the MMM and MMH models over 4 !s of simulation time. ................................ ................................ ................................ ................... 181 6 5 The RMSF profiles of the MMM and MMH models over 4 !s of simulation time. ................................ ................................ ................................ ................... 182 6 6 Distances of the M21 M36 loop of CusB NT and the open flap of CusF ........... 183 6 7 Histograms for the distances of the M21 M36 loop of CusB NT and the open flap of CusF ................................ ................................ ................................ ....... 184 6 8 The angle involving the center of mass of CusB NT's M21 M36 loop, Cu(I), and the center of mass of CusF's open flap ................................ ...................... 185 6 9 The histograms for the angle involving the center of m ass of CusB NT's M21 M36 loop, C u(I), and the center of mass of CusF's open flap ........................... 186 6 10 Bifurcated and paired hydrogen bonding inte ractions between CusB NT's R26 and CusF's D46 ................................ ................................ ......................... 187 6 11 Distances of possible hydrogen bonding partners on the CusB NT M21 M36 loop and the CusF open flap. ................................ ................................ ............ 188 6 12 Histograms for the distance betw een the centroids of the phenyl ring of CusB NT's F35 and of the phenyl part of the indole ring of CusF's W44. ......... 189 6 13 " stacking interaction between the side chains of CusB NT's F35 and CusF's W44. ................................ ................................ ................................ ..... 190

PAGE 11

11 LIST OF ABBREVIATIONS 1D 1 (One) Dimensional 3D 3 (Three) Dimensional A Alanine  Angstrom AM1 Austin Model 1 AM1 BCC Austin Mode l 1 Bond Charge C orrections AMBER Assisted Model Building with Energy Refinement aMD Accelerated Molecular Dynamics AUC Area Under the Curve aug cc pVDZ Augmented, correlation consisted, polarized, valence only double # basis set BFDb Biomolecular Fra gment Database BioLiP Biologically relevant Ligand Protein binding database BSSE Basis Set Superposition Error CASP Critical Assessment of protein Structure Prediction CBS Complete Basis Set CCSD(T) Coupled Cluster Single Double (T riple) COSMO Conduc tor like Screening Model CusB NT CusB N Terminus D Aspartate DFT Density Functional Theory DOF Degree of Freedom E. coli Escherichia coli F Phenylalanine

PAGE 12

12 FBDD Fragment based drug design FEP Free Energy Perturbation ff94 Force field 94 ff99SB Forc e field 99 Stony Brook GAFF Generalized Amber Force Field GB Generalized Born GGA Generalized Gradient Approximation H Histidine HADDOCK High Ambiguity Driven protein protein Docking HF Hartree Fock His Histidine HIV Human Immunodeficiency Virus I Isoleucine I TASSER Iterative Threading Assembly R efinement ID Identification number K Kelvin K Lysine kcal/mol kilocalories per mole L Leucine LANL2DZ Los Alamos National Laboratory 2 double # basis set LCPO Linear Combinations of Pairwise O verlaps M Molar M Methionine MC Monte Carlo MCPB Metal Center Parameter Builder

PAGE 13

13 MCY Matsuoka Clementi Yoshimine MD Molecular Dynamics Met Methionine MFP Memb rane Fusion Protein MM Molecular Mechanics MM/GB SA Molecular Mechanics/Generalized Born Surface Area MM/PB SA Molecular Mechanics/Poisson Boltzmann Surface Area MMH Methionine Methionine Histidine MMM Methionine Methionine Methionine MNDO Modified Neglect of Diatomic Overlap MOPAC Molecular Orbital Package MP2 Mller Plesset 2 nd order NDDO Neglect of the Diatomic Differential Overlap NMR Nuclear Magnetic Resonance NPT Isobaric, isothermal ensemble NVT Canonical ensemble P Proline PB Poisson Boltzmann PDB Protein Databank PDDG Pairwise Distant Directed Gaussian pH Power of Hydrogen PL Protein Ligand complex PM3 Parametric Model 3 PM6 Parametric Model 6 PM6 DH2 Parametric Model 6 with Dispersion and H bonding correction t erms

PAGE 14

14 PME Pa rticle Mesh Ewald QM Quantum Mechanics R Arginine RESP Re strained Electrostatic Potential RMSD Root Mean Square Deviation RMSF Root Mean Square Fluctuation RND Resistance, Nodulation, Division ROS Reactive Oxygen Species S Serine SCF Self Consiste nt Field SMD Solvation Model D SP Standard Precision SPARKS X Sequence, secondary structure Profiles and Residue level Knowledge based energy Score Extra with mixing profiles SPARTA+ Shift Prediction from Analogy in Residue type and Torsion Angle impr oved version SPC Simple Point Charge SPINE D Sequence bas ed Prediction with Integrated Ne ural network for Disordered residues STO Slater Type Orbital STRIDE Structural Identification T Threonine TIP3P Transferable Intermolecular P otential function 3 Point TIP4P Transferable Intermolecular P otential function 4 Point

PAGE 15

15 VMD Visual Molecular Dynamics XP Extra Precision G Gibbs free energy change

PAGE 16

16 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy COMPUTATIONAL STUDIES ON THE ENERGETICS AND DYNAMICS OF BIOMOLECULAR SYSTEMS By Melek Nihan Ucisik May 2014 Chair: Kenneth M. Merz Major: Chemistry Computationa l chemistry offers many avenues to investigate physical phenomena at the molecular level which is usually not totally captured by experiments. Its applications on biological problems present a whole new perspective to living organis ms at a micro scale. F ol ding mechanisms of proteins into their functional forms, assembly formation mechanisms of multiple proteins, signal transduction pathways through a series of proteins and lipids, interactions of proteins and nucleic acids, catalysis pathways of enzymes, an d binding principles of small molecules to enzymes belong to a lo ng list of areas to be explored wi th computational chemistry to make sense of observat ions made at macroscopic scale. This dissertation features discussions pertaining protein structure, dyna mics, and ligand binding. Quantum mechanics and molecular d ynamics a re employed to gain insigh ts into sample problems in the se areas. The first chapter introduces how computational chemistry might aid in the understanding of physical phenomena. The second chapter summarizes the basic theory behind the methodologies utilized in the projects presented herein. The next two chapters deal with ligand binding to proteins. In

PAGE 17

17 the third chapter, we prove the validity of the commonly used fragment interaction energ y additivity assumption. The fourth chapter underlines the need of assessing the uncertainty of calculated properties and demonstrates a protocol to place error bars on binding affinity predictions for a set of protein ligand complexes. The last two chapte rs investigate the structure and dynamics of a metal binding membrane fusion protein, CusB, the periplasmic piece of Cu(I) extrusion pump Cus C B A This dissertation contribute s to the field of computational drug design by demonstrating the soundness of the pairwise interaction energy additivity approximation and encouraging the use of uncertainty assessment for computed physical properties Moreover, it is expected to aid in elucidating the working mechanisms of intrinsically disordered protein domains.

PAGE 18

18 CHAP TER 1 INTRODUCTION The year 2013 was a remarkable year for computational chemists. The recent Nobel Prize in Chemistry honored Arieh Warshel of University of Southern California, Martin Karplus of Harvard Univers ity and UniversitÂŽ de Strasbourg and Mich a e l Levi tt of Stanford University for "development of multiscale models for complex chemical s ystems" 1 T heir work was attributed to be ground breaking because it success fully combined classical physics and quantum mechanics to model large, complex chemical systems and reactions. The fact that computers are "just as important as the test tube" 1 for today's chemists was acknowledged and appreciated through the most prestigious award in the fi eld of physical sciences This thesis does not present any studies which particularly utilized the aforementioned method of focusing on the chemicall y significant regions of a large molecular system and treating it with quantum mechanical methods, and handling its remainder with more crude classical mechanical techniques. However, it offers a vast spectrum of different computational chemistry methods r anging from quantum mechanics calculations to molecular dynamics making use of pure classical physics employed in numerous projects. Thus, it demonstrates several aspects of how calculations are very complementary to experiment and how they might guide or complete experiment, just as emphasized by the Nobel committee. 1 Chemical timescales are extremely fast. It is impossible to capture all the molecular details of chemi cal processes experimentally. The power of computational chemistry comes from the fact that it establishes models of the chemical processes from scratch making use of physics which allows scientists to virtually slow down those ultra fast timescales to mor e conceivable levels. At this point, another factor comes into play:

PAGE 19

19 the cost. Building up and simulating chemical systems from first principles in the form of delocalized wave functions is much more expensive than employing crude models composed of molecu lar or atomic constituents which limits the use of the former to much smaller systems although it is exempt from the rough approximations of the latter and hence much more accurate. 2 Those rough approximations stem from the purely classical physi cal treatment of atoms, molecules, and their interactions. Quantum physics is totally out of the picture which leads to oversimplification of many molecular phenomena. Coulomb interactions, namely the interaction between the cores and the electrons are acc ounted for while the purely quantum mechanical exchange interactions, which have been formulated by Pauli's exclusion principle, and other quantum mechanical effects like charge transfer and dispersion are entirely dismissed. Parameterization is also an in gredient to this approach. The predictions are fit to a characteristic set of experimental results and that might lead to biased outcomes which over or underestimate the probed values of a particular test system based on its similarity to the original tra ining set. Molecular mechanics and molecular dynamics are of this type of computational approach, whereas ab initio quantum mechanical calculations sit on the other end of the spectrum. Thus, molecular mechanics and molecular dynamics are the main methods to examine large biomolecular systems either in explicit solvent models or in a dielectric continuum. Studies which seek solutions to more fundamental questions on smaller model systems concerned with certain concepts req uire a higher level of accuracy. This level of accuracy is obtained through the use of ab initio quantum methods. A compromise between cost and accuracy is achieved by semi empirical methods which take advantage of an

PAGE 20

20 approximate wave function and parameterization to experimental data. Sy stem size and the available computing resources usually dictate which computational method is going to be chosen for a particular project to maximize the efficiency. 3 In this thesis, there is at least one example to all of these computational methods. After the background information about the theory and methods used to examine biomolecular systems given in Chapter 2, which introduces quantum mechanics, molecular mechanics and molecular dynam ics, a widely used concept in biomolecular models is question ed in Chapter 3: we investigate the additivity of pairwise in teraction energies for protein ligand complexes on the fragmented binding pocket of HIV Protease II bound to the inhibitor Indinavir. Chapter 4 moves on from the interaction energies to the ultimate property that many computational and medicinal chemists a re after: binding free energy. It demonstrates an application of a systematic, local perturbation methodology to create ensembles for unbound ligands and protein ligand complexes in static receptors on a congeneric series of aromatic inhibitors bound to th e engineered binding pocket of T4Lysozyme L99A. The constituents of the created ensembles are scored using molecular mechanics and semiempirical methods while also being assessed for their random and systematic errors. The scores are utilized to estimate b inding free energies directly from statistical mechanics equations with calculated uncertainties on the fly. The study shows clearly how the precision of binding affinity estimates improves with employed ensembles rather than single structures. In Chapter 5 we touch on the problem of protein folding with molecular dynamics by examining an intrinsically disordered pr otein domain, which contains a metal binding site and whose structure could not be resolved by crystallography because of being

PAGE 21

21 extremely disor dered. Twenty five distinct models of the N terminal region of the Cu(I) binding protein CusB in Eschecihia coli ( E. coli ) are simulated in explicit solvent for a total time of 15 !s on each of its apo and Cu(I) bound versions. The impacts of metal binding are explored with a very detailed post trajectory analysis. Some repeated structural elements emerge in both versions and the disorder is mostly traced back to a certain s ubdomain. In the final Chapter 6 a representative CusB N terminus structure is dock ed to its functional counterpart metallochaperone CusF. Two models for the metal binding site are created considering the potential metal transfer process and each model is simulated with molecular dynamics in explicit solvent for 4 !s. It is observed that the disorder in the N terminal region of CusB ceases when it is complexed with the metallochaperone CusF. It becomes even more stable when the Cu(I) ligating residues are predominantly from CusB which might suggest a clue about the direction of the metal transfer towards the extracellular milieu. I hope this thesis would serve as a good reference to many researchers in the drug design field who assume pa irwise energy additivity applies to their systems of interest. It would lay the groundwork for the use of error bars of free energy estimations no matter with which protocol they are calculated. It would encourage the establishment of enhanced exhaustive sampling methods to reach larger ensembles which would in turn increase both the accuracy and the precis ion of the binding affinity calculations. It would inspire more methodologies to directly calculate free energy from statistical mechanics and to avoid decomposing it to its thermodynamic components enthalpy and especially entropy, whose effects are specif ically hard to incorporate into computation. Last but not least, I truly wish the working mechanism of E. c oli 's CusCFBA metal

PAGE 22

22 extrusion pump would be solved and that would reflect on getting a better insight to the antibiotic resistance mechanisms in bact eria. Our very extensive molecular dynamics studies should certainly have accessed some molecular details on this system which were impossible to catch otherwise and they would definitely contribute to solving the full mechanism showing the fact that compu ters are "just as important as the test tube", as described by the Nobel committee. 1

PAGE 23

23 CHAPTER 2 THEORY AND METHODS IN MOLECULAR MODELING In this chapter the computational chemistry methods, which were employed by the presented studies are briefly discussed with the theory behind them. Quantum Mechanics Quantum mechanics was developed to account properly for the particle like and the wave like properties of m atter. Wave Functions The fundamental postulate of quantum mechanics is that the wave function exists for any (chemical) system and functions acting on it return observable properties of the system as shown in Equation 1 1. These functions are referred to as "operators". ! ! ! (2 1) where is an operator and is a scalar for some property of the system. When this holds, is an eigenfunction and is an eigenvalue. The product of the wave function with its complex conjugate ! has units of probability f unction. In chemistry, we always deal with real and not complex wave functions, hence we drop the symbol *'. The probability that a chemical system will be found at a certain region of spac e is given by the integral of | 2 | over that region of space. Thes e factors dictate some characteristics on an acceptable wave function. First, the normalized integral of | 2 | over all space must be unity: the system of interest should exist somewhere in space. Second, it must be finite and single valued everywhere. Thir d, it must be continuous and continuously differentiable.

PAGE 24

24 Designating the wave function as an "oracle" is common: its nature is not very obvious but it will return answers when inquired with operators about the observable properties of the system at hand. The operator returning the energy E of a system as an eigenvalue is called the Hamiltonian operator H which we plug in the Equation 2 1 to obtain the time independent form of the Schršdinger Equation: ! ! (2 2) The Hamiltonian operator queries the energy of a chemical system, or more specifically a molecule, with respect to the sum of its potential and kinetic energies. The kinetic energies of the electrons and nuclei constitute the kinetic en ergy of the molecule, whereas the attraction between the nuclei and electrons along with the internuclear and interelectronic repulsions comprise its potential energy. For each of these contributors, there is a term in the Hamiltonian operator as shown in Equation 2 3. H = ! 2 2 m e i 2 i # ! 2 2 m k k 2 k # e 2 Z k r i k + e 2 r i j + e 2 Z k Z l r k l k < l # i < j # k # i # (2 3) where i and j span electrons, k and l represent nuclei, is the Planck's constant divided by 2 m e is the mass of the electron, m k is the mass of the nucleus k 2 is the Laplacian operator, e is the charge on the electron, Z is the atomic number, and r ab is the distance of the particle a from the particle b The wave function is a function of 3 n coordinates where n stands for the number of particles (nuclei and electrons) because all the x y z coordinates associated with the particular molecule are involved. The Laplacian operator takes the following form in Cartesian coordinates:

PAGE 25

25 i 2 = 2 x i 2 + 2 y i 2 + 2 z i 2 (2 4) Th e last three summation terms in Equation 2 3 retain their appearance in classical mechanics where they display the attraction of the electrons to the nuclei, the electron electron repulsion, and the repulsion between t he nuclei, respectively. The first two terms are the adapted versions of the kinetic energy operator yielding the kinetic energy for a QM particle. There are many acceptable eigenfunctions for a certain molecule and each of those are associated with a cha racteristic eigenvalue E which adds up to a complete set of # with eigenvalues E i These solutions are usually required to be orthogonal to each other. If they are also normalized, that is, the integral of the probability of finding the particle over all space is one, then they establish an orthonormal set. The q uality check of wave functions is carried out by making use of the Variational Principle. In the set of the possible eigenfunctions # $! there must be one associated with a lowest energy value E i This enforces a boundary on the set from below and that ene rgy is taken to be the ground state energy E 0 When we arbitraril y guess a wave function, we can vary it repeatedly and every variation decreasing the concomitant eigenvalue E i is favorable because it means we are improving our guess to approach the E 0 W hy do we even need to guess a wave function? It would be very nice if we could systematically build it up but unfortunately, accurate wave functions for many particle molecular systems are extremely challenging to formulate due to the correlated motions of particles. In fact, exact solutions to the Schršdinger E quation exist only for simple model problems like the particle in a box, the harmonic oscillator, the particle on

PAGE 26

26 a ring, the particle on a sphere, and one electron species like the hydrogen atom. Co rrelations in the many particle molecular systems give rise to pairwise attraction and repulsion terms indicating that no particle moves in an independent fashion. To simplify the situation, the Born Oppenheimer approximation is used which states nuclear m otions are much slower than electronic motions and thus their motions could be decoupled for convenience: nuclear masses are so much bigger than electrons so they could be assumed to be stationary with respect to electrons which adjust to any nuclear mot ion instantaneously. Computing the electronic energies for fixed nuclear positions causes the nuclear kinetic energy term in Equation 2 3 to be independent of electrons. It also eliminates the correlation in attractive electron nuclear potential energy ter m and converts the repulsive nuclear nuclear potential energy term to a simple constant for a set geometry. Wave functions are insensitive to constant terms in the Hamiltonian. So the pure electronic energy is calculated first, and then the constant nuclea r energy is added to it to give the total electronic energy E el The Born Oppenheimer approximation has very strong implications. The concept of a potential energy surface, which is spanned by the electronic energy E el over all possible nuclear coordinates would be impossible without it. Any critical points on the potential energy surface, like the equilibrium or transition s tate geometries, would also die as concepts. Instead, we would be limited to consider high probability regions of the nuclear wave fu nctions. Basis Sets The Born Oppenheimer approximation removes the e lectron nuclear correlation. T hat certainly is a big simplification but the interelectronic correlation remains and it is

PAGE 27

27 very difficult to incorporate it in the molecular wave function s, aka molecular orbitals. Their calculations involve evaluation of integrals which are cumbersome, if not impossible. One way to construct wave functions, which are rather easier to work with, may involve representing those as a combination of more conven ient functions. These more convenient functions constitute a basis set. Projecting this on chemistry, we could conceive that atomic wave functions would be useful in constructing molecular wave functions. The purpose here would be to provide the molecular wave function with enough flexibility to allow electrons to delocalize and thus decrease their energy, namely the eigenfunctions E which would be achieved through the use of ideally an infinite basis set. However, that would hamper the practicability of this approach. Hence, employing a larger basis set is more likely to span the true molecular orbital space and to produce lower energies which are preferred. Linear combination of atomic wave functions presents an avenue to create molecular orbitals from a tomic orbitals and the larger the basis set, the closer we get to the actual properties of the molecular orbitals. The approximate atomic wave functions we construct should satisfy the following criteria: at large distances from the nucleus the electron de nsity in an atom decreases exponentially with the distance r and the electron density at the nucleus has a cusp. A common choice satisfying these criteria is the Slater type orbitals, which have the following form: n l m = R n l ( r ) Y l m ( ) (2 5)

PAGE 28

28 where Y ( ) is called a spherical harmonic, R is the radial part, n is the principal quantum number, l is the azimuthal quantum number and m is the magnetic quantum number. Switching to polar coordinates fo r atoms makes sense because atoms possess spherical symm etry. Manipulating Slater functions within molecular orbital calculations is mostly difficult, particularly when they are centered on different nuclei. That is why they are usually replaced by functions based on Gaussians in ab inito calculations. Gaussian functions are really useful since the product of two Gaussians can be expressed as a single Gaussian, where the spread of the function can be adjusted by handling the exponential coefficients. However, unlike the Slater functions, they lack the cusp at th e origin, namely the nucleus, and they decay towards zero more quickly. These disadvantages are overcome by utilizing their linear combinations to express the atomic orbitals. A Gaussian expansion can be altered in two parameters: the coefficient and the e xponent. The most flexibility is obtained by allowing both of these to vary in ab initio calculations, which is referred to as employing uncontracted or primitive Gaussians. These increase the computational effort required in these calculations which in tu rn makes the use of contracted Gaussians more common. The contraction coefficients and exponents are retained in their pre determined forms during these calculations. Containing only the functions that are necessary to accommodate the filled orbitals in each atom of a molecule could be a simplification to the problem of constructing molecular wave functions. Such a basis set is called a minimal basis set. The basis sets STO n G ( n $3) are of this type in which n Gaussian functions are used to

PAGE 29

29 describe each orbital. n =3 is found to be the absolute minimum that should be used in an ab initio molecular orbital calculation because at least three Gaussian functions are needed to represent each Slater type orbital. The computational expense rises with increasing n the number of Gaussian functions in the expansion. The minimal basis sets contain only one contraction per atomic orbital and the radial exponents are forbidden from expanding and contracting in size according to their molecular environment. Moreover, t hey cannot display the non spherical aspects of the electronic distribution. These aspects render them deficient in expressing molecular wave functions. Employing multiple functions per atomic orbital can suggest a solution to these deficiencies. This wou ld create a mixture of contracted and diffuse functions to describe the best molecular wave function. Double zeta basis sets are examples to this type and they double the number of functions in the minimal basis set. An alternative to this could be doublin g the basis functions which describe the valence electrons but keep a single function for the inner shells because the core orbitals are almost constant from one molecule to another and do not affect the chemical properties by much. These basis sets are de signated as split valence double zeta basis sets. An example to this could be the very common Pople type 3 21G orbital where three Gaussian functions describe the core orb itals. The valence orbitals are also presented via three Gaussians two of which repre sent the contracted p art and the remaining one is for the diffuse part.

PAGE 30

30 Simply increasing the number of basis functions does not necessarily account for perturbed charge distributions in molecules. Introducing polarization functions into the basis set re medies anisotropic effects. Since these have higher angular quantum numbers, they correspond to p orbitals for hydrogen and d orbitals for the first and second row elements. Their use is notified by an asterisk *. The most commonly used basis set 6 31G* e xemplifies this type of basis sets and it contains polarization functions on heavy (i.e. non hydrogen) atoms. Two asterisks (as in 6 31G**) show the use of polarization functions for both hydrogen and helium. None of these basis sets is able to deal with molecules containing lone electron pairs or anionic species. These encompass a significant amount of electron density distant from the nuclear centers. As the amplitudes of the Gaussian basis functions decay with increasing di stance from the nuclei, this failure emerges. Adding highly diffuse functions to the basis set cures it. These basis sets are designated with a +' in Pople and "aug" in Dunning basis set terminology For instance, the 3 21+G basis set includes an additional set of diffuse s and p ki nd Gaussian functions. Adding diffuse functions for hydrogen as well as for heavy atoms is announced by the use of ++' in Pople style basis sets Employing complete basis sets would p resent the only way to obtain the actual molecular wave function which is impossible due to computational limitations. Instead the results from finite basis sets are extrapolated to the Complete Basis Set (CBS)

PAGE 31

31 limit. The difference of the actual result and the extrapolated result is denoted as the basis set incompleteness error. The Hartree Fock Approximation Another attempt to solving the many electron problems encountered in the endeavor of finding approximate solutions to the Schršdinger equation is the Hartree Fock (HF) approximation. In addition to its historical con tributions to modern chemistry, it usually lays out the first step towards more accurate approximations. Fulfilling the anti symmetry requirement of wave functions is very important as it implies Pauli's exclusion principle which states that no more than one electron can occupy a spin orbital. The simplest anti symmetric wave function for the ground state of an N electron system is given by a single Slater determinant: ! ! ! ! ! ! (2 6) where # stan ds for spin orbitals, which are spatial wave functions multiplied by a spin function. The best wave function of this form is dictated by the variational principle to minimize the ground state energy E 0 when the full electronic Hamiltonian acts upon it. The choice of the spin orbitals in Equation 2 6 provide the variational flexibility through which one can collect the lowest E 0 and derive the Hartree Fock equation, which determines the optimal spin or bitals. The Hartree Fock equation is an eigenvalue equation employing an effective one electron operator, called the Fock operator, of the form f ( i ) = 1 2 i 2 Z A r i A H F ( i ) A = 1 M # (2 7)

PAGE 32

32 where M represent the nuclei, i stands for electrons and $ HF is the average p otential felt by the i th electron due to the presence of the other electrons. Thus, the complicated many electron problem is substituted by a one electron problem which treats the interelectronic repulsion in an average way. The Hartree Fock potential $ HF or the "field" experienced by the i th electron depends on the spin orbitals of the other electrons which makes the Hartree Fock equation nonlinear and brings the necessity of solving it iteratively. This iterative procedure is named the self consistent fi eld (SCF) method. The average field felt $ HF by each electron is calculated with an initial guess at the spin orbitals and then the Hartree Fock eigenvalue equation for a new set of spin orbitals. New fields are obtained by these new spin orbitals and this procedure is repeated until self consistency is achieved. This means the fields should no longer change and the spin orbitals utilized in the Fock operator should become identical as its eigenfunctions for the iteration to stop. The solution produces a se t of orthogonal Hartree Fock spin orbitals N of which with the lowest energies constitute the occupied spin orbitals. The Slater determinant established with these is the Hartree Fock ground state wave function which emerges as the best variational approxi mation to the ground state of the molecular system. The rest of the product set are the virtual spin orbitals. Although there is an infinite number of solutions to the Hartree Fock equation and an infinite number of virtual spin orbitals, practically a fi nite K membered set of

PAGE 33

33 spatial basis functions is introduced to solve the Hartree Fock equation which leads to a set of 2K spin orbitals. As discussed previously, the larger and more complete the basis set, the greater the degree of flexibility in the expa nsions for the spin orbitals and the lower will be the eigenvalue E 0 Increasing the size of the basis set will keep lowering the energy E 0 until a certain limit, which is referred to as the Hartree Fock limit. Any finite basis set will result in an energy somewhat above the Hartree Fock limit. Second Order Mller Plesset Perturbation Theory The electron correlation problem was attacked in an alternative way by Mller and Plesset who based their method on Rayleigh Schršdinger perturbation theory. In the Mller Plesset perturbation theory, t he true' Hamiltonian operator H is handled as the sum of a zeroth order Hamiltonian operator H 0 and a perturbation V : H = H 0 + V (2 8) A set of molecular orbitals can be obtained for H 0 in Equation 2 8. The eigenfunct ions of the true Hamiltionian H are i with corresponding energies E i The eigenfunctions of the zeroth order Hamiltonian H 0 are referred to as 0 (0) with energy E 0 (0) as their eigenvalues In order t o gradually improve the eige nfunctions and eigenvalues o f H 0 to make them approach the eigenfunctions and eigenvalues of the true Hamiltonian H we express the true Hamiltonian H as: ! ! !" (2 9) where % is an order parameter and can vary between 0 and 1. The eigenfunctions i and eigenvalues E i of the true Hamiltonian H are written in a Taylor series in % :

PAGE 34

34 i = i 0 ( ) + ! i 1 ( ) + 2 i 2 ( ) + . = n i n ( ) n 0 # (2 10 ) E i = E i 0 ( ) + E i 1 ( ) + 2 E i 2 ( ) + . = n E i n ( ) n 0 (2 11) E i (n) is called the n th order en ergy or the n th order correction to the energy It can be calculated from the eigenfunctions as: E i ( 0 ) = i 0 ( ) H i 0 ( ) d E i ( 1 ) = i 0 ( ) V i 0 ( ) d E i ( 2 ) = i 0 ( ) V i 1 ( ) d E i ( 3 ) = i 0 ( ) V i 2 ( ) d (2 12 ) which requires to determine the w ave functions for a given order In Mller Plesset perturbation theory the unperturbed Hamiltonian H 0 is the sum of the one electron Fock operators for N electrons. Additionally, the Hartree Fock wavefunction, which is the Slater determinant formed with the occupied orbitals, is taken to be 0 (0) Hence, applying the H 0 on 0 (0) yields E 0 (0) the sum of the occupied orbital energies as the eigenvalue. To obtain the higher order wave functions, we need to realize that the difference between the true' Hamiltonian H and the zeroth order Hamiltonian H 0 is going to be the perturbation V The sum of the nuclear attractio n terms and electronic repulsion terms should establish the true Hamiltonian By s ubtracting the H 0 in the form of the sum of the one electron Fock operators from the true H we obtain V = 1 r i j J i j 1 2 K i j # $ % & j o c c ( i o c c ( j > i o c c ( i o c c ( (2 13 )

PAGE 35

35 where the first term on the right is the proper way to c alculate the electronic repulsion between the electrons i and j and the seco nd term is how it is computed from summing over the Fock operators for the occupied orbitals where J stands for the Coulomb operator and K represents the exchange operator. Solvin g for the zeroth and first order energy values shows that their sum corresponds to the Hartee Fock energy. Thus, to improve the energy result beyond this makes it necessary for us to consider at least the second order correction. This level of theory is na med MP2 The higher order wave function 0 (1) is written as linear combinations of solutions to the zeroth order Hamiltonian 0 (0) which will include single, double, etc. excitations obtained from promoting electrons into the virtual orbitals from a Hart ree Fock calculation. The Mller Plesset perturbation theory presents the advantage of being size consistent. However, it is not variational which sometimes results in computation of lower energies tha n the correct ground state energy It works best when the perturbation is small, that is, when the T aylor expansions in Equation 2 10 and 2 11 converge quickly. Third and fourth order corrections are also applied i n certain cases which are refe rred to as MP3 and MP4 levels of theory respectively. The cost of the calculations increases once one goes to higher n which confines the practica bi lity of higher order MP n level calculations to small systems. Density Functional Theory Obviously, it is not easy to construct a representative wave function and a good H amiltonian operator for real chemical systems with multiple nuclei, especially

PAGE 36

36 considering the difficulties of correctly incorporating the interelectronic interactions. Density Functional Theory (DFT) follows an alternative approach where it investigates t he electronic structure using functionals, i.e. functions of another function: the e lectron density & (r ) which is a function of spatial coordinates This property depends naturally on the total number of electrons in the system as shown in Equation 2 14 : N = ( r ) d r (2 14 ) When integrated over all space, the electron density returns the total number of electrons : N Additionally, since the nuclei are effective point charges, their positions would correspond to local maxima in the electron density. Inst ead of calculating the many electron wave function as a Slater determinant from a set of N single electron wave functions as in the Hartree Fock theory, DFT only attempts to calculate the total electronic energy and the overall electron density distributio n by considering single electron functions. DFT methodology works with a uniform non zero charge distribution, named uniform electron gas', subject to an external potential V ext ( r ) which includes th e effects of nuclear attraction which gives rise to th e following energy functional: E r ( ) # $ = V e x t r ( ) % r ( ) d r + F r ( ) # $ (2 15 ) F [ & ( r )] represents the sum of the contribution from interelectronic interactions and the kinetic energy of the electrons. DFT found widespread use in solid state physics communities at the beginning and it was not applied to quantum chemistry until Hohenberg and Kohn established their existence and variation theorems. Accordingly, the ground state electron density

PAGE 37

37 determines the external potential, which in turn determines the Hamiltonian operator and a c andidate wave function. The density also obeys a variational principle, implying the best solution for the density corresponds to true ground state density associated with the minimum value of the energy. Any incorrect density yields an energy above the tr ue energy. The d ifficulty with the Equation 2 15 is that the form of the F [ & ( r )] is unknown. As a breakthrough, Kohn and Sham suggested that it should be approximated as the sum of three terms: F r ( ) # $ = E K E r ( ) # $ + E H r ( ) # $ + E X C r ( ) # $ (2 16 ) where E KE [ & ( r )] is the kinetic energy, E H [ & ( r )] is the interelectronic Coulombic energy, and E XC [ & ( r )] expresses the contributions from exchange and correlation. These are not known exactly, which presents the biggest problem with DFT. However, several approximations for it exist and these have been ordere d by Perdew with increasing sophistication and computational cost by an analogy to Jacob's ladder. The simplest of these is the local density approximation, where the local density at the coordinate where the functional is evaluated is the only determinant for the functional. The local spin density approximation generalizes it by incorporating the electron spin. More accurate approximations have been gained through quantum Monte Carlo simulations of jellium'. The gradient of the electron density is conside red in addition to the local density in generalized gradient approximations (GGA) which takes the approximations beyond being only local. Meta GGA DFT functionals provided a further improvement of the exchange correlation functional by including the Laplac ian for the density along with its gradient. The exchange part can be aided by including a component of the exact

PAGE 38

38 exchange energy calculated from the Hartree Fock theory, which makes the functional a so called hybrid' functional. Semiempirical Theory Sem iempirical methods make up the second major category of quantum mechanical molecular orbital calculations besides the ab initio calculations. They attack t he computational intensiveness and thus the slow speed of the Hartree Fock calculations by omitting c ompletely or parameterizing the two electron integrals of the H F calculations to experimental information Since t hese numerous integrals are very demanding to solve numeri cally, approximating or neglecting them is a sensible strategy to make calculations accessible to m olecular systems of larger size As a result, semiempiricial methods are so fast that they are applicable to even biomolecular systems. Their accuracy is also acceptable when the training set for the parameterization includes similar molecul es to the probed systems. They also replace the entire core, namely the inner shell nuclei and electrons, with parameterized functions, which is a further simplification. The rationale behind this is that the valence electrons cause all the chemical phenom ena in that s cientists might be interested Modern semiempirical methods are based on the Neglect of the Diatomic Differential Overlap (NDDO) method where the overlap matrix is replaced by the identity matrix. This simplifies the HF secular equation by e quating the matrix elements that correspond to the overlap of two atomic orbitals on different atoms to zero, although this does not mean that all overlap integrals are set to zero in the calculation of Fock matrix elements. Modified Neglect of Diatomic Ov erlap (MNDO), Austin Model 1 (AM1), Parametric Model 3 (PM3) Parametric Model 6 (PM6) and Pairwise Distant Directed Gaussian (PDDG) use the NDDO formalism while each of them adds their specific

PAGE 39

39 approximations with regards to evaluating the integrals and p arameterization philosophy on top of it. Molecular Mechanics So far we touched on the methods which are mainly concerned with the electrons in a molecular system. This is extremely complicated for larger system sizes. Hence, molecular mechanics (MM) work s on the molecular modeling problems that are too large to be attacked by quantum mechanics by means of classical mechanics It ignores electronic motions and computes the energy of a system as a function of the nuclear positions only. This is another ap plication of the Born Oppenheimer approximation, without which we would not be able to express the energy as a function of the nuclear coordinates. In all molecular mechanics methods, atoms are simulated as single particles; a radius, polarizability, and a constant net charge is assigned to each particle; and bonded interactions are regarded as springs with an equilibrium distance equal to the experimental or calculated bond length. Force Fields Since we are interested in energy of chemical systems and in the molecular mechanics case this is a function of stationary nuclei only, we assume there is no contribution to the total energy from electronic motions The potential energy functionals are called force fields and they describe the total energy of the s ystem using four contributors from intra and intermolecular forces contained in the system. Within this framework, deviation of bonds and angles away from their reference values causes energetic penalties. A function describes the change of the energy whe n bonds rotate, and there are terms describing the interaction between non bonded parts of the system. Additional terms may appear in various force fields depending on their sophistication but

PAGE 40

40 these terms are contained invariably in all force fields. The f unctional form of the AMBER force field 4 which was exclusively used in the projects presented hereby, is as follows: V r N ( ) = k b l l 0 ( ) 2 + k a ! 0 ( ) 2 a n g l e s b o n d s + V n 2 1 + c o s n ! ( ) # $ % & t o r s i o n s + i j r 0 i j r i j ( ) ) + , 1 2 2 r 0 i j r i j ( ) ) + , 6 # $ % & . + q i q j 4 0 r i j / 0 1 2 1 3 4 1 5 1 i = j + 1 N j = 1 N 1 (2 17 ) V ( r N ) stands for the potential e nergy, a function of the positions ( r ) of N particles. The first term denotes the interaction between pairs of bonded atoms, modeled with a harmonic potential, whose energy increases as the bond length l deviates from the reference bond length l 0 The seco nd term sums up the energies of all the valence angles in the molecule by using another harmonic potential which w ould penalize the angles deviating from the equilibrium value 0 The third term, th e torsional potential, gives the energy change as a bond r otates. The fourth term is the non bonded term which is evaluated for all atom pairs ( i and j ) that are in different molecules or that are in the same molecule but apart by at least three bonds. The first part in the brackets is the Lennard Jones potential which models the van der Waals interactions and the second part is the Coulomb potential model for electrostatic interactions. Although the number of individual terms to be computed amounts to a lot, it is fewer than the number of cumbersome integrals tha t would have to be calculated in an equivalent ab initio investigation. The functional form is not the only ingredient to a force field, parameters like V n k b and r 0i j must be specified as well. These are borrowed from experiments or ab initio calcula tions, whichever improves the overall energy result to a greater extent. The form

PAGE 41

41 and the parameters are transferable which is a very important feature of force fields. This translates into being able to use the same set of parameters for multiple systems without having to assign new parameters each time we switch systems. Thus, we can make predictions on the systems that were never probed before. Another conc ept that did not exist in quantum mechanical calculations is atom type. An atom type needs to be assigned to each atom in the system which specifies the atomic number, the hybridization state and sometimes information about the atom's environment. For instance, in AMBER there are three sets of atom types for the histidine residue depending on its prot onation state. Likewise, the carbon atom type s in a histidine ring and in the five membered ring of a tryptophan indole ring are different. 4 6 Force fields desig ned for modeling molecules of specific classes use more specific atom type descriptions than the general purpose force fields. Potential Energy Surface The concept of potential energy surface was mentioned earlier as an outcome of the Born Oppenheimer app roximation. Since it separates the electronic and nuclear motions, the energy of a molecule in its ground state remains as a function of the nuclear coordinates only and is reduced to be the potential energy. In even the simplest chemical systems, the pote ntial energy is obviously a very complicated, multi dimensional function. The energy of an N atom system will respond to the changes in its nuclear 3 N Cartesian or 3 N 6 internal coord inates and the way it changes with coordinates is denoted as the potentia l energy surface. Of special interest are the stationary points on this hypersurface'. These have no slope in any di rection, i.e. have zero first derivatives. Local minima, namely the minimum energy configurations of the atoms, correspond to the most st able states of

PAGE 42

42 the system. Moving away from these minimum points results in arrangements with higher energy. The very lowest energy on the surface is associated with the most stable structural configuration and is referred to as the global energy minimum. Another interesting point of investigations is how local minima are connected to each other and how the system changes from one minimum to the other. This might, for example, correlate with a chemical reaction or a conformational change. The highest point between two minima is called the saddle point and it represents the lowest energy barrier on the path connecting the two minima. The atomic arrangement in these points is the transition state structure. These cannot be isolated in experiments but can be ob served and examined with computational tools. Energy Minimization As stated before, modelers are especially interested in minima on the potential energy surface as these points coincide with the stable states of the system. A minimization algorithm is emp loyed to locate the minimum points on the surface. Saddle points on the surface are also significant because they represent the transition states along the path from one local minimum to another. At both minima and saddle points the first derivative of the energy function is zero with respect to every variable. Hence, given a function f of independent variables x 1 x 2 % x i the following is required: !" ! ! ! ! ! ! ! ! (2 18 ) Molecular mechanics minimizations mostly involve 3 N variable s in Cartesian coordinates for N atoms. If the energy functional is of analytical form, th en the derivatives can be found with the aid of standard calculus operations, which is not too likely for molecular

PAGE 43

43 systems due to the complicated dependencies of the energy functional on the many variables. Instead, minima are detected using numerical methods that gradually vary the coordinates to generate configurations with lower and lower energies until a minimum is reached. Thus, there are two groups of minimization algorithms: those which use derivatives and those which do not. There is no universal algorithm to solve all the molecular model ing minimization problems. Most program packages offer a number of them and the user should take into account the specifics of the given problem to choose the algorithm in order to obtain the answer as quickly as possible, using the least amount of memory. While minimizing, being trapped in an energy well on the potential energy surface is very common because most methods can only decrease the energy outcome. This, of course, might prevent the algorithm from reaching the global minimum. Therefore, a means of producing several starting points and minimizing each of these is usually required so that multiple local minima and potentially the global minimum can be discovered. For the molecular mechanics problems presented in this thesi s w e employed the steepes t descent me thod and the conjugate gradient method of minimization. Both are first order minimization algorithms which change the coordinates of the atoms gradually while they approach the minimum. Each iteration starts with the molecular configuration gat hered from the previous step. The first step of minimization is performed on the initial configuration provided by the user. The steepest descents algorithm takes steps in the direction of the negative o f the gradient g k at the point k which is where the outcome of the function decreases

PAGE 44

44 most quickly, until a minimum is achieved. For 3 N Cartesian coordinates, the 3 N dimensional unit vector s k represents the direction of the movement: s k = g k / g k (2 19 ) Since the steps are taken in a linear fashion, this move ment is referred to as a line search. Then the gradient is recomputed at the located minimum and the process is repeated in the orthogonal direction because the energy was minimized in the last direction. This method becomes very slow towards the actual mi nimum of interest. That is why it is mostly used to quickly relieve the initial strains on a system which was also what we did. Then we switched to conjugate grad ient algorithm like most users in which the gradients are orthogonal at each point as in the s teepest descent method but the directions are conjugate. The direction of movement from point x k is given as the vector v k which is computed from the gradient at the point and the previous direction vector v k 1 : v k = g k + k v k 1 (2 20 ) where k is a scalar obtained from the gradient at the current and the previous points. Molecular Dynamics If we could determine all the minima on a potential energy surface, then we could apply statistical mechanical formulae to develop a partition function from which we could obta in the thermodynamic properties. However, this is not possible for most of the systems in which modelers are interested bec ause of the large system sizes These usually possess an enormous number of local minima. Where a full characterization of the potent ial energy surface is impossible, modelers make use of simulations which enable them to study those systems via smaller replications thereof. Representative configurations of these smaller replications are created in simulations such that

PAGE 45

45 accurate estimati ons of structural and thermodynamic properties can be collected through a viable amount of calculations. They also provide the modelers with the time dependent behavior of atomic and molecular systems, displaying an elaborate picture of t he way in which a system varies from one configuration to another. Molecular dynamics (MD) computes the dynamics of a given system from which time averages of thermodynamic properties can be evaluated by integrating the instant values of these properties over time These time averages approximate the actual values for the probed properties. They mostly cannot exactly reproduce those actual values because the phase space associated with the system is extremely likely to be not sampled fully. The 6 N dimensional p hase space i s an important concept of computer simulations and is defined as the collection of all the points spanned by the combination of six components (three positional coordinates and three momenta) for each of the N atoms in a system. During an MD simulation, se ts of atomic positions are acquired in sequence by applying Newton's equations of motions. By solving the differential equations manifested in Newton's second law, a trajectory is obtained: d 2 x i d t 2 = F x i m i (2 21 ) Here, the motion of a particle of mass m i along the c oordinate x i is described. F xi is the force on the particle in that direction. Moreover, the state of a system at any future time can be predicted from its current state, which renders MD deterministic. Finite Difference Methods and Time Steps In realist ic models of intermolecular interactions, the force acting on each particle changes depending on their positions and the positions of their interaction partners. This continuous potential makes the motions of all particles coupled which results in a

PAGE 46

46 many b ody problem that cannot be solved analytic ally. Instead, the equations of motions are integrated by severing the calculation into a series of very short time steps ( t The vector sum of the interactions of a particle with other particles gives the total f orce on it at a time t From the force on each particle in a configuration, their accelerations ca n be obtained which are then combined with the positions and velocities at time t for computation of the positions and velocities at time t + ( t During the tim e step ( t the f orce is assumed to be fixed. Then the forces on the particles are calculated in their new positions, resulting in updated positions and velocities at time t + ( t and so on. There is a number of algorithms to integrate the equations of moti on using finite difference methods all of which assume the positions and dynamic properties can be approximated by Taylor series expansions: r t + t ( ) = r t ( ) + t v t ( ) + 1 2 t 2 a t ( ) + 1 6 t 3 b t ( ) + 1 2 4 t 4 c t ( ) + . v t + t ( ) = v t ( ) + t a t ( ) + 1 2 t 2 b t ( ) + 1 6 t 3 c t ( ) + . a t + t ( ) = a t ( ) + t b t ( ) + 1 2 t 2 c t ( ) + . (2 22 ) where r is the position, v is the velocity (the first derivative of the position with respect to time), a is the acceleration (the second derivative), b is the third derivative and so on. In MD, the Verlet algorithm is probably the most frequently implemented one. It calculates the new positions at t + ( t from the positions and accelerations at time t and the p ositions from the previous step t ( t : r t + t ( ) = r t ( ) + v t ( ) + 1 2 t 2 a t ( ) + . r t ! t ( ) = r t ( ) ! v t ( ) + 1 2 t 2 a t ( ) . (2 23 )

PAGE 47

47 and these sum up to give: r t + t ( ) = 2 r t ( ) r t ! t ( ) + t 2 a t ( ) (2 24 ) The velocities do not come up explicitly in the Verlet algorithm but are calculated in various ways. A simple way is to divide the difference in positions at t + ( t and t ( t by 2 ( t The implementation of the Verlet algorithm emerges as straightforward and it only stores two sets of positions and the accelerations as seen in Equation 2 24 The lack of the velocity in the equations is a drawback because it cannot be calculat ed until the positions at the next step are collected. Additionally, the positions are obtained by adding the small term of ( t 2 + a ( t ) to the difference of two mu ch larger terms in Equation 2 24 which might compromise the precision. It is also not a self sta rting algorithm, which requires the positions of the previous step to calculate the next step. This gives rise to use of some other means to calculate the positions of the previous step at the initial time t =0 Modifications of the Verlet algorithms prod uced other commonly used finite difference algorithms like the leap frog algorithm, the velocity Verlet algorithm, and the Beeman's algorithm. An efficient integration features speed, modest mem ory requirements, and ease of implementation. However, t he cal culation of the force on each particle in a system is always the most demanding part of an MD simulation. Compared to that, integration of the motion equations is usually trivial. Yet, the algorithm should conserve energy and momentum, be time reversible, and should allow a long time step, ( t to be employed. The use of a longer time step would translate to less iterations to cover a particular amount of phase space. One needs to be careful when choosing the length, though, because if the time step is too l ong, instabilities

PAGE 48

48 might occur due to high energy overlaps between atoms and they would prevent the conservation of energy and linear momentum. If the time step is too short, then the trajectory will be able to cover only a small portion of the phase space using the same amount of computational effort. The correct balance need s to be found. For flexible molecules, a useful guide is that the step should be one tenth the time of the shortest period of motion. In flexible molecules, bond stretching is the fast est type of motion, especially the bonds involving hydrogen. Temperature Regulation In Molecular Dynamics simulations, we are usually interested in observing a system's behavior at fixed temperatures such as the room temperature or the temperature settin g of the experiment to which the calculation will be compared. The temperature of the system correlates with the time average of the kinetic energy and is given by the following formula if no constraints are acting on the system: K N V T = 3 2 N k B T (2 25 ) Hence, if we manipulated the velocities, we could alter the temperature of the system. A scaling factor % can be applied on the velocities and the resulting temperature change can be computed as: T = 1 2 2 3 m i v i ( ) 2 N k B i = 1 N # 1 2 2 3 m i v i ( ) 2 N k B i = 1 N # (2 26 ) T = 2 1 ( ) T ( t ) (2 27 ) = T n e w T ( t ) (2 28 )

PAGE 49

49 where v i is the velocity an d m i is the mass of the i th particle in a N particle system, k B is the Boltzmann constant and T(t) stands for the temperature of the system at time t Multiplying the velocities by the scaling factor % defined in Equation 2 28 emerges as the simplest way t o control the temperature. Alternatively, the system could also be coupled to an external heat bath fixed at the desired temperature as Berendsen et al. suggested, where the bath supplies to or removes heat from the system when needed. The rate of change of temperature is proportional to the temperature difference of the system and the heat bath acting as a source of thermal energy. The velocities are scaled at each step with the aid of a coupling parameter ) determining how tightly the bath and the syste m are coupled to each other: d T ( t ) d t = 1 T b a t h T ( t ) ( ) (2 29 ) ! which leads to an exponential decay towards the target temperature. The temperature variation between successive time steps ( t is expressed by: T = t T b a t h T ( t ) ( ) (2 30 ) Thus, the velocities are scaled by the scaling factor % which is given by: 2 = 1 + t T b a t h T ( t ) 1 # $ % & (2 31 ) In this method, large values for imply weak coupling. If the coupling parameter equals the time step, this algorithm turns into the simple velocity scaling method. The advantage of the Berendsen method is that it allows the system to fluctuate about the target temperature.

PAGE 50

50 In general, velocity scaling artificially pronounces any temperature differences between the system components, which can lead to uneven temperature distributions; a phenomeno n also called "hot solvent, cold solute". To generate proper canonical ensembles, other methods like stochastic collisions aka the Anderson thermostat, or the extended system method which is also referred to as the NosÂŽ Hoover thermostat, can be implemen ted. In the MD projects presented in this dissertation, the Langevin thermostat was used to maintain the temperature at desired values. In this type of simulations, Newton's equations of motions are modified to approximate the effects of neglected degrees of freedom. At each time step, a random force acts on all particles and they are slowed down by constant friction giving rise to the Langevin equation of motion: m i d 2 x i ( t ) d t 2 = F i x i ( t ) { } ! i d x i ( t ) d t m i + R i ( t ) (2 32 ) R i ( t ) represents the random force acting on the i th particle at time t resulting fr om the interactions of the particle with solvent molecules. It is often assumed not to correlate with the particle velocities, positions and the forces acting on them F i is the frictional force on the particle i and is proportional to the velocity of the particle v i : F i = ! v i (2 33 ) where is the friction coefficient and is related to the collision frequency by = / m i (2 34 ) The collision frequency is mostly assumed to be independent of time and position as a simplifying factor. In Equation 2 32 th e friction term F causes a drop in the average kinetic energy and thus the temperature because the friction coefficient is a positive

PAGE 51

51 constant. This decrease is balanced with the random force R ( t ) counteracting the frictional force and the temperature of the systems is kept around the target value. It actually fluctuates around the set value because the random force R ( t ) is randomly chosen from a Gaussian distribution with zero mean. Pressure Regulation Simulations in the isothermal isobaric ensemble dir ectly mimic the most common experimental conditions and therefore are of interest to modelers. Furthermore, constant pressure may favor certain structural rearrangements which are of interest and make sampling from an isobaric ensemble more advantageous ov er sampling from a constant volume ensemble. A macroscopic system reacts to fluctuations in pressure by changing its volume which in turn keeps the pressure constant. This projects on the simulations of isobaric isothermal ensembles as changing the volume of the simulation cell. The isothermal compressibility, + is a measure of the amount of volume fluctuations: = 1 V V P # $ % & ( T (2 35 ) Larger values of + imply easier compressions of the substance of interest. Hence at a given pressure, larger volume fluctu ations accompany these easily compressible substances. It is the pressure analogue of the heat capacity Pressure regulation makes use of similar methods as those i n temperature control. So, the pressure can be kept constant by simply scaling the volume or by hooking the system up with a pressure bath'. The rate of change of pressure is expressed as follows:

PAGE 52

52 d P ( t ) d t = 1 p P b a t h P ( t ) ( ) (2 36 ) where ) p is the coupling constant, also known as the relaxation constant, P bath is the pressure of the bath and P ( t ) is the pressure at time t The scaling factor % applies on the volume of the simulation box which corresponds to scaling the atomic coordinates by a factor of % 1/3 Hence: = 1 # t p P P b a t h ( ) (2 37 ) The scaling factor can be equal for all t hree directions and this is referred to as isotro pic scaling of the coordinates. The scaling factor can also be calculated independently in each direction and that would be anisotropic scaling. In our simulations, isotropic scaling was employed. Solvation Models One of the most predominant uses of molec ular dynamics is simulating the structural and dynamical properties of biological systems. As we all know, no biomolecular system exists naturally in a vacuum environment which makes including solvation effects into molecular dynamics simulations essential Solvation affects the conformations adapted by macromolecules and binding of small molecules to those. For any solvated system subject to simulation, the solute, namely the protein or the protein ligand complex, remains as the primary focus but water mo lecules are explicitly added to the simulation box. Ideally, we would want this box to be as large as possible so that the system is approximated to be in bulk solvent but this would compromise the feasibility of the calculation. Most of the simulation tim e should be spent on the solute rather than solvent because we are not interested in the bulk

PAGE 53

53 properties of the solvent but mainly in the behavior of the solute. Periodic boundary conditions allow the system in the simulation box to experience forces as if th ey were in an infinite system The particles leaving the cell from one side of the simulation box enter it simultaneously from the opposite side. This approach decreases the number of the solvent molecules to be simulated although it imitates the effect s of bulk solv ent and is very commonly employed in molecular dynamics studies. As a simpler alternative one can utilize a solvent shell with even less solvent molecules and if the shell is thick enough, it becomes analogous to a drop of solvent surroundin g the solute. This is, however, less realistic than simulating a system in accordance with periodic boundary conditions. Since the most important solvent for biomolecular systems is water, we are mostly interested in water models for explicit solvation p urposes. All water models define the interaction s ites of the water mole cule, the partial charges on it, as these are needed to account for the electrostatic interactions, and the Lennard Jones parameters for it to express the dispersion and repulsion forc es. There are various types of water models like rigid, flexible or polarizable. They are also classified by the number of the interaction points on the model water molecule where this number ranges fr om three to six. Modelers use the rigid three site mode ls of TIP3P and S P C along with the four site model TIP4P most frequently. An alternative to explicit solvation is offered by implicit solvent models which are computationally less demanding and thus cheaper. Especially for the cases where the solvent doe s not directly interact with the che mical entity of interest but creates a milieu impacting the behavior of the solute with its dielectric properties, implicit solvation can

PAGE 54

54 be preferred over explicit solvation to decrease the computational cost and to con centrate on the behavior of the solute. In implicit solvent models, the solvent is expressed as a continuum portrayed by macroscopic parameters like dielectric constant, surface tension, and density. The thermodynamic background is based on the solvation f ree energy , G sol defined as the free energy change to transfer a molecule from vacuum to solvent: G s o l = G e l e c + G v d w + G c a v (2 38 ) where G elec is the electrostatic contribution which is significant for polar and charged species because of polarization of th e solvent. This term is modeled with a uniform medium of constant dielectric. The second term G vdw represents the van der Waals interaction between the solvent and the solute which can further be divided into an attractive, a repulsive, and a dispersion t erm. The last term G cav is the free energy cost of creating a hole for the solute in the solvent. It contains the entropic penalty of solvent reorganization around the solute and work done against the solvent pressure in constructing the hole. In molecul ar mechanics, the Generalized Born (GB) and the Pois son Boltzmann (PB) models are very popular choices for implicit solvation. The G elec term in Equation 2 38 is described by the Generalized Born equation in the former and the Poisson Boltzmann equation i n the latter model respectively. The remaining two terms, G vdw and G cav are combined into a non polar, solvent accessible surface area based term to complete the implementation of these models in the approximate computation of a fre e energy of solvatio n co ntribution. Application of these are referred to as Molecular

PAGE 55

55 Mechanics Generalized Born Surface Area (MM/GB SA) and Molecular Mechanics Poisson Boltzmann Surface Area (MM/PB SA).

PAGE 56

56 CHAPTER 3 INVESTIGATING THE CONCEPT OF ADDITIVITY OF INTERACTION ENERG IES FOR PROTEIN LIGAND COMPLEXES Background Fragment based drug design (FBDD), which focuses on using small molecular "fragments" to form larger molecules, 7 has gained considerable popularity over the past decade and forms a complementary approach to methods such as high throughput screening of drug like molecules. 8 Many ac a demic institutions and major pharmaceutical and biotechnol ogy companies have put an emphasis on FBDD with their efforts placing drug candidates into clinical trials. 9 FBDD scans small molecule librari es for activity using biophysi cal techniques such as surface plasmon resonance, protein ligand NMR spectroscopy, isothermal titration calorimetry, x ray crystallography, or with bioassays. 8 In addition, tech nological advancements have allowed for the development of various com putational tools, which aid in different phases of a FBDD effort such as the generation of fragment libraries and in silico screening of these libraries against targets to identify potential hits. The hope that successful binding mode prediction would faci litate the evolution of a hit to a potent lead molecule underscores the need for molecular docking studies 10 15 and this has given rise to the use of computa tional screening as a starting point in FBDD. 11 The underlying assumption in FBDD is a n additive (or nearly additive) enhancement of binding affinity from each of the fragment molecules constituting the fully assembled inhibitor. Our study assesses the validity of this assumpti on in a realistic model by exploring the additivity of quantum mechanics derived fragment energies Reprinted with permission from Ucisik, M. N.; Dashti, D. S.; Faver, J. C.; Merz, K. M. J. Chem. Phys. 2011, 135 085101. Copyright 2011, AIP Publishing LLC.

PAGE 57

57 relative to interaction energies computed for an entire protein ligand complex. The model system is the activ e site of the human immunodefi ciency virus (HIV) II protease bound to the commercially available drug Indinavir, 16 18 which has been isolated from the rest of the enzyme and a bstracted into a series of rep resent ative protein ligand fragments. The interaction energy expansion functional adopted from Xantheas's study of water clusters 19 has be en utilized to compute the n body corrections for the interaction energy of the fragment model of the bind ing pocket with the inhibitor and these have been compared to the binding energy of the ligand to the pocket as a whole. Although we anticipated a non additivity level of up to 30% of the total binding energy as observed in Xantheas's water cluster studies, we found large ly an additive interaction pat tern for the protein ligand model system we studied. Before proceeding, the definition for additivity an d non additivity must be establ ished. Benos and co workers ex pressed additivity as the in dependent contribution of frag ments to binding. 20 Thus, the total interac tion energy would equal to the sum of the energies of the individual contacts. The simplest unit of a contact involves two bodies, where only one interaction is present between the fragment pairs. Hence, the real simplifying as sumption for combinatorial pu r poses would involve pairwise additivity, which ignores n body interactions with n > 2 because of their (presumed) very low contribution to the total interaction energy. However, in the case of water clusters Xantheas found contributions as high as 30% fo r three body interactions. 19 Shimizu and Chan emphasize the pairwise character of additivity by comparing the total free energy of a ssociation among a set of N solutes to the sum of free energies of all two body combinations of these N solutes, specifically to the N(N 1)/2 possible pairings of the

PAGE 58

58 solutes calculated one pair at a time for two solutes while the effects of the remaining N 2 solutes were ignored. 21 Here, we should make a distinction between our work and this approach we do not predict anything specifically about free energies but only electronic energies, whi ch is further discussed in the Theory section of this chapter Nonetheless, this study constitute s a vali dation of pairwise additivity of interaction energies computed in FBDD studies. Biochemists examine additivity concepts through double mutant cycles. 22 Basically, constructing a double mutant cycl e corresponds to construc ting a thermodynamic cycle involving the wild type protein, the protein with a particular point muta tion in the region of interest, the protein with a different point mutation again in the region of interest and the protein with both single mutations appli ed simultaneously as shown in Figure 3 1 Thus, if free energy additivity holds true, the impact of the double mutation should be equal to the sum of the two single mutations which allows for the prediction of the free energy associated with a certain func tionality or structural el ement of any protein in the cycle from measurements of the other three proteins in the cycle. This practice resembles our strategy, whereby the ligand is kept intact and the binding pocket of the protein is varied, but has focused on addressing the (non ) additivity of free energies. Free energy additivity of point mutations has been obse rved at enzyme substrate inter faces in several studies. 23 25 However, the observed additivity was largely traced back to the remoteness of the point mutation sites, which implies no interaction between the two sites. Any means of interaction between the two mutated residues, both via direct contact and indi rect electrostatic interactions were described by Wells as factors which would cause the simple additivity model to collapse. 25 Schreiber and

PAGE 59

59 Fersht examined coupling free energies between two mutated sites, , G int which correspond to non additivity in the thermody na mic cycle of singly and doubly mutated proteins. 26 They found that res idues separated by less than 7  showed non additivity and this non additive behavior was interpreted to imply cooperativity between those residues. It was concluded that at greater separations the effects of the point mutations were additive implying tha t the mutual interaction between these point mutations was minimal. In contrast, our studies have explored the additivity of interaction energies beyond these distance boundaries. Some fragments that we examined were separated by less than 7 , but only ha d interaction en ergies of a few hundredths of a kcal/mol. Nevertheless, the observed free energy additivity behavior of fragment energies located in close proximity can be complex, as demonstrated by Wells. 25 Establishment of an inverse correlation between additivity and cooperativity is a very common conclusion. That is, most studies connect the non add itivity to coopera tivity among the examined fragments. 21 25 26 Accordingly a non zero coupling energy, w hich is accepted to be the mea sure of the cooperativity bet ween interacting fragments, im plies either a direct interp lay mediated by steric, electrostatic, hydrogen bonding, or hydrophobic interactions, or an indirect interplay through structural changes in the protein or solvent shell. The non additive character of free energies associated with fragments is largely the resu lt of the non additivity of en tropic terms. However, for enthalpies or energies, in addition to being theoretically justified, 27 additivity has been observed in, for example, the isothermal calorimetry experiments of Baum et al and Olejniczak et al 28 29 Geometrical changes in duced by point mutations can contribute to the non additivity of observed free energies, but this point is not a major con cern in the present work

PAGE 60

60 becau se of the fixed geometry we are using and due to the additivity of enthalpies experimentally observed in References 28 and 29 In order to understand the r ole additivity plays, it is im portant to select the appropriate computational model. The method to be employed should have a good balance between accuracy, cost efficiency, and feasibility in terms of memory and computer time requirements. Recent work examining the accuracy of several computational methods when compared to complete basis set (CBS) r esults acquired at the coupled cluster single double (triple) (CCSD(T)) level of theory suggested that the M06 L meta GGA (generalized gra dient approximation) functional 30 is such an appropriate level of theory. 31 Even in conjunc tion with the 6 31G* basis set 32 it yielded a narrow error distribution for the bound Indina vir system. We note that in spite of its relative accuracy and speed the M06 L functional has difficulty with convergence for charged systems. 33 Thus, substantial effort was expended to deal with convergence problems for fragments contain ing acetate and methyl guanidinium. For comparison purposes HF/6 31G* and PM6 DH2 34 37 calc ulations were performed as well. This chapter is organized as follows : first, the model system and the computational details are reviewed. Next, the justification for using the computed energies as a measure of additivity validation is presented. The energy decomposition scheme for the cluster analysis of the binding site o f the en zyme complexed with its inhibitor is then introduced. Then the contributions of the n body interactions to the total energy of the system are evaluated at different levels of theory and the conclusions are given. Finally, the ramifications of our w ork on FBDD and force fields are discussed. Molecular System: HIV Protease II Indinavir Complex The protein ligand mod el system is based on the crys tal structure of the HIV II

PAGE 61

61 protease obtained from Protein Data Bank at 1.9  resolution (PDB ID: 1HSG). 16 The bind ing pocket was decomposed previously into a total of 21 fragments by Faver et al. in a study, 31 for which the enzyme ligand complex structure was obtained from the PDB, hydrogen atoms were added to the structure with the program Reduce 38 and were subsequently optimized with the AMBER ff 99SB force field. 39 These 21 receptor fragments and the ligand were combined to model the binding site. Overlapping receptor fragm ents were joined to yield a total of 18 fragments from the origin al 21 fragments. The re sultant cluster structure was r etained for all subsequent cal culations. The final model contained short aliphatic alkanes including ethane, propane, isobutane, and butane along with polar species consisting of acet ate, acetic acid, m ethyl guani dinium, and four peptide chains containing up to 35 atoms. Two tightly coordinated water molecules in the crystal struc ture were retained and treated as two distinct fragments. The ligand L 735 524, 17 18 an orally bio available inhibitor of HIV proteases with the commercial name Indinavir, was kept intact in all calculations. Methods Single point cal culations w ere carried out at the HF, den sity functional theory (DFT), and semiempirical levels. The 6 31G* Pople type basis set 32 was used throughout. The M06 L/6 31G* combination of level of theory was chosen because of its narrow systematic and ran dom error distribution with re spect to a CCSD(T)/CBS reference. 40 Moreover, through its pa rameterization the M06 L functional gives a good account of intermolecular interactions as evidenced by the low system atic error associated with the polar and non polar interactions relative to CCSD(T)/CBS reference calculations. On the other hand, HF/6 31G* calculations represent another extreme in that this method has both large random and systematic er rors largely due to the

PAGE 62

62 incorrect treatment of dispersion. 31 40 The large size of the quantum regio n was the major obstacle to up grading to much larger basis sets and prompted our choice of the 6 31G* basis set. In order t o examine higher order multi body interaction energies the semiempirical PM6 DH2 level of theory was em ployed due to its accurate per formance with regards to the CCSD(T)/CBS results and its fast computational speed. Considering the in dividual accura cies fo r polar and non polar in teractions gave us further con fidence in the use of PM6 DH2 method in our calculations. 31 40 Convergence problems were encountered for some of the acetate and methyl guanidinium containing systems, which were add ressed by using quadratic convergence methods, 41 Fermi temperature broadening, 42 and shifting of orbital ener gies. In order to correct for the basis set superposition error the counterpoise method was applied. 43 In each calculation the nu clei of the atoms which did not belong to that particu lar calculation were deleted while their basis functions were retained. The DFT and HF calculations were performed with the Gaussian 09 suite of programs 44 while MOPAC2009, 45 was used for the semiempirical computa tions. Visualizations were done using the Visual Molecular Dynamics (VMD) program, 46 while the density plots were obtained using the statistical software package R. 47 Theory The process of partition ing a larger molecule into con stituent fragments and treating those fragments as unique bonding units within the fram ework of the larger system for mally splits the Hamiltonian of the larger molecule into in dividual fragment Hamilt onians. The total binding energy is given by the ensemble average of the Hamiltonian H of the full ligand containing N particles, at a volume V and at a temperature T : 27

PAGE 63

63 E N V T ( ) = H N V T (3 1) Now, let us assume that the Hamiltonians associated with the distinct fragments sum to model the Hamiltonian of the full protein ligand system. Thus, for a system of n fragments, H = H i i = 1 n (3 2) By combining Equation 3 1 and Equation 3 2, we can define an energy expression: E = H = H i i = 1 n = E i i = 1 n (3 3) suggesting that the energy and enthalpy of the system is additive as long as the Hamiltonian is additive. The additivity of e nthalpies was experimentally studied by Baum et al using isothermal titration calorimetry for a series of thrombin inhibitors. 28 Incor poration of a particular functionality into the inhibito r always corresponded to a spe cific , H which underlines the independent (additive) be havior of the enthalpy component, whereas the free energies and the entropy components for the same structural ch anges were much more variable sugg esting non additivity. Further more, this finding also supports the idea of associating a func tional group present within a molecule with a given enthalpy change, which bolsters the noti on of attributing certain ener gies or enthalpies to fragments of a larger molecule. In this study we empl oy an approximation for the en ergy decomposition of the binding pocket of HIV II protease bound to the ligand Indinavir. 16 18 The decomposition scheme we used was adopted from Xantheas's formulation for water clusters. 19 The t otal energy E n of an n body cluster can be expanded into one two thr ee four % n

PAGE 64

64 body terms via the formula below: E n = E 1234... n ( ) E i ( ) i = 1 n # + $ 2 E ij ( ) j > i n # i = 1 n % 1 # + $ 3 E ijk ( ) k > j n # j > i n % 1 # i = 1 n % 2 # + $ 4 E ijkl ( ) l > k n # k > j n % 1 # j > i n % 2 # i = 1 n % 3 # + ... + $ n E 1234... n ( ) (3 4) Here, the first term corresponds to the one body term, the sec ond to the two body, th e third to the th ree body, % and even tually the last term to the n body term. E ( i ) denotes the ener gies of the single fragments or the ligand, E ( ij ) represents the energies of all possible combinations involving two bodies out of the pool of fragments and the ligand and E ( ijk ) E ( ijkl ) describe the energies of the multi body combinations out of the same pool. The total binding energy of the Indinavir ligand to the HIV II protease binding pocket, which was split into 18 fragments, was first calculated using Equation 3 5: bp l bp l bind E E E E ! = + (3 5) where E bind stands for the total binding energy, E l+bp is the energy of the entire system consisting of the ligand and the fragmented binding pocket, E l indicates the energy of the ligand, and E bp refers to the energy of the b inding pocket comprising its 18 fragments. The total binding energy E bind is compared to the binding energy E bind which was obtained from the expansion shown in Equation 3 4 for the same system. If we write E bind explicitly using Equation 3 4 and l ab el the ligand as the 19 th frag ment, we arrive at Equation 3 6 :

PAGE 65

65 ! # $ $ $ % & + + + + + ( ( ! # $ $ $ % & + + + + + = ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) ) = > > > = = = > > > = > > > = = = > > > ) 18 ... 1234 ( ) ( ) ( ) ( ) ( ) 19 ( ) 19 ... 1234 ( ) ( ) ( ) ( ) ( 18 15 1 16 17 18 4 18 1 17 1 16 1 17 18 3 18 2 19 16 1 17 18 19 4 19 1 18 1 17 1 18 19 3 19 2 E ijkl E ijk E ij E i E E E ijkl E ijk E ij E i E E i i j j k k l i i i i j j k i j i i j j k k l i i i i j j k i j bind (3 6) The collection of terms in the first row of the expansion in Equation 3 6 corresponds to E l+bp in Equat ion 3 5, E(19) is equivalent to E l and the expression in t he third row excludes the 19 th fragment, namely, the ligand, leaving the energy of the whole binding pocket composed of 18 fragments ( n = 1 18) in the absence of the ligand. The difference between E bind and E bind will help us address the issue of additivi ty or non additivity of the interaction energy. As more terms in Equation 3 6 are considered, the difference between E bind and E bind will approach zero. Truncation at lower order terms (e.g., two body) will allow us to investigate the contribution of the higher order multi body terms to E bind Let us turn our attention to the individual n body terms. The two three four and . n body terms are defined as follows: { } { } { } { } { } { } ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( 3 3 3 3 2 2 2 2 2 2 4 2 2 2 3 2 ijl E jkl E ikl E ijk E kl E jl E jk E il E ik E ij E l E k E j E i E ijkl E ijkl E jk E ik E ij E k E j E i E ijk E ijk E j E i E ij E ij E + + + + + + + + + + + = ! + + + + = + = (3 7)

PAGE 66

66 The number of m body terms out of a system of n bodies, where m # n is simply given by the number of m combinations out of a set of n bodies. Thus, for our cluster with 19 bodies, there are 19 2 # $ % & = 19 2 *17 = 171 E(ij) val ues. However, using Equation 3 6 and noting that the 18 receptor fragments are comm on to both expansions in the first and the third rows, we observe that only the E(ij) values involving the ligand, the 19 th fragment, will survive. Thus, the total two body term 2 E ij ( ) # encountered in Equation 3 6 involves 19 2 # $ % & ( 18 2 # $ % & = 18 2 E(ij) values which is actually obvious from the fact that there exist 18 fragments to pair with the ligand. The same logic applies to the 3 E(ijk) 4 E(ijkl) % terms, which will be referred to as n body correction terms in the sense that these term s supplement the total 2 E ij ( ) # term in converging to E bind In other words, they correct E bind to approach E bind Hence, 153 3 E(ijk) values are required to calculate the total three body correction term, while 816 and 3,060 4 E(ijkl) 's and 5 E(ijkl m ) 's compose the total four body and five body correction terms, respectively. As indicated above, the many body energy values in the correction terms always involve the ligand since the interactions among only the receptor fragments are cancelled in the total decomposition scheme. The sum of the two body terms, namely the 2 E ij ( ) # is defined as the additive part of the binding energy decomposition, while the higher multi body correction terms represent the non additive part. Accordin g to this definition, the neglect of higher order multi body corrections might be expected to produce a significant difference between E bind and E bind demonstrating the importance of non additivity. This is what was observed

PAGE 67

67 in the work of Xantheas for wa ter clusters, where the three body terms and above represented 30% of the total binding energy. On the other hand, for the energy decompositio n scheme given by Equation 3 3 the additive part must be much bigger than the non additive many body correction t erms. This manuscript aims to elucidate the extent of non additivity for protein ligand binding clusters. Having chosen our cluster we need to address the basis set superposition error (BSSE) because we will employ finite basis sets for our calculations. We applied the counterpoise method to account for BSSE 43 Accordingly, the energies of the many body subsystems, e.g. E(ij k ) for the subsystem (ijk) at t he cluster geometry were calculated in the full basis of the entire system, which was denoted by E ijk | ijk ... n ( ) That is, in all the multi body energy calculations, the basis functions centered on each of the 323 nuclei were kept, while the nuclei which do not participate in that particular calculation were deleted. The inclusion of these so called "ghost" orbitals removes the effects of BSSE. This approach completes the formulation of the individual many bo dy correction terms given in Equation 3 7 by con verting the expressions into Equation 3 8 : 2 E ( i j i j k l . n ) = E ( i j i j k l . n ) E ( i i j k l . n ) + E ( j i j k l . n ) { } 3 E ( i j k i j k l . n ) = E ( i j k i j k l . n ) E ( i i j k l . n ) + E ( j i j k l . n ) + E ( k i j k l . n ) { } 2 E ( i j i j k l . n ) + 2 E ( i k i j k l . n ) + 2 E ( j k i j k l . n ) { } 4 E ( i j k l i j k l . n ) = E ( i j k l i j k l . n ) E ( i i j k l . n ) + E ( j i j k l . n ) + E ( k i j k l . n ) + E ( l i j k l . n ) { } 2 E ( i j i j k l . n ) + 2 E ( i k i j k l . n ) + 2 E ( i l i j k l . n ) + 2 E ( j k i j k l . n ) + 2 E ( j l i j k l . n ) + 2 E ( k l i j k l . n ) # $ % & % ( % ) % 3 E ( i j k i j k l . n ) + 3 E ( i k l i j k l . n ) + 3 E ( j k l i j k l . n ) + 3 E ( i j l i j k l . n ) { } . (3 8) Figure 3 2 and 3 3 visually demonstrate the procedure used to obtain the binding energies E bind and E bind Insertion of the individu al correction terms given in E quation 3 8

PAGE 68

68 in to Equation 3 6 yields the terms that must be computed to obtain the two binding energies. In Figure 3 2, Equation 3 5 is visualized for the HIV II protease ligand system. The atoms containing ghost orbitals are transparent, while the darker atoms designa te the nuclei present in that particular calculation. In Figure 3 3 the components of the energy decomposit ion scheme given by Equation 3 6 producing E bind are represented. The same color coding scheme used for Figure 3 2 applies to Figure 3 3 Results a nd Discussion Pairwise Interactions First, the total binding energy E bind of the HIV II protease to the Indinavir li gand was computed according to Equation 3 5 with the constituent terms shown in Figure 3 2 It was found to amount to 23.57 kcal/mol, 111 .60 kcal/mol and 131.68 kcal/mol for the HF/6 31G*, M06 L/6 31G* and PM6 DH2 levels of theory, respectively as shown in Table 3 1 Next, the correction terms n E ij ... n ( ) # given by Equation 3 8 were obtained. The large jump in the number of the e nergy values needed to generate the total many body correction terms set t he limit to where we truncated Equation 3 6 The limit is mostly forced by the computational expense of the various methods. Table 3 2 shows the individual total correction terms fo r the three levels of theory, HF, M06 L and PM6 DH2. The first point to note about these total correction terms is that as n increases, the absolute value of the total correction term associated with n n E ij ... n ( ) # decreases. That is, the contr ibution of the total correction terms associated with that particular n to E bind diminishes as n rises. For HF, it dropped from $ 26.86 $ kcal/mol to $ 4.24 $ kcal/mol, which corresponds to a fall of 84.2%. Thus, the three body total correction produces 18.7 % of the E bind At the M06 L level, this decline

PAGE 69

69 is much sharper: the total two body correction term was $ 111.49 $ kcal/mol, while the total three body correction equaled $ 4.45 $ kcal/mol, which yielded a reduction of 96.0%. The 3 E ijk ( ) # correction contributed 3.84% to the overall E bind The semiempirical PM6 DH2 level was found to show an even sharper drop of 98.6%, from $ 132.86 $ kcal/mol to $ 1.92 $ kcal/mol. The contribution of the three body correction terms to E bind was 1. 46%. Due to the speed of semiempirical methods higher order correction terms up to n =5 were included in the computed E bind for PM6 DH2 and were found to make small contributions. The four body corrections add up to 0.76 kcal/mol, which corresponds to 0. 58% of the total E bind and 0.57% of the total two body correction. The total five body correction term is 0.08 kcal/mol, or 0.06% of the total E bind and 0.06% of the overall two body correction. This equality of percentages in the presence or absence of the higher order correction terms n E ij ... n ( ) # with n >2 demonstrates that their effect on the energetics within the active site of the HIV II protease bound to Indinavir is negligible. In other words, the energetics of this model system is additi ve. In contrast to the present cluster, Xantheas observed a non additivity of the order of 30% for small water clusters 19 This find ing suggested that pairwise additive potentials ( e.g. MCY 48 TIP3P 49 etc. ) of water may be less accurate than previously thought (previously it was claimed to be ~10% 50 ). In our case, the pairwise additivity corresponds to 86.0% of the total E bind at the HF level, to 99.9% at the M06 L level and to 99.1% at th e PM6 DH2 level of theory where the numbers were calculated by the following formula: 1 # 2 E ( ij ) E bind $ E bind % & ' ( ) * + 100% (3 9)

PAGE 70

70 Thus, in the case of the present protein ligand system pairwise additivity appears to be a realistic model. Many body Interactions Havin g analyzed the cumulative effect of the total n body correction terms with n % 2, let us turn our attention to a deeper analysis of the individual multi body correction terms. The distributions of individual correction terms for both ligand binding pocket in teractions and for the interactions among the fragments forming the binding pocket were examined. First of all, they reveal again an n dependent behavior similar to the magnitudes of the total correction terms. As n rises, the number of terms included in t he expansion jumps very rapidly, as mentioned previously and thus the populations of the distributions increase very rapidly as well. However, as one goes to higher n the distribution gets consistently narrower. At all three levels of theory, for the lig and binding pocket interactions, the majority of the eighteen two body corrections have mag nitudes falling into the [0.00: 20.00] kcal/mol range. This range shrinks for all the methods when the three body corrections to the ligand binding pocket interaction s are considered. At the M06 L level, the majority of the terms lie in the [ 0.52: 0.16] kcal/mol range at 80% confidence level, while at the HF level this range is [ 0.34: 0.23] kcal/mol. The PM6 DH2 level had the narrowest distribution for these interacti ons with a range of [ 0.25: 0.13] kcal/mol. For higher order n body corrections to the ligand binding pocket interactions, where n equals 4 and 5, only the semiempirical PM6 DH2 results are available for assessment. For the four body corrections to the liga nd binding pocket the range is [ 0.007: 0.004] kcal/mol, while the range for the five body corrections is [ 0.0002: 0.0003] kcal/mol. Thus, even with the large number of terms for n >3 the sum

PAGE 71

71 does not yield a significant impact on the total energy E bind Fi gure 3 4 Figure 3 5, and Figure 3 6 summarize these observations. When we expand the examination of these ranges over the entire collection of interactions within the binding pocket, that is, over all the interactions among the eighteen fragments making up the binding pocket and those between the ligand and the binding pocket, the same observation is made: As n rises, the range for the magnitudes of the interaction energy corrections shrinks. By looking at the distributions shown in Figure 3 7 we see the majority of the pairwise interactions, that is for n =2, fall into the range of [ 1.00: 1.00] kcal/mol for the HF, M06 L and PM6 DH2 levels of theory. It should be kept in mind that the interactions between the binding pocket fragments cancel out in the ove rall energy decomposition scheme upon subtraction of the third ro w from the first in Equation 3 6 when considering a particular n th total correction term, n E ij ... n ( ) # However, these interactions between the binding pocket fragments survive in t he subsequent higher order correction terms and thus, their magnitudes affect the value of E bind As we have seen, the energy ranges including the majority of the n body corrections for n =2 5 of all the interactions arising within Indinavir binding pocke t complex narrow down as n increases. For n =2, the range of the magnitudes containing most of the interactions is [ 0.00 :1.00] kcal/mol as shown in Figure 3 7 while this range loses one order of magnitude for n =3 as in Figure 3 8). For n =4 and n =5, the cor responding range decreases one and three orders of magnitude further relative to the three body correction terms, respectively. Hence, there is an overall decline in the magnitudes of the interactions as more "parties" are involved in one particular

PAGE 72

72 intera ction. To reframe the picture, in spite of the higher number of interaction terms involving increasing number of parties, the cumulative energy correction arising from the entire group of interactions among n bodies shows such a strong decay that at cases for n % 4, it reaches below the limits of accurate computation. Having discussed the limits of accuracy, it is necessary to acknowledge the uncertainty within the consecutiv e correction terms. Revisiting Equation 3 7 the ( n +1) th total correction term of th e energy decomposit ion scheme given in Equation 3 6 n + 1 E ijk .. ( ) # includes the lower n th ( n 1) th ( n 2) th % and eventually first order correction terms. At this point, any n th order correction term with n >1 would have some level of uncertainty The number of lower order terms contributing to the n th order correction term rises with increasing n which results in accumulation of uncertainty within the n body total correction term. Hence, although the decreasing magnitudes of the n th order correc tions with increasing n has been confirmed by three different levels of theory, we do not claim that the numbers presented for higher n body corrections are free of uncertainty. What is certain here is that the contribution of corrections to the total ener gy expansion yielding E bind decreases as one includes higher terms in the expansion given in Equation 3 6 Although it was found that the pairwise interactions represent the full interaction energy E bind to 86.0% at the HF level, to 99.9% at the M06 L lev el and to 99.1% at the PM6 DH2 level of theory, the slight contribution of the higher order corrections was another point of interest. The cumulative impact of the three body corrections was very little. However, the individual 3 E ijk ( ) value s associated with some of the receptor fragments and the ligand had magnitudes of several kcal/mol. In order to understand

PAGE 73

73 whether the magnitudes of these individual three body corrections could be categorized according to the proximity of the two receptor fragments interacting with the ligand, the distance between the nearest atoms of the non ligand fragments has been plotted against the corresponding three body correction at the M06 L level of theory as shown in Figure 3 9. These data reveal that the sign ificant three body corrections stem from interactions involving polarization of the ligand or the surrounding receptor fragments. The red data points symbolize the charged fragments, whereas the blue data points stand for the two water fragments and the bl ack data points represent the remaining peptide and hydrocarbon fragments. The charged and hydrogen bonding fragments significantly polarize their environment including both the ligand and the non ligand parties involved in the three body interaction. Each polarizing entity affects the electron densities on the remaining parties involved in that particular interaction which evokes a larger magnitude for the individual three body correction. At distances <4 , the impacts of polarization are stronger due to the proximity of the receptor fragments. Only the charged fragments are capable of polarizing the remaining interacting parties over greater distances (>4 ) and influence the interaction to result in a considerable three body correction. Taken altogether we conclude that to a good approximation, the protein ligand complex studied herein behaves additively. This observation supports the use of pairwise additive force fields and enthalpy computation in fragment based drug design. Most force fields define n on bonded interactions as a double summation, which can be decomposed into van der Waals and electrostatic energies over all interacting atom pairs 4 51 This pairwise treatment of non bonded interactions in protein ligand

PAGE 74

74 complexes is supported by the enormous drop of contributions from the n body correction terms, where n >2, at multiple levels of theory. Se condly, as mentioned in the introduction, the observed additivity for the ligand binding pocket energetics provides a powerful approximation for the potential additivity of the binding enthalpies of the fragments when they are unified into a larger molecul e. This observed additivity has another significant application for the improvement of scoring algorithms based on energy. Faver et al. have suggested that systematic errors propagate as a simple sum of the errors contained within the individual interacti ons associated with each fragment pair. 31 52 Thus, if the sum of the energies of the fragment pairs approxima tes the total binding energy well enough, then the error propagation formula of Error Systematic = Err 1 + Err 2 + Err 3 + ... represents the total systematic error to a considerable accuracy. This finding confirms the proposed idea of accounting for the systematic errors in a ph ysics based score function by constructing a reference library comprised of numerous unique interacting fragments and developing accurate error probability density functions based on those interaction libraries. The effective application of a systematic er ror correction scheme is facilitated through additivity with respect to the fragment energies. Conclusions In this chapter the goal was to show that additivity principles are applicable to electronic energies of fragments making up a larger molecular entit y. We employed an energy expansion, which decomposes protein ligand interaction energies into n terms where each term designates the contributions originating from m interacting fragments with m & n In our scenario involving the HIV II protease active site complexed to the

PAGE 75

75 inhibitor Indinavir, m ranged from 2 to 5. The reliability of this decomposition scheme was confirmed at the HF, M06 L and PM6 DH2 levels of theory. For all three levels, inclusion of higher order terms with m >2 brought the interaction ene rgy expansion closer to E bind the exact binding energy of the system, which was obtained by subtraction of energies of the ligand and the binding pocket from that of the full complex. Additivity is theoretically and experimentally unsupported in the liter ature with regards to free energies 27 29 However, for enthalpies and electronic energies, additivity is largely supported in the case studied herein. The narrowin g of the distributions for higher order interactions (for m >2) supports additivity for energies and enthalpies. The energy ranges encompassing the m th order corrections become consistently narrower with increasing m Moreover, the ranges get exceedingly na rrow such that they cannot be accurately detected with the present computational methods. Although the number of correction terms summing up to the total m body correction term increases very rapidly and amounts to thousands at m =4, the magnitudes of these corrections are so small, that they do not accumulate to significant values. In other words, the quick decrease in the magnitudes of the correction terms with rising m overwhelms the increase in the number of single correction terms to be added to yield t he total m body correction term for a particular m This observation is advantageous from a computational perspective, since the thousands of terms in the m > 3 corrections may safely be neglected in the energy expansion for analogous systems. From the pr esent chapter we can arrive at three main conclusions. Firstly, many force fields evaluate non bonded interactions in a pairwise fashion for protein ligand systems. Now, having confirmed that the two body corrections to the overall energy

PAGE 76

76 expansion constit ute more than 95% of the total protein ligand interaction energy, the use of pairwise potentials in drug design applications is supported. Secondly, our results place fragment based drug design on a firmer footing especially with regards to interaction ene rgy computation. Finally, if the additive model provides a good approximation for the total binding energy of the ligand to the receptor, then the systematic errors in each of the receptor ligand fragment interactions accumulate as a simple sum, which accu rately reveals the total systematic error for the whole binding eve nt. Thus, the observed additivity of fragment energies supports the idea of the post hoc correction of systematic errors in physics based score functions.

PAGE 77

77 Table 3 1 E bind E bind ', and the total correction terms for various orders of correction estima ted using M06 L, HF and PM6 DH2 For HF and M06 L the calculations were completed up to the 3 rd degree ( n =3), while with PM6 DH2, terms up to the 5 th degree were ( n =5) achiev ed. All energy value entries are in kcal/mol. Level of Theory Total Interaction Energy for the Order n E bind upto n (kcal/mol) E bind n =2 n =3 n =4 n =5 n =3 n =5 HF/6 31G* 26.86 4.24 22.62 23.57 M06 L/6 31G* 111.49 4.45 115.94 111.60 PM6 DH2 132.86 1.92 0.76 0.08 130.94 131.62 131.68

PAGE 78

78 Figure 3 1. A representative double mutant cycle involving protein (P) ligand (L) interactions. X and Y stand for two residues, which are mutated individually and together in the wild type protein (P XY ). The free energy change associated with each mutation is inserted into the following equations in order to obtain the free energy of coupling, ! G int : ! G int = G P XY P X G P Y P = G P XY P Y G P X P Y X L P G P-XY P-Y G PY P G P-X P G P-XY P-X Y L P X L P L P

PAGE 79

79 Figure 3 2. Visualization o f the components of Equation 3 5. Upper panel shows the whole system with all the 18 fragments of the binding pocket and the Indinavir ligand. The middle panel displays the Indinavir ligand with the "ghosted" binding pocket. The lower panel shows the entire binding pocket with Indinavir in the "ghosted" state. !"#$%&'()*+*(,*+-.*/0$#'1 /0$#'1 2* !"#$%&'()*+*(,*+-*3,')(0(4(0'$*(5&* 60'10'$*7,38&(*

PAGE 80

80 Figure 3 3. Visualization of the components summing up to to tal correction terms given in Equation 3 6 The panels display the ligand and the particular fragments' nuclei participating in the calculations. Transp arent atoms are the "ghosted" ones, while the solid atoms are the ones whose nuclei take part in that particular calculation.

PAGE 81

81 Figure 3 4. The distribution of the 2 body corrections to the ligand binding pocket interaction energies at the Black) HF lev el. Red) M06 L level. Blue) PM6 DH2 level. The y axis denotes the density for all three plots (the AUC=1). Most pairwise interactions fall into a range of [ 5.06: 2.70 ] kcal /mol for HF, [ 11.10: 0.31 ] kcal/mol for M06 L and [ 16.00: 1.08] kcal/mol for PM6 DH2 level.

PAGE 82

82 Figure 3 5. The distribution of the three body correction terms to the ligand binding pocket interactions at M06 L, HF and PM6 DH2 levels are given in panels (red), (black) and (blue), respectively. The inset shows the peak region in more detail, where the y axis is the density (the AUC=1). !"!#" "!#"#! "#$%&'()*'+ $%&'()* ,-. ,-/,-,,-/,-. ,01,10

PAGE 83

83 Figure 3 6. The distributions of the n body terms for n =3 (left), n =4 (middle) and n =5 (right) at the PM6 DH2 level. These distributions apply to ligand receptor interactions. !"# #"$#"##"$!"# #$!#!$ 3 E(kcal/mol) %&'()*+ #"#,#"###"#-#"#, #-##,##.##/##!### 4 E(kcal/mol) #"##,#"####"##, #!###-###0###,### 5 E(kcal/mol)

PAGE 84

84 Figure 3 7. Th e distributions of the pairwise corrections to all the interactions present within the Indinavir binding pocket complex at the (black) HF level, (red) M06 L level and (blue) PM6 DH2 level. One interaction lies around 110 kcal/mol at both HF and M06 L leve ls, which involved adjacent opposite charges and has been omitted for the purpose of visual clarity. The detailed peak region is shown in the inset. !" #""#"!" "$#%!& "#$%&'()*'+ '()*+,, -.-!, .-!,/0

PAGE 85

85 Figure 3 8. The distributions of n body correction terms for n =3 (left), n =4 (middle) and n =5 (right) at the PM6 DH2 level of theory. These correction terms apply to all the interactions involving 3 to 5 parties out of the members of the Indinavir binding pocket complex system, including all the 18 fragments constituting the binding pocket and the Indina vir ligand. !"# #"$#"##"$!"# #$#!##!$#%##%$#&## 3 E(kcal/mol) '()*+,#"#.#"###"#%#"#. #%###.###/###0###!#### 4 E(kcal/mol) #"##.#"####"##. #$###!####!$###%#### 5 E(kcal/mol)

PAGE 86

86 Figure 3 9. Magnitude of individual three body corrections vs. the distance betwe en the nearest atoms of the two receptor fragments involved in the three body interaction. Interactions involving charged fragments are color coded with red water fragments with blue and other fragments with black.

PAGE 87

87 CHAPTER 4 ESTIMATION OF FREE ENERGY OF BINDING FOR PROTEIN LIGAND COMPLEXES CONCOMITANT WITH ERROR ESTIMATION Background Determination of binding affinities for protein ligand complexes present s one of the most complicated and attractive problems of computational chemistry. 14 53 The simplest methods attempting to suggest a solution to this problem, e.g docking, mostly rely on end point methods which estimate energies of single static complex structures. 54 They usually neglect factors like receptor flexibility, ligand strain upon binding, as well as various entropy effects. 55 58 Some sampling is added on top of t his strategy in methods like Molecular Mechanics Poisson Boltzmann/Surface Area (MM PBSA) and Molecular Mechanics Generalized Born/Surface Area (MM GBSA) which evaluate the absolute free energies of the protein and the ligand before and after binding. 59 62 Both docking and the latter methods decompose the free energy into enthalpy and entropy contributions. In MM PBSA and MM GBSA the enthalpies are obtained from the average energies from the molecular mechanics trajectories, which when combined with an entropy and a solvation free energy estimate using a continuum solvent model results in the final free energy estimate. 59 63 In order to predict the free energy of binding accurately with this protocol, the first requirement is to correctly predict the absolute free e nergies of the bound and unbound species. However, these are typically large quantities and we are interested in the small differences between them, so even a small error (percentage wise) in their prediction might have a significant impact on the free ene rgy difference &G. Hence, force field accuracy is very important in this exercise. Reprinted with permission from Ucisik, M. N.; Zheng, Z.; Faver, J. C.; Merz, K. M. J. Chem. Theory Comput. 2014 10 1314. Copyright 2014 American Chemical Society.

PAGE 88

88 Another tool to calculate relative free energies of binding is alchemical free energy calculations, which modify a molecular entity into another via non physical ("alchemic al") pathways. Via thermodynamic cycles, incorporating these alchemical transformations, free energy changes of physical processes can be evaluated. 64 68 It avo ids decomposition of free energy change into individual thermodynamic terms by utilizing the ratio of the partition functions of the involved species, which eliminates the need to evaluate enthalpy and entropy contributions explicitly. One issue associated with alchemical calculations is obtaining enough sampling to generate a sufficient number of uncorrelated configurations and this is done using force fields coupled with molecular dynamics (MD) or Monte Carlo (MC) sampling. 67 Thus, the sampling would always be based on the biases inherent in the force field employed. Moreover, to collect enough uncorrelated configurations to feed into the partition functions would be costly, especially for larger biological systems. 67 Ending up with biased results is also possible with insufficient sampling when computed free energies depend on the choice of the initial receptor or ligand structure. 69 71 The Mining Minima algorithm is worth noting here because it ties many aspects of the aforementioned free energy methods together. The attempt to systematically search for multiple local potential energy wells makes it unique among o ther end point methods. Moreover, it computes the free energies of binding directly from the configuration integral contributions of the sampled local minima. However, just like the MM PBSA and MM GBSA protocols, it estimates the absolute free energies of the protein, ligand and their complex, which runs the risk of introducing errors when the difference between large numbers is taken. 72 73

PAGE 89

89 Recently, various types of restrained or unrestrained MD 74 79 and potential of mean force simulations 80 82 have appeared where observed ligand binding or ligand removal events 74 76 provide another avenue to study protein ligand binding. These simulations are very expensive and are subject to model uncertainties, but do provide unique insight into binding events. Regardless of their sophistication and pra cticability levels, all of these methods come with their intrinsic errors, which can be classified as systematic and random. Systematic errors in any measured or calculated quantity are rather easy to handle: They shift the result to a certain direction an d, thus, are correctable. The random errors, on the other hand, may impact the determined energies in both fashions, up or down, and there is no way to correct for them in a post hoc manner. 31 We have shown that one way to reduce the amount of this type of error is to include multiple microstates in the calculations. In other words, local sampling of the potential ene rgy surface of interest decreases the random errors statistically. 52 In this work we propose a way to calculate the binding free energy of protein ligand comple xes directly from microstate energies utilizing a ratio of configuration integrals and we then assess the uncertainty of our estimates with previously derived error propagation formulas. 31 52 We predict the binding affinities directly from statistical mechanical definitions without splitting the free energy into enthalpic and entropic terms. This presents a huge advantage in terms of accuracy and uncertainty evaluation because introducing more terms into these calculations tends to propagate the errors originating from each component and understanding how these component errors do indeed propagate becomes a much m ore complicated task. Additionally, methods to

PAGE 90

90 directly calculate entropy are very challenging for a number of reasons. 83 85 Moreover, commonly used entropy ca lculation methods like normal mode analysis utilized in methods like MM PBSA, and MM GBSA rely on minimized snapshots from trajectories, which also introduce further uncertainties. 84 Estimating the free energy of binding directly from microstate energies provides a straightforward way to estimate uncertainties in free energies directly from the computed energies. We understand how to propagate the errors contained in microstate energies which allows us to directly determine the systematic and random errors associated with the calculated free energy of binding. 31 52 85 With these aspects of uncertainty determination in mind, we aimed to create an unbiased ensemble of structures involved in modeling protein ligand binding ev ents and explored the effects of introducing such an ensemble on binding affinity prediction. Similar to many other studies aiming to calculate ligand binding free energies to protein receptors 71 86 90 we apply our protocol to predict the binding affinities and the associated uncertainties to the experimentally well characterized, engineered T4 Lysozyme L99A s ystem. 91 92 We are not the first ones to use this series of T4Lysozyme L99A inhibitors to explore the power of binding free energy determination methods. It was subject to several earlier docking and free energy perturbation (FEP) molecular dynamics (MD) studies. 64 71 86 88 93 94 Moreover, an exhaustive docking study by Purisima et al. was recently conducted on this system using a very similar systematic ligand perturbation method to construct the protein ligand complex and free ligand ensembles. 90 The authors employed continuum solvent models that they developed and this sets a limitation on the applicabilit y and reproducibility of their protocol by others.

PAGE 91

91 Our work is novel in that it is the first study to offer error bars for binding free energy predictions, which are also corrected for their systematic errors. It uses the conventional force fields ff99SB 39 and ff94 4 in the AMBER12 95 package, and the semiempirical PM6 DH2 35 37 method in MOPAC2012 96 in conjunction with the readily available continuum solvent models Generalized Born (GB) 97 Conductor like Screening Model (COSMO) 98 and SMD 99 for scoring purposes which enhances the extensibility of the present approach. Methods Molecular System: L99A Mutant of T4 Lyso zyme We calculated the binding free energies and their associated uncertainties on a series of T4 Lysozyme L99A inhibitors. 91 92 T4 Lysozyme L99A is a very convenient system for this type of study due to the hydrophobic, entirely closed binding pocket facilitated by the L99A mutation. This isolated cavity accommodates a number of small hydrophobic molecules wit h experimentally measured binding affinities. Known experimental binding free energies allow us to judge the accuracy of our estimation which in turn makes it possible for us to improve our methodology and workflow. Additionally, dealing with a completely closed, hydrophobic pocket mitigates some computational artifacts, which could emerge from computing solvation effects. The protein ligand complexes employed have the Protein Data Bank (PDB) ID's 181L, 182L, 183L, 184L, 185L, 186L, 187L and 188L and these contain benzene, 2,3 benzofuran, indene, i butyl benzene, indole, n butyl benzene, p xylene, and o xylene as ligands, respectively. 92

PAGE 92

92 Ensemble Creation with the "Blur" Program After downloading the protein li gand complex structures from the PDB, we separated each complex into their ligand and protein parts. The protonation states for each protein chain were obtained with the web server H++ at neutral pH. 100 To prepare the binding pocket for each of these complexes, the amino acid residues within 5  of the ligand in its native PDB complex pose were relaxed using the ff99SB force field in the AMBER12 package while weak positional restraints of 10 kcal/mol* 2 were applied on the rest of the structure. The relaxation protocol consisted of 25,000 steps of steepest descent minimization 3 in the gas phase and was performed in the absence of the ligand to prevent any structural bias which would favor certai n poses over others. These minimized protein structures were kept rigid in the remainder of the analysis. The ligand structures, on the other hand, were optimized at the density functional theory (DFT) level using the M06L functional 30 in conjunction with the Dunning type aug cc pVDZ basis set. 101 The optimized geometries were utilized to obtain AM1 BCC partial charges and were used in the creation of the Generalized Amber Force Field (GAFF) parameters 5 6 102 103 wh ich were employed in conjunction with b oth ff99SB and ff94 force fields during scoring. The optimized ligand structures were docked into their corresponding minimized binding pocket and their positions in the pocket were optimized using the ff99SB force field with the generalized Born (GB) sol vent model. The ligands were then systematically translated and rotated using an in house code. This process of systematically moving the ligand consisted of rotations of the whole ligand about its center of mass, rotating rotatable bonds within the ligand by 15¡ increments, and translating the ligand's center of mass on an imaginary grid placed in the binding pocket

PAGE 93

93 with a spacing of 0.5  We termed this process "blurring". If the perturbation yielded a new, chemically meaningful pose that does not place the ligand atoms on top of the receptor atoms or induce other significant clashes, it was then appended to the ensemble of protein ligand geometries. The free ligand structures, i.e the unbound ligands, were also "blurred" by having their functional group s rotated incrementally, just as in complex structures and this lead to an unbiased ensemble for the ligands as well. For strictly aromatic ligands with no rotatable bonds only one single ligand pose made up the free ligand ensemble. Also, bond to methyl g roups were not treated as rotatable bonds. In this way, we ended up with the following: an ensemble of protein ligand complexes, an unbiased "ensemble" of free ligands, and a rigid protein. Energy Scoring and Error Estimation The energies of each protein complex pose, each free ligand pose, and each rigid protein geometry were calculated utilizing several protocols: (1) with the ff99SB force field coupled with the GB implicit solvent model, (2) with the ff94 force field and GB implicit solvent model, (3) with the ff99SB force field in the gas phase, (4) with the PM6 DH2 semiempirical method in conjunction with the COSMO solvent model, (5) with the PM6 DH2 semiempirical method in the gas phase. Molecular mechanics scoring was carried out with the AMBER12 su ite of programs while the PM6 DH2 calculations were done with the MOPAC2012 package. The uncertainty and accuracy assessment for a particular free energy of binding relied on determining the uncertainty and accuracy of individual interaction energies assoc iated with each protein ligand complex pose making up that particular protein ligand complex ensemble. The polar and nonpolar interactions between the ligand and the binding pocket in every pose were counted and the systematic and random errors

PAGE 94

94 per interac tion were collected from the Biomolecular Fragment Database (BFDb). 40 We used the "hsg" dataset and acquired the systematic and random errors with respect to the "gold standard" CCSD(T)/CBS level of theory. 31 104 105 Then these individual errors were propagated to yield the cumulative error in the binding free en ergy as described in reference 5 2. 52 Theory The free energy of binding can be obtained directly with statistical mechanics without decomposing it into separate enthalpic and entropic terms. This is an advantage because it eliminate s the need to estimate the errors introduced by individual enthalpic and entropic contributions, some of whic h are chal lenging to determine. Our algorithm samples from a canonical ensemble in which the volume, temperature, and number and composition of the atoms in the system remain constant in both the bound and unbound states of the protein and the ligand system The partition function, which depends on positions and momenta of the particles in the system, is separable into a product of two separate integ rals, one over pos i tions and one over momenta as the velocities of the particles are independent of the potential energy function. The integral over momenta can be calculated analytically and the result depends only on the particle masses, the numbers of the particles, and temperautre of the system As these variables do not change upon transitioning from the unbound state to the bound state, the integrals over momenta cancel in the ratio of partition functions of these two states 3 Hence, we used the ratio of the configuration al integrals of the protein ligand complex ( PL ), ligand ( L ) and protein ( P ) to define the free energy of the protein ligand binding, G bind :

PAGE 95

95 ( 4 1) For the configuration al integral affiliated with the PL complex, we assumed that the entire set of degrees of freedom (DOF) could be classified into six groups representing various physical contributions: (a) Rigid translations ( RT ) of the PL complex: the entire complex could translate while keeping constant internal and rotational DOFs, (b) Rigid rotations (RR) of the PL comp lex: the entire complex could rotate while keeping a constant position of center of mass and internal DOFs, (c) Rigid docking translations (RDT): the ligand could translate while the ligand's rotational and internal DOFs were constant and protein's DOFs we re fixed, (d) Rigid docking rotations (RDR): the ligand could rotate while its center of mass, internal DOFs and protein's DOFs were constant, (e) Internal protein DOFs ( IP ), and (f) Internal ligand DOFs ( IL ). In this expansion of DOFs, RT and RR deal with DOFs specific to the entire complex. RDT and RDR handle DOFs regarding the positioning of the ligand relative to the protein. IP and IL span the remaining DOFs and are specific to the protein and ligand DOFs, respectively. Hence, the total configuration s pace of the PL complex should be spanned by these six classes of DOFs, which when combined form the following integral: ( 4 2) Similarly, the configurational integral associated with the ligand was assumed to span (a) RT of the free ligand: it coul d translate while keeping constant internal and rotational DOFs, (b) RR of the free ligand: it could rotate while keeping a constant position of G b i n d = R T l n e E P L ( r ) d r # e E L ( r ) d r # ( ) e E P ( r ) d r # ( ) $ % & & ( ) ) e ! E P L ( r ) d r = e ! E P L d r I L b o u n d I P b o u n d R D R R D T R R R T

PAGE 96

96 center of mass and internal DOFs, and (c) IL : internal DOFs of the free ligand. This yields: ( 4 3) Likewi se, three classes were identified to collect the protein's DOFs: (a) RT of the unbound protein: it could translate while keeping constant internal and rotational DOFs, (b) RR of the unbound protein: it could rotate while keeping a constant position of cent er of mass and internal DOFs, and (c) IP : internal DOFs of the unbound protein which results in: ( 4 4) In all of these integrals RT and RR do not affect the energy, so the energy is constant with respect to them. We can evaluate the integrals as the volume of phase space associated with these DOFs times the value of the remainder of the integral. The PL complex, protein, and ligand can translate in a cubic box with the length equal to the average distance to another entity of their kind in a homogenou s solution which allows us to assume a inverse concentration as the value of the RT phase space volume. These integrals are thus reduced to: ( 4 5) for the PL complex, ( 4 6) for the free ligand, and ( 4 7) e ! E L ( r ) d r = e ! E L d r I L R R L R T L e ! E P ( r ) d r = e ! E P d r I P R R P R T P e ! E P L ( r ) d r = 1 C e ! E P L d r I L b o u n d I P b o u n d R D R R D T R R e ! E L ( r ) d r = 1 C e ! E L d r I L R R L e ! E P ( r ) d r = 1 C e ! E P d r I P R R P

PAGE 97

97 for the unbound protein, respect ively. If we placed an arbitrary unit vector on the center of mass of each of these species, this unit vector could rotate to point anywhere on a unit sphere with a surface area of 4". Each of these possible vectors could serve as a rotational axis about which the molecule could rotate 2". So the volume of phase space for RR terms is 8" 2 which transforms the above integrals into ( 4 8) for the PL complex, ( 4 9) for the free ligand, and ( 4 10) for the unbound protein, respectively. We a lso assumed the protein's internal DOFs would be constant as dictated by the rigid receptor approximation. The IP bound piece would give the volume of phase space V P enclosing the internal DOFs affiliated with the protein within the PL complex times the re mainder of the integral: ( 4 11) Monte Carlo (MC) integration was applied to the remaining DOFs for RDT RDR and IL in the bound form. The volume of phase space comprising the RDT DOFs was the binding pocket volume in which the ligand's center o f mass could translate: V p ocket The RDR DOFs span a volume of 8 2 with the same reasoning described above for the RR term. Finally, the volume of phase space containing the internal DOFs of the ligand e ! E P L ( r ) d r = 8 2 C e ! E P L d r I L b o u n d I P b o u n d R D R R D T e ! E L ( r ) d r = 8 2 C e ! E L d r I L e ! E P ( r ) d r = 8 2 C e ! E P d r I P e ! E P L ( r ) d r = 8 2 C V P e ! E P L d r I L b o u n d R D R R D T

PAGE 98

98 in bound form was V L Here, we emphasize that V L is n ot an actual volumetric entity but an abstract quantity, which expresses the phase space occupied by the internal ligand DOFs. Thus: ( 4 12) The pocket volume V pocket was obtained by evaluating the smallest volume, which would fit the superimposition of the centers of mass of all the ligand conformations generated by the blur code within the binding pocket. Applying MC integration to E qua tion 4 9 leads to the following final expression for the configuration integral of the free ligand: ( 4 13) The same operation on Equation 4 10 results in ( 4 14) for the unbound protein. Since we employed a single, static protein structure in our calculations, there was no need for an average: ( 4 15) By inserting Equation 4 13, 4 14, and 4 15 into Equ ation 4 1 we obtain: ( 4 16) which reduces to our final free energy of binding expression: e ! E P L ( r ) d r = ( 8 2 ) 2 C V p V p o c k e t V L e ! E P L ( r ) R D T R D R I L b o u n d e ! E L ( r ) d r = 8 2 C V L e ! E L ( r ) I L e ! E P ( r ) d r = 8 2 C V P e ! E P ( r ) I P e ! E P ( r ) d r = 8 2 C V P e ! E P ( r ) G b i n d = R T l n ( 8 2 ) 2 C V p V p o c k e t V L e " E P L ( r ) R D T R D R I L b o u n d 8 2 C V L e " E L ( r ) I L # $ % & ( 8 2 C V p e " E P ( r ) # $ % & ( ) + + + + . .

PAGE 99

99 ( 4 17) In this expression, we sample over RDT RDR and IL DOF's of the PL complex and the IL DOF's of the free ligand. Then using the conventional standar d deviation formula 106 the sa mpling error associated with this free energy of binding expression would be: ( 4 18) The single point energies calculated for each pose making up the PL complex and free ligand ensembles contained both systematic and random errors as proposed earlier by Faver et al 31 107 These individual errors would accumulate in the final free energy of binding estimati on and reduce the reliability of our results Faver et al. attacked this problem by first classifying and quantifying the molecular interactions present between the protein binding pocket and the ligand, then assessing the individual fragment based errors with a probability density function built with a reference database of molecular fragment interactions, and finally propagating these errors as E r r o r S y s t e m a t i c = N k k k E r r o r R a n d o m = N k k 2 k ( 4 19) where k stands for different interaction types in the reference database ( e.g. polar and nonpolar), N k is the associated interaction count, k and k 2 represent the mean error per interaction and variance about the mean error for interaction type k in the database. A generic function f(x i ) has a systematic error which could be, in its simplest model, evaluated as G b i n d = R T l n V p o c k e t C e E P L ( r ) R D T R D R I L b o u n d e E L ( r ) I L e E P ( r ) ( ) # $ % % & ( ( 8 2 V p o c k e t V L ( ) 2 e 2 E P L ( r ) e E P L ( r ) 2 N P L + V L ( ) 2 e 2 E L ( r ) e E L ( r ) 2 N L # $ $ % & ' 1 2

PAGE 100

100 ( 4 20) and a random e rror which could be estimated as ( 4 21) where # x i expresses the uncertainty in the input variables. Faver et al. applied these operations in E quations 4 20 and 4 21 to the configuration integral Z and determined the total error of the free energies t o be given as: G b i n d = E i S y s e E i i Z e E i Z E i R a n d # $ % & ( 2 i ( 4 22) where Z is the configuration integral, E i stands for the microstate energies (here: energies of distinct PL poses), E i Sys represents the systematic error within each microstate energy, a nd E i Rand designates the random error contai ned in each microstate energy. This can also be written as ( 4 23) where P i is the normalized weight of the microstate i That is, the first part of this expression provides the systematic error in free energ y of binding as the Boltzmann weighted average of the systematic error of each microstate, while the second part gives the accompanying random error in form of a Pythagorean sum of the weighted random errors of each microstate. Since systematic errors shi ft the results only in one, known direction, they can be corrected in a post hoc manner once their magnitudes are determined. Random errors, on the other hand, cause alterations in both directions, which are not predictable. This f = f x i x i i f = f x i x i # $ % & 2 i ( G b i n d = P i E i S y s i P i E i R a n d ( ) 2 i

PAGE 101

101 eliminates the possibility of a post hoc correction. The formula shown in Equation 4 23 however, suggests that if multiple microstates were considered in the binding free energy estimate, the second part related to the random errors would decay significantly because of the probabi lity P i within the Pythagorean sum. The impact of inclusion of numerous microstates on the cumulative systematic error is not as big as on the cumulative random error since the P i values would add up to one eventually in the summation of the first part whe reas the random error part would always get a coefficient which is less than one due to the Pythagorean sum. This was demonstrated with several thought experiments: The higher the number of states included, the less the random error encountered. 52 Our "blur" code accomplishes an exhaustive local sampling of the ligand within the protein binding pocket, in other words it creates the largest plausible PL complex en semble given the input criteria, so that the uncertainty in the estimated free energy of binding is minimized as much as possible. As stated earlier, systematic errors can be accounted for completely assuming our reference "gold standards" are perfectly ac curate. Hence, each pose making up the PL complex ensemble was analyzed for their protein ligand interactions. Due to the strictly nonpolar nature of the binding pocket of interest, all the interactions detected were of van der Waals type. Then utilizing t he Biomolecular Fragment Database, the systematic and random errors present in the energy of each PL complex pose were estimated. Systematic errors were removed from these microstate energies and then the corrected PL complex energies were used to estimate the free energy of binding in the absence of systematic errors. The uncertainty of a particular binding free energy was computed as follows: once an error bar was

PAGE 102

102 obtained for each pose making up the PL complex ensemble, a random value of error was select ed from the associated error distribution and added to the pose's energy which was already corrected for the systematic errors. These microstate energies were inserted into the final bindin g free energy formula shown in Equation 4 17 to yield an estimate o f the binding free energy. A distribution of binding free energies was acquired by calculating the free energy in this manner 10,000 times. Its standard deviation was used to measure the imprecision in the computed free energy due to microstate energy unce rtainties. Results Binding Affinity Estimates The protocol of exhaustively sampling the ligand configurations within and without the PL complex, i.e. "blurring", calculating the free energy of binding for the system of interest dir ectly from E quation 4 17 and estimating the uncertainty contained in the calculated binding free energy was tested on a congeneric series of eight T4Lysozyme L99A inhibitors: benzene, 2,3 benzofuran, indene, i butyl benzene, indole, n butyl benzene, p xylene, and o xylene with t he PDB ID's of 181L, 182L, 183L, 184L, 185L, 186L, 187L ,and 188L, respectively. A sample blurring process involving n butyl benzene (186L) is represented in Figure 4 1 using the Visual Molecular Dynamics (VMD) program. 46 The "blurred" ensembles were checked for duplicate poses in order to prevent double counting, which alters the final binding affinity estimates. Assuming unique poses would have distinct energies, recurring energy scores were filtered out from the PL ensemble scores along with their error estimations. Thus, only structurally unique poses were retained in the ensemble.

PAGE 103

103 One disadvantage of this set is that the measured experimental binding affinities span a very narrow range of 2.1 kcal/mol and this is beyond the accuracy levels of most computational methods. 91 Table 4 1 reports free energy of binding estimates obtained from force field scoring in implicit GB solvent a nd the semiempirical PM6 DH2 method with the COSMO solvation model. Interestingly, we ended up with much more negative free energies of binding estimates regardless of the method employed. As seen in Figure 4 2 the binding free energy estimates acquired from PM6 DH2 calculations correlated better with the experimental values indicated by the higher R 2 values of 0.56 and 0.68 in conjunction with SMD and COSMO, respectively, compared to the calculated free energies of binding from the ff99SB microstate ener gies which gave R 2 values of 0.45 and 0.53 utilizing the GB and SMD solvent models, respectively. Considering how well PM6DH2 is parameterized for interaction energies, this is not surprising. 31 The force fields ff94 and ff99SB showed very similar performances where the results changed by at most 0.8% from one to t he other. That is why only one plot is given for the to force field approaches. Only the results from the more recent ff99SB force field are given. A correlation of 0.68 with the PM6 DH2/COSMO method is quite reasonable in the face of the approximations m ade. We employed a static binding pocket conformation, in other words, we did not account for induced fit interactions between the protein and its ligand. Mobley et al. found that considering a single, static protein conformation yields poorer binding free energy estimates rather than when several protein conformations are included. 89 They remedied this problem to some extent by performing independent local geomet ry optimizations of the complex for each ligand

PAGE 104

104 and then simulating a rigid protein in its optimized geometry. For them this was a more severe problem because they were sampling with MD methods and ligands were prone to be energetically trapped in certain conformations if conformational changes in the protein were required to overcome the local trapping. In our procedure, ligands are less likely to get trapped in particular orientations because the phase space is searched systematically. Another source of i naccuracy might have been the physical variables of "blurring". If we made use of an infinitely small grid size for the translations within the binding pocket, we would end up with a larger ensemble covering a larger phase space. Likewise, if the rotation increments were decreased to infinitesimally small values, a higher number of conformations would be produced ensuring a better sampling. Both of these fine tuning steps may yield improvements. The origin of the consistent negative shift in the results f rom ff99SB /GB, ff94/GB and PM6 DH2/COSMO microstate energies might also have stemmed from the inaccuracies of the continuum solvation models we used, namely GB and COSMO. 85 According to the thermodynamic cycle shown in Figure 4 3 for protein ligand binding in the gas phase and solvent, a way to analyze this problem can be developed as follows: ( 4 24) Assuming the free energies of solvation for the protein P a nd the protein ligand complex PL would have nearly identical solvation free energies due to the closed binding pocket, we can write: ( 4 25) We tested this hypothesis of almost identical solvation free energies of the protein P and the PL complex on the protein structure used for the benzene system and a sample G s o l v P + G s o l v L + G b i n d s o l v = G b i n d g a s + G s o l v P L G b i n d s o l v = G b i n d g a s G s o l v L

PAGE 105

105 benzene protein complex structure out of the benzene PL ensemble. Because we are concerned with single poses for both, free energy and energy would be interchangeable in this case. We calcu lated the solvation energies by extracting the polar contribution to their ff99SB /GB energies and summing that up with the non polar contribution obtained from the solvent accessible surface area using the LCPO algorithm 108 The solvati on energy for the benzene protein complex was 3075 kcal/mol, whereas the solvation energy for the unbound protein was 3073 kcal/mol. Hence, their difference is, indeed, negligible, which supports our hypothesis. Hence, the microstate energies could be co mputed in the gas phase and then the free energy of solvation for the ligand could be subtracted from the free energy of binding in the gas phase. We utilized the experimental solvation free energies for the ligands, if they were available. The results col lected using this protocol employed the same equations as before and are shown in Table 4 2. Using these approximations, we obtained estimates for the binding free energies from in vacuo microstate energies and these calculated binding affinities were more negative than those we gathered from microstate energies incorporating a solvation model. Experimental solvation free energies were available only for benzene and benzene derivatives out of the set of eight ligands, which decreased our test set size. The solvation free energies were obtained from the Minnesota Solvation Database version 2012. 109 We are unable to report free energy of binding estimates for benzofuran, indene, and indole because of the missing ligand solvation free energies. The experimental solvation free energy values were smaller than the magnitude of the

PAGE 106

106 GB and COSMO solvation values shown in Table 4 2 yielding more negative f ree energies of binding as seen in Table 4 3. In addition to these, we evaluated the free energy of solvation for all eight ligands with another continuum solvation model: the qu antum mechanical SMD method by Truhlar et al 99 These values have been subtracted from the free energy of bind ing estimates in accordance to Equation 4 25 to produce the final estimates for the free energy of bi nding for the whole set as shown in Table 4 4 Akin to the previous set of results using the experimental free energies of solvation for ligands, the free energies of binding turned out to again be too negative. The correlation to the experimental binding affinities improved considerably from 0.45 to 0.53 when the SMD values were incorporated along with the gas phase free energy of binding estimates from the ff99SB microstate energies compared to the results obtained with the GB solvent mod el shown in Table 4 1. The PM6 DH2/COSMO combination, however, reproduced the trend in the measured binding affinities significantly better than the PM6 DH2/SMD pair as the R 2 values show: 0.68 vs. 0.56. Figure 4 2 summarizes the information contained in Tables 4 1, 4 3, a nd 4 4 in the form of a plot. For comparison purposes, we also performed a standard docking protocol on these systems and examined the correlation of the scores for the top hits to the experimental free energies. The Glide Standard Precision (SP) and Extra Precision (XP) algorithms 110 112 were applied which lead to R 2 values of 0.28 and 0.37, respectively. Error Analysis Systematic error correction and uncerta inty determination was conducted on the results obtained with ff99SB /GB, PM6 DH2/COSMO, ff99SB /SMD, PM6 DH2/SMD scoring methods. Using the Biomolecular Fragment Database and the iterative error bar

PAGE 107

107 assignment protocol, the systematic errors and uncertainti es plotted in Figures 4 4 and 4 5 were collected. For the binding affinity estimates obtained from force field energy scoring, accounting for the systematic errors moved the binding free energy estimates in the wrong direction although these corrected erro rs were tiny. Depending on the V pocket value, the completely entropic term RT ln (V poc ket C ) contributes 0.5 to 1.5 kcal/mol to the free energy of binding at 300 K and a concentration of C = 1 M which were used throughout in our calculations. Thus its possi ble miscalculation cannot fully rationalize this systematic negative shift. There must be another source of inaccuracy arising from our binding free energy evaluation scheme which would shift the results to much more negative values. The calculated binding free energies with PM6 DH2/COSMO suffer even from a bigger shift to more negative values as displayed in Figure 4 5 while they have considerably higher precision, which is not surprising given our earlier observations. 31 In addition to insufficient sampling, we suggest inaccuracies pa r tially originate from the static protein binding pocket approximation and inaccurac ies of the solvation free energy of the ligand since the inaccuracies in the solvation free energies of the PL complex and the free protein would mostly cancel out in this particular system. To test the effects of these two possible sources of error, we de signed Gedanken experiments. First, we assumed omitting the local relaxation of the binding pocket upon its interaction with the ligand would result in an energetic penalty of 2.00 kcal/mol. This destabilization of 2 kcal/mol was reflected in the systemati c errors of the PL microstate energies as 2 kcal/mol so that its application wo uld yield a more stable microstate energy. We employed the same Monte Carlo error propagation protocol and observed a positive

PAGE 108

108 shift in all the energy values both with ff99SB /G B and PM6 DH2/COSMO. We increased the hypothetical strain energy from 2 to 3 kcal/mol, which lead to a further improvement, i.e. closer estimates to the experimental results. The estimates obtained are displayed in Figures 4 6 and 4 7. A second Gedanken e xperiment aimed to reduce the errors arising from the solvation models employed in this study. As we pointed out earlier, due to the completely closed nature of the receptor pocket, the solvation free energy of the PL complex and of the free protein P larg ely cancel out. Thus, the only source of solvation model errors could be the solvation energy of the free ligand L To quantify its inaccuracy, we compared the ligand's experimental free energy of solvation to the solvation contribution to the absolute lig and energy. The difference was assumed to be the systematic error and this was combined with a hypothetical error bar of 1.00 kcal/mol. If the experimental solvation free energies did not exist, we assumed the average of the systematic errors obtained for the ligands with experimental solvation free energies would give an acceptable estimate of the systematic error. Monte Carlo error estimation was extended such that it also iteratively assigns errors to the ligand piece in Equation 4 17 The results were s urprising in the amount of improvement they yielded. Figures 4 8 and 4 9 demonstrate the trends. This solvation correction to the ligand was as effective as accounting for strain in the binding pocket. Both experiments were coupled where a hypothetical rec eptor strain of 3 kcal/mol and solvation free energy corrections for the ligand L were considered. Consequently, the binding free energy estimates were significantly enhanced to the point where half of the ff99SB /GB estimates contained the experimental res ults in their final error bars. With these two

PAGE 109

109 small thought exercises we identified qualitatively two significant sources of error in our procedure. Convergence of Binding Affinity Estimates and Error Reduction As stated earlier, the precision of calcula ted binding free energies benefits from sampling. Since uncertainties cannot be eliminated in a post hoc manner, sampling actually presents the only way to improve uncertainties of calculated quantities, in this case, protein ligand binding affinities. Thi s was already demonstrated by Faver et al. via a thought experiment where they showed even increasing the sample size from N =1 to N =2 enhances the precision considerably. 52 We applied this to our ff99SB /GB test set where we varied the sample size N from 1 to 5, 10, 25, 50, 75, 100 and increased it by 25 from there on. The free energies of binding were calculated every time along with the associated uncertainty, which demonstrated the convergence of our binding affinity est imates and also displayed the rapid decay of the error bars with growing sample size. For our convergence analysis, the energies for each pose of the "blurred" ensembles were ranked and the poses associated with lowest energies were assigned to the smalle st samples. In other words, the sample of N =1 consisted of the minimum energy pose, the N =5 sample contained the lowest five energies, the N =10 comprised the lowest ten energies and so on. Therefore the free energy of binding estimates rose as we included more and more microstates in our calculation. The maximum sample size was dictated by how compact the particular ligand was. For this system, N =50 seems to be an acceptable sample size for a good binding affinity assessment. Unfortunately, the sample size of n butyl benzene reaches only N =42, which suggests

PAGE 110

110 that we might have under sampled it or that it was packed very tightly into the binding pocket With its large side chain, most of the produced poses lead to serious collisions with the binding pocket re sidues and hence had to be filtered out. Both the binding free energies and the error bars do not appear to change after N =50 as seen in Figure 4 10. A better presentation of the effect of sample size on the error bars is given in Figure 4 11. The biggest impact on the random errors occurs when the ensemble size changes from N =1 to N >1 and the size of the error ba rs decrease drastically until N =25. After N =75, the uncertainties converge to their final values which are at least 3 kcal/mol less than what the y were using only a single microstate. These two analyses prove the benefits of utilizing sampling rather than using single microstates. Over or underestimating the final free energy values is almost inevitable and the obtained precision is the poorest wi th an ensemble of size N =1. Even adding just a few microstates with local sampling would result in a more accurate and precise binding free energy. If one found the most stable conformation of a particular ligand bound to the receptor and locally sampled f rom that minimum energy conformation such that only the bottom of the deepest potential energy well was sampled, even this practice would account for a large part of the significant contributions to the configuration integral of the PL complexes in E quatio n 4 1 and we find that this would yield a much better free energy of binding estimate. Conclusions We evaluated the binding free energies and their associated uncertainty for a congeneric series of T4Lysozyme L99A inhibitors. These were calculated directl y from the ratio of the configuration integrals of the protein ligand ( PL ) complex, the protein ( P ),

PAGE 111

111 and the ligand ( L ). Decomposition into entropic and enthalpic terms was not performed which presents an advantage: we did not have to work around the uncer tainties arising from these separate terms. Especially computing entropy accurately would be a challenge. Importantly, we introduced for the first time a way to estimate the systematic and random errors in computed free energies of binding on the fly. This is a step forward for protein ligand binding affinity calculation efforts because it allows for the understanding of the reliability of the computed quantity of interest. A workflow was laid out which involves creation of unbiased ensembles for protein li gand complexes and unbound ligands via our in house "blur" code, a direct binding free energy calculation formula from statistical mechanics, systematic error correction for microstate energies, and an error bar estimation protocol for the final binding fr ee energy estimates. The microstate energies were obtained with the ff99SB force field and the PM6 DH2 semiempirical method. Continuum solvent models were employed where the ff99SB force field was combined with the Generalized Born (GB) model and the PM6 D H2 approach with the Conductor like Screening Model (COSMO) model. In vacuo binding free energies were also calculated with the same methods and they were combined with the experimental and SMD solvation free energies of the ligands. The systematic and ran dom errors for each microstate were collected from the Biomolecular Fragment Database (BFDb) of Faver et al. and they were propagated as described elsewhere. 52 T he results were promising: they yielded reasonable correlation values (R 2 =0.45 with ff99SB /GB and 0.68 with PM6 DH2/COSMO) to the experimental binding affinities. However, they were all shifted to more negative values. We suggest that this artifact arises from the use of a single, static protein conformation, and the inaccuracies

PAGE 112

112 contained within the solvent models as demonstrated via Gedanken experiments which showed improvements once these two sources of error were approximately accounted for. Insufficie nt sampling due to the dependency on the initial pose is another probable source of error. Hence, there is definitely room to improve our protocol. "Blurring" the protein would likely improve our estimates to include receptor flexibility. Likewise, exploit ing explicit solvent for the force field scoring would enhance the results. Moreover, a bigger ensemble could be developed by utilizing finer settings during "blurring", i.e systematic perturbation of the ligand in or out of the binding pocket and potenti ally of the protein receptor itself. This work demonstrates the significant advantages of local sampling in enhancing precision of binding affinity estimates. It establishes the grounds for more sophisticated protein ligand binding free energy calculations which correct for the systematic errors contained in microstate energies, hence leading to a more accurate free energy of binding estimation, and at the same time determining error bars for binding free energies by propagating the random errors contained in microstate energies. We believe presenting error bars with binding affinity calculations should become common practice in our community. Hopefully, this study would contribute to improving the reliability of the calculated binding free energies.

PAGE 113

113 Table 4 1. The experimental and calculated binding free energies using microstate energies obtained with ff94 scoring in conjunction with the implicit GB solvent model, ff99SB scoring with the GB solvent model, and semiempirical PM6DH2 s coring with COSMO solven t model PDB Ligand Experimental G bind G bind ff 94 /GB ff 99 SB/GB PM6DH2/ COSMO 181L benzene 5.2 8.63 8.64 10.45 182L 2,3 benzofuran 5.5 10.43 10.48 15.51 183L indene 5.1 12.17 12.10 15.86 184L i butyl benzene 6.5 19.18 19.24 19.90 185L indole 4.9 9.70 9.68 14 .04 186L n butyl benzene 6.7 20.60 20.55 19.84 187L p xylene 4.7 15.78 15.66 13.93 188L o xylene 4.6 14.09 14.06 13.68 The systematic and random errors for each level of theory are also shown. All the numbers are in kcal/mol.

PAGE 114

114 Table 4 2. Comparison of the solvation energies with GBSA, COSMO and SMD solvation models against the experimental solvation energies fo r the congeneric set of ligands Ligand GBSA solvation energy COSMO Solvation energy SMD experimental benzene 2.18 2.01 0.4 0.9 2,3 benzofuran 5.69 3.41 0.8 indene 3.48 3.06 1.3 i butyl benzene 0.84 2.35 0.7 0.4 indole 7.54 6.49 4.4 n butyl benzene 0.74 2.47 0.4 0.2 p xylene 1.44 2.89 0.2 0.8 o xylene 1.59 2.92 0.1 0.9

PAGE 115

115 Table 4 3. The experime ntal and calculated binding free energies using microstate energies from gas phase ff99SB and PM6DH2 scoring with experimental ("exp") free energies of solvation for the particular ligand Ligand Exp. G bind Calculated in vacuo G bind ff99SB Calculated in vacuo G bind PM6DH2 G solv /exp for ligand G bind [ G solv /exp + ff99SB] G bind [ G solv /exp +PM6DH2] benzene 5.2 13.27 14.76 0.9 12.4 13.9 2,3 benzofuran 5.5 19.20 21.84 indene 5.1 18 .22 22.58 i butyl benzene 6.5 23.23 26.35 0.4 22.8 26.0 indole 4.9 19.39 23.26 n butyl benzene 6.7 25.46 27.25 0.2 25.7 27.5 p xylene 4.7 20.29 20.82 0.8 19.5 20.0 o xylene 4.6 18.48 20.88 0.9 17.6 20.0

PAGE 116

116 Tabl e 4 4. The experimental and calculated binding free energies using microstate energies from gas phase ff99SB and PM6DH2 scoring with SMD estimates for free energy of solvation Ligand Exp. G bind G solv /SMD for ligand G bind ff99SB/SMD G bind PM6DH2/SMD benzene 5.2 0.4 12.9 14.4 2,3 benzofuran 5.5 0.8 18.4 21.0 indene 5.1 1.3 16.9 21.3 i butyl benzene 6.5 0.7 23.9 27.1 indole 4.9 4.4 15.0 18.9 n butyl benzene 6.7 0. 4 25.9 27.7 p xylene 4.7 0.2 20.5 21.0 o xylene 4.6 0.1 18.4 20.8 "Exp." stands for experimental.

PAGE 117

117 Figure 4 1 The process of "blurring", i.e. systematically perturbing the ligand within the protein binding pocket, shown for n butyl benzene (186L). A ) Before blurring: Single confo rmation in the binding pocket, B ) After blurring, numerous conformations superimposed in t he binding pocket front view, C ) side view. Carbon atoms are displayed in grey and hydrogen atoms in white while red symbolizes oxygen and blue represents nitrogen. A B C

PAGE 118

118 Figure 4 2 Experimental free energies of binding plotted against calculated free energies of binding employing: (black) PM6 DH2 with SMD solvation model, ( red) PM6 DH2 with COSMO solvation model, (green) ff99SB force field with GB implicit solvent model, (blue) ff99SB force field with SMD solvation model. -7-6.5-6-5.5-5-4.5-4 Experimental G bind (kcal/mol) -30 -25 -20 -15 -10 -5 Calculated G bind (kcal/mol) PM6DH2/SMD PM6DH2/COSMO FF99SB/GB FF99SB/SMD R 2 =0.5620 R 2 =0.6798 R 2 =0.4532 R 2 =0.5273

PAGE 119

119 Figure 4 3 Thermodynamic cycle for protein ligand binding in the gas and aqueous phases. P stand s for protein, while L and PL represent ligand and protein ligand complex, respectively.

PAGE 120

120 Figur e 4 4 Systematic and random errors contained in the binding affinity estimates obtained from the ff99SB /GB (left) and ff99SB /SMD (right) micr ostate energies. -7-6.5-6-5.5-5-4.5-4 Experimental G bind (kcal/mol) -30 -25 -20 -15 -10 -5 0 Calculated G bind (kcal/mol) ff99SB/GB corrected for systematic error ff99SB/GB uncorrected y=x -7-6.5-6-5.5-5-4.5-4 Experimental G bind (kcal/mol) -30 -25 -20 -15 -10 -5 0 Calculated G bind (kcal/mol) ff99SB/SMD corrected for systematic error ff99SB/SMD uncorrected y=x

PAGE 121

121 Figure 4 5 Systematic and random errors contained in the binding affinity estimates obtained from the PM6 DH2/COSMO (left) and PM6 DH2/SMD (right) microstate energies. -7-6.5-6-5.5-5-4.5 Experimental G bind (kcal/mol) -20 -15 -10 -5 Calculated G bind (kcal/mol) PM6DH2/COSMO corrected for systematic error PM6DH2/COSMO uncorrected y=x -7-6.5-6-5.5-5-4.5-4 Experimental G bind (kcal/mol) -30 -25 -20 -15 -10 -5 0 Calculated G bind (kcal/mol) PM6DH2/SMD corrected for systematic error PM6DH2/SMD uncorrected y=x

PAGE 122

122 Figure 4 6. Free energy of binding estimates obtaine d with a Gedanken experiment, which assumes the static protein binding pocket approximation introduces a strain energy of 2 kcal/mol (blue) and 3 kcal/mol (orange) per PL complex pose. The original results with the static receptor are shown in black. Micr ostate scoring was done with ff99SB /GB. -7-6.5-6-5.5-5-4.5-4 Experimental G bind (kcal/mol) -30 -25 -20 -15 -10 -5 0 Calculated G bind (kcal/mol) ff99SB/GB corrected for systematic error ff99SB/GB uncorrected y=x ff99SB/GB corrected, PL strain=-2 kcal/mol ff99SB/GB corrected, PL strain=-3 kcal/mol

PAGE 123

123 Figure 4 7. Free energy of binding estimates obtained with a Gedanken experiment, which assumes the static protein binding pocket approximation introduces a strain energy of 2 kcal/mol (blue) and 3 kcal/mol (ora nge) per PL complex pose. The original results with the static receptor are shown in black. Microstate scoring was done with PM6 DH2/COSMO. -7-6.5-6-5.5-5-4.5 Experimental G bind (kcal/mol) -20 -15 -10 -5 Calculated G bind (kcal/mol) PM6DH2/COSMO corrected for systematic error PM6DH2/COSMO uncorrected y=x PM6DH2/COSMO corrected, PL strain=-2 kcal/mol PM6DH2/COSMO corrected, PL strain=-3 kcal/mol

PAGE 124

124 Figure 4 8 Gedanken experiments 1 and 2 combined. Blue represents the results with corrected solvation energy for the ligand and orange displays the results with both corrected solvation energy of the ligand and the receptor strain accounted for. Scoring was done with ff99SB /GB. -7-6.5-6-5.5-5-4.5-4 Experimental G bind (kcal/mol) -30 -25 -20 -15 -10 -5 0 Calculated G bind (kcal/mol) ff99SB/GB corrected for systematic error ff99SB/GB uncorrected y=x Corrected ff99SB/GB and G solv,L Corrected ff99SB/GB and G bind ; PL strain=-3 kcal/mol

PAGE 125

125 Figure 4 9 Gedanken experiments 1 and 2 combined. Blue represents the results with corrected solvation energy for the ligand and orange displays the results with both corrected solvation energy of the ligand and the receptor strain accounted for. Scoring was done with PM6 DH2/COSMO. -7-6.5-6-5.5-5-4.5 Experimental G bind (kcal/mol) -20 -15 -10 -5 0 Calculated G bind (kcal/mol) PM6DH2/COSMO corrected for systematic error PM6DH2/COSMO uncorrected y=x Corrected PM6DH2/COSMO, and G solv,L Corrected PM6DH2/COSMO and G solv,L ;PL strain=-3 kcal/mol

PAGE 126

126 Figure 4 10 Convergence of binding free energy calcu lations for the ff99SB /GB test set. Blurred poses were sorted as such the ones with the most negative energies were included in the smallest samples, hence the increase in the binding free energy estimate. Sample size was increased gradually and its maximu m depends on the compactness of the particular ligand. 050100150200 Number of microstates included in the G calculation -30 -25 -20 -15 -10 -5 0 Calculated G bind Benzene Benzofuran Indene Indole i-butylbenzene n-butylbenzene o-xylene p-xylene T4Lysozyme L99A Inhibitors with ff99SB/GB

PAGE 127

127 Figure 4 11. Error bar propagation with growing sample size in ff99SB /GB calculations. Sample sizes of N=(1,5,10,25,50,75,100,125,% ) were employed. N reached larger numbers for more compact ligand s as the "blurred" ensembles for those were naturally bigger. After N=75, the random errors did not change by much in size. Enlarging the sample size decreased the uncertainties in the final free energy estimation by at least 3 kcal/mol. The biggest improv ement, namely the sharpest decrease, occurred when transiting from N=1 to N>1. 050100150200 Number of microstates included in G calculation 3 4 5 6 7 8 9 10 Uncertainty (kcal/mol) Benzene Benzofuran Indene Indole i-butyl benzene n-butyl benzene o-xylene p-xylene T4Lysozyme L99A Inhibitors with ff99SB/GB

PAGE 128

128 CHAPTER 5 DYNAMIC AND STRUCTURAL STUDIES OF THE N TERMINUS OF THE INTRINSICALLY DISORDERED Cu(I) BINDING PROTEIN CusB Background In most organisms, copper binding enzymes ar e u sually membrane bound or found in the periplasm 113 Escherichia coli is one of these : In E. coli the periplasm can contain significantly higher copper concentrations than the cytoplasm since copper is needed for periplasmic or inner membrane enzymes such as multi copper oxidases, Cu,Zn superoxide dismutases, or cytochrome c oxidase, the f inal enzyme of the respiratory chain 114 Still, periplasmic copper con centrations must be kept at low levels because Cu(I) can initiate the generation of reactive oxygen species (ROS) under aerobic conditions 114 This delicate balance is achieved through various metal resistance systems against excess metal concentrations including the RND type (resistance, nodulation, division) transporters, which are common defense mechanisms not only in E. coli but all Gram negative bacteria. Thus, they play major roles in the intrinsic and acquired antibiotic resistance of Gram negative bacteria by facilitating their survival in otherwise lethal concentrati ons of drugs and metal ions 115 They also aid in expelling bacterial products such as siderophores, peptides, and quorum sensing signals 116 117 Clearer insight into these efflux mechanisms is of significant importance considering how antibiotic resistant pathogens represent a growing threat to human health. RND type efflux systems consist of three fundamental components: an energy requiring inner membrane protein 11 8 an outer membrane fac tor, and a periplasmic Reprinted with permission from Ucisik, M. N.; Chakravorty, D. K.; Merz, K. M. Biochemistry 2013, 52 6911. Copyright 20 13 American Chemical Society.

PAGE 129

129 component 119 Proton substrate an tiporters of the RND protein superfamily serve as the inner membrane components and their exported substrate determines the subclass to which they belong. Accordingly the inner membrane proteins ejecting heavy metals make up the heavy metal efflux subfami ly and they are highly substrate specific with the ability of differentiating even between the charge of the ions. 118 The cytoplasm or the periplasm provide the substrates, depending on the properties of the particular efflux system and the substrate 120 The CusC FBA efflux system in E. coli is responsible for extrusion of Cu(I) and Ag(I) and consists of CusA, the inner membrane proton/substrate antiporter of the heavy metal efflux RND family, CusB, the periplasmic protein and CusC, the outer membrane protein as s hown in Figure 5 1 121 123 The Cus system has a n additional fourth component, CusF, which acts as a periplasmic Cu(I)/Ag(I) metallochaperone and is vital for ma ximal metal resistance 122 124 The periplasmic protein CusB is a member of the membrane fusion protein (M FP) family 125 It is hypothesized to stabilize the tripartite intermembrane complex through its interactions with CusA and CusC which is similar to the role that AcrA plays in the well characterized multidrug efflux pump AcrAB TolC 126 129 Four domains were identified from the available crystal structures: the membrane proximal, beta barrel, lipoyl and alpha helical domains. However, the most impo rtant part containing the three conserved metal binding Met residue s could not be resolved with crystallographic techniques 126 130 Mutation studies conducted on these three Met residues have resulted in the loss of metal resistance in vivo 131 Moreover, CusB and CusF inter act temporarily in the presence of metal and have similar metal binding affinities inferred from experimental findings where the metal in the medium gets

PAGE 130

130 distributed approximately equally between them when mixed in equimolar concentrations in vitro 131 132 Direct metal transfer between these proteins has also been shown experimentally. 133 Chemical cross linking/mass spectrometry experiments captured the CusB CusF interaction, and underlined the significance of the N terminal region in terms of protein protein int eractions and metal transfer. 134 Recently the N terminal region of CusB (CusB NT) was experimentally found to exhibit metal transfer from CusF by itself, although not as effective ly as the full length CusB, which rules out the hypothesis that CusB acts simply as a metal che lator. It mu st be induc ing a structural change in the rest of the chain which gives rise t o higher resistance. This change was actually captured experimentally, implying conformational flexibility 135 Yet the retention of the metal transfer ability by truncated CusB encouraged us to examine this region by itself in order to unravel the dynamics and structure of this highly disordered entity. Being highly disordered in their N and C terminal regions is not an uncommon feature of membrane fusion proteins. The N and C termini of the multidrug efflux MFP's MexA and MacA could not be crystallographically resolved either. 136 138 These regions appear to be of flexible and dynamic nature as the observed disorder suggests. Structural and dynamic properties of proteins can be probed simultaneously by molecular dynamics (MD) simulations, w hich make them especially useful for studying protein structure and folding. 139 Recent attempts to examine the folding pathways of several peptides including villin headpiece 140 149 bovine acyl coenzyme A 150 Trp cage 151 156 and Alzheimer amyloid beta 10 35 peptide 157 resulted in success suggesting this

PAGE 131

131 technique has the potential to provi de molecular level insights into the structure of CusB As stated above, the available crystal structures of the periplasmic protein CusB lack their N termini. The absence of this very important piece hampers the ultimate modeling of the full metal efflux pump CusCFBA where metal uptake in the periplasm cannot be modeled. By determining a viable structural ensemble for CusB, whose members could transfer the metal from the open form of apo CusF, remains essential to reach our ultimate goal: simulating the en tire metal extrusion process via CusCBA. Shedding light on this highly disordered domain would provide researchers with an ensemble of full model structures of CusB, which has been missing so far. Moreover, the insights gained would facilitate further insi ght on other disordered proteins 158 In this chapter we examine the structural and dynamic changes of the N terminal region of CusB upon Cu(I) binding by simulating the apo and holo CusB NT using MD simulations and analyzing the sampled phase space in each case with various analysis tools Methods Initial Models We ran extensive MD simulations of 25 different initial models of the N terminus region of CusB which were obtained with the protein structure prediction tools Quark 159 I Tasser 160 161 and Sparks X. 162 Quark is an ab inito protein folding and structure prediction algorithm, which is accessible through a web server. It constructs the 3D protein model from the amino acid sequence only with no global templat e information. Thus, it is useful for proteins without homologous templates. It was ranked as the leading server in free modeling in the CASP9 experiment. 159 163 164 The second folding

PAGE 132

132 server we used, I Tasser, aims to predict protein structure and function by building 3D models based on multiple threading alignme nts and assembly simulations. The predicted models are then matched with the BioLiP protein function database 165 The latest CASP experiments ranked it as the number 1 server for protein structure prediction in addition to its first rank for fu nction prediction in CASP9. 166 The third server we employed was Sparks X, which is one of the best performers in the CASP9 experiment for single method fold recognition. It is a template based modeling algorithm which was established by weighted matching of multiple profiles and provides an improved scoring after taki ng into account errors in the predicted 1D structural properties 162 We refe r to residues 29 through 79 of chain C in the crystal structure 3NE5 130 as the N terminus region of CusB. Residues 1 to 28 constitute the signaling region and were not included in this study. The residue numbering is reported after subtraction of the signaling peptide. Computational Methods and Post simulation Analyses Both apo and Cu(I) bound versions of the N terminal region of CusB were simulated where all the protein and solvent atoms were treated explicitly. Each model system was solvated with the TIP3P triangulated water model 49 in a p eriodically replicated rectangular water box whose sides were at least 10  away from the solute atoms. The systems were neutralized by addition of Na + ions and charged amino acids were modeled in the protonated states obtained with the H++ protonation sta te server at physiological pH. 100 The apo models were run through an energy minimization protocol of seven stages involving the minimi zation of only the solvent atoms and the counter ions (stage 1), minimization of the hydrogen atoms (stage 2), minimization of the side chains by gradually decreasing the harmonic positional restraints acting on them

PAGE 133

133 (stages 3 6), and finally the energy mi nimization of the whole system with no positional restraints (stage 7). A total of 157,000 energy minimization steps were performed in total, 47,000 of which used the steepest descent protocol. 3 The remaining 110,000 steps utilized the conjugate gradient method for minimization. 3 A two stage equilibration protocol followed: first, the systems were heated slowly from 0 t o 300 K over 200 ps of MD within the canonical ensemble (NVT) by maintaining a weak harmonic restraint on the protein, and then they were simulated for 10 ns at 300 K to check for the stability of the peptide chains after removal of the harmonic restraints at a constant pressure of 1.0 bar for an isobaric, isothermal ensemble (NPT) using Langevin dynamics with a collision frequency of 1.0 ps 1 167 Periodic boundary conditions were imposed on the systems during the calculation of non bonded interactions at all minimization and equilibration stages while the lengths of the c ovalent bonds involving hydrogen were constrained and their interactions were omitted with the SHAKE algorithm while the systems were heated. 168 All the restraints on the systems were released before we started with the MD production runs with the apo models. The metal coordination site o f CusB was experimentally found to consist of three Met residues (M21, M36, and M38) aligned in almost an equilateral triangle through their sulfur atoms where the S Cu(I) distance was measured to be 2.3  in EXAFS experiments. 133 169 To correctly orient the metal binding residues for Cu(I) binding in the apo systems, we employed 2 ns of steered MD on each of the 25 systems to bring the metal binding residues (M21, M36, and M38) into an hypothetical pre organized state where the sulfur atoms would be 4.15  distant from each other in an equilateral triangular alignment using a harmonic restraint of 1000 kcal/mol* 2 This distance value

PAGE 134

134 was obtained from our optimization efforts involving methionine side chains and a Cu(I) ion at the density functional theory (DFT) QM level which were performed with Gaussian'09 44 using the M06L DFT functional 30 in conjunction with t he double # quality LANL2DZ pseudop otential basis set for Cu(I) 170 and the Pople type basis set 6 31G* for all the othe r atoms 32 In addition to providing the needed S S distances and a general positioning in th e metal site for the steered MD, these calculations confirmed the experimentally observed Cu(I) S distance of 2.3 . Once we obtained the individual "pre organized" geometries for the metal site on 25 models, we introduced the Cu(I) ion in the center of ma ss of the three sulfur atoms of M21, M36, and M38 after we had stripped all the solvent and counter ions from the final recorded snapshot at the end of the 2 ns simulation time which yielded 25 holo models. The metal bound systems were represented by a bon ded model where the metal parameters were obtained with MTK++/MC PB functionality 171 of AmberTools version 1.5. 172 A frequency calculation was run on the optimized structure (M062X/LANL2DZ 6 31G*) of the binding site using Gaussian09 to collect the bond and angle parameters. 173 The charges for the Cu(I) bound ligands were calculated using RESP (M062X/6 31G*) within the MTK++ program. 174 175 Having equilibrated the apo models and holo models, we proceeded with 200 ns of conventional MD and 400 ns of accelerated MD (aMD) 176 for each of these 50 sys tems which adds up to a total simulation time o f 5 microseconds of MD and 10 microseconds of aMD for both apo and holo models The MD simulations were run with the ff99SBildn force field 177 of the AMBER11 package 172 where for the aMD simulations, the same force field in the AMBER12 suite wa s utilized 95 The aMD parameters were

PAGE 135

135 obtained from t he 200 ns MD runs for each of the 50 systems as described in the AMBER12 manual and an extra boost was appl ied to the torsions Throughout the production simulations the SHAKE algorithm was used to constrain covalent bonds involving hydrogen. The Particle mesh Ewald (PME) method was utilized for long range electrostatic interactions and an 8  non bonded cutoff was applied to limit the direct space sum in PME 178 The temperature of the systems was maintained at 300 K with Langevin dynamics (collision frequency of 1.0 ps 1 ). Frames were collected every 2 ps. A selected subset of these snapshots were employed in distance, angle, di hedral, root mean square deviation (RMSD), root mean square fluctuation (RMSF), correlation analyses, entropy calculations and clustering of trajectory frames with the ptraj utility of AmberTools version 1.5. The average linkage algorithm with a maximum ec centricity of 5  was used in clustering the frames. NMR chemical shifts were assigned to backbone atoms extracted from snapshots separated by 0.5 ns with SPARTA+ 179 A smaller subset of these snapshots with a frame separation of 20 ns was employed in secondary structure predictions with the Stride program 180 Visual molecular dynamics (VMD) was utilized to visualize the molecular structures 46 while Gnuplot 181 and Grace software were used to plot the data. Results Effect o f Cu(I) Binding The N terminus of CusB was found to be mostly disordered in previous circular dichroism and NMR experiments involving its apo and Ag(I) bound versions. 135 Before we started with the MD simulations, we assessed the expected level of disorder in the CusB N terminus with SPINE D, an online artificial neural network server which classifies regions of a given protein chain as ordered or disordered bas ed on the protein

PAGE 136

136 sequence only 165 The results are shown in Figure 5 2 and CusB NT is predicted to have significant disorder while some order was estimated for two regions around the residues 16 25 and 38 41, which roughly corresponded to the metal coordination sites. This result, combined with the experimental findings, prepared us for the fact that a very careful analysis was needed to extricate insights into the intrinsically disordered protein. Both in the apo and holo trajectories of CusB N terminus domain, there are some motifs which were visually observed repeatedly: an anti parallel beta sheet configuration in the central part, a small beta sheet structure at the N terminus of the CusB NT chain which sometimes aligns with the antiparalle l beta sheets in the center to form a triple antiparallel beta sheet conformation, the N and C termini tails folding into short helices, and the 1 turn ( helix in the loop connecting the metal binding residues M21 and M36 Figure 5 3 gives sample structures observed in the simulations containing some of these features. Naturally, the Cu(I) ion bound to the three Met residues M21, M36 and M38 provides some degree of constraint to the chain motion which renders the holo form slightly more ordered around the metal binding region. Hence, the apo protein adopts a larger number of conformations. For both forms, we observed the maximum mobility in the CusB NT ch ain ends. The N terminal end, (residue numbers 1 10), is even more mobile than the C terminal end and the latter would normally be attached to the rest of the CusB chain whose crystal structure had been already resolved by X ray crystallography techniques. Both in the holo and the apo forms, the N terminal of the CusB NT chain often folds into a small alpha helix in addition to the beta sheet it forms occasionally. As our RMSD analysis suggests though (see Figure 5 5 and discussion below), substantial struc tural changes occur during the simulations. We observe

PAGE 137

137 numerous structural motifs, ranging from rather extended to more globular where the termini and the M21 M36 loop fold against the antiparallel beta sheet motif in the center. Hence, the beta sheets act like a core region leading to a rather compact shape along with the M21 M36 loop and the transient secondary structure (beta strand or alpha helix) in the C terminus of the CusB NT. We also made use of secondary structure predictions to quantify the obser ved structural similarities between the apo and Cu(I) bound protein models. These were established using the Structural Identification (STRIDE) algorithm which uses atomic coordinates for assignment 180 The frames were isolated from the MD and aMD trajectories and the se condary structure elements assigned to each frame were combined into a matrix, which was displayed with the color coding as seen in Figure 5 4 We analyzed the MD and aMD data separately in each case. Since we simulated the holo version for a longer time u sing aMD, we ended up with more frames contributing to this matrix because the frame separation was kept at 20 ns at all times. It must be noted that the covered phase space with aMD going from one frame to the next, is larger due to the enhanced sampling of aMD. Hence, greater structural changes are to be expected. Actually, our test runs indicate an acceleration of three to forty five fold in sampling the phase space depending on the starting point as tracked from the backbone RMSD profiles of arbitrary m odel structures. Thus, the MD trajectories tend to retain the secondary structure elements over consecutive frames. On the other hand, the aMD trajectories gives a more striped appearance due to the increased conformational variability of the snapshots and do not stack up as seen in MD matrices.

PAGE 138

138 The bottom panels reveal this effect very clearly: the longer the separation between the snapshots, the less continuous the matrix representation. For both apo and holo simulations, turns and coils shown in white do minate in terms of the observed secondary structures, which implies a large extent of disorder throughout the protein structure In addition to these, the persistence of the beta sheet motif can be seen in the red spots around residues 13 16 and 40 44. In the aMD trajectories, the alpha helix in the beginning part of the chain (first ten residues) dominates the picture. Helices are known to require time frames around 200 ns to form, which in turn makes them rarer in our MD trajectories 182 183 in comparison to the aMD trajectories. 180 181 The small helix forming in the loop region connecting M21 and M36 sometimes appears in the apo aMD simulation, but it's not a persistent structural element for the holo version. The most significant insight gained throug h these matrix representations is how much disorder is seen throughout the protein chain. Nonetheless, this analysis gives structural and timescale insights into the formation and unraveling of seconda ry structural elements in a dis ordered protein. 184 RMSD data served as another tool to assess the extent of structural variability of the apo and holo versions of the Cus B N terminal. The RMSD ranges and distributions observed in the MD trajectories for individual apo and holo models do not differ drastica lly as seen in Figure 5 5 The mean RMSD for the apo aMD trajectories is 10.86 2.11  while for the holo aMD traject ories it is 12.32 2.76 . The RMSDs of the apo models span a range of 5 18  where changes within one trajectory are minor: usually about 2  although it reaches 6  for some models.

PAGE 139

139 Surprisingly, the RMSD spread for the holo models is 6 22  while the RMSD within one trajectory remains mostly at much lower values of 2 3  Here we experience the benefits of having started our MD runs from twenty five distinct starting points to end up with good overall sampling. The RMSD values for each starting structu re with respect to an arbitrary starting structure were computed to determine how dissimilar the starting models were and they turned out to have an RMSD of at least 5  with respect to an arbitrary model The aMD data broadens the observed RMSD ranges sli ghtly: 5 23  for apo and 4 24  for the holo models. More importantly, the RMSD ranges spanned by single trajectories exceed 10  in certain cases which indicates better sampling, as expected, and would push the different conformations from different port ions of the potential energy surface to other regions. In this way "energetic traps" were avoided by switching to different starting points. The Cu(I) bound structures experience greater structural changes both in the MD and aMD trajectories compared to t heir apo counterparts which is counterintuitive with what one would anticipate because of the reasonable hypothesis that the metal ion would restrain the chain motion. This is observed both in the individual RMSD profiles and in the distribution of the RMS D values for the apo and holo trajectories. In Figure 5 5 the distribution of the RMSD values associated with the holo trajectories is broader and its mode is greater than the one for the apo distribution. Although the analyzed trajectories are from the p roduction phase, it appears that the polypeptide chain is still adapting to this extra constraint by fluctuating to a greater extent and searching for stable conformations This is in contrast to recent work on CzrA, where metal binding

PAGE 140

140 (Zn(II)) lead to a reduction in the number of states sampled by CzrA relative to its apo state 185 The RMSF profile is a measure of the mobility of the alpha carbons of the protein chain. In this case, the analysis shows that the end residues are more mobile compared to the rest of the molecule for both apo and holo conformations. The individual plots obtained from the individual MD trajectories contain various minima and loc al maxima in the interi or parts of the chain. The RMSF profiles from the aMD trajectories provided useful insight where some trends are observable: for the apo models the most stable, i.e showing the lowest atomic fluctuations, residues are T9, R12, I14, F27, S33, and K43 as seen in Figure 5 6 The reduced mobility around residue T9 is mostly associated with the formation of the alpha helix in the N terminus tail of CusB NT, which ceases to some extent upon addition of metal as indicated in the lower panel s of Figure 5 4 I14 and K43 fall into the beta sheets making up the central beta sheet motif, which emphasizes the tendency for this secondary structural element to form and stay stable over long time frames. R12 is adjacent to the first beta strand and i s stabilized by the central beta motif. F27 and S33, on the other hand, are involved in extensive hydrogen bonding with other residues in the M21 M36 loop (Figure 5 1 1 see discussion below). This intermolecular interaction network reduces the mobility of these two residues. The loop region between M21 and M36 becomes almost as mobile as the tails of the chain towards the middle for som e of th e trajectories Incorporation of the metal introduces a constraint to free motion and diminishes the atomic fluctuat ions of the metal ligating residues (indicated in Figure 5 6 ). The minima of the averaged atomic fluctuations for the holo version of the CusB N terminal region are found at M21 and M36. Interestingly,

PAGE 141

141 the Cu(I) binding residues M21, M36, and M38 in the ho lo models do not necessarily show minimal fluctuations in some of the individual trajectories. We ensured that this does not trace back to the disruption of the metal site; the S Cu(I) distances have been analyzed throughout and were confirmed to conserve the bonding distances at all times. Visual inspection aids in confirming these seemingly contradictory observations: The loop connecting the M21 and M36 residues moves to both sides of the central beta sheet motif similar to a flap which enhances the mobil ity of the Met's bound to Cu. Comparing the atomic fluctuation profiles of single models obtained from the aMD trajectories reveals that the minima are less spread in the holo chain. The fact that the immobilization effect remains only local around the met al binding residues becomes even more obvious in the cumula tive RMSF profiles in Figure 5 6 Looking at the individual profiles of the apo models, on the other hand, the motions seem to be dispers ed along the chains more evenly. However, this artifact disa ppears in the cumulative graph. We next calculated NMR chemical shifts for the apo and holo forms of CusB NT using SPARTA+. 179 These shifts were calculated for a total of 20,000 snapshots of the apo and holo models and averaged structures. The program could only calculate 43 backbone N's and their associated H's because the first amino acid residue is exempt from predictions and there are seven Pro residues without an amide hydrogen leaving 43 residues in total. The average shifts for the backbone nitr ogens (N) and their hydrogens (HN) were plotted with and without their standard deviations (Figure 5 7 ). The conclusion, which can be drawn from these calculations, is that no striking structural differences other than the local organization of the metal b inding site occur

PAGE 142

142 between the apo and the Cu(I) bound versions of the CusB N terminal, which is consistent with the experimental observations involving Ag(I). 135 Additionally, this resembles the metallochaperone CusF where the apo and the metal bound structures were similar to each other. 124 186 188 In addition to the above, we investigated numerous structural properties in order to identify the structural changes induced by the metal binding. To determine these properties, we focused on the prominent structur al motifs. The first was the beta sheet motif in the center of the chain. The distance between the two antiparallel beta strands measured from the center of mass of the residues 13 16 (strand 1) to the center of mass of the residues 41 44 (strand 2) emerge d as a good coordinate to estimate the extent to which these two beta sheets maintained their antiparallel alignment. This distance is measured to be around 5 6  when the two beta sheets are found to be in the antiparallel configuration. T he inset of the Figure 5 8 further illustrates this distance. The histogram showing the distribution of this distance over the apo and holo trajectories shows that the majority of the observed snapshots locates the two beta strands 5 6  apart from each other which implie s that this central motif is one of the key structural elements in this disordered protein chain (Figure 5 2). The range for this specific distance spans values from 3 to 66  suggesting that most, if not all of the available phase space is visited. Secon dly, to check the relative positioning of the two beta strands belonging to the central beta motif the dihedral angle formed by the centers of mass of L15, I14, K43, and P42 residues was probed (Figure 5 9 right). Dihedral angles of 0 degrees would imply an antiparallel conformation. As seen in the histogram in Figure 5 9 the global

PAGE 143

143 maximum occurs at around 0 degrees for both the apo and the holo trajectories of CusB NT suggesting that the antiparallel beta conformation is the most favored although it com prises only ca. 10% of the whole spectrum. Other peaks occur around 120, 30, 60, and 90 degrees indicate other common alignments in the trajectories. The similarity in behavior of the apo and holo models underlines the fact that this beta motif is one of the key structural features of the CusB N terminal region, as suggested by the data of Figure 5 8 Further analysis aimed to explore the conformational preferences of the loop connecting M21 and M36 relative to the central beta motif, whose importance we have highlighted. The dihedral involving the center of mass of F16 (in the central beta sheet), M21, M36 and the common center of mass of the residues D28, K29 and P30, which falls on the midpoint of the big loop connecting M21 and M36, was exami ned for th is purpose (Figure 5 10 inset). Not surprisingly, the apo and Cu(I) bound models do not differ significantly. It appears that this loop tends to locate itself somewhere on the range of 0 120 degrees from the central beta motif while for the holo models, a ngles of 0 30, 70 80, and 90 100 degrees dominate. For the apo models, dihedrals of 0 60 deg rees are more common (Figure 5 10 ). As more sampling is included in the analysis with the aMD trajectories, the peaks covering the range of 0 60 degrees flatten out for both the apo and the holo models giving rise to a more dispersed distribution. The distribution of the data obtained from the holo aMD trajectories shows two maxima at around 0 30 and 90 degrees. The one associated with the apo aMD data peaks at aroun d 30 70 degrees.

PAGE 144

144 To further understand the spatial alignment of the loop, a distance analysis was conducted on the residues, which constitute the loop (residues 22 35) and the ones close to the loop (residues 17, 19, 37, and 39 ) which may interact with the loop residues. Figure 5 11 shows these distance trends: hydrogen bonding between the backbone oxygen of S33 and the amide hydrogen of F35 occurs in almost 80% of all the snapshots taken both from the apo and holo trajectories. This is by far the most p revalent interaction inferred from this analysis In 50% of the snapshots this hydrogen bond is a strong one since the carbonyl oxygen amide nitrogen distance is less than 3 . S33 and F35 are located at the hinge of the loop when it folds against the bet a motif, and favor this type of folded alignment. The side chain oxygen of the S33 and the amide hydrogen of F35 also interact in ~25% of all the frames in the apo and Cu(I) bound simulations. This interaction alternates with the previous S33O F35NH intera ction from time to time, emphasizing the importance of F35 and S33 in determining the spatial arrangement of the loop. Alternatively, the side chain oxygen of S33 interacts with the amide hydrogen of M36 in almost 10% of the snapshots taken from both the a po and holo trajectories. This adds to the structural importance of residue M36: binding the Cu(I) ion and aiding in structural organization of the domain by facilitating the folding of the big loop with hydrogen bonds at the hinge point. In addition to th ese interactions, the backbone carbonyl oxygen of D19 and the amide H of Y22 form a backbone hydrogen bond on the other end of the loop and this acts as a hinge as well which is observed in ~20% of the apo and ~40% of the holo conformations. Very rarely, t he amide H of D19 and the backbone O of Y22 form hydrogen bonds. Another interesting point revealed in the distance analysis is the formation of the 1 turn helix within the M21 M36 loop whose

PAGE 145

145 formation is made possible through the backbone hydrogen bonding involving the residues P23 R26 and N24 F27. This is observed in almost 20 30% of the apo and holo models. An additional structural property was examined to gain insight to the general shape of the M21 M36 loop. The distance between the Cu(I) ion and the c enter of mass of each residue constituting the loop connecting M21 to M36 was calculated for every snapshot taken from the holo trajectories (Figure 5 12 ). This analysis reveals that the loop is mostly oval shaped. Standard deviations increase for residues closer to the middle of the loop implying that they are more mobile as they move further away from the hinge points. Finally, t o examine the reversibility of the structural effects caused by metal binding, we randomly picked one of the 25 models an d remov ed the Cu(I) ion from snapshots at 100 ns ("snapshot 1") and 200 ns ("snapshot 2") into the conventional MD simulation. We solvated, minimized, and equilibrated these two structures using the protocol described in the methods section and ran 200 ns of conv entional MD on them. We tracked the backbone motions of the new trajectories both visually and with root mean square fluctuation (RMSF) and root mean square deviation (RMSD) analyses. These were compared to the corresponding analyses of the original apo tr ajectory of the same model. Accordingly, both t he RMSD and RMSF profiles of snapshot 1 track the backbone motions of the original apo trajectory very closely, which suggest s they may be covering similar regions of phase space. However, for snapshot 2 the r ecovery of apo behavior was not as clear Although the original apo behavior of this particular model is not obtained in the snapshot 2 trajectory the main secondary structural

PAGE 146

146 elements like the central beta motif and the loop between M21 and M36 emerge i n the middle of the trajectory, which is a point visited by other apo models, as the STRIDE analysis qualitatively demonstrates Decomposing the Disorder Our original hypothesis envisioned that we would be able to extract probable structures for the Cu(I) bound version of the CusB N terminal region, which would further guide experimental analysis of this domain. Through visual inspection and the RMSF analyses, we observed the chain ends to be, not surprisingly, more mobile than the rest of the chain. This l ed to the idea of splitting the chain into three "subdomains" and analyzing these individually: The tails, the central beta motif, and the Cu(I) site along with the big loop. These are sh own in the insets of Figure 5 13 and Figure 5 14 Considering the ato mic fluctuations of the alpha carbon atoms presents a useful quantitative way to justify how we determined the su bdomains. As seen in Figure 5 13 the fluctuations behave consistently in the designated subdomains. Both tails show elevated backbone motion i n the Cu(I) bound version compared to the apo which is designated as the first subdomain and displayed with a green background. In the central beta motif there is a changing trend in the RMSF: the beta region of the holo models become less mobile towards the metal binding site while this is inverted for the apo models. In other words, the chain fluctuations decrease in the holo models as one approaches the Cu(I) site while the opposite behavior is observed in the apo models. Thus, the region can also be co nsidered as a distinct subdomain and is indicated with a grey background in Figure 5 13 The M21 M36 loop region constitutes the third subdomain as it displays a local stabilization upon metal binding, and a higher mobility in the apo models (pink backgrou nd).

PAGE 147

147 In this partitioning, the most stable subdomain was the Cu(I) site and the M21 M36 loop followed by the central beta sheet motif and then the tails as seen in the RMSD distribution in Figure 5 14 The RMSD distribution profile shows that the tails ha ve the highest extent of structural changes yielding the widest RMSD distribution with the greatest mode value of almost 10  further confirming that the most significant chain motion happens there. Interestingly, the biggest change in the RMSD distributi ons upon metal binding is detected in the beta motif region which points to the largest acquired ordering in contrast to what might have been expected to happen in the third subdomain containing the metal site and the loop. The beta motif is likely to assi st in generating a pre organized state suitable for metal binding and this hypothesis emphasizes its structural significance. The average RMSF profiles in Figure 5 13 also support this finding where the alpha carbons of the residues constituting the Cu(I) site and the big loop display the lowest atomic fluctuations while the highest atomic fluctuations stem from the tails of the protein chain. Additionally, clustering the split trajectories yielded a much lower number of clusters in general, while the lowes t number of clusters were produced by the most stable subdomain constituting of the Cu(I) site and the M21 M36 loop. More specifically, a subset of 6 000 frames from each subdomain was clustered using the ptraj utility of AmberTools version 1.5 using the a verage linkage algorithm, which resulted in 303, 22, and 4 clusters for subdomains 1, 2 and 3, respectively. Hence, we could observe that the tails are mostly responsible for the disordered state of the N terminal region of CusB. This led us to isolate the tails (subdomain 1) from the CusB NT and cluster the remainder, namely the subdomains 2 and 3 This protocol provided us with 261 clusters, which is a large decrease from the

PAGE 148

148 1646 clusters when the entire CusB NT chain was analyzed in an analogous manner Thus, the observed disorder can be largely attributed to subdomain 1, the N and C termini, of the CusB NT domain. Based on this observation we propose that even the smaller construct of subdomains 2 and 3 might form a functionally competent metal binding domain. Conclusions In this study we explored the conformational dynamics of the apo and Cu(I) bound CusB N terminal region to assess the impact of metal binding on this mostly disordered protein domain, which lacked an experimentally resolved structure. To obtain viable starting points, we utilized online protein folding servers Quark 159 I Tasser 160 161 and Sparks X 162 which gave promising results in CASP competitions. 25 star ting models were obtained and used in extensive MD simulations to study the structure and dynamics of the CusB N terminal region. Moreover, we applied the accelerated MD enhanced sampling method of Pierce et al 176 for another 10 microsecon ds of sampling in addition to 5 microseconds of classical MD simulations. The Cu(I) ion was inserted into the apo models after generating pre organized metal binding sites according to the available experimental information regarding the binding site. A bo nded force field was created which attached the three coordinating Met residues (M21, M36 and M38) to the Cu(I) ion in a trigonal planar (triangular) alignment over the course of these simulations. The holo models generated in this way were simulated for t he same amount of simulation time, which provided us with sufficient conformational dynamics and structural information to compare the apo and metal bound versions of the CusB N terminal domain.

PAGE 149

149 Our analysis of the secondary structure assignments, RMSD, a tomic fluctuations of alpha carbons, NMR chemical shift assignments, and various structural coordinates focusing on the more prominent motifs like the central beta sheet and the M21 M36 loop showed that significant disorder was present in both the apo and Cu(I) bound forms of the protein. Metal binding only led to a local ordering around the metal site causing a modest impact on the disorder of the whole domain. To reduce the complexity of this general disorder, the protein was broken up into three subdomai ns utilizing logically selected structural elements. The first subdomain consisted of both tails (residues 1 12, 46 51); the second contained the two beta strands in the center of the structure (residues 13 20, 39 45) and the third subdomain consisted of t he M21 M36 loop in the apo chain along with the Cu(I) ion in the holo version. The RMSD analysis carried out on these regions showed that the first subdomain with the N and C terminal tails showed the most disorder and was not affected by metal binding. Su rprisingly, the biggest structural impact upon Cu(I) binding occurs in the beta motif region, namely the second subdomain, in contrast to our anticipation where the binding site and the loop (the third subdomain) would be influenced the most. This finding suggests that the beta motif region is of utmost structural importance in the function of CusB as a metal chelator in the CusCFBA metal extrusion mechanism. Indeed, our simulations suggest that residues 13 45 would also function in a similar manner to the entire chain since the two tail regions do not form any significant s tructure during the simulations, and this prediction can be tested experimentally. The role of CusB NT for metal delivery offers a conundrum. The concept of preorganization 189 suggests that in order to have tight and selective metal ion binding

PAGE 150

150 the receptor should adopt a structure similar to that of th e complexed structure. Experimentally, CusB NT appears disordered both in the apo and metal loaded forms as determined by NMR 135 Computationally there appears to be a modest level of structural preorganization in CusB NT in residues 13 45, which was also seen for CusF 124 186 188 but it is still remarkable that this protein, which preforms an important and essential function, has characteristics that suggest that it might not sequester Cu(I) from the environment at levels typically seen in Cu(I) binding prote ins. 190 This disorder appears to be carried over into the intact complex as observed in the X ray structures of the CusBA system. 130 191 Perhaps association with CusF induces further order or once metal is bound in the intact complex interactions between the metal binding domain and the intact transporter induces further stabilization. To summarize, our simulations concluded that CusB NT largely remains disordered with only the central region of this protein (residues 13 45) forming secondary structure elements. I nterestingly, this situation is not significantly affected by metal binding in opposition to our initial biases. This study has set the stage to simulate this domain with its functional companion CusF, the metallochaperone of the CusCFBA Ag(I)/Cu(I) pump. Another goal is to attach the CusB N terminal to the remainder of CusB to see how it interacts with the intact copper transport system. The presence of these companion proteins might result in an increased order of this region and further insights into the metal ion transport mechanism.

PAGE 151

151 Figure 5 1. The CusCBFA pump in E. coli The green area represents the cytoplasm, the white area stands for the periplasm, and the red area shows the exterior of the cell. The Cu(I) ion is displayed with a dark orange sp here.

PAGE 152

152 Figure 5 2 The predicted level of order for each residue by the online server SPINE D. The scale for order probability is from +1 to 1 where +1 implies a maximum probability for order, namely complete order, and 1 expresses a maximum probabi lity for disorder, in other words, total disorder. 5101520253035404550 Residue Number -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 Order Probability

PAGE 153

153 Figure 5 3 The Cu(I) bound version of the CusB N terminal region is shown on the left and its apo version on the right. The central beta motif with the two antiparallel beta strands, the loop connecti ng the metal binding residues M21 and M36 and the disordered tail regions are indicated in these sample structures. The short alpha helix on the N terminal end of CusB NT forms very frequently in both the holo and apo versions.

PAGE 154

154 Figure 5 4 The STRID E secondary structure predictions for the (top left) apo CusB NT, MD trajectories; (top right) Cu(I) bound CusB NT, MD trajectories; (bottom left) apo CusB NT, aMD trajectories; (bottom right) Cu(I) bound CusB NT, aMD trajectories. Color scale: 1/green alp ha helix, 2/red beta sheet, 3/white turn, coil or not assigned.

PAGE 155

155 Figure 5 5 Distribution of RMSD values over 10 microseconds of aMD for apo and Cu(I) bound versions of the CusB NT chain. The means and the standard deviations for both versions do not d iffer significantly implying comparable structural mobilities before and after metal binding.

PAGE 156

156 Figure 5 6 Atomic fluctuations for the apo (blue) and Cu(I) bound (red) versions of the N terminal region of CusB. Data were collected over a total of 10 microseconds of aMD simulations from 25 different starting points. Cu(I) binding seems to impact only the metal binding residues, M21, M36 and M38.

PAGE 157

157 Figure 5 7 Average NMR Shifts for backbone N's and their H's for the apo (blue) and Cu(I) bound (red) predicted by SPARTA+ (right) and their standard deviations (left).

PAGE 158

158 Figure 5 8 Distribution of the distances between the two beta sheets forming the central beta motif. The distribution from the apo MD/aMD trajectories are shown in blue and the data fr om the holo MD/aMD trajectories are displayed in red. The inset visualizes this property. In this particular conformational alignment, the distance measures exactly 5.0 . The vast range of distances suggests that we have extensively sampled the available phase space. 010203040 Distance (Angstroms) 0.05 0.1 0.15 0.2 Normalized Population Apo (10 microseconds of aMD) Holo (10 microseconds of aMD ) Apo (5 microseconds of MD) Holo (5 microseconds of MD)

PAGE 159

159 Figure 5 9 The dihedral angle formed by L15, I14, K43 and P42 as a measure of the relative beta sheet positions in the central motif (right). The distribution of this property is shown in the plot over the apo (blue) and holo (red) tr ajectories. -180-120-60060120180 Dihedral Angle 0 0.002 0.004 0.006 0.008 0.01 Normalized Population Holo (5 microseconds of MD) Apo (5 microseconds of MD) Holo (10 microseconds of aMD) Apo (10 microseconds of aMD)

PAGE 160

160 Figure 5 10 The dihedral involving the center of mass of F16 (in the central beta conformation), M21, M36 and the common center of mass of the residues D28, K29 and P30 is shown in the inset. The distribution of this property is shown in the plot over the apo (blue) and holo (red) trajectories. -180-120-60060120180 Dihedral Angle 0.005 0.01 Normalized Population Holo (5 microseconds of MD) Apo (5 microseconds of MD) Holo (10 microseconds of aMD) Apo (10 microseconds of aMD)

PAGE 161

161 Figure 5 11 Distance analysis plot for residues in and adjacent to the big loop connecting residues M21 and M36. The triangles indicate data obtained from the apo trajectories while the dots a re for the holo trajectories. The numbers in the inset are the residue numbers. "O" represents the backbone oxygen, "N" the amide nitrogen and "OG", the side chain oxygen of S33. 23456 Distance (Angstroms) 0 0.2 0.4 0.6 0.8 Normalized Population 19O-22N holo 19O-22N apo 19N-22O holo 19N-22O apo 23O-26N holo 23O-26N apo 27N-24O holo 27N-24O apo 33O-35N holo 33O-35N apo 33OG-36N holo 33OG-36N apo 33OG-35N holo 33OG-35N apo

PAGE 162

162 Figure 5 12 Distance between the Cu(I) ion and the centers of mass of e ach residue constituting the loop connecting M21 to M36 along with error bars. 2224262830323436 Residue Number 8 10 12 14 16 18 20 Distance (Angstroms) MD aMD

PAGE 163

163 Fi gure 5 13 The RMSF profiles of apo and holo CusB N terminal revisited with background color coding. Green is for the tails, grey for the central beta motif and pink the M21 M36 loop. The inset figure shows these three subdomains.

PAGE 164

164 Figure 5 1 4 (right) The N terminal domain of CusB was split into three subdomains to decompose the structural disorder: 1. The tails (green), 2. The central beta motif (grey), 3. The Cu(I ) site and the M21 M36 loop (magenta). (left)The RMSD distributions associated with these subdomains extracted from 10 microseconds of aMD runs on the Cu(I) bound protein: green background subdomain 1, gray background subdomain 2, and pink background sub domain 3. 0.1 0.2 Apo CusB Cu(I) CusB 0.1 0.2 0.3 0.4 Normalized Population 0510152025 RMSD (Angstroms) 0 0.1 0.2 0.3 0.4

PAGE 165

165 CHAPTER 6 MOLECULAR DYNAMICS STUDY OF THE N TERMINUS OF CusB AND THE METALLOCHAPERONE CusF Background The CusCFBA efflux system in E. coli expels Cu(I) and Ag(I) when the concentrations of these heavy metals reach lethal levels within the cell It is comprised of CusA, the inner membrane proton/substrate antiporter of the heavy metal efflux RND family, CusB, the periplasmic protein and CusC, the outer membrane protein 121 123 The Cus system is completed by a n additional fourth component, CusF, which serves as a periplasmic Cu(I)/Ag(I) metallochaperone and is crucial for maximal metal resistance 122 124 The periplasmic protein CusB belongs to the membrane fusion protein (MFP) family 125 It is hypothesized to stabilize the tripartite in termembrane complex by interacting with CusA and CusC Its available crystal structures resulted in identification of f our domains: the membrane proximal, beta barrel, lipoyl and alpha helical domains. However, its N terminus, which contains three conserv ed metal binding Met residue s, could not be resolved with crystallographic techniques because of its disorder 126 1 30 Recent experiments showed that the N terminal region of CusB (CusB NT) is capable of exhibiting metal transfer from CusF by itself, although not as effective ly as the full length CusB, which weakens the hypothesis that CusB behaves simply as a metal c he lator. A structural change should be induced on the entire CusB chain, which results in higher Cu(I)/Ag(I) resistance by possibly increasing the metal ion transfer rate. 135 Yet the retention of the metal transfer ability by truncated CusB led us to carry out molecular dynamics studies concerning the disordered N terminal region of CusB which was extracted from the rest of the CusB protein. 192 We found that some structural elements show up repeatedly over the microsecond time frame both in apo and Cu(I) bound

PAGE 166

166 versions of this protein chain which points to the formation of transient order over this timescale. 192 Additionally CusB and CusF interact in the presence of metal ion. They display similar m etal binding affinities which was determined from equal distribution of the metal ion between them when they were mixed in equimolar c oncentrations in vitro 131 132 Direct metal trans fer between these proteins was also shown experimental ly. 133 Chemical cross linking/mass spectrometry experiments captured the CusB CusF interaction which highlighted the significance of the N terminal region in ter ms of protein protein interactions and the metal transfer 134 We hypothesize that the metal transfer between CusB and CusF might induce motional restraints on the very mobile N terminal region of CusB. To explore the structural impacts of this protein protein interaction, we conducted molecular dynamics studies on two models of the CusB NT / CusF complex bound to Cu(I) ion: the MMM model and the MMH model. In the MMM model, three Met residues ligate the metal, whereas in the MMH model, one of the ligating Met residues is replaced by a His residue. This computational experiment, which involves protein protein docking followed by extensive MD simulations, also brings us one step closer to our ultimate aim of simulating the metal extrusion process through the CusCFBA pump. Con structing the Complex of Cu(I) bound CusB N T erminus and the Metallochaperone CusF Our docking effort yielded two plausible models which could facilitate the transfer of the Cu(I) ion from the metal chaperon CusF to CusB. The two models differ in their metal binding res idues: the first model involves a Cu(I) binding site consisting of the Met residues M36 and M38 of CusB NT, and M49 of CusF while the second model has M36 of CusB NT, H36 of CusF, and M49 of CusF bound to Cu(I). These potential metal

PAGE 167

167 binding sites were bas ed on predictions made by the protein protein docking server HADDOCK (High Ambiguity Driven protein protein DOCKing ) 193 194 It presents an i nformation driven flexible docking approach for the modeling of biomolecular complexes in which it generates models, clusters the generated models, and returns the most populated clusters as docking results. The mos t populated cluster from HADDOCK docking of Cu(I) bound CusB NT 192 with the open conformation of CusF 195 resulted in plausible metal transfer environ ments as shown in Figures 6 1 and 6 2. Both candidates involve residues from both molecules, CusB NT and CusF, which would be expected to be involved in the process of metal transfer. We introduced the Cu(I) ion in the center of mass of the sulfur atoms o f the three Met residues making up the MMM model ( M36 and M38 of CusB NT, and M49 of CusF ) and of the two sulfur atoms of the metal binding Met's in the MMH model along with the epsilon nitrogen of the His residue ( M36 of CusB NT, H36 of CusF, and M49 of C usF ). The needed metal parameters were previously obtained in studies on CusB NT 192 and CusF 186 respectively using the MTK++/MC PB functionality 171 of AmberTools version 1.5. 172 Molecular Dynamics Simulation Methods and Post Simulation Analyses In b oth models all the protein and solvent atoms were treated explicitly. Each model system was solvated with the TIP3P tri angulated water model 49 in a periodically r eplicated rectangular water box whose sides were at least 10  away from the solute atoms. The charge of the model systems were neutralized by addition of Na + ions and charged amino acids were modeled in the protonated states obtained with the H++ protonat ion state server at pH= 7. 100 A similar minimization and equi libration protocol to what we used in our CusB NT simulations 192 was employed: an energy minimization

PAGE 168

168 series of seven stages involving the minimization of only the solvent atoms and the counter ions (stage 1), minimization of the hydrogen atoms (stage 2), minimization of the side chains by gradually decreasing the harmonic positional restraints acting on them (stages 3 6), and finally the energy minimization of the whole system with no positional restraints (stage 7). A total of 53,000 energy minimization steps were performed in total, 2 4,000 of which used the steepest descent protocol. 3 The re maining 29,000 steps utilized the conjugate gradient method for minimization. 3 A two stage equilibration protocol followed the minimization: first, the systems were heated slowly from 0 to 300 K over 200 ps of MD within the canonical ensemble (NVT) by maintaining a weak harmonic restraint on the protein, and then they were simulated for 10 ns at 300 K to check for the stability of the peptide chains after removal of the harmonic restraints at a constant pressure of 1.0 bar for an isobaric, isothermal ensemble (NPT) using Langevin dynamics with a collision frequenc y of 1.0 ps 1 167 Periodic boundary conditions were imposed on the systems during the calculation of non bonded interactions at all minimization and equilibration stages. The lengths of the covalent bonds involving hydrogen were constrained and their interactions were omitted with the SHAKE algorithm while the systems wer e heated. 168 All the restraints on the systems were released before we started with the MD production runs. We then ran 4 !s of conventional MD using the ff99SBildn force field 177 in th e AMBER12 suite 95 Throughout the production simulati ons the SHAKE algorithm was used to constrain covalent bonds involving hydrogen. The Particle mesh Ewald (PME) method was utilized for long range electrostatic interactions. A n 8  non bonded cutoff was applied to limit the direct space sum in PME 178 The temperature of the systems was kept at 300 K with Langevin

PAGE 169

169 dynamics (collision frequency of 1.0 ps 1 ). Frames were collected every 2 ps. A selected subset of these snapshots were employed in distance, angle, dihedral, root mean square deviation (RMSD), root mean square fluctuation (RMSF) analyses with the aid of the ptraj utility in Amber Tools 1.5 and secondary structure assign ments were obtained using the STRIDE program. 180 Results In the course of the simulations of the MMM model, the sulfur atoms of M36 and M38 of CusB NT, M49 of CusF, and the Cu(I) ion presented a distorted tetrahedral geometry with slightly longer bonds than in the optimi zed and the experimental geometries (by 0.1 0.2  on average for both CusB NT and CusF geometries). 133 169 Th e bonding sulfur and nitrogen distances in the metal site of the MMH model remained near the optimized distances for the most part, which we obtained through optimization calculations involving methionine and histidine side chains and a Cu(I) ion at the de nsity functional theory (DFT) QM level which were performed with Gaussian'09 44 using the B3LY P 196 and M06L 30 DFT functionals in conjunction with the double # quality LANL2DZ pseudop otential basis set for Cu(I) 170 and the Pople type basis set 6 31G* for all the othe r atoms 32 The Cu(I) ion and the ligating atoms placed themselves again in a distorted tetra hedron, akin to the MMM case. Disappearing of Disorder in CusB NT The first observation we made on both the MMM and the MMH models was that this complexation provided the CusB NT chain with a considerably increased structural stability. Upon visual inspec tion of the simulations, we observed the beta motif in the center of the CusB NT, which was already present in the s tarting geometries, was preserved throughout the trajectory. The tail regions were the most mobile parts akin to

PAGE 170

170 the apo and holo non comple xed CusB NT runs. We assigned secondary structures to snapshots saved from both trajectories using the STRIDE program. The 400 consecutive frames we employed were separated by 10 ns, yielding the total simulation time of 4 !s. As inferred from visual insp ection, the beta motif in CusB NT was always retained and the tails remained mostly disordered. The M21 M36 loop was also a random coil. On the CusF side, the largest random coil region proved to be the flap region, namely the CusF residues 36 to 48 as see n in Figure 6 3, which corresponds to 75 to 87 in the cumulative numbering which was used in the y axis of the STRIDE matrices. Over the course of 4 !s of MD simulations, an interesting difference arose between the two models, MMM and MMH. In the MMM model the very mobile, flap like M21 M36 loop of the CusB NT interacts with the open flap of CusF and these two are geometrically "locked" for over ~50% of the simulat ion time. The interaction appears to be reversible, because we observed the M21 M36 loop and the CusF flap coming together and separating repeatedly during the course of the simulation. For the metal to be expelled, it must be first fetched by CusF in the periplasm and passed to CusB. This suggests a metal binding site, which first largely involv es CusF components, then a mixture of CusF and CusB components that becomes gradually more dominated by CusB NT components, and finally only CusB residues. Hence, observing first the MMH model with two constituents from CusF and one constituent from CusB a nd then switching to the MMM model with one CusF component and two CusB components is a reasonable pathway. In the hypothesized earlier MMH model, the total chain motion slightly exceeds what it is found in the proposed later MMM model. These differ

PAGE 171

171 consi derably in the extent of motion observed in the M21 M36 loop of the CusB NT where in the MMH model, this loop and the open flap of CusF move freely and do not strongly interact. On the other hand, in the MMM model, an interplay between these two regions no ticeably restrains the motions of the CusB NT M21 M36 loop. The first analysis we carried out was the comparison of the root mean square deviations (RMSD) of the protein chain backbone atoms in the MMM and MMH models. The first frame of each production ru n was used as the reference. The average RMSD for the MMM model amounted to 7.40 , while for the MMH model it was 7.83 . Hence, the overall structural changes are slightly more pronounced in the MMH model as displayed in Figure 6 4. To check whether the se changes were localized or spread along the entire protein complex, we performed a root mean square fluctuations (RMSF) analysis, which indicated that the difference in the RMSDs was really a consequence of local mobility in the MMH model. As seen in Fig ure 6 5, the mobilities of the two models track each other very well, except for the M21 M36 loop of CusB NT which we already observed to interact with the open loop of CusF more strongly in the MMM model. This stronger interaction, which exists in prolong ed time frames, stabilizes the loop and renders it less mobile than in the MMH model which in turn gives rise to higher RMSD values for the MMH model. A closer look at the interaction of the M21 M36 loop of CusB and the open flap of CusF was subsequently c arried out. We developed various structural coordinates to quantify the properties of this interaction. Visual inspection of the simulation along with the RMSF analysis showed that these two regions of CusB and CusF affect each other significantly due to t heir proximity. Thus, the first coordinate we used was the distance

PAGE 172

172 between the centers of mass of the M21 M36 loop of CusB and the CusF flap. At this point we should clarify that we consider the residues from 36 to 48 of CusF as the open flap (75 to 87 in the cumulative numbering). We found 10  to be the most prevalent value for this distance in the MMM model, whereas in the MMH model it averaged around 20 . As a further coordinate, we examined the distance between the alpha carbon atoms of K32 of CusB a nd P45 of CusF (84 in cumulative numbering). These two residues appear to be especially proximal to each other compared to the centers of mass of the loops, which hints that there might be a stronger interaction between them. In the MMM model, their alpha carbons are separated by ~6.0  while in most of the snapshots saved from the MMH run, this separation measures ~10.0 . Figure 6 6 displays these distances throughout the simulation and Figure 6 7 presents the data in the form of histograms. Closer chain s would also correspond to smaller angles of the center of mass of CusB NT's M21 M36 loop, the Cu(I) ion, and the center of mass of CusF's open flap. Indeed, we probed this angle and saw it was much wider for the MMH model than in the MMM model, in which C usB NT's loop is found to interact with CusF's flap. In the MMH model the average value for this property was 93.8¡ in contrast to 50.9¡ for the MMM model. Its fluctuations and population densities are shown in Figures 6 8 and 6 9. We further analyzed thi s loop flap interplay at the level of the molecular interactions present. Previously we identified a hydrogen bonding network at the "hinge" points of the CusB NT M21 M36 loop which facilitated its folding against the core of the structure. A detailed visu al analysis of likely hydrogen bonding partners on the loop and flap showed some hydrogen bonding occurs between these two although it is of various

PAGE 173

173 types unlike the CusB NT case. The stronger hydrogen bonds are formed between one of the guanidinium nitrog ens of R26 of CusB NT and the two side chain oxygen atoms of D46 of CusF and this sometimes leads to a bifurcat ed geometry (Figure 6 10, part A ). 197 In 13% of the saved snapshots, both guanidinium hydrogens of R26 interact with both side chain carboxyl oxygen atoms of D46 to create a salt bridge (Figure 6 10, part B ). The same salt bridge inte ractions exist only in 0.02% of the snapshots saved from the MMH simulation. Additionally, weaker hydrogen bonds might also contribute significantly to the loop flap synergy. The most prominent hydrogen bonds are observed between the side chain oxygen atom of CusB NT's Y22 and the backbone nitrogen of CusF's T48. The side chain oxygen of CusB NT's Y22 switches partners to form another hydrogen bond to the side chain oxygen of CusF's T48 which comes up as the second most common hydrogen bonding interaction. The donor acceptor distances in these interactions are longer than the ones associated with the R26 D46 hydrogen bonds suggesting they are weaker but they occur more frequently. They rarely, only in 0.02% of the saved frames, position themselves in a bide ntate geometry. As shown in Figure 6 11, all the MMH corresponding interactions have distances that are much longer suggesting very little or no stabilization. Another factor which might facilitate this geometrical locking of the CusB NT loop and the CusF flap is the " stacking interaction between the side chains of CusB NT's F35 and CusF's W44 (83 in cumulative numbering), which actively contributes to metal binding in the closed form of CusF through a cation interaction with cation being either Cu(I) or Ag(I) in this case. 186 We measured the distance between the center of mass of the six carbon atoms making up the phenyl ring of CusB NT's F35 and the

PAGE 174

174 center of mass of six carbon atoms constituting the phenyl part of the indole side chain of CusF's W44. This interaction is observed in both the MMM and MMH models, but arises more frequently in the MMM model, wh ich suggests a slightly more favorable interaction. As seen in Figure 6 12, the most prevalent distance appears at 6  for both the MMM and MMH models. According to McGaughey et al. distances below 7.5  stabilize these conformations energetically. 198 56% of the snapshots saved from the MMM simulations have values smaller than or equal to 6, compared to 40% in the MMH simulations. Both parallel and T shaped conformations occur, as shown in Figure 6 13. Conclusions The study focused on the dynamics of the two proposed complexes formed by two constituents of the CusCFBA pump, the N terminal region of the membrane fusion protein CusB and the metallochaperone CusF. This aggregate is of sign ificant importance because it performs a crucial step in metal extrusion from the cell in E. coli : CusF transfers the metal it fetches in the periplasm to CusB, which passes the metal to CusC to be expelled from the cell. The N terminal region of CusB cont ains the metal binding site and w as found to be so disordered that its structure could not be resolved spectroscopically. The general disorder was also confirmed by our earlier MD studies although some key secondary structures were observed. 192 However, the assembly of the N terminal region of the CusB and CusF present an interesting case where the order in the be ta barrel protein CusF seems to be embraced by the originally disordered CusB NT. The secondary structu res, which were present in the starting geometries, were retained throughout the MD simulations, namely in two parallel runs of 4 !s each.

PAGE 175

175 This is not new for CusF but definitely a significant observation for CusB NT in terms of its overall stability. Th e complex structures were obtained by docking the protein chain of holo CusB to the open conformation of Cus F 195 using the protein protein docking web server HADDOCK. The docking pose with th e best score positioned the metal binding residues of both CusB and CusF in an appropriate manner through which we built two model Cu(I) bind ing sites. The first model binding site employed M36 and M38 from CusB NT, and M49 from CusF which more closely res embles the metal site of CusB. The second model binding site was similar to the one from CusF which involves two Met residues (M36 of CusB and M49 of CusF) and one His residue (H36 of CusF). 4 !s of classical MD simulations on each model resulted in differ ences in the mobility of CusB NT's M21 M36 loop and CusF's open flap. In the model ligating the Cu(I) ion with the three Met residues, the M21 M36 loop of CusB NT and the open flap of CusF interact for over most of the simulation time and they become geome trically locked with each other which in turn decreases the extent of their motion. In contrast, the model where two Met residues and one His residue coordinate the Cu(I) ion shows greater mobility in CusB NT's M21 M36 loop and CusF's open flap. The M21 M 36 loop of CusB NT and the open flap of CusF stay together as a result of a hydrogen bonding network and " stacking interactions. The hydrogen bonding network utilizes various types of hydrogen bonds including salt bridges (bifurcated) and paired ones. M ore abundant are the weaker hydrogen bonds that place the donor and acceptor atoms at distances greater than 3. A " stacking interaction is proposed also to play a role in stabilizing this interaction. It is the most abundant interaction between these t wo regions and is found

PAGE 176

176 in both models although it is more common in the model involving three Met residues in its metal site. All these observations lead us to hypothesize that both models might occur in order: first the model with two methionines and one histidine, and then the model with three methionines We hypothesize these r epresent steps of the metal transfer process. This suggests a metal coordination site, which is gradually dominated by CusB residues and is slightly more ordered. The observed sta bility in the rest of the CusB chain is already surprising when in complex with CusF. Some transient order for the CusB NT seems likely while CusF delivers the metal to CusB so that the two proteins align themselves around the metal and exchange the metal ion successfully. The proposed sequence for the transition metal site geometries is in line with this because it also follows an increased structural order in the course of the transfer. The gained transient order in the model with three metal coordinating methionines could be a driving force for the direction of the metal transfer: towards the extracellular milieu through CusB and then CusC. Considering the experimental finding where the metal gets distributed evenly among CusB NT and CusF when introduced to a solution containing equal amounts of these proteins, which implies similar metal binding affinities for CusB and CusF, gaining a higher extent of order during the metal transfer process might determine to which direction the metal ion is moved. In th is case, the model providing more order is dominated by CusB residues and resembles the metal environment of CusB with its three methionines. To complete the transfer from CusF to CusB NT would then proceed from the MMM case to the full CusB NT coordinated mode where CusF's

PAGE 177

177 M49 would be replaced by CusB NT's M21. These two residues are very proximal to each other already in the MMM model facilitating this final transfer step. It is important to note that our initial hypothesis, about the possible disappeara nce of disorder in CusB NT upon its complexation with its functional companion CusF, was supported up by the present simulations. This work might aid in gaining a deeper insight into the process of the metal transfer from CusF to CusB along with other meta l transfer processes involving disordered protein chains.

PAGE 178

178 Figure 6 1 The top hit from the HADDOCK docking of CusB NT (grey) to the open conformation of CusF (red) The inset shows the positioning of the metal binding residues on CusB (black) and CusF (burgundy )

PAGE 179

179 Figure 6 2 The two model metal binding sites were based on the positioning of the labeled residues. The black triangle shows the initial positions of M36 and M38 of CusB NT, and M49 of CusF, in the middle of which we introduced Cu(I) to yi eld the first model binding site, MMM. The metal binding site of the second model, MMH, is comprised of the residues at the corners of the dark red triangle: M36 of CusB NT, and M49 and H75 of CusF.

PAGE 180

180 Figure 6 3 Secondary structure assignments of the C usB NT/CusF complex for both the MMM (upper) and the MMH models (below). Residues 1 51 make up the CusB NT whereas the rest is the CusF. Color coding: green alpha helix, red beta strand, white: randomly coiled or disordered. 20 40 60 80 100 120 50 100 150 200 250 300 350 400 Cumulative Residue Number Frame number Secondary Structure Predictions for Holo CusB-NT/CusF: MMM "matrix.txt" matrix 1 2 3 20 40 60 80 100 120 50 100 150 200 250 300 350 400 Cumulative Residue Number Frame number Secondary Structure Predictions for Holo CusB-NT/CusF: MMH "matrix.txt" matrix 1 2 3

PAGE 181

181 Figure 6 4 The RMSD profil es of the MMM (black) and MMH (red) models over 4 !s of simulation time. Frame separation is 10 ns. 0100200300400 Frame number 0 5 10 15 RMSD (Angstroms) MMM MMH

PAGE 182

182 Figure 6 5 The RMSF profiles of the MMM (black) and MMH (red) models over 4 !s of simulation time. Frame separation is 10 ns. 50100 Residue number 0 5 10 15 20 RMSF (Angstroms) MMM MMH

PAGE 183

183 Figure 6 6 Distan ces of the M21 M36 loop of CusB NT and the open flap of CusF expressed in terms of the separation between their centers of mass (red) and the separation between the CusB NT residue K32 and the CusF residue P45 (black). Solid lines stand for the results fro m the MMM model and the dashed lines represent those from the MMH model. The separation expressed by both of these properties is smaller in the MMM model compared to the MMH model. 0100200300400 Frame number 0 5 10 15 20 25 30 Distance (Angstroms) K32CA-P84CA distance/ MMM Distance between the COM's of the residues 22-35 and 75-87/ MMM K32CA-P84CA distance/ MMH Distance between the COM's of the residues 22-35 and 75-87/ MMH

PAGE 184

184 Figure 6 7 Histograms for the d istances of the M21 M36 loop of CusB NT and the open flap of CusF expressed in terms of the separation between their centers of mass (red) and the separation between the CusB NT residue K32 and the CusF residue P45 (black). Solid lines stand for the results from the MMM model and the dashed li nes represent those from the MMH model. 510152025 Distance (Angstroms) 0 0.1 0.2 0.3 0.4 Normalized Population Density K32CA-P84CA distance/ MMM Distance between the COM'S of the residues 22-35 and 75-87/MMM K32CA-P84CA distance/ MMH Distance between the COM's of the residues 22-35 and 75-87/ MMH

PAGE 185

185 Figure 6 8. The angle involving the center of mass of Cus B NT's M21 M36 loop, C u(I), and the center of mass of CusF's open flap over the entire production run of 4 !s. Black represents the values from the MMM mod el and red from the MMH model. 0100200300400 Frame number 50 100 Angle MMM MMH

PAGE 186

186 Figure 6 9 The histograms for the angle involving the center of mass of CusB NT's M21 M36 loop, cu(I), and the center of mass of CusF's open flap over the entire production run of 4 !s. Black represents the values from the MMM model and red from the MMH model. 50100 Angle 0 0.02 0.04 0.06 0.08 Normalized Population Density MMM MMH

PAGE 187

187 Figure 6 10 A) Bifurcated and B) paired hydrogen bonding interactions between CusB NT's R26 and CusF's D46. CusB NT is displayed in grey and CusF is shown in violet. A B

PAGE 188

188 Figure 6 11. Distances of possible hy drogen bonding partners on the CusB NT M21 M36 loop and the CusF open flap. Solid lines express the probed distances in the MMM model and the dashed lines the ones in the MMH model. Involved residues are the following: K32 of CusB NT ("32"), D46 of CusF (" 85"), N43 of CusF ("82"), R26 of CusB NT ("26"), S33 of CusB NT ("33"), W44 of CusF ("83"), Y22 of CusB NT ("22"), and T48 of CusF ("87"). "NZ", "NH1", "NH2", and "NE1" repres ent side chain nitrogens while "OE1", "OE2", "OG", and "OH" stand for side chai n oxygens. "N" is a backbone nitrog en and "O" is a backbone oxygen. 2468101214 Distance (Angstroms) 0 0.1 0.2 0.3 0.4 0.5 Normalized Population Density 32NZ-85OE1 32NZ-85OE2 32NZ-82O 26NH1-85OE1 26NH1-85OE2 33N-82O 33OG-83NE1 26NH2-85OE1 26NH2-85OE2 22OH-87N 22OH-87OG1

PAGE 189

189 Figure 6 12. H istograms for the distance between the centroids of the phenyl ring of CusB NT's F35 and of the phenyl part of the indole ring of CusF's W44. Results for the MMM model are shown in black and for the MMH in red. 2468101214 Distance (Angstroms) 0 0.1 0.2 0.3 0.4 0.5 Normalized Population Density MMM MMH

PAGE 190

190 Figure 6 13 " stacking interaction between the side chains of CusB NT's F35 and CusF's W44. A) The common parallel and B) T shaped configurations are both observed. CusB NT is shown in grey and CusF is displayed in violet. A B

PAGE 191

191 LIST OF REFERENCES 1 Nobelprize.org "The Nobel Prize in Chemistry 2013 Press Release". http://www.nobelprize.org/nobel_prizes/chemistry/laureates/2013/press.html (accessed 29 January 2014). 2. Cramer, C. J., Essentials of Computational Chemistry: Theories and Models Second ed.; John Wiley & Sons Ltd.: The Atrium, Southern Gate, Chichester, West Sussex, England, 2004. 3. Leach, A. R., Molecular Modelling: Principles and Applications Second ed.; Pearson Education Limited: Essex, 2001. 4. Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117 5179. 5. Wang, J. M.; Wan g, W.; Kollman, P. A.; Case, D. A. J. Mol. Graphics Modell. 2006, 25 247. 6. Wang, J. M.; Wolf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. J. Comput. Chem. 2004, 25 1157. 7. Makara, G. M. J. Med. Chem. 2007, 50 3214. 8. Alex, A. A.; Flocco, M. M. Curr. Top. Med. Chem. 2007, 7 1544. 9. Hajduk, P. J.; Greer, J. Nat. Rev. Drug Discovery 2007, 6 211. 10. Brooijmans, N.; Kuntz, I. D. Annu. Rev. Biophys. Biomol. Struct. 2003, 32 335. 11. Congreve, M.; Chessari, G.; Tisi, D.; Woodhead, A. J. J. Med. Chem. 2008, 51 3661. 12. Guvench, O.; MacKerell, A. D. Curr. Opin. Struct. Biol. 2009, 19 56. 13. Kolb, P.; Irwin, J. J. Curr. Top. Med. Chem. 2009, 9 755. 14. Leach, A. R.; Shoichet, B. K.; Peishoff, C. E. J. Med. Chem. 2006, 49 5851. 15. Morra, G.; Genoni, A.; Neves, M. A. C.; Merz, K. M.; Colombo, G. Curr. Med. Chem. 2010, 17 25. 16. Chen, Z. G.; Li, Y.; Chen, E.; Hall, D. L.; Darke, P. L.; Culberson, C.; Shafer, J. A.; Kuo, L. C. J. Biol. Chem. 1994, 269 26344.

PAGE 192

192 17. Dorsey, B. D.; Levin, R. B.; Mc Daniel, S. L.; Vacca, J. P.; Guare, J. P.; Darke, P. L.; Zugay, J. A.; Emini, E. A.; Schleif, W. A.; Quintero, J. C.; Lin, J. H.; Chen, I. W.; Holloway, M. K.; Fitzgerald, P. M. D.; Axel, M. G.; Ostovic, D.; Anderson, P. S.; Huff, J. R. J. Med. Chem. 1994, 37 3443. 18. Vacca, J. P.; Dorsey, B. D.; Schleif, W. A.; Levin, R. B.; McDaniel, S. L.; Darke, P. L.; Zugay, J.; Quintero, J. C.; Blahy, O. M.; Roth, E.; Sardana, V. V.; Schlabach, A. J.; Graham, P. I.; Condra, J. H.; Gotlib, L.; Holloway, M. K.; Lin, J .; Chen, I. W.; Vastag, K.; Ostovic, D.; Anderson, P. S.; Emini, E. A.; Huff, J. R. Proc. Natl. Acad. Sci. U. S. A. 1994, 91 4096. 19. Xantheas, S. S. J. Chem. Phys. 1994, 100 7523. 20. Benos, P. V.; Bulyk, M. L.; Stormo, G. D. Nucleic Acids Res. 2002, 3 0 4442. 21. Shimizu, S.; Chan, H. S. J. Chem. Phys. 2001, 115 1414. 22. Horovitz, A. Folding Des. 1996, 1 R121. 23. Carter, P. J.; Winter, G.; Wilkinson, A. J.; Fersht, A. R. Cell 1984, 38 835. 24. Horovitz, A.; Rigbi, M. J. Theor. Biol. 1985, 116 149 25. Wells, J. A. Biochemistry 1990, 29 8509. 26. Schreiber, G.; Fersht, A. R. J. Mol. Biol. 1995, 248 478. 27. Mark, A. E.; Vangunsteren, W. F. J. Mol. Biol. 1994, 240 167. 28. Baum, B.; Muley, L.; Smolinski, M.; Heine, A.; Hangauer, D.; Klebe, G. J. Mol. Biol. 2010, 397 1042. 29. Olejniczak, E. T.; Hajduk, P. J.; Marcotte, P. A.; Nettesheim, D. G.; Meadows, R. P.; Edalji, R.; Holzman, T. F.; Fesik, S. W. J. Am. Chem. Soc. 1997, 119 5828. 30. Zhao, Y.; Truhlar, D. G. J. Chem. Phys. 2006, 125 194101. 31. Faver, J. C.; Benson, M. L.; He, X. A.; Roberts, B. P.; Wang, B.; Marshall, M. S.; Kennedy, M. R.; Sherrill, C. D.; Merz, K. M. J. Chem. Theory Comput. 2011, 7 790. 32. Ditchfie, R.; Hehre, W. J.; Pople, J. A. J. Chem. Phys. 1971, 54 724. 33. Wheele r, S. E.; Houk, K. N. J. Chem. Theory Comput. 2010, 6 395. 34. Stewart, J. J. P. J. Mol. Model. 2007, 13 1173. 35. Fanfrlik, J.; Bronowska, A. K.; Rezac, J.; Prenosil, O.; Konvalinka, J.; Hobza, P. J. Phys. Chem. B 2010, 114 12666.

PAGE 193

193 36. Korth, M.; Pitona k, M.; Rezac, J.; Hobza, P. J. Chem. Theory Comput. 2010, 6 344. 37. Rezac, J.; Fanfrlik, J.; Salahub, D.; Hobza, P. J. Chem. Theory Comput. 2009, 5 1749. 38. Word, J. M.; Lovell, S. C.; Richardson, J. S.; Richardson, D. C. J. Mol. Biol. 1999, 285 1735. 39. Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Proteins: Struct., Funct., Bioinf. 2006, 65 712. 40. Faver, J. C. Biomolecular Fragment Database (BFDb). via http://www.merzgroup.org (accessed 04/01/2013). 41. Bacskay, G. B. Chem. Phys. 1981, 61 385. 42. Rabuck, A. D.; Scuseria, G. E. J. Chem. Phys. 1999, 110 695. 43. Boys, S. F.; Bernardi, F. Mol. Phys. 1970, 19 553. 44. Frisch, M. J. T., G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V.; Mennucci, B.; Petersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, Jr., J. A.; Peralta, J. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendel l, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; Millam, N. J.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, Â….; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian 09, Revision A.02 Gaussian, Inc.: Wallingfor d CT, 2009. 45. Stewart, J. J. P. MOPAC2009 Stewart Computational Chemistry: Colorado Springs, CO, USA, 2008. 46. Humphrey, W.; Dalke, A.; Schulten, K. J. Mol. Graphics 1996, 14 33. 47. Team, R. D. C. R: A Language and Environment for Statistical Computi ng R Foundation for Statistical Computing: Vienna, Austria, 2010. 48. Matsuoka, O.; Clementi, E.; Yoshimine, M. J. Chem. Phys. 1976, 64 1351. 49. Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. J. Chem. Phys. 1983, 79 926.

PAGE 194

194 50. Kistenmacher, H.; Lie, G. C.; Popkie, H.; Clementi, E. J. Chem. Phys. 1974, 61 546. 51. Jorgensen, W. L.; Tiradorives, J. J. Am. Chem. Soc. 1988, 110 1657. 52. Faver, J. C.; Yang, W.; Merz, K. M. J. Chem. Theory Comput. 2012, 8 3769. 53. Gilson, M. K.; Zhou, H. X. Annu. Rev. Biophys. Biomol. Struct. 2007, 36 21. 54. Halperin, I.; Ma, B. Y.; Wolfson, H.; Nussinov, R. Proteins: Struct., Funct., Genet. 2002, 47 409. 55. Kitchen, D. B.; Decornez, H.; Furr, J. R.; Bajorath, J. Nat. Rev. Drug Discovery 2004, 3 935. 56. Sotriffer, C. A. Curr. Top. Med. Chem. 2011, 11 179. 57. Yuriev, E.; Agostino, M.; Ramsland, P. A. J. Mol. Recognit. 2011, 24 149. 58. Yuriev, E.; Ramsland, P. A. J. Mol. Recognit. 2013, 26 215. 59. Kollman, P. A.; Massova, I.; Reyes, C.; Kuhn, B.; Huo, S. H.; Chong, L.; Lee, M.; Lee, T.; Duan, Y.; Wang, W.; Donini, O.; Cieplak, P.; Srinivasan, J.; Case, D. A.; Cheatham, T. E. Acc. Chem. Res. 2000, 33 889. 60. Kuhn, B.; Kollman, P. A. J. Med. Chem. 2000, 43 3786. 61. Li, Y.; Liu, Z. H .; Wang, R. X. J. Chem. Inf. Model. 2010, 50 1682. 62. Rastelli, G.; Del Rio, A.; Degliesposti, G.; Sgobba, M. J. Comput. Chem. 2010, 31 797. 63. Homeyer, N.; Gohlke, H. Mol. Inf. 2012, 31 114. 64. Chodera, J. D.; Mobley, D. L.; Shirts, M. R.; Dixon, R. W.; Branson, K.; Pande, V. S. Curr. Opin. Struct. Biol. 2011, 21 150. 65. Christ, C. D.; Mark, A. E.; van Gunsteren, W. F. J. Comput. Chem. 2010, 31 1569. 66. Rodinger, T.; Pomes, R. Curr. Opin. Struct. Biol. 2005, 15 164. 67. Shirts, M. R.; Mobley, D. L.; Chodera, J. D. Annu. Rep. Comput. Chem. 2007, 3 41. 68. Woo, H. J.; Roux, B. Proc. Natl. Acad. Sci. U. S. A. 2005, 102 6825. 69. Mobley, D. L.; Chodera, J. D.; Dill, K. A. J. Chem. Phys. 2006, 125

PAGE 195

195 70. Wang, J. Y.; Deng, Y. Q.; Roux, B. Biophys. J. 2006, 91 2798. 71. Deng, Y. Q.; Roux, B. J. Chem. Theory Comput. 2006, 2 1255. 72. Huang, Y. M. M.; Chen, W.; Potter, M. J.; Chang, C. E. A. Biophys. J. 2012, 103 342. 73. Chen, W.; Gilson, M. K.; Webb, S. P.; Potter, M. J. J. Chem. Theory Comput. 2010, 6 3540. 74. Buch, I.; Giorgino, T.; De Fabritiis, G. Proc. Natl. Acad. Sci. U. S. A. 2011, 108 10184. 75. Giorgino, T.; De Fabritiis, G. J. Chem. Theory Comput. 2011, 7 1943. 76. Shan, Y. B.; Kim, E. T.; Eastwood, M. P.; Dror, R. O.; Seeliger, M. A.; S haw, D. E. J. Am. Chem. Soc. 2011, 133 9181. 77. Brannigan, G.; LeBard, D. N.; Henin, J.; Eckenhoff, R. G.; Klein, M. L. Proc. Natl. Acad. Sci. U. S. A. 2010, 107 14122. 78. Buch, I.; Harvey, M. J.; Giorgino, T.; Anderson, D. P.; De Fabritiis, G. J. Chem Inf. Model. 2010, 50 397. 79. Wu, C.; Biancalana, M.; Koide, S.; Shea, J. E. J. Mol. Biol. 2009, 394 627. 80. Buch, I.; Sadiq, S. K.; De Fabritiis, G. J. Chem. Theory Comput. 2011, 7 1765. 81. Essex, J. W.; Severance, D. L.; TiradoRives, J.; Jorgensen W. L. J. Phys. Chem. B 1997, 101 9663. 82. Gumbart, J. C.; Roux, B.; Chipot, C. J. Chem. Theory Comput. 2013, 9 794. 83. Genheden, S.; Ryde, U. Phys. Chem. Chem. Phys. 2012, 14 8662. 84. Hou, T. J.; Wang, J. M.; Li, Y. Y.; Wang, W. J. Chem. Inf. Model 2011, 51 69. 85. Singh, N.; Warshel, A. Proteins: Struct., Funct., Bioinf. 2010, 78 1705. 86. Boresch, S.; Tettinger, F.; Leitgeb, M.; Karplus, M. J. Phys. Chem. B 2003, 107 9535. 87. Deng, Y. Q.; Roux, B. J. Phys. Chem. B 2009, 113 2234. 88. Hermans J.; Wang, L. J. Am. Chem. Soc. 1997, 119 2707. 89. Mobley, D. L.; Graves, A. P.; Chodera, J. D.; McReynolds, A. C.; Shoichet, B. K.; Dill, K. A. J. Mol. Biol. 2007, 371 1118.

PAGE 196

196 90. Purisima, E. O.; Hogues, H. J. Phys. Chem. B 2012, 116 6872. 91. Morton, A.; Baase, W. A.; Matthews, B. W. Biochemistry 1995, 34 8564. 92. Morton, A.; Matthews, B. W. Biochemistry 1995, 34 8576. 93. Wei, B. Q. Q.; Baase, W. A.; Weaver, L. H.; Matthews, B. W.; Shoichet, B. K. J. Mol. Biol. 2002, 322 339. 94. Graves, A. P.; B renk, R.; Shoichet, B. K. J. Med. Chem. 2005, 48 3714. 95. Case, D. A.; Darden, T. A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang, J.; Duke, R. E.; Luo, R.; Walker, R. C.; Zhang, W.; Merz, K. M.; Roberts, B.; Hayik, S.; Roitberg, A.; Seabra, G.; Swails, J .; Goetz, A. W.; Kolossvai, I.; Wong, K. F.; Paesani, F.; Vanicek, J.; Wolf, R. M.; Liu, J.; Wu, X.; Brozell, S. R.; Steinbrecher, T.; Gohlke, H.; Cai, Q.; Ye, X.; Wang, J.; Hsieh, M. J.; Cui, G.; Roe, D. R.; Mathews, D. H.; Seetin, M. G.; Salomon Ferrer, R.; Sagui, C.; Babin, V.; Luchko, T.; Gusarov, S.; Kovalenko, A.; Kollman, P. A. AMBER 12 University of California, San Francisco: 2012. 96. Stewart, J. J. P. MOPAC2012 Stewart Computational Chemistry: Colorado Springs, CO, USA, 2012. 97. Tsui, V.; Case, D. A. Biopolymers 2001, 56 275. 98. Klamt, A.; Schuurmann, G. J. Chem. Soc., Perkin Trans. 2 1993 799. 99. Marenich, A. V.; Cramer, C. J.; Truhlar, D. G. J. Phys. Chem. B 2009, 113 6378. 100. Gordon, J. C.; Myers, J. B.; Folta, T.; Shoja, V.; Heath, L. S.; Onufriev, A. Nucleic Acids Res. 2005, 33 W368. 101. Woon, D. E.; Dunning, T. H. J. Chem. Phys. 1993, 98 1358. 102. Jakalian, A.; Bush, B. L.; Jack, D. B.; Bayly, C. I. J. Comput. Chem. 2000, 21 132. 103. Jakalian, A.; Jack, D. B.; Bayly, C. I. J. C omput. Chem. 2002, 23 1623. 104. Bartlett, R. J.; Musial, M. Rev. Mod. Phys. 2007, 79 291. 105. Halkier, A.; Helgaker, T.; Jorgensen, P.; Klopper, W.; Koch, H.; Olsen, J.; Wilson, A. K. Chem. Phys. Lett. 1998, 286 243. 106. Wonnacott, R. J.; Wonnacott, T. H., Introductory Statistics John Wiley & Sons, Inc.: New York, NY, 1985. 107. Faver, J. C.; Benson, M. L.; He, X.; Roberts, B. P.; Wang, B.; Marshall, M. S.; Sherrill, C. D.; Merz, K. M. PLoS One 2011, 6 e18868.

PAGE 197

197 108. Weiser, J.; Shenkin, P. S.; Still, W. C. J. Comput. Chem. 1999, 20 217. 109. Marenich, A. V.; Kelly, C. P.; Thompson, J. D.; Hawkins, G. D.; Chambers, C. C.; Giesen, D. J.; Winget, P.; Cramer, C. J.; Truhlar, D. G., Minnesota Solvation Database version 2012. University of Minnesota Minne apolis, 2012. 110. Friesner, R. A.; Banks, J. L.; Murphy, R. B.; Halgren, T. A.; Klicic, J. J.; Mainz, D. T.; Repasky, M. P.; Knoll, E. H.; Shelley, M.; Perry, J. K.; Shaw, D. E.; Francis, P.; Shenkin, P. S. J. Med. Chem. 2004, 47 1739. 111. Friesner, R. A.; Murphy, R. B.; Repasky, M. P.; Frye, L. L.; Greenwood, J. R.; Halgren, T. A.; Sanschagrin, P. C.; Mainz, D. T. J. Med. Chem. 2006, 49 6177. 112. Halgren, T. A.; Murphy, R. B.; Friesner, R. A.; Beard, H. S.; Frye, L. L.; Pollard, W. T .; Banks, J. L. J. Med. Chem. 2004, 47 1750. 113. Zhang, Y.; Gladyshev, V. N. Chem. Rev. 2009, 109 4828. 114. Macomber, L.; Rensing, C.; Imlay, J. A. J. Bacteriol. 2007, 189 1616. 115. Poole, K.; Srikumar, R. Curr. Top. Med. Chem. 2001, 1 59. 116. Pidd ock, L. J. V. Nat. Rev. Microbiol. 2006, 4 629. 117. Yang, S.; Lopez, C. R.; Zechiedrich, E. L. Proc. Natl. Acad. Sci. U. S. A. 2006, 103 2386. 118. Tseng, T. T.; Gratwick, K. S.; Kollman, J.; Park, D.; Nies, D. H.; Goffeau, A.; Saier, M. H. J. Mol. Micr obiol. Biotechnol. 1999, 1 107. 119. Dinh, T.; Paulsen, I. T.; Saier, M. H. J. Bacteriol. 1994, 176 3825. 120. Zgurskaya, H. I.; Nikaido, H. Mol. Microbiol. 2000, 37 219. 121. Franke, S.; Grass, G.; Nies, D. H. Microbiology (Reading, U.K.) 2001, 147 96 5. 122. Franke, S.; Grass, G.; Rensing, C.; Nies, D. H. J. Bacteriol. 2003, 185 3804. 123. Munson, G. P.; Lam, D. L.; Outten, F. W.; O'Halloran, T. V. J. Bacteriol. 2000, 182 5864. 124. Loftin, I. R.; Franke, S.; Roberts, S. A.; Weichsel, A.; Heroux, A.; Montfort, W. R.; Rensing, C.; McEvoy, M. M. Biochemistry 2005, 44 10533. 125. Saier, M. H.; Tam, R.; Reizer, A.; Reizer, J. Mol. Microbiol. 1994, 11 841. 126. Su, C. C.; Yang, F.; Long, F.; Reyon, D.; Routh, M. D.; Kuo, D. W.; Mokhtari, A. K.; Van Ornam J. D.; Rabe, K. L.; Hoy, J. A.; Lee, Y. J.; Rajashankar, K. R.; Yu, E. W. J. Mol. Biol. 2009, 393 342.

PAGE 198

198 127. Tikhonova, E. B.; Zgurskaya, H. I. J. Biol. Chem. 2004, 279 32116. 128. Zgurskaya, H. I.; Nikaido, H. Proc. Natl. Acad. Sci. U. S. A. 1999, 96 7190. 129. Zgurskaya, H. I.; Nikaido, H. J. Bacteriol. 2000, 182 4264. 130. Su, C. C.; Long, F.; Zimmermann, M. T.; Rajashankar, K. R.; Jernigan, R. L.; Yu, E. W. Nature 2011, 470 558. 131. Bagai, I.; Liu, W.; Rensing, C.; Blackburn, N. J.; McEvoy, M. M. J. Biol. Chem. 2007, 282 35695. 132. Kittleson, J. T.; Loftin, I. R.; Hausrath, A. C.; Engelhardt, K. P.; Rensing, C.; McEvoy, M. M. Biochemistry 2006, 45 11096. 133. Bagai, I.; Rensing, C.; Blackburn, N. J.; McEvoy, M. M. Biochemistry 2008, 47 11408. 134. Mealman, T. D.; Bagai, I.; Singh, P.; Goodlett, D. R.; Rensing, C.; Zhou, H.; Wysocki, V. H.; McEvoy, M. M. Biochemistry 2011, 50 2559. 135. Mealman, T. D.; Zhou, M. W.; Affandi, T.; Chacon, K. N.; Aranguren, M. E.; Blackburn, N. J.; Wysocki, V. H.; McEvoy, M. M. Biochemistry 2012, 51 6767. 136. Akama, H.; Matsuura, T.; Kashiwagi, S.; Yoneyama, H.; Narita, S. I.; Tsukihara, T.; Nakagawa, A.; Nakae, T. J. Biol. Chem. 2004, 279 25939. 137. Higgins, M. K.; Bokma, E.; Koronakis, E.; Hughes, C.; Koronaki s, V. Proc. Natl. Acad. Sci. U. S. A. 2004, 101 9994. 138. Yum, S. W.; Xu, Y. B.; Piao, S. F.; Sim, S. H.; Kim, H. M.; Jo, W. S.; Kim, K. J.; Kweon, H. S.; Jeong, M. H.; Jeon, H. S.; Lee, K.; Ha, N. C. J. Mol. Biol. 2009, 387 1286. 139. Dill, K. A.; MacC allum, J. L. Science 2012, 338 1042. 140. Beauchamp, K. A.; Ensign, D. L.; Das, R.; Pande, V. S. Proc. Natl. Acad. Sci. U. S. A. 2011, 108 12734. 141. Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. J. Chem. Phys. 2009, 131 124101. 142. Bowman, G. R.; Pande, V. S. Proc. Natl. Acad. Sci. U. S. A. 2010, 107 10890. 143. Bowman, G. R.; Voelz, V. A.; Pande, V. S. Curr. Opin. Struct. Biol. 2011, 21 4. 144. Duan, Y.; Kollman, P. A. Science 1998, 282 740.

PAGE 199

199 145. Freddolino, P. L.; Harrison, C. B.; Liu, Y. X.; Schulten, K. Nat. Phys. 2010, 6 751. 146. Jayachandran, G.; Vishal, V.; Pande, V. S. J. Chem. Phys. 2006, 124 147. Lei, H. X.; Wu, C.; Liu, H. G.; Duan, Y. Proc. Natl. Acad. Sci. U. S. A. 2007, 104 4925. 148. Pande, V. S.; Baker, I.; Chapman, J. ; Elmer, S. P.; Khaliq, S.; Larson, S. M.; Rhee, Y. M.; Shirts, M. R.; Snow, C. D.; Sorin, E. J.; Zagrovic, B. Biopolymers 2003, 68 91. 149. Piana, S.; Lindorff Larsen, K.; Shaw, D. E. Proc. Natl. Acad. Sci. U. S. A. 2012, 109 17845. 150. Voelz, V. A.; J ager, M.; Yao, S. H.; Chen, Y. J.; Zhu, L.; Waldauer, S. A.; Bowman, G. R.; Friedrichs, M.; Bakajin, O.; Lapidus, L. J.; Weiss, S.; Pande, V. S. J. Am. Chem. Soc. 2012, 134 12565. 151. Juraszek, J.; Bolhuis, P. G. Proc. Natl. Acad. Sci. U. S. A. 2006, 103 15859. 152. Lei, H. X.; Wang, Z. X.; Wu, C.; Duan, Y. J. Chem. Phys. 2009, 131 153. Ozkan, S. B.; Wu, G. A.; Chodera, J. D.; Dill, K. A. Proc. Natl. Acad. Sci. U. S. A. 2007, 104 11987. 154. Paschek, D.; Nymeyer, H.; Garcia, A. E. J. Struct. Biol. 2007 157 524. 155. Qiu, L. L.; Pabit, S. A.; Roitberg, A. E.; Hagen, S. J. J. Am. Chem. Soc. 2002, 124 12952. 156. Zhou, R. H. Proc. Natl. Acad. Sci. U. S. A. 2003, 100 13280. 157. Baumketner, A.; Shea, J. E. J. Mol. Biol. 2007, 366 275. 158. Babu, M. M.; Kriwacki, R. W.; Pappu, R. V. Science 2012, 337 1460. 159. Xu, D.; Zhang, Y. Proteins: Struct., Funct., Bioinf. 2012, 80 1715. 160. Roy, A.; Kucukural, A.; Zhang, Y. Nat. Protoc. 2010, 5 725. 161. Zhang, Y. BMC Bioinf. 2008, 9 162. Yang, Y. D.; Faragg i, E.; Zhao, H. Y.; Zhou, Y. Q. Bioinformatics 2011, 27 2076. 163. Kinch, L.; Shi, S. Y.; Cong, Q.; Cheng, H.; Liao, Y. X.; Grishin, N. V. Proteins: Struct., Funct., Bioinf. 2011, 79 59.

PAGE 200

200 164. Xu, D.; Zhang, J.; Roy, A.; Zhang, Y. Proteins: Struct., Funct ., Bioinf. 2011, 79 147. 165. Yang, J. Y.; Roy, A.; Zhang, Y. Nucleic Acids Res. 2013, 41 D1096. 166. Zhang, Y. Proteins: Struct., Funct., Bioinf. 2009, 77 100. 167. Allen, M. P.; Tildesley, D. J., Computer Simulations of Liquids Clarendon Press: Oxfor d, UK, 1987. 168. Ryckaert, J. P.; Ciccotti, G.; Berendsen, H. J. C. J. Comput. Phys. 1977, 23 327. 169. Kim, E. H.; Rensing, C.; McEvoy, M. M. Nat. Prod. Rep. 2010, 27 711. 170. Hay, P. J.; Wadt, W. R. J. Chem. Phys. 1985, 82 299. 171. Peters, M. B.; Y ang, Y.; Wang, B.; Fusti Molnar, L.; Weaver, M. N.; Merz, K. M. J. Chem. Theory Comput. 2010, 6 2935. 172. Case, D. A.; Darden, T. A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang, J.; Duke, R. E.; Luo, R.; Walker, R. C.; Zhang, W.; Merz, K. M.; Roberts, B. ; Wang, B.; Hayik, S.; Roitberg, A.; Seabra, G.; Kolossvai, I.; Wong, K. F.; Paesani, F.; Vanicek, J.; Liu, J.; Wu, X.; Brozell, S. R.; Steinbrecher, T.; Gohlke, H.; Cai, Q.; Ye, X.; Wang, J.; Hsieh, M. J.; Cui, G.; Roe, D. R.; Mathews, D. H.; Seetin, M. G .; Sagui, C.; Babin, V.; Luchko, T.; Gusarov, S.; Kovalenko, A.; Kollman, P. A. AMBER 11 University of California, San Francisco: 2010. 173. Zhao, Y.; Truhlar, D. G. Theor. Chem. Acc. 2008, 120 215. 174. Bayly, C. I.; Cieplak, P.; Cornell, W. D.; Kollman P. A. J. Phys. Chem. 1993, 97 10269. 175. Cieplak, P.; Cornell, W. D.; Bayly, C.; Kollman, P. A. J. Comput. Chem. 1995, 16 1357. 176. Pierce, L. C. T.; Salomon Ferrer, R.; de Oliveira, C. A. F.; McCammon, J. A.; Walker, R. C. J. Chem. Theory Comput. 20 12, 8 2997. 177. Lindorff Larsen, K.; Piana, S.; Palmo, K.; Maragakis, P.; Klepeis, J. L.; Dror, R. O.; Shaw, D. E. Proteins: Struct., Funct., Bioinf. 2010, 78 1950. 178. York, D. M.; Darden, T. A.; Pedersen, L. G. J. Chem. Phys. 1993, 99 8345. 179. She n, Y.; Bax, A. J. Biomol. NMR 2010, 48 13. 180. Heinig, M.; Frishman, D. Nucleic Acids Res. 2004, 32 W500. 181. T. Williams; Kelley, C. Gnuplot 4.4; 2010.

PAGE 201

201 182. Duan, L. L.; Mei, Y.; Zhang, D. W.; Zhang, Q. G.; Zhang, J. Z. H. J. Am. Chem. Soc. 2010, 132 11159. 183. Williams, S.; Causgrove, T. P.; Gilmanshin, R.; Fang, K. S.; Callender, R. H.; Woodruff, W. H.; Dyer, R. B. Biochemistry 1996, 35 691. 184. Uversky, V. N. Protein Sci. 2013, 22 185. Chakravorty, D. K.; Wang, B.; Lee, C. W.; Giedroc, D. P.; Merz, K. M. J. Am. Chem. Soc. 2012, 134 3367. 186. Chakravorty, D. K.; Wang, B.; Ucisik, M. N.; Merz, K. M. J. Am. Chem. Soc. 2011, 133 19330. 187. Loftin, I. R.; Franke, S.; Blackburn, N. J.; McEvoy, M. M. Protein Sci. 2007, 16 2287. 188. Xue, Y.; Davi s, A. V.; Balakrishnan, G.; Stasser, J. P.; Staehlin, B. M.; Focia, P.; Spiro, T. G.; Penner Hahn, J. E.; O'Halloran, T. V. Nat. Chem. Biol. 2008, 4 107. 189. Cram, D. J. Science 1988, 240 760. 190. Reyes Caballero, H.; Campanello, G. C.; Giedroc, D. P. Biophys. Chem. 2011, 156 103. 191. Su, C. C.; Long, F.; Lei, H. T.; Bolla, J. R.; Do, S. V.; Rajashankar, K. R.; Yu, E. W. J. Mol. Biol. 2012, 422 429. 192. Ucisik, M. N.; Chakravorty, D. K.; Merz, K. M. Biochemistry 2013, 52 6911. 193. Dominguez, C.; B oelens, R.; Bonvin, A. M. J. J. J. Am. Chem. Soc. 2003, 125 1731. 194. De Vries, S. J.; van Dijk, A. D. J.; Krzeminski, M.; van Dijk, M.; Thureau, A.; Hsu, V.; Wassenaar, T.; Bonvin, A. M. J. J. Proteins: Struct., Funct., Bioinf. 2007, 69 726. 195. Chakr avorty, D. K.; Li, P.; Merz, K. M. 2014, In preparation 196. Becke, A. D. J. Chem. Phys. 1993, 98 5648. 197. Rozas, I.; Alkorta, I.; Elguero, J. J. Phys. Chem. A 1998, 102 9925. 198. McGaughey, G. B.; Gagne, M.; Rappe, A. K. J. Biol. Chem. 1998, 273 15 458.

PAGE 202

202 BIOGRAPHICAL SKETCH Melek Nihan Ucisik was born in Istanbul, Turkey in 1986 as the first child to her parents Ayse Canan and Abdulhakim Mehmet. She spent most of her childhood at home with her sister Fehime Eymen and grandmother Hati ce Velide until she started elementary school. School did not appeal to her whatsoever, so she resisted by literally crying for more than a month every day, all day. Then she accepted the inevitable fact, went with the flow and graduated from Deutsche Schule Istanbul in 2005 with both Turkish and German high school degrees. She obtained her bachelor's degree in Chemistry from the Bogazici University of Istanbul in 2009. There she conducted undergraduate research with Dr. Rana Sanyal on dendrimer synthesis for hydrogels an d with Dr. Viktorya Aviyente on quantum mechanical calculations of transition states of Diels Alder cyclo additions She also worked at Dr. Wolfgang Kaim's lab at the University of Stuttgart on synthesis of air sensitive metal ligands. She moved to Gainesv ille, Florida in 2009 and joined Dr. Kennie Merz's lab to pursue her doctoral degree in the Quantum Theory Project at the University of Florida where her scientific efforts concentrated on biological applications of computational chemistry. She was awarded her Ph.D. in May 2014. Nihan enjoys arts and being outdoors.


xml version 1.0 encoding UTF-8
REPORT xmlns http:www.fcla.edudlsmddaitss xmlns:xsi http:www.w3.org2001XMLSchema-instance xsi:schemaLocation http:www.fcla.edudlsmddaitssdaitssReport.xsd
INGEST IEID ENU0HD2TZ_KHCH4W INGEST_TIME 2014-10-03T21:58:51Z PACKAGE UFE0046515_00001
AGREEMENT_INFO ACCOUNT UF PROJECT UFDC
FILES



PAGE 1

TheMovableTypeMethodAppliedtoProtein LigandBindingZhengZheng,MelekN.Ucisik,andKennethM.Merz *DepartmentofChemistryandtheQuantumTheoryProject,UniversityofFlorida,2328NewPhysicsBuilding,P.O.Box118435, Gainesville,Florida32611-8435,UnitedStates*SSupportingInformation ABSTRACT: Accuratelycomputingthefreeenergyforbiological processeslikeproteinfoldingorprotein ligandassociationremainsa challengingproblem.Bothdescribingthecomplexintermolecularforces involvedandsamplingtherequisitecon gurationspacemakeunderstandingtheseprocessesinnatelydi cult.Herein,weaddressthe samplingproblemusinganovelmethodologyweterm “ movabletype ” (MT).Conceptuallyitcanbeunderstoodbyanalogywiththeevolution ofprintingand,hence,thenamemovabletype.Forexample,acommon approachtothestudyofprotein ligandcomplexationinvolvestakinga databaseofintactdrug-likemoleculesandexhaustivelydockingtheminto abindingpocket.Thisisreminiscentofearlywoodblockprintingwhere eachpagehadtobelaboriouslycreatedpriortoprintingabook.However, printingevolvedtoanapproachwhereadatabaseofsymbols(letters, numerals,etc.)wascreatedandthenassembledusingaMTsystem,whichallowedforthecreationofallpossiblecombinationsof symbolsonagivenpage,thereby,revolutionizingthedisseminationofknowledge.OurMTmethodinvolvesidentifyingalloftheatom pairsseeninprotein ligandcomplexesandthencreatingtwodatabases:onewi ththeirassociatedpairwisedistantdependentenergies andanotherassociatedwiththeprobabilityofhowthesepairs cancombineintermsofbonds,angles,dihedrals,andnonbonded interactions.Combiningthesetwodatabasescoupledwiththeprincip lesofstatisticalmechanicsallow sustoaccuratelyestimatebinding freeenergiesaswellastheposeofaligandinareceptor.This method,byitsmathematicalconstruction,samplesallofthe con gurationspaceofaselectedregion(theproteinactivesitehere )inoneshotwithoutresortingtobruteforcesamplingschemes involvingMonteCarlo,geneticalgorithms,ormoleculardy namicssimulationsmakingthemethodologyextremelye cient. Importantly,thismethodexploresthefreeenergysurfaceelimin atingtheneedtoestimatetheenthalpyandentropycomponents individually.Finally,lowfreeenergystructurescanbeobtainedvi aafreeenergyminimizationprocedureyieldingalllowfreeenergy posesonagivenfreeenergysurface.Besidesrevolutionizingtheprotein liganddockingandscoringproblem,thisapproachcanbe utilizedinawiderangeofapplicationsincomputationalbiology whichinvolvethecomputationoffreeenergiesforsystemswith extensivephasespacesincludingproteinfolding,protein proteindocking,andproteindesign.INTRODUCTIONSamplingthecon gurationspaceofcomplexbiomoleculesisa majorhurdleimpedingourabilitytoadvancetheunderstandingofadiverserangeofprocessesincludingproteinfolding andtheaccuratepredictionofligandbindingtoabiological receptor.1 10Majoradvanceshavebeenmadeincomputer hardware,whichhasallowedmoleculardynamics(MD) simulationstoreachthemillisecondbarrier,butthismethod isbruteforceinnatureandrequireshighlysophisticated hardwareandsoftware.2,10 14Moreover,amajorhurdleinthe modelingofbiologicalsystemsisassociatedwithhowtheinter andintramolecularenergiesaremodeled.Modernforce elds arehighlyevolvedbutstillneedtobefurtherre nedtoreach chemicalaccuracyinmanyapplications.9,14 17Predictinghowaligand(drug)bindsitsreceptorand predictingitsassociatedbindinga nityisahighlychallenging problem,whichifsolved,wouldsingularlyadvancemodern structure-baseddrugdesign.8,15,17 31Thisapproachhaslargely employedso-calledend-pointmethodsthatdock(place)a candidatemoleculeintoareceptorandcomputethebinding freeenergyusingarangeofphysics-basedorempirical “ scoring ” functions.Fromananalysisoftheerrorpropagation propertiesinthestatisticalmechanicsbasedpredictionofprotein ligandbindinga nitiesitwasshownthattheendpointclassofapproachesmaximizesenergyfunction uncertainties.32 34Thiscanbealleviatedthroughtheuseof samplingapproachesincludingMDmethodsormethodsthat exhaustivelysamplethecon gurationspaceassociatedwith protein ligandbinding.1 3,5 7,9,11 14Thesemethodshave shownthattheycanbesuccessfulbutarebruteforceinnature, whichleadustoconsiderwaysinwhichwecanuseideasmore akintotheend-pointmethodsb utincorporatesamplingatthe sametime.Theconceptbeingthat thisapproachwouldgiveusthe bestofbothworlds,whilemitigatingthee ectsofenergyfunction de ciencies. UsingMDorexhaustivesamplingprocedurestoevaluate protein ligandbindingisconceptuallysimilartowoodblock Received: July10,2013 Published: October28,2013 Article pubs.acs.org/JCTC 2013AmericanChemicalSociety5526dx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 5538

PAGE 2

printingtechnologywhereallthewords(molecules)are carefullyplacedonaboard(receptorsite)andthewholebook canbeprinted(bindingfreeenergydetermined).Whileamore advancedprintingtechnology,movabletype(MT)printing, (whichwasinventedinChinainthe11thcenturyand introducedbyGutenbergintothealphabeticlanguagesystem) usesa “ database ” oflettersthatispreconstructedandthenthe printingofanywordinvolvesadatabasesearchfollowedbythe appropriatecombinationfromtheMTsystem.Usingatypical pairwisepotentialthemolecularenergyofasystemcanbe decomposedintoatompairwisein teractionenergiesincluding bond,angle,torsion,andlong-rangenoncovalentforces(vander Waalsandelectrostaticforces),whichbyanalogytotheMT systemsisourdatabaseof “ letters ” .Eachinteractionhasadi erent intensityandprobabilityofoccurrencealonganatompairwise coordinateaxis.Herein,wedevelopthemathematicsnecessaryto bringend-pointmethodsuptothe “ MTprintinglevel ” ,via buildingadatabaseofenergyintensitiesandprobabilitiesforall atomtypepairinteractionsfoundinprotein ligandcomplexes. UsingthisinformationwethendemonstratethattheMTapproach enhancesourabilitytopredictprotein ligandbindingfreeenergies andalsoallowsustoextracttheassociatedlow-energyposesallata fractionofthecostassociatedwith “ brute ” forcesamplingmethods. Moreover,thedockingandscoringproblemisanexampleofa broadclassofproblemsincomputationalbiologythatinvolveboth thecomputationofthefreeenergyandthestructureofabiological system,whichincludeschallengeslikethepredictionofprotein folds,protein proteininteractions,andproteindesignallofwhich theMTmethodcanaddress.METHODOLOGYTheMTMethodAppliedtoProtein LigandBinding. Athermodynamiccyclemodelingthebindingfreeenergy Gb sinsolution(showninFigure1)istypicallyemployedinendpointmethods: =+ŠŠ GGGGGb s b g solv PL solv L sol v P (1)wherePandLindicatetheproteinandligand,sandgrepresent thebehaviorinsolutionandthegasphase,respectively, Gsolvisthesolvationfreeenergy,and Gbisthebindingfreeenergy ingas(g)andsolution(s),respectively. Using Gsolv= Gsolv PL Gsolv L Gsolv P,eq1becomes =Š GGGb s b g sol v (2)Thebindingfreeenergyinsolutionisnowseparatedinto twoterms:Thebindingfreeenergyinthegasphaseandthe changeinthesolvationfreeenergyduringthecomplexation process.AtthispointweintroducetheMTalgorithmtomodel bothtermseachwithitsowndesign. Thebindingfreeenergyinthegasphaseisthemost importanttermtoevaluateinordertopredicttheprotein ligandbindinga nitybecauseitcontainsallinteractions betweentheproteinandligand.Whenapproximatedasthe Helmholtzfreeenergy(NVT),theGibbs(weusethecanonical ensemblethroughout,butwillpredominantlyusethe G notation)bindingfreeenergyinthegasphasecanbegenerated usingtheratioofthepartitionfunctionsdescribingtheprotein ligandcomplex,theprotein,andtheligand. =Š=Š Š ŠŠ GART Z ZZ RT dr drdr lnln e eeEr ErEr b g b g PL PL () ()()PL PL (3)where Z representsthecanonicalensemblepartitionfunction, and isthereciprocalofthethermodynamictemperature kBT Partitionfunctionsareintegralsoverallpossiblecoordinatesof theprotein ligandcomplex,theprotein,andtheligand. Equation3canbemanipulatedintothefollowingform: =Š Š ŠŠ GRT F FF ln e eeEr ErEr b g PL () P () L ()PL PL (4)wherethepartitionfunctionsareexpressedastheBoltzmannweightedaverageoftheposeenergies(showninbrackets) multipliedthevolumeofcon gurationspaceavailabletoeach state,shownas F ineq4. FPLisapproximatedastheproductof theexternaldegreesoffreedom(DoFs)oftheboundprotein andligand(includingtherotationalandtranslationalDoFs), andtheinternalDoFsoftheboundproteinandligand (includingtherelative-posit ionalandvibrationalDoFs), givenas = FFFFFPLboundP external boundL external boundP internal boundL internal (5)Similarly,theDoFsofthefreeproteinandligandmolecules arealsoseparatedintotheexternalandinternalcomponents. InternalDoFsareidenticalforboundandfreeprotein/ligand structures,andtheboundandfreeproteinsarealsoassumedto sharethesameinternalandexternalDoFs.Onlytheexternal DoFsoftheligandaredi erentiatedbetweentheboundand freesystems.TherotationalDoFofafreeligandis8 2ona normalizedunitsphere.However,becauseoftheinaccessible volumepresentinprotein ligandsystems,therotationalDoFs ofboundligandsaredesignatedas a 2withato-be-determined averagevolumefactor a lessthan8.ThetranslationalDoFsare treatedasaconstant C ,whichisassumedtobeidenticalforall freeligands,whilethetranslationalDoFforboundligandsis thevolumeofthebindingpocket Vpocketinwhichtheligands ’ centerofmasscantranslate.Thereby,intheprotein ligand bindingprocess,changesintheDoFscanbemodeledasa constantwithrespecttothedi erentvolumesofthebinding pockets.Applyingtheseapproximationsweobtain: = = == F FF FFFF FFFF F F aV C aV C 8 8PL PL boundP external boundL external boundP internal boundL internal freeP external freeP internal freeL external freeL internal boundL external freeL external 2 pocket 2 pocket (6) Figure1. Thethermodynamiccycleusedtoformulatethefreeenergy ofprotein ligandbindinginsolution. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385527

PAGE 3

Thegas-phaseprotein ligandbindingfreeenergycanthenbe furthermanipulatedintothefollowingform: =Š Š ŠŠ GRT aV C ln e 8eeEr ErEr b g pocket () ()()PL PL (7)AgainusingtheHelmholtzfreeenergyapproximation(eq3), thesolvationfreeenergycanbecorrelatedtothepartition functionofthesolute(protein,ligand,orprotein ligand complex)andsolute solventbulkinteractions.Inthisway,the solvationfreeenergy,using Gsolv Lasanexample,ismodeledas ineq8,andtheDoFsareapproximatedasbeingthesamefor thesoluteandthesolute solventbulkterms. =Š=Š =Š Š Š Š Š GART Z Z RT dr dr RT lnln e e ln e eEr Er Er Er solv L solv L LS L () () () ()LS L LS L (8)Finally,theremainingsolvationtermsgivenineq1( Gsolv Pand Gsolv PL)canbemodeledinananalogousmanneryielding thechangeinthesolvationfreeenergyasligandbindingoccurs whichthencanbeusedtoevaluatetheoverallfreeenergyof ligandbindinginaqueoussolution. ConstructionoftheMTSystem:AtomPairwise InteractionEnergyandProbabilityDatabases. With poseenergiessampledoverallpossibleDoFsforthebound andfreeprotein/ligandsystem,thegas-phaseprotein ligand bindingfreeenergycanbegeneratedusingmoleculardynamics, MonteCarlo,geneticalgorithms,etc.bysamplingoveralarge numberofposesoftheprotein,ligand,andprotein ligand complex.UsingthecanonicalensembletheHelmholtzfree energycanbeobtainedasthearithmeticmean(sumofthe energiesofallligandposesdividedbythetotalnumberofall posesalongwithanestimateofintegrationvolume)of Boltzmannfactors: =Š=Š =Š Š Š GARTZRT RT N ln[]ln[e ] ln eE i Ei (9)However,theproblemofpose-basedenergysamplingliesin thefactthatposeselectionandsamplesizesigni cantlya ect the nalresult,nottomentionthatcalculatingmanyunique posesisverytime-consuming.Di erentligandposeshave di erentenergypreferencesforthebindingsite,whichleadsto arangeofbindingprobabilities.Whencalculatingtheaveraged partitionfunctionsineq7,onecanassignprobabilities( Q )as weightstodi erentBoltzmannfactorsinordertodi erentiate thebindingpocketpreferencesagainstligandposes,ratherthan justsimplyusinganarithmeticmeanofallBoltzmannfactors. = Š Š Q e ei E i Ei i (10) =Š=Š=Š ŠŠGARTZRTRTQ ln[]ln[e]ln[e ] E i i Ei (11)Thechallengeinderivingthecanonicalpartitionfunction(as thedenominatorineq10)foraprotein ligandsystemisthatit isdi culttoincludeallrelevantligandposeenergieswithinthe bindingpocketusingbruteforcesamplingschemes.However, thetaskbecomesmucheasierwhenaprotein ligandsystemis reducedtotheatom-pairlevel.Inthiswaythe “ pose ” sampling problemcanthencanbecastasa1-Dratherthana3-D problembyderivingthecanonicalpartitionfunctionasasum oftheBoltzmannfactorproductsofallatompairwiseenergies includedinthesystemoverallatompairwiseseparation distanceranges. = = = Š Š Š Š Š Š Š Š ZQ Q Q Q Q Q e e e e e ei i E p q pq E a a E b b E c c E d d E all poses all combinations all atom pairs bond distance range no. of bonds angle distance rangeno. of angles torsion distance range no. of torsions vdWelec distance range no. of vdWelec interactionsi pq a b c d (12)Thecanonicalpartitionfunctioncanbederivedfollowing eq12,wheretheindex “ i ” referstoeachligandpose (microstate)ina “ traditional ” bruteforcesamplingscheme. Whentheprotein ligandsystemisbrokendowntotheatompairlevel, “ q ” indicatesallatompairsinthemolecularsystem, and “ p ” indicateseachpossiblecombinationofallatompairs eachofwhichisataprechosendistance. a b c and d referto eachatompairasabond,angle,torsionorlong-range(vander Waalsorelectrostatic)interactioninthecanonicalsystem, respectively,and , and referstoeachsampledseparation distancebetweenthecorrespondingatompair.Probabilitiesof alltheatompairwisedistributionsontheright-handsideof eq12arenormalizedas iQi= i(e Ei/( ie Ei))=1: = Š ŠQQ Q Q 1a a b b c c d d bond distance range no. of bonds angle distance rangeno. of angles torsion distance range no. of torsions vdWelec distance range no. of vdWelec interactions (13)HenceourMTmethodisdesignedtodecomposethe molecularenergyintoatompairwiseenergies,whichthensimpli estheenergysamplingproblemtotheatom-pairlevel. Theadvantageofthisidealiesinthatatompairscanbe categorizedbasedonatomandinteractiontypes,e.g.bond, angle,torsion,andlong-rangenoncovalentinteractionsandthat calculationofatompairwiseenergiesisextremelycheap. Thereby,itiseasytobuildanatomicpairwiseinteraction matrixofenergyvsdistanceforeachinteractiontypeandatom pairtype i j .Hence,theenergycalculationforeachmoleculeis nomorethanacombinationofelementsfromdi erentenergy matrices.Inaddition,theMTmethodisatemplatebywhich anypairwisedecomposableenergyfunctioncanbeused.Inthe currentwork,theenergyforeachinteractiontypebetweena certainatomtypepair i j iscalculatedusingtheknowledgebasedandempiricalcombinedscoringalgorithm(KECSA) JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385528

PAGE 4

potentialfunction.35InKECSA,theprotein ligandstatistical potentialismodi edandequatedtoanatompairwiseenergyin ordertogenerateforce eldparametersforbondstretching, anglebending,dihedraltorsionangles,andlong-range noncovalentinteractions.Pleaseseethedetailedrationaleand justi cationforKECSAanditsparametrizationinthe SupportingInformationandtherelevantliterature.35Alongwiththedistance-basedenergy,eachatompairtype alsohasadistancepreferenceencodedinitsdistribution, resultingindi erentprobabilitiesassociatedwithBoltzmann factorsforeachsampledatompairwisedistance.Atom-pair radialdistributionswerecollectedfromaprotein ligand structuretrainingset(i.e.,thePDBbindv2011datasetwith 6019protein ligandstructures)36,37andutilizedinthecurrent model.Theatompairwiseradialdistributionfunctionis modeledas = = g r nr nr nr rr () () () () 4ij ij ij ij N V aij (14)where nij( r )isthenumberofprotein ligandpairwise interactionsbetweenacertainatompairtype i and j inthe bin( r r + r ),withthevolume4 ra r collectedfromthe trainingset,and n *ij( r )inthedenominatormimicsthenumber ofprotein ligandatomtypepairs i and j inthesamedistance bininanidealgasstate.Thisremovesthe “ non-interacting ” backgrounddistributionfromtheprotein ligandsystem; ris de nedas0.005,and Nijisthetotalnumberofatompairsof type i and j .Theaveragevolume V oftheprotein ligand bindingsitesisgivenas4/( a +1) Ra +1,withthesameto-bedeterminedparameter a asdescribedabove(eqs7and14).A cuto distance R isassignedtoeachatomtypepairde ningthe distanceatwhichtheatompairwiseinteractionenergycanbe regardedaszero.Both a and R canbederivedusinga previouslyintr oducedmethod.35Theradialdistribution frequencyisthennormalizedbydividingthesumofradial distributionsofalltheatompairsinthesystem(eq15). = = + ++ +qr gr gr () () ()ij ij i j ij Rnr aNrr i j Rnr aNrr () (1) () (1)a ij ij a a ij ij a 1 1 (15)Inthisway,theenergyanddistributionfrequencyvsdistance iscalculatedforanyinteractiontypeandatompairtype, thereby,formingourMTdatabaseforlateruse. BindingFreeEnergiesfromtheMTMethod. Basedon eq4,thebindingfreeenergyisde nedasaratioofpartition functionsofthedi erentmoleculesinvolvedinthebinding process,i.e.,theprotein,ligand,andtheprotein ligand complex.Insteadofsamplingoverposesofonemolecule,the MTmethodsimpli esthepartitionfunctionofeachsystem intoacollectionofpartitionfunctions( c )overeachobservedatom pair,whichareequaltothenormalizeddistributionprobabilityof theatomtypepairalongthedistance( q ),multipliedbythe correspondingatompairwisepartitionfunction( z ): = cq z (16)Bycombiningthepartitionfunctions c overallatompairsin onemoleculethepartitionfunctionofonemoleculeaveraged overallpossibleconformationsisderived(eq17). = Šcr e()Er j M i N ij () (17)wheretheaveragedmolecularpartitionfunctionisgivenasa sumofatompairwisepartitionfunctions c sampledover distanceintervals( M )ofallcombinationof N atompairsatall possibledistances. Startingfromtheprotein ligandcomplexdatabase,we constructedthepartitionfunctionmatricesfortheMTalgorithm. Whenconvertedintoapartitionfunctionmatrix,theatom pairwiseenergymultipliersamp ledasafunctionofdistanceisthe basicelementneededtoassemblethetotalenergy,asshownin eq18,usingtheproteinbondenergyasanexample. = = Š Š ŠŠ zr zr zrzr () () ()() e e eek k k kakn Er Er ErEr bond bond 1 bond 2 bond bond () () ()()k k kakn bond 1 bond 2 bond bond (18)wheresubscript k indicatesabondedatompair i and j ,andeach distanceincrementbetweenany raand ra +1is0.005.Usingthis schemethedistancesamplingsizeisgivenby: n =( rn r1)/ (0.005),where r1and rnarethelowerandupperboundsfor distancesampling,whichvariesdependingontheeachatompair andinteractiontype.Theproductoverallbond-linkedatompairs derivesthetotalbondpartitionfunctionintheprotein: = = Z ()()()Pm m bond 1 bond 2 bond 3 bondbon d 1 bond 2 bondT 3 bondTbondT (19) = Z zrzrzrzrzrzrzrzrzr zrzrzrzrzrzrzrzrzr zrzrzrzrzrzrzrzrzr ()()()()()()()()() ()()()()()()()()() ()()()()()()()()()mmnmn mmnmn nmnmnnmn P bond 1 bond 12 bond 1 bond 11 bond 12 bond 1 bond 21 bond 12 bondbond 1 bond 22 bond 1 bond 11 bond 22 bond 1 bond 21 bond 22 bondbond 1 bond 2 bond 1 bond 11 bond 2 bond 1 bond 21 bond 2 bondbond (20)Ineqs19and20, m indicatesthetotalnumberofatompairs thatneedtohavetheirbondstretchtermcomputed(i.e., numberofcovalentbonds),and n isthedistancesamplingsize. Tindicatesthetranspose.Thus,thematrix ZP bond hasatotalof nmelements,andincludesallcombinationsofthesampledatom pairwisedistancesandatompairs(seeeq20).Energymatrices forotherkindsofatompairwiseinteractionsareassembledin thesameway(i.e.,bond,angle,torsion,andlong-range interactions).Asimpleexam pleisgiveninSupporting Information(butane methaneinteraction),whichillustrates JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385529

PAGE 5

themethodinmoredetail.Productsoverthesematrices generatetheentireproteinpartitionfunctionmatrix(eq21), representingallpossiblecombinationsoftheproteininternal energieswithdi erentatompairwisedistances. = ZZZZP PP bondangle P torsion P longrange (21)where =ŠZZZP longrange P vdWelec P Hbon d (22)TheKECSAvanderWaals electrostaticinteractionmodelsand hydrogenbondmodels35areappliedtotheprotein,ligandand protein ligandcomplexsystems.Similarly,theligandenergy(eq23) andprotein ligandinteractionenergymatrices(eq24)canbe obtained. = ZZZZLL bond L angle L torsion L longrang e (23) = ZZZZZZ ZZZPLP bond P angle P torsion P longrange L bond L angl e L torsion L longrange PL longrange (24)Thedistributionfrequencymatrixisbuiltinthesameway, withthe qij( r )derivedfromeq15aselementsineachmultiplier (alsousingtheproteinbondtermasanexample): = qr qrqr qr () ()() ()k bond k kk k n bond 1 bond 2 bond 3 bond (25) = Q k m P bond 1 bond 2 bond 3 bondbond bond (26) = QQQQP P bond P angle P torsion P longrang e (27)where =Š Q QQP longrange P vdWelec P Hbond (28)Thedistributionfrequencymatrixfortheproteinisderived usingeqs26 28,andthedistributionfrequencymatricesofthe ligandandprotein ligandintermolecularinteractionsare analogouslyderivedasineqs29and30. = QQQQL L bond L angle L torsion L longrang e (29) == = Š QQQ QQQQQ QQQQPL PL longrange PL vdWelec PL Hbond PL P bond P angle P torsion P longrange L bond L angle L torsion L longrange PL longrange (30)Wechosethesamerangeanddistanceincrementinboththe energyanddistributionfrequencycalculations,whichmeans thatany rx( x =1,2,3,...)ineq18isthesameascorresponding rxineq25.Thus,thecorrespondingelementsinallenergyand distributionfrequencymatricescorrelatewitheachother.The pointwiseproductoverallmatricesensuresthattheenergies anddistributionfrequencieswiththesamerangeanddistance incrementarecombinedintooneelementinthe nalmatrixof theprobability-weightedpartitionfunctionoftheprotein ligandcomplex( P L ineq31). == PLPLPL (31)Inthe nalmatrixeachelementof P L isavalueofthe partitionfunctionoftheprotein ligandcomplexmultipliedby itsprobabilitybasedonitsradialdistributionformingthe ensembleaverage.Finally,thesumofallelementsofthematrix P L givesustheaveragedpartitionfunctionoftheprotein ligandcomplex: = == Š Sum()1; Sum()Sum()eEr PL ()PL (32)wherethe rstequationisthenormalizationstatementforthe probabilities.Inthismanner,thenormalizedaveragedpartition functionoftheprotein ligandcomplexisderivedineq32. Followingthesameprocedure,theaveragedpartitionfunctions fortheproteinandligandaregeneratedaswell = Š eSum()Er () PP (33) = Š eSum()Er () LL (34)Expandingthematrices,theprotein ligandbindingfree energyinthegasphaseisde nedasineq35,usingthe averagedpartitionfunctionsofallthreesystems(protein, ligand,protein ligandcomplex)derivedabove. =Š =Š + Š++ Š Š Š ŠŠ ++ GRT aV C RT aV C QQQEEE QEQE ln e 8ee ln 8 ln (exp[()]) (exp[])(exp[])Er ErEr ijk IJK ijk ijk i I i i j J j j b g pocket () ()() pocket PLPL PLPL PPLLPL PL (35)Ineq35, Q istheradialdistributionfrequencyand E isthe energy. i j k aretheindicesoftheprotein,ligandandprotein ligandcomplex,while I J K arethetotalnumberofprotein, ligand,andprotein ligandcomplexsamples,respectively. j IQiP= j IQJL= j KQjPL=1. Q i P, Q j L,and Q k PLarestandard distributionfrequencymatricesnormalizedoverallthree systems,inordertosatisfy ijk I+J+K. Q i PQ j LQ k PL=1.Inthisway theprotein ligandbindingfreeenergyinthegasphaseis derivedusingourMTalgorithm. Determinationofthechangeinthesolvationenergy asafunctionofthebindingprocessiscomputedina similarmanner.Toillustratethiswedescribehowweobtainthesolvationfreeenergyoftheligand,whichisone componentof Gsolvandbyextensiontheotherterms canbederived. Theligandsolvationfreeenergyisobtainedbydecomposing theligand solventbulkenergyintothefreeligandenergy EL( r ),theligand solventpolarinteractionenergy Epsol( r ),and theligand solventnonpolarinteractionenergy Enpsol( r ): =++ ErErErEr ()()()()LSLpsolnpsol (36)Solventwasapproximatedasashellofeventhicknessaround theligand,inwhichthewatermoleculeswereevenly distributed.Thesolventshellthicknesswas6,andthe innersurfaceoftheshellwas1.6awayfromtheligand JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385530

PAGE 6

surface,whichapproximatestheradiusofawatermolecule. Herein,forsimplicity,theligand solventpolarinteraction wasconsideredasasurface(solventaccessiblepolarsurfaceof theligand) surface(solventbulklayersurfaceatacertain distanceawayfromligand)interaction,insteadofapoint pointinteraction,i.e.atompairwiseinteraction.Usingthis modeltheligandpolaratom solventinteractionenergywas modeledasasolventaccessibleburiedarea(SABA)ofthe ligandpolaratomsmultipliedbythepolaratom oxygen interactionenergytermstakenfromKECSA,35tosimulatethe ligand solventsurfaceinteractionenergy.AllSABA-weighted interactionenergiesalongthesolventshellthickness,witha 0.005incrementwerecollectedandstored.Theligand solventpolarinteractionBoltzmannfactormultiplierwas modeledusingeq37,with k indicatingeachpolaratominthe ligand, r1=1.6,whichistheinnerlayerofthesolventshell and rn=6+1.6=7.6,whichistheouterboundarylayerof thesolventshell. = = Š ŠŠ Š zr zrzr zr () ()() () e ee ek k kk k n Er ErEr Er psol psol 1 psol 2 psol 3 psol (SABA)() (SABA)() (SABA)() (SABA)()k k k k k k k k n psol 1 psol 2 psol 3 psol (37)Theligand solventpolarinteractionBoltzmannfactor matrixisthenderivedusingeq38,coveringallligandpolar atomsupto m .Thedistributionfrequencymatriceswerenot includedinligand solventenergycalculationbecausetheradial distributionfunctionisapproximatedasbeingidenticalalong allligand solventdistances(i.e.,afeaturelesscontinuum). Figure2furtherillustratesthemodelingoftheligand solvent polarinteraction. = ()()()()psol k m 1 psol 2 psolT 3 psolT psol Tpsol T (38) Figure2. Modelingofligand solventpolarinteractionusingaBoltzmannfactormultiplier.Acarbonyloxygenatomisusedasanexamplehere. (1)Thegreensurfaceshowsthesolventaccessiblesurfaceoftheligand(innerlayerofthesolventshell).Thesurfaceconsistingofbluedots representstheouterboundarysurfaceofthesolventshell.(2)Aclose-upviewofaselectedpolaratom(carbonyloxygen)withitssolventshell.(3) MonteCarlosamplingalongcarbonyloxygen solventshelllayerdistance. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385531

PAGE 7

Thenonpolaratomburiedarea(NABA)isusedtosimulatethe interactionsbetweenthenonpolaratomsandaqueoussolvent, becausetheinteractionenergybetweennonpolaratomsandwater moleculeshasaweakerresponsetochangesindistance. = Š [e]npsol NABA (39)Theligandenergyisthesameaswasintroducedinthegasphaseprotein ligandbindingfreeenergycalculation.So,the matrixfortheligand solventinteractionenergyis: = L L solvpsolnpsol (40) == Š S um()Sum()eLEr solvLpsolnpsol ()LS (41)Thesolvationfreeenergywasnot ttoexperimental solvationfreeenergiesandwasfoundtohaveasmall in uenceofthe nalbindingfreeenergiesfortheprotein ligandcomplexes.Nonetheless,futureworkwill tthese modelstosmallmoleculesolvationfreeenergies,butfor thepresentapplicationthesolvationmodelwasusedas formulatedabove. Withallnecessarycomponentsconstructed,thebindingfree energyinsolutioncanbegeneratedusing: =Š=+ŠŠ=Š Š + + =Š =Š+ Š++++ Š++ Š++ Š ŠŠ Š Š Š Š Š Š Š ŠŠ ++ + GGGGGGGRT aV C RT RTRTRT aV C RT aV C QQQEEEE QEEQEE ln e 8ee ln e e ln e e ln e e ln e 8ee ln 8 ln (exp[(NABA)]) (exp[NABA])(exp[(NABA)])Er ErEr Er Er Er Er Er Er Er ErEr ijk IJK ijk ijks i I i is js JS j js b s b g solv b g solv PL solv P solv L pocket () ()() () () () () () () pocket () ()() pocket PLPL PLPLPLpsolPL PPPpsolPLLLpsolLPL PL PLS PL PS P LS L PLS PSLS (42)PerformanceofMTKECSAasaScoringFunctionfor Protein LigandBindingA nityPrediction. Usingthe MTmethodweperformedbindingfreeenergycalculationswith theKECSAmodelanditsassociatedparameters.Thisvalidation studywasperformedtoillustrat e(1)thegeneralperformanceof MTmethodwhenusedtopredictprotein ligandbinding a nitiesand(2)whethersamplingalongatompairwisedistance improvesscoringperformance,asdoneinMTKECSA,improves thepredictionovertheoriginalKECSAmethod. Atestsetcontaining795protein ligandcomplexeswas chosenfromthePDBbindv2011re neddatasetbasedonthe followingcriteria:(1)Crystalstructuresofallselected complexeshadX-rayresolutionsof<2.5;(2)complexes withmolecularweights(MWs)distributedfrom100to900 wereselected,toavoidligandsize-dependentpredictionresults; and(3)complexeswithligandswhohavemorethan20 hydrogendonorsandacceptors,morethanonephosphorus atom,andcomplexeswithmetalloproteinswereexcluded. MTKECSAcalculationsshowimprovementsinPearson ’ s r Kendall androot-mean-squareerror(RMSE)whencompared totheoriginalKECSAmodel(Table1).Importantly,judging fromtheslopeandinterceptofbothcalculationsversus experimentaldata,MTKECSA(withslopeof0.85and interceptof0.14)betterreproducesthebindinga nitiesinthe low-andhigh-a nityregionsthantheoriginalKECSAmodel (withslopeof0.27andinterceptof3.57).Intheoriginal KECSAapproach,theentropytermswereempiricallytrained, thusitstestresultsdemonstratetrainingsetdependenceto somedegree.Becausecomplexeswithmedium-bindinga nitiesaremorecommonlyseeninthePDBdatabasewhen comparedtocomplexeswithhigh-orlow-bindinga nities, theybecomethemajorityinalargetrainingset(1982protein ligandcomplexeswereusedto ttheoriginalKECSAentropy terms).Thiscausesthetrainedscoringfunctionstooverestimatethebindinga nityofthelow-bindingcomplexes whileunderestimatingthatofthehigh-bindingcomplexes.On theotherhand,MTKECSA,usingcanonicalpartition functionstocomputethebindingfreeenergies,bypassesthedi cultyofempiricallybuildingtheentropytermand,thereby, betterreproducesthebindinga nityinlow-andhigh-binding freeenergyregions. ExtractingHeatMapsfromtheMTMethod. Gridbased methodsandtheirgraphicalrepresentationhavehadalong traditionincomputer-aideddrugdesign.4,38 41Forexample, COMFA42createsa elddescribingthechemicalnatureofthe activesitepocket,andtheGRIDalgorithm43usesagridbased approachtoaidinmoleculardockingandhasbeenadoptedby otherdockingprograms(e.g.,GLIDE).44,45BytheverynatureoftheMTmethodwecanreadily generate “ heatmaps ” describingthechemicalnatureofthegrid pointscreatedintheMTmethod.Thesecanbeusedto describepairwiseinteractionsbetweenthegridpointandthe proteinenvironment(e.g.,amidehydrogenwithacarbonyl oxygen)orinteractionscanbelumpedintononpolarorpolar interactionsdescribingtheaggregatecollectionofpolarand nonpolarpairwiseinteractions.Notonlydoesthisdescribethe natureofthegridpointsbutalsoindicatesregionswherespeci catomsshouldbeplacedtooptimizebindinga nity. Incontrasttoenergyheatmaps,theMTheatmapsrepresent theprobability-weightedinteractionenergyoneachgridpoint. Knowledge-baseddata(i.e.,theprobabilitydistributionalong theinteractingdistance)willa ectthelocationofboth unfavorableandfavorableinteractionsdependingonthenature ofthesystem.Moreover,energygradientmapscanbe generatedbasedonheatmapenergycalculations,which facilitatesliganddockingasdescribedbelow. ExtractingStructurefromtheMTMethod. The advantageoftheMTmethodisthattheenergyandthefree Table1.StatisticalResultsforMTKECSAandOriginal KECSACorrelatedwithExperimentalBindingA nitiesPearson ’ s r RMSE(p Kd)Kendall MTKECSA0.721.880.53 originalKECSA0.622.030.46 JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385532

PAGE 8

energy(whenintroducingthepartitionfunction)canbe derivedusingonlyatomiclinkageinformationcoupledwiththe databasesofatompairwisedistancedistributionsalongwith theircorrespondingenergies.Thiso ersusanewapproachfor protein liganddockingwithoutresortingtoexhaustivepose sampling.Ourinitiale ortsutilizedthefrozenreceptormodel, buttheincorporationofreceptor exibilityis,inprinciple, straightforwardandwillbeexploredinthefuture. Inadockingexercise,thebest-dockedposefortheligandis usuallyobtainedbasedonthehighestbindinga nity,which canberegardedasanoptimizationproblem.Withthefrozen bindingpocketapproximation,generationofthe “ best ” docking poseisagradientoptimizationoftheligandatomswithinthe bindingpocket,subjecttotheconstraintsoftheligand topology. Molecularinternallinkagesincludingbondlengthsand anglesonlyslightlydeviatefromtheiroptimizedvalues,making themconstraintsintheligandenergyoptimizationwithinthe bindingpocket.Theseligandatomconnectivitiesreducethe dimensionalityoftheprobleminthatatomiccollectionsthatdo nothavethecorrectconnectivityareeliminatedfromfurther consideration.Ontheotherhand,energiesofthetorsionsand long-rangeinteractionsbetweenligandsandproteinsvaryover comparativelylargedistancerangesand,thereby,areregarded astheobjectivefunctions.Hence,inordertodothe optimizationweneedtoobtainthe rstandsecondderivatives oftheligandtorsionandtheprotein ligandlong-range interactionpartitionfunctions(shownineq43and44), whichcanbereadilyseeninthegradientmapsoftheindividual atomtypepairs. = =+= dcr dr dqrzr dr dqr dr zrqr dzr dr () (()())(()) ()() (()) 0 (43) = =+ +> dcr dr dqrzr dr dqr dr zr dqr dr dzr dr zr dqr dr () (()())(()) ()2 (()) (()) () (()) 02 2 2 2 2 2 2 2 (44) Figure3. TheMTenergymapsoptimizationmechanismtoderivethe naldockingposeinoneprotein ligandcomplex. Figure4. Contactmapofthe1LI2protein ligandcomplexbindingregion.Hydrophobiccontactsareshownasgreendashedlines,andtheone hydrogenbondisshownasapinkdashedline.Thebindingpocketcavityisde nedbythedarkregion. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385533

PAGE 9

Optimumligandatomlocationsareobtainedwhenthe calculationsatis estheminimumvaluesforalltheobjective functions(ligandtorsionsandprotein ligandlong-range interactions)andallligandbondsandangleconstraints. Inouroptimizationalgorithmweobtainnumericalderivativesof theprobabilitydistributionandanalyticalderivativesfortheenergy expressionviapairwisepartialderivativesofthemodi edLennardJonespotentialsusedinKECSA.35Withtheligandtopologyand rstandsecondderivativesweusedaNewton Raphson algorithmtooptimizetheligandinthepocket.Anicefeatureof thismethodiswecanidentifyboththelowestfreeenergybinding modealongwithallotherpossiblelocalminimawithhigherfree energies.Moreover,wecanextractsaddlepointandhigher-order transitionsdescribingtheconnec tivitybetweenthelocalminima, butthesearequitenumerousand,hence,complicatedandwillnot bediscussedhereindetail. Figure5. Heatmapsforsp3oxygen(left)andaromaticcarbon(right).Gridpointswithlightercolorindicateenergeticallyfavorablelocationsfor certainatomtypeswithinthebindingpocket. Figure6. Inthebindingpocketofprotein ligandcomplex1LI2,theligandcrystalstructure(markedas CS )isshownasaballandstick gure,the globalminimumpose(markedas GM )isshownasastickalongwiththethreeotheridenti edlocalminimum(markedas a b ,and c ).Redbubbles ontheproteinatomsindicatepotentialcontactswiththeligandsp3oxygen.Graybubblesontheproteinatomsindicatepotentialcontactswith aromaticcarbons. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385534

PAGE 10

Figure3introducestheprocessoftheheatmapdocking.To illustratethemethodindetailwewilltouchonjustone examplewhosestructureis1LI2.Wehavealsocarriedout heatmapdockingagainstthepreviouslyintroducedtestsetof 795proteinligandcomplexes,whichwillbesummarizedbelow. Theprotein ligandcomplexwithPDBID1LI2isusedasan exampletoillustrateindetailtheprocessofheatmapdocking. 1LI2isaT4Lysozymemutantboundtophenolwithamodest bindinga nityof4.04(p Kd).Thebindingpocketregionis largerthanthesmallphenolligandstructure(seeFigure4), potentiallyallowingseveralligandposesthatrepresentlocal minima.Ontheotherhand,phenol,astheligand,hasasimple enoughstructuretoclearlyshowthedi erencesinprotein ligandcontactsbetweenlowenergyposes. Judgingfromthecrystalstructure,phenolformsahydrogen bondwithGLN102Aandseveralhydrophobiccontactswith VAL87A,TYR88A,ALA99A,VAL111A,andLEU118Ainthe bindingpocket(showninFigure4). Therearetwoatomtypes(sp3oxygenandaromaticcarbon atoms)inphenol.BasedontheMTKECSAcalculation,heat mapsforbothoftheatomtypeswithinthebindingpocketcan begenerated(Figure5). Theheatmapdockingprogramthengeneratedonesp3oxygen andsixaromaticcarbonstotheiroptimizedpositionfollowingthe gradientsontheircorrespondingenergyheatmapswhilesatisfying thelinkageconstraintsofphenol.Asaresult,togetherwiththe energeticglobalminimumligandpose(GM),threemorelocal minimumposes(pose a b ,and c )weregeneratedusingthe heatmapdockingmethod.RMSDv alues()andbindingscores (p Kd)areshowninTable2. AscanbeseeninFigure6,theGMposeslightlydeviates fromthecrystalstructure( CS )becauseoftheadjustmentofthe hydrogen-bonddistancebetweenthephenoloxygenandthe Table3.RMSDValues()andStandardDeviationsof RMSD()fromtheMTHeatmapDockingResults, ComparedWithGlideSPandXPResultsRMSD()RMSDstandarddeviation() MTKECSA1.971.27 GlideSP2.072.72 GlideXP1.872.01 Table2.RMSDvalues()andbindingscores(p Kd)ofthe globalandlocalminimaRMSD()bindinga nity(p Kd) globalminimum0.9373.329 localminimum a 2.6672.255 localminimum b 2.8392.975 localminimum c 2.3423.299 Figure7. PlotofMTKECSA(A),theoriginalKECSAmodel(B)calculatedp Kdorp Kivalues,GlideSPscore(C)andGlideXPscore(D)vs experimentalp Kdorp Kivalues. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385535

PAGE 11

sp2oxygenonGLN102AintheMTKECSAcalculation.The phenolbenzeneringbalancesthecontactswithALA99Aand TYR88AononesideandthecontactswithLEU118A, VAL87A,andLEU84Aontheother.Thelocalminimumpose c and b hasclosebindingscoreswhencomparedtothe GM pose.Theyformhydrogenbondswithdi erenthydrogen acceptors(ALA99Abackboneoxygenforpose c andLEU84A backboneoxygenforpose b )whilemaintainingverysimilar benzeneringlocations.Thelocalminimumpose a istryingto formahydrogenbondwithALA99Abackboneoxygen. However,thebenzeneringoflocalminimumpose a istilted towardtheLEU118A,VAL87A,andLEU84Asidechain carbons,weakeningthehydrogenbondwiththeALA99A backboneoxygenwiththenetresultbeingareductionin bindinga nity. FurtherValidationoftheMTMethod. UsingtheMT methodafurtherdockingstudywascarriedoutonthetestset of795protein ligandcomplexes.Inaddition,inordertobetter evaluatetheperformanceofMTscoringandheatmapdocking, wealsocarriedoutaGlidescoringanddockingstudyforcomparison.44 46FortheGlidedockingandscoringstudy,proteinstructures werepreparedwiththeProteinPreparationWizardutilityof theSchrodinger2013-2SuiteusingEpikstatepenalties.44,47ProtonationstateswereassignedusingPROPKAatpH7.0.48,49HydrogenpositionswereoptimizedwithOPLS2005.50,51All thecrystalwaterswereremoved.DockingrunsusingtheGlide version5.9StandardPrecision(SP)andExtraPrecision(XP) algorithmsfollowedthepreparationstep. TheMTheatmapdockinggeneratedanaverageRMSDof 1.97witha1.27standarddeviationwhencomparedtothe protein ligandcrystalstructure,whileGlideSPdocking generatedanaverageRMSDof2.07witha2.72standard deviation,GlideXPdockinggeneratedanaverageRMSDof 1.87witha2.01standarddeviationagainstthesameset (Table3).Theresultforeachindividualsystemstudiedherein isgivenintheSupportingInformation.Basedonthistestresult, MTheatmapdockingshowedacomparableperformanceto Glideresults,allofwhichhave 2.0RMSDresults.However, thestandarddeviationoftheposeRMSDgeneratedusing heatmapdockingisinamorenarrowrangethanwhatisseen usingSPandXPGlide. Wealsocomparedthebindinga nitycomputedbytheMT method(actuallyp Kd/p Kivalues)withtheGlideSPandXP scores.WeshowtheGlidescoresvstheexperimentalp Kdor p Kivaluesacrossthe795protein ligandcomplexesinthetest setinFigure7,togetherwiththecalculatedp Kdorp Kivalues fromMTKECSAandtheoriginalKECSAmodelvsthe experimentalp Kdorp Kivaluesofthesametestsetfor comparison.Duetothedi erentscalesusedbyMTKECSA andtheGlidescores,thecomparisononlyincludesPearson Â’ s r andKendall values.MTKECSAandtheoriginalKECSA,as discussedabove,generatedPearson Â’ s r Â’ sof0.72and0.62, Kendall Â’ sof0.53and0.46.GlideSPyieldsaPearson Â’ s r of 0.46andaKendall of0.52,whileGlideXPyieldsaPearson Â’ s r of0.42andaKendall of0.29.Overall,weconcludethatMT KECSAshowsadvantagesoverscoringwithGlide.CONCLUSIONSThepredictionofthefreeenergiesassociatedwithawiderange ofbiologicalproblemsremainsaverydauntingtask.Balancing thesamplingoftherelevantdegreesoffreedomwithaccurate energycomputationmakesthisaverydi cultproblem.Herein wedescribeanewapproachthatinone-shotsamples,allthe relevantdegreesoffreedominade nedregiondirectly a ordingafreeenergywithoutresortingtoadhocmodelingof theentropyassociatedwithagivenprocess.Thisis accomplishedbyconvertingensembleassemblyfroma3-D toa1-Dproblembyusingpairwiseenergiesofallrelevant interactionsinasystemcoupledwiththeirprobabilities.Wecall thisapproachtheMTmethod,andinconjunctionwithKECSA potentialfunctionwedemonstratedtheapplicationofthis approachtoproteinligandposeandbindingfreeenergy prediction.TheresultantMT-KECSAmodelout-performsthe originalKECSAmodelshowingthepowerofthisapproach. Importantly,thepresentMTmodelcanbeappliedtoany pairwisedecomposablepotentialwhichwillallowustoattacka widerangeofproblemsinbiologyincludingthevalidationof potentialfunctions.ASSOCIATEDCONTENT*SSupportingInformationAdescriptionofamethane butanesystemasanexamplethat illustrates,indetail,thebindingfreeenergycalculationusing theMTmethodwiththeKECSAenergyfunction.An introductiontoafastapproximatealgorithmformatrix multiplicationinMTcomputationaswellasthepredictedp Kdorp Kivsexperimentalp Kdorp Kiforourtestsetandthe heatmapdockingRMSDresultagainstthesametestset.This informationisavailablefreeofchargeviatheInternetathttp://pubs.acs.org/.AUTHORINFORMATIONCorrespondingAuthor* E-mail:kmerz1@gmail.comNotesTheauthorsdeclarenocompeting nancialinterest.ACKNOWLEDGMENTSWewouldliketothanktheNIH(GM044974andGM066859) forsupportingthepresentresearch.Drs.ErikDeumensand JohnFaverarealsoacknowledgedfornumeroushelpfuldiscussions.REFERENCES(1)Jorgensen,W.L.;Tirado-Rives,J.MonteCarlovs.molecular dynamicsforconformationalsampling. J.Phys.Chem. 1996 100 14508. (2)Cance s,E.;Legoll,F.;Stoltz,G.Theoreticalandnumerical comparisonofsomesamplingmethodsformoleculardynamics. ESAIM:Math.Modell.Numer.Anal. 2007 41 ,351. (3)Gallicchio,E.;Levy,R.M.Advancesinallatomsampling methodsformodelingprotein-ligandbindingaffinities. Curr.Opin. Struct.Biol. 2011 21 ,161. (4)Halperin,I.;Ma,B.;Wolfson,H.;Nussinov,R.Principlesof docking:Anoverviewofsearchalgorithmsandaguidetoscoring functions. Proteins 2002 47 ,409. (5)Limongelli,V.;Marinelli,L.;Cosconati,S.;LaMotta,C.;Sartini, S.;Mugnaini,L.;DaSettimo,F.;Novellino,E.;Parrinello,M. Samplingproteinmotionandsolventeffectduringligandbinding. Proc.Natl.Acad.Sci.U.S.A. 2012 109 ,1467. (6)Clark,M.;Guarnieri,F.;Shkurko,I.;Wiseman,J.Grand CanonicalMonteCarloSimulationofLigand ProteinBinding. J. Chem.Inf.Model. 2006 46 ,231. (7)Okamoto,Y.Generalized-ensemblealgorithms:enhanced samplingtechniquesforMonteCarloandmoleculardynamics simulations. J.Mol.GraphicsModell. 2004 22 ,425. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385536

PAGE 12

(8)Hendlich,M.;Lackner,P.;Weitckus,S.;Floeckner,H.; Froschauer,R.;Gottsbacher,K.;Casari,G.;Sippl,M.J.Identification ofnativeproteinfoldsamongstalargenumberofincorrectmodels. Thecalculationoflowenergyconformationsfrompotentialsofmean force. J.Mol.Biol. 1990 216 ,167. (9)Jones-Hertzog,D.K.;Jorgensen,W.L.Bindingaffinitiesfor sulfonamideinhibitorswithhumanthrombinusingMonteCarlo simulationswithalinearresponsemethod. J.Med.Chem. 1997 40 1539. (10)Shaw,D.E.;Maragakis,P.;Lindorff-Larsen,K.;Piana,S.;Dror, R.O.;Eastwood,M.P.;Bank,J.A.;Jumper,J.M.;Salmon,J.K.;Shan, Y.;Wriggers,W.Atomic-LevelCharacterizationoftheStructural DynamicsofProteins. Science 2010 330 ,341. (11)Kannan,S.;Zacharias,M.Simulatedannealingcoupledreplica exchangemoleculardynamics Anefficientconformationalsampling method. J.Struct.Biol. 2009 166 ,288. (12)Klepeis,J.L.;Lindorff-Larsen,K.;Dror,R.O.;Shaw,D.E. Long-timescalemoleculardynamicssimulationsofproteinstructure andfunction. Curr.Opin.Struct.Biol. 2009 19 ,120. (13)Schlick,T.Moleculardynamics-basedapproachesforenhanced samplingoflong-time,large-scaleconformationalchangesin biomolecules. F1000Biol.Rep. 2009 1 ,51. (14)Bonomi,M.;Branduardi,D.;Bussi,G.;Camilloni,C.;Provasi, D.;Raiteri,P.;Donadio,D.;Marinelli,F.;Pietrucci,F.;Broglia,R.A.; Parrinello,M.PLUMED:Aporta blepluginforfree-energy calculationswithmoleculardynamics. Comput.Phys.Commun. 2009 180 ,1961. (15)Gilson,M.K.;Zhou,H.-X.CalculationofProtein-Ligand BindingAffinities. Annu.Rev.Biophys.Biomol.Struct. 2007 36 ,21. (16)Karney,C.F.;Ferrara,J.E.;Brunner,S.Methodforcomputing proteinbindingaffinity. J.Comput.Chem. 2005 26 ,243. (17)Michel,J.;Essex,J.W.Predictionofprotein-ligandbinding affinitybyfreeenergysimulations:assumptions,pitfallsandexpect-ations. J.Comput.-AidedMol.Des. 2010 24 ,639. (18)Sippl,M.J.Calculationofconformationalensemblesfrom potentialsofmenaforce:anapproachtotheknowledge-based predictionoflocalstructuresinglobularproteins. J.Mol.Biol. 1990 213 ,859. (19)Tuffery,P.;Derreumaux,P.Flexibilityandbindingaffinityin protein-ligand,protein-proteinandmulti-componentproteininteractions:limitationsofcurrentcomputationalapproaches. J.R.Soc. Interface 2012 9 ,20. (20)Fan,H.;Schneidman-Duhovny,D.;Irwin,J.J.;Dong,G.; Shoichet,B.K.;Sali,A.Statisticalpotentialformodelingandranking ofprotein-ligandinteractions. J.Chem.Inf.Model. 2011 51 ,3078. (21)Gohlke,H.;Hendlich,M.;Klebe,G.Knowledge-basedscoring functiontopredictprotein-ligandinteractions. J.Mol.Biol. 2000 295 337. (22)Huang,S.Y.;Zou,X.Aniterativeknowledge-basedscoring functiontopredictprotein-ligandinteractions:II.Validationofthe scoringfunction. J.Comput.Chem. 2006 27 ,1876. (23)Huang,S.Y.;Zou,X.Advancesandchallengesinprotein-ligand docking. Int.J.Mol.Sci. 2010 11 ,3016. (24)Huang,S.-Y.;Zou,X.InclusionofSolvationandEntropyinthe Knowledge-BasedScoringFunctionforProtein LigandInteractions. J.Chem.Inf.Model. 2010 50 ,262. (25)Muegge,I.;Martin,Y.C.Ageneralandfastscoringfunctionfor protein-ligandinteractions:asimplifiedpotentialapproach. J.Med. Chem. 1999 42 ,791. (26)Muegge,I.Aknowledge-basedscoringfunctionforproteinligandinteractions:Probingthereferencestate. Perspect.Drug DiscoveryDes. 2000 20 ,99. (27)Muegge,I.EffectofligandvolumecorrectiononPMFscoring. J. Comput.Chem. 2001 22 ,418. (28)DeWitte,R.S.;Shakhnovich,E.I.SMoG:deNovodesign methodbasedonsimple,fast,andaccutatefreeenergyestimate.1. Methodologyandsupportingevidence. J.Am.Chem.Soc. 1996 118 11733. (29)Velec,H.F.;Gohlke,H.;Klebe,G.DrugScore(CSD)knowledge-basedscoringfunctionderivedfromsmallmoleculecrystal datawithsuperiorrecognitionrateofnear-nativeligandposesand betteraffinityprediction. J.Med.Chem. 2005 48 ,6296. (30)Zheng,Z.;Merz,K.M.,Jr.Ligandidentificationscoring algorithm(LISA). J.Chem.Inf.Model. 2011 51 ,1296. (31)Benson,M.L.;Faver,J.C.;Ucisik,M.N.;Dashti,D.S.;Zheng, Z.;Merz,K.M.,Jr.Predictionoftrypsin/molecularfragmentbinding affinitiesbyfreeenergydecompositionandempiricalscores. J. Comput.-AidedMol.Des. 2012 26 ,647. (32)Faver,J.C.;Zheng,Z.;Merz,K.M.,Jr.Modelforthefast estimationofbasissetsuperpositionerrorinbiomolecularsystems. J. Chem.Phys. 2011 135 ,144110. (33)Faver,J.C.;Zheng,Z.;Merz,K.M.,Jr.Statistics-basedmodel forbasissetsuperpositionerrorcorrectioninlargebiomolecules. Phys. Chem.Chem.Phys. 2012 14 ,7795. (34)Faver,J.C.;Yang,W.;Merz,K.M.,Jr.TheEffectsof ComputationalModelingErrorsontheEstimationofStatistical MechanicalVariables. J.Chem.TheoryComput. 2012 8 ,3769. (35)Zheng,Z.;Merz,K.M.,Jr.DevelopmentoftheKnowledgeBasedandEmpiricalCombinedScoringAlgorithm(KECSA)To ScoreProtein LigandInteractions. J.Chem.Inf.Model. 2013 53 1073. (36)Wang,R.;Fang,X.;Lu,Y.;Wang,S.ThePDBbinddatabase: collectionofbindingaffinitiesforprotein-ligandcomplexeswith knownthree-dimensionalstructures. J.Med.Chem. 2004 47 ,2977. (37)Wang,R.;Fang,X.;Lu,Y.;Yang,C.Y.;Wang,S.ThePDBbind database:methodologiesandupdates. J.Med.Chem. 2005 48 ,4111. (38)Goodsell,D.S.;Morris,G.M.;Olson,A.J.Automateddocking offlexibleligands:applicationsofAutoDock. J.Mol.Recognit. 1996 9 1. (39)Leis,S.;Zacharias,M.Efficientinclusionofreceptorflexibilityin grid-basedprotein-liganddocking. J.Comput.Chem. 2011 32 ,3433. (40)Taylor,R.D.;Jewsbury,P.J.;Essex,J.W.Areviewofproteinsmallmoleculedockingmethods. J.Comput.-AidedMol.Des. 2002 16 151. (41)Wu,G.;Robertson,D.H.;Brooks,C.L.,3rd;Vieth,M.Detailed analysisofgrid-basedmoleculardocking:AcasestudyofCDOCKERACHARMm-basedMDdockingalgorithm. J.Comput.Chem. 2003 24 ,1549. (42)Kubinyi,H.Comparativemolecular eldanalysis(CoMFA). HandbookofChemoinformatics:FromDatatoKnowledgein4Volumes ; Wiley-VCH:Weinheim,2008;p1555. (43)Goodford,P.J.Acomputationalprocedurefordetermining energeticallyfavorablebinding sitesonbiologicallyimportant macromolecules. J.Med.Chem. 1985 28 ,849. (44)Friesner,R.A.;Banks,J.L.;Murphy,R.B.;Halgren,T.A.; Klicic,J.J.;Mainz,D.T.;Repasky,M.P.;Knoll,E.H.;Shelley,M.; Perry,J.K.Glide:anewapproachforrapid,accuratedockingand scoring.1.Methodandassessmentofdockingaccuracy. J.Med.Chem. 2004 47 ,1739. (45)Halgren,T.A.;Murphy,R.B.;Friesner,R.A.;Beard,H.S.; Frye,L.L.;Pollard,W.T.;Banks,J.L.Glide:anewapproachforrapid, accuratedockingandscoring.2.Enrichmentfactorsindatabase screening. J.Med.Chem. 2004 47 ,1750. (46)Friesner,R.A.;Murphy,R.B.;Repasky,M.P.;Frye,L.L.; Greenwood,J.R.;Halgren,T.A.;Sanschagrin,P.C.;Mainz,D.T. ExtraPrecisionGlide:DockingandScoringIncorporatingaModelof HydrophobicEnclosureforProtein-LigandComplexes. J.Med.Chem. 2006 49 ,6177. (47)Sastry,G.M.;Adzhigirey,M.;Day,T.;Annabhimoju,R.; Sherman,W.Proteinandligandpreparation:parameters,protocols, andinfluenceonvirtualscreeningenrichments. J.Comput.-AidedMol. Des. 2013 27 ,221. (48)Rostkowski,M.;Olsson,M.H.;Sndergaard,C.R.;Jensen,J.H. GraphicalanalysisofpH-dependentpropertiesofproteinspredicted usingPROPKA. BMCStruct.Biol. 2011 11 ,6. (49)Olsson,M.H.M.;Sndergard,C.R.;Rostkowski,M.;Jensen,J. H.PROPKA3:ConsistentTreatmentofInternalandSurfaceResiJournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385537

PAGE 13

duesinEmpiricalpKapredictions. J.Chem.TheoryComput. 2011 7 525. (50)Jorgensen,W.L.;Tirado-Rives,J.TheOPLS[optimized potentialsforliquidsimulations]potentialfunctionsforproteins, energyminimizationsforcrystalsofcyclicpeptidesandcrambin. J.Am. Chem.Soc. 1988 110 ,1657. (51)Kaminski,G.A.;Friesner,R.A.;Tirado-Rives,J.;Jorgensen,W. L.EvaluationandreparametrizationoftheOPLS-AAforcefieldfor proteinsviacomparisonwithaccuratequantumchemicalcalculations onpeptides. J.Phys.Chem.B. 2001 105 ,6474. JournalofChemicalTheoryandComputation Articledx.doi.org/10.1021/ct4005992 | J.Chem.TheoryComput. 2013,9,5526 55385538