UFDC Home  myUFDC Home  Help 



Full Text  
PAGE 1 1 ERROR ESTIMATION IN BIOMOLECULAR MODELING By JOHN C. FAVER A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY U NIVERSITY OF FLORIDA 2012 PAGE 2 2 2012 John C. Faver PAGE 3 3 To my supportive friends and family PAGE 4 4 ACKNOWLEDGMENTS I would like to thank my committee members: Prof. Kenneth Merz Jr. Prof. Christopher Stanton, Prof. A drian Roitberg, Dr. Benjamin Smith, and Dr. Erik Deumens I appreciate the ir willingness to s pend time examining my research, help me refine my ideas, and offer advice along the way. I especially thank my adviser Prof. Merz, who has been a wonderful mentor to me for the last five years. Under his direction, I have undoubtedly become a better problem solver, technical writer, and scientist. I also thank my friends and colleagues in the Quantum Theory Project and Department of Chemistry at the University of Florida. I have had plenty of helpful discussions and exciting collaborations with many of them over the years. An incomplete list of them includes: Dr. Xiao He, Dr. Mark Benson, Dr. Ben Roberts, Dr. Bing Wang, Zheng Zheng, Nihan and Danial Dashti. In addition, much of this work was done in collaboration with Prof. David Sherrill and his group at Georgia Tech. Much of the funding for this work was provided by the UF alumni award and the NIH. I finally thank my parents, who have been completely supp ortive of me throughout my whole life. PAGE 5 5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 8 LIST OF FIGURES ................................ ................................ ................................ .......... 9 LIST OF ABBREVIATIONS ................................ ................................ ........................... 11 ABSTRACT ................................ ................................ ................................ ................... 12 CHAPTER 1 COMPUTATIONAL MODELS OF BIOMOLECULAR SYSTEMS ........................... 14 Introduction ................................ ................................ ................................ ............. 14 Thermodynamic Models ................................ ................................ .......................... 15 Direct Ensemble Methods ................................ ................................ ................ 15 Thermodynamic Cycles ................................ ................................ .................... 16 Energy Models ................................ ................................ ................................ ........ 17 Force Fields ................................ ................................ ................................ ...... 18 Quantum Methods ................................ ................................ ............................ 19 Wave function methods ................................ ................................ ............. 19 Density functional methods ................................ ................................ ........ 20 Basis sets ................................ ................................ ................................ ... 20 Linear scaling methods ................................ ................................ .............. 22 2 STATISTICAL ERROR ANALYSIS ................................ ................................ ......... 24 Introduction ................................ ................................ ................................ ............. 24 Types of Errors ................................ ................................ ................................ ....... 24 Probability Density Functions ................................ ................................ .................. 25 Error Propagation ................................ ................................ ................................ ... 26 Taylor Series Approach ................................ ................................ .................... 26 Monte Carlo Approach ................................ ................................ ..................... 28 3 MOLECULAR FRAGMENT BASED METHOD FOR ERROR ESTIMATION ......... 29 Introduction ................................ ................................ ................................ ............. 29 Additive Fragment Based Energy Models ................................ ............................... 29 Measuring Nonadditivity in Energy ................................ ................................ ......... 30 Propagation of Errors in Fragment Based Energies ................................ ............... 33 Estimating Fragment Based Errors ................................ ................................ ......... 33 Construction of Molecular Fragment Interaction Databases ................................ ... 34 PAGE 6 6 4 PROTEIN LIGAND INTERACTIONS ................................ ................................ ...... 38 Introduction ................................ ................................ ................................ ............. 38 Error Propagation in the Estimation of Protein Ligand Binding Affinities ................ 39 Practical Error Estimation for Protein Ligand Interactions ................................ ....... 42 Methods ................................ ................................ ................................ ............ 43 Results and Discussion ................................ ................................ .................... 45 Conclusions ................................ ................................ ................................ ............ 50 5 THE PROTEIN FOLDING PROBLEM ................................ ................................ .... 58 Introduction ................................ ................................ ................................ ............. 58 Error Propagation in Protein Folding Simulations ................................ ................... 60 Practical Error Estimation for Protein Folding Simulations ................................ ...... 62 Methods ................................ ................................ ................................ ............ 64 Results and Discussion ................................ ................................ .................... 65 Conclusions ................................ ................................ ................................ ............ 70 6 BASIS SET SUPERPOSITION ERROR ................................ ................................ 81 Introduction ................................ ................................ ................................ ............. 81 Statistics based Model for Estimating BSSE ................................ .......................... 83 Applications of the BSSE Mode l ................................ ................................ ............. 86 Protein Ligand Interactions ................................ ................................ ............... 86 Protein Folds ................................ ................................ ................................ .... 87 Conclusions ................................ ................................ ................................ ............ 88 7 STATISTICAL ENSEMBLES ................................ ................................ .................. 97 Introduction ................................ ................................ ................................ ............. 97 First order Error Analysis ................................ ................................ ........................ 97 Free Energy ................................ ................................ ................................ .... 100 Average Energy ................................ ................................ .............................. 102 Entropy ................................ ................................ ................................ ........... 103 Higher Order Error Analysis ................................ ................................ .................. 104 Proposed General Error Handling Protocol ................................ .......................... 107 Conclusions ................................ ................................ ................................ .......... 109 APPENDIX: DERIVATION OF ERROR PROPAGATION FORMULAS ..................... 121 Derivatives of Helmholtz Free Energy through Fo urth Order ................................ 121 First Order Error Propagation in Helmholtz Free Energy ................................ ...... 122 First Order Error Propagation in Average Energy ................................ ................. 122 First Order Error Propagation in Entropy ................................ .............................. 122 Second Order Error Propagation in Helmholtz Free Energy ................................ 125 Fourth Order Error Propagation in Helmholtz Free Energy ................................ ... 126 LIST OF REFERENCES ................................ ................................ ............................. 128 PAGE 7 7 BIOGRA PHICAL SKETCH ................................ ................................ .......................... 141 PAGE 8 8 LIST OF TABLES Table page 4 1 Error metrics for various energy models over the set of 1HSG fragments. ......... 52 5 1 Interaction energy error statistics for the 1UBQ fragment database. .................. 72 5 2 Analysis of ff99SB on the Rosetta Decoy Set ................................ ..................... 73 5 3 Analysis of PM6 on the Rosetta Decoy Set. ................................ ....................... 74 5 4 Analysis of PM6 DH2 on the Rosetta Decoy Set. ................................ ............... 75 6 1 Optimized BSSE Model Parameters for MP2/aDZ ................................ ............. 90 6 2 Optimized BSSE Model Parameters for MP2/6 31G* ................................ ......... 90 6 3 BSSE predi ctions for the HIV II protease/indinavir complex at MP2/aDZ. .......... 90 6 4 BSSE predictions for the HIV II protease/indinavir complex at MP2/6 31G*. ..... 91 7 1 Propagated errors in free energy at different levels of truncation at the isoenergetic point for small microstate energy errors. ................................ ...... 111 7 2 Propagated errors in free energy a t different levels of truncation at the isoenergetic point for M1. ................................ ................................ ................. 111 7 3 Free energy values from the simulated ensemble. ................................ ........... 111 PAGE 9 9 LIST OF FIGURES Figure page 1 1 T hermodynamic cycle for the protein ligand binding process. ............................ 23 3 1 Error distribution viewer for the molecular fragmen t interaction energy database. ................................ ................................ ................................ ............ 36 3 2 Fragment viewer for the molecular fragment interaction energy database. ........ 37 4 1 LigPlot dia gram of the HIV II protease/indinavir complex. ................................ .. 53 4 2 Example fragmentation of the HIV II protease/indinavir complex. ...................... 54 4 3 Mol ecular representations of the 21 independent chemical contacts present in the HIV II protease/indinavir complex. ................................ ............................ 55 4 4 Histogram of absolute electronic interaction energy errors for several m ethods. ................................ ................................ ................................ ............. 56 4 5 Histogram and Gaussian probability density function describing errors in the GAFF energy model. ................................ ................................ .......................... 57 5 1 Example of model systems used to build molecular interactions in proteins. ..... 76 5 2 Distortions in computed free energy landscapes due to modeling errors. .......... 77 5 3 Histogram and Gaussian probability density function describing errors in B97 D/TZVP over 1UBQ fragments. ................................ ................................ ... 78 5 4 Systematic error magnitudes versus chain length in the Ros etta Decoy Set. ..... 79 5 5 Random error magnitudes versus chain length in the Rosetta decoy set. .......... 80 6 1 BSSE magnitudes i n the fragm ent database at MP2/6 31G* and MP2/aug cc pVDZ ................................ ................................ ................................ ............. 92 6 2 Measured distance dependence curves of BSSE at MP2/6 31G* ...................... 93 6 3 Distribut ion of IBSSE estimates for the Pin1 WW domain native/decoy fold set. ................................ ................................ ................................ ...................... 94 6 4 Relative FMO MP2/6 31G*+PCM energies of each of the Pin1 WW domain folds before and after IBSSE corrections. ................................ ........................... 95 6 5 The relative energies of the Pin1 WW domain folds ordered by BSSE corrected energies.. ................................ ................................ ............................ 96 PAGE 10 10 7 1 Propagation of errors in free ener gy for the model two state system M1. ........ 112 7 2 Analysis of ensemble size dependence of propagated errors. ......................... 113 7 3 Propagation of errors in average energy fo r the model two state system M1. .. 114 7 4 Propagation of errors in entropy for the model two state system M1. ............... 115 7 5 Error propagation in a two state system with small microstate energy errors. .. 116 7 6 Monte Carlo estimation of propagated errors in the free energy of M1 ............. 117 7 7 Monte Carlo estimation of free energy errors as a function of increased local sampling of the example Lennard Jones surface. ................................ ............ 118 7 8 Simulated ensemble M3. ................................ ................................ .................. 119 7 9 Energies and free energies of the simulated ensemble M3. ............................. 120 PAGE 11 11 LIST OF ABBREVIATION S BSIE Basis Set Incompleteness Error BSSE Basis Set Sup erposition Error CBS Complete Basis Set limit CCSD(T) Coupled Cluster Theory (Singles Doubles and Perturbative Triples) CI Configuration Interaction D&C Divide and Conquer methods DFT Density Functional Theory FBDD Fragment Based Drug Design FMO Fragment M olecular Orbital method GAFF Generalized AMBER Force Field HF Hartree Fock IBSSE Intramolecular Basis Set Superposition Error MC Monte Carlo MD Molecular Dynamics MFCC Molecular Fragmentation with Conjugate Caps MM Molecular Mechanics MNDO Modified Neglec t of Differential O verlap MPX Mller Plesset Perturbation Theory (to X Order) PDB Protein Data Bank PDF Probability Density Function QM Quantum Mechanics SCF Self Consistent Field PAGE 12 12 Abstract of Dissertation Presented to the Grad uate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy ERROR ESTIMATION IN BIOMOLECULAR MODELING By John C. Faver August 2012 Chair: Kenneth M. Merz Major: Chemistry Computational m odeling of che mical processes is becomi ng an increasingly powerful means of gaining atomic level insight t hat can help interpret guide, or even predict outcomes of experimental work I t is anticipated that computational chemical models will be able to tackle larger and more complex systems over time with ever improving hardware accelerating the rate of discoveries in materials, biological, and medical science s A t the heart of all computational studies are theoretical models which make various assumptions about the und erlying physics of a problem Thus every theoretical model in computational chemistry has its own unique s et of advantages and disadvantages Of utmost concern are computational complexity and predictive accuracy While the most computationally demanding q uantum chemical models are able to match observed energies and spectra to within experimenta l error, only the simpler, more approximate models are currently able to simulate the dynamics of thousands of atoms over biochemical timescales. For this reason, r esearchers must always choose energy models that sufficiently balance speed and accuracy for their specific problem PAGE 13 13 This work aims to develop general methods for estimating intrinsic errors in energy models which allow for automated error correction and u ncertainty reporting in biomolecular modeling Such methods are desirable for two reasons. First, simple low cost energy models could be used while simultaneously accounting for their inaccuracies, allowing for improved agreement with experiment. Secondly, it has been shown that even small modeling errors propagate to yield large r magnitudes as system size increases implying that error estimation will be necessary as larger systems are studied. This dissertation describes a statistics and molecular fragm ent based approach for error estimation in biomolecular modeling and its application to several types of studies Chapter 1 serves as a brief overview of common computational models used for studying biomolecular systems. Chapter 2 describes elements of statistical error analysis that serve as the basis of the proposed method, which is then outlined in Chapter 3 The remaining chapters apply the methodology to specific problems, namely protein ligand binding (Chapter 4 ), protein folding (Chapter 5 ), basis set superposition error (Chapter 6 ), and the estimation of statistical mechanical variables (Chapter 7 ). PAGE 14 14 CHAPTER 1 COMPUTATIONAL MODELS OF BIOMOLECULAR SYST EMS Introduction Much of the molecular basis of known life can be represented by only a few type s of molecules and their interactions with one an other For instance, n ucleic acid polymers store genetic information that guides the specific linking of amino acid molecules to form proteins. Proteins then provide most of the metabolic functionality to th e biological cell by interacting with other proteins, nucleic acids, or small molecules (e.g. oxygen, sugar s, aspirin etc. ) Astonishingly, this basic paradigm of molecular biology has developed and evolved into its modern understanding within the span of a single human lifetime. Yet m any challenges remain to be solved in this information rich world of chemical biology: How do amino acid chains fold into functional proteins so quickly and precisely? How are intra and intercellular communication conducted? How can we d esign more effective disease combating drugs with little or no side effects? A s with most other scientific fields, the computer revolution of the last half century has accelerated the rate at which researchers can obtain, model, and analyze d ata. Whereas in the 1920s the quantum mechanical wave function for the hydrogen atom had to be worked out by hand, now the fully ab initio quantum treatment of proteins containing thousands of atoms is feasible Moreover computing power, and thus our abil ity to model more complex biomolecular systems, can only be expected to increase. It is clear that computation will remain an integral part of scientific progress in the future, serving to bridge the gap between theory and experiment. This introductory cha pter will describe some of the commonly used computational methodologies for simulating PAGE 15 15 biomolecular systems today and introduce some of the problems they are used to solve. Thermodynamic Models All physical systems evolve over time according to the law s of thermodynamics. Biomol ecular systems are no exception; their processes evolve in such a manner to minimize total free energy. Thus one of the most basic goal s of in silico biomolecular modeling is to accurately predict relative free energies of the st ates of any given system. Direct Ensemble Methods The most general free energy estimation methods are derived directly from the equations of statistical mechanics, and require adequate knowledge of the potential energy surface of the system under study T his involves extensive sampling of molecular configurations estimating the potential energy of each point 1 4 The free energy difference F between two observable states A and B can then be calculated with Equation 1 1 in which k b is the Boltzmann constant and T is the absolute temperature The quantity Q s is the partition function for state s the form of which depends on the chosen statistical ensemble. The discrete form of th e partition function for the canonical ensemble is given in Equation 1 2 where the energies E i of each micro state in state s are Boltzmann weighted and summed (1 1) (1 2) PAGE 16 16 While methods for direct estimati on of Equation 1 1 are the most theoretically justified ab initio method s for calculating thermodynamic variables, they are rarely used for complex systems due to the so called sampling problem As system size increases, the number of pos sible molecula r configurations grows very rapidly Popular alternatives to the direct estimation of the partition function of Equation 1 2 involve sampling only the highest contributing (i.e. low energy ) regions of phase space. These methods can be roughly divided into two main categories: molecular dynamics (MD) and Monte Carlo methods (MC) which incorporate random configurational perturbations and generate Markov chains with spec ified acceptance criteria 5 Thermodynamic Cycles In order to compare various states of a sy stem one can use thermodynamic cycles which exploit the additive nature of state functions in thermodynamics. Free energies of processes can be deconstructed into several terms, as illustrated in Figure 1 1 for the process of protein ligand binding. By fo cusing on the relevant a process (e.g. a free ligand, unbound protein receptor, and the bound complex), one can compute the Gibbs free energy of binding with Equation 1 3. It is possible to further decompose the expression by desolvating th e three systems and separating entropic and thermal effects as demonstrated in Equation 1 4. A benefit of using the free energy decomposition of Equation 1 4 is that the different terms c an be estimated with any combination of methods. 6 In addition, t hermodynamic cycles have the advantage of being able to bypass most of the sampling problem (by focusing only on specified states of the system ) and many docking and scoring protocols bypass it completely by utilizing o nly one microstate per state (the presumed local minima) It should be noted that such PAGE 17 17 single end point methods lack the built in error reducing behavior of ensemble methods as discussed in Chapter 7 (1 3) (1 4) Energy Models Regardless of which thermodynamic scheme is used, energies of molecular configurations must be estimated with some energy model. There exist s a large collection of applicable energy models, each one with its own unique set of a dvantages and disadvantages. For example, quantum methods like coupled cluster theory with complete basis set extrapolations can model energies and spectra of small molecules to within ex perimental error bars 7 but this method is currently far to o computationally demanding for very large systems Semiempirical quantum methods are able to treat larger systems and naturally treat effects like polarization and electron delocalization, but still rely on parameterization and some of these models are known treat non bonded interactions poorly. Force fi elds are computationally simpler still, and can model the dynamics of systems containing many thousands of atoms, but need to be well parameterized to yield agreement with experimental results. Because of these trade offs, every study in computational chem istry requires some sort of cost vs. accuracy analysis at the outset to find the optimal level of theory for the particular problem at hand. Two broad families of physics based energy models will be discussed here. Force fields will be described first, fol lowed by an introduction to various quantum mechanics based methods. PAGE 18 18 Force Fields All atom force fields generally model molecular systems as electric monopoles (representing atoms) bound by springs (representing their bonds and angles) To model dispersion forces, a term such as a Lennard Jones potential is used. A term modeling torsional angle preferences is also utilized The functional form of the AMBER potential 8 is shown as an example in Equation 1 5. The first three terms represent bonded interactio ns while the last double sum represents nonbonded interactions (van der Waals and electrostatic interactions) There are large datasets of parameters for this function covering common atom types observed in proteins and other molecules When nonstandard mo lecular systems are studied, the parameters for Equation 1 5 must first be derived usually via quantum mechanics calculations (1 5) While the energy function above neglects electronic degrees of freedom there are several availa ble polarizable force field models These methods include adding optimizable dipoles or N multipoles adding massless charge carrying virtual particles, or allowing atomic charges to fluctuate in order to equalize electronegativity among neighboring atoms These force fields are generally more difficult to parameterize, but offer additional variation in electronic degrees of freedom to model important polarization effects. PAGE 19 19 Quantum Methods Q uantum mech anics based energy models attempt to estimate the soluti on of the time independent Schrdinger equation for molecular systems under the Born Oppenheimer approximation (shown in Equation 1 6 for electrons i interacting with fixed nuclei a and other electrons j in atomic units ) These methods explicitly treat el ectronic degrees of freedom and thereby naturally model effects like polarization delocalization and charge transfer The set of common ly utilized quantum methods in chemistry can be roughly divi ded into two groups: wave function based methods and densit y functional methods. (1 6) Wave function methods Wave function methods include the canonical Hartree Fock (HF) method, which replaces the electron electron repulsion t erm of the Hamiltonian with a mean field approximation. 9 This results in an iterative procedure (called the self consistent field or SCF procedure) in which the energy of a proposed wave fu nction is minimized under restraints. The r esulting energies neglect the effects of correlation of motion between electrons with opposite spin, and thus much contemporary research i s focused on so post SCF methods Among these are co nfiguration interaction (CI), Mller cluster theory (e.g. CCSD(T)) 10 There also exist several different semiempirical methods which simplify the HF calculation by neglecting specific integral calculations and estimating others with parameter ized functions A common approximation is the modified neglect of d ifferential PAGE 20 20 overlap (MNDO) 11 used in methods including AM1 12 PM3 13 PM6 14 and PDDG 15 These methods have been known to treat nonbonded interactions (e.g. hydrogen or halogen bonds) poorly, so empirical correction terms have been constructed for some of these methods Examples include PM6 DH2 16 PM6 DH2X 17 and A M1 FS1 18 Density functional methods Density functional methods recast the problem of solving for a quantum mechanical wave function into optimizing the g round state electron density of a system. 19 This results in Equation 1 7, where the ground state energy of a system E is a functional of electron dens ity that can be written as the sum of a kinetic energy term T, electron nuclear attraction term V, electron electr on repulsion term J, and an exchange correlation term E xc 20 (1 7) There are many different proposed functional forms for the E xc functional, which have been organized into a hierarchy of models referred to by John Perdew as a s ladder of DFT. 21 Basis sets Both wave function and density functional me thods normally require basis sets to build approximate wave functions or electron densities. Basis functions used in quantum chemistry are usually linear combinations of Gaussian functions centered at nuclear positions Gaussian functions qualitatively mod el the three dimensional shape of atomic orbitals (as solved analytically for the hydrogen atom) and in addition they have mathematical properties which make them computationally efficient in integral calculations Since both methods use basis sets as var iables in an optimization PAGE 21 21 problem, a good rule of thumb is that more flexible basis sets result in more accurate energy estimates. This is rigorously true for methods like HF, for which the variational theorem applies. 9 A common example is the Pople style 6 31G* basis set for which a linear combination of six Gaussian functions is used for inner shell atomic orbitals, a combination of three and a comb ination of one are used for valence atomic orbitals, and the asterisk represents added d functions to second row elements and higher to allow for increased polarization of atomic densities. A set using N basis functions for a single atom ic orbital is called an N zeta set, and thus generally higher N represents increased variational flexibility. Ideally infinitely large basis sets should be used in quantum calculations to obtain the most accurate energies but computational hardware and t ime limits prohibit this option. Instead, models for extr apolating to the complete basis set limit (CBS) have become a popular alternative where two or more high level basis set results are used to estim ate infinite basis set results. 22 High level quantum calculations like CCSD(T) with CBS extrapolations are often used for benchmark studies of other energy models The use of finite and incomplete basis sets introduces what is called basis set incompleteness error (BSIE), which arises from modeling the wave function of a system with an approximate one For variational methods (those for which the variational principle applies) such as HF and CI, BSIE is always positive, i.e. calculated energ ies are always too positive A consequence of BSIE is that energies of interacting molecular systems suffer from basis set superposition error (BSSE) which arises when basis functions centered on neighboring atoms can be used to variationally lower the en ergy contribution of a given molecular fragment by compensating for the inflexibility of the PAGE 22 22 finite basis set. This leads to erroneous interaction energies or conformational energies when energy differences are computed, since this basis set compensation e ffect is much larger when the basis functions are more proximal. The effects of BSSE can be corrected with the counterpoise procedure for intermolecular interactions, in which monomer calculations are conducted with the additional basis functions of the in teracting partner. 23 The intramolecular BSSE case (IBSSE) is more complex, and there have been several a pproaches to correct IBSSE as well. These are discussed further in Chapter 6. Linear scaling meth ods Quantum chemical methods are generally much more computationally demanding than force field methods. For example, HF formally scales as N 4 MP2 as N 5 and CCSD(T) as N 7 Because of this, much work has been done on methods of applying quantum chemistry to large systems. Most of these methods exploit the local nature of quantum effects in insulating systems by breaking the molecular system into smaller subsystems. Among these are the fragment molecular orbital method (FMO) 24 molecular fragmentation with conjugate caps (MFCC) 25 and divide and conquer schemes (D&C) 26 28 By performing several calculations on the smaller subsystems, usually in parallel 29 30 linear scaling in quantum calculati ons can be achieved. Due to the emergence of these algorithms, quantum methods are now routinely used on large biomolecular systems. PAGE 23 23 Figure 1 1. Representation of a thermodynamic cycle for the protein ligand binding process. PAGE 24 24 CHAPTER 2 STATISTICAL ERROR ANALYSIS Introduction Any kind of measurement should be assumed to contain some amount of error. For example when weighing a chemical sample on an analytical balance, fluctuations in air d ensity, the ambient temperature humidity, the method of reading t he scale and many other factors can all have an effect on the data that is recorded. Due to the presence of these factors, calibration and replicated measurements are performed. Calibration is used to estimate and remove systematic errors, while repeated measurements can provide a measure of variance to estima te the effects of random errors. These practices are commonplace in the experimental laboratory, but thus far have not had a large presence in computational chemical modeling. This chapter serves as a summary of error estimation techniques which will be applied to specific problems in computational c hemical modeling in later chapters. Types of Errors Errors inherent in measurements can be characterized as one or a combination of two extremes: systemati c and random errors. 31 Systematic errors represent con stant urement. If a series of repeated measurements is conducted, and each contain s the same systemati c error, estimated differences between measured values will benefit from complete error cancelation. A simple laboratory example might be weighing chemical samp les on an analytical balance along with a standard sheet of weighing paper with a known mass Random errors, on the other hand, do not have definite values, but take on random values determined by some probability density function, usually assumed to be a normal PAGE 25 25 distribution. variability of repeated measurements Random errors, unlike systematic errors, are not guaranteed to cancel completely, but their behavior can be described and predi cted in a statistical manner. Probability Density Functions Imagine a researcher who records some observed quantity in a series of repeated measurements, and then constructs a histogram of the recorded data This histogram yields some very useful infor mat ion it illustrates the most common magnitudes of the observable and suggests the relative likelihood of measuring the same or another value in the next measurement Histograms are not easy to manipulate mathematically, so one can find a mathematical funct ion that fits th is histogram of recorded values Such a mathematical function f(x) which models the relative frequencies of observing x is called the limiting distribution of the population If this functi on is normalized to one it is called the probabi l it y density function (PDF), and the PDF P(x) yields the probability of measuring a value between x and x+dx. 32 W hen multiple types of measurements are made simultaneously each described by their own PDF s the PDFs can be combined in defined ways, which makes them especially useful for analysis of er ror propagation. Of particular interest is the normal distribution, defined by Equation 2 1, which is (the mean of the distribution) 2 (the variance of the distribution). (2 1) PAGE 26 26 The normal distribution has favorable properties for describing fun ctions of random variables. For example, in the case of the sum of t wo independent random variables x (described by the PDF P x ) and y (described by the PDF P y ) the PDF of the sum z is evaluated as the convolution of P x and P y (Equation 2 3 ). It can be sho wn that i f P x and P y are both normal distributions then P z is also a normal distribution and its mean and variance are equal to the sums of means and variances of P x and P y respectively (Equation 2 4 and 2 5 ) 32 This property allows for very fast estimations of PDFs for simple sums of normally distr ibuted random variables. (2 2) (2 3) (2 4) (2 5) Error Propagation Taylor Series Approach A general function f(x) that is infinitely differentiable about point a can be represented by the infinite series in Equat ion 2 6 (2 6 ) One can let x represent a measured input variable containing error and a represent its true error free value. The error in the measurement is then the quantity = ( x a ). By r earranging terms in Equation 2 6 we can now write an expression for t he error in the output variable, in terms of error in the input variable, (Equation 2 7 ). PAGE 27 27 When there are N input variables, the deviation f from the vector a due to the errors in the input vector x is given b y Equation 2 8 (2 7 ) (2 8 ) at low orders to yield si mple error propagation formulas. Equations 2 9 and 2 1 0 represent the simplest first order model of error propagation. First order systematic errors (Equation 2 9 ) are simple sums of component systematic errors ( which can be estimated by the means of the component error i ) weighted by the first order d erivatives of the output variable f with respect to each input variable x i First order independent random errors (Equation 2 10 ) propagate as Pythagorean sums of the individual random errors ( which can be estimated by the standard deviation of their PDFs, i ) weighted by the first derivative of the output variable. (2 9 ) (2 10 ) When higher accuracy is needed, higher orders of error propagation can be used. The equations for second order er ror pro pagation are given below, assuming that each input variable x i is independent from the others. 33 Note that the variance of the input variables affe ct propagated systematic errors. T h is important e ffect is neglected by first PAGE 28 28 order formulas. Note that if f( x ) is a simple sum of x i Equations 2 11 and 2 12 are identical to Equations 2 4 and 2 5. (2 11 ) (2 12 ) Monte Carlo Approach If the errors or uncertainties in the input variables are very high, higher orders of error propagation formulas would need to be used. Unfortunately these formulas quickly become complicated and computationally demanding especially for high dimensional prob lems. In the regimes of large errors and high dimensionality, an alternative method of error propagation is a Monte Carlo or brute force approach. In Monte Carlo (MC) error estimation, each of the input variables is assigned a PDF from which random samples are drawn many random selections of input values, which yields a distribution of values of the output variable. the standard deviation of this distribution is a measure of its uncertainty. MC error estimation gives a straightforward measure of propagated random error, but limited information about propagated systematic error. Thus low order error propagation formulas or direct observation should be used to estimate overall systematic errors in these regimes. PAGE 29 29 CHAPTER 3 MOLECULAR FRAGMENT BASED METHOD FOR ERR OR ESTIMATION Introduction This chapter will describe the general methodology used in this work for error estimation in biomolecular simulations. The method makes use of some important assumptions which need to be addressed. First, fragment interaction energies are assumed to contribute to overall energies independently in an additive manner. Secondly, errors in entropy are neglected, which has important practical consequences for computational methods using free energy decomposition schemes and thermodynamic cycles. Lastly, it relies on a chosen method for computing reference or level quantum chemical methods. Additive Fragment Based Energy Models Fragment based drug design (FBDD) has become a widely adopted paradigm for computer aided drug discovery efforts, largely due its ability to quickly scan large databases of molecular fragmen ts (and their combinations) for expected properties like binding affinity for a protein receptor target 34 35 In the framework of FBDD, the properties of molecular fragments constituting the overall molecule are oft en assumed to add linearly. For example, a potential drug molecule might have a hydrogen donatin g hydroxyl group and a hydrogen accepting amine group, both enhancing the overall stabiliz ation of the ligand in a receptor 36 In mathematical terms, this implies that the overall binding free energy can be decomposed into a sum of separate terms for each molecular fragment interaction. (3 1) PAGE 30 30 Although this additive free energy model is appealing in its simplicity, Mark and van Gunsteren showed that it is generally unreliable. 37 As discussed in Chapter 1, free energy depend s on the volume of accessible phase space available to a system, which is a global quantity not easily decomposed into molecular fragment contributions. The authors show that if a Hamiltonian can be expressed as a sum of components (Equation 3 2) then energy or enthalpy can be similarly expressed (Equation 3 3) but free energies will always contain term s representative of the correlation of motion in phase space between the components This is shown via the Taylor series expansion of G. (Equation 3 4) The difference between free energy and the sum of fragment components or the correlation between fragment energies, is called nonadditivity in free energy If the Hamiltonians in Equation 3 2 are not truly separable, then the difference between the total energy and the sum of component energies is the nonadditivity in energy. (3 2) (3 3) (3 4) Measuring Nonadditivity in Energy Mark and van Gunsteren showed that additivity principles are valid for enthalpy when the global Hamiltonian is separable. Computational i nvestigations of the validity of this approximation have been published for at least two different types of systems Xanthea s analyzed a many body expansion of the total energy of water clusters to estimate the relative contributions from each term. 38 T he total energy of the water PAGE 31 31 clusters was represented by Equation 3 5 which decomposed the total energy of the cluster into a sum of one body terms, two body correction terms (Equation 3 6) three body correction terms (Equation 3 7) and so on. (3 5 ) (3 6) (3 7) If the additivity approximation holds, then the sum of one and two body terms should be equal to the total interaction energy of the system. This would mean that the sum of interaction energies between each water molecule in the cluster fully captures the stabilization energy of the cluster geometry. The difference between the computed full energy and the se two terms is the magnitude of nonadditivity. Using this method, Xantheas estimated magnitud es of nonadditivity up to 25% of the total interaction energy of small water clusters with MP2/aug cc pVDZ. An analogous procedure has been used to investigate n onadditivity effects in protein ligand systems, where delocalization and polarization effects are expected to be generally smaller. Ucisik et al. studied the HIV II protease/indinavir complex by fragmenting the receptor into eighteen separate molecular fra gments. 39 By using a slightly modified form of Equation 3 5 for protein ligand interaction energies, it was observed that magnitudes of nonadditivity depend greatly on the energy model. For HF /6 31G*, it accounted for 14% of the total interaction energy. F or the M06 L density functional 40 with the 6 31G* basis set it was less than 0.1%. F or the PM6 DH2 semiempirical method it was estimated to be 0.9%. It was also observed that PAGE 32 32 nonadditivity is increased when highly charged and polarizing /polarizable fragments are proximal to each other In addi tion to computational validation, fragment additivity in enthalpy has been observed empirically in the isothermal titration calorimetry (ITC) experiments of Baum et al 41 In their work, the binding affinities of a congeneric series of thrombin inhibitors was measured with ITC, which allow ed for the estimation of free energy of binding along with enthalpic and entropic components. The authors observed that each single functional group change within the series was ass ociated with a near constant biding enthalpy change. The largest deviation in the enthalpy changes within the series was about 0.07 kcal/mol, which is smaller than the precision of most computational methods. On the other hand, the largest deviation s in fr ee energy and entropy changes for any fragment transformation were over 10 times as large. This experimental observation suggests that the a dditive fragment approximation can be justified for enthalpy but not for free energy or entropy In summary, it se ems that additivity of fragment contributions in interaction energies is an acceptable approximation for biomolecular systems. The validity is slightly dependent on the computational method, and is affected by the presence of highly polarizing molecular fr agments in the system. The approximation is especially beneficial for error correction schemes, because the overall stabilization energy of a system can be represented as a sum of molecular fragment interaction energies (Equation 3 8). (3 8) PAGE 33 33 Propagation of Errors in Fragment Based Energies With the linear model for interaction energy, simple error propagation formulas can be derived. By applying Equation 2 9 and 2 10 to the energy model of Equation 3 8, we find that total mode ling errors can be represented by Equation 3 9 for systematic errors and Equation 3 10 for random errors. (3 9) (3 10) As an example, consider a protein ligand complex comprised of two important molec ular contacts, a hydrogen bond and a van der Waals interaction Suppose t he interaction energies of both of these contacts are estimated with a given energy model. Furthermore, it is known that the energy model generally scores these types of hydrogen bond s with an error of 0.5 0.8 kcal/mol. Here, 0.5 kcal/mol represents a systematic error while 0.8 kcal/ mol represents a random error of uncertainty. At the same time, the energy model typically models similar van der Waals contacts with an error of 0.3 0. 5 kcal/mol. Given this information, the interaction energy between the protein and small molecule can be computed with the energy model, and an error would be estimated to be 0.2 0.94 kcal/mol. The systematic error of 0.2 kcal/mol could then be subtracted from the calculated energy, and the 0.94 kcal/mol represents an overall uncertainty in the total interaction energy. Estimating Fragment Based Errors Clearly, the additive, linear interaction energy model allows for very simple formulas of error propaga tion in single point energies of even large biomolecular systems. The challenge then becomes the correct estimation the in div idual fragment PAGE 34 34 based error s. These can be approximated using statistical methods and databases of previously measured errors in the interaction energies of related molecular complexes. As an example, consider the task of calc ulating the interaction energy error b etween two aromatic rings stacking interactions. Suppose the researcher lacks the resources to conduct a full ab initio quantum calculation, and must use a simpler more approximate energy model. However, he or she has previous knowledge about the accuracy and precision of the approximate energy model with respect to higher level theories or experiment. Based on a sample population of similar bimolecular complexes, from either previous calculations or published data the energy model tends to estimate energies with an error of +0.5 kcal/mol 1.0 kcal/mol. Thus this approximate energy model tends to underestima te the stability of most of these types of complexes, but over estimates a portion of them as described by the 1.0 kcal/mol spread of energy stacking interactions comes about from slight differences i n geometry, stoichiometry, ring constituents, and chemical environment. stacking complex under study falls into this population of previously computed complex energies, the researcher can estimate its error using properties of t he distribution of previously measured errors Construction of Molecular Fragment Interaction Databases Practically, it would not be desirable to require an extensive study of similar bimolecular interaction energies for each new calculation with an approx imate energy model. Rather, it would be preferable to have a general data resource for a broad spectrum of biomolecular interaction classes with many representative systems for each class. The database would also need to have calculated energies with many popular energy models to be applicable to a wide range of studies. With an ideal fragment PAGE 35 35 energy database, error distributions could be estimated for any specific bimolecular interaction class and energy model. The following t wo chapters describe the initi al steps in t he development and use of such database s Chapter 4 is concerned with applications specific to protein ligand interactions, which is especially relevant to computer aided drug design. Chapter 5 describes the problem of protein intramolecular i nteractions, which are important in understanding how proteins fold. In addition to these relatively small interacting molecular fragment databases, we have begun the construction of a larger, more general interacting fragment database. The molecular data is generated with in house fragmentation code applied to a subset of the PDB and the PDBBind protein ligand complex database. The interaction energy data is then generated with various computational chemistry software packages, and is freely available onl ine. The collection of data is loaded into an SQLite database for consists of a PDF viewer for a selected energy model, dataset, interaction class, and reference metho d (See Figure 3 1 for an example). Parameters for Gaussian PDFs are given along with bootstrapped error estimates. Possible outliers on either side of a PDF are highlighted, which can be viewed in the fragment viewer application (Figure 3 2), which supplie s the user with a molecular representation of the system, its database information, and its calculated interaction energies with various methods. It is hoped that with this project, researchers will gain easy access to the data needed for improving their c omputational results with automated error correction and uncertainty reporting. PAGE 36 36 Figure 3 1. Error distribution viewer for the molecular fragment interaction energy database. PAGE 37 37 Figure 3 2. Fragment v iewer for the molecular fragment interaction energy database. PAGE 38 38 CHAPTER 4 PROTEIN LIGAND INTERACTIONS Introduction One of the goals of computer aided drug design is to be able to accurately predict the binding affinity of a small molecule ligand to a protein receptor. 4 2 45 The general solution to this problem would expedite the drug discovery process by allowing for cheap in silico screening of drug candidates rather than expensive in vitro binding assays. Furthermore, c omputational screening of molecules early in the pipeline can reduce wasted experimental effort on non binders later in the process Success in this endeavor has been rather limited, however, due to many factors. Structure based method s rely on accurate protein receptor model s and usually neglect thermal motions of the protein. 46 The role s of water molecules in binding processes are difficult to predict. 47 The accessible conformational space of each ligand in a screening library must be adequately sampled. 4, 48 Tautomeric and protonation states of each ligand must be correctly predicted. 49 In addition, molecular interactions between the ligand and receptor must be accurately modeled to yield estimated binding free energies to within an acceptable amount of error, usually 1.0 kcal/mol. 50 Considering all of these factors from an error propagation perspective, it seems surprising that any success has been achieved in binding affinity estimation at all Algorithms for binding affinity prediction have traditionally b een divided into two separate tasks: docking and scoring. Docking is the process of correctly placing and energetically favorable complex structure. Scoring refers to the bindi ng energy (or pseudo energy) estimation. Many different score functions have been developed, PAGE 39 39 including empirical scores, knowledge based potentials, force fields, and some even incorporating quantum mechanical calculations. 51 53 Analyses of common docking and scoring protocols have shown that many of them are able to correctly identify ligand poses (i.e. successful docking), but have only modest success in estimating free energies of binding. 54 58 Due to the inconsistencies of scoring functions, consensus scoring has become a popular error handling procedure in which several score functions are used simultaneously to predict binding affinities. 59 60 The reasoning is that errors inherent in the individual score functions cancel toward zero when averages across several scores are used. This error reduction phenomenon has been observed both empirically and through gedanken experiments. 61 Other examples of successful prediction s come from free energy differences computed with free energy perturbation (FEP) calculations. 62 65 In these studies, similar ligands are compared by slowly transforming one into the other during an MD or MC simulat ion and computing the free energy change of this process. Usually only one molecular interaction is modified at a time, which reduces the dimensionality of error sources to one. The method is well suited for ordering a congeneric series of ligands by affin ity but still requires a great deal of computational effort for conformational sampling in the MD or MC simulations. Error Propagation in the Estimation of Protein Ligand Binding Affinities A simple analysis of error propagation in computational modeling of protein ligand binding process es was described in a gedanken experiment by Merz. 50 The analysis is based on the thermodynamic cycle of Figure 1 1, where the overall free energy of the molecular as sociation process is broken down into gas phase binding and solvation free energy terms as described by Equation 4 1. Here a superscript s represents free energy PAGE 40 40 in solution while a superscript g represents free energy in the gas phase, a subscript b repre sents the binding process while the s the solvation process, and P and S represent the protein and substrate respectively. (4 1) Equation 4 1 can be decomposed into enthalpic and entropic components a s shown in Equation 4 2. int (Equation 4 3) r epresents the total electronic interaction energy corr (Equation 4 4) represents the change in bind (Equation 4 5) represents solv (Equation 4 6) is a collection of the solvation related free energy terms of Equation 4 1. (4 2) (4 3) (4 4) (4 5) (4 6) As a first approximate model for error propagation, we assume that errors in corr bind solv are zero. This approximation leads to error estimates na med (BCS) error estimates, because they represent lower bound estimates for total random error in free energy. While systematic errors in the three neglected terms may cancel with systematic error in electronic interaction energy, random errors o nly increase with increasing number of error terms according to Equation 2 10 We also invoke the additive interaction energy model of Equation 3 8, by PAGE 41 41 examining the protein ligand complex and estimating the error in interaction energy for each distinct ch emical interaction, e.g. hydrogen bond or van der Waals contact. By doing so, we arrive at the following expressions for overall systematic and random error estimates for protein ligand binding: (4 7) (4 8) ligand complex, namely the HIV II protease/indinavir complex, was deconstructed into 29 distinct chemical fragment interactions. These chemical contacts in the bound complex are illustrated in the LigPlot 66 diagram of Figure 4 1 The complex comprises 18 distinct hydro phobic contacts between the ligand and protein receptor, and 11 hydrogen bonds between either the ligand and protein or ligand and crystallographic water molecules. Each of set ting, and each contributes to the overall protein ligand interaction. If the interaction energy between the ligand and protein is calculated with a model Hamoltonian having intrinsic modeling errors of 0.5 kcal/mol, then by using Equation 4 8 we would est imate an overall expected error bar of 2.7 kcal/mol for the enthalpy of binding. Likewise, if a less accurate model Hamiltonian was used with intrinsic modeling errors of 1.0 kcal/mol, error propagation leads to an estimated error bar of 5.4 kcal/mol. C onsidering that the experimental binding free energy of this system has been reported to be 12.8 kcal/mol, these are very large error bars indeed. Merz points out that the computed binding affinity could range from sub picomolar to micromolar or worse, ma king this PAGE 42 42 HIV II protease binder a false positive leading to wasted laboratory resources in the future, or a false negative leading to a missed opportunity for drug discovery. Of particular importance in this framework is the dependence of propagated erro r on system size. As more chemical contacts are modeled, each with their own associated uncertainty, the overall propagated random error for the association energy increases. This observation is evidenced by the success in free energy perturbation (FEP) me thods which aim to compute free energy differences in binding free energy. Since these methods typically examine single pharmacophore changes at a time, the number of random error sources is limited, and overall errors are determined by the modeling errors in the single mutating pharmacophores. Meanwhile, methods that model the entire binding process directly (e.g. MM/PBSA or docking scores ), have had less success, especially for larger ligands with diverse chemical contacts, each contributing to overall er ror. Practical Error Estimation for Protein Ligand Interactions Following was conducted on the HIV II protease/indinavir complex. 67 The crystal structure was downloaded from the Protein Data Bank, and the bound complex was manual ly deconstructed into 21 distinct chemical fragment interactions. For example, a pyridyl moiety in the ligand forms a van der Waals complex with a portion of the backbone of HIV II protease as shown in Figure 4 2 This fragmentation process resulted 15 non polar/van der Waals type interactions and 6 polar/hydrogen bond interactions between indinavir and the receptor. The final set of fragm ents is displayed in Figure 4 3 Each of these fragment systems was analyzed for the gas phase interaction energy with ma ny different energy models including ab initio wave function based quantum PAGE 43 43 Hamiltonians, density functional methods, semiempirical quantum methods, and polarizable and standard atomic force fields. With this collection of data, distributions of errors with respect to reference values were examined and fit to Gaussian probability distribution functions. In addition, metrics related to error propagation for the binding process were evaluated and compared between the various methods. For these metrics, energie s from CCSD(T)/CBS were used as a reference, since the method is known to consistently match experimental observations to within experimental error bars in many other benchmark studies. 7 Methods The crystal structure of the HIV II protease/indinavir complex was downloaded from the Protein Data B ank (PDBID: 1HSG) 68 Hydrogen atoms were added to the structure with the program Reduce 69 followed by optimization of their positions with th e AMBER FF99SB force field 70 Close contacts (less than 3 ) in the resulting structure were highlighted in the visualization program Chimera 71 which were then manually examined. In many cases, one member of an interacting fragm ent pair would be adjacent to a polarizing group such as the peptide bond along the protein backbone. These polarizing groups, although not directly interacting with the other fragment partner, were included in the fragment structures to model the polariza tion effect on the fragments. Any time an aromatic fragment was involved the entire aromatic ring was included. After manually determining all the chemically important fragment pairs, the fragments were generated by cutting the covalent bonds joining them to the remainder of the molecule, and replacing the bonds with linker hydrogen atoms. The hydrogen bond distances were set to 1.1 for carbon hydrogen bonds and 1.0 for nitrogen hydrogen bonds PAGE 44 44 The gas phase electronic interaction energies o f the 21 in dependent chemica l fragment complexes (Figure 4 3 ) were calculated using the following computational methods: the generalized AMBER force field (GAFF) 72 AMOEBA 73 MMFF 74 AM 1 12 AM1 FS1 18 PM3 13 PM6 14 PM6 DH2 16 PDDG 15 Hartree Fock (HF), second order Mller Plesset perturbation theory (MP2), the M06 L density functional 40 B97 D 75 and coupled cluster with single, double, and perturbative triple excitations (CCSD(T)) 10 Basis sets for the quantum mechanical methods ranged from the 6 31G* Pople type basis set 76 to the aug cc pVXZ (X=D,T,Q hereafter referred to as aXZ) Dunning type basis sets. 77 In addition, extrapolation to the complete basis set (CBS) limit was also used. 22 The CCSD(T) calculations were evaluat and haTZ basis sets, which place diffuse functions only on heavy (non hydrogen) atoms 78 79 In our error analysis, the CCSD(T)/CBS energies were used as our reference, which were evaluated as shown in Equations 4 9 and 4 10. (4 9 ) (4 10 ) T he notation CBS[aTZ aQZ] represents the complete basis set extrapolation from energies calculated with the aTZ and aQZ basis sets. In each q uantum based method, basis set superposition error (BSSE) was accounted for using the counterpoise correction. 23 The Hartree Fock and MP2 interaction energies were calculated with the Gaussian 03 progra m. 80 The M06 L and PM6 interaction energies were calculated with Gaussian 09. 81 AM1, PM3, and PDDG interaction energies were evaluated with the DivCon program. 82 The GAFF energies were calcul ated using the AMBER 10 suite. 83 Most of the CCSD(T) computations were performed using MOLPRO. 84 The larger PAGE 45 45 CCSD(T) computations were performed using NWChem 85 on the Jaguar supercomputer at Oak Ridge National Laboratory. MMFF calculations were performed with the Schr dinger suite. 86 Results and Discussion The deviations from reference energies over the set of 21 fragment interactions were used to evaluate error metrics for each method RMSE, Max Err BCS error and propagated systematic and random errors. R MSE (root mean square error) was simply the square root of the mean of square deviations from the reference as given by Equation 4 1 1, and Max Err was the maximum deviation from reference in the data set (Equation 4 12) (4 11) (4 12) BCS error (best case scenario error) is an error propagation estimate that assumes that each component error is purely random, and that the method contains no systematic errors. Thus BCS error is the Pythagorean sum of the individual error components as given by Equation 4 8. It can be approximated with RMSE as shown by Equation 4 13. This relationship with RMSE provides a means of estimating BCS error values in new complexes with N similar fragment interactions. (4 13) Finally, after fitting the set of measured errors for each method to Gaussian probability density functions (PDFs), propagated systematic and random errors were PAGE 46 46 estimated by using each PDF parameters in Equations 4 7 and 4 8. Each systematic error is estimated by the mean of the PDF, while each random error is estimated by the standard deviation of the PDF. Thus the propagated systematic and random errors were evaluated as Equation 4 14 and Equation 4 15 resp ectively. (4 14 ) (4 15 ) The resulting error metrics for the various energy methods are displ ayed in Table 4 1, which shows some interesting relationships between different types of energy mode ls and basis sets. It also highlights the magnitudes of errors expected in protein ligand complexes as large as this one. BCS error values range from below 1 to over 20 kcal/mol, depending on the method used. Semiempirical quantum methods performed fairly p oorly by this metric, likely due to their poor modeling of hydrogen bonds and dispersion. This is evidenced by the relatively improved performance of PM6 DH2, which contains empirical corrections for nonbonded interactions fit to CCSD(T)/CBS interaction en ergies of other biomolecular fragment complexes. 87 Among the force fields, the polarizable force field AMOEBA performed the best by BCS error but this was based on data for only 12 of the 21 data points due to a la ck of needed AMOEBA parameters. Thus adding polarization to force fields via multipolar expansions can improve estimated energies, but still requires much additional effort to correctly parameterize the score function. Among the ab initio quantum methods, Hartree Fock performed very poorly, presumably due to its incorrect modeling of electron dispersion, which should introduce systematic error. However, since BCS error neglects systematic errors, it greatly underestimates the precision of HF interaction ener gies. MP2 PAGE 47 47 performed well only when large basis sets were used, i.e. beyond aTZ. In addition, BCS error of both MP2 and M06 L had a very strong dependence on basis set. Upon examination of individual absolute interaction energy error magnitudes, it was obse rved that most were positive, i.e. they usually underestimated the stabilization energy of each fragment complex Figure 4 4 illustrates the absolute errors of the 21 fragment complexes for 6 methods as an example. This observation suggested that the energ y models had intrinsic systematic errors that were not explicitly considered by the BCS error or RMSE metrics. In order to distinguish between systematic and random errors, we fit the data to Gaussian probability density functions with means representing random error per interaction (Equation 4 16). (4 16) For each energy model studied, the sample means and sample variances of the absolute in teraction energy errors were used to construct a PDF for each method leading to the parameters listed in columns 3 and 4 of Table 4 1. The PDF for GAFF is shown as an example in Figure 4 5. In order to propagate these fragment based errors and estimate the error of the full protein ligand interaction, we assume that each fragment interaction has an error drawn from the PDF associated with the given energy model. By doing so, we can propagate errors according to Equations 4 14 and 4 15. This process leads to overall protein ligand interaction energy systematic and random error estimates listed in columns 5 and 6 of Table 4 1. PAGE 48 48 Indeed, there were large systematic errors in most methods, and most of them were positive shifts in total interaction energy. Some of these systematic shifts were energies, MP2 would not be expected to capture enough correlation energy with very small basis sets like 6 31G*, and older generation semiempiric al quantum methods fail to model the strength of hydrogen bonds properly, yielding large systematic errors. Surprisingly, systematic errors in parameterized force fields were also significant. GAFF had an overall systematic error of +19.7 kcal/mol, for examp le. In agreement with relatively small errors for each fragment interaction ( see Figure 4 5 ), but when these are propagated over the set of 21 independent fragment systems, the errors become prohibitive for accura tely estimating binding free energies. In fact, this systematic shift is larger than the experimental binding free en ergy itself ( 12 .8 kcal/mol). The AM1 FS1 semiempirical quantum method had the lowest magnitude of systematic error among the lower level m ethods, suggesting that it is very well paramet erized for this system as individual errors tended to cancel toward zero. Estimating systematic err ors is of interest because they are correctable errors. After estimating an interaction energy and associated systematic error, the systematic error can be subtracted to yield an energy estimate which is probably closer to the reference value. Furthermore, when comparing a series of similar ligands, these errors would mostly cancel when taking energy di fferences. Systematic errors are thus not the true bottleneck in correctly predicting binding energies, but rather random error, or imprecision, should be the ultimate barrier. The correlated wave function PAGE 49 49 quantum methods were the only energy models to bre ak the 1.0 kcal/mol propagated The semiempirical methods generally performed the worst, again with the exception of PM6 DH2 which performed as well as HF and GAFF. The DFT methods genera lly had overall random errors falling in between large basis set MP2 and GAFF. Of particular importance is the correct interpretation of these final error estimates. Each fragment based interaction was assigned an error contribution equal to the mean of t he PDF fit to the measured errors for a given method. That is, each molecular contact contributing to the overall binding process is assumed to have the most probable amount of error associated with it. Within the additive fragment approximation, the overa ll systematic error is then a sum of these estimates, and the overall systematic error is the most probable error in overall interaction energy. Random errors for each fragment are assigned based on the standard deviations of the PDFs and give an indicatio n of the most probable range of errors. One standard deviation about a 68% probability. Two standard deviations correspond to a 95% probability window. As an illustrative example, consider the case of GAFF and its propagated errors in columns 5 and 6 of Table 4 1. Suppose that one estimates the HIV II protease/indinavir interaction energy with the GAFF energy model and obtains the result E G The reference CCSD( T)/CBS interaction energy of the system could then be estimated to fall in the range (E G 19.74 5.9 0 E G 19.74+5.9 0 ) with 68% probability, or (E G 19.74 11.8 E G 19.74+11.8 ) with 95% probability. Thus o nly the information about the PDF PAGE 50 50 describing the intrins ic modeling errors of GAFF are needed to estimate (within some confidence interval) the CCSD(T)/CBS result for the same system Conclusions The magnitudes of errors estimated in this work highlight the need for improved accuracy and precision in energy fu nctions for the estimation of protein ligand binding affinities. Many of the estimated systematic and random errors for the protein/ligand interaction energy were larger than or a significant fraction of the experimental binding affinity of indinavir to HI V II protease. The proposed method offers a way to estimate correctable systematic errors, which should improve the ability of energy models to match experimental observations. In addition, propagated random errors provide a way to examine error bars in ad dition to raw scores when comparing a set of ligands. In this way, it is easy to determine if a given energy model is able to distinguish between t wo ligands by binding affinity. Furthermore, confidence intervals can be included in results of in silico scr eening of possible drug candidates. The method makes several important assumptions including additivity of fragment based interaction energies, that the distributions of fragment based energy errors are well described by Gaussian PDFs, and that enthalpy c orrections, entropy, and solvation free energy errors are zero. The last approximation is the most important for binding free energies, where each of the mentioned terms significantly contributes to bi nding affinity. A similar error propagation scheme for entropy or solvation free energy would likely fail due to greater nonadditivity in both of those terms. 88 90 Systematic errors in energy could add or cancel with systematic errors in entropy or solvation free energy depending on sign. Random errors, due to the error pro pagation formula of Equation 2 10 will only increase with the addition of terms to energy toward free energy, as PAGE 51 51 prescribed by Equation 4 2. Thus random errors in interaction energy pr edicted by the proposed method should be con sidered lower bound estimates of the free energy random error. The error estimates of this work are limited by the small data set (N=21) of fragments found in the HIV II protease/indinavir complex. For this reason, the generate d PDF parameters for each method are very approximate, and would likely not be suitable for general use. The method does suggest a means of producing more generally applicable error estimates, however. By generating a large set of protein ligand fragment i nteractions representing different common interaction types, the researcher could generate PDFs specific to each energy model and each interaction type. For interactions would be ex pected to more precisely estimate the magnitudes of modeling errors with GAFF in protein/ligand complexes. This would offer more accurate systematic error corrections as well as narrower propagated uncertainty estimates The main bottleneck of such an ende avor would be the calculation of accurate reference energies. PAGE 52 52 Table 4 1. Error metrics for various energy models over the set of 1HSG fragments Method BCS error 2 ) 2 ) 1/2 RMSE Max Err GAFF AMOEBA a MMFF 7.41 2.58 20.81 0.94 0.36 0.61 1. 66 0.24 20.25 19.74 7.56 12.81 5.90 2.24 20.62 1.62 0.75 4.54 4.35 2.23 11.43 AM1 AM1 FS1 16.39 11.44 2.57 0.03 5.89 6.23 53.97 0.63 11.12 11.44 3.58 2.50 10.69 7.97 PM3 14.70 1.82 6.67 38.22 11.84 3.21 8.24 PDDG 20.46 1.69 16.29 35.49 18.50 4.47 14.8 PM6 PM6 DH2 B97 D/ TZVP 10.94 4.89 2.53 1.56 0.25 0.05 3.13 1.11 0.30 32.76 5.25 1.05 8.11 4.83 2.51 2.39 1.07 0.55 6.99 2.94 2.12 M06 L/ 6 31G* aTZ 10.94 3.25 0.44 0.62 0.48 0.11 9.24 13.02 3.17 1.52 0.83 0.71 1.92 1.16 HF/ 6 31G* aDZ aTZ aQz 15.22 16.25 16.22 16.16 3.18 3.33 3.32 3.31 0.87 1.42 1.42 1.39 66.78 69.93 69.72 69.51 4.27 5.46 5.46 5.40 3.32 3.55 3.54 3.53 4.91 6.68 6.77 6.70 MP2/ 6 31G* aDZ aTZ aQZ CBS[aDZ aTZ] CBS[aTZ aQZ] 7.45 2.89 1.35 1.02 1.07 1.02 1.58 0.41 0.09 0.02 0.04 0.05 0.13 0.22 0.08 0.05 0.05 0.04 33.18 8.61 1.89 0.42 0.84 1.05 1.65 2.15 1.30 1.02 1.02 0.92 1.63 0.63 0.30 0.22 0.23 0.22 2.35 1.83 0.77 0.53 0.57 0. 56 CCSD(T)/ haDZ haTZ CBS[haDZ haTZ] b 3.80 1.28 0.28 0.69 0.23 0.03 0.20 0.03 0.00 14.49 4.83 0.63 2.05 0.74 0.00 0.83 0.28 0.06 2.08 0.74 0.14 Various scores (based on kcal/mol units) measuring the error of gas phase electronic interaction energies of the 21 independent chemical fragments involved in the binding of indinavir to 1HSG using CCSD(T)/CBS calculations as reference. The error scores are (from left to right): BCSerror, mean error, variance about the mean error, overall expected systematic error, overall expected random error, root mean square error, and maximum error. a AMOEBA results were calculated for 12 of the 21 fragments, due to lack of parameters for atoms in the remaining 9 systems. b haDZ haTZ extrapolation to t he CCSD(T) complete basis set limit; this approximates our reference CCSD(T)/CBS values but lacks the MP2 estimate of basis set effects beyond haTZ. PAGE 53 53 Figure 4 1 LigPlot diagram of the HIV II protease/indinavir complex PAGE 54 54 Figure 4 2 Example fragmen tation of the HIV II protease/indinavir complex. PAGE 55 55 Figure 4 3 Molecular representations of t he 21 independent chemical contacts present in the HIV II protease/indinavir complex. PAGE 56 56 Figure 4 4 Histogram of absolute electronic interaction energy errors for several methods. PAGE 57 57 Figure 4 5 Histogram and Gaussian probability density function describing errors in the GAFF energy model. PAGE 58 58 CHAPTER 5 THE PROTEIN FOLDING PROBLEM Introduction An important unsolved problem in computational biology is the ab init io protein folding problem dimensional structure given only its amino acid sequence. 91 92 The problem has been assumed to be solvable since Anfinsen proposed that proteins natura lly fold into a structure representing their free energy minimum. 93 a few proteins 94 95 and thus it has served as a reliable guide in the development of algorithms for protein fold prediction. The problem is then transformed into a search through phase state. However, a systematic search through phase a systematic search through even a modest subset of the conformational states of a 100 residue protein at 1 0 12 conformation s per second would take an amount of time equal to 10 21 times the ag e of the universe. 96 conformational space much earlier, and contrasted it wi th the observation that most proteins find their native folds in seconds to milliseconds or even faster (an apparent 97 This led him to conclude that proteins natural ly move through phase space along folding pathways that minimize their free energy, on surface s that have been described as a high dimensional funnel. 98 99 There has been great effort toward the prediction of native pr otein folds by using physics based energy functions and algorithms that seek to minimize total free energy. These typically model each inter and intramolecular interaction in a chemical system with a force field model and sample conformations along either a molecular dynamics PAGE 59 59 (MD) 100 105 or Monte Carlo (MC) 106 trajectory. Unfortunately, the tr ajectories needed to fold some proteins are st ill prohibitively long, an d littl e success has been achieved for large proteins (greater tha n ~100 amino acid residues) 107 108 A commonly cited explanation for failures in fold prediction is that "the primary obsta cle to de novo protein structure prediction is conformational sampling 109 and considering the vastness of conformational s pace, this should indeed be an issue. However, if the hypothesis of a mostly funnel shaped free energy surface is correct, then clearly not all configurations of a protein need to be evaluated but rather only those tha t decrease overall free energy. Of course additional sampling of higher energy states may be required to overcome local minima on the free energy surface. 110 Also of great importance in protein fold prediction are the accuracy and precision of the energy model used. Typical free energy differences between native and nonnative protein folds are around 10 20 kcal/mol, and energy functions must be at least this precise to distinguish between them. Most energy models used in ab initio protein folding studied are derivatives of the classical atomic force field of Levi tt et al (Equation 1 5, for example) 111 112 These energy models are usually parameterized to yield agreement with experimental observations or ab initio quantum calculations for smaller systems displaying interact ions like dispersion, hydrogen bonds, salt bridges, etc. as shown in Figure 5 1. It has been generally assumed that the ability to correctly model these types of small molecular interactions to within an acceptable level of error (typically 1.0 kcal/mol) c orresponds to similar error magnitudes in much larger composite protein systems. Furthermore it has been thought that, after proper parameterization, energy errors are purely random. Contrarily, we show that error PAGE 60 60 propagation in energy terms leads to very large magnitudes of both systematic and random errors in composite protein systems. These errors grow with system size so that as larger proteins are studied, more difficulty should be introduced in correctly predicting protein folds. Some e vidence for th is claim can be found in previously published work. For example, there is the annual Critical Assessment of protein Structure Prediction or (CASP) competition 113 in which researcher s are asked to prospectively predict, among other things, protein folds from primary sequence information. In these large scale experiments, predictions have tended to fail more often for larger proteins with multiple domains. Secondly, th e sensitivity of force field accuracy to parameterization is well known. 114 117 Schulten and co workers attempted to fold Pin1 using the CHARMM force field in a 10 s (a relatively long timescale) MD simulation, but were unsuccessfu l. 118 This fail ure was later shown to be due to bias in the force field parameters used. 119 Shaw and co workers later had success in folding the same system with a modification of the ff99SB force field (ff99SBildn). 120 The two force fields are similar in construction, but are parameteri zed differently. Thus force field energy models contain a range of uncertainty due to para meterization which can greatly affect the outcome of a protein folding simulation. This has also been evidenced by the sens itivity analyses of force field parameters for specific applications by Rabitz and others. 115 116 Error Propagation in Protein Folding Simulations Errors inherent in a calculation or measurement can be described as either systematic or random. Systematic er rors are predictable in both sign and magnitude, while random errors are not predictable in sign or magnitude. Because of this PAGE 61 61 difference, systematic errors propagate as a simple sum, while random errors propagate as the square root of sum of squares. Prop agated systematic error is correctable, since it describes an overall predictable shift in the measured value. Propagated random errors are not easily correctable, and are measured and reported as the accumulation of error from all random error sources. La rge random error is a characteristic of a very imprecise method of measurement. In searching the energy surface of a protein for the global free energy minimum, total error from all sources should be minimized. To illustrate the effects of error propagatio n on the protein folding problem, structure (folded state). In general these surfaces are not smooth, but often contain many local minima outside of the native state. If the fold ed protein has, for example, 100 independent chemical contacts (e.g. van der Waals interactions and hydrogen bonds between residues), and each is computationally modeled to chemical accuracy (i.e., within 1 kcal/mol random error with respect to an experi mental measurement), then random error propagation yields a total error of 10 kcal/mol. This would imply that our hypothetical computational model would have difficulty in distinguishing the native state from any other state with overlapping error bars wi thin 10 kcal/mol. Thus our computational model may correctly find several local minima, but if they differ in energy by less than the error bar magnitudes, the position of the global minimum could not be determined with much certainty These errors therefo re lead to distortions in the comp uted energy surface which yields a different ordering of folds by energy (see Figure 5 2 ). Dill realized the issue of error propagation, leading him to suggest that a PAGE 62 62 target of 0.1 kcal/mol per amino acid would be an accep table level of error for a protein of 100 amino acids yielding an overall random error bar of only 1 kcal/mol. 90 Given that each amino acid can have several intramolecular contacts, each with an associated error, achieving this level of accuracy would indeed be a very challenging endeavor. Practical Error Estimation for Protein Folding Simulations In our attempt to estimate and cor rect for errors in physics based energy scores used in protein folding, we have taken an approach similar to that previously described by Merz for protein ligand binding calculations 50 Intramolecular interactions involved in protein folds are deconstructed into chemical fragments and associated with reference interaction energies obtained using converged quantum chemical calculations or experimental measurements, if available. Energies from more appro ximate theories are then compared to the reference interaction energies to form fragment based error estimates. Estimating the error of a series of these fragment based interactions contained within a protein fold and then propagating these errors througho ut all fragments in the fold yields an estimate of the total error associated with a computed total energy for the protein. The error is broken down into a systematic portion which can be corrected for and a random portion which cannot, but can be reported as an error bar. It is important to remember that the thermodynamic principle of protein folding applies to the free energy, not the interaction energy (differences of total electronic E folding is just one folding shown in Equation 5 1 which is obtained using the thermodynamic cycle for the computation of the folding free energy from a fully unfolded reference structure. Here correction represents e nthalpic corrections to the PAGE 63 63 folding solvation is the difference in the solvation free energy of the folded and unfolded states. ( 5 1) In order to reliably calculate t he free energies of native and nonnative protein folds, the errors associated with each term of Equation 5 1 must be minimized. In view or lower limit free energy err or estimates because they neglect the error coming from the enthalpy, entropy, and solvation energy terms. Nonetheless, if the potential energy surface is not well modeled this will impact the quality of any estimate of entropy because a poor quality poten tial energy surface can have effects on entropy estimates that can be difficult to predict (e.g., sampling of non physical states). To obtain a reliable estimate. While it is true that systematic errors in the three remaining terms of Equation 5 not yet been studied in detail. Random error estimates, however, will only increase with the a ddition of terms with nonzero uncertainties. This method of error estimation assumes that electronic interaction energies are additive, even though free energies of interacting fragments are not. 41, 90 This approxim ation is supported by both statistical mechanics 37 and isothermal calorimetry experiments 41 involving protein ligand interactions, but deviations from additivity will impact the overall quality of our error estimates. Nonetheless, it is v ery instructive to folding term because, within our model, we can compare any physics based score function to chemically accurate quantum chemical PAGE 64 64 methods 7 providing reliable estimates of the magnitudes of energy errors and their contribution to folding folding errors are small, their impact on the uncertainty in folding will be small; otherwise they will have a significant impact on the accuracy of folding and the ability t o predict native protein folds. Methods In order to gener ate a reference database of fragment based interacting systems involved in protein folds, we examined a native fold of ubiquitin (PDBID: 1UBQ) 121 in detail. 122 After adding and optimizing hydrogens in AMBER with the ff99SB force field, the structure was viewed using Chimera, which was used to highlight van der Waals contacts and hydrogen bon ds resulting in a total of 42 of the former and 50 of the latter. Each resulting fragment interaction was evaluated in terms of gas phase interaction energy using a number of different methods. In generating these fragments, hydrogen to replace the severed bonds with the remainder of the protein. Energie s were evaluated with the ff99SB force field, the Generalized Amber Force Field (GAFF) 72 ff03 123 AM1 12 PM3 13 PM6 14 PDDG 15 PM6 DH2 16 HF, MP2, B97 D 75 M06, and M06 L 40 Th e ff99SB and ff03 force field methods underwent an atomic charge scaling procedure to produce correct net charges on the database fragments. This was necessary because the sum of parameterized force field charges on one fragment often did not equal the tot al charge used when calculating electronic energy with a QM reference method. Unless the force field charges are scaled properly, additional errors due to the lack of charge conservation are introduced. Several charge scaling protocols were examined, and w e found that scaling only positive charges to yield the correct integer net charge best matched reference data. PAGE 65 65 The ab initio quantum based methods employed several basis sets and included the counterpoise correction for basis set superposition error. Mll er Plesset perturbation theory through second order (MP2) with complete basis set extrapolations (CBS) 22 from the aug cc pVTZ and aug cc pVQZ basis sets (hereafter abbreviated as aXZ: X=D,T ,Q) were used for most of the reference values. Based on previous reports 124 and our experience with error analysis on protein ligand systems 67 the coupled cluster method with single, double, and perturbative triple excitations (CCSD(T)) CBS energies showed the largest improvement from MP2/CBS for systems containing aromatic groups. However, the present case contained no aromatic aromatic interactions and only eight total aromatic nonpolar contacts. Hence, CCSD(T)/CBS reference energies were computed (as des cribed in our previous work) only for these fragments. The addition of more protein molecules and specific interaction types to our reference database would further improve our ability to estimate errors, but this is a time consuming endeavor requiring a l arge number of high level quantum mechanical energy calculations. The ff99SB and GAFF calculations were conducted with the AMBER 11 8 suite of programs, and the ff03 energies were calculated with the Schrdinger package 86 AM1, PM3, PM6, and PDDG energies were calculated with DivCon 82 and PM6 DH2 energies were calculated with MOPAC2009 125 HF, MP2, B97 D, M06, and M06 L energies were calculated with Gaussian 09 81 and the CCSD(T)/CBS corrections used to generate reference values were calculated with Molpro 2009 84 and NWChem 5.1. 85 Results and Discussion A summary of the fragment based interaction energy deviations from reference energies is displayed in Table 5 1. The absolute deviations from our reference data were fitted to Gaussian error probability den sity functions with parameters (mean error PAGE 66 66 D/TZVP is shown as an example in Figure 5 3. The fragments were divided into two classes: nonpolar (van der Waals blue) and polar (hydrogen bonding red) interactions. In the case of B97 D/TZVP, the mean error, representing the correctable systematic error, is 0.29 kcal/mol and 0.59 kcal/mol per interaction for nonpolar and polar interactions, respectively. The variance, representing random error, is 0.02 (kcal/mol) 2 for nonpolar and 0.158 (kcal/mol) 2 for polar interactions. Thus this computational model has a relatively precise description of van der Waals interactions with only a slight offset, but it has a wider distri bution of errors for polar interactions. With this database of interaction energy errors in place, lower limit estimates of both systematic and random errors can be obtained for protein fold energies. Along with calculating the energy of a protein fold, an analysis of its interacting fragment components can be made. By determining the interaction type, an estimate for a overall error is then propagated as shown in Equations 5 2 and 5 3. ( 5 2) ( 5 3) Here i runs over all interaction types (e.g. polar, nonpolar) stored in the database, N i is the number of interactions of type i found in the analyzed protein fold, a nd i and i 2 are the mean error per interaction and variance about the mean error for interaction type i Note that total systematic error depends on the number of each type of interaction and thus will not exactly cancel when comparing different protein folds, sin ce PAGE 67 67 the folds may have different numbers of interaction types. The evaluated total systematic error should be subtracted from the evaluated energy and the evaluated total random error can be reported along with the corrected energy value. In the case of B9 7 D/TZVP for ubiquitin (Figure 5 3), by using the appropriate error probability density functions we estimate the total systematic error to be 17.3 kcal/mol and the random error to be 8.9 kcal/mol. Hence, the estimated systematic error is comparable to a t ypical folding free energy of a protein, but this error can be corrected. Unfortunately, the remaining random error is still a significant portion of a typical folding free energy. Any other protein fold with a calculated energy within this 8.9 kcal/mol er ror bar should be considered indistinguishable from the native fold by the computational method. The B97 D/TZVP case represents a favorable example with small mean errors and relatively tight error distributions (see Table 5 1 for other examples), but it w ould be computationally intractable to use it to study hundreds or thousands of decoys for a system of the size of ubiquitin. More approximate and computationally accessible methods yield higher esti mated errors. For example, ff99SB yields a systematic err or of 66.0 kcal/mol and a random error of 18.4 kcal/mol. Such magnitudes of error bars are disturbing, since any non native structure lying within the 18.4 kcal/mol range could not be distinguished from the native structure with ff99SB Furthermore, these error bars become even larger as larger proteins with more molecular contacts are examined. The magnitudes of these errors lead us to predict that current physics based scoring functions used in ab initio protein folding studies can have total energy erro rs much larger than the folding free energies of typical proteins. Hence, we conclude that accurate energy computation and error reduction represents a major stumbling block PAGE 68 68 along with conformational sampling in the achievement of an ultimate solution to t he ab initio protein folding problem. However, we are now in a position to correct for systematic errors, thereby improving our computational outcomes. In order to test our error hypothesis, we performed energy calculations and error corrections on a port ion of the Rosetta decoy set, which contained 49 protein systems. Each one comprised a crystal structure from the Protein Databank, 20 versions of the crystal structure that were relaxed with the Rosetta score function, and 100 low energy decoys produced by Rosetta. 126 The protein structures ranged from 50 to 146 residues in chain length, so semiempirical QM calculations (utilizing modern linear scaling methods) were feasible. Energies of all 5929 protein structures were calculated with ff99SB, PM6, and PM6 DH2. The ff99SB calculations were performed with the generalized Born implicit solvent model 127 and the PM6 and PM6 DH2 calculations used the COSMO 128 solvent model in MOPAC. Each structure was then analyzed for the number of nonpolar and polar interactions and the corresponding energy was corrected according to the appropri ate error probability density functions. To measure improvement due to energy corrections, three values were monitored: EGAP, z score, and EBO. The EGAP (energy gap) was defined as the difference between the energies of the lowest energy decoy and the lowe st native structure. The z score is the ratio of the difference between the lowest native fold energy and the average energy of all folds to the standard deviation of all fold energies. EBO (error bar overlap) yields true if the native structure is found t o lie within the lowest energy error bar; otherwise it is false. The results for the Rosetta decoy set analysis can be found in Tables 5 2, 5 3, and 5 4. By measuring EGAP, improvements due to error corrections were observed in 27, PAGE 69 69 36, and 31 o f the prote in datasets for ff99SB PM6, and PM6 DH2, respectively. By measuring z scores, improvements were seen in 38, 41, and 39 of the systems. The native structures were found within the lowest energy error bar in 45, 49, and 49 of the systems. The magnitudes of systematic errors in the decoy set for all three methods are shown in Figure 5 4. Overall, systematic error correction offers some benefit, but improvement was not uniform. We observed that, while the magnitudes of energy corrections varied over a set of d ecoys, that variation was small That is, both native and decoy folds had significant systematic errors, but the changes in relative energies after error correction were usually small. The decoy set for PDB ID: 1H6Z, for example, had an average energy corr ection of 51.3 5.2 kcal/mol for ff99SB and the native structure had an error correction of 56.9 kcal/mol. Although much of the systematic error canceled when measuring EGAP, the improvements due to error correction can still be significant compared to f olding free energies. This would especially be true if we had included more non native decoy folds in our analysis. The decoys in this set are very native like and have roughly the same number of intramolecular contacts, leading to similar magnitudes of er ror corrections. The difference in systematic errors between a native structure and a partially unfolded structure is expected to be much greater. While systematic error can be estimated and removed, random error in energy nd represents poor precision in scoring functions. After energy corrections, the native structure of 1H6Z was not the lo west energy structure with ff99SB but its corrected energy was found to lie within the lowest energy error bar. This result highlights a main disadvantage of using a method with large random error since the native and decoy folds could not be distinguished due to overlapping random PAGE 70 70 error bars. To highlight the dependence of the total random error on system size, we estimated the total ran dom error s of the 5929 protein folds with the three scoring functions. The relationship is shown in Figure 5 5 As we examine larger proteins, their potential energy surfaces should become increasingly more distorted due to increased random error in the en ergy functions, which can have unpredictable effects on the free energy landscape. Conclusions Folded proteins are characterized by numerous van der Waals and hydrogen bonding interactions that need to be accurately accounted for when using physics based s core functions. Even small errors in calculated energies between interacting partners within a protein quickly accumulate to produce large overall uncertainties in calculated total energies. This effect of error propagation distorts the calculated potentia l energy surface of a protein in a very complicated way, and therefore alters the shape of the folding funnel in ways that are difficult (if possible) to predict. One is only able to distinguish protein folds by energy when energy differences are larger th an their individual error bars. Rather than having few native like structures at the bottom of the folding funnel, it now should be extended to include any fold within the lowest energy error bar at the bottom. The bottom of the folding funnel may even be populated with non native states predicted to be native by the scoring function, with the true native states higher in calculated energy but perhaps with error bars overlapping with the incorrectly predicted native states. Since the free energy difference between a native and denatured protein fold can be on the order of 10 20 kcal/mol, the errors in interaction energies of the magnitude predicted herein suggest that we are a long way from computing energies between PAGE 71 71 native and decoy folds at a level of accu racy necessary to generally solve the ab initio protein folding problem, especially as larger proteins are examined. We have presented and demonstrated the use of a method to estimate the magnitude of errors in computed energies of proteins and shown that these can be corrected for in part, thereby improving results obtained from physics based scoring functions. Systematic error correction can be applied as an endpoint calculation or it can be computed on the fly, for example, in interactive protein foldin g gaming exercises. 129 In addition, the generation of error probability density functions provides a straightforward method of a nalyzing and comparing different score functions in terms of their ability to accurately model molecular interactions. The research outlined herein brings a new level of sophistication to energy computation that has largely been lacking in computational bi ology and chemistry, opening the door for novel ways to compare and improve modern scoring functions used in studying complex systems with large numbers of inter and intramolecular interactions. PAGE 72 72 Table 5 1 Interaction energy error statistics for the 1 UBQ fragment database. Method All 2 All VdW 2 VdW Polar 2 Polar R factor a GAFF 0.36 3.26 0.25 0.36 0.46 5.64 0.127 FF99SB b 0.73 4.04 0.12 1.27 1.22 5.83 0.170 FF03 b 0.83 6.61 0.18 0.81 1.36 10.86 0.259 AM1 3.15 9.50 1.04 0.70 4.85 10.28 0.373 PM3 2 .65 7.89 0.14 0.77 4.67 4.59 0.352 PM6 1.67 2.24 0.84 0.32 2.34 2.82 0.211 PM6 DH2 0.30 1.23 0.09 0.10 0.62 1.95 0.071 PDDG 3.21 16.23 0.62 0.90 6.30 7.48 0.484 HF/6 31G* 1.94 1.32 2.27 1.14 1.68 1.30 0.153 HF/aDZ 2.14 1.22 2.29 1.11 2.02 1.29 0.176 HF/aTZ 2.10 1.17 2.28 1.10 1.95 1.17 0.171 HF/aQZ 2.08 1.16 2.28 1.10 1.93 1.15 0.170 MP2/6 31G* 1.24 0.64 1.12 0.28 1.34 0.91 0.146 MP2/aDZ 0.48 0.16 0.21 0.01 0.69 0.19 0.061 MP2/aTZ 0.16 0.02 0.05 0.00 0.24 0.02 0.023 B97 D/TZVP 0.20 1.06 0.29 0 .02 0.60 1.58 0.087 M06/6 31G* 0.75 0.42 0.63 0.12 0.85 0.64 0.104 M06/aTZ 0.73 0.16 0.57 0.08 0.85 0.19 0.090 M06 L/6 31G* 0.71 0.43 0.40 0.10 0.96 0.57 0.103 M06 L/aTZ 0.75 0.14 0.55 0.07 0.91 0.14 0.096 Mean s and variance s of interaction energy dev iations (kcal/mol) from reference energies (a mix of MP2/CBS and CCSD(T)/CBS) for the interacting fragment molecules present in ubiquitin. The set of fragments was divided into 42 van der Waals interactions and 50 polar interactions. a The calculated R fact or serves as an analogy to the residual minimized in crystallographic structure refinement. A desirable value of R fa ctor would be less than 0.1. b The force field based atomic charge parameters were scaled to yield correct net charge on each fragment syst em. PAGE 73 73 Table 5 2. Analysis of ff99SB on the Rosetta Decoy Set NAME EGAP1 EGAP2 IMPROVE? ZSCORE1 ZSCORE2 IMPROVE? EBO 1a32 8.20 8.84 FALSE 0.87 1.11 TRUE TRUE 1a68 9.20 38.28 TRUE 3.47 3.67 TRUE TRUE 1acf 20.10 28.58 TRUE 1.25 1.64 TRUE TRUE 1ai l 18.30 24.28 FALSE 1.79 1.55 FALSE TRUE 1aiu 15.00 45.08 TRUE 3.57 3.80 TRUE TRUE 1b3a 7.10 6.26 TRUE 1.72 1.88 TRUE TRUE 1bgf 44.20 34.84 FALSE 3.99 3.56 FALSE TRUE 1bk2 30.60 33.08 TRUE 2.65 2.94 TRUE TRUE 1bkr 26.10 16.96 FALSE 2.57 2.60 TRUE TRUE 1bq9 11.00 24.74 TRUE 2.49 2.78 TRUE TRUE 1c8c 17.40 17.52 FALSE 0.44 0.49 TRUE TRUE 1c9o 11.40 16.54 TRUE 2.39 2.57 TRUE TRUE 1cc8 11.40 12.60 TRUE 2.74 2.73 FALSE TRUE 1cei 36.90 34.40 FALSE 3.94 3.61 FALSE TRUE 1ctf 22 .10 32.20 TRUE 1.37 1.79 TRUE TRUE 1dhn 4.50 49.88 TRUE 3.50 3.46 FALSE TRUE 1e6i 15.30 63.68 TRUE 2.77 2.92 TRUE TRUE 1enh 2.20 0.66 FALSE 2.20 2.56 TRUE TRUE 1ew4 4.70 14.42 TRUE 0.10 0.10 TRUE TRUE 1eyv 49.60 22.58 TRUE 2.23 2.55 TRUE TRUE 1fkb 59.90 145.70 TRUE 5.56 6.09 TRUE TRUE 1gvp 51.10 48.14 FALSE 3.36 4.32 TRUE TRUE 1hz6 23.60 27.40 FALSE 0.88 0.65 FALSE TRUE 1ig5 14.20 19.20 FALSE 1.25 1.34 TRUE TRUE 1iib 59.20 52.00 TRUE 3.41 4.50 TRUE TRUE 1kpe 4.30 3. 56 TRUE 2.13 2.21 TRUE TRUE 1lou 64.40 70.24 TRUE 3.59 4.38 TRUE TRUE 1opd 57.00 61.86 TRUE 4.33 4.74 TRUE TRUE 1pgx 2.60 2.76 TRUE 1.43 1.77 TRUE TRUE 1ptq 14.90 31.36 FALSE 0.35 0.06 FALSE FALSE 1r69 25.40 37.24 FALSE 1.14 0.86 FALSE FALSE 1scj 20.90 26.54 FALSE 0.61 0.76 TRUE TRUE 1shf 27.10 33.92 TRUE 2.15 2.73 TRUE TRUE 1ten 50.30 49.42 FALSE 4.82 4.77 FALSE TRUE 1tig 19.20 25.68 FALSE 0.75 0.74 FALSE TRUE 1tul 96.10 112.60 TRUE 3.25 5.90 TRUE TRUE 1ugh 9.70 19. 08 TRUE 2.81 3.07 TRUE TRUE 1urn 31.40 36.38 TRUE 3.30 3.32 TRUE TRUE 1utg 13.10 4.62 TRUE 0.59 1.87 TRUE TRUE 1vcc 29.00 28.68 FALSE 3.46 3.60 TRUE TRUE 1vie 48.00 57.52 TRUE 4.18 4.90 TRUE TRUE 1vls 81.50 100.06 FALSE 0.29 1.11 FALSE F ALSE 1who 71.30 86.84 TRUE 4.63 4.90 TRUE TRUE 256b 13.60 10.18 FALSE 1.84 2.11 TRUE TRUE 2acy 102.40 93.24 FALSE 4.68 5.19 TRUE TRUE 2ci2 33.20 32.84 TRUE 0.22 0.26 TRUE FALSE 2tif 25.00 22.30 FALSE 1.41 1.51 TRUE TRUE 4ubp 10.00 15.52 FALSE 1.21 1.78 TRUE TRUE 5cro 3.30 1.46 FALSE 0.76 2.30 TRUE TRUE NUMBER TRUE RATIO TRUE 27 38 45 0.55 0.78 0.92 EGAP1 and ZSCORE1 are the EGAP and z score metrics before error correction, and EGAP2 and ZSCORE2 are their values after error correction. PAGE 74 74 Table 5 3. Analysis of PM6 on the Rosetta Decoy Set. NAME EGAP 1 EGAP 2 IMPROVE ? ZSCORE 1 ZSCORE 2 IMPROVE ? EBO 1a32 107.81 110.99 TRUE 6.78 5.97 FALSE TRUE 1a68 62.40 211.24 TRUE 5.52 6.05 TRUE TRUE 1acf 132.49 139.44 TRUE 2.68 4.45 TR UE TRUE 1ail 112.76 91.22 FALSE 4.36 4.04 FALSE TRUE 1aiu 179.23 209.72 TRUE 6.03 6.02 FALSE TRUE 1b3a 99.74 92.57 FALSE 5.41 5.40 FALSE TRUE 1bgf 141.82 140.71 FALSE 5.45 5.49 TRUE TRUE 1bk2 72.57 80.39 TRUE 4.81 5.24 TRUE TRUE 1bkr 111.35 130.64 TRUE 4.60 4.76 TRUE TRUE 1bq9 105.05 126.36 TRUE 5.49 5.80 TRUE TRUE 1c8c 86.85 82.88 FALSE 3.93 5.58 TRUE TRUE 1c9o 113.18 111.58 FALSE 5.57 5.68 TRUE TRUE 1cc8 109.70 113.70 TRUE 5.39 5.57 TRUE TRUE 1cei 206.05 221.24 TRUE 5.99 6 .13 TRUE TRUE 1ctf 120.29 141.01 TRUE 3.63 5.97 TRUE TRUE 1dhn 112.90 230.78 TRUE 5.26 5.54 TRUE TRUE 1e6i 170.28 267.16 TRUE 6.46 6.65 TRUE TRUE 1enh 89.33 94.99 TRUE 5.54 5.66 TRUE TRUE 1ew4 108.29 171.75 TRUE 4.68 5.49 TRUE TRUE 1eyv 81. 40 214.98 TRUE 4.92 5.41 TRUE TRUE 1fkb 189.90 200.29 TRUE 5.57 5.48 FALSE TRUE 1gvp 141.93 142.87 TRUE 4.54 5.56 TRUE TRUE 1hz6 29.03 79.73 TRUE 5.72 6.50 TRUE TRUE 1ig5 91.35 85.87 FALSE 4.23 4.67 TRUE TRUE 1iib 27.28 179.44 TRUE 5.79 7. 02 TRUE TRUE 1kpe 154.82 168.39 TRUE 5.70 5.80 TRUE TRUE 1lou 135.57 169.92 TRUE 5.04 5.35 TRUE TRUE 1opd 148.43 163.36 TRUE 5.65 6.45 TRUE TRUE 1pgx 103.11 96.98 FALSE 5.28 7.09 TRUE TRUE 1ptq 51.33 24.26 FALSE 2.97 3.13 TRUE TRUE 1r69 95. 45 76.35 FALSE 4.03 5.56 TRUE TRUE 1scj 83.97 103.10 TRUE 3.76 4.53 TRUE TRUE 1shf 137.77 148.63 TRUE 4.95 5.95 TRUE TRUE 1ten 143.41 159.38 TRUE 6.33 6.29 FALSE TRUE 1tig 104.61 107.17 TRUE 4.82 4.63 FALSE TRUE 1tul 156.14 219.97 TRUE 4.54 6.53 TRUE TRUE 1ugh 166.78 181.46 TRUE 6.14 6.29 TRUE TRUE 1urn 136.44 154.85 TRUE 5.14 5.24 TRUE TRUE 1utg 127.32 127.60 TRUE 4.50 6.03 TRUE TRUE 1vcc 136.74 153.11 TRUE 5.41 5.57 TRUE TRUE 1vie 113.97 125.12 TRUE 5.26 5.87 TRUE TRUE 1vl s 103.00 94.97 FALSE 3.97 4.28 TRUE TRUE 1who 178.15 212.28 TRUE 6.54 6.89 TRUE TRUE 256b 212.86 209.13 FALSE 7.48 7.23 FALSE TRUE 2acy 165.18 205.92 TRUE 5.46 6.21 TRUE TRUE 2ci2 69.96 82.75 TRUE 4.07 4.99 TRUE TRUE 2tif 25.11 33.66 TRUE 2 .84 3.02 TRUE TRUE 4ubp 160.09 156.15 FALSE 3.61 5.32 TRUE TRUE 5cro 103.52 89.96 FALSE 2.82 5.92 TRUE TRUE NUMBER TRUE RATIO TRUE 36 41 49 0.73 0.84 1.0 EGAP1 and ZSCORE1 are the EGAP and z score metrics before error correction, and EGAP2 and ZSCORE2 are their values after error correction. PAGE 75 75 Table 5 4. Analysis of PM6 DH2 on the Rosetta Decoy Set. NAME EGAPB1 EGAP2 IMPROVE? ZSCORE1 ZSCORE2 IMPROVE? EBO 1a32 112.03 111.50 FALSE 6.56 6.36 FALSE TRUE 1a68 60.89 258.41 TRUE 6.03 6.80 TRU E TRUE 1acf 224.41 218.04 FALSE 4.07 5.68 TRUE TRUE 1ail 90.15 108.59 TRUE 4.47 4.57 TRUE TRUE 1aiu 202.09 231.09 TRUE 6.49 6.55 TRUE TRUE 1b3a 92.92 90.78 FALSE 6.03 6.05 TRUE TRUE 1bgf 228.20 219.92 FALSE 7.20 7.06 FALSE TRUE 1bk2 102.64 104.67 TRUE 6.20 6.27 TRUE TRUE 1bkr 201.32 191.15 FALSE 6.09 5.90 FALSE TRUE 1bq9 140.93 140.63 FALSE 6.69 6.64 FALSE TRUE 1c8c 90.30 93.22 TRUE 4.22 6.20 TRUE TRUE 1c9o 127.84 130.51 TRUE 6.14 6.19 TRUE TRUE 1cc8 149.24 150.55 TRUE 6.38 6.35 FALSE TRUE 1cei 220.90 218.40 FALSE 6.46 6.42 FALSE TRUE 1ctf 147.59 147.95 TRUE 4.14 6.35 TRUE TRUE 1dhn 94.75 280.59 TRUE 5.81 6.33 TRUE TRUE 1e6i 153.62 276.03 TRUE 6.64 7.08 TRUE TRUE 1enh 95.44 93.78 FALSE 5.99 5.94 FALSE TRUE 1ew 4 96.90 258.66 TRUE 6.47 7.26 TRUE TRUE 1eyv 108.10 248.22 TRUE 5.95 6.35 TRUE TRUE 1fkb 323.35 326.24 TRUE 7.10 7.15 TRUE TRUE 1gvp 142.98 143.24 TRUE 5.28 6.62 TRUE TRUE 1hz6 10.22 94.03 TRUE 5.77 6.79 TRUE TRUE 1ig5 98.28 93.85 FALSE 4.4 4 4.84 TRUE TRUE 1iib 57.61 182.30 TRUE 6.43 7.40 TRUE TRUE 1kpe 133.86 229.44 TRUE 6.43 6.88 TRUE TRUE 1lou 208.89 210.30 TRUE 6.39 6.31 FALSE TRUE 1opd 187.01 190.17 TRUE 6.60 7.07 TRUE TRUE 1pgx 97.79 97.71 FALSE 5.64 7.22 TRUE TRUE 1pt q 42.40 33.89 FALSE 2.72 3.65 TRUE TRUE 1r69 107.94 101.82 FALSE 4.15 6.20 TRUE TRUE 1scj 118.72 120.65 TRUE 4.63 5.46 TRUE TRUE 1shf 173.11 176.47 TRUE 5.91 6.63 TRUE TRUE 1ten 217.61 228.70 TRUE 7.48 7.51 TRUE TRUE 1tig 82.75 121.29 TRUE 5.23 5.36 TRUE TRUE 1tul 247.93 255.86 TRUE 6.30 7.31 TRUE TRUE 1ugh 146.99 226.13 TRUE 6.70 7.02 TRUE TRUE 1urn 122.83 184.92 TRUE 6.27 6.47 TRUE TRUE 1utg 119.92 120.90 TRUE 4.33 6.38 TRUE TRUE 1vcc 129.31 160.58 TRUE 5.69 5.75 TRUE TRUE 1vie 139.17 143.78 TRUE 6.21 6.45 TRUE TRUE 1vls 139.49 134.13 FALSE 4.87 5.26 TRUE TRUE 1who 225.83 235.18 TRUE 7.33 7.37 TRUE TRUE 256b 217.06 222.23 TRUE 7.57 7.64 TRUE TRUE 2acy 252.84 242.82 FALSE 7.42 7.31 FALSE TRUE 2ci2 87.63 87.34 FALSE 4.80 5.37 TRUE TRUE 2tif 56.24 53.22 FALSE 3.60 3.51 FALSE TRUE 4ubp 189.71 189.53 FALSE 3.79 5.56 TRUE TRUE 5cro 88.94 87.70 FALSE 2.61 6.09 TRUE TRUE NUMBER TRUE 31 39 49 RATIO TRUE 0.63 0.80 1.00 EGAP1 and ZSCORE1 are the EGAP and z score metrics before error correction, and EGAP2 and ZSCORE2 are their values after error correction. PAGE 76 76 Figure 5 1 Example of model systems used to build molecular interactions in proteins. PAGE 77 77 Figure 5 2 Distortions in computed free energy la ndscapes due to modeling errors. PAGE 78 78 Figure 5 3 Histogram and Gaussian probability density function describing errors in B97 D/TZVP over 1UBQ fragments. PAGE 79 79 Figure 5 4 Systematic error magnitudes versus chain length in the Rosetta Decoy Set. PAGE 80 80 Figure 5 5 Random error magnitudes versus chain length in the Rosetta decoy set. PAGE 81 81 CHAPTER 6 BASIS SET SUPERPOSIT ION ERROR Introduction Q uantum mechanical calculations are now routinely being applied to large biomolecular systems due to the introduction of linear scaling methods such as fragment molecular orbital methods (FMO) 24 molecular fragmentation with conjugate caps (MFCC) 25 the adjustable density matrix assembler method (ADMA) 130 and divide and conquer schemes 26 28 Th ese methods all share the core idea of dividing a large system into fragment components and performing quantum calculations on the smaller fragment systems The total energy of the supersystem is then approximated by using the fragment energies or densitie s. All of these methods are linear scaling with respect to fragment number, and have the advantage of being very easily parallelizable. Thus even large biomolecular systems could in principle be treated quantum mechanically within reasonable calculation ti mes with sufficiently parallel computational architectures. Linear scaling quantum methods tackle just one challenge in the quantum treatment of large biomolecular systems, namely the high number of orbital degrees of freedom. Another issue is propagation of energy model errors as discussed in Chapters 3 5. Each of the fragment based interactions contributing to the overall energy of a supersystem also contributes to the overall modeling error. These individual errors propagate over the supersystem and can lead to large total energy errors as the system size increases. Methods for correcting these energy modeling errors will need to be improved to achieve accurate computational modeling of biomolecular systems. There is another source of error introduced t hat is specific to quantum calculations performed with finite basis sets: basis set superposition error (BSSE). BSSE is an PAGE 82 82 artificial stabilization of interacting systems due to the additional variational flexibility afforded by the proximity of the basis functions centered on them. For example, in the interaction energy between A and B. The trivial solution for correcting BSSE is to use a more complete basis set, but this i s often undesirable since most ab initio quantum methods formally scale as N 4 or greater. The most commonly used solution is the counterpoise procedure of Boys and Bernardi, 23 which involves performing calculations of the energies of the separated molecular systems both with (E A B (E A and E B ) the additional basis functions of the interacting partner in the associated complex (AB) geometry. The magnitude of BSSE is then estimated via E quation 6 1 and this quantity is subtracted from the calculated interaction energy between systems A and B ( as shown in Equation 6 2) (6 1) (6 2) While the counterpoise procedure offers a straig htforward method of estimating BSSE for intermolecular interactions, the case of intramolecular BSSE (IBSSE) is still an unsolved problem. IBSSE has been observed in calculations for systems as small as benzene, for which it leads MP2, MP3, CISD, and CCSD all to predict nonplanar minimum geometries with Pople type basis sets. 131 132 Furthermore, Balabin has estimated magnitudes of IBSSE in small alkanes and peptides and reported magnitudes greater than estimated conf ormational energy differences. 133 134 This implies that IBSSE can prevent quantum methods with incomplete basis sets from accurately modelling PAGE 83 83 potential energy surfaces and thereby preclude agreement with experiment al observations. 135 136 Most current proposed methods for estimating IBSSE are simply intramolecular analogues of the counterpoise method. The overall system is deconstructed into molecular fragments (or even indivi dual atoms in some implementations 137 ), and each each fragment. In such a scheme, the fragmentation method is nonunique (the fragmentation rules are ar bitrary), and its implementation necessitates 2N additional quantum calculations (unless the isolated fragment/atomic energies are stored in a database, in which case there are a minimum of N additional calculations needed). Statistics based Model for Esti mating BSSE In order to bypass the additional 2N calculations required for estimating IBSSE with the traditional fragment counterpoise method, we have developed a linear model for estimating fragment contributions to IBSSE, which requires only a geometric analysis and classification of the molecular interactions present in a biomolecular system. 138 139 In order to develop such a model, we have generated a database of 773 molecular fragments with an in house fragmentation program using high resolution protein structures from the Protein Data Bank. The set was divided into four interaction classes: 354 nonpolar complexes, 312 backbone hydrogen bonds, 63 negatively charged complexes, and 44 positively charged complexes. The magnitudes of BSSE at the MP2/6 31G* and MP2/aug cc pVDZ (MP2/aDZ) level were calculated with the GAUSSIAN 09 program 81 These two basis sets with MP2 were chosen as example methods only for their relatively low computational cost and expectedly high magnitudes of BSSE. The resulting distributions of BSSE magnitudes are displayed in Figure 6 1, PAGE 84 84 which shows interesting distinctions between the different interaction classes and basis sets. Nonpolar/van der Waals type interactions generally had lower BSSE magnitudes than hydrogen bonds or charged interactions. Fragment interactions including ani onic species had the widest distribution of MP2/6 31G* BSSE magnitudes, while those including cationic species had a distribution more similar to the hydrogen bonded complexes. The difference between the two basis sets was the largest in the anionic system s, where the MP2/6 31G* estimates showed a much wider distribution than MP2/aDZ. Interestingly, the MP2/aDZ BSSE magnitudes were generally larger than MP2/6 31G* for nonpolar complexes but generally smaller for hydrogen bonds. Since BSSE depends on geomet ric orientation in addition to interaction class, a bimolecular proximity descriptor P AB was introduced to quickly measure the proximity of two interacting fragments A and B which is then used to es timate BSSE with a linear model described by Equations 6 3 and 6 4, where i and j run over the heavy (non hydrogen) atoms in fragments A and B respectively, r ij is the distance between atoms i and j and a, b, and c are optimizable parameters. Thus only proximal non hydrogen atoms from different fragments signifi cantly contribute to the overall proximity score. The functional form of Equation 6 3 was chosen to qualitatively model measured BSSE distance dependence curves (see Figure 6 2). Using other powers of distance in the exponent of P AB was investigated, but n o significant improvement over square distances was observed during model training. (6 3) (6 4) PAGE 85 85 The BSSE model was trained for the two basis sets and four interaction classes independently, by fitting to the traditional counterpoise calculated BSSE magnitudes in the fragment database. Since the fragments are in geometries taken from the PDB, they represent near minimum geometries. Thus the model reproduces BSSE values of near minimum structures better than very closely interacting fragments as seen in Figure 6 2. The resulting parameters and square correlation coefficients (over the training set) for the four interaction classes at the MP2/aDZ and MP2/6 31G* levels are shown in Tables 6 1 and 6 2. The overall procedure for estimating the IBSSE with this model is as follows: (1) Detect and classify the comprising molecular interactions of the supersystem. In this work, van der Waals interactions are sought with a distance cutoff of 4 between hea vy atoms, while polar interactions are sought with an interaction distance cutoff of 3.5 interacting sp 3 carbons in each case. If a subsystem displays qualities common to multiple c lasses ( e.g. a hydrogen bond with a carboxylate anion), then a hierarchy is used based on relative contributions to BSSE. The presence of anionic fragments takes the highest precedence, followed by cationic fragments, and then hydrogen bonds. Lacking all o f these features, the interaction is considered nonpolar. (2) Compute the proximity score and BSSE estimate (Equations 6 3 6 4 ) of each interaction using the appropriate parameter set for the interaction class and quantum method. (3) Compute the uncertain ty in the BSSE estimate. Since the BSSE model is linear with respect to the bimolecular proximity descriptor, error estimates are readily calculated with the expression in Equation 6 5 value depending on the interaction PAGE 86 86 class sam ple size N and the desired confidence limit, P AB is the newly estimated proximity descriptor value, P bar is the average proximity descriptor value for the interaction class in the database, and P i are the individual proximity descriptor values stored in t he database for the interaction class. The quantity S is given by Equation 6 6, where y i are the measured BSS E values of the database, and b and a are the mod el parameters used in Equation 6 4. (6 5) (6 6) (4) Propagate the errors throughout the supersystem. Predicted BSSE magnitudes are assumed to behave as independent systematic errors and propagate as a simple sum, while predicted uncertainties are assumed to behave as independent random error s, which propagate as a Pythagorean sum (Equation 6 7 ). (6 7) Applications of the BSSE Model Protein Ligand Interactions The method was first tested on the HIV II protease/indinavir protein/ligand complex (PDBID: 1HSG). The B SSE using both 6 31G* and aDZ basis sets with MP2 was estimated with the present method as well as the traditional fragment counterpoise method. The complex had been decomposed previously into 21 fragment interactions comprising classes of 15 nonpolar, 1 c ationic, 4 anionic, and 1 polar. The BSSE PAGE 87 87 estimations fo r MP2/aDZ are shown in Table 6 3. Although the Pearson correlation coefficient between the two methods was a low r=0.65, all of the deviations from the traditional counterpoise method were less than  1.0 kcal/mol and overall there was favorable cancelation of errors toward zero in the overall BSSE in the protein/ligand confidence), which agreed with the traditional counterpoi se calculated BSSE of 41.79 kcal/mol. For MP2/6 31G* (Table 6 4 ) the model yielded much better correlation fragment counterpoise BSSE values was 47.1 kcal/mol. Protein Fold s The model was also tested by estimating the IBSSE in the crystal structure of a helix conformation). The structure was downloaded and hydrogen atoms were added, followed by an optimization of their positions with the ff99SB force field in AMBER The resulting structure was fragmented with the same in house program used to generate the fragment database, which resulted in 10 hydrogen bond complexes forming the helical backbone a nd 3 nonpolar complexes. At MP2/6 31G*, the traditional counterpoise method yielded an overall IBSSE estimate of 31.6 kcal/mol, which lay MP2/aDZ, the model predicted 26.68 0.75 kcal/mol which agrees with the traditional counterpoise method value of 26.30 kcal/mol. The method was also used on a set of 9 native NMR derived structures and 33 sheet structure), which has been studied previously with FMO MP2/6 31G* with the PCM PAGE 88 88 solvent model. 140 The previous study showed that correlation energy from MP2 was able to distinguish between native and decoy folds, yet total MP2 energies were not. In order to investigate the effect of IBSSE on this observation, we estimated the IBSSE of the entire set. The magnitudes of the IBSSE estimates are plotted in Figure 6 3, which shows that the native structures had generally higher values of IBSSE than the decoy folds. We also observed a wide spread of estimated IBSSE values, which ranged from around 50 to 120 kcal/mol. This suggests that IBSSE cannot be expected to cancel fully when comparing different conformations of the same molecule, and must be corrected in order to obtain correct relative energies. The relative raw and corrected FMO MP2/6 31G*+PCM energies of the set are plotted in Figure 6 4. Due to the variance in the IBSSE estimates, the ordering of folds by energy changed after IBSSE corrections. This is perhaps more clear in Figure 6 5, which plots both sets of relative energies in order of increasing BSSE corrected energies. To quantify this difference in ordering, the Kendall tau rank coefficient (ranging from 1 for reverse ordering to 1 for identical ordering) between the corrected and uncorrected sets was measured to be 0.431. Although IBSSE correction did not improve fold discrimination by FMO MP2/6 31G*+PCM energy, the differences in relative energies was significant. This disagreement between energy rankings before and after IBSSE correction suggests that IBSSE can greatly distort potential energy surfaces and thereby affect QM calculated quantities of large biomolecular systems. Conclusions Novel alg orithms and increased computing power are allowing for the growing application of quantum chemistry to large biomolecular systems such as proteins and PAGE 89 89 protein/ligand complexes. As system size increases, the magnitudes of errors such as basis set superposit ion error also increase, which will require new methods to address the accumulation of modelling errors. We have introduced a fast statistics and fragment based method which bypasses the many additional quantum calculations required in the traditional fra gment counterpoise method for intramolecular BSSE. The method yielded BSSE estimates comparable to the traditional counterpoise method, which yielded estimates within predicted error bars. The model is easily extendable to other basis sets and quantum meth ods by generating the appropriate the training data. PAGE 90 90 Table 6 1. Optimized BSSE Model Parameters for MP2/aDZ Interaction Class N a b c R 2 Nonpolar 354 0.643 0.440 0.0864 0.86 Hydrogen Bond 312 0.00821 2.53 0.170 0.89 Cationic 44 0.570 0.458 0.0900 0 .88 Anionic 63 0.691 0.655 0.121 0.69 Table 6 2 Optimized BSSE Model P arameters for MP2/6 31G* Interaction Class N a b c R 2 Nonpolar 354 0.522 9.10 0.285 0.85 Hydrogen Bond 312 0.254 3.88 0.191 0.89 Cationic 44 0.983 29.4 0.423 0.68 Anionic 63 1.57 29.28 0.346 0.77 Table 6 3. BSSE predictions for the HIV II protease/indinavir complex at MP2/ aDZ. Number a Class b Predicted BSSE Measured BSSE Predicted Error Measured Error 1 np 1.15 1.31 0.35 0.17 2 np 2.19 2.34 0.35 0.16 3 pc 2.18 2.25 0.30 0.07 4 nc 2.73 2.73 1.01 0.01 5 nc 2.36 2.78 1.00 0.42 6 nc 1.33 1.35 1.00 0.02 7 p 2.52 2.18 0.35 0.34 8 np 2.42 1.80 0.35 0.62 9 np 2.41 1.51 0.35 0.91 10 np 2.14 2.25 0.14 0.10 11 np 2.33 1.61 0.35 0.71 12 np 1.39 1.91 0.35 0.52 13 np 2.11 2.36 0.35 0.25 14 np 2.25 2.97 0.35 0.72 15 np 1.20 1.34 0.35 0.14 16 np 1.29 1.14 0.35 0.14 17 np 1.85 2.20 0.35 0.35 18 np 1.26 1.78 0.35 0.52 19 np 1.94 2.48 0.35 0.55 20 np 1.41 1.88 0.35 0.47 21 nc 1.30 1.61 1.00 0.30 Arithme tic Sum 39.74 41.79 2.04 Total Error Bar 2.45 a The fragments are indentified by complex number from Reference X. b The fragments are classified as one of: np(nonpolar), p(polar), nc(negatively charged), or pc(positively charged). PAGE 91 91 Table 6 4 B SSE predictions for the HIV II protease/indinavir complex at MP2/ 6 31G* Number a Class b Predicted BSSE Measured BSSE Predicted Error Measured Error 1 np 1.49 2.05 0.30 0.56 2 np 0.83 0.90 0.30 0.07 3 pc 1.66 1.45 0.38 0.21 4 nc 10.40 8.27 1.01 2.14 5 nc 6.97 6.90 0.93 0.07 6 nc 2.10 3.47 0.93 1.37 7 p 2.62 3.07 0.17 0.44 8 np 1.45 2.13 0.30 0.68 9 np 1.41 1.98 0.30 0.56 10 np 2.08 1.90 0.30 0.19 11 np 1.26 1.33 0.30 0.06 12 np 1.26 1.66 0.30 0.40 13 np 1.43 1.72 0.30 0.30 14 np 0.91 0.70 0.30 0.21 15 np 0.84 1.10 0.30 0.26 16 np 0.73 0.63 0.30 0.11 17 np 1.51 1.15 0.30 0.36 18 np 1.33 1.28 0.30 0.05 19 np 0.88 1.05 0.30 0.18 20 np 1.62 1.57 0.30 0.05 21 nc 3.26 2.77 0.92 0.49 Arithmetic Sum 46.06 47.08 1.02 Total Error Bar 2.26 a The fragments are indentified by complex number from Reference X. b The fragments are classified as one of: np(nonpolar), p(polar), nc(negatively charged), or pc(positively charged). PAGE 92 92 Figure 6 1. BSSE magnitudes (calculated with the co unterpoise procedure) of the four interaction classes in the fragment database at MP2/6 31G* and MP2/aug cc pVDZ (MP2/aDZ). PAGE 93 93 Figure 6 2. Measured distance dependence curves of BSSE at MP2/6 31G* for a backbone hydrogen bond (red) and a van der Waals (b lue) complex. The measured BSSE values are plotted as circles along lines, and the model estimates are plotted as squares with associated error bars. The PDB (near minimum) geometries are designated with asterisks PAGE 94 94 Figure 6 3. Distribution of IBSSE est imates for the Pin1 WW domain native/decoy fold set. The IBSSE estimate for each fold is shown along the horizontal axis along with its associated estimated uncertainty. The native NMR observed folds are designated with asterisks. PAGE 95 95 Figure 6 4. Relative FMO MP2/6 31G*+PCM energies of each of the Pin1 WW domain folds before and after IBSSE corrections. The wide distribution of estimated BSSE magnitudes resulted in a different ordering of folds by energy after correction. PAGE 96 96 Figure 6 5. The relative energ ies of the Pin1 WW domain folds ordered by BSSE corrected energies. The NMR derived structures are marked with filled circles. The folds are ordered along the horizontal axis to yield monotonically increasing BSSE corrected relative energies. In contrast, the same ordering of folds leads to a very jagged plot of uncorrected energies, suggesting that intramolecular BSSE can greatly distort potential energy surfaces. PAGE 97 97 CHAPTER 7 STATISTICAL ENSEMBLES Introduction The previous chapters describe methods for erro r estimation and correction for single point energies of protein ligand complexes (Chapter 4), protein folds (Chapter 5), and quantum calculations of large biomolecular systems (Chapter 6) In general, single point structures are not adequate for simulatio ns of biomolecular phenomena; rather, ensembles of microstates collectively contribute to observable thermodynamic quantities of such systems. In this chapter, error analysis is applied to statistical thermodynamic quantities namely free energy, average e nergy, and entropy. 141 We first investigate the effect of modeling errors with a first order analysis on simulated systems. We then extend the a nalysis to higher orders for free energy. Finally, we propose a general procedure for on the fly error handling in the estimation of free energy differences in a simulated system. First order Error Analysis In this analysis we examine the canonical ensem ble (constant N, V, T), in which t he Helmholtz free energy, A (Equation 7 1), is the key quantity of interest. 142 This quantity depends on the r b T) 1 where k b is the Boltzmann constant), and partition function Q which sums over the microstates of the ensemble (Equation 7 2). (7 1 ) (7 2 ) PAGE 98 98 A general function f of several independent variables, f( x 1 x 2 N ), has an uncer tainty that ca n be estimated by Equation 7 3 wh ere i are uncer tainties in the input variables ( 7 3) Equation 7 3 is derived from a sum o f truncated Taylor series of f in each independent variable x i We truncate at the first power of i to yield the simplest (first order) model of error propagation. Similar expressions have been used in the past for the sensitivity ana lysis of force field parameters. 114 117 In the case of systematic errors, the inequality becomes an equality and the absolute value is dropped. If the errors i are random and uncorrelated, then the first order propagated error in the function f can be e stimated by the Pythagorean sum as given by Equation 7 4. ( 7 4) We have proposed previously that single point energies of large systems contain both systematic and random errors. Our approach to the esti mation of single point energy errors is as follows: 1) Identify and classify each distinct molecular interaction present (e.g. hydrogen bond or van der Waals contact), 2) Estimate the individual fragment based errors with a probability density function con structed with a reference database of molecular fragment interacti ons, 3) Propagate the errors using the first order formulas of Equations 7 5 and 7 6. ( 7 5) PAGE 99 99 ( 7 6) H ere the summations run ove r all interaction types k in the reference database, N k is the number of detected interactions of type k and k and 2 k represent the mean error per interaction for interaction type k and variance about the mean error for interaction type k in the database. Interaction types can be as coarsely defined as van der Waals and polar contacts or as specific as classify ing individual amino acid side chain interactions. Th is method assumes additivity (independence) of fragment interaction energies and that the probability density function s describing modeling errors for each interaction class is a normal distribution with should be interpreted as a measure of uncertainty in our error estimate for a given fragment interaction. They are inversely proportional to the precision of an energy model in predicting fragment based interaction energies. Thus if an example protein containing 50 hydrogen bonds and 40 distinct van der Waals contacts is modeled with an energy function with intrinsic modeling errors of 0.50 1.0 kcal/mol (note: here 0.50 kcal/mol is the value of the systematic error and 1.0 kcal/mol is the value of the random error) for hydrogen bonds and 0.20 1.0 kcal/mol for van der Waals contacts, we would evaluate the energy with the model and estimate an error of 17 9.5 kcal/mol. Rather than simply reporting the calculated ene rgy, one would correct systematic errors by subtracting 17 kcal/mol from the calculated energy, and report an error bar with model serving as a reference) has about a 68% chance (1 kcal/mol window. PAGE 100 100 Free Energy If the energies E i of each microstate i in a statistical ensemble contain systematic and random error components as estimated by Eq uations 7 5 and 7 6, we can propagate them over the entire ensemble using error propagation fo rmulas. By applying Equations 7 3 and 7 4 to the free energy A we obtain the f ollowing first order expressions (Equations 7 7 and 7 8). ( 7 7) ( 7 8) H ere P i is the normal ized weight of the microstate i Thus the systematic error in A is the Boltzmann weighted average of the systematic error of each microstate, and the random error in A is the Pythagorean sum of the weighted random errors from each microstate. If we suppose (as a thought experiment) that our ensemble consists of one dominant microstate (e.g., a stable folded protein, or a docked ligand pose), then P o and we find the seemingly obvious result of Equation 7 9 which indicates that the total error in the free energy is determined by the error contained in the single dominant microstate. Thus endpoint energy methods commonly used in docking and protein structure predictions, which consider only one microstate at a time, will always yield the full value of the e rror contained in the single microstate considered. ( 7 9) In a second scenario, suppose that the ensemble comprises several distinct contributing microstates, but each contains the same error estimate (constant E i Sys and PAGE 101 101 E i Rand ). In this case, the propagated error in free energy is given by Equation 7 10 which yields the constant systematic error of the microstates, but a decreased random error due to the scalin g factor in the last term in Equation 7 10, which must be less than one. It is worth noting that the scaling factor, and thus the propagated random error, decreases with increasing number of nearly isoenergetic microstates. Hence, by including additional contributing microstates via a local sampling of th e potential energy surface, the random error in free energy estimation is naturally reduced, while the systematic error is unaffected. ( 7 10) As a numerical example, consider a hypothetical two state protein (hereafter referr ed to as M1) modeled with a force field containing 0.21 0.6 kcal/mol error per interaction for van der Waals contacts and 0.41 2.29 kcal/mol error per interaction for hydrogen bonds (this error profile is similar to what we have found for the generalized AMBER force field i.e., GAFF). Assume that state 1 comprises 50 van der Waals and 65 hydrogen bonding interactions while state 2 comprises 60 and 70 of each, leading to overall microstate energy calculation errors of 37.2 18.9 kcal/mol for state 1 and 4 1.3 19.7 kcal/mol for stat e 2. As shown in Figure 7 1, the first order error propagation is strongly dependent on the energy gap between the two states. Propagated systematic error is dominated by the error of the lower energy microstate until the relativ e energies are within approximately 2 kcal/mol (assuming T=300K), where its value approaches the arithmetic mean of the two microstate systematic error values. In contrast, propagated random error begins to decrease slowly at the 2 kcal/mol region, and is minimized when the energies of the two states are equal. PAGE 102 102 To demonstrate the effect of ran dom error reduction due to increasing number of included microstates, we constructed a series of hypothetical ensembles ( model system M2) in which the number of mi crostates is incremented by one up to one thousand. The discrete microstates were randomly selected from the potential energy surface, which was modeled as a Lennard Jones type function (Eq uation 7 =1.0, as plotted in Figure 7 2A always included as a permanent global minimum microstate, and the remaining microstates were randomly selected from the potential energy surfac e. Each of the microstates was assigned zero systematic error and a random error o f 1.0 kcal/mol. As shown in Figure 7 2 B the first order propagated random error in free energy is 1.0 kcal/mol when the ensemble contains a single microstate. When 10 20 mic rostates are added, the error drops below 0.2 kcal/mol. When 50 60 microstates are included, the benefit of increased local sampling begins to diminish as the propagated random error approaches zero. ( 7 11) Average Energy Si milar analysis can be done for the macroscopic average energy E expressed by Eq uation 7 12 for which first order error propagation yields the formula in Equation 7 13 (See Appendix for the derivation). ( 7 12) PAGE 103 103 ( 7 13) The values of propagated errors in average energy depend on (1) the probability (normalized weight) of each microstate, (2) the energy value of each microstate, (3) and the average energy itself. The expected behavior of modeling error s in average energy is shown for the M1 system in Figure 7 3. W e observed similar error propagation behaviors for average energy as for free energy in the regions where the energy gap is f increased propagated errors now appear. The effect is more significant in the systematic error, where the propagated error value rises by about 8 kcal/mol. In addition, the width of the r for average energy than for free energy. Entropy Similarly we can investigate the propagation of systematic and random errors in entropy 14). ( 7 14) First order error propagati on in entropy yields the formula in Equation 7 1 5 (See Appendix for a derivation). ( 7 15) Upon applying Equation 7 15 to M1 (See Figure 7 4) we found that at large energy gaps, the propagation of microstate systematic and rando m errors nears zero, and increases significantly as the energy difference decreases, but to a much smaller PAGE 104 104 magnitude than the individual microstate energy errors. As the energy gaps approach zero, so do the propagated errors in entropy, for both error type s. Interestingly, the first order propagation of errors in entropy was symmetric about isoenergetic points for both and entropy should partially cancel when combining the two terms to calculate free energy (via E TS). Higher Order Error Analysis In order to establish more practical error analysis methods, which can be generally applied to even large microstate modeling error magnitudes, we have extended our error propa gation analysis to higher order schemes. Two strategies were employed. First, we continued the Taylor series expansion based approach, where analytical forms for second (Equation 7 16) and fourth (Equation 7 17) order error propagation were derived by assu ming independent microstate energies and considering only up to the second moments of the error probability densi ty functions. 143 Second, we conducted Monte Carlo (MC) error simulations, in which errors were randomly drawn from a Gaussian probability density function centered at a given syst ematic error with a standard deviation equal to the given random error. The simulated errors were then added to the given energies and free energy was computed 10 7 times to ensure convergence. The resulting shift in sample mean and standard deviation were used to represent propagated systematic and random error in the free energy. ( 7 16) PAGE 105 10 5 ( 7 17) We first examined a two state system where the two microstates have zero intrinsic systematic error but 1.0 kcal/m ol of random error. The error estimation results are summarized in Table 7 1 and Fig ure 7 5. We observed that, although the individual microstates contained no systematic errors, the microstate random errors lead to a propagated systematic error in the com puted free energy. The first order error analysis completely neglects this effect, yet here it yields a random free energy error estimate closest to the MC estimate. Systematic free energy errors nearly converge at fourth order, but random errors seem to c onverge more slowly. The MC estimates validate the previous observation that random errors are diminished as the two microstates are more equally weighted in the ensemble. In Figure 7 5, we see how increasing levels of truncation begin to converge toward M C results for small error magnitudes. However, the magnitudes of microstate energy errors for sizeable proteins and common force fields are usually much larger. To investigate the behavior of larger microstate energy errors, we re visited model system M1. As introduced earlier, this system includes two microstates with sizeable microstate energy errors of 37.1 18.9 and 41.3 19.7 kcal/mol respectively. As shown in Figure 7 PAGE 106 106 6 the MC error propagation results show very different behavior than predicted by f irst order formulas. For instance, contrary to the prediction from the first order analysis, the propagated systematic errors (Fig ure 7 6A ) are lower in magnitude than both of the microstate energy systematic errors, even at high energy gaps. The propaga te d random errors (Figure 7 6B ) also fall below the microstate random errors but to a lesser extent than what is predicted by first order analysis (compare to Figure 7 1B ). In Table 7 2, the MC estimates at the isoenergetic point are compared with the corres ponding low order error propagation formulas. It is clear that low order error analysis is not sufficient to estimate propagated errors when microstate energy errors are large. Thus for large microstate energy errors, the MC method should be employed. The discrepancy between low order formulas and MC estimates led us to revisit order analysis. To examine the generality of this effect, we again examined the model system M2 described by the Lennard Jones surface of Eq uation 7 11. As detailed earlier, each microstate sampled in the ensembles of increasing size included no systematic error but 1.0 kcal/mol random error. From the MC results (Fig ure 7 7), we found that random errors indeed f ollow similar error reduct ion behavior (compare with Figure 7 2B ). modeling errors in c omparison with single global minimum microstate energies, and therefore should be preferred as energetic score functions, provided that systematic errors are estimated and removed. PAGE 107 107 Proposed General Error Handling Protocol To demonstrate a possible error ha ndling protocol based on the observations of this work, we built a model system (M3) resembling various states of a small protein undergoing folding. The model ensemble comprised 5 (macro)states, each represented by 100 microstates. Such a dataset might ar ise from a clustering of snapshots from molecular dynamics simulations as done in the development of Markov models of protein folding, for example as seen in Bowman et al 144 In this model system, the 5 states have their average energies randomly selected from a uniform distribution in the range ( 100, 0) kcal/mol. Each component microstate was assigne randomly selected from a normal distribution centered at the corresponding state average energy with a standard deviation of 3.0 kcal/mol. In addition, each state was given a random average number of hydrogen bonds, selected from a unifor m distribution in the range (25, 55). The number of hydrogen bonds of each microstate was then randomly selected from a normal distribution centered on the average value of the corresponding state with a standard deviation of 1. The same procedure was empl oyed to assign van der Waals contacts, with the states randomly assigned in the range of (40, 80) and microstate deviations selected from a normal distribution with a standard deviation of 2. The energy function is assumed to contain errors of 0.21 0.60 f or van der Waals contacts and 0.41 2.29 for hydrogen bonds. These parameters lead based on the propagated microstate energy errors. We now aim to compute free energy diff erences between the 5 states given the set of erroneous microstate energies. The procedure is illustrated in Figure 7 8. The first step in an error handling free energy computation would be to analyze each microstate PAGE 108 108 for its intramolecular interactions. Th e microstate contact counts are displayed in Figure 7 8A which highlights the similarity of molecular contacts within each of the five states. In step two, by following our fragment based error propagation procedure, we calculate the systematic and random errors of the microstate energies which are shown in Figure 7 8B In the third step, the energies of all microstates are corrected by removing systematic errors. The resulting sets of corrected microstate energies and associated err or bars are displayed in Figure 7 9A Finally, independent MC error simulations are conducted for each of the 5 states. These simulations give rise to the final estimates of the free energies of all five states along with their associated uncertainties. The black dots and the b lack error bars in Figure 7 9B show the final estimated free energy values and the corresponding random errors when all 500 microstates are sampled. For example, the free energy of state 3 is estimated to be 110.5 6.97 and the free energy of state 5 is 102.6 7.24. The free energy difference between these two states is calculated to be 7.86 10.1. Here, the uncertainty of the free energy difference is evaluated as the Pythagorean sum of the error bars of the two states. In this model system, we also obs erve the random error reduction effect. We analyzed the ensemble results with the number of the sampled microstates per state equal to 1, 25, 50, 75, and 100. As shown in Figure 7 9B if only one microstate is sampled for each state, the error bars are ver y large, but when more than 25 microstates are included, the error bars are significantly reduced. Unfortunately, MC error propagation is unable to estimate the introduction of systematic shifts in free energy due to microstate random errors. Systematic sh ifts in free energy estimations depend on both the magnitudes of microstate random errors PAGE 109 109 and the distribution of calculated microstate energies. In this instance these shifts in free energy had a range of magnitudes from 0.25 to 4.5 kcal/mol, which sugges ts that these systematic errors would not cancel completely when taking free energy differences. Furthermore, low order systematic error formulas failed to estimate these shifts (See Table 7 3). For this reason, it is important to be attentive to the magni tudes of the final propagated error bars, since they give a direct value of random uncertainty and a rough indication of the presence systematic shifts. Conclusions This work serves as an extension of the methods proposed in earlier chapters for estimating microstate energy errors by describing their propagation when statistical mechanical variables such as free energy are estimated. The main benefits of such an analysis are that (1) estimated systematic errors can be corrected, and (2) random errors can pr ovide uncertainty estimates when simulation results are reported. Interestingly, we found that propagated random errors in computed free energies can be reduced by increasing the local sampling of potential energy surfaces. This suggests that, analogous to the enthalpy entropy compensation effect, there exist correlated interplays between energy modeling errors and state sampling. As errors in an energy model are introduced, one can take advantage of state sampling to minimize the associated errors in free energy estimations. We also observed the effect of partial systematic error cancelation between energy and entropy in free energy (Figures 7 1, 7 3, and 7 diminished when entrop y and its associated error are subtracted to produce free energy and its associated error. PAGE 110 110 The error propagation formulas derived in this work approximate the accumulation of microstate energy errors by using truncated Taylor series. Higher orders of Tayl or series expansions could be utilized, but this is expected to yield more complicated error propagation formulas than presented here. For very large microstate energy errors, Monte Carlo error estimation may be necessary to correctly estimate propagated r andom error magnitudes. A second assumption made in this work is that microstate energies are independent from one another, which may not be a valid approximation when sampling along a molecular dynamics trajectory. Thus special care will need to be taken to ensure at least quasi independent state sampling in future applications. PAGE 111 111 Table 7 1. Propagated errors in free energy at different levels of truncation at the isoenergetic point for small microstate energy errors. 1 st order 2 nd order 4 th order MC res ults Systematic Error 0 0.420 0.272 0.293 Random Error 0.707 0.569 0.653 0.780 Table 7 2. Propagated errors in free energy at different levels of truncation at the isoenergetic point for M1. 1 st order 2 nd order 4 th order MC results Systematic Err or 39.23 117.6 20480 28.60 Random Error 13.67 156.2 20440 15.94 Table 7 3. Free energy values (kcal/mol) from the simulated ensemble. State True Calculated a Error Calculated Uncertainty in a True Shift in A Calculated Shift in A a,b 1 25.0 30.2 5.24 8.89 0.702 3730 2 84.5 85.0 0.50 9.40 4.52 6590 3 0 0 0 9.88 0.258 2840 4 88.0 93.5 5.55 8.98 1.36 2880 5 7.86 5.09 2.77 1 0.0 3.47 9860 a calculated in a real computational experiment while the remaining quantities come from the simulated ensemble and are generally not known in a real computational e xperiment. b The calculated shifts in A come from Taylor series estimation up to 4 th PAGE 112 112 A B Figure 7 1. Propagation of errors in free energy for the model two state system M1. A) Systematic error propagation at different energy gaps. B) Random error propagat ion at different energy gaps. PAGE 113 113 A B Figure 7 2. Analysis of ensemble size dependence of propagated errors. A) The Lennard Jones type potential energy surface of model system M2. B) Propagated random errors versus number of states included in the ensembl e. PAGE 114 114 A B Figure 7 3. Propagation of errors in average energy for the model two state system M1. A) Systematic error propagation at different energy gaps. B) Random error propagation at different energy gaps. PAGE 115 115 A B Figure 7 4. Propagation of errors in entropy for the model two state system M1. A) Systematic error propagation at different energy gaps. B) Random error propagation at different energy gaps. PAGE 116 116 A B Figure 7 5. Error propagation in a two state system with small microstate energy errors. A) Propagated systematic errors at first order (red), second order (green), fourth order (blue), and the M onte Carlo estimation (black). B ) Propagated random errors at the same levels of truncation. PAGE 117 117 A B Figure 7 6. Monte Carlo estimation of propagated err ors in the free energy of M1, a hypothetical 2 state protein system with large microstate energy errors. A ) Propagated systematic error. B ) Propagated random error. PAGE 118 118 Figure 7 7. Monte Carlo estimation of free energy errors as a function of increased loc al sampling of the example Lennard Jones surface. As more states are sampled, each with its own 1.0 kcal/mol random energy error, the propagated free energy random error is decreased, while systematic error is introduced PAGE 119 119 A B Figure 7 8. Simulated ense mble M3. A) Distributions of molecular contacts in the ensemble. The (macro)states are distinguished by color. Note the similarity of structures within each state. B) Distributions of systematic and random microstate energy errors in the simulated ensemble PAGE 120 120 A B Figure 7 9. Energies and free energies of the simulated ensemble M3. A) The energies and random error estimates of each microstate after systematic error correction. The microstates are colored according their associated (macro)state. B) Monte C arlo free energy estimates of each state along with Monte Carlo estimated uncertainty. Uncertainties are estimated by including either 1, 25, 50, 75, or 100 microstates in each ensemble. PAGE 121 121 A PPENDIX DERIVATION OF ERROR PROPAGATION FORMULAS Derivatives of Hel mholtz Free Energy through Fourth Order (A 1) (A 2) (A 3) (A 4) (A 5) (A 6) (A 7) (A 8) (A 9) (A 10) (A 11) (A 12) PAGE 122 122 First Order Error Propaga tion in Helmholtz Free Energy (A 13) (A 14) (A 15 ) First Order Error Propagation in Average Energy (A 16) (A 17) (A 18) (A 19) (A 20) (A 21) First Order Error Propagation in Entropy (A 22) (A 23) Since each term in the outer sum in S depends on every state (via the partition functions in the denominators), we break S into two parts: S i for the outer sum term for state i, and S a for all the remaining terms PAGE 123 123 (A 24) (A 25) (A 26) We first evaluate the derivative of S i (A 27) (A 28) (A 29) (A 30) Now we compute the derivative of S a PAGE 124 124 (A 31) (A 32) (A 33) (A 34) (A 35) (A 36) (A 37) Now add the derivatives following Equation A 26. (A 38) PAGE 125 125 (A 39) Equation A 39 can then be simplified by using Equation A 40. (A 40) (A 41) (A 42) (A 43) (A 44) (A 45) (A 46) (A 47) Second Ord er Error Propagation in Helmholtz Free Energy The following expressions in Equation A 48 and A 49 were adapted from Reference 138 (A 48) (A 49) By replacing the partial derivatives with the express ions from Equations A 2 and A 3 we arrive at Equation A 50. PAGE 126 126 (A 50) Fourth Order Error Propagation in Helmholtz Free Energy The following expressions in Equa tion A 51 and A 52 were adapted from Reference 138 (A 51) (A 52) By replacing the partial derivatives with the expressions from Equations A 2 through A 9 we arrive at Equation A 53. PAGE 127 127 (A 53) PAGE 128 128 L IST OF REFERENCES 1. Chen, W.; Gilso n, M. K.; Webb, S. P.; Potter, M. J. Modeling Protein Ligand Binding by Mining Minima. J. Chem. Theory Comput. 2010, 6 (11), 3540 3557. 2. Head, M. S.; Given, J. A.; Gilson, M. K. ''Mining minima'': Direct computation of conformational free energy. J. Phys Chem. A 1997, 101 (8), 1609 1618. 3. Kolossvary, I. Evaluation of the molecular configuration integral in all degrees of freedom for the direct calculation of conformational free energies: Prediction of the anomeric free energy of monosaccharides. J. Phy s. Chem. A 1997, 101 (51), 9900 9905. 4. Purisima, E. O.; Hogues, H. Protein Ligand Binding Free Energies from Exhaustive Docking. J. Phys. Chem. B 2012 ASAP Article. DOI: 10.1021/jp212646s 5. Cramer, C. J. In Essentials of Computational Chemistry: Theori es and Models; 2 ed.; John Wiley and & Sons, Ltd: Chichester, 2004. 6. Fanfrlik, J.; Bronowska, A. K.; Rezac, J.; Prenosil, O.; Konvalinka, J.; Hobza, P. A Reliable Docking/Scoring Scheme Based on the Semiempirical Quantum Mechanical PM6 DH2 Method Accurat ely Covering Dispersion and H Bonding: HIV 1 Protease with 22 Ligands. J. Phys. Chem. B 2010, 114 (39), 12666 12678. 7. Helgaker, T.; Klopper, W.; Tew, D. P. Quantitative quantum chemistry. Mol. Phys. 2008, 106 (16 18), 2107 2143. 8. Case, D. A.; Darden, T A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang, J.; R.E. Duke, R.; Luo, R. C. W. AMBER 11 University of California, San Francisco, 2010. 9. Szabo, A.; Ostlund, N. S. In Modern Quantum Chemistry; MacMillan Publishing Co.: New York, 1982. 10. Bartlett, R. J.; Musial, M. Coupled cluster theory in quantum chemistry. Rev. Mod. Phys. 2007, 79 (1), 291 352. 11. Dewar, M. J. S.; Thiel, W. Ground States of Molecules. 38. The MNDO Method, Approximations and Parameters. J. Am. Chem. Soc. 1977, 99 (15), 4899 4907. 1 2. Dewar, M. J. S.; Zoebisch, E. G.; Healy, E. F.; Stewart, J. J. P. AM1: A New General Purpose Quantum Mechanical Molecular Model. J. Am. Chem. Soc. 1985, 107 (13), 3902 3909. 13. Stewart, J. J. P. Optimization of Parameters for Semiempirical Methods I. M ethod. J. Comp. Chem. 1989, 10 (2), 209 220. PAGE 129 129 14. Stewart, J. J. P. Optimization of parameters for semiempirical methods V: Modification of NDDO approximations and application to 70 elements. J. Mol. Model 2007, 13 (12), 1173 1213. 15. Repasky, M. P.; Chand rasekhar, J.; Jorgensen, W. L. PDDG/PM3 and PDDG/MNDO: Improved semiempirical methods. J. Comput. Chem. 2002, 23 (16), 1601 1622. 16. Korth, M.; Pitonak, M.; Rezac, J.; Hobza, P. A Transferable H Bonding Correction for Semiempirical Quantum Chemical Method s. J. Chem. Theory Comput. 2010, 6 (1), 344 352. 17. Rezac, J.; Hobza, P. A halogen bonding correction for the semiempirical PM6 method. Chem. Phys. Lett. 2011, 506 286 289. 18. Foster, M. E.; Sohlberg, K. A New Empirical Correction to the AM1 Method for Macromolecular Complexes. J. Chem. Theory Comput. 2010, 6 (7), 2153 2166. 19. Parr, R. G.; Yang, W. In Density functional theory of atoms and molecules; Oxford University Press New York, 1989. 20. Perdew, J.; Kurth, S., Density Functionals for Non relativi stic Coulomb Systems in the New Century. In A Primer in Density Functional Theory Fiolhais, C.; Nogueira, F.; Marques, M., Eds. Springer Berlin / Heidelberg: 2003; Vol. 620, pp 1 55. 21. Perdew, J. P.; Schmidt, K. Jacob's ladder of density functional appr oximations for the exchange correlation energy. AIP Conf. Proc. 2001, 577 (1), 1 20. 22. Halkier, A.; Helgaker, T.; Jorgensen, P.; Klopper, W.; Koch, H.; Olsen, J.; Wilson, A. K. Basis set convergence in correlated calculations on Ne, N 2, and H2O. Chem. P hys. Lett. 1998, 286 (3 4), 243 252. 23. Boys, S. F.; Bernardi, F. Calculation of Small Molecular Interactions by Differences of Separate Total Energies Some Procedures with Reduced Errors. Mol. Phys. 1970, 19 (4), 553 566. 24. Fukuzawa, K.; Kitaura, K.; Uebayasi, M.; Nakata, K.; Kaminuma, T.; Nakano, T. Ab initio quantum mechanical study of the binding energies of human estrogen receptor alpha with its ligands: An application of fragment molecular orbital method. J. Comput. Chem. 2005, 26 (1), 1 10. 25. Chen, X. H.; Zhang, J. Z. H. Molecular fractionation with conjugated caps density matrix with pairwise interaction correction for protein energy calculation. J. Chem. Phys. 2006, 125 (4), 044903 26. Yang, W.; Lee, T. S. A Density Matrix Form of the Divi de and Conquer Approach for Electronic Structure Calculations of Large Molecules. J. Chem. Phys. 1995, 103 (13), 5674 5678. PAGE 130 130 27. Dixon, S. L.; Merz Jr., K. M., The Divide and Conquer Methodology Applied to Semiempirical SCF MO Calculations. In Encyclopedia of Computational Chemistry Schleyer, P. v. R., Ed. Wiley & Sons Ltd: Baffins Lane, Chichester, 1998; Vol. 1, pp 762 776. 28. He, X.; Merz, K. M. Divide and Conquer Hartree Fock Calculations on Proteins. J. Chem. Theory Comput. 2010, 6 (2), 405 411. 29. H e, X.; Wang, B.; Merz, K. M. Protein NMR Chemical Shift Calculations Based on the Automated Fragmentation QM/MM Approach. J. Phys. Chem. B 2009, 113 (30), 10380 10388. 30. Vincent, J. J.; Merz, K. M., Jr. Parallel Implementation of the Divide and Conquer A lgorithm. Theor. Chem. Acc. 1998, 99 220 223. 31. Taylor, J. R. In An introduction to error analysis: the study of uncertainties in physical measurements; McGuire, A., Ed.; 2nd ed.; University Science Books: Sausalito, Calif., 1997. 32. Papoulis, A. In Pr obability, random variables, and stochastic processes; 3rd ed.; McGraw Hill: New York, 1991. 33. Seiler, F. A. Error Propagation for Large Errors. Risk Analysis 1987, 7 (4), 509 518. 34. Alex, A. A.; Flocco, M. M. Fragment based drug discovery: What has it achieved so far? Curr. Top. Med. Chem. 2007, 7 (16), 1544 1567. 35. Makara, G. M. On sampling of fragment space. J. Med. Chem. 2007, 50 (14), 3214 3221. 36. Bissantz, C.; Kuhn, B.; Stahl, M. A Medicinal Chemist's Guide to Molecular Interactions. J. Med. C hem. 2010, 53 (14), 5061 5084. 37. Mark, A. E.; Vangunsteren, W. F. Decomposition of the Free Energy of a System in Terms of Specific Interactions Implications for Theoretical and Experimental Studies. J. Mol. Biol. 1994, 240 (2), 167 176. 38. Xantheas, S. S. Ab Initio Studies of Cyclic Water Clusters (H2o)(N), N=1 6 .2. Analysis of Many Body Interactions. J. Chem. Phys. 1994, 100 (10), 7523 7534. 39. Ucisik, M. N.; Dashti, D. S.; Faver, J. C.; Merz, K. M. Pairwise Additivity of Energy Components for Prot ein Ligand Binding: HIV II Protease Indinavir Case. J. Chem. Phys. 2011, 135 085101. PAGE 131 131 40. Zhao, Y.; Truhlar, D. G. The M06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and t ransition elements: two new functionals and systematic testing of four M06 class functionals and 12 other functionals. Theor. Chem. Account 2008, 120 215 241. 41. Baum, B.; Muley, L.; Smolinski, M.; Heine, A.; Hangauer, D.; Klebe, G. Non additivity of Fun ctional Group Contributions in Protein Ligand Binding: A Comprehensive Study by Crystallography and Isothermal Titration Calorimetry. J. Mol. Biol. 2010, 397 (4), 1042 1054. 42. Zhou, H. X.; Gilson, M. K. Theory of Free Energy and Entropy in Noncovalent Bi nding. Chem. Rev. 2009, 109 (9), 4092 4107. 43. Gilson, M. K.; Zhou, H. X. Calculation of protein ligand binding affinities. Annu. Rev. Bioph. Biom. 2007, 36 21 42. 44. Leach, A. R.; Shoichet, B. K.; Peishoff, C. E. Prediction of Protein Ligand Interactio ns. Docking and Scoring: Successes and Gaps. J. Med. Chem. 2006, 49 (20), 5851 5855. 45. Foloppe, N.; Hubbard, R. Towards predictive ligand design with free energy based computational methods? Curr. Med. Chem. 2006, 13 (29), 3583 3608. 46. Mobley, D. L.; G raves, A. P.; Chodera, J. D.; McReynolds, A. C.; Shoichet, B. K.; Dill, K. A. Predicting Absolute Ligand Binding Free Energies to a Simple Model Site. J. Mol. Biol. 2007, 371 (4), 1118 1134. 47. Abel, R.; Young, T.; Farid, R.; Berne, B. J.; Friesner, R. A. Role of the active site solvent in the thermodynamics of factor Xa ligand binding. J. Am. Chem. Soc. 2008, 130 (9), 2817 2831. 48. Deng, Y. Q.; Roux, B. Computations of Standard Binding Free Energies with Molecular Dynamics Simulations. J. Phys. Chem. B 2 009, 113 (8), 2234 2246. 49. Martin, Y. C. Let's not forget tautomers. J. Comput. Aided Mol. Des. 2009, 23 (10), 693 704. 50. Merz, K. M. Limits of Free Energy Computation for Protein Ligand Interactions. J. Chem. Theory Comput. 2010, 6 (5), 1769 1776. 51. Raha, K.; Peters, M. B.; Wang, B.; Yu, N.; WollaCott, A. M.; Westerhoff, L. M.; Merz, K. M. The role of quantum mechanics in structure based drug design. Drug Discovery Today 2007, 12 (17 18), 725 731. 52. Peters, M. B.; Raha, K.; Merz, K. M. Quantum mech anics in structure based drug design. Current Opinion in Drug Discovery & Development 2006, 9 (3), 370 379. PAGE 132 132 53. Raha, K.; Merz, K. M. Large scale validation of a quantum mechanics based scoring function: Predicting the binding affinity and the binding mode of a diverse set of protein ligand complexes. J. Med. Chem. 2005, 48 (14), 4558 4575. 54. Cummings, M. D.; DesJarlais, R. L.; Gibbs, A. C.; Mohan, V.; Jaeger, E. P. Comparison of Automated Docking Programs as Virtual Screening Tools. J. Med. Chem. 2005, 4 8 (4), 962 976. 55. Guvench, O.; MacKerell, A. D. Computational evaluation of protein small molecule binding. Curr. Opin. Struc. Biol. 2009, 19 (1), 56 61. 56. Taylor, R. D.; Jewsbury, P. J.; Essex, J. W. A review of protein small molecule docking methods. J. Comput. Aided Mol. Des. 2002, 16 (3), 151 166. 57. Kontoyianni, M.; McClellan, L. M.; Sokol, G. S. Evaluation of Docking Performance: Comparative Data on Docking Algorithms. J. Med. Chem. 2003, 47 (3), 558 565. 58. Warren, G. L.; Andrews, C. W.; Capell i, A. M.; Clarke, B.; LaLonde, J.; Lambert, M. H.; Lindvall, M.; Nevins, N.; Semus, S. F.; Senger, S.; Tedesco, G.; Wall, I. D.; Woolven, J. M.; Peishoff, C. E.; Head, M. S. A critical assessment of docking programs and scoring functions. J. Med. Chem. 200 6, 49 (20), 5912 5931. 59. Charifson, P. S.; Corkery, J. J.; Murcko, M. A.; Walters, W. P. Consensus Scoring: A Method for Obtaining Improved Hit Rates from Docking Databases of Three Dimensional Structures into Proteins. J. Med. Chem. 1999, 42 (25), 5100 5109. 60. Oda, A.; Tsuchida, K.; Takakura, T.; Yamaotsu, N.; Hirono, S. Comparison of consensus scoring strategies for evaluating computational models of protein ligand complexes. J. Chem. Inf. Model. 2006, 46 (1), 380 391. 61. Wang, R. X.; Wang, S. M. How does consensus scoring work for virtual library screening? An idealized computer experiment. J. Chem. Inf. Comp. Sci. 2001, 41 (5), 1422 1426. 62. Kollman, P. A. Free Energy Calculations: Applications to Chemical and Biochemical Phenomena. Chem. Rev. 1993 93 (7), 2395 2417. 63. Huang, N.; Kalyanaraman, C.; Bernacki, K.; Jacobson, M. P. Molecular mechanics methods for predicting protein ligand binding. Phys. Chem. Chem. Phys. 2006, 8 (44), 5166 5177. 64. Essex, J. W.; Severance, D. L.; TiradoRives, J.; Jor gensen, W. L. Monte Carlo simulations for proteins: Binding affinities for trypsin benzamidine complexes via free energy perturbations. J. Phys. Chem. B 1997, 101 (46), 9663 9669. PAGE 133 133 65. Lybrand, T. P.; McCammon, J. A.; Wipff, G. Theoretical calculation of re lative binding affinity in host guest systems. P. Natl. Acad. Sci. USA 1986, 83 (4), 833 835. 66. Wallace, A. C.; Laskowski, R. A.; Thornton, J. M. Ligplot a Program to Generate Schematic Diagrams of Protein Ligand Interactions. Protein Eng. 1995, 8 (2), 127 134. 67. Faver, J. C.; Benson, M. L.; He, X.; Roberts, B. P.; Wang, B.; Marshall, M. S.; Kennedy, M. R.; Sherrill, D. C.; Merz, K. M. Formal Estimation of Errors in Computed Absolute Interaction Energies of Protein ligand Complexes. J. Chem. Theory Co mput. 2011, 7 (3), 790 797. 68. Chen, Z. G.; Li, Y.; Chen, E.; Hall, D. L.; Darke, P. L.; Culberson, C.; Shafer, J. A.; Kuo, L. C. Crystal Structure at 1.9 Angstrom Resolution of Human Immunodeficiency Virus (Hiv) Ii Protease Complexed with L 735,524, an O rally Bioavailable Inhibitor of the Hiv Proteases. J. Biol. Chem. 1994, 269 (42), 26344 26348. 69. Word, J. M.; Lovell, S. C.; Richardson, J. S.; Richardson, D. C. Asparagine and glutamine: Using hydrogen atom contacts in the choice of side chain amide ori entation. J. Mol. Biol. 1999, 285 (4), 1735 1747. 70. Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Comparison of multiple amber force fields and development of improved protein backbone parameters. Proteins 2006, 65 (3), 712 725. 71. Pettersen, E. F.; Goddard, T. D.; Huang, C. C.; Couch, G. S.; Greenblatt, D. M.; Meng, E. C.; Ferrin, T. E. UCSF chimera A visualization system for exploratory research and analysis. J. Comput. Chem. 2004, 25 (13), 1605 1612. 72. Wang, J. M.; W olf, R. M.; Caldwell, J. W.; Kollman, P. A.; Case, D. A. Development and testing of a general amber force field. J. Comput. Chem. 2004, 25 (9), 1157 1174. 73. Ponder, J. W.; Wu, C. J.; Ren, P. Y.; Pande, V. S.; Chodera, J. D.; Schnieders, M. J.; Haque, I.; Mobley, D. L.; Lambrecht, D. S.; DiStasio, R. A.; Head Gordon, M.; Clark, G. N. I.; Johnson, M. E.; Head Gordon, T. Current Status of the AMOEBA Polarizable Force Field. J. Phys. Chem. B 2010, 114 (8), 2549 2564. 74. Halgren, T. A.; Murphy, R. B.; Jorgens en, W. L.; Friesner, R. A. Extending the OPLS AA force field for ligand functionality. Abstr. Pap. Am. Chem. S. 2000, 220 U277 U277. 75. Grimme, S. Semiempirical GGA type density functional constructed with a long range dispersion correction. J. Comput. C hem. 2006, 27 (15), 1787 1799. PAGE 134 134 76. Ditchfie.R; Hehre, W. J.; Pople, J. A. Self Consistent Molecular Orbital Methods .9. Extended Gaussian Type Basis for Molecular Orbital Studies of Organic Molecules. J. Chem. Phys. 1971, 54 (2), 724 728. 77. Kendall, R. A .; Dunning, T. H.; Harrison, R. J. Electron Affinities of the 1st Row Atoms Revisited Systematic Basis Sets and Wave Functions. J. Chem. Phys. 1992, 96 (9), 6796 6806. 78. Sherrill, C. D.; Takatani, T.; Hohenstein, E. G. An Assessment of Theoretical Meth ods for Nonbonded Interactions: Comparison to Complete Basis Set Limit Coupled Cluster Potential Energy Curves for the Benzene Dimer, the Methane Dimer, Benzene Methane, and Benzene H2S. J. Phys. Chem. A 2009, 113 (38), 10146 10159. 79. Elsohly, A. M.; Tsc humper, G. S. Comparison of Polarization Consistent and Correlation Consistent Basis Sets for Noncovalent Interactions. Int. J. Quantum Chem. 2009, 109 (1), 91 96. 80. Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Montgomery, J., J. A.; Vreven, T.; Kudin, K. N.; Burant, J. C.; Millam, J. M.; Iyengar, S. S.; Tomasi, J.; Barone, V.; Mennucci, B.; Cossi, M.; Scalmani, G.; Rega, N.; Petersson, G. A.; Nakatsuji, H.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Has egawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Klene, M.; Li, X.; Knox, J. E.; Hratchian, H. P.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; O chterski, J. W.; Ayala, P. Y.; Morokuma, K.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Zakrzewski, V. G.; Dapprich, S.; Daniels, A. D.; Strain, M. C.; Farkas, O.; Malick, D. K.; Rabuck, A. D.; Raghavachari, K.; Foresman, J. B.; Ortiz, J. V.; Cui, Q.; B aboul, A. G.; Clifford, S.; Cioslowski, J.; Stefanov, B. B.; Liu, G.; Liashenko, A.; Piskorz, P.; Komaromi, I.; Martin, R. L.; Fox, D. J.; Keith, T.; Al Laham, M. A.; Peng, C. Y.; Nanayakkara, A.; Challacombe, M.; Gill, P. M. W.; Johnson, B.; Chen, W.; Won g, M. W.; Gonzalez, C.; and Pople, J. A. Gaussian 03, Revision E.01 Revision E.01; Gaussian, Inc.: Wallingford, CT, 2004. PAGE 135 135 81. Frisch, M. J.; Trucks, G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Scalmani, G.; Barone, V. M., B.; P etersson, G. A.; Nakatsuji, H.; Caricato, M.; Li, X.; Hratchian, H. P.; Izmaylov, A. F.; Bloino, J.; Zheng, G.; Sonnenberg, J. L.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Vreven, T.; Montgomery, Jr., J. A.; Peralta, J. E.; Ogliaro, F.; Bearpark, M.; Heyd, J. J.; Brothers, E.; Kudin, K. N.; Staroverov, V. N.; Kobayashi, R.; Normand, J.; Raghavachari, K.; Rendell, A.; Burant, J. C.; Iyengar, S. S.; Tomasi, J.; Cossi, M.; Rega, N.; M illam, N. J.; Klene, M.; Knox, J. E.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Martin, R. L.; Morokuma, K.; Zakrzewski, V. G.; Voth, G. A.; Salv ador, P.; Dannenberg, J. J.; Dapprich, S.; Daniels, A. D.; Farkas, .; Foresman, J. B.; Ortiz, J. V.; Cioslowski, J.; Fox, D. J. Gaussian 09 Gaussian, Inc.: Wallingford, CT, 2009. 82. Dixon, S. L.; van der Vaart, A.; Gogonea, V.; Vincent, J. J.; Brothers, E. N.; Surez, D.; Westerhoff, L. M.; Merz, K. M., Jr. DIVCON99 The Pennsylvania State University: 1999. 83. Case, D. A.; Darden, T. A.; Cheatham, T. E.; Simmerling, C. L.; Wang, J.; Duke, R. E.; Luo, R.; Crowley, M.; Ross C. Walker, W. Z., K.M. Merz, B. Wang, S. Hayik, A. Roitberg, G. Seabra, I. Kolossvry, K.F.Wong, F. Paesani, J. Vanicek, X.Wu, S.R. Brozell, T. Steinbrecher, H. Gohlke, L. Yang, C. Tan, J. Mongan, V. Hornak, G. Cui, D.H. Mathews, M.G. Seetin, C. Sagui, V. Babin, and P.A. Kollman AMBER 10 University of California, San Francisco San Francisco, CA, 2008. 84. Werner, H. J.; Knowles, P. J.; Lindh, R.; Manby, F. R.; Schtz, M.; P. Celani; G. Knizia; T. Korona; R. Lindh; A. Mitrushenkov; G. Rauhut; T. B. Adler; R. D. Amos; A. Bernhardsson; A. B erning; D. L. Cooper; M. J. O. Deegan; A. J. Dobbyn; F. Eckert; E. Goll; C. Hampel; A. Hesselmann; G. Hetzer; T. Hrenar; G. Jansen; C. Kppl; Y. Liu; A. W. Lloyd; R. A. Mata; A. J. May; S. J. McNicholas; W. Meyer; M. E. Mura; A. Nicklass; P. Palmieri; K. P flger; R. Pitzer; M. Reiher; T. Shiozaki; H. Stoll; A. J. Stone; R. Tarroni; T. Thorsteinsson; M. Wang; A. Wolf MOLPRO 2009. 85. Kendall, R. A.; Apra, E.; Bernholdt, D. E.; Bylaska, E. J.; Dupuis, M.; Fann, G. I.; Harrison, R. J.; Ju, J. L.; Nichols, J. A.; Nieplocha, J.; Straatsma, T. P.; Windus, T. L.; Wong, A. T. High performance computational chemistry: An overview of NWChem a distributed parallel application. Comput. Phys. Comm. 2000, 128 (1 2), 260 283. 86. Macromodel version 9.8 Schrodinger, LLC: New York, NY, 2010. 87. Riley, K. E.; Hobza, P. Assessment of the MP2 method, along with several basis sets, for the computation of interaction energies of biologically relevant hydrogen bonded and dispersion bound complexes. J. Phys. Chem. A 2007, 111 (33 ), 8257 8263. PAGE 136 136 88. Fennell, C.; Dill, K. Physical Modeling of Aqueous Solvation. J. Stat. Phys. 2011, 145 (2), 209 226. 89. Hu, C. Y.; Kokubo, H.; Lynch, G. C.; Bolen, D. W.; Pettitt, B. M. Backbone additivity in the transfer model of protein solvation. Pro tein. Sci. 2010, 19 (5), 1011 1022. 90. Dill, K. A. Additivity principles in biochemistry. J. Biol. Chem. 1997, 272 (2), 701 704. 91. Anfinsen, C. B. Influences of 3 Dimensional Configuration on Chemical Reactivity and Stability of Proteins. J. Polym. Sci. 1961, 49 (151), 31 49. 92. Dill, K. A.; Ozkan, S. B.; Shell, M. S.; Weikl, T. R. The protein folding problem. Ann. Rev. Biophys. 2008, 37 289 316. 93. Anfinsen, C. B. Principles That Govern Folding of Protein Chains. Science 1973, 181 (4096), 223 230. 94 Sohl, J. L.; Jaswal, S. S.; Agard, D. A. Unfolded conformations of alpha lytic protease are more stable than its native state. Nature 1998, 395 (6704), 817 819. 95. Baker, D.; Agard, D. A. Kinetics Versus Thermodynamics in Protein Folding. Biochemistry U s 1994, 33 (24), 7505 7509. 96. Sindhikara, D. J. In A sampling of molecular dynamics [electronic resource] / by Daniel Jon Sindhikara; University of Florida: Gainesville, Fla, 2009. 97. Levinthal, C. Are There Pathways for Protein Folding. J. Chim. Phys. Pcb. 1968, 65 (1), 44 45. 98. Leopold, P. E.; Montal, M.; Onuchic, J. N. Protein Folding Funnels a Kinetic Approach to the Sequence Structure Relationship. P. Natl. Acad. Sci. USA 1992, 89 (18), 8721 8725. 99. Dill, K. A.; Chan, H. S. From Levinthal to p athways to funnels. Nat. Struct. Biol. 1997, 4 (1), 10 19. 100. Liwo, A.; Khalili, M.; Scheraga, H. A. Ab initio simulations of protein folding pathways by molecular dynamics with the united residue model of polypeptide chains. P. Natl. Acad. Sci. USA 2005 102 (7), 2362 2367. 101. Snow, C. D.; Nguyen, N.; Pande, V. S.; Gruebele, M. Absolute comparison of simulated and experimental protein folding dynamics. Nature 2002, 420 (6911), 102 106. PAGE 137 137 102. Snow, M. E. Powerful Simulated Annealing Algorithm Locates Glo bal Minimum of Proten Folding Potentials from Multiple Starting Conformations. J. Comput. Chem. 1992, 13 (5), 579 584. 103. Duan, Y.; Kollman, P. A. Pathways to a protein folding intermediate observed in a 1 microsecond simulation in aqueous solution. Scie nce 1998, 282 (5389), 740 744. 104. Duan, Y.; Wang, L.; Kollman, P. A. The early stage of folding of villin headpiece subdomain observed in a 200 nanosecond fully solvated molecular dynamics simulation. P. Natl. Acad. Sci. USA 1998, 95 (17), 9897 9902. 105 Zagrovic, B.; Snow, C. D.; Shirts, M. R.; Pande, V. S. Simulation of folding of a small alpha helical protein in atomistic detail using worldwide distributed computing. J. Mol. Biol. 2002, 323 (5), 927 937. 106. Li, Z. Q.; Scheraga, H. A. Monte Carlo Min imization Approach to the Multiple Minima Problem in Protein Folding. P. Natl. Acad. Sci. USA 1987, 84 (19), 6611 6615. 107. Kryshtafovych, A.; Fidelis, K.; Moult, J. CASP8 results in context of previous experiments. Proteins 2009, 77 217 228. 108. Dill, K. A.; Ozkan, S. B.; Weikl, T. R.; Chodera, J. D.; Voelz, V. A. The protein folding problem: when will it be solved? Curr. Opin. Struc. Biol. 2007, 17 (3), 342 346. 109. Kim, D. E.; Blum, B.; Bradley, P.; Baker, D. Sampling Bottlenecks in De novo Protein S tructure Prediction. J. Mol. Biol. 2009, 393 (1), 249 260. 110. Ghosh, K.; Ozkan, S. B.; Dill, K. A. The ultimate speed limit to protein folding is conformational searching. J. Am. Chem. Soc. 2007, 129 (39), 11920 11927. 111. Levitt, M.; Lifson, S. Refinem ent of Protein Conformations Using a Macromolecular Energy Minimization Procedure. J. Mol. Biol. 1969, 46 (2), 269 279. 112. Levitt, M. The birth of computational structural biology. Nat. Struct. Biol. 2001, 8 (5), 392 393. 113. Kryshtafovych, A.; Krysko, O.; Daniluk, P.; Dmytriv, Z.; Fidelis, K. Protein structure prediction center in CASP8. Proteins 2009, 77 5 9. 114. Gilson, M. K. Sensitivity analysis and charge optimization for flexible ligands: Applicability to lead optimization. J. Chem. Theory Comput 2006, 2 (2), 259 270. 115. Ho, T.; Rabitz, H. Reconstruction of Intermolecular Potentials at Fixed Energy Functional Sensitivity Analysis Approach. J. Chem. Phys. 1988, 89 (9), 5614 5623. PAGE 138 138 116. Susnow, R.; Nachbar, R. B.; Schutt, C.; Rabitz, H. Sensitiv ity of Molecular Structure to Intramolecular Potentials. J. Phys. Chem. 1991, 95 (22), 8585 8597. 117. Wong, C. F. Systematic Sensitivity Analyses in Free Energy Perturbation Calculations. J. Am. Chem. Soc. 1991, 113 (8), 3208 3209. 118. Freddolino, P. L.; Liu, F.; Gruebele, M.; Schulten, K. Ten microsecond molecular dynamics simulation of a fast folding WW domain. Biophys. J. 2008, 94 (10), L75 L77. 119. Freddolino, P. L.; Park, S.; Roux, B.; Schulten, K. Force Field Bias in Protein Folding Simulations. Bi ophys. J. 2009, 96 (9), 3772 3780. 120. Lindorff Larsen, K.; Piana, S.; Palmo, K.; Maragakis, P.; Klepeis, J. L.; Dror, R. O.; Shaw, D. E. Improved side chain torsion potentials for the Amber ff99SB protein force field. Proteins 2010, 78 (8), 1950 1958. 12 1. Vijaykumar, S.; Bugg, C. E.; Cook, W. J. Structure of Ubiquitin Refined at 1.8 a Resolution. J. Mol. Biol. 1987, 194 (3), 531 544. 122. Faver, J. C.; Benson, M. L.; He, X.; Roberts, B. P.; Wang, B.; Marshall, M. S.; Sherrill, D. C.; Merz, K. M. The Ener gy Computation Paradox and ab initio Protein Folding. PloS ONE 2011, 6 (4), e18868. 123. Duan, Y.; Wu, C.; Chowdhury, S.; Lee, M. C.; Xiong, G. M.; Zhang, W.; Yang, R.; Cieplak, P.; Luo, R.; Lee, T.; Caldwell, J.; Wang, J. M.; Kollman, P. A point charge fo rce field for molecular mechanics simulations of proteins based on condensed phase quantum mechanical calculations. J. Comput. Chem. 2003, 24 (16), 1999 2012. 124. Molnar, L. F.; He, X.; Wang, B.; Merz, K. M. Further analysis and comparative study of inter molecular interactions using dimers from the S22 database. J. Chem. Phys. 2009, 131 (6), 065102. 125. Stewart, J. J. P. MOPAC2009 Stewart Computational Chemistry: Colorado Springs, CO, 2008. 126. Tsai, J.; Bonneau, R.; Morozov, A. V.; Kuhlman, B.; Rohl, C A.; Baker, D. An improved protein decoy set for testing energy functions for protein structure prediction. Proteins 2003, 53 (1), 76 87. 127. Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. Semianalytical Treatment of Solvation for Molecular M echanics and Dynamics. J. Am. Chem. Soc. 1990, 112 (16), 6127 6129. 128. Klamt, A.; Schuurmann, G. Cosmo a New Approach to Dielectric Screening in Solvents with Explicit Expressions for the Screening Energy and Its Gradient. J. Chem. Soc. 1993, (5), 799 805. PAGE 139 139 129. Cooper, S.; Khatib, F.; Treuille, A.; Barbero, J.; Lee, J.; Beenen, M.; Leaver Fay, A.; Baker, D.; Popovic, Z.; Players, F. Predicting protein structures with a multiplayer online game. Nature 2010, 466 (7307), 756 760. 130. Mezey, P. G. Macromo lecular density matrices and electron densities with adjustable nuclear geometries. J. Math. Chem. 1995, 18 (2 4), 141 168. 131. Asturiol, D.; Duran, M.; Salvador, P. Intramolecular basis set superposition error effects on the planarity of benzene and othe r aromatic molecules: A solution to the problem. J. Chem. Phys. 2008, 128 (14), 144108. 132. Moran, D.; Simmonett, A. C.; Leach, F. E.; Allen, W. D.; Schleyer, P. V.; Schaefer, H. F. Popular theoretical methods predict benzene and arenes to be nonplanar. J Am. Chem. Soc. 2006, 128 (29), 9342 9343. 133. Balabin, R. M. Enthalpy difference between conformations of normal alkanes: Effects of basis set and chain length on intramolecular basis set superposition error. Mol. Phys. 2011, 109 (6), 943 953. 134. Bala bin, R. M. Communications: Is quantum chemical treatment of biopolymers accurate? Intramolecular basis set superposition error (BSSE). J. Chem. Phys. 2010, 132 (23), 231101. 135. Balabin, R. M. Enthalpy Difference between Conformations of Normal Alkanes: R aman Spectroscopy Study of n Pentane and n Butane. J. Phys. Chem. A 2009, 113 (6), 1012 1019. 136. Balabin, R. M. Enthalpy difference between conformations of normal alkanes: Intramolecular basis set superposition error (BSSE) in the case of n butane and n hexane. J. Chem. Phys. 2008, 129 (16), 164101. 137. Jensen, F. An Atomic Counterpoise Method for Estimating Inter and Intramolecular Basis Set Superposition Errors. J. Chem. Theory Comput. 2010, 6 (1), 100 106. 138. Faver, J. C.; Zheng, Z.; Merz, K. M. S tatistics based model for basis set superposition error correction in large biomolecules. Phys. Chem. Chem. Phys. 2012 Advance Article. DOI: 10.1039/C2CP23715F 139. Faver, J. C.; Zheng, Z.; Merz, K. M. Model for the Fast Estimation of Basis Set Superposit ion Error in Biomolecular Systems. J. Chem. Phys. 2011, 135 144110. 140. He, X.; Fusti Molnar, L.; Cui, G. L.; Merz, K. M. Importance of Dispersion and Electron Correlation in ab Initio Protein Folding. J. Phys. Chem. B 2009, 113 (15), 5290 5300. PAGE 140 140 141. Fav er, J. C.; Yang, W.; Merz, K. M. The Effects of Computational Modeling Errors on the Estimation of Statistical Mechanical Variables. J. Chem. Theory Comput. 2012 ASAP Article. DOI: 10.1021/ct300024z 142. McQuarrie, D. A. In Statistical mechanics; Universi ty Science Books: Sausalito, Calif., 2000. 143. Winzer, P. J. Accuracy of error propagation exemplified with ratios of random variables. Rev. Sci. Instrum. 2000, 71 (3), 1447 1454. 144. Bowman, G. R.; Beauchamp, K. A.; Boxer, G.; Pande, V. S. Progress and challenges in the automated construction of Markov state models for full protein systems. J. Chem. Phys. 2009, 131 (12), 124101 PAGE 141 141 BIOGRAPHICAL SKETCH John Cleveland Faver was born in Little Rock, Arkansas to Perry and Kathy Faver in 1985. He has a young er brother, Benjamin Faver. He grew up in Benton, Arkansas and graduated from Benton High School. He then attended the University of Arkansas in Fayetteville where he became interested in drug design and conducted research in organic synthesis in the lab of Professor Matthias McIntosh He also spent a summer in the experimental pharmacology lab of Professor Michael Owens at the University of Arkansas for Medical Sciences While at the University of Arkansas he was i ntroduced to the field of computational c hemistry by Professor Peter Pulay. He graduated with a B.S. in Chemistry in 2003 then moved to Gainesville Florida to enter the Ph. D. program in physical chemistry at the University of Florida U nder the advisement of Professor Kenneth Merz in the Quantu m Theory Project he worked on various projects related to the computational modeling of biomolecules. After his time at the University of researcher. In addition to his scienti fic pursuits, John is an amateur musician under the name thnkftsfnnl. 