UFDC Home  myUFDC Home  Help 



Full Text  
CONSTANT pH REPLICA EXCHANGE MOLECULAR DYNAMICS STUDY OF PROTEIN STRUCTURE AND DYNAMICS By YILIN MENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 2010 Yilin Meng To my family ACKNOWLEDGMENTS At the completion of my graduate study at the University of Florida, I would like to take great pleasure in acknowledging the people who have supported me over these years. I primarily thank my advisor, Professor Adrian E. Roitberg. Throughout the years working in his group, I have learned a tremendous amount from him. His guidance and encouragement supported me to overcome the obstacles not only in research but also in my personal life. There is no way I would have achieved my goal without his support and help. I am thankful for the support and guidance of my committee members, Professors Kenneth M. Merz Jr., Nicolas C. Polfer, Stephen J. Hagen, and Arthur S. Edison. I also would like to thank Professors So Hirata, Joanna R. Long, Carlos L. Simmerling, and Wei Yang for their guidance in my research. I am very grateful for the assistance and helpful discussions from my colleagues in the Roitberg group, especially Dr. Daniel Sindhikara, Dr. Gustavo Seabra, Dr. Lena Dolghih, Dr. Seonah Kim, Jason Swails, Danial Dashti, Billy Miller, Dwight McGee, and Sung Cho. I appreciate all my friends at the Quantum Theory Project, the Department of Chemistry and Physics. I thank the source of funding that supported my graduate study. My research was supported by National Institute of Health under Contract 1R01 A1073674. Computer resources and support were provided by the Large Allocations Resource Committee through grant TGMCA05S010 and the University of Florida HighPerformance Computing Center. I want to acknowledge my wife, Xian who encouraged me and supported me to complete this work. Finally, I am very grateful for my whole family for their love and encouragement. TABLE OF CONTENTS page A C KNOW LEDG M ENTS ............... ....................... ................ ............... 4 L IS T O F T A B LE S ........................ ................. ........... ..... .............................. 9 LIST OF FIGURES.................................. ......... 10 LIST OF ABBREVIATIONS ............... .................... ............ ............... 17 A BSTRACT ........................ ............................................. 19 CHAPTER 1 INTRODUCTION .............................. ............. .................. 21 1.1 AcidBase Equilibrium ............... ................ 21 1.2 Amino Acids and Proteins.................................. .... ........ .. ................. 22 1.3 lonizable Residues in Proteins and the Effect of pH on Proteins................... 25 1.4 Measuring pKa Values of lonizable Residues ............ ................................ 29 1.5 M olecular M odeling ............. ..... ............. ......... ... ....... ........ 38 1.6 Potential Energy Surface .............................................................................. 39 1.7 Molecular Dynamics, Monte Carlo Methods and Ergodicity........................... 41 1.8 Theoretical Protein Titration Curves and pKa Calculations Using Poisson Boltzm ann Equation ....... .... ................ .......................... ... .. .... ...... 44 1.9 Computing pKa Values by Free Energy Calculations.................................... 48 1.10 pKa Prediction Using Empirical Methods......................................... 53 1.11 ConstantpH Molecular Dynamics (ConstantpH MD) Methods................. 53 2 THEORY AND METHODS IN MOLECULAR MODELING............................ 59 2.1 Potential Energy Functions and Classical Force Fields .................................. 59 2.1.1 Potential Energy Surface... .. .............. .......................... 59 2.1.2 Force Field Models .......... ................................... 60 2 .1.3 P rotein Force F ield M odels............................................. ... .. ............... 63 2.2 Molecular Dynam ics (M D) Method ....... ... ................. ......... ..... ............. 64 2.2.1 M D Integrator ........................ ........ ...................... 64 2.2.2 Thermostats in MD Simulations......................... ..... ............. 65 2.2.3 Pressure Control in MD Simulations............... .................... 68 2.3 Monte Carlo (MC) Method ......................... ........ ...... .................... 70 2.3.1 Canonical Ensemble and Configuration Integral ................................... 70 2.3.2 Markov Chain Monte Carlo (MCMC) ......... .... .... ..................... 71 2.3.3 The Metropolis Monte Carlo Method ......... .... .... ..................... 73 2.3.4 Ergodicity and the Ergodic Hypothesis...... ........ .................... 74 2.4 Solvent M models ........... ........... ............... ............... .... ........... 74 2.4.1 Explicit S olvent M odel ................................................... ............... 75 2.4.2 The PoissonBoltzmann (PB) Implicit Solvent Model............................ 77 2.4.3 The Generalized Born (GB) Implicit Solvent Model............................. 79 2.5 pKa Calculation Methods.................................................. 80 2.5.1 The Continuum Electrostatic (CE) Model ............................................. 80 2.5.2 Free Energy Calculation Methods ......... ....... ........................... 82 2.5.3 ConstantpH MD Methods ......................................................... 87 2.6 Advanced Sam pling Methods ....................................................................... 94 2.6.1 The Multicanonical Algorithm (MUCA)............................ .... ........... 95 2.6.2 Parallel Tem pering .................. ............. ... ................... .......... 96 2.7 Replica Exchange Molecular Dynamics (REMD) Methods............................. 97 2.7.1 Temperature REMD (TREMD) ............. .......... ........... .......... ........ 99 2.7.2 Ham iltonian REM D (HREM D) ......... ............................................. 105 2.7.3 Technical Details in REM D Sim ulations ............................................ 105 3 CONSTANTpH REMD: METHOD AND IMPLEMENTATION........................... 114 3.1 Introduction ........................................... ........... 114 3.2 T heory and M ethods ............................... .. ....... .. .... .............. .............. 114 3.2.1 ConstantpH REMD Algorithm in AMBER Simulation Suite .................. 114 3.2 .2 S im ulation D etails................................. .................................... 118 3.2.3 Global Conformational Sampling Comparison Using Cluster Analysis.. 120 3.2.4 Local Conformational Sampling and Convergence to Final State ......... 122 3.3 Results and Discussion........................................... ............... 122 3.3.1 Reference Compounds.............................. ............... 122 3.3.2 M odel peptide A D FDA .................................. ............... 124 3.3.3 Heptapeptide derived from OMTKY3................................................. 128 3.4 Conclusions ........ .......... ................................ ... ............ 136 4 CONSTANTpH REMD: STRUCTURE AND DYNAMICS OF THE CPEPTIDE O F RIBO NUCLEASE A ........... ................ ............... ......... ............... 137 4.1 Introduction ............. ...... ... ... ....... ......... ............... .......... 137 4 .2 M methods ......................... ....... ...... ......................... ............... 143 4.2.1 Simulation Details......................................... 143 4.2.2 C luster A analysis ............ .... ... ...... ......... ... .. ... ...... ......... 144 4.2.3 Definition of the Secondary Structure of Proteins (DSSP) Analysis ...... 145 4.2.4 Computation of the Mean Residue Ellipticity ..................................... 145 4 .3 R results and D discussion ............................................................. 150 4.3.1 Testing Structural Convergence ..................... .... .. .................. 150 4.3.2 pKa Calculation and Convergence...................................... 151 4.3.3 The Mean Residue Ellipticity of the Cpeptide ............... ................ 151 4.3.4 Helical Structures in the Cpeptide ................................................. 153 4.3.5 The TwoDimensional Probability Densities ....................................... 157 4.3.6 Important Electrostatic Interactions: LyslGlu9 and Glu2Argl0........... 160 4.3.7 Important Electrostatic Interactions: Phe8His12................................. 164 4.3.8 Cluster Analysis Results................... ....................... ............... 167 4.4 Conclusions ............. .... ................................ ..... .... ........... 168 5 CONSTANTpH REMD: pKa CALCULATIONS OF HEN EGG WHITE LYSOZYME .............................. .... ............. ........ ....... .......... 170 5.1 Introduction .............. ................................... ...... .................. ... 170 5.2 Simulation Details .............. ......... ..................... ........ .. ..... 174 5.3 Protein Conformational and Protonation State Equilibrium Model .................. 176 5.4 NM R Chem ical Shift Calculations ...................................... ......................... 177 5.5 R results and D iscussions.................................................. 178 5.5.1 Structural Stability and pKa Convergence.............................. ............ .. 178 5.5.2 pKa Predictions ............ .. .. .................. .... ..................... 182 5.5.3 ConstantpH REMD Simulations with a Weaker Restraint .................... 184 5.5.4 Active Site lonizable Residue pKa Prediction: Asp52 .......................... 187 5.5.5 Active Site lonizable Residue pKa Prediction: Glu35........................... 189 5.5.6 Correlation between Conformation and Protonation............................. 193 5.5.7 ConformationProtonation Equilibrium Model............ .. ..... ....... 197 5.5.8 Theoretical NMR Titration Curves ........ ........ .................... 201 5.6 Conclusions ........ .......... .............. .................. ... ............ 203 LIST OF REFERENCES .......... ............ ......... ................ ............... 206 BIOGRAPHICAL SKETCH ............... ........... ... ... ...................... 221 LIST OF TABLES Table page 11 Intrinsic pKa values of ionizable residues in proteins.26 ................. ............ 29 31 The REMD pKa predictions of reference compounds ................................... 123 32 pKa predictions and Hill coefficients fitted from the Hill's Plot........................ 125 33 Correlation coefficients between MD and REMD cluster populations............... 128 41 Correlation coefficients between two sets of cluster populations...................... 151 51 Simulation details of constantpH REM D runs............................................... 175 52 Predicted pKa values and their RMS errors relative to experimental measurements from the restrained REMD simulations .............................. 183 53 Predicted pKa values and their RMS errors relative to experimental measurements from weakly restrained REMD simulations. ........................... 185 54 Distance between Glu35 carboxylic oxygen atoms and neighboring residue sidechain atoms in 1AKI crystal structure......................... ....................... 190 LIST OF FIGURES Figure page 11 A) Structure of an amino acid named alanine. An amino group (NH2), a carboxylic acid group (COOH), a side chain (R, in this case, a methyl group) and a hydrogen atom are bonded to a central carbon atom (Ca). B) Dihedral angles (p and yp of alanine dipeptide................ .................... 23 12 A Ramachandran plot (a contour plot showing the probability density of (0,p) pairs) of tyrosine generated from the simulation of a heptapeptide which will be described later in chapter 3. In this figure, a lefthanded ahelix is also show n. ......... .... .............. ................................. ........................... 2 5 13 A diagram showing the cartoon representation of an enzyme at low pH (acidic) and at around the optimal pH value. EH indicates the structure at low pH and E stands for the zwitterion form, which is the active species in our model.13.................... ............................. ............... 26 14 The reaction schemes showing the enzyme reactions at which pH values are smaller than the optimal pH value. Ks, Ks, K, and K2 are equilibrium constants of corresponding reactions and kcat is the rate constant of the rate determining step. This model can be used to explain how pH value affects enzyme catalysis in the pH range that is larger than optimal pH.13'14................. 27 15 A) An example of titration curve. B) An example of Hill's plot on the basis of the titration described in Figure 15A. The two plots are generated from constantpH MD simulations of an aspartic acid in a pentapeptide. ................. 30 16 13C NMR titration curves of aspartate residues in HIV1 protease/KNI272 complex taken from Wang et a/.,1996.27 In this figure, Asp Cy chemical shifts are plotted as a function of pD. Asp25 and Asp125 do not change protonation states in this pD range. But isotope shift experiments show that Asp25 is protonated and Asp125 is deprotonated in this pD range. "Reprinted with permission from Wang, Y. X.; Freedberg, D. I.; Yamazaki, T.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso, Y.; Torchia, D. A. Biochemistry 1996, 35, 99459950. Copyright 1996 American Chemical Society." ..................... 32 17 Thermodynamic cycle used to compute pKa shift. Both acid dissociation reactions occur in aqueous solution. A thermodynamic cycle is a series of thermodynamic processes that eventually returning to the initial state. A state function, such as reaction free energy in this case, is pathindependent and hence, unchanged through a cyclic process.......... ...... ....... ............... 49 18 Thermodynamic cycle utilized to calculate the difference between AG1 and AG2. In Figure 17 and Figure 18, proteinAH represents the ionizable residue in protein environment. AH represents the reference compound which is usually the ionizable residue with two termini capped. In practice, a proton does not disappear but instead becomes a dummy atom. The proton has its position and velocity. The bonded interactions involving the proton are still effective. However, there is no nonbonded interaction for that proton. The change in protonation state is reflected by changes of partial charges in the ionizable residue...................................... ........... 50 21 A diagram showing bondstretching coupled with anglebending. A cross term calculating coupling energy is adopted when evaluating the total potential energy. .................. ............. .. .......................... ............... 62 22 A diagrammatic description of TIP3P and TIP4P water models. A) TIP3P model. The red circle is oxygen atom and the black circles are the hydrogen atoms. Experimental bond length and bond angle are adopted. B) TIP4P model. Oxygen and hydrogen atoms are labeled with same color as in the TIP3P model. TIP4P model also employs the experimental OH bond length and HOH bond angle. Clearly, the fourth site (green circle) which carries negative partial charge has been added to the TIP4P model ............................ 77 31 Methods to perform exchange attempts. A) Only molecular structures are attempted to exchange. The protonation states are kept the same. B) Both molecular structures and protonation states are attempted to exchange. ........ 115 32 Titration curves of blocked aspartate amino acid from 100 ns MD at 300K and REMD runs. Agreement can be seen between MD and REMD s im u la tio n s ................................................ .................... 1 2 3 33 Cumulative average protonation fraction of aspartic acid reference compound vs Monte Carlo (MC) steps at pH=4. .................................................... 124 34 The titration curves of the model peptide ADFDA at 300K from both MD and REMD simulations. MD simulation time was 100 ns and 10 ns were chosen for each replica for REMD runs. ................................................. 125 35 Cumulative average protonation fraction of Asp2 in model peptide ADFDA vs Monte Carlo (MC) steps at pH=4 .............. ............................ ...... 126 36 Backbone dihedral angle (cp, yp) normalized probability density (Ramachandran plots) for Asp2 at pH 4 in ADFDA. Ramachandran plots at other solution pH values are similar. For Asp2, constantpH MD and REMD sampled the same local backbone conformational space. Phe3 and Asp4 Ramachandran plots also display the same trend ................................ ..... 127 37 Cluster populations of ADFDA at 300K. A) MD vs REMD at pH 4. Trajectories from MD and REMD simulations are combined first. By clustering the combined trajectory, the MD and REMD structural ensembles will populate the same clusters. The fraction of the conformational ensemble corresponding to each cluster (fractional population of each cluster) was calculated for MD and REMD simulation, respectively. Two sets of fractional population of clusters were generated, and hence plotted against each other. B) Two REMD runs from different starting structures at pH 4. Large correlation shown in Figure 37B suggests that the REMD runs are converged. Large correlations between two independent REMD runs are also observed at other solution pH values. Correlations between MD and REMD simulations can be found in Table 33 ...... ..... ............ ............... .... .... 128 38 A) Titration curves of Asp3 in the heptapeptide derived from protein OMTKY3. B) Titration curves of Lys5 and Tyr7 in the heptapeptide derived from protein OMTKY3. C) shows the Hill's plots of Asp3. The pKa values of Asp3 are found through Hill's plots ........... .... ..... .. .. ........... ........ ..... 129 39 A) Cumulative average protonation fraction of Asp3 of the heptapeptide derived OMTKY3 vs MC steps. B) and C) is cumulative average protonation fraction of Tyr7 and Lys5 in the heptapeptide vs MC steps, respectively. Clearly, faster convergence is achieved in contantpH REMD simulations. ..... 131 310 Dihedral angle ((p, p) probability densities of Asp3 at pH 4. A) ConstantpH MD results. B) ConstantpH REMD results. The two probability densities are almost identical, indicating that constantpH MD and REMD sample the same local conformational space. All others also show very similar trend. ................ 133 311 The rootmeansquare deviations (RMSD) between the cumulative ((p, p) probability density up to current time and the ((p, p) probability density produced by entire simulation. ((p, p) probability density convergence behaviors at other pH values also show that REMD runs converge to final distribution faster. ............................. ....... .............................. 134 312 Cluster population at 300 Kfrom constant pH MD and REMD simulations at pH=4. Cluster analysis is performed using the entire simulation. The populations in each cluster from the first and second half of the trajectory are compared and plotted. Ideally, a converged trajectory should yield a correlation coefficient to be 1. A) Constant pH MD. B) Constant pH REMD. Much higher correlation coefficient can be seen in constant pH REMD simulation, suggesting much better convergence is achieved by the constant pH REM D run. ................................. ......................... ................ 135 41 Cluster population at 300 K from constant pH REMD simulations at pH 2. A) Cluster analysis is performed on the trajectory initiated from fully extended structure. The populations in each cluster from the first and second half of the trajectory are compared and plotted. B) Two REMD runs from different starting structures at pH 2. Correlation coefficients at other pH values can be found in Table 41 ....................... ................. ............... .............. 150 42 Cumulative average fraction of protonation vs Monte Carlo (MC) steps. Only the two glutamate residues are shown here and the histidine residue is found to show the same trend. The pH values are selected such that the overall average fraction of protonation is close to 0.5. .................. .............. 152 43 Computed the mean residue ellipticity at 222 nm as a function of pH values. A bellshaped curve at 300 K is obtained with a maximum at pH 5. The effect of temperature on mean residue ellipticity at 222 nm is also demonstrated..... 153 44 Helical Content as a function of residue number................ .. ........... 154 45 A) Time series of Ca RMSDs vs the fully helical structure at pH 5. The first two residues at each end are not selected because the ends are very flexible. B) Probability densities of the Ca RMSDs. Clearly, the structural ensemble at pH 5 contains more structures similar to the fully helical structure. C) Time series of Ca radius of gyration at pH 5. D) Probability density of the Ca radius of gyration. More compact structures are found at pH 5 ............... ..... ....................... ............................................ 15 5 46 A) Probability densities of number of helical residues in the Cpeptide. B) Probability densities of the number of helical segments in the Cpeptide. A helical segment contains continuous helical residues. The probability of forming the second helical segment is very low at all three pH values, thus only the first helical segment is further studied. C) Probability densities of the starting position of a helical segment. D) Probability densities of the length of a helical segment (number of residues in a helical segment) ............ .......... 156 47 2D probability density of helical starting position and helical length, pH = 2..... 158 48 2D probability density of helical starting position and helical length, pH=5....... 158 49 2D probability density of helical starting position and helical length, pH=8....... 159 410 2D probability density of helical length and CaRMSD at pH = 2. ....... ........ 159 411 2D probability density of helical length and CaRMSD at pH = 5. ....... ........ 160 412 2D probability density of helical length and CaRMSD at pH = 8. ....... ........ 160 413 A) Probability density of LyslGlu9 distance (A). The distance is the minimum distance between the sidechain nitrogen atom of Lysl and the sidechain carboxylic oxygen atoms of Glu9. B) Probability density of Glu2 Arg10 distance (A). The distance is the minimum distance between side chain carboxylic oxygen atoms of Glu2 and guanidinium nitrogen atoms of A rg ......... ...................................................... ............................ 16 2 414 Twodimensional probability density of Lysl Glu9 and Glu2Argl0 at pH 5. Apparently, LyslGlu9 and Glu2Argl0 saltbridges cannot be formed sim ultaneously .... ........ ......... .............. ................ ..... .......... 162 415 A) Twodimensional probability density of Glu2Arg10 saltbridge formation and helical length at pH 5. According to the plot, the Glu2Arg10 saltbridge can be found in fourresidue, sixresidue and nonhelical structures. B) Two dimensional probability density of Glu2Arg10 saltbridge and the helix starting position at pH 5. If a helix begins from Thr3, it cannot have a Glu2 Arg10 saltbridge. Thus, one role of the Glu2Arg10 saltbridge is to prevent helix formation from Thr3. ..... ............................... ... .. ............. 163 416 A) Probability density of Phe8 backbone to His12 ring distance. The distance is the minimum distance between Phe8 backbone carbonyl oxygen atom and His12 imidazole nitrogen atoms. B) Probability density of Phe8 ring to His12 ring distance. The distance is the minimum distance between Phe8 aromatic ring carbon atoms and His12 imidazole nitrogen atoms ............. ............... 164 417 A) Twodimensional probability density of Glu2Arg10 distance and Phe8 His12 backbonetoring distance at pH 5. B) Correlations between Glu2 Arg10 saltbridge and Phe8His12 contact at pH 5........................................ 166 418 A) Twodimensional probability density of helical segment length and Phe8 His12 interaction. B) Twodimensional probability density of helical segment starting position and Phe8His12 interaction. Phe8His12 also stabilizes four residue and sixresidue structures. Helices begin at Lys7 and Phe8His12 is coupled. Unlike Glu2Arg10, Phe8His12 stabilizes helices starting from T h r3 ...................... .. .. ......... .. .. ......... ...................................... 1 6 7 419 A) Top 20 populated clusters and average helical percentage. B) Probability densities of the CaRMSD vs the fully helical structure of the top 2 populated clusters. C) Helical Percentage as a function of residue number of the top 2 populated clusters. D) Probability density of the Glu2Arg10 and Phe8 backboneHis12 ring interactions in the second most populated cluster......... 169 51 Crystal structure of HEWL (PDB code 1AKI). Residues in red represent aspartate and residues in blue are glutamate ................................................ 171 52 A simple schematic view of the conformationprotonation equilibrium in a constantpH simulation. .............. .. ..... ....... ........ ............... 176 53 Ca RMSD vs crustal structure (PDB code: 1AKI). A) Ca RMSD vs 1AKI from REMD without restraint on Ca. B) Ca RMSD vs 1AKI from REMD with restraint on Ca. The restraint strength is 1 kcal/molA2.............. ................. 179 54 pKa prediction error as a function of time. The predicted pKa at a given time is a cumulative result. For each ionizable residue, the time series of its pKa error is generated at a pH where the average predicted pKa is closest to that pH value. In this way, we try to eliminate any bias toward the energetically favored state. A flat line is an indication of convergence. Glu35 is not shown here due to poor convergence ........ ............. ........... ............. 180 55 A) pKa prediction convergence to its final value. Similarly, the pKa value at a given time is a cumulative average. A flat line having yvalue of 0 is expected when pKa calculation convergence is reached. The same pH values are chosen for each ionizable residue as in Figure 54. B) Asp52 pKa prediction convergence to its final value at multiple pH values. The pH values are selected in such a way that the pKa calculated at this pH will be used to compute composite pKa........ .. ................ ......... ... ............... 181 56 RMS error between predicted and experimental pKa vs pH value. A minimum of pKa RMS error can be found near the pH at which 1AKI crystal structure is reso lived ............ .. .................................................................. .. 184 57 A) Ca RMSD of HEWL from weaker restraint REMD simulations. The RMSDs are larger than those with stronger restraints. When comparing RMSDs at different pH for simulations using weaker restraint, RMSDs are greater at pH 3 and 4 than those at pH 4.5. B) pKa prediction deviation from final value at pH 4.5 from constantpH REMD with 0.1 kcal/molA2................. ................. 186 58 Asp52 in the crystal structure of 1AKI. Its neighbors that having strong electrostatic interactions are also shown. ...... .. ................. ................... 188 59 A) Time series of Asp52 carboxylic oxygen atom OD1 to Asn59 and Asn44 ND2 distances at pH 3 in the 1 kcal/molA2 constantpH REMD run. B) Time series of Asp52 carboxylic oxygen atom OD2 to Asn59 and Asn44 ND2 distances under the same condition. Hydrogen bonds which are stabilizing deprotonated Asp52 are formed in a large extent even at a low pH............... 188 510 A) Time series of the Glu35 heavy atoms (excluding two carboxylic oxygen atoms) RMSD relative to crystal structure 1AKI. B) Probability distribution of the RMSD. The conformation centered at RMSD ~0.1 A is labeled as conformation 1. The one centered at ~0.6 A is named conformation 2. Apparently, an extra conformation (conformation 3) is visited by the weakly restrained REM D sim ulation ............................................ ......... .................. 191 511 A) Representative Structure of conformation 1. B) Representative Structure of conformation 2. The structure ensemble is generated from REMD simulations with stronger restraining potential. The carboxylic group of Glu35 in conformation 2 is clearly pointing toward the amide group of Alal 10. Deprotonated form of Glu35 tends to decrease the electrostatic energy. Furthermore, conformation 1 does not particularly favor the protonated Glu35. No significant stabilizing factor is found for the protonated Glu35......... 192 512 Representative Structure of conformation 3 from cluster analysis. Glu35 is in the hydrophobic region, consisting of Gln57, Trpl08 and Ala110. Conformation 1 and 2 in the weakly restrained simulations are basically the same as those demonstrated in Figure 511 ............................... .............. 193 513 A) Correlation between side chain dihedral angle land protonation states. B) Correlation between side chain dihedral angle X2and protonation states.... 194 514 Minimal distance between Asp119 side chain carboxylic oxygen atoms (OD1 and OD2) and Arg125 guanidinium nitrogen atoms. Since guanidinium group has three nitrogen atoms, the minimal distance is the shortest distance between Asp119 OD1 (or OD2) and those three nitrogen atoms.................... 196 515 A) Probability distribution of Asp119 CG to Arg125 CZ distances. The Asp119 CG to Arg125 CZ distance is used to distinguish conformations. B) Coupling between conformations and protonation states ............................ 197 516 K12/K12,h as a function of pH and its dependence on pKa,i and pKa,2................ 199 517 A) Fraction of each species as a function of pH titrationn curves) obtained from equations based on conformationprotonation equilibrium. The effect of K12, h is tested. B) Comparison of titration curves derived from actual simulations and from the equilibrium equations................. ....... ............ 200 518 Theoretical NMR chemical shifts as a function of pH. It's plotted to see if the conformationprotonation equilibrium model can reproduce experimental titration curve based on NMR chemical shift measurements ......................... 202 LIST OF ABBREVIATIONS ACE Analytical Continuum Electrostatic BAR Bennett Acceptance Ratio CD Circular Dichroism CE Continuum Electrostatic CPHMD Continuous ConstantpH Molecular Dynamics CPL Circularly Polarized Light DOF Degree of Freedom DOS Density of States DSSP Definition of the Secondary Structure of Proteins EAF Exchange Attempt Frequency EFP Effective Fragment Potential FEP Free Enery Perturbation FDPB Finite Differece PoissonBoltzmann GB Generalized Born HEWL Hen Egg White Lysozyme HH HendersonHasselbach HREMD Hamiltonian Replica Exchange Molecular Dynamics LCPL Left Circularly Polarized Light MC Monte Carlo MCMC Markov Chain Monte Carlo MCCE Multiconformation Continuum Electrostatic MD Molecular Dynamics MDFE Molecular Dynamics based Free Energy (calculation) MM MUCA NMR NPT NVE NVT PB PBC PES PMF QM QM/MM RCPL REM REMD REXCPHMD RF RMSD TI TREMD VREMD Molecular Mechanics Multicanonical Nuclear Magnetic Resonance Isothermalisobaric Ensemble Microcanonical Ensemble Canonical Ensemble PoissonBoltzmann Periodic Boundary Condition Potential Energy Surface Probability Distribution Function Potential of the Mean Force Quantum Mechanics Hybrid Quantum Mechanical Molecular Mechanical Right Circularly Polarized Light Replica Exchange Method Replica Exchange Molecular Dynamics Replica Exchange Continuous ConstantpH Molecular Dynamics RadioFrequency RootMeanSquare Deviation Thermodynamic Integration Temperature Replica Exchange Molecular Dynamics Viscosity Replica Exchange Molecular Dynamics Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CONSTANT pH REPLICA EXCHANGE MOLECULAR DYNAMICS STUDY OF PROTEIN STRUCTURE AND DYNAMICS By Yilin Meng August 2010 Chair: Adrian E. Roitberg Major: Chemistry Solution pH is a very important thermodynamic variable that affects protein structure, function and dynamics. Enormous effort has been made experimentally and computationally to understand the effect of pH on proteins. One category of computational method to study the effect of pH is the constantpH molecular dynamics (constantpH MD) methods. ConstantpH MD employs dynamic protonation in simulations and correlates protein conformations and protonation states. Therefore, constantpH MD algorithms are able to predict pKa value of an ionizable residue as well as to study pHdependence directly. A replica exchange constantpH molecular dynamics (constantpH REMD) method is proposed and implemented to improve coupled protonation and conformational state sampling. By mixing conformational sampling at constant pH (with discrete protonation states) with a temperature ladder, this method avoids conformational trapping. Our method was tested on seven different biological systems. The constantpH REMD not only predicted pKa correctly for model peptides but also converged faster than constant pH MD. Furthermore, the constantpH REMD showed its advantage in the efficiency of conformational samplings. The advantage of utilizing constantpH REMD is clear. We have studied the effect of pH on the structure and dynamics of Cpeptide from ribonuclease A by constantpH REMD. The mean residue ellipticity at 222 nm at each pH value is computed, as a direct comparison with experimental measurements. The C peptide conformational ensembles at pH 2, 5, and 8 are studied. The Glu2Argl0 and Phe8Hisl2 interactions and their roles in the helix formation are also investigated. ConstantpH REMD method is applied to the study of hen egg white lysozyme (HEWL). pKa values are calculated and compared with experimental values. Factors that could affect pKa prediction such as hydrogen bond network and interaction between ionizable residues are discussed. Structural feature such as coupling between conformation and protonation states is demonstrated in order to emphasize the importance of accurate sampling of the coupled conformations and protonation states. CHAPTER 1 INTRODUCTION 1.1 AcidBase Equilibrium Acids and bases are common in our daily lives. For example, vinegar is acidic and ammonia is basic. According to the BronstedLowry definition, an acid is a chemical compound that can donate protons and a base is a chemical compound that can accept protons. An acid can be converted to its conjugate base by transferring a proton to a base and a base is converted to its conjugate acid by accepting a proton. For simplicity, the conversion between an acid and its conjugate base can be described by the reaction: HA H+ + A where HA is an acid, A is its conjugate base, and H+ represents proton (in aqueous environment, H+ is hydronium ion H30O). There exists an equilibrium state between any acidbase conjugate pair. At equilibrium, the concentration of each species is constant. In an acidbase reaction, an acid dissociation constant is used to describe this equilibrium. The acid dissociation constant has the definition of Eq. 11. K (H,+)(aA) (11) aHA Here Ka is the acid dissociation constant and aH+, aA and aiA represent the activity of each species, respectively. In Eq. 11, the activity of each individual species (take alHA as an example) can be expressed as: [HA] aHA YHA] (12) In Eq. 12, YHA is the activity coefficient of HA, [HA] is the concentration of HA, and c is the standard concentration which is 1 M. In an ideal solution, the activity coefficients are unity. The concentration of each species is divided by standard concentration in order to make the acid dissociation constant dimensionless. For simplicity, the acid dissociation constant is expressed using the concentration of each species from now on. The Ka indicates the strength of an acid: the stronger the acid is, the larger the Ka is. The order of magnitude of Ka can span over a broad range. Therefore, a logarithmic (base 10) measure of the Ka is more frequently adopted: pKa = loglo Ka (13) Combining Eq. 11 and Eq. 13, we can express the pKa value as: ([ ] (14) pKa = pH loglo ) (14) Eq. 14 is the HendersonHasselbalch (HH) equation. It allows one to solve directly for pH values instead of calculating the concentration of hydronium ions first. When [A] = [HA], the HH equation becomes pKa = pH. Therefore, the pKa value of an acid is numerically equal to the pH value at which the acid and its conjugate base have the same concentrations. The acid dissociation constant represents the thermodynamics of an acid dissociation reaction because the pKa value is proportional to the Gibbs free energy of the reaction. For simple compounds such as acetic acid, temperature is the most important factor that affects its pKa value. However, for complex molecules such as proteins and peptides, the effect of environment is also crucial and will be discussed in this dissertation. 1.2 Amino Acids and Proteins The goal of this dissertation is to study the acidbase equilibrium in peptide and protein systems and its effect on peptide and protein conformations by constantpH REMD method. Thus, an introduction to peptide and protein, especially their structures will be helpful. Amino acids have the generic structure as shown in Figure 11A. Each amino acid consists of an amino group (NH2), a carboxylic acid group (COOH) and a distinctive side chain (R). All three groups are connected to a carbon atom which is called carbon alpha (Ca). There are twenty naturally occurring side chains and they can be divided into groups based on their physical or chemical properties. For example, one way to categorize the twenty side chains is based on their acid/base properties in aqueous solution. Therefore, an aspartic acid is an acidic amino acid and a lysine is a basic amino acid. For an amino acid, its carboxylic group can react with the amine group of another amino acid. This condensation reaction forms a peptide bond which links the two amino acids and yields a water molecule. As a consequence of the condensation reaction, proteins are formed. A protein is a string of amino acids connected by peptide bonds and folded into a globular structure. A protein often consists of a minimum of 30 to 50 amino acids.1 Shorter chains of amino acids are often called peptides. Each amino acid in a protein or peptide is called a residue. The peptide bonds form the backbone of a protein. A B Figure 11. A) Structure of an amino acid named alanine. An amino group (NH2), a carboxylic acid group (COOH), a side chain (R, in this case, a methyl group) and a hydrogen atom are bonded to a central carbon atom (Ca). B) Dihedral angles (p and yp of alanine dipeptide. A protein usually has four levels of structure which are called primary structure, secondary structure, tertiary structure and quaternary structure. The primary structure is the sequence of amino acids. The folding of a protein is determined by its primary structure. Next, the secondary structure (e.g. ahelix, 3strand, or loop) is the three dimensional structure of local segments of a protein. As mentioned earlier, proteins fold themselves into functional structures after they are formed. After folding, protein backbones often possess certain types of fold or alignment. The term of secondary structure is used to describe the threedimensional structures of such manners. The two most common secondary structures found in proteins are ahelices and 3strands. The local secondary structure of a particular residue in a protein can be described by a Ramachandran plot which is a twodimensional histogram (or probability distribution) of backbone dihedral angle pair (,0p). As demonstrated in Figure 11B, backbones can rotate around the NCa and CaC bonds, forming dihedral angles 0 and p. Backbone conformations of a residue can be described by specifying (0,P). Three main regions are populated in general in a Ramachandran plot, corresponding to the three main stable conformations a residue has: the righthanded ahelix region near (4=57o, 0=470), the 3strand region near (0=1250, p=1500) and the polyproline II region near (0=750, p=1450). The most populated region indicates the most stable conformation of a residue. An example of Ramachandran plot is shown in Figure 12. Furthermore, the tertiary structure is the threedimensional positions of all atoms in a protein. The tertiary structures yield information about protein side chains, for example, salt bridges. Finally, the quaternary structure defines the positions of all atoms in a protein containing multiple peptide chains, for example, the hemoglobin tetramer. It is the highest level of protein structures. 150 'l' 1.6E3 100 Psheet PPII 3.2E3 50  4.8E3 6.4E3 0 I 8.0E3 S, Lefthanded I 9.6E3 50 ahelix 50l 1.1E2 ahelix 1.3E2 100 1.4E2 150 1.6E2 150 100 50 0 50 100 150 Figure 12. A Ramachandran plot (a contour plot showing the probability density of (4,ip) pairs) of tyrosine generated from the simulation of a heptapeptide which will be described later in chapter 3. In this figure, a lefthanded ahelix is also shown. Proteins perform vital functions, which are important to our lives. Almost all cell activities depend on proteins. For example, hemoglobin can transport oxygen molecules from lung to cells;1 many chemical reactions occurring in living organisms are catalyzed by proteins called enzymes; and proteins are also involved in cell signaling. Mutations in the proteins, aggregation and misfolding of proteins can cause many diseases. For example, many cancers result from the mutations in the tumor suppressor p53.2'3 Thus, understanding protein structures and functions is important. 1.3 lonizable Residues in Proteins and the Effect of pH on Proteins An ionizable residue in a protein is a residue with a side chain that can donate or accept proton(s). There are seven ionizable residues: ASP, GLU, HIS, CYS, TYR, LYS and ARG. Ionizable residues define the acidbase properties of that protein. Consequently, the solution pH value becomes an important thermodynamic variable affecting protein structure, dynamics, folding mechanism, and function4. Many biological phenomena such as protein folding/misfolding,58 substrate docking9 and enzyme catalysis are pHdependent.1012 A good example of how pH value affects proteins is the pHdependence of enzyme kinetics. Most enzymes possess an optimal pH value, at which the reaction rate is largest. Enzyme catalysis is pHdependent because the active sites of enzymes in general contain important acidic or basic residues. Only one form (acidic or basic) of the ionizable residue is catalytically active, thus the concentration of the catalytically active species will affect the kinetics. Consider a simple reaction model (Figure 13 and Figure 14) to demonstrate how pH value affects enzyme reaction rate. In this model, only the zwitterion form is active; no intermediate exists for the enzyme reaction and the protonationdeprotonation steps are faster than catalysis steps. Furthermore, the rate determining step does not depend on pH value. HOOC NH3+ 00C NH3 Enzyme Enzyme EH E Figure 13. A diagram showing the cartoon representation of an enzyme at low pH (acidic) and at around the optimal pH value. EH indicates the structure at low pH and E stands for the zwitterion form, which is the active species in our model.13 K5 Ks kt E+S ES E+P K K2 EH EHS Figure 14. The reaction schemes showing the enzyme reactions at which pH values are smaller than the optimal pH value. Ks, Ks, K, and K2 are equilibrium constants of corresponding reactions and kcat is the rate constant of the rate determining step. This model can be used to explain how pH value affects enzyme catalysis in the pH range that is larger than optimal pH.13'14 The equilibrium constants shown in Figure 14 are not independent of each others. The relationship among them is given by: KsK2 = K,, K1 (15) According to the above equation, if K1 = K2, then the substrate binding will not be affected by pH value of the solution. If it is not the case, then the binding is pH dependent. After applying steadystate approximation to the [ES], the reaction rate can be written as: kcat [E]o[S] v= (16) K,+[S] (1+[H+]/K2)+K,[H+]/K1 where [E]0 is initial concentration of the enzyme and [H+] is the concentration of hydronium ions. At low pH, increasing the concentration of hydronium ions (pH value decreases) will decrease the reaction rate. The same kind of model can also be applied to derive the effect of pH on reaction rate when the pH is higher than optimal. Likewise, only the zwitterion form is catalytically active. The conclusion is that pH value too high or too low will lower the enzyme catalytic reaction rate. Given the importance of the solution pH, knowing the pKa value of an ionizable residue in a protein is important because it will indicate the average protonation state of that ionizable residue at a certain pH value. However, the pKa value of an ionizable residue is highly affected by its protein environment.15'16 Two major factors affect protein pKa values: one is the desolvation effect and the other is the electrostatic interaction. Other factors such as hydrogen bonding and structural rearrangement are also able to affect protein pKa values. An ionizable side chain in the interior of a protein can have a different pKa value from the isolated amino acid in solution, which is caused by dehydration effect.1719 For example, Asp26 of the thioredoxin, which lies in a deep pocket of the protein, has a pKa value of 7.517 while the pKa value of a waterexposed aspartic acid is 4.0.20 The Garcia Moreno group has been employing sitedirect mutagenesis method to study the effect of desolvation18'19,2123 and will be described later in this chapter. Their research on the buried ionizable residues provides a probe of the dielectric constant inside the protein, which is an important parameter for the pKa prediction on the basis of the Poisson Boltzmann equation. Electrostatic interactions such as saltbridges are also able to affect pKa values. For example, His31 and Asp70 form a saltbridge in the T4 lysozyme.24 The formation of this saltbridge shifts the pKa of Asp70 to 0.5 and changes the pKa of His31 to 9.1. Interestingly, Asp26 in the thioredoxin has been shown to form a saltbridge with Lys57 when it is in the deprotonated form.25 The formation of a saltbridge should reduce the pKa value of Asp26. Therefore, the pKa value of 7.5 is the combined result of desolvation effect and electrostatic interaction. Each ionizable residue has its own intrinsic pKa value. The intrinsic pKa value of an ionizable residue is defined as the pKa value measured when this residue is fully solvent exposed and is not interacting with any other groups,20 for example, an aspartate residue with two termini blocked. This kind of dipeptide is often used as reference (or model) compound in the theoretical calculation of protein pKa values. The intrinsic pKa values are reported in Table 11: Table 11. Intrinsic pKa values of ionizable residues in proteins.26 Residue Name Intrinsic pKa value ASP 4.0 GLU 4.4 HIS 6.7 CYS 8.0 TYR 9.6 LYS 10.4 ARG 12.0 1.4 Measuring pKa Values of Ionizable Residues A general way to determining the pKa value of an acid experimentally is through titration. In experiments, the pH values are measured by a pH meter as a function of the volume of base added to the solution. Therefore, a titration curve will be obtained (Figure 15A shows an example of titration curve) and the pKa value is the pH value at which the deprotonated and protonated species have the same concentrations. Another way of presenting a titration curve is by plotting the fraction of deprotonation (protonation) vs the pH value. A Hill plot (an example is shown in Figurel4B), which can be obtained by plotting log([A]/[HA]) as a function of pH, is used to study titration behavior. After fitting to the modified HH equation: pH = pKa + k log( ), the x intercept is the pKa value and the slope (k) is the Hill coefficient which reflects interactions between ionizable residues. The HH equation will be represented as a straight line in a Hill plot, with a slope of unity. If only one ionizable residue is present in the system of interest, or an ionizable residue does not couple with other ionizable residue(s), the HH equation should be reproduced. A nonzero slope reflects statistical error (random error). Interacting ionizable residues will demonstrate nonHH behavior and possess nonunity slope in a Hill's plot. When k > 1, we say the proton binding is positively cooperative which means binding of the first proton will increase the binding affinity of the other one. When k < 1, the binding of protons is negatively cooperative which means the binding of one proton will decrease the affinity of the other proton. w Titrations (ConstantpH MD runs) 0.8 2 Linear Fit, slope=0.89, RZ=1.O C S 0.6 1 0A4 0. S 2 3 4/ 5 6 T Solution pH LL 0.2 1 0.0 2 2 3 4 5 6 Solution pH A B Figure 15. A) An example of titration curve. B) An example of Hill's plot on the basis of the titration described in Figure 15A. The two plots are generated from constantpH MD simulations of an aspartic acid in a pentapeptide. However, determining pKa value of protein ionizable residues by measuring solution pH as a function of the volume of base is difficult because there are multiple ionizable residues in a protein in general. An experimental technique that is sitespecific is preferred. Nuclear Magnetic Resonance (NMR) is one of the most frequently employed spectroscopic methods in chemistry, physics and biological science. One application of the NMR method is to measure pKa values of individual ionizable residues. NMR spectroscopy measures the absorption of radiofrequency (RF) radiation by a nucleus in magnetic field. Only a nucleus with a spin quantum number that equals half of an integer is able to generate NMR signal. Furthermore, the absorption is affected by the chemical environment around that nucleus. Electron density around a nucleus provides a shielding effect to the external magnetic field for the nucleus. Thus, different chemical environment (electron density) around a nucleus will affect its resonance frequency, resulting in chemical shift. Changes in protonation state are able to result in changes in the chemical shift of the nuclei around the ionizable site (for example, Cy of Asp, C5 of Glu, and N6 and N, of His). Subsequently, at a given pH value, the equilibrium between the protonated and deprotonated species can yield a weighted average chemical shift, obs = p + l+l1n(PKapH) (17) Here Sobs, 6p and AS are the chemical shift observed, chemical shift of the protonated species, the change in chemical shifts caused by titration, respectively, and n is the Hill's coefficient. In Eq. 17, the HH equation is implied. Therefore, chemical shifts will be measured at different pH values and a titration curve will be obtained. Figure 16 demonstrates a titration curve generated by NMR spectroscopy. However, in practice, onedimensional NMR spectra are often too complicated to be interpreted for proteins. Introducing a new spectrum dimension will allow the ability to simplify the spectra and yield more useful information. In twodimensional NMR spectroscopy, the sample is excited by one or more pulses in the socalled "preparation time". Then the resulting magnetization is allowed to evolve for time tl, and the signal is not recorded during time t,. Following the evolution time, one or more pulses will be applied to the sample and the resulting signal will be measured as a function of a new time variable t2. 1H, 13C and 15N NMR are frequently employed in experiments to determine protein pKa values.14 Proton NMR has shown to be particularly useful in studying histidine pKa values. It is also employed to study the acidbase equilibrium of tyrosine residues. 13C NMR experiments can be performed to determine the pKa values of lysine and aspartate. 182 Asp129 181  *  "Asp29 0 I :*  s 180 Asp6. &.Asp _.. ..D . 179 . S 3Asp0 S176 Asp25 176 175 2 3 4 5 6 7 pD Figure 16. 13C NMR titration curves of aspartate residues in HIV1 protease/KNI272 complex taken from Wang et a/.,1996.27 In this figure, Asp Cy chemical shifts are plotted as a function of pD. Asp25 and Asp125 do not change protonation states in this pD range. But isotope shift experiments show that Asp25 is protonated and Asp125 is deprotonated in this pD range. "Reprinted with permission from Wang, Y. X.; Freedberg, D. I.; Yamazaki, T.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso, Y.; Torchia, D. A. Biochemistry 1996, 35, 99459950. Copyright 1996 American Chemical Society." One example of measuring the pKa value of an ionizable residue using NMR technique is the determination of the pKa value of Asp26 in Escherichia coli thioredoxin.17'25'2830 NMR method, especially the 2DNMR technique, has been intensively employed in the investigations of the pKa value of Asp26. Escherichia coli thioredoxin has two redox forms. The oxidized form has a disulfide bond linking Cys32 and Cys35, while the two cysteine residues are not bonded in the reduced form. Hence, the two cysteine residues are ionizable in the reduced form, which makes the investigations more complicated. Asp26 is located at the bottom of a hydrophobic cavity near the active site disulfide and is completely buried in the protein. In 1991, Dyson et al. investigated pH effect on the thioredoxin in the vicinity of active site, using 2D NMR.28 Both oxidized and reduced thioredoxin have been studied. CaH and COH chemical shifts of Cys32 and Cys35, and NH, CaH and CH chemical shifts of Asp26 as a function of pH value have been measured. Those chemical shifts have been found to titrate with a pKa value of 7.5. Since the cysteine residues in the oxidized thioredoxin are not ionizable, they proposed that the apparent pKa is the pKa value of Asp26. In the same year, experiments performed by Langsetmo et al. measured electrophoretic mobility of the wildtype and D26A mutation of the oxidized thioredoxin, as a function of pH. A pKa of 7.5 has been obtained from their experiments.17 In 1995, Wilson et al. measured the chemical shifts of CGH1, CGH2 and C3 atoms of Cys32 and Cys35 using the reduced form of thioredoxin.30 Both the wildtype and D26A mutation have been studied. Comparing the titration curves between the wildtype and the D26A mutation, a titration showing pKa value > 9 has been found missing in the D26A thioredoxin experiment. Adopting that the cysteine residues in the reduced thioredoxin have pKa values of 7.1 and 7.9 derived from Raman spectroscopy, they concluded that Asp26 has an apparent pKa of greater than 9. However, their results were challenged by the pKa determinations of Cys32 and Cys35 in the reduced form of thioredoxin. In 1995, Jeng et al. studied the titration behaviors of Cys32 and Cys35 in the reduced form of thioredoxin by 13C NMR experiments.29 Their pKa values were found to be 7.5 and 9.5. Their pKa values of Cys32 and Cys35 challenged the results obtained by Wilson et al. In order to elucidate the pKa value of Asp26 in the reduced thioredoxin, Jeng and Dyson measured the pKa value of Asp26 in 1996 using 2DNMR.29 The 13C chemical shift of the carboxylic group, which is bonded to titrating site, as well as the CPH1 and CPH2 proton chemical shifts was measured as a function of pH value. The authors believed that the pH effect on 13C chemical shift of the carboxylic group should result from titration due to its close distance to the titrating site. The apparent pKa value obtained from their experiments has been shown between 7.3 and 7.5, which is the same as the pKa value of Asp26 in the oxidized form. Fluorescence spectroscopy can be utilized to determine pKa values as well. Fluorescence is the emission of light by a substance when it is relaxing from electronic excited state (Si) to electronic ground state (So). In fluorescence spectroscopy, the substance is first excited from So to one of many vibrational states of S, by absorbing a photon. Following the excitation, relaxation to the vibrational ground state S, occurs through collisions with other molecules. Once in the ground vibrational state of S1, the substance will return to one of many vibrational states of So by emitting a photon. Since the substance can return to various vibrational states in the electronic ground state, a band of emission wavelengths will be observed. The absorption and emission wavelengths are different (emission photons have a larger wavelength) and the difference in wavelength is called Stokes shift. The average time the substance stays in its electronic excited state is called the fluorescence lifetime. In biophysical chemistry, the tryptophan fluorescence is frequently employed to study the conformational changes in proteins. In general, tryptophan has a maximal absorption wavelength of 280 nm31 and maximal emission wavelength of 300~350 nm.32'33 Changes in the environment of a tryptophan residue will affect the emission wavelength and/or intensity. Furthermore, it has been noticed that tryptophan fluorescence is sensitive to the polarity of the local environment. One advantage of tryptophan fluorescence spectroscopy is that the chromophore is intrinsic; no change is made to the protein. If the change in protonation state of an ionizable residue affects the spectrum of a neighboring tryptophan residue, which is the main fluorescent species in a protein, then fluorescence spectroscopy can be employed to generate a titration curve. Therefore, the pKa value will be obtained. One example of determining pKa value by fluorescence spectroscopy is measuring the pKa of Glu35 in HEWL performed by the Imoto group.34 The Trpl08 is in van der Waals contact with Glu35. Changes in protonation state of Glu35 can induce a large shift in intensity of Trpl08 fluorescence signal. Another way of obtaining a titration curve is the potentiometric method. The potentiometric titration measures pH value as a function of the volume of titrant added. The volume of titrant added at each dosing can be used to calculate moles of hydrogen ion released from (or bound by) a peptide or protein, and hence number of hydrogen ions released (or bound) per molecule. Plotting number of hydrogen ions released (or bound) per molecule as a function of pH will generate a titration curve. By utilizing potentiometric titration, a titration curve of the entire peptide or protein can be obtained. The GarciaMoreno group has been utilizing the potentiometric method, combined with other experimental techniques and protein pKa calculations, to investigate pKa values of ionizable residues buried deep in a protein.18,19,2123 As mentioned earlier in the last section, protein environment can shift the pKa value of an ionizable residue. In nature, a small portion of the ionizable residues are buried in the deep pockets of the protein, inaccessible to water.22,35 Those buried ionizable residues are crucial to the protein functions such as catalysis,12'36 and ion or electron transport.37'38 Determining and understanding the pKa values of buried ionizable residues is important for biological research. The GarciaMoreno group performed sitedirected mutagenesis experiments, mutating a nonpolar residue which is inaccessible to water to an ionizable residue. The pKa value of the mutated ionizable residue is determined experimentally and predicted theoretically. By combining experimental and theoretical determination, the dielectric effect and electrostatic interactions will be elucidated. One example of the mutagenesis experiment is mutating Val66 in a "hyperstable variant" of the staphylococcal nuclease (SNase) to glutamate.19'21 The original and mutated forms of the "hyperstable variant" of SNase are called PHS and PHS/V66E. The PHS nuclease can be made by mutating three residues of the wildtype SNase: P117G, H124L, and S128A. Val66 has been found in the core region of the SNase and inaccessible to aqueous environment. The potentiometric titrations have been performed on both PHS and PHS/V66E. The difference between the two titration curves represents the Glu66 titration plus other titrations affected by the mutation, although it is assumed that the latter effect is not significant. The difference in hydrogen ions (Avi) bound to PHS and PHSN66E was fitted to the following equation, 10n(pHpKa) 1+10n(pH pKa) Ai l+lOn(pHpK,) (18) where n is the Hill's coefficient, pH is the solution pH value, and pKa in this case is the pKa value of Glu66. The pHdependence of PHS and PHSN66E stability was also demonstrated by the guanidine hydrochloride denaturation free energy profiles. The Trpl40 fluorescence was recorded as a probe of the denaturation. The difference in denaturation free energy profiles was also fitted nonlinearly to obtain the pKa value of Glu66. The pKa value of Glu66 has been determined to be 8.8 from potentiometric titration and 8.5 from the protein stability study. The pKa shift of 4.4 (on the basis of the potentiometric measurements, and glutamate has an intrinsic pKa value of 4.4) is among the largest ones for acidic ionizable residues. Once the experimental pKa value is accurate obtained, a "reverse pKa prediction" can be performed to investigate the dielectric constant inside the protein, which is an important parameter in the continuum electrostatic model and will be explained later this chapter. In fact, the direct potentiometric measurements were first carried out by the GarciaMoreno group on PHS and PHSN66K.18 A pKa value of 6.38 was found for Lys66, while the pKa value of lysine model compound is 10.4. Recent sitedirected mutagenesis studies on PHS have extended to Leu38.22 Mutations to aspartate, glutamate and lysine were conducted. Similar to their treatment on Val66 mutations, potentiometric titration and protein denaturation experiments were conducted to determine pKa values by the GarciaMoreno group. For the PHS/L38E, NMR technique was employed to facilitate Glu38 pKa measurement. PHS/L38K has shown a pKa value close to the intrinsic value of lysine. After mutation, lysine was found to adjust its sidechain to let water molecules penetrate. However, L38D and L38E have shown elevated pKa values. Both Asp38 and Glu38 were still inaccessible to water, although structural rearrangement was also observed. Their pKa values were further perturbed by electrostatic interactions with surface carboxylic groups. Their investigations have unveiled how conformational changes, desolvation and electrostatic interactions affect pKa values. 1.5 Molecular Modeling Experimental techniques such as spectroscopy are fundamental to the study of protein structure and function. For example, NMR spectroscopy is frequently employed in biological science, Xray crystallography can be applied to resolve protein structures and circular dichroism (CD) spectrometry is employed to determine the secondary structure of a protein. However, the advances in computational power combined with the leap in theory make experiments not the only way to understand biological molecules. Molecular modeling offers another way to investigate structures and properties of biological molecules. It combines theories developed in the fields of physics, chemistry and biology with the computer resources to simulate the behaviors of molecules. Results from simulations are often compared to experimental observations in order to validate the method and understand the behavior of biological molecules from an atomistic level. 1.6 Potential Energy Surface Molecules possess more than one stable configuration in general. In principle, all possible molecular configurations need to be considered in order to simulate a molecule correctly. A potential energy surface (PES), which is a surface defined by the potential energies of all possible configurations, can be utilized to fulfill this requirement. The local minima of a PES indicate stable conformations of a molecule. There are multiple ways to generate a PES. Quantum mechanical calculations offer the most accurate way to construct a PES. By solving the Schrodinger Equation, one can obtain energies and wave function of the molecule. In the field of chemistry, electronic structure theory utilizes quantum mechanics to describe the motion of electrons, in the framework of BornOppenheimer approximation. The BornOppenheimer approximation states that the electronic relaxation caused by nuclear motion is instantaneous because of the huge difference in the masses of electrons and nuclei. Thus, electronic motion and nuclear motion are decoupled. The eigenvalue of the electronic Schrodinger equation at each nuclear configuration is the potential energy of nuclei at that geometry. Solving Schrodinger equation at different configurations will yield the PES of a molecule. However, the cost of electronic structure calculations is very expensive, which hinders the use of highlevel of theory when studying large biological molecules. Due to the cost of electronic structure methods, an alternative way to describe a PES is to use a classical mechanical model. One of the commonly used algorithms is the allatom force field in which the PES is computed without solving the Schrodinger equation. In an allatom force field model, no electrons are present and each atom is represented by a single particle (in contrast to the unitedatom force field model where a functional group is represented by a particle). Atoms interact with each other via bonded and nonbonded potential energy terms. Equation 19 shows an example of allatom force field model that is frequently adopted in the simulations of proteins: 1 1 U(qN) = Zbonds kb(rT )2 + Eangles ka ( 0o)2 + dihedrals n= [1 + cos(n y)] + iE + I + 4E[ 12 (al6 (19) i=I j + 4cr [rri The first three summations are bonded terms and they represent interactions of bond stretching, valence angle bending, and torsions, respectively. In Eq. 19, bond stretching and angle bending are considered by a harmonic potential. The torsion term is expressed as Fourier series due to the periodic nature of a dihedral angle. The latter two summation terms are the nonbonded interaction terms. The two components in the double summation represent electrostatic interactions and van der Waals interactions, respectively. Electrostatic potential is represented by Coulomb interaction. qj and q, are partial charges on atom i, and j, respectively. ri is the distance between the two atoms. In Eq. 19, van der Waals interaction is calculated by the LennardJones potential, in which Eif is the well depth and ocy is the distance when repulsive and attractive potentials are equal. Solvent effect is also considered when implicit solvent such as the Generalized Born (GB) model39'40 is adopted (solvent models will be briefly described in the next chapter). The cost of allatom force field model is low compared with abinitio methods because it utilizes predefined parameters when calculating potential energies. The strategy of generating those parameters is via fitting to experimental data and quantum mechanical calculations. One must notice that the parameters are often internally consistent which means parameters of different force fields are in general nontransferrable. The allatom force field models are utilized much more frequently than the quantum mechanical methods when simulating large systems such as proteins. However, force fields such as Eq. 19 do not allow bond breaking or forming. Thus, they are not able to study reactions. Nowadays, linear scaling techniques in electronic structure theory are developed in order to fill the gap between force fields and the high accuracy abinitio methods.41'42 One example of the linear scaling algorithm is the DivCon program developed by the Merz group.43 The balance between computational accuracy and cost is the main theme in the computational chemistry.44 One category of schemes attempting to achieve this balance is the socalled hybrid quantum mechanical molecular mechanical (QM/MM) methods.41'4547 The basic idea of the QM/MM methods is that different regions of a system may play different roles. For example, if one wants to study an enzymatic reaction, the potential energy calculation involving the active site should be done by a quantum mechanical model because the classical force field is not able to describe bond forming/breaking. On the other hand, the bulk water (assuming no water molecule participates in enzymatic reaction) and the protein environment of the enzyme can be represented by the force field in order to save simulation time. In the QM/MM methodology, different regions of a system are treated by different level of theory and interact with each other. The QM/MM approaches have become a key area in the simulation of proteins.48'49 1.7 Molecular Dynamics, Monte Carlo Methods and Ergodicity Accurately simulating the behavior of a molecule requires more than knowing the PES. A molecule often has more than one minimum on the PES. Finding the correct probability distribution of molecular conformations is also important because the majority of experiments measure molecular properties as averages over molecular structures. Sampling algorithms such as molecular dynamics (MD) and the Metropolis Monte Carlo (MC) method are crucial to molecular modeling. For a system containing N number of particles, there are 6N degrees of freedom (DOF). Half of the DOF comes from coordinates and the other half represents the momentum of all particles. The 6Ndimensional space defined by those DOF is called the phase space. Both MD and MC methods sample the molecular phase space. Over time, the system will generate a trajectory in the phase space. MD utilizes the equation of motion to propagate a system in the phase space (The details of molecular dynamics will be presented in the next chapter). Each particle in the system has velocity and position and Newton's second law (Eq. 110) is applied to control the dynamics: F = m a = VU (110) The force on any particle in the system is given by the negative gradient of the potential energy. The equation of motion is usually solved numerically. By propagating the equation of motion, the phase space will be explored and a probability distribution for DOFs will be obtained. Therefore, molecular properties are able to be computed by averaging over times: (A)MD =li too =0 A(ti) (111) In Eq. 111, A is the property of interest. t is the total simulation time. N is the size of the sample taken during the entire simulation. The bracket stands for taking average. A(ti) is the value of A at time ti in the simulation. In contrast to MD, the Metropolis MC method (from now on, we will call the Metropolis MC method as MC method unless otherwise mentioned) does not utilize the equation of motion. MC method samples the phase space through a Markov chain (the details of Monte Carlo method will be presented in the next chapter). In MC algorithm, a new state (for example, a new molecular configuration) is randomly selected and the transition probability relationship between the current state and the new state is calculated by the detailed balance equation. Then a Metropolis criterion50 is applied to accept or reject the transition to the new state. The Markov chain can be applied because the system is assumed to be at equilibrium. Likewise, after a sufficient number of transitions, the phase space will be explored and molecular properties can be computed by averaging over ensemble: (A)Mc = JA() p(x)d2 (112) Here A(2) is the value of A in state x. p(2) is the normalized probability density of state x. The MD and the MC methods represent two different ways of sampling phase space and computing average molecular properties. According to the ergodic hypothesis, the time average is equal to the ensemble average: (A)MC = J A(2)p(x)dx = limrtoo Et=oA(ti) = (A)MD (113) The ergodic hypothesis is often assumed to be true in molecular simulations. This hypothesis makes MD and MC methods equivalent in sampling phase space. If the system is ergodic, the phase spaces generated by MD and MC should be the same because the phase space does not depend on sampling technique. The same behavior should also extend to any observable properties. Conformational sampling in a MD or MC simulation is essential in the study of complex systems such as polymers and proteins. One major concern is that the PES of a complex system is very rugged and contains a lot of local energy minima.51 Thus, kinetic trapping would occur as a result of the low rate of potential energy barrier crossing, especially when the barrier is high. In order to overcome this kinetic trapping behavior, generalized ensemble methods (advanced sampling methods)52'53 are frequently employed in molecular simulations. Popular generalized ensemble methods include multicanonical algorithm,54'55 simulated tempering method,56'57 parallel tempering method,5860 and replica exchange molecular dynamics (REMD) method.61'62 A more thorough description of MD, MC and the advanced sampling methods will be presented in the next chapter. 1.8 Theoretical Protein Titration Curves and pKa Calculations Using Poisson Boltzmann Equation Studying protein titration curves theoretically has a long history. As early as 1957, Tanford and Kirkwood presented their study of protein titration curve.63 In their model, proteins were considered to be lowdielectric spheres with discrete unit charges on ionizable residues. They proposed that the pKa value of an ionizable residue can be calculated from its intrinsic pKa value and pairwise electrostatic interactions with other ionizable residues. Calculating the pairwise electrostatic interactions involves using empirical parameters. A protein titration curve showing average charge as a function of pH value was plotted. The TanfordKirkwood model was further extended and utilized to study lysozyme by Tanford and Roxby.64 The equations used to generate a titration curve in the Tanford and Roxby paper were the same as those Tanford and Kirkwood used. However, they employed an iterative approach to generate titration curves and pKa values for all ionizable residues. In their approach, each ionizable residue was initially assigned a pKa value that is equal to its intrinsic value. At a given pH, the average charge on each site (representing fraction of deprotonation/protonation) can be computed. Those average charges were then employed to update pKa values. This process was repeated until selfconsistent average charge and pKa value of a site was obtained. Therefore, a titration curve can be produced by plotting average charge as a function of pH value. In 1990, Bashford and Karplus utilized the finite difference PoissonBoltzmann (FDPB) equation in the calculation of pKa values.65 A detailed description of the FDPB method will be present in the next chapter. The pKa shift of an ionizable residue relative to a model compound is calculated (in their paper, intrinsic pKa is a quantity defined as the pKa value of an ionizable residue when other sites are neutral, that is, no interactions between ionizable sites). Given a molecular configuration, three terms are calculated by FDPB equation for each ionizable site: the Born solvation free energy, the pairwise electrostatic interactions with nonionizable residues (represented by partial charges), and the pairwise electrostatic interactions between ionizable sites. Summing the three terms yields the electrostatic work of charging the ionizable sidechain, and hence yields the pKa shift. A protein titration curve is represented by plotting fraction of protonation vs pH value. Considering a protein with N ionizable sites and each site can have two states (protonated and deprotonated), there are 2N possible macrostates and each macro state can be represented by an Ndimensional vector. Once the FDPB equation is solved, free energy differences of each vector relative to completely deprotonated are computed. Thus, the fraction of protonation of an ionizable site can be calculated by taking the Boltzmann weighted average of the 2N macrostates. The FDPB method forms the foundation of the continuum electrostatic (CE) models, which are frequently utilized when studying protein pKa values.16'6571 The FDPB method has been implemented into many modeling software packages such as UHBD72 and DELPHI.73 Many modifications have been done to improve its performance. In 1991, Beroza et al. employed the Metropolis MC method to sample 2N numbers of protonation states, instead of calculating the protonation fraction at a given pH value directly.74 After using MC sampling of protonation states, the number of ionizable residues included in the simulation can increase dramatically. Solving the FDPB equation requires the dielectric constant in a protein as an input parameter and the dielectric constant is very important because the electrostatic energy is inversely proportional to it. It is considered as the most important adjustable parameter in FDPBbased pKa calculations.16 Thus, one question arisen from utilizing FDPB method is how to choose dielectric constant for proteins. The values between 4 and 20 are typically adopted in the FDPB calculations.67 Direct experimental determination of the interior dielectric constant is extremely difficult. In practice, the protein dielectric constants are measured utilizing protein powders, which will cause problems in interpreting the resulting dielectric constants.18,75,76 Research has been performed to find an optimal interior dielectric constant for protein pKa predictions. However, considering the difference in protein environment, no single dielectric constant can yield experimental pKa values for both internal and surface residues in a protein.77 In 1996, Simonson and Brooks studied charge screening effect and protein dielectric constant by MD simulations.78 What they found was that protein dielectric constant can range from ~4 in the interior of protein to a much higher value (~30) in the region near the surface. As mentioned in section 1.4, the GarciaMoreno group conducted site directed mutagenesis experiments in the deep pocket of a protein where water is inaccessible and measured the pKa value of mutated ionizable residue.18,19,2123,77 Then, the experimental pKa value was put back into FDPB equation in order to examine protein interior dielectric constant. The protein interior dielectric constants were found to be ~11.18 Mehler and his coworker employed a sigmoidal screened electrostatic interaction to treat the protein dielectric environment.79'80 Their method had been applied to Glu35 and Asp66 in hen egg white lysozyme and had obtained satisfactory results.80 Another problem in the FDPBbased pKa calculation is that the FDPB equation is often solved on the basis of one structure such as Xray crystal structure. The entropic effect is missing when a single structure is used. To improve the performance of the CE model in pKa calculations, protein conformational sampling is also considered in order to incorporate conformational flexibility into pKa calculations.8186 In the 1990s, You and Bashford developed an algorithm in which 36 sidechain conformations of ionizable residues are adopted in the calculation of pKa values.86 In 1997, Alex and Gunner proposed to use Monte Carlo method to sample (2M)NLK possible states instead of just 2N protonation states.81 Here N is the number of ionizable residues and each one can have M possible conformations. Furthermore, each one of the K nonionizable residue possesses L number of possible conformations. The Gunner group further extends this algorithm to the socalled multiconformation continuum electrostatic method (MCCE).83 Recently, Barth et al. proposed a rotamer repacking technique combined with FDPB method and was given the name FDPB_MF.82 In the FDPB_MF method, the conformational space of sidechain of ionizable residues was defined by a rotamer probability distribution. Each rotamer was given a weight and was interacting with other ionizable residues in a meanfield scheme. 1.9 Computing pKa Values by Free Energy Calculations MDbased free energy (MDFE) calculations87'88 have also been employed to predict pKa values. MDFE calculations combine free energy calculation algorithms with MD propagations. MD propagations sample phase space and generate a conformational ensemble. Free energy calculation methods calculate the free energy difference between two states on the basis of the phase space sampled by MD. Free energy perturbation (FEP) and thermodynamic integration (TI) are two frequently employed free energy calculation methods and will be explained with more details in the next chapter. Free energy calculation algorithms such as FEP and TI methods can be used to compute pKa because Ka is associated with the free energy of reaction. Early pKa calculations utilizing free energy calculations were conducted by the Warshel et al.,89'90 Jorgensen et al.,91 and Merz92 with the FEP method and classical force fields. In the 1980s, Warshel et al. proposed a protein dipole Langevin dipole (PDLD) model for the pKa calculations.90 In the PDLD model, proteins were treated as particles having partial charges and polarizable dipoles, while the solvent molecules nearby were viewed as Langevin dipoles. The bulk water that is far away from ionizable residues was still treated as dielectric continuum. Electrostatic interactions between charges and dipoles, and dipoles and dipoles were computed. Jorgensen et al. combined abinitio quantum mechanical calculations and classical FEP calculations in 1989.91 Jorgensen et al. calculated the pKa difference between two acids, AH and BH. The gasphase dissociation free energy of AH and BH were computed by quantum mechanical methods. The solvation free energy calculations were conducted using MC FEP method for the neutral molecules and the anions. One shortcoming of their calculations is that only small organic molecules were investigated due to the computational cost of quantum mechanical methods. In 1991, Merz performed classical FEP calculations for three glutamate residues in two proteins (HEWL and human carbonic anhydrease II).92 The glutamate dipeptide was utilized as a model compound to eliminate the gasphase dissociation free energy calculations. When MDFE calculations utilizing the classical force fields are performed, quantum effects such as bond forming/breaking cannot be simulated. Thus, the pKa shift of an ionizable residue relative to its intrinsic pKa value (pKa value of the reference compound which is defined in section 1.3 of this dissertation) is computed by the free energy calculations. A diagrammatic explanation of pKa shift calculation utilizing the MDFE method is demonstrated in Figure 17 and Figure 18. Model AH > A + H* AG1 AG2 AG3 AGprotein ProteinAH > ProetinA + H* Figure 17. Thermodynamic cycle used to compute pKa shift. Both acid dissociation reactions occur in aqueous solution. A thermodynamic cycle is a series of thermodynamic processes that eventually returning to the initial state. A state function, such as reaction free energy in this case, is pathindependent and hence, unchanged through a cyclic process. G model(AH>A) AH > A AGI AG2 ProteinAH N ProetinA AGprein(proteinAH>proteinA) Figure 18. Thermodynamic cycle utilized to calculate the difference between AG1 and AG2. In Figure 17 and Figure 18, proteinAH represents the ionizable residue in protein environment. AH represents the reference compound which is usually the ionizable residue with two termini capped. In practice, a proton does not disappear but instead becomes a dummy atom. The proton has its position and velocity. The bonded interactions involving the proton are still effective. However, there is no nonbonded interaction for that proton. The change in protonation state is reflected by changes of partial charges in the ionizable residue. Equations 114 to 120 explain how pKa values will be computed from free energy calculations using force fields: 1 PKa,protein 2.303kT AGprotein (114) 1 pKa,model 2303kT Gmodel (115) In Eq. 114 and 115, AGprotein and AGmodel are the acid dissociation reaction free energy of the ionizable residue in protein and the reference compound, respectively. Therefore, the pKa shift between ionizable residue in protein environment and the reference compound can be calculated as pKa,protein PKa,model 2.33k (AGprotein  AGmode ). According to the thermodynamic cycle shown in Figurel6A, (AGproten  AGmodel) = AG1 + AG2 + AG3. Here, AGI and AG2 are the free energy difference between two protonated species, and between two deprotonated species, respectively. AG3 is equal to zero because the free energy difference between two protons that are in the same environment is zero. However, calculating AGI and AG2 directly utilizing MDFE calculations is not preferable because the difference between the reference compound and the protein system is very large. A simple way to determine the difference between AGI and AG2 is needed. Therefore, the thermodynamic cycle shown in Figurel6B is employed. By utilizing that thermodynamic cycle, (AG1 + AG2) can be expressed as (AG(proteinAH proteinA) AG(AH A)), where AG(proteinAH proteinA) and AG(AH A) are the free energy difference between the protonated and deprotonated ionizable residue in protein and the reference compound, respectively. AG(proteinAH proteinA) and AG(AH A) can be further expressed as: AG(proteinAH proteinA) = AGQM(proteinAH proteinA) + AGMM(proteinAH proteinA) (116) And AG(AH A) = AGQM(AH A) + AGMM(AH A) (117) In Eq. 116 and Eq. 117, the MM in the subscripts stands for the free energy differences which are calculated by classical force fields. The quantum mechanical contributions (labeled by QM in the subscripts) to the free energy difference of an ionizable residue in protein environment and its reference compound are assumed to be the same: AGQM(proteinAH < proteinA) = AGQM(AH A) (118) Combining all derivations and assumption, the difference between two acid dissociation reaction free energies can be written as: AGprotein AGmodel = AGM(proteinAH proteinA) AGMM(AH A) (119) Thus, subtracting Eq. 115 from Eq. 114 yields: pKaprotein = pKamode 2.303T (AGMM(proteinAH proteinA) AGMM(AH  A)) (120) AGMM(proteinAH > proteinA) and AGMM(AH A) are are computed by MDFE calculations (for example, TI). A more detailed description of the MDFE methodology and how to compute AG(proteinAH proteinA) and AG(AH A) will be explained in the next chapter. An example of using classical force field MDFE calculations to study pKa values is given by Simonson et al.15 The pKa values of Asp20 (experimental pKa of 2, which is lower than the intrinsic Asp pKa value), Asp26 (experimental pKa of 7.5) in thioredonxin, and Asp14 (with an experimental pKa around 4) in ribonuclease A were evaluated by TI calculations. The aspartate dipeptide was taken as the model compound; both explicit and implicit water models were used in their simulations. Proton dissociation was represented by changes in the partial charges of carboxylic group only. The free energy change caused by the disappearance of the proton van der Waals interaction was not considered because the van der Waals radius of the proton in aspartate is zero in the AMBER force field. Correct protonation free energies have been obtained. Entropic and enthalpic effects are also correctly obtained. However, several problems have also been found with the MDFEbased pKa calculations. For example, interactions between ionizable sites are not able to be incorporated directly. Furthermore, their free energy differences have shown dependence on the force fields and solvation models. Hybrid quantum mechanical/molecular mechanical (QM/MM) methods can be coupled with free energy calculation simulations.48'93 Recently, the Cui group has conducted pKa calculations using FEP calculations coupled with SCCDFTB method.94'95 A detailed description of QM/MM free energy calculations of pKa values can be found in a recent review by Kamerlin et al.48 1.10 pKa Prediction Using Empirical Methods Empirical models are also employed to study protein pKa values. According to Lee and Crippen,16 the seemingly most accepted empirical method is PROPKA which is developed by the Jensen group.96101 The PROPKA method involves using 30 parameters obtained from 314 residues in 44 proteins. QM calculations and the effective fragment potential (EFP) method,102'103 which is a QM/MM method, are employed to generate those parameters. In the PROPKA method, a pKa value is calculated by adding "perturbations" to its intrinsic pKa values. Three types of perturbations are considered: the hydrogen bonding, desolvation effect and charge charge interactions. A detailed description of the PROPKA method can be found in a review by Jensen et al.97 1.11 ConstantpH Molecular Dynamics (ConstantpH MD) Methods Traditionally, MD simulations have been performed in a manner of constant protonation state. The protonation state of an ionizable residue is assigned before a MD simulation is started. Moreover, the protonation states are not allowed to change during MD propagations. Performing constant protonation state MD simulations requires knowing the pKa values of all ionizable residues beforehand. Not knowing the pKa value may result in wrong assignment of protonation state. In addition, if pKa values are near the solution pH values, constant protonation state MD simulations are not able to reflect this situation. More importantly, constant protonation state MD simulations cannot be employed to study the coupling between conformations and protonation states. Thus, constantpH MD algorithms were developed in order to correlate protein conformation and protonation state.104 The purpose of constantpH MD is to describe protonation equilibrium correctly at a given pH value. Therefore, its applications include pKa predictions and studying pH effects. One category of constantpH MD methods uses a continuous protonation parameter.105115 Earlier models include a grand canonical MD algorithm developed by Mertz and Pettitt in 1994115 and a method introduced by Baptista et al. in 1997.106 In the Mertz and Pettitt model, protons are allowed to be exchanged between a titratable side chain and water molecules. Baptista et al. used a potential of mean force to treat protonation and conformation simultaneously. Later, Borjesson and HCnenberger developed a continuous protonation variable model in which the protonation fraction is adjusted by weak coupling to a proton bath, using an explicit solvent.107'108 More recently, the continuous protonation state model has been further developed by the Brooks group.109114 They developed a constantpH MD algorithm by the name of continuous constantpH molecular dynamics (CPHMD). In the CPHMD method, Lee et al.114 applied Adynamics116 to the protonation coordinate and used the Generalized Born (GB)40,117 implicit solvent model. They chose a A variable to control protonation fraction and introduced an artificial potential barrier between protonated and deprotonated states. The potential is a biasing potential to increase the residency time close to protonation/deprotonation states and it centered at half way of titration (A=1/2). The CPHMD method was then extended by incorporating improved GB model and REMD algorithm for better sampling. The applications of CPHMD and replica exchange CPHMD included predicting pKa values of various proteins,110'114 studying proton tautomerism109 and pHdependent protein dynamics such as folding112'113 and aggregation.111 In addition to continuous protonation state models, discrete protonation state methods have also been developed to study pHdependence of protein structure and dynamics.118131 The discrete protonation state models utilize a hybrid molecular dynamics and Monte Carlo (hybrid MD/MC) method. Protein conformations are sampled by molecular dynamics and protonation states are sampled using a Monte Carlo scheme periodically during a MD simulation. A new protonation state is selected after a userdefined number of MD steps and the free energy difference between the old and the new state is calculated. The Metropolis criterion is used to accept or reject the protonation change. Various solvent models and protonation state energy algorithms were used in discrete protonation state constant pH MD simulations. Burgi et al.130 presented their constantpH MD method using discrete protonation state model and applied it to hen egg white lysozyme (HEWL). The lysozyme was dissolved into explicit water. Short TI calculations (20 ps of dynamics) were carried out to provide classical free energy difference between old and new protonation states at each MC attempt. The MC move is evaluated based on the following free energy difference: AG = kBT In 10 (pH pKa,ref) + AGprot,MM AGref,MM (121) In the above equation, pH is a parameter and represent the pH value of the solution, pKa,ref is the pKa value of the model compound (reference compound), AGprot,MM and AGref,MMis the classical force field proton dissociation free energy given by TI for the protein and reference compound, respectively. One pitfall of the method developed by Burgi et al. is the choice of simulation time of TI. The 20 ps TI calculation represents neither singlestructure protonation free energy nor an average of the entire ensemble. The Baptista group proposed their constantpH MD method using the FDPB method to calculate protonation energies and their MD was done in explicit solvent.118,123126 The MD propagations are conducted at fixed protonation states. The MC moves in the protonation states are performed at fixed molecular configurations. The MD propagation is able to generate a conditional PDF of coordinates and moment given protonation states, while the MC sampling is able to yield a conditional PDF of protonation states given molecular configurations. Baptista et al. proved that the hybrid MD and MC method is able to generate an ergodic Markov chain.118 Hence, conditional probability distributions yielded by MD and MC are able to generate a joint probability distribution satisfying semigrand canonical ensemble. The work done by Baptista et al. provides the theoretical justification for combined MD and MC sampling in the discrete protonation state constantpH methods. In practice, MD simulations are conducted in explicit water to sample conformational space. A new protonation state is selected and the free energy difference is calculated using the structure at that moment and the continuum electrostatic model. The MC transition is evaluated and if the move is accepted, a short MD run is performed to relax the solvent. After solvent relaxation, MD steps continue for solute and solvent. The Baptista group applied their constantpH MD method to the study of protonationconformation coupling effect,123 the pHdependent conformation states of kyotorphin,124 pKa predictions of the HEWL125 and the redox titration of cytochrome c3.126 Walczak and Antosiewicz also employed the FDPB method to determine protonation energy but they used Langevin Dynamics to propagate coordinates between MC steps.128 This method is further extended by Dlugosz and Antosiewicz.119 122,128 The extended method combines conventional MD simulation using the analytical continuum electrostatic (ACE)132 scheme to sample conformations with the FDPB method for the MC moves. Succinic acid119 and a heptapeptide derived from ovomucoid third domain (OMTKY3)122 have been studied by Dlugosz and Antosiewicz. This heptapeptide corresponds to residues 2632 of OMTKY3 and has the sequence of acetylSerAspAsnLysThrTyrGlymethylamine. Nuclear magnetic resonance (NMR) experiments indicated the pKa of Asp is 3.6,122 0.4 pKa unit lower than the value of blocked Asp dipeptide. In their studies, the conventional molecular dynamics (MD) simulations were carried out to sample peptide conformations. Their method predicted the pKa to be 4.24. Mongan et al. developed a method combining the GB model and the discrete protonation state model and implemented it into the AMBER simulation suite.127 In Mongan's method, the GB model was used in protonation state transition energy as well as solvation free energy calculations. Therefore, solvent models in conformational and protonation state sampling are consistent and the computational cost is small. More recently, the accelerated molecular dynamics (AMD)133'134 method was combined with Mongan's constantpH algorithm to enhance conformational sampling.129 This model has been utilized to calculate pKa values of an enzyme and to explore the protonation conformation coupling. The continuous protonation state model developed by the Brooks group, the discrete protonation state model proposed by Baptista et al. and by Mongan et al. will be further explained in chapter 2. CHAPTER 2 THEORY AND METHODS IN MOLECULAR MODELING Molecular Modeling or molecular simulation is a way to study molecules using theories developed in the fields of physics, chemistry and biology coupled with the computer resources. With the development of computer power and parallel computation, molecular modeling is more and more often involved in the research of biology, chemistry and physics.42 Understanding the underlying theory and methods of molecular modeling is necessary in order to perform simulations and analyze the data generated. In this chapter, the basic theory and methods of constantpH replica exchange molecular dynamics method and protein pKa calculations methods are described. 2.1 Potential Energy Functions and Classical Force Fields 2.1.1 Potential Energy Surface Molecular modeling studies molecules, which in general possess more than one configuration for a chemical formula in general. In principle, all possible molecular configurations need to be considered in order to simulate a molecule correctly. A potential energy surface (PES), which is a surface defined by the potential energies of all possible configurations, can be utilized to fulfill this requirement. The concept of PES is a result of the BornOppenheimer approximation. The BornOppenheimer approximation states that the electronic relaxation caused by nuclear motion is instantaneous because of the huge difference in the masses of electrons and nuclei. Thus, electronic motion and nuclear motion are decoupled. Electronic energy, which is computed at a fixed nuclear geometry (molecular structure), is the potential energy of nuclei at that structure. Local minima on the PES indicate stable conformations of a molecule. Quantum mechanics forms the foundation of understanding the molecular behaviors and offers the most accurate way to construct a PES. Ideally, the Schrodinger equation is solved for electronic energy at all possible nuclear configurations and hence, yields the PES of a molecule. 2.1.2 Force Field Models Although quantum mechanical calculations generate very accurate energies, performing a molecular simulation using quantum mechanical method is too time consuming even through the use of parallel computation, especially for large systems such as polymers and proteins. Force field (equivalent to molecular mechanics) models have been designed to solve this problem. Force field models ignore electrons and calculate the potential energy of a system based on nuclear geometry only. Force field calculations are fast because the potential energy functions are simple and parameterized. In a force field model, the potential energy of a system has the following contributions in general: bond stretching (vibration), angle bending, bond rotation (torsion), electrostatic interaction, and the van der Waals interaction. The former three contributions are often called the bonded interactions and the last two belong to non bonded interactions. In many force field models, such as the AMBER force field,135 bond stretching energy between atoms i and j is the second order truncation of the Taylor expansion of potential energy function about equilibrium distance and hence, can be formulated as a harmonic potential: Ubond = kj(rij rij, )2 (21) where ky is the force constant, ri is the distance between two atoms and ri,eq is the equilibrium distance between the two atoms. One drawback of this function is that a bond cannot be broken and has infinite energy when two atoms are infinitely apart. Therefore, such a potential energy can be applied to bond stretching near equilibrium distance only. A simplest remedy is to include higher order Taylor expansion terms but this increases the computation time. For example, expansions up to the fourthorder are adopted in the general organic force field MM3.136 This Taylor expansion strategy is also employed in deriving anglebending potential functions. Torsions (or dihedral angles) are periodic and hence, Fourier series is adopted as torsion potential energy function. One example of the formula of torsion potential energy is displayed in Eq. 19. The van der Waals interaction in a force field model should be able to reproduce the repulsion and attraction between two particles having no permanent charges. This attractive interaction is generally called dispersion. Quantum mechanics indicates that the dispersion energy is inversely proportional to the sixthpower of the distance between two particles (say atoms) i and j (under the dipoledipole interaction approximation):137 Udispersion (22) rij6 where by is a constant specific to i and j and ry is the distance between i and j. There is no theoretical derivation for the repulsive interaction. However, for computational simplicity, the repulsive energy is taken to be inversely proportional to the twelfthpower of the distance. A simple way to combine repulsive and attractive potentials is just adding up the two potentials. Thus, van der Waals interaction is governed by the LennardJones potential shown in Eq. 19. Due to the fact that van der Waals interaction decays very fast as a function of interparticle distance, it is often called "shortrange interaction". Electrostatic interaction is often considered as the "longrange interaction". The simplest model of electrostatic interaction is the pointcharge model which is adopted in the AMBER force field. Partial charges are assigned to each atom and Coulomb's law is applied to calculating interaction energy. More complicated models such as calculating electrostatic energy through dipole momentdipole moment interaction have also been employed.137 Bond, angle and torsion interactions are coupled. Thus, the coupling effects (cross terms) should be incorporated into force fields. Mathematically, cross terms are generated from multidimensional Taylor expansions. For example, the anglebending accompanied by two bondstretching motions (shown in Figure 21) is formulated to be (as in MM3): Bond ngle ijk [( rij,eq) + (rik rik,eq )](ijk Oijk,eq) (23) ] (k StretchingBending Coupling (Cross Term) Figure 21. A diagram showing bondstretching coupled with anglebending. A cross term calculating coupling energy is adopted when evaluating the total potential energy. The force field is simply a function and corresponding parameters. Thus, obtaining parameters is crucial for force field development. Given a potential energy function, parameters are required to reproduce experimental data or quantum mechanical calculation results as much as possible. 2.1.3 Protein Force Field Models Computer simulations of biological molecules often involve thousands of atoms or even more,138 especially when using explicit solvent models. Many simulations on proteins choose to use force fields to reduce computational cost. Popular protein force fields include (but are not limited to) AMBER99SB,139 CHARMM22,140 GROMOS96,141 and OPLS force fields.142 In general, a simple potential energy function like Eq.19 is employed in the protein force fields. Protein force field parameters are in general optimized on the basis of small molecules. Take the AMBER force field (Eq. 19) as an example; there are bonded and nonbonded terms in it. In the nonbonded terms, the partial charges are fitted to quantum mechanical calculation using HartreeFock/631G* level of theory in vacuum. This level of theory typically overestimates dipole moment, and hence the resulting partial charges can satisfactorily approximate the condensed phase charge distribution. The LennardJones parameters have been obtained from reproducing liquid properties following the work of Jorgensen et al.142 After the partial charges are assigned, the LennardJones parameters are fitted to reproduce experimental data such as heat capacity, liquid density, and the heat of vaporization. The bond stretching and angle bending parameters are derived by fitting to structural and vibrational experimental data of small molecules that make up proteins. The bond and angle parameters should ensure that the geometries of simple protein fragments are close to experimental data. The torsion dihedrall angle) parameters can be obtained from quantum mechanical conformational energy calculations. Determining torsion parameters is often the last step of force field parameter optimizations. Given the previous obtained individual energy term parameter sets, the torsion parameters are adjusted to best fit quantum mechanical conformational energies, for example, the Ramachandran plot of a model compound. Detailed description of the protein force field parameter determinations can be found in the paper of Cornell et a.,143 MacKerell et al.,140 and Hornak et al.139 2.2 Molecular Dynamics (MD) Method 2.2.1 MD Integrator As mentioned in the introduction, MD samples the phase space utilizing the equation of motion. A trajectory in the phase space will be generated over time. The ergodic hypothesis is assumed to be true, that is, the time average of any property at equilibrium is equivalent to the ensemble average. Thus, given a set of initial positions and moment and a method to compute forces, a MD simulation can be applied to any system. For a simple system such as a harmonic oscillator moving along one axis, there exists an analytical solution of the trajectory (the coordinate and momentum as a function of time can be expressed analytically). However, it's almost impossible to know the analytical solution of complex systems such as polymers or proteins. Therefore, numerical integrators are implemented to propagate positions and velocities of particles. One of the frequently used integrator is the leapfrog algorithm:41'144 (t + At) = q(t) + ( t +I At) At (24) v(t+ t) = (t At)+ a(t)t (25) (t) = F(t) VU(t) (26) a(t>)= (26) m m Here, q and v stand for the position and velocity of a particle respectively; a(t), F(t) and U(t) represent the acceleration, the force and the potential energy at time t; and At is the time step used in MD simulation. One frequently employed potential energy function is the force field model introduced in the previous section. According to Eq. 24, 25 and 26, the leapfrog algorithm propagates positions and velocities in a coupled way. The velocity at time t can be calculated by velocities at t + At and t At by the following equation: v(t) = [vt + At) + v(t At) (27) One important issue in the MD propagation is choosing a proper time step that optimizes speed of propagation and accuracy of the simulation. A too small time step will waste simulation time in sampling the same conformation, whereas a too large time step can bring two atoms too close and hence cause instability of the trajectory. In general, a time step is a tenth of the period of fastest motion. In biological molecules, the fastest motion is the bond stretching and bonds with hydrogen atoms in particular. Thus, one way to increase time step without reducing accuracy is to remove the degree of freedom having highest frequency. One commonly employed algorithm to achieve this goal is the SHAKE algorithm.145 When using the SHAKE algorithm to remove heavyatomtohydrogen DOF, the heavyatomtohydrogen bond length is fixed. The fixed bond lengths act as distance constraints between heavy and hydrogen atoms. Lagrangian multipliers have been utilized to keep the bond lengths constant. By employing the SHAKE algorithm, a large time step such as 2 fs could be used. Methods that can integrate the equation of motion more efficiently are popular area of research. 2.2.2 Thermostats in MD Simulations Before describing thermostats in MD simulations, the concept of thermodynamic ensemble (statistical ensemble) should be introduced first. An ensemble is a large amount of replicas of the system of interest (it may contain infinite number of replicas). All replicas in an ensemble are considered at once. Each replica represents the system in one possible state. Thermodynamic ensembles are characterized by macroscopic thermodynamic properties. Several frequently employed thermodynamic ensembles are microcanonical ensemble (NVE ensemble), canonical ensemble (NVT ensemble), isothermalisobaric ensemble (NPT ensemble), and grand canonical ensemble. MD simulations are controlled by Newton's second law. This makes a MD simulation conserve the total energy and represent a system in the microcanonical (NVE) ensemble, where number of particles (N), volume (V), and total energy (E) are constant. However, our system of interest is in the canonical (NVT) ensemble, in which number of particles (N), volume (V), and temperature (T) are constant. Therefore, maintaining a constant temperature in a MD simulation is necessary. Any algorithm that can maintain constant temperature and approximate the NVT ensemble is called a thermostat. Popular thermostats include Berendsen thermostat,146 Langevin dynamics147 and NoseHoover thermostat.148 The Berendsen thermostat and Langevin dynamics are utilized in our MD simulations and thus explained here. In a MD simulation, the temperature can be written as: T 1N mi (28) S(3Nn)kB i=l 2 Here N is the number of particles, n is number of constrained degree of freedom, mi and vi are the mass and velocity of particle i. Thus, temperature is a function of velocities of all particles. The simplest way to control temperature is to rescale velocity at each time step. However, this will cause discontinuity in the momentum trajectory in phase space. Berendsen et al. introduced a weak coupling method to an external heat bath to MD simulations. The heat bath can add or remove heat from the system in order to maintain a constant temperature. The rate of temperature change is governed by Eq. 2 9: dTt =1 (To T(t)) (29) dt TT where To is the temperature of the bath and CT is the coupling time which indicates the time scale a system relaxes to target value. By employing a coupling time, the MD propagation can avoid sudden change in velocities. Since temperature is computed from velocities of all the atoms, what the Berendsen thermostat really does is to multiply all velocities with a scaling factor 2 (shown in Eq. 210) in order to rescale the current temperature Tto the target value To. S= At 12 (210) I TT By rescaling velocities, the Berendsen thermostat controls the temperature in MD simulations. As mentioned before, the coupling time t, determines how tightly the system and the heat bath coupled together. A large t, means the coupling is weak. It takes long time for the system to relax from current temperature to target temperature. As T, co, the internal energy will be conserved and the microcanonical ensemble will be restored. If t, is small, the coupling between the system and the heat bath is strong and the velocity scaling factor is large. However, large velocity scaling factor will cause large disruption in the momentum part of the phase space trajectory. The larger the scaling factor is, the less natural the trajectory is. Langevin dynamics belongs to the category of stochastic thermostat.137 It mimics the Brownian motion of a particle. Instead of Newton's second law, the equation of motion of MD method when using stochastic thermostat becomes: di_ 1 VU d =  yi +A(t) (211) dt mi dqi In Eq. 211, vi, qi and mi are the velocity, position and mass of particle i respectively, U is the potential energy, y is the friction coefficient and A(t) is a random force at time t. The amplitude of this force is determined by fluctuationdissipation theorem (Eq. 212). (Ai (tl)Aj(t2)) = 2ykBT6i (tl t2) (212) (Ai(t)Aj (t2)) is the time correlation of A on particle i at time t, with A on particle j at time t2. y is the friction coefficient, kB is the Boltzmann constant, T is the temperature, 6,i is the Kronecker delta function and S(t, t2) is the Dirac delta function. Langevin dynamics can be used as thermostat because the equation of motion is temperature dependent via the random force term. 2.2.3 Pressure Control in MD Simulations Most biological experiments are performed in a constant pressure and constant temperature situation (NPT ensemble). Therefore, pressure control techniques (barostats) should be used in simulations to maintain system pressures and it is done by adjusting the system volumes. Since the number of particles is constant during a simulation, another application of maintaining pressure is to regulate system density which should be at certain appropriate value. A generally employed barostat is the Berendsen barostat.146 The pressure of a system in a simulation is calculated using the virial theorem of Clausius and can be expressed as: p= INkBT .I E .i+ r .j (213) In the above equation, P is pressure, N is the number of particles, and T is the temperature. ri and v(?n) are the distance and interaction energy between atoms i and j, respectively. Analogous to temperature control, the pressure can be maintained simply by rescaling volume at each time step although the system volume will be disrupted too much. Berendsen barostat was developed in order to smooth the change in volume. The Berendsen barostat, in which the algorithm is the same as Berendsen thermostat, utilizes a pressure bath. The rate of pressure change is governed by following equation: dP(t) 1(Po P(t) (214) dt rp where p is the coupling constant and Po is the pressure of the bath. The change in pressure is reflected by adjusting system volume. The coordinates of all particles in the system are scaled by a factor 11/3 and A is formulated as: A = 1 K (P Po) (215) Tp The K in the above equation is the isothermal compressibility. It represents the volume fluctuation caused by pressure change: K =  (216) v ap 2.3 Monte Carlo (MC) Method 2.3.1 Canonical Ensemble and Configuration Integral In statistical mechanics, an ensemble is a collection of a very large number of systems and each system is a replica (on a thermodynamic level) of a particular thermodynamic system of interest. If the thermodynamic system of interest has a volume of V, N number of particles and temperature T, then an ensemble containing a very large number of such systems is called the canonical ensemble. The canonical ensemble is important because it best represents systems of interest in practice. Because each system of the canonical ensemble is not isolated, the energy of each system is not fixed. Thus, there is a probability of finding a system with energy Ei and the probability distribution of systems in the canonical ensemble is the socalled Boltzmann distribution (Eq. 217). pi = leEi/(kBT) (217) Here Q is the partition function and is essentially a normalization factor. E, is the quantum energy of a system. Q = Zi eEl/(kBT) (218) In classical mechanics, the Hamiltonian function H is employed to describe the total energy of a system and can be expressed as H(p, q) where p and q are moment and positions respectively. In general, the Hamiltonian can be separated into kinetic energy which depends only on moment and potential energy which depends only on positions. In addition to using the Hamiltonian instead of quantum energy, the energy levels become continuous in the classical limit. Thus, the partition function will be written as an integral. Q = ffePH(pq)dpdq (219) Here /p = 1/(kBT). After integrating the kinetic energy term, the partition function has the form of Eq. 220 and is called configuration integral. Z = f eU() dq (220) Thus, the Boltzmann distribution in the classical limit is given by Eq. 221: P = epu (221) z 2.3.2 Markov Chain Monte Carlo (MCMC) The definition of Markov chain is crucial to the MCMC methods, so it will be explained first in this section. Consider a stochastic process at discrete steps (tl,t2, ...) for a system that has a set of states (S,S2, ...) with finite size. We define that the system is in state Xt at step t. The conditional probability of XtS = S, given that Xtn_, is in state Si, etc, is: P(Xtn = Si IXt_, = SiXtn2 = Sk, ...Xt, = Sh) (222) A Markov process is defined in Eq. 222 with the property that the conditional probability of Xt, = S, only depends on its previous state Xt"_ = Si: P(Xtn = Si IXtn_ = SXtn2 = Sk, ...,Xt = Sh) = P(Xtn = Sj IXtn_ = Si) (223) The corresponding sequence of states (XI,X2,...) is called a Markov chain. The conditional probability P(Xt, = Sj IXt,_ = Si) is essentially the transition probability from state Si to S, and is denoted as w(i j). Based on the probability theory, a transition probability has the properties w(i j) 2 0 and yi w(i j) = 1. Thus, the probability of Xt, = S, can be written as: P(Xt, = Sj) = P(Xtn = Sj IXtl = Si) P(Xt,, = Si) = w(i j)P(Xt n = Si) (224) A change in P(X,, = Si) with respect to step is governed by the master equation: dPX = i w(j i)P(Xt. = S) + Ej w(i j)P(Xt = Si) (225) At equilibrium (or under steadystate approximation), it is clear that P(Xt = Si) should not change with steps. This leads to: w(j i)(X = Si) = Ej w(i j)P(Xtn = Si) (226) Since the Markov chain introduced above possesses discrete and finite number of states, the transition probability can be described as a matrix, which is called the transition matrix. The (i,j)th element of the transition matrix represents w(i j). The probability distribution can be represented by a row vector. Multplying a probability distribution with transition matrix will generate a new probability distribution. If a Markov chain is timehomogeneous (the definition of time is essentially a step due to the stochastic nature of a Markov chain), the elements of transition matrix are constants (timeindependent). When a probability distribution vector is not changed by multiplying with the transition matrix, the distribution is said to be stationary. At equilibrium, the elements of the transition matrix are independent of time. The equilibrium distribution is an eigenvector of the transition matrix with an eigenvalue of 1. Hence, multiplying equilibrium probability distribution with transition will not change it. Properties of a Markov chain include: a Markov chain is irreducible, if all states communicate with each other; a Markov chain is called periodic, if number of steps needed to move between two states is not periodic; it is positive recurrent, if the expectation value of the return time to a state is finite. These properties are closely related to the ergodicity of a Markov chain. The MCMC methods are Monte Carlo samplings from a probability distribution by employing a Markov chain whose equilibrium probability distribution is the intended probability distribution. States sampled by Monte Carlo method form a Markov chain. The transitions in MCMC must satisfy the detailed balance equation: w(j i)P(Xeq = Si) = w(i j)P(Xeq = Si) (227) A Markov chain is said to be reversible when it satisfies the detailed balance equation. 2.3.3 The Metropolis Monte Carlo Method In 1953, Metropolis et al.50 proposed an algorithm to sample the phase space of a system at equilibrium by the MC method. According to the Metropolis algorithm, at configuration i, a new configuration j is chosen, both configurations are weighted by Boltzmann distribution (Eq. 221) and the detailed balance condition (Eq. 227) is employed to evaluate the transitions (MC moves) between configurations, P(i)w(i j) = P(j)w(j i) (228) In the above equation, P(i) is the Boltzmann weight of configuration i and w(i > j) is the transition probability from configuration i toj. Inserting Eq. 221 into Eq. 228 and rearranging Eq. 228 yields: w(ij) P)= P fe(U()U(i)) = eA (229) w j i) P(i) And the transition probability from configuration i toj can be written as: w(i j) = minl{1, e^} (230) In practice, the new configuration is accepted if A < 0. However, if A > 0, a random number between zero and one is generated and is compared with eA. If the random number is less than or equal to e, then the new configuration is accepted. Otherwise, the current configuration is kept and is added to the configuration ensemble. This accept/reject criterion is the socalled Metropolis criterion. The MC sampling with the Metropolis criterion generates a Markov chain whose equilibrium PDF is the Boltzmann distribution. Compare the Metropolis MC with MD, MC method simulates a system in the canonical ensemble without controlling temperatures; the bottleneck of MC sampling is the potential energy difference while the bottleneck of MD is the energy barrier. 2.3.4 Ergodicity and the Ergodic Hypothesis In statistical mechanics, ergodic (adjective of ergodicity) is a word used to describe a system which satisfies the ergodic hypothesis. The ergodic hypothesis states that over a long period of time, the time average and the ensemble average of a property should be the same. In our simulations, the ergodic hypothesis is often assumed to be true. Ergodicity breaking (the ergodic hypothesis does not hold) often means that the system is trapped in a local region of the phase space. One example when the ergodic hypothesis does not hold is the spontaneous magnetization of a ferromagnetic system below Curie temperature. The ensemble average of net magnetization is zero since spin up and spin down are degenerate states and the population of either states should be the same. However, a net magnetization exists when temperature is below Curie temperature. Ergodicity is often discussed in a Markov chain. A Markov chain is called ergodic when all its states are irreducible, periodic and have positive recurrent. 2.4 Solvent Models Because proteins are stable and perform their functions in condensed phase, especially in aqueous solution, representing the solvation effect is of great importance. One frequently used solvent model in MD simulations is the water model. Two ways of representing aqueous solution are present here: the explicit and the implicit solvent models. As its name indicates, the explicit water model employs water molecules in the simulation and the implicit water model treats water as a dielectric continuum. 2.4.1 Explicit Solvent Model Different types of water molecules such as SPC/E,149 TIP3P,150 and TIP4P150 are developed. Water molecules parameters are fitted to bulk water properties such as density, heat of vaporization, and dipole moment.150 The density of liquid water is an important physical quantity to check the water models. The density of liquid water shows a maximum at 40 C and water models should correctly reflect this. TIP3P failed to achieve that, while TIP4P and TIP5P151 and their variants were able to reproduce this trend. Take the TIP3P and TIP4P water models as examples. A simple diagrammatical description of TIP4P and TIP4P water models are shown in Figure22. The TIP3P water model has one oxygen atom and two hydrogen atoms. The geometry of TIP3P water is the same as experimental geometry with OH bond length of 0.9572 A and HOH angle of 104.520. Only oxygen atom has a van der Waals radius. Thus, the van der Waals interactions only occur among oxygen atoms. Partial charges are placed on oxygen atom and hydrogen atoms. The partial charge on the oxygen atom is 0.834e and the partial charge on each hydrogen atom is 0.417e, where e is the charge of an electron. When computing interactions (Coulomb interaction and LennardJones interaction) between two TIP3P water molecule, there are 3x3=9 distances needed to be calculated. The TIP4P water model, as its name implies, has four sites. Similar to the TIP3P water model, experimental geometry (bond length and bond angle) is also adopted in the TIP4P model. The only atom, in the TIP4P molecule, having the van der Waals interaction is oxygen too. However, for the TIP4P model, the negative partial charge is located on the fourth site, instead of being placed on the oxygen atom, as in the TIP3P model. The use of the fourth site carrying negative charge is able to improve electrostatic properties of water such as dipole moment. The positive partial charges are still placed on hydrogen atoms. The new partial charges are 1.04e and 0.52e. New LennardJones potential parameters have also been employed for the TIP4P water model to achieve better fitting results. Computing the interactions between a pair of the TIP4P molecules requires knowing 9 distances for electrostatic interactions and 1 distance for the LennardJones potential. Therefore, using TIP4P model in a simulation will be computationally more expensive than using TIP3P model. For a fivesite water model such as TIP5P, 17 distances are needed in order to calculate waterwater interactions. When simulating a molecule with explicit water molecules, the periodic boundary condition (PBC) is utilized in order to mimic reality.152 Otherwise, water molecules evaporate into vacuum. Ewald summation153 or ParticleMesh Ewald (PME) summation154 is employed to compute the longrange electrostatics efficiently when the PBC is employed. One advantage of employing the explicit water model is that the solventsolute interaction can be represented. For example, studying the hydrogen bonding between water molecules and proteins requires using the explicit water model. However, it suffers from computational cost. CPU time is approximately proportional to number of interatomic interactions. q=0.834e 0.9752 A 0.9752 0.9752 0.9752 A q=0.417e q=0.417e A q=0.52e q=0.52e B Figure 22. A diagrammatic description of TIP3P and TIP4P water models. A) TIP3P model. The red circle is oxygen atom and the black circles are the hydrogen atoms. Experimental bond length and bond angle are adopted. B) TIP4P model. Oxygen and hydrogen atoms are labeled with same color as in the TIP3P model. TIP4P model also employs the experimental OH bond length and HOH bond angle. Clearly, the fourth site (green circle) which carries negative partial charge has been added to the TIP4P model. 2.4.2 The PoissonBoltzmann (PB) Implicit Solvent Model An alternative way of representing solvation effect is to reproduce the PES after a molecule is dissolved in solvent. The solutionphase potential energy of a molecule can be computed by adding solvation free energy to the gasphase potential energy. Given the correct solutionphase PES, correct forces can be generated for the equation of motion. Thus, the key issue is finding the accurate free energy of solvation. A dielectric continuum model can be employed to calculate free energy of solvation. In the dielectric continuum model, the free energy (work) of assembling a charge distribution is expressed as: G = Jp(r')(r)dq (231) Here p(r) is the charge density of the molecule and 0(r) is the electrostatic potential. The PoissonBoltzmann model utilizes the PoissonBoltzmann equation to describe the electrostatic potential as a function of charge density. In practice, the linearized PB equation (Eq. 232), which utilizes the first order truncation of Taylor series expansion of the hyperbolic sine, is often employed. V [E(r)V0(r)] = 47tp(r) + E()/i()K204() (232) In the above equation, E is the dielectric constant, A is a switching function which is zero when electrolyte is inaccessible and otherwise one, and K2 is the DebyeHcckel parameter. For simple cases such as spherical charge distributions, the solutions to PB equation are analytical and simple. Consider dissolving a sphere with charge q and radius a and the charge is uniformly distributed on the surface. The charge density on the surface can be expressed as: p(x)= q (233) 47ra2 Here x is any point on the surface. From outside of the sphere, the electrostatic potential at r is calculated by: (r) q (234) Integrating the righthand side of Eq. 231 from infinity to a with Eq. 233 and Eq. 2 234 will yield G = The free energy of solvation is the difference between gasphase 2Ea and solutionphase free energies. Thus, it can be written as: cAGso = 1 2 (235) This is the socalled Born equation and is the basis of the generalized Born (GB) method which will be introduced later. For complex systems such as proteins, there is no analytical solution to the linearized PB equation.73 Therefore, this equation is solved iteratively until self consistent is achieved for the charge density and electrostatic potential. 2.4.3 The Generalized Born (GB) Implicit Solvent Model Solving the linearized PB equation is computationally expensive. An approximate method to the PB implicit solvent model is proposed as the GB method.39'117 Using the GB implicit solvent can greatly shorten the simulation time, which makes the GB frequently employed in molecular simulations. Similar to Eq. 235, the free energy of solvation in the GB method is given by: AGso0 = 21) fj (236) Here qj and qj are charges on nuclei i and j. fGB is calculated by: fGB = (rij + aae /4ai) (237) Here ai is the effective Born radius of charge qi, and riy is the distance between the two charges. Another approximation in the GB method is the Coulomb field approximation.40 This approximation estimates the effective Born radius by integrating the energy density of a Coulomb field over the molecular volume. The integral is often evaluated numerically. One should notice that the GB theory involves two approximations to reproduce the PB results. The first approximation contains Eq. 236 and 237. The second one is the Coulomb field approximation. Further approximations are often introduced to reduce the time computing the effective Born radii in practice. The pair wise approximation155 is often applied. In this approximation, the van der Waals radius of an atom and a function dependent on positions and the van der Waals radii of atom pairs are utilized to compute the effective Born radius. 2.5 pKa Calculation Methods 2.5.1 The Continuum Electrostatic (CE) Model The basic idea of the CE model is also given in Figurel6. Since computing the pKa value of an ionizable residue in a protein directly is difficult (breaking a bond plus dissolving all species into water), a model compound is utilized and the pKa shift is calculated via the thermodynamic cycles shown in Figure 17 and Figure 18. Like the MDFE calculations, the CE model also computes the pKa value of an ionizable residue relative to its intrinsic value (or model compound value according to the definition of Bashford and Karplus; the definition of the intrinsic pKa can be found in section 1.3). The pKa value of an ionizable residue is written as: pKa(protein) = pKa(model) + 2.303k(GAH (protein) AGAHA(model)) (238) In the above equation, pKa(model) is the intrinsic pKa value of an ionizable residue and can be found in Table 11. AGAHA(protein) and AGAHA(model) is the free energy difference between protonated and deprotonated species for that ionizable residue and its reference compound (the reference compound utilized in the CE model is an isolated ionizable residue with two ends capped and fully exposed to aqueous environment.), respectively. Eq. 238 is essentially the same as Eq. 120. The difference between MDFE methods and the CE model is how the free energy differences between the protonated and deprotonated species on the righthand side of Eq. 120 are generated. MDFE methods compute the two free energy differences via free energy calculation algorithms while the CE model calculates them via FDPB method. In this continuum electrostatic model, proteins are considered as lowdielectric regions surrounded by highdielectric continuum representing water. Protonation is represented by adding a unit charge to the ionizable site. In the continuum electrostatic model, AGAHA(protein) and AGAHfA(model) are assumed to differ only in their electrostatic contributions. This assumption will result in the cancellation of nonelectrostatic free energy contributions. Thus, calculating the electrostatic work of charging a site in the ionizable residue and in the reference compound from zero to unit charge is required. This electrostatic work can be further decomposed into three terms. For any ionizable site in a fixed protein structure, the electrostatic work consists of three terms: the Born solvation free energy (AGBorn), the background free energy which is the interaction of the ionizable site with nonionizable charges (AGback), and the interaction with other ionizable sites (AGinteract ). For the reference compound, only the first two terms exist. Thus, AGAHA(protein) can be written as: AGAH _A(protein) = AGorn (protein) + AGback (protein) + AGinteract (protein)(239) And AGAHA(model) can be written as: AGAHA(model) = AGorn (model) + AGback (model) (240) Linearized PB equation (described in Section 2.3.2) is solved for electrostatic potentials using finite difference method. For an ionizable site i, the Born solvation is determined by Eq. 235. The background free energy is calculated using Eq. 241: AGback = k qi qkp(r, ,rk) (241) Here qk is nonionizable partial charge and (rT, rk) stands for the electrostatic potential produced at rkby a unit charge place at ri. The electrostatic interaction with other ionizable sites can also be evaluated by Eq. 241 except that charges on ionizable sites must be used. After computing all components on the righthand sides of Eq. 238 and Eq. 239, the pKa of ionizable residue i will be obtained. To produce a titration curve, a protein containing N ionizable residues is considered here. Each ionizable residue has two states: protonated and deprotonated. Thus, there are 2N numbers of macrostates for that protein. Each macrostate can be represented by a vector x=(x,, x2, XN), whose elements xi is 0 or 1 according to whether ionizable site i is deprotonated or protonated. The free energy of x relative to the vector whose components are all zero (this is equivalent to the free energy change when charging the nonzero components in the vector) is given by Eq. 242: AG(j) = WAGi xi +2i=1Wj ,(q +xi)(qf +x) (242) Here AGi = AGBorn (protein) + AGback (protein) AGBorn (model) AGback (model) for ionizable site i, Wij is the electrostatic interaction between unit charges at ionizable site i and j, and qO is the charge of site i when it is in the deprotonated state. Thus, 80, which is the fraction of protonation of site i, can be written as (Eq. 243): 0 xe epaG()2.303v( )pH Oi epAG(n)2.303v(2)pH (243) Here p/ = 1/kT and v(x) is the number of nonzero components in x. Summing up individual 0i will generate a titration curve of the entire protein. 2.5.2 Free Energy Calculation Methods As mentioned previously, the pKa value is proportional to the standard free energy of reaction. Therefore, free energy calculation methods can be employed to compute the pKa value of ionizable residue one is interested in. In this section, two frequently used free energy calculation methods: thermodynamic integration (TI)156,157 and free energy perturbation (FEP)158 are described. Both TI and FEP belong to the socalled "slowgrowth" or equilibrium method and can be employed to compute the free energy difference between two states. In other words, each transition should be reversible. In the TI method, initial state A (having potential energy UA(q), where q is the molecular structure) and final state B (having potential energy Us(q)) are connected by a reaction coordinate A (this reaction coordinate does not necessarily have any physical significance). The simplest scheme of constructing the potential energy as a function of A is: U(A) = (1 )UA +UB (244) Slowly transforming A from zero to one converts state A to B; the intermediate values of A correspond to a mixed system without physical meaning. The Helmholtz free energy A in the canonical ensemble (or the Gibbs free energy G in the isothermalisobaric ensemble) is formulated as: A = ksT In Q = ksT InZ (245) where Q is the partition function and Z is the configuration integral. From now on, our derivation will focus on the canonical ensemble and the Helmholtz free energy but can be extended to isothermalisobaric ensemble and the Gibbs free energy in the same manner (this statement also holds when the free energy perturbation method is described later). Following Eq. 245, the Helmholtz free energy as a function of A is: A(A) = kBT In Z(A) = ksT f eU(q,~)/kBTdq (246) Here, U is the potential energy function and q is molecular structure. The free energy difference can be written as: AAAB = AB AA = fo A/dA dA (247) Then, aA (A) In Z(A) 1 0Z(A) OA(= kBT nn = kBTiZ( (248) an an z(n) an Plugging the explicit form of configuration integral into the derivative leads to: SA eU(q,,A)/kBTdq (e(q)kBdq (249) a (e"U(qB eU(q,)/kBT(_l/kaT) au() (250) OA OA Therefore, kT z()= kT eU(q,)/kBT( 1/kT)an) dq (251) z(n) 0an z(= a Since the integration is over coordinate space, the configuration integral can be moved into the integral. Eq. 251 now becomes: OA(A) 1 Z(A) U(q,_)/kBT 1) = _kT z _= f B dq (252) a0 ) 0z Z(1) dA The first term in the integrand is the Boltzmann weight factor P(q,A). Rewriting Eq. 251 yields: OAA) U ) UW(A) d= fP(qA) dq = ( n (253) Thus, the final form of AAAB is: AAAB = Jo OA/d dA = fo ()n dA (254) In both Eq. 253 and 254, the bracket represents an ensemble average generated at 2. In pKa calculations, state A (or B) represents the protonated species and the other represents the deprotonated species. Each intermediate value A corresponds to a mixed protonated and deprotonated state, without any physical meaning. When classical force fields are applied, the proton becomes a dummy atom in the deprotonated state but retains its position and velocity in the protein (or model compound). Furthermore, state A and B only differ in charge distributions. Dissociation free energy can be computed using methods of numerical integration (such as trapezoidal rule or Gaussian quadrature) to treat Eq. 254. As explained in the previous chapter, the quantum mechanical contributions to the proton dissociation free energy are assumed to be the same for protein and the model compound. Therefore, subtracting dissociation free energy of model compound from that of protein will yield the pKa shift relative to the pKa value of the model compound. The FEP method, which was initially introduced by Zwanzig in 1954,158 is another frequently employed free energy calculation method. Consider two states (A and B) with partition functions QA and QB, respectively, and the Helmholtz free energy AA and Ag, respectively. The free energy difference from A to B can be expressed as: AAAB = AB AA = kBT ln(QB/QA) (255) Suppose the configuration integrals Z are adopted instead of partition functions. The potential energy function of state A and B is UA(q) and Us(q), respectively, where q is the molecular structure. Thus, AAA B = kT ln(ZB/ZA) = ksT n(f(eUB (q)/k T/ZA)dq) (256) According to Zwanzig, Us(q) can be written as the sum of UA(q) and a perturbation term AU(q). UB(q) = UA(q) + AU(q) (257) AAAB = kBT In(f(e(UA (q)+AU(q))/kB T/ZA)dq) (258) AA _B kBTln ( U A(q)/kBT'eAU(q)/kBT ) AAAzB kT n T e U A dq) (259) ZA The Boltzmann weight factor of state A has the form: PA(q) = eUA(q)/kBT/ZA (260) Therefore, AAAB = kBTln(f PA(q)eAU(q)/kBTdq) = kBT In(eAU(q)/kBT)A (261) The bracket with subscript A stands for the ensemble average performed on the structural ensemble generated from state A. Substituting AU(q) with Us(q) UA(q), Eq. 261 becomes: AAAB = kBTln(e(UB(q)UA(q))/kBT)A (262) In order to compute AAAB, one simulation of state A is performed. Once a configuration q is generated, the potential energy difference at configuration q is computed. The ensemble average of e(UB(q)UA(q))/kBTcan be calculated easily and hence, AAAB is obtained. According to Eq. 262, if the potential energy difference between the two states (perturbation) is too large, the free energy difference given by FEP calculation can be unreasonably large. Thus, FEP calculations cannot accurately reflect the true free energy difference of large changes in Hamiltonian (basically, potential energy). Only similar Hamiltonians contributes to the free energy difference. In order to compute the free energy difference between two very different systems (such as calculating free energy difference from benzene to toluene), intermediate systems mixing the two very different systems (end points) are adopted in such a way that the differences between neighbors can be treated as perturbations. To be specific, a coupling parameter can be adopted in the same fashion as TI. The sum of free energy difference between intermediate systems (each intermediate state has a specific coupling parameter A,) will be the targeted free energy difference. In practice, computing AAAB (forward free energy difference) is equally easy (or hard) as computing AABA (backward free energy difference) and one is exactly the opposite of the other in principle. Evaluation of forward and backward free energy differences provides an indication of convergence. The Bennett Acceptance Ratio (BAR) method159'160 is a frequently employed scheme to reduce sampling bias and statistical error. In 1985, Jorgensen et al.161 proposed a "doublewide" scheme to perform FEP calculations in order to reduce the computational cost. The doublewide FEP can be explained by the following example. Suppose AA(A2 , A,) is to be computed. Instead of performing two MD simulations at Ai and A, only one MD simulation at A(lj is \2 I conducted. The AA (At A,(j)) and AA (A, , A(i)) are calculated then the objective free energy difference can be obtained. If N configurations of each MD simulation are taken in order to compute AA(At , A), the conventional FEP scheme requires 4N potential energy calculations, while doublewide FEP only requires 3N. 2.5.3 ConstantpH MD Methods As described in the previous chapter, the constantpH MD methods want to describe protonation equilibrium correctly at a given pH value. The constantpH MD models sample protonation state space explicitly, along with the sampling of conformational space. In practice, two protonation state sampling schemes have been developed. One scheme utilizes a binary protonation state space: only the protonated and deprotonated states are defined. MC steps have been performed periodically during MD propagations, which sample the conformational space. At each MC step, a new protonation state is selected and the free energy difference between the old and new states is computed. The Metropolis criterion is the applied to evaluate the MC move. Since a binary protonation state space is adopted, this scheme is generally called the discrete protonation state model. The other scheme employs a continuous protonation state space. Not only the completely protonated and deprotonated species are defined, fractional protonation states also exist in the simulation. The MD propagations sample both conformational and protonation state space. The latter scheme is named continuous protonation state model. In this section, the CPHMD model developed by the Brooks group and two discrete protonation state constantpH MD methods developed by Baptsta et al. and by Mongan et al. are described to provide a brief overview. In the CPHMD method, Lee et al.114 applied Adynamics116 to the protonation coordinate and used the Generalized Born (GB) implicit solvent model. They chose a A variable, which is bound between 0 and 1, to control protonation fraction. A = 0 represents an ionizable residue in its protonated state, while = 1 corresponds to the deprotonated ionizable residue. Due to its continuous nature, = 0 and A = 1 are rarely sampled. Thus an arbitrary value A, is adopted such that any A value smaller than Ap is defined to be protonated, while any A is greater than 1 Ap is set to be deprotonated. To ensure an unbounded reaction coordinate is practically used, a new coordinate 0 is introduced and is propagated in a MD simulation. A is expressed as: A = sin2 (0). (263) An artificial potential barrier between the protonated and deprotonated states has been introduced. The potential is a biasing potential to increase the residency time close to protonation/deprotonation states and it is centered at half way point of titration (A=1/2). The formula of this biasing potential used by Lee et al. is Ubias = 4 ( )2 (264) where p/ is an adjustable parameter controlling the height of the biasing potential. A value of 1.25 kcal/mol is found enough to provide occupation time in the protonated and deprotonated states. The total potential of the system, which provides the forces for MD propagation, has the form: Utotal = Ubond + Uangle + Utorsion + Uelec () + Uvdw () + UGB () + Unonpolar + ni= [Umodel () + UH(Oi) + Uias, (0i)] (265) Here, the first five terms are essentially defined by Eq. 19. UGB is the GB solvation free energy which will be explained in the next chapter. Unonpolar is the energy related to surface accessible areas. i in Eq. 265 represents an ionizable residue. Umodel is a potential of the mean force (PMF) in the titration coordinate for a model compound. The AGMM(AH A) shown in Eq. 117 can be represented by Umodel (2 = 0) Umodel ( = 1). The Umodel in Eq. 265 is fit to a twoparameter parabolic function having the form Umodel = Ai(sin2() B)2. UH (i) = 2.303ksT sin2(O) (pKA pH), which is the chemical potential of adding a fractional proton to the solution at pH. The term Umodei (0i) + UPH(Oi) is essentially the quantum mechanical dissociation free energy of a fractional proton. The CPHMD method also assumes Eq. 118 is true. Another feature of the CPHMD method is using an extended Hamiltonian. A kinetic energy term of titration coordinate 0 is employed in CPHMD: Ko = ~1M72 (266) The fictitious mass Mi controls the speed of response of the protonation state change to the force on it. Baptista et al.118 proposed that MD simulations incorporating protonation state change is essentially a semigrand canonical ensemble. The joint PDF can be written as: exp [pn pH (p,q,Ps,,qsn) P(p,q,s,s,) exp [fn fl (p,q,ps,qs)] (267) Sf exp [fPn pH (p,q,ps,qsn)]dpdqd psdqs Here, p, q is the moment and coordinates of solute, respectively. Ps and qs is the moment and coordinates of solvent, respectively. n is the vector containing protonation state information of each ionizable residue. The details of ni is explained in the continuum electrostatic model. n is essentially the number of protonated ionizable residues. [ is the chemical potential of protons and fl = 1/kT. The Hamiltonian contains quantum mechanical and classical force field terms. The quantum mechanical part in their model is assumed not to depend on coordinates and moment. The introduction of dummy atom to replace the proton in a deprotonated residue makes kinetic energy only a function of moment. Two conditional samplings have been considered by Baptista et al.: one is conformational sampling under a fixed protonation state, the other one is protonation state samping under a fixed structure. The PDF of conformations at fixed protonation state is: P(p,q, p, q Iexp[ (p,qpsqs) (268) f exp [fpHc(p,q,ps,qs,n)ldpdqd psdqs where Hc is the classical Hamiltonian. Due to the fact that quantum mechanical Hamiltonian depends only on protonation state, which is fixed in conformational sampling, the quantum contribution is a constant and is canceled. The PDF of protonation states at fixed coordinates is given in Eq. 269: P ) exp [2.303npH G(q,n)] (269) P exp [2.303npH fl G(q,n)] where AG is the free energy of a protonation state relative to the completely deprotonated state. In their model, FDPBbased method is executed to calculate free energy difference. Combining the two conditional samplings, one is able to generate an ensemble satisfying Eq.267. In order to prove the above statement, one must show the Markov chain constructed by transition matrix and the two conditional probabilities satisfies the following condition, p = limn,,o pWn (270) In the above equation, p is the joint PDF as defined in Eq. 267, p is a joint PDF depend on the same variables as p, and W is transition matrix. Proving Eq. 270 holds means that one must prove the Markov chain defined by p and W is ergodic. In order to prove a Markov chain is ergodic, one needs to prove (a) the Markov chain is irreducible; (b) the chain needs to be periodic; (c) the transition matrix elements are timeindependent; and (d) the limiting distribution should be stationary. The detailed proof is given by Baptista et al. in their 2002 paper. Their proof justified the discrete protonation state constantpH method which samples conformational space at fixed protoation state and samples protonation state at fixed structure. In 2004, Mongan et al.127 proposed a constantpH MD method and implemented in the AMBER suite. This algorithm follows the scheme proposed by Baptista et al.118 but employs the GB model in both MD and MC. Given a protein with N titratable sites, the discrete protonation state model means protonation states of a protein are described by a vector x=(xl, X2, ..., XN) where each xi is some integer representing the protonation state of titratable residue i. In AMBER, five amino acids are designed to be titratable: aspartate, glutamate, histidine, lysine and tyrosine. For each titratable residue, different protonation states have different partial charges on the side chain. This model also includes syn and anti forms of protons for the aspartate and glutamate side chains as well as the 6 and E proton locations for histidine. At each Monte Carlo step, a titratable site and a new protonation state for that site are chosen randomly and the transition free energy at this fixed configuration is used to evaluate the MC move. Considering a titratable site A in a protein environment, its protonated form is protAH and deprotonated form is protA. The equilibrium between the two forms is governed by their free energy difference. This free energy difference is the ensemble average of different configurations. However, the free energy difference cannot be computed by a molecular mechanics (MM) model since the transition between two forms deals with bond breaking/forming and solvation of a proton which involves quantum mechanical effects. The above problems can be solved by using a reference compound. The reference compound has the same titratable side chain as protAH but with known pKa value (pK,,ref). Following Mongan et al., we assume the transition free energy can be divided into the quantum mechanics (QM) part and the molecular mechanics (MM) part. We further assume that the quantum mechanical energy components are the same between the reference compound and the protAH. Since the pKa of the reference compound is known, its transition free energy from deprotonated form to protonated form at a given pH is: AGref = kT In 10 (pH pK,ref) (271) So the QM component of the transition free energy can be expressed as: AGref,M = AGref AGref,MM (272) Here AGref,MM is the molecular mechanics contribution to the free energy of protonation reaction for that reference compound. In practice, the QM component of the transition free energy also contains errors from MM calculations so it's actually a non MM component. Since the approximation of the QM component of the transition free energy is: AGref,QM = AGprotein ,QM (273) Then the transition free energy from protA to protAH can be calculated as: AG = kBT In 10 (pH pKa,ref ) + AGMM AGref,MM (274) Here, AGMM is the molecular mechanics contribution (electrostatic interactions in nature) to the free energy of the protein titratable site. Hence, by using a reference compound, the QM effects are not needed. Effectively, we compute ApKa relative to the reference compound. Computing ApKa can also help canceling some error introduced by GB solvation model through the use of AGef,MM. In AMBER, a reference compound is a blocked dipeptide amino acid possessing titratable side chain (for example, acetyl Aspmethylamine). Five reference compounds were constructed corresponding to five titratable residues. The values of AGef,MM for each reference compound are obtained from thermodynamic integration calculations at 300 K and set as internal parameters in AMBER. The AGMM is calculated by taking the difference between the potential energy with the charges of the current protonation state and the potential energy with the charges of the new protonation state. If the transition is accepted, MD steps are carried out to sample conformational space in the new protonation state. If the MC attempt is rejected, MD steps are also carried out with no change to the protonation state. 2.6 Advanced Sampling Methods Conformational sampling in a MD or MC simulation is essential in the study of complex systems such as polymers and proteins. One major concern is that the PES of a complex system is very rugged and contains a lot of local energy minima. Thus, kinetic trapping would occur as a result of the low rate of potential energy barrier crossing, especially when the barrier is high. To overcome this kinetic trapping behavior, generalized ensemble methods can be employed in molecular simulations. As its name implies, a generalized ensemble method differ from the canonical ensemble method in the weight factor of a state. The weight factor in the canonical ensemble is Boltzmann weight. However, a nonBoltzmann weight factor can be used in a generalized ensemble method (This does not mean that Boltzmann factor is prohibited in a generalized ensemble method. In fact, parallel tempering which belong to the family of generalized ensemble method, does adopt Boltzmann factor.). By choosing a non Boltzmann weight factor, the system is able to perform a random walk in the potential energy space. Thus, potential energy barriers will be overcome easily and more conformations will be visited. Frequently utilized generalized ensemble algorithms include the multicanonical (MUCA) method and replica exchange molecular dynamics (REMD) method. In this section, the MUCA and parallel tempering will be introduced briefly. Due to the importance of REMD method to this dissertation, the details of REMD method will be explained in the next section. 2.6.1 The Multicanonical Algorithm (MUCA) In canonical ensemble, the probability of visiting a state in the energy space is: Pcanonical (T, E) o n(E)eE/k T (275) Here, n(E) is the density of states (DOS), which means the number of states between E and E + dE. eE/kBT is the Boltzmann factor. As potential energy increases, the Boltzmann factor decreases but the DOS increases rapidly. A bellshaped probability distribution function (PDF) of E can be observed. However, in the MUCA method,54'55'137 the PDF is designed to be flat (a constant), although it still can be written in the form of Eq. 276: PMUCA (E) oc n(E)WMUCA (E) = (276) where WMUCA (E) is the multicanonical weight factor and n(E) is DOS. The multicanonical weight factor needs to be inversely proportional to the DOS in order to generate a flat PDF. However, the DOS of a system is in general unknown, which makes the multicanonical weight apriori unknown. Generating correct distribution of n(E) is the central task of a MUCA simulation. In practice, short simulations are performed in order to determine the DOS in an iterative manner. Details of determining the DOS can be found in the paper of Okamoto and Hansmann published in 1995.162 After the DOS is resolved, the canonical ensemble PDF will be contained. Thus, the average of any quantity can be determined by Eq. 111 or Eq. 112, depending on either MD or MC simulation is performed. Another way to explore the DOS is by using the WangLandau algorithm.163'164 In the WangLandau algorithm, the DOS is recorded by a histogram g(U) and initially set to unity for all its elements. Another histogram which is called visit histogram is also constructed with initial values set to zero. The visit histogram represents the number of visits to each energy level. Monte Carlo moves are made. Instead of being evaluated by the Metropolis criterion, they are evaluated by the DOS, w(i j) = min 1, 9(Ui (277) where w(i j) is the transition probability from state i to state j. Each time an energy level is visited, the corresponding element of the DOS histogram is updated by multiplying the current value with a modification coefficient that is greater than 1. The initial value of the modification coefficient is fo = e z 2.71828. Every time a MC move is performed, the corresponding element of the visit histogram is also updated. The MC moves will continue until the visit histogram is flat. At this stage, the DOS are converged. In order to achieve a finer convergence, a second round of the above process will be performed. This time, the modification coefficient fi in the second round is given by fi = o. The visit histogram is then reset to zero. This process will iterate until a modification coefficient that is approximately 1 is achieved (in the paper of Wang and Landau, the final value of the modification coefficient is 1.00000001). By utilizing WangLandau algorithm, the DOS will be obtained and a random walk in the potential energy space will be achieved. 2.6.2 Parallel Tempering In 1986, Swendsen and Wang firstly performed parallel tempering (replica exchange MC) simulations to investigate spin glass.59 Multiple noninteracting copies (replicas) of the system are simulated at different temperatures. At each temperature, MC simulation is conducted to sample the conformational space. Structures or temperatures of the two replicas are attempted to be exchanged periodically. The detailed balance condition is applied and the weight factor of the state is the Boltzmann weight factor. The Metropolis criterion has been utilized to accept or reject the move. Hansmann et al.58 first utilized the parallel tempering algorithm in the study of a biomolecule (7residue Ketenkephalin). Other applications of the parallel tempering algorithm include Xray structures determination performed by Falcioni and Deem.165 A MC simulation at a high temperature accepts the transition attempts more often than doing that at a low temperature. Thus, the simulation at high temperatures tends to visit more conformations in conformational space. Exchanging structures with replicas at lower temperatures can help them avoid getting trapped in the conformational space. The acceptance ratio, which is the averaged fraction of successful exchange attempts, is an important issue in the parallel tempering method. It is correlated with temperature distribution of replicas. According to Kofke,166 the acceptance ratio is the area of overlap between the potential energy PDF at two temperatures. Given the number of replicas, if the temperatures of the two replicas are too different, the overlap between the two potential energy PDFs will be small. Therefore, accepting an exchange attempt is unlikely, which makes parallel tempering simulation inefficient. However, if the temperatures of the two adjacent replicas are too close, the overlap between two PDFs will be large, and hence the acceptance ratio will be large. But the conformational space sampled by two adjacent replicas will be too close. More replicas than actually needed are utilized to achieve the same goal, and hence computer resource is wasted. 2.7 Replica Exchange Molecular Dynamics (REMD) Methods Due to the correlation between conformation and protonation sampling, correct sampling of protonation states requires accurate sampling of protein conformations. Hence, generalized ensemble methods such as multicanonical algorithm and REMD should be used to avoid kinetic trapping which comes from low rates of barrier crossing in constant temperature MD simulations. REMD has been applied to the continuous protonation state constantpH method (REXCPHMD) by Khandogin et al.110113 They have performed REXCPHMD simulations to predict pKa values110 and to explore pH dependent protein dynamics.111113 The REMD, which is the MD version of parallel tempering, have been developed by Sugita and Okamoto in 1999.62 The theory of REMD is essentially the same as parallel tempering. In their method, temperatures are attempted to be exchanged. This leads to the unique part of REMD: the treatment of velocities after accepting an exchange attempt, because the velocities must reflect the temperature correctly. Sugita and Okamoto proposed to rescale the velocities in order to recover the desired temperature when temperatures are swapped. Similar to other generalizedensemble methods, REMD algorithm wants to make the system perform a random walk in either temperature or potential energy space, and hence avoid kinetic trapping. The advantage of REMD over other generalizedensemble method is that the weight factor is Boltzmann weight which is apriori known. This advantage makes REMD very frequently employed in the MD simulations of complex systems. The REMD algorithm has been applied to studies of peptides, proteins, proteinmembrane system in order to describe free energy landscape, amyloid formation, structure prediction and binding. Many extended versions such as solutetempering REMD167 and structurereservoir REMD168170 have been proposed to improve the performance of REMD algorithm. The REMD variants will be briefly explained later in this section. 2.7.1 Temperature REMD (TREMD) A thorough description of the TREMD algorithm can be found in the original paper of Sugita and Okamoto.62 In TREMD, N noninteracting copies (replicas) of a system are simulated at N different temperatures (one each). Regular MD is performed and periodically an exchange of configurations between two (usually adjacent) temperatures is attempted. Suppose replica i at temperature Tm and replica at temperature Tn are attempting to exchange; the following satisfies the detailed balance condition: Pm(i)Pnj)w(i G j) = Pn(i)Pm(j)w(j i) (278) Here w(i j) is the transition probability between two states i and j and Pm(i) is the population of state i at temperature m (in REMD assumed Boltzmann weighted). Since, Pm(i) oc eH(pi,qi)/kBTm (279) where H is the Hamiltonian of the state, q represents the molecular structure, and p stands for momentum. The Hamiltonian consists of kinetic energy (K) and potential energy (U) terms and can be written as: H(p,q) = K(p) + U(q) (280) In the original derivation of exchange probability, Sugita and Okamoto mentioned that exchanging two replicas (states) is equivalent to exchanging temperatures. The moment of each replica after exchange attempt need to be rescaled: Pn(i) = Tnm pm(i) (281) Pm 0) = TmTnPn () (282) After inserting Eq. 279 and Eq. 280 into Eq. 278, the detailed balance equation becomes: exp[[K(pm(i)) + U(qi)]/kBTm} exp[[K(pn(j)) + U(qj)]/kBTn} w(i j) = exp{[K(pn(i)) + U(qi)]/kBTn} exp{[K(pm(j)) + U(qj)]/kBTm w(j i) (283) According to Eq. 281 and Eq. 282, K(pm(i)) =(Tm/Tn)K(p,(i)) (284) K(p (j))= (Tn/Tm)K(pm(j)) (285) Therefore, kinetic energy contributions on both sides of Eq. 283 will be canceled out, leaving only potential energy terms contribute to exchange probability. w(ij) exp [U(qi)/kBTn]exp [U(qj)/kBTm] (286) w(>i) exp [U(qi)/kBTm]exp [U(qj)/kB Tn Further manipulation of Eq. 286 yields: w(i>j) exp [(B ) (U(qi) U(qj))] wU(i>i) Ik kB Tn If the Metropolis criterion is applied, the exchange probability is obtained as: w(i j) = min{1, exp ( ) ((q) U(qj))l (287) If the exchange attempt between two replicas is accepted, the temperatures of the two replicas will be swapped and velocities rescaled to the new temperatures by multiplying all the old velocities by the square root of the new temperature to old temperature ratio: Vnew = Vold (288) Sl Told Here, Vnew and Vold are the new and old velocities, respectively. Tnew and Told are the temperatures after and before an exchange is accepted, respectively. The acceptance ratio is the average value of the exchange probabilities between two temperatures: 100 Pc = (min 1, exp [( ) (U(qi) U(q;))) (289) For a given system, the potential energy function is independent of temperature but the potential energy PDF in a canonical ensemble depends on temperature. The potential energy PDF can be considered as a Gaussian function (to the second order truncation of the Taylor expansion of the PDF at the potential energy value corresponding to maximum probability). The Gaussian is centered at mean potential energy of the system with a variance 2 = kBT2CV, where C, is the heat capacity. At this stage, the Gaussian function expression of the potential energy PDF is not adopted. It will be employed later in this section. The potential energy PDF at temperature Tm is currently written as: 1 Pm(U) = n(U)exp(U/kBT) (290) Qm where n(U) is the DOS and the exponential term is the Boltzmann weight factor as a function of potential energy. Recall that in the probability theory, the average quantity can be expressed as: (A) = fP(A) A dA (291) Extend Eq. 291 to the bivariate case and notice that the two PDFs are independent. The acceptance ratio can be rewritten as, Pac = ffC Pm(U) P(U') min 1,)exp ([(U ) (U U')l dUdU' (292) Let a function g(U, U') to denote min 1, exp [(m (U U')}, P = 1/kBTm and ln = l/kBTn, then, g(U, U') = min{1, exp[(fm fln)(U U')]} (293) 101 Without loss of generality, we can assume that Pm > /,, which means T, < T,. Therefore, another way of writing Eq. 293 is g(U, U') = 1 when U > U' and g(U, U') = exp[(P, ,S)(U U')] when U < U'. Inserting g(U, U') into Eq. 292 will lead to: Pacc = foo Pm (U)dU f00 1 P,(U')dU' + fcm Pm(U)dU fS exp[(fn fSn)(U  U')] P (U')dU' (294) For simplicity, we denote fm Pm(U)dU 0f exp[(fP fl,)(U U')] P(U')dU' as h(U, U'). Inserting Eq. 290 into h(U, U'), h(U, U') = f0cn(U)emUdU f0 e mU efmu' efnU epnU' n(U')enu'dU' (295) Since U and U' are independent, Eq. 295 can be rewritten as: h(U, U') = fc n(U)ePmU enmU ePnUdU fm ~1n(U')ePnU' efmu' ePnU'dU' (296) Simplifying Eq. 296 will formulate h(U, U') as: h(U, U') = f~ n(U) ePnUdU fu n(U')em'dU' (297) 00oo QM Qn Recall that a partition function is just a normalizing constant. Qm and Qn in Eq. 2 97 can switch their positions in the integrand. Thus, Eq. 297 becomes: h(U,U') = fo Pn (U)dU fU Pm (U')dU' (298) Inserting Eq. 298 into Eq. 294, Pacc = fo Pm (U)dU fU Pn (U')dU' + f0o Pn(U)dU fu P,(U')dU' (299) Each term on the righthand side of Eq. 299 can be interpreted as an overlap between two PDFs. The sum is the entire overlap between two PDFs. Therefore, the 102 average exchange probability is just the overlap between potential energy PDFs at two temperatures. Next, let us consider the temperature distribution in the simplest case, in which the heat capacity is a constant. As mentioned earlier, a potential energy PDF of a canonical ensemble can be written as a Gaussian function, P(U)= P(U)exp 2k(T2) (2100) where U is the average potential energy, P(U) is the probability density of finding U at temperature T, and C, is the heat capacity. Since the PDF should be normalized, it is easy to find the relationship between P(U) and the standard deviation of the Gaussian function: P(U) = 1/ 2nkBT2C (2101) For simplicity in the derivation of the acceptance ratio, the Gaussian PDF at temperature T, will be written as Eq. 2102 from now on: P(U) 1 exp (uU)2 (2102) Recall that one assumption to distribute temperatures is to maintain a random walk in temperature space. Hence, a constant acceptance ratio should be achieved for any two adjacent temperatures. As shown previously, the acceptance ratio is the overlap between two potential energy PDFs. Consider two potential energy PDFs at temperatures Tm < Tn. The PDF at Tm will be to the left of the PDF at Tn. After finding the potential energy Uintersect where the two Gaussian PDFs intersect, the overlap between two PDFs can be computed by integrating the left Gaussian PDF from 103 Uintersect to infinity and the Gaussian on the right from minus infinity to Uintersect and adding them up, pc = UU 2 exp mml2 dU + uiect 1 exp (u U)2 dU(2103) acc Uintersect A2 mn2 2 I 2 2 0 f 2oFn2 Complementary error functions will be utilized and Eq. 2103 will become, Pacc = erf (Uinters ct + erf c (UnUintersect) (2104) According to Rathore et al.,171 the acceptance ratio can be approximate to: Pacc erf c U (2105) where a = (am + an)/2. For a geometric distribution of temperatures where Tn = cTm, am + n = kBCvTTm(c + 1). The average potential energy difference can be computed as, (Un Ur) CAT = C,(Tn Tm) = C,(c 1)Tm (2106) Thus, if the heat capacity does not change with temperature, the temperature term in the numerator and denominator in Eq. 2105 will be canceled, which means the acceptance ratio will be a constant. Furthermore, Eq. 2105 also signals the number of replicas needed to cover a temperature range as a function of system size. In order to have a nonzero Pacc, (Un Um)/a 1. This leads to, CAT/( k(CTm(c + 1)) 1. Further simplifications lead to: ATm Tm (2107) Since the heat capacity is O(N), where N is the number of particles, the number of replicas to cover a temperature range is O(N1/2). 104 2.7.2 Hamiltonian REMD (HREMD) Instead of preparing replicas with different temperatures, another way to overcome potential energy barriers is simply changing the PES to reduce potential energy barriers.61 And this is the basic idea of HREMD. In HREMD algorithm, replicas differ in their Hamiltonians but have the same temperature. Likewise, regular MD is performed and an exchange of configurations between two neighboring replicas is attempted periodically. Let us consider replica with Hamiltonian H, and replica j with Hamiltonian Hm are attempting to exchange. By employing the detailed balance equation (Eq. 278) and Boltzmann weight of a molecular structure, the transition probability can be written as: w(i j) = min{1, exp[p(Hn (i) + Hm(j) Hm(i) Hn(j))] (2108) 2.7.3 Technical Details in REMD Simulations Temperature distributions have been explored in order to optimize the performance of REMD method. For systems having constant heat capacity, a geometrical distribution of temperatures has been adopted. Sugita and Okamoto,62 and Kofke166 believed that the most efficient way to exploit REMD algorithm is letting each replica spend the same amount of simulation time at each temperature (a random walk in temperature space). In practice, this is achieved by producing the same acceptance ratio for each replica, given that each replica only attempts to exchange with its neighbors in temperature space. Under the condition that the system has a constant heat capacity, a geometrical distribution of temperatures (7Tl/T = c) is achieved. Sanbonmatsu and Garcia suggested an iterative method to distribute temperatures for replicas in 2002.172 They have chosen the averaged values of potential energy as a function of temperature to maintain a random walk in the temperature space. In 2005, 105 Rathore et al.171 suggested that an acceptance ratio of 0.2 yields the best performance, based on constant heat capacity assumption. They have chosen Gotype model of protein A and the LennardJones liquid to study the deviation of heat capacity relative to the final value as a function of acceptance ratio. A minimum of deviation at acceptance ratio around 0.2 has been observed. Kone and Kofke173 have performed similar study for the parallel tempering simulations. They also considered a randomwalk model in temperature space through replica exchange moves. The acceptance ratio is given by: Pacc = erfc(1 C1/2) (2109) where B = Pi/flo, Pi is the Boltzmann weight factor, and C is the heat capacity which is assumed to be constant in their study. Without loss of generality, flo is greater than Pf,. The meansquare displacement of this randomwalk (Eq. 2110) has been maximized with respect to acceptance ratio. The maximum is shown near an acceptance ratio of 20%. 02 Oc (1nB)2Pacc(B) (2110) where U2 is the meansquare displacement, B and Pac are shown in Eq. 2109. Temperature distributions in parallel tempering simulation of villin headpiece subdomain HP36 have been investigated by Trebst et al.174 HP36 will undergo helix coil transition at high temperatures and hence, the heat capacity will not be held constant. The diffusion of a replica in temperature space has been introduced to judge the performance. In their method, a replica is labeled "up" when its previous visit of the extreme temperature is the highest temperature; it is labeled "down" when its previous visit of the extreme temperature is the lowest. For each temperature Ti, two histograms nup (Ti) and down (Ti) are recorded. The two histograms keep the record of the number 106 of visits from replicas with label "up" and "down", respectively. The average fraction of replicas traveling from the lowest to highest temperature can be calculated as: f(T) = nup (Ti) (2111) nup (Ti)+ndown (Ti) The diffusivity D(T) is adopted and has the form: aT D(T) oc T (2112) df/dT They have pointed out that the diffusivity is temperature dependent, a minimum of diffusivity has been observed around the temperature where heat capacity is at maximum. The plot showing diffusivity vs temperature indicates that random walk is suppressed the most when phase transition occurs. The numbers of roundtrip between temperature extremes of each replica has been maximized to generate an optimal temperature distribution. More recently, Nadler and Hansmann175177 suggested that the optimal number of replicas between the lowest and highest temperatures in explicit solvent simulation has the following formula: Noptima = 1 + 0.594V ln(Tmax /Tmin), where the C is the heat capacity, and Tmax and Tmin is the highest and lowest temperature, respectively. They also proposed that the optimal temperature distribution i1 can be formulated as: Toptimal (i) = Tmin Tn ) 1 Gmin In addition to replica temperature distribution, exchange attempt frequency (EAF) is also an important issue in parallel tempering and REMD sampling efficiency. In 2001, Opps and Schofield178 investigated the effect of EAF for parallel tempering. Two dimensional spin system and a polypeptide in vacuum have been selected to test the effect of EAF on the properties such as order parameter and radius of gyration of the polypeptide. They suggested that the most efficient scheme is to attempt after a few MC 107 steps. The situation is more complicated in the case of REMD. In general, thermostats are used in MD propagations to maintain a canonical ensemble is satisfied. It is argued that exchanges in REMD should happen when system temperature stabilizes.179 Attempting to exchange frequently may prevent the system from heat dissipation. This argument was supported by studies of a peptide Fs21 performed by Zhang et al.179 They have suggested that 1 ps of exchange attempt interval is desirable for REMD. However, Sindhikara et al.180 have later shown that small exchange attempt interval (even as small as a few MD steps) does not affect heat dissipation, given that REMD exchange is done properly. Conformational sampling deviation relative to long simulation time reference calculation as a function of EAF has been investigated. They have pointed out that large EAF (small exchange attempt time interval) is preferred. Abraham and Gready181 studied the effect of EAF based on a 23residue peptide in explicit water. By examining the potential energy autocorrelation time, they argued that an exchange period below 1 ps is too short for replica exchange attempts to be independent, and hence reduce the tempering efficiency. However, their conclusion was not supported by an investigation of tempering efficiency performed by Zhang and Ma.182 Zhang and Ma utilized the transition matrix and its correlation functions. The autocorrelation function of transition probability can be written as a function of eigenvalues of transition matrix. The decay time has been explored in order to understand the tempering efficiency. Zhang and Ma found that tempering efficiency increases monotonically as EAF increases. Thermostat effects on the performance of REMD have also been explored. Earlier work has been done by the Garcia group.172 They have studied if the potential energy 108 PDFs satisfy the Boltzmann distribution: ln[P(U, TI)/P(U, T2)] = k +c, Skg BT2 kg _Ti/ where P(U, T) is the potential energy PDF at temperature T and c is a constant. They have found that NoseHoover and the Anderson thermostats satisfy the above condition, while the Berendsen thermostat does not. Rosta et al.183 investigated the thermostat artifact in the REMD simulations in 2009. The current REMD exchange scheme assumes Boltzmann distribution (canonical ensemble) in the calculation of exchange probability. However, the Berendsen thermostat cannot preserve the Boltzmann distribution. Thus REMD simulations of bulk water and protein folding are performed and the temperature is controlled by Berendsen thermostat and Langevin dynamics. They have studied the potential energy PDFs and thermal unfolding under the two thermostats. The Berendsen thermostat has been shown to produce a shift average potential energy and prolonged tails for potential energy PDF for bulk water, while no such effect has been seen when Langevin dynamics is employed. An increased probability of folding at low temperatures has been reported by Berendsen thermostat, whereas the probability of folding is decreased at high temperatures. The authors proposed that REMD simulations performed with thermostats that can generate a Boltzmann distribution, such as Langevin dynamics, Andersen and NoseHoover thermostats. In a REMD simulation, the number of replicas needed to cover a temperature range scales as 0(f1/2), where f is the degree of freedom of the system. Given a large system, the number of replicas needed is large. For example, 64 replicas have been used in a REMD study of phairpin surrounded by explicit water molecules (4342 atoms in each replica) to cover the temperature range from 270 K to 695 K.184 A number of 109 methods have been developed to reduce the number of replicas needed in REMD simulations. In 2002, Fukunishi et al.61 proposed HamiltonianREMD (HREMD). In the HREMD scheme, replicas differ in their Hamiltonians but have the same temperature. The exchange strategies in the paper of Fukunishi were to scale hydrophobic interactions and to scale van der Waals interactions. In 2005, Liu et al.167 published a method with the name replica exchange with solute tempering. In the replica exchange with solute tempering algorithm, the proteinwater interactions and waterwater interactions are scaled such that the exchange probability does not depend on the number of explicit water molecules. The number of replicas in replica exchange with solute tempering simulation to cover the same temperature range is significantly reduced when comparing with original REMD algorithm. Lyman et al.,185 and Liu and Voth later,186'187 have developed resolution exchange schemes to improve the performance of REMD. Coarsegrained models (low resolution) are employed to replace the role of hightemperature replicas. The Simmerling group has contributed the hybrid explicit/implicit solvation model188 in order to reduce the number of replicas needed in REMD simulations with explicit water molecules. Each replica is propagated in an explicit water box. At an exchange attempt, the solute and its solvation shell, which is calculated onthefly, are placed in dielectric continuum. Exchange probabilities are calculated based on the potential energies of the solute and the hybrid solvent. The usage of a hybrid solvent can shrink the number of replicas from 40 to 8, in a test case of polypeptide Alalo simulated at temperatures from 267 K to 571 K. Structural reservoir techniques168170 have also been incorporated into REMD algorithm. High temperature MD simulations are performed first to generate a structural reservoir. Structures in the 110 reservoir will be brought to replicas via exchanges. One advantage of using structural reservoir is that nonBoltzmann weight factors can be chosen in the calculation of exchange probabilities.170 Recently, Ballard and Jarzynski189 proposed to use non equilibrium work simulations to accept exchange attempts. Kamberaj and van der Vaartl90 developed a new scheme to perform exchanges, in which the generalized canonical PDF have been employed to achieve a flat potential of the mean force in temperature space. The WangLandau algorithm163'164 has been adopted in order to estimate the DOS in temperature space and the roundup time between extreme temperatures has been minimized. More recently, solvent viscosity has been selected as a parameter in addition to temperature for REMD method.191 This method is named VREMD and it is essentially a twodimensional REMD method. The motivation of choosing viscosity as a parameter is that the lower the viscosity, the faster a protein will diffuse, and sample the conformational space. In this algorithm, one replica is selected to have normal viscosity, others use reduced viscosities. The mass of solvent molecules is scaled by a factor of 22 when the viscosity is scaled by a factor of A. Changing the mass of solvent molecules does not affect the potential energy at an exchange attempt. Thus, the exchange probability of the VREMD is the same as conventional TREMD. The author applied VREMD to the study of trialanine, decaalanine, and a 16residue 3 hairpin peptide. By using the VREMD, replica numbers are reduced by a factor of 1.5 to 2. The replica exchange method (REM) can be coupled with other generalized ensemble methods in order to enhance conformational sampling. The Okamoto group have coupled REM with MUCA and simulated tempering. The two new schemes are 111 called multicanonical replica exchange method,192 and replica exchange simulated tempering,193,194 respectively. The details of coupled REM and generalizedensemble methods can be found in a review by Mitsutake et al.53 Due to its stochastic nature, the REMD algorithm has been employed to investigate thermodynamics rather than kinetics.195 However, a properly designed scheme of analyzing the REMD trajectory in phase space can yield information about kinetics. In 2005, Levy and his coworkers195 designed a kinetic network and used master equation to solve for the transition rate from REMD simulations. The structures at all temperatures are grouped into states based on their structural similarity (they selected a 42 dimensional Euclidean distance space based on CaCa distances, instead of clustering, to group their structures). A state is denoted as a node and an edge stands for a transition between two nodes. A total of 800,000 nodes and 7.347x 109 edges were obtained. The master equation has been utilized to describe the transitions between two states. Since they discretized the conformational space into states, the master equation is written in a matrix notation, d = KP(t), where K is the transition dt matrix and P(t) is probability distribution of states at time t. Instead of solving for eigenvalues of the transition matrix or solving the differential equation numerically, the authors actually simulated the path satisfying the master equation. Likewise, this Markov state model has been employed in the study of protein folding too. In 2006, van der Spoel and Seibertl96 studied protein folding rate based on Arrhenius equation. The folding mechanism in their investigation has been assumed to be twostate. A binary folding indicator, which is the RMSD relative to the native state, has been adopted by the authors. Hence, the firstorder reaction rate equation has been 112 set up. Then, the rate equation was integrated and averaged over all trajectories in order to generate an derived fraction of folded structures. A fitting parameter x2, which is equal to the difference between derived and actual fraction of folded structures, was minimized numerically with respective to energy barriers and preexponential factors. In this manner, the Arrhenius reaction rate will be resolved from REMD simulations. Yang et al.197 proposed to use diffusion equation to extract kinetics from REMD simulation in 2007. The FokkerPlanck equation has been employed to extract local drift velocity and diffusion coefficient from REMD simulations. Langevin dynamics on the reaction coordinate is performed using drift velocity and diffusion coefficient. The free energy landscape will be reconstructed based on drift velocity and diffusion coefficient. In 2008, Buchete and Hummer198 demonstrated that both local conformational transition rate as well as globally folding rates can be accurately extracted from REMD simulations, without any assumption in temperature dependence of the kinetics (Arrhenius and nonArrhenius). Similar to Levy and coworkers, Buchete and Hummer have also adopted the master equation operating on discretized space to describe transitions. Conditional probability of state j at time t, given the initial state i, was computed by the master equation. The likelihood of seeing Nji number of transitions in a time interval has been maximized with respective to the natural log of transition rate constant (transition matrix elements) and the natural log of equilibrium population of state i. Thus, the rate constants will be generated. A detailed description can be found in the paper of Buchete and Hummer. 113 CHAPTER 3 CONSTANTpH REMD: METHOD AND IMPLEMENTATION 3.1 Introduction In this chapter, the constantpH REMD algorithm used in the AMBER simulation suite is presented and is employed to study model systems. We first tested our method based on five dipeptides and a model peptide having the sequence AlaAspPheAsp Ala (ADFDA). The two ends of model peptide ADFDA were not capped so the two ionizable side chains would have different electrostatic environment. The pKa values of the two Asp residues are expected to be different due to the difference in electrostatic environment. Then our constantpH REMD method is applied to a heptapeptide derived from OMTKY3, the same heptapeptide as Dlugosz and Antosiewicz studied in their paper. NMR experiments indicated the pKa of Asp is 3.6,122 0.4 pKa unit lower than the value of blocked Asp dipeptide. Dlugosz and Antosiewicz performed constantpH MD simulations and their method predicted the pKa to be 4.24.122 Our purpose is to show that the REMD algorithm coupled with a discrete protonation state description can greatly improve pHdependent protein conformation and protonation state sampling. 3.2 Theory and Methods 3.2.1 ConstantpH REMD Algorithm in AMBER Simulation Suite In the case of constant pH molecular dynamics, the potential energy of the system depends not only on the protein structure but also on the protein protonation state. * Reproduced in part with permission from Meng, Y.; Roitberg, A.E. Constant pH Replica Exchange Molecular Dynamics in Biomolecules Using a Discrete Protonation Model, J. Chem. Theory. Comput. 2010, 6, 14011412. Copyright 2010 American Chemical Society. 114 Likewise, when coupling REMD algorithm with constantpH MD, one can either attempt to exchange molecular structures only or swap both structures and protonation states at the same time. For simplicity, let us consider two replicas where replica 0 has temperature To, protein structure qo and protonation state no, while replica 1 has temperature T1, structure q and protonation state n. A diagrammatic description of the two exchange algorithms is shown in Figure 31. (A) q1, nq i q l qo, n x X q no qno q% no t Figure 31. Methods to perform exchange attempts. A) Only molecular structures are attempted to exchange. The protonation states are kept the same. B) Both molecular structures and protonation states are attempted to exchange. The first way of performing an exchange attempt is that replica 0 tries to jump from state (qo, no) to state (ql, no) at temperature To in one Monte Carlo step. Similarly, replica 1 attempts to transit from state (ql, n) to state (qo, n1) at temperature T1. Protonation states are kept at exchange attempts and only change during dynamics. Therefore, the detailed balance equation now becomes: w(P30qonofi, qn, q, nof, qon,) exp(8,0E(qo,no))exp(AE(q,, n)) w(A0qno, Aqonl > 0qono, Aq1n1) exp(/0E(q,,no))exp(AE(qo,nl)) (31) Here w(floqono, fllqln  floq1no,1lqo0nl) is the transition probability of swapping structures. If Metropolis criterion is used, this exchange probability can be written as: w(fo ono, iqlni floq1no, lqonl) = min{l, exp(A)} (32) 115 In Eq. 32, A has the form: A = lo [E(qo, no) E(ql, no)] fl [E(qo, n) E(ql, ni)] (33) Here fo = 1/kBTo, fl1 = 1/k T1 and E is the potential energy. If the protonation states of two adjacent replicas at an exchange attempt are the same, the exchange probability of our constant pH REMD will be equivalent to the conventional REMD exchange probability. However, if it is not the case, four potential energy terms are needed to calculate exchange probability. Under this circumstance, the constantpH REMD becomes a REMD algorithm that combines both temperature and Hamiltonian REMD algorithms. One possible concern of exchanging only structures would be the role of kinetic energy, especially when no and nl are different. In the REMD algorithm developed by Sugita and Okamoto, the kinetic energy terms in the Boltzmann factors cancel each other on average through velocity rescaling (Eq. 288). Only potential energies are required to compute exchange probabilities. There is a problem in canceling kinetic energy terms when the numbers of particles of two systems attempting to exchange are not the same. However, according to the constantpH MD algorithm proposed by Mongan et al., a proton does not leave the molecule but becomes a dummy atom when an ionizable side chain is in deprotonated state. Furthermore, that dummy atom retains its position and velocity which are controlled by molecular dynamics. Hence, the kinetic energy contributions to the Boltzmann weight will be cancelled out during exchange probability calculation, leaving only potential energy useful for the calculation. The second possibility consists of exchanging protonation states as well as molecular structures at REMD Monte Carlo moves. For instance, replica 0 attempts to 116 move from state (qo, no) to state (ql, n1) at temperatures To in one MC move and replica 1 attempts to jump from state (ql, n1) to state (qo, no) at temperature T. The detailed balance equation now can be written as: w(fqn,,/1q1n1 >pi4q1,n1,/~1qno, ) ,n ) w(qn qnw(qn >~i0)qln1) (34) w(/o0q1n1,, /1qon >/oq0ono,Aflq,1n1) w(/A1qon >fA1qn,1) w(/oqln > P0qono) This equation states that the exchange probability is the product of MC transition probabilities at temperature To and T. If the protonation states of two adjacent replicas are the same at an exchange attempt, the exchange probability of constantpH REMD becomes the exchange probability of conventional temperaturebased REMD. If no and n, are different, then each MC transition is essentially the protonation state change step in constantpH MD plus a structural transition. For example, consider the MC transition at temperature To, w(foqono floqlnl) = minl{, exp(Ai)} (35) In Eq. 35, A1 has the form: A, = lo [E(ql, no) E(qo, no)] + (pH pKa,ref) + lo [Eelec (ql, nl) Eelec (ql, no)]  Po AGref,MM (36) The first term in A6 derives from the transition in configuration at fixed protonation state no, and the rest corresponds to protonation state change at fixed structure q. Eeiec represents the electrostatic component of potential energy. Similarly, the transition probability of MC jump at T can be expressed as: w(Plqlnl fllqono) = min{l, exp(A2)} (37) And 117 A2 = 1 [E(qo, nl) E(ql, ni)] (pH pKa,ref ) 1 [Eelec (qo, nl) Eelec (qo, no)] + Pi AGref,MM (38) Therefore, similar to Eq. 32, the exchange probability can be written as: w(floqonoi, f qnl floqinl, fllqono) = min{l, exp(A')} (39) And A' = A + pf [Eezec (ql, nl) Eezec (ql, no)] fl [Ee ec (qo, nl) Eeiec (qo, no)] + (fo Pl) AGref,MM (310) In Eq. 310, A is the same quantity as in Eq. 33. The exchange probability calculation in the second method of coupling REMD and constantpH MD utilizes the same energy terms required by the first method since obtaining electrostatic potential energies does not require extra energy calculations. The advantage of implementing the second exchanging protocol over the first one should not be significant because it is the conformational sampling at higher temperature that greatly improves conformational sampling at lower temperatures. Allowing protonation states to change at exchange attempts does not provide extra gains in conformational sampling. In addition, one can always choose to sample protonation state space during the MD propagation. Therefore, only the first method of performing exchanges was implemented. 3.2.2 Simulation Details Constant pH REMD simulations were carried out first on five reference compounds: blocked Aspartate, Glutamate, Histidine, Lysine and Tyrosine to test our method and implementation. The experimental pKa values of those reference compounds are known and listed in Table 31. We later performed constant pH REMD 118 simulations on a model peptide ADFDA (AlaAspPheAspAla, unblocked termini) and the heptapeptide derived from OMTKY3 (residues 26 to 32 with blocked termini). Four replicas were used in the reference compounds and ADFDA REMD simulations. The temperatures were 240, 300, 370 and 460 K for all six molecules. The pH range for the study of acidic side chains was sampled from 2.5 to 6 and the pH range of histidine6 is from 5.5 to 8. The basic side chains were titrated from pH 9 to 12. An interval of 0.5 was chosen for all titrations. Eight replicas were chosen for the heptapeptide with a temperature range from 250 to 480 K. 10 ns were used for each replica in all REMD simulations and an exchange was attempted every 2 ps. A MC move to change protonation state was attempted every 10 fs. A second set of REMD runs was done with the same overall conditions but different initial structures in order to check simulation convergence. To compare conformational and protonation state sampling, 100 ns of constant pH MD simulations were carried out for aspartate reference compound and ADFDA at the same pH values as in the REMD runs. For the heptapeptide, one set of 10 ns constant pH MD simulations were done at all pH values simulated by REMD method. Constant pH REMD and MD simulations were done using the AMBER 10 molecular simulation suite.199 The AMBER ff99SB force field139 was used in all the simulations. The SHAKE algorithm145 was used to constrain the bonds connecting hydrogen atoms with heavy atoms in all the simulations which allowed use of a 2 fs time step. OBC Generalized Born implicit solvent model200 was used to model water environment in all our calculations. The Berendsen thermostat, 146 with a relaxation time of 2 ps, was used to keep the replica temperature around their target values. Salt 119 concentration (DebyeHuckel based) was set at 0.1M. The cutoff for nonbonded interaction and the Born radii was 30 A. 3.2.3 Global Conformational Sampling Comparison Using Cluster Analysis In our study, global conformational samplings have been compared utilizing cluster analysis.169'188 Cluster analysis is a technique to group "similar" structures and each group is called a cluster. A cluster analysis measures the similarity between two objects. In the cluster analysis we performed, protein backbone similarity (measured by backbone RMSD) is considered and the hierarchical agglomerative clustering algorithm is employed. Hierarchical algorithm basically creates a hierarchy of clusters and a hierarchical algorithm can be agglomerative or divisive. The hierarchical agglomerative algorithm starts with considering every object as a cluster and combines similar clusters into one cluster, while the divisive algorithm starts with one cluster containing all objects and divides it into more groups. In our work, the cluster analysis was done using the MoilView program.201 The MD and REMD trajectories (having same number of frames) at 300 K and under the same solution pH value were first combined. The ptraj module of the AMBER package has been utilized to create the combined trajectory. The "trajin" keyword was used to read in two trajectories and the "trajout" command generated the trajectory we need. The combined trajectory was clustered based on peptide backbone atoms root meansquare deviations (RMSDs). A cluster cutoff RMSD of 1.5 A is chosen for both ADFDA and the heptapeptide during our analysis. By clustering the combined trajectory, the MD and REMD conformational samplings will populate the same clusters. The fraction of the conformational ensemble corresponding to each cluster (fractional population of each cluster) was calculated for MD and REMD run, respectively. Two 120 sets of fractional population of clusters were generated. One must note that the fractional population of each cluster from MD and REMD trajectory may not be the same. Therefore, the correlation between the two sets of fractional population can be investigated by plotting one set against the other and doing a linear fitting. The MoilView program will generate a file pointing out which cluster a snapshot in the combined trajectory belongs to. Thus, the fractional population of each cluster was obtained for MD and REMD simulation. If the MD and REMD simulations produced the same structural ensemble, the fractional population of a cluster from MD simulation will be the same as that from REMD simulation. Cluster population fraction from REMD simulation vs that from MD simulation was plotted (see Figure 37A). The correlation coefficient values which represent the correlations between MD and REMD cluster population were calculated at each solution pH value by doing linear regression.169'188 A high correlation between MD and REMD cluster population indicates that the structure ensembles are similar to each other. This method provides a direct comparison of global conformational sampling between MD and REMD simulations. The same technique was used when studying convergence of constant pH REMD and MD trajectories (see Figure 37B and Figure 312). When investigating convergence of conformational sampling, snapshots from two constantpH REMD simulations (or two constantpH MD simulations) were combined. The two constantpH simulations should have the same temperatures and solution pH values. They only differ in initial structures. A high correlation coefficient indicates the two structural ensembles are similar and two conformational samplings are converged, while a poor 121 correlation means the structural ensembles are different and the conformational sampling depends on initial condition. 3.2.4 Local Conformational Sampling and Convergence to Final State In our study, the local conformational sampling was examined by comparing the probability distribution of backbone dihedral angle pair ((p, p). Essentially, we are comparing the Ramachandran plot of a residue. Each ((p, p) probability density was computed by inning (p and p angle pairs 10 x 100, which would lead to a 36x36 histogram. These two dimensional histograms were normalized into populations and the contours were plotted. The metric used to evaluate ((p, p) probability density convergence was the rootmeansquared deviation (RMSD) between the cumulative ((p, p) histogram and the one produced by using all configurations. Each cumulative histogram was constructed by using ((p, p) pairs up to current time and following the same algorithm mentioned earlier in this section. Essentially, we were computing the RMSD between two matrices. The RMSD between the cumulative probability density at time t and the final probability density (all configurations were utilized to compute final probability density) is given by, RMSD(t) = J 61jl[Pi (t Pi,final 2 /36 x 36 (311) where Pij (t) is the ijth element of the cumulative probability density of the ((p, p) pairs at time t and Pij,fina is the corresponding element in the final probability density matrix. 3.3 Results and Discussion 3.3.1 Reference Compounds We first applied our constant pH REMD method to the reference compounds. Table 31 shows the pKa values predicted by REMD simulations (10 ns for each replica) 122 as well as the reference pKa values. All our pKa values were calculated by fitting to the HH equation. Agreement between constant pH REMD predictions and the reference values can be seen. Table 31. The REMD pKa predictions of reference compounds. pKa Aspartate Glutamate Histidine Lysine Tyrosine REMD 3.97(0.01) 4.41(0.01) 6.40(0.03) 10.42(0.01) 9.61(0.01) Reference 4.0 4.4 6.5 10.4 9.6 The numbers in parenthesis are the standard errors. The pH titration curves of the same reference compounds showed agreement between MD (100 ns) and REMD simulations. Figure 32 demonstrates the REMD and MD titration curves of aspartic acid reference compound as an example. 1.0 MD run S REr.1D run 0.8 0 t / re 0.6 / LL C: 0.4 S0.2 0.0 3 4 5 6 7 Solution pH Figure 32. Titration curves of blocked aspartate amino acid from 100 ns MD at 300K and REMD runs. Agreement can be seen between MD and REMD simulations. We further studied the convergence of protonation states sampling. REMD and MD protonation fraction (cumulative protonation fraction) were plotted with respect to MC attempts for aspartate reference compound at all pH values. Figure 33 demonstrated the protonated fraction versus time at pH 4 as one example. According to 123 Figure 33, it suggests that although the final pKa predictions are the same between REMD and MD simulations, the protonation state sampling during REMD simulations clearly converges faster than that in a MD run. MD, pH=4 6 REMD, pH=4 0 .2 t5 g 05, 0 50000 100000 150000 200000 MC Titration Steps Figure 33. Cumulative average protonation fraction of aspartic acid reference compound vs Monte Carlo (MC) steps at pH=4. 3.3.2 Model peptide ADFDA The model peptide ADFDA (as zwitterion) was chosen as a more stringent test of our constant pH REMD method. The charged termini will provide different electrostatic environment for each titratable Asp residue and hence a correct constant pH REMD model should reflect this difference between titration curves of the two Asp residues. The Asp2 residue is closer to the NH3+, so the deprotonated state is favored and the pKa value of Asp2 residue should shift below 4.0 (which is the pKa value of the reference aspartic dipeptide). The Asp4 residue is closer to the COO negative charge and hence the pKa value should shift above 4.0. The titration curves of the model peptide ADFDA from REMD simulations are shown in Figure 34. We can clearly see that Asp2 and Asp4 have different titration 124 curves from each other and from the reference compound. The pKa value and Hill coefficient for each Asp residue were obtained by fitting titration curves to a Hill plot. The results are shown in Table 32. The REMD pKa predictions reflect the difference between Asp2 and Asp4 due to different peptide electrostatic environments. We also displayed the MD titration curves of Asp2 and Asp4 in Figure 34 and listed the MD pKa predictions and corresponding Hill coefficients in Table 32. The titration curve of Asp2 residue only showed a small difference between MD and REMD simulation. But we can see differences in titration behaviors of Asp4 between MD and REMD calculations when solution pH is below 5. Interestingly, Lee et al. studied blocked AspAsp peptide using CPHMD method, reporting different Hill coefficient for each of the two Asp residues. Model peptide ADFDA Titration curves at 300K 1.0, . Asp2MD Asp2 REMD v Asp4 MD 0.8 v Asp4 REMD S6Asp reference C : 04 u 0 / . 0 / S 0.2 0.0 2 2 3 4 5 6 7 Solution pH Figure 34. The titration curves of the model peptide ADFDA at 300K from both MD and REMD simulations. MD simulation time was 100 ns and 10 ns were chosen for each replica for REMD runs. Table 32. pKa predictions and Hill coefficients fitted from the Hill's Plot Asp2 Asp4 pKa Hill Coefficient pKa Hill Coefficient REMD 3.74 0.87 4.38 0.67 MD 3.76 0.89 4.54 0.85 125 Convergence rates of Asp2 titration behavior were compared between REMD and MD calculations due to the fact that Asp2 titration curves are very close. The cumulative protonated fractions versus MC attempts at pH 4 are shown in Figure 35. Likewise, faster convergence in protonation state sampling can be seen for REMD simulation even though both REMD and MD calculations resulted in the same final protonated fraction. Clearly, our constant pH REMD method accelerates the convergence of sampling of protonation states. MD, pH=4, Asp2 0.5 REMD pH=4, Asp2 .0 0 c 0.3 0 20000 40000 60000 80000 100000 MC Titration Steps Figure 35. Cumulative average protonation fraction of Asp2 in model peptide ADFDA vs Monte Carlo (MC) steps at pH=4. In addition to protonation state sampling, we also evaluated the conformational sampling in constant pH MD and REMD simulations. First, distribution of backbone (p and yp angle pairs (Ramachandran plots) of residue Asp2, Phe3 and Asp4 in ADFDA at each solution pH were studied. The regions in Ramachandran plots sampled by MD and 126 REMD simulations are the same at all pH values. Ramachandran plots for residue Asp2 at pH 4 are shown in Figure 36 as an example. (A) (B) "MD, pH=4, Asp2 24E2 15 REMD, pH=4, Asp2 24E2 100 2.1E2 100 21E2 1.8E2 18E2 50 50 1.5E2 15E2 07 / 10 E2 rs 0 ?E2 4 E. 50 50 E3 6E3 0, .150. .150 .100 .50 0 50 100 150 150 100 50 0 50 100 150 phi phi Figure 36. Backbone dihedral angle (cp, yp) normalized probability density (Ramachandran plots) for Asp2 at pH 4 in ADFDA. Ramachandran plots at other solution pH values are similar. For Asp2, constantpH MD and REMD sampled the same local backbone conformational space. Phe3 and Asp4 Ramachandran plots also display the same trend. Since the Ramachandran plot only represented local conformational sampling, we also evaluated global conformational sampling by clustering MD and REMD trajectories and comparing the cluster populations. The MD and REMD cluster population R2 values are listed in Table 33. A plot of cluster populations from MD and REMD trajectories at solution pH of 4 is shown in Figure 37A as an example. The large R2 values indicate that the MD and REMD sampled the same conformational space and generated the same structure ensemble. The small size of ADFDA and simple structure of each residue make 100 ns long enough for MD to sample the relevant conformations. We further studied the convergence of REMD simulations by comparing global conformation distribution between two REMD simulations starting from two different structures. Cluster populations of the two REMD simulations at solution pH 4 are 127 displayed in Figure 37B. The R2 value is 0.959 at pH 4. This large correlation tells us that the two REMD simulations provide the same structure ensemble and hence the two simulations are converged. Table 33. Correlation coefficients between MD and pH=2.5 pH=3 R2 0.94 0.90 pH=4.5 pH=5 R2 0.85 0.98 The R values were calculated by linear regression. REMD cluster populations. pH=3.5 pH=4 0.79 0.93 pH=5.5 pH=6 0.92 0.96 ADFDA, pH=4 Linear Fit, R=0.93 C 20 5 w 0 .o 10. 0 10  S5 30 35 ADFDA, pH=4 Linear Fit, R =0.96 0 5 10 15 20 % Population of REMD Run 1 25 30 Figure 37. Cluster populations of ADFDA at 300K. A) MD vs REMD at pH 4. Trajectories from MD and REMD simulations are combined first. By clustering the combined trajectory, the MD and REMD structural ensembles will populate the same clusters. The fraction of the conformational ensemble corresponding to each cluster (fractional population of each cluster) was calculated for MD and REMD simulation, respectively. Two sets of fractional population of clusters were generated, and hence plotted against each other. B) Two REMD runs from different starting structures at pH 4. Large correlation shown in Figure 37B suggests that the REMD runs are converged. Large correlations between two independent REMD runs are also observed at other solution pH values. Correlations between MD and REMD simulations can be found in Table 33. 3.3.3 Heptapeptide derived from OMTKY3 We first compared the protonation state sampling between constant pH REMD and MD simulations. Titration curves of Asp3, Lys5 and Tyr7 from two sets of 128 0 5 10 15 20 25 % Population of REMD Run simulations are plotted in Figure 38A and 38B. For each titratable residue, titration curves generated by constant pH REMD and MD are close to each other. Since the pKa value of Asp3 in this heptapeptide is experimentally determined to be 3.6, it will be interesting to evaluate how our predicted values compare to the experimental result. The pKa values of Asp3 were calculated based on Hill's plots which are displayed in Figure 38C. The predicted pKa value is 3.7 for both REMD and MD simulations and they are in excellent agreement with the experimental pKa value. Following the same procedures, our predicted pKa values of Lys5 and Tyr7 from constant pH REMD and MD simulations were obtained. Not surprisingly, the REMD and MD schemes yielded essentially the same predicted pKa values for Lys5 and Tyr7. (A) MD, Asp3 REMD, Asp3 1.0 / 0.8 / 0 / uL 0.6 / o 0.4  / Q_ 0.2 0.0 i i , 4 6 8 10 Solution pH Figure 38. A) Titration curves of Asp3 in the heptapeptide derived from protein OMTKY3. B) Titration curves of Lys5 and Tyr7 in the heptapeptide derived from protein OMTKY3. C) shows the Hill's plots of Asp3. The pKa values of Asp3 are found through Hill's plots. 129 MD, Lys5 REMD, Lys5 MD, Tyr7 REMD, Tyr7 / 1.0 0.8 . 0 Lu 0.6  C . 0)  0.4 0. 0) 0.2 0.0 10 12 Solution pH MD, Asp3 REMD, Asp3 4 5 6 Solution pH Figure 38. Continued 130 (B) / 4 I Although the final pKa predictions are the same for constant pH REMD and MD simulations, constant pH REMD showed clear advantage in the convergence of protonation state sampling. Again, we chose the cumulative average protonation fraction vs MC steps to reflect protonation state sampling convergence for all three titratable residues. Several representative plots are shown in Figure 39. The trend that constant pH REMD simulations produce faster convergence in protonation fraction is universal. Therefore, it is very clear that constant pH REMD method is better than constant pH MD in protonation state sampling. (A) MD, pH=4, Asp3 0.5 REMD, pH=4, Asp3 0.4 M' 0  0 0.  0.2 01 ._ 0 20000 40000 60000 MC Titration Steps Figure 39. A) Cumulative average protonation fraction of Asp3 of the heptapeptide derived OMTKY3 vs MC steps. B) and C) is cumulative average protonation fraction of Tyr7 and Lys5 in the heptapeptide vs MC steps, respectively. Clearly, faster convergence is achieved in contantpH REMD simulations. 131 (B) 0 20000 40000 60000 MC Titration Steps MD, pH=10, Lys5 REMD, pH=10, Lys5 40000 60000 MC Titration Steps Figure 39. Continued 132 MD, pH=9, Tyr7 REMD, pH=9, Tyr7 (C) 0.3 0.2 0.1 0.0  20000 Conformational sampling is an important issue in constant pH studies. We first looked at the conformational sampling on peptide backbones. We evaluated backbone conformational sampling through Ramachandran plots. Six residues (from Ser2 to Tyr7) are studied here. Not surprisingly, Ramachandran plots from constant pH REMD and MD simulations are very close, suggesting that the overall local conformational samplings are similar. The Ramachandran plots of Asp3 at pH 4 are shown in Figure 3 10 as examples. The only exception is Tyr7 in acidic pH values. Tyr7 can visit the left handed alpha helix conformation during constant pH REMD runs but is not able to do that in constant pH MD runs. In general, constant pH REMD and MD yielded the same Ramachandran plots for the heptapeptide. (A) (B) 150 MD, pH=4, Asp3 4E2 150 9 REMD, pH=4, Asp3 462 100 36E2 00 3 6E2 100 100  3 2E.2 3 23 2 50 RE 50 '. " A4E 0 0 0 2 I 2E2 z&2 8E3 8E3 100 100 150 0 150 0 150 .100 .50 0 50 100 150 150 100 50 0 50 100 150 phi phi Figure 310. Dihedral angle ((p, p) probability densities of Asp3 at pH 4. A) ConstantpH MD results. B) ConstantpH REMD results. The two probability densities are almost identical, indicating that constantpH MD and REMD sample the same local conformational space. All others also show very similar trend. As demonstrated earlier, the overall samplings of ((p, p) distribution by constant pH REMD and MD are similar for Ser2 to Thr6. It is interesting to determine how fast each sampling scheme reaches the final distribution. We studied evolution of backbone conformational sampling based on cumulative data as what we did in the case of 133 protonation state sampling convergence. As described in the METHOD section, the RMSD between the (cp, p) distribution up to current time versus total simulation time was calculated. The smaller a RMSD is, the closer a probability distribution reaches to the final distribution. Deviations were calculated starting from the second nanosecond with time intervals incremented by 100 ps. The cumulative timedependence RMSD of Asp3 and Lys5 are also shown in Figure 311 as examples. As seen in the figures, these curves decrease faster in constant pH REMD simulations. Figure 311 suggests that although the final (cp, p) probability distributions are similar between constant pH REMD and MD simulations, the constant pH REMD simulation clearly reaches the final state faster. (A) (B) o0.006 MD, pH=4, Asp3 o0.00 MD, pH=4, Lys5 REMD, pH=4, Asp3 REMD, pH=4. Lys5 0.005 0.006 00.005 0.004 0.0030003 0.003 0.002 0.0020.002 0002 0.001 0.001 0 2000 4000 6000 8000 2000 4000 6000 8000 Time (ps) Time (ps) Figure 311. The rootmeansquare deviations (RMSD) between the cumulative ((p, p) probability density up to current time and the ((p, p) probability density produced by entire simulation. ((p, p) probability density convergence behaviors at other pH values also show that REMD runs converge to final distribution faster. Cluster analysis was also applied to study the convergence of conformation sampling in the heptapeptide. By comparing cluster populations between the first and 134 second half of one trajectory, one could check the convergence of that simulation. The two halves of a structural ensemble should yield the same populations in each cluster if convergence is reached. For example, simulations at pH 4, both constant pH REMD and MD yield about 20 clusters and the correlations coefficients are calculated through a linear regression. Cluster population plots and correlation coefficients are shown in Figure 312. A much higher correlation coefficient can be seen in constant pH REMD simulation, suggesting the two halves of the constant pH REMD simulation at pH 4 populate each cluster much more similarly than the corresponding constant pH MD does. Hence, much better convergence is achieved by the constant pH REMD run. (A) (B) (A) Linear Fit, pH=4, R2=0.54 (B).89 40 25 Linear Fit, pH=4, R2=0.89 C. 0 1(I0 ( 35 %D E al 2 20 0 30 s 25  I / i15 S20 C 1 Cu I I n I o I* a Q. f n 0 5 10 15 20 25 30 5 40 0 5 10 15 20 25 % Population of MD Run (the first half) % PNpulaliori of REMD Run (the first half) Figure 312. Cluster population at 300 K from constant pH MD and REMD simulations at pH=4. Cluster analysis is performed using the entire simulation. The populations in each cluster from the first and second half of the trajectory are compared and plotted. Ideally, a converged trajectory should yield a correlation coefficient to be 1. A) Constant pH MD. B) Constant pH REMD. Much higher correlation coefficient can be seen in constant pH REMD simulation, suggesting much better convergence is achieved by the constant pH REMD run. 135 3.4 Conclusions In our work, we have applied replica exchange molecular dynamics (REMD) algorithm to the discrete protonation state model developed by Mongan et al. in order to study pHdependent protein structure and dynamics. Seven small peptides were selected to test our constant pH REMD method. Constant pH molecular dynamics (MD) simulations were ran on the same peptides for comparison. The constant REMD method results are encouraging. The constant REMD method can predict pKa values in agreement with literature and experimental results. Constant pH REMD method also displays advantage in convergence behaviors during protonation states and conformational sampling. The REMD algorithm has been proven beneficial to study pHdependent protein structures. Our future work will include studies of pHdependent protein dynamics and application of this constant pH REMD to large proteins. 136 CHAPTER 4 CONSTANTpH REMD: STRUCTURE AND DYNAMICS OF THE CPEPTIDE OF RIBONUCLEASE A 4.1 Introduction The protein and peptide folding problem202 is an important aspect of protein science and biophysical chemistry.203 In 1961, Anfinsen studied the refolding of denatured ribonuclease (RNase).204 He first increased the temperature of the protein and the protein lost its functional threedimensional shape (native state). When Anfinsen lowered the temperature, he found that the RNase was able to refold into its normal shape, without any other help. His experiment raised questions about protein folding. In general, people are interested in the thermodynamics (such as free energy landscape, folding pathway, and interactions in a protein), folding kinetics (such as how fast a protein folds), and native state prediction for a given sequence in protein folding.202 Both experimental and theoretical approaches have been employed to understand protein folding.205,206 From now on, our introduction to protein folding will focus on computer simulations. In a protein folding simulation, the concept of free energy landscape always plays an important role.202'207 Many questions can be answered once the free energy landscape is obtained. Levinthal,208 in 1968, proposed that it is impossible for a protein to search all its conformations during folding process because the time taken to visit all conformations will be much longer than the folding time observed. His argument is well known as the "Levinthal's paradox". Thus, proteins must fold to their native states along some welldefined folding pathways. The "new" view of protein folding is the free energy landscape theory, which provides a statistical view of the folding landscape.202,203'207 The folding process does not require chemicalreactionlike steps between specific 137 states. Basically, a protein folds on a funnelshaped free energy landscape, which is defined by the amino acid sequence of the protein. Folding process is a directed visit of conformations on a landscape in order to reach the native state, which is the most thermodynamically stable conformation. Changing temperature, adding denaturant to the protein solution, or changing solution pH value of the protein system is able to change the free energy landscape, and hence affect protein folding. The free energy landscape of a protein is often rugged51 and requires advanced sampling techniques such as REMD method to sample the conformational space. Due to the visual limitation, a free energy landscape is frequently projected onto one or two reaction coordinates. In practice, the free energy landscape is often projected onto several important reaction coordinates such as the radius of gyration of a protein, the number of backbone hydrogen bonds, and native contacts. Principal component analysis has also been carried out to generate the folding free energy landscape. The relative free energy (potential of the mean force, PMF) can be calculated by the following, AF(B A) = F(A) F(B) = kBTln(P(A)/P(B)) (41) where AF(B A) is the relative PMF between state A and state B defined by reaction coordinatess, P(A) and P(B) are the probability density of find state A, and B along the reaction coordinatess, respectively. Knowing the free energy landscapes can help people understand folding mechanisms. Transition states, intermediates, and folding pathways can be obtained from a folding free energy landscape. For example, when the free energy barrier between folded and unfolded state is disappeared, the folding is called downhill folding, in which the folding time is determined by diffusion rate on the free energy landscape. 138 One example of the protein folding free energy landscape studies is simulating the folding of Cterminal 3haripin of protein G, performed by Zhou et al. in 2001.184 The OPLSAA force field, SPC explicit water model, and REMD algorithm have been employed in their simulation. The free energy landscape has been projected onto seven different reaction coordinates such as radius of gyration, number of hydrogen bonds, and fraction of native contacts. Twodimensional free energy landscapes along those reaction coordinates were generated in order to elucidate the folding pathway. Four different states were found in the folding landscape, native state, unfolded state, and two intermediate states. Structural features of each state were also characterized. The formation of hydrophobic core and hydrogenbonding in the folding process has been investigated. They have found that the hydrophobic core and hydrogen bonds formed almost simultaneously after initial collapse. Although not investigated in this chapter, protein folding kinetics is also an important aspect of protein folding.209 One example of the folding kinetics study is seeking the speed of protein folding.210 Computer simulations have been performed to elucidate folding kinetics.211 The Pande group at Stanford University pioneered computer simulations of folding kinetics.206'211213 When studying protein folding kinetics, the Pande group conducted multiple independent MD simulations starting from different initial conditions. The probability of the native state in the structure ensemble was computed after a predefined simulation time. Assuming the folding mechanism is two state folding and follows the firstorder reaction kinetics, and the transition time is much shorter than staying time in either state, the probability of barriercrossing can be given by, 139 P(t) = 1 ekt (42) where t is simulation time and k is the folding rate. In the limit of t < 1/k, Eq. 42 can be simplified to P(t) ~ kt, according to the Taylor expansion. The probability of barrier crossing can be computed by using the fraction of simulations that crossed the barrier. Other methods utilized to explore folding kinetics include Markov state models.195'198'214 217 One example of predicting folding time is given by studying the Cterminal 3 hairpin of protein G. In their studies, Pande and coworkers213 utilized the OPLSAA force field and the GB implicit solvent model using waterlike viscosity via Langevin collision coefficient. A total simulation time of 38 ps has been accumulated through 2700 independent simulations, among which 8 completely folded trajectories were found. Thus, a folding time of 4.7+1.7 ps can be derived from Eq. 42, which is in agreement with the experimental result of 6 ps. Furthermore, the folding free energy landscape has been generated and the folding pathway and folding intermediates etc have also been probed. Another area of protein folding simulation is to probe protein folding through the unfolding simulations. The unfolding simulations adopt the assumption that folding processes follow the reverse pathways of unfolding processes. Both temperatures and denaturants can be employed to denature proteins. Levitt and Daggett have been performed unfolding simulations extensively.218220 The Cpeptide, residues 1 to 13 from the Nterminus of RNase A, is a peptide well studied by experiments.5,'7221226 In 1971, Brown and Klee223 first observed the presence of ahelix of Cpeptide through circular dichroism (CD) spectroscopy. This peptide was 140 further studied extensively by the Baldwin group.5,7'222,224,226 CD spectroscopy showed that the Cpeptide demonstrated pHdependent ahelix formation. The mean residue ellipticity at 222 nm of the Cpeptide showed a bellshaped pH profile, having a maximum at pH value of 5. Mutation experiments indicated that the Glu2 and Hisl2 in the Cpeptide were crucial to the pHdependent helix formations.5'7'224'226 Maximal mean residue ellipticity occurred at pH 5 because both the glutamate and histidine residues are charged at that pH. NMR experiments on an analog of the Cpeptide (RN24) by the Wright group also confirmed the formation of complete and partial helix.225 Two side chain interactions were believed to stabilize the partial helix formation in the Cpeptide and its analogs in the mutation experiments and NMR studies.7,224226 A saltbridge between Glu2 and Arg10 side chains was proposed to improve the helix formation as the pH values increased to 5. The interaction between Phe8 and Hisl2 was also believed to improve helix formation as the pH values reduced to pH of 5. The folding and side chain interactions of Cpeptide and its analogs were also extensively studied by molecular simulations.227235 Schaefer et a/.232 studied the helical conformations and folding thermodynamics. The Okamoto group228230'233235 has performed thorough investigations of the Cpeptide using a multicanonical algorithm (MUCA) and the replica exchange method (REM) in both implicit solvent and explicit solvent. They have studied secondary structures of the Cpeptide, roles of Glu2 and Hisl2 in the Cpeptide, helixcoil transition, and dielectric effect in the implicit solvent. Ohkubo and Brooks231 utilized REMD simulations with the GB model to explore the helixcoil transition of short peptides including the Cpeptide. Conformational entropy as a function of temperature has been explored for the Cpeptide and its analogues 141 (different chain length). The conformational entropy has been found to be proportional to chain length over a wide range of temperatures. Felts and coworkers227 carried out REMD simulations with the AGBNP implicit solvent model to study the folding free energy landscape of the Cpeptide. The free energy landscape was projected onto radius of gyration and helical length. The possible interaction between Glu2Argl0 was also explored. Dielectric effects of AGBNP solvation model on helical length and salt bridge has been investigated too. In 2005, Sugita and Okamoto233 performed replica exchange multicanonical algorithm simulations in explicit solvent to explore the folding mechanism and sidechain interactions such as Glu2Argl0 and Phe8Hisl2. They constructed folding free energy landscape along the principal component axes. The correlations between Glu2Argl0 and Phe8Hisl2 interactions and the Cpeptide conformations have been elucidated. They have found that the minimum free energy conformation possess both interactions. They have also suggested that the purpose of Glu2Argl0 saltbridge is to prevent ahelix extending to Nterminus of the Cpeptide and the Phe8Hisl2 stabilizes the alphahelix conformation toward the Cterminus. More importantly, Khandogin etal.112 studied the pHdependent folding of the Cpeptide with REXCPHMD. Important electrostatic interactions such as the LyslGlu9, Glu2 Arg10 and Phe8Hisl2 interactions were also investigated. The Cpeptide has also been selected to test the effect of force fields on protein folding simulations and simulation convergence. In 2004, Yoda et a/.234'235 tested six commonly employed force fields (AMBER94, AMBER96, AMBER99, CHARMM22, OPLSAA/L, and GROMOS96) on the Cpeptide as well as the Cterminal fragment from the B1 domain of the Gpeptide in explicit water using generalizedensemble 142 method. Melting curves have been studied. Secondary structures of both peptides were also computed and compared with experimental data. AMBER99 and CHARMM22 were found showing best agreement for the Cpeptide. In this chapter, we present a study of the Cpeptide using constantpH REMD method introduced in the previous chapter. The effect of pH on the folding of Cpeptide and the structural ensemble is studied. We compare directly with experimental measurements of helicity, namely the mean residue ellipticity at 222 nm. Important electrostatic interactions such as Glu2Argl0 saltbridge and Phe8Hisl2 interaction are also examined. 4.2 Methods 4.2.1 Simulation Details The Cpeptide we simulated has the sequence: KETAAAKFERQHM. The N terminus of the Cpeptide (lysine) is charged while the Cterminus (methionine) is capped with an amide. For our study, constantpH REMD simulations were performed starting from a completely extended structure at pH values 2, 3, 4, 5, 6.5 and 8. Eight replicas were chosen with a temperature range from 260 to 420 K. A simulation time of 44 ns were used for each replica in all REMD runs and an exchange was attempted every 2 ps. The structures obtained from the first 4 ns were discarded, resulting in a 40 ns of production time for each replica. Glu2, Lys7, Glu9 and Hisl2 are selected to be titratable. A MC move to change protonation state was attempted every 10 fs. A second set of REMD runs was done at pH values of 2, 5 and 8 starting from a fully helical initial structure in order to check simulation convergence. The three pH values are selected to represent low pH, pH where maximum helicity was observed experimentally and high pH, respectively. 143 AMBER 10 molecular simulation suite199 was used to simulate the Cpeptide. The AMBER ff99SB force field139 was used in all the simulations. The SHAKE algorithm145 was used in all the simulations which allowed use of a 2 fs time step. OBC Generalized Born implicit solvent model200 was used to model water environment in all our calculations. The Berendsen thermostat,146 with a relaxation time of 2 ps, was used to keep the replica temperature around their target values. Salt concentration (Debye Huckel based) was set at 0.1 M. The cutoff for nonbonded interaction and the Born radii was 30 A (this cutoff is longer than the peptide). 4.2.2 Cluster Analysis When studying the folding of Cpeptide, the roles of cluster analysis are twofold. One role is to compare structural ensembles and check convergence at particular temperature and solution pH value, while the other is to analyze a single ensemble of structures to investigate protein structures and interactions. As described in the previous chapter, cluster analysis was done using the MoilView program201 and the Ca RMSD has been chosen to measure structure similarity. When comparing conformational sampling, two different ways of comparisons have been adopted. The first way is to compare the first and the second halves of one trajectory. In this case, cluster analysis was performed on a single trajectory and the cluster information can be utilized to study folding thermodynamics and interactions in the Cpeptide. The second way is to compare the structural ensembles produced by simulations starting from the fully extended and fully helical structures. In the second case, the two trajectories (having same number of frames) at 300 K and under the same solution pH value were first combined. Then the combined trajectory was clustered on the basis of peptide backbone atoms rootmeansquare deviations (RMSDs). The 144 population fraction corresponding to each cluster was obtained for both trajectories. The correlation coefficient, which represents the correlation between the cluster populations of the two trajectories, was calculated at each solution pH value by doing linear regression. A high correlation indicates that the structure ensembles are close to each other. This method provides a direct comparison of global conformational sampling between the two trajectories. A cluster cutoff RMSD of 2.0 A is chosen during our analysis. 4.2.3 Definition of the Secondary Structure of Proteins (DSSP) Analysis The secondary structures of the Cpeptide have been explored by DSSP algorithm,236 which is proposed by Kabsch and Sander. The DSSP algorithm identifies the secondary structure of a residue by hydrogen bond calculations. The calculation is based on electrostatic energy between backbone carbonyl group and amide group, U = qlq2 ( +  332 kcal/mol (43) ON rTCH rOH TCN In the above equation, ql and q2 are the partial charges on each atoms. If the electrostatic energy is below 0.5 kcal/mol, then a hydrogen bond will assigned to corresponding carbonyl and amide groups. The secondary structure of a residue is labeled by one letter: G for 310 helix, H for alphahelix, I for pihelix, B for antiparallel betasheet, b for parallel betasheet, and T for turns. 4.2.4 Computation of the Mean Residue Ellipticity CD spectroscopy is one of the most commonly used techniques to study protein secondary structures and folding.237 Chiral molecules absorb left circularly polarized light (LCPL) and right circularly polarized light (RCPL) differently. CD spectroscopy 145 measures the difference in absorbance of LCPL and RCPL of a chiral molecule. It can provide information of protein secondary structures. Electromagnetic waves contain oscillating electric and magnetic fields perpendicular to each other and to the propagating directions. A circularly polarized light (CPL) has an electric field vector rotating along its propagation direction but maintains its magnitude. This is in contrast to linearly polarized light which has an electric field vector oscillating in one plane but change its magnitude. When a LCPL is propagating toward an observer, the electric field vector rotates counterclockwise, while the RCPL rotates clockwise. When a circularly polarized light passes through chiral molecules, the difference in the absorption of LCPL and RCPL is given by: AE(A) = EL( ) ER(A) (44) where EL and ER is extinction coefficient of LCPL and RCPL, respectively and A is wavelength. AE has the dimensions of (cm M)1 or cm2 dmol1. The extinction coefficient E can be calculated by BeerLambert law: E = A/c 1 where A is the absorbance, c is the concentration, and I is the width of the cuvette. This difference gives CD spectroscopy. Many CD instruments record signal in ellipticity, 0, which is measured in degrees. The ellipticity can be calculated as: 0 = 32.98(AL AR) = 32.98 c 1 AE, where 32.98 has unit of degree. A more frequently adopted measurement of CD is the molar ellipticity [0],238 1000 [0] = = 3298 An(A) (45) 1Here, the molar ellipticity has units of de Here, the molar ellipticity has units of deg cm2 dmol1 146 The integrated intensity of a CD band is called rotational strength. Theoretically, for a electronic transition from ground state (0) to excited state (i), the rotational strength can be calculated as, Roi = Im((WolI e IA Ii) (lI I l lo)) (46) where i0 and Cip is the wavefunction of electronic ground and excited state, respectively; ,e and Pm is the electronic transition and magnetic transition dipole moment operator, respectively; and Im stands for the imaginary part. Eq. 46 suggests that the frequently adopted units of rotational strength are DebyeBohr magnetons (DBM, 1 DBM=9.274 x 1039 erg cm3, where erg is the cgs unit of energy). Eq. 46 is origindependent because the magnetic transition dipole moment operator is origin dependent. In order to avoid this origindependence, the dipolevelocity formulation can be employed, Ro, = (eh/2ntmvo)Im((V lvi) (Vi Aim lo)) (47) Here, e is the charge of an electron, m is the mass of an electron, and v0o is the frequency of the transition. According to the paper of Sreerama and Woody,238 CD spectrum can be calculated as, assuming each CD band (CD transition) is a Gaussian function of wavelength, AEk = 2.278RkAk/Ak (48) where AEk, Rk, Ak, and Ak is the CD, rotational strength, wavelength and halfbandwidth (one half of the width at 1/e of its maximum) of the kth transition, respectively. In Eq. 4 8, the constant 2.278 has the dimensions of DBM1 cm2 dmol1. 147 The far ultraviolet (far UV, with a wavelength smaller than 250 nm) CD spectra of proteins can yield important information about the secondary structures of proteins.238 In the far UV range, peptide bonds in a protein are the main chromophores. Thus, the CD spectra in the far UV range are reported on a residue basis (mean residue ellipticity). In a protein CD spectrum, a positive band at ~190 nm and two negative bands at 208 nm and 222 nm can be found for ahelix.239 In particular, a strong negative band at 222 nm is a leading indication of the presence of helical structures. Structures containing 3 sheet will show two bands in CD spectra: a positive band at ~198 nm and a negative band at 215 nm.240 Computing protein CD spectra using quantum mechanical methods combining with Eq. 47 is only possible in principle due to the size and complexity of protein structures. The matrix method241 using predetermined parameters has been adopted to tackle this problem. In the matrix method, a secular matrix is constructed based on transition energies and interactions between transitions. A protein is considered as a set of independent chromophores. Each local transition energies and interactions between transitions in different chromophores are utilized to construct the secular matrix. A transition on a local chromophore is represented by a charge distribution. The charge distributions, as parameters, are determined from quantum mechanical wavefunctions or experiments or a combination of both.242244 The offdiagonal elements of the secular matrix, which represent the interactions between transitions in different chromophore, are further simplified by chargecharge (monopolemonopole) electrostatic interaction,238 Vj,kl = Em En ijm qkln /rijm,kln (49) 148 Here, Vj,ki is the electrostatic energy between transition j on chromophore i and transition 1 on chromophore k. m sums over the point charges of transition j on chromophore i and n sums over the point charges of transition 1 on chromophore k, and r denotes for the distance between two charges. Diagonalization of the secular matrix using a unitary transformation will yield the eigenvalues and eigenvectors corresponding to all transitions of the protein. Eigenvalues provide information about transition energies and the eigenvectors describe the mixing of local transitions. The rotational strength can be obtained from eigenvectors. In this work, the algorithm developed by the Woody group238,244 was used to compute the mean residue elliptcity. Detailed description of their algorithm can be found in the paper of Sreerama and Woody. The peptide transitions (two mi* transitions at 140 and 190 nm, respectively and one nn* transition at 220 nm) were computed using the Matrix method241 in the originindependent form.245 Transition charge distributions monopolee charges) are obtained from INDO/S246 semiempirical electronic structure calculations. Side chain transitions of phenylalanine, tyrosine and tryptophan were also included in the calculations. The ahelix formation can be characterized by two negative bands at 208 and 222 nm, and a positive band at 192 nm. Following the experiments performed by the Baldwin group, the mean residue ellipticity at 222 nm ([8]222) was calculated to generate the pH profile. In practice, Woody's program reads in one protein structure in PDB format and yields the mean residue ellipticity and the rotational strength as a function of wavelength. Therefore, the ptraj module of the AMBER 10 package has been utilized to 149 generate a protein structural ensemble in order to find out an ensemble average of the mean residue ellipticity at 222 nm. 4.3 Results and Discussion 4.3.1 Testing Structural Convergence Conformational sampling convergence is investigated utilizing cluster analysis, as described earlier. Two ways of checking conformational sampling of the simulations from the fully extended structure are utilized. One way is to compare the first and the second halves of the trajectory and the other way is to compare to the structural ensembles produced by simulations starting from a fully helical structure. The R2 values of the cross clustering are listed in Table 41. Plots demonstrating the cluster population correlations from both ways at pH 2 are showed in Figure 41 as an example. The large R2 values indicate that converged structural ensembles are achieved through 40 ns simulations. (A) pH=2 (B) pH=2 15 Linear Fit, R=0O.90 20 Linear Fit, R =0.95 2 15 1I lo S 2 5 5r 0 5 10 15 05 10 15 20 % Population % Population First Half REMD strating from extended structure Figure 41. Cluster population at 300 K from constant pH REMD simulations at pH 2. A) Cluster analysis is performed on the trajectory initiated from fully extended structure. The populations in each cluster from the first and second half of the trajectory are compared and plotted. B) Two REMD runs from different starting structures at pH 2. Correlation coefficients at other pH values can be found in Table 41. 150 Table 41. Correlation coefficients between two sets of cluster populations. pH = 2 pH = 3 pH = 4 pH = 5 pH= 6.5 pH = 8 R2 0.90 0.92 0.90 0.94 0.93 0.85 (E vs E) R2 0.95 ... 0.88 __ 0.84 (E vs H) 151 E vs E means comparing the first and the second halves of the trajectories starting from the fully extended structure. E vs H stands for comparing structural ensemble given by simulations starting from fully extended and fully helical structures, respectively. 4.3.2 pKa Calculation and Convergence Four residues of the Cpeptide are titratable in our constantpH REMD simulations: Glu2, Lys7, Glu9 and Hisl2. Lys7 is always protonated in the pH range of 2 to 8, as expected. Thus, only the data from glutamate and histidine residues are analyzed. For each glutamate and histidine residue, the fraction of deprotonation at each pH value is obtained and a Hill's plot is utilized to find out the pKa value. The pKa values are 3.1, 3.7 and 6.5 for Glu2, Glu9 and Hisl2 respectively. The cumulative average fraction of protonation vs constantpH MC attempts is chosen to study the convergence of the pKa calculation. The cumulative average fraction of protonation represents the time evolution of the protonation state sampling. As shown in Figure 42, a stabilized fraction of protonation is achieved through 40 ns simulations. 4.3.3 The Mean Residue Ellipticity of the Cpeptide The mean residue ellipticity of the Cpeptide at each pH value and at 300 K was computed. The pHprofile of the [e]222 (Figure 43) is clearly a bellshaped curve, in agreement to the experimental pHprofile of the [e]222. The maximum of our calculated [e]222 is at pH value of 5, with a numerical value of ~ 6400 deg cm2 dmol1. However, the computed values of [e]222 at the ends (pH = 2, 3, and 8) suggest that the helix is more populated in the simulations than in experiments at those pH values. 0.8 Glu2 at pH 3 SGlu9 at pH 4 0.6 o S0.4 0 0.0 ,, , 0 20000 40000 60000 MC steps (total time = 40 ns) Figure 42. Cumulative average fraction of protonation vs Monte Carlo (MC) steps. Only the two glutamate residues are shown here and the histidine residue is found to show the same trend. The pH values are selected such that the overall average fraction of protonation is close to 0.5. As mentioned in the section 2.2.2, the protonation state model involves using parameters fitted at 300 K, thus results obtained at temperatures other than 300 K should be viewed qualitatively, not quantitatively. Cpeptide at a temperature lower than 300 K shows a more negative [8]222 (more helical), while the [8]222 becomes less negative (less helical) when the temperature is higher than 300 K. Experiments showed that the pHprofile becomes flat at high temperatures.5 Our results also reflect the same trend: pH profile of the [e]222 at 420 K is flat and less negative than those at 300 K, while the pH profile at 280 K is still bellshaped and more negative. 152 7.0 7.0 T = 420 K 6.5 . E N) S 5.5  o 3 (, 5.5 /\ CN 4 5.0 1.5 o 4.5  I I 1 .0 2 3 4 5 6 7 8 Solution pH Value Figure 43. Computed the mean residue ellipticity at 222 nm as a function of pH values. A bellshaped curve at 300 K is obtained with a maximum at pH 5. The effect of temperature on mean residue ellipticity at 222 nm is also demonstrated. 4.3.4 Helical Structures in the Cpeptide In order to examine the helical conformations in different environments, constant pH REMD at pH values 2, 5, and 8 are selected to represent the pH range. The secondary structures of the Cpeptide were computed utilizing the DSSP algorithm.236 Any residue which according to the DSSP algorithm belongs to the 310helix or ahelix conformation is called helical. The helical percentages of each residue are shown in Figure 44. The maximum helical percentage of a residue is ~ 55% at pH 2 and 5, and the maximum helical percentage is ~ 40% at pH 8. The averaged helical percentage at pH 5 is around 30%, which is in good agreement with experiments (29+2%). Figure 44 suggests that the Cpeptide contains a lot of nonhelical structures, even at pH 5 where the helical content is maximal. 153  pH =2 60 pH = 5 pH = 8 50 ( 40 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Residue Number Figure 44. Helical Content as a function of residue number. We calculated the Ca RMSD vs the fully folded structure (the fully helical structure has a Ca RMSD of 0.8 A relative to the ribonuclease A Xray structure, Thr3 to Hisl2 are chosen to calculate Ca RMSD) and the Ca radius of gyration (Rg) of the Cpeptide. The time series and the probability density of RMSDs and Rg are illustrated in Figure 4 5. According to Figure 45B, two conformations can be seen at all three pH values. The conformation with the smaller RMSD represents structures closer to the fully helical structure and the structural ensemble at pH 5 possesses more such kind of structures than the other two structural ensembles. Figure 45D demonstrate the probability density of the Rg, and it suggests that the Cpeptide is more compact at pH 5 than at pH 2 and 8. The results of Rg agree with the results of RMSDs because the helical structures are more compact. 154 pH=5 I, 1 d I J I I i Ill. i r i I II 11 ii il I' 10000 20000 3000 40000 Time (ps) pH=5 i ,,ll I , Ca Radius of Gyration (A) Figure 45. A) Time series of Ca RMSDs vs the fully helical structure at pH 5. The first two residues at each end are not selected because the ends are very flexible. B) Probability densities of the Ca RMSDs. Clearly, the structural ensemble at pH 5 contains more structures similar to the fully helical structure. C) Time series of Ca radius of gyration at pH 5. D) Probability density of the Ca radius of gyration. More compact structures are found at pH 5. We further studied the details of the Cpeptide structural ensemble with respect to pH values. The studies of helical structure were on the basis of our DSSP results. We first show the probability density of total number of helical residues at pH 2, 5 and 8 in Figure 46A. As expected, simulations at pH 5 generated the smallest number of non helical structures and the percentage is ~ 25%. Simulation at pH 8 generated the most nonhelical structures and ~ 37% of the structural ensemble possesses no helical 155 5 I I pH = 2  pH= 5  pH=8 3 Ctx RMSD (A) 10 9. ., a 03*^ 10000 20000 30000 40000 Time (ps) residue. For those structures possessing helical residues, structures having four helical residues are the most probable and structures containing three helical residues are also common at all three pH values. Besides, structures possessing six helical residues are also found. Furthermore, simulation at pH 5 yielded more configurations possessing sevenresidue and longer helices. Thus, longer helical chains are formed more often at pH 5. (A) Cl pH=2 (B)  pH = 2 0,4 IpH=5 07 =lpH = 5 SpH =8 H l=8 0,6 0.3 0.5 0.4 1 02 01 0.2 0 2 4 6 8 10 0 1 2 Number of Helical Residues Number of Helical Segments (C) pH = 2 (D) i l pH =2 0.4 pH = 5 I pH = 5 I pH=8 0.4 IpH=8 03 Z 0.2 Z 01 02 2 00 M _Q3 2 4 6 8 10 1 2 3 4 5 6 7 8 9 10 11 Helix Starting Position Helical Length Figure 46. A) Probability densities of number of helical residues in the Cpeptide. B) Probability densities of the number of helical segments in the Cpeptide. A helical segment contains continuous helical residues. The probability of forming the second helical segment is very low at all three pH values, thus only the first helical segment is further studied. C) Probability densities of the starting position of a helical segment. D) Probability densities of the length of a helical segment (number of residues in a helical segment). 156 Next, the number of helical segments (a helical segment contains continuous helical residues) is studied and shown in Figure 46B. The number of helical segment ranges from zero to two at all three pH values. However, Cpeptide structures having two helical segments are really rare. The probability densities of having two helical segments at pH 2 and 8 are ~ 0.05, while that at pH 5 is ~ 0.1. Due to the small population of the second helical segment, the analysis of the helical length (number of helical residues in a segment) and the helix starting position (residue number of the amino acid initiating a helical segment) is focused on the first helical segment. Figure 46C demonstrates the probability density of helix starting position in the Cpeptide. The helix starting position is affected by pH. The most probable starting position is affected by solution pH. At pH 2, Lys7 is the most favorable position to start a helix but the most probable place to initiate a helix is Thr3 at pH 5 and 8. At pH 2 and 5, Thr3, Ala6 and Lys7 are favorable positions to start a helix, while Thr3 and Lys7 are the favorable place to start a helix at pH 8. However, the effect of solution pH on the helical segment length is not as significant as the effect on helix starting position. Figure 46D shows that the threeresidue or fourresidue helices are dominant at all three pH values. 4.3.5 The TwoDimensional Probability Densities Twodimensional (2D) probability density can be employed to study the correlations between important variables. The peaks in the plots indicate the coupling between two variables and represent stable conformations. The more populated a region is, the more stable the corresponding conformation is. The 2D probability densities between helix starting position and helical length are illustrated in Figures 47 to 49. Helices consisting of Thr3Ala5, Lys7Argl0 and Glu9Hisl2 are present at all 157 three pH values, while the number of helical conformations is more at pH 5 and 8. At pH 2 and 5, the most probable helix formation is the fourresidue helix starting from Lys7 (Lys7Argl0). The 2Dprobability densities reveal that the sixresidue (Lys7Hisl2) helix and the sevenresidue (Ala6Hisl2) helix are found stable at pH 5. At pH 8, Thr3Ala5 becomes the most favorable helical formation. Lys7Argl0 and Lys7Hisl2 are also favorable. At pH 8, a new sevenresidue helix (Thr3Glu9) is found. pH=2 0 10 1 IF 'i' ' = ',,.: 4 I I I 2 4 6 8 10 Helical Segment Starting Position Figure 47. 2D probability density of helical starting position and helical length, pH = 2. 2 4 6 8 10 Helical Segment Starting Position Figure 48. 2D probability density of helical starting position and helical length, pH=5. 158 pH 5 S,______ .. .: : .. ~.... ",,,',, 10 pH =8 10, E ' 2 I I I 2 4 6 8 10 Helical Segment Starting Position Figure 49. 2D probability density of helical starting position and helical length, pH=8. 2Dprobability densities correlating helical length and Ca RMSDs relative to fully helical structure are shown in Figures 410 to 412. As expected, structures having long helices (helical length > 7) correspond to the conformations with RMSDs smaller than 2.2 A and this region is more populated at pH 5. Interestingly, configurations possessing fourresidue helix can also yield RMSDs smaller than 2.2 A, suggesting that structures having partial helix can be similar to the fully helical too. 5 pH=2 o S II Helical Segment Length Figure 410. 2D probability density of helical length and CRMSD at pH = 2. S ,' o ll ' '159 159 pH=5 0 1 OE3 2 0E3 ; 3 OE3 4 0E3 4~ 5 OE3 I 6 OE3 " 70 OE3 8 O8E3 I .I I 9 OE3 S 3 il ). 1 OE2 S 1 E2 C] 1 2E2 S13E2 S1 4E2 15E2 O 2 < 1 6E2 18E2 I k1 9E2 2 0E2 II 1E2 21E2 2 3E2 42 4E2 0 2 4 6 8 10 Helical Segment Length Figure 411. 2D probability density of helical length and CaRMSD at pH = 5. pH = 8 0 S, 1 OE3 2 OE3 3 OE3 4 0E3 4 ;5 O0E3 S30E3 S7 0E3 S ,' 80E3 9 0E3 1 EE2 3 I" 1 1E2 o 1 2E2 13E2 S .'/, "14E2 O 2 1 E2 S1 217E2 1 9E2 19E2 21E2 1" J 2' 2 3E2 24E2 0 2 4 6 8 10 Helical Segment Length Figure 412. 2D probability density of helical length and CaRMSD at pH = 8. 4.3.6 Important Electrostatic Interactions: LyslGlu9 and Glu2Argl0 The saltbridge between Glu2 and Arg10 was found in the Xray structure of RNase A.247 Amino acid substitution experiments on the Cpeptide indicated this salt bridge is crucial to the increase in helical content when the pH value is increasing to pH 160 5.7,224 Proton NMR experiments done by Osterhout et a/.225 suggested that this salt bridge stabilizes partial helix instead of complete helix. They proposed that the RN24 structural ensemble contains three major conformations: unfolded, complete folded and partial helix with Glu2Argl0 interaction. Hansmann et a/.229 also proposed that the salt bridge stabilizes partial helix by performing multicanonical simulations. Felts et al.227 found that the saltbridge is only significantly found in the globular nonhelical Cpeptide structures. Sugita and Okamoto233 studied the Cpeptide using multicanonical REM and explicit solvent. They found that Glu2Argl0 saltbridge does not stabilize helix directly, but to stop the helix extending to the Nterminus. In the REXCPHMD study performed by Khandogin et al., they found that Lysl Glu9, instead of Glu2Argl0, contributes to the helix formation. The LyslGlu9 and Glu2Argl0 interactions are studied in our work. Figure 413A and 413B show the probability density vs charge distance of the two interactions at pH 2, 5 and 8. At pH 2, neither LyslGlu9 nor Glu2Argl0 saltbridge is formed, consistent with mostly protonated glutamate. At pH 5 and 8, Glu2Arg10 saltbridge is clearly formed (Figure 413A) while the LyslGlu9 saltbridge is formed in a much less extent (Figure 413B). Figure 414 shows the correlation between the two saltbridges at pH 5. Clearly, the two saltbridges cannot be formed at the same time. The effect of Glu2 Arg10 saltbridge on helical structure formation can be reflected by conditional probabilities. The probabilities of finding helical residue(s) given that the Glu2Argl0 saltbridge is formed are calculated at pH 2, 5 and 8. The conditional probabilities are 0.64, 0.73 and 0.63, respectively. Although at pH 2, the probability of forming Glu2 161 Arg10 saltbridge is low (~ 1%), the chance of having a helical structure is 63% once it is formed. This clearly shows the stabilizing effect of Glu2Argl0 on helix formation. (A) pH=2 (B) 0010 pH = 5 005 pH=2 pH= 8 pH=5 pH= 8 0 008 0.04 0006 003 C 0.004 0.02 0002 0,01  00000sMO 0.00 5 10 15 20 25 30 5 10 15 20 25 30 LyslGlu9 Distance (A) Glu2Arg10 Distance (A) Figure 413. A) Probability density of LyslGlu9 distance (A). The distance is the minimum distance between the sidechain nitrogen atom of Lysl and the sidechain carboxylic oxygen atoms of Glu9. B) Probability density of Glu2 Arg10 distance (A). The distance is the minimum distance between sidechain carboxylic oxygen atoms of Glu2 and guanidinium nitrogen atoms of Argl0. 25 pH=5 0 8.0E5 1.6E4 2.4E4 20 3.2E4 4.OE4 *< 4.8E4 5. 5.6E4 S '. "' 6.4E4 15 15 7.2E4 S* r : 8.0E4 0) " .* 0 .* ; .. . 10 ' ' j 5 I I I I I 5 10 15 20 25 Glu2ArglO Distance (A) Figure 414. Twodimensional probability density of LyslGlu9 and Glu2Argl0 at pH 5. Apparently, Lysl Glu9 and Glu2Argl0 saltbridges cannot be formed simultaneously. 162 The correlation between Glu2Argl0 saltbridge and helical length, and helix starting position are further studied. Figure 415A shows that the Glu2Argl0 saltbridge can be found in nonhelical configurations, fourresidue and sixresidue helices at pH 5. Moreover, in the sixresidue helix, the Glu2Argl0 saltbridge is always formed. The same pattern is obtained at pH 8, thus the pH 8 results are not shown here. Figure 4 15B shows the correlation between the saltbridge and helix starting position at pH 5. When a helix is initiated at Thr3, the saltbridge is not formed. When a helix begins at Ala4, Lys7 and residues behind Lys7, only the saltbridge is seen. However, in the non helical configurations and helices begin at Ala6, both states are found. Besides, Lys7 is the most probable place to initiate a helix when the saltbridge is formed. Again, no salt bridge is found when a helix starts at Thr3. Combining the correlations between Glu2 Arg10 and helical length, and Glu2Argl0 and helix starting position, the saltbridge clearly has the effect that preventing forming helices near the Nterminus and stabilizing partial helix near the Cterminus (Lys7Argl0 and Lys7Hisl2). 35 35 pH=5 pH=5  1 1 M oM 0 ,5 "ttto a 2 2,. A 2 0 2 4 6 8 10 0 2 4 6 8 10 Helical Segment Length A Helix Starting Position B Figure 415. A) Twodimensional probability density of Glu2Argl0 saltbridge formation and helical length at pH 5. According to the plot, the Glu2Arg10 saltbridge can be found in fourresidue, sixresidue and nonhelical structures. B) Two dimensional probability density of Glu2Arg10 saltbridge and the helix starting position at pH 5. If a helix begins from Thr3, it cannot have a Glu2 Arg10 saltbridge. Thus, one role of the Glu2Argl0 saltbridge is to prevent helix formation from Thr3. 163 4.3.7 Important Electrostatic Interactions: Phe8His12 Hisl2 is believed to be responsible for the decrease in helical content when solution pH values increase from 5 to 8.226 Hisl2 was found to interact with Phe8.221 However, the nature of the Phe8Hisl2 interaction is not completely clear. A weak hydrogen bond between the charged side chain of Hisl2 (proton donor) and the aromatic ring of Phe8 (proton acceptor) is supported by the configuration in RNase A X ray structure247 and ion screening experiments222,226 but is in contrast to proton NMR experiments.221 A contact between the aromatic ring of Hisl2 and backbone carbonyl oxygen of Phe8 has been proposed to explain the proton NMR results. Sugita and Okamoto studied the interaction between the aromatic ring of Phe8 and the charged ring of His12.233 They observed the contact between two rings has been made and stabilizes helix near the Cterminus. However, the REXCPHMD results showed that the interaction between backbone carbonyl oxygen of Phe8 and the charged sidechain of Hisl2 is responsible for the increased helical content at pH 5.112 (A) 0.06 pH = 2 _pH= 5 0.05 pH = 8 0.04 J3 0.03 002 4 8 12 16 Phe8 BackboneHisl2 Ring Distance (A) Figure 416. A) Probability density of Phe8 backbone to Hisl2 ring distance. The distance is the minimum distance between Phe8 backbone carbonyl oxygen atom and Hisl2 imidazole nitrogen atoms. B) Probability density of Phe8 ring to Hisl2 ring distance. The distance is the minimum distance between Phe8 aromatic ring carbon atoms and Hisl2 imidazole nitrogen atoms. 164 pH= 2 pH=5 pH= 8 0,01 0.00  4 8 12 16 20 Phe8 RingHis12 Ring Distance (A) Figure 416. Continued We also studied ringring and backbonering interactions between Phe8 and Hisl2 at pH 2, 5 and 8. The ringring interaction is represented by minimum distance between aromatic atoms in Phe8 and the two sidechain nitrogen atoms of Hisl2. The backbone ring interaction is represented by minimum distance between backbone carbonyl oxygen atom of Phe8 and the two sidechain nitrogen atoms of Hisl2. Figure 416A and 416B show the probability densities of each distance at three pH values. We found that the backbonering contact is made at all three pH values. However, forming such a contact at pH 8 is much less favorable than doing that at pH 5. Interestingly, Phe8 backbone and Hisl2 ring close contact and Glu2Argl0 saltbridge formation are coupled (Figure 417). The ringring contact is observed at pH 5 but not at pH 8. At pH 2, the ringring contact is formed but is much less probable. More importantly, the integrated probability of making a backbonering contact is larger than the integrated probability of forming a ringring contact at pH 2 and 5. In order to separate configurations making a contact from the rest, a cutoff distance of 4.0 A and 5.0 A is adopted, in the case of backbonering and ringring contact, respectively. The integrated 165 probability (area under the curve) of making backbonering contact and ringring contact is 0.34 and 0.22, respectively, at pH 5. The integrated probability is 0.23 and 0.14, respectively, at pH 2. Thus, the Phe8 backboneHisl2 ring interaction is the major form of the contact. 8 7 pH=5 0 pH=5 0 3 DE.4 3 8E.4 5 10 15 2 30 2.5 3.0 3 0E 3 0 6 I I4 5 10 15 20 2. 3.0 3 Glu2Arg10 Distance (A) A Glu2Arg10 Distance (A) B Figure 417. A) Twodimensional probability density of Glu2Argl0 distance and Phe8 Hisl2 backbonetoring distance at pH 5. B) Correlations between Glu2 Arg10 saltbridge and Phe8Hisl2 contact at pH 5. We further examine the correlation between the Phe8 backboneHisl2 ring contact and helical properties such as helical length and helix starting position. The backbonering contact is found in the fourresidue and sixresidue helices at pH 2 and 5. At pH 8, it can be seen in the fourresidue helix. The 2D probability densities are similar at the three pH values, thus only the plot at pH 5 is shown as an example (Figure 418A and 418B). Similar to the Glu2Arg10 saltbridge, Lys7 is the most favorable place to initiate a helix with a contact between Phe8 and His12. Thus, the Phe8His12 backbonering contact stabilizes the helix formation near the Cterminus (Lys7 to Arg 0 and Lys7 to His12). However, unlike the Glu2Arg10 interaction, helix formation initiated from Thr3 is able to form a contact between Phe8 and His12. Phe8His12 contact does not affect helix formation near the Nterminus. 166 (A) (B) 12 12 pH =5 1 pH =5 I= 46 (N I I ... C ,,,:: ;o 0 ' .: I I 0 2 4 6 8 10 0 2 4 6 8 10 Helical Segment Length Helix Segment Starting Position Figure 418. A) Twodimensional probability density of helical segment length and Phe8His12 interaction. B) Twodimensional probability density of helical segment starting position and Phe8His12 interaction. Phe8His12 also stabilizes fourresidue and sixresidue structures. Helices begin at Lys7 and Phe8Hil gis2 coupled. Unlike Glu2Argl0, Phe8Hisl2 stabilizes helices Cluster analysis is performed to find out significant conformations and to examine important electrostatic interactions. The structures at pH 5 are clustered because both Glu2Argl0 and Phe8Hisl2 contacts are more probable than at pH 2 or 8 so that the contacts can be studies in clusters. The top 20 populated clusters and their average helical percentage is plotted in Figure 419A. The most populated cluster shows the largest average helical content and the second most populated cluster shows a much lower helical content (close to the lowest among 20 clusters). The most populated cluster corresponds to the conformation yielding small CaRMSDs (< 2.2 A) relative to the fully helical structure (Figure 419B). Interestingly, the plot showing helical percentage vs the residue number (Figure 419C) reveals that the second most populated cluster only shows helical structures between Lys7 and Hisl2. Thus, helices are only formed near the Cterminus. Figure 419D demonstrates the probability density 167 of the Glu2Argl0 and Phe8Hisl2 interactions. Compare with the corresponding probability densities on the basis of the entire structural ensemble, forming a contact between Glu2Argl0, and Phe8Hisl2 is more probable in the structures belong to the second most populated cluster than in the entire structural ensemble. This is especially obvious for the Glu2Argl0 interaction. Results obtained from the second most populated cluster confirm that Glu2Argl0 and Phe8Hisl2 contacts, especially the Glu2Argl0 contact, stabilize partial helix formation near the Cterminus. 4.4 Conclusions In this chapter, we have studied the pHdependent helix formation of the Cpeptide of ribonuclease A using constantpH REMD simulations. The mean residue ellipticity at 222 nm at each pH value is computed and utilized to gauge helical content. The pH profile clearly demonstrates a bellshaped curved with a maximal helicity at pH 5, in good agreement with experimental results. The pH effect on the Cpeptide structural ensembles is studied at three representative pH values: 2, 5 and 8, representing the two ends in the pH profile and the pH value yielding the maximum helical content. At pH 2, helices consisting of Thr3Ala5, Lys7Argl 0 and Glu9Hisl2 are formed and the Lys7 Arg10 is the most stable one. At pH 5, additional sixresidue (Lys7Hisl2) and seven residue (Ala6Hisl2) helices are stable helices but the most probable helix is the same as that at pH 2. At pH 8, the most favorable helix switched to Thr3Ala5. Lys7Hisl2 and a new sevenresidue helix (Thr3Glu9) are also present. Glu2Argl0 saltbridge formation and its role in the helix formation are studied. We find that the saltbridge is formed and is more probable at pH 5. The Glu2Argl0 salt bridge is found to stabilize helix formation near the Cterminus. The nature of Phe8 Hisl2 interaction and its role in helix formation are also explored. Backbone carbonyl 168 oxygen of Phe8 and sidechain charge of Hisl2 contact is the major form. The role of Phe8 and Hisl2 contact is similar to that of the Glu2Arg10 saltbridge. Results from cluster analysis on trajectory generated at pH 5 confirmed the effects of Glu2Argl0 and Phe8Hisl2 interactions. w pH=5(B) ll m * * * * 0 5 10 15 20 Population of Cluster (%) 6 8 Residue Number the most populated cluster, pH = 5 the second most populated cluster, pH = 5 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 Cu RMSD (A) Glu2Argl0, the second most most populated cluster, pH = 5 Phe8 backboneHisl2 ring, the second most populated cluster, pH = 5 4 8 12 16 Distance (A) Figure 419. A) Top 20 populated clusters and average helical percentage. B) Probability densities of the CaRMSD vs the fully helical structure of the top 2 populated clusters. C) Helical Percentage as a function of residue number of the top 2 populated clusters. D) Probability density of the Glu2Argl0 and Phe8 backboneHis12 ring interactions in the second most populated cluster. 169 C 40 . 30 4 CHAPTER 5 CONSTANTpH REMD: pKa CALCULATIONS OF HEN EGG WHITE LYSOZYME 5.1 Introduction Hen egg white lysozyme (HEWL, shown in Figure 51) has been selected to test pKa prediction methods or constantpH methods for a long time.125 This protein is a 129 amino acids enzyme and is the first enzyme to have its threedimensional structure determined by Xray crystallography.248'249 Lysozyme can be found in the secretions such as tears and saliva. The function of this enzyme is to catalyze the hydrolysis of a polysaccharide and the reaction has an optimal pH around 5.125 By hydrolyzing polysaccharides, lysozyme can damage the cell walls of certain bacteria. HEWL is a monomeric singledomain enzyme whose active site is situated in a cleft between two regions. Two residues are crucial to the catalysis, Glu35 and Asp52. During the hydrolysis, a covalent enzymesubstrate intermediate is formed.249 In this process, Glu35 acts as the proton donor and Asp52 becomes the nucleophile.249 The starting point of the catalytic mechanism is the donation of a proton from Glu35 to the substrate. Then, Asp52 will attack the anomeric carbon of the substrate and form a covalent bond with the substrate. In the final step, the enzymesubstrate complex is hydrolyzed by a water molecule and the initial protonation states of Glu35 and Asp52 are restored. HEWL has been a good test system of pKa prediction studies for several reasons. First, accurate predicting the pKa values of both ionizable residues in active site can help people identify proton donor and nucleophile in HEWL according to a simple criterion proposed by Nielsen and McCammon in 2003.250 They proposed that if catalytic mechanism involves two acidic residues, then the proton donor should have a pKa value of at least 5.0 and the pKa of nucleophile should be at least 1.5 pH units lower 170 than that of proton donor. Second, the pKa values of HEWL acidic residues were determined by Bartik et al.251 using twodimensional proton NMR. It shows several ionizable residues having pKa values much different from their intrinsic pKa values. Furthermore, there are more than 100 PDB entries of the wildtype HEWL structure, the effect of structural variation can be tested for pKa calculation methods, especially for the FDPB method.250 Thus, our constantpH REMD method will be tested on HEWL. Figure 51. Crystal structure of HEWL (PDB code 1AKI). Residues in red represent aspartate and residues in blue are glutamate. Various constantpH methods have been tested on HEWL. Burgi et al.130 utilized their constantpH method to predict pKa values of HEWL. The RMS error between predicted and experimental pKa values was determined to be from 2.8 to 3.8 pH units. In 2004, Lee et al.114 applied their CPHMD method to four proteins: turkey ovomocoid (PDB code 10MT), bovine trypsin inhibitor (1BPI), HEWL (193L) and ribonuclease A (7RSA). The overall pKa RMS error relative to experimental data was around 1 pH unit. 171 For HEWL, the average absolute error of all ionizable residues (including the termini) was 1.6 pH units, while the average absolute error of pKa values of acidic ionizable residues relative to experimental data was 1.5 pH units. However, the pKa values of Glu35 and Asp52 were both 5.8, indicating that CPHMD results were not able to predict proton donor and nucleophile. In the same year, Mongan et al.127 published their discrete protonation state constantpH MD method. HEWL was also selected as the test system. In the study of performed by Mongan et al., four different crystal structures of HEWL were utilized (1AKI, 1LSA, 3LZT, and 4LYT). The RMSD of pKa values of all ionizable residues relative to experimental results were 0.86, 0.77, 0.88, and 0.95 for 1AKI, 1LSA, 3LZT, and 4LYT, respectively. In addition to pKa predictions, Mongan et al. also studied protonationconformation correlation. Principal component analysis of a trajectory was conducted and projected onto the first two (largest eigenvalues) eigenvectors and association between conformation and protonation was observed. In 2006, Khandogin and Brooks110 utilized REXCPHMD method to predict pKa values of 10 proteins. The RMS error values between REXCPHMD and experimental pKa values ranged from 0.6 to slightly greater than 1 pH unit. For HEWL, the RMS error between predicted and experimental pKa values was 0.6 pH unit and the maximum absolute error is 1.0 pH unit. So far, their HEWL pKa prediction RMS error is the smallest among constantpH pKa calculations on HEWL. Machuqueiro and Baptista presented HEWL pKa predictions from their stochastic titration constantpH MD with explicit water model in 2008.125 The RMS error between predicted and experimental pKa values were 0.82, and 1.13 for generalized reaction field,252 and PME154 treatment of longrange electrostatics, respectively. A comparative FDPB calculation (single crystal structure, 172 which is the same as that utilized in constantpH MD, and a protein dielectric constant of 2) was also conducted and the RMS error was found to be 2.76. Since the constantpH method proposed by Baptista requires FDPB calculation, the selection of dielectric constant inside the protein was crucial. Machuqueiro and Baptista performed constant pH MD utilizing three different dielectric constants (E=2, 4, and 8) combined with PME treatment of longrange electrostatics. The pKa RMS error values were 1.13, 1.02, and 1.12 for E =2, 4, and 8, respectively. More recently, the constantpH MD proposed by Mongan et al.127 was coupled with accelerated molecular dynamics (AMD)133'134 and tested on HEWL by Williams et al.129 ConstantpH AMD and MD simulations of 5 ns in length have been performed. Only acidic ionizable residues in HEWL were taken into consideration by constantpH scheme. RMS error values between predicted and experimental pKa values were calculated. The constantpH AMD yielded an overall RMS error value of 0.73, while the original constantpH MD pKa RMS error was 0.80. The pKa RMS error of aspartates were 0.75, and 1.46 from constantpH AMD, and MD, respectively. The pKa RMS error of glutamates were 0.85, and 1.04 from constantpH AMD, and MD, respectively. In general, recent works utilizing various constantpH schemes have achieved RMS error values in the range of 0.6~1.13 for HEWL. In this chapter, we present a study of HEWL using constantpH REMD algorithm. Both structural restrained and unrestrained simulations were done. pKa values from constantpH REMD are compared with experimental values. We also investigated the pKa convergence, effect of structural restraint and conformationprotonation correlations. 173 5.2 Simulation Details Crystal structure 1AKI (PDB code) has been taken as HEWL starting structure in our study. Water molecules in the crystal structure were striped first. Only aspartate and glutamate residues were studied so there are nine ionizable residues selected. Hydrogen atoms were added by the LEaP module in the AMBER suite. The post processed crystal structure was then minimized and heated from 0 K to 300 K. The restart structure from the heating process was taken as the initial structure for our constantpH REMD simulations. In this chapter, all REMD runs refer to constantpH REMD simulations for simplicity. The pH range was from 2 to 6 in an increment of 0.5 pH unit. Two sets of REMD simulations were performed: the unrestrained ones (ntr=0 in AMBER) and the restrained ones (ntr=1 in AMBER). In each REMD run, an exchange of structures was attempted every 500 MD steps. 1000 exchange attempts were intended to use for both sets. Thus Simulation time of each replica in each set is 1 ns. In the unrestrained REMD runs, we chose the highest temperature to be 320 K in the hope that HEWL will not unfold at all temperatures. In the restrained REMD runs, Ca atoms from residue 3 to 126 were restrained by harmonic potentials. The restraining harmonic potential has the following form: Ures = k(q, re ), where and qre, are Cartesian coordinates at current time and Cartesian coordinates of the reference structure, respectively, k is the force constant of the harmonic potential which determines the strength of a restraint. In our simulations, the reference coordinates are the initial Ca atoms coordinates. By putting restraining harmonic potential on Ca atoms, the secondary structure of HEWL will be preserved and the highest temperature will be 174 increase to 420 K in order to achieve better sidechain conformational sampling. The force constant of the harmonic potentials was 1.0 kcal/molA2 (setting restraint_wt=1 in AMBER). Several other REMD simulations were done according to results from the two sets of REMD runs. The general goal of those simulations was to test what we proposed from the two previous sets. First, another 1 ns constantpH REMD simulation with restraint on Ca atoms was continued for all the pH values in order to check the pKa convergence of the restrained simulations. Likewise, 1000 exchange attempts were conducted in those 1 ns simulations and the restraint strength is still 1.0 kcal/molA2. Second, a new set of constantpH REMD simulations with restraint on Ca atoms was performed. The force constant adopted in the second set was 0.1 kcal/molA2 so that the effect of restraint strength can be tested. The details of constantpH REMD simulations can be found in Table 51. Table 51. Simulation details of constantpH REMD runs Restrained Restraint Number of Temperature Simulation Exchange pH values or not Strength Replicas (K) Time (ns) Attempts 26 No 0 4 280320 1 1000 26 Yes 1 8 280420 2 2000 3, 4, 4.5 Yes 0.1 8 280420 2 2000 The restraint strength was represented by the force constant of a harmonic potential. The unit of force constant is kcal/molA2. For the REMD simulation with 1 kcal/molA2 restraint, it was actually performed in two stages. Each stage lasted for 1 ns and the purpose of the second stage was to check the pKa convergence. All simulations were done using the AMBER 9 molecular simulation suite253 with the AMBER ff99SB force fields.139 The SHAKE algorithm145 was used to allow a 2 fs time step. OBC Generalized Born implicit solvent model200 was used to model water 175 environment in all our calculations. The Berendsen thermostat, 146 with a relaxation time of 2 ps, was used to keep the replica temperature around their target values. Salt concentration (DebyeHuckel based) was set at 0.1M. The cutoff for nonbonded interaction and the Born radii was 30 A. 5.3 Protein Conformational and Protonation State Equilibrium Model Suppose an ionizable side chain has only two conformations in equilibrium and each conformer has its own equilibrium in protonation state. We can use 1p, 1d, 2p and 2d to label conformer 1 in protonated form, conformer 1 in deprotonated form, conformer 2 in protonated form, and conformer 2 in deprotonated form, respectively. The equilibrium among all species is demonstrated in Figure 52. pKa,1 Id 1p $ $ 2d 2p pKa,2 Figure 52. A simple schematic view of the conformationprotonation equilibrium in a constantpH simulation. Then, K12, the equilibrium constant between conformation 1 and 2 is K12 [p] + [ ] (51) [2p]+ [2d] In the above model, pKa,1 and pKa,2 represent protonation equilibrium within each conformation. They can be expressed as: pKa, = pH log( ] (52) [ld] 176 and pKa,2 = pH log( ). (53) [2d] So, the pKa of that ionizable residue is pKa = pH log( (54) [Id] + [2d] 5.4 NMR Chemical Shift Calculations Theoretical NMR chemical shift titration curve was generated. Due to the limitation of system size, full quantum mechanical NMR calculations were performed only on ionizable residue dipeptide (ionizable residue with two ends blocked). The structure of ionizable dipeptide was extracted from the representative structures (representing different side chain conformations) generated from cluster analysis. Proper protonation states were assigned for each structure. All full quantum mechanical NMR calculations were done in Gaussian03 software package254 using B3LYP functional and 6311 ++G** basis set. Isotropic magnetic shielding constants were computed in vacuum using GIAO method.255 Tetramethylsilane (TMS) was used as reference in order to obtain the chemical shift. Recently, Merz and coworkers256 developed an automated fragmentation quamtum mechanical/molecular mechanical (AFQM/MM) approach to study protein properties. They have applied their method to compute protein chemical shift of Trp Cage. In this AFQM/MM model, one residue and the atoms near it (less than 4 A) are assigned to the QM region and the rest of a protein will be put into the MM region. During NMR calculations, all atoms in the MM region will be viewed as point charges. 177 We applied this AFQM/MM method to 1AKI to calculate chemical shift as well. Again, all AFQM/MM calculations were based on representative structures. 5.5 Results and Discussions 5.5.1 Structural Stability and pKa Convergence Since changing protonation state during simulation will cause discontinuity in force and energy, structural stability in our simulations is important. We chose to use Ca atoms rootmeansquare deviation (RMSD) vs 1AKI structure as our metric. Figure 53A shows us the Ca RMSD vs time in unrestrained REMD runs. In Figure 53A, HEWL is instable at all the pH simulated. The RMSD can reach a very high value (~ 18 A) during simulations. Even at pH=4 where Ca RMSD values are small relative to the rest, the Ca RMSD can still go beyond 3 A. pKa predictions from unrestrained REMD runs shouldn't be used. Figure 53B shows the RMSDs in the restrained REMD runs. Although the RMSD values are small and stable throughout 2 ns simulations, the restrained REMD simulations still reveal problems, according to Figure 53B. Our simulations use 1AKI which is resolved at pH=4.5 as starting structure. As pH is moving away from 4.5, one may expect HEWL will adopt conformations a little bit different from 1AKI. So a bigger RMSD should be expected where the pH value is far away from 4.5. This behavior has been confirmed in the work of Mongan et al. However, putting restraint on Ca atoms results in the same RMSDs in the entire pH range. This may have negative effect on pKa predictions at pH values far away from 4.5. 178 20 16 12 8 4 ,.. .' ,* ' 0, 0 2000 F l l i : r 1 4000 6000 8000 rame Number (total time=1 ns) 0.4 0.2 0 2000 4000 6000 8000 1000 Frame Number (total time=2 ns) B Figure 53. Ca RMSD vs crustal structure (PDB code: 1AKI). A) Ca RMSD vs 1AKI from REMD without restraint on Ca. B) Ca RMSD vs 1AKI from REMD with restraint on Ca. The restraint strength is 1 kcal/molA2. In order to check protonation state sampling convergence from the restrained REMD simulations, pKa prediction error (predicted value minus experimental value) against time as well as time evolution of prediction deviation (predicted pKa value at 179 pH=2 pH=3 pH=4 pH=5 pH=6 10000  pH=2 pH=3 pH=4 pH=5 pH=6 S. I 1 i 0 6 1 i i '' ,' l' 0.6 i I current time minus the final predicted pKa value) are followed and demonstrated in Figure 54 and 55. According to those plots, stabilizations in pKa predictions are seen after a few hundred picoseconds of simulations. Increasing simulation time wouldn't change average pKa predictions and their errors relative to experimental values. In order to show convergence in protonation state sampling is reached in a wide range of pH, a representative plot of Asp52 pKa deviations are shown in Figure 55B. Convergence is clearly seen over the pH range. Glu7 S 4 Asp18 I Asp48 S.Asp52 E Ec Asp66 Co Asp87 0 x 2 ~Asp101 S________________ o S"  0 1000 2000 Time (ps) Figure 54. pKa prediction error as a function of time. The predicted pKa at a given time is a cumulative result. For each ionizable residue, the time series of its pKa error is generated at a pH where the average predicted pKa is closest to that pH value. In this way, we try to eliminate any bias toward the energetically favored state. A flat line is an indication of convergence. Glu35 is not shown here due to poor convergence. 0 0 2 z 0 1000 2000 Time (ps) Figure 54. pKa prediction error as a function of time. The predicted pKa at a given time is a cumulative result. For each ionizable residue, the time series of its pKa error is generated at a pH where the average predicted pKa is closest to that pH value. In this way, we try to eliminate any bias toward the energetically favored state. A flat line is an indication of convergence. Glu35 is not shown here due to poor convergence. 180  Glu7 Aspl8 Asp48  Asp52 Asp66 Asp87 Asp101 Aspll9 0 1000 Time (ps) 1500 4 2 0 2 1500 2000 pH=2 pH=2.5 pH=3 pH=3.5 pH=4 pH=4.5 2000 Time (ps) Figure 55. A) pKa prediction convergence to its final value. Similarly, the pKa value at a given time is a cumulative average. A flat line having yvalue of 0 is expected when pKa calculation convergence is reached. The same pH values are chosen for each ionizable residue as in Figure 54. B) Asp52 pKa prediction convergence to its final value at multiple pH values. The pH values are selected in such a way that the pKa calculated at this pH will be used to compute composite pKa. 181 Lt . 5.5.2 pKa Predictions A popular way to study the accuracy of pKa prediction is to look at the pKa RMS error relative to experimentally measured pKa values. In general, a Hill's plot is used to generate pKa for each ionizable residue because Hill's plot can combine results from all simulations. Mongan et al. proposed a way to calculate pKa without using Hill's plot in their constantpH MD paper. They called pKa values calculated in their way composite pKa values. A composite pKa is an average of all pKa values having an absolute offset less than 2 pH units. Here an offset means the difference between predicted pKa and its corresponding pH values. Table 52 shows pKa values and the pKa RMS error values from the 2ns restrained REMD runs. Composite pKa values, pKa values obtained from Hill's plots and their RMS error values relative to experimental measurements are also listed in Table 5 2. We used the same experimental pKa values as Mongan et al. did to calculate pKa RMS error. In our work, the pKa predictions from Hill's plots yield a RMSD value of 0.84, while utilizing composite pKa values produces a RMS error value of 0.87. According to constantpH simulation literatures, the RMS error values of HEWL pKa prediction are around 0.8 for acidic ionizable residues. So there is no significant improvement in pKa prediction from our simulations. However, as we mentioned in the structural stability discussion, putting a restraint on Ca atoms of a protein lowers the ability to adjust its conformations. The further a pH value is away from crystal pH, the more a structure ensemble is skewed from the correct one. Simulations performed around pH 4.5 are less affected by the restraint than simulations done at pH values far away from 4.5. Since the less a structural ensemble is skewed, the less human error in pKa predictions. So one may expect smaller pKa RMS 182 error relative to experimental values will be seen around pH 4.5. pKa prediction RMS error relative to experimental values are plotted against pH values in Figure 56. As expected, a minimum having RMS error of 0.74 at pH 4.5 can be found. An RMS error of 0.74 is among the best published HEWL predictions. Table 52. Predicted pKa values and their RMS errors relative to experimental measurements from the restrained REMD simulations. Ex pH pH pH pH pH pH pH pH Com Hill Exp2 Hill 2 2.5 3 3.5 4 4.5 5 6 p Glu7 2.85 3.61 3.58 3.46 3.03 2.99 2.93 2.36 3.37 3.27 3.23 Aspl8 2.66 1.59 1.54 1.51 1.61 1.91 2.35 2.5 3.69 1.63 1.4 Glu35 6.2 3.76 3.65 4.36 4.14 4.31 4.53 4.76 4.61 4.27 4.58 Asp48 2.5 1.88 1.98 2.14 2.34 2.6 2.45 1.96 2.9 2.23 2.01 Asp52 3.68 2.71 2.45 2.63 2.82 3.05 2.72 2.77 3.99 2.73 2.68 Asp66 2.0 2.5 2.69 2.86 2.92 3.12 2.72 3.09 4.04 2.8 2.73 Asp87 2.07 2.32 2.43 2.64 2.49 2.54 2.64 2.79 3.62 2.51 2.42 Asp101 4.09 4.52 4.4 4.14 4.03 3.79 3.55 3.44 3.96 3.89 3.85 Asp119 3.2 2.71 2.78 3.01 3.01 3.25 3.01 2.89 3.97 2.96 2.9 RMS 1.04 1.1 0.91 0.89 0.83 0.74 0.79 1.12 0.87 0.84 Error ihtf thl b Lv"f~r f~~ vn~ntln rlifLr~m "f~r f~~ h rmrft V\~in r~~ iniszabl resiu (pse a n s poaeperm fo deiiin vands "Hill stands for te c~ au otiemfo h Hill's plot. The force constant of the harmonic potential used here is 1.0 kcallmolA2. 183 I" 1.2 a) 1. 1 S1.0 E S0.9 (I) 0.8 0.7 0.7    i     2 4 6 pH Value Figure 56. RMS error between predicted and experimental pKa vs pH value. A minimum of pKa RMS error can be found near the pH at which 1AKI crystal structure is resolved. 5.5.3 ConstantpH REMD Simulations with a Weaker Restraint Based on what have been found so far, we propose that reducing restraint strength on Ca atoms will yield better pKa predictions. This is because reducing restraint strength will increase degree of freedom in conformation sampling. HEWL can relax its structure further, even at pH 4.5. Thus a more accurate structure ensemble can be produced. This, in turn, will improve pKa calculations. ConstantpH REMD simulations with a weaker restraint (harmonic potential on Ca atoms) of 0.1 kcal/molA2 were carried out at three different pH values to test our hypothesis. First, as shown in Figure 57A, all three simulations generate larger Ca RMSDs relative to 1AKI than those simulations with stronger restraint do. This means HEWL relaxes more when a weaker restraint is used. Besides, the Ca RMSD fluctuations in all three runs are bigger than those in the 1 kcal/molA2 REMD runs. This means more conformational space is visited. Another 184 interesting point in the weakerrestrained REMD runs is that the Ca RMSDs at pH 3 and 4 are larger than those at pH 4.5. Simulations at pH 3 and 4 do tend to sample conformations that are different from at pH 4.5. The pKa prediction results are listed in Table 53. pKa prediction deviation from the final value vs time at pH value of 4.5 is shown in Figure 57B to demonstrate protonation state sampling convergence. According to Table 53, nearly 0.1 pH unit improvement in the RMS error of predicted pKa values can be seen at each pH for the weakly restrained REMD runs. However, among all three RMS error values, the best one is still obtained at pH 4.5 indicating that restraint is still favoring simulations near pH 4.5. After reducing the restraint strength, our best pKa RMS error relative to experimental values is 0.62. Table 53. Predicted pKa values and their RMS errors relative to experimental measurements from weakly restrained REMD simulations. pH=3 pH=4 pH=4.5 1 0.1 1 0.1 1 0.1 Glu7 3.46 3.71 2.99 3.38 2.93 3.34 Aspl8 1.51 1.57 1.91 1.76 2.35 2.23 Glu35 4.36 5.09 4.31 5.23 4.53 5.24 Asp48 2.14 2.27 2.6 2.48 2.45 2.71 Asp52 2.63 2.47 3.05 2.88 2.72 3.29 Asp66 2.86 2.63 3.12 2.66 2.72 2.93 Asp87 2.64 2.52 2.54 2.79 2.64 2.88 Asp101 4.14 3.82 3.79 3.77 3.55 3.54 Asp119 3.01 2.22 3.25 2.21 3.01 3.38 RMSE 0.91 0.84 0.83 0.72 0.74 0.62 In Table 53, the number 1 in the second row means the force constant of the restraining potential is 1 kcal/molA2, while 0.1 stands for 0.1 kcal/molA2. RMSE stands for RMS Error. 185 1.51 1.0 I 1.0 pH=3 pH=4 pH=4.5 j : I ~I ' 0.5 0 2000 4000 4000 6000 6000 8000 10000 8000 10000 Frame Number (total time=2 ns)  Glu7 SAsp18 Glu35 Asp48 Asp52  Asp66 Asp87 Asp101  Asp119 .......T.... 1000 1000 150 1500 20 2000 Time (ps) Figure 57. A) Ca RMSD of HEWL from weaker restraint REMD simulations. The RMSDs are larger than those with stronger restraints. When comparing RMSDs at different pH for simulations using weaker restraint, RMSDs are greater at pH 3 and 4 than those at pH 4.5. B) pKa prediction deviation from final value at pH 4.5 from constantpH REMD with 0.1 kcal/molA2 186 5.5.4 Active Site lonizable Residue pKa Prediction: Asp52 Accurate calculations of the pKa values of ionizable residues in active site are important because their protonation states are crucial in enzyme reactions. In the case of HEWL, Asp52 works as a nucleophile. This requires Asp52 to be deprotonated during reactions which has an optimal pH around 5. In both restrained REMD, Asp52 is indeed deprotonated around pH 5. However, the error of Asp52 relative to experimental value is about 1 pH unit. Mongan and coworkers also had the same trend except that a bigger error was obtained in their simulations. They claimed that Asp52Asn46 hydrogen bond caused the very low predicted pKa of Asp52.127 Asp52 and residues that strongly interact with it (three asparagine residues: Asn44, Asn46 and Asn59) in the crystal structure of 1AKI (hydrogen atoms are added and proper protonation state is chosen at pH 4.5) are shown in Figure 58. We studied those interactions which are represented by atomtoatom distances in our REMD simulations. We find that Asp52 is closer to Asn59 and Asn44 rather than to Asn46, indicating that Asp52 has stronger interactions with Asn59 and Asn44 than with Asn46. Time series of Asp52 carboxylic oxygen atoms to Asn59 and Asn44 ND2 distances at pH 3 are shown in Figure 59. As can be seen from Figure 59A and 59B, Asp52 and Asn44, Asn59 stay within hydrogenbonding distance for a long time at pH as low as 3. Furthermore, hydrogenbonding distances between Asp52 and Asn44, and between Asp52 and Asn59 are coupled. Two oxygen atoms in the carboxylic group of Asp52 are able to work as proton acceptors simultaneously. This means that the deprotonated form of Asp52 is overstabilized by hydrogenbonding, even at low pH values. 187 ASN46 ASN59 AS452 Figure 58. Asp52 in the crystal structure of 1AKI. Its neighbors that having strong electrostatic interactions are also shown.  Asp52 OD1 and Asn59 ND2  Asp52 OD1 and Asn44 ND2 U I] II 10 8 6 4 Asp52 002 and Asn59 ND2  Asp52 OD2 and Asn44 ND2 ,11 , 0 2000 4000 6000 8000 10000 Frame Number 0 2000 4000 6000 Frame Number Figure 59. A) Time series of Asp52 carboxylic oxygen atom OD1 to Asn59 and Asn44 ND2 distances at pH 3 in the 1 kcal/molA2 constantpH REMD run. B) Time series of Asp52 carboxylic oxygen atom OD2 to Asn59 and Asn44 ND2 distances under the same condition. Hydrogen bonds which are stabilizing deprotonated Asp52 are formed in a large extent even at a low pH. Next, hydrogen bond analysis was conducted with PTRAJ module in the AMBER suite for both sets of restrained REMD simulations. Hydrogen bonds can be found between Asp52 and all three asparagines (Asn44, Asn46, and Asn59) in both sets. The occupation times of Asp52Asn44 and Asp52Asn59 hydrogenbonding are longer than 188 8000 10000 that of Asp52Asn46 hydrogenbonding. Furthermore, the Asp52Asn44 and Asp52 Asn59 hydrogenbonding are coupled according to the distances demonstrated in Figure 59. Asp52 is protonated only when the entire carboxylic group is pointing away from Asn44 and Asn59. The Asp52Asn44 and Asp52Asn59 hydrogenbonding, not the Asp52Asn46 hydrogenbonding, is responsible for low predicted pKa value of Asp52. The hydrogen bond contents are similar in both strongly and weakly restrained REMD simulations. This indicates that the hydrogenbonding effect on Asp52 in our simulations is too strong. Reducing restraint strength doesn't help the conformational sampling of Asp52. 5.5.5 Active Site lonizable Residue pKa Prediction: Glu35 Glu35 is another problematic case in our study. In the 1 kcal/molA2 runs, it's the largest single residue error: the error is almost 2 pH units. Excluding Glu35 will lower the pKa RMS error value by nearly 0.2 pH unit. In the 0.1 kcal/molA2 runs, the pKa value of Glu35 is improved, having an error around 1 pH unit. This is the main reason that smaller pKa RMS errors relative to experimental data are found in all three 0.1 kcal/molA2 REMD simulations. Although the pKa error of Glu35 in the weakly restrained REMD simulation is large, the good news for weakly restrained REMD simulations is that Glu35 can be correctly identified as proton donor based on the criterion proposed by Nielsen and McCammon: Glu35 has a pKa value ~5.2 and the pKa difference between Asp52 and Glu35 is greater than 1.5 pH units. The predicted pKa value of Glu35 was determined to be 5.32 in the study performed by Mongan et al. They claimed that a similar hydrogenbonding effect as Asp52 demonstrated was responsible for the low predicted pKa value of Glu35.127 However, hydrogenbonding analysis of our data does not show any significant 189 hydrogenbonding is formed by Glu35, which is in contrary to what Mongan et al. claimed. In the 1AKI crystal structure, Glu35 sidechain is in the vicinity of Gln57, Trp108 and Ala110 sidechains. Several key distances between Glu35 carboxylic group and Gln57, Trp108 and Ala110 side chains in the crystal structure are listed in Table 54. According to Table 54, Glu35 is in a hydrophobic region except that a close distance between Glu35 OE2 atom and Ala110 backbone amide nitrogen atom. The hydrophobic effect is the main reason of an elevated pKa value of Glu35. However, when the carboxylic group is pointing toward the Ala110 amide group, the deprotonated form of Glu35 will be favored. If such a conformation is stable throughout simulations, the predicted pKa value will be smaller than what it supposed to be. We think one reason of a low predicted pKa value is that Glu35 is stuck in conformations stabilizing deprotonated form. But the weakly restrained simulations allow Glu35 to relax structure further and visit conformations stabilizing protonation more frequently. Table 54. Distance between Glu35 carboxylic oxygen atoms and neighboring residue sidechain atoms in 1AKI crystal structure. Glu35 OE1 Glu35 OE2 Gln57 CB 3.56 5.25 Gln57 CG 3.85 5.84 Trp108 CB 5.36 3.43 Trp108 CG 5.43 3.94 Trp108 CD1 4.65 3.67 Ala110 N 4.65 3.09 Ala110 CB 4.19 3.48 The unit of all distances in Table 54 is A. Glu35 heavyatom RMSD relative to 1AKI as well as cluster analysis on the basis of those RMSDs are chosen to study Glu35 conformational sampling. Distributions of 190 heavyatom RMSD, which are shown in Figure 510, show that 2 conformations are found in the strongly restrained simulations: one centered at RMSD ~0.1 A (we label that conformation as conformation 1) and the other centered at ~0.6 A (it is labeled as conformation 2). However, an extra conformation (conformation 3) is visited by the weakly restrained REMD simulations. Cluster analysis is employed to separate those conformations. For conformation 2, the carboxylic group of Glu35 points toward the Ala110 amide group in both sets of the restrained REMD runs (Figure 511). The carboxylic group in conformation 1 also points toward the Ala110 amide group, although in a lesser extent. However, conformation 3 (shown in the weakly restrained runs only) contains configurations in which Glu35 carboxylic group is pointing away from Ala110 amide group (Figure 512B). In this conformation, the Glu35 sidechain is in the hydrophobic region and the protonated species is favored. A toolow percentage of conformation 3 is responsible for the low predicted pKa value of Glu35. REMD, pH=4.5, res=1.0 1 0 REMD, pH=4.5, res=0.1 S0.6 P 0,6 0,4 02 00 0 2000 4000 6000 8000 10000 Frame Number (total time=2 ns) A Figure 510. A) Time series of the Glu35 heavy atoms (excluding two carboxylic oxygen atoms) RMSD relative to crystal structure 1AKI. B) Probability distribution of the RMSD. The conformation centered at RMSD ~0.1 A is labeled as conformation 1. The one centered at ~0.6 A is named conformation 2. Apparently, an extra conformation (conformation 3) is visited by the weakly restrained REMD simulation. 191  REMD, pH=4.5, res=1.0  REMD, pH=4.5, res=0.1 0.04 0 0.02 0.00 0.0 0.2 04 0.6 0.8 1.0 RMSD of Glu35 vs 1AKI (A) B Figure 510. Continued 71 .24 86 1' 81 .30 1544 REMD, pH=4.5, res=1.0 A REMD, pH=4.5, res=1.0 B Figure 511. A) Representative Structure of conformation 1. B) Representative Structure of conformation 2. The structure ensemble is generated from REMD simulations with stronger restraining potential. The carboxylic group of Glu35 in conformation 2 is clearly pointing toward the amide group of Alal 10. Deprotonated form of Glu35 tends to decrease the electrostatic energy. Furthermore, conformation 1 does not particularly favor the protonated Glu35. No significant stabilizing factor is found for the protonated Glu35. 192 REMD, pH=4.5, res=0.1 Figure 512. Representative Structure of conformation 3 from cluster analysis. Glu35 is in the hydrophobic region, consisting of Gln57, Trpl08 and Ala110. Conformation 1 and 2 in the weakly restrained simulations are basically the same as those demonstrated in Figure 511. Another possible reason of underestimating pKa value of Glu35 is the use of implicit solvent in constantpH MD and REMD simulations. Imoto et al. suggested that Glu35 and Asp52 were coupled by two water molecules through hydrogenbonding. Glu35 carboxylic group acted as a proton donor in the hydrogenbonding. Thus the protonated form of Glu35 was stabilized and contributed to the elevated pKa value. Two water molecules are indeed found between Glu35 and Asp52 in the 1AKI crystal structure and they are within hydrogenbonding distances to Glu35 and Asp52. If the hypothesis is true, the use of implicit solvent breaks this hydrogenbonding network. Thus a stabilizing factor of protonated Glu35 is missing. A constantpH algorithm employing explicit solvent is needed to study this effect. 5.5.6 Correlation between Conformation and Protonation As described earlier, one advantage of utilizing constantpH methods is that the conformational sampling and the protonation state sampling are directly coupled. In this 193 work, sidechain dihedral angles are chosen to study conformationprotonation coupling. Asp119 land X2 dihedral angles at pH 3 will be shown as representatives. Two dimensional histograms between dihedral angles and protonation states are displayed in Figure 513. A twodimensional (2D) histogram is generated by putting bins in dihedral angle and protonation state space (As explained in the second chapter, considering syn and anti configuration of protons will generate five protonation states in the case of ionizable aspartate in AMBER. They can be labeled as 0, 1, 2, 3 and 4 in which state 0 stands for deprotonated state and the rest represent protonated species). 150 1500 150 S 13 03 100 00 l c 2 h angle around 170. In Figure 513A, we can clearly see that conformation 1 is coupled with 150 0 "', 0 ( 0N 36.0 0 1 2 3 4 0 1 2 3 4 Protonatiaon Stae A Protonalion State B Figure 513. A) Correlation between side chain dihedral angle xland protonation states. B) Correlation between side chain dihedral angle x^and protonation states. Our 2D histograms can show the correlations between dihedral angle distribution and protonation state distribution. Two conformations are obtained in X1 space: conformation 1 having X1 angle around 60 while conformation 2 having X1 angle around 170. In Figure 513A, we can clearly see that conformation 1 is coupled with protonated form and most structures in conformation 2 are in deprotonated state. According to Figure 513B, similar behavior can be seen in 72 space too. Most 194 deprotonated Asp119 are found having X2 near 40 and 1400, while configurations showing 750 and 1000 of X2 are protonated. A closer look at the 1AKI crystal structure reveals that sidechains of Asp119 and Arg125 are close to each other (the carboxylic group of Asp119 and the guanidinium group of Arg125 are in hydrogen bond distance). Since Arg125 has a positive charge on its guanidinium group, it stabilizes the deprotonated Asp119 when two side chains are close to each other. We calculated pKa of Asp119 in 1AKI using H++ (H++ is a web based FDPB server developed by Alexy Onufriev's group at Virginia Tech. The FDPB equation is solved on the basis of only one protein structure).257'258 The calculated pKa of Asp119 using FDPB method is 1.1, 0.7 and 1.3 when the internal dielectric constant is set to be 2, 4, and 6, respectively. All three pKa values are much lower than experimental pKa value of 3.2. This behavior agrees with what we just explained: Asp119Arg125 sidechain coupling stabilizes the deprotonated form of Asp119. The single structure FDPBbased pKa calculations yield such low pKa values because only one conformation is visited by Asp119. Therefore, Asp119 must sample other conformations in order to yield accurate pKa predictions. Time evolution of distance between Asp119 and Arg125 side chain is shown in Figure 514 to reflect that conformations other than crystal conformation are visited in our constantpH REMD runs. In Figure 514, we can clearly see that the close contact between Asp119 and Arg125 sidechains can be broken during our simulations. Allowing sidechains to move will result in a pKa value of 3.0 in our simulations. The comparison between constantpH and singlestructure FDPB algorithm clearly demonstrates the importance of conformational sampling in pKa calculations. 195 Asp119 D1 16 Asp119 OD2 12 0 2000 4000 6000 8000 10000 Frame Number (2 ns in total) Figure 514. Minimal distance between Asp119 side chain carboxylic oxygen atoms (OD1 and OD2) and Arg125 guanidinium nitrogen atoms. Since guanidinium group has three nitrogen atoms, the minimal distance is the shortest distance between Asp119 OD1 (or OD2) and those three nitrogen atoms. Therefore, another way to look at conformations is combining both Asp119 and Arg125. Now distances between Asp119 CG and Arg125 CZ atoms are selected to distinguish different conformations. Figure 515A shows the CGCZ distance probability distribution. The probability distributions also reveal that two conformations exist. One conformation is centered at CGCZ distance of 4.2 A which represents the Asp119 and Arg125 coupling is on. The other conformation is actually representing all structures not belonging to the previous conformation. Based on the distance between Asp119 CG and Arg125 CZ, we can say the coupling is off. The 2D histogram between distance and protonation state at pH 3 is shown in Figure 515B. As can be seen in the 2D histogram contour plot, short distance conformation is indeed in the deprotonated state. The pKa of shorter distance conformation is negative infinity. Although several snapshots possess both protonated state and short distance, 2D histogram doesn't reveal them as a stable conformation. So, the short distance conformation is purely coupled with deprotonated 196 form. We also obtain the pKa value of the longer distance conformation is 3.3 according to Hill's plot. 0.14 pH=3 12 pH=4 0 pH=4.5 0.12 1 100.10 0,10 t I o ,L 0.06 " A 270.0 0.04 6 m 002300 =*':,, " ^ y   i a  i   I   I I 4 B 8 10 12 14 0 1 2 3 4 Distance (A) A Protonation State B Figure 515. A) Probability distribution of Asp119 CG to Arg125 CZ distances. The Asp119 CG to Arg125 CZ distance is used to distinguish conformations. B) Coupling between conformations and protonation states. 5.5.7 ConformationProtonation Equilibrium Model Due to the coupling between conformation and protonation equilibrium, knowing the pH effect on conformational equilibrium will be interesting and important. Again, Asp119 is selected as the representative of our study. First, we want to show the derivation and the analytical form of K12 as a function of pH values in a general case. From now on, we will label conformation 1 in deprotonated form as 1d. The, 1p, 2d and 2p stand for conformation 1 in protonated form, conformation 2 in deprotonated form and conformation 2 in protonated form, respectively. According to eq. 2 and 3, [1p] = [Id].10(pKa,'pH) and [2p] = [2d]O1(pKa,2pH). We can substitute [1p] and [2p] in eq. 1 with [1d] and [2d] so the conformational equilibrium constant will have the form: 12 d] 1+10(pKa, pH) [2d] 1 + 10(pKa2 pH) 197 In Eq. 55, [1d]/[2d] is the equilibrium constant of conformation 1 and 2 in deprotonated form and it is equal to the K12 at high pH where both conformations are in the deprotonated form. So K12 has the final analytical formula: 1 +l 10(pKa,1pH) K12 K12,h + (pKapH) (56) 1 + 10(pKa,2pH) where K12,h stands for K12 at high pH. In our derivation, conformation 1 always has a smaller pKa value than conformation 2. So the denominator always increases faster than the numerator when pH values going down. Considering that K12,h is a constant, then K12 is a sigmoid function. When pH is much greater than both pKa values, K12 becomes K12,h. When pH is much smaller than both pKa values, K12 reaches its lower bound. In the case of Asp119, the pKa value is minus infinity for conformation 1 when we use Asp119 CG and Arg125 CZ distance to distinguish two conformations. The ratios of K12 and K12,h from both analytical derivations and actual simulations are plotted in Figure 516. Close agreement between K12/K12,h plots generated from simulations and conformationprotonation equilibrium model is seen in Figure 516A. The agreement shows that the model could represent conformational equilibrium in our constantpH REMD simulations. So, further use of that model is possible. Different pKa,i and pKa,2 values are also used in order to test how two pKa values affect shape and inflection point of the sigmoid function. According to Figure 516B, 516C and 516D, if the difference between pKa,i and pKa,2 is large (greater than 1 pH unit, approximately), the inflection point will appear at a pH value that equals to pKa,2. pKa,i will affect the inflection point only when the difference is small. If we view a K12/K12,h plot as a titration curve and the inflection point is the pKa value, then the K12/K12,h plot yields a pKa value equals to pKa,2 values, which is 3.3 in the case of Asp119. 198 Actual Simulation *Analytical 08 0.6 04 Solution pH  pKa,1 minus infinity, pKa,2=3.3 pKa,1= 0.5, pKa,2=3.3  pKa,1= 1.0, pKa,2=3.3  pKa,1= 2.0, pKa,2=3.3 0 2 4 6 Solution pH  pKa,1= minus infinity, pKa,2=3.3 pKa,1= minus infinity, pKa,2=2.0 pKa,1 = minus infinity, pKa,2=4.0  pKa,1= minus irnn.r. rI a 2=6.0  pKa,1=1.0, pKa,2=2.0  pKa,.12.0, pKa,2=3.0 pKa,1=3.0, pKa,2=3.5 pKa,1=3.5, pKa,2=4.0 Solution pH C Solution pH Figure 516. K12/K12,h as a function of pH and its dependence on pKa,i and pKa,2. Since the analytical form of K12, pKa,1 and pKa,2 are known and the sum of all fractions is unity, we can figure out fractions of each species. The analytical expressions of each species are: [1d] =( K12 P)) (57) K12+ 0(pK pH [1p] = ( K2 10pKa,1pH )p] =12 () (58) K12 +1 1+1opKa,1pH [2d] =() (+OpI,2 (59) 12+1) 0pKapH(510) S 1 10PKa,2pH G121+1 1+10pKa,2pH (0 199 In our study of Asp119, pKa,i is minus infinity which lead to [1 p] is equal to zero. K12,h is calculated as the average of all [1d]/[2d], which results in a K12,h of 1.6. Another K12,h of 1.8, which is the K12 at pH 5, is also tried. Then, fractions of each species from both analytical formula and actual simulations are shown in Figure 517. ld, K,,=1.8 2d, K,=1.8 10  K2p, K, =1.8 d, K 2=1.6 S 8. 2d, K,=1.6 S 2p, K, =1.6 06 w 0 04 U. 02 00 0 1 2 3 4 5 6 7 Solution pH A 1.0 Analytical, K h=1.8 *Analytical, K1,2=1.6 Actual Simulations 0.8 A06 04 LL 0.2 2 3 4 5 Solution pH B Figure 517. A) Fraction of each species as a function of pH titrationn curves) obtained from equations based on conformationprotonation equilibrium. The effect of K12,h is tested. B) Comparison of titration curves derived from actual simulations and from the equilibrium equations. Firstly, the fraction of 2p vs pH plots are almost identical for two K12,h values. This means that although the fractions of 1d and 2d are affected, the sum of 1d and 2d is 200 not. Secondly, titration curves derived from analytical formula and actual simulations agree with each other very well. The agreement among titration curves leads to similar pKa values. Both analytical titration curves using different K12,h yield pKa values to be between 2.8 and 2.9 with negligible difference and the actual simulation titration curve gives a pKa value of 3.0. The analysis demonstrates that the equilibrium model could represent protonation equilibrium in our simulations. 5.5.8 Theoretical NMR Titration Curves Since the model can be used to simplify conformationprotonation equilibrium in our constantpH REMD simulations, it is interesting to know whether it has some practical meanings. Reproducing experimental titration curves offers us a good objective. So, quantum mechanical calculations of NMR chemical shift (5) are performed and their results are demonstrated and discussed in this part. As we have shown earlier, the dynamics of Asp119 generates two conformations indicating whether the Asp119Arg125 electrostatic interaction is "on" or "off". Our NMR calculations are based on the representative structures of each conformation, in proper protonation state. Due to the size of HEWL molecule, full quantum mechanical calculations are too expensive. So our first trial is using Asp119 dipeptide. Chemical shifts of the 1d, 2p and 2d are obtained and the fractions of each species at different pH can be calculated using eq. 7, 8 and 10. At each pH value, the theoretical chemical shift used to make a titration curve is calculated as follows: = 81d [1d] + 2d [2d] + 2 [2p]. The chemical shifts of 1d, 2d and 2p are 2.17, 2.48, 3.03 ppm respectively and the theoretical NMR titration curve is plotted in Figure 518. Compare theoretical titration curve with experimental one, the trend is correctly reproduced. At low pH, the theoretical and 201 experimental chemical shifts agree well: 3.03 ppm versus 3.13 ppm. However, the difference between calculated and experimental high pH chemical shifts is greater than 0.6 ppm. This makes our calculated (61ow pH6high pH) is 0.75 ppm while the experimental difference is only 0.21 ppm.  Full QM + Asp119 dipeptide 1 QM/MM + entire HEWL 3.2 3.0  E 2.8 \ C) 2.6 E 2.4 2.2 0 1 2 3 4 5 6 7 Solution pH Figure 518. Theoretical NMR chemical shifts as a function of pH. It's plotted to see if the conformationprotonation equilibrium model can reproduce experimental titration curve based on NMR chemical shift measurements. The problem at high pH could be that a dipeptide cannot accurately represent Asp119 and its environment especially we have known there is a strong Asp119Arg125 Coulomb interaction. So a set of QM/MM calculations was conducted using the entire HEWL molecule. The new chemical shifts are 2.58, 2.69 and 3.25 ppm for 1d, 2d and 2p. Comparing chemical shifts based on dipeptide and the entire molecule, differences of 2p and 2d are 0.22 ppm and 0.21 ppm. More importantly, both 2p chemical shifts are similar to experimental low pH (each one shows the difference near 0.1 ppm). The differences are small for 2p and 2d because there are no significant interactions for Asp119 in conformation 2. Unlike 2p or 2d, the chemical shift of 1d is improved by 0.41 202 ppm, telling that using the whole HEWL molecule does change ld chemical shift a lot. After applying QM/MM method on the entire HEWL, the calculated (i6ow pH5high pH) becomes 0.63 ppm. The theoretical titration curve using QM/MM technique is also displayed in Figure 518. But no matter whether a dipeptide or the entire HEWL is used in NMR calculations, the pKa values are around 2.9 as expected. NMR titration curves yield the same pKa value as protonation (deprotonation) fraction vs pH does. The NMR titration curve calculations validate the use of conformationprotonation equilibrium model and confirm its applicability. This model can be used to simplify a lot analysis involving further calculations. 5.6 Conclusions In this chapter, constantpH REMD simulations are performed to study the pKa of hen egg white lysozyme. Three sets of constantpH REMD simulations have been performed: one set of simulations are conducted without restraining potential, while a harmonic potential is put on the Ca atoms in the other two sets of REMD simulations. The force constants of the two harmonic potentials are 1, and 0.1 kcal/molA2, respectively, so that the effect of restraint strength on pKa prediction accuracy can be studied. In our constantpH REMD simulations, the unrestrained ones are found to be structurally instable. The Ca atom RMSD relative to crystal structure can be as high as 18 A. Due to the effect of restraining potential, HEWL in a restrained simulation is stable and similar to the crystal structure, according to the Ca atom RMSD values. In the restrained simulations with a force constant of 1 kcal/molA2, accurate pKa predictions are achieved. The overall RMS errors between predicted and experimental pKa values are 0.87 and 0.84, dependent of pKa calculation methods. Unfortunately, those two 203 RMS errors are not better than constantpH MD results obtained by Mongan et al. The advantage of incorporating REMD method is not observed. However, a plot showing RMS error as a function of pH value yields the smallest RMS error at pH 4.5, at which the crystal structure was resolved. Supported by the work of Mongan et al., we propose that the further away from crystal pH value, the stronger the biasing effect from the restraining potential. The biasing effect of conformational sampling will in turn affect pKa predictions. As expected, reducing the strength of harmonic potential results in improved pKa predictions. Likewise, the smallest pKa RMS error of 0.62 is obtained at pH 4.5 in the weakly restrained constantpH REMD simulations. An RMS error of 0.62 is among the best pKa predictions generated from constantpH simulations. The pKa predictions of catalytic ionizable residues are of particular interest in the case of HEWL. ConstantpH REMD simulations with stronger restraining potential failed to identify proton donor under the criteria proposed by Nielsen and McCammon in 2003. The weakly restrained constantpH REMD simulations are able to predicted proton donor and nucleophile, although the errors of predicted pKa values of Glu35 and Asp52 are among the largest in our simulations. Hydrogenbonding is found to be responsible for the large error of Asp52. The hydrogenbonding of Asp52 with Asn44 and Asn59 overstabilizes the deprotonated form of Asp52, causing the pKa value of Asp52 too small. For Glu35, conformational sampling also plays a role in underestimating its pKa value. However, other factors such as the use of implicit solvent may affect the pKa prediction of Glu35 too. In this work, we also focused on conformation and protonation equilibrium in constantpH REMD simulations. Correlations between protonation and sidechain 204 dihedral angles X, and X2 are studied. Other representation of conformations such as whether an important electrostatic interaction is formed or not is also adopted. In both cases, the coupling between conformation and protonation is observed. The effect of conformationprotonation coupling is partially reflected by the comparison between constantpH and single structure FDPB algorithms. ConstantpH REMD yields better pKa values are seen because more conformation space is visited. The conformationprotonation equilibrium is further studied. Equilibrium constants between conformations are derived in order to show how pH affects conformation equilibrium. The conformational equilibrium constant is shown to be pH dependent and it's a sigmoid function of pH values. The shape of the sigmoidal function is influenced by pKa values of each conformation. Titration curves which are the means to obtain pKa values are also derived from conformationprotonation equilibrium. All analytical results are in good agreement with our simulations. In addition, we apply this conformation protonation equilibrium to reproduce experimental NMR titration curve by carrying out full QM and QM/MM calculations. First, we showed the importance of protein environment to chemical shift calculations. Calculation using isolated ionizable side chain can only qualitatively reproduce experimental NMR titration curve. The error mainly comes from the high pH end where an isolated side chain assumption fails. After adding protein environment, our theoretical titration curve is greatly improved and good agreement to experimental result is obtained. Our conformationprotonation equilibrium model can be used to represent our simulations and will simplify further calculations. 205 LIST OF REFERENCES (1) Bettelheim, F. A. Introduction to general, organic, and biochemistry; 8th ed.; Thomson Brooks/Cole: Belmont, CA, 2007. (2) Dey, A.; Verma, C. S.; Lane, D. P. Br. J. Cancer 2008, 98, 48. (3) Vogelstein, B.; Lane, D.; Levine, A. J. Nature 2000, 408, 307310. (4) Matthew, J. B.; Gurd, F. R. N.; Garciamoreno, E. B.; Flanagan, M. A.; March, K. L.; Shire, S. J. Crc Cr. Rev. Biochem. 1985, 18, 91197. (5) Bierzynski, A.; Kim, P. S.; Baldwin, R. L. Proc. Natl. Acad. Sci. U. S. A. 1982, 79, 24702474. (6) Ferguson, N.; Schartau, P. J.; Sharpe, T. D.; Sato, S.; Fersht, A. R. J. Mol. Biol. 2004, 344, 295301. (7) Shoemaker, K. R.; Kim, P. S.; Brems, D. N.; Marqusee, S.; York, E. J.; Chaiken, I. M.; Stewart, J. M.; Baldwin, R. L. Proc. Natl. Acad. Sci. U. S. A. 1985, 82, 23492353. (8) GarciaMira, M. M.; Sadqi, M.; Fischer, N.; SanchezRuiz, J. M.; Munoz, V. Science 2002, 298, 21912195. (9) Hunenberger, P. H.; Helms, V.; Narayana, N.; Taylor, S. S.; McCammon, J. A. Biochemistry 1999, 38, 23582366. (10) Demchuk, E.; Genick, U. K.; Woo, T. T.; Getzoff, E. D.; Bashford, D. Biochemistry 2000, 39, 11001113. (11) Dillet, V.; Dyson, H. J.; Bashford, D. Biochemistry 1998, 37, 1029810306. (12) Harris, T. K.; Turner, G. J. IUBMB Life 2002, 53, 8598. (13) Laidler, K. J. Chemical kinetics; 3rd ed.; Harper & Row: New York, 1987. (14) Fersht, A. Structure and mechanism in protein science : a guide to enzyme catalysis and protein folding; W.H. Freeman: New York, 1999. (15) Simonson, T.; Carlsson, J.; Case, D. A. J. Am. Chem. Soc. 2004, 126, 4167 4180. (16) Lee, A. C.; Crippen, G. M. J. Chem. Inf Model. 2009, 49, 20132033. (17) Langsetmo, K.; Fuchs, J. A.; Woodward, C. Biochemistry 1991, 30, 76037609. 206 (18) GarciaMoreno, B.; Dwyer, J. J.; Gittis, A. G.; Lattman, E. E.; Spencer, D. S.; Stites, W. E. Biophys. Chem. 1997, 64, 211224. (19) GarciaMoreno, B.; Fitch, C.; Karp, D.; Gittis, A.; Lattman, E. Biophys. J. 2002, 82, 300a300a. (20) Tanford, C. Adv. Protein Chem. 1962, 17, 69165. (21) Dwyer, J. J.; Gittis, A. G.; Karp, D. A.; Lattman, E. E.; Spencer, D. S.; Stites, W. E.; GarciaMoreno, B. Biophys. J. 2000, 79, 16101620. (22) Harms, M. J.; Castaneda, C. A.; Schlessman, J. L.; Sue, G. R.; Isom, D. G.; Cannon, B. R.; GarciaMoreno, B. J. Mol. Biol. 2009, 389, 3447. (23) Mehler, E. L.; Fuxreiter, M.; Simon, I.; GarciaMoreno, E. B. Proteins: Struct., Funct., Genet. 2002, 48, 283292. (24) Anderson, D. E.; Becktel, W. J.; Dahlquist, F. W. Biochemistry 1990, 29, 2403 2408. (25) Dyson, H. J.; Jeng, M. F.; Tennant, L. L.; Slaby, I.; Lindell, M.; Cui, D. S.; Kuprin, S.; Holmgren, A. Biochemistry 1997, 36, 26222636. (26) Bashford, D.; Case, D. A.; Dalvit, C.; Tennant, L.; Wright, P. E. Biochemistry 1993, 32, 80458056. (27) Wang, Y. X.; Freedberg, D. I.; Yamazaki, T.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso, Y.; Torchia, D. A. Biochemistry 1996, 35, 99459950. (28) Dyson, H. J.; Tennant, L. L.; Holmgren, A. Biochemistry 1991, 30, 42624268. (29) Jeng, M. F.; Dyson, H. J. Biochemistry 1996, 35, 16. (30) Wilson, N. A.; Barbar, E.; Fuchs, J. A.; Woodward, C. Biochemistry 1995, 34, 89318939. (31) Callis, P. R. Methods Enzymol. 1997, 278, 113150. (32) Callis, P. R.; Burgess, B. K. J. Phys. Chem. B 1997, 101, 94299432. (33) Vivian, J. T.; Callis, P. R. Biophys. J. 2001, 80, 20932109. (34) Inoue, M.; Yamada, H.; Yasukochi, T.; Kuroki, R.; Miki, T.; Horiuchi, T.; Imoto, T. Biochemistry 1992, 31, 55455553. 207 (35) Kajander, T.; Kahn, P. C.; Passila, S. H.; Cohen, D. C.; Lehtio, L.; Adolfsen, W.; Warwicker, J.; Schell, U.; Goldman, A. Structure 2000, 8, 12031214. (36) Bartlett, G. J.; Porter, C. T.; Borkakoti, N.; Thornton, J. M. J. Mol. Biol. 2002, 324, 105121. (37) Jiang, Y. X.; Ruta, V.; Chen, J. Y.; Lee, A.; MacKinnon, R. Nature 2003, 423, 42 48. (38) Luecke, H.; Richter, H. T.; Lanyi, J. K. Science 1998, 280, 19341937. (39) Bashford, D.; Case, D. A. Annu. Rev. Phys. Chem. 2000, 51, 129152. (40) Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. J. Am. Chem. Soc. 1990, 112, 61276129. (41) Cramer, C. J. Essentials of computational chemistry : theories and models; J. Wiley: West Sussex, England ; New York, 2002. (42) Raha, K.; Merz, K. M. In Annual reports in computational chemistry; Spellmeyer, D. C., Ed.; Elsevier: Amsterdam ; Boston, 2005; Vol. 1, p p113130. (43) Dixon, S. L.; Merz, K. M. J. Chem. Phys. 1996, 104, 66436649. (44) Vreven, T.; Morokuma, K. In Annual Reports in Computational Chemistry; Spellmeyer, D., Ed.; Elsevier: Amsterdam ; Boston, 2006; Vol. 2, p p3551. (45) Field, M. J.; Bash, P. A.; Karplus, M. J. Comput. Chem. 1990, 11, 700733. (46) Singh, U. C.; Kollman, P. A. J. Comput. Chem. 1986, 7, 718730. (47) Warshel, A.; Levitt, M. J. Mol. Biol. 1976, 103, 227249. (48) Kamerlin, S. C. L.; Haranczyk, M.; Warshel, A. J. Phys. Chem. B 2009, 113, 12531272. (49) Monard, G.; Merz, K. M. Acc. Chem. Res. 1999, 32, 904911. (50) Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E. J. Chem. Phys. 1953, 21, 10871092. (51) Wolynes, P. G.; Onuchic, J. N.; Thirumalai, D. Science 1995, 267, 16191620. (52) Itoh, S. G.; Okumura, H.; Okamoto, Y. Mol. Simul. 2007, 33, 4756. (53) Mitsutake, A.; Sugita, Y.; Okamoto, Y. Biopolymers 2001, 60, 96123. 208 (54) Berg, B. A.; Neuhaus, T. Phys. Lett. B 1991, 267, 249253. (55) Berg, B. A.; Neuhaus, T. Phys. Rev. Lett. 1992, 68, 912. (56) Lyubartsev, A. P.; Martsinovski, A. A.; Shevkunov, S. V.; Vorontsovvelyaminov, P. N. J. Chem. Phys. 1992, 96, 17761783. (57) Marinari, E.; Parisi, G. Europhys. Lett. 1992, 19, 451458. (58) Hansmann, U. H. E. Chem. Phys. Lett. 1997, 281, 140150. (59) Swendsen, R. H.; Wang, J. S. Phys. Rev. Lett. 1986, 57, 26072609. (60) Earl, D. J.; Deem, M. W. Phys. Chem. Chem. Phys. 2005, 7, 39103916. (61) Fukunishi, H.; Watanabe, O.; Takada, S. J. Chem. Phys. 2002, 116, 90589067. (62) Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 1999, 314, 141151. (63) Tanford, C.; Kirkwood, J. G. J. Am. Chem. Soc. 1957, 79, 53335339. (64) Tanford, C.; Roxby, R. Biochemistry 1972, 11, 21922198. (65) Bashford, D.; Karplus, M. Biochemistry 1990, 29, 1021910225. (66) Gilson, M. K. Proteins: Struct., Funct., Genet. 1993, 15, 266282. (67) Antosiewicz, J.; Mccammon, J. A.; Gilson, M. K. J. Mol. Biol. 1994, 238, 415436. (68) Antosiewicz, J.; McCammon, J. A.; Gilson, M. K. Biochemistry 1996, 35, 7819 7833. (69) Bashford, D.; Karplus, M. J. Phys. Chem. 1991, 95, 95569561. (70) Yang, A. S.; Gunner, M. R.; Sampogna, R.; Sharp, K.; Honig, B. Proteins: Struct., Funct., Genet. 1993, 15, 252265. (71) Yang, A. S.; Honig, B. J. Mol. Biol. 1993, 231, 459474. (72) Madura, J. D.; Briggs, J. M.; Wade, R. C.; Davis, M. E.; Luty, B. A.; Ilin, A.; Antosiewicz, J.; Gilson, M. K.; Bagheri, B.; Scott, L. R.; Mccammon, J. A. Comput. Phys. Commun. 1995, 91, 5795. (73) Nicholls, A.; Honig, B. J. Comput. Chem. 1991, 12, 435445. 209 (74) Beroza, P.; Fredkin, D. R.; Okamura, M. Y.; Feher, G. Proc. Natl. Acad. Sci. U. S. A. 1991, 88, 58045808. (75) Bone, S.; Pethig, R. J. Mol. Biol. 1985, 181, 323326. (76) Harvey, S. C.; Hoekstra, P. J. Phys. Chem. 1972, 76, 2987&. (77) GarciaMoreno, B.; Fitch, C. A. Methods Enzymol. 2004, 380, 2051. (78) Simonson, T.; Brooks, C. L. J. Am. Chem. Soc. 1996, 118, 84528458. (79) Mehler, E. L.; Eichele, G. Biochemistry 1984, 23, 38873891. (80) Mehler, E. L.; Guarnieri, F. Biophys. J. 1999, 77, 322. (81) Alexov, E. G.; Gunner, M. R. Biophys. J. 1997, 72, 20752093. (82) Barth, P.; Alber, T.; Harbury, P. B. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 48984903. (83) Georgescu, R. E.; Alexov, E. G.; Gunner, M. R. Biophys. J. 2002, 83, 17311748. (84) Gunner, M. R.; Alexov, E.; Torres, E.; Lipovaca, S. J. Biol. Inorg. Chem. 1997, 2, 126134. (85) Livesay, D. R.; Jacobs, D. J.; Kanjanapangka, J.; Chea, E.; Cortez, H.; Garcia, J.; Kidd, P.; Marquez, M. P.; Pande, S.; Yang, D. J. Chem. Theory Comput. 2006, 2, 927938. (86) You, T. J.; Bashford, D. Biophys. J. 1995, 69, 17211733. (87) Kollman, P. Chem. Rev. 1993, 93, 23952417. (88) Straatsma, T. P.; Mccammon, J. A. Annu. Rev. Phys. Chem. 1992, 43, 407435. (89) Warshel, A.; Sussman, F.; King, G. Biochemistry 1986, 25, 83688372. (90) Russell, S. T.; Warshel, A. J. Mol. Biol. 1985, 185, 389404. (91) Jorgensen, W. L.; Briggs, J. M. J. Am. Chem. Soc. 1989, 111, 41904197. (92) Merz, K. M. J. Am. Chem. Soc. 1991, 113, 35723575. (93) Hu, H.; Yang, W. T. Annu. Rev. Phys. Chem. 2008, 59, 573601. (94) Li, G. H.; Zhang, X. D.; Cui, Q. J. Phys. Chem. B 2003, 107, 86438653. 210 (95) Riccardi, D.; Schaefer, P.; Cui, Q. J. Phys. Chem. B 2005, 109, 1771517733. (96) Bas, D. C.; Rogers, D. M.; Jensen, J. H. Proteins: Struct., Funct., Bioinf. 2008, 73, 765783. (97) Jensen, J. H.; Li, H.; Robertson, A. D.; Molina, P. A. J. Phys. Chem. A 2005, 109, 66346643. (98) Li, H.; Hains, A. W.; Everts, J. E.; Robertson, A. D.; Jensen, J. H. J. Phys. Chem. B 2002, 106, 34863494. (99) Li, H.; Robertson, A. D.; Jensen, J. H. Proteins: Struct., Funct., Bioinf 2004, 55, 689704. (100) Li, H.; Robertson, A. D.; Jensen, J. H. Proteins: Struct., Funct., Bioinf 2005, 61, 704721. (101) Minikis, R. M.; Kairys, V.; Jensen, J. H. J. Phys. Chem. A 2001, 105, 38293837. (102) Day, P. N.; Jensen, J. H.; Gordon, M. S.; Webb, S. P.; Stevens, W. J.; Krauss, M.; Garmer, D.; Basch, H.; Cohen, D. J. Chem. Phys. 1996, 105, 19681986. (103) Gordon, M. S.; Freitag, M. A.; Bandyopadhyay, P.; Jensen, J. H.; Kairys, V.; Stevens, W. J. J. Phys. Chem. A 2001, 105, 293307. (104) Mongan, J.; Case, D. A. Curr. Opin. Struct. Biol. 2005, 15, 157163. (105) Baptista, A. M. J. Chem. Phys. 2002, 116, 77667768. (106) Baptista, A. M.; Martel, P. J.; Petersen, S. B. Proteins: Struct., Funct., Genet. 1997, 27, 523544. (107) Borjesson, U.; Hunenberger, P. H. J. Chem. Phys. 2001, 114, 97069719. (108) Borjesson, U.; Hunenberger, P. H. J. Phys. Chem. B 2004, 108, 1355113559. (109) Khandogin, J.; Brooks, C. L. Biophys. J. 2005, 89, 141157. (110) Khandogin, J.; Brooks, C. L. Biochemistry 2006, 45, 93639373. (111) Khandogin, J.; Brooks, C. L. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 16880 16885. (112) Khandogin, J.; Chen, J. H.; Brooks, C. L. Proc. Natl. Acad. Sci. U. S. A. 2006, 103, 1854618550. 211 (113) Khandogin, J.; Raleigh, D. P.; Brooks, C. L. J. Am. Chem. Soc. 2007, 129, 3056 3057. (114) Lee, M. S.; Salsbury, F. R.; Brooks, C. L. Proteins: Struct., Funct., Bioinf 2004, 56, 738752. (115) Mertz, J. E.; Pettitt, B. M. Int. J. Supercomp. Appl. 1994, 8, 4753. (116) Kong, X. J.; Brooks, C. L. J. Chem. Phys. 1996, 105, 24142423. (117) Chen, J. H.; Brooks, C. L.; Khandogin, J. Curr. Opin. Struct. Biol. 2008, 18, 140 148. (118) Baptista, A. M.; Teixeira, V. H.; Soares, C. M. J. Chem. Phys. 2002, 117, 4184 4200. (119) Dlugosz, M.; Antosiewicz, J. M. Chem. Phys. 2004, 302, 161170. (120) Dlugosz, M.; Antosiewicz, J. M. J. Phys. Chem. B 2005, 109, 1377713784. (121) Dlugosz, M.; Antosiewicz, J. M. J. Phys.: Condens. Matter2005, 17, S1607 S1616. (122) Dlugosz, M.; Antosiewicz, J. M.; Robertson, A. D. Phys. Rev. E2004, 69, 021915. (123) Machuqueiro, M.; Baptista, A. M. J. Phys. Chem. B 2006, 110, 29272933. (124) Machuqueiro, M.; Baptista, A. M. Biophys. J. 2007, 92, 18361845. (125) Machuqueiro, M.; Baptista, A. M. Proteins: Struct., Funct., Bioinf 2008, 72, 289 298. (126) Machuqueiro, M.; Baptista, A. M. J. Am. Chem. Soc. 2009, 131, 1258612594. (127) Mongan, J.; Case, D. A.; McCammon, J. A. J. Comput. Chem. 2004, 25, 2038 2048. (128) Walczak, A. M.; Antosiewicz, J. M. Phys. Rev. E 2002, 66, 051911. (129) Williams, S. L.; de Oliveira, C. A. F.; McCammon, J. A. J. Chem. Theory Comput. 2010, 6, 560568. (130) Burgi, R.; Kollman, P. A.; van Gunsteren, W. F. Proteins: Struct., Funct., Genet. 2002, 47, 469480. 212 (131) Meng, Y. L.; Roitberg, A. E. J. Chem. Theory Comput. 2010, 6, 14011412. (132) Schaefer, M.; Karplus, M. J. Phys. Chem. 1996, 100, 15781599. (133) Hamelberg, D.; Mongan, J.; McCammon, J. A. J. Chem. Phys. 2004, 120, 11919 11929. (134) Hamelberg, D.; Mongan, J.; McCammon, J. A. Protein Sci. 2004, 13, 7676. (135) Ponder, J. W.; Case, D. A. Adv. Protein Chem. 2003, 66, 2785. (136) Allinger, N. L.; Yuh, Y. H.; Lii, J. H. J. Am. Chem. Soc. 1989, 111, 85518566. (137) Leach, A. R. Molecular modelling : principles and applications; 2nd ed.; Prentice Hall: Harlow, England ; New York, 2001. (138) MacKerell, A. D. In Annual reports in computational chemistry Spellmeyer, D. C., Ed.; Elsevier: Amsterdam ; Boston, 2005; Vol. 1, p p91~102. (139) Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Proteins: Struct., Funct., Bioinf 2006, 65, 712725. (140) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; JosephMcCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; WiorkiewiczKuczera, J.; Yin, D.; Karplus, M. J. Phys. Chem. B 1998, 102, 35863616. (141) Daura, X.; Mark, A. E.; van Gunsteren, W. F. J. Comput. Chem. 1998, 19, 535 547. (142) Jorgensen, W. L.; TiradoRives, J. J. Am. Chem. Soc. 1988, 110, 16571666. (143) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995, 117, 51795197. (144) Verlet, L. Phys. Rev. 1967, 159, 98. (145) Ryckaert, J. P.; Ciccotti, G.; Berendsen, H. J. C. J. Comput. Phys. 1977, 23, 327 341. (146) Berendsen, H. J. C.; Postma, J. P. M.; van Gunsteren, W. F.; Dinola, A.; Haak, J. R. J. Chem. Phys. 1984, 81, 36843690. 213 (147) McQuarrie, D. A. Statistical thermodynamics; University Science Books: Mill Valley, Calif., 1973. (148) Nose, S. J. Chem. Phys. 1984, 81, 511519. (149) Berendsen, H. J. C.; Grigera, J. R.; Straatsma, T. P. J. Phys. Chem. 1987, 91, 62696271. (150) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. J. Chem. Phys. 1983, 79, 926935. (151) Mahoney, M. W.; Jorgensen, W. L. J. Chem. Phys. 2000, 112, 89108922. (152) Allen, M. P.; Tildesley, D. J. Computer simulation of liquids; Clarendon Press; Oxford University Press: Oxford [England] New York, 1987. (153) Ewald, P. P. Annalen Der Physik 1921, 64, 253287. (154) Darden, T.; York, D.; Pedersen, L. J. Chem. Phys. 1993, 98, 1008910092. (155) Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. Chem. Phys. Lett. 1995, 246, 122 129. (156) Kirkwood, J. G. J. Chem. Phys. 1935, 3, 300313. (157) Straatsma, T. P.; Mccammon, J. A. J. Chem. Phys. 1991, 95, 11751188. (158) Zwanzig, R. W. J. Chem. Phys. 1954, 22, 14201426. (159) Bennett, C. H. J. Comput. Phys. 1976, 22, 245268. (160) Shirts, M. R.; Chodera, J. D. J. Chem. Phys. 2008, 129, 124105. (161) Jorgensen, W. L.; Ravimohan, C. J. Chem. Phys. 1985, 83, 30503054. (162) Hansmann, U. H. E.; Okamoto, Y. Nucl. Phys. B 1995, 914916. (163) Wang, F. G.; Landau, D. P. Phys. Rev. E2001, 64, 056101. (164) Wang, F. G.; Landau, D. P. Phys. Rev. Lett. 2001, 86, 20502053. (165) Falcioni, M.; Deem, M. W. J. Chem. Phys. 1999, 110, 17541766. (166) Kofke, D. A. J. Chem. Phys. 2002, 117, 69116914. 214 (167) Liu, P.; Kim, B.; Friesner, R. A.; Berne, B. J. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 1374913754. (168) Li, H. Z.; Li, G. H.; Berg, B. A.; Yang, W. J. Chem. Phys. 2006, 125, 144902. (169) Okur, A.; Roe, D. R.; Cui, G. L.; Hornak, V.; Simmerling, C. J. Chem. Theory Comput. 2007, 3, 557568. (170) Roitberg, A. E.; Okur, A.; Simmerling, C. J. Phys. Chem. B 2007, 111, 2415 2418. (171) Rathore, N.; Chopra, M.; de Pablo, J. J. J. Chem. Phys. 2005, 122, 024111. (172) Sanbonmatsu, K. Y.; Garcia, A. E. Proteins: Struct., Funct., Genet. 2002, 46, 225234. (173) Kone, A.; Kofke, D. A. J. Chem. Phys. 2005, 122, 206101. (174) Trebst, S.; Troyer, M.; Hansmann, U. H. E. J. Chem. Phys. 2006, 124, 174903. (175) Nadler, W.; Hansmann, U. H. E. Phys. Rev. E2007, 76, 065701. (176) Nadler, W.; Hansmann, U. H. E. Phys. Rev. E2007, 75, 026109. (177) Nadler, W.; Hansmann, U. H. E. J. Phys. Chem. B 2008, 112, 1038610387. (178) Opps, S. B.; Schofield, J. Phys. Rev. E2001, 6305, 056701. (179) Zhang, W.; Wu, C.; Duan, Y. J. Chem. Phys. 2005, 123, 154105. (180) Sindhikara, D.; Meng, Y. L.; Roitberg, A. E. J. Chem. Phys. 2008, 128, 024103. (181) Abraham, M. J.; Gready, J. E. J. Chem. Theory Comput. 2008, 4, 11191128. (182) Zhang, C.; Ma, J. P. J. Chem. Phys. 2008, 129, 134112. (183) Rosta, E.; Buchete, N. V.; Hummer, G. J. Chem. Theory Comput. 2009, 5, 1393 1399. (184) Zhou, R. H.; Berne, B. J.; Germain, R. Proc. Natl. Acad. Sci. U. S. A. 2001, 98, 1493114936. (185) Lyman, E.; Ytreberg, F. M.; Zuckerman, D. M. Phys. Rev. Lett. 2006, 96, 028105. (186) Liu, P.; Shi, Q.; Lyman, E.; Voth, G. A. J. Chem. Phys. 2008, 129, 114103. 215 (187) Liu, P.; Voth, G. A. J. Chem. Phys. 2007, 126, 045106. (188) Okur, A.; Wickstrom, L.; Layten, M.; Geney, R.; Song, K.; Hornak, V.; Simmerling, C. J. Chem. Theory Comput. 2006, 2, 420433. (189) Ballard, A. J.; Jarzynski, C. Proc. Natl. Acad. Sci. U. S. A. 2009, 106, 12224 12229. (190) Kamberaj, H.; van derVaart, A. J. Chem. Phys. 2009, 130, 074906. (191) Nguyen, P. H. J. Chem. Phys. 2010, 132, 144109. (192) Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 2000, 329, 261270. (193) Mitsutake, A.; Okamoto, Y. Chem. Phys. Lett. 2000, 332, 131138. (194) Mitsutake, A.; Okamoto, Y. J. Chem. Phys. 2004, 121, 24912504. (195) Andrec, M.; Felts, A. K.; Gallicchio, E.; Levy, R. M. Proc. Natl. Acad. Sci. U. S. A. 2005, 102, 68016806. (196) van der Spoel, D.; Seibert, M. M. Phys. Rev. Lett. 2006, 96, 238102. (197) Yang, S. C.; Onuchic, J. N.; Garcia, A. E.; Levine, H. J. Mol. Biol. 2007, 372, 756 763. (198) Buchete, N. V.; Hummer, G. Phys. Rev. E2008, 77, 030902. (199) Case, D. A.; Darden, T. A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang, J.; Duke, R. E.; Luo, R.; Crowley, M.; Walker, R. C.; Zhang, W.; Merz, K. M.; B.Wang; Hayik, S.; Roitberg, A.; Seabra, G.; Kolossvary, I.; K.F.Wong; Paesani, F.; Vanicek, J.; X.Wu; Brozell, S. R.; Steinbrecher, T.; Gohlke, H.; Yang, L.; Tan, C.; Mongan, J.; Hornak, V.; Cui, G.; Mathews, D. H.; Seetin, M. G.; Sagui, C.; Babin, V.; Kollman, P. A.; University of California, San Francisco: San Francisco, 2008. (200) Onufriev, A.; Bashford, D.; Case, D. A. J. Phys. Chem. B 2000, 104, 37123720. (201) Elber, R.; Roitberg, A.; Simmerling, C.; Goldstein, R.; Li, H. Y.; Verkhivker, G.; Keasar, C.; Zhang, J.; Ulitsky, A. Comput. Phys. Commun. 1995, 91, 159189. (202) Dill, K. A.; Ozkan, S. B.; Shell, M. S.; Weikl, T. R. Annu. Rev. Biophys. 2008, 37, 289316. (203) Dobson, C. M. Nature 2003, 426, 884890. 216 (204) Anfinsen, C. B.; Haber, E.; Sela, M.; White, F. H. Proc. Natl. Acad. Sci. U. S. A. 1961, 47, 13091314. (205) Mayor, U.; Johnson, C. M.; Daggett, V.; Fersht, A. R. Proc. Natl. Acad. Sci. U. S. A. 2000, 97, 1351813522. (206) Snow, C. D.; Nguyen, N.; Pande, V. S.; Gruebele, M. Nature 2002, 420, 102106. (207) Brooks, C. L. Acc. Chem. Res. 2002, 35, 447454. (208) Levinthal, C. J. Chim. Phys. Phys.Chim. Biol. 1968, 65, 4445. (209) Gruebele, M. Annu. Rev. Phys. Chem. 1999, 50, 485516. (210) Kubelka, J.; Hofrichter, J.; Eaton, W. A. Curr. Opin. Struct. Biol. 2004, 14, 7688. (211) Snow, C. D.; Sorin, E. J.; Rhee, Y. M.; Pande, V. S. Annu. Rev. Biophys. Biomol. Struct. 2005, 34, 4369. (212) Snow, C. D.; Qiu, L. L.; Du, D. G.; Gai, F.; Hagen, S. J.; Pande, V. S. Proc. Natl. Acad. Sci. U. S. A. 2004, 101, 40774082. (213) Zagrovic, B.; Sorin, E. J.; Pande, V. J. Mol. Biol. 2001, 313, 151169. (214) Jayachandran, G.; Vishal, V.; Pande, V. S. J. Chem. Phys. 2006, 124, 054118. (215) Singhal, N.; Snow, C. D.; Pande, V. S. J. Chem. Phys. 2004, 121, 415425. (216) Swope, W. C.; Pitera, J. W.; Suits, F. J. Phys. Chem. B 2004, 108, 65716581. (217) Swope, W. C.; Pitera, J. W.; Suits, F.; Pitman, M.; Eleftheriou, M.; Fitch, B. G.; Germain, R. S.; Rayshubski, A.; Ward, T. J. C.; Zhestkov, Y.; Zhou, R. J. Phys. Chem. B 2004, 108, 65826594. (218) Daggett, V.; Levitt, M. J. Mol. Biol. 1993, 232, 600619. (219) Daggett, V.; Levitt, M. J. Cell. Biochem. 1993, 223223. (220) Daggett, V.; Levitt, M. Curr. Opin. Struct. Biol. 1994, 4, 291295. (221) Dadlez, M.; Bierzynski, A.; Godzik, A.; Sobocinska, M.; Kupryszewski, G. Biophys. Chem. 1988, 31, 175181. (222) Baldwin, R. L. Biophys. Chem. 1995, 55, 127135. (223) Brown, J. E.; Klee, W. A. Biochemistry 1971, 10, 470476. 217 (224) Fairman, R.; Shoemaker, K. R.; York, E. J.; Stewart, J. M.; Baldwin, R. L. Biophys. Chem. 1990, 37, 107119. (225) Osterhout, J. J.; Baldwin, R. L.; York, E. J.; Stewart, J. M.; Dyson, H. J.; Wright, P. E. Biochemistry 1989, 28, 70597064. (226) Shoemaker, K. R.; Fairman, R.; Schultz, D. A.; Robertson, A. D.; York, E. J.; Stewart, J. M.; Baldwin, R. L. Biopolymers 1990, 29, 111. (227) Felts, A. K.; Harano, Y.; Gallicchio, E.; Levy, R. M. Proteins: Struct., Funct., Bioinf 2004, 56, 310321. (228) Hansmann, U. H. E.; Okamoto, Y. J. Phys. Chem. B 1998, 102, 653656. (229) Hansmann, U. H. E.; Okamoto, Y. J. Phys. Chem. B 1999, 103, 15951604. (230) La Penna, G.; Mitsutake, A.; Masuya, M.; Okamoto, Y. Chem. Phys. Lett. 2003, 380, 609619. (231) Ohkubo, Y. Z.; Brooks, C. L. Proc. Natl. Acad. Sci. U. S. A. 2003, 100, 13916 13921. (232) Schaefer, M.; Bartels, C.; Karplus, M. J. Mol. Biol. 1998, 284, 835848. (233) Sugita, Y.; Okamoto, Y. Biophys. J. 2005, 88, 31803190. (234) Yoda, T.; Sugita, Y.; Okamoto, Y. Chem. Phys. 2004, 307, 269283. (235) Yoda, T.; Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 2004, 386, 460467. (236) Kabsch, W.; Sander, C. Biopolymers 1983, 22, 25772637. (237) Johnson, W. C. Annu. Rev. Biophys. Biophys. Chem. 1988, 17, 145166. (238) Sreerama, N.; Woody, R. W. Methods Enzymol. 2004, 383, 318351. (239) Gratzer, W. B.; Doty, P.; Holzwarth, G. M. Proc. Natl. Acad. Sci. U. S. A. 1961, 47, 17851791. (240) Manning, M. C.; Illangasekare, M.; Woody, R. W. Biophys. Chem. 1988, 31, 77 86. (241) Bayley, P. M.; Nielsen, E. B.; Schellma.Ja J. Phys. Chem. 1969, 73, 228243. (242) Clark, L. B. J. Am. Chem. Soc. 1995, 117, 79747986. 218 (243) Hirst, J. D. J. Chem. Phys. 1998, 109, 782788. (244) Woody, R. W.; Sreerama, N. J. Chem. Phys. 1999, 111, 28442845. (245) Goux, W. J.; Hooker, T. M. J. Am. Chem. Soc. 1980, 102, 70807087. (246) Ridley, J.; Zerner, M. Theor. Chim. Acta 1973, 32, 111134. (247) Wlodawer, A.; Svensson, L. A.; Sjolin, L.; Gilliland, G. L. Biochemistry 1988, 27, 27052717. (248) Blake, C. C. F.; Koenig, D. F.; Mair, G. A.; North, A. C. T.; Phillips, D. C.; Sarma, V. R. Nature 1965, 206, 757761. (249) Vocadlo, D. J.; Davies, G. J.; Laine, R.; Withers, S. G. Nature 2001, 412, 835 838. (250) Nielsen, J. E.; McCammon, J. A. Protein Sci. 2003, 12, 313326. (251) Bartik, K.; Redfield, C.; Dobson, C. M. Biophys. J. 1994, 66, 11801184. (252) Tironi, I. G.; Sperb, R.; Smith, P. E.; Vangunsteren, W. F. J. Chem. Phys. 1995, 102, 54515459. (253) Case, D. A.; Darden, T. A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang, J.; Duke, R. E.; R.Luo; Merz, K. M.; Pearlman, D. A.; Crowley, M.; Walker, R. C.; Zhang, W.; Wang, B.; S.Hayik; Roitberg, A.; Seabra, G.; Wong, K. F.; Paesani, F.; Wu, X.; Brozell, S.; Tsui, V.; H.Gohlke; Yang, L.; Tan, C.; Mongan, J.; Hornak, V.; Cui, G.; Beroza, P.; Mathew, D. H.; C.Schafmeister; Ross, W. S.; Kollman, P. A.; University of California, San Francisco: San Francisco, 2006. (254) Frisch, M. J. T., G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Montgomery, Jr., J. A.; Vreven, T.; Kudin, K. N.; Burant, J. C.; Millam, J. M.; lyengar, S. S.; Tomasi, J.; Barone, V.; Mennucci, B.; Cossi, M.; Scalmani, G.; Rega, N.; Petersson, G. A.; Nakatsuji, H.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Klene, M.; Li, X.; Knox, J. E.; Hratchian, H. P.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gomperts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Ayala, P. Y.; Morokuma, K.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Zakrzewski, V. G.; Dapprich, S.; Daniels, A. D.; Strain, M. C.; Farkas, O.; Malick, D. K.; Rabuck, A. D.; Raghavachari, K.; Foresman, J. B.; Ortiz, J. V.; Cui, Q.; Baboul, A. G.; Clifford, S.; Cioslowski, J.; Stefanov, B. B.; Liu, G.; Liashenko, A.; Piskorz, P.; Komaromi, I.; Martin, R. L.; Fox, D. J.; Keith, T.; AILaham, M. A.; Peng, C. Y.; Nanayakkara, A.; Challacombe, M.; Gill, P. M. W.; Johnson, B.; Chen, W.; Wong, M. W.; Gonzalez, C.; and Pople, J. A.; Gaussian, Inc.: Wallingford CT, 2004. 219 (255) Ditchfie.R Mol. Phys. 1974, 27, 789807. (256) He, X.; Wang, B.; Merz, K. M. J. Phys. Chem. B 2009, 113, 1038010388. (257) Anandakrishnan, R.; Onufriev, A. J. Comput. Biol. 2008, 15, 165184. (258) Gordon, J. C.; Myers, J. B.; Folta, T.; Shoja, V.; Heath, L. S.; Onufriev, A. Nucleic Acids Res. 2005, 33, 368371. 220 BIOGRAPHICAL SKETCH Yilin Meng was born in Jilin, Jilin Province, People's Republic of China. He went to the Dalian University of Technology at Dalian, Liaoning Province and studied chemical engineering. He graduated with a bachelor's degree in engineering in 2004. During his college, Yilin has developed an interest in the computational chemistry, especially the electronic structure theory and has worked in Dr. Ce Hao' group for a year. In August 2004, Yilin came to the University of Florida and began his life as a graduate student. His original plan was to keep studying the electronic structure theory. However, he was impressed by the research of Dr. Adrian E. Roitberg. Later, he joined the Roitberg group and started his career in the molecular modeling. 221 PAGE 1 1 CONSTANT pH REPLICA EXCHANGE MOLECULAR DYNAMICS STUDY OF PROTEIN STRUCTURE AND DYNAMICS By YILIN MENG A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2010 PAGE 2 2 2010 Yilin Meng PAGE 3 3 To my family PAGE 4 4 ACKNOWLEDGMENTS At the completion of my graduate study at the University of Florida, I would like to take great pleasure in acknowledging th e people who have supported me over these years. I primarily thank my advisor, Professor Adrian E. Roitberg. Throughout the years wor king in his group, I have learned a tremendous amount from him. His guidance and encouragement supported me to overcome the obstacles not only in research but also in my personal life. There is no way I would have achieve d my goal without his support and help. I am thank ful for the support and guidance of my committee members, Professor s Kenneth M. Merz Jr., Nicolas C. Polfer Ste ph en J. Hagen and Arthur S. Edison. I also would like to thank Professor s So Hirata, Joanna R. Long, Carlos L. Simmerling and Wei Yang for their guidance in my research. I am very grateful for the assistance and helpful discussions from my colleagues in the Roitberg group, especially Dr. Daniel Sindhikara, Dr. Gustavo Seabra, Dr. Lena Dolghih Dr. Seonah Kim, Jason Swails Danial Dashti, Billy Miller, Dwight McGee, and Sung Cho I appreciate all my friend s at the Quantum Theory Project the Department of Chemistry and Physics. I thank the source of funding that supported my graduate study My research was supported by National Institute of Health under Contract 1R01 AI073674. Computer resources and support were provided by the Large Allocations Resourc e Committee through grant TG MCA05S010 and the University of Florida High Performance Computing Center. PAGE 5 5 I want to acknowledge my wife Xian who encouraged me and supported me to complete this work. Finally, I am very grateful for my whole family for their love and encouragement. PAGE 6 6 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 9 LIST OF F IGURES ................................ ................................ ................................ ........ 10 LIST OF ABBREVIATIONS ................................ ................................ ........................... 17 ABSTRACT ................................ ................................ ................................ ................... 19 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 21 1.1 Acid Base Equilibrium ................................ ................................ ....................... 21 1.2 Amino Acids and Proteins ................................ ................................ ................. 22 1.3 Ionizable Residues in Proteins and the Effect of pH on Proteins ...................... 25 1.4 Measuring p K a Values of Ionizable Residues ................................ ................... 29 1.5 Molecul ar Modeling ................................ ................................ ........................... 38 1.6 Potential Energy Surface ................................ ................................ .................. 39 1.7 Molecular Dynamics, Monte Carlo Methods and Ergodicity .............................. 41 1.8 Theoretical Protein Titration Curves and p K a Calculations Using Poisson Boltzmann Equation ................................ ................................ ............................ 44 1.9 Computing p K a Values by Free Energy Calculati ons ................................ ........ 48 1.10 p K a Prediction Using Empirical Methods ................................ ......................... 53 1.11 Constant pH Molecular Dynamics (Constant pH MD) Methods ...................... 53 2 THEORY AND METHODS IN MOLECULAR MODELING ................................ ...... 59 2.1 Potential Energy Functions and Classical Force Fields ................................ .... 59 2.1.1 Potential Energy Surface ................................ ................................ ......... 59 2.1.2 Force Field Models ................................ ................................ .................. 60 2.1.3 Protein Force Field Models ................................ ................................ ...... 63 2.2 Molecular Dynamics (MD) Method ................................ ................................ .... 64 2.2.1 MD Integrator ................................ ................................ .......................... 64 2. 2.2 Thermostats in MD Simulations ................................ ............................... 65 2.2.3 Pressure Control in MD Simulations ................................ ........................ 68 2.3 Monte Carlo (MC) Method ................................ ................................ ................ 70 2.3.1 Canonical Ensemble and Configuration Integral ................................ ..... 70 2.3.2 Markov Chain Monte Carlo (MCMC) ................................ ....................... 71 2.3.3 The Metropolis Monte Carlo Method ................................ ....................... 73 2.3.4 Ergodicity and the Ergodic Hypothesis ................................ .................... 74 2.4 Solvent Models ................................ ................................ ................................ 74 2.4.1 Explicit Solvent Model ................................ ................................ ............. 75 PAGE 7 7 2.4.2 The Poisson Boltzmann (PB) Implicit Solvent Model ............................... 77 2.4.3 The Generalized Born (GB) Implicit Solvent Model ................................ 79 2.5 p K a Calculation Methods ................................ ................................ ................... 80 2.5.1 The Co ntinuum Electrostatic (CE) Model ................................ ................ 80 2.5.2 Free Energy Calculation Methods ................................ ........................... 82 2.5.3 Constant pH MD Methods ................................ ................................ ....... 87 2.6 Advanced Sampling Methods ................................ ................................ ........... 94 2.6.1 The Multicanonical Algorithm (MUCA) ................................ ..................... 95 2.6.2 Parallel Tempering ................................ ................................ .................. 96 2.7 Replica Exchange Molecular Dynamics (REMD) Methods ............................... 97 2.7.1 Temperature REMD (T REMD) ................................ ............................... 99 2.7.2 Hamiltonian REMD (H REMD) ................................ .............................. 105 2.7.3 Technical Details in REMD Simulations ................................ ................ 105 3 CONSTANT pH REMD: METHOD AND IMPLEMENTATION .............................. 114 3.1 Introduction ................................ ................................ ................................ ..... 114 3.2 Theory and Methods ................................ ................................ ....................... 114 3.2.1 Constant pH REMD Algorithm in AMBER Simulation Suite .................. 114 3.2.2 Simulation Details ................................ ................................ .................. 118 3.2.3 Global Conformational Sampling Comparison Using Cluster Analysis .. 120 3.2.4 Local Conformational Sampling and Convergence to Final State ......... 122 3.3 Results and Discussion ................................ ................................ ................... 122 3.3.1 Reference Compounds ................................ ................................ .......... 122 3.3.2 Model peptide ADFDA ................................ ................................ ........... 124 3.3.3 Heptapeptide derived from OMTKY3 ................................ ..................... 128 3.4 Conclusions ................................ ................................ ................................ .... 136 4 CONSTANT pH REMD : STRUCTURE AND DYNAMICS OF THE C PEPTIDE OF RIBONUCLEASE A ................................ ................................ ........................ 137 4.1 Introduction ................................ ................................ ................................ ..... 137 4.2 Methods ................................ ................................ ................................ .......... 143 4.2.1 Simulation Details ................................ ................................ .................. 143 4.2.2 Cluster Analysis ................................ ................................ ..................... 144 4.2.3 Definition of the Secondary S tructure of Proteins (DSSP) Analysis ...... 145 4.2.4 Computation of the Mean Residue Ellipticity ................................ ......... 145 4.3 Results and Discussion ................................ ................................ ................... 150 4.3.1 Testing Structural Convergence ................................ ............................ 150 4.3.2 p K a Calculation and Convergence ................................ ......................... 151 4.3.3 The Mean Residue Ellipticity of the C peptide ................................ ....... 151 4.3.4 Helical Structures in the C peptide ................................ ........................ 153 4.3.5 The Two Dimensional Probability Densities ................................ .......... 157 4.3.6 Important Electrostatic Interactions: Lys1 Glu9 and Glu2 Arg10 ........... 160 4.3.7 Import ant Electrostatic Interactions: Phe8 His12 ................................ ... 164 4.3.8 Cluster Analysis Results ................................ ................................ ........ 167 4.4 Conclusions ................................ ................................ ................................ .... 168 PAGE 8 8 5 CONSTANT pH REMD: p K a CALCULATIONS OF HEN EGG WHITE LYSOZYME ................................ ................................ ................................ .......... 170 5.1 Introduction ................................ ................................ ................................ ..... 170 5.2 Simulation Details ................................ ................................ ........................... 174 5.3 Protein Conformational and Protonation State Equilibrium Model .................. 176 5.4 NMR Chemical Shift Calculation s ................................ ................................ ... 177 5.5 Results and Discussions ................................ ................................ ................. 178 5.5.1 Structural Stability and p K a Convergence ................................ .............. 178 5.5.2 p K a Predictions ................................ ................................ ...................... 182 5.5.3 Constant pH REMD Simulations with a Weaker Restraint .................... 184 5.5.4 Active S ite Ionizable Residue p K a Prediction: Asp52 ............................ 187 5.5.5 Active Site Ionizable Residue p K a Prediction: Glu35 ............................. 189 5.5.6 Correlation between Conformation and Protonation .............................. 193 5.5.7 Conformation Protonation Equilibrium Model ................................ ........ 197 5.5.8 Theoretical NMR Titration Cu rves ................................ ......................... 201 5.6 Conclusions ................................ ................................ ................................ .... 203 LIST OF REFERENCES ................................ ................................ ............................. 206 BIOGRAPHICAL SKETCH ................................ ................................ .......................... 221 PAGE 9 9 LIST OF TABLES Table page 1 1 Intrinsic p K a values of ionizable residues in proteins. 26 ................................ ...... 29 3 1 The REMD p K a predictions of reference compounds. ................................ ...... 123 3 2 p K a ........................... 125 3 3 Correlation coefficients between MD and REMD cluster populations. .............. 128 4 1 Correlation coefficients between two sets of cluster popu lations. ..................... 151 5 1 Simulation details of constant pH REMD runs ................................ .................. 175 5 2 Predicted p K a values and their RMS errors rel ative to experimental measurements from the restrained REMD simulations. ................................ ... 183 5 3 Predicted p K a values and their RMS errors relative to experimental measurements from weakly restrained R EMD simulations. ............................. 185 5 4 Distance between Glu35 carboxylic oxygen atoms and neighboring residue side chain atoms in 1AKI crystal structure. ................................ ....................... 190 PAGE 10 10 LIST O F FIGURES Figure page 1 1 A) Structure of an amino acid named alanine. An amino group ( NH2), a carboxylic acid group ( COOH), a side chain ( R, in this case, a methyl group) and a hydrogen atom are bonded to a central carbon atom (C ). B) Dihedral angles and of alanine dipeptide. ................................ .................... 23 1 2 A Ramachandran plot (a contour plot showing the probability densi ty of ( ) pairs) of tyrosine generated from the simulation of a heptapeptide which will be described later in chapter 3. In this figure, a left handed helix is also shown. ................................ ................................ ................................ ................ 25 1 3 A diagram showing the cartoon representation of an enzyme at low pH (acidic) and at around the optimal pH value. EH indicates the structure at low pH and E stands for the zwitterion form, which is the active species in our model. 13 ................................ ................................ ................................ .............. 26 1 4 The reaction schemes showing the enzyme reactions at which pH values are smaller than the optimal pH value. K s K K 1 and K 2 are equilibrium constants of corresponding reactions and k cat is the rate constant of the rate determining step. This model can be used to explain how pH value affects enzyme catalysis in the pH range that is larger than optimal pH. 13,14 ................. 27 1 5 A) An exampl the titration described in Figure 1 5A. The two plots are generated from constant pH MD simulations of an aspartic acid in a pentapeptide. ................... 30 1 6 13 C NMR titration curves of aspartate residues in HIV 1 protease/KNI 272 complex taken from Wang et al .,1996. 27 In this figure, Asp C chemical shifts are plotted as a function of pD. Asp25 and Asp125 do not change protonation states in this pD range. But isotope shift experiments show that with permission from Wang, Y. X.; Freedberg, D. I.; Yamazaki, T.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; K iso, Y.; Torchia, D. A. Biochemistry 1996 35 9945 9950 ........................... 32 1 7 Thermodynamic cycle used to compute p K a shift. Both acid dissociation reacti ons occur in aqueous solution. A thermodynamic cycle is a series of thermodynamic processes that eventually returning to the initial state. A state function, such as reaction free energy in this case, is path independent and hence, unchanged through a cycl ic process. ................................ ...................... 49 1 8 7 and Figure 1 8, protein AH represents the ionizable residue in protein environment. AH represents the reference compound PAGE 11 11 which is usually the ionizable residue with two termini capped. In practice, a proton does not disappear but instead becomes a dummy atom. The proton has its position and velocity. The bonded interactions involving the proton are still effective. However, there is no non bonded interact ion for that proton. The change in protonation state is reflected by changes of partial charges in the ionizable residue. ................................ ................................ .......................... 50 2 1 A diagram showing bond stretching coupled with angle be nding. A cross term calculating coupling energy is adopted when evaluating the total potential energy. ................................ ................................ ................................ 62 2 2 A diagrammatic description of TIP3P and TIP4P water models. A) TIP3P model The red circle is oxygen atom and the black circles are the hydrogen atoms. Experimental bond length and bond angle are adopted. B) TIP4P model. Oxygen and hydrogen atoms are labeled with same color as in the TIP3P model. TIP4P model also employs the exp erimental OH bond length and HOH bond angle. Clearly, the fourth site (green circle) which carries negative partial charge has been added to the TIP4P model. ............................ 77 3 1 Methods to perform exchange attempts. A) Only molecular structures are attempted to exchange. The protonation states are kept the same. B) Both molecular structures and protonation states are attempted to exchange. ........ 115 3 2 Titration curves of blocked aspartate amino acid from 100 ns MD at 300K and REMD runs. Agreement can be seen between MD and REMD simulations. ................................ ................................ ................................ ....... 123 3 3 Cumulative ave rage protonation fraction of aspartic acid reference compound vs Monte Carlo (MC) steps at pH=4. ................................ ................................ 124 3 4 The titration curves of the model peptide ADFDA at 300K from both MD and REMD s imulations. MD simulation time was 100 ns and 10 ns were chosen for each replica for REMD runs. ................................ ................................ ....... 125 3 5 Cumulative average protonation fraction of Asp2 in model peptide ADFDA vs Monte Carlo (MC) steps at pH=4. ................................ ................................ ..... 126 3 6 (Ramachandran plots) for Asp2 at pH 4 in ADFDA. Ramachandran plots at other solution pH values are similar. For Asp2, constant pH MD and REMD sampled the same local backbone conforma tional space. Phe3 and Asp4 Ramachandran plots also display the same trend. ................................ ........... 127 3 7 Cluster populations of ADFDA at 300K. A) MD vs REMD at pH 4. Trajectories from MD and REMD simulatio ns are combined first. By clustering the combined trajectory, the MD and REMD structural ensembles will populate the same clusters. The fraction of the conformational ensemble corresponding to each cluster (fractional population of each cluster) was PAGE 12 12 calcul ated for MD and REMD simulation, respectively. Two sets of fractional population of clusters were generated, and hence plotted against each other. B) Two REMD runs from different starting structures at pH 4. Large correlation shown in Figure 3 7B suggests that the REMD runs are converged. Large correlations between two independent REMD runs are also observed at other solution pH values. Correlations between MD and REMD simulations can be found in Table 3 3. ................................ ............................. 128 3 8 A) Titration curves of Asp3 in the heptapeptide derived from protein OMTKY3. B) Titration curves of Lys5 and Tyr7 in the heptapeptide derived K a values of Asp3 are f ................................ ................................ .. 129 3 9 A) Cumulative average protonation fraction of Asp3 of the heptapeptide derived OMTKY3 vs MC steps. B) and C) is cumulative average protonation fracti on of Tyr7 and Lys5 in the heptapeptide vs MC steps, respectively. Clearly, faster convergence is achieved in contant pH REMD simulations. ..... 131 3 10 pH MD results. B) Constant pH REMD results. The two probability densities are almost identical, indicating that constant pH MD and REMD sample the same local conformation al space. All others also show very similar trend. ................ 133 3 11 The root mean robability density behaviors at other pH values also show that REMD runs converge to final distribution faster. ................................ ................................ ............................. 134 3 12 Cluster population at 300 K from constant pH MD and REMD simulations at pH=4. Cluster analysis is performed using the entire simulation. The populations in each cluster from the first and second half of the trajectory are compared and plotted. Ideal ly, a converged trajectory should yield a correlation coefficient to be 1. A) Constant pH MD. B) Constant pH REMD. Much higher correlation coefficient can be seen in constant pH REMD simulation, suggesting much better convergence is achieved by the constan t pH REMD run. ................................ ................................ ................................ .. 135 4 1 Cluster population at 300 K from constant pH REMD simulations at pH 2. A) Cluster analysis is performed on the trajectory initiated from fully extended structure The populations in each cluster from the first and second half of the trajectory are compared and plotted. B) Two REMD runs from different starting structures at pH 2. Correlation coefficients at other pH values can be found in Table 4 1. ................................ ................................ ............................ 150 4 2 Cumulative average fraction of protonation vs Monte Carlo (MC) steps. Only the two glutamate residues are shown here and the histidine residue is found PAGE 13 13 to show the same trend. The pH values a re selected such that the overall average fraction of protonation is close to 0.5. ................................ ................. 152 4 3 Computed the mean residue ellipticity at 222 nm as a function of pH values. A bell shaped cur ve at 300 K is obtained with a maximum at pH 5. The effect of temperature on mean residue ellipticity at 222 nm is also demonstrated. .... 153 4 4 Helical Content as a function of residue n umber. ................................ ............. 154 4 5 A) Time series of C RMSDs vs the fully helical structure at pH 5. The first two residues at each end are not selected because the ends are very flexible. B) Probability densities of the C RMSDs. Clearly, the structural ensemble at pH 5 contains more structures similar to t he fully helical structure. C) Time series of C radius of gyration at pH 5. D) Probability density of the C radius of gyration. More compact structures are found at pH 5. ................................ ................................ ................................ ...................... 155 4 6 A) Probability densities of number of helical residues in the C peptide. B) Probability densities of the number of helical segments in the C peptide. A helical segment contains continuous helical residues. The probability of forming the second helical segment is very low at all three pH values, thus only the first helical segment is further studied. C) Probability densities of the starting position of a helical segment. D) Probability densities of the length of a helical segment (number of residues in a helica l segment). .......................... 156 4 7 2D probability density of helical starting position and helical length, pH = 2. .... 158 4 8 2D p robability density of helical starting position and helical length, pH=5. ...... 158 4 9 2D probability density of helical starting position and helical length, pH=8. ...... 159 4 10 2D probability density of helical length and C RMSD at pH = 2. ..................... 159 4 11 2D probability density of helical length and C RMSD a t pH = 5. ..................... 160 4 12 2D probability density of helical length and C RMSD at pH = 8. ..................... 160 4 13 A) Probability densi ty of Lys1 Glu9 distance (). The distance is the minimum distance between the side chain nitrogen atom of Lys1 and the side chain carboxylic oxygen atoms of Glu9. B) Probability density of Glu2 Arg10 distance (). The distance is the minimum distance betw een side chain carboxylic oxygen atoms of Glu2 and guanidinium nitrogen atoms of Arg10. ................................ ................................ ................................ ............... 162 4 14 Two dimensional probability density of Lys1 Glu9 and Glu2 Arg10 at pH 5. Apparently, Ly s1 Glu9 and Glu2 Arg10 salt bridges cannot be formed simultaneously. ................................ ................................ ................................ 162 PAGE 14 14 4 15 A) Two dimensional probability density of Glu2 Arg10 salt bridge formation and helical length at pH 5. Acc ording to the plot, the Glu2 Arg10 salt bridge can be found in four residue, six residue and non helical structures. B) Two dimensional probability density of Glu2 Arg10 salt bridge and the helix starting position at pH 5. If a helix begins from Thr3, it c annot have a Glu2 Arg10 salt bridge. Thus, one role of the Glu2 Arg10 salt bridge is to prevent helix formation from Thr3. ................................ ................................ ................. 163 4 16 A) Probability density of Phe8 backbone to His12 rin g distance. The distance is the minimum distance between Phe8 backbone carbonyl oxygen atom and His12 imidazole nitrogen atoms. B) Probability density of Phe8 ring to His12 ring distance. The distance is the minimum distance between Phe8 aromatic ring carb on atoms and His12 imidazole nitrogen atoms. ................................ .. 164 4 17 A) Two dimensional probability density of Glu2 Arg10 distance and Phe8 His12 backbone to ring distance at pH 5. B) Correlations be tween Glu2 Arg10 salt bridge and Phe8 His12 contact at pH 5. ................................ .......... 166 4 18 A) Two dimensional probability density of helical segment length and Phe8 His12 interaction. B) Two dimensional pr obability density of helical segment starting position and Phe8 His12 interaction. Phe8 His12 also stabilizes four residue and six residue structures. Helices begin at Lys7 and Phe8 His12 is coupled. Unlike Glu2 Arg10, Phe8 His12 stabilizes helices startin g from Thr3. ................................ ................................ ................................ ................. 167 4 19 A) Top 20 populated clusters and average helical percentage. B) Probability densities of the C RMSD vs the fully helical structure of the top 2 populated clusters. C) Helical Percentage as a function of residue number of the top 2 populated clusters. D) Probability density of the Glu2 Arg10 and Phe8 backbone His12 ring interactions in the second m ost populated cluster. ......... 169 5 1 Crystal structure of HEWL (PDB code 1AKI). Residues in red represent aspartate and residues in blue are glutamate. ................................ .................. 171 5 2 A simple schematic view of the conformation protonation equilibrium in a constant pH simulation. ................................ ................................ .................... 176 5 3 C RMSD vs crustal structure (PDB code: 1AKI). A) C RMSD vs 1AKI from REMD without restraint on C B) C RMSD vs 1AKI from REMD with restraint on C The restraint strength is 1 kcal/molA 2 ................................ .... 179 5 4 p K a prediction error as a function of time. The predicted p K a at a given time is a cumulative result. For each ionizable residue, the time series of its p K a error is generated at a pH where the average predicted p K a is closest to that pH va lue. In this way, we try to eliminate any bias toward the energetically favored state. A flat line is an indication of convergence. Glu35 is not shown here due to poor convergence. ................................ ................................ ......... 180 PAGE 15 15 5 5 A) p K a prediction convergence to its final value. Similarly, the p K a value at a given time is a cumulative average. A flat line having y value of 0 is expected when p K a calculation convergence is reached. The same pH values are chosen for each ionizab le residue as in Figure 5 4. B) Asp52 p K a prediction convergence to its final value at multiple pH values. The pH values are selected in such a way that the p K a calculated at this pH will be used to compute composite p K a ................................ ................................ ................... 181 5 6 RMS error between predicted and experimental p K a vs pH value. A minimum of p K a RMS error can be found near the pH at which 1AKI crystal structure is resolved. ................................ ................................ ................................ ........... 184 5 7 A) C RMSD of HEWL from weaker restraint REMD simulations. The RMSDs are larger than those with stronger restraints. When comparing RMSDs at different pH for simulations using weaker restraint, RMSDs are greater at pH 3 and 4 than those at pH 4.5. B) p K a predi ction deviation from final value at pH 4.5 from constant pH REMD with 0.1 kcal/mol 2 ................................ ...... 186 5 8 Asp52 in the crystal structure of 1AKI. Its neighbors that having strong electrostatic in teractions are also shown. ................................ ......................... 188 5 9 A) Time series of Asp52 carboxylic oxygen atom OD1 to Asn59 and Asn44 ND2 distances at pH 3 in the 1 kcal/mol 2 constant pH REMD run. B) Time series of Asp52 carboxylic oxygen atom OD2 to Asn59 and Asn44 ND2 distances under the same condition. Hydrogen bonds which are stabilizing deprotonated Asp52 are formed in a large extent even at a low pH. ................ 188 5 10 A) Time series of the Glu35 heavy atoms (excluding two carboxylic oxygen atoms) RMSD relative to crystal structure 1AKI. B) Probability distribution of the RMSD. The conformation centered at RMSD ~0.1 is labeled as conformation 1. The one centered at ~0.6 is named conformation 2. Apparently, an extra conformation (conformation 3) is visited by the weakly restrained REMD simulation. ................................ ................................ ............ 191 5 11 A) Representative Structure of conformation 1. B) Representative Structure of conformation 2. The structure ensemble is generated from REMD simulations with stronger restraining potential. The carboxylic group of Glu35 in conformation 2 is clearly pointing toward the amide group of Ala110. Deprotonated form of Glu35 tends to decrease the electrostatic energy. Furthermore, conformation 1 does not particularly favor the protonated Glu35. No significant stabilizing factor is found for the protonated Glu35. ........ 192 5 12 Representative Structure of conformation 3 from cluster analysis. Glu35 is in the hydrophobic region, consisting of Gln57, Trp108 and Ala110 Conformation 1 and 2 in the weakly restrained simulati ons are basically the same as those demonstrated in Figure 5 11. ................................ ................... 193 PAGE 16 16 5 13 A) Correlation between side chain dihedral angle 1 and protonation states. B) Correlation between side chain dihedral angle 2 and protonation states. ... 194 5 14 Minimal distance between Asp119 side chain carboxylic oxygen atoms (OD1 and OD2) and Arg125 guanidinium nitrogen atoms. Since guanidi nium group has three nitrogen atoms, the minimal distance is the shortest distance between Asp119 OD1 (or OD2) and those three nitrogen atoms. .................... 196 5 15 A) Probability distribution of A sp119 CG to Arg125 CZ distances. The Asp119 CG to Arg125 CZ distance is used to distinguish conformations. B) Coupling between conformations and protonation states. ................................ 197 5 16 K 12 / K 12,h as a function of pH and its dependence on p K a,1 and p K a,2 ............... 199 5 17 A) Fraction of each species as a function of pH (titration curves) obtained from equations based on conformation pro tonation equilibrium. The effect of 12 is tested. B) Comparison of titration curves derived from actual simulations and from the equilibrium equations. ................................ ............... 200 5 18 Theoretical NMR chemical shifts as a function of pH. It conformation protonation equilibrium model can reproduce experimental titration curve based on NMR chemical shift measurements. ........................... 202 PAGE 17 17 LIST OF ABBREVIATION S ACE Analytical Continuum Electrostatic BA R Bennett Acceptance Ratio CD Circular Dichroism CE Continuum Electrostatic CPHMD Continuous Constant pH Molecular Dynamics CPL Circularly Polarized Light DOF Degree of Freedom DOS Density of States DSSP Definition of the Secondary Structure of Proteins EA F Exchange Attempt Frequency EFP Effective Fragment Potential FEP Free Enery Perturbation FDPB Finite Differece Poisson Boltzmann GB Generalized Born HEWL Hen Egg White Lysozyme HH Henderson Hasselbach H REMD Hamiltonian Replica Exchange Molecular Dynamics LCPL Left Circularly Polarized Light MC Monte Carlo MCMC Markov Chain Monte Carlo MCCE Multiconfo rmation Continuum Electrostatic MD Molecular Dynamics MDFE Molecular Dynamics b ased Free Energy (calculation) PAGE 18 18 MM Molecular Mechanics MUCA Multicanonical NMR N uclear Magnetic Resonance NPT Isothermal isobaric Ensemble NVE Microcanonical Ensemble NVT Canonical Ensemble PB Poisson Boltzmann PBC Periodic Boundary Condition PES Potential Energy Surface PDF Probability Distribution Function PMF Potential of the Mean Force QM Quantum Mechanics QM/MM Hybrid Quantum Mechanical Molecular Mechanical RCPL Right Circularly Polarized Light REM Replica Exchange Method REMD Replica Exchange Molecular Dynamics REX CPHMD Replica Exchange C ontinuous Constant pH Molecular Dynamics RF Radio Frequency RMSD Root Mean Square Deviation TI Thermodynamic Integration T REMD Temperature Replica Exchange Molecular Dynamics V REMD Viscosity Replica Exchange Molecular Dynamics PAGE 19 19 Abstract of Dissertation Presented to the Graduate School of the U niversity of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy CONSTANT pH REPLICA EXCHANGE MOLECULAR DYNAMICS STUDY OF PROTEIN STRUCTURE AND DYNAMICS By Yilin Meng August 2010 Chair: Adrian E. Roi tberg Major: Chemistry Solution pH is a very important thermodynamic variable that affects protein structure, function and dynamics. Enormous effort has been made experimentally and computationally to understand the effect of pH on protein s One category of computational method to study the effect of pH is the constant pH molecular dynamics (constant pH MD) methods. C onstant pH MD employs dynamic protonation in simulations and correlate s protein conformation s and protonation state s Therefore, constant pH MD algorithms are able to predict p K a value of an ionizable residue as well as to study pH dependence directly. A replica exchange constant pH molecular dynamics (constant pH REMD) method is proposed and implemented to improve coupled protonation and conf ormational state sampling. By mixing conformational sampling at constant pH (with discrete protonation states) with a temperature ladder, this method avoids conformational trapping. Our method was tested on seven different b iological systems. The constant pH REMD not only predicted p K a correctly for model peptide s but also converged faster than constant pH MD Furthermore the constant pH REMD showed its advantage in the efficiency of conformational samplings. The advantage of utilizing constant pH REMD is clear. PAGE 20 20 We have studied the effect of pH on the structure and dynamics of C peptide from ribonuclease A by constant pH REMD The mean residue ellipticity at 222 nm at each pH value is computed, as a direct comparison with experimental measurements. The C pe ptide conformational ensembles at pH 2, 5, and 8 are studied. The Glu2 Arg10 and Phe8 His12 interactions and their role s in the helix formation are also investigated Constant pH REMD method is applied to the study of hen egg white lysozyme (HEWL). p K a v alues are calculated and compared with experimental values. Factors that could affect p K a prediction such as hydrogen bond network and interaction between ionizable residues are discussed. Structural feature such as coupling between conformation and proton ation states is demonstrated in order to emphasize the importance of accurate sampling of the coupled conformations and protonation states. PAGE 21 21 CHAPTER 1 INTRODUCTION 1.1 Acid Base Equilibrium Acids and bases are common in our daily lives. For example, vineg ar is acidic and ammonia is basic. According to the Bronsted Lowry definition, an acid is a chemical compound that can donate protons and a base is a chemical compound that can accept protons. A n acid can be converted to its conjugate base by transferring a proton to a base and a base is converted to its conjugate acid by accepting a proton. For simplicity, the conversion between an acid and its conjugate base can be described by the reaction: + + where HA is an acid, A is its conjugate base and + represents proton (in aqueous environment, + is hydronium ion 3 + ) There exists an equilibrium state between any acid base conjugate pair. At equilibrium, the concentration of each species is constant In an acid base reaction, an acid dissociation constant is used to describe this equilibrium. The acid dissociation constant has the definition of Eq. 1 1. = + (1 1) H ere K a is the acid dissociation constant and + and represent the activity of each species, respectively. In Eq. 1 1, the activity of each individual species (take as an example) can be e xpressed as: = [ ] (1 2) In Eq. 1 2, is the activity coefficient of [ ] is the concentration of and is the standard concentration which is 1 M. In an ideal solution, the activity coefficients are unity. T he concentration of each species is divided by standard PAGE 22 22 concentration in order to make the acid dissociation constant dimension less. For simplicit y, the acid dissociation constant is expressed using the concentration of each species from now on. The K a indicates the strength of an acid: the stronger the acid is, the larger the K a is. The order of magnitude of K a can span over a broad range. Therefor e, a logarithmic (base 10) measure of the K a is more frequently adopted: = log 10 (1 3 ) Combining Eq. 1 1 and Eq. 1 3 we can express the p K a value as: = log 10 (1 4 ) Eq. 1 4 is the Henderson Hasselbalch (HH) equation. It allows one to solve directly for pH values instead of calculating the concentration of hydronium ions first. When = the HH equation becomes = Therefore, the p K a value of an acid is numerically equal to the pH value at which the acid and its conjugate base have the same concentration s The acid dissociation constant represents the thermodynamics of an acid dissociation reaction because the p K a value is proportional to the Gibbs free energy of the reaction. For simple compound s such as acetic acid, t emperature is the most important factor that affects its p K a value However, for complex molecules such as proteins and peptides, the effect of environment is also crucial and will be discussed in this dissertation. 1.2 Amino A cids and Proteins The goal of this dissertation is to study the acid base equilibrium in peptide and protein systems and its effect on peptide and protein conformations by constant pH REMD method. Thus, an introduction to peptide and protein, especially their structures PAGE 23 23 will be helpful. Amino acids have the generic structure as shown in Figure 1 1 A Each amino acid consists of an amino g roup ( NH2), a carboxylic acid group ( COOH) and a distinctive side chain ( R). All three groups are connected to a carbon atom which is called carbon alpha (C ). There are twenty naturally occurring side chains and they can be divided into groups based on their physical or chemical properties. For example, one way to categorize the twenty side chains is based on their acid/base properties in aqueous solution T herefore, an aspartic acid is an acidic amino acid and a lysine is a basic amino acid. For an amino acid, its carboxylic group can react with the amine group of another amino acid. This condensation reaction forms a peptide bond which links the two amino a ci ds and yields a water molecule. As a consequence of th e condensation reaction, proteins are formed A protein is a string of amino acids connected by peptide bonds and folded into a globular structure A protein often consists of a minimum of 30 to 50 am ino acids. 1 Shorter chains of amino acids are often called peptides. Each amino acid in a protein or peptide is called a residue. The peptide bonds form the backbone of a protein. A B Figure 1 1. A) S tructure of an amino acid named alanine An amino group ( NH2), a carboxylic acid group ( COOH), a side chain ( R, in this case, a methyl group) and a hydrogen atom are bonded to a central carbon atom (C ). B) Dihedral angles and of alanine dipeptide. PAGE 24 24 A protein usually has four levels of structure which are called primary structure, secondary structure, tertiary structure and quaternary structure. The primary structure is the sequence of amino acids. The folding of a protein is de termined by its primary structure. Next, the secondary structure (e.g. helix, strand, or loop) is the three dimensional structure of local segments of a protein. As mentioned earlier, proteins fold themselves into functional structures after they are formed. After folding, protein backbones often possess certain type s of fold or alignment. The term of secondary structure is used to describe the three dimensional structure s of such manners. The t wo most common secondary structures found in proteins strands. The local secondary structure of a particular residue in a protein can be described by a Ramachandran plot which is a two dimensional histogram (or probability distribution) of backbone dihedral angle pair ( ). As demonstrated in Figure 1 1B backbones can rotate around the N C and C C bonds, fo rming dihedral angles and Backbone conformations of a residue can be described by specifying ( ). Three main regions are populated in general in a Ramachandran plot, corresponding to the three main stable conformations a residue has: the right h helix region near ( = 57, = strand region near ( = 125, = 150 ) and the polyproline II region near ( = 75, =145). The most populated region indicates the most stable conformation of a residue. An example of Ramachandran plot is shown in Figure 1 2 Furthermore, the tertiary structure is the three dimensional positions of all atoms in a protein. The tertiary structures yield information about protein side chains, for example, salt bridges. Finally, the quaternary structure def ines the positions of all atoms PAGE 25 25 in a protein containing multiple peptide chains for example, the hemoglobin tetramer It is the highest level of protein structures. Figure 1 2. A Ramachandran plot (a contour plot showing the probability density of ( ) pairs ) of tyrosine generated from the simulation of a heptapeptide which will be described later in chapter 3 In this f igure, a left handed helix is also shown. Proteins perform vital functions, which are important to our lives. Almost all cell a ctivities depend on proteins. For example, hemoglobin can transport oxygen molecules from lung to cells; 1 many chemical reactions occurring in living organisms are catalyzed by proteins c alled enzymes; and proteins are also involved in cell signaling. Mutations in the proteins, aggregation and misfolding of proteins can cause many diseases. For example, many cancers result from the mutations in the tumor suppressor p53. 2,3 Thus, understanding protein structures and functions is important. 1.3 Ionizable Residues in Proteins and t he Effect of pH on Proteins An ionizable residue in a protein is a residue with a side chain that can donate or accept prot on(s). T here are seven ionizable residues: ASP, GLU, HIS, CYS, TYR, LYS and ARG. Ionizable residues define the acid base properties of that protein. PAGE 26 26 Consequently the s olution pH value becomes an important thermodynamic variable affecting protein structure dynamics folding mechanism and function 4 Many biological phenomena such as protein folding/misfolding, 5 8 substrate docking 9 and enzyme catalysis are pH dependent. 10 12 A good example of how pH value affects protein s is the pH dependence of enzyme kinetics. Most enzymes possess an optimal pH value, at which the reaction rate is largest Enzyme catalysis is pH dependent because the active sites of enzymes in general contain important acidic or basic residues. Only one form (acidic or basic) of the ionizable residue is catalytically active, thus the concentration of the catalytically active species will affect the kinetics. Consider a simple reaction model ( Figure 1 3 and Figure 1 4 ) to demons trate how pH value affects enzyme reaction rate In this model, only the zwitter ion form is active; n o intermediate exists f or the enzyme reaction and the protonation deprotonation steps are fast er than catalysis steps. Furthermore, the rate determining step does not depend on pH value. Figure 1 3. A diagram showing the cartoon representation of an enzyme at low pH (acidic) a nd at around the optimal pH value. EH indicates the structure at low pH and E stands for the zwitterion form, which is the active species in our model. 13 PAGE 27 27 Figure 1 4. The r eaction scheme s showin g the enzyme reactions at which pH values are smaller than the optimal pH value K s K K 1 and K 2 are equilibrium constants of corresponding reactions and k cat is the rate constant of the rate determining step. This model can be used to explain how pH va lue affect s enzyme catalysis in the pH range that is larger than optimal pH. 13,14 The equilibrium const ants shown in Figure 1 4 are not independent of each other s The relationship among them is given by: 2 = 1 (1 5 ) According to the above equation, if 1 = 2 then the substrate binding will not be affect ed by pH value of the so lution. If it is not the case, then the binding is pH dependent. After applying steady state approximatio n to the the reaction rate can be written as: = 0 + 1 + + / 2 + + / 1 (1 6 ) w here 0 is initial concentration of the enzyme and + is the concentration of hydronium ions. At low pH, increasing the concentration of hydronium ions (pH value decreases ) will decrease t he reaction rate. Th e same kind of model can also be applied to derive the effect of pH on reaction rate when the pH is higher than optimal Likewise, only the zwitterion form is catalytically active. The conclusion is that pH value too high or too low wil l lower the enzyme catalytic reaction rate PAGE 28 28 Give n the importance of the solution pH, k nowing the p K a value of an ionizable residue in a protein is important because it will indicate the average protonation state of that ionizable residue at a certain pH v alue. However, the p K a value of an ionizable residue is highly affected by its protein environment. 15,16 Two major factors affect protein p K a values: one is the desolvation effect and the other is the electrostatic interaction. Other factors such as hydrogen bonding and structural rearrangement are also able to affect protein p K a values. An ionizable side chain in the interior of a protein can have a different p K a value from the isolated amino acid in solution, whi ch is caused by dehydration effect. 17 19 For example, Asp26 of the thioredoxin, which lies in a deep pocket of the protein, has a p K a value of 7.5 17 while the p K a value of a water exposed aspartic acid is 4.0. 20 The Garcia Moreno group has been employing site direct mutagenesis method to study the effect of desolvation 18,19,21 23 and will be described later in this chapter. Their research on the buried ionizable resid ues provide s a probe of the dielectric constant inside the protein, which is an important parameter for the p K a prediction on the basis of the Poisson Boltzmann equation. Electrostatic interactions such as salt bridges are also able to affect p K a values. F or example His31 and Asp70 form a salt bridge in the T4 lysozyme 24 Th e formation of this salt bridge shifts the p K a of Asp 70 to 0.5 and changes the p K a of His31 to 9.1. Interestingly, Asp26 in the thioredoxin has been shown to form a salt bridge with Lys57 when it is in the deprotonated form 25 The formation of a salt bridge should reduce the p K a value of Asp26. Therefore, the p K a value of 7.5 is the combined result of desolvation effect and electrostatic interaction. PAGE 29 29 Each ionizable residue has its own intrinsi c p K a value. The intrinsic p K a value of an ionizable residue is defined as the p K a value measured when this residue is fully solvent exposed and is not interacting with any other groups, 20 for example, an aspartate residue with two termini blocked. This kind of dipeptide is often used as reference (or model) compound in the theoretical c alculation of protein p K a values. The intrinsic p K a values are reported in Table 1 1: Table 1 1. Intrinsic p K a values of ionizable residues in proteins. 26 Residue Name Intrinsic p K a value ASP 4.0 GLU 4.4 HIS 6.7 CYS 8.0 TYR 9.6 LYS 10.4 ARG 12.0 1.4 Measuring p K a Values of Ionizable Residues A general way to determining the p K a value of an acid experimentally is through titration. In experiments the pH values are measured by a pH meter as a function of the volume of base added to the solution. Therefore, a t itration curve will be obtained ( Figure 1 5 A shows an example of titration curve) and the p K a value is the pH value at which the deprotonate d and protonated species have the same concentrations. Another way of presenting a titration curve is by plotting the fraction of deprotonation (protonation) vs the pH value. A Hill plot (an example is shown in Figure1 4B ) which can be obtained by plottin g log([ A ]/[ HA ]) as a function of pH, is used to study titration behavior. After fitting to the modified HH equation: = + log t he x intercept is the p K a value and the slope ( ) is the Hill coefficient which reflects interactions between ionizable residues. The HH equation will be represented as a PAGE 30 30 straight line in a Hill plot, with a slope of unity. If only one ionizable residue is present in the system of interest, or an ionizable residue does not couple with other ionizable residue(s) the HH equation should be reproduced A non zero slope reflects statistical error (random error) Intera cting ionizable residues will demonstrate non HH behavior and possess non When > 1 we say the proton binding is positively cooperative which means binding of the first proton will increase the binding affinity of the other one. When < 1 the binding of protons is negatively cooperative which means the binding of one proton will de crease the affinity of the other proton. A B Figure 1 5 A) An example of titration curve on the basis of the t itration described in Figure 1 5 A The two plots are generated from constant pH MD simulations of a n aspartic aci d in a pentapeptide. However, determining p K a value of protein ionizable residues by measuring solution pH as a function of the volume of base is difficult because there are multiple ionizable residues in a protein in general. An experimental technique th at is site specific is preferred. Nuclear Magnetic Resonance (NMR) is one of the most frequently employed spectroscopic methods in chemistry, physics and biological science. One application of the NMR method is to measure p K a values of individual ionizable residues. NMR PAGE 31 31 spectroscopy measures the absorption of radio frequency (RF) radiation by a nucleus in magnetic field. Only a nucleus with a spin quantum number that equals half of an integer is able to generate NMR signal Furthermore, the absorption is af fected by the chemical environment around that nucleus. Electron density around a nucleus provide s a shielding effect to the external magnetic field for the nucleus. Thus different chemical environment (electron density) around a nucleus will affect its r esonance frequency, resulting in chemical shift. Changes in protonation state are able to result in changes in the chemical shift of the nuclei around the ionizable site (for example, C of Asp, C of Glu, and N and N of His). Subsequently at a given pH value, the equilibrium between the protonated and deprotonated species can yield a weighted average chemical shift, = + 1 + 10 ( ) (1 7 ) He re and are the chemical shift observed, chemical shift of the protonated species, the change in chemical shift s caused by titration, respectively, and n is the Eq. 1 7 the HH equation is implied. Therefore, chemical shifts will be measured at different pH values and a titration curve will be obtained. Figure 1 6 demonstrates a titration curve generated by NMR spectroscopy. However, in practice, o ne dimensional NMR spectra are often too complicated to be interpreted for proteins. Introducing a new spectrum dimension will allow the ability to simplify the spectra and yield more useful information. In two dimensional NMR spectroscopy, the sample is excited by one or more pulses in the so 1 and the signal is not recorded during time 1 Following the evolution time one or more pulses will be PAGE 32 32 applied to the sample and the resulting signal will be measured as a function of a new time variable 2 1 H, 13 C and 15 N NMR are frequently employed in experiments to determine protein p K a values. 14 Proton NMR has shown to be particularly useful in studying histidine p K a values. It is also employed to study the acid base equilibrium of tyrosine residues. 13 C NMR experiments can be performed to determine the p K a values of lysine and aspartate. Figure 1 6 13 C NMR titration curves of aspartate residues in HIV 1 protease/KNI 272 complex taken from Wang et al ,1996. 27 In this figure, Asp C chemical shifts are plotted as a function of pD. Asp25 and Asp125 do not change protonation states in this pD range. But isotope shift experiments show that Asp25 is protonated and Asp125 is deprotonated in this pD range. Reprinted with permission from Wang, Y. X.; Freedberg, D. I.; Yamazaki, T.; Wingfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso, Y.; Torchia, D. A. Biochemistry 1996 35 9945 9950 One example of measuring the p K a value of an ionizable resid ue using NMR technique is the determination of the p K a value of Asp26 in Escherichia coli PAGE 33 33 thioredoxin. 17,25,28 30 NMR method, especially the 2D NMR technique, has been intensively employed in the investigations of the p K a value of Asp26. Escherichia coli thioredoxin has two redox forms. The oxidized form has a disulfide bond linking Cys32 and Cys35, while the two cysteine residues are not bonded in the reduced form. H ence, the two cysteine residues are ionizable in the reduced form which makes the investigation s more complicated. Asp26 is located at the bottom of a hydrophobic cavity near the active site disulfide and is completely buried in the protein. In 1991, Dyson et al. investigated pH effect on the thioredoxi n in the vicinity of active site, using 2D NMR. 28 Both oxidized and reduced thioredoxin have been studied. C H and C H chemical shifts of Cys32 and Cys35, and NH, C H and C H chemical shifts of Asp26 as a function of pH value have been measured Those chemical shifts have been found to titrate with a p K a value of 7.5. Since the cysteine residues in the oxidized thioredoxin are not ionizable, they proposed that the apparent p K a is the p K a value of Asp26. In the same year, e xperiments performed by Langsetmo et al. measured electrophoretic mobility of the wild type and D26A mutation of the oxidized thioredoxin as a function of pH. A p K a of 7.5 has been obtained from their experiments. 17 I n 1995 Wilson et al. measur ed the chemical shifts of C H1, C H2 and C atoms of Cys32 and Cys35 using the reduced form of thioredoxin. 30 Both the wild type and D26A mutation have b een studied. Comparing the titration curves between the wild type and the D26A mutation, a titration showing p K a value > 9 has been found missing in the D26A thioredoxin experiment. Adopti ng that the cysteine residues in the reduced thioredoxin have p K a values of 7.1 and 7.9 derived from Raman spectroscopy, they concluded that Asp26 has an apparent p K a of greater than 9. However, their results were challenged by the PAGE 34 34 p K a determinations of Cy s32 and Cys35 in the reduced form of thioredoxin. In 1995, Jeng et al. studied the titration behaviors of Cys32 and Cys35 in the reduced form of thioredoxin by 13 C NMR experiments. 29 Their p K a values were found to be 7.5 and 9.5. Their p K a values of Cys32 and Cys35 challenged the results obtained by Wilson et al In ord er to elucidate the p K a value of Asp26 in the reduced thioredoxin, Jeng and Dyson measured the p K a value of Asp26 in 1996 using 2D NMR 29 The 13 C chemical shift of the carboxylic group, which is bonded to titrating site, as well as the C H1 and C H2 proton chemical shifts was measured as a function of pH value. The authors believed that the pH effect on 13 C chemical shift of the carboxylic group should result from titration due to its close distance to the titrating site. The apparent p K a value obtained from their experiments has been shown between 7.3 and 7.5 which is the same as the p K a value of Asp26 in the oxidized form. Fluorescence spectroscopy can be utilized to determine p K a values as well Fluorescence is the emission of light by a substance when it is relaxing from electronic excited state ( 1 ) to electronic ground state ( 0 ) In fluorescence spectroscopy, the substance is first excited from 0 to one of many vibrational states of 1 by absorbing a photon. Following the excitation, r elaxation to the vibrational ground state 1 occurs through c ollisions with other molecules Once in the ground vibrational state of 1 t he substance will return to one of many vibrational state s of 0 by emitting a photon Since the substance can return to various vibrational states in the electronic ground state, a band of emission wavelength s will be observed. T he absorption and emission wavelength s are different (emission photon s have a larger wavelength ) and the PAGE 35 35 difference in wavelength is called Stokes shift. The average time the su bstance stays in its electronic excited state is called the fluorescence lifetime. In biophysical chemistry, the tryptophan fluorescence is frequently employed to study the conformational changes in protein s In general, t ryptophan has a maximal absorptio n wavelength of 280 nm 31 and maximal emission wavelength o f 300~350 nm. 32,33 Changes in the environment of a tryptophan residue will affect the emission wavelength and/or intensity Furthermore, it has been noticed that tryptophan fluorescence is sensitive to the polarity of the local environment. One advantage of tryptophan fluorescence spectroscopy is that the chromophore is intrinsic; no change is made to the protein. If the change in protonation state of an ionizable residue affects the spectrum of a neighboring tr yptophan residue, which is the main fluorescent species in a protein, the n fluorescence spectroscopy can be employed to generate a titration curve. Therefore, the p K a value will be obtained. One example of determining p K a value by fluorescence spectroscopy is measuring the p K a of Glu35 in HEWL performed by the Imoto group. 34 The Trp108 is in van der Waals contact with G lu35. Changes in protonation state of Glu35 can induce a large shift in intensity of Trp108 fluorescence signal. Another way of obtaining a titration curve is the potentiometric method. The potentiometric titration measures pH value as a function of the vo lume of titrant added The volume of titrant added at each dosing can be used to calculate moles of hydrogen ion released from (or bound by) a peptide or protein and hence number of hydrogen ions released (or bound) per molecule Plotting number of hydrog en ions released (or bound) per molecule as a function of pH will generate a titration curve. By utilizing PAGE 36 36 potentiometric titration, a titration curve of the entire peptide or protein can be obtained. The Garcia Moreno group has been utilizing the potentio metric method combined with other experimental techniques and protein p K a calculations, to investigate p K a values of ionizable residues buried deep in a protein. 18,19,21 23 As mentioned earlier in the last section protein environment can shift the p K a value of an ionizable residue. In nature, a small portion of the ionizable residues are buried in the deep pocket s of the protein, inaccessible to water. 22,35 Those buried io nizable residues are crucial to the protein functions such as catalysis, 12,36 and ion or electron transport. 37,38 Determining and understanding the p K a values of buried ion izable residues is important for biological research. The Garcia Moreno group performed site directed mutagenesis experiments mutating a nonpolar residue which is inaccessible to water to an ionizable residue. The p K a value of the mutated ionizable residu e is determined experimentally and predicted theoretically. By combining experimental and theoretical determination, the dielectric effect and electrostatic interactions will be elucidated. One example of the mutagenesis hyperstable variant the staphylococcal nuclease (SNase) to glutamate. 19,21 of SNase are called PHS and PHS/V66E. The PHS nuclease can be made by mutat ing three residues of the wild type SNase : P117G, H124L, and S128A. Val66 has been found in the core region of the SNase and inaccessible to aqueous environment. The potentiom etric titrations have been performed on both PHS and PHS/V66E. The difference bet ween the two titration curves represents the Glu66 titration plus other titrations affected by the mutation, although it is assumed that the latter effect is not PAGE 37 37 significant. The difference in hydrogen ions ( ) bound to PHS and PHS/V66E was fitted t o the following equation, = 10 1 + 10 (1 8 ) where is the solution pH value, an d in this case is the p K a value of Glu66. The pH dependence of PHS and PHS/V66E stability was also demonstrat ed by the guanidine hydrochloride denaturation free energy profile s. The Trp140 fluorescence was recorded as a probe of the denaturation The difference in denaturation free energy profiles was also fitted nonlinearly to obtain the p K a value of Glu66. The p K a value of Glu66 has been determined to be 8.8 from potentiometric titration and 8.5 from the protein stability study. The p K a shift o f 4.4 (on the basis of the potentiometric measurements, and glutamate has an intrinsic p K a value of 4.4) is among the largest ones for acidic ionizable residues. Once the experimental p K a value is accurate obtained, a reverse p K a prediction can be perfor med to investigate the dielectric constant inside the protein, which is an important parameter in the continuum electrostatic model and will be explained later this chapter. In fact, the direct potentiometric measurements were first carried out by the Garc ia Moreno group on PHS and PHS/V66K. 18 A p K a value of 6.38 was found for Lys66, while the p K a value of lysine model compound is 10.4. R ecent site directed mutagenesis studies on PHS have extended to Leu38. 22 Mutations to aspartate, glutamate and lysine were conducted. Similar to their treatment on Val66 mutations potentiometric titration and p rotein denaturation experiments were conducted to determine p K a values by the Garcia Moreno group For the PHS/L38E, PAGE 38 38 NMR technique was employed to facilitate Glu38 p K a measurement. PHS/L38K has shown a p K a value close to the intrinsic value of lysine. After mutation, lysine was found to adjust its side chain to let water molecules p enetrate. However, L38D and L38E have shown elevated p K a values. Both Asp38 and Glu38 were still inaccessible to water although structural rearrangement was also observed Their p K a values were further perturbed by electrostatic interactions with surface carboxylic groups. Their investigations have unveiled how conformational change s desolvation and electrostatic interactions affect p K a values. 1.5 Molecular Modeling Experimental techniques such as spectroscopy are fundamental to the study of protein s tructure and function. For example, NMR spectroscopy is frequently employed in biological science X ray crystallography can be applied to resolve protein structures and circular dichroism (CD) spectrometry is employed to determine the secondary structure of a protein. However, the advances in computational power combined with the leap in theory make experiments not the only way to understand biological molecules. Molecular modeling offers another way to investigate structures and properties of biological m olecules. It combines theories developed in the fields of physics, chemistry and biology with the computer resources to simul ate the behaviors of molecules. R esults from simulations are often compared to experimental observations in order to validate the m ethod and understand the behavior of biological molecules from an atomistic level. PAGE 39 39 1.6 Potential Energy Surface Molecules possess more than one stable configuration in general. In principle, a ll possible molecular configurations need to be considered in or der to simulate a molecule correctly. A potential energy surface (PES), which is a surface defined by the potential energies of all possible configurations, can be utilized to fulfill this requirement. The local minima of a PES indicate stable conformation s of a molecule. There are multiple ways to ge nerate a PES. Quantum mechanical calculations offer the most accurate way to construct a PES. By solving the Schrodinger Equation, one can obtain energies and wave function of the molecule. In the field of chem istry, electronic structure theory utilizes quantum mechanics to describe the motion of electrons in the framework of Born Oppenheimer approximation The Born Oppenheimer approximation states that the electronic relaxation caused by nuclear motion is inst antaneous because of the huge difference in the masses of electrons and nuclei. Thus, electro nic motion and nuclear motion are decoupled. The eigenvalue of the electronic Schrodinger equation at each nuclear configuration is the potential energy of nuclei at that geometry Solving Schrodinger equation at different configurations will yield the PES of a molecule. However, the cost of electronic structure calculations is very expensive, which hinders the use of high level of theory when studying large biologi cal molecules. Due to the cost of electronic structure methods, an alternative way to describe a PES is to use a classical mechanical model. One of the commonly used algorithms is the all atom force field in which the PES is computed without solving the S chrodinger equation. In an all atom force field model, no electrons are present and each atom is represented by a single particle ( in contrast to the united atom force field model where a functional group is represented by a particle ) Atoms interact with each other via bonded PAGE 40 40 and non bonded pote ntial energy terms. Equation 1 9 shows an example of all atom force field model that is frequently adopted in the simulations of proteins : = 1 2 ( 0 ) 2 + 1 2 ( 0 ) 2 + 2 1 + 3 n = 1 + 4 0 + 4 12 6 = + 1 = 1 ( 1 9 ) The first three summations are bonded terms and they represent interactions of bond stretching, valence angle bending and torsions respectively. In Eq. 1 9 bond stretching and a ngle bending are considered by a harmonic potential. The torsion term is expressed as Fourier series due to the periodic nature of a dihedral angle. The latter two summation term s are the non bonded interaction term s The two components in the double summ ation represent electrostatic interactions and van der Waals interactions respectively Electrostatic potential is represented by Coulomb interaction. and are partial charges on atom and respectively. i s the distance between the two atoms. In Eq. 1 9 van der Waals interaction is calculated by the Lennard Jones potential, in which is the well depth and is the distance when repulsive and attractive potentials are equal. Solvent effect is also considered when implicit solvent such as the Generalized Born (GB) model 39,40 is adopted (solvent models will be briefly d escribed in the next chapter). The cost of all atom force field model is low compared with ab initio methods because it utilizes pre defined parameters when calculating potential energies. The strategy of generating those parameters is via fitting to exper imental data and quantum mechanical calculations. One must notice that the parameters are often internally consistent which means parameters of different force fields are in general non transferrable. The all atom force field models are utilized much more frequently PAGE 41 41 than the quantum mechanical methods when simulating large systems such as proteins. However, force fields such as Eq. 1 9 do not allow bond breaking or forming. Thus, they are not able to study reactions. Nowadays, linear scaling techniques in e lectronic structure theory are developed in order to fill the gap between force fields and the high accuracy ab initio methods. 41,42 One example of the linear scaling algorithm is the DivCon program developed by th e Merz group. 43 The balance between computational accuracy and cost is the main theme in the computational chemistry 44 One category of schemes attempting to achieve this balan ce is the so called hybrid quantum mechanical molecular mechanical (QM/MM) methods. 41,45 47 The basic idea of the QM/MM methods is that different regions of a system may play different roles. For example, if one wa nts to study an enzymatic reaction, the potential energy calculation involving the active site should be done by a quantum mechanical model because the classical force field is not able to describe bond forming/breaking. On the other hand, the bulk water ( assuming no water molecule participates in enzymatic reaction) and the protein environment of the enzyme can be represented by the force field in order to save simulation time. In the QM/MM methodology, different regions of a system are treated by di fferen t level of theory and interact with each other. The QM/MM approaches have become a key area in the simulation of proteins. 48,49 1.7 Molecular Dynamics, Monte Carlo Method s and Ergodicity Accurately simulating the b ehavior of a molecule requires more than knowing the PES. A molecule often has more than one minimum on the PES. Finding the correct probability distribution of molecular conformations is also important because the majority of experiments measure molecula r properties as averages over molecular PAGE 42 42 structures. Sampling algorithm s such as m olecular d ynamics (MD) and the Metropolis Monte Carlo (MC) method are crucial to molecular modeling. For a system containing N number of particles, there are 6 N degree s of fre edom (DOF). Half of the DOF comes from coordinates and the other half represents the momentum of all particles. The 6 N dimensional space defined by those DOF is called the phase space. Both MD and MC methods sample the molecular phase space. Over time, the system will generate a trajectory in the phase space. MD utilizes the equation of motion to propagate a system in the phase space (The details of molecular dynamics will be presented in the next chapter). Each particle in the system has velocity and posit Eq. 1 10 ) is applied to control the dynamics: = = ( 1 10 ) The force on any particle in the system is given by the negative gradient of the potential energy. The equation of motion is usually sol ved numerically. By propagating the equation of motion the phase space will be explored and a probability distribution for DOFs will be obtained. Therefore, molecular properties are able to be c omputed by averaging over times: = lim 1 = 0 ( 1 1 1 ) In Eq. 1 1 1 A is the property of interest. t is the total simulation time. is the size of the sample taken during the entire simulation. The brac ket stands for taking average. is the value of A at time in the simulation. In contrast to MD, the Metropolis MC method (from now on, we will call the Metropolis MC method as MC method unless otherwise mentioned) does not utilize the PAGE 43 43 e quation of motion. MC method samples the phase space through a Markov chain (the details of Monte Carlo method will be presented in the next chapter). In MC algorithm, a new state (for example, a new molecular configuration) is randomly selected and the tr ansition probability relationship between the current state and the new state is calculated by the detailed balance equation. Then a Metropolis criterion 50 is applied to accept or reject the transition to the new state. The Markov chain can be applied because the system is assumed to be at equilibrium. Likewise, after a sufficient number of transitions, the phase space will be explored and molecular properties can be comp uted by averaging over ensemble: = (1 1 2 ) Here is the value of A in state is the normalized probability density of state The MD and the MC methods represent two different ways of sampling phase space and computing average molecular properties. According to the ergodic hypothesis, the time avera ge is equal to the ensemble average: = = lim 1 = = 0 (1 1 3 ) The ergodic hypothesis is often assumed to be true in molecular simulations. Th is hypothesis makes MD and MC methods equivalent in sampling phase space If the system is ergodic, t he phase spaces generated by MD and MC should be the same because the phase space does not depend on sampling technique. The same behavior should also extend to any observable properties. C onformational sampling in a MD or MC simulation is essential in the study of complex systems such as polymers and proteins. One major concern is that the PES of PAGE 44 44 a complex system is very rugged and contain s a lot of local energy minima. 51 Thus, kinetic trapping would occur as a result of the low rate of potential energy barrier crossing, especially when the barrier is high. In order t o overcome this kinetic trapping beha vior, generalized ensemble methods (advanced sampling methods) 52,53 are frequently employed in molecular simulations. Popular generalized ensemble methods include multicanonical algorithm, 54,55 simulated tempering method 56,57 parallel tempering method 58 60 and replica exchange molecular dynamics (REMD) method. 61,62 A more thorough description of MD MC and the advanced sampling methods will be presented in the next chapter. 1.8 Theoretical Protein Titration Curves and p K a Calculation s Using Poisson Boltzmann E quation Studying protein titrat ion curves theoretically has a long history. As early as 1957, Tanford and Kirkwood presented their study of protein titration curve. 63 In their model, proteins were considered to be low dielectric spheres with disc rete unit charges on ionizable residues. They proposed that the p K a value of an ionizable residue can be calculated from its intrinsic p K a value and pair wise electrostatic interactions with other ionizable residues. Calculating the pair wise electrostatic interactions involves using empirical parameters. A protein titration curve showing average charge as a function of pH value was plotted. The Tanford Kirkwood model was further extended and utilized to study lysozyme by Tanford and Roxby. 64 The equations used to generate a titration curve in the Tanford and Roxby paper were the same as those Tanford and Kirkwood used. However, they employed an iterative approach to generate titration curves and p K a values for all ionizable residues. In their approach, each ionizable residue was initially assigned a p K a value that is equal to its intrinsic value. At a given pH, the PAGE 45 45 average charge on each site (representing fraction of deprotonation/protonation) can be computed. Those average charges were then employed to update p K a values. This process was repeated until self consistent average ch arge and p K a value of a site was obtained. Therefore, a titration curve can be produced by plotting average charge as a function of pH value. In 1990, Bashford and Karplus utilized the finite difference Poisson Boltzmann (FDPB) equation in the calculation of p K a values. 65 A detailed description of the FDPB method will be present in the next chapter. The p K a shift of an ionizable residue relative to a model compound is calculated (in their paper, i ntrinsic p K a is a quantity defined as the p K a value of an ionizable residue when other sites are neutral, that is no interactions between ionizable sites). Given a molecular configuration, three terms are calculated by FDPB equation for each ionizable sit e: the Born solvation free energy, the pair wise electrostatic interactions with non ionizable residues (represented by partial charges), and the pair wise electrostatic interactions between ionizable sites. Summing the three terms yields the electrostatic work of charging the ionizable side chain, and hence yields the p K a shift. A p rotein titration curve is represented by plotting fraction of protonation vs pH value. Considering a protein with N ionizable sites and each site can have two states (protonate d and deprotonated), there are 2 N possible macro states and each macro state can be represented by an N dimensional vector. Once the FDPB equation is solved, free energy differences of each vector relative to completely deprotonated are computed. Thus, the fraction of protonation of an ionizable site can be calculated by taking the Boltzmann weighted average of the 2 N macro states. PAGE 46 46 The FDPB method forms the foundation of the continuum electrostatic (CE) models, which are frequently utilized when studying pr otein p K a values. 16,65 71 The FDPB method has been implemen ted into many modeling software packages such as UHBD 72 and DELPHI. 73 Many modifications have been done to improve its performance. In 1991 Beroza et al. employed the Metropolis MC method to sample 2 N numbers of protonation states instead of calculating the protonation fraction at a given pH value directly 74 After us ing MC sampling of protonation states, the number of ionizable residues included in the simulation can increase dramatically. S olving the F DPB equation requires the dielectric constant in a protein as an input parameter and the dielectric constant is very important because the electrostatic energy is inversely proportional to it. It is considered as the most important adjustable parameter in FDPB based p K a calculations. 16 Thus, o ne question arisen from utilizing FDPB method is how to choose dielectric constant for proteins. The values between 4 and 20 are typically adopted in the FDPB calculations. 67 Direct experimental determination of the interior dielectric constant is extremely difficult In practice, the protein dielectric constants are mea sured utilizing protein powders which will cause problem s in interpreting the resulting dielectric constants. 18,75,76 Research has been performed to find an optimal interior dielectric constant for protein p K a predictions. However, considering the difference in protein environment, no single dielectric constant can yield experimental p K a values for both internal and surface residues in a protein 77 In 1996, Simonson and Brooks studied charge screening ef fect and protein dielectric constant by MD simulations. 78 What they found was that protein dielectric constant can range from ~4 in th e interior of protein to a much higher value (~30) in the region near PAGE 47 47 the surface. As mentioned in section 1.4, the Garcia Moreno group conducted site directed mutagenesis experiments in the deep pocket of a protein where water is inaccessible and measured the p K a value of mutated ionizable residue 18,19,21 23,77 Then, t he experimental p K a value was put back into FDPB equation in order to examine protein interior dielectric constant. The protein interior dielectric constants were found to be ~11 18 Mehler and his co worker employed a sigmoidal screened electrostatic interaction to treat the protein dielectric environment. 79,80 Their m ethod had been applied to Glu35 and Asp66 in hen egg white lysozyme and ha d obtained satisfactory results. 80 Another problem in the FDPB based p K a calculation is that the FDPB equation is often solved on the basis of one structure such as X ray crystal structure. The entropic effect is missing when a single structure is used. To improve the performan ce of the CE model in p K a calculations protein conformational sampling is also considered in order to incorpora te conformational flexibility into p K a calculations 81 86 In the 1 990s, You and Bashford developed an algorithm in which 36 side chain conformations of ionizable residues are adopted in the calculation of p K a values. 86 In 1997, Alex and Gunner proposed to use M onte Carlo method to sample 2 possible states instead of just 2 protonation states. 81 Here N is the number of ionizable residues and each one can have M possible conformations. Furthermore, each one of the K non ionizable residue possesses L number of possible conformations. The Gunner group further extends this algorithm to the so called multiconformation continuum electrostatic method (MCCE). 83 Recently, Barth et al. proposed a rotamer repacking technique combined with FDPB method and was given the name FDPB_MF. 82 In the FDPB_MF method, the PAGE 48 48 conformation al space of side chain of ionizable residues was defined by a rotamer probability distribution. Each rotame r was given a weight and was interacting with other ionizable residues in a mean field scheme. 1.9 Computing p K a Values by Free Energy Calculations MD based free energy (MDFE) calculations 87,88 have also been em ployed to predict p K a values. MDFE calculations combine free energy calculation algorithms with MD propagations. MD propagation s sample phase space and generate a conformational ensemble. Free energy calculation methods calculate the free energy difference between two states on the basis of the phase space sampled by MD. F ree energy perturbation (FEP) and thermodynamic integration (TI) are two frequently employed free energy calculation methods and will be explained with more details in the next chapter. Fr ee energy calculation algorithms such as FEP and TI methods can be used to compute p K a because is associated with the free energy of reaction. Early p K a calculations u tiliz ing free energy calculations were conducted by the Warshel et al. 89,90 Jorgensen et al. 91 and Merz 92 with the FEP method and classical force fields. In the 1980s, Wars hel et al. proposed a protein dipole Langevin dipole (PDLD) model for the p K a calculations. 90 In the PDLD model, proteins were treated as particles hav ing partial charges and polarizable dipoles, while the solvent molecules nearby were viewed as Langevin dipoles. The bulk water that is far away from ionizable residues was still treated as dielectric continuum. Electrostatic interactions between charges a nd dipoles, and dipoles and dipoles were computed. Jorgensen et al. combined ab initio quantum mechanical calculations and classical FEP calculations in 1989. 91 Jorgensen et al. calculated the p K a difference between two acids, and The gas phase dissociation free energy of and were PAGE 49 49 computed by quantum mechanical methods. The solvation free energy calculations were conduct ed using MC FEP method for the neutral molecules and the anions. One shortcoming of their calculations is that only small organic molecules were investigated due to the computational cost of quantum mechanical methods. In 1991, Merz performed classical FEP calculations for three glutamate residues in two proteins (HEWL and human carbonic anhydrease II) 92 The g lutamate dipeptide was utilized as a model compou nd to eliminate the gas phase dissociation free energy calculations. When MDFE calculations utilizing t he classical force field s are performed, quantum effect s such as bond forming/breaking cannot be simulated. Thus, the p K a shift of an ionizable residue relative to its intrinsic p K a value (p K a value of the reference compound which is defined in section 1.3 of this dissertation ) is computed by the free energy calcul ations A diagrammatic explanation of p K a shift calculation utilizing the MDFE method is dem onstrated in Figure 1 7 and Figure 1 8 Figure 1 7. Thermodynamic cycle used to compute p K a shift. Both acid dissociation reactions occur in aqueous solution. A thermodynamic cycle is a series of thermodynamic processes that eventually returning to the initial state A state function, such as reaction free energy in this case, is path independent and hence, unchanged through a cyclic process. PAGE 50 50 Figure 1 8. In Figure 1 7 and Figure 1 8 protein AH represents the ionizable residue in protein environment AH represents the reference compound which is usually the ionizable residue with two termini capped. In practice, a proton does not disappear but instead becomes a dummy atom. The proton has its position and velocity. The bonded interactions involving the proton are still effective However, there is no non bonded interaction for that proton. The change in protonation state is reflected by changes of partial charges in the ionizable residue. Equations 1 1 4 to 1 20 explain how p K a values will be computed from free energy calculations using force fields : = 1 2 303 (1 1 4 ) = 1 2 303 (1 1 5 ) In Eq. 1 1 4 and 1 1 5 and are the acid dissociation reaction free energy of the ionizable residue in protein and the reference compound, respectively Therefore, the p K a shift between ioniza ble residue in protein environment and the reference compound can be calculated as = 1 2 303 According to the thermodynamic cycle shown in Figure1 6A, = 1 + 2 + 3 Here, 1 and 2 are the free energy difference between two protonated species, and between two deprotonated species, respectively. PAGE 51 51 3 is equal to zero because the free energy difference between two protons that are in the same environment is zero. However, c alculating 1 and 2 directly utilizing MDFE calculations is not preferable because the difference between the reference compound and the protein system is very large. A sim ple way to determine the difference between 1 and 2 is needed Therefore, the thermodynamic cycle shown in Figure1 6B is employed. By utilizing that thermodynamic cycle, ( 1 + 2 ) can be expressed as ( ) where and are the free energy difference between the protonated and deprotonated ionizable residue in protein and the reference compound, respectively. and can be further expressed as: = + (1 16) A nd = + (1 17) In Eq. 1 16 and Eq. 1 17, t he MM in the subscripts stands for the free energ y difference s which are calculated by classical force fiel ds. The quantum mechanical contributions (labeled by QM in the subscripts) to the free energy difference of an ionizable residue in protein environment and its reference compound are assumed to be the same: ( ) = ( ) (1 1 8 ) Combining all derivations and assumption, the difference between two acid dissociation reaction free energies can be written as: = (1 19) PAGE 52 52 Thus subtracting Eq. 1 1 5 from Eq. 1 1 4 yields: = + 1 2 303 (1 20 ) and are are computed by MDFE calculations (f or example, TI) A more detailed description of the MDFE methodology and how to compute and will be explained in the next chapter. An example of using classical force field MDFE calculation s to study p K a values is given by Simonson et al. 15 T he p K a values of Asp 2 0 (experimental p K a of 2, which is lower than the intrinsic Asp p K a value), Asp26 (experimental p K a of 7.5 ) in thioredonxin, and Asp14 (with an experimental p K a around 4) in ribonuclease A were evaluated by TI calculations The aspartate dipeptide was taken as the model compound; b oth explicit and implicit water models were used in their simulations. Proton di ssociation was represented by changes in the partial charges of carboxylic group only The f ree energy change caused by the disappearance of the proton van der Waals interaction was not considered because the van der Waals radius of the proton in aspartate is zero in the AMBER force field Correct protonation free energies have been obtained. Entropic and enthalpic effects are also correctly obtained. However, several problems have also been found with the MDFE based p K a calculations. For example, interacti ons between ionizable sites are not a ble to be incorporated directly. Furthermore their free energy difference s have shown dependence on the force fields and solvation models. Hybrid quantum mechanical/molecular mechanical (QM/MM) methods can be coupled w ith free energy calculation simulations. 48,93 Recently, the Cui group has PAGE 53 53 conducted p K a calculations using FEP calculations coupled with SCC DFTB method. 94,95 A detailed de scription of QM/MM free energy calculations of p K a values can be found in a recent review by Kamerlin et al. 48 1.10 p K a Prediction Using Empirical Methods Empirical models are also employed to study protein p K a values. According to Lee and Crippen, 16 the seemingly most accepted empirical method is PROPKA which is developed by the Jensen group. 96 101 The PROPKA method involves using 30 parameters obtained from 314 residues in 44 proteins. QM calculations and the effective fragment potential (EFP) method, 102,103 which is a QM/MM method, are employed to generate those parameters. In the PROPKA method, a p K a value is K a values. Three types of perturbations are considered: th e hydrogen bonding, desolvation effect and charge charge interactions. A detailed description of the PROPKA method can be found in a review by Jensen et al. 97 1.11 Constant pH Molecular Dynamics (Co nstant pH MD) Methods Traditionally, MD simulations have been performed in a manner of constant protonation state. The protonation state of an ionizable residue is assigned before a MD simulation is started. Moreover, the protonation states are not allowed to change during MD propagations. Performing constant protonation state MD simulations requires knowing the p K a values of all ionizable residues beforehand. Not knowing the p K a value may result in wrong assignment of protonation state. In addition, if p K a values are near the solution pH values, constant protonation state MD simulations are not able to reflect this situation. More importantly, constant protonation state MD simulations cannot be employed to study the coupling between conformations and proton ation states. Thus, PAGE 54 54 constant pH MD algorithms were developed in order to correlate protein conformation and protonation state. 104 The purpo se of constant pH MD is to describe protonation equilibrium correctly at a given pH value. Therefore, its applications include p K a predictions and studying pH effects. One category of constant pH MD methods uses a continuous protonation parameter. 105 115 Earlier models include a grand canonical MD algorithm developed by Mertz and Pettitt in 1994 115 and a method introduced by Baptista et al. in 1997. 106 In the Mertz and Pettitt model, protons are allowed to be exchanged between a titratable side chain and water molecules. Baptista et al. used a potential of mean force to treat protonation and conformation simultaneously. Later, Brjesson and Hnenberger developed a continuous protonation variable model in which the protonation fra ction is adjusted by weak coupling to a proton bath, using an explicit solvent. 107,108 More recently, the continuous protonation state model has been further developed by the Brooks group. 109 114 They developed a constant pH MD algorithm by the name of continuous constant pH molecular dynamics (CPHMD). In the CPHMD method, Lee et al. 114 applied dynamics 116 to the protonation coordinate and used th e Generalized Born (GB) 40,117 implicit solvent model. They chose a variable to control protonation fraction and introduced an artificial potential barrier between protonat ed and deprotonat ed states. The potentia l is a biasing potential to increase the residency time close to protonation/deprotonation states and it centered at half way of titration ( =1/2). The CPHMD method was then extended by incorporating improved GB model and REMD algorithm for better samplin g. The applications of CPHMD and replica exchange CPHMD included predicting p K a values of various proteins, 110,114 studying PAGE 55 55 proton tautomerism 109 and pH dependent protein dynamics such as folding 112,113 and aggregation. 111 In addition to continuous protonation state models, discrete protonation state methods have also been developed to study pH depe ndence of protein structure and dynamics. 118 131 The discret e protonation state models utilize a hybrid molecular dynamics and Monte Carlo (hybrid MD/MC) method. Protein conformations are sampled by molecular dynamics and protonation states are sampled using a Monte Carlo scheme periodically during a MD simulation. A new protonation state is selected after a user defined number of MD steps and the free energy difference between the old and the new state is calculated. The Metropolis criterion is used to accept or reject the protonation change. Various solvent models and protonation state energy algorithm s were used in discrete protonation state constant pH MD simulations. Burgi et al. 130 presented their constant pH MD method using discrete protonation state model and applied it to hen egg white lysozyme (HEWL) The lysozyme was dissolved into explicit water. Short TI calculations (20 ps of dynamics) were carried out to provide classical free energy difference between old and new protonation states at each MC attempt. The MC move is evaluated based on the following free energy difference: = ln 10 + (1 21 ) In the above equation, is a parameter and represent the pH value of the solution, is the p K a value of the mod el compound (reference compound), and is the classical force field proton dissociation free energy given by TI for the protein and reference compound, respectively. One pitfall of the method PAGE 56 56 developed by Burgi et al. is the choice of simulation time of TI. The 20 ps TI calculation represents neither single structure protonation free energy nor an average of the entire ensemble. The Baptista group proposed their constant pH MD method using the FDPB method to calculate pr otonation energies and their MD was done in explicit solvent. 118,123 126 The MD propagations are conducted at fixed protonation state s The MC moves in the protonation states are performed at fixed molecular config uration s The MD propagation is able to generate a conditional PDF of coordinates and momenta given protonation states while the MC sampling is able to yield a conditional PDF of protonation states given molecular configurations. Baptista et al. proved th at the hybrid MD and MC method is able to generate an ergodic Markov chain. 118 Hence, conditional probability distributions yielded by MD and MC are able to generate a joint probability distribution satisfying semigrand canonical ensemble. The work done by Baptista et al. provides the theore tical justification for combined MD and MC sampling in the discrete protonation state constant pH methods. In practice MD simulations are conducted in explicit water to sample conformational space. A new protonation state is selected and the free energy d ifference is calculated using the structure at that moment and the continuum electrostatic model. The MC transition is evaluated and if the move is accepted, a short MD run is performed to relax the solvent. After solvent relaxation, MD steps continue for solute and solvent. The Baptista group applied their constant pH MD method to the study of protonation conformation coupling effect, 123 the pH dependent conformation states of kyotorphin, 124 p K a predictions of the HEWL 125 and the redox titration of cytochrome c 3 126 PAGE 57 57 Walczak and Antosiewicz also employed the FD PB method to determine protonation energy but they used Langevin D ynamics to propagate coordinates between MC ste ps. 128 This method is f urther extended by Dlugosz and Antosiewicz. 119 122,128 The extended method combines conventional MD simulation using the ana lytical continuum electrostatic (ACE) 132 scheme to sample conformations with the FDPB method for the MC moves. Succinic acid 119 and a heptapeptide derived from ovomucoid third domain (OMTKY3 ) 122 have been studied by Dlugosz and Antosiewicz. This heptapeptide corresponds to residue s 26 32 of OMTKY3 and has the sequence of acetyl Ser Asp Asn Lys Thr Tyr Gly methylamine. Nuclear magnetic reso nance (NMR) experiments indicated the p K a of Asp is 3.6, 122 0.4 p K a unit lower than the value of blocked Asp dipeptide. In their studies, the conventional molecular dynamics (MD) simul ations were carried out to sample peptide conformations. Their method predicted the p K a to be 4.24. Mongan et al developed a method combining the GB model and the discrete protonation state model and implemented it into the AMBER simulation suite 127 In as solvation free energy calculations. Therefore, solvent models in conformational and protonation state sampling are consistent and the computational cost is small. More recently, the accelerated molecular dynamics (AMD) 133,134 method was combined with pH algorithm to enhance conformational sampling. 129 This model has been utilized to calculate p K a values of an enzyme and to explore the protonation conformation coupling. The continuous protonation state model developed by the PAGE 58 58 Brooks group, the discrete protonation state model proposed by Baptista et al. and by Mongan et al. will be further explained in chapter 2. PAGE 59 59 CHAPTER 2 T HEORY AND METHODS IN MOLECULAR MODELIN G Molecular M odeling or molecular simulation is a way to study molecules using theories developed in the fields of physics, chemistry and biology coupled with the computer resources With the development of computer p ower and parallel computation molecular modeling is more and more often involved in the research of biology, chemistry and physics. 42 Understanding the underlying theory and me thods of molecular modeling is necessary in order to perform simulations and analyze the data generated. In this chapter, the basic theory and methods of constant pH replica exchange molecular dynamics method and protein p K a calculations methods are descri bed. 2.1 Potential Energy Functions and Classical Force Fields 2.1.1 Potential Energy Surface Molecular modeling studies molecules which in general possess more than one configuration for a chemical formula in general. In principle, a ll possible molecul ar configurations need to be considered in order to simulate a molecule correctly. A potential energy surface (PES) which is a surface defined by the potential energies of all possible configurations, can be utili zed to fulfill this requirement The conce pt of PES is a result of the Born Oppenheimer approximation. The Born Oppenheimer approximation states that the electronic relaxation caused by nuclear motion is instantaneous because of the huge difference in the masses of electrons and nuclei. Thus, elec tro nic motion and nuclear motion are decoupled E lectronic energy which is computed at a fixed nuclear geometry (molecular structure), is the potential energy of nuclei at that structure L ocal minima on the PES indicate stable conformations of a PAGE 60 60 molecule Quantum mechanic s forms the foundation of understanding the molecular behaviors and offers the most accurate way to construct a PES. Ideally, t he Schrodinger equation is solved for electronic energy at all possible nuclear configuration s and hence, yield s the PES of a molecule. 2.1.2 Force Field Models Although quantum mechanical calculations generate very accurate energies, performing a molecular simulation using quantum mechanical method is too time consuming even through the use of parallel computation especially for large systems such as polymers and proteins. F orce field (equivalent to molecular mechanics) models have been designed to solve this problem. Force field models ignore electrons and calculate the potential energy of a system based on nucle ar geometry only. Force field calculations are fast because the potential energy functions are simple and parameterized. In a force field model, the potential energy of a system has the following contributions in general: bond stretching (vibration), angl e bending, bond rotation (torsion), electrostatic interaction and the van der Waals interaction. The former three contributions are often called the bonded interactions and the last two bel ong to non bonded interactions. In many force field models such a s the AMBER force field, 135 bond stretching energy between atoms and is the second order truncation of the Taylor expansion of potential energy function about equilibrium distance and hence, can be formulated as a harmonic potential : = 1 2 2 (2 1) PAGE 61 61 where is the force constant, is the distance between two atoms and is the equilibrium distance between the two atoms. One drawback of this function is tha t a bond cannot be broken and has infinite energy when two atoms are infinitely apart. Therefore, such a potential energy can be applied to bond stretching near equilibrium distance only. A simplest remedy is to include higher order Taylor expansion terms but this increase s the computation time. For example, expansions up to the fourth order are adopted in the general organic force field MM3. 136 This Taylor expansion strategy is also employed in deriving angle bending potential functions. Torsions (or dihedral angles) are periodic and hence, Fourier series is ad opted as tor sion potential energy function. On e example of t he formula of torsion potential energy is displayed in Eq. 1 9 The van der Waals interaction in a force field model should be able to reproduce the repulsi on and attraction between two particles having no permanent charges This attractive i nteraction is generally called dispersion. Quantum mechanics indicates that the dispersion energy is inversely proportional to the sixth power of the distance between two particles (say atoms) and (under the dipole dipole interaction approximation) : 137 = 6 (2 2) w here is a constant specific to and and is the distance between and Th ere is no theoretical derivation for the repulsive interaction. However, for computation al simplicity, the repulsive energy is taken to be inversely proportional to the twelfth power of the distance. A simple way to combine repulsive and attractive potenti als is just adding up the two potentials. Thus, van der Waals interaction is governed by the Lennard Jones potential shown in Eq. 1 9 Due to the fact that van der Waals PAGE 62 62 interaction decays very fast as a function of inter particle distance, it is often cal led E simplest model of electrostatic interaction is the point charge model which is adopted in the AMBER force field. Partial charges are assigned to each atom and applied to calculating interaction energy. More complicated models such as calculating electrostatic energy through dipole moment dipole moment interaction have also been employed. 137 Bond, angle and torsion interactions are coupled. Thus, the coupling effects (cross terms) should be incorporated into force fields. Mathematically, cross terms are generated from multi dimensional Taylor expansions. For example, the angle bending accompanied by two bond stretching motions (shown in Figure 2 1) is formulated to be (as in MM3): = 1 2 + (2 3 ) Figure 2 1 A diagram showing bond stretching coupled with angle bending. A cross term calculating coupling energy is adopted when evaluating the total potential energy T he force field is simply a function and corresponding parameters. Thus, obtaining parameters is crucial for force field development. Given a potential energy function, PAGE 63 63 parameters are required to reproduce experimental data or quantum mechanical calculation results as much as possible. 2.1.3 Protein Force Field Models Computer simulations of biological molecules often involve thousands of atoms or even more 138 especially when using explicit solvent models. M any simulations on proteins choose to use force field s to reduce computational cost Popular protein force fields include (but are not limited to) AMBER 99SB 139 CHARMM 22 140 GROMOS 96 141 and OPLS force fields. 142 In general, a simple potential energy function like Eq.1 9 is employed in the protein force fields. Protein force field parameters are in general optimized on the basis of small molecules Take the AMBER force field ( Eq. 1 9 ) as an example; there are bon ded and non bonded terms in it In the non bonded terms, the partial charges are fitted to quantum mechanical calculation using Hartree Fock/6 31G* level of theory in vacuum This level of theory typically overestimates dipole moment, and hence the resulti ng partial charges can satisfactorily approximate the condensed phase charge distribution. The Lennard Jone s parameters have been obtained from reproducing liquid properties following the work of Jorgensen et al 142 After the partial charges are assigned, the Lennard Jone s parameters are fitted to reproduce experimental data such as heat capacity, liquid density, and the heat of vaporization. The bond stretching and angle bending pa rameters are derived by fitting to structural and vibrational experimental data of small molecules that make up proteins. The bond and angle parameters should ensure that the geometries of simple protein fragments are close to experimental data. The torsion (dihedral angle) parameters can be obtained from quantum mechanical conformational energy calculations. Determining torsion parameters is often the la st step of force field parameter optimizations. Given PAGE 64 64 the previous obtained individual energy term parameter sets, the torsion parameters are adjusted to best fit quantum mechanical conformational energies, for example, the Ramachandran plot of a model com pound. Detailed description of the protein force field parameter determinations can be found in the paper of Cornell et al. 143 MacKerell et al. 140 and Hornak et al 139 2.2 Molecular Dynamics (MD) Method 2.2 .1 MD Integrator As mentioned in the introduction, MD samples the phase space utilizing the equation of motion. A trajectory in th e phase space will be generated over time. The e rgodic hypothesis is assumed to be true, that is, the time average of any property at equilibrium is equivalent to the ensemble average. Thus, given a set of initial positions and momenta and a method to comp ute forces, a MD simulation can be applied to any system. For a simple system such as a harmonic oscillator moving along one axis, there exists an analytical solution of the trajectory (the coordinate and momentum as a function of time can be expressed ana the analytical solution of complex systems such as polymers or proteins. Therefore, numerical integrators are implemented to propagate positions and velocities of particles. One of the frequently used int egrator is the leap frog algorithm: 41,144 + = + + 1 2 (2 4 ) + 1 2 = 1 2 + ( ) (2 5 ) = ( ) = ( ) (2 6 ) Here, q and v stand for the position and velocity of a particle respectively; a (t), F (t) and U (t) represent the acceleration, the force an d the potential energy at time t PAGE 65 65 is the time step used in MD simulation. One frequently employed potential energy function is the force field model introduced in the previous section. According to Eq. 2 4, 2 5 and 2 6 the leapfrog algorithm propag ates positions and velocities in a coupled way. The velocity at time t can be calculated by velocities at + 1 2 and 1 2 by the following equation: = 1 2 + 1 2 + 1 2 (2 7 ) One important issue in the MD propagation is choosing a proper time step that optimizes speed of propagation and accuracy of the simulation. A too small time step will waste simulation time in sampling the same conformation, whereas a too large time step can bring two atoms too close and hence cause instability of the trajectory. In general, a time step is a tenth of the period of fastest motion. In biological molecules, the fastest motion is the bond stretching and bonds with hydrog en atoms in particular. Thus, one way to increase time step without reducing accuracy is to remove the degree of freedom having highest frequency. One commonly employed algorithm to achieve this goal is the SHAKE algorithm. 145 When using the SHAKE algorithm to remove heavy atom to hydrogen DOF the heavy atom to hydrogen bond length is fixed T he fi xed bond lengths act as distance constraints between heavy and hydrogen atom s. Lagrangian multiplier s have been utilized to keep the bond lengths constant By employing the SHAKE algorithm, a large time step such as 2 fs could be used. Methods that can in tegrate the equation of motion more efficiently are popular area of research. 2.2 .2 Thermostats in MD Simulations Before describing thermostats in MD simulations, the concept of thermodynamic ensemble (statistical ensemble) should be introduced first. An e nsemble is a large PAGE 66 66 amount of replicas of the system of interest (it may contain infinite number of replicas). All replicas in an ensemble are considered at once. Each replica represents the system in one possible state. Thermodynamic ensembles are characte rized by macroscopic thermodynamic properties. Several frequently employed thermodynamic ensembles are microcanonical ensemble (NVE ensemble), canonical ensemble (NVT ensemble), isothermal isobaric ensemble (NPT ensemble) and grand canonical ensemble. simulation conserve the total energy and represent a system in the microcanonical (NVE) ensemble where number of particles ( ) volume ( ) and total energy ( ) are constant However our system of interest is in the canonical (NVT) ensemble in which number of particles ( ) volume ( ) and temperature ( ) are constant. T h erefore maintaining a constant temperature in a MD simulation is necessary Any algorithm that can maintain c onstant temperature and approximate the NVT ensemble is called a thermostat. Popular thermostats include Berendsen thermostat, 146 Langevin dynamics 147 and Nose Hoover thermostat. 148 The Berendsen thermostat and Langevin dynamics are utilized in our MD simulation s and thus explained here. In a MD simulation, the temperature can be written as: = 1 3 2 2 = 1 (2 8) Here N is the number of particles, n is number of constrained degree of freedom, m i and are the mass and velocity of particle i Thus, tempera ture is a function of velocities of all particles. The simplest way to control temperature is to rescale velocity at each time step. However, this will cause discontinuity in the momentum trajectory in phase space. PAGE 67 67 Berendsen et al introduced a weak coupl ing method to an external heat bath to MD simulations. The heat bath can add or remove heat from the system in order to maintain a constant temperature. T he rate of temperatu re change is governed by Eq. 2 9 : = 1 0 (2 9 ) w here 0 is the temperature of the bath and is the coupling time which indicates the time scale a system r elaxes to target value. By employing a coupling time, the MD propagation can avoid sudden change in velocities. Since temperature is computed from velocities of all the atoms, what t he Berendsen thermostat really does is to multiply all velocities with a s c aling factor (shown in E q. 2 10 ) in order to rescale the current temperature T to the target value T 0 = 1 + 0 1 1 / 2 (2 10 ) By rescaling velocities, the Berend sen thermostat controls the temperature in MD simulations. As mentioned before, the coupling time determines how tightly the system and the heat bath coupled together. A large means the coupling is weak. It takes long time for the system to relax from current temperature to target temperature. As the internal energy will be conserved and the microcanonical ensemble will be restored. If is small, the coupling between the system and the heat bath is strong and the velocity s caling factor is large. However, large velocity scaling factor will cause large disruption in the momentum part of the phase space trajectory. The larger the scaling factor is, the less natural the trajectory is. PAGE 68 68 Langevin dynamics belongs to the category of stochastic thermostat. 137 It mimics motio n of MD method when using stochastic therm ostat becomes: = 1 + (2 11 ) In Eq. 2 11 and are the velocity, position and mass of particle i respectivel y, U is the potential energy, is the friction coefficient and A(t) is a random force at time t The amplitude of this force is determined by fluctuat ion dissipation theorem (Eq. 2 12 ). ( 1 ) 2 = 2 1 2 (2 12 ) 1 2 is the time correlation of A on particle i at time t 1 with A on particle j at time t 2 is the Boltzmann const ant, T is the temperature, is the Kronecker delta function and 1 2 is the Dirac delta function. Langevin dynamics can be used as thermostat because the equation of motion is temperature dependent via the random force term. 2.2 .3 Pre ssure Control in MD Simulations Most biological experiments are performed in a constant pressure and constant temperature situation (NPT ensemble) Therefore, pressure control techniques (barostats) should be used in simulations to maintain system pressure s and it is done by adjusting the system volumes. Since the number of particles is constant during a simulation, another application of maintaining pressure is to regulate system density which should be at certain appropriate value. A generally employed ba rostat is the Berendsen barostat. 146 PAGE 69 69 The pressure of a sys tem in a simulation is calculated using the virial theorem of Clausius and can be expressed as: = 1 1 3 = + 1 = 1 (2 13 ) In the above equation, P is pressure, N is the number of particles, and T is the temperature. and are the distance and interaction energy between atoms and respectively. Analogous to temperature control, the pressure can be maintained simply by rescaling volume at each time step although the system volume will be disrupt ed too much. Berendsen barostat was developed in order to smooth the change in volume. T he Berendsen barostat, in which the algorithm is the same as Berendsen thermostat, utilizes a pressure bath. The rate of pressure change is governed by following equati on: = 1 0 (2 14 ) where is the coupling constant and 0 is the pressure of the bath. The change in pressure is reflected by a djust ing system volume The coordinates of all particles in the system are scaled by a factor 1 / 3 and is formulated as: = 1 0 (2 15 ) The in the above equation is the isothermal compressibility. It represents the volume fluctu ation caused by pressure change: = 1 (2 16 ) PAGE 70 70 2.3 Monte Carlo (MC) Method 2.3 .1 Canonical Ensemble and Configuration Integral In statistical mechanics, an ensemble is a collection of a very large number of systems and each system is a replica (on a thermodynamic level) of a particular th ermodynamic s ystem of interest If the thermodynamic system of interest has a volume of V N number of particles and temperature T then an ensemble containing a very large number of such systems is called the canonical ensemble. The canonical ensemble is important bec ause it best represent s systems of interest in practice. Because each system of the canonical ensemble is not isolated, the energy of each system is not fixed. Thus, there is a probability of finding a system with energy and the probability distrib ution of systems in the canonical ensemble is the so called Boltzmann distribution (Eq. 2 17 ). = 1 / (2 17 ) Here Q is the part ition function and is essentially a normalization factor. is the quantum energy of a system. = / (2 18 ) In classical mecha nics, the Hamiltonian function H is employed to describe the total energy of a system and can be expressed as where p and q are momenta and positions respectively. In general, the Hamiltonian can be separate d into kinetic energy which depends onl y on momenta and potential energy which depends only on positions. In addition to using the Hamiltonian instead of quantum energy, the energy levels become continuous in the classical limit. Thus, the partition function will be wr itten as an integral. PAGE 71 71 = ( ) (2 19 ) H ere = 1 / After integrating the kinetic energy term, the partition function has the form of Eq. 2 20 and is called conf iguration integral. = (2 20 ) Thus, the Boltzmann distribution in the clas sical limit is given by Eq. 2 21 : = 1 (2 21 ) 2.3.2 Markov Chain Monte Carlo (MCMC) The definition of Markov chain is crucial to the MCMC methods, so it will be explained first in this section. Conside r a stochastic process at discrete steps ( 1 2 ) for a system that has a set of states ( 1 2 ) with finite size. We define that the system is in state at step The conditional probability of = given that 1 is in state etc is : =  1 = 2 = 1 = (2 22 ) A Markov process is defined in Eq. 2 22 with the property that the conditional probability of = only depends on its previous state 1 = : =  1 = 2 = 1 = = =  1 = (2 23 ) The corresponding sequence of st ates ( 1 2 ) is called a Markov chain. The conditional probability =  1 = is essentially the transition probability from state to and is denoted as Based on the probability theor y, a transition probability has the properties 0 and = 1 Thus, t he probability of = can be written as: = = =  1 = 1 = = 1 = (2 24 ) PAGE 72 72 A change in = with respect to step is governed by the master equation: = = = + = (2 25 ) At equilibrium (or under steady state approximation) it is clear that = should not change with steps This leads to: = = = (2 26 ) Since the Markov chain introduced above possesses discrete and finite number of states, the transition probability can be described as a matrix, which is called the transition matrix. The th element of the transiti on matrix represents The probability distribution can be represented by a row vector. Multplying a probability distribution with transition matrix will generate a new probability distribution. If a Markov chain is time homogeneous (the definitio n of time is essentially a step due to the stochastic nature of a Markov chain) the elements of transition matrix are constants (time independent). W hen a probability distribution vector is not changed by multiplying with the transition matrix, the distri bution is said to be stationary. At equilibrium, the elements of the transition matrix are independent of time. The equilibrium distribution is an eigenvector of the transition matrix with an eigenvalue of 1. Hence, multiplying equilibrium probability dist ribution with transition will not change it. Properties of a Markov chain include: a Markov chain is irreducible, if all states communicate with each other; a Markov chain is called aperiodic, if number of steps needed to move between two states is not per iodic ; it is positive recurrent, if the expectation value of the return time to a state is finite. Th e se properties are closely related to the ergodicity of a Markov chain. PAGE 73 73 The MCMC methods are Monte Carlo sampling s from a probability distribution by employing a Markov chain whose equilibrium probability distribution is the intended probability distribution States sampled by Monte Carlo method form a Markov chain. The transitions in MCMC must satisf y the detailed balance equation: = = = (2 27 ) A Markov chain is said to be reversible when it satisfies the detailed balance equation 2.3 .3 The Metropolis Monte Carlo Method In 1953, Metropolis et al 50 proposed an algorithm to sample the phase space of a system at equilibrium by the MC method. According to the Metropolis algorithm, at configuration i a new configuration j is chosen, both configurations are weighted by Boltzmann distribution ( Eq. 2 21 ) and the detailed balance condition ( Eq. 2 27 ) is employed to evaluate the transitions (MC moves) between configurations, = ( ) (2 2 8 ) In the above equation, ( ) is the Boltzmann weight of configuration i and ( ) is the transition probability from configuration i to j Inserting Eq. 2 21 into Eq. 2 28 and r earranging Eq. 2 28 yields: ( ) ( ) = ( ) ( ) = ( ) = e (2 29 ) A nd the transition probability from configuration i t o j can be written as: = 1 (2 30 ) In practice, the new configuration is accepted if 0 However, if > 0 a random number between zero and one is generated and is compared with If the random number is less than or equal to then the new configuration is accepted. Otherwise, PAGE 74 74 the current configuration is kept and is added to t he configuration ensemble. This accept/reject criterion is the so called Metropolis criterion. The MC sampling with the Metropolis criterion generates a Markov chain whose equilibrium PDF is the Boltzmann distribution. Compare the Metropolis MC with MD, MC method simulates a system in the canonical ensemble withou t controlling temperatures; the bottleneck of MC sampling is the potential energy difference while the bottlene ck of MD is the energy barrier. 2.3.4 Ergodicity and t he Ergodic Hypothesis In statistical mechanics, ergodic (adjective of ergodicity) is a wo rd used to describe a system which satisfies the ergodic hypothesis. T he ergodic hypothesis states that over a long period of time, the time average and the ensemble average of a property should be the same. In our simulations, the ergodic hypothesis is of ten assumed to be true. Ergodicity breaking (the ergodic hypothesis does not hold) often means that the system is trapped in a local region of the phase space. One example when the ergodic hypothesis does not hold is the spontaneous magnetization of a ferr omagnetic system below Curie temperature. The ensemble average of net magnetization is zero since spin up and spin down are degenerate states and the population of either states should be the same. However, a net magnetization exists when temperature is be low Curie temperature. Ergodicity is often discussed in a Markov chain. A Markov chain is called ergodic when all its states are irreducible, aperiodic and have positive recurrent. 2.4 Solvent Models Because proteins are stable and perform their functions in condense d phase, especially in aqueo us solution, representing the so lvation effect is of great importance. One frequently used solvent model in MD simulations is the water model. Two ways of representing aqueous solution are present here: the explicit a nd the implicit solvent PAGE 75 75 models. As its name indicat es the explicit water model employs water molecules in the simulation and the implicit water model treats w ater as a dielectric continuum. 2.4 .1 Explicit Solvent Model Different types of water molecules s uch as SPC /E 149 TIP3P 150 and TIP4P 150 are developed. Water molecules parameters are fitted to bulk water properties such a s density heat of vaporization, and dipole moment. 15 0 The density of liquid water is an important physical quantity to check the water models. The density of liquid water shows a maximum at 4 C and water models should correctly reflect this. TIP3P failed to achieve that, while TIP4P and TIP5P 151 and their variants were able to repr oduce this trend. Take the TIP3P and TIP4P water models as examples. A simple diagrammatical description of TIP4P and TIP4P water models are shown in Figure2 2 The TIP3P water model has one oxygen atom and two hydrogen atoms. The geometry of TIP3P water i s the same as experimental geometry with OH bond length of 0.9572 and HOH angle of 104.52 Only oxygen a tom has a van der Waals radius. Thus, the van der Waals interactions only occur among oxygen atoms. Partial charges are placed on oxygen atom and hyd rogen atoms. The partial charge on the oxygen atom is 0.834 and the partial charge on each hydrogen atom is 0.417 where is the charge of an electron. When computing interactions (Coulomb interaction and Lennard Jone s interaction) between two TIP3P water molecule, there are 3 3=9 distances needed to be calculated. The TIP4P water model, as its name implies, has four sites. Similar to the TIP3P water model, experimental geometry (bond length and bond angle) is also adopted in the TIP4P model. The onl y atom in the TIP4P molecule having the van der Waals interaction is oxygen too. However, for the TIP4P model, the negative partial PAGE 76 76 charge is located on the fourth site instead of being placed on the oxygen atom, as in the TIP3P model. The use of the fo urth site carrying negative charge is able to improve electrostatic properties of water such as dipole moment. The positive partial charges are still placed on hydrogen atoms. The new partial charges are 1.04 and 0.52 New Lennard Jone s potential para meters have also been employed for the TIP4P water model to achieve better fitting results. Computing the interactions between a pair of the TIP4P molecules requires knowing 9 distances for electrostatic interactions and 1 distance for the Lennard Jone s po tential Therefore, using TIP4P model in a simulation will be computationally more expensive than using TIP3P model. For a five site water model such as TIP5P, 17 distances are needed in order to calculate water water interactions. When simulating a mole cule with explicit water molecules, the periodic boundary condition (PBC) is utilized in order to mimic reality. 152 Otherwise, water molecules evaporate into vacu um. Ewald summation 153 or Particle Mesh Ewald (PME) summation 154 is employed to compute the long r ange electrostatics efficiently when the PBC is employed. One advantage of employing the expli cit water model is that the solvent solute interaction can be represented. For example, studying the hydrogen bonding between water molecules and proteins requires using the explicit water model However, it suffers from computational cost. CPU time is app roximately proportional to number of inter atomic interactions. PAGE 77 77 A B Figure 2 2. A d iagrammatic description of TIP3P and TIP4P water models. A) TIP3P model. The red circle is oxygen atom and the black circles are the hydrogen atoms. Experimental bond length and bond angle are adopted. B) TIP4P model. Oxygen and hydrogen atoms are labeled with same color as in the TIP3P model. TIP4P model also employs the experimental OH bond length and HOH bond angle. Clearly, the fourth site (green circle) which carries negative partial charge has been added to the TIP4P model. 2.4 .2 The Poisson Boltzmann (PB) Implicit Solvent Model An alternative way of representing solvation effect is to reproduce the PES after a molecule is dissolved in solvent. The solution p hase potential energy of a molecule can be computed by adding solvation free energy to the gas phase potential energy. Given the correct solution phase PES, correct forces can be generated for the equation of motion. Thus, the key issue is finding the accu rate free energy of solvation. A dielectric continuum model can be employed to calculate free energy of solvation. In the dielectric contin uum model, the free energy (work) of assembling a charge distribution is expr essed as : = 1 2 (2 31 ) H ere r is the charge density of the molecule and r is the electrostatic potential. The Poisson Boltzmann model utili zes the Poisson Boltzmann equation to describe the electrostatic potential as a function of charge density. In practice, the PAGE 78 78 linearized PB equation ( Eq. 2 32 ), which utilizes the first order truncation of Taylor series expansion of the hyperbolic sine, is often employed. = 4 + 2 (2 32 ) In the above equation is the dielectric constant, is a switching function which is zero when electrolyte is inaccessible an d otherwise one, and 2 is the Debye Hckel parameter. For simple cases such as spherical charge distributions, the solutions to PB equation are analytical and simple. Consider dissolving a sphere with charge and radius and the charge is uniform ly distributed on the surface. The charge density on the surface can be expressed as: = 4 2 (2 33 ) Here is any point on the surface. From outside of the sphere, the electrostatic potential at is calculated by : = (2 34 ) Integrating the right hand side of Eq. 2 31 from in finity to with Eq. 2 33 and Eq. 2 34 will yield = 2 2 The free energy of solvation is the difference between gas phase and solution phase free energies. Thus, it can be written as: = 1 2 1 1 2 (2 35 ) This is the so called Born equation and is the basis of the generalized Born (GB) method which will be introduced later. PAGE 79 79 For complex systems such as proteins, there is no analytical solution t o the linearized PB equation. 73 Therefore, this equation is solved iteratively until self consistent is achieved for the charge density and electrostatic potential. 2.4 .3 The Generalized Born (GB) Implicit Solvent Model Solving the linearized PB equation is computationally expensive. An approximate method to the PB implicit solvent model is pr oposed as the GB method. 39,117 Using the GB implicit solvent can greatly shorten the simulation time which makes the GB frequently employed in molecular simulations. Similar to Eq. 2 35 t he free energy of solvati on in the GB method is given by : = 1 2 1 1 (2 36 ) Here and are charges on nuclei and i s calculated by: = 2 + 2 4 1 2 (2 37 ) H ere is the effective Born radius of charge and is the distance between the two charges. Another approximation in the GB method is the Coulomb field approximation. 40 This approximation estimates t he effective Born radius by integrating the energy density of a Coulomb field over the molecular volume. The integral is often evaluated numerically. One should notice that the G B theory involves two approximations to reproduce the PB results. The first approximation contains Eq. 2 36 and 2 37 The second one is the Coulomb field approximation. Further approximations are often introduced to reduce the time computing the effective Born radii in practice The p air wise approximation 155 is often applied. In this approximation, t he van der Waals radius PAGE 80 80 of an at om and a function dependent on positions and the van der Waals radii of atom pairs are utilized to compute the effective Born radius 2.5 p K a Calculation Methods 2.5 .1 The Continuum Electrostatic (CE) Model The basic idea of the CE model is also given in F igure1 6. Since computing the p K a value of an ionizable residue in a protein directly is difficult (breaking a bond plus dissolving all species into water) a model compound is utilized and the p K a shift is calculated via the thermodynamic cycles shown in Figure 1 7 and Figure 1 8 Like the MDFE calculations, the CE model also compute s the p K a value of an ionizable residue relative to its intrinsic value (or model compound value according to the defi nition of Bashford and Karplus ; the definition of the intr insic p K a can be found in section 1.3 ). The p K a value of an ionizable residue is written as: = + 1 2 303 (2 38 ) In the above equation, is the intrinsic p K a value of an ionizable residue and can be found in Table 1 1. and is the free energy difference between protonated and deprotonated species for that ionizable residu e and its reference compound (the reference compound utilized in the CE model is an isolated ionizable residue with two ends capped and fully exposed to aqueous environment.) respectively. Eq. 2 38 is essentially the same as Eq. 1 20. The difference betwe en MDFE methods and the CE model is how the free energy difference s between the protonated and deprotonated species on the right hand side of Eq.1 20 are generated MDFE methods compute the two free energy differences via free energy calculation algorithms while the CE model calculates them via FDPB method. In this PAGE 81 81 continuum electrostatic model, proteins are considered as low dielectric regions surrounded by high dielectric continuum representing water. Protonation is represented by adding a unit charge to the ionizable site. In the continuum electrostatic model, and are assumed to differ only in their electrostatic contribution s. This assumption will result in the cancellation of non electrostatic free energy contributions. Thus, calculating the electrostatic work of charging a site in the ionizable residue and in the reference compound from zero to unit charge is required. This electrostatic work can be further decomposed into three terms. For any ionizable site in a fixed protein structure, the electrostatic work consists of three terms: the Born solvation free energy ( ), the background free energy which is the interaction of the ionizable site with non ionizable charges ( ), and the interaction with other ionizable sites ( ). For the reference compound, only the first two terms exist. Thus, can be written as: = ( ) + ( ) + ( ) (2 39 ) A nd can be written as: = ( ) + ( ) (2 40 ) Li n earized PB equation (described in S ection 2.3.2) is solved for electrostatic potentials using finite difference method. For an i onizable site the Born sol vation is determined by Eq. 2 35 The background free ene rgy is calculated using Eq. 2 41 : = ( ) (2 41 ) H ere is non ionizable partial charge and ( ) stands for the electrostatic potential produced at by a unit charge place at The electrostatic interaction with PAGE 82 82 other ionizable sites can also be evaluated by Eq. 2 41 except that charges on ionizable sites must be used. After computing all components on the right hand sides of Eq. 2 38 and Eq. 2 3 9 the p K a of ionizable residue will be obtained. To produce a titration curve, a protein containing N ionizable resi dues is considered here. Each ionizable residue has two states: protonated and deprotonated. Thus, there are 2 N numbers of macro states for that protein. Each macro state can be represented by a vector =( 1 2 ), whose elements is 0 or 1 accor ding to whether ionizable site is deprotonated or protonated. The free energy of relative to the vector whose components are all zero (this is equivalent to the free energy change when charging the non zero components in the vect or) is given by Eq. 2 42 : = = 1 + 1 2 ( 0 + ) ( 0 + ) = 1 = 1 (2 42 ) H ere = + ( ) for ionizable site is the electrostatic interaction between unit charges at ionizable site and and 0 is the charge of site when it is in the deproto nated state. Thus, which is the fraction of protonation of site can be written as ( Eq. 2 43 ): = 2 303 ( ) 2 303 ( ) (2 43 ) H ere = 1 / and ( ) is the number of non zero components in Summing up individual will generate a titration curve of the entire protein. 2.5 .2 Free Energy Calculation Method s As mentioned previously, the p K a value is prop ortional to the standard free energy of reaction. Therefore, free energy calculation methods can be employed to compute the p K a value of ionizable residue one is interested in. In this section, two frequently PAGE 83 83 used free energy calculation methods: thermodyn amic integration (TI) 156,157 and free energy perturbation (FEP) 158 are described. Both TI and FEP belong to the so called or equilibrium method and can be employed to compute the free energy difference between two states. In other words, each transition sho uld be reversible. In the TI method, initial state A (having potential energy where is the molecular structure ) and final state B (having potential energy ( ) ) are connected by a reaction coordinate (this reaction coordinate doe s not necessarily have any physical significance) The simplest scheme of constructing the potential energy as a function of is: = 1 + (2 44 ) Slo wly transforming from zero to one converts state A to B ; the intermediate values of correspond to a mixed system without physical meaning. The Helmholtz free energy in the canonical ensemble (or the Gibbs free energy in the isothermal isobaric ensemble) is formulated as: = ln = ln (2 45 ) where is the partition function and is the configuration integral. From now on, our derivation wi ll focus on the canonical ensemble and the Helmholtz free energy but can be extended to isothermal isobaric ensemble and the Gibbs free energy in the same manner (this statement also holds when the free energy perturbation method is described later). Follo wing Eq. 2 45 the Helmholtz free energy as a function of is: = ln = / (2 4 6 ) Here, is the potential energy function and is molecular structure. T he free energy difference can be written as : PAGE 84 84 = = 1 0 (2 4 7 ) Then, = ln = 1 (2 4 8 ) Plugging the explicit form of configuration integral into the derivative leads to: = / = / (2 4 9 ) / = / 1 / (2 50 ) Therefore, 1 = 1 / 1 / (2 51 ) Since the integration is over coordinate space, the configuration integral can be moved into the integral. Eq. 2 51 now becomes: = 1 = / (2 52 ) The first term in the integrand is the Boltzmann weight factor Rewriting Eq. 2 51 yields: = = (2 53 ) Thus, the final form of is: = 1 0 = 1 0 (2 54 ) In both Eq. 2 53 and 2 54 the bracket represents an ensemble average generated at In p K a calculations, state A (or B ) represents the protonated species and the other represents the deprotonated species. Each intermediate value correspond s to a mixed protonated and deprotonated state, without any physical meaning. When PAGE 85 85 classical force fields are applied, the proton becomes a dummy atom in the deprotonated state but retains its position and velocity in the protein (or model compound). Furth ermore, state A and B only differ in char ge distribution s Dissociation f ree energy can be computed using methods of numerical integration (such as trapezoidal rule or Gaussian quadrature) to treat Eq. 2 54 As explained in the previous chapter, the quantu m mechanical contributions to the proton dissociation free energy are assumed to be the same for protein and the model compound. Therefore, subtracting dissociation free energy of model compound from that of protein will yield the p K a shift relative to the p K a value of the model compound. The FEP method which was initially introduced by Zwanzig in 1954, 158 is another frequently employed free energy calculation method. Consider two state s ( A and B ) with partition functions and respectively, and the Helmholtz free energy and respectively. The free energy difference from A to B can be expressed as: = = ln / (2 55 ) Suppose the configuration integrals are adopted instead of partition functions T he potential energy function of state A and B is and respectively where is the molecular structure. Thus, = ln / = ln ( ) / / (2 56 ) According to Zwanzig, can be written as the sum of and a perturbation term = + (2 57 ) = ln + / / (2 58 ) = ln / / (2 59 ) PAGE 86 86 The Boltzmann weight factor of state A has the form: = / / (2 60 ) Therefore, = ln / = ln / (2 61 ) The bracket with subscript A stands for the ensemble average performed on the structural ensemble generated from state A Substituting with Eq. 2 61 becomes: = ln / (2 62 ) In order to compute one simulation of state A is performed. Once a configuration is generated, the potential energy difference at configuration is computed. The ensemble average of / can be calculated easily and hence, is obtained According to Eq. 2 62 if the potential energy difference between the two state s (perturbation) is too large, the free energy difference given by FEP calcul a tion can be unreasonably large Thus, FEP calculations cannot accurately reflect the true free energy difference of large changes in Hamiltonian (basically, potential energy) Only similar Hamiltonians contributes to the free energy difference. In order t o compute the free energy difference between two very different systems (such as calculating free energy difference from benzene to toluene), intermediate systems mixing the two very different systems (end points) are adopted in such a way that the differe nces between neighbors can be treated as perturbations. To be specific, a coupling parameter can be adopted in the same fashion as TI. The sum of free energy difference between intermediate systems (each intermediate state has a specific coupling parameter ) will be the targeted free energy difference. PAGE 87 87 In practice, computing (forward free energy difference) is equally easy (or hard) as computing (backward free energy difference) and one is exactly the opposite of the other in p rinciple. Evaluation of forward and backward free energy difference s provides an indication of convergence. The Bennett Acceptance Ratio (BAR) method 159,160 is a frequently employed scheme to reduce sampling bias a nd statistical error. In 1985, Jorgensen et al 161 calculations in order to reduce the computational cost. The double wide FEP can be explained by the following example. Suppose is to be computed. Instead of performing two MD simulations at and only one MD simulation at + 2 is conducted. The + 2 and + 2 are calculated then the objective free energy difference can be obtained. If configurations of each MD simulation are taken in order to compute the conventional FEP scheme requires 4 potential energy calculations, while doub le wide FEP only requires 3 2.5 .3 Constant pH MD M ethods As described in the previous chapter the constant pH MD methods want to describe protonation equilibrium correctly at a given pH value. The constant pH MD models sample protonation state spac e explicitly, along with the sampling of conformational space. In practice, two protonation state sampling schemes have bee n developed. One scheme utilizes a binary protonation state space: only the protonated and deprotonated states are defined. MC steps have been performed periodically during MD propagations, which sample the conformational space. At each MC step, a new PAGE 88 88 protonation state is selected and the free energy difference between the old and new states is computed. The Metropolis criterion is the applied to evaluate the MC move. Since a binary protonation state space is adopted, this scheme is generally called the discrete protonation state model. The other scheme employs a continuous protonation state space. Not only the completely protonated and deprotonated species are defined, fractional protonation states also exist in the simulation. The MD propagations sample both conformational and protonation state space. The latter scheme is named continuous protonation state model. In this section, the CP HMD model developed by the Brooks group and two discrete protonation state constant pH MD methods developed by Baptsta et al and by Mongan et al. are described to provide a brief overview. In the CPHMD method, Lee et al 114 applied dynamics 116 to the protonation coordinate and used the Generalized Born (GB) implicit solvent model They chose a variable which is bound between 0 and 1, to control protonation fraction = 0 represents an ionizable residue in its protonated state, while = 1 corresponds to the deprotonated ionizable residue. Due to its continuous nature, = 0 and = 1 are rarely sampled. Thus an arbitrary value is adopted such that any value smaller than is defined to be protonated, while any is greater than 1 is set to be deprotonated. To ensure an unbounded reaction coordinate is practically used, a new coordinate is introduced and is propagated in a MD simulation. is expressed as: = 2 ( 2 63 ) A n artificial potential barrier betwe en the protonated and deprotonated states has been introduced The potential is a biasing potential to increase the residency time PAGE 89 89 close to protonation/deprotonation states and it is centered at half way point of titration ( =1/2). The formula of this bia sing potential used by Lee et al. is = 4 1 2 2 ( 2 64 ) w here is an adjustable parameter controlling the height of the biasing potential. A valu e o f 1.25 kcal/mol is found enough to provide occupation time in the protonated and deprotonated states. The total potential of the system, which provides the forces for MD propagation, has the form: = + + + + + + + + + = 1 ( 2 65 ) Here, the fir st five terms are essentially defined by Eq.1 9 is the GB solvation free energy which will be explained in the next chapter. is the energy related to surface accessible areas. in Eq. 2 65 represents an ionizable residue. is a potential of the mean force (PMF) in the titration coordinate for a model compound. The shown in Eq. 1 1 7 can be represented by = 0 = 1 The in Eq. 2 65 is fi t to a two parameter parabolic function having the form = 2 2 = 2 303 2 which is the chemical potential of adding a fractional proto n to the solution at pH. The term + is essentially the quantum mechanical dissociation free energy of a fractional proton. The CPHMD method also assumes Eq. 1 1 8 is true. Another feature of the CPHMD method is using an extended Hamiltonian. A kinetic energy term of titration coordinate is employed in CPHMD: PAGE 90 90 = 1 2 2 = 1 ( 2 66 ) The fic titious mass controls the speed of response of the protonation state c hange to the force on it. Baptista et al 118 proposed that MD simulations incorporating protonation state change is essentially a semigrand canonical ensemble. The joint PDF can be written as: , = , , (2 67 ) Here, is the momenta and coordinates of solute, respecti vely. and is the momenta and coordinates of solvent, respectively. is the vector containing protonation state information of each ionizable residue. The details of is explained in the continuum electrostatic model. is essential ly the number of protonated ionizable residues. is the chemical potential of protons and = 1 / The Hamiltonian contains quantum mechanical and classical force field terms. The quantum mechanical part in their model is assumed not to depend on coordinates and momenta. The introduction of dummy atom to replace the proton in a deprotonated residue makes kinetic energy only a function of momenta. Two conditional samplings have been considered by Baptista et al. : one is conformational sampling unde r a fixed protonation state, the other one is protonation state samping under a fixed structure. The PDF of conformations at fixed protonation state is:  = , , (2 68 ) w here is the classical Hamiltonian. Due to the fact that quantum mechanical Hamiltonian depends only on protonation state, which is fixed in conformational PAGE 91 91 sampling, the quantum contribution is a constant and is canceled. The PDF of protonation states at fixed coordinates is given in Eq. 2 6 9 :  = 2 303 2 303 (2 69 ) w here is the free energy of a protonation state relative to the completely deproto nated state. In their model, FDPB based method is executed to calculate free energy difference. Combining the two conditional sampling s, one is able to generat e an ensemble satisfying Eq.2 6 7 In order to prove the above statement, one must show the Markov chain constructed by transition matrix and the two conditional probabilities satisfies the following condition, = lim (2 70 ) In the above equation, is the joint PDF as defined in Eq. 2 6 7 is a joint PDF depend on the same variables as and i s tra nsition matrix. Proving Eq. 2 70 holds means that one must prove the Markov chain defined by and is ergodic. In order to prove a Markov chain is ergodic, one needs to prove (a) the Markov chain is irreducible; (b) the chain needs to be aperiod ic; (c) the transition matrix elements are time independent; and (d) the limiting distribution should be stationary. The detailed proof is given by Baptista et al. in their 2002 paper. Their proof justified the discrete protonation state constant pH method which samples conformational space at fixed protoation state and samples protonation state at fixed structure. In 2004, Mongan et al 127 proposed a constant pH MD method and implemented in the AMBER suite This algorithm follows the scheme proposed by Baptista et al 118 but employs the GB model in both MD and MC. Given a protein with N titratable sites, the PAGE 92 92 discrete protonation state model means protonation states of a protein are described by a vector =( x 1 x 2 x N ) where each x i is some integer representing the protonation state of titratable residue i In AMBER, five amino acids are designed to be titratable: aspartate, glutamate, histidine, lysine and tyrosine. For each titratable residue, diffe rent protonation states have different partial charges on the side chain. This model also includes syn and anti forms of protons for the aspartate and glutamate side chains as At each Monte Carlo step, a titratable site and a new protonation state for that site are chosen randomly and the transition free energy at this fixed configuration is used to evaluate the MC move. Considering a titratable site A in a protein environment, its protonated form is protA H and deprotonated form is protA The equilibrium between the two forms is governed by their free energy difference. This free energy difference is the ensemble average of different configurations. However, the free energy difference cannot be computed b y a molecular mechanics (MM) model since the transition between two forms deals with bond breaking/forming and solvation of a proton which involves quantum mechanical effects. The above problems can be solved by using a reference compound. The reference co mpound has the same titratable side chain as protA H but with known p K a value ( ). Following Mongan et al., we assume the transition free energy can be divided into the quantum mechanics (QM) part and the molecular mechanics (MM) part. We further assume that the quantum mechanical energy components are the same between the reference compound and the protA H. Since the p K a of the reference PAGE 93 93 compound is known, its transition free energy from deprotonated form to protonated form at a given pH is: = ln 10 (2 71 ) So the QM component of the transition free energy can be expressed as: = (2 72 ) H er e is the molecular mechanics contribution to the free energy of protonation reaction for that reference compound. In practice, the QM component of the MM component. Since the approximation of the QM component of the transition free energy is: = (2 73 ) T hen the transition free ene rgy from protA to protA H can be calculated as: = ln 10 + (2 74 ) Here is the molecular mechanics contribution (electrostatic interactions in nature) to the free energy of the protein titratable site. Hence, by using a reference K a relative to the K a can also help cancelin g some error introduced by GB so lvation model through the use of In AMBER, a reference compound is a blocked dipeptide amino acid possessing titratable side chain (for example a cetyl Asp methylamine). Five reference compounds were constructed corresponding to five titratable residues. The values of for each reference compound are obtained from thermodynamic integration calculations at 300 K and set as internal parameters in AMBER. The is calculated by taking the difference between the potential energy PAGE 94 94 with the charges of the current protonation state and the potential energy with the charges of the new protonation state. If the transition is accepted, MD steps are carried out to sample conformational space in the new protonation sta te. If the MC attempt is rejected, MD steps are also carried out with no change to the protonation state. 2.6 Advance d Sampling Methods Conformational sampling in a MD or MC simulation is essential in the study of complex systems such as polymers and prote ins. One major concern is that the PES of a complex system is very rugged and contains a lot of local energy minima. Thus, kinetic trapping would occur as a result of the low rate of potential energy barrier crossing, especially when the barrier is high. T o overcome this kinetic trapping behavior, generalized ensemble methods can be employed in molecular simulations. As its name implies, a generalized ensemble method differ from the canonical ensemble method in the weight factor of a state. T he weight facto r in the canonical ensemble is Boltzmann weight. However, a non Boltzmann weight factor can be used in a generalized ensemble method (This does not mean that Boltzmann factor is prohibited in a generalized ensemble method. In fact, parallel tempering which belong to the family of generalized ensemble method, does adopt Boltzmann factor.). By choosing a non Boltzmann weight factor, the system is able to perform a random walk in the potential energy space. Thus, potential energy barriers will be overcome easi ly and more conformations will be visited. Frequently utilized generalized ensemble algorithms include the multicanonical (MUCA) method and replica exchange molecular dynamics (REMD) met hod. In this section, the MUCA and parallel tempering will be introduc ed briefly. Due to the importance of REMD method to this dissertation, t he details of REMD method will be explained in the next section PAGE 95 95 2.6.1 The Multicanonical Algorithm (MUCA) In canonical ensemble, the probability of visiting a state in the energy spac e is: / (2 75 ) Here, is the density of states (DOS), which means the number of states between and + / is the Boltzmann factor. As potential energy increases, the Boltzmann factor decreases but the DOS increases rapidly. A bell shaped probability distribution function (PDF) of can be observed. However, in the MUCA method, 54,55,137 the PDF is designed to be flat (a constant), although it still can be written in the form of Eq. 2 76 : = ( 2 76 ) whe re is the multicanonical weight factor and is DOS. The multicanonical weight factor needs to be inversely proportional to the DOS in order to generate a flat PDF. However, the DOS of a system is in general unknown, which makes the multicanonical weight a priori unknown. Generating correct distribution of is the central task of a MUCA simulation. In practice, short simulations are performed in order to determine the DOS in an iterative manner. Details of determining the DOS ca n be found in the paper of Okamoto and Hansmann published in 1995. 162 After the DOS is resolved, the canonical ensemble PDF will be ontained. Thus, the average of any quantity can be determined by Eq. 1 11 or Eq. 1 12 depending on either MD or MC simulation is performed. Another way to explore the DOS is by using the W ang Landau algorithm. 163,164 In the Wang Landau algorithm, the DOS is recorded by a histogram ( ) and initially set to unity for all its elements. Another histogram which is called visit histogram is also PAGE 96 96 constructed with initial values set to zero. The visit histogram represents the number of visits to each energy level. Monte Carlo moves are m ade. Instead of being evaluated by the Metropolis criterion, they are evaluated by the DOS, = 1 ( 2 77 ) where is the tran sition probability from state to state Each time an energy level is visited, the corresponding element of the DOS histogram is updated by multiplying the current value with a modification coefficient that is greater than 1. The initial value of the modification coefficient is 0 = 2 71828 Every time a MC move is performed, the corresponding element of the visit histogram is also updated. The MC moves will continue until the visit histogram is flat. At this stage, the DOS are converged. In order to achieve a finer convergence, a second round of the above process will be performed. This time, the modification coefficient 1 in the second round is given by 1 = 0 The visit histogram is then reset to zero. This process will iterate un til a modification coefficient that is approximately 1 is achieved (in the paper of Wang and Landau, the final value of the modification coefficient is 1.00000001). By utilizing Wang Landau algorithm, the DOS will be obtained and a random walk in the poten tial energy space will be achieved. 2.6.2 Parallel Tempering In 1986, Swendsen and Wang firstly performed parallel tempering (replica exchange MC) simulations to investigate spin glass. 59 Multiple non interacting copies (replicas) of the system are simulated at different temperatures. At each temperature, MC simulation is conducted to sample the conformational space. Structures or temperatures of the two replicas are attempted t o be exchanged periodically. The PAGE 97 97 detailed balance condition is applied and the weight factor of the state is the Boltzmann weight factor. The Metropolis criterion has been utilized to accept or reject the move. Hansmann et al 58 first utilized the parallel tempering algorithm in the study of a biomolecule (7 residue Ket enkephalin) Other application s of the parallel tempering algorithm include X ray s tructures determination performed by Falcioni and Deem. 165 A MC simulation at a high temperature accepts the transiti on attempts more often than doing that at a low temperature. Thus, the simulation at high temperatures tends to visit more conformations in conformational space. Exchanging structures with replicas at lower temperatures can help them avoid getting trapped in the conformational space. The acceptance ratio, which is the averaged fraction of successful exchange attempts, is an important issue in the parallel tempering method. It is correlated with temperature distribution of replicas. According to Kofke 166 the acceptance ratio is the area of overlap between the potential energy PDF at two temperatures Given the number of replicas, if the temperatures of the two replicas are too different, the overlap between the two potential energy PDFs will be small. Therefore, accepting an exchange attempt is unlikely, which makes parallel tempering simulation inefficient. However, if the temperatures of the two adjacent replicas are too close, the overlap between two PDFs will be large, and hence the acceptance ratio will be large. But the conformational space sampled by two adjacent replicas will be too close. More replic as than actually needed are utilized to achieve the same goal and hence computer resource is wasted 2.7 Replica Exchange Molecular Dynamics (REMD) Methods Due to the correlation between conformation and protonation sampling, correct sampling of protonati on states requires accurate sampling of protein conformations. Hence, generalized ensemble methods such as multicanonical algorithm and REMD PAGE 98 98 should be used to avoid kinetic trapping which comes from low rates of barrier crossing in constan t temperature MD simulations. REMD has been applied to the continuous protonation state constant pH method (REX CPHMD) by Khandogin et al 110 113 They have performed REX CPHMD simulations to predict p K a values 110 and to explore pH dependent protein dynamics. 111 113 The REMD, which is the MD version of parallel tempering, have been developed by Sugita and Okamoto in 1999. 62 The theory of REMD is essentially the same as parallel tempering. In their method, tempe ratures are attempted to be exchanged. This leads to the unique part of REMD: the treatment of velocities after accepting an exchange attempt, because the velocities must reflect the temperature correctly. Sugita and Okamoto proposed to rescale the velocit ies in order to recover the desired temperature when temperatures are swapped. Similar to other generalized ensemble methods, REMD algorithm wants to make the system perform a random walk in either temperature or potential energy space, and hence avoid kin etic trapping. The advantage of REMD over other generalized ensemble method is that the weight factor is Boltzmann weight which is a priori known. This advantage makes REMD very frequently employed in the MD simulations of complex systems. The REMD algorit hm has been applied to studies of peptides, proteins, protein membrane system in order to describe free energy landscape, amyloid formation, structure prediction and binding. Many extended versions such as solute tempering REMD 167 and structure re servoir REMD 168 170 have been proposed to improve the performance of REMD algorithm. The REMD variants will be briefly explained later in this section. PAGE 99 99 2.7 .1 Temperature REMD (T REMD) A thorough description of the T REMD algo rithm can be found in the original paper of Sugita and Okamoto. 62 In T REMD, N non interacting copies (replicas) of a system are simulated at N diffe rent temperatures (one each). Regular MD is performed and periodically an exchange of configurations between two (usually adjacent) temperatures is attempted. Suppose replica i at temperature T m and replica j at temperature T n are attempting to exchange; t he following satisfies the detailed balance condition: = ( ) ( 2 78 ) Here is the transition probability between two states i and j a nd P m ( i ) is the population of state i at temperature m (in R EMD assumed Boltzmann weighted). Since, / (2 79 ) w here i s the Hamiltonian of the state, represents the molecular structure, and stands for momentum. The Hamiltonian consists of kinetic energy ( K ) and potential energy ( U ) terms and can be written as: = + (2 80 ) In the original derivation of exchange probability, Sugita and Okamoto mentioned that exchanging two replicas (states) is equivalent to exchanging temperatures. T he momenta of each replica after e xchange attempt need to be rescaled: = / (2 81 ) = / (2 82 ) After inserting Eq. 2 7 9 and Eq. 2 80 into Eq. 2 78 the detailed balance equation becomes: PAGE 100 100 + / + / = + / + / ( ) (2 83 ) According to Eq. 2 81 and Eq. 2 82 = / (2 84 ) = / (2 85 ) Therefore, kinetic energy contributions on both sides of Eq. 2 83 will be canceled out, leaving only potential energy terms contribute to exchange probability. ( ) = / / / / (2 86 ) Further manipulation of Eq. 2 86 yields: ( ) = 1 1 If the Metropolis criterion is applied, the exchange probability is obtained as: = 1 1 1 ( 2 87 ) If the exchange attempt between two replicas is accepted, the temperatures of the two replicas will be swapped and velocities r escaled to the new temperatures by multiplying all the old velocities by the square root of the new temperature to old temperature ratio: = (2 88 ) Here and are the new and old velocities, respectively. and are the temperatures after and before an exchange is accepted, respectively. The acceptance ratio is the average valu e of the exchange probabilities between two temperatures : PAGE 101 101 = 1 1 1 (2 89 ) For a given system, the potential energy function is i ndependent of temperature but the potential energy PDF in a canonical ensemble depends on temperature. T he potential energy PDF can be considered as a Gaussian function ( to the second order truncation of the Taylor expansion of the PDF at the potential ene rgy value corresponding to maximum probability). The Gaussian is centered at mean potential energy of the system with a variance 2 = 2 where is the heat capacity. At this stage, the Gaussian function expression of the potential energy PDF is not adopted. It will be employed later in this section. T he potential energy PDF at temperature is curren tly written as: = 1 / (2 90 ) w here is the DOS and the exponential term is the Boltzmann weight factor as a function of potential ene rgy. Recall that in the probability theory, the average quantity can be expressed as: = (2 91 ) Extend Eq. 2 91 to the bivariate case and notice t hat the two PDF s are independent T he acceptance ratio can be rewritten as, = 1 1 1 + (2 92 ) Let a function to denote 1 1 1 = 1 / and = 1 / then, = 1 (2 93 ) PAGE 102 102 Without loss of generality, we can assume that > which means < Therefore, a nother way of writing Eq. 2 93 is = 1 when > and = when < Inserting into Eq. 2 92 will lead to: = 1 + (2 94 ) For simplicity, we denote as Inserting Eq. 2 90 into = 1 1 (2 95 ) Since and are independent, Eq. 2 95 can be rewritten as : = 1 1 (2 96 ) Simplifying Eq. 2 96 will formulate as: = 1 1 (2 97 ) Recall t hat a partition function is just a normalizing constant. and in Eq. 2 97 can switch their positions in the integrand. Thus Eq. 2 97 becomes: = (2 9 8 ) Inserting Eq. 2 9 8 into Eq. 2 94 = + (2 99 ) Each term on the right hand side of Eq. 2 9 9 can be interp reted as an overlap between two PDF s The sum is the entire overlap between two PDF s Therefore, the PAGE 103 103 average exchange probability is just the overlap between potential energy PDF s at two temperatures. Next, let us consider the temperature distribution i n t he simplest case, in which the heat capacity is a constant. As mentioned earlier, a potential energy PDF of a canonical ensemble can be written as a Gaussian function, = 2 2 2 (2 100 ) w here is the average potential energy, is the probability density of finding at temperatu re and is the heat capacity. Since the PDF should be normalized, it is easy to find the relationship between and the standard deviation of the Gaussian function: = 1 / 2 2 (2 101 ) F or simplicity in the derivation of the acceptance ratio the Gaussian PDF at temperature will be written as Eq. 2 102 from now on : = 1 2 2 2 2 2 (2 102 ) Recall that one assumption to distribute temperatures is to maintain a random walk in temperature space. Hence, a constant acceptance ratio should be achieved fo r any two adjacent temperatures. As shown previously, the acceptance ratio is the overlap bet ween two potential energy PDFs. Consider two potential energy PDFs at temperatures < The PDF at will be to the left of the PDF at A fter finding the potential energy where the two Gaussian PDFs intersect, the overlap between two PDFs can be computed by integrating the left Gaussian PDF from PAGE 104 104 to infinity and the Gaussian on the right from m inus infinity to and adding them up = 1 2 2 2 2 2 + 1 2 2 2 2 2 (2 103 ) Complementary error functions will be utilized and Eq. 2 103 will become, = 1 2 2 + 1 2 2 (2 104 ) According to Rathore et al 171 the acceptance ratio can be approximate to: 2 2 (2 105 ) w here = + / 2 For a geometric distribution of temperatures where = + = + 1 The average potential energy difference can be computed as, = = 1 (2 106 ) Thus, if the heat capacity does not change with t emperature, the temperature term in the numerator and denominator in Eq. 2 105 will be canceled w hich means the acceptance ratio will be a constant. Furthermore, Eq. 2 105 also signals the number of replicas needed to cover a temperature range as a functi on of system size. In order to have a non zero / 1 This leads to / + 1 1 Further simplifications lead to: (2 107 ) Since the heat capacity is where is the number of particles, the number of replicas to cover a temperature range is 1 / 2 PAGE 105 105 2.7 .2 Hamiltonian REMD (H REMD) Instead of preparing replicas with different temperatures, an other way to overcome potential energy barriers is simply changing the PES to reduce potential energy barriers. 61 And this is the basic idea of H REMD. In H REMD algorithm, replica s differ in their Hamiltonians but have the same temperature. Likewise, regular MD is performed and an exchange of configurations between two neighboring replicas is attempted periodically. Let us consider replica i with Hamiltonian H n and replica j with H amiltonian H m are attempting to exchange. By employing the detailed balance equation ( Eq. 2 7 8 ) and Boltzmann weight of a molecular structure, the transition probability can be written as: = 1 + ( ) (2 108 ) 2.7.3 Technical Details in REMD Simulations Temperature distributions have been explored in order to optimize the performance of REMD method. F or systems having constant heat capacity, a geometrical distribution of temperatures has been adopted. Sugita and Okamoto, 62 and Kofke 166 believed that the most efficient way to exploit REMD algorithm is letting each replica spend the same amount of simulation ti me at each temperature (a random walk in temperature space). In practice, this is achieved by producing the same acceptance ratio for each replica, given that each replica only attempts to exchange with its neighbors in temperature space. Under the conditi on that the system has a constant heat capacity, a geometrical distribution of temperatures ( / = ) is achieved. Sanbonmatsu and Garcia suggested an iterative method to distribute temperatures for replicas in 2002. 172 They have chosen the averaged values of potential energy as a function of temperature to maintain a random walk in the temperature space. In 2005, PAGE 106 106 Rathore et al 171 suggested that an acceptance ratio of 0.2 yields the best performance, based on constant heat capacity assumption. They have chosen Go type model of prote in A and the Lennard Jones liquid to study the deviation of heat capacity relative to the final value as a function of acceptance ratio. A minimum of deviation at acceptance ratio around 0.2 has been observed. Kone and Kofke 173 have performed similar study fo r the parallel tempering simulations. They also considered a random walk model in temperature space through replica exchange moves. The acceptance ratio is given by: = 1 1 + 1 / 2 ( 2 109 ) where = 1 / 0 is the Boltzmann weight factor, and is the heat capacity which is assumed to be constant in their study. Without loss of generality, 0 is greater than 1 The mean square displacement of this random walk ( Eq. 2 1 10 ) has been maximized with respect to acceptance ratio. The maximum is shown near an acceptance ratio of 20%. 2 ln 2 ( 2 110 ) where 2 is the mean square displacement, and are shown in Eq. 2 10 9 Temperature distributions in parallel tempe ring simulation of villin headpiece subdomain HP 36 have been investigated by Trebst et al 174 HP 36 will undergo helix coil t ransition at high temperatures and hence, the heat capacity will not be held constant. The diffusion of a replica in temperature space has been introduced to judge e visit of the extreme temperature is the lowest. For each temperature two histograms and are recorded. The two histogram s keep the record of the number PAGE 107 107 replicas traveling from the lowest to highest temperature can be calculated as: = + ( 2 111 ) The diffusivity is adopted and has the form: / ( 2 112 ) They have pointed out that the diffusivity is temperature dependent, a minimum of diffusivity has been observed around the temperature where heat capacity is at maximum. The plot showing diffusivity vs temperature indi cates that random walk is suppressed the most when phase transition occurs. The numbers of round trip between temperature extremes of each replica has been maximized to generate an optimal temperature distribution. More recently, Nadler and Hansmann 175 177 suggested that the optimal number of replicas between the lowest and highest temperatures in explicit solvent simulation has the following formula: = 1 + 0 594 ln / w here the is the heat capacity, and and is the highest and lowest temperature, respectively. They also proposed that the optimal temperature distribution can be formulated as: = 1 1 In addition to replica temperature distribution, exchange attempt frequency (EAF) is also an important issue in parallel tempering and REMD sampling efficiency. In 2001, Opps and Schofield 178 investigated the effect of EAF for parallel tempering. Two dimensional spin system and a polypeptide in vacuum have been selected to test the effect of EAF on the properties such as order parameter and radius of gyration of the polypeptide. They suggested that the most efficient scheme is to attempt after a few MC PAGE 108 108 steps. The situation is more complicated in the case of REMD. In general, thermostats are used in MD pr opagations to maintain a canonical ensemble is satisfied. It is argued that exchanges in REMD should happen when system temperature stabilizes. 179 Attempting to exchange frequently may prevent the system from heat dissipation. This argument was supported by studies of a peptide Fs21 performed by Zhang et a l 179 They have suggested that 1 ps of exchange att empt interval is desirable for REMD. However, Sindhikara et al 180 have later shown that small exchange attempt interval (even as small as a few MD steps) does not affect heat dissipation, given that REMD exchange is done properly. Conformational sampling deviation relative to long simulation time reference calculation as a function of EAF has been investigated. They have pointed out that large EAF (small exchange attempt time interval) is preferred. Abraham and Gready 181 studied the effect of EAF based on a 23 residue peptide in explicit water. By examining the potential energy autocorrelation time, they argued that an exchange period below 1 ps is too short for replica exchange attempts to be independent, and hence reduce the tempering efficiency. However, the ir conclusion was not supported by an investigation of tempering efficiency performed by Zhang and Ma. 182 Zhang and Ma utilized the transition matrix and its correlation functions. The autocorrelation function of transition probability can be written as a function of eigenvalues of transition matrix. The decay time has been explored in order to understand the tempering efficiency. Zhang and Ma found that tempering efficiency increases m onotonically as EAF increases. Thermostat effects on the performance of REMD have also been explored. Earlier work has been done by the Garcia group. 172 They have studied i f the potential energy PAGE 109 109 PDFs satisfy the Boltzmann distribution: ln 1 / 2 = 1 2 1 1 + where is the potential energy PDF at temperature and is a constant. They have found that Nose Ho over and the Anderson thermostats satisfy the above condition, while the Berendsen thermostat does not. Rosta et al 183 investigated the thermostat artifact in the REMD simulations in 2009. The current REMD exchange scheme assumes Boltzmann distribution (canonical ensemble) in the calculation of exchange probability. However, the Berendsen the rmostat cannot preserve the Boltzmann distribution. Thus REMD simulations of bulk water and protein folding are performed and the temperature is controlled by Berendsen thermostat and Langevin dynamics. They have studied the potential energy PDFs and therm al unfolding under the two thermostats. The Berendsen thermostat has been shown to produce a shift average potential energy and prolonged tails for potential energy PDF for bulk water, while no such effect has been seen when Langevin dynamics is employed. An increased probability of folding at low temperatures has been reported by Berendsen thermostat, whereas the probability of folding is decreased at high temperatures. The authors proposed that REMD simulations performed with thermostats that can generate a Boltzmann distribution, such as Langevin dynamics, Andersen and Nose Hoover thermostats. In a REMD simulation, the number of replicas needed to cover a temperature range scales as 1 / 2 where is the degree of freedom of the system. Given a large system, the number of replicas needed is large. For example, 64 replicas have been used in a REMD study of hairpin surrounded by explicit water molecules (4342 atoms in each re plica) to cover the temperature range from 270 K to 695 K. 184 A number of PAGE 110 110 methods have been developed to redu ce the number of replicas needed in REMD simulations. In 2002, Fukunishi et al 61 proposed Hamiltonian REMD (H REMD). In the H REMD scheme, replicas differ in their Hamiltonians bu t have the same temperature. The exchange strategies in the paper of Fukunishi were to scale hydrophobic interactions and to scale van der Waals interactions. In 2005, Liu et al 167 published a method with the name replica exchange with solute tem pering. In the replica exchange with solute tempering algorithm, the protein water interactions and water water interactions are scaled such that the exchange probability does not depend on the number of explicit water molecules. The number of replicas in replica exchange with solute tempering simulation to cover the same temperature range is significantly reduced when comparing with original REMD algorithm. Lyman et al ., 185 and Liu and Voth later, 186,187 have developed resolution exchange schemes to improve the performance of REMD. Coarse grained models (low re solution) are employed to replace the role of high temperature replicas. The Simmerling group has contributed the hybrid explicit/implicit solvation model 188 in order to reduce the number of replicas needed in REMD simulations with explicit water molecules. Each replica is propagated in an explicit water box. At an exchange attempt, the solute and its solvation shell, which is calculated on the fly, are placed in dielectric continuum. Exchange probabilities are calc ulated based on the potential energies of the solute and the hybrid solvent. The usage of a hybrid solvent can shrink the number of replicas from 40 to 8, in a test case of polypeptide Ala 10 simulated at temperatures from 267 K to 571 K. Structural reservo ir techniques 168 170 have also been incorporated into REMD algorithm. High temperature MD simulations are performed first to generate a structural reservoir. Structures in the PAGE 111 111 reservoir will be brought to replicas via exchanges. One advantage of using structural reservoir is that non Boltzmann weight factors can be chosen in the calculation of exchange probabilities. 170 Recently, Ballard and Jarzynski 189 proposed to use non equilibrium work sim ulations to accept exchange attempts. Kamberaj and van der Vaart 190 developed a new scheme to perform exchanges, in which the generalized canonical PDF have been employed to achieve a flat potential of the mean force in temperatur e space. The Wang Landau algorithm 163,164 has been adopted in order to estimate the DOS in temperature space and the round up time between extreme temperatures has been minimized. More recently, solvent viscosity h as been selected as a parameter in addition to temperature for REMD method. 191 This method is named V REMD and it is essentially a two dimensional REMD method. The motivation of choosing viscosi ty as a parameter is that the lower the viscosity, the faster a protein will diffuse, and sample the conformational space. In this algorithm, one replica is selected to have normal viscosity, others use reduced viscosities. The mass of solvent molecules is scaled by a factor of 2 when the viscosity is scaled by a factor of Changing the mass of solvent molecules does not affect the potential energy at an exchange attempt. Thus, the exchange probability of the V REMD is the same as conventional T REM D. The author applied V REMD to the study of trialanine, deca alanine, and a 16 residue hairpin peptide. By using the V REMD, replica numbers are reduced by a factor of 1.5 to 2. The replica exchange method (REM) can be coupled with other generalized en semble methods in order to enhance conformational sampling. The Okamoto group have coupled REM with MUCA and simulated tempering. The two new schemes are PAGE 112 112 called multicanonical replica exchange method, 192 and replica exchange simulated tempering, 193,194 respectively. The details of coupled REM and generalized ensemble methods can be found in a review by Mitsutake et al 53 Due to its stochastic nature, the REMD algorithm has been employed to investigate thermodynamics rather than kinetics. 195 However, a properly designed scheme of analyzing the REMD trajectory in phase s pace can yield information about kinetics. In 2005, Levy and his coworkers 195 designed a kinetic network and used master equation to solve for the transition rate from REMD simulations. The structures at all temperatures are grouped into states based on their structural similarity (they selected a 42 dimensional Euclidean distance space based on C C distances, instead of clustering, to group their structure s). A state is denoted as a node and an edge stands for a transition between two nodes. A total of 800,000 nodes and 7.347 10 9 edges were obtained. The master equation has been utilized to d escribe the transitions between two states. Since they discretized the conformational space into states, the master equation is written in a matrix notation, = where is the transition matrix and is probability distri bution of states at time Instead of solving for eigenvalues of the transition matrix or solving the differential equation numerically, the authors actually simulated the path satisfying the master equation. Likewise, this Markov state model has been em ployed in the study of protein folding too. In 2006, van der Spoel and Seibert 196 studied protein folding rate based on Arrhenius equation. The folding mechanism in their investigation has been assumed to be two state. A binary folding indicator, which is the RMSD relative to the native state, has been adopted by the author s. Hence, the first order reaction rate equation has been PAGE 113 113 set up. Then, the rate equation was integrated and averaged over all trajectories in order to generate an derived fraction of folded structures. A fitting parameter 2 which is equal to the dif ference between derived and actual fraction of folded structures, was minimized numerically with respective to energy barriers and pre exponential factors. In this manner, the Arrhenius reaction rate will be resolved from REMD simulations. Yang et al 197 proposed to use diffusion equation to extract kinetics from REMD simulation in 2007. The Fokker Planck equation has been employed to extract local drift velocity and diffusion coefficient from REMD simulations. Lang evin dynamics on the reaction coordinate is performed using drift velocity and diffusion coefficient. The free energy landscape will be reconstructed based on drift velocity and diffusion coefficient. In 2008, Buchete and Hummer 1 98 demonstrated that both local conformational transition rate as well as globally folding rates can be accurately extracted from REMD simulations, without any assumption in temperature dependence of the kinetics (Arrhenius and non Arrhenius). Similar to Levy and coworkers, Buchete and Hummer have also adopted the master equation operating on discretized space to describe transitions. Conditional probability of state at time given the initial state was computed by the master equation. The likeli hood of seeing number of transitions in a time interval has been maximized with respective to the natural log of transition rate constant (transition matrix elements) and the natural log of equilibrium population of state Thus, the rate const ants will be generated. A detailed description can be found in the paper of Buchete and Hummer. PAGE 114 114 CHAPTER 3 CONSTANT pH REMD: METHOD AND IMPLEMENTATION 3.1 Introduction In this chapter, the constant pH REMD algorithm used in the AMBER simulation suite is presented and is employed to study model systems. We first tested our method based on five dipeptides and a model peptide having the sequence Ala Asp Phe Asp Ala (ADFDA). The two ends of model peptide ADFDA were not capped so the two ionizable side chains would have different electrostatic environment. The p K a values of the two Asp residues are expected to be different due to the difference in electrostatic environment. Then our constant pH REMD method is applied to a heptapeptide derived from OMTKY3, the same heptapeptide as Dlugosz and Antosiewicz studied in their paper. NMR experiments indicated the p K a of Asp is 3.6, 122 0.4 p K a unit lower than the value of blocked Asp dipeptide. Dlu gosz and Antosiewicz performed constant pH MD simulations and t heir method predicted the p K a to be 4.24. 122 Our purpose is to show that the REMD algorithm coupled with a discrete proto nation state description can greatly improve pH dependent protein conformation and protonation state sampling. 3.2 Theory and Methods 3.2.1 Constant pH REMD Algorithm in AMBER Simulation Suite In the case of constant pH molecular dynamics, the potential en ergy of the system depends not only on the protein structure but also on the protein protonation state. Reproduced in p art with permission from Meng, Y.; Roitberg, A.E. Constant pH Replica Exchange Molecular Dynamics in Biomolecules Using a Discrete Protonation Model, J. Chem. Theory. Comput. 2010, 6, 1401 1412. Copyright 2010 American Chemical Society. PAGE 115 115 Likewise, when coupling REMD algorithm with constant pH MD, one can either attempt to exchange molecular structures only or swap both structures and pro tonation states at the same time. For simplicity, let us consider two replicas where replica 0 has temperature T 0 protein structure q 0 and protonation state n 0 while replica 1 has temperature T 1 structure q 1 and protonation state n 1 A diagrammatic desc ription of the two exchange algorithms is shown in Figure 3 1. Figure 3 1. Methods to perform exchange attempts. A) Only molecular structures are attempted to exchange. The protona tion states are kept the same. B) Both molecular structures and protonat ion states are attempted to exchange. The first way of performing an exchange attempt is that replica 0 tries to jump from state ( q 0 n 0 ) to state ( q 1 n 0 ) at temperature T 0 in one Monte Carlo step. Similarly, replica 1 attempts to transit from state ( q 1 n 1 ) to state ( q 0 n 1 ) at temperature T 1 Protonation states are kept at exchange attempts and only change during dynamics. Therefore, the detailed balance equation now becomes: (3 1 ) Here ( 0 0 0 1 1 1 0 1 0 1 0 1 ) is the transition probability of swapping structures. If Metropolis criterion is used, this exchange probability can be written as: 0 0 0 1 1 1 0 1 0 1 0 1 = 1 (3 2 ) PAGE 116 116 In Eq. 3 2, has the form: = 0 0 0 1 0 1 0 1 1 1 (3 3 ) H ere 0 = 1 / 0 1 = 1 / 1 and E is the potential energy. I f the protonation states of two adjacent replicas at an exchange attempt are the same, the exchange probability of our constant pH REMD will be equivalent to the conventional REMD exchange probability. How ever, if it is not the case, four potential energy terms are needed to calculate exchange probability. Under this circumstance, the constant pH REMD becomes a REMD algorithm that combines both temperature and Hamiltonian REMD algorithms. One possible conce rn of exchanging only structures would be the role of kinetic energy, especially when n 0 and n 1 are different. In the REMD algorithm developed by Sugita and Okamoto, the kinetic energy terms in the Boltzmann factors cancel each other on average through vel ocity rescaling ( Eq. 2 8 8 ). Only potential energies are required to compute exchange probabilities. There is a problem in canceling kinetic energy terms when the numbers of particles of two systems attempting to exchange are not the same. However, accordin g to the constant pH MD algorithm proposed by Mongan et al., a proton does not leave the molecule but becomes a dummy atom when an ionizable side chain is in deprotonated state. Furthermore, that dummy atom retains its position and velocity which are contr olled by molecular dynamics. Hence, the kinetic energy contributions to the Boltzmann weight will be cancelled out during exchange probability calculation, leaving only potential ene rgy useful for the calculation. The second possibility consists of exchang ing protonation states as well as molecular structures at REMD Monte Carlo moves. For instance, replica 0 attempts to PAGE 117 117 move from state ( q 0 n 0 ) to state ( q 1 n 1 ) at temperatures T 0 in one MC move and replica 1 attempts to jump from state ( q 1 n 1 ) to state ( q 0 n 0 ) at temperature T 1 The detailed balance equation now can be written as: ( 3 4 ) This equation states that the exchange probability is the product of MC transition probabilities at temperature T 0 and T 1 If the protonati on states of two adjacent replicas are the same at an exchange attempt, the exchange probability of constant pH REMD becomes the exchange probability of conventional temperature based REMD. If n 0 and n 1 are different, then each MC transition is essentially the protonation state change step in constant pH MD plus a structural transition. For example, consider the MC transition at temperature T 0 0 0 0 0 1 1 = 1 1 ( 3 5 ) In Eq. 3 5 1 has the form: 1 = 0 1 0 0 0 + + 0 1 1 1 0 0 ( 3 6 ) The first term in 1 derives from the transition in configuration at fixed protonation state n 0 and the rest corresponds to protonation state change at fixed structure q 1 E elec r epresents the electrostatic component of potential energy. Similarly, the transition probability of MC jump at T 1 can be expressed as: 1 1 1 1 0 0 = 1 2 ( 3 7 ) And PAGE 118 118 2 = 1 0 1 1 1 1 0 1 0 0 + 1 (3 8 ) Therefore, similar to Eq. 3 2 the exchange probability can be written as: 0 0 0 1 1 1 0 1 1 1 0 0 = 1 ( 3 9 ) And = + 0 1 1 1 0 1 0 1 0 0 + 0 1 ( 3 10 ) In Eq. 3 10 is the same quantity as in Eq. 3 3 The exchange probability calculation in the second method of coupling REMD and constant pH MD utilizes the same energy terms required by the first method since obtainin g electrostatic potential energies does not require extra energy calculations. The advantage of implementing the second exchanging protocol over the first one should not be significant because it is the conformational sampling at higher temperature that gr eatly improves conformational sampling at lower temperatures. Allowing protonation states to change at exchange attempts does not provide extra gains in conformational sampling. In addition, one can always choose to sample protonation state space during th e MD propagation. Therefore, only the first method of performing exchanges was implemented. 3.2.2 Simulation Details Constant pH REMD simulations were carried out first on five reference compounds: blocked A spartate, Glutamate, Histidine Lysine and Tyrosi ne to test our method and implementation. The experimental p K a values of those reference compounds are known and listed in Table 3 1. We later performed constant pH REMD PAGE 119 119 simulations on a model peptide ADFDA (Ala Asp Phe Asp Ala, unblocked termini) and the heptapeptide derived from OMTKY3 (residue s 26 to 32 with blocked termini). Four replicas were used in the reference compounds and ADFDA REMD simulations. The temperatures were 240, 300, 370 and 460 K for all six molecules. The pH range for the study of aci dic side chains was sampled from 2.5 to 6 and the pH range of histidine from 5.5 to 8. The basic side chains were titrated from pH 9 to 12. An interval of 0.5 was chosen for all titrations. Eight replicas were chosen for the heptapeptide with a temper ature range from 250 to 480 K. 10 ns were used for each replica in all REMD simulations and an exchange was attempted every 2 ps. A MC move to change protonation state was attempted every 10 fs. A second set of REMD runs was done with the same overall cond itions but different initial structures in order to check simulation convergence. To compare conformational and protonation state sampling, 100 ns of constant pH MD simulations were carried out for aspartate reference compound and ADFDA at the same pH valu es as in the REMD runs. For the heptapeptide, one set of 10 ns constant pH MD simulations were done at all pH values simulated by REMD method. Constant pH REMD and MD simulations were done using the AMBER 10 molecular simulation suite 199 The AMBER ff99SB force field 139 was used in all the simulations. The SHAKE al gorithm 145 was used to constrain the bonds connecting hydrogen atoms with heavy atoms in all th e simulations which allowed use of a 2 fs time step. OBC Generalized Born implicit solvent model 200 was used to model water environment in all our calculations. The Berendsen thermostat, 146 with a relaxation time of 2 ps, was used to keep the replica temperatu re around their target values. Salt PAGE 120 120 concentration (Debye Huckel based) was set at 0.1M. The cutoff for non bonded interaction and the Born radii was 30 3.2. 3 Global Conformational Sampling Comparison Using Cluster Analysis In our study, global conforma tional samplings have been compared utilizing cluster analysis. 169,188 group is called a cluster. A cluster analysis measures the similarity be tween two objects. In the cluster analysis we performed, protein backbone similarity (measured by backbone RMSD) is considered and the hierarchical agglomerative clustering algorithm is employed. Hierarchical algorithm basically creates a hierarchy of clus ters and a hierarchical algorithm can be agglomerative or divisive. The hierarchical agglomerative algorithm starts with considering every object as a cluster and combine s si milar clusters into one cluster, while the divisive algorithm starts with one clus ter containing all objects and divides it into more groups. In our work, the c luster analysis was done using the Moil View program. 201 The MD and REMD trajectories (having same number of frames) at 300 K and under the same sol ution pH value were first combined. The ptraj module of the AMBER package has been utilized to used to T he combined trajectory was clustered based on peptide backbone atoms root mean square deviations (RMSD s ). A cluster cutoff RMSD of 1.5 is chosen for both ADFDA and the heptapeptide during our analysis. By clustering the combined trajectory, the MD an d REMD conformational samplings will populate the same clusters. The fraction of the conformational ensemble corresponding to each cluster (fractional population of each cluster) was calculated for MD and REMD run respectively. Two PAGE 121 121 sets of fractional popu lation of clusters were generated. One must note that the fractional population of each cluster from MD and REMD trajectory may not be the same Therefore, the correlation between the two sets of fractional population can be investigated by plotting one se t against the other and doing a linear fitting The Moil View program will generate a file pointing out which cluster a snapshot in the combined trajectory belongs to. Thus, t he fractional population of each cluster was obtained for MD and REMD simulation If the MD and REMD simulations produced the same structural ensemble, the fractional population of a cluster from MD simulation will be the same as that from REMD simulation Cluster population fraction from REMD simulation vs that from MD simulation was plotted ( see Figure 3 7 A) The correlation coefficient values which represent the correlations between MD and REMD cluster population were calculated at each solution pH value by doing linear regression. 169,188 A h igh correlation between MD and REMD cluster population indicates that the structure ensembles are similar to each other. This method provides a direct comparison of global conformational sampling between MD and REMD simulations. The same technique was use d when studying convergence of constant pH REMD and MD trajectories ( see Figure 3 7 B and Figure 3 1 2 ) When investigating convergence of conformational sampling, snapshots from two constant pH REMD simulations (or two constant pH MD simulations) were combi ned. The two constant pH simulations should have the same temperatures and solution pH values T hey only differ in initial structures. A high correlation coefficient indicates the two structural ensembles are similar and two conformational samplings are co nverged, while a poor PAGE 122 122 correlation means the structural ensembles are different and the conformational sampling depends on initial condition. 3.2. 4 Local Conformational S ampling and Convergence to Final State In our study, the local conformational sampling was examined by comparing the probability distribution a re 10, which would lead to a 36 36 histogram. These two dimensional histograms were normalized into populations and the convergence was the root mean squared deviation (RMSD same algorithm men tioned earlier in this section. Essentially, we were computing the RMSD between two matrices. The RMSD between the cumulative probability density at time and the final probability density ( all configurations were utilized to compute final probability density ) is given by, = 2 36 = 1 / 36 36 36 = 1 (3 11) w here ( ) is the th element of the cumulative probability density of the pairs at time and is the corresp onding element in the final probability density matrix 3.3 Results and Discussion 3.3.1 Reference Compounds We first applied our constant pH REMD met hod to the reference compounds. Table 3 1 shows the p K a values predicted by REMD simulations (10 ns for e ach replica) PAGE 123 123 as well as the reference p K a values. All our p K a values were calculated by fitting to the HH equation. Agreement between constant pH REMD predictions and the reference values can be seen. Table 3 1 The REMD p K a predictions of reference compou nds pKa Aspartate Glutamate Histidine Lysine Tyrosine REMD 3.97(0.01) 4.41(0.01) 6.40(0.03) 10.42(0.01) 9.61(0.01) Reference 4.0 4.4 6.5 10.4 9.6 The numbers in parenthesis are the standard errors. The pH titration curves of the same reference compoun ds showed agreement between MD (100 ns) and REMD simulations. Figure 3 2 demonstrates the REMD and MD titration curves of aspartic acid re ference compound as an example. Figure 3 2 Titration curves of blocked aspartate amino acid from 100 ns MD at 300K and REMD runs. Agreement can be seen between MD and REMD simulations. We further studied the convergence of protonation states sampling. REMD and MD protonation fraction (cumulative protonation fraction) were plotted with respect to MC attempts for asparta te reference compound at all pH values. Figure 3 3 demonstrated the protonated fraction versus time at pH 4 as one example. According to PAGE 124 124 Figure 3 3 it suggests that although the final p K a predictions are the same between REMD and MD simulations, the proto nation state sampling during REMD s imulations clearly converge s faster than that in a MD run Figure 3 3 Cumulative average protonation fraction of a spartic acid reference compound vs Monte Carlo (MC) steps at pH=4. 3.3.2 Model peptide ADFDA The model peptide ADFDA (as zwitterion) was chosen as a more stringent test of our constant pH REMD method. The charged termini will provide different electrostatic environment for each titratable Asp residue and hence a correct constant pH REMD model should reflec t this difference between titration curves of the two Asp residues. The Asp2 residue is closer to the NH 3 + so the deprotonated state is favored and the p K a value of Asp2 residue should shift below 4.0 (which is the p K a value of the reference aspartic dipe ptide). The Asp4 residue is closer to the COO negative charge and hence the p K a value should shift above 4.0. The titration curves of the model peptide ADFDA from REMD simulations are shown in Figure 3 4 We can clearly see that Asp2 and Asp4 have differe nt titration PAGE 125 125 curves from each other and from the reference compound. The p K a value and Hill coefficient for each Asp residue were obtained by fitting titration curves to a Hill plot. The results are shown in Table 3 2. The REMD p K a predictions reflect the difference between Asp2 and Asp4 due to different peptide electrostatic environments. We also displayed the MD titration curves of Asp2 and Asp4 in Figure 3 4 and listed the MD p K a predictions and corresponding Hill coefficients in Table 3 2. The titration curve of Asp2 residue only showed a small difference between MD and REMD simulation. But we can see differences in titration behaviors of Asp4 between MD and REMD calculations when solution pH is below 5. Interestingly, Lee et al. studied blocked Asp Asp peptide using CPHMD method, reporting different Hill coefficient for each of the two Asp residues. Figure 3 4 The titration curves of the model peptide ADFDA at 300K from both MD and REMD simulations. MD simulation time was 100 ns and 10 ns were chosen for each replica for REMD runs. Table 3 2. p K a predictions and Hill coefficients fitted from the Asp2 Asp4 p K a Hill Coefficient p K a Hill Coefficient REMD 3.74 0.87 4.38 0.67 MD 3.76 0.89 4.54 0.85 PAGE 126 126 Convergence rates of Asp2 titration beh avior were compared between REMD and MD calculations due to the fact that Asp2 titration curves are very close. The cumulative protonated fractions versus MC attempts at pH 4 are shown in Figure 3 5 Likewise, faster convergence in protonation state sampli ng can be seen for REMD simulation even though both REMD and MD calculations resulted in the same final protonated fraction. Clearly, our constant pH REMD method accelerates the convergence of sampling of protonation states. Figure 3 5 Cumulative averag e protonation fraction of Asp2 in model peptide ADFDA vs Monte Carlo (MC) steps at pH=4. In addition to protonation state sampling, we also evaluated the conformational sampling in constant pH MD and REMD each solution pH were studied. The regions in Ramachandran plots sampled by MD and PAGE 127 127 REMD simulations are the same at all pH valu es. Ramachandran plots for residue Asp2 at pH 4 are shown in Figure 3 6 as an example. Figure 3 6 (Ramachandran plots) for Asp2 at pH 4 in ADFDA. Ramachandran plots at other solution pH valu es are similar. For Asp2, constant pH MD and REMD sampled the same local backbone conformational space. Phe3 and Asp4 Ramachandran plots also display the same trend. Since the Ramachandran plot only represented local conformational sampling, we also evalua ted global conformational sampling by clustering MD and REMD trajectories and comparing the cluster populations. The MD and REMD cluster population R 2 values are listed in Table 3 3. A plot of cluster populations from MD and REMD trajectories at solution p H of 4 is shown in Figure 3 7 A as an example. The large R 2 values indicate that the MD and REMD sampled the same conformational space and generated the same structure ensemble. The small size of ADFDA and simple structure of each residue make 100 ns long e nough for MD to sample the relevant conformations. We further studied the convergence of REMD simulations by comparing global conformation distribution between two REMD simulations starting from two different structures. Cluster populations of the two REMD simulations at solution pH 4 are PAGE 128 128 displayed in Figure 3 7 B. The R 2 value is 0.959 at pH 4. This large correlation tells us that the two REMD simulations provide the same structure ensemble and hence the two simulations are converged. Table 3 3. Correlation coefficient s between MD and REMD cluster populations pH=2.5 pH=3 pH=3.5 pH=4 R 2 0.94 0.90 0.79 0.93 pH=4.5 pH=5 pH=5.5 pH=6 R 2 0.85 0.98 0.92 0.96 The R 2 values were calculated by linear regression. Figure 3 7 Cluster p opulations of ADFDA at 30 0K. A) MD v s REMD at pH 4. Trajectories from MD and REMD simulations are combined first. By clustering the combined trajectory, the MD and REMD structural ensemble s will populate the same clusters. The fraction of the conformational ensemble corresponding to each cluster (fractional population of each cluster) was calculated for MD and REMD simulation, respectively. Two sets of fractional population of clusters were generated, and hence plotted against each other. B) T wo REMD runs from different starting st ructures at pH 4. Lar ge correlation shown in Figure 3 7 B suggests that the REMD runs are converged. Large correlations between two independent REMD runs are also observed at other solution pH values. Correlations between MD and REMD sim ulations can be foun d in Table 3 3 3.3.3 Heptapeptide derived from OMTKY3 We first compared the protonation state sampling between constant pH REMD and MD simulations. Titration curves of Asp3, Lys5 and Tyr7 from two sets of PAGE 129 129 simulations are plotted in Figure 3 8 A and 3 8 B. F or each titratable residue, titration curves generated by constant pH REMD and MD are close to each other. Since the p K a value of Asp3 in this heptapeptide is experimentally determined to be 3.6, it will be interesting to evaluate how our predicted values compare to the experimental result. The p K a Figure 3 8 C. The predicted p K a value is 3.7 for both REMD and MD simulations and they are in excellent agreement with the experimental p K a value. Following the same procedures, our predicted p K a values of Lys5 and Tyr7 from constant pH REMD and MD simulations were obtained. Not surprisingly, the REMD and MD schemes yielded essentially the same predicted p K a values for Lys5 and Tyr7. Figu re 3 8. A) T itration curves of Asp3 in the heptapeptide derived from protein OMTKY3. B) T itration curves of Lys5 and Tyr7 in the heptapeptid e derived from protein OMTKY3. K a values of ots. PAGE 130 130 Figure 3 8 Continued PAGE 131 131 Although the final p K a predictions are the same for constant pH REMD and MD simulations, constant pH REMD showed clear advantage in the convergence of protonation state sampling. Again, we chose the cumulative average proton ation fraction vs MC steps to reflect protonation state sampling convergence for all three titratable residues. Several representative plots are shown in Figure 3 9 The trend that constant pH REMD simulations produce faster convergence in protonation frac tion is universal. Therefore, it is very clear that constant pH REMD method is better than constant pH MD in protonation state sampling. Figure 3 9 A) Cumulative average protonation fraction of Asp3 of the heptapeptide derived OMTKY3 vs MC steps. B) an d C) is c umulative average protonation fraction of Tyr7 and Lys5 in the heptapeptide vs MC steps respectively. Clearly, faster convergence is achieved in contant pH REMD simulations. PAGE 132 132 Figure 3 9 Continued PAGE 133 133 Conformational sampling is an important issue in constant pH studies. We first looked at the conformational sampling on peptide backbones. We evaluated backbone conformational sampling through Ramachandran plots. Six residues (from Ser2 to Tyr7) are studied here. Not surprisingly, Ramachandran plots from constant pH REMD and MD simulations are very close, suggesting that the overall local conformational samplings are similar. The Ramachandran plots of Asp3 at pH 4 are shown in Figure 3 10 as examples. The only exception is Tyr7 in acidic pH values. Ty r7 can visit the left handed alpha helix conformation during constant pH REMD runs but is not able to do that in constant pH MD runs. In general, constant pH REMD and MD yielded the same Ramachandran plots for the heptapeptide. Figure 3 10 Dihedral ang lity densities of Asp3 at pH 4 A) C onstant pH MD results. B) Constant pH REMD results. The two probability densities are almost identical, indicating that constant pH MD and REMD sample the same local conformational space. All others also show very similar trend. pH REMD and MD are similar for Ser2 to Thr6. It is interesting to determine how fast each sampling scheme reaches the final distribution. We studied evolution of backbone conformational sampling based on cumulative data as what we did in the case of PAGE 134 134 protonation state sampling convergence. As described in the METHOD section, the ation time was calculated. The smaller a RMSD is, the closer a probability distribution reaches to the final distribution. Deviations were calculated starting from the second nanosecond with time intervals incremented by 100 ps. The cumulative time depend ence RMSD of Asp3 and Lys5 are also shown in Figure 3 1 1 as examples. As seen in the figures, these curves decrease faster in constant pH REMD simulations. Figure 3 1 1 suggests that although the final are similar between constant pH REMD and MD simulations, the constant pH REMD simulation clearly reaches the final state faster. Figure 3 11 The root mean probabili behaviors at other pH values also show that REMD runs converg e to final distribution faster. Cluster analysis was also ap plied to study the convergence of conformation sampling in the heptapeptide. By comparing cluster populations between the first and PAGE 135 135 second half of one trajectory, one could check the convergence of that simulation. The two halves of a structural ensemble s hould yield the same populations in each cluster if convergence is reached. For example, simulations at pH 4, both constant pH REMD and MD yield about 20 clusters and the correlations coefficients are calculated through a linear regression. Cluster populat ion plots and correlation coefficients are shown in Figure 3 1 2 A much higher correlation coefficient can be seen in constant pH REMD simulation, suggesting the two halves of the constant pH REMD simulation at pH 4 populate each cluster much more similarl y than the corresponding constant pH MD does. Hence, much better convergence is achieved by the constant pH REMD run. Figure 3 1 2 Cluster population at 300 K from constant pH MD and REMD simulations at pH=4. Cluster analysis is performed using the enti re simulation. The populations in each cluster from the first and second half of the trajectory are compared and plotted. Ideally, a converged trajectory should yield a co rrelation coefficient to be 1. A) Constant pH MD. B) C onstant pH REMD. Much higher co rrelation coefficient can be seen in constant pH REMD simulation, suggesting much better convergence is achieved by the constant pH REMD run. PAGE 136 136 3.4 Conclusion s In our work, we have applied replica exchange molecular dynamics (REMD) algorithm to the discrete protonation state model developed by Mongan et al. in order to study pH dependent protein structure and dynamics. Seven small peptides were selected to test our constant pH REMD method. Constant pH molecular dynamics (MD) simulations were ran on the same p eptides for comparison. The constant REMD method results are encouraging. The constant REMD method can predict p K a values in agreement with literature and experimental results. Constant pH REMD method also displays advantage in convergence behaviors during protonation states and conformational sampling. The REMD algorithm has been proven beneficial to study pH dependent protein structures. Our future work will include studies of pH dependent protein dynamics and application of this constant pH REMD to lar ge proteins. PAGE 137 137 CHAPTER 4 CONSTANT p H REMD: STRUCTURE AND DYNAMICS OF THE C PEPTIDE OF RIBONUCLEASE A 4.1 Introduction The p rotein and peptide folding problem 202 is an important aspect of protein science and biophysical chemistry 203 In 1961, Anfinsen studied the refolding of denatured ribonuclease (RNase). 204 He first increas ed the temperature of the protein and the protein lost its functional three dimensional shape (native state). When An finsen lowered the temperature, he found that the RNase was able to refold into its normal shape, without any other help. His experiment raised questions about protein folding. In general, people are interested in the thermodynamics (such as free energy la ndscape, folding pathway and interactions in a protein ), folding kinetics (such as how fast a protein folds), and native state prediction for a given sequence in protein folding. 202 Both experimental and theoretical approaches have been employed to understand protein f olding. 205,206 From now on, our introduction to protein folding will focus on computer simulations. In a protein folding simulation, the concept of free energy landscape always plays an important role. 202,207 Many questions can be answered once the free energy landscape is obtained. Levinthal, 208 in 1968, proposed that it is impossible for a protein to search all its conformations during folding process because the time taken to visit all conformations will be much longer than the folding time observed. His argument is well some well defined folding pathways. ing is the free energy landscape theory which provides a statistical view of the folding landscape 202,203,207 The folding process does not require chemical reaction like steps between specific PAGE 138 138 states. Basically, a protein folds on a funnel shaped free energy landscape which is defined by the amino acid sequence of the protein. Folding process is a direct ed visit of conformations on a landscape in order to reach the native state, which is the most thermodynamicall y stable conformation. Changing temperature, adding denaturant to the protein solution, or changing solution pH value of the protein system is able to change the free energy landscape, and hence affect protein folding. The free energy landscape of a protei n is often rugged 51 and requires advanced sampling technique s such as REMD method to sample the con formational space. Due to the visual limitation, a free energy landscape is frequently projected onto one or two reaction coordinates. In practice, the free energy landscape is often projected onto several important reaction coordinates such as the radius of gyration of a protein, the number of backbone hydrogen bonds, and native contacts. Principal component analysis has also been carried out to generate the folding free energy landscape. T he relative free energy (potential of the mean force PMF ) can be c alculated by the following, = = ln / (4 1) w here is the relative PMF between state A and state B defined by reaction coordinate(s), and ( ) are t he prob ability density of find state A, and B along the reaction coordinate(s) respectively. Knowing the free energy landscape s can help people understan d folding mechanisms Transition states intermediates and folding pathway s can be obtained from a fo lding free energy landscape. For example w hen the free energy barrier between folded and unfolded state is disappeared, the folding is called downhill folding, in which the folding time is determined by diffusion rate on the free energy landscape. PAGE 139 139 One exa mple of the protein fol ding free energy landscape studies is simulating the folding of C terminal haripin of protein G, performed by Zhou et al in 2001. 184 The OPLS AA force field SPC explicit water model, and REMD algorithm ha ve bee n employed in their simulation. T he free energy landscape has been projected onto s even different reaction coordinates such as ra dius of gyration, number of hydrogen bonds, and fraction of native contacts T wo dimensional free energy landscapes along those reaction coordinates were generated in order to elucidate the folding pathway Four different states were found in the folding l andscape, native state, unfolded state, and two intermediate states. Structural features of each state were also characterized The formation of hydrophobic core and hydrogen bonding in the folding process ha s been investigated. They have found that the hy drophobic core and hydrogen bonds formed almost simultane ously after initial collapse. Although not investigated in this chapter, protein folding kinetics is also an important aspect of protein folding. 209 One example of the folding kinetics study is seeking the speed of protein folding. 210 Computer simulations have been performed to elucidate folding kinetics. 211 The Pande group at Stanford University pioneered computer simulations of folding kinetics. 206,211 213 When studying protein folding kinetics, the Pande group conducted m ultiple independent MD simulations starting fr om different initial conditions. T he probability of the native state in the structure ensemble was computed after a pre defined simulation time. Assuming the folding mechanism is two state fold ing and follows the first order reaction kinetics, and the transition time is much shorter than staying time in either state, the probability of barrier crossing can be given by, PAGE 140 140 = 1 (4 2) where is simulation time and is the folding rate. In the limit of 1 / Eq. 4 2 can be simplified to according to the Taylor expansion. The probability of barrier crossing can be computed by using the fraction of simulations that crossed the barrier. Other methods utilized to explore folding kinetics include Markov state models. 195,198,214 217 One example of predicting folding time is given by stud ying the C hairpin of protein G In their studies, Pande and co workers 213 utilized the OPLS AA force field and the GB implicit solvent model using water like viscosity via Langevin collision coefficient. A total simulation 2700 independent simulations, among which 8 completely folded trajectories were found. Thus, a folding time of 4.7 2, which is in rthermore, the folding free energy landscape has been generated and the folding pathway and folding intermediates etc have also been probed. Another area of protein folding simulation is to probe protein folding through the unfolding simulations. The un folding simulations adopt the assumption that folding processes follow the reverse pathways of unfolding processes. Both temperatures and denaturants can be employed to denature proteins. Levitt and Daggett have been performed unfolding simulations extensi vely. 218 220 The C peptide, residues 1 to 13 from the N terminus of RNase A, is a peptide well studied by experiments. 5,7,221 226 In 1971, Brown and Klee 223 first observed the presence of helix of C peptide through circular dichroism (CD) spec troscopy This peptide was PAGE 141 141 further studied extensively by the Baldwin group. 5,7,222,224,226 CD spectroscopy showed that the C peptide demonstrated pH helix formation The mean residue ellipticity at 222 nm of the C peptide showed a bell shaped pH profile, having a maximum at pH value of 5. M utation experiments indicated that the Glu2 and His12 in the C peptide were crucial to the pH dependent helix formations. 5,7, 224,226 Maximal mean residue ellipticity occurred at pH 5 because both the glutamate and histidine residues are charged at that pH NMR experiments on an analog of the C peptide (RN 24) by the Wright group also confirmed the formation of complete and part ial helix. 225 Two side chain interactions were believed to stabilize the partial helix formation in the C peptide and its analogs in the mutation experiments and NMR studi es. 7,224 226 A salt bridge between Glu2 and Arg10 side chains was proposed to improve the helix formation as the pH values increased to 5. The interaction between Phe8 and His12 was also believed to improve helix f ormation as the pH values reduced to pH of 5. The folding and side chain interactions of C peptide and its analogs were also extensively studied by molecular simulations. 227 235 Schaefer et al 232 studied the helical conformations and folding thermodynamics. The Okamoto group 228 230,23 3 235 has performed thorough investigations of the C peptide using a multicanonical algorithm (MUCA) and the replica exchange method (REM) in both implicit solvent and explicit solvent They have studied s econdary structures of the C peptide, roles of Glu 2 and His12 in the C peptide helix coil transition, and dielectric effect in the implicit solvent. Ohkubo and Brooks 231 utilized REMD simulations with the GB model to explore the helix coil transition of short peptides including the C peptide. Conformational entropy as a function of temperature has been explored for the C peptide and its analogues PAGE 142 142 (different chain length). The conformational entropy has been found to be proportional to chain length over a wide range of temperatures. Felts and co workers 227 carried out REMD simulations with the AGBNP implicit solvent model to study the folding free energy landscape of the C peptide The free energy landscape was proj ected onto radius of gyration and helical length. The possible interaction between Glu2 Arg10 was also explored. Dielectric effects of AGBNP solvation model on helical length and salt bridge has been investigated too. In 2005, Sugita and Okamoto 233 performed replica exchange multicanonical algorithm simulations in explicit solvent to explore the folding mechanism and side chain interactions such as Glu2 Arg10 and Phe8 His12. They constructed folding free energy landscape along the principal component axes. The correlations between Glu2 Arg10 and Phe8 His12 interactions and the C peptide conformations have been elucidated. They have found that the minimum free energy conformation possess both interactions. They have also suggested that the purpose of Glu2 Arg10 salt bridge is to prevent helix extending to N terminus of the C peptide and the Phe8 His12 stabilizes the alpha helix conformation toward the C terminus. More importantly, Khandogin et al 112 studied the pH dependent folding of th e C peptide with REX CPHMD. I mportant electrostatic interactions such as the Lys1 Glu9, Glu2 Arg10 and Phe8 His12 interactions were also investigated The C peptide has also been selected to test the effect of force fields on protein folding simulations and simulation convergence In 2004, Yoda et al 234,235 tested six commonly employed force fields (AMBER94, AMBER96, AMBER99, CHARMM22, OPLS AA/L, and GROMOS96) on the C peptide as well as the C terminal fragment from the B1 domain of the G peptide in explicit water using generalized ensemble PAGE 143 143 method. M elting curves have been studied. S econdary structures of both peptides were also computed and compared with experimental data. AMBER99 and CHARMM22 were found showing best agreement for the C peptide. In this chapter we present a study of the C peptide using constant pH REMD method introduced in t he previous chapter The effect of pH on the folding of C peptide and the structural ensemble is studied. We compare directly with experimental measurements of helicity, namely the mean residue ellipticity at 222 nm. Important electrostatic interactions su ch as Glu2 Arg10 salt bridge and Phe8 His12 interaction are also examined. 4.2 Method s 4.2.1 Simulation Details The C peptide we simulated has the sequence: KETAAAKFERQHM. The N terminus of the C peptide (lysine) is charged while the C terminus (methionin e) is capped with an amide For our study, constant pH REMD simulations were performed starting from a completely extended structure at pH values 2, 3, 4, 5, 6.5 and 8. Eight replicas were chosen with a temperature range from 2 60 to 42 0 K. A simulation tim e of 44 ns were used for each replica in all REMD run s and an exchange was attempted every 2 ps. The structures obtained from the first 4 ns were discarded, resulting in a 40 ns of production time for each replica. Glu2, Lys7, Glu9 and His12 are selected t o be titratable. A MC move to change protonation state was attempted every 10 fs. A second set of REMD runs was done at pH values of 2, 5 and 8 starting from a fully helical initial structure in order to check simulation convergence. The three pH values ar e selected to represent low pH, pH where maximum helicity was observed experimentally and high pH, respectively. PAGE 144 144 AMBER 10 molecular simulation suite 199 was used to simulate the C peptide. The AMBER ff99SB force field 139 was used in all the simulations. The SHAKE algorithm 145 was used in all the simulations which allowed use of a 2 fs time step. OBC Generalized Born implicit solvent model 200 was used to model water environment in all our calculations. The Berendsen thermostat, 146 with a relaxation time of 2 ps, was used to keep the replica temperature around their target values. Salt concentration (Debye Huckel based) was set at 0.1 M. The cutoff for non bonded int eracti on and the Born radii was 30 (this cutoff is longer than the peptide). 4.2.2 Cluster Analysis When studying the folding of C peptide, the roles of cluster analysis are two fold. One role is to compare structural ensembles and check convergence at p articular temperature and solution pH value, while the other is to analyze a single ensemble of structures to investigate protein structures and interactions. As described in the previous chapter, c luster analysis was done using the Moil View program 201 and the C RMSD has been chosen to measure structure similarity. When comparing conformational sampling two different ways of comparisons have been adopted. The first way is to compare the first and the second halve s of one trajectory. In this case, cluster analysis was performed on a single trajectory and the cluster information can be utilized to study folding thermodynamics and interactions in the C peptide. The second way is to c ompare the structural ensembles pr oduced by simulations starting from the fully extended and fully helical structures. In the second case, the two trajectories (having same number of frames) at 300 K and under the same solution pH value were first combined. Then the combined trajectory was clustered on the basis of peptide backbone atoms root mean square deviations (RMSD s ). The PAGE 145 145 population fraction corresponding to each cluster was obtained for both trajectories Th e correlation coefficient, which represents the correlation between the clus ter population s of the two trajectories, was calculated at each solution pH value by doing linear regression. A high correlation indicates that the structure ensembles are close to each other. This method provides a direct comparison of global conformation al sampling between the two trajectories A cluster cutoff RMSD of 2.0 is chosen during our analysis. 4.2.3 Definition of the Secondary Structure of Proteins ( DSSP ) Analysis The secondary structures of the C peptide have been explored by DSSP algorithm 236 which is proposed by Kabsch and Sander. The DSSP algorithm identifies the secondary structure of a residue by hydrogen bond calculations. The calculation is based on electrostatic ener gy between backbone carbonyl group and amide group, = 1 2 1 + 1 1 1 332 / (4 3 ) In the above equation, 1 and 2 are the partia l charges on each atoms. If the electrostatic energy is below 0.5 kcal/mol, then a hydrogen bond will assigned to corresponding carbonyl and amide groups. The secondary structure of a r esidue is labeled by one letter: G for 3 10 helix, H for alpha helix, I for pi helix, B for antiparallel beta sheet b for parallel beta sheet, and T for turns. 4.2.4 Computation of the Mean Residue Elli p ticity CD spectroscopy is one of the most commonly used techniques to study protein secondary structures and folding 237 Chiral molecules absorb left circularly polarized light (LCPL) and right circularly polarized light (RCPL) differently. CD spectroscopy PAGE 146 146 measures the difference in absorbance of LCPL and RCPL of a chiral molecule. It can provide information of protein secondary structures Electromagnetic waves contain oscillating electric and magnetic fields perpendicular to e ach other and to the propagating directions. A circularly polarized light (CPL) has a n electric field vector rotating along its propagation direction but maintains its magnitude. This is in contra st to linearly polarized light which has an electric field v ector oscillating in one plane but change its magnitude. When a LCPL is propagating toward an observer, the electric field vector rotates counterclockwise, while the RCPL rotates clockwise. When a circularly polarized light passes through chiral molecules, the difference in the absorption of LCPL and RCPL is given by: = (4 4 ) w here and is extinction coefficient of LCPL and RCPL, respectively and is wavelength. has the dimension s of ( ) 1 or 2 1 The extinction coefficient can be calculated by Beer Lambert law : = where is the absorbance, is the concentration, and is the width of the cuvette This difference gives CD spectroscopy. Ma ny CD instruments record signal in ellipticity, which is measured in degrees The ellipticity can be calculated as: = 32 98 = 32 98 where 32.98 has unit of degree. A more frequently adopted measurement of CD is the molar elli pticity [ ] 238 = 100 = 3298 (4 5 ) Here, the molar ellipticity has units of 2 1 PAGE 147 147 The integrated intensity of a CD band is called rotational stren gth. Theoretically, for a electronic transition from ground state (0) to excited state ( ), the rotational strength can be calculated as, = 0 0 (4 6 ) w here 0 and is the wavefunction of electronic ground and excited state, respectively ; and is the electronic transition and magnetic transition dipole moment operator, respectively; and stan ds for the imaginary part. Eq. 4 6 suggests that the frequently adopted units of rotational strength are Debye Bohr magnetons (DBM, 1 DBM= 9 274 10 39 3 where is the unit of energy) Eq. 4 6 is origin dependent because the ma gnetic transition dipole moment operator is origin depende nt. In order to avoid this origi n dependence, the dipole velocity formulation can be em ployed, = / 2 0 0 0 (4 7 ) Here, is the charge of an electron, is the mass of an electron, and 0 is the frequency of the transition. According to the paper of Sreerama and Woody 238 CD spectrum can be calculated as, assuming each CD band (CD transition) is a Gaussian function of wavelength, = 2 278 / (4 8 ) w here and is the CD, rotational strength, wavelength and half bandwidth (one half of the width at 1 of its maximum) of the th transition, respectively. In Eq. 4 8, the constant 2.278 has the dimensions of 1 2 1 PAGE 148 148 The far ultraviolet ( far UV with a wavelength smaller than 250 nm ) CD spectra of proteins can yield important information about the secondary structures of proteins. 238 In the far UV range, peptide bonds in a protein are the main chromophore s Thus, the C D spectra in the far UV range are reported on a residue basis (mean residue ellipticity). In a protein CD spectrum, a positive band at ~190 nm and two negative bands at 208 nm and 222 nm can be found for helix. 239 In particular, a strong negative band at 222 nm is a leading indication of the presence of helical structures S tructures sheet will show two bands in CD spectra: a positive band at ~198 nm and a negative band at ~215 nm. 240 Compu ting protein CD spectra using quantum mechanical methods combining with Eq. 4 7 is only possible in principle due to the size and complexity of protein structures. The matrix method 241 using pre determined parameters has been adopted to tackle this problem. In th e matrix method, a secular matrix is constructed based on transition energies and interactions between transitions A protein is considered as a set of independent chromophores. Each local transition energies and interactions between transitions in differe nt chromophores are utilized to construct the secular matrix. A transition on a local chromophore is represented by a charge distribution. The charge distributions as parameters, are determined from quantum mechanical wavefunctions or experiments or a com bination of both. 242 244 The off diagonal elements of the secular matrix, which represent the interactions between transitions in different chromophore, are further simplified by c harge charge (monopole monopole) e lectrostatic interaction 238 = / (4 9 ) PAGE 149 149 Here, is the electrostatic energy between transition on chromophore and transition on chromophore sums over the point charges of transition on chromophore and sums over the point charges of transition on chromophore and denotes for the distance between two charge s. Diagonalization of the secular matrix using a unitary transformation will yield the eigenvalues and eigenvectors corresponding to all transitions of the protein. Eigenvalues provide information about transition energies and the eigenvectors describe the mixing of local transitions. The rotational strength can be obtained from eigenvectors. In this work, the algorithm developed by the Woody group 238,244 was used to compute the mean residue elliptcity. Detailed des cription of their algorithm can be found in the paper of Sreerama and Woody. The peptide transitions (two transitions at 140 and 190 nm, respectively and one transition at 220 nm ) were computed using the Matrix method 241 in the origin independen t form 245 Transition charge distributions (monopole charges) are obtained from INDO/S 246 semi empir ical electronic structure calculations. Side chain transitions of phenylalanine, tyrosine and tryptophan were also included in the calculation s T helix formation can be characterized by two negative bands at 208 and 222 nm, and a positive band at 192 nm. Following the experiments performed by the Baldwin group, the mean residue ellipticity at 222 nm ([ ] 222 ) was calculated to generate the pH p rofile. In practice, protein structure in PDB format and yields the mean residue ellipticity and the rotational strength as a function of wavelength. Therefore, the ptraj module of the AMBER 10 package has been utilized to PAGE 150 150 gene rate a protein structural ensemble in order to find out an ensemble average of the mean residue ellipticity at 222 nm. 4.3 Results and Discussion 4.3.1 Testing Structural Convergence Conformational sampling convergence is investigat ed utilizing cluster an alysis, as described earlier Two ways of checking conformational sampling of the simulations from the fully extended structure are utilized. One way is to compare the first and the second halves of the trajectory and the other way is to compare to the str uctural ensembles produced by simulations starting from a fully helical structure. The R 2 values of the cross clustering are listed in Table 4 1. Plots demonstrating the cluster population correlations from both ways at pH 2 are showed in Figure 4 1 as an example. The large R 2 values indicate that converged structural ensembles are achieved through 40 ns simulations. Figure 4 1. Cluster population at 300 K from constant pH REMD simulations at pH 2. A) Cluster analysis is performed on the trajectory init iated from fully extended structure. The populations in each cluster from the first and second half of the trajec tory are compared and plotted. B) Two REMD runs from different starting structures at pH 2. Correlation coefficients at other p H values can be found in Table 4 1. PAGE 151 151 Table 4 1. Correlation coefficients between two sets of cluster populations. pH = 2 pH = 3 pH = 4 pH = 5 pH = 6.5 pH = 8 R 2 (E vs E) 0.90 0.92 0.90 0.94 0.93 0.85 R 2 (E vs H ) 0.95 0.88 0.84 E vs E means compa ring the first and the second halves of the trajectories starting from the fully extended structure. E vs H stands for comparing structural ensemble given by simulations starting from fully extended and fully helical structures, respectively. 4.3.2 p K a Ca lculation and Convergence Four residues of the C peptide are titratable in our constant pH REMD simulations: Glu2, Lys7, Glu9 and His12. Lys7 is always protonated in the pH range of 2 to 8, as expected. Thus, only the data from glutamate and histidine resi dues are analyzed. For each glutamate and histidine residue, the fraction of deprotonation at each pH value is K a value. The p K a values are 3.1, 3.7 and 6.5 for Glu2, Glu9 and His12 respectively. The cumulative average fraction of protonation vs constant pH MC attempts is chosen to study the convergence of the p K a calculation. The cumulative average fraction of protonation represents the time evolution of the protonation state sampling. As shown in Fig ure 4 2 a stabilized fraction of protonation is achieved through 40 ns simulations. 4.3.3 The Mean Residue Ellipticity of the C peptide The mean residue ellipticity of the C peptide at each pH value and at 300 K was computed. The pH profile of the [ ] 222 ( Figure 4 3 ) is clearly a bell shaped curve, in agreement to the experimental pH profile of the [ ] 222 The maximum of our calculated PAGE 152 152 [ ] 222 is at pH value of 5, with a numerical value of ~ 6400 degcm 2 dmol 1 However, the computed values of [ ] 222 at the ends (pH = 2, 3, and 8) suggest that the helix is more populated in the simulations than in experiments at those pH values. Figure 4 2. Cumulative average fraction of protonation vs Monte Carlo (MC) steps. Only the two glutamate residues are show n here and the histidine residue is found to show the same trend. The pH values are selected such that the overall average fraction of protonation is close to 0.5. As mentioned in the section 2.2.2 the protonation state model involves using parameters fit ted at 300 K, thus results obtained at temperatures other than 300 K should be viewed qualitatively, not quantitatively. C peptide at a temperature lower than 300 K shows a more negative [ ] 222 (more helical) while the [ ] 222 becomes less negative (less h elical) when the temperature is higher than 300 K. E xperiments showed that the pH profile becomes flat at high temperatures. 5 Our results also reflect the same trend: pH profile of the [ ] 222 at 420 K is flat and less negative than those at 300 K, while the pH profile at 280 K is still bell shaped and more negative. PAGE 153 153 Figure 4 3. Computed the mean residue ellipticity at 222 nm as a function of pH values. A bell shaped curve at 300 K is ob tained with a maximum at pH 5. The effect of temperature on mean residue ellipticity at 222 nm is also demonstrated. 4.3.4 Helical Structure s in the C peptide In order to examine the helical conformations in different environments, constant pH REMD at pH v alues 2, 5, and 8 are selected to represent the pH range. The secondary structures of the C peptide were computed utilizing the DSSP algorithm. 236 Any residue which according to the DSSP algorithm belongs to the 3 10 helix conformation is called helical. The helical percentages of each residue are shown in Figure 4 4 The maximum helical percentage of a residue is ~ 55% at pH 2 and 5, and the maximum helical percentage is ~ 40% a t pH 8. The averaged helical percentage at pH 5 is around 30%, which is in good agreement with experiments (29 2%) Figure 4 4 suggests that the C peptide contains a lot of non helical structures, even at pH 5 where the helical content is maximal. PAGE 154 154 Figure 4 4. Helical Content as a function of residue number. We calculated the C RMSD vs the fully folded structure (the fully helical structure has a C RMSD of 0.8 relative to the ribonuclease A X ray structure Thr3 to His12 are chosen to calculate C RMSD ) and the C radius of gyration ( R g ) of the C peptide. The time series and the probability density of RMSDs and R g are illustrated in Figure 4 5. According to Figure 4 5B, two conformations can be seen at all three pH values. The conformation with the smaller RMSD represents structures closer to the fully helical structure and t he structural ensemble at pH 5 possesses more such kind of structures than the other two structural ensembles. Figure 4 5D demonstrate the probability density of the R g and it suggests that the C peptide is more compact at pH 5 than at pH 2 and 8. The res ults of R g agree with the results of RMSDs because the helical structures are more compact. PAGE 155 155 Figure 4 5. A) Time series of C RMSDs vs the fully helical structure at pH 5. The first two residues at each end are not selected because the ends are very f lexible. B) Probability densitie s of the C RMSDs. Clearly, the structural ensemble at pH 5 contains more structures s imilar to the fully helical structure. C) Time series of C radius of gyration at pH 5. D) Probability density of the C radius of gyratio n. More compact structures are found at pH 5. We further studied the details of the C peptide structural ensemble with respect to pH values. The studies of helical structure were on the basis of our DSSP results. We first show the probability density of to tal number of helical residues at pH 2, 5 and 8 in Figure 4 6 A As expected, simulations at pH 5 generated the smallest number of non helical structures and the percentage is ~ 25%. Simulation at pH 8 generated the most non helical structures and ~ 37% of the structural ensemble possesses no helical PAGE 156 156 residue. For those structures possessing helical residues, structures having four helical residues are the most probable and structures containing three helical residues are also common at all three pH values. B esides, structures possessing six helical residues are also found Furthermore, simulation at pH 5 yielded more configurations possessing seven residue and longer helices. Thus, longer helical chains are formed more often at pH 5. Figure 4 6 A) Prob ability densitie s of number of helical residues in the C peptide. B) Probability densities of the number of helical segments in the C peptide. A helical segment contains continuous helical residues. The probability of forming the second helical segment is very low at all three pH values, thus only the first helical segment is further studied. C) Probability densities of the starting position of a helical segment. D) Probability densities of the length of a helical segment (number of residues in a helical se gment). PAGE 157 157 Next, the number of helical segment s ( a helical segment contains continuous helical residues) is studied and shown in Figure 4 6B The number of helical segment ranges from zero to two at all three pH values. However, C peptide structures having tw o helical segments are really rare. The probability densities of having two helical segments at pH 2 and 8 are ~ 0.05, while that at pH 5 is ~ 0.1. Due to the small population of the second helical segment, the analysis of the helical length (number of hel ical residues in a segment) and the helix starting position (residue number of the amino acid initiating a helical segment) is focused on the first helical segment Figure 4 6C demonstrates the probability density of helix starting position in the C peptid e. The helix starting position is affected by pH. The most probable starting position is affected by solution pH. At pH 2, Lys7 is the most favorable position to start a helix but the most probable place to initiate a helix is Thr3 at pH 5 and 8. At pH 2 a nd 5, Thr3, Ala6 and Lys7 are favorable positions to start a helix, while Thr3 and Lys7 are the favorable place to start a helix at pH 8. However, the effect of solution pH on the helical segment length is not as significant as the effect on helix starting position. Figure 4 6D shows that the three residue or four residue helices are dominant at all three pH values. 4.3.5 The Two Dimensional Probability Densit ies T wo dimensional (2D) probability density can be employed to study the correlations between impo rtant variables. The peaks in the plots indicate the coupling between two variables and represent stable conformations. The more populated a region is, the more stable the corresponding conformation is. The 2D probability densities between helix starting p osition and helical length are illustrated in Figure s 4 7 to 4 9 Helices consisting of Thr3 Ala5, Lys7 Arg10 and Glu9 His12 are present at all PAGE 158 158 three pH values, while the number of helical conformations is more at pH 5 and 8. At pH 2 and 5, the most probab le helix formation is the four residue helix starting from Lys7 (Lys7 Arg 10). The 2D probability densitie s reveal that the six residue (Lys7 His12) helix and the seven residue (Ala6 His12) helix are found stable at pH 5. At pH 8, Thr3 Ala5 becomes the most favorable helical formation. Lys7 Arg10 and Lys7 His12 are also favorable. At pH 8, a new seven resi due helix (Thr3 Glu9) is found. Figure 4 7 2D probability density of helical starting position and helical length, pH = 2. Figure 4 8 2D probability density of helical starting position and helical length, pH=5. PAGE 159 159 Figure 4 9 2D probability density of helical starting position and helical length, pH=8. 2D probability densitie s correlating helical length and C RMSDs relative to fully helical structure are shown in Figure s 4 1 0 to 4 1 2 As expected, structures having long helices (helical length > 7) correspond to the conformations with RMSDs smaller than 2.2 and this region is more populated at pH 5. Interestingly, configurations possessing four resi due helix can also yield RMSDs smaller than 2.2 suggesting that structures having partial helix can be similar to the fully helical too. Figure 4 1 0 2D probability density of helical length and C RMSD at pH = 2. PAGE 160 160 Figure 4 1 1 2D probability density of helical length and C RMSD at pH = 5. Figure 4 1 2 2D probability density of helical length and C RMSD at pH = 8. 4.3.6 Important Electrostatic Interactions: Lys1 Glu9 and Glu2 Arg10 The salt b ridge between Glu2 and Arg10 was found in the X ray structure of RNase A. 247 Amino acid subst itution experiments on the C peptide indicated this salt bridge is crucial to the increase in helical content when the pH value is increasing to pH PAGE 161 161 5. 7,224 Proton NMR experiments done by Osterhout et al 225 suggested that this salt bridge stabilizes partial helix instead of complete helix. They proposed that the RN 24 structural ensemble contains three major conformations: unfolded, complete folded and partial helix with Glu2 Arg10 interaction. Hansmann et al 229 also proposed that the salt bridge stabilizes partial helix by performing multicanonical simulations. Felts et al 227 foun d that the salt bridge is only significantly found in the globular non helical C peptide structures. Sugita and Okamoto 233 studied the C peptide using multicanonical REM and explicit solvent. They found that Glu2 Arg10 salt bridge does not stabilize helix directly, but to stop the helix extending to the N terminus. In the REX CPHMD study performed by Khandogin et al. they found that Lys1 Glu9, instead of Glu2 Arg10, contributes to the helix formati on. Th e Lys1 Glu9 and Glu2 Arg10 interactions are studied in our work. Figure 4 1 3 A and 4 1 3 B show the probability density vs charge distance of the two interactions at pH 2, 5 and 8. At pH 2, neither Lys1 Glu9 nor Glu2 Arg10 salt bridge is formed consistent wi th mostly protonated glutamate At pH 5 and 8, Glu2 Arg10 salt bridge is clearly formed ( Figure 4 1 3 A) while the Lys1 Glu9 salt bridge is formed in a much less extent (Figure 4 1 3 B) Figure 4 1 4 shows the correlation between the two salt bridges at pH 5. C learly, the two salt bridges cannot be formed at the same time. The effect of Glu2 Arg10 salt bridge on helical structure formation can be refl ected by conditional probabilities The probabilities of finding helical residue(s) given that the Glu2 Arg10 sal t bridge is formed are calculated at pH 2, 5 and 8. The conditional probabilities are 0.64, 0.73 and 0.63, respectively. Although at pH 2, the probability of forming Glu2 PAGE 162 162 Arg10 salt bridge is low (~ 1%), the chance of having a helical structure is 63% once it is formed. Th i s clearly shows the stabilizing effect of Glu2 Arg10 on helix formation. Figure 4 1 3 A) Probability density of Lys1 Glu9 distance (). The distance is the minimum distance between the side chain nitrogen atom of Lys1 and the side chai n ca rboxylic oxygen atoms of Glu9. B) Probability density of Glu2 Arg10 distance (). The distance is the minimum distance between side chain carboxylic oxygen atoms of Glu2 and guanidinium nitrogen atoms of Arg10. Figure 4 1 4 Two dimensional probabilit y density of Lys1 Glu9 and Glu2 Arg10 at pH 5. Apparently, Lys1 Glu9 and Glu2 Arg10 salt bridges cannot be formed simultaneously. PAGE 163 163 The correlation between Glu2 Arg10 salt bridge and helical length, and helix starting position are further studied. Figure 4 1 5 A shows that the Glu2 Arg10 salt bridge can be found in non helical configurations, four residue and six residue helices at pH 5. Moreover, in the six residue helix, the Glu2 Arg10 salt bridge is always formed The same pattern is obtained at pH 8, thus t he pH 8 results are not shown here. Figure 4 1 5 B shows the correlation between the salt bridge and helix starting position at pH 5. When a helix is initiated at Thr3, the salt bridge is not formed. When a helix begins at Ala4, Lys7 and residues behind Lys7 only the salt bridge is seen. However, in the non helical configurations and helices begin at Ala6, both states are found. Besides, Lys7 is the most probable place to initiate a helix when the salt bridge is formed. Again, no salt bridge is found when a helix starts at Thr3. Combining the correlations between Glu2 Arg10 and helical length, and Glu2 Arg10 and helix starting position, the salt bridge clearly has the effect that preventing forming helices near the N terminus and stabilizing partial helix nea r the C terminus (Lys7 Arg10 and Lys7 His12). A B Figure 4 15. A) Two dimensional probability density of Glu2 Arg10 salt bridge formation and helical length at pH 5. According to the plot, the Glu2 Arg10 salt bridge can be found in four residue, six resi due and non helical structures. B) Two dimensional probability density of Glu2 Arg10 salt bridge and the helix starting position at pH 5. If a helix begins from Thr3, it cannot have a Glu2 Arg10 salt bridge. Thus, one role of the Glu2 Arg10 salt bridge is to prevent helix formation from Thr3. PAGE 164 164 4.3.7 Important Electrostatic Interactions: Phe8 His12 His12 is believed to be responsible for the decrease in helical content when solution pH values increase from 5 to 8. 226 His12 was found to interact with Phe8. 221 However, the nature of the Phe8 His12 interaction is not completely clear. A weak hydrogen bond between the charged side chain of His12 ( proton donor) and the aromatic ring of Phe8 ( proton acceptor) is supported by the configuration in RNase A X ray structure 247 and ion screening experiments 222,226 but is in contrast to proton NMR experiment s 221 A contact between the aromatic ring of His12 and backbone carbonyl oxygen of Phe8 has been proposed to explain the proton NMR results. Sugita and Okamoto studied the interaction between the aromatic ring of Phe8 and the charged ring of His12. 233 They observed the contact between two rings has been made and stabilize s helix near the C terminus. However, the REX CPHMD results showed that the interaction between backbone carbonyl oxygen of Phe8 and the charged side chain of His12 is responsible for the increased helical content at pH 5. 112 Figure 4 1 6 A) Probability density of Phe8 backbone to His12 ring distance. The distan ce is the minimum distance between Phe8 backbone carbonyl oxygen atom and His12 imidazole nitrogen atoms B) Probability density of Phe8 ring to His12 ring distance. The distance is the minimum distance between Phe8 aromatic ring carbon atoms and His12 imi dazole nitrogen atoms. PAGE 165 165 Figure 4 16. Continued We also studied ring ring and backbone ring interactions between Phe8 and His12 at pH 2, 5 and 8. The ring ring interaction is represented by minimum distance between aromatic atoms in Phe8 and the two side c hain nitrogen atoms of His12. The backbone ring interaction is represented by minimum distance between backbone carbonyl oxygen atom of Phe8 and the two side chain nitrogen atoms of His12. Figure 4 1 6 A and 4 1 6 B show the probability densities of each dista nce at three pH values. We found that the backbone ring contact is made at all three pH values. However, forming such a contact at pH 8 is much less favorable than doing that at pH 5. Interestingly, Phe8 backbone and His12 ring close contact and Glu2 Arg10 salt bridge formation are coupled (Figure 4 17 ). The ring ring contact is observed at pH 5 but not at pH 8. At pH 2, the ring ring contact is formed but is much less probable. More importantly, the integrated probability of making a backbone ring contact is larger than the integrated probability of forming a ring ring contact at pH 2 and 5. In order to separate configurations making a contact from the rest, a cutoff distance of 4.0 and 5.0 is adopted, in the case of backbone ring and ring ring contact, respectively. The integrated PAGE 166 166 probability (area under the curve) of making backbone ring contact and ring ring contact is 0.34 and 0.22, respectively, at pH 5. The integrated probability is 0.23 and 0.14, respectively, at pH 2. Thus, the Phe8 backbone His1 2 ring interaction is the major form of the contact. A B Figure 4 17 A) Two dimensional probability density of Glu2 Arg10 distance and Phe8 His12 backbone to ring distance at pH 5. B) Correlations between Glu2 Arg10 salt bridge and Phe8 His12 contact at pH 5. We further examine the correlation between the Phe8 backbone His12 ring contact and helical properties such as helical length and helix starting position. The backbone ring contact is found in the four residue and six residue helices at pH 2 and 5. At pH 8, it can be seen in the four residue he lix. The 2D probability densitie s are similar at the three pH values, thus only the plot at pH 5 is shown as an example ( Figure 4 18 A and 4 18 B). Similar to the Glu2 Arg10 salt bridge, Lys7 is the most favorabl e place to initiate a helix with a contact between Phe8 and His12. Thus, the Phe8 His12 backbone ring contact stabilizes the helix formation near the C terminus (Lys7 to Arg10 and Lys7 to His12). However, unlike the Glu2 Arg10 interaction, helix formation initiated from Thr3 is able to form a contact between Phe8 and His12. Phe8 His12 contact does not affect helix formation near the N terminus. PAGE 167 167 Figure 4 18 A) Two dimensional probability density of helical segment leng th and Phe8 His12 interaction. B) Tw o dimensional probability density of helical segment starting position and Phe8 His12 interaction. Phe8 His12 also stabilizes four residue and six residue structures. Helices begin at Lys7 and Phe8 His12 is coupled. Unlike Glu2 Arg10, Phe8 His12 stabilizes helices starting from Thr3. 4.3.8 Cluster Analysis Results Cluster analysis is performed to find out significant conformations and to examine important electrostatic interactions. The structures at pH 5 are clustered because both Glu2 Arg10 and Phe8 His1 2 contacts are more probable than at pH 2 or 8 so that the contacts can be studies in clusters. The top 20 populated clusters and their average helical percentage is plotted in Figure 4 19 A. The most populated cluster shows the largest average helical cont ent and the second most populated cluster shows a much lower helical content (close to the lowest among 20 clusters). The most populated cluster corresponds to the conformation yielding small C RMSDs (< 2.2 ) relative to the fully helical structure (Figu re 4 19 B). Interestingly, the plot showing helical percentage vs the residue number (Figure 4 19 C) reveals that the second most populated cluster only shows helical structures between Lys7 and His12. Thus, helices are only formed near the C terminus. Figur e 4 19 D demonstrates the probability density PAGE 168 168 of the Glu2 Arg10 and Phe8 His12 interactions. Compare with the corresponding probability densities on the basis of the entire structural ensemble, forming a contact between Glu2 Arg10, and Phe8 His12 is more pr obable in the structures belong to the second most populated cluster than in the entire structural ensemble. This is especially obvious for the Glu2 Arg10 interaction. Results obtained from the second most populated cluster confirm that Glu2 Arg10 and Phe8 His12 contacts, especially the Glu2 Arg10 contact, stabilize partial helix formation near the C terminus. 4.4 Conclusions In this chapter, we have studied the pH dependent helix formation of the C peptide of ribonuclease A using constant pH REMD simulatio ns. The mean residue ellipticity at 222 nm at each pH value is computed and utilized to gauge helical content. The pH profile clearly demonstrates a bell shaped curved with a maximal helicity at pH 5, in good agreement with experimental results. The pH eff ect on the C peptide structural ensembles is studied at three representative pH values: 2, 5 and 8, representing the two ends in the pH profile and the pH value yielding the maximum helical content. At pH 2, helices consisting of Thr3 Ala5, Lys7 Arg10 and Glu9 His12 are formed and the Lys7 Arg10 is the most stable one. At pH 5, additional six residue (Lys7 His12) and seven residue (Ala6 His12) helices are stable helices but the most probable helix is the same as that at pH 2. At pH 8, the most favorable hel ix switched to Thr3 Ala5. Lys7 His12 and a new seven residue helix (Thr3 Glu9) are also present. Glu2 Arg10 salt bridge formation and its role in the helix formation are studied. We find that the salt bridge is formed and is more probable at pH 5. The Glu2 Arg10 salt bridge is found to stabilize helix formation near the C terminus. The nature of Phe8 His12 interaction and its role in helix formation are also explored. Backbone carbonyl PAGE 169 169 oxygen of Phe8 and side chain charge of His12 contact is the major form. The role of Phe8 and His12 contact is similar to that of the Glu2 Arg10 salt bridge. Results from cluster analysis on trajectory generated at pH 5 confirmed the effects of Glu2 Arg10 and Phe8 His12 interactions. Figure 4 19 A) Top 20 populated clus ters and average helical perc entage. B) Probability densities of the C RMSD vs the fully helical structure of the top 2 populated clusters. C) Helical Percentage as a function of residue number of the top 2 populated clusters. D) Probability density of th e Glu2 Arg10 and Phe8 backbone His12 ring interactions in the second most populated cluster. PAGE 170 170 CHAPTER 5 CONSTANT pH REMD: p K a CALCULATIONS OF HEN EGG WHITE LYSOZYME 5 .1 Introduction Hen egg white lysozyme (HEWL shown in Figure 5 1 ) has been selected to tes t p K a prediction methods or constant pH methods for a long time 125 This protein is a 129 amino acids enzyme and is the first enzyme to have its three dimensional structure determined by X ray crystallography 248,249 Lysozyme can be found in the secretions such as tears and saliva The function of this enzyme is to catalyze the hydrolysis of a polysaccharide and the reaction has an optimal pH around 5. 125 By hydrolyzing polysaccharides, lysozyme can damage the cell walls of certain bacteria. HEWL is a monomeric single domain enzyme whose active site is situated in a cleft between two regions. Two residues are crucial to the catalysis, Glu35 and Asp52. During the hydrolysis, a covalent enzyme substrate intermediate is formed. 249 In this process, Glu35 acts as the proton donor and Asp52 becomes the nucleophile. 249 The starting point of the catalytic mechanism is the donation of a proton from Glu35 to the substrate. Then, Asp52 will attack the anomeric carbon of the substrate and form a covalent bond with the substrate. In the final step, the enzyme substrate c omplex is hydrolyzed by a water molecule and the initial protonation states of Glu35 and Asp52 are restored. HEWL has been a good test system of p K a prediction studies for several reasons. First, a ccurate predicting the p K a values of both ionizable residue s in active site can help people identify proton donor and nucleophile in HEWL according to a simple criterion proposed by Nielsen and McCammon in 2003. 250 They proposed that if catalytic mechanism involves two acidic residues, then the proton donor should have a p K a value of at least 5.0 and the p K a of nucleophile should b e at least 1.5 pH units lower PAGE 171 171 than that of proton donor. Second the p K a values of HEWL acidic residues were determined by Bartik et al 251 using t wo dimensional proton NMR. It shows several ionizable resid ues having p K a values much different from their intrinsic p K a values. Furthermore, there are more than 100 PDB entries of the wild type HEWL structure the effect of structural variation can be tested for p K a calculation methods, especially for the FDPB me thod. 250 Thus, our constant pH REMD method will be tested on HEWL. Figure 5 1. Crystal structure of HEWL (PDB code 1AKI). Residues in red represent aspartate and residues in blue are glutamate. Various constant pH methods have been tested on HEWL. Burgi et al 130 utilized their constant pH method to predict p K a values of HEWL. The RMS error between predicted and experimental p K a values was determined to be from 2.8 to 3.8 pH units. In 2004, Lee et al 114 applied their CPHMD method to four proteins: turkey ovomocoid (PDB code 1OMT), bovine trypsin inhibitor (1 BPI), HEWL (193L) and ribonuclease A (7RSA). The overall p K a RMS error relative to experimental data was around 1 pH unit. PAGE 172 172 For HEWL, the average absolute error of all ionizable residues (including the termini) was 1.6 pH units, while the average absolute e rror of p K a values of acidic ionizable residues relative to experimental data was 1.5 pH uni ts. However, the p K a values of Glu35 and Asp52 were both 5.8, indicating that CPHMD results were not able to predict proton donor and nucleophile. In the same year, Mongan et al 127 published their discrete protonation state constant pH MD method. HEWL was also selected as the test system. In the study of performed by Mongan et al ., four different crystal structures of HEWL w ere utilized (1AKI, 1LSA, 3LZT and 4LYT). The RMSD of p K a values of all ionizable residues relative to experimental results were 0.86, 0.77, 0.88 and 0.95 for 1AKI, 1LSA, 3LZT, and 4LYT, respectively In addition to p K a predictions, Mongan et al also st udied protonation conformation correlation. Principal component analysis of a trajectory was conducted and projected onto the first two (largest eigenvalues) eigenvectors and association between conformation and protonation was observed. In 2006, Khandogin and Brooks 110 utilized REX CPHMD method to predict p K a values of 10 proteins. The RMS error values between REX CP HMD and experimental p K a values ranged from 0.6 to slightly greater than 1 pH unit. For HEWL, the RMS error between predicted and experimental p K a values was 0.6 pH unit and the maximum absolute error is 1.0 pH unit. So far, their HEWL p K a prediction RMS e rror is the smallest among constant pH p K a calculations on HEWL. Machuqueiro and Baptista presented HEWL p K a predictions from their stochastic titration constant pH MD with explicit water model in 2008 125 The RMS error between predicted and experimental p K a values were 0.82, and 1.13 for generalized reaction field, 252 and PME 154 treatment of long range electrostatics, respectively. A comparative FDPB calculation (single crystal structure PAGE 173 173 which is the same as that utilized in constant pH MD and a protein dielectric constant of 2) was also con ducted and the RMS error was found to be 2.76. Since the constant pH method proposed by Baptista requires FDPB calculation, the selection of dielectric constant inside the protein was crucial. Machuqueiro and Baptista performed constant pH MD utilizing thr ee different dielectric constants ( =2, 4, and 8) combined with PME treatment of long range electrostatics The p K a RMS error values were 1.13, 1.02, and 1.12 for = 2, 4, and 8, respectively. More recently, the constant pH MD proposed by Mongan et al 127 was coupled with accel erated molecular dynamics (AMD) 133,134 and tested on HEWL by Williams et al 129 C onstant pH AMD and MD simulations of 5 ns in length have been performed. Only acidic ionizable residues in HEWL were taken into consideration by constant pH scheme. RMS error values between predicted and experimental p K a values were calculated. The constant pH AMD yielded an overall RMS error value of 0.73, while the original constant pH MD p K a RMS error was 0.80. The p K a RMS error of aspartates were 0.75, and 1.46 from constant pH AMD, and MD, respectively. The p K a RMS error of glutamates were 0.85, and 1.04 from constant pH AMD, and MD, respectively. In general, recent works utilizing various constant pH schemes have achieved RMS error values in the range of 0.6~1.13 for HEWL. In this chapter we present a study of HEWL using constant pH REMD algorithm. Both structural restrained and unrestrained simulations were done. p K a values from constant pH REMD are compared with ex per imental values We also investigated the p K a convergence, effect of structural restraint and conformation protonation correlations PAGE 174 174 5.2 Simulation Details Crystal structure 1AKI (PDB code) has been taken as HEWL starting structure in our study. Water mo lecules in the crystal structure were striped first. Only aspartate and glutamate residues were studied so there are nine ionizable residues selected. H ydrogen atoms were added by the LEaP module in the AMBER suite. The post processed crystal structure was then minimized and heated from 0 K to 300 K. The restart structure from the heating process was taken as the initial structure for our constant pH REMD simulations. In this chapter all REMD runs refer to constant pH REMD simulations for simplicity The p H range was from 2 to 6 in an increment of 0.5 pH unit. Two sets of REMD simulations were performed: the unrestrained ones (ntr=0 in AMBER) and the restrained ones (ntr=1 in AMBER) In each REMD run, an exchange of structures was attempted every 500 MD st eps. 1000 exchange attempts were intended to use for both sets. Thus Simulation time of each replica in each set is 1 ns. In the unrestrained REMD runs, we chose the highest temperature to be 320 K in the hope that HEWL will not unfold at all temperature s In the restrained REMD runs, C atoms from residue 3 to 126 were restrained by harmonic potential s. The restraining harmonic potential has the following form: = 1 2 2 where and are Cartesian coordinates at current time a nd Cartesian coordinates of the reference structure, respectively, is the force constant of the harmonic potential which determines the strength of a restraint. In our simulations, the reference coordinates are the initial C atoms coordinates. By putti ng restraining harmonic potential on C atoms, the secondary structure of HEWL will be preserved and the highest temperature will be PAGE 175 175 increase to 420 K in order to achieve better side chain conformational sampling The force constant of the harmonic potenti als wa s 1.0 kcal/mol 2 ( setting restraint_wt=1 in AMBER ) Several other REMD simulations were done according to results from the two sets of REMD runs. The general goal of those simulations was to test what we proposed from the two previous sets. First, a nother 1 ns constant pH REMD simulation with restraint on C atoms was continued for all the pH values in order to check the p K a convergence of the restrained simulations. Likewise, 1000 exchange attempts were cond ucted in those 1 ns simulations and the restraint strength is still 1.0 kcal/mol 2 Second, a new se t of constant pH REMD simulations with restraint on C atoms was performed. The force constant adopted in the second set was 0.1 kcal/mol 2 so that the effect of restraint strength can be tested The details of constant pH REMD simulations can be found in Table 5 1 Table 5 1. Simulation de tails of constant pH REMD runs pH values R estrained or not Restraint Strength Number of Replicas Temperature (K) Simulation Time (ns) Exchange Attempts 2~6 No 0 4 280~ 3 20 1 1000 2~6 Yes 1 8 280~420 2 2000 3, 4, 4.5 Ye s 0.1 8 280~420 2 2000 The restraint strength was represented by the force constant of a harmonic potential The unit of force constant is kcal/mol 2 For the REMD simulation with 1 kcal/mol 2 restraint, it was actually performed in two st age s. Each sta ge lasted for 1 ns and the purpose of the second stage was to check the p K a convergence. All simulations were done using the AMBER 9 molecular simulation suite 253 with the AMBER ff99SB force fields. 139 The SHAKE algorithm 145 was used to allow a 2 fs time step. OBC Generalized Born implicit solvent model 200 was used to model water PAGE 176 176 environment in all our calculations. The Berendsen thermostat, 146 with a relaxation time of 2 ps, was used to keep the replica temperature around their target values. Salt concentration (Debye Huckel based) was set at 0.1M. The cutoff for nonbonded interaction and the Born radii was 30 5 .3 Protein C onformational and P rotonation State E quilibrium Model Suppose an ionizable side chain has o nly two conformations in equilibrium and each conformer has its own equilibrium in protonation state. We can use 1p, 1d, 2p and 2d to label conformer 1 in protonated form, conformer 1 in deprotonated form, conformer 2 in protonated form and conformer 2 in deprotonated form respectively. The equilibrium among all species is demonstrated in Figure 5 2 Figure 5 2 A simple schematic view of the c onformation p rotonation equilibrium in a c onstant pH simulation. Then, 12 the equilibrium constant between conformation 1 and 2 is (5 1) In the above model, p K a,1 and p K a,2 represent protonation equilibrium within each conformation. They can be expressed as: (5 2) PAGE 177 177 and (5 3) So, the p K a of that ionizable residue is (5 4) 5.4 NMR Chemical Shift Calculations Theoretical NMR chemical shift titration curve was generated. Due to the limitation of system size, full quantum mechanical NMR calculations were performed only on i onizable residue dipeptide (ionizable residue with two ends blocked). The structure of ionizable dipeptide was extracted from the representative structures (representing different side chain conformations) generated from cluster analysis. Proper protonatio n states were assigned for each structure. All full quantum mechanical NMR calculations were done in Gaussian03 software package 254 using B3LYP functional and 6 311++G** basis set. Isotropic magnetic shielding constants were computed in vacuum using GIAO method. 255 Tetramethylsilane (TMS) was used as reference in order to obtain the ch emical shift. Recently, Merz and co workers 256 developed an a utomated fragmentation quamtum mechanical/molecular mechanical (AF QM/MM) approach to study protein properties. They have applied their method to compute protein chemical shift of Trp Cage. In this AF QM/MM model, one residue and the atoms near it (less th an 4 ) are assigned to the QM region and the rest of a protein will be put into the MM region. During NMR calculations, all atoms in the MM region will be viewed as point charges. PAGE 178 178 We applied this AF QM/MM method to 1AKI to calculate chemical shift as well Again, all AF QM/MM calculations were based on representative structures. 5 .5 Results and Discussions 5 .5 .1 Structural Stability and p K a Convergence Since changing protonation state during simulation will cause discontinuity in force and energy, structur al stability in our simulations is important We chose to use C atoms root mean square deviation (RMSD) vs 1AKI structure as our metric. Figure 5 3 A shows us the C RMSD vs time in unrestrained REMD runs. In Figure 5 3 A, HEWL is instable at all the pH sim ulated. The RMSD can reach a very high value (~ 18 ) during simulations. Even at pH=4 where C RMSD values are small relative to the rest, the C RMSD can still go beyond 3 p K a be used. Figure 5 3 B show s the RMSDs in the restrained REMD runs. Although the RMSD values are small and stable throughout 2 ns simulations, the restrained REMD simulations still reveal problems according to Figure 5 3 B Our simulations use 1AKI which is resolved at pH=4.5 as sta rting structure. As pH is moving away from 4.5, one may expect HEWL will adopt conformations a little bit different from 1AKI. So a bigger RMSD should be expected where the pH value is far away from 4.5. This behavior has been confirmed in the work of Mong an et al However, putt ing restraint on C atoms results in the same RMSDs in the entire pH range. This may have negative effect on p K a predictions at pH values far away from 4.5. PAGE 179 179 A B Figure 5 3 C RMSD vs crustal structure (PDB code: 1AKI). A) C RMSD vs 1AKI from REMD without restraint on C B) C RMSD vs 1AKI f rom REMD with restraint on C The restraint strength is 1 kcal/molA 2 In order to check protonation state sampling convergence from the restrained REMD simulations, p K a prediction error (pre d icted value minus experimental value ) against time as well as time evolution of prediction deviation ( predicted p K a value at PAGE 180 180 current time minus the final predicted p K a value) are followed and demonstrated in Figure 5 4 and 5 5 According to those plots, st abilizations in p K a predictions are seen change average p K a predictions and their errors relative to experimental values. In order to show convergence in protonation state sampling is reached in a wide range of pH, a representative plot of Asp52 p K a deviations are shown in Figur e 5 5 B Convergence is clearly seen over the pH range. Figure 5 4 p K a prediction error as a function of time. T he predicted p K a at a given time is a cumulative result. For each ionizable residue, the time series of its p K a error is generated at a pH where the average predicted p K a is closest to that pH value. In this way, we try to eliminate any bias toward the energetically favored state. A flat li ne is an indication of convergence. Glu35 is not shown here due to poor convergence. PAGE 181 181 A B Figure 5 5 A ) p K a prediction convergence to its final value. Similarly, t he p K a value at a given time is a cumulative average A flat line having y value of 0 is expected when p K a calculation convergence is reached. The same pH values are chos en for each ionizable residue as in Figure 5 4 B ) Asp52 p K a prediction convergence to its final value at multiple pH values. The pH values are selected in such a way that the p K a calculated at this pH will be used to compute composite p K a PAGE 182 182 5 .5 .2 p K a Predictions A popular way to study the accuracy of p K a prediction is to look at the p K a RMS error relative to experimentally measured p K a to generate p K a simulations. Mongan et al proposed a way to calculate p K a their constant pH MD paper. They called p K a values calcul ated in their way c omposite p K a values A composite p K a is an average of all p K a values having an absolute offset less than 2 pH units. Here an offset means the difference between predicted p K a and its corresponding pH values. Table 5 2 shows p K a values and the p K a RMS erro r values from the 2 ns restrained REMD runs. Composite p K a values, p K a va l ues their RMS error values relative to experimental measurements are also listed in Table 5 2. We used the same experimental p K a values as Mongan et al did to calculate p K a RMS error In our work, the p K a predictions from while utiliz ing composite p K a values produces a RMS error value of 0.87. According to con stant pH simulation literatures the RMS error values of HEWL p K a prediction are around 0.8 for acidic ionizable residues. So there is no significant improvement in p K a prediction from our simulations. However, as we mentioned in the structural stability discussion, putting a restraint on C atoms of a protein lowers the ability to adjust its conformations. The further a pH value is away from crystal pH, the more a structure ensemble is skewed from the correct one. Simulations performed around pH 4.5 are less affected by the restraint than s imulations done at pH values far away from 4.5. Since the less a structural ensemble is skewed, the less human error in p K a predictions. So one may expect smaller p K a RMS PAGE 183 183 error relative to experimental values will be seen around pH 4.5. p K a prediction RMS error relative to experimental values are plotted against pH values in Figure 5 6 As expected, a minimum having RMS error of 0.74 at pH 4. 5 can be found. An RMS error of 0.74 is among the best published HEWL predictions. Table 5 2. Predicted p K a values a nd their RMS errors relative to experimental measurements from the restrained REMD simulations. Exp 251 pH 2 pH 2.5 pH 3 pH 3.5 pH 4 pH 4.5 pH 5 pH 6 Com p Hill Glu7 2.85 3.61 3.58 3.46 3.03 2.99 2.93 2.36 3.37 3.27 3.23 Asp18 2.66 1.59 1.54 1.51 1.61 1.91 2.35 2.5 3.69 1.63 1.4 Glu35 6.2 3.76 3.65 4.36 4.14 4.31 4.53 4.76 4.61 4.27 4.58 Asp48 2.5 1.88 1.98 2.14 2.34 2.6 2.45 1.96 2.9 2.23 2.01 Asp52 3.68 2.71 2.45 2.63 2.82 3.05 2.72 2.77 3.99 2.73 2.68 Asp66 2.0 2.5 2.69 2.86 2.92 3.12 2.72 3.09 4.04 2.8 2.73 Asp87 2.07 2.32 2.43 2.64 2.49 2.54 2.64 2.79 3.62 2.51 2.42 Asp101 4.09 4.52 4.4 4.14 4.03 3.79 3.55 3.44 3.96 3.89 3.85 Asp119 3.2 2.71 2.78 3.01 3.01 3.25 3.01 2.89 3.97 2.96 2.9 RMS Error 1.04 1.1 0.91 0.89 0.83 0.74 0.79 1.12 0.87 0.84 In this table, K a Comp stands for the composite p K a value of an ionizable residue (see Mongan s paper for definition) and Hill stands for the p K a value obtained f rom the Hill s plot. The force constant of the harmonic potential used here is 1 .0 kcal/mol 2 PAGE 184 184 Figure 5 6 RMS error between predicted and experimental p K a vs pH value A minimum of p K a RMS error can be fou nd near the pH at which 1AKI crystal structure is resolved. 5 .5 .3 Constant pH REMD Simulations with a W eaker R estraint Based on what have been found so far, we propose that reducing restraint strength on C atoms will yield better p K a predictions. This is because reducing restraint strength will increa se degree of freedom in conformation sampling. HEWL can relax its structure further, even at pH 4.5 Thus a more accurate structure ensemble can be produced. This, in turn, will improve p K a calculations. Constant pH REMD simulations with a weaker restraint (harmonic potential on C atoms) of 0.1 kcal/mol 2 were carri ed out at three different pH values to test our hypothesis. First, as shown in Figure 5 7 A, all three simulations generate larger C RMSDs relative to 1AKI than those simulations with stronger restraint do This means HEWL relaxes more when a weaker restraint is used. Besides, the C RMSD fluctuations in all three runs are bigger than those in the 1 kcal/mol 2 REMD runs. This means more conformational space is visited. Another PAGE 185 185 interesting point in the weaker restrained REMD runs is that the C RMSDs at pH 3 and 4 are larger than tho s e at pH 4.5. Simulations at pH 3 and 4 do tend to sample conformations that are different from at pH 4.5. The p K a prediction result s are listed in Table 5 3. p K a prediction deviation from th e final value vs time at pH value of 4.5 is shown in Figure 5 7 B to demonstrate protonation state sampling convergence. According to Table 5 3, n early 0.1 pH unit improvement in the RMS error of predicted p K a values can be seen at each pH for the weakly re strained REMD runs. However, among all three RMS error values, the best one is still obtained at pH 4.5 indicating that restraint is still favoring simulations near pH 4.5. After reducing the restraint strength our best p K a RMS error relative to experimen tal values is 0.62. Table 5 3. Predicted p K a values and their RMS errors relative to experimental measurements from weakly restrained REMD simulations. pH=3 pH=4 pH=4.5 1 0.1 1 0.1 1 0.1 Glu7 3.46 3.71 2.99 3.38 2.93 3.34 Asp18 1.51 1.57 1.91 1.76 2. 35 2.23 Glu35 4.36 5.09 4.31 5.23 4.53 5.24 Asp48 2.14 2.27 2.6 2.48 2.45 2.71 Asp52 2.63 2.47 3.05 2.88 2.72 3.29 Asp66 2.86 2.63 3.12 2.66 2.72 2.93 Asp87 2.64 2.52 2.54 2.79 2.64 2.88 Asp101 4.14 3.82 3.79 3.77 3.55 3.54 Asp119 3.01 2.22 3.25 2.2 1 3.01 3.38 RMS E 0.91 0.84 0.83 0.72 0.74 0.62 In Table 5 3, the number 1 in the second row means the force constant of the restraining potential is 1 kcal/mol 2 while 0.1 stands for 0.1 kcal/mol 2 RMSE stands for RMS Error. PAGE 186 186 A B Figure 5 7 A) C R MSD of HEWL from weaker restraint REMD simulations. The RMSDs are larger than those with stronger restraints. When comparing RMSDs at different pH for simulations using weaker restraint, RMSDs are greater at pH 3 and 4 than those at pH 4.5. B) p K a predicti on deviation from final value at pH 4.5 from constant pH REMD with 0.1 kcal/mol 2 PAGE 187 187 5 .5 .4 Acti ve Site I onizable R esidue p K a P rediction : Asp52 Accurate calculations of the p K a values of ionizable residue s in active site are important because their protonat ion state s are crucial in enzyme reactions. In the case of HEWL, Asp52 works as a nucleophile. This requires Asp52 to be deprotonated during reactions which has an optimal pH around 5. In both restrained REMD, Asp52 is indeed deprotonated around pH 5. Howe ver, the error of Asp52 relative to experimental value is about 1 pH unit. Mongan and co workers also had the same trend except that a bigger error was obtained in their simulations They claimed that Asp52 Asn46 hydrogen bond caused the very low predicted p K a of Asp52. 127 Asp52 and residues that strongly interact with it (three asparagine residues: Asn44, Asn46 and Asn59) in the crystal structure of 1AKI (hydrogen atoms are added and proper protonation state is ch osen at pH 4.5) are shown in Figure 5 8 We studied those interactions which are represented by atom to atom distances in our REMD simulations. We find that Asp52 is closer to Asn59 and Asn44 rather than to Asn46, indicating that Asp52 has stronger interac tions with Asn59 and Asn44 than with Asn46. Time series of Asp52 carboxylic oxygen atoms to Asn59 and Asn44 ND2 distances at pH 3 are shown in Figure 5 9 As can be seen from Figure 5 9 A and 5 9 B Asp52 and Asn44, A sn59 stay within hydrogen bonding distanc e for a long time at pH as low as 3 Furthermore, hydrogen bonding distances between Asp52 and Asn44 and between Asp52 and Asn59 are coupled. Two oxygen atoms in the carboxylic group of Asp52 are able to work as proton acceptors simultaneously. This means that the deprotonated form of Asp52 is over stabilized by hydrogen bonding even at low pH values PAGE 188 188 Figure 5 8 Asp52 in the crystal structure of 1AKI Its neighbors that having st rong electrostatic interactions are also shown. A B Figure 5 9 A ) Time s eries of Asp52 carboxylic oxygen atom OD1 to Asn59 and Asn44 ND2 distances at pH 3 in the 1 kcal/mol 2 constant pH REMD run. B) Time series of Asp52 carboxylic oxygen atom OD2 to Asn59 and Asn44 ND2 distances under the same condition Hydrogen bonds which are stabilizing deprotonated Asp52 are formed in a large extent even at a low pH. Next, hydrogen bond analysis was conducted with PTRAJ module in the AMBER suite for both sets of restrained REMD simulations. Hydrogen bonds can be found between Asp52 and a ll three asparagines (Asn44, Asn46, and Asn59) in both sets. The occupation times of Asp52 Asn44 and Asp52 Asn59 hydrogen bonding are longer than PAGE 189 189 that of Asp52 Asn46 hydrogen bonding. Furthermore, the Asp52 Asn44 and Asp52 Asn59 hydrogen bonding are couple d according to the distances demonstrated in Figure 5 9. Asp52 is protonated only when the entire carboxylic group is pointing away from Asn44 and Asn59. T he Asp52 Asn44 and Asp52 Asn59 hydrogen bonding, not the Asp52 Asn46 hydrogen bonding, is responsible for low predicted p K a value of Asp52. The hydrogen bond contents are similar in both strongly and weakly restrained REMD simulations This indicates that the hydrogen bonding effect on Asp52 in our simulations is too strong. R educing restraint strength do help the conformational sampling of Asp5 2. 5.5 .5 Active S ite I onizable R esidue p K a P rediction : Glu35 Glu35 is another problematic case in our study. In the 1 kcal/mol 2 largest single residue error: the error is almost 2 pH unit s Exc luding Glu35 will lower the p K a RMS error value by nearly 0.2 pH unit. In the 0.1 kcal/mol 2 runs, the p K a value of Glu35 is improved, having an error around 1 pH unit. This is the main reason that smaller p K a RMS error s relative to experimental data are found in all three 0.1 kcal/mol 2 REMD simulations. Although the p K a error of Glu35 in the weakly restrained REMD simulation is large the good news for weakly restrained REMD simulations is that Glu35 can be correctly identified as proton donor based on the criterion pr oposed by Nielsen and McCammon: Glu35 has a p K a value ~5.2 and the p K a difference between Asp52 and Glu35 is greater than 1.5 pH units The predicted p K a value of Glu35 was determined to be 5.32 in the study performed by Mongan et al They claimed that a similar hydrogen bonding effect as Asp52 demonstrated was responsible for the low predicted p K a value of Glu35. 127 However, h ydrogen bonding analysis of our data does not show any significant PAGE 190 190 hydroge n bonding is formed by Glu35, which is in contrary to what Mongan et al. claimed In the 1AKI crystal structure Glu35 side chain is in the vicinity o f Gln57, Trp108 and Ala110 side chains. Several key distances between Glu35 carboxylic group and Gln57, T rp108 and Ala110 side chains in the crystal structure are listed in Table 5 4. According to Table 5 4, Glu35 is in a hydrophobic region except that a close distance between Glu35 OE2 atom and Ala110 backbone amide nitrogen atom. The hydrophobic effect is t he main reason of an elevated p K a value of Glu35. However, w hen the carboxylic group is pointing toward the Ala110 amide group the deprotonated form of Glu35 will be favored If such a conformation is stable throughout simulations, the predicted p K a value will be smaller than what it supposed to be. We think one reason of a low predicted p K a value is that Glu35 is stuck in conformations stabilizing deprotonated form. But the weakly restrained simulations allow Glu35 to relax structure further and visit con formations stabilizing protonation more frequently Table 5 4. Distance between Glu35 carboxylic oxygen atoms and neighboring residue side chain atoms in 1AKI crystal structure Glu35 OE1 Glu35 OE2 Gln57 CB 3.56 5.25 Gln57 CG 3.85 5.84 Trp108 CB 5.36 3.43 Trp108 CG 5.43 3.94 Trp108 CD1 4.65 3.67 Ala110 N 4.65 3.09 Ala110 CB 4.19 3.48 The unit of all distances in Table 5 4 is Glu35 heavy atom RMSD relative to 1AKI as well as cluster analysis on the basis of those RMSDs are chosen to study Glu35 conformational sampling. Distributions of PAGE 191 191 heavy atom RMSD which are shown in Figure 5 1 0 show that 2 conformations are found in the strongly restrained simulations : one centered at RMSD ~0. 1 (we label that conformation as conformation 1) and the other centered at ~0.6 (it is labeled as conformation 2) However, an extra conformation (conformation 3) is visited by the weakly restrained REMD simulations. Cluster analysis is employed to separate those conformations. For conformation 2, t he carboxylic gro up of Glu35 points toward the Ala110 amide group in both sets of the restrained REMD runs (Figure 5 11) The carboxylic group in conformation 1 also points toward the Ala110 amide group, although in a lesser extent. However, conformation 3 (shown in the we akly restrained runs only) contains configurations in which Glu35 carboxylic group is pointing away from Ala110 amide group (Figure 5 12B) In this conformation the Glu35 side chain is in the hydrophobic region and the protonated species is favored. A too l ow percentage of conformation 3 i s responsible for the low predicted p K a value of Glu35. A Figure 5 1 0 A) Time series of the Glu35 heavy atoms (excluding two carboxylic oxygen atoms) RMSD relative to crystal structure 1AKI B) Probability distribution of the RMSD. The conformation centered at RMSD ~0.1 is labeled as conformation 1. The one centered at ~0.6 is named conformation 2. Apparently, an extra conformation (conformation 3) is visited by the weakly restrained REMD simulation. PAGE 192 192 B Figure 5 10 Continued A B Figure 5 11. A) Representative Structure of conformation 1. B) Representative Structure of conformation 2. The s tructure ensemble is generated from REMD simulations with stronger restraining potential. The carboxylic group of Glu35 in co nformation 2 is clearly pointing toward the amide group of Ala110. Deprotonated form of Glu35 tends to de crease the electrostatic energy. Furthermore, conformation 1 does not particularly favor the protonated Glu35. No significant stabilizing factor is fou nd for the protonated Glu35. PAGE 193 193 Figure 5 12. Representative Structure of conformation 3 from cluster analysis. Glu35 is in the hydrophobic region, consisting of Gln57, Trp108 and Ala110 Conformation 1 and 2 in the weakly restrained simulation s are basica lly the same as those demonstrated in Figure 5 11 Another possible reason of underestimating p K a value of Glu35 is the use of implicit solvent in constant pH MD and REMD simulations. Imoto et al. suggested that Glu35 and Asp52 were coupled by two water mo lecules through hydrogen bonding. Glu35 carboxylic group acted as a proton donor in the hydrogen bonding. Thus the protonated form of Glu35 was stabilized and contributed to the elevated p K a value. Two water molecules are indeed found between Glu35 and Asp 52 in the 1AKI crystal structure and they are within hydrogen bonding distances to Glu35 and Asp52. If the hypothesis is true the use of implicit solvent breaks this hydrogen bonding network. Thus a stabilizing factor of protonated Glu35 is missing. A c on stant pH algorithm employing explicit solvent is needed to study this effect. 5. 5 .6 Correlation betwe en C onformation and P rotonation As described earlier, one advantage of utiliz ing constant pH methods is that the conformational sampling and the protona tion state sampling are directly coupled. In this PAGE 194 194 work, s ide chain dihedral angles are chosen to study conformation protonation coupling. Asp119 1 and 2 dihedral angles at pH 3 will be shown as representatives. Two dimensional histograms between d ihedral angles and protonation state s are displayed in Figure 5 1 3 A two dimensional ( 2D ) histogram is generated by putting bins in dihedral ang le and protonation state space ( As explained in the second chapter, considering s yn and anti configuration of p rotons will generate five protonation state s in the case of ionizable aspartate in AMBER. They can be labeled as 0, 1, 2, 3 and 4 in which state 0 stands for deprotonated state and the rest represent protonated species ). A B Figure 5 1 3 A) Correlation between side chain dihedral angle 1 and protonation states. B) Correlation between side chain dihedral angle 2 and protonation states. Our 2D histograms can show the correlations between dihedral angle distribution and protonation state distribut ion. Two conformations are obtained in 1 space: conformation 1 having 1 angle around 60 while conformation 2 having 1 angle around 170 In Figure 5 1 3 A, we can clearly see that conformation 1 is coupled with protonated form and most stru ctures in conformation 2 are in deprotonated state. According to Figure 5 1 3 B, similar behavior can be seen in 2 space too. Most PAGE 195 195 deprotonated Asp119 are found having 2 near 40 and 140 while configurations showing 75 and 100 of 2 are p rotonated. A closer look at the 1AKI crystal structure reveals that side chains of Asp119 and Arg125 are close to each other (the carboxylic group of Asp119 and the guanidinium group of Arg125 are in hydrogen bond distance). Since Arg125 has a positive ch arge on its guanidinium group, it stabilizes the deprotonated Asp119 when two side chains are close to each other We calculated p K a of Asp119 in 1AKI using H++ (H++ is a web based FDPB server developed by Alexy Onufriev s group at Virginia Tech The FDPB equation is solved on the basis of only one protein structure ). 257,258 The calculated p K a of Asp119 using FDPB method is 1.1, 0.7 and 1.3 when the internal dielectric constant is set to be 2, 4, and 6, respectivel y. All three p K a values are much lower than experimental p K a value of 3.2. This behavior agrees with what we jus t explained: Asp119 Arg125 side chain coupling stabilizes the deprotonated form of Asp119. The single structure FDPB based p K a calculations yiel d such low p K a values because only one conformation is visited by Asp119. Therefore Asp119 must sample other conformations in order to yield accurate p K a predictions. Time evolution of distance between Asp119 and Arg125 side chain is shown in Figure 5 1 4 to reflect that conformations other than crystal conformation are visited in our constant pH REMD runs. In Figure 5 1 4 we can clearly see that the close contact between Asp119 and Arg125 side chains can be broken during our simulations. Allowing side chai n s to move will result in a p K a value of 3.0 in our simulations. The comparison between constant pH and single structure FDPB algorithm clearly demonstrates the importance of conformational sampling in p K a calculations. PAGE 196 196 Figure 5 1 4 Minimal distance bet ween Asp119 side chain carboxylic oxygen atoms (OD1 and OD2) and Arg125 guanidinium nitrogen atoms. Since guanidinium group has three nitrogen atoms, the minimal distance is the shortest distance between Asp119 OD1 (or OD2) and those three nitrogen atoms. Therefore another way to look at conformations is combining both Asp119 and Arg125. Now distance s between Asp119 CG and Arg125 CZ atoms are selected to distinguish different conformations. Figure 5 1 5 A shows the CG CZ distance probability distribution. Th e probability distributions also reveal that two conformations exist. One conformation is centered at CG CZ distance of 4.2 which represents the Asp119 and Arg125 coupling is on. The other conformation is actually representing all structures not belongin g to the previous conformation. Based on the distance between Asp119 CG and Arg125 CZ, we can say the coupling is off. The 2D histogram between distance and protonation state at pH 3 is shown in Figure 5 1 5 B. As can be seen in the 2D histogram contour plot short distance conformation is indeed in the deprotonated state. The p K a of shorter distance conformation is negative infinity. Although several snapshots possess both protonated state and short distance, 2D histogram doesn t reveal them as a stable conf ormation. So, the short distance conformation is purely coupled with deprotonated PAGE 197 197 form. W e also obtain the p K a value of the longer distance conformation is 3.3 according to Hill s plot. A B Figure 5 1 5 A) Probability distribution of Asp119 CG to Arg125 C Z distances. The Asp119 CG to Arg125 CZ distance is used to distinguish conformations. B) Coupling between conformations and protonation states. 5. 5 .7 Conformation P rotonation E quilibrium M odel Due to the coupling between conformation and protonation equil ibrium, knowing the pH effect on conformational equilibrium will be interesting and important. Again, Asp119 is selected as the representative of our study. First, we want to show the derivation and the analytical form of K 12 as a function of pH values in a general case. From now on, we will label conformation 1 in deprotonated form as 1d. The, 1p, 2d and 2p stand for conformation 1 in protonated form, conformation 2 in deprotonated form and conformation 2 in protonated form, respectively. According to eq. 2 and 3, [1p] = [1d] 10 (pKa,1 pH) and [2p] = [2d] 10 (pKa,2 pH) We can substitute [1p] and [2p] in eq. 1 with [1d] and [2d] so the conformational equilibrium constant will have the form: ( 5 5 ) PAGE 198 198 In Eq. 5 5 [1d]/[2d] is the equilibrium constant of conformation 1 and 2 in deprotonated form and it is equal to the K 12 at high pH where both conformations are in the deprotonated form. So K 12 has the final anal ytical formula: ( 5 6 ) where K 12,h stands for K 12 at high pH. In our derivation, conformation 1 always has a smaller p K a value than conformation 2. So the de nominator always increases faster than the numerator when pH values going down. Considering that K 12,h is a constant, then K 12 is a sigmoid function. When pH is much greater than both p K a values, K 12 becomes K 12,h When pH is much smaller than both p K a val ues, K 12 reaches its lower bound. In the case of Asp119, the p K a value is minus infinity for conformation 1 when we use Asp119 CG and Arg125 CZ distance to distinguish two conformations. The ratios of K 12 and K 12,h from both analytical derivations and actu al simulations are plotted in Figure 5 1 6 Close agreement between K 12 /K 12,h plots generated from simulations and conformation protonation equilibrium model is seen in Figure 5 1 6 A. The agreement shows that the model could represent conformational equilibr ium in our constant pH REMD simulations. So, further use of that model is possible. Different p K a 1 and p K a 2 values are also used in order to test how two p K a values affect shape and inflection point of the sigmoid function. According to Figure 5 1 6 B, 5 1 6 C and 5 1 6 D if the difference between p K a 1 and p K a 2 is large (greater than 1 pH unit, approximately), the inflection point will appear at a pH value that equals to p K a 2 p K a 1 will affect the inflection point only when the difference is small. If we v iew a K 12 / K 12,h plot as a titration curve and the inflection point is the p K a value, then the K 12 / K 12,h plot yields a p K a value equals to p K a 2 values, which is 3.3 in the case of Asp119. PAGE 199 199 A B C D Figure 5 1 6 K 12 / K 12,h as a function of pH and its depend ence on p K a,1 and p K a,2 Since the analytical form of K12, pKa,1 and pKa,2 are known and the sum of all fractions is unity, we can figure out fractions of each species. The analytical expressions of each species are: 1 = 12 12 + 1 1 1 + 10 1 (5 7) 1 = 12 12 + 1 10 1 1 + 10 1 (5 8) 2 = 1 12 + 1 1 1 + 10 2 (5 9) 2 = 1 12 + 1 10 2 1 + 10 2 (5 10) PAGE 200 200 In o ur study of Asp119, p K a 1 is minus infinity which lead to [1p] is equal to zero. K 12 h is calculated as the average of all [1d]/[2d], which results in a K12,h of 1.6. Anot her K 12 h of 1.8, which is the K 12 at pH 5, is also tried. Then, fractions of each species from both analytical formula and actual simulations are shown in Figure 5 1 7 A B Figure 5 1 7 A) Fraction of each species as a function of pH (titration curves) o btained from equations based on conformation protonation equilibrium. The effect of 12 is tested. B) Comparison of titration curves derived from actual simulations and from the equilibrium equations Firstly, the fraction of 2 vs pH plots are almo st identical for two K 12,h values. This means that although the fractions of 1d and 2d are affected, the sum of 1d and 2d is PAGE 201 2 01 not. Secondly, titration curves derived from analytical formula and actual simulations agree with each other very well. The agreeme nt among titration curves leads to similar p K a values. Both analytical titration curves using different K 12,h yield p K a values to be between 2.8 and 2.9 with negligible difference and the actual simulation titration curve gives a p K a value of 3.0. The anal ysis demonstrates that the equilibrium model could represent protonation equilibrium in our simulations. 5. 5 .8 Theoretical NMR T itration C urves Since the model can be used to simplify conformation protonation equilibrium in our constant pH REMD simulati ons, it is interesting to know whether it has some practical meanings. Reproducing experimental titration curves offers us a good objective. So, quantum mechanical calculations of NMR chemical shift ( ) are performed and their results are demonstrated and discussed in this part. As we have shown earlier, the dynamics of Asp119 generates two conformations indicating whether the Asp119 Arg125 electrostatic interaction is on or off Our NMR calculatio ns are based on the representative structures of each conformation, in proper protonation state. Due to the size of HEWL molecule, full quantum mechanical calculations are too expensive. So our first trial is using Asp119 dipeptide. Chemical shifts of the 1d, 2p and 2d are obtained and the fractions of each species at different pH can be calculated using eq. 7, 8 and 10. At each pH value, the theoretical chemical shift used to make a titration curve is calculated as follows: The chem ical shifts of 1d, 2d and 2p are 2.17, 2.48, 3.03 ppm respectively and the theoretical NMR titration curve is plotted in Figure 5 1 8 Compare theoretical titration curve with experimental one, the trend is correctly reproduced. At low pH, the theoretical a nd PAGE 202 202 experimental chemical shifts agree well: 3.03 ppm versus 3.13 ppm. However, the difference between calculated and experimental high pH chemical shifts is greater than 0.6 ppm. This makes our calculated ( low pH high pH ) is 0.75 ppm while the experimental difference is only 0.21 ppm. Figure 5 1 8 Theoretical NMR chemical shifts as a function of pH. It s plotted to see if the conformation protonation equilibrium model can reproduce experimental titration curve based on NMR chemical shift measurements. The problem at high pH could be that a dipeptide cannot accurately represent Asp119 and its environment especially we have known there is a strong Asp119 Arg125 Coulomb interaction. So a set of QM/MM calcula tions was conducted using the entire HEWL molecule. The new chemical shifts are 2.58, 2.69 and 3.25 ppm for 1d, 2d and 2p. Comparing chemical shifts based on dipeptide and the entire molecule, differences of 2p and 2d are 0.22 ppm and 0.21 ppm. More import antly, both 2p chemical shifts are similar to experimental low pH (each one shows the difference near 0.1 ppm). The differences are small for 2p and 2d because there are no significant interactions for Asp119 in conformation 2. Unlike 2p or 2d, the chemic al shift of 1d is improved by 0.41 PAGE 203 203 ppm, telling that using the whole HEWL molecule does change 1d chemical shift a lot. After applying QM/MM method on the entire HEWL, the calculated ( low pH high pH ) becomes 0.63 ppm. The theoretical titration curve usin g QM/MM technique is also displayed in Figure 5 1 8 But no matter whether a dipeptide or the entire HEWL is used in NMR calculations, the p K a values are around 2.9 as expected. NMR titration curves yield the same p K a value as protonation (deprotonation) fr action vs pH does. The NMR titration curve calculations validate the use of conformation protonation equilibrium model and confirm its applicability. This model can be used to simplify a lot analysis involving further calculations. 5 6 Conclusions In this chapter, constant pH REMD simulations are performed to study the p K a of hen egg white lysozyme Three sets of constant pH REMD simulations have been performed: one set of simulations are conducted without restraining potential, while a harmonic potential i s put on the C atoms in the other two sets of REMD simulations. The force constants of the two harmonic potentials are 1, and 0.1 kcal/mol 2 respectively so that the effect of restraint strength on p K a prediction accuracy can be studied In our constant pH REMD simu lations, the unrestrained ones are foun d to be structurally instable. The C atom RMSD relative to crystal structure can be as high as 18 Due to the effect of restraining potential, HEWL in a restrained simulation is stable and similar to the crystal st ructure, according to the C atom RMSD values. In the restrained simulations with a force constant of 1 kcal/mol 2 accurate p K a predictions are achieved. The overall RMS errors between predicted and experimental p K a values are 0.87 and 0.84, dependent of p K a calculation methods. Unfortunately, those two PAGE 204 204 RMS errors are not better than constant pH MD results obtained by Mongan et al The advantage of incorporating REMD method is not observed. However, a plot showing RMS error as a function of pH value yield s the smallest RMS error at pH 4.5, at which the crystal structure was resolved. Supported by the work of Mongan et al., we propose that the further away from crystal pH value, the stronger the biasing effect from the restraining potential. The biasing eff ect of conformational sampling will in turn affect p K a predictions. As expected, r educing the strength of harmonic potential result s in improved p K a predictions Likewise, the smallest p K a RMS error of 0.62 is obtained at pH 4.5 in the weakly restrained co nstant pH REMD simulations An RMS error of 0.62 is among the best p K a predictions generat ed from constant pH simulations. The p K a predictions of catalytic ionizable residues are of particular interest in the case of HEWL. Constant pH REMD simulations with stronger restraining potential failed to identify proton donor under the criteria proposed by Nielsen and McCammon in 2003. The weakly restrained constant pH REMD simulations are able to predicted proton donor and nucleophile, although the errors of predi cted p K a values of Glu35 and Asp52 are among the largest in our simulations. Hydrogen bonding is found to be responsible for the large error of Asp52. The hydrogen bonding of Asp52 with Asn44 and Asn59 over stabilize s the deprotonated form of Asp52, causin g the p K a value of Asp52 too small. For Glu35, conformational sampling also plays a role in underestimating its p K a value. However, other factors such as the use of implicit solvent may affect the p K a prediction of Glu35 too. In this work, we also focus ed on conformation and protonation equilibrium in constant pH REMD simulations. Correlatio ns between protonation and side chain PAGE 205 205 dihedral angles 1 and 2 are studied. Other representation of conformations such as whether an important electrostatic interaction is formed or not is also adopted. In both cases, the coupling between conformation and protonation is observed. The effect of conform ation protonation coupling is partially reflected by the comparison between constant pH and single structure FDPB algorithms. Constant pH REMD yields better p K a values are seen because more conformation space is visited. The conformation protonation equil ibrium is further studied. Equilibrium constants between conformations are derived in order to show how pH affects conformation equilibrium. The conformational equilibrium constant is shown to be pH dependent and it s a sigmoid function of pH values. The s hape of the sigmoid al function is influenced by p K a values of each conformation. Titration curves which are the means to obtain p K a values are also derived from conformation protonation equilibrium. All analytical results are in good agreement with our sim ulations. In addition, we apply this conformation protonation equilibrium to reproduce experimental NMR titration curve by carrying out full QM and QM/MM calculations. First, we showed the importance of protein environment to chemical shift calculations. C alculation using isolated ionizable side chain can only qualitatively reproduce experimental NMR titration curve. The error mainly comes from the high pH end where an isolated side chain assumption fails. After adding protein environment, our theoretical t itration curve is greatly improved and good agreement to experimental result is obtained. Our conformation protonation equilibrium model can be used to represent our simulations and will simplify further calculations. PAGE 206 206 LIST OF REFERENCES (1) Bettelheim, F. A. Introduction to general, organic, and biochemistry ; 8th ed.; Thomson Brooks/Cole: Belmont, CA, 2007. (2) Dey, A.; Verma, C. S.; Lane, D. P. Br. J. Cancer 2008 98 4 8. (3) Vogelstein, B.; Lane, D.; Levine, A. J. Nature 2000 408 307 310. (4) Mat thew, J. B.; Gurd, F. R. N.; Garciamoreno, E. B.; Flanagan, M. A.; March, K. L.; Shire, S. J. Crc Cr. Rev. Biochem. 1985 18 91 197. (5) Bierzynski, A.; Kim, P. S.; Baldwin, R. L. Proc. Natl. Acad. Sci. U. S. A. 1982 79 2470 2474. (6) Ferguson, N.; Sc hartau, P. J.; Sharpe, T. D.; Sato, S.; Fersht, A. R. J. Mol. Biol. 2004 344 295 301. (7) Shoemaker, K. R.; Kim, P. S.; Brems, D. N.; Marqusee, S.; York, E. J.; Chaiken, I. M.; Stewart, J. M.; Baldwin, R. L. Proc. Natl. Acad. Sci. U. S. A. 1985 82 234 9 2353. (8) Garcia Mira, M. M.; Sadqi, M.; Fischer, N.; Sanchez Ruiz, J. M.; Munoz, V. Science 2002 298 2191 2195. (9) Hunenberger, P. H.; Helms, V.; Narayana, N.; Taylor, S. S.; McCammon, J. A. Biochemistry 1999 38 2358 2366. (10) Demchuk, E.; Geni ck, U. K.; Woo, T. T.; Getzoff, E. D.; Bashford, D. Biochemistry 2000 39 1100 1113. (11) Dillet, V.; Dyson, H. J.; Bashford, D. Biochemistry 1998 37 10298 10306. (12) Harris, T. K.; Turner, G. J. IUBMB Life 2002 53 85 98. (13) Laidler, K. J. Chemi cal kinetics ; 3rd ed.; Harper & Row: New York, 1987. (14) Fersht, A. Structure and mechanism in protein science : a guide to enzyme catalysis and protein folding ; W.H. Freeman: New York, 1999. (15) Simonson, T.; Carlsson, J.; Case, D. A. J. Am. Chem. Soc 2004 126 4167 4180. (16) Lee, A. C.; Crippen, G. M. J. Chem. Inf. Model. 2009 49 2013 2033. (17) Langsetmo, K.; Fuchs, J. A.; Woodward, C. Biochemistry 1991 30 7603 7609. PAGE 207 207 (18) Garcia Moreno, B.; Dwyer, J. J.; Gittis, A. G.; Lattman, E. E.; Spenc er, D. S.; Stites, W. E. Biophys. Chem. 1997 64 211 224. (19) Garcia Moreno, B.; Fitch, C.; Karp, D.; Gittis, A.; Lattman, E. Biophys. J. 2002 82 300a 300a. (20) Tanford, C. Adv. Protein Chem. 1962 17 69 165. (21) Dwyer, J. J.; Gittis, A. G.; Karp D. A.; Lattman, E. E.; Spencer, D. S.; Stites, W. E.; Garcia Moreno, B. Biophys. J. 2000 79 1610 1620. (22) Harms, M. J.; Castaneda, C. A.; Schlessman, J. L.; Sue, G. R.; Isom, D. G.; Cannon, B. R.; Garcia Moreno, B. J. Mol. Biol. 2009 389 34 47. ( 23) Mehler, E. L.; Fuxreiter, M.; Simon, I.; Garcia Moreno, E. B. Proteins: Struct., Funct., Genet. 2002 48 283 292. (24) Anderson, D. E.; Becktel, W. J.; Dahlquist, F. W. Biochemistry 1990 29 2403 2408. (25) Dyson, H. J.; Jeng, M. F.; Tennant, L. L. ; Slaby, I.; Lindell, M.; Cui, D. S.; Kuprin, S.; Holmgren, A. Biochemistry 1997 36 2622 2636. (26) Bashford, D.; Case, D. A.; Dalvit, C.; Tennant, L.; Wright, P. E. Biochemistry 1993 32 8045 8056. (27) Wang, Y. X.; Freedberg, D. I.; Yamazaki, T.; Wi ngfield, P. T.; Stahl, S. J.; Kaufman, J. D.; Kiso, Y.; Torchia, D. A. Biochemistry 1996 35 9945 9950. (28) Dyson, H. J.; Tennant, L. L.; Holmgren, A. Biochemistry 1991 30 4262 4268. (29) Jeng, M. F.; Dyson, H. J. Biochemistry 1996 35 1 6. (30) Wi lson, N. A.; Barbar, E.; Fuchs, J. A.; Woodward, C. Biochemistry 1995 34 8931 8939. (31) Callis, P. R. Methods Enzymol. 1997 278 113 150. (32) Callis, P. R.; Burgess, B. K. J. Phys. Chem. B 1997 101 9429 9432. (33) Vivian, J. T.; Callis, P. R. Bio phys. J. 2001 80 2093 2109. (34) Inoue, M.; Yamada, H.; Yasukochi, T.; Kuroki, R.; Miki, T.; Horiuchi, T.; Imoto, T. Biochemistry 1992 31 5545 5553. PAGE 208 208 (35) Kajander, T.; Kahn, P. C.; Passila, S. H.; Cohen, D. C.; Lehtio, L.; Adolfsen, W.; Warwicker, J. ; Schell, U.; Goldman, A. Structure 2000 8 1203 1214. (36) Bartlett, G. J.; Porter, C. T.; Borkakoti, N.; Thornton, J. M. J. Mol. Biol. 2002 324 105 121. (37) Jiang, Y. X.; Ruta, V.; Chen, J. Y.; Lee, A.; MacKinnon, R. Nature 2003 423 42 48. (38) Luecke, H.; Richter, H. T.; Lanyi, J. K. Science 1998 280 1934 1937. (39) Bashford, D.; Case, D. A. Annu. Rev. Phys. Chem. 2000 51 129 152. (40) Still, W. C.; Tempczyk, A.; Hawley, R. C.; Hendrickson, T. J. Am. Chem. Soc. 1990 112 6127 6129. (41) Cramer, C. J. Essentials of computational chemistry : theories and models ; J. Wiley: West Sussex, England ; New York, 2002. (42) Raha, K.; Merz, K. M. In Annual reports in computational chemistry ; Spellmeyer, D. C., Ed.; Elsevier: Amsterdam ; Boston, 2005 ; Vol. 1, p p113 130. (43) Dixon, S. L.; Merz, K. M. J. Chem. Phys. 1996 104 6643 6649. (44) Vreven, T.; Morokuma, K. In Annual Reports in Computational Chemistry ; Spellmeyer, D., Ed.; Elsevier: Amsterdam ; Boston, 2006; Vol. 2, p p35 51. (45) Field, M. J.; Bash, P. A.; Karplus, M. J. Comput. Chem. 1990 11 700 733. (46) Singh, U. C.; Kollman, P. A. J. Comput. Chem. 1986 7 718 730. (47) Warshel, A.; Levitt, M. J. Mol. Biol. 1976 103 227 249. (48) Kamerlin, S. C. L.; Haranczyk, M.; Warshel, A. J Phys. Chem. B 2009 113 1253 1272. (49) Monard, G.; Merz, K. M. Acc. Chem. Res. 1999 32 904 911. (50) Metropolis, N.; Rosenbluth, A. W.; Rosenbluth, M. N.; Teller, A. H.; Teller, E. J. Chem. Phys. 1953 21 1087 1092. (51) Wolynes, P. G.; Onuchic, J. N.; Thirumalai, D. Science 1995 267 1619 1620. (52) Itoh, S. G.; Okumura, H.; Okamoto, Y. Mol. Simul. 2007 33 47 56. (53) Mitsutake, A.; Sugita, Y.; Okamoto, Y. Biopolymers 2001 60 96 123. PAGE 209 209 (54) Berg, B. A.; Neuhaus, T. Phys. Lett. B 1991 267 2 49 253. (55) Berg, B. A.; Neuhaus, T. Phys. Rev. Lett. 1992 68 9 12. (56) Lyubartsev, A. P.; Martsinovski, A. A.; Shevkunov, S. V.; Vorontsovvelyaminov, P. N. J. Chem. Phys. 1992 96 1776 1783. (57) Marinari, E.; Parisi, G. Europhys. Lett. 1992 19 451 458. (58) Hansmann, U. H. E. Chem. Phys. Lett. 1997 281 140 150. (59) Swendsen, R. H.; Wang, J. S. Phys. Rev. Lett. 1986 57 2607 2609. (60) Earl, D. J.; Deem, M. W. Phys. Chem. Chem. Phys. 2005 7 3910 3916. (61) Fukunishi, H.; Watanabe, O.; T akada, S. J. Chem. Phys. 2002 116 9058 9067. (62) Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 1999 314 141 151. (63) Tanford, C.; Kirkwood, J. G. J. Am. Chem. Soc. 1957 79 5333 5339. (64) Tanford, C.; Roxby, R. Biochemistry 1972 11 2192 2198. (65 ) Bashford, D.; Karplus, M. Biochemistry 1990 29 10219 10225. (66) Gilson, M. K. Proteins: Struct., Funct., Genet. 1993 15 266 282. (67) Antosiewicz, J.; Mccammon, J. A.; Gilson, M. K. J. Mol. Biol. 1994 238 415 436. (68) Antosiewicz, J.; McCammon J. A.; Gilson, M. K. Biochemistry 1996 35 7819 7833. (69) Bashford, D.; Karplus, M. J. Phys. Chem. 1991 95 9556 9561. (70) Yang, A. S.; Gunner, M. R.; Sampogna, R.; Sharp, K.; Honig, B. Proteins: Struct., Funct., Genet. 1993 15 252 265. (71) Yan g, A. S.; Honig, B. J. Mol. Biol. 1993 231 459 474. (72) Madura, J. D.; Briggs, J. M.; Wade, R. C.; Davis, M. E.; Luty, B. A.; Ilin, A.; Antosiewicz, J.; Gilson, M. K.; Bagheri, B.; Scott, L. R.; Mccammon, J. A. Comput. Phys. Commun. 1995 91 57 95. ( 73) Nicholls, A.; Honig, B. J. Comput. Chem. 1991 12 435 445. PAGE 210 210 (74) Beroza, P.; Fredkin, D. R.; Okamura, M. Y.; Feher, G. Proc. Natl. Acad. Sci. U. S. A. 1991 88 5804 5808. (75) Bone, S.; Pethig, R. J. Mol. Biol. 1985 181 323 326. (76) Harvey, S. C .; Hoekstra, P. J. Phys. Chem. 1972 76 2987 &. (77) Garcia Moreno, B.; Fitch, C. A. Methods Enzymol. 2004 380 20 51. (78) Simonson, T.; Brooks, C. L. J. Am. Chem. Soc. 1996 118 8452 8458. (79) Mehler, E. L.; Eichele, G. Biochemistry 1984 23 3887 3891. (80) Mehler, E. L.; Guarnieri, F. Biophys. J. 1999 77 3 22. (81) Alexov, E. G.; Gunner, M. R. Biophys. J. 1997 72 2075 2093. (82) Barth, P.; Alber, T.; Harbury, P. B. Proc. Natl. Acad. Sci. U. S. A. 2007 104 4898 4903. (83) Georgescu, R. E .; Alexov, E. G.; Gunner, M. R. Biophys. J. 2002 83 1731 1748. (84) Gunner, M. R.; Alexov, E.; Torres, E.; Lipovaca, S. J. Biol. Inorg. Chem. 1997 2 126 134. (85) Livesay, D. R.; Jacobs, D. J.; Kanjanapangka, J.; Chea, E.; Cortez, H.; Garcia, J.; Kid d, P.; Marquez, M. P.; Pande, S.; Yang, D. J. Chem. Theory Comput. 2006 2 927 938. (86) You, T. J.; Bashford, D. Biophys. J. 1995 69 1721 1733. (87) Kollman, P. Chem. Rev. 1993 93 2395 2417. (88) Straatsma, T. P.; Mccammon, J. A. Annu. Rev. Phys. Chem. 1992 43 407 435. (89) Warshel, A.; Sussman, F.; King, G. Biochemistry 1986 25 8368 8372. (90) Russell, S. T.; Warshel, A. J. Mol. Biol. 1985 185 389 404. (91) Jorgensen, W. L.; Briggs, J. M. J. Am. Chem. Soc. 1989 111 4190 4197. (92) Merz K. M. J. Am. Chem. Soc. 1991 113 3572 3575. (93) Hu, H.; Yang, W. T. Annu. Rev. Phys. Chem. 2008 59 573 601. (94) Li, G. H.; Zhang, X. D.; Cui, Q. J. Phys. Chem. B 2003 107 8643 8653. PAGE 211 211 (95) Riccardi, D.; Schaefer, P.; Cui, Q. J. Phys. Chem. B 2005 109 17715 17733. (96) Bas, D. C.; Rogers, D. M.; Jensen, J. H. Proteins: Struct., Funct., Bioinf. 2008 73 765 783. (97) Jensen, J. H.; Li, H.; Robertson, A. D.; Molina, P. A. J. Phys. Chem. A 2005 109 6634 6643. (98) Li, H.; Hains, A. W.; Everts, J. E.; Robertson, A. D.; Jensen, J. H. J. Phys. Chem. B 2002 106 3486 3494. (99) Li, H.; Robertson, A. D.; Jensen, J. H. Proteins: Struct., Funct., Bioinf. 2004 55 689 704. (100) Li, H.; Robertson, A. D.; Jensen, J. H. Proteins: Struct., Funct., Bio inf. 2005 61 704 721. (101) Minikis, R. M.; Kairys, V.; Jensen, J. H. J. Phys. Chem. A 2001 105 3829 3837. (102) Day, P. N.; Jensen, J. H.; Gordon, M. S.; Webb, S. P.; Stevens, W. J.; Krauss, M.; Garmer, D.; Basch, H.; Cohen, D. J. Chem. Phys. 1996 105 1968 1986. (103) Gordon, M. S.; Freitag, M. A.; Bandyopadhyay, P.; Jensen, J. H.; Kairys, V.; Stevens, W. J. J. Phys. Chem. A 2001 105 293 307. (104) Mongan, J.; Case, D. A. Curr. Opin. Struct. Biol. 2005 15 157 163. (105) Baptista, A. M. J. Ch em. Phys. 2002 116 7766 7768. (106) Baptista, A. M.; Martel, P. J.; Petersen, S. B. Proteins: Struct., Funct., Genet. 1997 27 523 544. (107) Borjesson, U.; Hunenberger, P. H. J. Chem. Phys. 2001 114 9706 9719. (108) Borjesson, U.; Hunenberger, P. H. J. Phys. Chem. B 2004 108 13551 13559. (109) Khandogin, J.; Brooks, C. L. Biophys. J. 2005 89 141 157. (110) Khandogin, J.; Brooks, C. L. Biochemistry 2006 45 9363 9373. (111) Khandogin, J.; Brooks, C. L. Proc. Natl. Acad. Sci. U. S. A. 2007 1 04 16880 16885. (112) Khandogin, J.; Chen, J. H.; Brooks, C. L. Proc. Natl. Acad. Sci. U. S. A. 2006 103 18546 18550. PAGE 212 212 (113) Khandogin, J.; Raleigh, D. P.; Brooks, C. L. J. Am. Chem. Soc. 2007 129 3056 3057. (114) Lee, M. S.; Salsbury, F. R.; Brooks, C. L. Proteins: Struct., Funct., Bioinf. 2004 56 738 752. (115) Mertz, J. E.; Pettitt, B. M. Int. J. Supercomp. Appl. 1994 8 47 53. (116) Kong, X. J.; Brooks, C. L. J. Chem. Phys. 1996 105 2414 2423. (117) Chen, J. H.; Brooks, C. L.; Khandogin, J Curr. Opin. Struct. Biol. 2008 18 140 148. (118) Baptista, A. M.; Teixeira, V. H.; Soares, C. M. J. Chem. Phys. 2002 117 4184 4200. (119) Dlugosz, M.; Antosiewicz, J. M. Chem. Phys. 2004 302 161 170. (120) Dlugosz, M.; Antosiewicz, J. M. J. Phys Chem. B 2005 109 13777 13784. (121) Dlugosz, M.; Antosiewicz, J. M. J. Phys.: Condens. Matter 2005 17 S1607 S1616. (122) Dlugosz, M.; Antosiewicz, J. M.; Robertson, A. D. Phys. Rev. E 2004 69 021915. (123) Machuqueiro, M.; Baptista, A. M. J. Phy s. Chem. B 2006 110 2927 2933. (124) Machuqueiro, M.; Baptista, A. M. Biophys. J. 2007 92 1836 1845. (125) Machuqueiro, M.; Baptista, A. M. Proteins: Struct., Funct., Bioinf. 2008 72 289 298. (126) Machuqueiro, M.; Baptista, A. M. J. Am. Chem. Soc 2009 131 12586 12594. (127) Mongan, J.; Case, D. A.; McCammon, J. A. J. Comput. Chem. 2004 25 2038 2048. (128) Walczak, A. M.; Antosiewicz, J. M. Phys. Rev. E 2002 66 051911. (129) Williams, S. L.; de Oliveira, C. A. F.; McCammon, J. A. J. Chem. Theory Comput. 2010 6 560 568. (130) Burgi, R.; Kollman, P. A.; van Gunsteren, W. F. Proteins: Struct., Funct., Genet. 2002 47 469 480. PAGE 213 213 (131) Meng, Y. L.; Roitberg, A. E. J. Chem. Theory Comput. 2010 6 1401 1412. (132) Schaefer, M.; Karplus, M. J Phys. Chem. 1996 100 1578 1599. (133) Hamelberg, D.; Mongan, J.; McCammon, J. A. J. Chem. Phys. 2004 120 11919 11929. (134) Hamelberg, D.; Mongan, J.; McCammon, J. A. Protein Sci. 2004 13 76 76. (135) Ponder, J. W.; Case, D. A. Adv. Protein Chem 2003 66 27 85. (136) Allinger, N. L.; Yuh, Y. H.; Lii, J. H. J. Am. Chem. Soc. 1989 111 8551 8566. (137) Leach, A. R. Molecular modelling : principles and applications ; 2nd ed.; Prentice Hall: Harlow, England ; New York, 2001. (138) MacKerell, A. D. In Annual reports in computational chemistry Spellmeyer, D. C., Ed.; Elsevier: Amsterdam ; Boston, 2005; Vol. 1, p p91~102. (139) Hornak, V.; Abel, R.; Okur, A.; Strockbine, B.; Roitberg, A.; Simmerling, C. Proteins: Struct., Funct., Bioinf. 2006 65 712 725. (140) MacKerell, A. D.; Bashford, D.; Bellott, M.; Dunbrack, R. L.; Evanseck, J. D.; Field, M. J.; Fischer, S.; Gao, J.; Guo, H.; Ha, S.; Joseph McCarthy, D.; Kuchnir, L.; Kuczera, K.; Lau, F. T. K.; Mattos, C.; Michnick, S.; Ngo, T.; Nguyen, D. T.; Prodhom, B.; Reiher, W. E.; Roux, B.; Schlenkrich, M.; Smith, J. C.; Stote, R.; Straub, J.; Watanabe, M.; Wiorkiewicz Kuczera, J.; Yin, D.; Karplus, M. J. Phys. Chem. B 1998 102 3586 3616. (141) Daura, X.; Mark, A. E.; van Gunsteren, W. F. J. Comput Chem. 1998 19 535 547. (142) Jorgensen, W. L.; Tirado Rives, J. J. Am. Chem. Soc. 1988 110 1657 1666. (143) Cornell, W. D.; Cieplak, P.; Bayly, C. I.; Gould, I. R.; Merz, K. M.; Ferguson, D. M.; Spellmeyer, D. C.; Fox, T.; Caldwell, J. W.; Kollman, P. A. J. Am. Chem. Soc. 1995 117 5179 5197. (144) Verlet, L. Phys. Rev. 1967 159 98. (145) Ryckaert, J. P.; Ciccotti, G.; Berendsen, H. J. C. J. Comput. Phys. 1977 23 327 341. (146) Berendsen, H. J. C.; Postma, J. P. M.; van Gunsteren, W. F.; Din ola, A.; Haak, J. R. J. Chem. Phys. 1984 81 3684 3690. PAGE 214 214 (147) McQuarrie, D. A. Statistical thermodynamics ; University Science Books: Mill Valley, Calif., 1973. (148) Nose, S. J. Chem. Phys. 1984 81 511 519. (149) Berendsen, H. J. C.; Grigera, J. R.; S traatsma, T. P. J. Phys. Chem. 1987 91 6269 6271. (150) Jorgensen, W. L.; Chandrasekhar, J.; Madura, J. D.; Impey, R. W.; Klein, M. L. J. Chem. Phys. 1983 79 926 935. (151) Mahoney, M. W.; Jorgensen, W. L. J. Chem. Phys. 2000 112 8910 8922. (152) Allen, M. P.; Tildesley, D. J. Computer simulation of liquids ; Clarendon Press ; Oxford University Press: Oxford [England] New York, 1987. (153) Ewald, P. P. Annalen Der Physik 1921 64 253 287. (154) Darden, T.; York, D.; Pedersen, L. J. Chem. Phys. 19 93 98 10089 10092. (155) Hawkins, G. D.; Cramer, C. J.; Truhlar, D. G. Chem. Phys. Lett. 1995 246 122 129. (156) Kirkwood, J. G. J. Chem. Phys. 1935 3 300 313. (157) Straatsma, T. P.; Mccammon, J. A. J. Chem. Phys. 1991 95 1175 1188. (158) Zwan zig, R. W. J. Chem. Phys. 1954 22 1420 1426. (159) Bennett, C. H. J. Comput. Phys. 1976 22 245 268. (160) Shirts, M. R.; Chodera, J. D. J. Chem. Phys. 2008 129 124105. (161) Jorgensen, W. L.; Ravimohan, C. J. Chem. Phys. 1985 83 3050 3054. (162 ) Hansmann, U. H. E.; Okamoto, Y. Nucl. Phys. B 1995 914 916. (163) Wang, F. G.; Landau, D. P. Phys. Rev. E 2001 64 056101. (164) Wang, F. G.; Landau, D. P. Phys. Rev. Lett. 2001 86 2050 2053. (165) Falcioni, M.; Deem, M. W. J. Chem. Phys. 1999 11 0 1754 1766. (166) Kofke, D. A. J. Chem. Phys. 2002 117 6911 6914. PAGE 215 215 (167) Liu, P.; Kim, B.; Friesner, R. A.; Berne, B. J. Proc. Natl. Acad. Sci. U. S. A. 2005 102 13749 13754. (168) Li, H. Z.; Li, G. H.; Berg, B. A.; Yang, W. J. Chem. Phys. 2006 12 5 144902. (169) Okur, A.; Roe, D. R.; Cui, G. L.; Hornak, V.; Simmerling, C. J. Chem. Theory Comput. 2007 3 557 568. (170) Roitberg, A. E.; Okur, A.; Simmerling, C. J. Phys. Chem. B 2007 111 2415 2418. (171) Rathore, N.; Chopra, M.; de Pablo, J. J. J. Chem. Phys. 2005 122 024111. (172) Sanbonmatsu, K. Y.; Garcia, A. E. Proteins: Struct., Funct., Genet. 2002 46 225 234. (173) Kone, A.; Kofke, D. A. J. Chem. Phys. 2005 122 206101. (174) Trebst, S.; Troyer, M.; Hansmann, U. H. E. J. Chem. Phys 2006 124 174903. (175) Nadler, W.; Hansmann, U. H. E. Phys. Rev. E 2007 76 065701. (176) Nadler, W.; Hansmann, U. H. E. Phys. Rev. E 2007 75 026109. (177) Nadler, W.; Hansmann, U. H. E. J. Phys. Chem. B 2008 112 10386 10387. (178) Opps, S. B. ; Schofield, J. Phys. Rev. E 2001 6305 056701. (179) Zhang, W.; Wu, C.; Duan, Y. J. Chem. Phys. 2005 123 154105. (180) Sindhikara, D.; Meng, Y. L.; Roitberg, A. E. J. Chem. Phys. 2008 128 024103. (181) Abraham, M. J.; Gready, J. E. J. Chem. Theory Comput. 2008 4 1119 1128. (182) Zhang, C.; Ma, J. P. J. Chem. Phys. 2008 129 134112. (183) Rosta, E.; Buchete, N. V.; Hummer, G. J. Chem. Theory Comput. 2009 5 1393 1399. (184) Zhou, R. H.; Berne, B. J.; Germain, R. Proc. Natl. Acad. Sci. U. S. A 2001 98 14931 14936. (185) Lyman, E.; Ytreberg, F. M.; Zuckerman, D. M. Phys. Rev. Lett. 2006 96 028105. (186) Liu, P.; Shi, Q.; Lyman, E.; Voth, G. A. J. Chem. Phys. 2008 129 114103. PAGE 216 216 (187) Liu, P.; Voth, G. A. J. Chem. Phys. 2007 126 045106. (188) Okur, A.; Wickstrom, L.; Layten, M.; Geney, R.; Song, K.; Hornak, V.; Simmerling, C. J. Chem. Theory Comput. 2006 2 420 433. (189) Ballard, A. J.; Jarzynski, C. Proc. Natl. Acad. Sci. U. S. A. 2009 106 12224 12229. (190) Kamberaj, H.; van der Vaart, A. J. Chem. Phys. 2009 130 074906. (191) Nguyen, P. H. J. Chem. Phys. 2010 132 144109. (192) Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 2000 329 261 270. (193) Mitsutake, A.; Okamoto, Y. Chem. Phys. Lett. 2000 33 2 131 138. (194) Mitsutake, A.; Okamoto, Y. J. Chem. Phys. 2004 121 2491 2504. (195) Andrec, M.; Felts, A. K.; Gallicchio, E.; Levy, R. M. Proc. Natl. Acad. Sci. U. S. A. 2005 102 6801 6806. (196) van der Spoel, D.; Seibert, M. M. Phys. Rev. Lett. 2006 96 238102. (197) Yang, S. C.; Onuchic, J. N.; Garcia, A. E.; Levine, H. J. Mol. Biol. 2007 372 756 763. (198) Buchete, N. V.; Hummer, G. Phys. Rev. E 2008 77 030902. (199) Case, D. A.; Darden, T. A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang J.; Duke, R. E.; Luo, R.; Crowley, M.; Walker, R. C.; Zhang, W.; Merz, K. M.; B.Wang; Hayik, S.; Roitberg, A.; Seabra, G.; Kolossvry, I.; K.F.Wong; Paesani, F.; Vanicek, J.; X.Wu; Brozell, S. R.; Steinbrecher, T.; Gohlke, H.; Yang, L.; Tan, C.; Mongan, J.; Hornak, V.; Cui, G.; Mathews, D. H.; Seetin, M. G.; Sagui, C.; Babin, V.; Kollman, P. A.; University of California, San Francisco: San Francisco, 2008. (200) Onufriev, A.; Bashford, D.; Case, D. A. J. Phys. Chem. B 2000 104 3712 3720. (201) Elber, R.; Roitberg, A.; Simmerling, C.; Goldstein, R.; Li, H. Y.; Verkhivker, G.; Keasar, C.; Zhang, J.; Ulitsky, A. Comput. Phys. Commun. 1995 91 159 189. (202) Dill, K. A.; Ozkan, S. B.; Shell, M. S.; Weikl, T. R. Annu. Rev. Biophys. 2008 37 289 316. (20 3) Dobson, C. M. Nature 2003 426 884 890. PAGE 217 217 (204) Anfinsen, C. B.; Haber, E.; Sela, M.; White, F. H. Proc. Natl. Acad. Sci. U. S. A. 1961 47 1309 1314. (205) Mayor, U.; Johnson, C. M.; Daggett, V.; Fersht, A. R. Proc. Natl. Acad. Sci. U. S. A. 2000 97 13518 13522. (206) Snow, C. D.; Nguyen, N.; Pande, V. S.; Gruebele, M. Nature 2002 420 102 106. (207) Brooks, C. L. Acc. Chem. Res. 2002 35 447 454. (208) Levinthal, C. J. Chim. Phys. Phys. Chim. Biol. 1968 65 44 45. (209) Gruebele, M. Annu. Re v. Phys. Chem. 1999 50 485 516. (210) Kubelka, J.; Hofrichter, J.; Eaton, W. A. Curr. Opin. Struct. Biol. 2004 14 76 88. (211) Snow, C. D.; Sorin, E. J.; Rhee, Y. M.; Pande, V. S. Annu. Rev. Biophys. Biomol. Struct. 2005 34 43 69. (212) Snow, C. D .; Qiu, L. L.; Du, D. G.; Gai, F.; Hagen, S. J.; Pande, V. S. Proc. Natl. Acad. Sci. U. S. A. 2004 101 4077 4082. (213) Zagrovic, B.; Sorin, E. J.; Pande, V. J. Mol. Biol. 2001 313 151 169. (214) Jayachandran, G.; Vishal, V.; Pande, V. S. J. Chem. Ph ys. 2006 124 054118. (215) Singhal, N.; Snow, C. D.; Pande, V. S. J. Chem. Phys. 2004 121 415 425. (216) Swope, W. C.; Pitera, J. W.; Suits, F. J. Phys. Chem. B 2004 108 6571 6581. (217) Swope, W. C.; Pitera, J. W.; Suits, F.; Pitman, M.; Elefther iou, M.; Fitch, B. G.; Germain, R. S.; Rayshubski, A.; Ward, T. J. C.; Zhestkov, Y.; Zhou, R. J. Phys. Chem. B 2004 108 6582 6594. (218) Daggett, V.; Levitt, M. J. Mol. Biol. 1993 232 600 619. (219) Daggett, V.; Levitt, M. J. Cell. Biochem. 1993 223 223. (220) Daggett, V.; Levitt, M. Curr. Opin. Struct. Biol. 1994 4 291 295. (221) Dadlez, M.; Bierzynski, A.; Godzik, A.; Sobocinska, M.; Kupryszewski, G. Biophys. Chem. 1988 31 175 181. (222) Baldwin, R. L. Biophys. Chem. 1995 55 127 135. (223 ) Brown, J. E.; Klee, W. A. Biochemistry 1971 10 470 476. PAGE 218 218 (224) Fairman, R.; Shoemaker, K. R.; York, E. J.; Stewart, J. M.; Baldwin, R. L. Biophys. Chem. 1990 37 107 119. (225) Osterhout, J. J.; Baldwin, R. L.; York, E. J.; Stewart, J. M.; Dyson, H. J .; Wright, P. E. Biochemistry 1989 28 7059 7064. (226) Shoemaker, K. R.; Fairman, R.; Schultz, D. A.; Robertson, A. D.; York, E. J.; Stewart, J. M.; Baldwin, R. L. Biopolymers 1990 29 1 11. (227) Felts, A. K.; Harano, Y.; Gallicchio, E.; Levy, R. M. Proteins: Struct., Funct., Bioinf. 2004 56 310 321. (228) Hansmann, U. H. E.; Okamoto, Y. J. Phys. Chem. B 1998 102 653 656. (229) Hansmann, U. H. E.; Okamoto, Y. J. Phys. Chem. B 1999 103 1595 1604. (230) La Penna, G.; Mitsutake, A.; Masuya, M.; Okamoto, Y. Chem. Phys. Lett. 2003 380 609 619. (231) Ohkubo, Y. Z.; Brooks, C. L. Proc. Natl. Acad. Sci. U. S. A. 2003 100 13916 13921. (232) Schaefer, M.; Bartels, C.; Karplus, M. J. Mol. Biol. 1998 284 835 848. (233) Sugita, Y.; Okamoto, Y. Bio phys. J. 2005 88 3180 3190. (234) Yoda, T.; Sugita, Y.; Okamoto, Y. Chem. Phys. 2004 307 269 283. (235) Yoda, T.; Sugita, Y.; Okamoto, Y. Chem. Phys. Lett. 2004 386 460 467. (236) Kabsch, W.; Sander, C. Biopolymers 1983 22 2577 2637. (237) John son, W. C. Annu. Rev. Biophys. Biophys. Chem. 1988 17 145 166. (238) Sreerama, N.; Woody, R. W. Methods Enzymol. 2004 383 318 351. (239) Gratzer, W. B.; Doty, P.; Holzwarth, G. M. Proc. Natl. Acad. Sci. U. S. A. 1961 47 1785 1791. (240) Manning, M C.; Illangasekare, M.; Woody, R. W. Biophys. Chem. 1988 31 77 86. (241) Bayley, P. M.; Nielsen, E. B.; Schellma.Ja J. Phys. Chem. 1969 73 228 243. (242) Clark, L. B. J. Am. Chem. Soc. 1995 117 7974 7986. PAGE 219 219 (243) Hirst, J. D. J. Chem. Phys. 1998 1 09 782 788. (244) Woody, R. W.; Sreerama, N. J. Chem. Phys. 1999 111 2844 2845. (245) Goux, W. J.; Hooker, T. M. J. Am. Chem. Soc. 1980 102 7080 7087. (246) Ridley, J.; Zerner, M. Theor. Chim. Acta 1973 32 111 134. (247) Wlodawer, A.; Svensson, L. A.; Sjolin, L.; Gilliland, G. L. Biochemistry 1988 27 2705 2717. (248) Blake, C. C. F.; Koenig, D. F.; Mair, G. A.; North, A. C. T.; Phillips, D. C.; Sarma, V. R. Nature 1965 206 757 761. (249) Vocadlo, D. J.; Davies, G. J.; Laine, R.; Withers, S. G. Nature 2001 412 835 838. (250) Nielsen, J. E.; McCammon, J. A. Protein Sci. 2003 12 313 326. (251) Bartik, K.; Redfield, C.; Dobson, C. M. Biophys. J. 1994 66 1180 1184. (252) Tironi, I. G.; Sperb, R.; Smith, P. E.; Vangunsteren, W. F. J. Chem Phys. 1995 102 5451 5459. (253) Case, D. A.; Darden, T. A.; T.E. Cheatham, I.; Simmerling, C. L.; Wang, J.; Duke, R. E.; R.Luo; Merz, K. M.; Pearlman, D. A.; Crowley, M.; Walker, R. C.; Zhang, W.; Wang, B.; S.Hayik; Roitberg, A.; Seabra, G.; Wong, K. F.; Paesani, F.; Wu, X.; Brozell, S.; Tsui, V.; H.Gohlke; Yang, L.; Tan, C.; Mongan, J.; Hornak, V.; Cui, G.; Beroza, P.; Mathew, D. H.; C.Schafmeister; Ross, W. S.; Kollman, P. A.; University of California, San Francisco: San Francisco 2006. (254) Frisc h, M. J. T., G. W.; Schlegel, H. B.; Scuseria, G. E.; Robb, M. A.; Cheeseman, J. R.; Montgomery, Jr., J. A.; Vreven, T.; Kudin, K. N.; Burant, J. C.; Millam, J. M.; Iyengar, S. S.; Tomasi, J.; Barone, V.; Mennucci, B.; Cossi, M.; Scalmani, G.; Rega, N.; Pe tersson, G. A.; Nakatsuji, H.; Hada, M.; Ehara, M.; Toyota, K.; Fukuda, R.; Hasegawa, J.; Ishida, M.; Nakajima, T.; Honda, Y.; Kitao, O.; Nakai, H.; Klene, M.; Li, X.; Knox, J. E.; Hratchian, H. P.; Cross, J. B.; Bakken, V.; Adamo, C.; Jaramillo, J.; Gompe rts, R.; Stratmann, R. E.; Yazyev, O.; Austin, A. J.; Cammi, R.; Pomelli, C.; Ochterski, J. W.; Ayala, P. Y.; Morokuma, K.; Voth, G. A.; Salvador, P.; Dannenberg, J. J.; Zakrzewski, V. G.; Dapprich, S.; Daniels, A. D.; Strain, M. C.; Farkas, O.; Malick, D. K.; Rabuck, A. D.; Raghavachari, K.; Foresman, J. B.; Ortiz, J. V.; Cui, Q.; Baboul, A. G.; Clifford, S.; Cioslowski, J.; Stefanov, B. B.; Liu, G.; Liashenko, A.; Piskorz, P.; Komaromi, I.; Martin, R. L.; Fox, D. J.; Keith, T.; Al Laham, M. A.; Peng, C. Y .; Nanayakkara, A.; Challacombe, M.; Gill, P. M. W.; Johnson, B.; Chen, W.; Wong, M. W.; Gonzalez, C.; and Pople, J. A.; Gaussian, Inc.: Wallingford CT, 2004. PAGE 220 220 (255) Ditchfie.R Mol. Phys. 1974 27 789 807. (256) He, X.; Wang, B.; Merz, K. M. J. Phys. Chem B 2009 113 10380 10388. (257) Anandakrishnan, R.; Onufriev, A. J. Comput. Biol. 2008 15 165 184. (258) Gordon, J. C.; Myers, J. B.; Folta, T.; Shoja, V.; Heath, L. S.; Onufriev, A. Nucleic Acids Res. 2005 33 368 371. PAGE 221 221 BIOGRAPHICAL SKE TCH the Dalian University of Technology at Dalian, Liaoning Province and studied chemical his college, Yilin has developed an interest in the computational chemistry, especially the In August 2004, Yilin came to the University of Florida and began his life as a graduat e student. His original plan was to keep studying the electronic structure theory. However, he was impressed by the research of Dr. Adrian E. Roitberg. Later, he joined the Roitberg group and started his career in the molecular modeling. 