Citation
Rationalizing and Quantifying the Scrambling Chemistry in Peptide Sequencing

Material Information

Title:
Rationalizing and Quantifying the Scrambling Chemistry in Peptide Sequencing A Computational Perspective
Creator:
Yu, Long
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (163 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Chemistry
Committee Chair:
POLFER,NICOLAS CAMILLE
Committee Co-Chair:
WEI,WEI
Committee Members:
OMENETTO,NICOLO
ROITBERG,ADRIAN E
CHEN,SIXUE
Graduation Date:
12/19/2014

Subjects

Subjects / Keywords:
Amino acids ( jstor )
Average linear density ( jstor )
Datasets ( jstor )
Ions ( jstor )
Isomers ( jstor )
Mass spectrometers ( jstor )
Mass spectroscopy ( jstor )
Molecular dynamics ( jstor )
Sequencing ( jstor )
Simulations ( jstor )
Chemistry -- Dissertations, Academic -- UF
peptide -- permutation -- scrambling -- sequence
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Chemistry thesis, Ph.D.

Notes

Abstract:
A spectrum of computational approaches have been applied to the investigation of peptide sequence scrambling processes and the limitations of ultra-high mass spectrometry. The first project focused on the energetics of the isomerization of linear peptide fragments to head-to-tail macrocyclic structures, which are at the basis of rationalizing sequence scrambling. Molecular dynamics simulations were used to investigate the potential mean force (PMF) along the distance between the atoms involved in the head-to-tail nucleophilic attack. Peptides with higher energy barriers disfavor macrocycle formation tend to adopt linear oxazolone structures, whereas lower barriers correlate with facile macrocycle formation. A key factor in favoring macrocyclization was found to be the peptide length, with longer sequences favoring head-to-tail cyclization. In the second project, the appearance of scrambled/permuted sequence ions in tandem mass spectra was evaluated to draw statistical conclusions on trends in the dissociation chemistry, showing that the propensity of sequence permutations increases with the length of the precursor peptides. The ratio of matched permuted sequence ions over all fragments can reach up to 25% for longer peptides. Meanwhile, the overall average percentage of permuted sequence ions was found to be 5.3% when subtracting contributions from false positives, compared to 16.9% for original sequence ions, suggesting that scrambling does not constitute a significant problem for correct peptide sequencing. In project 3, all compositionally distinct peptides up to 1000 Da made from 21 amino acid residues were considered, to explore the limitations of ultra-high accuracy mass measurements on peptide identification. The number of peptides was found to grow exponentially with nominal mass, reaching nearly 50,000 at 1000 Da. Moreover, a striking periodic oscillation behavior was observed, with a period of 15 Da at lower nominal masses and 14 Da at higher masses. These mass differences coincide with the most common mass differences between amino acid building blocks, resulting in a large number of isomers or isobars at some nominal masses, but lower numbers at adjacent nominal masses. Due to the large number of isomers, even ultra-high resolution (e.g. 0.5 ppm) cannot promise a high rate of identification when the nominal mass is high. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2014.
Local:
Adviser: POLFER,NICOLAS CAMILLE.
Local:
Co-adviser: WEI,WEI.
Statement of Responsibility:
by Long Yu.

Record Information

Source Institution:
UFRGP
Rights Management:
Copyright Yu, Long. Permission granted to the University of Florida to digitize, archive and distribute this item for non-profit research and educational purposes. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder.
Resource Identifier:
974372620 ( OCLC )
Classification:
LD1780 2014 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

RATIONALIZING AND QUANTIFYING THE SCRAMBLING CHEMISTRY IN PEPTIDE SEQUENCING: A COMPUTATIONAL PERSPECTIVE By LONG YU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2014

PAGE 2

2014 Long Yu

PAGE 3

To my parents, my wife, Roscoe and my future children

PAGE 4

4 ACKNOWLEDGMENTS I gratefully treasure the opportunity to work with Dr. Nicolas Polfer, and benefit from his long time and generous guidance through my graduate study and researches. I couldn't imagine how there could be a better Ph.D. advisor. I would like to thank Dr. David Powell for his kin generosity to grant me the access to his Fourier transform ion cyclotron resonance mass spectrometer, which started my experience in mass spectrometry experiments. I thank Dr. Wei D. Wei, Dr. Nicolo Omenetto, Dr. Adrian E. Roitberg and Dr. Sixue Chen, for their precious time and energy spent serving on my committee. I would like to thank Yipu Miao and Pengfei Li, former members of the Quantum Theory Project in the chemistry department for their kind sharing of knowledge and experience which greatly helped me in the understanding of molecular dynamics and ab-initio calculations. I thank Johan Galindo-Cruz from Dr. Roitberg's group for the tutorial and help in molecular dynamics model building. I especially thank Xinyu Miao, my undergraduate roommate, one of my best friends and the best man at my wedding ceremony. His timely help and directions overcame so many obstacles I engaged during my methodology development. I thank my alumni and former group mate Xian Chen for her tutorials in the experimental techniques and her considerate care during my graduate study. I appreciate the cooperation with Yanglan Tan during my calculation and data analysis. I really treasure the true friendship with group member and undergraduate alumni Ning Zhao ever since he joined the Polfer group. The Polfer and the Eyler research groups have been a great support to me and this research could not have been accomplished otherwise.

PAGE 5

5 I thank Yu Shen, Yixing Yang and Weiran Cao for their unselfish tutoring during my preparation for the qualifying exams, otherwise it would be so difficult for a student with undergraduate background in physics to pass them all. I thank my alumni and long-time friend Ou Chen for his ungrudging accommodation in every aspect of my living and studying since my arrival in the United States, especially when I need ed it most. His crucial help during my application to the University of Florida jumpstarted my graduate study and my life in Gainesville; ot herwise this thesis and this Ph. D. degree would both be non-existing. I thank all my friends and relatives for their support of all kind. Without them life is a pure misery. I thank my mother for her lifetime devotion in my education, and her deep sacri fi ce of time and energy that led to a huge compromise in her otherwise even more successful business career. I will try my best to be her biggest pride. I REALLY need to thank goodness for bringing me to my wife, Yifan Wang. She is simply the one under which female perfection in my eyes has been defined, and marrying her is the best thing that has ever happened in my life.

PAGE 6

6 TABLE OF CONTENTS page ACKNOWLEDGMENTS .................................................................................................. 4 LIST OF TABLES ............................................................................................................ 9 LIST OF FIGURES ........................................................................................................ 11 LIST OF ABBREVIATIONS ........................................................................................... 15 ABSTRACT ................................................................................................................... 16 CHAPTER 1 INTRODUCTION .................................................................................................... 18 1.1 Proteomics and Peptide Sequencing ................................................................ 20 1.2 Peptide Scrambling ........................................................................................... 21 1.2.1 Overview ................................................................................................ . 21 1.2.2 Previous Work ......................................................................................... 22 1.2.3 Proposed Research ................................................................................. 24 2 EXPERIMENTAL AND COMPUTATIONAL TECHNIQUES ................................... 27 2.1 Background ....................................................................................................... 27 2.2 Electrospray Ionization (ESI) ............................................................................ 28 2.3 Mass Spectrometry ........................................................................................... 29 2.3.1 Introduction .............................................................................................. 29 2.3.2 Time of Flight (TOF) Mass Spectrometer ................................................ 29 2.3.2.1 Mechanism ..................................................................................... 29 2.3.2.2 Comparison with other MS ............................................................. 30 2.3.3 Fourier Transform Ion Cyclotron Resonance Spectrometer .................... 31 2.3.3.1 Overview and Mechanism .............................................................. 31 2.3.3.2 Advantages of FT-ICR MS ............................................................. 32 2.4 Collision Induced Dissociation (CID) ................................................................ . 32 2.5 Infrared Multiple Photon Dissociation (IRMPD) ................................................. 33 2.6 Free Electron Laser and FELIX ........................................................................ 35 2.6.1 Introduction .............................................................................................. 35 2.6.3 FELIX facility ........................................................................................... 36 2.7 Density Functional Theory ................................................................................ 36 2.7.1 Overview ................................................................................................ . 36 2.7.2 DFT Calculation with Gaussian 03 .......................................................... 37 2.8 Molecular Dynamics Simulation with AMBER ................................................... 37 2.8.1 Overview ................................................................................................ . 37 2.8.2 Molecular Dynamics Simulations ............................................................. 38 2.8.3 AMBER (Assisted Model Building with Energy Refinement) ................... 38

PAGE 7

7 3 TRENDS IN FAVORING/DISFAVORING MACROCYCLE FORMATION: SEQUENCE LENGTH, TORSIONAL RESTRICTION, AND BASIC RESIDUES .... 42 3.1 Scope and Motivation ....................................................................................... 42 3.1.1 Prior IR Spectroscopy Studies ................................................................ . 42 3.1.2 Planned Approach ................................................................................... 44 3.2 Computational Methods .................................................................................... 46 3.2.1 Overview ................................................................................................ . 46 3.2.2 Molecular Dynamics Model Building ........................................................ 46 3.2.2.1 Overview ........................................................................................ 46 3.2.2.2 HyperChem .................................................................................... 47 3.2.2.3 Gaussian calculations .................................................................... 48 3.2.2.4 AMBER calculation ........................................................................ 48 3.2.3 Potential Calculation with WHAM ............................................................ 49 3.2.3.1 Weighted histogram average method (WHAM) equations ............. 49 3.2.3.2 WHAM calculation parameters ....................................................... 51 3.3 Results and Discussion ..................................................................................... 52 3.3.1 Overview ................................................................................................ . 52 3.3.2 "b" ions of G n , [YAG] n and Proline Modified QWFGLM Peptides ............. 52 3.3.2.1 [YAG] n sequence motifs ................................................................ . 52 3.3.2.2 G 4 to G 8 .......................................................................................... 53 3.3.2.3 QWFGLM and proline substitutes .................................................. 53 3.3.3 Arginine Peptide Results ......................................................................... 54 3.3.3.1 N-C results ..................................................................................... 55 b 6 ions of R-substituting QWFGLMs .......................................................... 55 b 5 ions of R-substituting QWFGLs ............................................................. 55 3.3.3.2 N-H results ..................................................................................... 56 b 6 ions of R-substituting QWFGLMs .......................................................... 57 b 5 ions of R-substituting QWFGLs ............................................................. 57 3.3.4 Limitation of the Underlying Computational Approaches ......................... 58 3.4 Summary .......................................................................................................... 58 4 STATISTICAL STUDY OF SEQUENCE SCRAMBLING IN COLLISIONINDUCED DISSOCIATION OF PEPTIDES ............................................................ 73 4.1 Background ....................................................................................................... 73 4.2 Computational Methods .................................................................................... 76 4.2.1 Overview ................................................................................................ . 76 4.2.2 Experimental Data and Importing ............................................................ 77 4.2.3 Bottom-up Sequencing Tool .................................................................... 77 4.2.3.1 General procedure of sequencing .................................................. 78 4.2.3.2 Types of fragment ions to be considered ....................................... 79 4.2.3.3 More details of mass matching ...................................................... 80 4.2.4 Raw Output File Analysis Tool ................................................................ 80 4.2.5 False Positives ........................................................................................ 82 4.3 Results and Discussion ..................................................................................... 85 4.3.1 Overview ................................................................................................ . 85

PAGE 8

8 4.3.1.1 CID experiment setup ................................................................... 85 4. 3.2 General Sequencing Results ................................................................... 86 4.3.3 Population Analysis ................................................................................. 86 4.3.4 Dependency on Peptide Lengthss ........................................................... 88 4.3.5 Dependence on Precursor Ion Charge State ........................................... 93 4.3.6 Discussions on the Trends in Scrambling ................................................ 94 4.4 Summary .......................................................................................................... 95 5 PEPTIDE MASS AND IDENTIFICATION BY HIGH RESOLUTION MASS SPECTROMETRY ................................................................................................ 108 5.1 Background ..................................................................................................... 108 5.1.1 Motivation .............................................................................................. 108 5.1.2 Background of Proposed Research Work ............................................. 110 5.2 Computational Method .................................................................................... 111 5.2.1 Overview ............................................................................................... 111 5.2.2 Peptides of Interest ................................................................................ 111 5.2.3 Masses Enumeration Algorithm ............................................................. 113 5.2.4 Data Analysis ........................................................................................ 115 5.2.4.1 Peptide rearrangement by mass-ascending order ....................... 115 5.2.4.2 Atomic composition and isomer depletion .................................... 116 5.2.4.3 Nominal mass population analysis ............................................... 117 5.2.4.4 Differentiation by mass ................................................................ . 117 5.3 Result and Discussion .................................................................................... 119 5.3.1 Population Analysis and Mass Periodicity ............................................. 119 5.3.2 Explanation of the Periodicity ................................................................ 121 5.3.3 Identification by Mass under Different PPM Values ............................... 123 5.3.3.1 Unique masses ............................................................................ 124 5.3.3.2 Identification under different mass accuracies ............................. 125 5.3.3.3 Atomic composition identification ................................................. 126 5.3.4 Influence from Additional Amino Acid Residues .................................... 127 5.4 Summary ........................................................................................................ 129 6 CONCLUSIONS AND FURTHER WORK ............................................................. 144 APPENDIX A ORIGINAL DATA FILE SAMPLES AND TABLES ................................................ 148 B AMBER MODEL BUILDING DETAILS AND SAMPLE OUTPUTS ....................... 153 LIST OF REFERENCES ............................................................................................. 158 BIOGRAPHICAL SKETCH .......................................................................................... 163

PAGE 9

9 LIST OF TABLES Table page 2-1 Specifications of the FELIX light source. ............................................................ 40 3-1 Relative contributions of oxazolone and macrocycle structures for oligoglycine bn products, based on HDX results. ............................................... 70 3-2 List of b ions considered for PMF calculations. ................................................... 70 3-3 Comparison of experimental and computed results for b n [YAG] n [GAY] n and [AYG] n based on PMF data in Figure 39. .......................................................... 71 3-5 Experimental results on the isomeric structures for the b ions of QWFGLM and the proline modifications, based on PMF data in Figure 3-11. ..................... 71 3-6 Experimental and computational results for b6 ions for QWFGLM and its arginine-modified sequence motifs based on PMF data in Figure 3-12. ............. 71 3-7 Experimental and computational results for b5 ions for arginine-modified sequence motifs of QWFGL based on PMF data in Figure 3-13. ....................... 72 3-8 Experimental and computational results for the b 6 ions for arginine-containing peptides based on PMF data in Figure 3-15. ...................................................... 72 3-9 Experimental and computational results for b5 ions of arginine-containing peptides based on Figure 3-16. .......................................................................... 72 4-1 Direct sequence ion matching results for the MS2 spectrum in Figure 41. ...... 107 4-2 Nondirect sequence ion matching results for MS2 spectrum in Figure 41. ...... 107 4-3 List of direct and nondirect ion types considered for fragment matching. ......... 107 5-1 List of amino acid residues considered and their exact masses. ...................... 142 5-2 From left to right: Atomic composition, one-letter code, and exact masses of 21 amino acid residues included in the amino acid list. .................................... 142 5-3 Summary of compositional differences in Tables 5-3 and 5-4, showing their corresponding nominal mass differences and number of occurrences ............. 143 A-1 A sample MS2 file. ............................................................................................ 148 A-2 Sample raw output file. ..................................................................................... 149 A-3 Sample entries from the data file. ..................................................................... 150

PAGE 10

10 A-4 Atomic compositional differences between all pairs of amino acid residues considered. The set of 5 numbers denotes the difference in number of C, N, O, P and S atoms, respectively. ....................................................................... 151 A-5 Atomic compositional differences between all pairs of amino acid residues considered. The set of 5 numbers denotes the difference in number of C, N, O, P and S atoms, respectively (continued). .................................................... 152

PAGE 11

11 LIST OF FIGURES Figure page 1-1 Mechanistic scheme showing oxazolone b fragment formation, followed by cyclization into a macrocycle and loss of sequence information for the peptide. ............................................................................................................... 26 2-1 Schematic demonstration of ESI mechanism. .................................................... 40 2-2 The geometry of an open, cylindrical ICR cell. ................................................... 41 2-3 Schematic of free electron laser operation. ........................................................ 41 2-4 The bonded and non-bonded interactions considered in molecular mechanics simulation. .......................................................................................................... 41 3-1 Schematic diagram of headto -tail macrocyclization. .......................................... 61 3-2 Kinetic fitting of the HDX results for glycine-based b fragment ions ................... 61 3-3 MidIR -MPD spectrum of b5-G8 (generated from octaglycine),compared to the lowest-energy conformers for the various chemical structures ..................... 62 3-4 Top: IRMPD spectra of the b4 ions with the sequence motifs A) TyrAlaGly, B) GlyAlaTyr, and C) AlaTyrGly. Bottom: IRMPD spectra of the b6 ions with the sequence motifs A) TyrAlaGly, B) GlyAlaTyr, and C) AlaTyrGly. ........................ 63 3-5 Overlay of IRMPD spectra of QPWFGLMPG b7, QPFGLMPG b6, and protonated cyclo(QPFGLM).. .............................................................................. 64 3-6 IR spectra for b5 arginine peptides recorded in the mid-IR range, the shadow region correspond to the typical wavelength for the oxazolone C=O stretching mode. ................................................................................................ . 64 3-7 IR spectra for b6 arginine peptides recorded in the mid-IR range, the shadow region correspond to the typical wavelength for the oxazolone C=O stretching mode. ................................................................................................ . 65 3-8 Schematic diagram of the distance coordinate for PMF calculation. .................. 65 3-9 PMF results for b n [YAG] n , [AYG] n and [GAY] n . .................................................. 66 310 PMF calculation results for b n G n . ....................................................................... 66 311 PMF calculation results for the b ions of QWFGLM, QPWFGLM and QPFGLM. ........................................................................................................... 67

PAGE 12

12 312 PMFs calculation results for the b 6 ions for QWFGLM and arginine-modified sequence motifs. ................................................................................................ 67 313 PMFs calculation results for the b 5 ions for arginine-modified sequence motifs of QWFGL. ............................................................................................... 68 314 Schematic representation of how the proton transfer (pathway marked in red) deactivates the headto -tail cyclization pathway. ................................................ 68 315 PMFs calculation results for the b 6 ions for arginine-modified sequence motifs of QWFGLM. ............................................................................................ 69 316 PMFs calculation results for the b 6 ions for arginine-modified sequence motifs of QWFGL. ............................................................................................... 69 4-1 Schematic diagram of b ion sequence permutations as a result of headto -tail cyclization. .......................................................................................................... 98 42A A sample MS-MS spectrum from CID experiment .............................................. 99 42B Schematic flowchart for the automated direct/nondirect ion matching and statistical data analysis. .................................................................................... 100 4-3 Size distribution of peptides identified by Scherl et al. [57] and used as a reference peptide list in this study. ................................................................... 101 4-4 Number of peptides as a function of peptide length, for the higher(blue) and lower(red) confidence datasets. ..................................................................... 101 4-5 Mean percentages and standard deviations of direct (blue) and nondirect (red) ions among all fragments at each peptide lengths, for the higherconfidence dataset. .......................................................................................... 102 45B Mean percentages and standard deviations of direct (blue) and nondirect (red) ions among all fragments at each peptide lengths, for the lowerconfidence dataset. .......................................................................................... 102 4-6 Intensity-weighted percentage of nondirect ions at each peptide length for higher(top) and lower(bottom) confident datasets. ....................................... 103 4-7 Number of possible permutations due to macrocycle formation versus peptide length based on mechanism in Figure 41. .......................................... 103 4-8 Number of peptides at each peptide length for the real (red) and decoy (black) peptide list. ............................................................................................ 104

PAGE 13

13 4-9 Percentage of direct (black) and nondirect (red) ions among all fragments at each peptide length for higher (top) and lower (middle) confidence datasets and decoy (bottom) dataset. ............................................................................. 104 410 Percentages of false positive nondirect ion population over real non direct ion population at each peptide length, for higher (blue) and lower (red) confidence datasets. ......................................................................................... 105 411 Percentages of direct (blue) and nondirect (red) ions among all fragments after depleting false positive contributions at each peptide length, superposed with the population at each peptide length. ................................... 105 412 Number of precursor ions at each charge state (+2, +3, +4 and +5). ............... 106 413 Percentages of nondirect ions among all fragments for charge states 2+ (black) and 3+ (red) at each peptide length for higher (top) and lower (bottom) confidence datasets. .......................................................................... 106 5-1 Population of compositionally distinct peptides at each nominal mass. ............ 131 5-2 Histogram analysis of compositionally distinct peptides at each individual nominal mass. .................................................................................................. 131 5-3 Number of compositionally distinct peptide plotting and polynomial fit to the ninth order up to a nominal mass of 1000 Da. .................................................. 132 5-4 Population plotting and polynomial fit to the ninth order up to 600 Da. ............. 132 5-5 Relative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P-F)/F, for nominal masses up to 1000 Da. ....................................................................... 133 5-6 Relative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P-F)/F, for nominal masses between 600 and 1000 Da. .................................................... 133 5-7 Population and polynomial fit as a function of nominal mass up to 1500 Da. ... 134 5-8 Relative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P-F)/F, for nominal masses between 1000 and 1500 Da. .................................................. 134 5-9 Relative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P-F)/F, for nominal masses between 310 and 400 Da (B) and 900 1000 Da (C). ........... 135

PAGE 14

14 510 Relative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P-F)/F, for nominal masses between 815 and 900 Da. ...................................................... 135 511 Number of occurrences for each nominal mass difference, which is calculated as the absolute value of nominal mass difference between any pair of amino acid residues considered in this research. .................................. 136 512 Number of unique masses vs. overall population at each nominal mass for masses up to 1000 Da. The inlet gives a zoomed plotting between 300 and 600 Da. ............................................................................................................. 136 513 Percentage of unique masses within the entire population at each nominal mass for masses up to 1000 Da. ...................................................................... 137 514 Percentage of masses identifiable with 50 ppm detection accuracy and of unique masses within the entire population at each nominal mass for masses up to 1000 Da. .................................................................................................. 137 515 Percentages of masses distinguishable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire population at each nominal mass for masses up to 1000 Da. ........................................................................................................... 138 516 Percentages of masses distinguishable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire population at each nominal mass for masses up between 400 and 1000 Da. .............................................................................. 138 517 Percentages of masses identifiable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire isomer-excluded population at each nominal mass, for masses up to 1000 Da. ............................................................................... 139 518 Percentages of masses identifiable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire isomer-excluded population at each nominal mass, for masses up between 500 and 1000 Da. ....................................................... 139 519 Population at each nominal mass for masses up to 1000 Da for the original 21 AA residue (black) and the modified 24 AA residue (21 plus 3 phosphorylated residues, red) are used during mass enumeration. ................. 140 520 Histograms of mass distribution within nominal masses 499, 699 and 899 Da, when only the original 21 AA residues (black) or the augmented 24 AA residues (orange) are considered for mass enumeration. ................................ 141

PAGE 15

15 LIST OF ABBREVIATIONS AA Amino Acid AMBER Assisted Model Building with Energy Refinement AMU Atomic Mass Units CID Collision Induced Dissociation DFT Density Functional Theory ESI ElectroSpray Ionization FEL Free Electron Laser FELIX Free Electron for Infrared eXperiments FT -ICR Fourier Transform Ion Cyclotron Resonance HDX Hydrogen Deuterium eXchange HF Hartree-Fock HPLC Higher Performance Liquid Chromatography IRMPD Infrared Multiple Photon Dissociation MALDI Matrix Assisted Laser Desorption Ionization MD Molecular Dynamics MS Mass Spectrometry MS 2 /MSMS Tandem mass spectrometry when two consecutive measurements are made to measure the masses of both precursor ion and fragment ions PMF Potential Mean Force PPM Parts Per Million TOF Time Of Flight

PAGE 16

16 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy RATIONALIZING AND QUANTIFYING THE SCRAMBLING CHEMISTRY IN PEPTIDE SEQUENCING: A COMPUTATIONAL PERSPECTIVE By Long Yu December 2014 Chair: Nicolas C. Polfer Major: Chemistry A spectrum of computational approaches have been applied to the investigation of peptide sequence scrambling processes and the limitations of ultra-high mass spectrometry. The first project focused on the energetics of the isomerization of linear peptide fragments to headto -tail macrocyclic structures, which are at the basis of rationalizing sequence scrambling. Molecular dynamics simulations were used to investigate the potential mean force (PMF) along the distance between the atoms involved in the headto -tail nucleophilic attack . Peptides with higher energy penalties disfavor macrocycle formation tend to adopt linear oxazolone structures, whereas lower energetic penalties correlate with facile macrocycle formation . A key factor in favoring macrocyclization was found to be the peptide length, with longer sequences favoring headto -tail cyclization. In the second project, the appearance of scrambled/permuted sequence ions in tandem mass spectra was evaluated to draw statistical conclusions on trends in the dissociation chemistry , showing that the propensity of sequence permutations increases with the length of the precursor peptides. The ratio of matched permuted sequence ions over all fragments can reach up to 25% for long er peptides. Meanwhile, the overal l

PAGE 17

17 average percentage of permuted sequence ions was found to be 5.3% when subtracting contributions from false positives , compared to 16.9% for original sequence ions, suggesting that scrambling does not constitute a significant problem for correct peptide sequencing. In project 3, all compositionally distinct peptides up to 1000 Da made from 21 amino acid residues were considered , t o explore the limitations of ultra-high accuracy mass measurements on peptide identification. The number of peptides was found to grow exponentially with nominal mass, reaching nearly 50,000 at 1000 Da. Moreover, a striking periodic oscillation behavior was observed, with a period of 15 Da at lower nominal masses and 14 Da at higher masses. These mass differences coincide with the most common mass differences between amino acid building blocks, resulting in a large number of isomers or isobars at some nominal masses, but lower numbers at adjacent nominal masses. Due to the large number of isomers, even ultra-high resolution (e.g. 0.5 ppm) cannot promise a high rate of identification when the nominal mass is high.

PAGE 18

18 CHAPTER 1 INTRODUCTION In recent years peptide sequencing [1,2] has become a major tool widely used in proteomics study, especially with the integration of in-line automated high performance liquid chromatography coupled to high resolution tandem mass spectrometry [3,4] and probability based peptide identification ("sequencing") software[5,6]. However, while such approach has become mainstream, one remarkable obstacle still remains, that the percentage of matched peptide fragments among all recorded peaks during automated peptide identification generally stays at relatively low level, in many cases below 10 percent. A lot of causes can be named [7 -14], and one widely discussed is that there might be unknown fragmenting pathways besides the default cleavages of a peptide, the products of which cannot be predicted by the sequencing software, who then account for the many unmatched peaks in tandem mass spectra. Theories have been proposed to describe such novel pathways, among which there is the peptide scrambling. First proposed by Paizs[ 15], the scrambling theory asserts that linear fragment ions from collision induced dissociation (CID, the prominent dissociation technique for tandem mass spectrometry) of peptides can undergo further cyclization and form macrocyclic isomers, whose reopening at a different cleavage site from the original cyclization site will result in new fragments otherwise not predicted from CID. While previous works[1622] including those by the author of this thesis have proven the existence of scrambling effect through infrared multiple photon dissociation (IRMPD) and hydrogen-deuterium exchange experiments accompanied by theoretical DFT calculations, the chemistry behind such scrambling mechanism as well as its overall influence on peptide sequencing remains relatively unclear. Works to be presented in

PAGE 19

19 this dissertation are attempts to unveil such chemistry and influence, mostly from computational aspects. For such reason Chapter 2 provide details for all the experimental elements involved in these researches, from equipments and their physical/chemical backgrounds and apparatus setups, to experimental techniques, procedures and data acquisition methods. Chapter 3 investigates the scrambling chemistry when certain basic amino residues are present in the precursor peptide ions, and their impact to peptide scrambling by running IRMPD experiments to obtain IR spectra and thus structural information of linear/macrocyclic fragments ions and molecular dynamics simulations to explain and predict the existence/absence of scrambling effects in specific peptide sequences. Chapter 4 tries to statistically evaluate the contribution of peptide scrambling to typical bottom-up peptide sequencing algorithms, by referencing a large dataset of experimental tandem mass spectra and conduct bottom-up fragment ion matching while taking consideration novel predicted fragments from the scrambled ion pathways. Chapter 5 moves one step further, and investigates the identification boundary by ultra-high mass spectrometer alone through numerating all the possible peptide ions and its fragments with masses up to 1000 a.u., analyze the identification probability with different MS resolution powers (in point per million, PPM) selected, as well as the patterns of mass peak distribution and the chemistry behind. Chapter 6 gives a summary to all three projects and an overall conclusion. However before all these chapters, the subsequent sections in this introduction chapter will provide necessary background knowledge for the later chapters in somehow detailed manners. Such will cover a brief introduction to proteome and proteomics, the framework of peptide sequencing, essentials in peptide scrambling

PAGE 20

20 including experiments that have validated its existence and its influence on the bottomup peptide identification algorithm. 1.1 Proteomics and Peptide Sequencing The concept of proteomics was introduced back in 1997 [ 23], to describe the study of proteins at a large scale, especially focusing on the structural and functional properties of proteins. Among all proteomics study one important aspect is analytical protein identification[1,2], which lays foundation for the other researches including protein structures, expressions, interactions, as well as extending proteomics study to system biology. Since the 1990's, mass spectrometry[24-27] has become the single most dominant experimental technique used for protein identification, which more specifically refers to the amino acid sequence of a protein, thus the name "protein sequencing". During typical experimental approaches, however, proteins are not sequenced directly, but rather follow a top-down fashion: the protein sample is enzymatically digested first, producing a mixture of peptides. Such mixtures are separated normally using high performance liquid chromatography (HPLC) among other techniques, and then ionized using electrospray ionization (ESI) or matrix assisted laser dissociation ionization (MALDI) technique, before introduced to the high-vacuum chamber of a mass spectrometer, whose mass to charge ratio is analyzed. The peptide ions in the chamber then further go through another round of dissociation, usually collision induced dissociation (CID) when buffer gas is introduced into the chamber and collide with the peptide ions, causing the latter to fragment, and the resulting fragmented ions are again mass analyzed. Such mass measurement of both peptide ("precursor") ion and fragment ions is thus called tandem mass spectrometry, or MS-MS. These

PAGE 21

21 measurements are used in a bottom-up sequencing algorithm, when fragments and precursor masses are matched to a large list with mass data of known peptides and their fragments following typical known pathways, and identities of peptides can be drawn from the quality of matching measured by percentage, which ultimately build toward the identity of the original protein. In recent years such procedure of HPLC separation, MS-MS measurement and data analysis has been well integrated and automated thanks to the rise of computing power, improved algorithm and database, and proceedings in electronic and mechanical engineering. And the MS-based protein sequencing is dominant for its much better sensitivity, higher tolerance to mixtures, and supreme efficiency over legacy methods like Edman degradation. 1. 2 Peptide Scrambling 1. 2.1 Overview While MS-based protein sequencing has been widely accepted, problems with the method have been attracting attention, most noticeably the matching percentage. While the fundamental premise in protein sequencing lies on the comparison between mass information from MS-MS spectra and sequence information from DNA/protein database, only a small percentage of fragment ions can be confidently recognized and used for interpreting the original MS-MS spectra. Most other mass peaks (denoting fragment ions) are simply not recognized. Despite some of such peaks are outcome of experiment imperfection like detector false positives or contaminants, the main reason for the failure to identify such peaks is failing to fully understand the fragmentation chemistry, or that behind the fragment ion generation. Theories explaining novel fragment ions generation pathways have since been proposed, among which the peptide scrambling. During the CID process of MS-MS

PAGE 22

22 approach, the protonated peptide ions collide with the background inert gas molecules and dissociate, and the products are often from cleavage at amid backbone bonds called band yions[2]. The complete nomenclature for such product ions is given in Figure 1[ 2]. Such b ions are chemically interpreted as been made through the nucleophilic attack from the neighboring carbonyl oxygen atom . According to recent works by Paizs et al. [14,15], however, such b ions can also isomerize into a macrocyclic structure, or to form a cyclic peptide. When such macrocyclic peptide reopens at a different site from where the N and C terminus of the original linear peptide joins, a sequence-permutated linear peptide is formed, whose fragment ions are novel to those from the original sequence, leaving the matching program unable to identify them without taking into consideration this "scrambling" mechanism. Such scrambled fragment ions may account for many unidentified peaks in MS-MS spectra. 1. 2.2 Previous Work Mass spectrometry has been a proven tool in protein sequencing, however itself cannot provide direct structural information for the ions introduced to its chamber, which is essential in understanding the scrambling chemistry. For such reason alternative approaches are introduced, including hydrogen-deuterium exchange (HDX), ion mobility, isotope labeling and infrared multiple photon dissociation (IRMPD) spectroscopy. The IRMPD spectroscopy is especially useful as vibrational peaks in the IR spectrum will reveal direct structural information for the underlying peptide and fragment ions, as chemical groups in an ion will carry their signature vibration modes. Es pecially, theoretical IR spectra of the candidate structures for a b fragment can be obtained with density functional theory (DFT) calculation, and matched to the IRMPD spectra, to determine the structure or a mixture of structures in the peptide.

PAGE 23

23 In a previous work by the author of this dissertation and his colleagues, both experimental and theoretical approaches mentioned above have been implemented, in the investigation of b ions of oligoglycines, namely b2 to b8. First, HD exchange experiment are done for these ions in the chamber of an 4.7 Tesla FT-ICR mass spectrometer, and product ion abundance is recorded over time. Linear regression of such plotting has revealed remarkably different reaction rates for these ions, suggesting the existence of isomer mixtures in some b ions while one single isomer is present in other cases. To further investigate the structures of such isomers, IRMPD spectroscopy is then performed by exciting these ions using either free electron laser (FEL) or optical parametric oscillator (OPO) laser in the ICR cell at different wavelengths, and recording the IRMPD yield at wavelengths in the nearand mid-IR regions, so that the yields plotted against wavelength give IRMPD spectra for each ions. Meanwhile, theoretical IR spectra of the candidate linear and macrocyclic isomers of these b ions are calculated first through a series of geometry optimization using molecular dynamics simulation and DFT, followed by the frequency calculation using DFT method and B3LYP/6-31G** level of theory. By matching these spectra to their experimental counterparts, especially by looking at the diagnostic vibrational modes unique to one isomer to check its presence in the experimental IR spectra, existence of the corresponding isomer can be validated. The results have shown that while b2 ion consists of only a linear oxazolone structure, b2-b4 ions consist of a mixture of both linear and cyclic structures, whereas b5 to b8 ions are mostly macrocyclic. This work confirms both the macrocylization and scrambling pathway, and more importantly, the size effect of peptides undergoing such, and is thus widely cited. Both experimental and theoretical approaches used in this work

PAGE 24

24 are also adopted during some research works documented in this dissertation, and wi ll thus be explained more thoroughly in the subsequent chapters. 1. 2.3 Proposed Research As investigation into the peptide fragment ion scrambling pushes further forward, more questions arose en route to the deeper understanding of the chemistry behind scrambling. On top of the experiments which confirm the size effect in fragment ion cyclization (and the subsequent scrambling) in certain peptides, whether such cyclization is dependent on other factors become of interest. For example, the dependence on the existence of certain amino acid residues and/or their positions along the sequence remains unclear and needs investigation, and whether the size effect universally exists across common peptides needs further confirmation. Furthermore, while experiments have confirmed scrambling in many individual cases, whether and to which extent it has become common in CID and other dissociations remain unclear. To deliver an answer to such questions, more investigations have to be carried out at a statistical level, with large number of sequencing results, like those from CID experiments, examined and analyzed, so that the possible contribution of ion scrambling to the existing bottom-up protein sequencing can also be evaluated. In this dissertation we focus on a discussion of the computational approaches in light of a comparison to IRMPD experiments. Such approaches cover several aspects, from the chemistry behind individual scrambling under different scenarios, to the impact scrambling can bring to proteomics analyzed at a statistical level. Moreover, inspired by the statistical approach, a complete enumeration of peptide fragment masses under 1000 Da and the subsequent data mining are performed, to investigate the potential of

PAGE 25

25 ultra-high resolution mass spectrometer in determining atomic composition by mass alone, and the behavior, pattern and population distribution of the enumerated masses.

PAGE 26

26 Figure 1-1 . Mechanistic scheme showing oxazolone b fragment formation, followed by cyclization into a macrocycle and loss of sequence information for the peptide.

PAGE 27

27 CHAPTER 2 EXPERIMENTAL AND COMPUTATIONAL TECHNIQUES 2.1 Background While the majority results and discussions presented in this dissertation are those from a variety of computational approaches , these are contrasted to a range of experimental results that were recorded, obtained and/or cited, and have been subjected to data mining and analysis . These results, despite being presented in several subsequent chapters, share common experimental techniques and procedures, so it is of greater merit to present such techniques and procedures together in one single experimental chapter, rather than having one special experiment section in each of the later project chapters. Meanwhile, some theoretical backgrounds and derived computational methods are also commonly used throughout the projects, and for the same reason they will be presented here. The following sections cover from the physics and chemistry behind such techniques and approaches, mechanism and instrumentation in details, actual experimental facility and setups, to the running parameters for the actual experiments, if applicable. Such coverage starts from the general introduction to mass spectrometry (MS), followed by more details into the several kinds of mass spectrometers used during the underlying experiments, namely timeof -flight (TOF) MS and Fourier transfer iron cyclotron resonance (FT-ICR) MS. It then extends to the collision induced dissociation (CID), the major ion dissociation technique used for protein and peptide sequencing. Infrared multiple photon dissociation (IRMPD) is then introduced, whose theoretical mechanism is briefly explained, as well as its application in MS experiments. The laser light sources used in the underlying IRMPD experiments, namely the optical free electron laser (FEL), are

PAGE 28

28 also briefly introduced . Background is also given on the apparatus involved in acquiring the IRMPD spectroscopy data. The theoretical and computational aspects cover the concept of density-functional theory (DFT) theory, the Gaussian software used for DFT calculations, and molecular dynamics simulation with the force field AMBER (Assisted Model Building with Energy Refinement) software. 2. 2 Electrospray Ionization (ESI) Electrospray ionization (ESI) was first introduced to work with mass spectrometry by Fenn et al. in 1984[ 28], and has since seen wide application in biomolecule MS[29,30]. Compared to other ionization techniques, ESI can be conveniently coupled to analytical separation techniques like the high performance liquid chromatography (HPLC)[ 31], and thus makes on-line analytical operations much more efficient. This advantage, joined by ESI's capability to ionize large biomolecules, has made it the most widely used ionization technique in protein sequencing and related research works. A schematic diagram about how ESI works is shown in Figure 2-1. The sample solution is prepared by dissolving the analyte in a volatile solution, typically 50:50 methanol: H 2 O at a concentration between 10 -3 to 10 9 M. This solution is pumped through a tiny needle tip directed at a capillary, which is the entrance into the mass spectrometer. A high voltage difference is applied between the needle and the capillary, in the range between 2000 and 4000 Volts. The resulting electric field causes formation of a Taylor cone, where a spray of fine charged droplets are generated. Heating of the capillary aids evaporation of solvent from the droplets, which then undergo Rayleigh explosions, as the repulsion between positively charged ions overcome the surface tension of the droplets. Eventually, the charged ions are transferred into gas phase, after which they can be guided by the ion optics of the mass spectrometer.

PAGE 29

29 2. 3 Mass Spectromet ry 2. 3.1 Introduction A mass spectrometer measures the mass to charge ratios of charged particles (ions) in a sample. The abundances of ions are often plotted against their mass to charge ratios, forming a "mass spectrum. Mass spectrometry involves ionizing atoms and/or molecules, transfer these ions into the gas phase, and eventually guide them into a mass spectrometer for mass analysis . While the concept of separating charged particles by their massto -charge ratio was experimentally validated in the late 19th century , in recent decades mass spectrometry has become a dominant analytical technique for its high sensitivity and accuracy as well as wide areas of applications. Mass spectrometers have seen great improvements over the past century, with different designs and detecting mechanisms introduced through which the ions are detected and their masses analyzed. But fundamentally, all such designs are based on charged particles' different behaviors in electromagnetic fields. Experimental data presented in the subsequent chapters involve t wo types of mass: the time of flight (TOF) and Fourier transform ion cyclotron resonance (FT-ICR) mass spectrometers, and these will be briefly introduced in the following subsections. 2. 3.2 Time of Flight (TOF) Mass Spectrometer 2. 3.2.1 Mechanism In timeof -flight mass spectrometry, ions are accelerated by an electric field, and the time that it takes the ions to travel across a certain distance is recorded. More specifically, the velocity and kinetic energy of an ion with mass and charge is accelerated by a electric voltage following the equations below:

PAGE 30

30 (2 1) (2 2) By accelerating an ion with a fixed voltage and measuring its end speed, the mass to charge ratio (i.e., m/q, also referred to as m/z) can be readily obtained. For the velocity measurement, the charged ion will undergo a free flight with given travel distance , so that by measuring the time it takes to cover the distance, , the traveling velocity can be calculated, so as to determine the mass to charge ratio: (2 3) (24) (25) The mass spectrometer taking advantage of this measuring mechanism is thus called "time of flight" (TOF) mass spectrometer. 2. 3.2.2 Comparison with other MS The concept of a TOF MS can be traced back to as early as 1946 with an early prototype dated to 1948 [ 32 ]. Unlike many other MS such as quadrupole, ion trap and legacy sector spectrometers which obtain mass spectra through stepwise individualmass scanning, TOF MS records the mass spectrum of an entire batch of ions by measuring their mass to charge ratios in the same measurement , and is thus able to acquire much greater ion abundance and meanwhile be much more efficient in

PAGE 31

31 obtaining the full spectrum. This also means TOF can efficiently record multiple rounds of mass spectra and average them out, to provide even more reliable resul ts [ 33 ]. In terms of resolution power, modern TOF MS can achieve a relative resolution ( ) as high as 20,000[ 34 ], or 10ppm in terms of . While this is not as high as state of the art FT-ICR MS, which have 1ppm or higher resolution, it is still accurate enough for major MS applications. 2. 3.3 Fourier Transform Ion Cyclotron Resonance Spectrometer 2. 3.3.1 Overview and Mechanism Fourier-transform ion cyclotron resonance (FT-ICR) mass spectrometers offer the advantages of high mass resolution and accuracy[35,36], as well as an ultra-high vacuum environment. The latter is important in IRMPD experiments, as de-excitation of the ions by collisions is minimized. FT -ICR is based on the principle of measuring cyclotron frequencies in a fixed magnetic field. The ions in the Penning trap (Figure 2-2) are first excited by an oscillating electric field perpendicular to the magnetic field to achieve a larger cyclotron radius, and bringing the ions into a coherent phase. The frequency of the cyclotron motion is then measured on two opposing plates by an induced current detection circuitry. The superposition of multiple sine-wave components can be deconstructed with a Fourier transform analysis. Finally, the cyclotron frequency is related to m/z by the cyclotron equation: , (26) where is the cyclotron frequency, is the magnetic field strength, is the ion charge, and is the ion mass.

PAGE 32

32 The equation is usually given in terms of angular frequency, , ( 2 7) while the angular frequency is defined as . Figu re 2 2 is a schematic representation of an open ICR cell. 2. 3.3.2 Advantages of FT ICR MS FT ICR mass spectrometer differs from other mass analyzers in several aspects. First, unlike other analyzers whose detection relies on contact between sensors and ion s, FT ICR MS only need the ions of interest to be close to the detecting plates [ 37 ] . Second, instead of space or time, the FT ICR MS is solely resolved by the ion cyclotron motion, thus all ions in the ICR cell can be detected simultaneously, instead of be ing detected at different places or time. Also, using superconducting magnet for the magnetic fields, the FT ICR MS can provide an unparalleled high level mass resolution, which makes even more competitive when large bio molecules are of interest. 2.4 C ollision Induced D issociation (CID) Collision induced dissociation (CID) [ 38,39 ] can be dated back almost as early as the invention of the first mass spectrometer itself, and has been studied since the first half of the twentieth century. More recently, it has found an important application in the detection, identification and structural analysis for biomolecules. Especially during the last twenty years, gas phase CID has become the prominent dissociation technique for tandem mass spectrometers to produce f ragment ions from precursor ions. To initiate CID, the underlying ions are accelerated via an external electric field to gain higher kinetic energies. Neutral molecules, in most occasion inertia gases like helium, nitrogen or argon are then introduced into the chamber so that the CID ions will

PAGE 33

33 collide with these molecules. Through such collision kinetic energies are absorbed by the CID ions, breaking internal bonds and leading to the dissociation of the precursor ion. The convenience and effectiveness of CID has made it the single most dominant ion dissociation technique in tandem mass spectrometry, especially bottom up protein sequencing. 2.5 Infrared Multiple Photon Dissociation (IRMPD) Sitting at the center of our experimental IR spectra measurement, the IRMPD mechanism and realization will be explained in a more detailed fashion. The dissociation of mass-selected ions with line-tunable CO 2 lasers was first reported in the late 1970s[40]. Due to the very limited tuning ranges of CO 2 lasers, it took until the emergence of high-power widely tunable free electron lasers (FELs) that infrared multiple photon dissociation (IRMPD) spectroscopy of trapped ions became useful. The previous development of soft ionization techniques for biomolecules, such as electrospray ionization (ESI) and matrix assisted laser desorption/ionization (MALDI), opened up novel avenues for experiments. Infrared multiple photon dissociation (IRMPD) is the dissociation of ions as a consequence of incoherent absorption of multiple photons, which differentiates itself from single photon or coherent multiphoton dissociations. The energy threshold for dissociation of most polyatomic molecules normally requires the absorption of tens of of infrared photons. For coherent multiphoton dissociation, this would mean the absorption of such number of photons in the same vibrational ladder. However, due to the anharmonicity nature of molecular vibrations, energy differences between levels within the same ladder will become smaller as the ladder is climbed up, which precludes

PAGE 34

34 efficient coherent absorption/excitation. This limitation is called the anharmonicity bottleneck. During an IRMPD experiment , on the other hand, the absorption of each infrared photon is incoherent. The anharmonic coupling between different vibrational modes will diffuse the absorbed energy from a single infrared photon in the bath of background vibrational states of the molecule, a process referred to as intramolecular vibrational redistribution (IVR). IVR will quickly remove the population from excited states into the background states for large molecules with large enough density of states, as a result of which the molecule will be ready for the next single photon absorption, making photon absorption events incoherent. Historically, IRMPD was first reported in the late 70's, when mass selected ions were dissociated with line-tunable CO 2 lasers, whose tunable range is greatly limited. The emerging of free electron laser, with both high irradiation power and wide tunable range, then make the IRMPD spectroscopy of trapped ions come at large, especially when coupled with ESI/MALDI ionization techniques which makes gas phase large biomolecules available to the technique. Experimentally, IRMPD spectra are recorded by scanning the laser irradiation frequency in a stepwise manner, while at each step the ions of interest are held in the ICR cell of FT-ICR mass spectrometer and irradiated for a fixed length of time, and then ma ss analyzed. When the irradiation is in resonance with a certain vibrational mode of the underlying ion, stronger absorption will happen and more ions will be dissociated through the IRMPD pathway, bringing down the precursor ion population and boosting up that for the product ion. By plotting either the abundance of precursor ion (the

PAGE 35

35 "depletion spectrum") or the ratio between product and precursor ions (the "IRMPD yield spectrum" against the irradiation frequency, the IR spectrum of the underlying ion can be obtained. 2. 6 Free Electron Laser and FELIX 2. 6.1 Introduction For IRMPD experiments, the absorption of many (tens to hundreds) photons is needed to overcome the dissociation thresholds, which in turn requires a powerful laser source. The emergence of free electron lasers, with both high laser power and wide range of tunable wavelength, truly enables a full spectroscopic analysis for IRMPD experiments. FELs [41,42] use a relativistic electron beam as the lasing medium which moves freely through a magnetic structure, hence the term free electron. Figure 2-3 shows the schematic of operation for a free electron laser. To create a free electron laser, electrons are accelerated to a relativistic speed (near the speed of light), and pass through the FEL oscill with alternating poles to produce periodic and transverse magnetic field. The electron The beam of electrons move in phase with the emitted light, and both fields add coherently. Unlike conventional undulators which let electrons radiate individually, in an FEL the electron beam consists of bunches, which emit coherently, and continue to radiate in phase with each other, resulting in higher laser intensity. The wavelength of FEL can be conveniently tuned by adjusting the electron beam energy and magnetic field strength.

PAGE 36

36 2. 6.3 FELIX facility The IRMPD experiments shown here were performed with the Free Electron Laser for Infrared eXperiments (FELIX) laser located at the FOM-Institute for Plasma [ 43] . Note that the facility has now been moved to the Radboud University in Nijmegen, The Netherlands. The specifications of the FELIX light source are shown in Table 2-1[44]. 2. 7 Density Functional Theory 2. 7.1 Overview Density functional theory is one of the two most popular theoretical approaches in quantum chemistry, and has been widely used in a large variety of computational chemistry and physics topics. In general, DFT is a quantum theory to describe the electronic structure of many-body system using the functional of electron density, based on the framework of Kohn-Sham system[ 45]. Under such system a many body system with interacting electrons in the external potential is converted to one where an effective potential is applied with only non-interacting electrons[ 46]. The overall energy for such system is denoted by a functional of electron density, (28) Where T s is the Kohn-Sham kinetic energy, (29) v ext is the external potential on the interacting system, E xc is the exchangecorrelation energy and V H is the Coulomb energy,

PAGE 37

37 (210) 2. 7.2 DFT Calculation with Gaussian 03 During the research covered in this dissertation, several types of quantumchemical calculations have been carried out using Gaussian 03 , the single most dominant commercial software in computational chemistry. Such calculations include DFT geometry optimization using B3LYP functional[ 47] at different level of theory, for searching the local energy-minimized structures to be considered as candidate structures for certain peptide and fragment ions; Hartree-Fock electrostatic potential calculation using 6-31G* basis set to build the molecular dynamics force field for novel structures; DFT frequency calculations using B3LYP/6-31G* basis set to generate the theoretical IR spectra for the candidate structures so that they can be matched to the experimental IR spectra to give structural information. All calculations were run on the University of Florida High Performance Clusters (UFHPC)[ 48]. 2. 8 Molecular Dynamics Simulation with AMBER 2. 8.1 Overview Molecular dynamics is the computational simulation for the physical movement of microscopic particles like atoms, molecules and ions. Principally, such simulation is whose interaction with external potentials are represented by molecular dynamics force fields. Molecular dynamics simulation has seen wide use in biochemistry research, where very large biomolecules can be simulated at affordable computational cost, while maintaining considerable high computational accuracy. In the works presented in this dissertation, molecular dynamics simulation has been used to search the

PAGE 38

38 conformational space in order to find global energy-minimized structures, to perform preliminary geometry optimizations, and to calculate the potential of mean forces (PMF) for a series of positively charged ions. These calculations are done using AMBER molecular dynamics simulation software. 2. 8.2 Molecular Dynamics Simulations The electrons being orders of magnitude lighter than the nuclei, the BornOppenheimer approximation can be made so that the energy of a molecular system can be calculated as a function of nuclear coordinates of the atoms, while ignoring the electron motions. This lays the foundation of molecular mechanics. Molecular mechanics takes in five factors to simulate the intramolecular interactions: Bond stretching, bond angle bending, bond torsion which are all bonded interactions, and van der Waals and electrostatics interactions which are non-bonded. The potential energy of a molecular mechanics system can thus be expressed with terms corresponding to these interactions, as shown in Eq. 2-11. (2 11) Below are the schematic demonstration of these five interactions, as shown in Figure 2-4. 2. 8.3 AMBER (Assisted Model Building with Energy Refinement) AMBER [49,50,51] continues as a joint development from the collaboration of research groups in universities and from industry, including Rutgers University, University of Utah, Stonybrook University, University of California at Irvine, University of Florida and Encysive Pharmaceuticals, whereas the project was originally led by Peter

PAGE 39

39 Kollman (UCSF). Through the decades it has become one of the most powerful and widely used molecular mechanics simulation software, yet publicly available. The software consists of two major parts. It starts with a list of common atoms, ligands and molecules for the simulation of biomolecules for instance, peptide sequences can be constructed by using amino acid residues already pre-defined in the AMBER package. On top of such, there is the molecular dynamics simulation program package, including both source code, binary executives and program demos. During the AMBER MD simulations, atoms in molecules or charged ions are treated as point charges to simulate the electrostatic interaction and hydrogen bonding, etc, while also taking into consideration the restrictions from bond distances, bond angles and dihedral angles, etc to define a whole set of MD force constants accordingly. The MD simulation is then performed to these defined atoms, which interact with one another following rules from restricted electrostatic potential. Note that, in contrast to ab initio or DFT approaches, AMBER is not capable of making or breaking bonds.

PAGE 40

40 Table 2-1. Specifications of the FELIX light source. wavelength range 3 250 m continuous tuning range factor 3 micropulse energy 1 50 J micropulse power 0.5 100 MW micropulse repetition rate 1 GHz or 25 MHz macropulse repetition rate 5 (10) Hz micropulse duration 6 100 optical cycles macropulse duration < 10 s spectral bandwidth 0.4 7 % polarization (linear) > 99 % Figure 2-1. Schematic demonstration of ESI mechanism.

PAGE 41

41 Figure 22. The geometry of an open, cylindrical ICR cell. Figure 2-3 . Schematic of free electron laser operation. Figure 2-4. The bonded and non-bonded interactions considered in molecular mechanics simulation.

PAGE 42

42 CHAPTER 3 TRENDS IN FAVORING/DISFAVORING MACROCYCLE FORMATION: SEQUENCE LENGTH, TORSIONAL RESTRICTION, AND BASIC RESIDUES 3.1 Scope and Motivation Protein sequencing is at the center of modern proteomics research, and tandem mass spectrometry (MS-MS) via collision-induced dissociation (CID) has become the most dominant experimental approach in sequencing of enzymatically digested peptides. Chapter 1 had already summarized the experimental approaches employed in mass spectrometry-based sequencing, as well as the possible problematic aspects of sequence scrambling pathways. Prior studies have given evidence for sequence scrambling on select peptide systems[ 1521 ], have hypothesized reaction pathways that rationalize sequence scrambling[14,15], and have provided experimental evidence for intermediates proposed in those reaction pathways. In particular, infrared photodissociation spectroscopy has provided compelling evidence for the presence of headto -tail macrocyclic peptide fragment structures that are key in sequence scrambling[52,53 ]. This Chapter attempts to rationalize some of the trends in the headto -tail macrocyclization chemistry of b fragments (depicted in Figure3-1), as confirmed by IR spectroscopy (and other structural probes), by making use of molecular dynamics modeling. For the sake of clarity, the prior experimental IR spectroscopy results are summarized in IR spectroscopy results section hereafter, followed by the computational approaches and results presented here. 3. 1.1 Prior IR Spectroscopy Studies In prior studies by Xian et al., to which this author also contributed, experimental IR spectroscopy and gas-phase H/D exchange results had indicated that longer

PAGE 43

43 sequences gave rise to enhanced headto -tail macrocyclization. While these results have been published elsewhere[ 22], key result plottings will be briefly cited here for convenient discussions. In Figure 3-2 [ 22 ], the HDX results for oligoglycine b n fragments are shown, which depicts compositional change of the oxazolone/macrocycle mixture as size of peptide increases. Note that the b2 ions have only one single reaction rate, suggesting a single isomer situation. Also, IRMPD spectra have been recorded for the b2, b5 and b8 ions, and their theoretical counterparts have been calculated using DFT method with B3LYP/6-31G* level of theory, for both oxazolone and macrocycle candidate structures. The IR spectrum for G2 peptides, which resembles b2 ions structurally, have also been recorded as a reference to determine the isomer structure of b2 ions. The b5G8 spectrum is shown in Figure 3-3 as an example. In Figure 3-3, the signature C=O stretching mode unique to the oxazolone structures and the CO-H + bending mode unique to the macrocycle structure have become evidence for the mixture of both structures. The experimental IR spectra and its theoretical counterparts, joined by the previous HDX results, are eventually able to give quantitative information about the oxazolone/macrocycle structures in each of the oligoglycine b ions, as shown in Table 31. Table 3-1 sees a gradual increase in macrocycle structure ratio with the increase of peptide size, suggesting a size effect in macrocycle formation. A more pronounced shift from oxazolone to macrocycle structures as a function of peptide length was observed in peptide b fragments with related et al.[ 54], as shown in Figure 34[54].

PAGE 44

44 In another study by Tirado et al[ 55 ], the impact of the proline residue on macrocycle formation was investigated. As shown in Figure 35[ 55 ], experimental IRMPD results have indicated that the proline containing b 6 and b 7 ions of Q P FGLM and Q P WFGLM, respectively, contain considerable oxazolone populations, whereas the original b 6 ions of QWFGLM are exclusively macrocyclic. In other words, the proline residues have suppressed the macrocycle formation. In unpublished results from Tirado et al. [ 56 ] as shown in Figure 3-6 and Figure 3-7, the role of the basic residue arginine on headto -tail macrocyclization was investigated. While the original b6 QWFGLM are purely macrocyclic, the b5 and b6 ions of Q R FGL and Q R FGL have both shown strong oxazolone signature modes, indicating its presence. This is similar to the b6 Q P FGLM and b7 Q P WFGLM situation. However, when arginine substitutes residues at other positions, no clear peak has been recorded in the oxazolone target region, suggesting macrocycle structures as with the QWFGLM. The systematic studies above indicate the role of molecular structure on headto tail macrocyclization, such as sequence length, the potential role of torsionally restricted residues (e.g. proline), as well as the role of basic residues. The explanations put forward for these trends have so far been largely hypothetical. This Chapter aims to provide a deeper understanding, based on the energetic considerations from molecular dynamics simulations. 3.1. 2 Planned Approach While the above experimental results have demonstrated the impact of molecular structures on macrocycle formation, it is worthwhile to work from the theoretical side and come up with explanations behind such influence.

PAGE 45

45 According to the scrambling mechanism provided in Chapter 1, a macrocycle formation starts with the nucleophilic attack from the N-terminus nitrogen to the oxazolone ring oxygen. In order for such attack to happen, a linear oxazolone structure has to bend sufficiently so that the Nand Cterminus be brought in close proximity, as the nucleophilic attack can happen only within a close distance. Such bending would mean overcoming considerable energy penalties. It is then postulated that this specific energy penalty, which differs with molecular structure, will play some role in the favoring of macrocycle structure. To validate such postulation, the energy penalties for the structures whose experimental conclusions have been reported have to be calculated, in this case via molecular dynamics simulations. In order to study the molecular strain as a function of distance between the N-terminus and the oxazolone ring, an umbrella sampling method is implemented, which will apply an external biasing potential along the N/C terminal distance coordinate, and thus allow sufficient sampling at close distances. Then the weighted histogram average method is to be used to de-bias the applied potential, and calculate the potential of mean force (PMF) along the distance coordinate. Note that standard canonical MD simulations are less useful in this instance, as the sampling at close distances would be expected to be insufficient, due to the molecular strain. With such consideration, umbrella sampling and WHAM based PMF calculation is planned for all structures discussed above, namely the b n ions of oligoglycines, the b 4 and b 6 ions of [YAG] n for all iterated sequences, and the b ions of QWFGLM as well as its proline and arginine modifications. Table 3-2 summarizes all the b ions considered for PMF calculation.

PAGE 46

46 The PMFs for these structures would be along the reduced coordinate, i. e., th e distance between the N-terminus nitrogen and C-terminus oxygen, where the nucleophilic attack occurs. A schematic diagram showing this distance is presented in Figure 3-8. If necessary, PMFs along other coordinates would also be calculated. The results are to be compared with the experimental results and conclusions, in the hope of rationalizing the effect of molecular strain and structure on macrocycle formation. 3.2 Computational Methods 3. 2.1 Overview The main focus of this section is on how to calculate the potential energy surface for oxazolone peptides with the scanning of certain interatomic distances of interest. While principally such calculation is done through molecular dynamics simulation with AMBER software, the actual implementation is not as straight forward and involves considerable special treatment, as the structures, especially some parts of it like the oxazolone substructures, are novel to AMBER and are not readily accessible, thus requiring in-house model building and parameterization. The following subsections will explain the details of such model building and parameterization, as well as the PES calculation. 3. 2.2 Molecular Dynamics Model Building 3. 2.2.1 Overview Before PMF calculation can be initiated in AMBER, the geometry and molecular dynamics force field constants of the underlying molecules/ions have to be prepared. Such preparation is referred to as MD model building. In general this is done through the interplay of three software: HyperChem for initial structure building, Gaussian for

PAGE 47

47 fine structure optimization and electrostatic potential calculation, and AMBER for final parameterization. 3. 2.2.2 HyperChem HyperChem is commercially available from HyperCube, a company based in Gainesville, FL. The software is capable of a variety of chemistry related functions, but the core functionality is its ease of building and visualizing biomolecular structures. The relatively intuitive adding, removing and modification of atoms, bonds, bond angles and dihedral angles, and more importantly, a large database of common substructures, such as amino acid residues make it a powerful tool for building peptides fragments with more unusual chemical moieties, such as oxazolone rings. On top of structure building, HyperChem is also capable of carrying out structural calculations at molecular dynamics, semi-empirical and ab-initio levels. Being a single-machine Windows OS based software, it cannot be conveniently run on the UF high performance clusters, thus limiting its use for expensive, quantum-chemical calculations at high levels of theory. Nonetheless, it can still conveniently optimize the geometries of initial, manually-built input structures with MD and semi-empirical (AM1, as used in this research) methods, paving the road toward further DFT inspired geometry optimization. The building of underlying oxazolone structures in HyperChem works as follows. For a given sequence, its preliminary backbone structure is first made using the amino acid database provided by HyperChem. Modifications are then made to the backbone peptide, including the adding of proton, modification of bond type and angles, rearrangement to form the oxazolone rings, setting of charge states, etc. While the modified structures are topologically identical to the target structures, geometry optimizations are to be performed, first at molecular dynamics level then at AM1 semi-

PAGE 48

48 empirical level. The optimized structures are saved in standard PDB (protein database) format, for subsequent optimizations. 3. 2.2.3 Gaussian calculations Gaussian is the single most dominant commercially available software for abinitio calculations, as briefly discussed in Chapter 2. While its capability covers a wide variety of areas in chemistry, it is primary used in the underlying research for geometry optimization and electrostatic potential calculation. First, the resulting geometry from HyperChem optimization is further optimized with B3LYP (Becke 3-Parameter, Lee, Yang and Parr) functional and 6-31g* basis set density functional theory calculation, to obtain a locally minimized structure, which will be used as the initial structure for MD simulation, as well as the foundation on which electrostatic potential calculation and MD force field building are based. Besides the geometry optimization, ab-initio calculation is also done for the electrostatic potential calculation, with HF/6-31g* level of theory. Results from such calculations are used during the MD parameterization for the underlying ions. More discussions will be presented in the AMBER subsection. 3. 2.2.4 AMBER calculation As a powerful molecular dynamics simulation tool, AMBER has been used in this research first for force field parameterization, and then for the umbrella sampling and WHAM PMF calculation. Force field model building The MD simulation procedure being described above, the proper force constants for the ions of interest have to be prepared before any simulation can start. While AMBER does have force field constants for most legacy molecules/ions or

PAGE 49

49 substructures embedded in its database that can be used to build whole molecules/ions, the novel oxazolone structures couldn't find their readily usable ones. To deal with the novel structures and their unembedded constants, the antechamber package is included in AMBER, to calculate the force constants necessary for MD simulation. Such calculation is based on external ab-initio calculation results, namely the geometry and electrostatic potentials of the underlying molecules. By reading in the geometry and ESP information and augment them with AMBER built-in parameters, the parameterization for novel molecules and ions can be performed, resulting in input files ready for the subsequent PES calculation. 3. 2.3 Potential Calculation with WHAM The potentials of mean force (PMF) are calculated using umbrella sampling with a series of biasing potentials so that sufficient sampling can be achieved near the high energy penalty where the distances along the reduced coordinate of interest are very close, followed by the weighted histogram analysis method (WHAM) to reproduce the PMF. While such approach has been widely applied in molecular dynamics research, basic introduction for the umbrella sampling and WHAM is provided belo w. 3. 2.3.1 Weighted histogram average method (WHAM) equations Let be the number of biased simulation, and be the number of samples generated from the simulation. The samples are discretized in bins to determine a histogram with respect to the biasing coordinate , which is also the reduced coordinate of interest. Let be the estimate of the biased probability in the bin of

PAGE 50

50 the simulation, and be that of the unbiased probability in the bin, then the relationship between and can be expressed by , (31) where is the biasing factor and is the normalizing constant so that , that is, , (3 2) In the case of coordinate biasing, , (3 3) where is the biasing potential for the simulation. An optimal estimate of can then be given by , (3 4) where is the number of counts in the histogram bin from the simulation, is the total number of samples from the simulation, and is the normalizing factor,

PAGE 51

51 . (3 5) The above two equations can be used to calculate a reasonable estimate of the PMF, and are referred to as the WHAM equations. 3. 2.3.2 WHAM calculation parameters This subsection gives the details during actual WHAM calculations for the PMF's of interest, especially the molecular dynamics simulation parameters. The PMF umbrella sampling is calculated along the distances between pairs of atoms of interest, such as be tween N-termin al N and C-termin al oxazolone C, for all peptides of interest. Firstly 1000 steps of steepest descent minimization was performed followed by 3000 steps conjugated minimization. Then, a 20 ps heating process was simulated to heat the system from 0K to 300K. Afterwards, a 5 ns equilibration process was followed to further equilibrate the structure. Next, umbrella samplings were performed on 34 individual windows, with distances ranging from 1.0 to 17.5 , in 0.5 intervals. Each window has 1 ns equilibration and 5 ns sampling. A 5 kcal*mol -1 * 2 harmonic potential was used between two reactive atoms in the umbrella sampling simulation, while the data points were stored every 10 fs. In total, 1.7 million data points were collected for the final PMF umbrella samplings for each peptide investigated. All the molecular dynamics simulations were performed with a step size equal to 1 fs. The Langevin algorithm was used for the temperature control with the collision frequency set to 2.0 ps 1 . Since the classical model could not break the chemical bond due to the harmonic potential approximation, structures with distances of less than 2.0 , or larger than

PAGE 52

52 certain values could not be found in the trajectories of the umbrella sampling process, thus constraining the distance ranges for the PMF free energy profiles. 3.3 Results and Discussion 3. 3.1 Overview This section presents the computational results for the peptide fragment experimental studies shown above. Comparisons between experimental and computed results are discussed in terms of the hypotheses put forward. 3. 3.2 "b" ions of G n , [YAG] n and Proline Modified QWFGLM Peptides 3. 3.2.1 [YAG] n sequence motifs The PMF results for the N-termin al N and C-termin al oxazolone C distance for all these fragment ions are shown in Figure 3-9 . The interpretation of the IRMPD spectra of b 4 and b 6 ions for the related sequence motifs YAG, AYG and GAY are summarized in Table 3-3 (see IRMPD spectra in Figure 3-4). The striking effect between oxazolone structures for all b4 ions, as opposed to macrocycles for b6 ions is seen for all sequence variants. 4 ions are more elevated than those for the b 6 ions, at distances shorter than 5 angstrom. This indicates the higher energy penalties that b 4 ions have to overcome in order to bend their backbones sufficiently and bring the Nand Ctermini in close proximity, so that a nucleophilic attack can be initiated. The higher energy penalty, which reaches a maximum value of 17.31, 16.04 and 19.29 kcal mol-1 for YAG, AYG and GAY, respectively, can be understood due to the b 4 length, resulting in more significant ring strain compared to b 6 . The corresponding values for b6 are 17.31, 16.04 and 19.29 kcal mol-1, which compare to computed transition state energies for headto -tail cyclization of 12 kcal mol -1 for linear b5 ions of YAGFL from

PAGE 53

53 density functional theory as calculated by Harrison et al. It is important to note that our computational results do not necessarily give an accurate reflect ion of the true PES for head to tail cyclization, given the limitations of this approach at shorter distances, but rather offer a qualitative gauge on the relative ring strain that is induced, as the peptide structure rearranges to allow a close proximity of both termini. 3. 3.2.2 G 4 to G 8 The results from the previous study by Chen, Yu, et al on oligoglycines b ions are summarized in Table 3 4. The trend shows a gradual increase in the percentage of macrocycle structures from b4 to b7, followed by a steep jump to 100% macrocycle for b8. This size effect is in qualitative agreement with the [YAG] n etc. results analysis, but shows a jump from oxazolone to macrocycle for longer sequences. The PMF results for the b4 b8 series are shown in Figure 3 10. The inse ts show the regions at shorter distances. While the PMFs of b 4 , and to a lesser extent b 5 ions, are higher in energy at closer distances (i.e., <5 angstrom), the PMFs of b 6 to b 8 are barely separated. While the computational results are in general accordan ce with the experimental results for smaller b ions, from b4 to b6, the differences between the larger b ions do not mirror the experimental results. This suggests that the WHAM approach has some limitations in accuracy, and that this simplified approach c annot substitute for more advanced transition state calculations. 3. 3.2.3 QWFGLM and proline substitutes The IRMPD spectra of b 6 ion of QWFGLM, b 6 ion of QPFGLM and b 7 ion of QPWFGLM have been recorded and analyzed by a work from M. Tirado, et al from the Polfer group, and the conclusions on isomer structures are cited in Table 3 5.

PAGE 54

54 According to these conclusions, the b 7 ion of Q P WFGLM is most disfavoring macrocycle formation, followed by the b 6 ion of Q P FGLM. While both proline containing peptides demonst rate considerable proclivity for oxazolone structures, the original b 6 QWFGLM still favors macrocycle structure. The PMF calculations results for these b ions are shown in Figure 3 11. The inset shows the region at shorter distances. These PMF's when N C distances are close (< 5 angstrom) do demonstrate a trend in accordance with experimental results. The b 7 QPWFGLM (blue) PMF posts the highest energy penalty along the close distance which also has the lowest percentage of macrocycle formation. The b 6 QPFG LM PMF fell below it but still holds an edge over the b 6 QWFGLM, which favors macrocycle formation most. In this case, the energy penalty theory fits well in the existing experimental results. The comparison between experimental and computational results i s also summarized in Table 3 5. 3. 3.3 A rginine Peptide R esults The results above appeared to validate the general approach for using PMF calculations for evaluating relative ring strain in these peptide fragments, and thus obtaining a qualitative understan ding for the ease or difficulty in forming macrocycles. The accuracy of the calculations seems to be limited when comparing larger b ions, particularly b7 and b8. In the subsequent studies on arginine containing b ions, we are limited to b5 and b6. Need to tryptic digested peptides as the cleavage will be at the arginine residue. However, in the case when other enzymes are used for the protein digestion, peptides with internal arginines can still form .

PAGE 55

55 3. 3.3.1 N C results b 6 ions of R substituting QWFGLMs The IR spectra of b 6 Q R FGLM, QW R GLM and QWF R LM have been recorded by M. Tirado et al. from the Polfer group, and are shown in Figure 3 7. The conclusions of isomer structures are shown in Table 3 6. For convenient references, these b 6 ions will be marked as b 6 R0, b 6 R2, b 6 R3 and b 6 R4, where the numbers behind R denote the positions where unprotonated arginine substitutes the original r esidue. Note that b 6 R0 refers to no substitution. According to these conclusions, b 6 R2 demonstrates obvious disfavoring of macrocyclization, whereas all other b ions are still macrocyclic. PMF calculation results for these b ions are shown in Figure 3 12. Inlets gives results at close distances. As a reference, the computational results are compared to their experimental counterparts, as shown in Table 3 6 . These results show no correlation. Clearly the b 6 R4 has a considerably higher energy penalty over th e other b 6 ions. On the other hand, the PMF of b 6 R0, b 6 R2 and b 6 R3 don't really demonstrate clear differentiation from one another, and the three plotting even see crossovers in between 3 and 4 angstroms. Nonetheless, b 6 R0, b 6 R3 and b 6 R4 are all macrocycli c structures despite having either higher or lower energy penalties than that of b 6 R2, which, experimentally, is confirmed to adopt an oxazolone structure. b 5 ions of R substituting QWFGLs The calculated PMFs with respect to N C coordinates are presented i n a manner similar to the b 6 ions, for b 5 R2, b 5 R3 and b 5 R5, as shown in Figure 3 13.

PAGE 56

56 For convenient reference, the relative ranking of calculated energies are compared with the corresponding experimental results for these b ions, as shown in Table 3 7 . Similar to the b 6 ions situation, the N C plottings failed to show strong separations between different ions, suggesting that they should behave in the same way. The experimental results showed that this was clearly not the case, once again showing a discr epancy between experiment and computed results. 3. 3.3.2 N H results While the N C results didn't suggest unambiguous energy differences among the b 5 or b 6 arginine peptides and thus couldn't provide reasonable explanation for the favoring/disfavoring of ma crocycle formation with different arginine peptides, it is then worthwhile to turn to an alternative viewpoint. Unlike the other peptides presented in this research, the arginine peptides are unique in that the (unprotonated) arginine residue is a highly b asic site, and hence represents a potential site for proton transfer. This raises the possibility for proton transfer from the C terminus oxazolone ring to the arginine side chain. Once this happens, the oxazolone ring will be deactivated, and the nucleoph ilic attack from N terminus can no longer happen, preventing the macrocycle formation. The proton transfer thus becomes a competing pathway against the macrocycle formation. It is then reasonable to evaluate the ease of this proton transfer when the argin ine residue is located at different positions along the peptide chain. It is expected that macrocycle formation will be disfavored for peptides undergoing this proton transfer, leaving them as oxazolone structures. Such evaluation can be done through scann ing the PMF along the coordinate of distance between the proton on the

PAGE 57

57 oxazolone ring and the nitrogen on the arginine side chain which is the target of the proton transfer, as demonstrated in Figure 3 14. Such calculation has been done for all b 5 and b 6 a rginine peptides, and results are presented in the following two subsections. b 6 ions of R substituting QWFGLMs Unlike the N C results, the PMF with respect to the N H coordinate gives very clear separation between b 6 R2, b 6 R3 and b 6 R4, as shown in Figure 3 15. For more convenient reference, the structural conclusion from experimental results and the energy ranking from such calculation are listed side by side in Table 3 8 . As shown in Figure 3 15 and Table 3 9, the b 6 R2 has remarkably lower energy minima at around 2 angstrom than b 6 R3, which is again lower than b 6 R4. This would suggest that b 6 R2 has greater likelihood to make possible the proton transfer, thus disfavoring macrocycle formation most, and such likelihood for the other two b 6 ions are ranked acc ordingly. Note that AMBER MD simulation is not capable of making/breaking bonds, so the proton transfer cannot be modeled as such. But results still show that the arginine has a high affinity to interact with the proton, and that b 6 R2 has the lowest ring s train by far to make this happen. On the other hand, such analysis has been well supported by the experimental results, as b 6 R2 stands as the only oxazolone structure among all arginine peptides. b 5 ions of R substituting QWFGLs The PMFs for N H as shown i n Figure 3 16 and energy ranking as shown in Table 3 9 again demonstrate comparable trends. the b 5 R2 plotting carries obviously lower energy minima against proton transfer, while those for b 5 R3 and b 5 R5 are both higher, despite that no clear winner can be declared between the two.

PAGE 58

58 Again, these calculations are supported by the corresponding experimental IRMPD spectra. Due to equipment availability, IR spectrum of b 5 R2 has been recorded only within 1720cm 1 and 1900cm 1 region, yet the strong peak between 1750cm 1 and 1800cm 1 provides a strong evidence for the oxazolone structure. The IR spectra for b 5 R3 and b 5 R5, on the other hand, shows little peak in the oxazolone identification region, suggesting the absence of oxaz olone structure, and the dominance of the macrocycle counterparts. The observations matched the predictions according to the N H PMF calculation and further validated the energy originated theory in predicting the favoring/disfavoring of macrocycle formati on as a whole. 3. 3.4 Limitation of the U nde rlying Computational A pproach es While the PES calculation presented in this research does provide qualitative results and thus possible explanations toward their experimental counterparts, the nature of such calcu lation also leads to several aspects of limitation. The underlying molecular dynamics simulation is quite simplistic, which gives no consideration about bond breaking and also post a lower limit on how close two unbonded atoms can come together. Alternati vely, quantum chemical transition state calculations for macrocycle formation could yield more accurate energetics. However such calculations are also potentially very expensive, given the size of the molecule. In fact, many transition states would need to be considered, making the calculations non trivial, and requiring considerable expertise in computational approaches. 3.4 Summary The head to tail cyclization chemistry of b ions during collision induced dissociation (CID) has been investigated using com putational approach es . The potential

PAGE 59

59 mean force (PMF) as a function of the distance between pairs of atoms involved in t h e nucleophilic attack , namely N terminal N and oxazolone ring C , ha s been calculated for a series of b ions, in the hope of explaining and predicting the favoring/disfavoring of macrocycle formation from an energetics point of view. These calculation s were done through AMBER molecular dynamics simulatio ns , which utilized umbrella sampling and the weighted histogram average method (WHAM). The results were further compared with their experimental IRMPD spectroscopy counterparts. In the case of b ions of oligoglycines from G 4 to G 8 and b 4 and b 6 ions of [YAG] n , [AYG] n and [GAY] n , clear trends of size effect have been identified in both computational and experimental results. The PMF's at close distance (<2 angstrom) between the N terminus N and oxazolone C terminus C for b ions with shorter lengths are notably higher than the tho se for longer ones, suggesting greater energy penalties to bring two ends of the peptides together. These shorter peptides also tend to disfavor macrocycle formation according to IRMPD results. In the case of b ions of QWFGLM and two of its proline modifications, Q P WFGLM and Q P FGLM, the impact from the adding of torsional rigidity of a proline residue has been assessed. The PMF calculat ions showed that the b7 ions from Q P WFGLM had the highest energy penalty among the three, and were structurally most ly oxazolone according to the IRMPD results. b 6 ion from Q P FGLM also ha d a higher penalty than the unmodified b 6 QWFGLM, which is consistent with the presence of some oxazolone for b 6 ion from Q P FGLM , as opposed to exclusively macrocycle for the QWFGLM se quence motif . The impact from torsional rigidity in terms of preventing

PAGE 60

60 head to tail cyclization was thus confirmed, and PMF calculation s turned out to be a useful qualitative tool to evaluate this trend . A number of b 5 and b 6 ions were considered with arginine substitutions in the QWFGLM sequence motif . Besides the PMF's between (N terminal) N and (oxazolone) C, those between the H (i.e., proton) on the oxazolone ring and the (unprotonated) N on the arginine side chain were also calculated. While the N C results showed no consistent trend to rationalize the experimental data , the N H PMF results could, suggesting a different mechanism for favoring/disfavoring macrocycle formation. In particular, b 6 Q R FGLM and b 5 Q R FGL had lower minima at closer distance s (~2 angstrom), suggesting that these structures are more amenable to proton transfer from the oxazolone ring to the arginine side chain . As t hose two b ions were also the only ones to be mostly oxazolone according to the IRMPD results, this suggests that the proton transfer will deactivate the N C nucleophilic attack and thus shut down macrocycle formation. In summary , via comparison of experimental and computational results of several b ions, several factors influencing macrocycle formation have been in vestigated. While the energetics results presented here are merely qualitative, the relative energetics were generally found to be in accordance with the experimental trends, thus providing a chemical rationale for the trends that favor or disfavor macrocy clization.

PAGE 61

61 Figure 3 1. Schematic diagram of head to tail macrocyclization . F igure 3 2 . Kinetic fitting of the HDX results for gly cine based b fragment ions for A) b2 generated from triglycine, B) b3 generated from octaglycine, C)b 4 generated from pentaglycine, D) b5 generated from octaglycine, E) b6 generated from octaglycine, F) b7 g enerated from octaglycine, and G) b8 generated from octaglycine. exchanging structure as a function of bn fragment siz e. Reprinted with permission from Xian Chen, Long Yu, Jeffrey D. Steill, et al, J. AM. CHEM. SOC. 2009, 131, 18272 18282. Copyright © 2009, American Chemical Society .

PAGE 62

62 Figure 3-3. MidIR -MPD spectrum of b5-G8 (generated from octaglycine),compared to the lowest-energy conformers for the various chemical structures: A) macrocycle structure protonated on backbone carbonyl, B)oxazolone structure protonated on N-terminus, and C) oxazolone structure protonated on oxazolone ring N. The relative energies to the lowest conformer are indicated. The chemically diagnostic bands are labeled. Reprinted with permission from Xian Chen, Long Yu, Jeffrey D. Steill, et al, J. AM. CHEM. SOC. 2009, 131, 18272 18282. Copyright 2009, American Chemical Society.

PAGE 63

63 Figure 3 4. Top: IRMPD spectra of the b4 ions with the sequence motifs A) TyrAlaGly, B) GlyAlaTyr, and C ) AlaTyrGly. Bottom: IRMPD spectra of the b6 ions with the sequence motifs A) TyrAlaGly, B ) GlyAlaTyr, and C ) AlaTyrGly. The CID product were made from the precurs or ions (TyrAlaGly)2ProGly, (GlyAlaTyr)2ProGly, and (AlaTyrGly)2ProGly, respectively. The spectral region associated with oxazolone band(s) is colored in light red. Reprinted with permission from Marcus Tirado, Nicolas C. Polfer , Angewandte Chemie Volume 124, Issue 26, pages 6542 6544 , June 25, 2012 License Number 3496931382001 Copyright © 2012 WILEY VCH Verlag GmbH & Co. KGaA, Weinheim .

PAGE 64

64 Figure 3-5. Overlay of IRMPD spectra of QPWFGLMPG b7, QPFGLMPG b6, and protonated cyclo(QPFGLM). The chemically diagnostic oxazolone CO stretch region is highlighted in red. Reprinted with permission from Marcus Tirado , Jochem Rutters, Xian Chen , Alfred Yeung , Jan van Maarseveen , John R. Eyler, Giel Berden , Jos Oomens , Nick C. Polfer, Journal of The American Society for Mass Spectrometry March 2012, Volume 23 , Issue 3 , pp 475-482 License Number 3496931202774 Licensed content publisher Springer. Figure 3-6. IR spectra for b5 arginine peptides recorded in the mid-IR range, the shadow region correspond to the typical wavelength for the oxazolone C=O stretching mode.

PAGE 65

65 Figure 3-7. IR spectra for b6 arginine peptides recorded in the mid-IR range, the shadow region correspond to the typical wavelength for the oxazolone C=O stretching mode. Figure 3-8. Schematic diagram of the distance coordinate for PMF calculation.

PAGE 66

66 Figure 3 9 . PMF results for b n [YAG] n , [AYG] n and [GAY] n . Figure 3 10 . PMF calculation results for b n G n .

PAGE 67

67 Figure 311. PMF calculation results for the b ions of QWFGLM, QPWFGLM and QPFGLM. Figure 312. PMFs calculation results for the b 6 ions for QWFGLM and argininemodified sequence motifs.

PAGE 68

68 Figure 313. PMFs calculation results for the b 5 ions for arginine-modified sequence motifs of QWFGL. Figure 314. Schematic representation of how the proton transfer (pathway marked in red) deactivates the headto -tail cyclization pathway.

PAGE 69

69 Figure 3 15 . PMFs calculation results for the b 6 ions for arginine modified sequence motifs of QWFGLM . Figure 3 16 . PMFs calculation results for the b 6 ions for arginine modified sequence motifs of QWFGL .

PAGE 70

70 Table 3 1. Relative contributions of oxazolone and macrocycle structures for oligoglycine bn products, based on HDX results . b ions Oxazolone and/or macrocycle structures b2Gn Oxazolone b3Gn Oxazolone b4Gn 90% oxazolone + 10% macrocycle b5Gn 75% oxazolone + 25% macrocycle b6Gn 70% oxazolone + 30% macrocycle b7Gn 65% oxazolone + 35% macrocycle b8Gn Macrocycle Table 3 2 . List of b ions considered for PMF calculations . b ions b ions b 4 G n b 4 [YAG] n b 5 G n b 6 [YAG] n b 6 G n b 4 [GAY] n b 7 G n b 6 [GAY] n b 8 G n b 4 [AYG] n b 6 QWFGLM b 6 [AYG] n b 7 QPWFGLM b 6 Q P FGLM b 5 QRFGL b 6 Q R FGLM b 5 QWRGL b 6 QW R GLM b 5 QWFGR b 6 QWF R LM

PAGE 71

71 Table 3-3. Comparison of experimental and computed results for b n [YAG] n [GAY] n and [AYG] n based on PMF data in Figure 3-9. b ions Oxazolone and/or macrocycle structures Relative energy ranking b4[YAG]n Oxazolone Higher b6[YAG]n Macrocycle Lower b4[GAY]n Oxazo lone Higher b6[GAY]n Macrocycle Lower b4[AYG]n Oxazolone Higher b6[AYG]n Macrocycle Lower Table 3-4. Comparison of experimental and computed results for b n G n based on PMF data in Figure 3-10. b ions Oxazolone and/or macrocycle structures Relative energy ranking (descending) b4Gn 90% oxazolone + 10% macrocycle Fir st b5Gn 75% oxazolone + 25% macrocycle Seco nd b6Gn 70% oxazolone + 30% macrocycle Thi rd b7Gn 65% oxazolone + 35% macrocycle Four th (by trivial margin) b8Gn Macrocycle Fif th (by trivial margin) Table 3-5 . Experimental results on the isomeric structures for the b ions of QWFGLM and the proline modifications, based on PMF data in Figure 3-11. b i ons Oxazolone and/or macrocycle structures Relative energy ranking (descending) QPWFGLM Mostly oxazolone Fir st QPFGLM Mostly macrocycle Seco nd QWFGLM Macrocycle Thi rd Table 3-6 . Experimental and computational results for b6 ions for QWFGLM and its arginine-modified sequence motifs based on PMF data in Figure 312 . b ions Oxazolone and/or macrocycle structures Rel ative energy ranking (descending) QWFGLM Macrocycle Four th QRFGLM Mostly oxazolone Seco nd QWRGLM Macrocycle Thi rd QWFRLM Macrocycle Fir st

PAGE 72

72 Table 3 7 . E xperimental and computational results for b5 ions for arginine modified sequence motifs of QWFGL based on PMF data in Figure 3 13. b ions Oxazolone and/or macrocycle structures Relative energy ranking (descending) QRFGL Oxazolone Thi rd QWRGL Macrocycle Fir st QWFGR Macrocycle Seco nd Table 3 8 . E xperimental and computational results for the b 6 ions for arginine containing peptides based on PMF data in Figure 3 15 . b ions Oxazolone and/or macrocycle structures Relative energy ranking (descending) QRFGLM Mostly oxazolone Thi rd QWRGLM Macrocycle Seco nd QWFRLM Macrocycle Fir st Table 3 9 . E xperimental and computational results for b5 ions of arginine containing peptides based on Figure 3 16 . b ions Oxazolone and/or macrocycle structures Relative energy ranking (descending) QRFGL Oxazolone Thi rd QWRGL Macrocycle Fir st QWFGR Macrocycle Seco nd

PAGE 73

73 CHAPTER 4 STATISTICAL STUDY OF SEQUENCE SCRAMBLING IN COLLISION INDUCED DISSOCIATION OF PEPTIDES 4.1 Background The typical approach to protein sequencing[1,2] has been discussed in the previous chapters, which incorporates collision induced dissociation (CID) and tandem mass spectrometry[3,4], backed by automated database matching[5,6]. However in a typical CID tandem mass spectrum, only less than 20% of the fragment ions can be identified and assigned to corresponding CID products of the precursor ion by commercial sequencing programs. Such a low rate of identification could have two explanations. The precursor ion sample might not be purified enough such that the actual sample contains multiple precursors at the same time, whose fragment ions can then cause difficulties in assigning the fragments from each precursor. On top of this, there might also be a lack of understanding in the chemistry of ion fragmentation [7 9] . As a result, some fragment ions from the precursors may not be properly predicted and thus many peaks in the mass spectrum may not find corresponding matches. While peptide purification can be improved by better separation (i.e., liquid chromatography) , studying the fragmentation chemistry behind peptide sequencing is of greater interest here. One special topic is about the occurrence of sequencing permutations, or "scrambling", during CID experiments[10-14], where primary structure of the peptide is observed to have changed and produced novel fragment ions, namely the linear oxazolone and macrocyclic structures. The interconversion between these structures provides an explanation to peptide scrambling in CID, as the macrocycle structure formed from an oxazolone structure may not reopen at the previous location where it closes, and the subsequent oxazolone structure will then have a permutated

PAGE 74

74 sequence. In this case, the original b ions are called direct sequence ions, or simply direct ions, whereas the permutated b ions are called nondirect sequence ions, or nondirect ions[27-31]. According to previous studies[15-20], longer peptides tend to cyclize more completely which means potentially stronger cyclization effect[21-26]. On top of this, the longer a peptide is, the more possible ring opening positions there are that do not correspond to the initial site of ring formation. In other words, the number of possible scrambled fragment ions is expected to increase with sequence length. Sure a longer peptide may also have more kinds of fragment ions as its CID products, however the growth in possible kinds of fragment ions for the scrambled fragments with growth of peptide length is much faster than that for the direct ion fragments. This is because for a given linear peptide, number of possible CID fragment ions grow linearly with length of peptide, however number of unique scrambled linear peptides from scrambling effect also grows linearly with peptide length. That said, the possible number of unique fragment ions directly produced from CID will be O(n) where n denotes length of peptide by amino acid residues, while that for the scrambled fragment ions will be O(n 2 ), so in principle we are expecting the ratio of nondirect sequencing ions population over that of direct sequencing ion to grow linearly with peptide length. A schematic picture for the b ion permutation can be found in Figure 4-1. One specific tandem mass spectrum, as shown in Figure 4-2A, is presented and discussed below to demonstrate the impact of sequence permutation to sequencing. The spectrum is for the 3+ charged precursor ion and its CID fragments of a tryptic digest peptide, GAQAPAFSLVGGDLADVTLENFAGK. Out of the 17 mass peaks, fiv e

PAGE 75

75 have been matched to direct sequence ions and four to nondirect ions (marked "s ions" where "s" for scrambling). Also, when counting in the intensity (count of ions at a specific mass peak), sum of intensity for all nondirect ions accounts for 36% of that for all matched (direct plus nondirect) ions. A more detailed matching result can be found in Table 4-1 and Table 4-2. Note that unlike the matched direct sequence ions, a nondirect ion can be resulting from fragmentation of several different permutated sequences. For example, the sa 4 ion GGAQ can be the fragment of intermediate b fragments of three permutated sequences, namely GGAQAPAFSLV, GGAQAPAFSLVG and GGAQAPAFSLVGGDLADVTLENFA since all these sequences can be the outcome of b ions from the original GAQAPAFSLVGGDLADVTLENFAGK peptide undergoing cyclization and subsequent permutation. Note the position of glycines which can result in sequence "GGAQ". While such peptide scrambling has been confirmed in many individual cases, such experiments won't be able to assess the overall influence of scrambling in tandem mass spectrometry method as used in proteomic studies. On the other hand, the twenty different kinds of naturally occurring amino acids as well as the long average length of peptides in typical proteomic studies all require a large size of experimental data to be analyzed in order to give a statistical evaluation of the process. This leads to the need for a systematic study which consists of the following elements. As stated above, it has to work on a large dataset of experimental data in order for its results to have statistical merit. Furthermore, the experimental data have to be matched to a limited number of known proteins so that the matching algorithm can maximize its accuracy. Last ly , the

PAGE 76

76 exper imental data, or the set of tandem mass spectra, have to be of high accuracy and resolution, so that contribution from false positives due to experimental factors can be minimized and that the contribution from scrambling can be better assessed. In the sub sequent sections, a study designed to carry all these elements will be explained in detail. 4.2 C omputational M ethod s 4. 2.1 O verview In order to analyze the consequences of bringing peptide scrambling into consideration during peptide sequencing, an automa ted procedure has to be designed. While there have been plenty of commercially available software for bottom up protein sequencing, it is quite inconvenient to modify any of them in a way that the novel nondirect sequencing ions can be added to the fragmen t ion matching mechanism, so that the contribution from sequence permutation can be readily assessed. Nonetheless, the statistical analysis over the initial ion matching cannot be performed with any existing public software as the underlying task is too sp ecific. Under such scenario, several in house programs are written with JAVA programming language and UNIX shell script language to effectively perform both the sequencing and the data analysis operations. Also, the nature of sequence permutation makes the number of possible nondirect sequence ions as a result of permutation grow exponentially as size of peptide increases, thus also increases the likelihood of "false positive", or incorrect matching of an experimental fragment peak to a nondirect sequence i on, since these ions will be matched to a much larger pool of candidate masses. This concern is even more significant when taking into consideration the fact that such mass matching is not to exact values, but rather ranges within certain mass accuracies . Without proper

PAGE 77

77 understanding of the impact of false positives, the assessment of the contribution of nondirect sequence ions might be of limited merit. Thus, a separate procedure will also be introduced to evaluate the incidence of false positives. All aspects of computational approach es will be presented in this section. 4. 2.2 Experimental Data and Importing A complete dataset from a previous proteomics study is obtained, in which 41,434 tandem mass spectra from CID experiments are stored in standard MS2 format. These data were recorded during an earlier study by Scherl et al.[ 57 ] using an LTQorbitrap mass spectrometer made by Thermo. More experimental details for the data acquisition can be found in the experimental method chapter. Each mass spectrum is stored individually in one MS2 file, which records the mass and charge state of the precursor ion, as well as masses for all the product ion peaks. Corresponding to each of such masses, the intensity of each mass peak are also recorded in the MS2 file, which is proportional to the ion abundances. These MS2 files will then be analyzed by the inhouse sequencing program individually, and sequencing results will also be saved to individual files corresponding to each MS2 file. One sample MS2 file can be found in Table A-1 in the Appendix A. 4. 2.3 Bottom-up Sequencing T ool This section focuses on the in-house written sequencing software, explaining the overall philosophy and general procedure, the detailed fragment ion types to be considered for mass matching (including direct and nondirect sequencing ions), the programming realization in each sub-step, and the parameters to be played with like matching accuracies.

PAGE 78

78 4. 2.3.1 G eneral procedure of s equencing In order to receive maximum accuracy and also comply with proteomics studying protocol, a finite number of peptides have to be used for precursor ion matching. In t his case, it is all of the identified tryptic digest peptides from Pseudomonas aeruginosa from a previous research[ 57 ] that is used to form the matching list. Each MS2 file will be processed by the in house sequencing program in a sequential manner. Upon o pening an MS2 raw data file, the precursor ion mass recorded on top of the file is matched to those in the peptide list which is a large list of previously identified peptides . If no peptide in the list is matched to this mass within 5 ppm (resolution of t he mass spectrometer with which the experimental data are collected, see later subsection for more discussion) the program will generate a null output file and move on to the next MS2 file. Otherwise the precursor ion is matched to the sequence that has th e minimum mass difference with the precursor ion. The sequencing program will then enumerate all direct and nondirect ions from the matched sequence, and calculate the masses of these fragment ions. The fragment ion masses in the MS2 file are compared firs t with the direct ion masses and results are recorded to the corresponding individual output file, with original peptide sequence, type of direct ion (e.g., a2 of ABCDER ), matching accuracy in the unit of ppm with a maximum value of 5, and lead by a header of "direct ion matching" on top of the matching list. The fragment ions in the MS2 file that fail to find a match to any direct ions are then matched to all possible nondirect ions and the same result recording procedure is applied as that used for the di rect ions, with only difference being that the header is now " nondirect ion matching". An output file storing both precursor and fragment ion matching results are finally created, one for each MS2 file, which will be further analyzed with statistical metho d using another in house written

PAGE 79

79 program, to be explained subsequently. After this the program move on to the next MS2 files, until all the MS2 files are processed. Further details to the above procedure will be elaborated in the subsequent section. A schematic flowchart is given in Figure 4-2B. 4. 2.3.2 Types of fragment ions to be considered Depending on the bond at which the peptide is cleaved, the N terminus and C terminus fragment ions are called a, b, c and x, y, z ions, respectively. Furthermore, for a peptide with n amino acid residues cleaving at the mth residue, the corresponding N terminus fragments are called a m , b m , and c m whereas the carboxyl terminus fragments are called x (n m) , y (n m) and z (n m) . In a CID experiment, typical products considered for sequencing are the a, b, and y ions. Apart from these ions, there can be "satellite" fragment ions as results of further H 2 O or NH 3 losses from the original a and b ions. These types of ions will be considered for direct sequencing ion matching. On the other hand, a and b ions as well as their NH 3 and H 2 O losses are considered in the calculation of CID products for the permuted primary structures (non-direct sequencing ions), after the direct sequencing ion matching is performed. Need to mention, that there can be non-direct ions that are indistinguishable from internal fragment ions of the original sequence. For example, the internal ion BCD can also be the b3 ion of the permuted b5 fragment sequence BCDEA. Considering the fact that internal fragments are typically not included in sequencing, and both ions cannot be differentiated, such fragments are assigned to non-direct sequence ions. Also, only one permutation and the subsequent fragmentation are considered when calculating non-direct fragment ions and charge state +1 and +2 have been considered for both direct and nondirect ions. Table 4-3 gives the complete list of direct and nondirect ions considered for identification.

PAGE 80

80 4. 2.3.3 M ore details of mass matching As stated in Chapter 2, the tandem mass spectra from the CID experiments recorded by an LTQ o rbitrap [ 58 ] have a mass accuracy of 5 point per million (ppm). This same relative accuracy is thus used during the mass matching. Need to mention, that 5ppm is also a de facto standard accuracy used in protein sequencing worldwide and thus the work presen ted here will carry more realistic merit. The precursor ion mass in the MS2 file is to be matched to the aforementioned peptide list first, which consists of 5029 unique sequences. This list of peptides were previously identified in Scherl et al's work [57] . During the matching, accurate masses are calculated for all peptides in the peptide list and compared to that of the precursor ion. When the relative difference between the precursor ion mass and that of a known sequence falls within 5 ppm, the sequence is assigned to the precursor ion, and both direct and nondirect ions listed in the previous section for the sequence are calculated and matched to the fragment ion masses in the MS2 file, sequentially. Upon a match (difference between a fragment ion mass a nd a calculated mass falls within 5 ppm) the corresponding fragment ion type and the sequence based on which it is named, as well as the relative difference in ppm unit, charge state of the fragment ion, and its intensity in the mass spectrum are recorded to the output file. After the matching is over, a count of total fragment peaks is recorded to the bottom of the output file for the convenience of later analysis. A sample output file can be found in Table A 2 in the Appendix A . 4. 2.4 R aw Output File Anal ysis T ool T hrough the method described in S ection 2.3, all 41,434 MS2 files are to be analyzed and their corresponding output files to be generated. In order to investigate the impact peptide scrambling can bring to bottom up protein sequencing at a statis tical

PAGE 81

81 level, as well as the behavior of such scrambling mechanism within a large peptide population, all such output files have to be examined and data-mined. Needless to say, that it is unrealistic to perform such examination manually and an automated result analysis procedure has to be developed. The procedure has to cover two aspects of automated processing: First, for each individual output file, it will scan through the initial matching results and extract necessary raw data for the calculation of intermediate variables, which are more data-mining friendly. Second, it should be able to incorporate all such intermediate variables into one single data file, and perform statistical analysis seeing the data as a whole, as well as output the final conclusion results in a humanfriendly manner. Also, such final output should be computational friendly so that more automated analysis can be conveniently performed in the future. With these aspects in mind, an in-house data processing and analysis program is written in JAVA language. The details of such program is given as follows. First, it reads in each raw output file, record the matched precursor ion sequence (in the case there is a match), and calculate its length by counting the number of residues in the sequence. It then count the number of matches for direct sequence ions and nondirect sequence ions with unique masses, and sum them up to provide total number of matches. After the matched ion counting, the intensities (number of ions recorded by the mass spectrometer at a specific m/z) for these matched ions are summed together, as well separately for direct and nondirect ions, and the contribution from each is calculated in terms of percentage in the overall intensity sum. Eventually this is performed to all raw output files, and a data file with these calculation is created. Starting from this data, further statistical analysis is performed, including population

PAGE 82

82 analysis, distribution of peptide lengths and charge states, dependence of direct and nondirect io n matching percentage on these variables, etc. Details for such analysis is presented in the results and discussions chapt er. A table with sample data file entries can be found in Table A 3 in the Appendix A . 4. 2.5 F al se P ositive s One outcome from sequence permutation is that a much greater number of nondirect sequence ions with unique sequence and masses can be present. As a result, experimental fragment ion masses are matched to a larger pool of calculated masses. This will be a bigger problem for peptide s with longer sequences, as the number of possible combination for permuted ions grows exponentially as the peptide grows in length. While the matching between peaks on the mass spectra and masses of the calculated ions is not on a basis of matching on exa ct mass but rather on acceptable difference (5 ppm), this will make the following scenario possible, that a peak is identified simply because there are too many calculated ions whose masses are close to it and eventually one is close enough to trigger the match even if the peak is actually the measurement for some other ion. If this is to happen, then the peak would be incorrectly identified. Such false positive in fragment ion matching is not negligible, as worst case scenario it might account for a great portion of all supposedly matched nondirect sequence ions and make the statistical analysis meaningless. For example, in the situation as represented by Figure 4 1, a total of (2+3+4)=9 permuted sequences are to be considered, for a sequence length of 6. T o p ut it more generally, the number of possible permuted sequences will then go to for an n residue sequence, or

PAGE 83

83 . This means a remarkable growth as the length of the sequence grows. Consequently, there has to be a way to validate whether such false positives exist, and evaluate to what extent they contribute to the data presented. One approach to evaluate the statistics of false positives as a function of length is to use a decoy peptide list, and to repeat the sequencing procedure based on the decoy list . The conventional approach to making a decoy set of peptides involves inverting the peptide sequences. The latter approach would clearly be ill-suited here, as those inverted peptide sequences also correspond to permuted/scrambled peptide sequences. In our approach, for each peptide in the original list, a decoy peptide is artificially made up with a sequence that happens to have a mass within 5 ppm of the original peptide . For the tandem mass spectra analysis with this decoy list , each fragment ion matching has no chemical merit, and thus the contribution from false positives can be evaluated. The key step to make such an approach possible is to make a decoy peptide list. To be more specific, for each of the 5,029 peptides in a list a random sequence has to be made up whose mass is close enough to the mass of the original peptide. While calculating the exact mass for a given sequence is straightforward, how to come up with a novel sequence with a specific exact mass is not as obvious, unless one were willing to calculate the masses of all peptides up to a particular limit. The sequences on the original list are all built from the twenty natural amino acids, which correspond to 19 unique masses (leucine and isoleucine have the same exact mass), and the above question converts to finding a linear combination of 19 unique masses so that their weighted sum equals to a given mass. All the above masses hold their values accurate

PAGE 84

84 to the fifth decimal in Da unit, and when timed by 100,000 these values will all become integers. This simple approach transform s the problem into an integer prob lem, for which a mathematical solution is known . In mathematics , the Frobenius coin problem (after the mathematician Ferdinand Frobenius), also known as the McNugget problem . The coin problem asks the largest integer (t he Frobenius number) which cannot be obtained by combining any number of a set of given integers. For example, with only 2 cents and 5 cents coins one cannot make 3 cents, but any amount equal to or greater than 6 cents can be made by putting together a ce rtain amount of both coins. For example, two 2 cent and one 5 cent coins make 9 cents, etc. Fortunately, the Frobenius coin problem have been analytically solved, and following the same philosophy used to solve this problem, the coefficients can be found f or any integer greater than the Frobenius number for the same set of denominators. The solution is also readily available as a built in function in the Mathematica software, and a small program was written in the Mathematica programming language to automat ically generate decoy substitute peptides for the entries in the original list . The home written program reads in the 19 unique exact masses of amino acids and all 5,029 peptide sequences, calculates their exact masses, calls the FrobeniusSolve[] [59] funct ion sequentially for each exact mass, generates all 5,029 decoy sequences, and finally write them to a text file with both the sequences and their exact masses. Notably, the nature of the Mathematica built in FrobeniusSolve[] [ 59 ] algorithm leads to a tend ency to generate solutions with majority residue contribution from one residue basis, for example glycine residue. This might lead to a discrepancy between the length

PAGE 85

85 of forged sequence and that of the original sequence and compromise the quality of the fa lse positive evaluation. Such obstacle is solved by letting FrobeniusSolve[] [ 59 ] function generate a pool of 500 different solutions, and picking the one having the same length as the original sequence into the decoy peptide list. Such problem being resolved, the text file is purposively formatted in the exact same manner following the original peptide list, so that it can directly substitute the original data file and the false positive experimental sequencing procedure can be performed. 4.3 Results and Discussion 4. 3.1 Overview In this section, results from the in-house sequencing program which takes into consideration scrambling product ions are presented, including the counts of matches for all MS2 files as a whole, for those with higher and lower matching confidences, followed by data analysis on the details of matching results for each individual MS2 file. More specifically, matching quality and percentage of direct/nondirect sequencing ions matched among all ions are analyzed when variables like peptide lengths and charge states are isolated, so that their contributions can be evaluated. At the end of this section, the sequencing results using the decoy peptide list are also presented to reach a better understanding on how false positives affect the novel sequencing algorithm considering scrambled ions. 4. 3.1.1 CID experiment setup Before reporting the computational results, the experimental setup through which the 41,434 raw MS2 results are to be briefly explained. The peptides for these MS-MS spectra were prepared following standard proteomics practice: Proteins were first extracted from a source of clinical strain of Pseudomonas aeruginosa, and their

PAGE 86

86 concentrations were determined using Coomassie-based protein array purchased from Pierce/Thermo-Fisher. These proteins then underwent reduction with dithiothreitol, or DTE, and the subsequent alkylation with iodoacetamide and digestion using sequencing grade porcine trypsin purchased from Promega. Through such digestion peptides were made, and then desalted using Vydac C18 miscrospin column. Such prepared peptides are first separated using high performance liquid chromatography (HPLC) with nanoflow design. The HPLC equipments are made by NanoAcquity and Waters Corporation. For the running parameters of HPLC, it uses homemade gravity pulled 75 micron inner diameter by 150 millimeter analytical column, while packing 100 angstrom C18 particles, with a 5 micron diameter C18AQ supplied by Michrom, inside the column. The HPLC equipment is coupled to an LTQ-orbitrap mass spectrometer made by Thermo. Following the HPLC separation, peptides were ionized using electrospray ionization (ESI) under positive ion mode. 4. 3.2 General Sequencing Results Of all the 41,434 MS2 files which are processed by the in-house sequencing program, 6,358 files carried precursor ion masses that were matched to those in the peptide list. Among these files there are 4,878 which also have at least one fragment ion peak matched to the direct sequencing ions generated from its matched precursor ion. The others either have only nondirect sequencing ions matched, or no match at all. In order to maintain a relatively high level of confidence, only these 4,878 results will be considered for the subsequent analysis. 4. 3.3 Population Analysis As elaborated in Section 2, each result file is analyzed and information within it is condensed to an entry in the final analysis data. Such condensed information includes

PAGE 87

87 an index to the original result file name, the matched precursor ion sequence and its charge state, number of fragment ions recorded by the mass spectrometer, number of direct sequencing ion matches and that for the nondirect ions, percentage of sum over all matched direct/nondirect ions' intensities over that over all matched ions' intensity and total intensities of all fragment ions. According to previous works, size of a peptide, or its length as defined by number of amino acid residues it has, will affect the scrambling effect. For this reason it is natural to first separate these 4,878 matched results by the size of peptide each result corresponds to. The number of peptides having the same length is then counted for each length and plotted against peptide length, as shown in Figure 4-3. Among all the precursor ions, the shortest ones are having 7 amino residues while the longest one have 32. The majority of peptides fall within lengths between 10 to 22, when judging from the full width half maximum principle, and peak in between 12 to 15.On the other hand, previous research on oligoglycines has shown that cyclic peptides account for 100% of the CID product b ions for peptide length equal or greater than 8, whose reopening then leads to scrambling effect. In this situation, it is speculated that scrambling will happen very widely among these precursor ions. On the other hand, while the 4,878 matched peptides are having at least one direct sequence ion, the one-direct-ion criteria still seems not confident to conclude a precursor ion matching in the legacy approach. To better analyze the sequencing results, these peptides were further grouped into two categories by their numbers of direct sequencing ion matches. Those with equal to or greater than 5 direct sequencing ions are binned into the "high confidence" group, as in legacy peptide sequencing program approach

PAGE 88

88 such number of fragment ion matching combined with the mass matching of precursor ion suggests a highly confident identification. On the other hand, those peptides with only 1 to 4 direct sequencing ion matches are binned into the "low confidence" group for a comparable reason. Under such criteria there are 3,888 peptides grouped to the high confidence dataset, while 990 belong to the low confidence group. Another population analysis is done for both datasets, and results are shown in Figure 44. According to the grouped plotting, despite the high confidence dataset being almost 4 times bigger in size, the size distribution patterns of both datasets resembles each other in a very close fashion, which is also very similar to that of the complete dataset. In short, the high confidence dataset peaks in between 12 and 15 and the low confidence at 15. The full width half maximum range for the high confidence dataset is 10 to 21 while that for the low confidence dataset is 11 to 17. Also need to mention, that the majority of these lengths are having peptide counting over 10, with many over 100, and this makes the analysis over the peptide length of considerable statistical merit. 4. 3.4 Dependency on Peptide Lengthss To validate this postulation, percentage of direct/nondirect sequencing ions over all tandem MS ions are calculated for all 4,878 results, and grouped by their corresponding peptide lengths. The percentages belonging to the same peptide length are then averaged and their standard deviation calculated. The results were plotted in Figure 4-5A and 4-5B. Again, results for the highand low-confidence dataset are presented separately. For the high confidence dataset, ratios of direct ion matching are relatively unchanged as peptide length grows, maintaining at around 20%. For the low confidence dataset, direct ion matching ratio is generally much lower, in between 4% to 8%, which

PAGE 89

89 is somewhat expected as by definition peptides in this group can only have up to 4 direct sequencing ions. Moreover, it only saw very slight increase with peptide length growth. Considering that as the peptide length grows, the total number of unique tandem masses have also increased, such relatively unchanged ratio can be somewhat understood. The ratio of nondirect sequencing ion, however, shows a striking increase as the length of the peptide grows, for both the high and lowconfidence datasets. Meanwhile, this ratio analysis hasn't taking into account the intensities of the matched peak, and while a higher intensity corresponds to a stronger signal, or more ions, recorded by the mass spectrometer and thus a tandem mass peak with higher intensity can bring more confidence to the validity of mass peak, it is worthwhile to take into account the intensity contribution. Moreover, the relative intensity contributions are in fact more relevant in terms of evaluating the probabilities of these events. Following such consideration the intensity-weighted nondirect sequencing ion match counts over intensity weighted count of all matched ions is also calculated and analyzed similarly. The results are shown in Figure 4-6. For the high confidence dataset, percentage of nondirect sequencing i on intensity contribution constantly grows from near zero (means direct sequencing ions dominate among all matched ions) to roughly 60% percent (greater contribution over direct ions). For the low confidence dataset, such percentage remains at relatively higher level (~50% percent) after seeing growth from around 15% when peptide lengths are relatively shorter. Meanwhile, keep in mind that the number of permutation due to macrocycle formation is in squared linear relationship with peptide length, as shown i n Figure 47.

PAGE 90

90 Given the apparent increases of nondirect ions from near zero for short sequences to significant contributions for longer sequences, it is imperative that the rate of false positive identification is evaluated. In fact, it is in principle pos sible that the entire increase of nondirect ions (as function of sequence length) is an artifact, as a result of false positives! By running the false positive evaluation program introduced in Section 2, the mass spectral analysis is done with a decoy pept ide list and the results are analyzed to provide a comparable plotting with Figure 4 8. Note that the size distribution is not identical to the one in original peptide distribution, as each decoy peptide merely needs to have the same mass, not the same seq uence length. According to Figure 4 8, the decoy sequencing received a similar pattern in the population distribution by length of peptide, which somehow gives confidence on the quality of decoy peptide list formation, and makes the subsequent analysis mo re comparable with that from the real analysis. The ratios of direct/nondirect sequencing ions over all tandem mass fragments using the decoy peptide list is calculated and plotted by each individual peptide lengths and presented in Figure 4 9. For conveni ent comparison, the same plotting using the real list for high and low confidence datasets are also included. Looking at the decoy plotting, the ratios of direct sequencing ions were kept constantly at very low values, being close to zero. This makes perf ect sense as the direct ions calculated from the decoy matched precursor peptides actually don't exist and it is quite natural experimental tandem mass peaks cannot find their masses matched to barely any of them. On the other hand, despite being also quit e low in percentage values, the ratios for non direct sequencing ions do rise to remarkably

PAGE 91

91 higher values at longer peptide lengths than those at shorter lengths, especially seeing the rise from close to zero beginning from length=15 to 25, and eventually grow to around 10%. This means that false positives in nondirect ion matching, for the reason discussed earlier. However, this increase is seemingly less than the one observed in the original nondirect analysis. For instance, comparing the nondirect ratio s at length=25 for the high and low confidence datasets and the decoy datasets, and the values are 25%, 20% and 8%, respectively, meaning that false positive account for either one third or 40% of identified nondirect ions in the original analysis (i.e., 8 % vs. 25%, 8% vs. 20%). To better evaluate the contribution from false positive for the nondirect ion matching quality, the above ratio is derived for each peptide length, and plotted against length as shown in Figure 4 10. Results show that for the high c onfidence dataset, percentage contribution from false positive remains steadily at around 30%, while for the low confidence dataset such percentage fluctuates in between 30% and 50%. The steady value of 30% for the high confidence set suggests that the sta tistics of large numbers average out and that this value gives an accurate reflection of the true rate of false positives. In addition, this result suggests that the majority of the trend showing an increase of nondirect ions with sequence length is real, and can be quantified. If the original maximum value of around 40% non direct ions is decreased by 1/3 to remove the false positive contribution, this still results in around 25% of non direct ion contribution, which makes total percentage of matched ions from less than 20% (direct sequencing ions only) to over 40% (direct and nondirect ions combined) for longer peptides.

PAGE 92

92 Meanwhile, in order to evaluate the impact of scrambling to peptide sequencing from a more general aspect, it is worthwhile to calculate the overall average percentage of nondirect ions among all experimentally recorded fragment ions. This can be done following the following approach. First, the impact from false positive has to be minimized if not removed. This is done by subtracting the f alse positive percentages from the overall nondirect ion percentages at all peptide lengths. These processed nondirect ion percentages at each peptide length are then weighted by the precursor ion population at each length, and a weighted average is subseq uently calculated to draw the overall nondirect ion percentage. The overall direct ion percentage can be calculated following the same procedure. The percentages of direct and nondirect ions among all recorded fragments after the false positive depletion a nd the weighted population for the high confidence dataset are presented in Figure 4 11. Notably, the precursor ion populations are densest in between peptide lengths 10 to 20, however the nondirect ion percentages within the same length region are relativ ely low (below 10%) especially after removing the contribution from false positive. This will then bring the weighted average of nondirect ion percentage across all peptide lengths down remarkably. The direct ion percentages, on the other hand, constantly remain moderately over 15% regardless of peptide length, and the impact from population weight is thus minimal. Following the aforementioned procedure, the overall average percentage for direct and non direct ions are calculated at 16.93% and 5.31%, respec tively, suggesting that direct ions still play a dominant role during peptide sequencing.

PAGE 93

93 4. 3.5 D ependence on Precursor Ion Charge S tate The previous section has validated the size effect of peptide scrambling as both the ratio of nondirect sequencing ions count over all experimentally recorded fragment ions count and the contribution from nondirect ion intensity apparently grow with the peptide length. However, normally a longer peptide precursor ion are also ionized by more protons as a higher charge stat e. Thus the question remains open that whether charge state as a separate parameter influence the likelihood of peptide scrambling and the subsequent ratio of nondirect sequencing ion? To answer this question data analysis was also performed around charge states. To begin with, the population analysis of all precursor ions by their charge states. Counts of precursor ions with each charge state were thus plotted, and presented in Figure 4 12. Need to mention, among all 4,878 matching results, there are situa tions where multiple precursor ions were matched to the same sequence and charge state. As shown in the plotting, majority of peptides are having a charge state of 2 (2931 peptides) or three (1852 peptides), with few quadruple charged (93 peptides) and ve ry rare occasions of +5 charged (2 peptides). In this situation the later analysis which tries to isolate the charge state contribution will only focus on the +2 and +3 charge states. For such analysis, peptides with charge states 2 and 3 are further separ ated and the ratio of nondirect sequencing ions over all tandem mass fragment ions in each charge state is plotted against the peptide length. Again, results from the high and low confidence datasets were also separated. Such results are shown in Figure 4 13. For the low confidence dataset results, charge states 2+ and 3+ don't really make too much difference. However, for the high confidence dataset, the charge state 3+ results apparently makes a difference from that of the charge state 2+, being able to

PAGE 94

94 produce a higher ratio of nondirect sequencing ions especially at longer peptide lengths over 19 amino acid residues. This leads to a divergence between the 2+ and 3+ plotting. 4. 3.6 D iscussions on the Trends in S crambling While clear trends that scramblin g effects are stronger with longer peptide lengths and higher charge states, it is worthwhile to consider what is the chemistry behind such trends. For the trend in length, the growth is consistent with previous experiments where b fragments of oligoglyci nes are investigated, and the percentage of macrocycle peptides among both linear oxazolone and macrocycle isomers grows with number of glycine residues. Chemically, such growth in macrocycle propensity with length means a larger population of cyclic pepti des, laying a larger foundation for the scrambling to happen. Meanwhile, a larger macrocycle peptide also leads to more possible sites for the reopening of macrocycle to happen, thus with the same number of macrocycle peptides it is expected a longer pepti de will have more unique scrambled fragment ion. Moreover, these two trends are quite likely to interact and result in even greater scrambling effects as both reinforce each other. However, chemically a large enough macrocycle peptide is not as entropicall y favored over linear oxazolone isomer as its degrees of freedom are more restricted, hence an upper limit in peptide length for such mono growth is somewhat expected, but absent from the experimental results. On the other hand, the results from the curren t also don't provide direct evidence for the existence of macrocycle peptides with length over 30 amino acid residues, as most unambiguously matched fragment ions are of much shorter lengths.

PAGE 95

95 For the trend in charge states, results have shown that in the h igh confidence dataset the higher (3+) charge state leads to greater scrambling effects over lower (2+) state. Chemically this may be due to the higher charge state's greater promotion of sequence scrambling. On top of such trends, the standard deviation o f nondirect ion ratio also grows rapidly as length and charge state of precursor ion grows. This means even if having the same number of amino acid residue and charge state, different peptides may still demonstrate different behaviors talking about sequenc e scrambling. For example, the amino acid composition of a peptide may affect its likelihood of cyclization, as discussed in Chapter 3. To predict sequence permutation behavior merely using two parameters (length and charge state) looks quite difficult in this situation. Also, the b ion cyclization and scrambling may not be the only pathway leading to the production of nondirect sequencing ions, and other novel mechanisms may also need to be taken into consideration in the future. 4.4 Summary Previous researches have confirmed that sequence scrambling can happen during collision induced dissociation of protonated peptides, which is a widely used in the tandem mass spectrometry approach as the standard practice for protein sequencing. To statist ically investigate the impact of such scrambling, 41,434 experimental MS MS spectra from a previous research were analyzed via an in house sequencing program that takes into consideration scrambling effect s , and matched to a list of 5,029 unique tryptic di gest peptides from P. aeruginosa. Out of these spectra, 4,878 were matched to known sequence s in the list . Such matched peptides were grouped into high and low confidence datasets by number of direct sequencing ions

PAGE 96

96 they carry, with a cutoff set at 5. Fo r the high confidence dataset, just under 20% of all fragment ions were identified as direct sequencing ions. For the 80% remaining fragments , which are not identified by conventional sequencing program s , up to 30% c ould be matched to fragment ions as a co nsequence of sequence scrambling, or nondirect sequencing ions. The overall average percentage of permuted sequence ions , taking into account the size distribution and discounting false positives, was found to be 5.3%, compared to 16.9% for original sequen ce ions, suggesting that permuted sequence ions are clearly present, but are not a major contributor to tandem mass spectra . Based on these results, it is worthwhile to take sequence permutation s into consideration in standard protein sequencing algorithm s . The matched nondirect sequencing ions may improve the confidence of peptide sequencing when included in the fragment ion matching, especially over the traditional approach when they are simply discarded. The time it takes to process all 41,434 MS2 files reads around a minute, and in that sense taking into consideration the scrambling effect will add little computational load to existing sequencing program, making it very cost effective. The only major concern is the contribution of false positive s to nond irect ions, which account for roughly one third of those ions, as evaluated by a decoy dataset . This can be improved by using higher resolution mass spectrometers and higher accuracy data acquisition. However, the overall population weighted average percen tages of direct and nondirect ions still suggest that the direct ions play the primary role during peptide sequencing, and the contribution from the nondirect ions should thus be treated only as

PAGE 97

97 an auxiliary input which assists the identification, instead of playing an equally critical role. Besides the validation of scrambling effect and its potential positive contribution to protein sequencing, trends in scrambling have also been observed. The percentage of nondirect sequencing ions grow as lengths of pep tides grow, and also see growth with higher charge state (3+ vs 2+). The standard deviation of such percentage also grows with peptide length, indicating factors other than length and charge state are playing considerable role in affecting scrambling effec t, like the carrying of specific amino acid residue such as the basic residues investigated in Chapter 3 .

PAGE 98

98 Figure 4 1 . Schematic diagram of b ion sequence permutations as a result of head to tail cyclization, Reprinted with permission from Long Yu, Yanglan Tan, Yihsuan Tsai, David R. Goodlett,* and Nick C. Polfer , J. Proteome Res. 2011, 10, 2409 2416 Copyright © 2011, American Chemical Society .

PAGE 99

99 Figure 4 2 A . A sample MS MS spectrum from CID experiment, Reprinted with permission from Long Yu, Yanglan Tan, Yihsuan Tsai, David R. Goodlett,* and Nick C. Polfer , J. Proteome Res. 2011, 10, 2409 2416 Copyright © 2011, American Chemical Society .

PAGE 100

100 Figure 4 2B . S chematic flowchart for the automated direct/nondirect ion matching and statistical da ta analysis .

PAGE 101

101 Figure 4 3 . Size distribution of peptides identified by Scherl et al. [57] and used as a reference peptide list in this study . Figure 4 4 . N umber of peptides as a function of peptide length, for the higher (blue) and lower (red) confidence datasets .

PAGE 102

102 Figure 4 5A . Mean percentages and standard deviations of direct (blue) and nondirect (red) ions among all fragments at each peptide lengths, for the higher co nfidence dataset . Figure 4 5B . Mean percentages and standard deviations of direct (blue) and nondirect (red) ions among all fragments at each peptide lengths, for the lower confidence dataset .

PAGE 103

103 Figure 4 6 . I ntensity weighted percentage of nondirect ions at each peptide length for higher (top) and lower (bottom) confident datasets . Figure 4 7 . Number of possible permutations due to macrocycle formation versus peptide leng th based on mechanism in Figure 4 1 .

PAGE 104

104 Figure 4 8 . Number of peptides at each peptide length for the real (red) and decoy (black) peptide list. Figure 4 9 . Percentage of direct (black) and nondirect (red) ions among all fragments at each peptide length for higher (top) and lower (middle) confidence datasets and decoy (bottom) dataset .

PAGE 105

105 Figure 4 10 . P ercentages of false positive nondirect ion population ov er real non direct ion population at each peptide length, for higher (blue) and lower (red) confidence datasets . Figure 4 11 . P ercentages of direct (blue) and nondirect (red) ions among all fragments after depleting false posit ive contributions at each peptide length, superposed with the population at each peptide length (hollow dots) for the higher confidence dataset.

PAGE 106

106 Figure 4 12 . N umber of precursor ions at each charge state (+2, +3, +4 and +5) . Figure 4 13 . Percentages of nondirect ions among all fragments for charge states 2+ (black) and 3+ (red) at each peptide length for higher (top) and lower (bottom) confidence datasets .

PAGE 107

107 Table 4 1 . D irect sequence ion matching re sults for the MS2 spectrum in Figure 4 1 . Observed peaks (m/z) Theoretical peaks (m/z) Ion types Rel. Intensities (%) 665.327 665.325 y6 56 879.453 879.457 y8 48 1061.042 1061.047 y21 ( 2+ ) 100 1123.055 1123.06 0 b23 ( 2+ ) 50 Table 4 2 . N ondirect sequence ion matching resul ts for MS2 spectrum in Figure 4 1 . Table 4 3 . L ist of direc t and nondirect ion types considered for fragment matching . Direct ions Nondirect ions a sa b sb y sa NH3 a NH3 sb H2O b H2O sb NH3 b NH3 Observed peaks (m/z) Theoretical peaks (m/z) Ion types Putative permutated sequences of b fragment Rel. Intensities (%) 286.15 286.15 sa4 GGAQAPAFSLV 28 553.31 553.30 sa7 VGGAQAPAFSL 68 1088.04 1088.04 sa232+ TLENFAGGAQAPAFSLVGGDLADV 41 1093.59 1093.6 0 sa13 LVGGDLAGAQAPAFS 29

PAGE 108

108 CHAPTER 5 P EPTIDE MASS AND IDENTIFICATION BY HIGH RESOLUTION MASS SPECTROMETRY 5.1 Background 5. 1.1 M otivation The years since the 70's have seen great progress in high resolution high accuracy mass spectrometer[ 60 ], with the worldwide application of FT ICR mass spectrometers whose resolution power evolve to the sub ppm range[ 61 64 ], accompanied by the introduction of other high resolution MS like the orbitrap mass analyzer[ 65 67 ]. The driving force for more accurate mass measurement ha s come from the analytical challenges in identifying complex mixtures from biological research (e.g. proteomics[ 68 71 ], metabolomics), as well as petroleum analysis[9], and many other applications. The analysis of complex mixtures is also typically assiste d by separation techniques, such as liquid chromatography. However, such an approach also involves inherent challenges. First, not being able to provide direct structural information, mass spectrometry, whatever resolution and accuracy it attains, won't be able to separate isomers, as the mass difference due to bond formation is virtually infinitesimal[ 72,73 ]. Second, even setting aside the isomer problem, the number of unique mass values is expected to grow exponentially as nominal mass increases, since th ere will be greater possibility to form more different chemicals when the total number of atoms and radicals as well as their combinations are expected to grow when a higher nominal mass is under consideration[ 74 77 ]. This will likely lead to a situation w hen two neighboring unique masses are having so small difference that even the state of the art high resolution mass spectrometer won't be able to tell them apart. Such consideration leads to one question: If a universal identification by mass alone[ 78 80 ] through high

PAGE 109

109 resolution mass spectrometer cannot be achieved, is there any more specific scenario or under any specific condition it can be? For example, when the unknown substances are restricted to be only metabolites or only products from enzyme digest ed proteins or peptides and their fragments, can such identification be achieved? And if yes, what will be the minimum resolution power re quired for such identification[8 1]. Meanwhile, even in the situation where mass alone won't give unambiguous identific ation, a mass measurement with high accuracy, say 1 ppm or higher, will still be of great merit. For example, the resolution used in the experiments and subsequent data processing in the previous chapter is 5 ppm. At such resolution the false positive anal ysis gives an estimate of 30%, or roughly one third of matched non direct sequencing ions are falsely identified simply because the underlying fragment ion mass is not different enough in value from the mass of the ion which it is assigned to. If, for inst ance, the resolution is improved to 1 ppm then the threshold difference will be only one fifth of the original matching, and it is reasonably believed the false positive will be well reduced under such resolution. However, such deduction is merely qualitat ive, and quantitative conclusions like what kind of resolution it takes to hold false positive percentage below a certain level remain unclear, and need additional work. Under such consideration, a computational approach is proposed, which limit the pool of masses to protonated peptides, which are the most common precursor ions in collision induced dissociation experiments, the latter being the single most widely applied technique in protein sequencing. In short, all possible peptides with unique amino aci d residue compositions are enumerated with an upper limit of 1000 a.u., and this mass pool then undergoes data mining procedures, especially with different

PAGE 110

110 resolution powers (in ppm) applied, to shine light on the answers to questions asked above. While th e detailed computational approach and results/discussions are to be presented in the subsequent sections, the later subsections in this section will also cover related information in higher details. 5. 1.2 B ackground of Proposed Research W ork As part of wor k in Chapter 4, a decoy peptide list is forged to evaluate the contribution from false positive that will affect the sequencing of non direct ions. Such forging requires the building of a specific peptide composition using the twenty amino acid residues, w hose exact masses are known, so that such composition will carry an exact mass equal to a given peptide in the authentic peptide list . The building of such is made possible through the application of "FrobeniusSolve" algorithm that is readily accessible wi thin Mathematica programming language. By providing the exact masses of amino residues and the target peptide mass, this algorithm is able to give virtually unlimited number of peptides with their unique compositions yet masses equal to the given masses. I n order to make the decoy peptides better resemble their authentic counterparts, their lengths are preferred to more closely match those of the original peptides. To achieve such, 500 decoy peptides meeting the mass criteria were randomly generated with th e FrobeniusSolve algorithm for each peptide in the original list , whose lengths vary greatly, so that one peptide with the desired length can be pulled from this candidate pool. This approach turned out to be successful as the population distribution by le ngth results using the decoy peptide list very closely resembled their authentic counterparts. Keeping in mind that according to the solved Frobenius problem, any large enough (larger than the "Frobenius number" for the set of 19 exact masses corresponding to the 20 naturally existing amino acid residues) exact

PAGE 111

111 mass can find its corresponding peptide sequences. This means any mass over a specific value won't be able to provide unambiguous peptide identification as multiple peptides are bound to have this same mass. However how such "specific value" behaves remains unclear. On the other hand, this reminds that it is worthwhile to examine all possible peptide masses up to a certain upper limit, which are definitely of finite quantity, and look into this pool of mass data to see their behavior, especially to explore the possibility of identification by mass. That is, given a specific mass accuracy in ppm unit, whether a certain peptide's composition, and the corresponding exact mass, is unique from those of all other peptides. Following such consideration a computational approach is planned and executed, as discussed in the subsequent sections. 5.2 Computational Method 5. 2.1 Overview This section will cover the following aspects of the computational approach. It starts with introducing the peptides of interest for this specific research, followed by the programming details about how to generate the corresponding masses. When these are readily elaborated, the details of data analysis are also presented, accompanied by other computational approaches in aid of the analysis over original mass data. 5.2.2 Peptides of Interest The underlying research will focus on the protonated peptides consist of the twenty naturally occurring amino acid residues as well as the oxidated methionine and the alkylated cysteine, with a nominal mass upper limit of 1000 Da. Being the most widely used protein sequencing technique today, collision induced dissociation (CID) based tandem mass spectrometry identifies the sequence of a protein by mass

PAGE 112

112 analyzing the peptides which are the enzyme digestion products of proteins, as well as their CID product fragment ions and matches such masses to a sequencing list . At the center of such technique, the protonated peptides, being precursor ions of the CID experiments and primary target of mass based sequencing, draw special interest. Meanwhile, the protonated peptides are also compositionally identical to the y ions of the same sequence, which itself is a major CID product type. Under such consideration , protonated peptides are selected as the target for mass enumeration. Table 5 1 gives sequence enumeration. Their exact masses are also presented. As for the upper mass limit of the peptide enumeration, it is arbitrarily set at 1000 Da. Such a limit is set following some considerations. First, it is expected that the number of unique peptide sequences will grow exponentially as maximum allowed sequen ce length grows, which in turn is related to the upper mass limit. Thus in order for the computational load to be held realistic a mass upper limit has to be set. Second, initial test runs with 1000 Da and 2000 Da upper limit settings have indicated that t he percentage of mass indistinguishable peptides has already dropped to very small value (see results and discussion section for details) when mass value grows over 800 Da, meanwhile the population at each nominal mass rise high exponentially (arbitrarily fitted to the a ninth power polynomial). The total number of peptides with unique sequence grows from millions to billions when upper mass limit rises from 1000 Da to 2000 Da. While the generation of such raw data of sequences is affordable even with a 200 0 Da upper limit setting, the proposed subsequence data mining is of higher order of complexity and thus makes the higher mass upper limit not as favored. By taking into

PAGE 113

113 consideration these factors, the 1000 Da mass upper limit is selected and used for all later calculation and data analysis. 5. 2. 3 M asses Enumeration A lgorithm To enumerate all the possible unique sequences within an upper mass limit, the most straightforward method is to use brute force and exhaust all combinations by sequentially putting each amino residue at each position of the sequence with a given length, and also exhaust all possible lengths within mass limit, which can be deduced by dividing the mass limit by the mass of glycine residue, which is the smallest among all amino residues by mass. However, as the research focuses on masses of the peptide, different sequences with the same amino acid residue composition actually corresponds to the same exact mass, thus such brutal force enumeration will leads to great redundancy and removin g such redundancy is yet another work to be done, adding the already computational resource consuming procedure. Also, while mass of glycine residue is around 57 Da, the 1000 Da mass limit will lead to a length limit of 17 residues, and that would mean rou ghly 21(number of residues with unique masses) ^ 17 or over 3^22 combinations for length=17 alone, and total combination will reach sigma[21^n, n=1 to 17], a number associated with great computational complexity. To avoid such a complication, an alternativ e approach is designed to cut the computational load and redundancy reduction complexity, which utilizes a dynamic programming philosophy. At the core of such an approach, one key question needs to be addressed: If we have a complete set of peptides with u nique exact masses made from n unique amino acid residues, how can we get the new complete set when one more unique amino acid residue is introduced (i.e., the complete set for (n+1) unique residues)?

PAGE 114

114 For convenient discussion, the mass of this new unique residue, R, is denoted as M, and the peptide mass upper limit as L. What peptides with unique compositions of R need to be considered to produce a complete set? First, for the list of peptides of such a division) entries, mass upper limit is reached, to form the correspon ding peptides with novel composition. For example, for a sequence XXXXXX whose exact mass is Q, novel peptides These two pathways are the only ones that can produce novel peptides to t he old set when taking into consideration the novel residue R. By adding the peptides exclusively made from R residues and the novel peptides made out of each peptide in the old complete set following the procedure described above to the old set, the compl ete set of peptides with unique composition for (n+1) unique residues can be obtained. dues and zero mass. Thus, by including such null peptide to the complete set, the above mentioned procedures can be unified to one simple approach. Now we have the procedure to produce the complete set for (m+1) unique massed residues out of that for m, as long as we have the complete set for m=1 to begin with, we will be able to eventually get the ultimate complete set for all 21 unique mass amino acid residues. In fact, the m=1 complete set can also be made following the same procedure from m=0 set, which has only one null peptide.

PAGE 115

115 In terms of computational implementation, such dynamic programming philosophy is carried out using the C++ language, and the source code is compiled and then executed on the UF HPC cluster. The program turned out to be very effi cient, being able to calculate the complete set for the 1000 Da upper mass limit in just one minute and that for the 2000 Da in less than ten. The output is in text only format, with each peptide sequence and the corresponding exact mass presented in one r ow, for convenient subsequent data analysis. 5. 2.4 D ata A nalysis While the initial motivation for this mass enumeration study is about exploring the detection or identification limit of ultra high mass spectrometer, more interesting behaviors of the large dataset itself revealed as the analysis to this dataset gradually unfolded. As a result, the data analysis became a data mining exercise of the original dataset, and it covers far more than merely whether a mass peak is distinguishable under a certain dete ction accuracy. The tools used for these data mining procedures are described in the following subsections. 5. 2.4 .1 P eptide rearrangement by mass ascending order The raw output file from the peptide enumeration program arranged the peptides by the order of how early they are enumerated by the program. Nonetheless, such ranking does not necessarily make peptides with closest or even equal masses next to each other, and thus not convenient for the subsequent mass differentiation. It is thus worthwhile to rear range the entries (corresponding to peptides) in the original output file in a mass ascending order. This is done with a simple bubble sorting algorithm and realized with C++ language. Also, the sequence composition and exact mass for a given peptide is se parated with a "tab" symbol in the text only output file, for convenient

PAGE 116

116 subsequent operations both through C++/UNIX shell programming and Microsoft Excel data processing. 5. 2. 4 .2 Atomic composition and isomer depletion Under the differentiation by mass co nsideration, one important situation is to deal with isomers. Being of different amino acid compositions, two peptides can be of the same atomic composition and thus isomers, which in turn won't be differentiated by mass the mass defects by virtue of bon d energies are too trivial to be reflected from the 5 decimal exact masses. Isomers are first identified, before their duplicate entries can be deleted. Computationally, this is done by a two step approach. First, the atomic compositions for all peptides i n the output files are to be calculated. This is quite straightforward as a program is written in C++ to translate each amino acid residue in a peptide to its corresponding atomic composition, and then summarize the atomic compositions of all amino acid re sidues in one peptide to give the peptide's atomic composition. The details of each amino acid residue considered and their atomic compositions are given in Table 5 2. The elemental composition is appended to the end of each peptide entry in the mass ascen ding order, to form an intermediate output file. Second, starting from this intermediate file, those peptides with the same exact mass will have their atomic compositions compared. If two or more peptides are found to be isomers, only one will be kept whil e the others are deleted. This will result in a new output file containing no isomers, and used in the data mining where isomers are not to be considered.

PAGE 117

117 5. 2.4 .3 N ominal mass population analysis Before getting into the investigation on differentiability, the first step taken over the processed peptide enumeration data is to count the number of peptides at each nominal mass. This is quite simple computationally as it involves only the grouping of peptides by their nominal mass followed by the counting of pe ptides in each group. Moreover, to reflect the detailed mass distribution within a given nominal mass, one nominal mass range (e.g., 100.0 to 101.0) can be further divided into, for example, 100 evenly spaced groups, whereas another population analysis can be performed. In reality, the latter is only done to a certain representative nominal masses, namely 100, 200, ..., 1000 instead of all masses, to observe the trend of population as nominal mass grows. Notably, due to mass defects there might be the possi bility certain masses are misgrouped. For example a mass of 499.995 might be grouped to 499 500 whereas its nominal mass should actually be 500, and thus should be grouped to 500 501. This situation will be evaluated and discussed in the results section. O n the other hand, the population analysis for all nominal masses is performed, with very interesting results on top of the expected exponential growth. This is to be presented in the results and discussion section. 5. 2.4 .4 D ifferentiation by m ass A key que stion that will be addressed here is whether a certain mass peak can be unambiguously assigned to a certain peptide with a unique amino acid composition for a given mass accuracy. In other words, are there any peptides with neighboring masses that so close , that their masses overlap with the mass ranges of other peptides? For example, for a peptide A whose exact mass is 500.00000 Da and a relative accuracy of 5ppm, if there is another peptide B having an exact mass within the

PAGE 118

118 500.00000+/ 0.0025 Da range, sa y 500.00200, then a mass reading of 500.00000 from the mass spectrometer with 5ppm mass accuracy won't be unambiguously assigned to one peptide. In that case, both A and B are called "mass indistinguishable". On the other hand, if in the peptide enumeratio n list , peptides C and D were adjacent to peptide A, and had exact masses of 500.00300 and 499.99700, respectively, then the mass reading of 500.00000 +/ 5ppm will conclude that peptide A is "mass distinguishable". Computationally, the procedure to tell w hether a peptide is mass distinguishable works as follows. Starting from the peptide with smallest exact mass in the mass ascending peptide list , the "indistinguishable range" for such peptide is calculated based on its exact mass and the given ppm value. In the case of above mentioned scenario, it would be between 499.99750 and 500.00250 Da for an exact mass of 500.00000 and accuracy of 5ppm. Then the entries immediately before and after this peptide are read in, and their masses compared to the lower and upper limit of the indistinguishable range. If neither falls within such range then an "D" mark is appended at the end of the underlying peptide entry, meaning it is mass distinguishable, otherwise "I" is appended for "indistinguishable". Such operation wi ll act on all entries in the mass list , and for both the original list and the isomer depleted counterpart. Once this is done, the resulting peptides are also divided by nominal masses, and percentages of distinguishable peptides in both cases are analyzed over nominal mass growth, to reveal the statistical behavior of the whole peptide pool. Also, the entire approach will be repeated using different mass accuracy values, to explore the potential and cost effectiveness of different levels of high accuracy m ass analyzers.

PAGE 119

119 5.3 Result and D iscussion 5. 3.1 Population Analysis and Mass P eriodicity All peptides with unique amino acid compositions have been enumerated under mass limit of 1000 Da, and ranked by their exact mass, before further grouped by their nomin al masses, as described in Section 2. To begin with, the population of each group are plotted against the corresponding nominal masses, as shown in Figure 5 1. For convenient reference, a zoomed plotting in the 0 600 Da mass range is also shown as an inset . As shown in Figure 5 1, the number of compositionally distinct peptides at each nominal mass grows exponentially with nominal mass growth, seeing less than a hundred (95) peptides with nominal mass 500 Da but almost fifty thousand (49,993) at 1000 Da. T his is well expected as higher masses allow more possibility of different amino acid residue composition, which then leads to greater number of unique peptides. Also, in order to make sure that the nominal mass population analysis won't be affected by the situation where there are plenty of masses distributed around a specific integer mass such that such distribution would be artificially truncated by the nominal mass grouping, which may well happen at higher nominal masses, the distributions of exact masse s under a series of representative nominal masses, namely 400, 500, ..., 1000 were investigated. This is done by further and evenly splitting the underlying mass range, for example 399.0 to 400.0 for nominal mass 400, into 200 sub ranges followed by anothe r population analysis within such nominal mass, so that whether the suspected situation actually happens can be validated. The results are shown in Figure 5 2.

PAGE 120

120 The plotting for nominal masses 400 to 1000 does demonstrate a mass center shift to higher value within the range. This is expected as in most occasions, the exact mass values for the building amino acid residues are in the form of a two or low three digit integer followed by a relatively small decimal value. Larger nominal mass leads to greater numb er of amino acid residues and thus a greater mass defect, primarily due to the progressive inclusion of H atoms, which have the largest positive mass defect (1.00794). Fortunately, up to nominal mass 1000 Da the shift hasn't reached the other end of the de cimal range, indicating the underlying nominal mass population analysis not yet to be affected. Seeing the exponential growth of population with nominal mass growth, it is worthwhile to fit it with a polynomial expansion, in order to decouple the periodic oscillation from general increase that is observed. A 9 th order polynomial was found to give a promising fit (R square = 0.9994). The fitting and original population plotting as well as the fitting parameters are presented in Figure 5 3. For better referen ce, a zoomed plotting with nominal mass ranging from 0 to 600 is also presented in Figure 5 4. As can be seen in both figures, the fitting works quite well across the full range of nominal masses. The polynomial fit values at each nominal mass are subtract ed from the original population values, so that the fluctuations from the general growth trend can be better demonstrated. While the absolute fluctuations from such subtractions also grows exponentially with nominal masse growth, it is then more convenient to examine the "relative fluctuations", where the original fluctuation values are divided by their

PAGE 121

121 corresponding fitting values. The results are plotted and shown in Figure 5 5. Also, a zoomed in view for mass range 600 1000 is also presented in Figure 5 6. The observed periodicity behavior occurs across the entire mass range, from just over 200 Da to 1000 Da. For the lower than 200 masses, the relatively small population at each nominal mass makes the periodicity behavior less likely to be discovered. To more thoroughly uncover such uniform periodicity behavior, the upper mass limit is lifted to 1500 Da while the mass enumeration and subsequent processing are performed again, as well as the polynomial fittings and the relative fluctuation calculations. The corresponding results are presented in Figure 5 7 and Figure 5 8. As shown in both figures, the periodicity behavior continues from 1000 Da to 1500 Da, with the amplitude not behaving as symmetric due to the quality drop in polynomial fitting. In orde r to more precisely observe the periodicity behavior, the relative fluctuation plotting is zoomed in for two mass regions around 350, 850 and 950 Da, respectively, as shown in Figure 5 9 and Figure 5 10. For Figure 5 9, it is observed that the population f luctuation has a period of 16 Da, as consecutive minima are observed at 324, 340, 356, etc. Whereas at the higher 850 Da range ( Figure 5 10), the period becomes 14 Da (815, 829, 843, etc.). Also, at higher masses the oscillation behavior is even more unifo rm and regular, as the 850 Da range relative fluctuation plotting almost demonstrates a sine function pattern. We will investigate what lies behind this periodicity in the section below. 5. 3.2 Explanation of the P eriodicity To interpret such periodicity, o ne option is to consider the masses of the building blocks, the amino acid residues, and more importantly, their respective mass

PAGE 122

122 differences. Hence the nominal mass differences between all possible amino acid pairs, totaling 210 (21*20/2) of them, have bee n calculated, which range from 0 to 129. The number of occurrence of each nominal mass difference during such calculation is presented with a histogram, as shown in Figure 5 11. According to the histogram, there are certain nominal mass differences appeari ng more frequently than others. Especially, high number of occurrences show up at 14/16, 28/32, 44/46, 57, 71/76 and 85/89 Da, which are separated by low numbers in between. Moreover, by looking at these "peaks" of occurrence, they are well separated by 14 Da: 71 57 = 46 32 = 28 14 = 14. This may be the reason why the higher masses (e.g., in the 850 Da range) are demonstrating periodical oscillating pattern in their population with a period of 14. Then the question becomes how these peaks of occurrenc e, especially the 14 Da, come to happen. One feasible approach is to look further into the atomic composition of each amino acid residue. According to Table 5 3, all the amino acid residues considered in this research are built out of five types of atoms: C, H, N, O, and S. Thus the compositional differences between all residues are calculated and the results are prese nted in Table A 4 and Table A 5. Note that the minus denotes less number of atoms in the comparison. To better observe the trends in the resu lts, the comparisons that return the same results more than once are highlighted with the same color. According to these tables, apparently there are multiple sets of composition differences that appear more than once and such relatively high frequency of occurrence may be the reason behind the periodicity behavior. To examine these high

PAGE 123

123 frequency occurrences, such composition differe nces are summarized in Table 5 3 , in the order of their number of occurrences. By looking at both the raw and summarized tabl es, the following observations can be concluded. First, the mass difference of 14 Da is the most dominant, for it corresponds to multiple composition differences, some of which themselves are of high occurrence frequency. Such composition differences inclu de "CH 2 " (as in between Ala vs Gly), "C 2 H 6 O" (as in Lys vs Asn), " H 2 + O" (as in Asp vs Thr), etc. Meanwhile, the three most frequently occurring composition differences are "CH 2 " (14 Da, 5 occurrences), "O" (16 Da, 3 occurrences) and "S" (16 Da, 3 oc currences). However it is noticeable that the stacking of multiple "CH 2 "'s is also at large, such as "C 2 H 4 " (28 Da, 2 occurrences) and "C 3 H 6 " (42 Da, 2 occurrences). Plus, there is also other composition difference corresponding to integer times of 14 Da ( "CO", 28 Da). All scenarios considered, they may constitute the explanation why for lower masses the patterns have a period of 14 16 Da whereas when the masses grow higher the 14 Da difference becomes dominant, leading to a very uniform 14 Da period. Later works by Hubler et al[ 82 ] further explains the 14 Da period from a mathematical point of view. 5. 3.3 Identification by Mass under Different PPM V alues The original primary goal of this research is to investigate the possibility of peptide composition ide ntification by mass measurement alone, and how accuracies from mass spectrometers/analyzers will affect such identification. Methods have been explained in details in Section 2, and in this section the corresponding results and discussions are to be presen ted.

PAGE 124

124 5. 3.3.1 Unique masses Before entering into a real PPM value, the most extreme case is considered, when there is a perfect mass analyzer which can measure the exact mass without any relative error. With such mass analyzer, as long as a peptide in the u nderlying list has its exact mass unique to any other peptide in the list , its composition can be determined by mass only measurement. The corresponding masses for such peptides are referred to as "unique masses". Computationally, this is to use the same p rogram described in Section 2, but set the PPM value to zero. After the unique mass identification is done for all masses in the list , the nominal mass population analysis is again performed to the unique masses, and the results are shown in Figure 5 12, a ccompanied by the population analysis for all peptides. For convenience reference, a zoomed plotting with nominal mass ranging from 300 to 600 Da is also given in the inlet of Figure 5 12. As shown in Figure 5 12, for higher (> 800 Da) nominal masses, even with the extreme accuracy (0 PPM), the population of unique masses still accounts for merely a small portion of the total population, whereas the situation is relatively better for lower nominal masses, especially for the < 50 0 Da mass range. Moreover, unlike the overall population plotting which exhibits clear periodicity behavior, the unique mass population plotting behaves much more smoothly with no significant fluctuation. This might be due to that the wavy behavior for the overall population plotting comes largely from isomers, which by definition won't be present in the unique masses. To better compare the two populations, unique masses population is divided by all masses population at each nominal mass, and the percentage s are plotted against nominal mass, as shown in Figure 5 13.

PAGE 125

125 According to Figure 5 13, at lower (<500 Da) nominal masses, a great portion (50% to 100%) of peptides can be identified by mass (or atomic composition), whereas higher nominal masses bring the p ercentage down significantly to as low as less than 10% when nominal mass approaches 1000 Da. Meanwhile, the percentage plotting also demonstrate the periodic fluctuations as seen in the population analysis. This may due to that at some nominal masses more isomers are present which bring up the overall population, and in turn bring down the percentage of unique masses. 5. 3.3.2 Identification under different mass accuracies While the 0 PPM (exact mass identification) scenario has been investigated, next step is to observe how different relative accuracies from a mass analyzer can affect the quality of identification by mass alone. To begin with, the aforementioned program is run again, but at a relatively low level of accuracy, 50 PPM, which is widely seen in many commercially available mass spectrometers. Similarly, the percentage of mass identified peptides under such accuracy is plotted again nominal masses, and presented along with the 0 PPM result for easier comparison, as shown in Figure 5 14. From the plotting in Figure 5 14, it is obvious accuracy does make a major difference in the quality of identification by mass. While the 0 PPM accuracy can maintain over 20% of identification up to 600 Da and a considerable portion thereafter, the lower resolution power of 50 PPM makes such identification almost obsolete for nominal mass 400 Da and greater. This will then raise the question: how about higher but still practical resolution fit in the gap between 50 PPM and 0 PPM in terms of their contribution toward quality of identification? To answer such a question, a series of higher resolutions, namely 10, 5, 1 and 0.1 PPM are used for additional program runs,

PAGE 126

126 and the corresponding results are shown in Figure 5 15. Also, a zoomed in plotting within mass range 50 0 to 1000 is also presented in Figure 5 16 for better observation. It is evident that even the 10 PPM accuracy makes improvement to the 50 PPM results, and further improvements in accuracy constantly raise the nominal mass limit under which considerable po rtion of masses can lead to unambiguous composition identification by mass alone. However, it is worthwhile to mention that even with 0.1 PPM accuracy, the quality of identification is still not as good as the exact mass (0 PPM) scenario. The latter gives around 20% at nominal mass 600 Da whereas the former yields less than 5%. In this sense the further improvement in detection accuracy for today's ultra high resolution mass spectrometer, especially the FT ICRs, is still of meaningful merit in the case of i dentification by mass alone. On the other hand, the trend here clearly indicates that the quest for higher mass accuracy yields progressively lower returns in terms of identification by mass alone. 5. 3.3.3 Atomic composition identification While performing the previous analysis, keep in mind that all these calculations are excluding peptides that are having the same amino acid composition but different sequences for positive identification as only compositionally distinct peptides were enumerated. This is o nly in the case of determining the amino acid composition of a peptide and as a result such list of peptides didn't exclude isomers. However, it is also of practical merit if a mass measurement alone can conclude the atomic composition of an unknown peptid e. Under this consideration, isomer redundancies are depleted through the procedure described in Section 2, leaving only one representing peptide for each group of isomers. The remaining pool of peptides then undergo the same procedure used above, with the spectrum of PPM values used above. The resulting

PAGE 127

127 identification percentages are shown in Figure 5 17 . Again, a zoomed in plotting is also presented, as shown in Figure 5 18 . The plotting in Figure 5 17 and Figure 5 18 show much higher identification perce ntages over the previous results. This is due to two factors. On one hand, the redundancy depletion significantly reduced the overall size of peptide pool, thus brings down the denominator. On the other hand, the depletion of isomer redundancies now makes the remaining representing peptides join the original unique mass peptides and thus increases the numerator. As a result, the 0.1 PPM accuracy mass measurement can maintain around 30% of identification percentage all the way to 1000 Da, which means roughly ten times more identification over previous results. Also, for nominal masses over 700 Da, the 1 PPM and 0.1 PPM plotting demonstrate decisive advantage over the neighboring 5 PPM results, again implying very practical merit in the pursuing for even highe r resolution power for the ultra high resolution mass spectrometers. 5. 3.4 Influence from A dditional A mino A cid R esidues The population analysis and subsequent discovery of mass periodicity is based on the peptide list built from 22 most widely seen amino acid residues , and 21 unique masses and atomic compositions , since leucine and isoleucine are isomers. The works presented above to explain the mass periodicity hints that the periodic behavior is related to the most c ommon mass differences between these residues . I t is then worthwhile to also examine whether the introduction of more building blocks will affect the mass periodicity. Especially, there are many typical post transitional modifications (PTMs) that are attac hed on the side chains of certain amino acid residues , thus chang ing their exact masses and atomic compositions. To explore this new scenario, one widely seen and major PTM, phosphorylation, is taken into consideration.

PAGE 128

128 Phosphorylation is commonly observed on three amino acid residues, namely serine (to form pSer), threonine (pThr, respectively) and tyrosine (pTyr). Thus these three phosphorylated residues join the original 21 unique mass building blocks, and the new group of 24 building blocks are used to re enumerate all possible masses up to 1000 Da. A nominal mass population analysis is then performed to the new list , and the results are plotted in Figure 5 19 , compared to those from the original one. According to Figure 5 19 , while the adding of three b uilding residues does increase the population at each nominal mass considerably, especially at higher nominal masses, this can well be well expected as the adding of more building residues leads to greater composition possibilities. However, the new plotti ng does show the same wavy pattern as the old one. Moreover, as seen from the zoomed plotting, the ups and downs of the two plotting are highly synchronized with each other, and thus maintaining the same period of around 14 Da. That said, the mass periodic ity applies to a broad er scenario than that when only peptides built from naturally existing amino acid residues are considered. On top of the population analysis across the entire 1000 Da, analysis at several individual nominal masses have also been perfo rmed, similar to those presented in Figure , but with both the non phosphorylated and phosphorylated results. The new histogram plottings are shown in Figure 5 20. From Figure 5 20, there is a clear trend that the overall histograms from the phosphorylated results have shown a shift to lower masses, which can be owed to the existence of phosphorus atom and its exact mass (30.9738 Da). However despite such shift the overall distribution is still within the 1 Da range.

PAGE 129

129 5.4 Summary Through a C++ language based dynamic programming, all the peptides with unique amino acid compositions have been enumerated up to a 1000 Da mass limit, built out of 21 amino acid residues, to form a complete peptide list . Subsequent analysis and data mining have been performed on thi s list , with the following discoveries. First, starting from around 200 Da where population at each nominal mass are large enough to carry statistical merit, an obvious periodicity behavior has been confirmed in the plotting of population versus nominal ma sses. The period is found to be around 15 to 16 Da at lower masses and 14 Da at higher masses. These mass differences are also found to correspond to the most frequent mass and atomic compositional differences between the amino acid residue building blocks . Second, the possibility of compositional identification by mass measurement alone has been investigated as a function of mass accuracy. The percentage of peptides that can be mass identified are examined under these accuracies, and results for lower and upper nominal masses lead to different conclusions. At lower nominal masses, high resolution mass spectrometer is able to identify the amino acid residue composition for as high as 60% peptides belonging to a given nominal mass, however this number drops t o below 10% for higher masses. This is primarily due to the large number of isomers at larger masses. Consequently, when considering atomic composition identification, which removes such negative effects from isomers, identification quality has been greatl y improved, with nearly 40% to 60% of peptides with nominal masses between 600 to 1000 Da identified by a 0.1 PPM accuracy mass measurement. Whether such conclusions are limited by the number of amino acid building blocks has also been investigated. Three post transitional modified residues, the

PAGE 130

130 phosphorylated pSer, pThr and pTyr, have been added to the original 21 building residues and the population analysis is repeated. The results demonstrate very close behavior to the original one, despite higher absol ute values of population due to the greater number of combination possibilities brought by the addition of new residues. Still, whether a large number of PTM residues will change the conclusion drawn toward the current mass periodicity is open to doubt, bu t in the case when analysis is necessary, the same investigation approach used in this chapter can be also applied. Overall, the results and considerations used to explain the periodicity behavior suggest that differences in masses and atomic compositions between building blocks will make it quite likely that for polymeric molecules, some masses will make possible more isomers than the others, thus facilitating a periodicity behavior in population at each nominal mass. Thinking about this more broadly, the periodic behavior can be generally well expected with many other polymeric species that are composed of repeating building blocks, especially biomolecules such as DNA and saccharides, as well as organic polymers.

PAGE 131

131 Figure 5 1 . Population of compositionally distinct peptides at each nominal mass . Figure 5 2 . H istogram analysis of compositionally distinct peptides at each individual nominal mass .

PAGE 132

132 Figure 5 3 . N umber of comp ositionally distinct peptide plotting and polynomial fit to the ninth order up to a nominal mass of 1000 Da . Figure 5 4 . Population plotting and polynomial fit to the ninth order up to 600 Da .

PAGE 133

133 Figure 5 5 . R elative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P F)/F, for nominal masses up to 1000 Da. Figure 5 6 . R elative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P F)/F, for nominal masses between 600 and 1000 Da.

PAGE 134

134 Figure 5 7 . Population and polynomial fit as a function of nomin al mass up to 1500 Da . Figure 5 8 . R elative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P F)/F, for nominal masses between 1000 and 1500 Da.

PAGE 135

135 Figure 5 9 . R elative deviation between the population (P) and the polynomial fitting value (F) at each nominal mass, calculated by relative deviation=(P F)/F, for nominal masses between 310 and 400 Da (B) and 900 1000 Da (C). Reprinted with permission from Long Yu, Yan Mei Xiong and Nick C. Polfer , Anal. Chem. 2011, 83, 8019 8023 Copyright © 2011, American Chemical Society . Figure 5 10 . R elative deviation between the population (P) and the polynomial fitting value (F) at e ach nominal mass, calculated by relative deviation=(P F)/F, for nominal masses between 815 and 900 Da.

PAGE 136

136 Figure 5 11 . Number of occurrences for each nominal mass difference, which is calculated as the absolute value of nominal mas s difference between any pair of amino acid residues considered in this research . Figure 5 12 . Number of unique masses vs. overall population at each nominal mass for masses up to 1000 Da. The inlet gives a zoomed plotting betw een 300 and 600 Da.

PAGE 137

137 Figure 5 13 . Percentage of unique masses within the entire population at each nominal mass for masses up to 1000 Da. Figure 5 14 . Percentage of masses identifiable with 50 ppm detection accuracy and of unique masses within the entire population at each nominal mass for masses up to 1000 Da.

PAGE 138

138 Figure 5 15 . Percentages of masses distinguishable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire population at each nominal mass for masses up to 1000 Da. Figure 5 16 . Percentages of masses distinguishable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire popula tion at each nominal mass for masses up between 400 and 1000 Da.

PAGE 139

139 Figure 5 17 . Percentages of masses identifiable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire isomer excluded population at each nominal mass, for masses up to 1000 Da. Figure 5 18 . Percentages of masses identifiable with 0.1, 1, 5 and 10 ppm detection accuracy within the entire isomer excluded population at each nominal mass, for masses up between 500 and 1000 Da.

PAGE 140

140 Figure 5 19 . Population at each nominal mass for masses up to 1000 Da for the orig inal 21 AA residue (black) and the modified 24 AA residue (21 plus 3 phosphorylated residues, red) are used during mass enumeration.

PAGE 141

141 Figure 5 20 . Histograms of mass distribution within nominal masses 499, 699 and 899 Da, when only the original 21 AA r esidues (black) or the augmented 24 AA residues (orange) are considered for mass enumeration.

PAGE 142

142 Table 5 1 . List of amino acid residues considered and their exact masses . AA residue Exact mass AA residue Exact mass Gly 57.02147 Lys 128.095 0 Ala 71.03712 Glu 129.0426 Ser 87.03203 Met 131.0405 Pro 97.05277 His 137.0589 Val 99.06842 Ox Met 147.0354 Thr 101.0477 0 Phe 147.0684 Cys 103.0092 0 Arg 156.1011 Leu/Ile 113.0841 0 Alk Cys 160.0306 Asn 114.0429 0 Tyr 163.0633 Asp 115.027 00 Trp 186.0793 Gln 128.0586 0 Table 5 2 . From left to right: Atomic composition, one letter code, and exact masses of 21 amino acid residues included in the amino acid list . C,H,N,O,S Letter Name Mass 2,3,1,1,0 G Glycine 57.02146 3,5,1,1,0 A Alanine 71.03711 3,5,1,2,0 S Serine 87.03203 5,7,1,1,0 P Proline 97.05276 5,9,1,1,0 V Valine 99.06841 4,7,1,2,0 T Threonine 101.0477 0 3,5,1,1,1 C Cysteine 103.0092 0 6,11,1,1,0 L Leucine(Isoleucine) 113.0841 0 4,6,2,2,0 N Asparagine 114.0429 0 4,5,1,3,0 D Aspartic Acid 115.0269 0 5,8,2,2,0 Q Glutamine 128.0586 0 6,12,2,1,0 K Lysine 128.095 00 5,7,1,3,0 E Glutamic Acid 129.0426 0 5,9,1,1,1 M Methionine 131.0405 0 6,7,3,1,0 H Histidine 137.0589 0 5,9,1,2,1 I Oxidized Methionine 147.0354 0 9,9,1,1,0 F Phenylalanine 147.0684 0 6,12,4,1,0 R Arginine 156.1011 0 5,8,2,2,1 B Alkylated Cysteine 160.0307 0 9,9,1,2,0 Y Tyrosine 163.0633 0 11,10,2,1,0 W Tryptophan 186.0793 0

PAGE 143

143 Table 5-3 . Summary of compositional differences in Tables 5-3 and 5-4, showing their corresponding nominal mass differences and number of occurrences Compositional difference (c, h, n, o, s) Nominal mass Occurrences 1,2,0,0,0 14 5 0,0,0,1,0 16 3 0,0,0,0,1 32 3 1,2,0,0,1 46 3 2,3,1,1,0 57 3 0, 1, 1,1,0 1 2 2,4,0, 1,0 12 2 0, 1,1,0,0 13 2 1, 3,1,1,0 15 2 1, 4,0,2,0 16 2 4,0,0,0, 1 16 2 2,1,1, 1,0 23 2 1,1,1,0,0 27 2 1,0,0,1,0 28 2 2,4,0,0,0 28 2 0, 1,1,1,0 29 2 1,2,0,1,0 30 2 3,6,0,0,0 42 2 2,4,0,1,0 44 2 5,3, 1,0,0 49 2 2,2,0,2,0 58 2 2,4,0,0,1 60 2 6,4,0,0,0 76 2

PAGE 144

144 CHAPTER 6 CONCLUSIONS AND FURT HER WORK Three computational approaches have been carried out to rationalize and quantify the chemistry of sequence permutations during peptide sequencing. In Chapter 2, to rationalize the favoring/disfavoring of macrocycle formation of linear peptide fragments in CID experiments from an energy point of view, b ions of oligoglycines (b 4 to b 8 ), [YAG] n (b 4 and b 6 ), QWFGLM (b 6 ) and its proline (b 6 and b 7 ) and arginine (b 5 and b 6 ) modified peptides were chosen as target systems for a systematic study. Previous experim ental results from IRPMD spectroscopy and HDX had confirmed which molecular systems favored linear oxazolone structures or, alternatively, macrocycles, and thus allowed a comparison between experiment and computations. Through molecular dynamics and DFT ca lculations, the geometries and electrostatic potentials of these linear structures were obtained, and used in the AMBER suite of programs to generate the force field parameters for MD simulations. The basic premise of our approach assumed that extensive ma crocycle formation should correlate with a low energy penalty for a head to tail nucleophilic attack, whereas a high penalty would impede this process, leaving the fragment as a linear oxazolone. The potential mean force (PMF) along the coordinate of distance between the atoms of interest for this nucleophilic attack, namely the N terminal N and C terminal oxazolone C, can refl ect the energy penalty for this process. This PMF was approximated using umbrella sampling and the WHAM method in AMBER. The calculated PMF results were compared with the experimental conclusions of oxazolone/macrocycle structures, showing a qualitative ag reement between peptides disfavoring macrocycle formation and higher energy penalties . However, in the case of arginine modified peptides, this

PAGE 145

145 agreement breaks down, and an alternative explanation involving proton transfer from the oxazolone ring to the b asic arginine side chain was put forward. It was found that b 5 Q R FGL and b 6 Q R FGLM had the lowest energy penalty for the proton transfer, and these results correlate with their disfavoring of macrocycle formation. Given that proton transfer constitutes a c ompeting reaction pathway against macrocyclization, this also seems to be consistent. In summary, the computed results can provide an energetic basis for rationalizing facile or non facile macrocycle formation. To evaluate the contribution of sequence perm utation/scrambling on peptide sequencing at statistical level, 41,434 previously recorded MS2 spectra were investigated in Chapter 3. An in house sequencing program which matches the fragments from MS2 spectra to both direct and non direct sequence ions pr ocessed all the raw spectra, and the results were further processed for statistical analysis by another in house program. Out of the 41,434 spectra, over 4500 were matched to tryptic digest peptides from P. aeruginosa. For high confidence matches with at l east 5 direct sequence ion matched, around 20% of fragments in the MS2 spectra can be assigned based on the original sequence. Out of the remaining 80% fragments, up to 1/3 can be assigned based on permuted sequence. Also, the propensity of sequence permut ation were shown to increase with peptide length, and charge state (e.g., 3+ vs 2+). When factoring in the number of peptides at each length and calculating the overall percentage of nondirect ions among all fragments, after depleting the contribution from false positives, it was shown that nondirect ions are still not as populated as direct ions, indicating that sequence permutation can only make limited contribution toward peptide sequencing.

PAGE 146

146 In Chapter 4, the effectiveness of ultra high resolution mass s pectrometry in the identification by mass was explored, by computing all compositionally distinct peptides with nominal mass up to 1000 Da with an in house C++ program, using 21 distinct amino acid residues as building blocks. When grouping these peptides by their nominal masses, a periodic behavior showed up for nominal masses of 200 Da and up. This periodicity was found to be ~15 Da for lower masses and ~14 Da for higher masses. Such differences reflect the most common mass differences between amino acid residues. The percentage of uniquely identified compositional peptides using high mass accuracy showed a similar modulation pattern. At some masses such percentage goes down to <10%, whereas for other masses especially the lower mass, as high as >30% of co mpositional distinct peptides can be identified. Due to the large number of isomers, improvement in mass accuracy can only have marginal effects in improving identification percentage. While the exact pattern of periodicity depends on the amino acid buildi ng blocks used, the periodicity itself is more universal. Adding phosphorylated amino acid residues pSer, pThr and pTyr to the existing 21 residues and the periodicity still held true. It is thus believed that for any polymeric molecules, a similar periodi c behavior in number of isomers at each nominal mass is expected. The differences in the corresponding elemental compositions and the corresponding mass differences will lead to that some nominal masses are having more isomers than the other. Overall, the works presented in this dissertation have contributed to a better understanding of scrambling chemistry, both qualitatively and quantitatively. Meanwhile, there are still ongoing plans stemming from the current results and approaches. For the rationalizati on of macrocycle favoring/disfavoring, the current PMF calculations have

PAGE 147

147 their limitations: The molecular dynamics simulation used in the current approach is quite simplistic, and cannot take into account bond formation and breaking, and also limits the lo west distance the pair of atoms of interest can reach. A quantum chemistry transition state calculation for the macrocycle formation of linear peptides would result in more accurate energetics. Due to the size and complexity of these systems, involving man y putative transition states, such calculations would involve considerable effort and expertise. Moreover, while the current approach focuses on b ions from CID experiments, the a ions have also been put forward to play a role in sequence permutations acco rding to recent works by Paizs[ 14,15 ]. The energetics of rearrangement reactions of a ions would be another fertile area of research, in order to establish trends in the macrocyclization chemistry. The statistical tools designed for the current large pool of MS2 data can also be used on different types of tandem mass spectrometers, for instance Q TOF instruments. In those instruments, the parent ion is fragmented in a collision cell, followed by TOF analysis. The scrambling percentage in collision cell inst ruments may be quite different than in ion traps, as the energy is deposited quickly into the ion, typically via a single, higher energy collision. This contracts to slow heating in ion traps via multiple collisions.

PAGE 148

148 APPENDIX A ORIGINAL DATA FILE SAMPLES AND TABLES Table A-1. A sample MS2 file. 1055_3_2R1.txt //MS2 file name >S 1055 3 //serial number 1427.71991 3 //precursor mass and charge state 120.156769 732.86 //fragment masses and intensit ies 131.13237 845.15 158.976166 893.85 235.758713 788.85 246.155731 1013.2 249.09552 781.82 316.993164 692.26 335.44281 891.58 362.181671 7782.34 363.181824 2063.83 439.794037 2677.59 447.268707 2725.26 640.420044 927.85 666.732483 829.48 751.685303 808.99 755.377808 1049.99 835.627991 759.29 937.439026 822.03 1039.70178 950.91 1224.3501 995.56 1257.90906 819.04 1292.12598 713.27

PAGE 149

149 Table A-2 . Sample raw output file. 2427.30571 was between 2427.28758 4 and 2427.31185 6 Serial number Matched Sequence 3384 ITYTFTDEAPALATYSLLPI VK The precursor ion charge is: 3 Comparing direct fragments. Mass Matched Fragment Sequence Intensity Charge ppm value >456.316467 y4 ITYTFTDEAPALATYSLLPI VK 18792.15 1 3.43 >820.896179 b15_H2O_2 ITYTFTDEAPALATYSLLPI VK 1276.87 2 0.5 >920.954651 b17_H2O_2 ITYTFTDEAPALATYSLLPI VK 5074.22 2 0.91 >929.960938 b17_2 ITYTFTDEAPALATYSLLPI VK 1447.37 2 1.99 >932.577393 y8 ITYTFTDEAPALATYSLLPI VK 1635.95 1 4.41 >986.499573 b18_2 ITYTFTDEAPALATYSLLPI VK 13245.84 2 1.57 >1042.47619 6 b9 ITYTFTDEAPALATYSLLPI VK 1598.42 1 3.3 >1141.09899 9 b21_2 ITYTFTDEAPALATYSLLPI VK 1268.35 2 4.15 >1323.64917 0 b12 ITYTFTDEAPALATYSLLPI VK 1142.46 1 1.87 >Done Comparing nondirect fragments for scramble d sequences. Mass Matched Fragment Sequence Intensity Charge ppm value $626.35 sb9 _NH3_2 DEAPALATYSLLPIITYTFT 927.14 2 1.69 $Done Total compared values: 40

PAGE 150

150 Table A-3. Sample entries from the data file. File name Charge Sequence Compared Total Direct Nondire ct Total intensity % of nondirect intensity SequenceMatch_1001_2_1R1.txt 2 YTLAADTK 39 8 7 1 32077.18 3.79 SequenceMatch_1012_2_3R1.txt 2 RFEENVVQK 23 7 6 1 19960.06 11.26 SequenceMatch_1028_2_2R1.txt 2 VGAATEVEMK 37 19 17 2 52492.06 3.85 SequenceMatch_1028_2_2R1.txt 2 VGAATEVEMK 37 19 17 2 52492.06 3.85 SequenceMatch_1059_2_2R1.txt 2 AGDKVNVTR 49 15 11 4 60280.49 16.08 SequenceMatch_1079_2_7R1.txt 2 MYAEQAQQGEDAPQGEQA K 122 49 35 14 414208.4 10.53 SequenceMatch_1079_2_7R1.txt 2 MYAEQAQQGEDAPQGEQA K 122 49 35 14 414208.4 10.53 SequenceMatch_1079_2_7R1.txt 2 MYAEQAQQGEDAPQGEQA K 122 49 35 14 414208.4 10.53 SequenceMatch_1079_3_3R1.txt 3 PAAPDRPAASEAETTVR 48 16 14 2 131716.4 9.86 SequenceMatch_1081_3_1R1.txt 3 VEVAGDKVNVTR 88 23 20 3 738890 1.3 SequenceMatch_1081_3_1R1.txt 3 VEVAGDKVNVTR 88 23 20 3 738890 1.3 SequenceMatch_1083_3_1R1.txt 3 LGVHDVEHVGGK 74 24 21 3 87945.95 5.66 SequenceMatch_1083_3_3R1.txt 3 YAHVDCPGHADYVK 69 29 19 10 82117.03 24.36 SequenceMatch_1083_3_3R1.txt 3 YAHVDCPGHADYVK 69 29 19 10 82117.03 24.36 SequenceMatch_1083_3_3R1.txt 3 YAHVDCPGHADYVK 69 29 19 10 82117.03 24.36 SequenceMatch_1087_3_2R1.txt 3 ETITKDNVEIEGK 112 38 25 13 388422.9 7.94 SequenceMatch_1088_3_1R1.txt 3 NVLKEGEEVEAK 99 32 20 12 445354.7 7.51 SequenceMatch_1088_3_1R1.txt 3 NVLKEGEEVEAK 99 32 20 12 445354.7 7.51 SequenceMatch_1088_3_2R1.txt 3 SLRELEQDGQAQK 54 24 17 7 69024.85 15.4 SequenceMatch_1088_3_2R1.txt 3 SLRELEQDGQAQK 54 24 17 7 69024.85 15.4 SequenceMatch_1089_3_2R1.txt 3 SLRELEQDGQAQK 62 24 17 7 80484.19 18.09 SequenceMatch_1089_3_2R1.txt 3 SLRELEQDGQAQK 62 24 17 7 80484.19 18.09 SequenceMatch_1103_3_2R1.txt 3 VEHYVDQEELK 64 16 12 4 42841.76 13.33 SequenceMatch_1108_3_2R1.txt 3 AKDSALGGIESVRK 34 8 7 1 32165.94 4.42 SequenceMatch_1108_3_3R1.txt 3 NAGAASDVAQPHPSIR 43 11 10 1 58682.52 2.86 SequenceMatch_1111_2_5R1.txt 2 INTNSVDTNHAER 29 7 7 0 19491.26 0 SequenceMatch_1115_2_5R1.txt 2 NAEAQLQNASAQR 33 11 9 2 34162.18 6.69 SequenceMatch_1115_3_1R1.txt 3 VMVCSPGLAHQR 38 9 7 2 77505.33 8.49 SequenceMatch_1119_3_1R1.txt 3 KVLDEQVSEVR 48 15 9 6 60307.5 18.31 SequenceMatch_1119_3_1R1.txt 3 KVLDEQVSEVR 48 15 9 6 60307.5 18.31 SequenceMatch_1120_3_2R1.txt 3 GHPEYYQDVAAR 63 19 17 2 130972.2 1.97 SequenceMatch_1122_2_5R1.txt 2 LAQGDAAQAEQVAR 36 15 15 0 48897.52 0 SequenceMatch_1128_3_2R1.txt 3 SNTPVVAISDDGSKR 75 22 12 10 102577.9 33.47 SequenceMatch_1128_3_2R1.txt 3 SNTPVVAISDDGSKR 75 22 12 10 102577.9 33.47 SequenceMatch_1130_2_6R1.txt 2 VVDNTVQGSAAQAAAPAQR 48 28 18 10 153218.8 23.93 SequenceMatch_1135_3_1R1.txt 3 SIAMGSTEGLKR 44 10 8 2 307497.5 4.45

PAGE 151

151 Table A-4 . Atomic compositional differences between all pairs of amino acid residues considered. The set of 5 numbers denotes the difference in number of C, N, O, P and S atoms, respectively. G A S P V T C L N D Q G N/A 1,2,0,0,0 1,2,0,1,0 3,4,0,0,0 3,6,0,0,0 2,4,0,1,0 1,2,0,0,1 4,8,0,0,0 2,3,1,1,0 2,2,0,2,0 3,5,1,1,0 A N/A N/A 0,0,0,1,0 2,2,0,0,0 2,4,0,0,0 1,2,0,1,0 0,0,0,0,1 3,6,0,0,0 1,1,1,1,0 1,0,0,2,0 2,3,1,1,0 S N/A N/A N/A 2,2,0, 1,0 2,4,0, 1,0 1,2,0,0,0 0,0,0, 1,1 3,6,0, 1,0 1,1,1,0,0 1,0,0,1,0 2,3,1,0,0 P N/A N/A N/A N/A 0,2,0,0,0 1,0,0,1,0 2, 2,0,0,1 1,4,0,0,0 1, 1,1,1,0 1, 2,0,2,0 0,1,1,1,0 V N/A N/A N/A N/A N/A 1, 2,0,1,0 2, 4,0,0,1 1,2,0,0,0 1, 3,1,1,0 1, 4,0,2,0 0, 1,1,1,0 T N/A N/A N/A N/A N/A N/A 1, 2,0, 1,1 2,4,0, 1,0 0, 1,1,0,0 0, 2,0,1,0 1,1,1,0,0 C N/A N/A N/A N/A N/A N/A N/A 3,6,0,0, 1 1,1,1,1, 1 1,0,0,2, 1 2,3,1,1, 1 L N/A N/A N/A N/A N/A N/A N/A N/A 2, 5,1,1,0 2, 6,0,2,0 1, 3,1,1,0 N N/A N/A N/A N/A N/A N/A N/A N/A N/A 0, 1, 1,1,0 1,2,0,0,0 D N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 1,3,1, 1,0 Q N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A K N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A E N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A M N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A H N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A I N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A F N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A R N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A B N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A Y N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A W N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

PAGE 152

152 Table A-5 . Atomic compositional differences between all pairs of amino acid residues considered. The set of 5 numbers denotes the difference in number of C, N, O, P and S atoms, respectively (continued). K E M H I F R B Y W G 4,9,1,0,0 3,4,0,2,0 3,6,0,0,1 4,4,2,0,0 3,6,0,1,1 7,6,0,0,0 4,9,3,0,0 3,5,1,1,1 7,6,0,1,0 9,7,1,0,0 A 3,7,1,0,0 2,2,0,2,0 2,4,0,0,1 3,2,2,0,0 2,4,0,1,1 6,4,0,0,0 3,7,3,0,0 2,3,1,1,1 6,4,0,1,0 8,5,1,0,0 S 3,7,1, 1,0 2,2,0,1,0 2,4,0, 1,1 3,2,2, 1,0 2,4,0,0,1 6,4,0, 1,0 3,7,3, 1,0 2,3,1,0,1 6,4,0,0,0 8,5,1, 1,0 P 1,5,1,0,0 0,0,0,2,0 0,2,0,0,1 1,0,2,0,0 0,2,0,1,1 4,2,0,0,0 1,5,3,0,0 0,1,1,1,1 4,2,0,1,0 6,3,1,0,0 V 1,3,1,0,0 0, 2,0,2,0 0,0,0,0,1 1, 2,2,0,0 0,0,0,1,1 4,0,0,0,0 1,3,3,0,0 0, 1,1,1,1 4,0,0,1,0 6,1,1,0,0 T 2,5,1, 1,0 1,0,0,1,0 1,2,0, 1,1 2,0,2, 1,0 1,2,0,0,1 5,2,0, 1,0 2,5,3, 1,0 1,1,1,0,1 5,2,0,0,0 7,3,1, 1,0 C 3,7,1,0, 1 2,2,0,2, 1 2,4,0,0,0 3,2,2,0, 1 2,4,0,1,0 6,4,0,0, 1 3,7,3,0, 1 2,3,1,1,0 6,4,0,1, 1 8,5,1,0, 1 L 0,1,1,0,0 1, 4,0,2,0 1, 2,0,0,1 0, 4,2,0,0 1, 2,0,1,1 3, 2,0,0,0 0,1,3,0,0 1, 3,1,1,1 3, 2,0,1,0 5, 1,1,0,0 N 2,6,0, 1,0 1,1, 1,1,0 1,3, 1, 1,1 2,1,1, 1,0 1,3, 1,0,1 5,3, 1, 1,0 2,6,2, 1,0 1,2,0,0,1 5,3, 1,0,0 7,4,0, 1,0 D 2,7,1, 2,0 1,2,0,0,0 1,4,0, 2,1 2,2,2, 2,0 1,4,0, 1,1 5,4,0, 2,0 2,7,3, 2,0 1,3,1, 1,1 5,4,0, 1,0 7,5,1, 2,0 Q 1,4,0, 1,0 0, 1, 1,1,0 0,1, 1, 1,1 1, 1,1, 1,0 0,1, 1,0,1 4,1, 1, 1,0 1,4,2, 1,0 0,0,0,0,1 4,1, 1,0,0 6,2,0, 1,0 K N/A 1, 5, 1,2,0 1, 3, 1,0,1 0, 5,1,0,0 1, 3, 1,1,1 3, 3, 1,0,0 0,0,2,0,0 1, 4,0,1,1 3, 3, 1,1,0 5, 2,0,0,0 E N/A N/A 0,2,0, 2,1 1,0,2, 2,0 0,2,0, 1,1 4,2,0, 2,0 1,5,3, 2,0 0,1,1, 1,1 4,2,0, 1,0 6,3,1, 2,0 M N/A N/A N/A 1, 2,2,0, 1 0,0,0,1,0 4,0,0,0, 1 1,3,3,0, 1 0, 1,1,1,0 4,0,0,1, 1 6,1,1,0, 1 H N/A N/A N/A N/A 1,2, 2,1,1 3,2, 2,0,0 0,5,1,0,0 1,1, 1,1,1 3,2, 2,1,0 5,3, 1,0,0 I N/A N/A N/A N/A N/A 4,0,0, 1, 1 1,3,3, 1, 1 0, 1,1,0,0 4,0,0,0, 1 6,1,1, 1, 1 F N/A N/A N/A N/A N/A N/A 3,3,3,0,0 4, 1,1,1,1 0,0,0,1,0 2,1,1,0,0 R N/A N/A N/A N/A N/A N/A N/A 1, 4, 2,1,1 3, 3, 3,1,0 5, 2, 2,0,0 B N/A N/A N/A N/A N/A N/A N/A N/A 4,1, 1,0, 1 6,2,0, 1, 1 Y N/A N/A N/A N/A N/A N/A N/A N/A N/A 2,1,1, 1,0 W N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A

PAGE 153

153 APENDIX B AMBER MODEL BUILDING DETAILS AND SAMPLE OUTPUTS I Overview The peptides of interest in Chapter 3 are novel structures to AMBER, whose input files cannot be directly generated within the software. Legacy approaches involves generating ESP distribution for each peptide after their geometry is optimized with high level of theory DFT calculations and making AMBER input files based on such distribution, which becomes costly when size of such peptide become considerably large and number of peptides of interest grows. On the other hand, in our approach we simply make input files for the novel residues on the peptides, whose other residues can be conveniently recognized and parameterized within AMBER. The AMBER input files for the entire peptide can then be generated using LEAP, by incorporating the novel residue input files and standard AMBER force fields. Such input files are then used for the PES calculation, to provide possible explanation for experimental results and help understanding the scrambling chemistry. II.Remaking the novel residues Keep in mind that in order to make ESP calculation, complete molecules, instead of just the residues, are needed. That said, additional atoms need to be added to to an residue such that the residue will carry a similar charge distribution in this molecule as in our peptides of interests. Below are the list of residues and the supporting molecules. 1.list of novel residues to make 1.1 N terminus residues: NHX (X=Q,R,G,Y,A) Structure: H-XNH -CH3 charge/multiplicity 0/1 total residues to make: 5

PAGE 154

154 1.2 Oxazalone C terminus residues: OXX (XX=LM, GL, GR, G G, GY, AG, YG, GA, AY) Structure: CH3-XX charge/multiplicity 1/1 total residues to make: 9 1.3 unprotonated Arginine Structure: CH3-RNH -CH3 charge/multiplicity 0/1 total residue to make: 1 2.Procedure 2.1 In HyperChem, visually make initial structures for the above molecules containing the novel residues of interest, and optimize them with AMBER MD. Export the resulting structures in PDB format to obtain atom types and Cartesian coordinates information. 2.2 Based on such information, generate Gaussian input for the corresponding structures so they can be geometrically optimized with DFT. Optimized structures are then put to ESP calculation. 2.3 Within AMBER, make the input files (prepi and prmtop files) for each RESIDUE by reading in the ESP calculation results for each structure containing the residue, and deplete the atoms not belonging to the residue with a "MAINCHAIN" approach, which basically specify the head atom type and the type of atom to connect to it, that for the tail atom, main chain atoms and all the omitted atoms, as well as the charge of the residue, through a script file . This is done through several commands in Antechamber, as demonstrated by an online antechamber tutorial.

PAGE 155

155 2.4 Once the input files are ready, use antechamber to generate the pdb files based on the prepi and prmtop files for each novel residue. 2.5 At the end of this procedure, we should be looking at a prepi, a prmtop and a pdb file for each novel residue. III Making the AMBER input files for the peptides of interest 1.list of peptides to make 1.1 the original sequence: QWFGLM 1.2 the proline-modified sequences: QPFGLM, QPWFGLM 1.3 Arginine substitutions: RWFGLM, QRFGLM, QWRGLM, QWFRLM 1.4 b5 peptides with Arginine substitutions: QRFGL, QWRGL, QWFGR 1.5 glycines, g4 to g8: GGGG, GGGGG, GGGGGG, GGGGGGG, GGGGGGGG 1.6 YAGY, YAGYAG, AYG, AYGA, GAY, GAYGAY 2.Procedures to make AMBER input files for a peptide 2.1 In HyperChem, start from the amino acid database, build the backbone structure for a given peptide. Make proper modifications on the N terminus (adding of a hydrogen) and if necessary, on the Arginine residue (removal of a hydrogen), and make the oxazalone structure at the C terminus. optimize the underlying structure with AMBER MD, and export the pdb file. 2.2 Edit the PDB file so that the atoms belonging to each residue within this pdb file is arranged in the same way they are arranged in the PDB files of the novel residues, and also modify the residue names to those of the novel residues whey it applies. This is tedious and has to be done in a zero-toleranceto -error manner. Double check the arrangement and save the PDB file.

PAGE 156

156 2.3 In AMBER, open the Leap program. Sequentially load the standard amber force fields, the prepi and prmtop files for the novel residues, and the pdb file for the peptide. Then manually specify the sequence of this peptide, and let Leap to generate the final prepi and prmtop files for the peptide. IV .Parameterization samples 1.C-terminus oxazolone glycine-glycine (OGG)

PAGE 157

157 2.Unprotonated arginine (R)

PAGE 158

158 LIST OF REFERENCES (1) Aebersold, R.; Goodlett, D. R.. Chem. Rev. 2001, 101, 269 295. (2) Steen, H.; Mann, M. Nat. Rev. Mol. Cell Biol. 2004, 5, 699 711. (3) Biemann, K.; Martin, S. A. Mass Spectrom. Rev. 1987, 6, 1 75. (4) Chait, B. T.; Wang, R.; Beavis, R. C.; Kent, S. B. Science 1993, 262, 89 92. (5) Perkins, D. N.; Pappin, D. J. C.; Creasy, D. M.; Cottrell, J. S. Electrophoresis 1999, 20, 3551 3567. (6) Eng, J. K.; McCormack, A. L.; Yates, J. R., III J. Am. Soc. Mass Spectrom. 1994, 5 , 976 989. (7) Dongre, A. R.; Jones, J. L.; Somogyi, A.; Wysocki, V. H. J. Am. Chem. Soc. 1996, 118, 8365 8374. (8) Wysocki, V. H.; Tsaprailis, G.; Smith, L. L.; Breci, L. A. J. Mass Spectrom. 2000, 35, 1399 1406. (9) Boyd, R.; Somogyi, A. J. Am. Soc. Mass Spectrom. 2010, 21, 1275 1278. (10) Poulter, L.; Taylor, L. C. E. Int. J. Mass Spectrom. Ion Proc. 1989, 91, 183 197. (11) Burlet, O.; Orkiszewski, R. S.; Ballard, K. D.; Gaskell, S. J. Rapid Commun. Mass Spectrom. 1992, 6, 658 662. (12) Tang, X.; Boyd, R. K. Rapid Commun. Mass Spectrom. 1992, 6, 651 657. (13) Johnson, R. S.; Krylov, D.; Walsh, K. A. J. Mass Spectrom. 1995, 30, 386 387. (14) Harrison, A. G.; Yalcin, T. Int. J. Mass Spectrom. 1997, 165, 339 347. (15) Harrison, A. G.; Young, A. B.; Bleiholder, B.; Suhai, S.; Paizs, B. J. Am. Chem. Soc. 2006, 128, 10364 10365. (16) Tang, X.-J.; Thibault, P.; Boyd, R. K. Anal. Chem. 1993, 65, 2824 2834. (17) Tang, X.; Boyd, R. K. Rapid Commun. Mass Spectrom. 1994, 8, 678 686. (18) Vachet, R. W.; Bishop, B. M.; Erickson, B. W.; Glish, G. L. J. Am. Chem. Soc. 1997, 119, 5481 5488. (19) Yague, J.; Paradela, A.; Ramos, M.; Ogueta, S.; Marina, A.; Barahona, F.; Lopez de Castro, J. A.; Vazquez, J. Anal. Chem. 2003, 75, 1524 1535.

PAGE 159

159 (20) Mouls, J.; Aubagnac, J.; Martinez, J.; Enjalbal, C. J. Proteome Res. 2007, 6, 1378 1391. (21) Jia, C.; Qi, W.; He, Z. J. Am. Soc. Mass Spectrom. 2007, 18, 663 678. (22) Chen, X.; Yu, L.; Steill, J. D.; Oomens, J.; Polfer, N. C. J. Am. Chem. Soc. 2009, 131, 18272-18282. (23) James P. Quarterly Reviews of Biophysics Vo l. 30 Issue 04, pp 279-331. (24) Tsaprailis, G.; Nair, H.; Zhong, W.; Kuppannan, K.; Futrell, J. H.; Wysocki, V. H. Anal. Chem. 2004, 76, 2083 2094. (25) Tsaprailis, G.; Nair, H.; Somogyi, A.; Wysocki, V. H.; Zhong, W.; Futrell, J. H.; Summerfield, S. G.; Gaskell, S. J. J. Am. Chem. Soc. 1999, 121, 5142 5154. (26) Tsaprailis, G.; Somogyi, A.; Nikolaev, E. N.; Wysocki, V. H. Int. J. Mass Spectrom. 2000, 195, 467 479. (27) Paizs, B.; Suhai, S. Rapid Commun. Mass Spectrom. 2001, 15, 2307 2323. (28) Fenn, J. B.; Mann M.;Meng, C. K. ; Wong, S. F.; Whitehouse, C.M. Science 1989 , 246, 6471 . (29)Ho, C. S.;Lam, C. W. K.; Chan, M. H. M.; Cheung, R.C.K.; Law, L. K.; Lit, L. C. W.; Ng, K. F.; Suen, M. W. M.; Tai, H. L. Clin Biochem Rev. 2003, 24, 3 12. (30) Yamashita, M.; Fenn, J. B. J. Phys. Chem. 1984, 88, 4451 4459. (31) Lindsay, S.; Kealey, D. "High performance liquid chromatography", OSTI ID: 7013902 (32) Stephens, W. E. Phys. Rev., 1946, 69, 691. (33) Mamyrin, B. A. "Timeof -flight mass spectrometer", US 4072862. (34) Wiley, W. C.; McLaren, I. H. Rev. Sci. Instrum. 1955, 26, 1150-1157. (35) Marshall, A . G .; Hendrickson, C . L .; Jackson, G . S .; Mass Spectrom Rev. 1998 17,1-35. (36) Marshall, A. G.; Hendrickson, C. L. Intl. J. Mass Spec 215, 5975. (37) Guan, S.; Marshall, A.G. Intl. J. Mass Spec. Ion Proc. 1995, 146 , 261 296 . (38) Wells, M. J.; McLuckey, S. A.; Methods in Enzymology, 2005, 402, 148 185 . ( 39) Sleno, L.; Volmer, D. A. J. of Mass Spec. , 2004, 39, 1091 1112.

PAGE 160

160 (40) Woodin, R. L. ; Bomse, D. S.; Beauchamp, J. L. J. Am. Chem. Soc. 1978, 100, 3248-3250. (41) Huang, Z; Kim, K. J. Phys. Rev. ST Accel. Beams 2007, 10, 034801. (42) Motz, H.; Thon, W.; Whitehurst, R.N. J. Ap pl . Phys. 1953, 24 , 826833. (43) Oepts, D.; van der Meer, A.F.G.; van Amersfoort, P.W. Infrared Phys. Tech. 1995, 36, 297-308 (44) http://www.differ.nl/research/guthz/felix/specifications (45) Kohn, W.; Sham, L. J. Phys. Rev. 1965, 140, A1133-A1138. (46) Becke, A. D. J. Chem. Phys. 1993, 98, 5648-5652 ; (47) Becke, A. D. J. Chem. Phys. 1993, 98, 1372-1377; (48) http://www.hpc.ufl.edu/ (49) http://ambermd.org/ (50) Salomon-Ferrer, R.; Case, D. A.; Walker, R. C. WIREs Comput Mol Sci 2012, 00, 113. (51) Gtz, A. W.; Williamson, M. J.; Xu, D.; Poole, D.; Le Grand, S.; Walker, R. C. J. Chem. Theory Comput., 2012, 8 , 1542 1555. (52) Eyler, J. R. Mass Spectrom Rev. 2009 , 28 , 448-67. (53) Polfer, N . C .; Oomens, J. Mass Spectrom Rev. 2009 , 28 , 46894. (54) Tira do, M.; Polfer, N. C. Ang. Chem. 2012, 124, 6542 6544. (55) Tirado, M.; Rutters, J.; Chen, X.; Yeung, A; van Maarseveen, J.; Eyler, J. R.; Berden, G.; Oomens, J.; Polfer, N. C. J. Am. Soc. Mass Spectrom. 2012, 23 , 475-482. (56) Marcus Tirado, Ph.D Thesis, University of Florida Chemistry. (57) Scherl, A.; Shaffer, S. A.; Taylor, G. K.; Hernandez, P.; Appel, R. D.; Binz, P. A.; Goodlett, D. R. J. Am. Soc. Mass Spectrom. 2008, 19, 891 901. (58) Hu, Q.; Noll, R. J.; Li, H.; Makarov, A.; Hardmanc, M.; Cooks, R. G. The J. Mass Spectrom. 2005, 40, 430 443. (59) http://reference.wolfram.com/language/ref/FrobeniusSolve.html (60) He, F.; Hendrickson, C. L.; Marshall, A. G. Anal. Chem. 2001,73, 647 650.

PAGE 161

161 (61) Marshall, A. G. Int. J. Mass Spectrom. 2000, 200, 331 356. (62) Marshall,A.G.; Hendrickson, C. L.;Emmett,M.R.; Rodgers, R. P.; Blakney, G. T.; Nilsson, C. L. Eur. J. Mass Spectrom. 2007, 13, 57 59. (63) Schaub, T. M.; Hendrickson, C. L.; Horning, S.; Quinn, J. F.; Senko, M. W.; Marshall, A. G. Anal. Chem. 2008, 80, 3985 3990. (64) Zhang, L.K.; Rempel, D. L.; Pramanik, B. N.; Gross, M. L. Mass Spectrom. Rev. 2005, 24, 286 309. (65) Makarov, A. Mass spectrometer. U.S. Patent 5, 886, 346, 1999. (66) Hu, Q.; Noll, R. J.; Li, H.; Makarov, A.; Hardmanc, M.; Cooks, R. G. J. Mass Spectrom. 2005, 40, 430 443. (67) Olsen, J. V.; de Godoy, L.; Li, G.; Macek, B.; Mortensen, P.; Pesch, R.; Makarov, A.; Lange, O.; Horning, S.; Mann, M. Mol. Cell. Proteomics 2005, 4, 2010 2021. (68) Marshall, A. G.; Rodgers, R. P. Acc. Chem. Res. 2004, 37, 53 59. (69) Yates, J. R. I. Trends Genet. 2000, 16, 5 8. (70) Aebersold, R.; Goodlett, D. R. Chem. Rev. 2001, 101, 269 295. (71) Aebersold, R.; Mann, M. Nature 2003, 422, 198 207. (72) Meng, F.; Cargile, B. J.; Miller, L. M.; Forbes, A. J.; Johnson, J. R.; Kelleher, N. L. Nat. Biotechnol. 2001, 19, 952 957. (73) Bogdanov, B.; Smith, R. D. Mass Spectrom. Rev. 2005, 24, 168 200. (74) Mann, M. J. Protein Chem. 1994, 13, 506 507. (75) Zubarev, R. A.; Hakansson, P.; Sundqvist, B. Anal. Chem. 1996, 68, 4060 4063. (76) Takach, E. J.; Hines, W. M.; Patterson, D. H.; Juhasz, P.; Falick, A. M.; Vestal, M. L.; Martin, S. A. J. Protein Chem. 1997, 16, 363 369. (77) He, F.; Emmett, M. R.; Hakansson, K.; Hendrickson, C. L.; Marshall, A. G. J. Proteome Res. 2004, 3, 61 67. (78) Shen, Y. F.; Tolic, N.; Hixson, K. K.; Purvine, S. O.; Pasa-Tolic, L.; Qian, W. J.; Adkins, J. N.; Moore, R. J.; Smith, R. D. Anal. Chem. 2008, 80, 1871 1882. (79) Schenk, S.; Schoenhals, G. S.; de Souza, G.; Mann, M. BMC Med. Genomics 2008, 1, 41.

PAGE 162

162 (80) Scherl,A.; Shaffer, S.A.;Taylor,G. K.;Hernandez,P.; Appel,R.D.; Binz, P. A.; Goodlett, D. R. J. Am. Soc. Mass Spectrom. 2008, 19, 891 901. (81) He, F.; Emmett, M. R.; Hakansson, K.; Hendrickson, C. L.; Marshall, A. G. J. Proteome Res. 2004, 3, 61 67. (82) Hubler, S . L .; Craciun, G. Biosystems, 2012, 109, 179-185.

PAGE 163

163 BIOGRAPHICAL SKETCH Republic of China. He received a Bachelor of Science degree in physics from the department of Special Class for the Gifted Young, University of Science and Technology of China in June 2004. After graduation, he worked as a research assistant first in University of Science and Technology of China, and then in City University of Hong Kong. In March 2006, he was admitted by the Ph.D. program in the University of Florida chemistry department and was officially enrolled in July 2006. Long received a Master of Science degree in August 2010 and continued for his Ph.D. study in the Department of Chemistry, University of Florida, supervised by Dr. Nicolas Polfer in the physical division, and graduated with his Ph. D. in Chemistry in the winter of 2014.