|UFDC Home||myUFDC Home | Help|
This item has the following downloads:
1 USE OF MUTATION EVENTS AND MUTUAL INFORM ATION TO PREDICT PROTEIN PROTEIN INTERACTIONS By HOMER FLOYD WILLIS IV A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQ UIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 20 10
2 20 10 Homer Floyd Willis IV
3 To Karen and Devin
4 ACKNOWLEDGMENTS I arrived back in Gainesville in 2001, some 16 years after taking my first class as an unde rgrad, to pursue a MBA degree. Little did I know that the educational journey would continue and I would be defending a dissertation in 2010 for a PhD in Computer S cience. Prior to my enrollment in the CISE program I felt I had a strong foundation in Compu ter Science working as a software engineer for the previous 12 years and programming since I was in the 8 th grade. I really had no idea how much I did no t know and for that I will always be grateful to the many faculty members of the department who have ch osen to teach. I would like to thank Dr. Helal for his support and inspiration that you can burn a candle at three ends. We developed an RFID based information grid to be used to address proble ms in wayfinding for the blind This work resulted in the publi cation, Information Grid and Wearable Computing Solution to the Problem of Wayfinding for the Blind Symposium on Wearable Computers, Osaka, Japan, O ctober 2005. 34 37 (cited 51 times) and the funding of a $200,000 research grant by NSF. My research topic has foundations in Electrical Engineering where Claude E. Shannon in 1948, published a classic paper that formalized Information Theory to help unde rstand what is possible when sending a n electrical signal from point A to point B. He probably had no idea that Information Theory would play such an important role in so many disciplines. I probably would have missed the application of Infor mation Theory to genetics had it not been for a journal review class I took with Dr. Braun in the Z oology department Dr. Braun decided we should review and understand a paper on the use of mutual information t o detect co evolving amino acids. I realized the method coul d be improved by sampling mutation events and this became my
5 research topic. I would like to thank Dr. Braun for his passion towards data analysis, introducing me to the research topic and serving on my committee. I would like to thank the organizers of PM SB 2006 Probabilistic Modeling and Machine Learning in Structural and Systems Biology, for accepting my paper on predicting co evolving pairs in Pfam for oral presentation and recording the seminar as part of a video lecture series. ( http://videolectures.net/pmsb06_willis_peppu/ ). I would also like to thank the organizers of ISMB 2006 3DSig Structural Bioinformatics & Computational Biophysics for selecting my short paper for oral presentation. I realized at this conference how challenging the problem of tertiary protein structure is and how passionate the researchers in this field are. I would also like to thank Dr. Schneider at the National Cancer Institute for his many emails discussing mutual information and emphasis on ways to visualize mutual information. I became very active in using X3D to visualize data and was an invited speaker at the X3D Tech Talk addressing the use of X3D in scientific visualization at SIGGRAPH 2006. This work lead to Willis, S. Protein CorreLogo: an X3D representation of co evolving pairs, tertiary structure, ligand binding pockets and protein protein interactions in protein families which was selected for Proce edings of the Twelfth International Conference on 3D Web Technology ACM Press I would like to thank the State of Florida for having the vision to fund the Scripps Florida Research Institute where I began working in February 2009. Writing software for th e automation of mass spectrometry data in hydrogen deuterium exchange (HDX) experiments exposed me to the precision by which protein structures operate. I was assigned a project to develop a method that would allow for the use of HDX for ligand screening b y comparing the structural dynamics of ligands on nuclear receptors This work resulted in the publication, Chalmers MJ, Pascal BD,
6 precision differential hydrogen de My work with predicting co evoling pairs in nuclear receptors was encourag ed by Dr. Griffin who as Chairman of the Department of Molecular Therapeutics and Director of Translational Research Institute a lways managed to find the time to discuss what I thought to be interesting data patterns. These discussion and understanding of what makes nuclear receptors posit ions conserved within the nuclear receptor superfamily: approach reveals functionally Nuclear Receptor Signaling Atlas. I would like to thank Dr. ngr for agreei ng to serve as my committee chair and allowing me to explore an independent research topic I would also like to thank Dr. Banerjee and Dr. Kahveci for their time and patience in serving on my committee and to Dr. Phillips for serving as my external commit tee member. All network graph images were rendered with the free version of yEd. All protein structure images were generated with Chimera (Pettersen, 2004).
7 TABLE OF CONTENTS P age ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF TABLES ................................ ................................ ................................ ......................... 10 LIST OF FIGURES ................................ ................................ ................................ ....................... 11 ABSTRACT ................................ ................................ ................................ ................................ ... 15 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .................. 18 2 BACKGROUND ................................ ................................ ................................ .................... 25 Structural Biology ................................ ................................ ................................ ................... 27 Mutual Information ................................ ................................ ................................ ................. 32 3 MUTUAL INFORMATION CALCULATED WITH MEMI METHOD ............................. 36 4 MUTUAL INFORMATION IN PROTEIN FAM ILIES PFAM ................................ ........... 41 Mutual Information in Protein Families ................................ ................................ ................. 43 Mutual Information in PF04055.9 ................................ ................................ ................... 44 Mutual Information in Pfam ................................ ................................ ............................ 47 Analysis ................................ ................................ ................................ ................................ .. 51 5 PROTEIN CORRELOGO A VISUAL REPRESENTATION OF MUTUAL INFORMATIO N RELATIONSHIPS IN PFAM ................................ ................................ ... 59 CorreLogo ................................ ................................ ................................ ............................... 60 Protein CorreLogo ................................ ................................ ................................ .................. 61 3D Mod el ................................ ................................ ................................ ......................... 62 Model for PFAM PF00027.18 and PDB 1ne4 A ................................ ............................ 67 Model for PFAM PF00018.16 and PDB 1i07 A B ................................ ......................... 68 Protein CorreLogo Summary ................................ ................................ ................................ .. 69 6 MUTUAL INFORMATION TO PREDICT PROTEIN INTERACTIONS IN RETROVIRUSES ................................ ................................ ................................ ................... 76 Virus Background ................................ ................................ ................................ ................... 78 Virus Protein Topology Models Predicted from Mutual Information ................................ .... 79 Visual Representation of Informative Relati onships ................................ .............................. 82 7 PREDICTING PROTEIN INTERACTIONS IN HEPATITIS C (HCV) .............................. 84
8 Protein Topology Models ................................ ................................ ................................ ....... 84 Power law Degree Distribution ................................ ................................ .............................. 86 Analysis of MEMI Model ................................ ................................ ................................ ....... 87 Protein Surface Mutual Information Models ................................ ................................ .......... 89 HCV Analysis ................................ ................................ ................................ ......................... 93 8 PREDICTING PROTEIN INTERACTIONS IN HIV ................................ ......................... 102 Pro tein Topology Models ................................ ................................ ................................ ..... 102 Power law Degree Distribution ................................ ................................ ............................ 104 Analysis of MEMI Model ................................ ................................ ................................ ..... 105 Mutual Information Analysis of Pre and Post Treatment Sequence Data ............................ 106 Protein Surface Mutual Information Models ................................ ................................ ........ 107 HIV Analysis ................................ ................................ ................................ ........................ 110 9 PREDICTING PROTEIN INTERACTIONS IN INFLUENZA A VIRUS ......................... 125 Power law Degree Distribution ................................ ................................ ............................ 127 Analysis of MEMI Model ................................ ................................ ................................ ..... 127 Protein Surface Mutual Information Models ................................ ................................ ........ 128 Influen za A Analysis ................................ ................................ ................................ ............ 133 10 PREDICTING PROTEIN INTERACTIONS IN DENGUE ................................ ................ 147 Protein Topology Models ................................ ................................ ................................ ..... 147 Power law Degree Distribution ................................ ................................ ............................ 148 Analysis of MEMI Model ................................ ................................ ................................ ..... 148 Dengue Analysis ................................ ................................ ................................ ................... 149 11 PREDICTING FUNCTIONAL CO EVOLVING SECONDARY STRUCTURES IN NUCLEAR RECEPTORS ................................ ................................ ................................ .... 155 MEMI Predicted Protein Interaction Network is Scale Free ................................ ................ 157 Co evolving Secondary Structures ................................ ................................ ....................... 159 MSA Sequence Positions [139,144,195] ................................ ................................ .............. 161 Allosteric communication between distant secondary structures ................................ ......... 163 PPRE DNA ................................ ................................ 164 12 CONCLUSIONS A ND FUTURE RESEARCH ................................ ................................ .. 187 Summary of Results ................................ ................................ ................................ .............. 187 MEMI Method and its application on Pfam ................................ ................................ .. 187 Protein CorreLogo ................................ ................................ ................................ ......... 188 Protein Topology of Viruses ................................ ................................ ......................... 188 HCV ................................ ................................ ................................ ....................... 188 Influenza ................................ ................................ ................................ ................. 188 HIV ................................ ................................ ................................ ......................... 189 Dengue ................................ ................................ ................................ ................... 189
9 Predicting funct ional co evolving secondary structures ................................ ........ 190 Other Potential Applications and Future Research ................................ ............................... 191 High Throughput Screening ................................ ................................ .......................... 191 Predicting Protein Interactions between Pfam families ................................ ................. 191 LIST OF REFERENCES ................................ ................................ ................................ ............. 193 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ....... 203
10 LIST OF TABLES Table P age 4 1 Clustering of co evolved pairs MI pairs ......... 57 4 2 Number of MI scores < 500 and MEMI mutation count > 40 ................................ ........... 58 5 1 Amino acid grouping and color ................................ ................................ ......................... 75 7 1 HCV proteins and functions ................................ ................................ ............................. 101 7 2 Information contribution between ................................ ................................ .................... 101 8 1 HIV HXB2 gene product sequence positions ................................ ................................ .. 124 9 1 Influenza A gene sequence positions ................................ ................................ ............... 145 9 2 Predicted c o evolving pair HA(125,275). ................................ ................................ ........ 146 9 3 Predicted co evolving pair HA(379,478). ................................ ................................ ........ 146 11 1 in deuterium exchange with the addition of DNA. ................................ ................................ ................................ ............. 185 11 2 RXR HDX showing peptides that had a change in deuterium exchange with the addition of DNA. ................................ ................................ ................................ ............. 186
11 LIS T OF FIGURES Figure P age 2 1 Physio Chemical properties of amino a cids ................................ ................................ ...... 34 2 2 Mutual information ................................ ................................ ................................ ............ 34 2 3 Entropy to calculate mutual information ................................ ................................ ........... 34 2 4 Venn diagram mutual i nformation ................................ ................................ ..................... 35 3 1 Phylogenetic tree for a single sequence position ................................ ............................... 39 3 2 Phylogenetic tree for co evolving p airs where green indicates a mutation event .............. 40 3 3 State diagram of observed mutation transitions ................................ ................................ 40 4 1 Mutual Information log ba se 2 ................................ ................................ ........................... 53 4 2 Mutual Information log base 2 with phylogenetic effect reduced ................................ ..... 53 4 3 MI clusters ................................ ................................ ................................ ......................... 54 4 4 MEMI clusters ................................ ................................ ................................ ................... 54 4 5 MI for PF04055.9 ................................ ................................ ................................ ............... 54 4 6 MEMI for PF04055.9 ................................ ................................ ................................ ......... 55 4 7 Distribution of perc ................................ ......... 55 4 8 Percentage accuracy Z>=4 and MI count < 500 ................................ ................................ 5 6 4 9 Ribbon model with MI in PF00014.1 3 ................................ ................................ .............. 57 5 1 RNA CorreLogo of 5S loop E region RFAM RF00001 ................................ .................... 70 5 2 Protein CorreLogo PF00025 PDB 1HUR A ................................ ................................ ...... 71 5 3 Sequence Logo ................................ ................................ ................................ ................... 72 5 4 PF00025 1HUR ................................ ................................ ................................ .................. 72 5 5 PF00027.18 1NE4 A RP adenosine binding pocket ................................ .......................... 73 5 6 PF00018.16 1I07 A B ................................ ................................ ................................ ........ 74 7 1 Hepatitis C virus genome NS2 through NS5B are the non structural proteins ................. 94
12 7 2 HCV top 100 pairs MI method ................................ ................................ .......................... 95 7 3 HCV top 100 pairs MEMI method ................................ ................................ .................... 96 7 4 Edges per node for the top 100 pairs using MI method in HCV ................................ ....... 96 7 5 Edges per node for the top 100 pairs using MEMI method in HCV ................................ 97 7 6 HCV topology (Penin, Dubuisson et al. 2004) ................................ ................................ .. 97 7 7 Predicted co evolving pairs NS3 PDB 1CU1 ................................ ................................ .... 98 7 8 Predicted co evolving pairs NS5A PDB 1ZH1 ................................ ................................ 99 7 9 Predicted co evolving pairs NS5B PDB 1GX6 ................................ ............................... 100 8 1 HIV gene map ................................ ................................ ................................ .................. 110 8 2 HIV top 64 pairs MI ................................ ................................ ................................ ......... 111 8 3 HIV top 64 pairs MEMI ................................ ................................ ................................ ... 112 8 4 Edges per node for the top 64 MI pairs using MI method in HIV ................................ ... 113 8 5 Edges per node for the top 64 MI pairs using MEMI method in HIV ............................. 113 8 6 HIV MI Pairs top 100 MEMI method ................................ ................................ .............. 114 8 7 HIV immature and mature ................................ ................................ ............................... 115 8 8 HIV PRO GAG MI pairs pre treatment ................................ ................................ .......... 116 8 9 HIV PRO GAG MI pairs post treatment ................................ ................................ ......... 117 8 10 ENV gp120 T Cell CD4 antibody PDB 2NXY wide view ................................ ............. 118 8 11 HIV viral attachment ................................ ................................ ................................ ........ 118 8 12 ENV gp120 T Cell CD4 antibody PDB 2NXY detailed view ................................ ........ 119 8 13 POL RT PDB 2IAJ ................................ ................................ ................................ .......... 120 8 14 GAG p6 PDB 2C55 ................................ ................................ ................................ ......... 121 8 15 GAG p24 PDB 2ONT (148 220) ................................ ................................ ..................... 121 8 16 VPR PDB 1M8L ................................ ................................ ................................ .............. 122 8 17 TAT PDB 1JFW ................................ ................................ ................................ .............. 122
13 8 18 POL Integrase PDB 1EX4 ................................ ................................ ............................... 123 9 1 Influenza virus 3D topology ................................ ................................ ............................ 134 9 2 Influenza top 100 pairs MI method A/duck/Viet Nam/18/2005(H5N1) ......................... 135 9 3 Influenza top 100 pairs MEMI me thod A/duck/Viet Nam/18/2005(H5N1) ................... 135 9 4 Influenza top 200 pairs MEMI method A/duck/Viet Nam/18/2005(H5N1) ................... 136 9 5 Edges per nod e for the top 100 pairs using MI method in influenza ............................... 136 9 6 Edges per node for the top 100 pairs using MEMI method in influenza ......................... 137 9 7 Influenza virus imaged using electron tomography ................................ ......................... 137 9 8 Four neuraminidase structures PDB 2HU4 ................................ ................................ ...... 138 9 9 Predicted co evolving pairs in neuraminidase side view ................................ ................. 139 9 10 Predicted co evolving pairs symmetry in neuraminidase (top view) ............................... 139 9 11 Predic ted co evolving pairs in neuraminidase showing second cluster (side view) ........ 140 9 12 Neuraminidase docking s cenario showing co evolving pair relationships ...................... 141 9 13 Three hemagglutinin structures PDB 2IBX cluster 1 ................................ ...................... 142 9 14 Three hemagglut inin structures PDB 2IBX cluster 2 ................................ ...................... 143 9 15 Three hemagglutinin structures PDB 2IBX cluster 3 ................................ ...................... 144 9 16 Three hemagglutinin structu res PDB 2IBX cluster1 and 3 interactions .......................... 145 10 1 Dengue genome ................................ ................................ ................................ ............... 150 10 2 Dengue top 50 pairs MI method ................................ ................................ ...................... 150 10 3 Dengue top 50 pairs MEMI method ................................ ................................ ................ 151 10 4 Edges per node for the top 100 pairs using MI method in dengue ................................ .. 152 10 5 Edges per node for the top 100 pairs using MEMI method in dengue ............................ 152 10 6 Enve lope protein PDB 2B6B ................................ ................................ ........................... 153 10 7 Envelope protein PDB 2B6B interface to NS5 PDB 2J7U ................................ .............. 154 11 1 Top 300 predicted co evolving pairs fr om MSA alignment ranked by mutual information. ................................ ................................ ................................ ...................... 172
14 11 2 Number of edges or degree K per predicted co evolving sequence position .................. 173 11 3 Degree distribution of edges per node in log scale log(P(k)) vs. log(k) .......................... 173 11 4 The top 100 predicted co evolving pairs wher ...... 174 11 5 Predicted co evolving secondary structures bits of information ................................ ...... 175 11 6 Predicted co evol ving secondary structures network with minimum distance between secondary structures. ................................ ................................ ................................ ........ 176 11 7 NR DBD dimer interface showing predicted co evol ving pairs in MSA (139,144,195) ................................ ................................ ................................ ................... 177 11 8 1HCQ Estrogen Receptor alpha DBD homodimer showing predicted co evolving pairs in MSA(139,144,101). ................................ ................................ ............................ 178 11 9 MSA (139,144,195) showing cons erved amino acid triplets for nuclear receptor. ......... 178 11 10 Proposed MEMI interaction model f or DBD signaling to LBD Helix 3 ......................... 179 11 11 ................................ ................................ ...................... 179 11 12 ................................ ....... 180 11 13 ................................ ................ 180 11 14 ................................ .. 181 11 15 LBD RXR DBD dimer interface. ................................ ................................ ................................ ........................... 181 11 16 +/ DNA as compared to ................................ ................................ ................................ ........................ 182 11 17 addition of DNA. ................................ ................................ ................................ ............. 182 11 18 nge with the addition of DNA. ................................ ................................ ................................ ............. 183 11 19 Helix A responsible for DNA binding with promoter region is conserved across a nuclear receptors. ................................ ................................ ................................ ............. 184
15 Abstract of Dis sertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy USE OF MUTATION EVENTS AND MUTUAL INFORMATION TO PREDICT PROTEIN PROTEIN INTERACTIONS By H omer F loyd W illis IV December 20 10 Chair: Alper ngr Major: Computer Engineering All biological functions in the living cell involve the interaction of protein structures in a complex system where through billions of years of evolution the inse rtion, deletion and mutation of DNA sequences that code for proteins provides the coding for life, as we know it. A key challenge in understanding this coding is the ability to separate random mutations from positive mutations that are critical to preserv ing the chemical properties of the protein interac tions. In this thesis, we develop a new mutual information based method (MEMI) using mutation events in a protein multiple sequence alignment from a representative phylogenetic tree to predict co evolving s equence position. Others have published numerous papers on the use of mutual information to predict co evolving sequence positions with techniques to filter false positives that are not found to be contact neighbors in a representative protein structure C o evolving relationships could indicate close proximity in a protein structure and can provide possible solutions for tertia ry protein structure prediction. The MEMI method predicts with 81% accuracy co evolving pairs in Pfam where the distance between th e two sequence positions in the protein structure is less than 12 angstroms (contact neighbors) as compared to 56% accuracy using mutual information (MI), a previously
16 known method To understand the false positives a new visualization software tool, Prote in Correlogo, was developed to allow a comprehensive 3D overview of co evolving pairs in a protein family. The Protein Correlogo models provide insight that false positives with high mutual information were contact neighbors in the stable protein homodimer structure and represented a prediction method for protein protein interaction s The MEMI method is used to p redict protein protein interactions represented as a co evolving network, for the retroviruses: hepatitis C, HIV, influenza A and dengue. R etrov iruses were selected because they have high mutation rates, small genomes and have a large amount of sequence data available for analysis Determining informative mutation patterns can provide invaluable insight into the protein topology or systems biolog y of a virus, which can lead to a vaccine or cure. The MEMI method is used to predict co evolving pairs in nuclear receptors an important class of proteins that regulate the expression of proteins. Three sequence positions are identified as co evolving tha t are unique for each of the 48 known nuclear receptors in humans The three sequence positions are found in the heterodimer interface of the nuclear receptor DNA binding domain and in the flanking regions of the protein DNA binding interface. It is propos ed that the three sequence positions, as unique identifiers for all nuclear receptors are critical sequence positions dictating nuclear receptor DNA binding specificity. Analysis of nuclear receptors predicts co evolving relationships between the DNA bindi ng domain and ligand binding domain where the distances g reatly exceed 12 angstroms and c ould be viewed as false positives A high number of co evolving pairs are observed grouped by secondary structure and a new method is introduced to predict co evolvin g secondary structures.
17 We used hydrogen deuterium exchange to show the changes in n uclear r eceptors secondary structures when bound to DNA support the predicted co evolving secondary structures network. In this thesis, w e provide a large exploration of using mutual information to predict co evolving pairs where the difficulty is validating the results where very little is known about protein protein inter actions Identifying predicted co evolving pairs that can be associated with unique attributes of the protein(s) could provide validation that the unique amino acid pairing patterns are important in protein differentiation. With the growing collection of protein sequences and fully sequenced species genomes, information theory will play an important role in increasing our knowledge of the diversity of life that is contained in DNA
18 CHAPTER 1 INTRODUCTION Protein Protein interactions have been notoriously difficult to study for a variety of reasons including the high false positive results of genetic approaches such as yeast two hybrid screening (Young 1998) and false negative results of biochemical approaches (Phizicky, Bastiaens et al. 2003) when applied to membrane proteins. As the amount of biological data that is collected and stored in databases increases, the use of bioinformatics will become an important tool to predict or de tect protein interactions. In Chapter 2 the biological relationships to sequence data and protein structures are described as well as an overview of mutual information. The use of mutual information (MI method) to detect co e volving sequence data is a long running and actively researched topic (Schneider, Stormo et al. 1986; Korber, Farber et al. 1993; Clarke 1995; Pazos, Helmer Citterich et al. 1997; Atchley, Wollenberg et al. 2000; Pr itchard, Bladon et al. 2001; Tillier and Lui 2003; Wu, Schiffer et al. 2003; Crooks 2004; Daub, Steuer et al. 2004; Hamilton, Burrage et al. 2004; Dimmic, Hubisz et al. 2005; Martin, Gloor et al. 2005; Fares and McNally 2006; Fares and Travers 2006; Inbal Halperin 2006; Wang, San Wong et al. 2006; Yi, Ma et al. 2007) where various methods are used to filter or correct for false positives to show agreement with know n protein interactions. Mutual i nformation is dependent on accurate probability distributions and if the samples are selected from a limited set of sequence data then a bias is introduced towards the grouping of the samples (Atchley, Wollenberg et al. 2000) The phylogenetic bias contributes to a high number of false positives and does not detect valid co evolving sequence positions, which occurred in early ancestors (Pollock and Taylor 1997; Barker and Pagel 2005) The phylogenetic effect on sequence data when calculating mutual information has prevented it from becoming a widely used technique to predict protein protein interactio ns.
19 The Mutation Event Mutual Information ( MEMI ) method is proposed which calculates probability distributions by sampling along a phylogenetic tree for mutation events. The phylogenetic tree for a collection of aligned sequence data is used as a templa te to build a mutation history for two sequence positions. If two children of a parent node in the tree each have the same amino acid then the parent node is assigned that amino acid. If the two children each have a different amino acid then an X or not kn ow n is assigned to the parent. The process continues from the base to the root of the tree. If a comparison is made between a node with an assigned amino acid and a sibling node with an X then the children of the X node are searched for agreement with the sibling node currently being compared. If a match is found then the parent node is assigned that amino acid. Once a consensus tree has been determined for two sequence positions the tree is then descended along each node where a mutation event in any node results in the sampling of the amino acid pair assigned to that node. The collection of amino acid pairs that are selected based on mutation events along the entire tree are then used to calculate the probability distributions that are used to determine th e mutual information for the two sequence positions. This process is repeated for all sequence positions as a pairwise comparison. The advantage of this approach is that the phylogenetic influences from closely related sequence data can be eliminated and m utation events that occurred in early ancestors or along parallel evolutionary paths in the tree will be detected. In Chapter 3 a more detailed description of the MEMI method is given. A significant challenge in predicting co evolving amino acids are the lack of validated co evolving amino acids based on biological evidence. One potential attribute of co evolving pairs is that they should be found in close proximity in 3D space indicating a possible interaction. Thus a common measure of a true positive is finding the predicted co evolving pairs that are
20 distance in sequence space are within 12 angstroms in 3D space The primary research motivation to predict co evolving pairs is to find constraints on tertiary protein structure prediction indicating how a protein is folding. Proteins are dynamic structures and a predicted co evolving pair may not be found as contact neighbors in a crystalized protein and would be considered a false positive even though it may be a true positive. It is important to not limit the measure of the accuracy of a particular algorithm in predicting co evolving pairs based on contact distance in a representative protein structure. This becomes even more apparent when you consider using co evolving pairs to predict protein protein interactions where very little is known about specific sequence positions interacting between two proteins. In Chapter 4 sequence data from 2,765 Pfam families is used to do a comparative analy sis of the MEMI method with the previously known MI method using protein structures from each protein family to score true positives based on 3D distance. The MEMI method is shown to predict statistically significant co evolving pairs that are located < 12 angst roms in a representative PDB structure 81% of the time as compared to 56% for the MI method. With a significant improvement in predicting co evolving pairs that are contact n eighbors it was observed that a number of apparent false positives would have the highest bits of information in the protein family. To explore the data relationships and reasons for the false positives, Protein Correlogo was developed to provide a 3D model of all known data attributes as it relates to the predicted co evolving pairs. Protein Correlogo is presented in Chapter 5 where the visual models led to the determination that selected examples of false positives with high mutual information were actually contact neighbors in the homodimer structure of the stable protein. Using the distance between predicted co evolving pairs in the tertiary structure could be 40 angstroms apart but in the interface between the same protein in its stable quaternary structure are < 12 angstroms apart.
21 It was clear that pr edicted co evolving pairs could not be viewed as true positives based on tertiary structure distance alone making its use in providing possible constraints for protein folding problems difficult. By analyzing the network properties of predicted co evolvin g pairs where a sequence position is a node and the edge indicates the co evolving relationship it is possible to measure the randomness of the network as an indicator of false positives. Once predicted co evolving pairs are identified, numerous visual in dicators of potential relationships can be constructed using network graphs or highlighted on solved PDB structures. The network graphs can play an informative role in the understanding of interactions for trans membrane proteins and proteins with unsolved PDB structures where little is known about sequences positions that may be exposed on a protein surface. To understand the potential of using mutual information based methods to predict protein protein interactions it is important to use sequence data sa mpled for a specific phenotype to capture the compensating mutations important to preserve or enhance function of the proteins. The application of mutual information and the Mutation Event Mutual Information ( MEMI ) method are ideally suited to retroviruses which have high mutation rates, and small genomes that can be easily sequenced Current wet lab research techniques used to detect or validate point specific protein interactions in viruses is an evolving field making validation of protein protein inte ractions difficult Developing and validating a bioinformatics approach to the understanding of a protein topology that is only dependent on genomic sequence data can provide an alternative approach in the understanding of protein protein interact ions in viruses. In Chapte r 6 a background of retroviruses is given as well as a summary of findings for the analysis of co evolving pairs in Hepatitis C ( Chapter 7 ) HIV ( Chapter 8 ), Influenza A ( Chapter 9 ) and Dengue
22 ( Chapter 10 ). In Chapters 7 10, co evolving pairs are predicted using the MEMI method and MI method where emphasis is placed on the predicted co evolving network having power law properties indicating non random relationships. The MEMI predicted co evolving pairs when possible are shown on representative protein structure and compared to known virus topology. In Chapter 11 the MEMI method is appl ied to nuclear receptors and important class of proteins responsible for regulation of specific genes with significant findings on the importance of predicted co evolving pairs and a novel introduction of co evolving secondary structures Humans contain 48 na med nuclear receptors that are responsible for sensing ligands such as steroids, hormones and vitamins as a trigger to conformational changes in protein structure which impacts the recruitment of other proteins responsible for gene transcription. Nuclea r receptors consist of a DNA binding domain where the sequence space is considered highly conserved connected by a hinge domain to the ligand binding domain where the sequence space is considered moderately conserved. Both DNA binding domain and ligand b inding domain are considered highly conserved structures. Nuclear receptors are drug targets for approximately 13% of FDA approved drugs, which means they are subjected to significant research studies. As important regulators of gene transcription the meth od by which the nuclear receptors bind to DNA promoter regions near a target gene is well understood. Predicting DNA binding motifs for nuclear receptors using protein and DNA sequence data is an actively researched field. The DNA binding domain of nuclea r receptors is highly conserved in sequence space where mutations are important for selecting a targeted group of genes. The MEMI method is used to predict co evolving pairs in nuclear receptors to identify regions that are important in the evolution of nu clear receptors. The MEMI method revealed three sequence positions, which were previously unknown distinct for each named nuclear receptor and are found as contact
23 also found in the flanking regions where the nuclear receptor DBD is bound to DNA indicating a pos sible role of the gene promoter DNA in coding for gene transcription without ligand. The nuclear receptor DNA binding domain (DBD) and ligand binding domain (LBD) are two distinct protein structures connected by a long hinge region. The co evolving pair a nalysis s howed strong relationships between the two regions where it is well known they do not interact. As nuclear receptors evolved, each nuclear receptor has functional attributes that require coordination in the region that recognizes DNA for binding t o specific genes and the LBD which is activated/de activated by a select ligand. The sequence positions that are predicted to be co evolving in the DBD and LBD are contained in secondary structures that are highly conserved in structure but are important i n the differentiation of protein function. The network analysis revealed co evolving relationships between secondary structures and this was used to introduce a novel concept of predicted co evolving secondary structures. The binding of nuclear receptors to DNA has been recently shown to cause changes in secondary structure in Helix 3 located in the LBD very distan t from the DNA binding domain. This allosteric communication is predicted by the MEMI method that shows a high number of co evolving relationshi ps with sequences in Helix 3 and sequence positions in the DBD. To validate these finding s hydrogen deuterium exchange is used to study the structural changes in secondary structures in DNA. The changes in secondary structures based on experimental evidence strongly agrees with the secondary structures that are predicted to be co evolving and suggests a mechanism by which DNA binding is communicated to distant secondary structure responsible for enabling the recruitment of proteins for gene transcription.
24 In Chapter 12 a summary of results is provided and direction for future research is discussed Developing algorithms to predict co evolving pairs can provide invaluable insight into the evolution of proteins based on distinct pairing patterns that are shown to be unique to named attributes of proteins. Using the distance between predicted co evolving pairs represents one po ssible step in the validation of a true positive. Based on the analysis of a large collection of sequence data in Pfam, viruses and nuclear receptors it is clear that other methods must be used to test the non randomness of the predictions. Mutual informa tion is a measure of randomness or information between two variables but what is not given is the nature of the information being expressed. Co evolving pairs that are shown to be unique for named attributes of a protein family can be used when possible to validate the correctness of predicted co evolving pairs. In predicted co evolving pairs where it is not clear why the pairing patterns are unique they should be viewed as informative and not classified as a false positive given that our knowledge of pro tein interactions is very limited.
25 CHAPTER 2 BACKGROUND With large scale genome projects sequencing diverse species and DNA sequencing has become a standard tool in research labs, the amount of collected sequence data will require extensive computational resour ces to store, index and reference in meaningful ways. As of May 30, 2007 the Nucleotide Sequence Database consists of 97 million sequence entries of which 23 million are whole genome shotgun data (EMBL 2007) The Swiss Prot database is a curated database of protein sequence data with a high level of manual research annotati on minimal level of duplication, the protein domain structure and specific cross referencing to other databases. As of August 21, 2007 the Swiss Prot database consists of 277,883 entries and is growing exponentially (Swiss Prot 2007) The protein sequence data in Swiss Prot is organized in protein families called Pfam as a large collection of multiple sequence alignments grouped by functional family based on hidden Mark ov models. As of June 2007, Pfam contains 9318 protein families (Bateman, Birney et al. 2002) Proteins are stored as a collection of three nucleotides in genomes that code in reading frames for one of twenty possible amino acids. The chemical properties of the amino acids then form secondary structures as either a loop, helix or beta sheet that the n combine to form a stable tertiary structure that is the starting point of a protein and its interaction with other proteins and ligands. It is these protein interactions in 3D space that guide almost every process in a living cell. Great importance is pl aced on solving the 3D structure of a protein sequence if possible as it gives the researcher insight into the chemical and functional properties of the protein. The critical proteins involved with trans membrane transport in and out of cells are difficult to isolate and the ability to accurately determine the 3D structure of these proteins is an unsolved problem. The collection of 3D protein structures is organized in the RCSB PDB database and as of August
26 21, 2007 consists of 45,368 protein structures (RCSB 2007) where approximately 5,000 of the 9,317 Pfam protein families do not have a representative PDB structure. The collection of PDB structures has allowed a reclassification of sequence data grou ped by secondary structure into the Class, Architecture, Topology, Homologous superfamily (CATH) (Orengo, Michie et al. 1997) and the Structu ral Classification Of Proteins(SCOP) (Lo Conte, Ailey et al. 2000) databases. Using homology to predict secondary structure from sequence data is considered 70 80 percent accurate and accurately predicting protein structure without sequence homology is an unsolved problem (Crooks 2004) The end purpose of collecting sequence data and determining protein structures is to understand h ow specific proteins interact then create protein topologies that illustrate the biochemistry, signaling netw orks and regulatory process of the cell. This knowledge is critical in developing treatments for cancer, finding vaccines for viruses and improving the quality of life. Progress is being made on biologi cal techniques to observe test and validate protein i nteractions but the complexity level is high and based on prior assumptions that may not be accurate. The mutual information t heory derived method proposed here appears to be very promising, since it is purely based on sequence data of potentially intera cting proteins, without any preconceived notions or biases about what is currently known The ability to detect co evolving amino acid pairs between proteins could indeed reveal specific interfaces that may help predict protein protein interactions. In tur n, these predicted interactions may focus the attention of the biologist on new research targets to validate using conventional biochemical processes. The MEMI method is used to detect co evolving pairs and uses mutual information to detect non random com pensating mutations by re sampling of sequence data by selecting mutation events along the phylogenetic tree for the aligned sequences. In most cases the ability
27 to determine the 3D structure of a protein is difficult, slow and prohibitively expensive give n current state of the art X ray or NMR techniques (Chandonia and Brenner 2006) For proteins that are trans membrane and thus important i n understanding the role of specific proteins related to infection in cells the techniques required to easily solve PD B structures remains a difficult problem (Torres, Stevens et al. 2003) The ability to collect s equen ce data is not limited to a specific type of protein, which allows protein interactions to be predicted for trans membrane proteins with other regulatory proteins in the host cell. In turn, these predicted protein protein interactions can be used as specif ic targets for site directed mutagenesis studies This approach has general applications associated with all protein protein interactions where enough genomic sequence data exists to accurately determine the probability distribution of the population (Wong, Roccatano et al. 2007) The ability to detect co evolving amino acid pairs using information theory or mutual information can be an important tool in the understanding of protein protein interactions in rapidly mutating viruses s uch as Hepatitis C, HIV or Influenza A. With the recognized difficulty in determining 3D structures for trans membrane proteins and non structural proteins, important to the regulation and replication of viruses, the clinical researcher is at a disadvantage in the advancement of the understanding of the life cycle of a virus and possible drug targets. Developing and validating a bioinformatics approach to the understanding of a protein topology that is only dependent on genomic sequence data can provide an alternative ap proach in the understanding of protein protein interactions in viruses. Structural Biology A protein is a 3D structure that is constructed by a chain of amino acids that are coded for by a DNA sequence resulting in the formation of secondary structures tha t fold to form a complex tertiary structure (Chothia 1984) The functional or bio chemical properties of a protein
28 depend on the 3D structure of the protein. The major function of DNA is to encode the sequence of amino acids required to construct a protein. DNA itself is a double helix structure made up of four bases, Adenine (A), Guani ne (G), Thymine (T) and Cytosine (C). The DNA is arranged as a sequence of letters in chromosomes and the genetic information that codes for a protein function is called a gene. In the human genome, there are approximately 3 billion base pairs of DNA arran ged into 46 chromosomes (Venter, Adams et al. 2001) Each consecutive block of three nu cleotides, called a codon ( TTG, ACT, GCA), in a gene are responsible for the coding of one of twenty possible amino acids. Eac h amino acid has various physical chemical properties that may or may not be important based on the location of the amino acid in the protein 3D structure. An amino acid that is hydrophobic will typically be found packed in the inner core of a protein stru cture or a cell membrane to avoid water. An amino acid that is charged can be positive or negative and can be used to form bonds between oppositely charged amino acids. The Venn diagram in Figure 2 1 represents the different phys io chemical attributes of amino acids and is just one of many possible classifications (Betts and Russell 2003) Each amino acid consists of a backbone made up of a Carbon atom, a Carbon atom, an d a Nitrogen atom (C C N) connecting with a left and right neighboring amino acid to form a chain of atoms (C C N) (C C N) (C C N). Attached to the first C or C alpha is the residue or side chain that results in the unique physio chemical properties of each amino acid. The residue or side chain based on neighboring amino acids form chemical bonds that result in the formation of a secondary structure as a helix, beta sheet or turn/loop. The three secondary structures depict the basic building blocks to form t he tertiary structure of the protein (Branden and Tooze 1999) A specific region of a protein structure can be assigned a specific function and this region would be
29 coded by a sequence of amino acids. This group of amino acids or sequence of 20 possible symbols, each representing an amino acid, can be further grouped in a functional protein family or Pfam (Bateman, Birney et al. 2002) Pfam database contains information about protein families where function and protein structure are generally conserved in a particular fa mily. A Hidden Markov Model is constructed for a family from manually curated sequence data and is then used to classify all deposited sequence data into a particular Pfam family. Random mutations in the genetic code result in inserts; deletes or substit utions causing variations in the sequence of amino acids that form a secondary structure. These mutations are considered either neutral or positive in relationship to the protein function or the formation of the protein structure (Kimura 1983) If a mutati on occurs that hinders the formation of the secondary structure or alters the surface area of key protein interface in a way the protein can no longer function, then that negatively impacts the host organism ability to survive and promote the negative muta tion to future generations. The occurrence of sequence mutations from evolutionary pressures may cause a collection of amino acid sequences in a protein family to have only 30% similarity but still retain the same general secondary or tertiary structure wh ich preserves protein function (Pollastri, Martin et al. 2007) The high degree of variability in amino acid sequences that belong to the same protein family but still yield the same 3D structure makes the ability to predict protein structure from a sequence of amino acids an unsolved problem. The tertiary structure of a single protein can form a quaternary structure (grouping of tertiary structures) by combining with an exact copy of the same protein and would form a dimer and would be viewed as a single stable structure. The collection of proteins interacting to form a stable structure can also occur between three proteins and is referred to as a trimer It is also common for a single protein to organize with multiple copies of the same prote in or other protein
30 structures to form barrel structures or large complex structures referred to as homo oligomer Proteins can also interact with a ligand, which is simply an atom, ion or molecule and this interaction can play a critical role in the life of a cell. Each protein structure has an overall purpose in the cellular process and the 3D model is important to understand how the protein works and for possible drug targets to fight disease. Cystic Fibrosis (CF) is a common hereditary disease that is the result of a single deletion of an amino acid at position 508 in the genetic code at the CFTR gene that is located on chromosome 7 in the human genome. The CFTR gene is responsible for the formation of a protein that is responsible for the movement of s alt and water into and out of cells. The single deletion of one amino acid prevents the protein structure from moving across the cell membrane. Understanding the impact of single or multiple mutations in a gene can provide clues as to t he overall function of the gene. A protein 3D structure can be determined by X Ray crystallography or NMR techniques. The result is a PDB file that represents either part or an entire protein structure, which is submitted to the Protein Data Bank As of January 2007, the Prot ein Data Bank contained 41,258 structures. As of November 2006, the Pfam database, which represents protein sub sequences organized by function, contained 8957 families. Of these 8957 families, approximately 5000 families do not have a representative PDB s tructure making the understanding of which amino acids are important in protein function difficult. Each unique amino acid sequence is assigned an accession number and it may have multiple PDB structures associated with it. The protein structure that was determined from a unique sequence is assigned a PDB ID in the form of 4 letters ( 1ONB, 1BTZ, 1BEF). The PDB ID can be used to download a PDB file for a referenced amino acid sequence. In many cases the
31 techniques required to do X Ray crystallography or NMR may prevent the structure of the entire sequence from being determined. The PDB file may represent a sub sequence or have gaps in the 3D structure. The PDB file contains the 3D coordinates of each atom and is grouped by amino acid and tertiary structure. In many cases the PDB file will contain the quaternary structure showing how the proteins form a stable structure or how the protein has interfaced with a specific group of ligands. Numerous free application s exist that allow the researcher to load and vi ew a PDB data file as a 3D structure (Moreland, Gramada et al. 2005; Angel 2006) These applications are written specifically to read the 3D coordinates of the atoms and based on the amino acid properties render visual models. The simplest is the ball and stick model though the level of detail can be visually overwhelming. The ribbon model u ses the (C C N) coordinates for each amino acid along the backbone and a Hermite function (Abramowitz and Stegun 1965) to form a smooth l ine between (C C N) (C C N) data points which results in a clear indication of the secondary structures helix, beta sheet and turns in the 3D model. Proteins can have an active region at the core of the 3D structure where binding to a ligand may occur and the ribbon model is used to highlight this region. Proteins also interact with other proteins and in this case the surface of the protein structure becomes visually important. The surface model can be approximated by representing each atom coordinate as a sphere with a diameter equal to the size of the atom. The applications used by researchers to view 3D protein models support the same basic features with variation among each application in its ability to inspect and modify the visual appearance of the pr otein structure. Basic features include the ability to view ball and stick, ribbon and surface map models with the ability to rotate and zoom in and out on the model. Each application has strengths and weaknesses in its ability to select individual atoms, amino acids,
32 and secondary structures with a focus on highlighting a specific region of interest. A common use of these programs is to create a visual model that can be used in a research publication to support the understanding of a particular region in t he protein structure. It is also common for the applications, which have origins in the open source community to support multiple operating systems, and it is uncommon to have the ability to export a 3D model to any standard 3D formats. It is a general obs ervation, that biologists will typically use a Mac or Linux PC as they have a requirement to run research applications written originally in Unix. Mutual I nformation The field of Information Theory was introduced by (Shannon 1948) measure of information and the detection of noise in a communication channel. The information shared between two discrete random variables X and Y is defined as mutual information and is shown in Figure 2 3 where [ ] is the joint prob ab ility distributi on of two variables X and Y, where and are the probability distributions of X and Y. A completely random sig nal is considered to have no in formation or maximum entropy and is defined as in Figure 2 3 (a) If a signal consists of a continuous transmissio n of the probability of each letter A Z then that signal would appear to be random and would have an entropy score of 1.0 log base 2 6 The log base impacts the max imum entropy value and to obtain a normalized scale the base would equal the number of discrete values in the set. It is common to do log base 2 when calculating entropy and the units would then be in bits or an approximation of how many binary values are required to send the same information through a communication channel. A simplified view of the mutual information calculation show in Figure 2 3 (d) is the
33 sum of the entropy of X and Y minus the joint entropy of X and Y Figure 2 3 (c) and the relationships can be more easily illustrated in Figure 2 4 Mutual information is based on Shannon Entropy (H ), which is derived from the probabilities of occurrences of individual and combined events between two discrete random variables. In Figure 2 4 a Venn diagram shows the relationship of entropy and mutual information. If the entropy of X and the entropy of Y indicates 100% random behavior then the sum of H(X) and H(Y) will be a maximum. If the joint entropy H( X,Y) of the X and Y pair is non random then mutual information will be at a minimum and the mutual information will be a maximum. The maximum area of H(X) in Figure 2 4 when scaled by log base N where N is the size of the set that consists of X is 1.0, if each symbol in X has equal probability. The same scaling function applies to H(Y) where typically the number of members in the set X and Y are the same ( Cover T, Thomas J 2006 ) In cases, where the size of the set or the number of discrete values for X and the number of discrete values for Y are not the same then the log base can be used for normalizing the entropy calcula tions for H(X), H(Y) and H(X,Y) (Gouveia Oliveira R, Pedersen AG 2007) When calculating mu tual information for a sequence position that can possibly contain 20 different amino acids then log base 20 could be used as a scaling factor. The problem with this approach is that functionally in a specific sequence position is that the number of amino acids that can be substituted may be constrained by a unique physio chemical property. The treatment of this condition can be simplified by applying log base 2 as a standard and is used in all entropy calculations in this research.
34 Figure 2 1 Physio Chemical properties of amino a cids Figure 2 2 M utual information (a) (b) (c) (d) Figure 2 3 Entropy to calculate mutual informati on Aromatic Aliphatic A G S C S H P N D E Q R K H Y W F M I V L T C S S Hydrophobic Positive Polar Negative Tiny Small Proline Charged Q
35 H(Y) H(X|Y) H(Y|X) MI(X,Y) H(X,Y) H(X) Figure 2 4 Venn diagram mutual i nformation
36 CHAPTER 3 MUTUAL INFORMATION C ALCULATED WITH MEMI METHOD Application of Information Theory and the analysis of sequence data are impacted by a sampling bias from targeted research on proteins of medical interest and the introduction of noise from the ph ylogenetic impact on probability calculations. This topic will be thoroughly explored and algorithms developed to improve the quality of information measured in protein sequence data. When calculating probabilities one underlying assumption is that the sam ples used to calculate the probabilities are randomly picked samples from the population. If the samples are not randomly picked then a bias is introduced towards the grouping of the samples (Atchley, Wollenberg et al. 2000) In a protein family each sequence represents a sample of the population of sequences that represent that fami ly. The first introduced bias is that the sequences tend to come from research studies that have medical or pharmaceutical interests. Given common evolutionary history of all genetic sequences, the ability to survey the existing and extinct population for all members of a particular protein family is not practical or feasible. So we are left with a collection of protein sequences that do not represent the entire population pool, with the added impact of phylogenetic influence where each sample is dependent on its parent. These two factors can contribute statistical bias or noise when calculating entropy or mutual information relationships. As the scope of genetic research expands to include a larger set of all species then the impact of populating sampling w ill decrease as protein sequences belonging to a protein family become more diverse. If a hypothetical database contained all sequences belonging to a protein family with a time stamp that indicated the origination of that sequence from evolutionary forces then the challenge of measuring probability distributions would be negligible. Such a database does not exist and
37 the ability to go back in time and accurately predict the evolutionary time stamp of when a sequence evolved is also impossible or at minimum extremely difficult without accurate fossil records. To compensate for the lack of a valid time stamp various techniques are used to construct a binary evolutionary tree (Semple and Steel 2003) The tree would have N terminal nodes one for e ach sequence in the family. The internal nodes of the tree represent a hypothetical ancestor of the sequences that are children of that node. Numerous methods are used to construct the trees. One common approach is to use Molecular Clock Theory to simplify the analysis and assume the number of mutations per unit of time is constant (Zuckerkandl and Pauling 1962) The tree is constructed by minimizing the number of mutations between a child and the hypothetical pare nt ancestor node. The evolutionary tree represents a graph of mutation events that can be used to correct or compensate for the phylogenetic influence in a protein family. Detection of mutation events from the root of the node to the leaf nodes will genera te a set of all mutations at a particular sequence position. This set of mutations would then be used as the basis for probability calculations of observed mutations. In Figu re 3 1 a binary tree represents a phylogenetic tree whe re the circles are the hypothetical parent and the boxes represent a sequence position in a protein family. Without taking into consideration the phylogenetic influence the probability of the set p(A)=1/6, p(D)=1/6 and p(c)=2/3. If we calculate the probabi lity of observed mutation events starting from the parent node then p(A)=1/3, p(D)=1/3 and the p(C)=1/3 as indicated by the dashed nodes. This is done by starting at the root node and counting all children nodes where the child does not equal the parent. C omparing the two approaches in calculating probability yields two very different results. One accurately represents the sample and the latter represents the probability of transition to a different amino acid. This
38 has the impact of reducing or compensatin g for the phylogenetic influence on probability calculations. This same approach can be applied to calculating mutual information between two sequence positions and is the basis for improving the detection of co evolving pairs. In Figure 3 2 the tree represents a pair of amino acids found at position x and position y. The phylogenetic tree is used to detect mutation events between pairs, which become the population sample used for probability calculations. The probability based on the number of observed sequences would result in p(AE)=1/6, p(DE)=1/6, p(CD)=3/6 and p(CE)=1/6. By using the method described above where we start at the root node and count children nodes that are different with the additional rule that if an internal no de is XX it takes on the value of its parent node we get the following probabilities: p(AE)=1/4, p(DE)=1/4, p(CD)=1/4 and p(CE) =1/4 as indicated by the colored nodes. The impact of having CD occur 50% of the time is now reduced to 25 %, which serves as an adjustment to the phylogenetic influence of a mutation that occurs early in the tree, and overall only four distinct mutations occur. It would appear that counting the number of distinct combinations would yield the same results. However, this is only true in the example presented. In a large tree mutations occur along multiple paths of the tree; an amino acid pair that appears early in the tree may be absent for many mutations and then reappear as a dominant stable pairing along a particular branch of the tree. This approach focuses on counting the transitions from one mutation state to the next, and if the state does not change then a mutation did not occur. In Figure 3 3 a state diagram is constructed with the observed paired state transitions illustrate d in Figure 3 2
39 The motivation for this approach is to increase the reliability of detecting co evolving pairs that are Hydrogen atoms or to detect protein protein interactions. Calculating probabilities using observed state transitions reduces the phylogenetic effect on the overall probability scores To bring the key point into focus, and restating the assumption that if a sequence position mutates to a different amino acid then amino acids located in neighboring secondary or tertiary structures may also need to mutate to preserve the structure or function in that region. I f a particular pair of amino acids remains in one state for 90% of all sequences is informative but becomes noise when trying to determine the existence of co evolving pairs. The final transition into the state that achieves a 90% representation may be the most efficient form in the structure. The mutations that occurred prior to this stable state represent co evolving information and by using a phylogenetic tree to determine mutation events we can more accurately reflect the changes that occurred. Figu re 3 1 Phylogenetic tree for a single sequence position
40 Figure 3 2 Phylogenetic tree for co evolving pairs where green indicates a mutation event F igure 3 3 State diagram of observed mutation transitions CD DE AE CE
41 CHAPTER 4 MUTUAL INFORMATION I N PROTEIN FAMILIES PFAM Mutual information is the measure of mutual dependence between two variables. In protein structures this is referred to as co evolving pairs (Martin, Gloor et al. 2005) When two amino acids are distant (>10 positions) in a sequence the fold of the protein could place them at contact points or near neighbors (<12 angstr oms likely that a neighboring amino acid in 3D space will also need to mutate to preserve function or structure of the protein. The use of mutual information or entropy measures to detect co evolving p airs is an actively researched topic (Atchley, Wollenberg et al. 2000; Pritchard, Bladon et al. 2001; Crooks 2004; Hamilton, Burrage et al. 2004; Dimmic, Hubisz et al. 2005; Martin, Gloor et al. 2005) The ability to predict secondary structure of helix, strands or loops by homology where sequence data from unknown structure is compared to sequence data for known structures is considered 70 80 percent accurate (Crooks 2004) The ability to take these predicted secondary structures and develop an a ccurate tertiary model is considered a challenging an unsolved problem. Developing an algorithm t hat allows the use of mutual in formation to detect co evolving pairs that are close in 3D space but distant in a protein sequence would play an important rol e in tertiary structure prediction for that protein family. Accurately predicting co evolving pairs in homologous protein sequences is difficult due to the introduction of phylogenetic noise in the signal. With a reliable method to predict co evolving pair s this allows for the initial relative placement of secondary structures in a potential tertiary structure. This narrows the possible number of protein folding solutions and provides a potentially accurate base model, which can then be further refined by a lready accepted methods.
42 One approach to reducing the phylogenetic effect is to set thresholds of minimum entropy scores and to normalize results by the joint entropy (Martin, Gloor et al. 2005) To compensate for the phyl ogenetic effect in the creation of background noise any column with entropy < .3 is discarded and the final mutual information scor e is divided by the joint entro py for the two columns. This is providing a filtering mechanism where column pairs that have a high degree of randomness will lower the mutual information scores. Understanding the information that is being measured and the ability to filter noise is important in detecting co evolving pairs a nd minimizing false positives. (Clarke 1995) takes a fu nctional biological approach to explain the impact of a mutation between distant amino acid pairs and their relationship in the 3D protein structure. Clarke points out the challenges and importance of uniform sampling of sequences from an evolutionary dive rse sample. When aligning the sequences to detect co variation between amino acid pairs the probability measurement can be biased based on data only being available for a narrow part of the phylogenetic tree. Clarke focuses on attempting to correct for err ors by reducing the effective weight of repeating sequences because they are evolutionary close and appear to create a sampling bias. With the correction they are able to show high mutual information scores mapped onto important functional areas of a prote in 3D structure. Multiple methods can be us ed to select or filter on infor mation that improves the abi lity to detect a particular seg ment of co evolving pairs or reduce false positives. (Tillier and Lui 2003) compare multiple approac hes in detecting co evolving pairs. Summing and weighting the entropy over all columns introduce multiple significant interdependencies for a given site Further derivations account for insertions and deletions. The se results are then used in con junction w ith the mutual information value to calculate a dependency ratio, which is the
43 degree of correlation between two sites that is not attrib utable to phylogeny. Further ma thematical manipulation introduces the entropy weighted dependency ratio that performs b etter than using mutual information as the criteria for investigation. According to Tillier and Lui the use of mutual information created a high false positive rate. With a growing database of sequences associated with a particular protein family the chal lenges of small sample size are becoming less of an issue. The small sampling problem is replaced by the impact of phylogenetic noise from closely related sequences. In Chapter 3 a re sampling method to detect mutation pairs i n a phylogenetic tree is introduced ( MEMI ) to reduce the phylogenetic effect. An initial comparison of the standard method ( MI ) and MEMI method is done using Pfam family PF04055.9 that shows a high number of statistically significant MI pairs Additional d etails associated with the relationship of amino acids with shared mu tual information are discussed. T he MEMI method is further tested against the full Pfam database to determine the pre dictive accuracy of detecting co evolving pairs. This represent the fi rst known review of an algorithm to detect co evolving pairs against the complete Pfam database which represents 85% coverage of the Swiss Prot database and 75% coverage of the SP+TrEMBL database (John Marc Chandonia 2005) M utual Information in Protein Families M utual inform ation is calculated for the pro tein family PF04055.9 using the standard approach ( MI ) and by reducing the phylogenetic effect ( MEMI ) by re sampling against the phylogenetic tree for the family. The prediction accuracy and quality of data points for both techniques will be reviewed. In the Pfam full data set, PF04055.9 has 4582 sequences listed as belonging to that family and were selected for the high number of statistically significant MI pairs.
44 To measure the prediction accuracy of calculating mutual information using the MI and MEMI algorithm are run against the entire Pfam data set with the criteria that the protein family must have at least 100 sequences and less than 5 000 sequences. In determining the average 3D dis tance between two amino acid pairs the protein sequence with a corresponding PD B structure must have 90% align ment. The distance bet ween an amino acid pair that is shown to have high mutual information is me as ured by finding the minimum distance of non hydrogen atoms from a representative PDB structure If more than one PDB structure exists for a particular protein family then the average distance in all PDB structures is used. Mutual Information in PF04055.9 To illustrate the two approaches mutual information was calculated using log base 2 ( MI ) for Pfam family PF04055.9 and values greater than four standard deviations from the mean are graphed in F igure 4 1 A total of 24 amino acid pairs that are greater than 10 sequence positions apart have a Z score of 4 or greater. Eight of these data points have a near conta Mutual informa tion can be used to detect c o evolving pairs and in this ex ample 33% of the pairs difficult to use the data in determining rela tive 3D position of two secondary structures An indication of a false posi this example four data points fall in this range. We would like to increase the percentage of MI scores that well defined false positives. In Figure 4 2 the results of calculating mutual information by reducing the phylogenetic effect are graphed ( MEMI ). Of the 23 data points that have a MI sc ore with Z >=4, 16 of the data points are less than the MEMI and a standard deviation of 5.0 for the MI method. Reducing the phyl ogenetic impact in the
45 probabil ity scores in this one example has improved by 100% the numbe r of identified pairs that are less than Both methods identify 4 out 5 of the same MI clusters or secondary structures. The MI method amino acid pairs can be grouped among 5 distinct regions or sequence positions (4, 26, 193, 410, 555). In the MEMI data the 5 distinct regions are identified at sequence positions (4, 26, 188, 330 412). Sequences positions 193, 188 and 410,412 for comparative purposes are assumed to be a member of the same secondary structure. The paired relationships less than for the two methods are listed in Table 4 1 The quality of the information found in the MI pairs needs to reflect the number of indicated relationships between secondary structures. For the MI method the MI pairs duplicate in formation about relationships between the same secondary structures. In Figure 4 3 the MI mutu al information relationships be tween sequence positions are represented as a graph of the host secondary structure s The MEMI mutua l information rela tionships are represented in Figure 4 4 To approximate a secondary structure it is assumed that if two sequence positions with high mutual information are within 10 positions then those positions are grouped as one node in the graph. In the MI graph the six amino pairs of high mutual information reflect relationships between four different secondary structures. In the MEMI graph the five regions of interest share common mutual information across multiple nodes or secondary structures. In each method a high number of the identified amino acid pairs are close sequence position neighbors with other identified amino acid pairs. For example (410,554), (412,555) and (410,555) i ndicate one target area of muta tion. Th is could represent an initial mutation at position 555 which then required a compensating near contact mutation at position 412. The two
46 s econdary structures also compen sated for the mutation of 555 by a local mutation at 554 and the mutation at 412 by a l ocal mutation at 410. When calculating mutual information and the joint entropy between these different pair combinations it appears that (410,554) and (410,555) are co evolving pairs. This is mis leading if the primary purpose of the mutation at 410 and 55 4 occurred to support the primary co evolving mutation at (412,555). This same associative eff ect can be seen in MI scores be tween three secondary struc tures. In the MEMI sequence posi tions (26, 412, 188) have a shared relationship that forms a cycle. Thi s could be explained where the initial mutation occurred at sequence posi tion 26 which resulted in a com pensating mutation at 412 and 188. The high M I score be tween 412 and 188 may not be informative as it does not reflect a true co evolving pair with func tional or structural significance and may be the result of an associative property shared between (26,412) and (26,188) In both of these examples the amino acid pairs are less than difficult to select amino acid pairs that are not co evolving. Where this is significant is in determining what constitutes a false positive. If an amino acid pair with high MI score apart and a local muta tion in the secondary structure occurs to compensate, this local mutation with a high MI is possible to filter these secondary relationships, which will improve the predictive quality of using mutual information to detect co evolving pairs. To simplify the analy sis or measurement a To better understand the impact of the overall mutual in formation found in protein family PF04055.9 the scores for every pair combination are graphed in Figure 4 5 and Figure 4 6 The
47 the Y axis the corresponding MI score. Each sequence in a protein family will have a series of inserts, which allows for optimal global alignment in the family. Wh en calculating the MI for a col umn pair if a column contains a total number of inserts greater than 20% of the number of rows in that column no MI score was calculat ed. PF04055.9 contains 4,582 se quences and an overall sequence length allowing for inserts of 580. Both graphs follow the same general outline or shape with the major difference in the mean score for each. The average mutual information for MI method is 0.3 with a standard deviation of .15. The aver age mutual information for the MEMI data is .17 with a standard deviation of .067. The higher MI score represent more information as opposed to low MI scores associated with random occurrence. The difficulty is separating the information signal from the ra ndom noise. In this particular example, the reduction in the mutual information average from .3 to .17 represents an information contribution of .13 from the phylogenetic effect. The high mutual information values found at the peaks in the graph typically represent high mutual information between pairs closer than 10 sequence positions. Mutual information scores with values greater than two standard deviations (Z=2) are statistically signifi cant with an interest placed on amino acid pairs with MI scores fou r standard deviations from the mean (Z>=4). The reduction in information associated with the phylogenetic effects allows the MEMI method to detect a better signal (less noise) associated with information indicating co evolving pairs. Mutual Information in Pfam With two different methods of calculating mutual information and one test case that shows MEMI yields better results, the next step is to measure results across a broad sample of test cases. Pfam is based on Hidden Markov Models where the sequence dat a used to train the model is hand selected and remains constant over various iterations of Pfam releases. With proven models to select members of a particular protein family the models are used against Swissprot
48 and SP TrEMBL database to assign sequences t o the appropriate Pfam family. The Pfam seed data set used for training of the models is typically a small subset of the sequences in the family. A Pfam family full data set ranges in the number of sequences less than 10 to more than 10,000. Of the 8,183 protein families in Pfam 19.0, 2,765 families have one or more referenced PDB structures. The Pfam data can be downloaded from http://pfam.wustl.edu as a flat file. The Pfam full data set the file size is over 2 GB of data. The data is annotated in the STO CKHOLM format and provides references to numerous data types and external data sources. To facilitate better organization an XML version of a Pfam family was developed and the data converted to an XML equivalent where one protein family occupies one file. The 2,765 families that contained at least one PDB association were identified and became the test data set to analyze high mutual information scores for co evolving pairs. The phylogenetic tree used for each protein family was provided with the Pfam data set and is assumed to be optimal. An additional constraint placed on the Pfam families was the number of sequences in the family should be greater than 100 and less than 5000 (reduce computation time). Once mutual information is calculated for a family th e referenced PDB definitions are used to determine actual 3D position between amino acid pairs as the closest non Hydrogen atoms. The PDB files are available in a flat file format mmCIF or as structured data in an XML format called PDBML/XML. The XML versi on was used for the referenced PDB data. For a PDB model to be used as a reference it was required that the Pfam sequence should align with the PDB sequence by at least 90%. In cases where amino acids in the Pfam sequence that did not align, their 3D coord inates are based on the interpolated point average with the closest neighbors that do have sequence alignment.
49 Initial work in this area of aligning Pfam sequence data with the PDB data yielded significant data quality issues. The offsets provided in the P fam reference data did not provide sequence alignment and the defined sequence in the PDB data did not match the amino acid data at the atomic level. To minimize measurement errors the PDB sequence used for alignment and indexing is derived from the atom s tructures in the PDB data. Starting with the 2,765 Pfam families and filtering based on the number of sequences greater than 100 and less than 5000 results in 1,777 Pfam families. Using the 1,777 Pfam families and needing at least one sequence to have 90% alignment with the referenced PDB reduces the number of test sequences to 783 families. If a protein sequence contains multiple representative PDB structures then the sequence was limited to the first five PDB structures when calculating average 3D distanc e between two sequence positions to reduce overall computational time. The average distance across all PDB references is also calculated to represent the average distance between two sequence positions in the Pfam family. The runtime performance of calcul ating mutual information using the MI method against the Pfam full data set for the 783 families takes approximately 12 hours. When the algorithm is added in to determine the mutations based on the phylogenetic tree it took sixty times longer to perform th e calculations. The nature of the data to be analyzed translates well as a collection of parallel jobs and all processing was done on a 24 node cluster. The primary focus is on mutual information scores with a value four times or greater the standard dev iation from the mean or Z>= 4 and sequence distance between pairs greater than 10. The 783 families used to validate the use of mutual information results in 225 families for MI method and 240 families for the MEMI method that had at least one pair that me t the filter
50 calculated for each family. This average prediction percentage represents the likelihood that if we use the same approach in Pfam families that do not have solved PDB models that the MI scores would represent co evolving pairs. The MI group with Z>=4 has an average percentage of 46.7% and the MEMI group with Z>=4 is 51.8%, indicating that MEMI algorithm does a slightly better job of finding co evo lving pairs. Graphing each approach yields interesting results, a high number of successes at 100% and an equally high number below 30%. Figure 4 8 is a histogram comparing the two methods grouped by Z score and the number of protein families by percentage of co evolving pairs predicted. The average results are listed in Table 4 2 There are numerous attributes associated with the data and by clustering the different dimensions; the goal is to detect additional filter criteria that can be used to increase the quality of the data. In Figure 4 7 the number of MI scores where Z>=2 is on the X axis and the percentage of MI scores where Z> By including an additional filter or constraint of the over all number of MI scores, where Z>=2 is less than 500 increases the Z>=4 perc entages to 56.2% for the MI group. It was determined that if the average number of mutation events between co evolving pairs for a protein family was less than 40 the prediction of co evolving pairs was poor. For the MEMI group, the results are filtered by the number of MI scores where Z>=2 is less than 500 and the average number of mutations between co evolving pairs is greater than 40 resulting in an improved prediction accuracy of 81.3%. Using the filter criteria, the accuracy of the MI and MEMI methods are compared in Figure 4 8 an d Table 4 2 The MEMI average prediction accuracy improved from a reduction in the number of low scoring families for both the Z>=4 and Z=3 groups. In section 3.1, the definition
51 of a as undetermined or requiring further classification. The average percentage of predicted co 00 for the MI group is 26% and for the MEMI group is 10.9%. The MEMI method performs better in predicting a lower number of defined false positives with the potential to improve the overall accuracy prediction by approximately 7% if the predicted co evolvi properly. Improvement in the reduction of false positives is also possible if pairs with distance > evolving pair ca n be detected. Analysis The MEMI method of re sampling sequence data based on mutation events along the phylogenetic tree is an effective approach in improving the quality of predicted co evolving pairs. With an effective strategy identified it is importan t to identify weaknesses or potential areas of improvements. The predictive accuracy of the MEMI method can be further improved by properly classifying amino acid pairs that have shared mutual information from local structure mutations. By eliminating ami no acid pairs were the pair is part of a cycle between three nodes were one segment in the cycle is less than 10 sequence positions apart will reduce the number of false positives. This will impact amino acid pairs with MI score that falls in the Z>=4 and Z=3 category. The biggest gain in the general use of the MEMI method will be improvement in prediction accuracy of MI scores that have a Z score of 3, which currently applies to approximately 46% of the Pfam families. The information content associated wit h co evolution share physio chemical properties to maintain protein structure or function. The log base N problem defines the scaling of the MI
52 score based on the number of possible symbols that can exist in the sample set. The use of log base 2 is standar d because this represents data in the unit bits and is appropriate when each symbol in a column can be selected from the same data set. When a sequence position is con strained by physio chemical properties from a small set the overall mutual information s core will be minimized For example, the amino acids belonging to positive consist of (H,K,R) and negative consist of (D,E) (Livingstone and Barton 1993) The max value of MI is limited to the minimum column entropy (Martin, Gloor et al. 2005) If an amino acid position which is classified as positive and is represented by 33% H, 33% K and 33% R the assumption is that the H, K and R are equally random occurrences constrained by a positive amino acid type. The entropy for this position is 1.58 compared to the entropy score of 3.2 for a small amino acid that ha s random occurrence of nine amino acids. Both are equally random when constrained by amino acid type. If the positive amino acid and the small amino acid were perfectly correlated with joint entropy of 0 then the MI score would be 1.58 or the minimum entro py of the two columns. The MI score of 1.58 may not be statistically significant when compared to other co evolving pairs and would not be detected. Classifying a sequence position based on the physio chemical properties and scaling the entropy and joint e ntropy by the appropriate log base N for the sequence position classifier may improve the detection of co evolving pairs. The challenge is determining the appropriate classifier based on the observed mutation transitions in the co evolving pair and testing the fitness of this approach. Interpreting information from co evolving pairs as it relates to function or structure in a protein can be better understood by mapping the information relationships onto a ribbon model representation of the protein structure for further review. An example ribbon model mapped with
53 mutual information for PF00014.13 and PDB model 1kun is presented in Figure 4 9 The mutual information between co evolving pairs is indicated by connecting black lines. F igure 4 1 Mutual Information log base 2 Figure 4 2 Mutual I nformation log base 2 with p hylogenetic effect reduced 0 5 10 15 20 25 30 (2,26) (26,410) (2,412) (4,26) (2,410) (26,412) (410,554) (26,555) (412,555) (4,410) (410,555) (26,163) (0,26) (25,410) (4,555) (8,412) (1,26) (26,161) (2,193) (3,410) (8,189) (3,26) (2,555) Angstroms 0 5 10 15 20 25 30 (188,412) (8,189) (8,414) (8,412) (189,414) (188,414) (25,412) (189,412) (26,412) (2,412) (2,193) (330,412) (26,188) (1,412) (190,412) (5,188) (188,410) (26,190) (188,330) (26,410) (3,188) (25,410) (1,26) Angstroms
54 Figure 4 5 MI for PF04055.9 4 26 410 193 55 5 Figure 4 3 MI c lusters 4 26 412 188 330 Figure 4 4 MEMI c lusters
55 Figure 4 7 where Z>=2 Figure 4 6 M EM I for PF04055.9
56 Figure 4 8 Percentage accuracy Z>=4 and MI count < 500
57 Figure 4 9 Ribbon model with MI in PF00014.13 Table 4 1 Clustering of co evolved pairs MI pairs < 12 and > 10 sequence positions apart MI MEMI ( 4 ,26) ( 188 ,412) ( 410 ,554) ( 8 ,189) ( 412 ,555) ( 8 ,414) ( 410 ,555) ( 8 ,412) ( 8 ,412) ( 18 9 ,414) ( 2 ,193) ( 188 ,414) ( 8 ,189) ( 25 ,412) ( 3 ,26) ( 189 ,412) ( 2 ,25) ( 2 ,193) ( 330 ,412) ( 26 ,188) ( 190 ,412) ( 5 ,188) ( 188 ,410) ( 188 ,330) ( 3 ,188)
58 Table 4 2 Number of M I scores < 500 and MEMI mutation count > 40 MI MEMI %<12A %Pfam %<12A %Pfam Z>=4 56.2 18.3 81.3 15.8 Z=3 42.6 55.8 56.4 46.4 Z=2 33.2 79.7 36.9 71.6
59 CHAPTER 5 PROTEIN CORRELOGO A VISUAL REPRESENTATIO N OF MUTUAL INFORMAT ION RELATIONSHIPS IN PFA M To underst and the functional elements of a protein structure biologists use domain specific 3D viewers (PDB) that are written to process the coordinates of atoms that represent the solved protein structure using X Ray crystallography or NMR. The PDB viewers have be en written to capture specific or common features of interest to the researcher. With the explosion of protein sequence data comparative studies and statistical analysis of data can indicate regions of interest in 3D models. The ability to integrate statis tical data into existing PDB viewers is difficult because the software is typically written to accomplish very specific functional goals and does not support exporting to a standard 3D format. In this thesis, PDB data is used to create X3D (VRML) PDB ribbo n models that are augmented with statistically significant data and compared to an Information Rich Virtual Environment represented as a Protein CorreLogo X3D model. A protein family (Pfam) represents multiple alignments of protein sequences where protein domains and the tertiary structures have evolutionary conserved regions representing protein function. Various information proper ties of the protein family, tertiary attributes from a PDB structure and the location of ligand binding pockets are combined t o create a 3D immersive model. The multiple sequence alignment from the protein family is used to detect co evolving amino acid pairs using mutual information. Co evolving pairs are indicated as a column with color coding to represent the physio chemical p roperties of each co evolving amino acid combination. Additional visualizations along each axis include the 2D sequence logo, the degree of insert regions in the protein family and the surface accessibility of each amino acid for the referenced PDB sequenc e. The Protein CorreLogo model is built using X3D (VRML) facilitating immersive viewing of complex data relationships and detected co evolving pairs.
60 Two protein families are chosen as representative examples and are presented in the results section that compare the Protein CorreLogo model with a P DB ribbon model showing the structural significance of predicted co evolving amino acid pairs using mutual information. Pfam family PF00027.18 was chosen because it has a high number of co evolving pairs that are located in proximity to the proteins ligand binding pocket an important functional region of the protein. Another example protein family, with SH3 domains that are involved in signal transduction related to cytoskeletal organization (PF00018.16), shows si gnificant mutual information occurring between two pairs of amino acids that are in contact in the intertwined dimer structure but are on opposite ends of the tertiary structure. CorreLogo The Protein CorreLogo, shown in Figure 5 2 is an extension of CorreLogo, shown in Figure 5 1 for RNA and DNA alignments (Bindewald, Schneider et al. 2006) to protein families defined by Pfam (Bateman, Birney et al. 2002) The basis for CorreLogo is a 3D representation of a 2D sequence logo (Schneider and Stephens 1990) shown in Figure 5 3 to visualize sequence conservation and the information content of a mu ltiple sequence alignment (Schneider and Stephens 1990) Numero us implementations exist to generate a sequence logo and have been thoroughly reviewed in (Bindewald, Schneider et al. 2006) The CorreLogo for RNA and DNA alignments show, through its 3D visualization of RNA or DNA alignments, valuable additional information compare d to a traditional sequence logo. In the CorreLogo model, the addition of mutual information (Shannon 1948; Pierce 1980) relationships existing between two sequence positions is the key indicator for regions of interest in understanding the model. With a focus on RNA/DNA the CorreLogo model represents the bases A, C, G and (T or U) mapped to four colors showing the potential relationships found in significant mutual information pairs. One of the visual innovations in the CorreLogo model is the
61 representation as a column of st acked colored segment where the height of each segment is proportional to the contribution of each base pair to the overall mutual information score. The CorreLogo model integrates the standard 2D sequence logos along two sides of the square matrix and add itional bar graphs to indicate the fraction of gap characters on the remaining two sides. This section describes the creation of a CorreLogo for proteins and the integration of various data attributes associated with the protein family and correspondin g PDB structure into one visual model. Expanding to a 3D model from a standard sequence logo allows for additional visual indicators that reflect mutual information, spatial distance between amino acids, ligand binding pockets and protein protein interacti ons into one model. The 3D model can then be viewed in an immersive environment allowing dynamic real time navigation to specific regions of interest. The 3D model can also be rendered as a 2D image focused on the informative elements or specific area of i nterest in the Protein CorreLogo model. Two protein families are presented where the Protein CorreLogo model is validated against a PDB ribbon model showing the structural significance of identified co evolving pairs detected by mutual information. In the first example, a protein family with proteins that bind cyclic nucleotides (PF00027) (Korner, Sofia et al. 2003) the co evolving pairs indicate strong association with the ligand bindi ng pocket regions. In the second example, a protein family with SH3 domains that are involved in signal transduction related to cytoskeletal organization (PF00018) (Kishan, Newcomer et al. 2001) two of the co evolving pairs are distant in the tertiary structu re of the protein but are in contact in the quaternary structure of the intertwined dimer. Protein CorreLogo The Protein CorreLogo model allows the visualization of various properties from a multiple sequence alignment in a protein family. The key data element is mutual information, as
62 this is available for families with and without a representative PDB structure for that family. The information content or correlation depicted between two columns in a multiple sequence alignment can serve as an indicato r or marker for a region of interest in the protein family. These detected co evolving amino acid pairs (Martin, Gloor et al. 2005) could indicate local mutations to preserve secondary structure, a ligand binding site or p rotein protein interaction for a quaternary structure. 3D Model The Protein CorreLogo models are built using a Java application that reads PDB structures and other statistical data related to the protein sequence and outputs the 3D model in X3D, the XML an d ISO standards replacement of VRML (http://www.web3D.org/x3d.html). Working with X3D as an XML DOM in Java to create 3D representations is straightforward with minimal programming effort on the part of the developer. This allows the application developmen t to focus on the organization and collection of statistical data and the creation of visual models in an abstract way. Once the models are rendered in X3D they can then be viewed on multiple operating systems using a web browser or local X3D viewer. The mutual information for a particular Pfam is calculated, and then by selecting a representative protein sequence in the family, the corresponding PDB structure the model is built. An example model is shown in Figure 5 2 for protei n family PF00025 (Pasqualato, Renault et al. 2002) and sequence ARF1_HUMAN with accession number P84077 and PDB structure 1hur (Carlos Amor, Harrison et al. 1994) With the sequence forming the two sides of the data matrix, the matrix is a mirror image of the same data divided along the diagonal. The protein sequence forms the two pe rpe ndicular sides of t he grid so that a corresponding mutual information pair from the protein family can be placed as a column at the intersection of the two sequence positions. A major difference between the CorreLogo and the Protein
63 CorreLogo is the addition of 3D structural distances and the indication of ligand binding pockets. This creates a mapping problem in that the multiple sequence alignment for a protein family typically contains many regions of inserts to accommodate the overall optimal multiple sequence alignment. The protein sequence that is used as the base reference to build the model has all inserts removed to allow a one to one alignment with the sequence found in a representative PDB structure. This allows 3D coordinate data to determine distance between co e volving pairs and ligands and mapped onto the Protein CorreLogo model. The reference sequence from the Pfam alignment is then mapped column by column to the corresponding sequence without any inserts. It is possible that multiple positions in the Pfam alig nment may map to the same position in the PDB sequence. These regions of inserts are indicated by the Pfam alignment gap where for each insert position the height of the bar is increased by 1. A grouping and color scheme was selected to illustrate the phys io chemical properties of the amino acids (Livingstone and Barton 1993) Each amino aci d can exhibit multiple properties which add to the complexity of mapping an amino acid to a particular color. The following mapping listed in Table 5 1 was used to simplify the overall color representation. This is only one of man y possible groupings and color schemes, which could be user selectable in the model. A 2D sequence logo, shown in Figure 5 2 is placed parallel to the grid corresponding to each sequence position. The uncertainty for each column in the Pfam is calculated and when the sequence logo is built, only the exact one to one mapping from the Pfam alignment to the PDB sequence is used. The grey bar along the top of the sequence logo represents the uncertainty for a sequence position where uncertainty is defined by: If the sequence position was completely conserved or for example, if only one amino acid value was found in that position, then H(i)=0. The bar for that sequence position would have the
64 maximum height and colored accordi ng to the amino acids physio chemical properties. When a column is completely random or has maximum uncertainty then H(i)=1 and the bar height would be 0. In Figure 5 2 at marker A, the sequence logo indicates two sequence positi ons that are conserved with a positive ( red) and negative ( black) amino acid in line with a ligand binding pocket region indicated by the purple square. To identify amino acids that may be part of a ligand binding pocket, each atom in the PDB structure is searched against all other ligand atoms in the PDB structure and if two atoms are within five angstroms of each other it is marked as being in or near a ligand binding pocket. To illustrate this in a PDB ribbon model shown in Figure 5 5 each amino acid th at is within five angstroms of an atom belonging to a ligand is purple. For protein protein interactions that form quaternary structures where an atom is within five angstroms of another amino acid in a different protein structure the amino acid is assigne d a color to correspond to a specific protein structure. An example is illustrated in Figure 5 6 where the cyan color indicates that the two amino acids at the intersection of the color are contact neighbors with an amino acid in another protein structure. The example in Figure 5 6 shows a dimer protein structure where the two structures combine to form a stable quaternary structure. In Figure 5 4 an example Protein CorreLogo model shows a ligand and protein protein interaction fo rming a quaternary structure. For the Protein CorreLogo model each amino acid that is in or near a ligand binding pocket will result in a pair wise comparison of the distance between each binding pocket amino acid. If the two binding pocket amino acids th at are being compared are closer than 16 angstroms, then the intersection of the amino acid sequence position is marked with the appropriate color. The intensity of the intersecting square color is proportional to the distance less than 16 angstroms. If tw o amino acids are eight angstroms apart then the intensity of the selected color is reduced to
65 50%. If two amino acids are in contact the color would be reduced to 0% intensity or black. Along the center diagonal a binding pocket amino acid will be compare d to itself, which would have a distance of zero and is indicated by a black square. Ligand binding pocket regions that are located away from the center diagonal are distant amino acids in the sequence and near neighbors in 3D space. This marking of ligan d binding pockets is meant to show areas of interest and is dependent on the data found in the PDB structure as being complete and accurate. In immersive mode, when the mouse pointer is hovered over a binding pocket marker it will show the specific details about which amino acid atoms are closest and the distance between the atoms. The Pfam data set contains pre calculated surface accessibility data for each sequence where zero has no surface accessibility and a value of nine is 100 percent accessible. This data is represented as a blue bar graph along the side of the data matrix. An additional grey bar graph is added along the remaining two sides of the data matrix to show the number of inserts in the referenced sequence for the protein family multiple sequ ence alignment. For each consecutive insert one unit increases the height of the bar Each column located at the intersection point on the grid represents a statistically significant mutual information relationship between two sequence positions in the Pfa m multiple sequence alignment. The range of values will depend on the amount of correlated mutations that occurred in the protein family and the quality of the overall alignment. A protein sequence of length 200 will result in 40,000 pair wise combination s or mutual information scores. The mean is taken for all possible pair wise combinations and mutual information scores that are greater or equal to three standard deviations from the mean are included in the model. For some protein families this may resul t in less than ten statistically significant mutual information pairs. For other protein families it could exceed more than one
66 hundred mutual information pairs, which would make the visual model difficult to use. The cutoff for statistical significance ca n be adjusted to include the appropriate number of mutual information pairs for the model. In Figure 5 4 an immersive view of a mutual information column is shown where the key data attributes are indicated in this picture by the black arrows and are added manually for reference purposes. The two sides of the column are mirrored on the opposite side of the column. Starting at the bottom is the average distance between amino acid pairs for all referenced PDB structures from the pro tein family rounded up to the nearest integer. Next to the average distance is the standard deviation and can indicate the degree of structural variability between the pairs in the protein family. The next block above, with green background and black numbe rs is the surface accessibility for the amino acid in the referenced sequence. The next block with a white background and green text is the secondary structure for that amino acid. The blue block with white letters is the amino acid index in the PDB sequen ce. The next green block with black letters is the corresponding amino acid for the referenced PDB sequence. The mutual information score is a summation allowing the possibility of measuring the percentage contribution to the overall score for each amino a cid pair in the multiple sequence alignment. A particular amino acid pair or pairs may contribute significantly to the overall mutual information score. Each amino acid pair is mapped to the corresponding physio chemical properties a nd for each colored pai r group the overall contribution is summed. The height of each colored block pair in the column is the overall percentage contribution for each pair group. As the blocks are added, each corresponding group contributes less to the overall mutual information score and is continued until 95% of the mutual information score has been reached. The remaining 5% of block pairs are not included as they typically contain a large number of
67 amino acid combinations that could be viewed as random or not significant. In Figure 5 4 the column representing sequence positions 153 and 139 are showing strong contribution to the mutual information score by amino acids that are hydrophobic (brown) hydrophobic (brown) and amino acids that are negative (b lack) positive (red). Model for PFAM PF00027.18 and PDB 1ne4 A In Figure 5 5 the CorreLogo model for a protein family with proteins that bind cyclic nucleotides in Pfam PF00027.18 (Korn er, Sofia et al. 2003) (2890 sequences), is compared to the 3D ribbon structure for a sequence accession P00514 (KAPO_BOVIN) and PDB structure 1ne4 A (Wu, Jones et al. 2004) Blue circles with letters indicate the correlation of key regions between the protein ribbon model and the Protein CorreLogo n in the PDB ribbon model with two arrows pointing to the sequence positions in 3D space. The blue circles are added to this figure manually for reference purp oses. The 3D ribbon structure has black lines showing mutual information between pairs that is four standard deviations from the mean and the yellow lines indicate three standard deviations from the mean. These indicators of mutual information are programm atically added when the X3D ribbon model is created. In the 3D ribbon structure a mutual information pair is only shown with a connecting line if the two amino acids have a sequence distance greater than 10 positions. If an amino acid is within five angstr oms of a ligand atom then it is colored purple to indicate a binding pocket region. In the Protein CorreLogo model the purple regions indicate the binding pockets for the ligand RP Adenosine The protein family PF00027.18 is a sub sequence of the PDB sequ ence defined by 1ne4 and is indicated by the white grid versus the grey grid for the entire sequence. The family has two mutual information pairs with a score that is four standard deviations from the mean (Z>=4)
68 between sequence positions (226,240) and (1 94,240) marked with A and G in Figure 5 5. The markers are used to show relationships between two different models of the same basic data. In protein families where protein structures do not exist the value of the Protein CorreLogo model is that it can ind icate amino acids or areas of research interest. The average distance measured against 12 PDB structures in PF00027.18 for (226,240) is 3.7 angstroms with a standard deviation of 0.8 and for (194,240) is 5.7 angstroms with a standard deviation of 1.7. The co evolving pairs that are predicted using mutual information have no input from PDB data. Only when the locations of predicted co evolving pairs are merged on the grid with binding pocket regions is it indicated that a relationship may exist. For those c o evolving pairs that do not share both amino acids in the ligand binding pocket, marked C, D, E, F and G, from the ribbon model it appears that some structural related support function is taking place. Model for PFAM PF00018.16 and PDB 1i07 A B In Figur e 5 6 the Protein CorreLogo model for a protein family with SH3 domains that are involved in signal transduction related to cytoskeletal organization in Pfam PF00018.16 (Mayer 2001) (3373 sequences) is compared to the 3D ribbon structure for a sequence accession Q08509 (EPS8_MOUSE) and PDB structure 1i07 A B (Kishan, Newcomer et al. 2001) Blue circles with letters indicate the correlation of key regions between the protein ribbon model and the Protein CorreLogo pointing to the sequence positions in 3D space. The blue and red circles are a dded to this figure manually for reference purposes. This particular example shows the quaternary structure as a dimer and the binding locations where the amino acids from the two structures are within 5 angstroms of each other as indicated by the cyan col ored regions. The mutual information for each detected co evolving pair is represented by a column in the CorreLogo model and
69 connecting yellow and black lines in the ribbon model where the difference between sequences positions is greater than 10. In this model the sequence positions (8, 52) marker B has an average distance of 6 angstroms from 90 representative PDB structures and a standard deviation of 8.6 which indicates a high degree of variation in 3D distance which is unexpected. The same type of dist ance variability is also occurring between sequence positions (17, 52), marker A, on the model with an average distance of 11 angstroms and a standard deviation of 7.9. The CorreLogo model with the large regions of cyan colored squares indicates that this protein is an intertwined dimer. The ribbon model shows the mutual information connection between marker A and B sharing opposite ends of the single sequence structure. For the PDB structure 1i07 A, which is the basis for this CorreLogo model, the distanc e between (8, 52) is 27 angstroms and (17, 52) is 18 angstroms. The large distance between the predicted co evolving pairs is unexpected. When looking at the distance of the amino acid pairs in relationship to the dimer structure formed by the two protein sequences A and B, the mutual information relationship is revealed. The distance between the two structures in the dimer of the pair (8 Sequence B, 52 Sequence A) marker D is 4 angstroms and (17 B, 52 A) marker C is 10 angstroms. In this particular struct ure the co evolving amino acid pairs found at (17, 52) and (8, 52) are occurring to support protein protein interaction between two structures in the intertwined dimer. Protein CorreLogo Summary The two example Protein CorreLogo models show amino acid pair s with strong mutual information, indicating potentially interesting characteristics of the protein structure. By using mutual information to detect potential co evolving pairs in a protein family and integrating the data into the Protein CorreLogo and a r epresentative PDB ribbon model when available, the
70 researcher has a simplified view of complex data relationships. The information content or the reason for correlated mutations between two amino acids in the Protein CorreLogo model is not obvious. The adv antage of the Protein CorreLogo model is that it can convey important information about regions of potential functional interest to the researcher based only on sequence data. This is important when a PDB structure for a prote in family has not been solved. The use of X3D as a XML based 3D modeling language allows rapid development of visual models that can be easily extended by third party applications or user written applications. Figure 5 1 RNA CorreLogo of 5S loop E region RFAM RF00001
71 Surface accessibili ty 10 sequence position grid Sequenc e logo GDP binding pocket 1hur B dimer binding Pfam alignment gaps 1hur sequence end MI pair intersection columns A Figure 5 2 Pr otein CorreLogo PF00025 PDB 1HUR A
72 Figure 5 3 Sequence Logo Surface accessibility MI column pair Sequ ence positions Secondary Structure Average 3D PDB distance pairs St. Dev. distance between pairs Physio Chemical properties for all column pairs Pfam Figure 5 4 PF00025 1 HUR
73 Figure 5 5 PF00 027.18 1NE4 A RP adenosine binding p ocket A B C D E F RP Aden osine binding G A B C D E F G
74 Figure 5 6 PF00018.16 1I07 A B A B A B
75 Table 5 1 Amino acid grouping and c olor Property Color Amino Acids Pos itive Red His, Lys, Arg Negative Black Asp, Glu Polar White Asn, Gln, Ser Small Yellow Pro Hydrophobic Brown Ala, Cys, Phe, Gly, Ile, Leu, Met, Thr, Val, Trp, Tyr
76 CHAPTER 6 MUTUAL INFORMATION T O PREDICT PROTEIN IN TERACTIONS IN RETROV IRUSES Protein Protein interactions in viruses have been notoriously difficult to study for a variety of reasons including the high false positive results of genetic approaches such as yeast two hybrid screening (Young 1998) and false negative results of biochemical approaches (Phizicky, Bastiaens et al. 2003) when applied to membrane proteins. As t he amount of biological data that is collected and stored in databases increases, the use of bioinformatics will become an important tool to predict or detect protein interactions. The use of mutual information to detect co evolving sequence data is a lon g running and actively researched topic (Schneider, Stormo et al. 1986; Korber, Farber et al. 1993; Clarke 1995; Pazos, Helmer Citterich et al. 1997; Atchley, Wollenberg et al. 2000; Pritchard, Bladon et al. 2001; T illier and Lui 2003; Wu, Schiffer et al. 2003; Crooks 2004; Daub, Steuer et al. 2004; Hamilton, Burrage et al. 2004; Dimmic, Hubisz et al. 2005; Martin, Gloor et al. 2005; Fares and McNally 2006; Fares and Travers 2006; Inbal Halperin 2006; Wang, San Wong et al. 2006; Yi, Ma et al. 2007) where various methods are used to filter or correct for false positives to show agreement with know n protein interactions. Mutual i nformation is dependent on accurate probability distributions and if the samples are select ed from a limited set of sequence data then a bias is introduced towards the grouping of the samples (Atchley, Wollenberg et al. 2000) The phylogenetic bias contributes to a high number of false positives and does not detect valid co evolving sequence positions, which occurred in early ancestors (Pollock and Taylor 1997; Barker and Pagel 2005) The phylogenetic effect on sequence data when calculating mutual information has prevented it from becoming a widely used technique to predict protein protein interactions. We propose d t he Mutation E vent Mutual Information ( MEMI ) method in Chapter 3 which calculates probability distributions by sampling along a phylogenetic tree for mutation events.
77 The phylogenetic tree for a collection of aligned sequence data is used a s a template to build a mutation history for two sequence positions. If two children of a parent node in the tree each have the same amino acid then the parent node is assigned that amino acid. If the two children each have a differe nt amino acid then an X or not know n is assigned to the parent. The process continues from the base to the root of the tree. If a comparison is made between a node with an assigned amino acid and a sibling node with an X then the children of the X node are searched for agreement with the sibling node currently being compared. If a match is found then the parent node is assigned that amino acid. Once a consensus tree has been determined for two sequence positions the tree is then descended along each node where a mutation event in any node results in the sampling of the amino acid pair assigned to that node. The collection of amino acid pairs that are selected based on mutation events along the entire tree are then used to calculate the probability distributions that are used to de termine the mutual information for the two sequence positions. This process is repeated for all sequence positions as a pairwise comparison. The advantage of this approach is that the phylogenetic influences from closely related sequence data can be elimin ated and mutation events that occur in early ancestors or along parallel evolutionary paths in the tree will be detected. Once co evolving pairs are predicted numerous visual indicators of potential relationships can be constructed using network graphs o r highlighted on solved PDB structures. The network graphs can play an informative role in the understanding of interactions for trans membrane proteins and proteins with unsolved PDB structures where little is known about what sequences positions that may be exposed on a protein surface. The application of MI and the MEMI method are ideally suited to retroviruses, which have high mutation rates, and small genomes that can be easily sequenced Current wet lab research
78 techniques used to detect or validate point specific protein interactions in viruses is an evolving field. Developing and validating a bioinformatics approach to the understanding of a protein topology that is only dependent on genomic sequence data can provide an alternative approach in the understanding of protein protein interactions in viruses. Virus Background A virus consists of genetic material that is carried into a host cell by protective protein shell called a capsid. Once inside a cell the virus begins replication by using the host cell to reproduce multiple copies of the virus genome. Infection continues as the reproduced virus genomes leaves the host cell and infects other cells. Viruses can pass between hosts from direct contact or body fluids. Viruses are treated as non living organisms because they do not respond to stimulus in the environment and depend on a host cell for replication. The dependency on a host cell for replication makes the study of viruses in the laboratory difficult which impacts the understanding of how the virus infects the cell and replicates. Viruses can be classified by the process that is used for replication. A DNA virus is made up of DNA information and is then subject to the DNA polymerases which can proofread or correct for mutations in th e DNA strand. This results in a lower mutation rate or a virus that has lower variation from generation to generation. In contrast, RNA viruses have very high mutation rates as they lack DNA polymerases to correct for replication errors. If the mutations o ccur in a region of the protein structure that has functional significance then that may require a compensating mutation elsewhere to preserve function. Hepatitis C, HIV, Influenza A Dengue fever Yellow fever, Measles, Rabies, Ebola are all examples of RNA viruses that because of their catastrophic impact on the human population are being aggressively researched to develop vaccines and treatments The genomes of viruses are relatively small compared to bacteria or multicellular organisms, which make seq uencing of a
79 specific virus genome a straightforward process. This has allowed for activ e research of sequence data of virus genome s, where specific variations or response to therapy are compared to genomic differences between virus species or sub types. W ith a large collection of publicly available RNA virus genome sequences, mutual information can be used to detect co evolving pairs, which in turn can predict protein protein interactions or an overall virus protein topology. One area that will play an increasingly import role in the study of viruses because of the rapidly mutating nature of viruses is the ability to sequence genomic data reliably, quickly and at a low cost. In February 2004, the National Human Genome Research Institute issued a request for proposals to develop a system to sequence a single person genome in less than a day for $1,000. Today it is possible to sequence up to 1,000 base pairs for $1. As the sequencing of genomic data for a specific strain of virus becomes routine, bioinforma tics will play an important role in understanding the significance and purpose of virus mutations between genotypes. Our immune system is the natural defense system, which has been the traditional first line of defense against disease and illness. Pathoge ns, such as rapidly mutating viruses, are constantly evolving in new ways to avoid detection by the immune systems and to successfully infect their hosts. An example virus that has structurally involved in a way to escape detection by our immune system is HIV in such a unique way that it is referred to as a quasispecies (Wain Hobson 1989) It is this ability to detec t compensating mutations and the ease by which sequence data can be collected that the use of information theory and mutual information can become an important tool in the development of vaccines to help prevent the next global flu pandemic or a yet to eme rge devastating virus Virus Protein Topology Models Predicted from Mutual Information In th e following chapter s Hepatitis C, HIV, Influenza A and Dengue fever have protein protein interaction models built using the standard method ( MI ) and the MEMI meth od to
80 calculate mutual information to predict co evolving pairs. The information being expressed as non random mutations will indicate potential interactions between proteins that can be used to infer a relationship between proteins or group of proteins. T he first assumption is that if a relationship exists between two proteins then those proteins form a positive or dependent relationship. It is proposed that anti relationships also exist that prevents two proteins from interacting to keep a protein surface open for a future interaction with a different protein. As the protein structures are expressed they are originating from the same general cellular location and for interactions to take place the two proteins must be in close proximity in 3D space at some point in time If common interfaces are used to bind proteins then negative interfaces should exist to prevent protein binding, which could play an important role in reserving a protein surface for a protein structure that has not migrated to the same 3D space. A significant challenge is validating or scoring a predicted protein protein interaction model when very little is known about the retrovirus protein topology. The application of mutual information to the problem of detecting compensating mutations which may be involved in protein interactions, is ideally suited for viruses but difficult to quantitatively score. The resulting models will only indicate protein interactions in sequence positions with a high mutation rate or high entropy and cannot be used against highly conserved sequence positions that are typically the sequence positions of interest in wet lab research. This makes it difficult to validate a particular protein interaction for two sequence positions that appear random against publishe d wet lab research. The problem of model validation is also difficult when two different methods are proposed as in the MI and MEMI method to calculate entropy and joint entropy. One approach to scoring the quality of the MEMI and MI method is that the col lection of predicted co evolving sequence positions are not random and when organized into a graph will have a
81 power law degree distribution which indicates that a few nodes or hubs are connected to a large number of nodes with a single edge (Strogatz 2001; Albert and Barabsi 2002; Girvan and Newman 2002; Barabasi and Oltvai 2004) The information that is being expressed in a co evolving sequence positions does not provide any indication of the reason for the com pensating mutation but simply that some relationship does exist between the two sequence positions. When these relationships are combined with other predicted informative relationships or mapped onto a representative PDB structure it is possible to infer t he purpose of the compensating mutations. The following items summarize general or informative observations based on the predicted protein interaction network graph or specific examples of predicted co evolving pairs mapped onto a PDB structure. Given that the predictions are made from what appears to be random data or sequence positions with high entropy any results that appear not random based on other data attributes support the potential for informative relationships and should be investigated further. The MEMI protein topology model for HCV shows strong correlation to proposed structural organization of the HCV genome (Penin, Dubuisson et al. 2004) even though only 161 HCV genome sequences where used to determine probability distributions when calculating mutual information using the MEMI method. The MEMI protein topology model for Influenza was constructed from 1818 genome sequences. The MEMI model predicts that Neuramininidase (NA) only interacts with other NA proteins on the virus surface and that Hemagglutinin (HA) also located on the virus surface onl y interacts with other HA proteins. This clustering of HA groups and NA groups is supported by electron tomography models of Influenza (Harris, Cardone et al. 2006) In the MEMI model the top 100 scoring mutual information relationships indic ates no interactions between HA or NA, where in the MI model 11 of the top scoring co evolving sequences positions occur between HA and NA. The assumption is that the MEMI model is informative given that of the top 200 scoring mutual information pairs, non e exist between the HA and NA proteins where the algorithm has no knowledge of protein boundaries. The MEMI protein topology model for HIV accurately predicts the interface between gp120 and gp41 where the predicted co evolving sequence positions in gp120 are located at the interface occupied by CD4 in PDB 2nxy (Zhou, Xu et al. 2007) The MI method did
82 not predict an interface between gp120 and gp41 in the top 100 highest scoring mutual information pairs. The protei n topology models for Dengue fever were constructed from 160 genome sequences and the resulting MI model shows uniform distribution of interactions or appears non informative. The MEMI model shows potential informative relationships as a power law distribu tion of the network topology. In addition, an interesting relationship is indicated in the Envelope protein. The Envelope protein has a solved trimer structure 1THD where the three long protein structures are parallel forming a sheet, which combines with o ther trimers to form the virus capsid. When the sequence positions in E that shows interactions with other proteins are highlighted on the trimer they are clustered along a straight line. The significance is that the middle protein structure in the trimer is anti parallel to the other two structures so only one solution exists that would allow the same sequence positions in all three structures to be located along the same axis. The center E structure that is anti parallel is offset and rotated 180 degrees so that the sequence positions K122, K123, T120 and S229 can be found along the same axis and appears to be informative. Visual Representation of Informative Relationships For each virus, the predicted co evolving sequence positions using the MI and MEMI m ethod are used to construct a network graph where each sequence position is grouped in a node representing a protein When mutual information is calculated in the virus genome it is done by a pair wise comparison of each sequence position with no knowledge of gene or protein boundaries. One problem with this approach is that conserved sequence positions involved in protein interactions will not be included in the model because of the low entropy and the inability to infer relationships from compensating mut ations. The difficulty in validating the predicted protein topologies using mutual information is that very little is currently known about protein interactions in viruses and it is generally limited to a general protein interaction versus a known sequence position in a protein interacting with a specific sequence position in another protein. An additional limiting factor in validating if a predicted co evolving sequence position is part of a known protein interaction is that mutagenesis studies are typical ly focused on conserved sequence positions in a virus genome and regions with high mutation rates are not actively studied. Viruses have fairly well defined genes that code for protein structures to keep the virus
83 genome as small as possible. Because viru s genomes are small with minimal inserts or deletes a large collection of virus genomes can be accurately aligned with clustalw (Thompson, Higgins et al. 1994) The predicted co ev olving sequence positions with the expressing the most information pairs are mapped, when possible, on a corresponding PDB structure as a surface map. Knowing the 3D structure of a protein in a virus genome is critical in the overall understanding of the v irus protein topology and developing potential targets for drug therapy. Given the importance of determining the 3D structure of a virus protein for research, the task of solving the 3D structure of critical protein structures has proven difficult. It is c ommon to find only one or two representative PDB structures have been solved for a specific virus genome. For PDB structures that do exist it is expected that a predicted sequence position involved in a mutual information pair would be located on the surf ace of the protein for a protein protein interaction For clusters of mutual information pairs that are common to a gene but distant from each other in the genome sequence it is expected that they will share a common area on the protein surface.
84 CHAPTER 7 PREDICT ING PROTEIN INTERACTIONS IN HEPATITIS C (HCV) Hepatitis C virus (HCV), a positive sense single stranded RNA v irus of the Flaviviridae family and is the main cause of chronic liver disease in humans. Despite its small 9.5 kilobase genome, HCV is extremely e ffective at evading the immune system of its human host: after initial contact, 70% of individuals become persistently infected. Worldwide, there are an estimated 170 million carriers of HCV (Memon and Memon 2002) including 3 million in the USA, mostly infected with genotype 1 virus (Wasley and Alter 2000) Twenty percent will develop liver cirrhosis and up to 2.5 % of these patients will come down with hepatocellular carcinoma. Since i ts discovery in 1989, Hepatitis C (HCV) has been the focus of intense investigations, culminating recently in the development of an effective system for culturing infectious virus of genotype 2a in hepatoma cells. Nevertheless, the role of several individu al protein encoded by the virus and the role of specific interactions among those proteins remain mostly unknown. Hepatitis C through its constant mutations is able to avoid detection by the immune system and infect the liver often going undetected under normal circumstances for many years. With advances in genomic sequencing and structural biology researchers around the world are closing the knowledge gap on the life cycle of viruses that will lead to the rapid development of vaccines. However, to put thi ngs in perspective Hepatitis C was discovered in 1989 and 18 years later a reliable preventable vaccine or cure has not been developed. Protein T opology M odels The data used to construct the HCV protein topology was built using the 161 complete genome sequ ences available from the Los Alamos HCV Sequence Database (Kuiken, Yusim et
85 al. 2005) The limited amount of sequence data can lead to errors in the probability distributions for each sequence position that can generate false positives or not detect ing a co evolving pair. The initial research and development of the MEMI method was to predict contact pairs or co evolving pairs that could then be used as initial solutions or model scoring in tertiary structure prediction. One of the challenges was eliminating the false positives for detected co evolving pairs in a PDB structure that showed the two sequence positions greater than 12 angstroms apart. In these specific cases the co evolving pair relationship became important when the protein was forming a stable s tructure as a trimer or a dimer The proteins and corresponding genome sequence positions are given in Figure 7 1 and description of each protein in Table 7 1 (Chevaliez and Pawlotsky 2006) The HCV protein protein interaction model showing the 100 highe st scoring mutual information pairs using the MI method is shown in Figure 7 2 and using the MEMI method is shown in Figure 7 3 Each node in the network graph represents an amino acid connected by an edge to its corresponding co evolving pair. The label on each edge represents the number of mutation events determined using the MEMI method where the lower the number the likelihood of small sample error increases An organic layout algorithm is used to auto layout the graph using t he yEd graph editor where the distance between each node in the graph is minimized but still kept in the boundaries of the parent node, which represents a gene. The network graph is then adjusted by hand to prevent overlaps of nodes and lines. For nodes t hat share multiple edges the expectation is that those two proteins are showing a potential interface where the resulting model gives specific sequence positions that are interacting. The ability to predict a general protein protein interface is an importa nt step in understanding the dynamics of how proteins interact and the ability to
86 reference a specific sequence position provides precise locations for drug targets and mutagenesis studies. For proteins that do not have a representative PDB structure, whic h is true for most trans membrane proteins, the researcher can get important insight into regions of the protein sequence that may be on the protein surface and interacting with other proteins A significant challenge is validating or scoring a predicted protein protein interaction model when very little is known about the virus protein topology. The application of mutual information to the problem of detecting compensating mutations, which may be involved in protein interactions, is ideally suited for vir uses but di fficult to quantitatively score The resulting models will only indicate protein interactions in sequence positions with high entropy and cannot be used against highly conserved sequence positions that are typically the sequence positions of int erest in wet lab research. This makes it difficult to validate a particular protein interaction for two sequence positions that appear random against published wet lab research. The problem of model validation is also difficult when two different methods a re proposed as in the MI and MEMI method to calcu late entropy and joint entropy. One approach to scoring the MEMI and MI method is that the collection of predicted co evolving amino acid sequence positions are not random and when organized into a graph wil l have a power law degree distribution which indicates that a few nodes or hubs are connected to a large number of nodes with a single edge (Barabasi and Oltvai 2004) Power l aw De gree D istribution To illustrate the degree of the MI and MEMI method the top 100 mutual information pairs from HCV where connected in an undirected graph and the number of nodes versus number of edges are shown in Figure 7 4 and Figure 7 5 The HCV predicted co evolving pair sequence positions for the MI and MEMI method as shown in Figure 7 2 and Figure 7 3 share minimal common sequence positions. The degree distribution for the MI method shown in Figure 7 4 has
87 8 nodes with 1 and 2 edges, which is a characteristic of a non random network, and 3 nodes with 15 16 and 17 edges indicating a hub in the network The MI model also has a number of nodes located in the middle with a high number of edges that indicates random connections or potential false positives. For the MEMI method shown in Figure 7 5 the graph shows 30 nodes with only 1 connected edge and a fast decay of nodes with high edge counts. The MEMI method shows a few nodes with a high number of connected edges, which indicates a hub in the network. By comparing the two models the MEMI method shows a close approximation to a power law distribution expected in biological networks. Analysis of MEMI M odel In the MI model shown in Figure 7 2 the dominating feature shows E1 sequence position T:66 interacti ng with NS5A, NS2, NS5B and NS4B. The E1 and E2 protein are the key components of the HCV virion envelope and are responsible for viral entry and fusion (Bartosch, Dubuisson et al. 2003; Nielsen, Bassendine et al. 2004) It would not be expected that E1 would have a role as a regulatory protein that would requir e interaction with other protein sequences. The same general pattern of many shared interfaces between NS5A, NS2, NS5B and NS4B would indicate the need for multiple interactions with different protein sequences. It is possible that these are valid interfac es but generally does not agree with what is known about protein interactions in HCV as indicated by Figure 7 6 (Penin, Dubuisson et al. 2004) The challenge of building protein topology models with co evolving pairs predicted using mutual information is that the nature of the inform ation or reason for non random compensating mutations is not given but that simply information exists. This requires analysis in the context of other research knowledge to infer the purpose of the interaction and the following should be viewed in the conte xt of a general discussion or possible reasons for the interactions when validating the MEMI model with the HCV topology model given in Figure 7 6
88 In Figure 7 3 NS4A does not show any relationships to other proteins and this attributed to the very low mutation rate in the NS4A sequence data. The use of mutual information to detect co evolving amino acid pairs can only be used against sequence positions that have mutations. The use of mutual information will only detect protein protein interactions that involve co evolving amino acid pairs. For the Core structure, which does have variance in sequence structure no co evolving pairs were detected, which is not unreasonable given the function of the core as the viral capsid Given the structural topology of the model given in Figure 7 6 it is compared to the MEMI model given in Figure 7 3 NS3 is shown as neighbors and is interacting with NS4B and NS5A, whi ch is consistent with the MEMI model. The MEMI model does not show NS3 interaction wi th NS5B which given the 3D distance indicated by HCV Topology model is reasonable to infer. NS3 could be playing a protease role for NS4B and NS5A and may interact with N S5B but has had no reason for compensating mutations to preserve the interaction. NS2 is showing minimal support for interaction with NS3, NS5A, NS4B and E1. P7 shows minimal interaction with E2, which appear to be neighbors in the trans membrane. P7 is al so showing a single interaction with NS5A and NS5B. In the case of the shared information between NS5A and NS5B given the orientation of P7 and that the amino acids at position A 11 and L 20 are detected as co evolving pairs could be to keep NS5A and N S5B fr om anchoring near P7. NS5A and NS5B could be from the HCV virus that is being expressed or from a competing HCV virus that is active in the region. NS5A is showing strong interaction with NS5B and NS4B, which is also acceptable given that they are neighbor s. In NS5A, sequence positions S 370, P 397, S 383 and D 402 showing co evolving pair relationships with both NS4B and NS5B. Given the sequence proximity of these four amino acids in NS5A it could be a key region for regulation. An
89 alternative explanation coul d be that this region has strong affinity for NS5B and so the co evolving pairs relationships are positive and the co evolving pair relationships with NS4B are negative or meant to keep NS4B away from that protein docking region. A PDB structure for this r egion of NS5A does not exist which highlights the value of this approach to determining potential protein protein interactions. Without complete PDB structure s for either NS4B or NS5A it would be possible to infer relationships and through mutagenesis test specific amino acid positions in both genes for possible interactions. Given the current MI model the sequence positions of interest for NS4B is K42, for NS5A is S370, D402, P397, S383 and S101, for NS5B it is W571, A376, S377, D66 and R65. Protein S urfac e M utual I nformation M odels In Figure 7 7 through Figure 7 9 surface structures are provided that highlight the amino acids that are detected as having co evolving pair relationships with other genes. The PDB structure for NS3(1C U1) represents the entire NS3 sequence, NS5B(1GX6) is a near complete sequence and the structure for NS5A(1ZH1) is a fragment from sequence position 36 to 198. The anchor domain for NS5A has also been solved for sequence positions from 1 31 and no co evolv ing pair relationships are detected in this region. The region that is showing the most informative relationships is from sequence position 370 to 402 and encompasses a region that currently does not have a representative PDB structure. The use of mutual i nformation gives us specific regions and amino acids to study using mutagenesis without the need of a PDB structure to help focus attention on amino acids that are on the protein surface. In Figure 7 7 the blue surface (H110, V63 0 ) amino acids share a co evolving pair relationship with NS4B (A51, K42) and the green surface (T178, Q580) with NS5A (S81, Y412 and S370). K42 in NS4B shares a co evolving pair relationship with both V630 and Q580 of NS3. If NS3 is acting as a protease for NS4B and NS5A then NS3 should try to make efficient
90 use of its structure to recognize the appropriate cleavage point for two different sequences. The symmetry of the green being a right handed turn versus a left handed turn for the blue, balanced with the center of NS3, could i ndicate the process by which NS3 is able to function as a protease against two different sequences. In Figure 7 7, the bottom structure (opposite side) the red region (T449) shares a co evolving pair rela tionship with L4 in NS4B, which is at the beginning of the NS4B sequence. Sequence position T449 also shares information with (Y413,R294) in NS5A which are not part of the solved NS5A PDB structure T449 is located on the opposite side of NS3 in Figure 7 7 which could be the entry point for the start of th e protease process which may indicates it is important to have a relationship with the beginning of two different sequences which are beginning the protease process. In Figure 7 8 protein structure NS5A PDB 1ZH1 is shown with S10 1 interfacing with NS5B as an informative interface. The 1ZH1 PDB structure is a subsequence of NS5A, which makes it difficult to see the full extent of the shared information between the two structures. The NS5A structure shown in Figure 7 9 provides a clearer indication of the relationship between NS5A and NS5B. Sequence position S81 shares a relationship with NS3 (T178, T449) and is in close proximity but on opposite sides of NS3 as shown in Figure 7 7. For NS5B show in Figure 7 9 the dominating interface based on co evolving pairs is with NS5A sequence positions (S101 S370, S383 ). Sequence position in NS5B (R65, D66, A73) share a structural symmetry with (A376, S377, E480) where the second grouping is flipped 180 degrees along the vertical axis and share the same co evolving pair relationships with NS5A. This symmetry is also uniquely shared by D66 and A376 which are both sequence neighbors to R65 and S377 and more interestingly share a set of co evolv ing pairs with both NS5A and NS4B and the reason the two amino acids are marked with blue to highlight.
91 To emphasize that this region of information is expressing a co evolving relationship both A376 and D66 are co evolving with the same amino acid K42 in NS4B. NS4B is a membrane protein where sequence positions (N 75,138 140,191 C) are predicted to be exposed to the cytoplasm and would represent possible interaction points with other proteins (Tan 2006) Sequence position K42 is predicted to be in the cytoplasm and is shown to be interacting with NS5B at sequence position D66 and A376 where both are located at the top and bottom of the structure. The membrane anchor of NS5B is located from sequence positions 570 591 (Tan 2006) which indicates that the bottom region of NS5B as show in Figure 7 9 should be orientated to th e cell wall for anchoring and is the region showing the most interaction with NS5A. Based on the predicted interaction of NS5B with NS4B at K42 the purpose of the information relationship could provide an initial attraction point to the region if it comes in contact with a section already occupied by NS4B. With the sequence positions D66 and R65 on NS5B providing both a +/ amino acid and the same relationship reflected in A376 and S377 orientation and positioning of NS5B approaching NS4B would be critical in forming a positive or negative attraction with NS4B:K42. The interface could offer a steering influence to allow NS5B to find an open surface on the cell wall. The same steering relationship could also exist with NS5A, which is also anchored in the reg ion and shows a dominant symmetrical interface with a shared NS5A interface to the top and bottom regions of NS5B. The NS5A membrane anchoring region is mapped to the first 30 amino acids (Brass, Bieck et al. 2002) in the sequence and provides an indication of the orientation of the NS5A structure which has been partially solved. The shared information sequence positions of NS5A (S101 E118, S370, S383, G391, S397 ) are assumed to be on the protein surface exposed to the cell interior. This informative interface with NS5B sequence positions (A376,
92 S377 E455 G480) is the region that contains the NS5B membrane anchor, which should be orientated to the cell wall. This informative relationship between NS5A and NS5B could provide steering or spatial adjustments that allow both to anchor properly to the cell wall. An interface is also predicted from NS5B:C303 to P7:A11 and E2:S281 where all three sequence positions show co evolving pair relationships. The regions P7:A11 and E2:281 are proposed to be located as neighbo ring structures in the ER membrane so a potential co evolving relationship between P7 and E2 is reasonable but it is unclear of the need or reason for shared information to NS5B:C303. The reason for informative relationships between two co evolving pairs i s not universal across all sequences but could be isolated to a particular sub type where compensating mutation are required based on a significant change in protein structure. The pairing relationship between E2:281 and NS5B:C303 contains 1.842 bits of inf ormation where the contribution to the information can be attributed to the co evolving pairs found in Table 7 2 The ability to understand the relationship or the information content shared between E2:281, NS5B:C303 and P7:A11 i s dependent on other external attributes that can be shown to share potential relationships. Using the informative co evolving amino re lationships they could be isol ated to a particular subtype that could indicate the purpose of the compensating mutations. In the examples given where a PDB structure exists, the location of sequence positions on a re presentative PDB structure can be used to infer potential relationships as a possible explanation of the information content shared between predicted co evolving amino acids. If a PDB structure does not exist then a nother possible analysis of the information content would be to understand the physio chemical properties of the paired amino acids as an indication of function or purpose. In the case of E2:281, NS5B:C 303 and P7:A11, which are showing a shared informative relationship and are known to exist structurally in the same general region then
93 understanding the physio chemical properties of compensating mutations may indicate purpose or the reason for the expres sed information. HCV A nalysis The MEMI predicted protein interaction model shows strong agreement with what is currently known about protein protein interactions in HCV. The top 100 MEMI predicted protein protein interaction sequence positions when possi ble are mapped to a representative PDB structure are found on the surface of the protein This is a positive indicator that they could be involved with protein interaction s versus a group of predicted co evolving sequence position found to be buried in the protein structure By associating predicted co evolving sequence positions with their location on a protein structure and compared to other sequence positions predicted to be involved with compensating mutations clear structural symmetry or patterns can b e detected. In NS5B, the predicted co evolving pair form a structural symmetry of two interfaces flipped 180 degrees around the vertical axis, interacting with the same sequence positions in NS5A supports that the co evolving pairs predicted with MEMI indi cate informative relationships. Given that the MEMI model was based on 161 HCV genome sequences the quality of the model appears to be good and would improve as more genome sequences become available. Another possible approach in understanding targeted pr otein protein interactions that minimize the need for the sequencing complete genomes is to test via mutagenesis specific protein protein interfaces by sequencing the two representative protein s of interest and then using sequences from repeated experiment s and multiple generations to calculate mutual information. This would eliminate the need to sequence the entire HCV genome allowing for an increased number of sequence samples with the added knowledge of the phylogenetic tree based on when the sequences a re sampled. The increased number of samples and an accurate mutation
94 order would improve the quality of the protein interaction model between two proteins and is a technique that can be easily used by HCV researchers in existing labs Another interesting approach to force large scale probing of protein interactions is to use High Throughput S creening (HTS) on HCV with various drugs and then measure sequence data that shows possible compensating mutations as an indication of a specific protein interaction. To validate the models the same tests can be performed multiple times and the mutual information calculated for each test case separately. Any common predicted protein interactions from separate test cases would indicate that the protein interaction is no t random and would allow for controlled testing and validation without the need for mutagenesis. C E1 E2 P7 NS2 NS3 4A 4B 5A 5B UTR 3 1 192 384 747 810 1027 1658 1712 1973 2421 3011 Figure 7 1 Hepatitis C virus genome NS2 through NS5B are the non structural proteins
95 Figure 7 2 HCV top 100 pairs MI method
96 Figure 7 3 HCV top 100 pairs MEMI method Figure 7 4 Ed ges per node for the top 100 pairs using MI method in HCV 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 Nodes Edges
97 Figure 7 5 Ed ges per node for the top 100 pairs using MEMI method in HCV Figure 7 6 HCV t opology (Penin, Dubuisson et al. 2004) 0 5 10 15 20 25 30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Nodes Edges
98 NS4B Q:580 T:178 NS2:C:69 T:449 V:630 H:110 K:42 A:51 L:4 NS5A S:81 Y:413 R:294 *Opposite side of NS3 Figure 7 7 P redicted co evolving pairs NS3 PDB 1CU1
99 NS5B Q:62 D:402(Not shown) S:81 NS3 T:178 T:449 E:116 A:376 S:101 S:377 W:571 R:65 E:455 Figure 7 8 Pred icted co evolving pairs NS5A PDB 1ZH1
100 NS5A R:65 D:66 A:246 A:376 S:377 G:480 C:303 NS4B:K:42 S:37 0 S:38 3 A:73 NS5A S:10 1 S:37 0 S:383 S:397 E:11 8 G:391 T:276 A:376 R:65 E:455 S:377 D:66 A:73 A:246 Figure 7 9 P redicted co evolving pairs NS5B PDB 1GX6 Rotate d 90 degrees E:455 Structural sy mmetry where lower grouping is flipped 18 0 degrees Two regions s hare the same co evolving relationships with NS5A (R65=S377) (D66=A376 ) (A73=E480) S:10 1 E2:S:281 P7:A:11
101 Table 7 1 HCV proteins and functions HCV Protein Function Co re Nucleocapsid E1 Envelope E2 Envelope/Receptor binding p7 Calcium ion channel NS2 NS2 3 autoprotease NS3 Component of NS2 3 and NS3 4A proteinases NTPase/helicase NS4A NS3 4A proteinase cofactor NS4B Membrane web induction NS5A RNA replication by formation of replication complexes NS5B RNA dependant RNA polymerase Table 7 2 Information contribution between E2:281 NS5B:C303 Percentage of Information H A .2405 Q S .2405 Y T .2405 S I .1548 T C .1047 S C .019
102 CHAPTER 8 PREDICTING PROTEIN INTERACTIONS IN HIV Human immunodeficiency virus (HIV) is a retrovirus that can lead to AIDS where the immune systems begins to fail allowing for a wide range of infections to occur which are often fatal. HIV infect ion in humans is pandemic where in the last quarter century nearly 65 million people were infected with HIV and an estimated 25 million have died of AIDS related illnesses. The estimated annual funding of HIV resources in 2005 was $8.3 billion and of that $630 million was allocated to find a preventive vaccine (UNAIDS 2006) An additional impact of HIV and a weakened immune system is the ability of other infectious diseases like Tuberculosis to develop new drug resistant strains in people infected with HIV (Iademarco and Castro 2003) HIV is very efficient in that it encodes for multiple prote ins from the same genome by using different reading frames. This allows the virus in its RNA form to remain small but still be able to code a higher number of proteins from a shorter genomic sequence. The HIV sequence database is maintained by the Los Ala mos National Laboratory (Leitner, Foley et al. 2005) and provides a variety of tools including Gene Cutter which is used to separate all gene sequences from the multiple reading frames in a HIV sequence. Protein T opology M odels All available HIV genome sequences (624) were d ownloaded (October, 2006) from the HIV Los Alamos database and submitted to Gene Cutter to extract each gene sequence from the multiple reading frames. Each gene sequence was then arranged with the corresponding genes from the same genomic sequence in a fi xed order to form one hybrid or flattened genomic sequence. This allowed a full sequence representation of the HIV genome without the multiple reading frames.
103 This gave an aligned set of amino acid sequences that could then use to calculate mutual informa tion using MI and MEMI method. Quicktree (Howe, Bateman et al. 2002) was used to build a phylogenetic tree for the sequences that was then used for the MEMI method. With 600+ sequences at a length of 3000+ it took approximately 30 hours to process the MEMI method on an AMD 64 1900. Once the mutual information pairs are identified for the multiple sequence alignment they are then programmatically mapped to a single sequence as a reference with no inserts. For this example 97BL006_AF193275 HIV sequence was used as the reference sequence because it was first in the returned list of HIV sequences. The top scoring mutual information pairs are then mapped into a network graph and grouped by gene. Each sequence position pair is then connected in the graph to indicate a predicted inter action or informative relationship. An organic layout algorithm is then used to minimize the edge distance between each connected segment. The gene boundaries are listed in Table 8 1. (Leitner, Foley et al. 2005) and represent a hypothetical sequence without multiple reading f rames where the Start bp defines the beginning position of the gene and End bp the corresponding end of the gene. I n Figure 8 1 the position of each gene is given where overlapping genes represent the use of multiple reading frames at the RNA level and sh ows HIV very efficient coding mechanism to increase the number of proteins in a shorter genome. The impact is that a single RNA mutation at a shared reading frame may change the amino acid in two different proteins and is a clear indicator of a co evolving relationship between two proteins where the information being expressed is that the two sequence positions originate from the same reading frame. The multiple reading frames of HIV were detected as very high scoring informative relationships. To filter fo r this effect each amino acid pair combination each occurrence of a sequence position was tracked as being in an overlapping protein region. If the sequence distance between the mutual
104 information pairs was equal to a pre determined offset value then the p air was marked as sharing a reading frame an eliminated. The MI model shown in Figure 8 2 has limited interaction between proteins, which makes it difficult to judge the quality of the model. The mutual information models was bui lt from 624 sequences that if carefully selected from a global population may result in accurate probability disruptions for each sequence position allowing the MI method to produce a quality model. The one significant feature found in the MEMI model shown in Figure 8 3 is the interface between gp120 and gp41 where gp120 binds to gp41, which prevents the human immune response from recognizing the virus. The interface between gp120 and gp41 is a known drug target in an attempt to de velop a HIV vaccine. When the virus needs to bind to the cell, gp120 changes structure quickly exposing gp41 to the host cell. The gp120 protein structure will then bind to CD4, which is expressed on the surface of T cells, and is then used to gain entry t o the host T cells. In the MEMI model Pol RT is showing interaction with multiple proteins where Pol RT is responsible for transcribing the viral RNA into double stranded DNA and the informative relationships may be at the RNA level and not involved in pr otein interactions and may represent interfaces to handle multiple reading frames found in gp120, gp41, p6, POL Pro and REV exon2. Power law D egree D istribution The degree distribution for the network graph of the top 64 predicted co evolving pairs in HI V using the MI and MEMI method are shown in Figure 8 4 and Figure 8 5 The expectation is that the distribution will be non random and exhibit a power law distribution (Barabasi and Oltvai 2004) The multiple reading frames of HIV cause a high scoring co evolving relationship with all overlapping protein sequenc es and are not included in the MI or MEMI model. Both
105 models have a power law distribution indicating that both network graphs are non random but the quality of the models cannot be differentiated from the degree distribution. Analysis of MEMI M odel The t op 100 scoring mutual information pairs using the MEMI method is shown in Figure 8 6 to show a higher number of possible interactions between proteins. The relationship of POL RT was discussed earlier and shows that E328 as a comm on relationship with many of the HIV proteins but is probably occurring at the RNA level as part of the critical steps of encoding RNA to DNA for cellular replication. REV exon2 sequence position V46 also shares common interfaces with multiple proteins wi th local support from P45 and G40 which should all be found near the same protein surface. REV is responsible for binding to a Rev Respone Unit sequence to export mRNA out of the nucleus and into the cytoplasm before the host cell RNA splicing machinery ha s a chance to cut the HIV RNA. Once the RNA is exported from the nucleus it can then form structural proteins in the cytoplasm. The protein VIF at sequence position T69 is also showing strong affinity for multiple HIV proteins. The role of VIF is to atta ck the human enzyme APOBEC3G which is the host cell defense to virus infection ( Strebel 2003) The reason for interaction with multiple proteins in HIV restricted to sequence p osition VIF:T69 is not clear. An interface is indicated between p24 at (Q4, T126, P122) with p6 (R4, E6 ), which is supported by the approximate location of p6 forming a surface to contain a collection of p24 in the mature state of HIV ( Figure 8 7 ). Pol Integrase and POL RT also show support for a strong interface from the MEMI model and are neighbors in the mature HIV model shown in Figure 8 7.
106 Mutual I nforma tion A nalysis of P re and P ost Tr eatment S equence D ata A direct application of using mutual information to detect compensating mutations as a result of drug treatments can help indicate key protein interfaces and alternative drug targets for drug resistant strains. Working with Dr. Goodenow, who is the Stephany W. Holloway University Chair for AIDS research at the University of Florida, sequence data of GAG and PRO were analyzed from pre and post treatment of HIV in children. The resulting models are shown i n Figure 8 8 and Figure 8 9 where it is observed that after treatment with protease drugs that the GAG PRO interface was adapting to overcome the presence of the protease inhibitor. The challenges with this specific analysis are the limited number of seque nces and that samples came from hourly increments in the same patients, which introduces phylogenetic errors when using traditional mutual information or other correlation models. Using the MEMI method in this type of narrow sequencing allows for mutation events to be detected in closely related sequences and accurate mutual information models can be built. In the pre treatment interface shown in Figure 8 8 PRO (M38, Q60, I87, T93) interacts with GAG (C53, D65, C66, E100, E101, K115). In the post treatment interface shown in Figure 8 9 the number of interface points doubles from 7 to 14 where PRO (P3, L12, G18, E23, V34, G51, Q63, G89, T98, L99) is interfacing with GAG (R24, Q26, C35, G39, F73, G75, K82, G86, S91, D120). It is diff icult to infer anything concrete from the two models. The models do indicate that the pre treatment interface was occurring between a low number of sequence positions with minimal mutation count and in the post treatment sequence data the number of indicat ed compensating mutations doubled with higher mutation counts and the sequence positions involved have changed. The challenge for researchers is that the GAG protein structure is not known which makes it difficult to understand the impact of compensating m utations on protein structure and alternative locations for drug targets. By sampling sequence data between
107 specific known protein interfaces to detect compensating mutations as a result of drug interaction a simple application of mutual information using the MEMI method can indicate compensating protein interfaces and targets for new drugs. Protein S urface M utual I nformation M odels The gp120 gp41 predicted interface is a known relationship between gp120 gp41 and is a key component of the HIV virus ability to infect a cell. In Figure 8 10 gp120 is shown binding to CD4, which is found on the cell surface of a T Cell. This interface location is a common interface with gp41 where gp120 covers gp41 to prevent it from being recognized by the host immune system. The PDB structure for gp41 has not been solved as it is a trans membrane protein but a proposed model for the interactions of gp120, gp41 with CD4 for host cell invasion is shown in Figure 8 12 the info rmative relationships of co evolving amino acids in gp120 with other proteins are shown. Sequence positions E91, S274 and R466 share a predicted interface with gp41 with strong structural support for this interface in gp120 where the three sequence positio ns are in close proximity on the protein surface but are distant sequence positions in the genome. In the PDB 2nxy structure the three sequence positions are buried with minimal exposure to the surface when binding with CD4. In the binding mode with gp41 t he protein structure gp120 adjusts to hide or cap gp41 and it may be in this binding structure that the three sequence positions are on the protein surface. This is a unique example in that predicted sequence positions in protein protein interactions tend to be on the protein surface when a reference PDB model exists. The 3D representations of protein structures are typically presented as fixed or static models, which is an artifact of crystallizing the structure as part of the X ray crystallography proce ss. A protein is a dynamic structure that has flexibility and can present different protein surface depending on the protein interactions. Sequence position R469 in gp120 shows
108 interfaces to VPR, POL Pro, POL Integrase, and REV exon2 and given the locatio n of R469 on the protein surface binding with CD4 could play a role to prevent other proteins from binding to the interface. An interface is also predicted to exist between REV exon2 and gp120 where gp120 sequence positions H374, N478, and R428 are neighb ors on the protein surface of gp120 and all three are distant in the genome. REV sequence position V46 shows strong pairing relationships with many sequence positions in gp41 and gp120, which share a common interface. REV could be performing a regulatory f unction or these sequence positions when represented in the RNA form present a Rev response element (RRE) interface for transport out of the cell nucleus (Le, Zhang et al. 2002) Sequence position G431 in gp120 is showing a co evolving relationship with E328 of POL RT and is shown in Figure 8 13 The POL RT structure is a dimer where the second structure does not appear to play a significant role beyond structure stability in the transcribing of the HIV RNA to a double stranded DNA palm region. Sequence position E328 shares common interfaces with numerous proteins and is located near the palm cut region where RNA DNA transcription takes place. The other sequence positions R284, L295 and I309 that show co evolving relationships with other proteins are also found near t he pocket of the palm cut region and may serve as guide markers for the transcription process. The sequence position E328 appears to play a key role in the transcription process where a mutation in this region requires compensating mutations in POL Int, P2 4, Pol Rnase, exon2, gp41, gp120 and p6. The transcription process should be operating against a long and narrow structure as it moves through the palm cute region. POL Int has a sequence length of 288 and sequence position 280 is predicted to interact wi th E328. POL Rnase has a sequence length of 120 and sequence position 115 is predicted to
109 interact with E328. Protein gp120 has a sequence length of 481 and sequence position 431 is predicted to interact with E328. Protein p6 and p24 both show co evolving relationships at the beginning of each sequence. For gp41 and REV exon2 the predicted co evolving sequence positions are found near the middle of the sequence and both proteins are encoded with regions that include multiple reading frames. A predictive i nterface exists between p24 sequence at positions T216 and P122 with p6 sequence positions at R4 and E6. ( Figure 8 14 and Figure 8 15 ). Proteins p6 and p7 make up the nucleocapsid and is surrounded by p24 which forms the viral capsid and is shown in Figure 8 7 as surface neighbors in a mature HIV virus. It is reasonable to expect a binding interface to exist between p6 and p24. VPR has two interactions Q8 with gp120 and R95 with TAT. Both positions are at opposite ends of a long structure shown in Figure 8 16 For the VPR TAT interface VPR R95 is positive charged amino acid and TAT D5 ( Figure 8 17 ) is a negative charged amino acid which indicates the pote ntial for a protein protein interaction. For the predicted interface between TAT R56 and REV exon1 D9 are compensating mutations between a positive and negative amino acid, which indicates the potential for a protein protein interaction. Integrase perfor ms a key role in integrating the HIV genetic material into the DNA of the host cell. The integration is performed after POL RT converts the RNA to a double stranded DNA helix. The POL integrase dimer structure shown in Figure 8 18 has five sequence positions, that are all found on the protein surface, (D55, E212, K240, E246, C280) and are predicted to perform an interface between POL Integrase, POL RT, GAG pol, and gp120. Sequence position Q177 shows interactions with gp41, Rev exo n2, VIF and POL pro where Q177 is buried in PDB 1EX4.
110 HIV A nalysis HIV is a well researched virus where significant resources have been allocated to understand how the virus works to facilitate a cure or vaccine. The protein interactions that are predicted for HIV based on genome sequence data using the MEMI method show general agreement with what is currently known about HIV protein interactions. The most significant example is the interface between gp120 and gp41 where the protein structure for gp41 remai ns unsolved and is a critical drug target to prevent the spread of HIV infection. Interfaces are also predicted between POL RT and POL Integrase perform s the important step of transcribing the HIV RNA genome into DNA and the integration of the DNA into the host cell. With the ability to detect protein interfaces as a result of compensating mutations that are caused by HIV drugs the researcher has a tool to understand the continuing evolution of the HIV virus. As the knowledge of point specific protein inte ractions based on mutual information detecting co evolving amino acids the understanding of the systems biology in HIV can be improved. Figure 8 1 HIV gene m ap 1 1 http://www.hiv.lanl.gov/content/hiv db/MAP/landmark.html
111 Figure 8 2 HIV top 64 pairs MI
112 Figure 8 3 HIV top 64 pairs MEMI
113 Figure 8 4 Edges per node for the top 64 MI pairs using MI method in HI V Figure 8 5 Edges per node for the top 64 MI pairs using MEMI method in HIV 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 Nodes Edges 0 10 20 30 40 50 60 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Nodes Edges
114 Figure 8 6 HIV MI Pairs top 100 MEMI m ethod
115 Figure 8 7 HIV immature and m ature 2 2 This wo rk is in the public domain in the United States because it is a work of the United States Federal Government under the terms of 17 U.S.C. § 105. Drs. Louis E. Henderson and Larry Arthur
116 Figure 8 8 HIV PRO GAG MI pairs pre treatment
117 Figure 8 9 HIV PRO GAG MI pairs post trea tment
118 Figure 8 10 ENV gp 120 T Cell CD4 antibody PDB 2NXY wide view Figure 8 11 HIV v iral attachment 3 3 http://www.niaid.nih.gov/daids/dtpdb/attach.asp Antibody T Cell Surface CD4
119 REV exon2 gp41 Antibody T Cell Surface CD4 V:46 POL RT:E:328 H:374 N:478 VIF:T:69 POL Int E:212 G:431 VPR:Q:8 POL Pro:V:11 R:466 S:274 E:91 C:25 3 G:89 G:40 N:280 Y: 127 R:469 G:458 Q:428 Figure 8 12 ENV gp 120 T Cell CD4 antibody PDB 2NXY detailed view
120 p6 Palm Cut Region Figure 8 13 POL RT PDB 2IAJ I:309 :E:6 L:295 POL Int R:284 D:55 E:212 E:246 gp120 N:280 Q:428 G:431 E:91 R:4 gp41 Y:12 7 N:10 5 POL Int K:24 0 C:28 0 P24 Q:4 P:122 exon2:V:4 6 E:328 POL Rnase:G:115
121 P24 Figure 8 14 GAG p6 PD B 2C 55 P:122 Q:4 T:216 POL RT I:309 E:32 8 R:4 E:6 Figure 8 15 GAG p24 PDB 2ONT (148 220) T:216 p6:E:6
122 Figure 8 16 VPR PDB 1M8L R:95 TAT:D:5 Q:8 Gp120:R:469 Figure 8 17 TAT PDB 1 JFW R:56 REV exon1:D:9 D:5 VPR:R:95
123 Figure 8 18 POL Integrase PDB 1EX4 Q:177 gp41:G:89 REV exon2:R:19 VIF:T:69 POL pro G:68 T:91 D:55 K:240 POL RT R:284 L:295 E:328 E:246 gp120 R:469 N:478 E:212 GAG Pol:R:3 POL RT:R:264
124 Table 8 1 HIV HXB2 gene p roduct sequence positions Gene Product Function Start bp End bp GAG p17 Matrix 790 1185 GAG p24 Capsid 1186 1881 GAG p2 1882 1921 GAG p7 NucleoCapsid 192 1 2085 GAG p1 2086 2133 GAG p6 NucleoCapsid 2134 2292 GAG pol 2088 2252 POL pro Protease cuts proteins into segments(Drug target) 2250 2549 POL RT 2550 3869 POL Rnase 3870 4229 POL Integrase Integrates DNA produced by reverse transcription into host genome 4230 5096 VIF Promotes the infectivity but not the production of viral particles. 5041 5620 VPR Interacts with p6. Detected in nucleus. 5559 5850 TAT exon1 Viral regulatory factors 5831 6045 TAT exon2 Viral regulatory factors 8379 8466 REV exon1 Viral regulatory factors(Export) 5970 6045 REV exon2 Viral regulatory factors(Export) 8379 8650 VPU Membrane protein(degredation of CD4,enhancement of viron release) 6062 6310 ENV 6225 6314 ENV gp120 External glycoprotein 6315 7757 ENV gp41 Tr ansmembrane glycoprotein 7758 8792 NEF Down regulates CD4 and increases viral infection in a host cell 8797 9417
125 CHAPTER 9 PREDICTING PROTEIN INTERACTIONS IN INFLUENZA A VIRUS Influenza is a RNA virus more commonly referred to as the flu. Various strains of In fluenza have been responsible for killing millions of people. Influenza A is the most severe and Bird Flu or HN51 is a form of Influenza, A which currently infects birds. The concern is that a new variant of HN51 will find a way to infect humans. For this analysis 2000+ Influenza A genomes were downloaded from The J. Craig Venter Institute formerly known as The Genome Research Institute. The genomic data is organized by protein as unaligned sequence data. Clustalw was used to align the sequence data for eac h protein and then assembled as a global alignment for the entire genome. Clustalw was used to align each protein but discards duplicate sequences which when the locally aligned proteins sequences are assembled into a globally aligned genome this resulted in 1818 unique Influenza A genome sequences. The MEMI and MI method were used to calculate mutual information for the 1818 sequences. This took approximately 48 hours on an AMD dual core 1800 processor using the MEMI method. The algorithm has no notion of gene boundaries and does a pair wise comparison on all sequence positions to determine the high scoring mutual information amino acid pairs. Based on statistically significant mutual information the two sequence positions indicate that when one position m utates, the other sequence position must also mutate to compensate. The reason for the compensating mutations cannot be determined from the mutual information calculation but simply indicates that information exists that may be informative or of interest b etween the two sequence positions. In prior analysis of mutual information in HCV, HIV and Dengue the predicted co evolving sequence positions expressing the most information occurred between proteins. The assumption for Influenza A, with no prior knowled ge of the protein topology, is that it would
126 exhibit the same type of relationships. For the MI method the top scoring 100 mutual information pairs 11 of these relationships were between Hemagglutinin (HA) and Neuraminidase (NA) and the remaining 89 were i solated with sequence positions bounded by HA and NA and is shown in Figure 9 2 This would indicate that HA has a strong association with HA in a homo oligomer and NA has a strong association with NA in a homo oligomer. The detec ted interactions between HA and NA could indicate some measure of protein interactions. The mutual information calculated using the MI and MEMI method were based on 1818 sequences from global sources, which could provide representative distribution of prob abilities for each sequence position resulting in an accurate protein interaction model. The MI model was compared to the MEMI model where in the top 100 scoring MEMI predicted co evolving relationships not one pair was associated with an HA NA interface a nd is shown in Figure 9 3 Both the MI and MEMI method show co evolving sequence positions bounded by HA and NA even though Influenza consists of 12 individual proteins which indicates HA and NA play a significant role in the need to mutate and the importance of preserving important protein interfaces. The MEMI method shown in Figure 9 3 is clearly indicating that no significant mutually dependent relationships can be detected between HA and NA. To emphas ize the significance of predicting no interactions between HA and NA for the top 100 scoring predicted co evolving pairs is that mutual information has no notion of gene boundaries when doing pair wise comparison of all sequence positions in the genome. If some measure of false positives existed in the calculations it is expected that at least two sequence positions that are each bounded by HA and NA would show an informative relationship. This is the case for the MI method, which can be attributed to pote ntial false positives but using the MEMI method no interactions are predicted between HA and NA a clear differentiator between the MI and MEMI method. To
127 further test the likelihood of protein interface existing between HA and NA or the existence of false positives using the MEMI method the top 200 scoring predicted co evolving relationships are shown in Figure 9 4 The statistical significance of no protein interactions between NA and HA as false positives in the top 200 scoring co evolving pairs indicates that the MEMI method is detecting informative relationships and HA and NA are forming distinct homo oligomer structures. Power law D egree D istribution The degree distribution for the network graph of the top 100 predicted co evo lving pairs in Influenza using the MI and MEMI method are shown in Figure 9 5 and Figure 9 6 The expectation is that the distribution will be non random and exhibit a power law distribution (Barabasi and Oltvai 2004) The MI model shows a linear decay as the edges increases indicating nodes share a high number of edges, which can indicate randomness or high number of false positives for protein interactions in a biological network. The MEMI model indicates a power law distribution with 38 nodes showing only 1 edge and one node in the tail with 24 edges acting as a hub. Analysis of MEMI M odel A team of researchers at NIAMS led by Alasdiar Steven and working with a H3N2 Influenza strain were able to image the virus using electron tomography (ET) show in Figure 9 7 (Harris A, et al. 2006) Looking at the above 3D model it is clear that Hemagglutinin (HA) groups with other Hemagglutinin proteins where some amount of space is required between each Hemagglutinin structure. The mutual information relationships in Hemagglutinin may also indicate compensating mutations to preserve secondary or tertiary structure. It is also possible if the sequence positions are found on the protein surface that they are preventing Hemagglutinin structures from interacting, maintaining an even spacing or providing orientation as the cluster
128 forms and attaches to the capsid. Detailed examples of the 3D locations of the predicted co evolving sequence positions is given in the next section and are all found on the surface of the protein which indicates protein interac tion or anti relationships to prevent interaction. For Neuraminidase (NA) the structures are clustered in tight groups and sequence positions with high mutual information relationships depending on their location on the 3D structure may form protein protei n interactions from the complement sequence position in a neighboring Neuraminidase structure or help with overall orientation of each structure. PDB structures for Neuraminidase (2HU4) and Hemagglutinin (2IBX) are available as complex quaternary structur es. For Neuraminidase a point of interest is sequence position 45 and was not part of the solved PDB structure and is possibly involved with the anchoring to the capsid as a trans membrane region of the protein. Each PDB structure was rendered in UCSF Chim era as a surface and when possible the sequence position for each mutual information relationship were highlighted in a unique color and labeled. From the mutual information network topology model color annotations were done on each structure by cluster of sequence positions. Of the top 100 predicted Influenza co evolving sequence positions in Neuraminidase only one, Q293 was is buried in the PDB structure. Predicted co evolving sequence positions for Hemagglutinin were mapped using PDB model 2IBX and the m utual information network topology model uses the sequence (A/duck/Viet Nam/18/2005(H5N1)). For Neuraminidase 2HU4 the PDB model was recently released and the amino acid sequ ence position offsets in the PDB model showed good agreement with the A/duck/Viet Nam/18/2005(H5N1)) sequence. Protein S urface M utual I nformation M odels In Figure 9 3 three clusters of co evolving pairs relationships exist centered around (Q:45, C:141,A:61) in the NA protein. The assumption is that with compensating mutations that are limited to sequence positions found in NA that the co evolving relationships are important for
129 the formation of a stable NA structure in a homo oligomer. To understand the information being expressed between seque nce positions they are highlighted on PDB 2HU4 which is four NA proteins in a stable structure and shown in Figure 9 8 The cluster formed by (A:288:308, A:265:284, A:285:304, G:67:87, S:217:236) with Q:45 is locat ed in the center of the NA structure in a relatively small region on the surface of the protein. The Q:45 and A:61 sequence positions are not part of the solved 2HU4 PDB structure so it is difficult to determine a specific structural relationship based on proximity. It is possible that as NA structures are produced and are deposited on the viral surface they need to be evenly distributed for efficient operation and if an existing NA cluster already is attached then the center cluster steers Q:45 away from t hat location. In Figure 9 9 the 2HU4 structure is rotated 90 degrees to indicated that all predicted co evolving amino acids are found on one side of the 2HU4 structure. This is a strong indicator that the predicted co evolving s equence positions are working together for some purpose that involves interfaces isolated to one side of the structure. The second cluster defined by (C:141:161,K:187:206,D:364:387,S:368:391,T:423:453) are all found on the outer edge of the NA structure wh ere (C:141:161,K:187:206,T:432:453) form an interface with a neighboring NA structure and (D:364,387,S:368:391) are located on the other surface of the homo oligomer structure. Using the location on the protein structure to validate the information that i s expressed in the co evolving relationships the second cluster is show in Figure 9 10 That pattern is symmetrical and would appear to have purpose. In Figure 9 11 the second cluster is shown rotated 9 0 degrees about the horizontal axis and shows that in the homo oligomer structure for NA all sequence predictions in the second cluster are found on the same plane.
130 It is proposed that co evolving or informative relationships in the second cluster are imp ortant in the formation of a stable homo oligomer structure where orientation of the NA structure as it approaches another NA structure must be rotated 90 degrees. If two NA structures approach each other where C:141:161 is aligned with K:187:206 then this is a positive attraction. The sequence position K:187:206 also shares a co evolving relationship with T:432:453 which can help with steering or adjusting the rotation of the approaching NA into the appropriate position. This positive relationship is shown in Figure 9 12 and is indicated by the green connecting lines. The sequence position C:141:161 is central or a common node for the other four sequence positions in the second cluster and this relationship could play a role when three NA structures have attached and one remaining NA structure is requried to form a stable structure. As the fourth NA structure approaches only one orientation will work for docking purposes and the other three sequence positions located on the outside edge in the cluster will provide the appropriate interactions either to adjust the orientation of the approaching NA structure or reject it. This proposed relationship is shown in Figure 9 12 and is indicated by the black lines t erminated by diamonds. In Figure 9 13 through the clusters or hubs of mutual information relationships are for Hemagglutinin (HA) are mapped onto the 21BX PDB structure. HA plays a significant role in attaching to the host cell a nd the host organisms immune response will classify a HA subtype. As an example Bird Flu is classified as H5N1 and Spanish Flu as H1N1 where H indicates HA and N indicates NA. In its stable form HA is a trimer and the top 100 MEMI predicted co evolving pai rs, all HA pairs except one, are found on the trimer surface of 2IBX. The predicted protein interactions in HA like NA could play a significant role in the formation of the stable trimer stucture and the three clusters/hubs could be associated with a part icular subtype. With well
131 defined subtypes the classification of amino acids in a predicted co evolving pair could indicate specific amino acid pairings required for different structural features. The HA pair (125,275) has the most bits of information at 2.333 for all HA predicted co evolving pairs and are dista nt in the HA sequence. In PDB 21BX, HA pair (125,275) are surface neighbors in the mid region of the protein structure indicating the potential significance of the information being expressed. To un derstand the information that is being expressed in HA pair 125 275 the specific amino acid pairings were selected from all 1818 Influenza sequences and grouped by HA subtype and is shown in Ta ble 9 2 When looking at HA sequence position 125 and 275 separately a high number of the amino acid G in both positions dominates 1316 H3 genome sequences and 368 H1 sequences. Using the MEMI method to do a pair wise comparison of mutation events before calculating the mutual information min imizes the impact of the phylogenetic influences from high sequence similarity. The result is that specific amino acid combinations in HA 125 275 predict the H subtype. If amino acids D and S are found at sequence positions 125 and 275 then based on the 1818 reference Influenza genome sequences would indicate H6 subtype. For the 15 unique amino acid pairs found in HA sequence positions 125 and 275 each amino acid pair indicates a unique subtype. In three examples, ( H2, H6 and H11) two different amino ac id pairs can be used as markers to indicate the corresponding subtype. The remaining subtypes are all correlated to a single amino acid pair combination and could be used as a classifier of HA subtypes Given that the sequence positions HA (125,275) are su rface neighbors in H5N1 and that specific pairing combinations indicate different HA subtype the information content being expressed is critical to the structure and immune response of HA.
132 This analysis (not shown) was repeated on other HA predicted co ev olving pairs with high mutual information and showed significant correlation that a specific amino acid pair combination indicates a HA subtype. The information being expressed is not isolated to a specific combinations of sequence positions but must be co nsidered as it relates to the hub or clustering relationships shown in the network graph. An additional example is given in Table 9 3 for HA sequence positions 379 478 where they are not edges of a hub but 379 is supporting 478 which is an edge on a small hub. From Figure 9 14 the two sequence positions are in the same general surface region but are not contact neighbors. Based on the location of 478 and 379 the relationship could play a role in the formation or orientation of the structure to form a stable trimer. The specific amino acid pairings found at HA 379 and 478 indicate a strong correlation for HA subtype except for amino acid pairs Y:L and Y:I. The Y:L pair combination is found in H1, H2, H5 and H6. Understanding the simularities or differences of H1, H2, H5 and H6 when Y:L is present could indicate the structural or functional role of these two sequence positions. From the groupings and graph relationships shown in Figure 9 13 through Figure 9 16 it is possible to pick regions of interest for wet lab studies. With the high number of predicted co evolving pairs in HA, the task of selecting significant or critical relationships and wet lab experiments to u nderstand the information being expressed can be difficult. Another potential data analysis tool is the cross referencing of feature attributes as an example associated with genome sequences, subtype, physical chemical properties of amino acids and correla ting them to the information that is being expressed by specific amino acid pairings.
133 Influenza A A nalysis The MEMI model predicts that Neuramininidase ( NA) only interacts with other NA proteins on the virus surface and that Hemagglutinin (HA) also located on the virus surface only interacts with other HA proteins. This clustering of HA groups and NA groups is supported by electron tomography models of Influenza (Harris, Cardone et al. 2006) In the MEMI model the top 100 scoring mutual inform ation relationships indicates no interactions between HA or NA, where in the MI model 11 of the top scoring co evolving sequences positions occur between HA and NA. The assumption is that the MEMI model is informative given that of the top 200 scoring mutu al information pairs, none exist between the HA and NA proteins where the algorithm has no knowledge of protein boundaries. The clustering of pairing relationships in NA show an interaction pattern that appears to guide the formation of four NA proteins a s a homo oligomer. NA prevents HA from binding to the surface of an infected cell when the virus is escaping to infect other cells. This can be viewed as a secondary step in the process of infecting of the cell and the information being expressed from comp ensating mutations is directed towards increasing the efficiency of the homo oligomer and attaching to the virus surface. The information that is expressed in HA is complex and HA is the key protein for the infection of the host cell. The mutations that oc cur in HA impacts the efficiency or type of host cells the Influenza A virus will attach. The two predicted co evolving pairs in HA expressing the most information are perfect classifiers for the Influenza subtype where a specific amino acid pair codes for a specific subtype. It is not clear what role these pairing relationships play in the protein structure of HA but it is clear they are important markers that could be an important drug target.
134 Figure 9 1 Influenza v irus 3D topology 4 4 This file has been released into the public domain by the copyright holder, its copyright has expired, or it is ineligible for copyright. This applies worldwide.
135 Figure 9 2 Influenza top 100 pairs MI method A/duck/Viet Nam/18/2005(H5N1) Figure 9 3 Influenza top 100 pairs MEMI method A/duck/Viet Nam/18/2005(H5N1)
13 6 Figure 9 4 Influenza top 200 pairs MEMI method A/duck/Viet Nam/18/2005(H5N1) Figure 9 5 Edges per node for the t op 10 0 pairs using MI method in i nfluenza 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 Nodes Edges
137 Figure 9 6 Edges per n ode for the top 100 pairs using MEMI method in i nfluenza Figure 9 7 Influenza virus imaged using electron tomography 0 5 10 15 20 25 30 35 40 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Nodes Edges HA N A
138 Figure 9 8 Four n euraminidase structures PDB 2HU4 A:265:284 A:2 8 5: 30 4 Q : 45 G : 67 : 87 S :2 17 : 236 A:2 88 : 308 A : 61 C : 141 : 161 T : 432 : 453 S : 368 : 391 K : 187 : 206 D : 364 : 3 87 Q : 45 A : 61
139 Figure 9 9 Predicted co evolving pairs in n euraminidase side view Figure 9 10 Predicted co evolving pairs symmetry in n euraminidase (top view)
140 Figure 9 11 P redicted co evolving pairs in n euraminidase showing second cluster ( side view )
141 C : 141 : 161 K : 187 : 206 T : 432 : 453 D : 364 : 387 S : 368 : 391 Positive/Attractive relationship Negative/Interfe rence relationship Figure 9 12 N euraminidase docking scenario showing co evolving pair relationships
142 Figure 9 13 Three h emagglutinin structures PDB 2IBX c luster 1 N:125:11 3 K : 275 : 262 K : 169 : 156 I : 67 : 55 Q : 18 : 6 5 G276 : 264 H : 119 : 10 7 K : 139 : 127 V:14 E : 186 : 174 P : 297 : 285 N : 112 : 10 0 L : 60 : 48
143 Figure 9 14 Three h ema gglutinin structures PDB 2IBX c luster 2 K : 135 : 123 Y : 153 : 141 F : 433 : 88* R : 420 : 75 L : 478 : 133 P : 505 : 160 K : 38 : 26 W : 76 : 64 H : 141 : 129 Y : 379 : 34
144 Figure 9 15 Three h emagglutinin structures PDB 2IBX c luster 3 M : 298:286 4 E : 85:7 3 N : 88:76 T : 52:40 P : 210:198 L : 188:176* M : 82:70 L:188:176 is b uried behind and offset from N:88:7 6
145 Figure 9 16 Three h emagglutinin structures PDB 2IBX c luster 1 and 3 interactions Table 9 1 Influenza A gene sequence positions Gene Product Function Start bp End bp pol ymerase PB2 Viral Polymerase 0 760 PB1 F2 protein Impacts cells apoptosis 761 861 polymerase PB1 Viral Polymerase 862 1619 polymerase PA Viral Polymerase 1620 2335 NS2 Nuclear export 2336 2456 NS1 Cellular RNA transport, splicing and translation 2457 2693 CAP 2694 3191 NA Blocks HA binding when new virus escapes host cell 3192 3677 MP2 Opens up virus to the cytoplasm of the host cell 3678 3774 MP1 Binds to viral RNA 3775 4026 HA Responsible for binding to cell that is being infected 4027 4614 V:14 not included in PDB model but plays and informative relationship in the protein interface. M : 298 : 286 E : 85 : 73 K : 275 : 262 E : 186 : 174 V : 14*
146 Ta ble 9 2 Predicted co evolving pair HA(125,275). E ach amino acid combination is indexed against occurrence in a HA subtype A:E D:S E:T G:E G:G G:Q G:R K:K K:R L:F L:L N:K S:G T:R Y:G H1 0 0 0 0 0 0 0 0 0 0 0 0 368 0 1 H2 0 0 0 0 0 0 0 0 16 0 0 0 0 2 0 H3 0 0 0 0 1316 0 0 0 0 0 0 0 0 0 0 H4 0 0 0 0 0 19 0 0 0 0 0 0 0 0 0 H5 0 0 0 0 0 0 0 0 0 0 0 37 0 0 0 H6 0 22 7 0 0 0 0 0 0 0 0 0 0 0 0 H7 0 0 0 5 0 0 0 0 0 0 0 0 0 0 0 H8 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 H10 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 H11 0 0 0 0 0 0 0 0 0 1 11 0 0 0 0 H12 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 Total 1 22 7 5 1316 19 4 4 16 1 11 37 368 2 1 Table 9 3 Predicted co evolving pair HA(379,478) E ach amino acid combination is indexed against occurrence in a HA subtype I:E M:A M:T Q:D Q:M Q:V T:D T:K Y:I Y:L H1 0 0 0 0 0 0 0 0 354 13 H2 0 0 0 0 0 0 0 0 2 16 H3 0 0 0 0 1308 6 0 0 0 0 H4 0 0 0 0 0 0 0 19 0 0 H5 0 0 0 0 0 0 0 0 0 37 H6 0 0 0 0 0 0 0 0 0 29 H7 0 0 0 0 0 0 5 0 0 0 H8 0 1 0 0 0 0 0 0 0 0 H10 0 0 0 4 0 0 0 0 0 0 H11 12 0 0 0 0 0 0 0 0 0 H12 0 1 3 0 0 0 0 0 0 0 Total 12 2 3 4 1308 6 5 19 356 95
147 CHAPTER 10 PREDICTING PROTEIN INTERACTIONS IN DENGUE Using mutual information to detect co evolving amino acids shows promise as a technique to detect protein protein interactions. This has strong applications in RNA virus with high mutation rates. Dengue Fever is caused by a single stranded RNA flavivirus. The Genome Institute of Singapore and Novartis Institute of Tropical Diseases have compiled a portal for information related to the Dengue virus. ( http://dengueinfo.org/NITD/ ) Protein T opology M odels Using sequence data for 160 complete genomes of Dengue published at the Dengue Portal the prot ein topology based on the top 50 scoring predicted co evolving pairs using the MI method is shown in Figure 10 2 and using the MEMI method is shown in Figure 10 3 It is important to understand that the i nteraction between two amino acids located on two different protein surfaces with high mutual information could be positive which would indicate that these two protein surfaces have an affinity for each other. This would indicate that the two proteins shar e an interface. It is also possible that the mutual information relationship between the two sequence positions prevents the two protein surfaces from interacting to reserve the protein surface for another protein. Comparing the two protein topology model s the MI model shows a one too many relationship where a single sequence position is sharing mutual information with multiple other protein interfaces with an interesting distribution of a low number of sequence positions per protein and minimal predicted co evolving pairs bound by a single protein. This could be explained that each protein has a fairly conserved sequence with minimum number of position that has variance. When comparing sequence positions in the MI model with the MEMI model they are the sam e, which would indicate, that sequence variance in Dengue is low. Given that the
148 mutual information was calculated with 160 sequences and that the sequence data was collected from isolated geographical regions the overall sequence variance may be low. Powe r law D egree D istribution The degree distribution for the network graph of the top 100 predicted co evolving pairs in Dengue using the MI and MEMI method are shown in Figure 10 4 and Figure 10 5 The expe ctation is that the distribution will be non random and exhibit a power law distribution (Barabasi and Oltvai 2004) The MI model shows a linear decay as the edges increases indicating nodes share a high number of edges, which can indicate randomness or high number of false positives for protein interactions in a biological network. The MEMI model has a power law distribution with 24 nodes showing only 1 edge, 23 nodes showing two edges and one node in the tail with 37 edges acting as a hub. Analysis of MEMI M odel This is good example of the differences between the MEMI and the MI method. Both methods are selectin g the same sequence positions but the interfaces that are shown are very different. For the MI method it appears to predicting interfaces with a large number of proteins, which is not expected, and the MEMI has well defined singular interfaces, which are e xpected in protein interactions. The one major difference between the two models is the role of NS4A that in the MI model shows no interactions but in the MEMI model has a single sequence position A36 interacting with E, NS5, NS2A, NS2B, PrM, NS4B, and NS3 The NS4A protein is the least studied of the NS proteins and is believed to play a role in the dramatic rearrangement and induction of unique membrane structures within the cytoplasm of an infected cell (DengueInfo 2007) It would be difficult to say that no false positives exist in the MEMI model and because of the low sequence count may be a very poor prediction of protein interactions in Dengue. What is clear is that both the MI model and MEMI model are showing strong agreement in sequence
149 positions that may be involved in protein interactions where the MI relationships appear all equally likely or al most random and the MEMI method has distinct relationships that appear informative. Once the sequence positions of interest have been identified then highlighting those sequence positions on a solved PDB structure can provide additional information about the accuracy or importance of the predicted sequence position. The Major Envelope protein E has a solved structure 1THD as a trimer. The sequence positions in E that have high mutual information are color coded It is interesting to point out that the seq uence positions come from various sections in the sequence but in 3D space form a straight line along the trimer where the center E structure is flipped and rotated 180 degrees and is shown in Figure 10 6 The structure marked as A and C are parallel with the same alignment and the structure B orientation is flipped and rotated 180 degrees. This is an interesting indicator that the line or axis formed by these sequence position may be important. The other interfaces are at the end of the structure and even though they are distance in sequence positions they are all approximate neighbors in 3D space. Dengue A nalysis The protein topology models for Dengue fever were constructed from 160 genome sequences and the resulting MI model show s uniform distribution of interactions or appears non informative. The MEMI model shows potential informative relationships as a power law distribution of the network topology. In addition, an interesting relationship is indicated in the Envelope protein. The Envelope protein has a solved trimer structure 1THD where the three long protein structures are parallel forming a sheet, which combines with other trimers to form the virus capsid. When the sequence positions in E that shows interactions with other pr oteins are highlighted on the trimer they are clustered along a straight line. The significance is that the middle protein structure in the trimer is anti parallel to the other two structures so only one
150 solution exists that would allow the same sequence p ositions in all three structures to be located along the same axis. The center E structure that is anti parallel is offset and rotated 180 degrees so that the sequence positions K122, K123, T120 and S229 can be found along the same axis and appears to be i nformative. Figure 10 1 Dengue g enome Figure 10 2 Dengue top 50 pairs MI method C prM E NS1 2A 2B NS3 4A 4B NS5
151 Figure 10 3 De ngue top 50 pa irs MEMI method
152 Figure 10 4 Edges per n ode for the top 100 pairs using MI method in d engue Figure 10 5 Edges per n ode fo r the top 100 pairs using MEMI method in d engue 0 5 10 15 20 25 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Nodes Edges 0 5 10 15 20 25 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 Nodes Edges
153 Figure 10 6 Envelope p rotein PDB 2B6B K:12 2 K:12 3 T:120 S:22 9 B A C Q:47 A:20 5 N:19 1 NS1 NS2A:K:14 2 NS2B:L :123 K:56 3 Q:644 NS5
154 Figure 10 7 Envelope p rotein PDB 2B6B interface to NS5 PDB 2 J7U PrM K:12 2 L:342 S:30 0 NS4A:A:36 Q:293 0 M:340 H:346 R:345 K:563:Q Q:644:T I:632 T:120 S:631:K V:650:K T:723:K NS2B:K:90 K:28 D:29
155 CHAPTER 11 PREDICTING FUNCTIONAL CO EVOLVING SECONDARY STRUCTURES IN NUCLEAR RE CEPTORS In this study we analyze sequence data from all nuclear receptors to compare the predictions using the MIME method with calculating mutual information using all aligned sequence data (MI method). Nuclear receptors (NRs) are multidomain transcriptio n factors and with few exceptions consist of a N terminal domain, a highly conserved DNA binding domain (DBD), a hinge domain connecting the DBD to the LBD, a ligand binding domain (LBD) and several have C terminal extensions referred to as F domains (Robinson Rechavi, Garcia et al. 2003) Currently there are 744 structures in the PDB for NR LBDs while there are only 96 structures of NR DBDs. NRs are zinc finger proteins that can bind to DNA either as monomers, homodimers, or heterodimers via their DBDs typically upstream of proximal promoter regions of target genes at specific nucleotide sequences referred to as nuclear receptor response elements (NRREs) (Desvergne and Wahli 1999) While the DBDs are highly conserved across all NRs, mutations within this domain allow specificity for NR binding across the genome. The LBD region is structurally conserved yet is only moderately conserved on the sequence level perhaps allowing this protein family to bind and respond to a wide range of endogenous ligands such as hormones, sterols, and fatty acids. Ligand binding results in changes in LBD conformational dynamics facilitating recruitment or displacement of co regulatory chromatin remodeling proteins that in turn impacts transcriptional output of target genes. Nuclear receptors are important dru g targets due to their well established roles in the pathologies of a number of diseases as well as the ability to be modulated by small molecules. Currently, 13% of FDA approved drugs target NRs for treatment of cancer, osteoporosis, and diabetes (Overington, Al Lazikani et al. 2006) With significant research focused on NRs, there is a large collection of
156 published mutagenesis studies as well as the identification of naturally occurring mutations that correlate with vari ous diseases and disorders (Horn, Vriend et al. 2001) These dat a can be used to provide additional evidence that predicted co evolving pairs are sequence positions found in regions that are functionally important. The calculated mutual information or non random relationship between two sequence positions could be att ributed to phylogeny, structure, function, interactions or stochastic processes (Buck and Atchley 2005; Codoner and Fares 2008) which further complicates the determination of false positives when developing algorithms to detect compensating mutations of interest. The co evolving relat ionship between two sequence positions could also be important to prevent an interaction from occurring and thus difficult to validate from a static representation in a stable protein structure. The pairing relationships with high mutual information with s tatistical significance could be represented as a collection of predicted co evolving pairs or as a protein interaction network model to help focus attention on regions or secondary structures of interest. We introduce a protein interaction map where predi cted co evolving sequence positions are grouped by secondary structure as a method to reduce the visual complexity of the network. To further reduce the network complexity we introduce a model that sums the mutual information between secondary structures, which indicates co evolving secondary structures. To probe this model we used hydrogen/deuterium exchange (HDX) coupled with mass spectrometry (MS) to study protein stability, dynamics, and protein ligand interaction (Englander and Kallenbach 1983; Bai, Milne et al. 1993; Zhang and Smith 1993; Chamberlain and Marqusee 1997; Englander, Mayne et al. 1997; Engen and Smith 2001; Sivaraman, Arrington et al. 2001; Hamuro, Wong et al. 2002; Hamuro, Zawadzki et al. 2003; Gar cia, Pantazatos et al. 2004; Yan, Broderick et al. 2004; Yan, Watson et al. 2004; Busenlehner and Armstrong 2005; Lisal, Lam et
157 al. 2005; Nazabal, Maddelein et al. 2005; Winters, Spellman et al. 2005; Chalmers, Busby et al. 2007) to provide supporting evi dence of interacting co evolving secondary structures. The nuclear receptors are represented in Pfam by PF00104 Ligand Binding Domain (LBD) and PF00105 DNA Binding Domain (DBD). PF00104.22 contains 2549 sequences and PF00105.10 contains 2647 sequences and when joined by accession id creates a MSA of 2094 sequences. PF00104.22 does not include sequences from the first two helices of the LBD, Helix 1(H1) and Helix 2(H2). PF00105.10 ends at the C terminal side of the DBD (Helix c) thus co evolving pair predic tion will not include amino acids covering the hinge domain, H1 and H2. The mutual information was calculated for all pair wise sequence positions and the probabilities for each sequence position were determined by detecting mutation events in the phylogen etic tree representing the MSA (See Materials and Methods). The pair wise comparison of all sequence positions in the union of PF00104 and PF00105 allows for the detection of co evolving pairs based on high mutual information that may exist in or between t he LBD and DBD regions independent of quaternary structures. Co evolving pairs may indicate key mutations that differentiate attributes of nuclear receptors in terms of sequence specific DNA binding or specificity for ligand. The predicted co evolving pair s using the MEMI method and MI method are used to construct a corresponding protein interaction network. The top 300 predicted co evolving pairs ranked by mutual information using the MEMI method and MI method are shown as an interaction network in Figure 11 1 The p value for each of the 300 predicted co evolving pairs is p < .0001 MEMI Predicted Protein Interaction Network is Scale Free The predicted network model of co evolving pairs represents a protein interaction network a nd should exhibit properties associated with complex networks that model social networks, scientific collaborations, links between web sites (Barabasi and Albert 1999) and protein
158 interactions (Jeong, Mas on et al. 2001; Wagner 2001) A network is considered scale free if the degree or number of neighbors for each node follows a power law distribution of where 2< <3 (Barabasi and Albert 1999) A network with random properties would have an equally distributed number of edges across all vertices and would have a bell shaped or normal distribution. In a protein interaction network that is scale free the hubs or sequence positions with a high degree could play a critical role in determining mutations that differentiate features and functions in the various subfamilies and groups of proteins. Protein intera ction networks have scale free and small world properties which allow for adaptability and can also indicate proteins that are important when the protein has a high degree or high number of interactions with other proteins (Jeong, Mason et al. 2001; Wagner 2001) If a protein interaction network based on predicted co evolving pairs is scale free and sma ll world then it is anticipated that the hubs or sequence positions with a high number of interactions are important in the differentiation of the protein (Chakrabarti and Panchenko 2010) The top 300 predicted co evolving pairs ar e graphed as the number of nodes with degree using the MEMI method in Figure 11 2 A and MI method in Figure 11 2 B It is expected in a scale free network that the number of nodes with a few edges is high a nd the number of nodes with a high number of edges is low and follow a power law distribution. The distribution of edges to nodes for the MEMI method has a calculated power law distribution using plfit of = 2.2613 and for the MI method = 2.4722 (Clauset, Shalizi et al. 2009) Both methods produce a network where 2< <3 and false positives in predicted co evolving pairs can introduce ad ditional number of edges between nodes resulting in errors in how well the connected edges follows a power law distribution. This is better illustrated by comparing the actual distribution of the network as a log log graph in Figure 11 3 A for the MEMI method and Figure 11 3 B using MI
159 method. The p value for each distribution is calculated using plpva where if p > 0.1 then the power law distribution is a plausible hypothesis for the data and it should not b e rejected (Clauset, Shalizi et al. 2009) For the MEMI method p=0.6150(power law is plausible) and for MI method p=0.0320(power law s hould be rejected). Based on the calculated p value and the fit of the data on a log vs. log graph, the MEMI method network distribution appears to follow a power law distribution. Moreover, the MI method does not appear to follow a power law distribution and these results could be attributed to a high number of false positives introducing random edges in the network. Protein protein interaction networks exhibit scale free properties where most proteins interact with a few partners (low degree) and a few, but significant number of proteins, interact with a large number (high degree) of proteins (Li, Armstrong et al. 2004) Biological and non biological scale free networks are resistant to random errors at nodes wit h a low degree where the removal/deletion of a node with a high degree can have a significant impact on the network (Albert, Jeong et al. 2000; Jeong, Mason et al. 2001; Milo, Shen Orr et al. 2002; Ozier, Amin et al. 2003; Han, Bertin et al. 2004) We have shown that the predicted protein protein interaction network using the MEMI method exhibits properties of a scale free network where the network indicates interactions between co evolving amino acids. Co evolving Secondary Structures We introduce a modified protein interaction network diagram where the predicted co evolving pairs represented by a connected edge are grouped by secondary structure as defined by the consensus secondary structure in PFAM PF00104 and by the canonical representation of the DBD of the retinoid X r (Laudet and Gronemeyer 2002) as shown in Figure 11 4 To minimize the visual complexity of the network the top 100 predicted co evolving pairs based on the amount of mutual information using the MEMI me thod are graphed in a network
160 model shown in Figure 11 4 A and the MI method in Figure 11 4 B. Each sequence position in the a NR that binds to and is activated by 9 cis retinoic acid and functions as a homodimer or as a heterodimer partner for many NRs including the PPARs and the vitamin D receptor (VDR). The on 144 when mapped Each NR sequence that is included in the MSA has an N terminal domain of different length plus amino acid inserts and deletions as compared to other NR sequences which results in varying indexes for each protein sequence position in the MSA. By grouping predicted co evolving pairs by secondary structure it is possible to predict secondary structures that co evolve. We introduce a novel method t o detect statistically significant co evolving secondary structures by summing the mutual information for sequence pairs grouped by secondary structure. Relationships that are more than one standard deviation from the mean information between all secondary structures are shown in the greatly simplified model shown in Fi gure 11 5 as compared to Figure 11 1 Nuclear receptors primarily form homodimer or heterodimers and the secondary structures that share h igh mutual information can represent secondary structures that interact or are important in the differentiation of NRs. The MEMI method predicts an interaction between secondary structures DBD CII and DBD Helix c, which is the DBD dimer interface between N Rs and DBD Helix c and LBD S2 a recently discovered third dimer interface (Chandra, Huang et al. 2008) shown in Fi gure 11 5 A. The MI method does not predict the DBD dimer interface between DBD Helix c and DBD CII shown in Fi gure 11 5 B.
161 MSA S equence P ositions [139,144,195] MSA positions (144,195 ) and (139,195) have the highest Mutual Information using the MEMI method. These three sequence positions are also hubs or nodes with high degree (195 k=45,139 k=29, 144 k=26), which indicate their overall importance in the protein interaction network. The se three sequence positions are located in the DBD, which is considered highly conserved, but have unique pairing patterns that predict co evolving relationships in the DBD and LBD regions. We provide analysis of the importance of these three sequence posi tions based on location in representative PDB structures, known naturally occurring mutations and show the pairing patterns are conserved among individual nuclear receptors. The first PDB structure of a full length heterodimeric NR complex bound to DNA (3 DZU, 3DZY, 3EOO) was published in October 2008 (Chandra, Huang et al. 2008) This structur e was small LXXLL peptides corresponding to coactivator NR boxes and various pharmacological ligands. Prior to this publication PDB structures of NRs were limited to homo/heterodimer LBD (PF00104) models (744 structures) or homo/heterodimer DBD (PF00105) models (96 structures). MSA sequence positions 139,144 located in DBD CII and 195 located in DBD Helix c form a clique where the pair (144,195) and (139,195) have the highest overall mutual 5 and in Figure 11 7 A shown in Figure 11 7 B and in the homodimer DBD interface of REVERB REVERB shown in 5 [139:K:157: ] 139 is the MSA position, K is the amino acid found at sequence position 157 in
162 Figure 11 7 C. The nuclear receptors RAR, RXR, PPAR, VDR and TR and their isoforms typically bind as heterodimers with RXR to the repeated idealized direct repeat DNA pattern AGGTCANxAGGTCA where Nx=[1,2,3,4,5] depending on the nuclear rec eptor binding partner. In contrast, steroid NRs such as estrogen receptor alpha bind as homodimers to the idealized DNA sequence AGGTCANxACTGGA where the DNA is a palindrome with an Nx=3 base pair spacer. The nuclear receptors DBD sequence is highly conser ved and the ability to select for DNA binding is guided by docking with different orientations and spacing of the idealized DNA pattern AGGTCA either as a direct repeat, a palindrome or everted repeat. In Figure 11 8 the ER homo dimer DBD bound to DNA has a contact neighbor between [139:R55:ESR1_HUMAN] and [101:P44:ESR1_HUMAN] forming the dimer interface. The sequence positions 139 located in the DBD Helix c and 101 located in DBD D Box ( Figure 11 4 A) ar e predicted to be co evolving pairs illustrating how the predicted protein interaction map is a summary view of interactions that may be specific to various NRs protein complexes. A mutation in [144:R607Q:ANDR_HUMAN] is attributed to Partial Androgen Insen sitivity Syndrome (PAIS) and breast cancer in men (Wooster, Mangion et al. 1992; Weidemann, Linck et al. 1996; Weidemann, Peters et al. 1998; Chen, Chern et al. 1999) A mutation in [194:K630T:ANDR_HUMAN] is attri buted to prostate cancer (Tilley, Buchanan et al. 19 96) The NR5 subfamily members contain a conserved sequence called the FTZ F1 box (579 601) responsible for DNA binding as a monomer which includes the hub [195:A580:FTZF1_DROME] indicating the multipurpose roles of secondary structures as a feature of NR s. (Ueda, Sun et al. 1992) In Figure 11 9 for each of the 2094 nuclear recep tor sequences used in the mutual information calculations the amino acids found at MSA positions (139,144) are listed in each
163 row and all amino acids found at MSA position 195 are listed in the columns where the intersection of rows and columns corresponds to the nuclear receptor that contains that amino acid triplet. The amino acids found at MSA position 195 are generally conserved across nuclear receptor groups and the rows represented by amino acids found at MSA position (139,144) are conserved across in dividual nuclear receptors sub groups. The MSA position 144 appears to be more conserved across nuclear receptor groups and MSA position 139 to individual nuclear receptors. The clear pattern beyond minor variances between the receptor isoforms is that the pairing patterns are unique or should be considered conserved amino acids for a particular nuclear receptor. The conserved amino acid triplets at sequence positions (139,144,195) and the location in the DBD could play an important role in the structural c hanges required for various nuclear receptor monomer, homodimer, and heterodimer partners in recognizing DNA response elements (Umesono, Murakami et al. 1991; Kliewer, Umesono et al. 1992) Allosteric communication between distant secondary structures Both the MIME and MI method predict a co evolving relationship between DBD CII and LDB S2 as well as H3, which is separated by 23A. The large distance between two regions predicted to have co evolving relationship coul d indicate a false positive or be the result of indirect co evolution (Burger and van Nimwegen 2010) Helix 3 in both models is a hub where it sha res a large number of edges with other secondary structures, which can indicate it is an PDB 3DZU we take the minimum distance between all secondary structures in the protein interaction networks in Fi gure 11 5 and construct a plausible interaction network where the Figure 11 6 The protein interaction map using the MIME method shows pred icted co evolving pairs between (139,144,195) and the hub MSA sequence position 201 in Helix 3 which is part of the
164 AF 2 region (Warnmark, Treuter et al. 2003) The predicted co evolving pair relationships between MSA positions (139,144,195) in the DBD dimer interface and numerous sequence positions in Helix 3 could pla y a role in a hydrogen bond network that facilitates structural changes in Helix 3 upon DNA binding. The predicted co evolving secondary structures between Helix 3 and DBD dimer interface can also be an indirect effect of amino acids that are conserved amo ng specific NRs for the differentiation of features. At the protein level Helix 3 and DBD CII and DBD Helix c could be co evolving to establish specific attributes associated with the protein but do not co evolve to maintain favorable pairings for the dire ct interaction between secondary structures. PPRE DNA The DBD region of Nuclear Receptors is highly conserved but the mutual information based on mutation events contains the highest information content or non random mutations with a high number of sequence positions in the LBD. The hinge region connecting the LBD and DBD provide what would appear to be a clear separation between two protein structures or alternatively enough flexibility to allow for interaction with and binding to the appropriate DNA promoter region. Up on binding to the DNA, interactions between the DBD and LBD can enhance or influence the ability of the AF 2 region to bind to co activators or co repressors. To validate predicted interactions between secondary structures in the DBD and possible allosteri c communication to the LBD, HDX analysis was performed on full PPRE DNA. A change in the deuterium exchange in peptides bound by predicted co evolving secondary structures would provide evidence that the secondary structures are inter acting and can be assigned to a specific protein as opposed to inferring interaction from minimum distance in a representative PDB.
165 When the DBD heterodimer binds to the PPRE DNA promoter region conformational changes can be detected by comparing the deut erium level of amide hydrogen between the protein complex in the presence or absence of the PPRE. If a region of the receptor is involved or perturbed by interaction with DNA, peptic peptides derived from these regions would show different HDX kinetics com pared to those from the complex without DNA. This approach, differential HDX, provides a measure of the PPRE induced conformational change with the complex. HDX was performed as previously described (Chalmers, Busby et al. 2006; Bruning, Chalmers et al. 2007; Chalmers, Busby et al. 2007; Dai, Chalmers et al. 2008) Exchange kinetics of peptic peptides deriv course for the labeling was 10s, 30s, 60s, 300s, 900s, and 3600s and each time points were done in triplicate. The results of the HDX analysis for peptides with statistically significant differences me Table 11 1 and Table 11 2 where at least two time points have a p value less than 0.05 in t test indicating there is a measurable difference in the presence or absence of DNA. The table data shows t he average %D exchanges for all time points for apo and ligand. The difference between %D exchange for each time point is summed to give a weighted difference between the apo and ligand peptide. Both ith additional changes also dimerization interface between LBDs. Using mutual information to predict co evolving secondary structures does not indicate in what context they H3 could be co evolving based
166 on location in a solved PDB structure as illustrat ed in Figure 11 6 It is also possible to have VDR. By using either solved PDB structures or HDX analysis we can assign the predicted co evolving secondary structure pairs to the specific proteins where we have data to suggest they are interacting. In other protein complexes the relationships coul d be reversed or not required for the differentiation of features for a specific protein complex. In Figure 11 10 a revised protein interaction model is presented based on the MEMI predicted secondary structure interactions, mini mal distances between secondary structures in 3DZU and HDX data where peptides bound by secondary structures showed perturbations upon DNA binding suggest possible interac H10, RXR H5 and RXR H3 show statistically significant changes in deuterium exchange. No co evolving pairs interactions are predicted between H10 and H5 even though they are contact neighbors or between PPA sequence and functional attributes not requiring compensating mutations then MI based methods would not predict interactions with H10 even though it may play a critical role in protein protein inte Figure 11 13 within 1 h) due to the high stabilization from ligand MRL20 and no observable perturbation terminal end (LIASFSH+1 310 317) is structurally dynamic and also appeared minimal but statistically significant effect of D NA binding appearing in the peptide Figure 11 13
167 in Figure 11 14 Helix c and CII in the DBD have a predicted co evolving relationship with S2, that is and RXR DBD Helix c have a minimum distance of 7.3A and P to negatively affect PPRE binding and transcriptional activity (Chandra, Huang et al. 2008) Our c, further supporting the interaction as pr edicted by the MEMI method. co evolving secondary structures by the MEMI method but not using the MI method. We do not gion but do show the D Box region, which immediately precedes CII, is less dynamic in the presence of DNA ( Figure 11 17 B). involved in th e dimer interface is less dynamic ( Figure 11 17 C dynamic ( Figure 11 18 C). Fr the already mentioned MSA hubs (139,144,195) are shown to be conserved among individual NRs, are invol ved in the DBD dimer interface and interacting with DNA regions flanking the DNA promoter. Helix 3 contains a mutual information hub, high number of edges or co evolving pairs, [201:V:290] that as part of the AF 2 stabilizes H12 (Warnmark, Treuter et al. 2003) and the iates transactivation (Barroso, Gurnell et
168 al. 1999) Further examples on the importance of H3 MSA position 201 on the stabilization of H12 can be found in the literature across different NRs. A point mutation [201: S235A:VDR_HUMAN] reduces ligand dependent transcription by 55% (Kraichely, Collins et al. 1999) Resistance to thyroid hormone in TR is attributed to the mutation [212:T277A:THB_HUMAN] with further implications in reducing co activator interaction where T277 is surface exposed and is in close proximity to L454 and E457 in helix 12 which are known to be critical for co activator interaction (Collingwood, Wagner et al. 1998) T he mutation [201:D351Y:ESR1_HUMAN] found in MCF7 tumors was stimulated by the nonsteroidal antiestrogen tamoxifen(TAM) (Wolf and Jordan 1994) It is shown that [201:D351:ESR1_HUMAN] plays a role in accurate positioning of helix 12 in the absence of ligand and D351A inhibits folding in the active conformation. In the presence of ligand (Estradiol) with the D351A mutation the active confirmation is restored (Anghel, Perly et al. 2000) 286 in Figure 11 11 A and 288 298 in Figure 11 11 sequence position [201:D:274] w hich based on structural homology should interact with H12 is more dynamic with the addition of DNA ( Figure 11 12 ). The allosteric communication from DBD to H3 when bound to DNA could result from stabilization of the DBD dimer int erface and either H5 or H7 leads to structural changes in H3. From the HDX
169 other protein complexes the symmetry could be reversed or H3 in both proteins could be impacted by the addition of DNA as an allosteric ligand. By using network theory and the properties of a scale free net work it is possible to validate a co evolving protein interaction graph as representative of a complex network, which can give support that the network contains minimal random errors. We have shown that by sampling mutation events along a consensus phyloge netic tree the resulting predicted co evolving pairs using mutual information has properties of a scale free network. Using mutual information without adjusting for the influence of biased sequence data on probability calculations the resulting protein int eraction network is not scale free which can be attributed to true false positives. We use d HDX analysis to show structural changes occur in predicted co evolving secondary structures indicating a possible signaling pathway from the NR DBD to LBD Helix 3. Using the MEMI method we predict two of the three known NR dimer interfaces where it is typically expected that co evolving sequence position occurs in a tertiary structure but in this example the predicted co evolving pairs with the highest mutual informa tion are found in the quaternary dimer interface. The third dimer interface located in the LBD formed by Helix 10 was not predicted which shows one of the weaknesses of information theory approaches where mutual information cannot be used to predict protei n interactions between conserved regions. In nuclear receptors the DBD domain is highly conserved and LBD domain is moderately conserved and each have distinct functional attributes to ligand affinity and selection of targeted genes for transcription. Mut ations in the LBD select for preferred heterodimer partners and have unique properties associated with the recruitment of co activators and/or co repressors. Mutations in the DBD allow the recognition of specific DNA promoter regions, which in turn is depe ndent on a specific ligand for gene regulation. The method by which nuclear receptors bind
170 to DNA is well understood but the ability to predict which nuclear receptors target specific genes is an ongoing field of research (Vaisanen, Dunlop et al. 2005; Montemayor, Montemayor et al. 2010) Helix A in the DBD binds with DNA is generally conserved across nuclear receptors by group shown in Figure 11 19 and does not provide the specificity for g ene selection. The analysis of co evolving pairs using the MEMI method reveals three sequence positions (139,144,195) that express overall the most information in the DBD and LBD and are conserved by nuclear receptor sub group in the 2094 sequences used in this analysis. The distinct amino acid triplets found at (139,144,195) because they are conserved across nuclear receptor sub groups should be viewed as functionally important to the unique attributes associated with each nuclear receptor. The three sequ ence positions based on solved PDB structures form a dimer interface between DBDs and could play a role in the spacing requirements for DNA recognition in conta dual use of these three sequence positions is revealed and illustrated in Figure 11 7 A. dimer interface is found sequence positions (139,144) not involved in the DBD dimer interface are in a position to omoter region. Since these three sequence positions are specific to each nuclear receptor it provides a unique identifier that could be used for DNA flanking re gion of PPARE is attributed to the binding affinity of PPAR (Palmer, Hsu et al. 1995; Ijpenberg, Jeannin et al. 1997; Juge Aubry, Pernin et al. 1997)
171 Mutual Information provides a straightforward method to measure the mutual dependence between two variables and has been used in various forms in numerous research papers to predict co evolving pairs that can indicate contact neighbors, which can then be used in predicting protein tertiary structure (Crooks and Brenner 2004; Shackelford and Karplus 2007; Horner, Pirovano et al. 2008; Swanson, Vannucci et al. 2009; Burger and van Nimwegen 2010) The analysis of multiple aligned sequence data using mutual information re sults in a measure of information and indicates the bits of data needed to represent the pairing patterns between two sequence positions but does not indicate the nature of the signal or the reason for a co evolving pair. Co evolving pairs could also be a critical adaptation in a protein structure in preventing a particular folding pattern from occurring and would appear as a false positive in a representative PDB. Network theory and the properties of a scale free network allow the predicted protein interac tion model to be validated as a complex network with non random or biased properties. By taking a systems biology view of the predicted co evolving sequence positions and thus involved in some aspect of protein interaction or protein differentiation we ca n gain insight into the overall dynamics of protein interactions.
172 Figure 11 1 Top 300 predicted co evolving pairs from MSA alignment ranked by mutual information. The MSA was determined by joining PFAM 00105 (DBD domain) and PFAM 00104 (LBD domain) via common accession id resulting in 2094 sequences A) MEMI method B) MI method
173 Figure 11 2 Number of edges or d egree K per predicted co evolving sequ ence position. A sequence position or node in a predicted co evolving pair may also be predicted to co evolve with other sequences positions. The distribution of the number of edges or predicted co evolving relationships is considered scale free if it foll ows a power law distribution. A limited number of nodes will have a high number of edges and a large number of nodes will have a low number of edges. A) Degree distribution using MEMI method compared to a random distribution of the same number of nodes and edges. B) Degree distribution using mutual information without correcting for over sampling of sequences of research interest. Figure 11 3 Degree distribution of edges per node in log scale log(P(k)) vs log(k). A) MEMI method where dashed line = 2.2613 as the best fit of o bserved data points. B) MI method where dashed line = 2.4722 as the best fit of the observed data points.
174 Figure 11 4 The top 100 predicted co The first number indicates the position in the multiple sequence alignment followed by the amino acid and the sequence position in the representative protein. A) MEMI method B) MI method
175 Fi gure 11 5 Predicted co evolving secondary struc tures bits of information Sum of MI between secondary structures for the top 300 predicted co evolving pairs one standard deviation above the mean. Edges are labeled with the sum of Mutual Information A) MEMI network average MI between secondary structures 3.29 +/ 2.98 B) MI network average MI between secondary structures 3.92 +/ 3.33
176 Figure 11 6 Predicted co evolving secondary structures network with minimum distance between secondary structures. Distance is measured from closest atoms bound by shown as dashed lines A) MEMI netwo interfaced to LBD of PPAR via S2 and interacting with H3 B) MI network with no clear interface between LBD and DBD based on minimal distance between secondary structures.
177 Figure 11 7 NR DBD dimer interface showing predicted co evolving pairs in MSA (139,144,195). In each structure MSA 139 is blue, 144 is red and 195 is green. A) REVERB DBD
178 Figure 11 8 1HCQ Estrogen Receptor alpha DBD homodimer showing predicted co evolving pairs in MSA(139,144,101). MSA position 139 is blue, 144 is red and 101 is orange. Figure 11 9 MSA (139,144,195) showing conserved amino acid triplets for nuclear receptor. Each nuclear r eceptor group is generally clustered in columns and individual isoforms differentiate by rows where amino acid combina tions are un ique to a specific nuclear r eceptor.
179 Figure 11 10 Proposed MEMI interaction model for DBD signaling to LBD Helix 3. The hub at MSA 201:V:290 is shown to play a role in the stabilization of Helix 12 which is the primary mechanism in preventing binding with co activator proteins (Collingwood, Wagner et al. 1998; Barroso, Gurnell et al. 1999; Kraichely, Collins et al. 1999) The predicted co evolving pair relationships between MSA positions (139,144,195) and numerous sequence positions in Helix 3 could facilitate the repositioning of Helix 12, which prevents co activator proteins from binding to the LBD if the DBD is not bound to DNA. Figure 11 11 MI hub, high number of edges or co evolving pairs, [201:V:290] that has been shown to play a role in the mediates transactiva tion.
180 Figure 11 12 MI hub. Based on structural homology and sequence position [201:D:274] in the MSA this region can play a role in interaction with Hel ix 12. The regio n is slightly more dynamic in the presence of DNA. Figure 11 13 RXR Helix 5 peptide is slightly more dynamic.
181 Figure 11 14 presence of DNA are less dynamic. Figure 11 15 LBD RXR DBD dimer neighbor with R202 in RXR DBD Helix c
182 Figure 11 16 ic with minimal change +/ DNA as compared to Figure 11 17 addition of DNA.
183 Figure 11 18 addition of DNA.
184 Figure 11 19 Helix A responsible for DNA binding with promoter region is generally conserved across a group of nuclear r eceptors.
185 Table 11 1 Peptide Charge Start End Features apo Avg %D The o PPRE Avg %D Sum Diff %D Stdev %D Conf Value T test VDTEMPFWPTNF 2 47 59 37.4 39.3 12.6 7.8 4.8 [2/6] CRVCGDKASGF 2 154 165 [N TERM DBD:3 DBD CI:9] 19.4 14.3 31.4 9 22.3 [3/6] KLIYDRCDL 2 185 194 [DBD 1:6 DBD D Box:4] 27.2 24.5 17.1 7.7 9.4 [3/6] IYD RCDL 2 187 194 [DBD 1:4 DBD D Box:4] 21.2 15.6 34.2 7.7 26.5 [6/6] IYDRCDL 1 187 194 [DBD 1:4 DBD D Box:4] 23.1 17.7 35.1 8.2 26.9 [6/6] QKCLA 1 211 216 [DBD Helix b:5 DBD 2:1] 0.2 1.8 12.4 5.7 6.7 [3/6] AVGMS 1 215 220 [DBD Helix b:1 DBD 2:2 Helix c:3 ] 12.5 9.2 20.2 6.1 14.1 [3/6] VGMSHNAIRFGRMPQAEKEKL 3 216 237 [DBD 2:2 DBD Helix c:12 A Box:7 Hinge:1] 38.9 29.7 55.5 5.1 50.4 [5/6] TGKTTDKSPFVIYDM 2 281 296 [DBD LBD Hinge:14 H1:2] 31 29.4 11.1 5.4 5.6 [2/6] TGKTTDKSPFVIYDMNSLM 2 281 300 [DBD LBD Hin ge:14 H1:6] 36.1 33.6 17.6 8 9.6 [3/6] TGKTTDKSPFVIYDMNSLM 3 281 300 [DBD LBD Hinge:14 H1:6] 35.7 32.6 19.9 6.7 13.2 [2/6] LISEGQGFMTREF 2 383 396 [S1:1 S2:9 H6:4] 20 16.7 20.1 6.1 14 [4/6] ISEGQGFMTRE 2 384 395 [S2:9 H6:3] 19 16.5 15.7 5.9 9.8 [5/6] I SEGQGFMTREFL 2 384 397 [S2:9 H6:5] 20 16.7 20.1 6.1 14 [4/6] VIILSGDRPGLL 2 433 445 [H8:2 R8 9:7 H9:4] 8.4 6.8 10 4.4 5.6 [2/6] LQKMTDL 1 479 486 [H10:8] 2.7 0.8 21.7 2.7 19 [5/6] LQKMTDL 2 479 486 [H10:8] 1.9 1.7 23.2 3.1 20 [4/6] LRQIVTEHVQL 2 485 496 [H10:4 R10 11:1 H11:8] 4.2 2.9 9.3 3.8 5.5 [2/6] RQIVTEHVQLL 2 486 497 [H10:3 R10 11:1 H11:9] 4.2 2.9 9.3 3.8 5.5 [2/6]
186 Table 11 2 RXR HDX showing peptides that had a change in deuterium exchange w ith the addition of DNA. Peptide Charge Start End Features apo Avg %D Theo PPRE Avg %D Avg %D Stdev %D Conf Value T test FTKHICAICGDRSSGKHYGVY 3 130 151 [NT DBD:8 DBD CI:13 DBD Helix a:1] 22 17.8 25.9 3.8 22.1 [2/6] IDKRQRNRCQY 2 179 190 [DBD CII:7 DBD H elix b:5] 41.7 36.6 30.9 11.3 19.5 [2/6] QKCLA 1 193 198 [DBD Helix b:5 DBD 2:1] 0.2 1.8 12.4 5.7 6.7 [3/6] KREAVQEERQRGKDRNENEVESTS 4 201 225 [DBD Helix c:11 A Box:7 Hinge:7] 41.6 35.1 39.6 8.2 31.4 [4/6] YVEANMGLNPSSPNDPVTNICQ 2 249 271 [H2:3 R2 3:10 H3:10] 40.4 42 11.5 6.9 4.6 [3/6] AADKQLFTL 1 271 280 [H3:10] 8 9.5 9 3.5 5.5 [3/6] AADKQLFT 1 271 279 [H3:9] 8.6 10.4 10.9 3.6 7.3 [2/6] AADKQLFT 2 271 279 [H3:9] 7.8 9.6 12.7 5.1 7.6 [2/6] AADKQLFTLVE 2 271 282 [H3:12] 7.9 9.7 11.3 3.5 7.8 [2/6] LL RLPALRS 2 419 428 [H10:10] 0.5 0.3 8.8 2.9 5.9 [2/6] LRLPALRSI 2 420 429 [H10:10] 0.5 0.3 8.8 2.9 5.9 [2/6]
187 CHAPTER 12 CONCLUSIONS AND FUTURE RESEARCH The ability to understand the inner workings of a living cell and the protein interactions that guide it can lead to an unprecedented age of discovery in the early detection of diseases, personalized medicine regenerat ive medicine and the fountain of youth. The information that guides every biological process originates from the DNA/RNA encoding and is the key t o translating the genomic language of life. When sequencing the genome of species becomes a routine investigative tool in biology research lab s an unprecedented amount of information will be available for computational analysis by computer scientists. Biol ogists using first principles of research and validation in wet lab experiments are providing the foundation of knowledge which ultimately will be defined in computer programs that will sort through massive amounts of genomic data furthering and redefining our base knowledge of the life of a cell. The application of mutual information to genomic data can help narrow the researchers focus to interesting relationships that may not have been obvious or intuitive in traditional web lab research. Summary of Res ults MEMI Method and its application on Pfam The ability to properly apply mutual information to genomic data is key to the elimination of false positives and providing a measure of confidence to the researcher that the predicted relationship between two s equence positions may not be apparent but is worth understanding. Using mutual information against Pfam sequence data to predict contact pair relationships using the MI method resulted in an accuracy of 56.2% and using the MEMI method an accuracy of 81.3%. A substantial improvement but still a 18.7% false positive rate which indicates either an error in the application of mutual information or a measure of uncertainty related to understanding the nature of the information that is being expressed.
188 Protein C orreLogo Through Protein CorreLogo a visual method was developed to display attributes associated with sequence data and predicted co evolving pairs to provide a single view of apparent false positives. It was determined that one explanation for predicted co evolving pairs that appeared to be false positives were actually contact neighbors in the interface of a protein structure in its dimer configuration. The error was the narrow focus of sequence data associated with a single protein family and recognizin g that the purposes of sequence data that forms a protein is not the creation of a stable structure but to put a combination of amino acids on the surface of the protein to interact with other proteins. Protein T opology of V iruses The MEMI method was appl ied to the retrovirus genomes of HCV, HIV, Influenza and Dengue fever to determine the protein topology of the virus. This was a full scale application of predicting co evolving or compensating mutations to an entire genome. Viruses are actively researched have comparatively short genomes, and must actively mutate to avoid the host immune system and provide an ideal test case for the application of the MEMI method HCV The MEMI protein topology model for HCV shows strong correlation to proposed structural organization of the HCV genome (Penin, Dubuisson et al. 2004) even though only 161 HCV genome sequences where used to determine probability distributions when calculating mutual information using the MEMI method. Influenza The MEMI protein topology model for Influenza was constructed from 1818 genome sequenc es. The MEMI model predicts that Neuramininidase (NA) only interacts with other NA proteins on the virus surface and that Hemagglutinin (HA) also located on the virus surface only
189 interacts with other HA proteins. This clustering of HA groups and NA groups is supported by electron tomography models of Influenza (Harris, Cardone et al. 2006) In the MEMI model the top 100 scoring mutual information relationships indicates no interactions between HA or NA, where in the MI model 11 of the top sco ring co evolving sequences positions occur between HA and NA. The assumption is that the MEMI model is informative given that of the top 200 scoring mutual information pairs, none exist between the HA and NA proteins where the algorithm has no knowledge of protein boundaries. HIV The MEMI protein topology model for HIV accurately predicts the interface between gp120 and gp41 where the predicted co evolving sequence positions in gp120 are located at the interface occupied by CD4 in PDB 2nxy (Zhou, Xu et al. 2007) The MI method did not predict an interface between gp120 and gp41 in the top 100 highest scoring mutual information pairs. Dengue The protein topology models for Dengue fever were constructed from 160 genome sequences and the resulting MI model shows uniform distribution of interactions or appears non informative. The MEMI model shows potential informative relationships as a power law distribution of the network topology. In addition, an interesting relations hip is indicated in the Envelope protein. The Envelope protein has a solved trimer structure 1THD where the three long protein structures are parallel forming a sheet which combines with other trimers to form the virus capsid. When the sequence positions i n E that show interactions with other proteins are highlighted on the trimer they are clustered along a straight line. The significance is that the middle protein structure in the trimer is anti parallel to the other two structures so only one solution exi sts that would allow the same sequence positions in all three structures to be located along the same axis. The center E structure that is anti parallel is offset and rotated 180 degrees
190 so that the sequence positions K122, K123, T120 and S229 can be found along the same axis and appears to be informative. Predicting f unctional c o evolving secondary s tructures Predicting functionally important or feature differentiating regions of proteins from multiple sequence alignment can provide new ways to extract inf ormation from the vast amounts of sequence data. The key is recognizing mutation patterns and providing methods to validate the patterns are not random and contain information indicating the importance of the co evolving relationships. As a purely informat ics approach using mutual information calculated from mutation events and network theory the researcher can take a high level overview of secondary structures of interest in the evolution and differentiation of proteins that are functionally conserved but have unique features or attributes associated with each protein. By predicting co evolving secondary structures with the ability to select specific sequence positions that are viewed as hubs the information can be used for directed mutagenesis studies. The predicted secondary structure interaction model can provide constraints in tertiary structure prediction or indicate possible protein protein docking models. A key challenge in any prediction algorithm is validating the findings beyond general observatio ns. We offer a straightforward approach to predicting functionally important regions of a protein with a network theory system view of the data that we hope will serve as a method to validate other co evolving pair algorithms. We used Hydrogen deuterium e xchange to show the changes in Nuclear Receptors secondary structures when bound to DNA support the secondary structures that are predicted to be co evolving in the LBD and DBD domains of Nuclear Receptors. HDX analysis of protein dynamics can be an invalu able tool to validate algorithms that predict co evolving amino acids.
191 Other Potential Applications and Future Research High Throughput S creening With encouraging results from application of MEMI to virus genomes and the awareness that the information bein g expressed is defining protein protein interactions allows for additional areas of research. Through high throughput screening processes multiple libraries of drug compounds can be applied in parallel to a specific virus strain where the goal is to force compensating mutations between established protein interfaces. The sequence data from the two proteins of interest can then be easily collected and analyzed using MEMI to determine very specific protein interfaces. The experiments can be repeated in parall el allowing for comparison of what appears to be random mutations defining protein topology models t hat can be validated with each run of the experiment. Predicting Protein Interactions between Pfam families The current state of sequence data collection a nd submission of data either exists at the DNA level or as a sequence or sub sequence of an expressed protein. In the case of protein sequence data, which is the grammar of the genome, they have been organized into protein families using Hidden Markov mode ls, which were originally developed for speech recognition. What is missing in the current sequence database s is an index that allows the words or sentence that w ere being expressed at the time of the experiment to be referenced for analysis It is also no t practical to sequence entire genomes as it relates to a specific research experiment but simply the sub genome responsible for generating the proteins in that region of interest. To apply mutual information at the genomic level for humans to determine p rotein interactions would not offer the genome sequence diversity required for meaningful results. A larger view of evolution is required where each Pfam family has a reference sequence that was sourced from a particular species genome. In another represen tative Pfam family a protein
192 sequence also exists that originated from the same species genome. By linking these two sequences in different families to the same genome an index is created. If this process is repeated for all sequences in Pfam grouped by fa mily a species genome can be constructed that has organized the grammar or words and the sentences they originated from. Mutual information can then be applied to this collection of hybrid genome sequence data organized and grouped by Pfam. Compensating mu tations that are predicted across Pfam families would indicate that two protein families interact with additional reference data that gives the specific sequence positions of interest. If this technique w ere applied to proteins known to express in nerve c ells across all species it would be possible to determine protein protein interaction based only on sequence data. Once a protein topology is developed for proteins expressed in nerve cells, the mutations that are expressing the most information between tw o sequence positions can then be mapped onto the phylogenetic tree used in the MEMI method. Features or attributes associated with one species and not another can then be isolated based on the informative mutations. This can be repeated for almost any cel lular process where observed differences occur between species but share different sequence data that is found in the same protein family.
193 LIST OF REFERENCES Abramowitz, M. and I. Stegun (1965). Handbook of Mathematical Functions with Formulas, Graphs and Mathematical Tables. New York, Dover. Albert, R. and A. L. Barabsi (2002). "Statistical mechanics of complex networks." Reviews of Modern Physics 74 (1): 47. Albert, R., H. Jeong, et al. (2000). "Error and attack tolerance of complex networks." Nature 406 (6794): 378 382. Angel, H. (2006). "Biomolecules in the computer: Jmol to the rescue." Biochemistry and Molecular Biology Education 34 (4): 255 261. Anghel, S. I., V. Perly, et al. (2000). "Aspartate 351 of estrogen receptor alpha is no t crucial for the antagonist activity of antiestrogens." J Biol Chem 275 (27): 20867 20872. Atchley, W. R., K. R. Wollenberg, et al. (2000). "Correlations Among Amino Acid Sites in bHLH Protein Domains: An Information Theoretic Analysis." Mol Biol Evol 17 (1 ): 164 178. Bai, Y., J. S. Milne, et al. (1993). "Primary structure effects on peptide group hydrogen exchange." Proteins 17 (1): 75 86. Barabasi, A. L. and R. Albert (1999). "Emergence of Scaling in Random Networks." Science 286 (5439): 509 512. Barabasi, A L. and Z. N. Oltvai (2004). "Network biology: understanding the cell's functional organization." Nat Rev Genet 5 (2): 101 113. Barker, D. and M. Pagel (2005). "Predicting Functional Gene Links from Phylogenetic Statistical Analyses of Whole Genomes." PLoS Computational Biology 1 (1): e3. Barroso, I., M. Gurnell, et al. (1999). "Dominant negative mutations in human PPARgamma associated with severe insulin resistance, diabetes mellitus and hypertension." Nature 402 (6764): 880 883. Bartosch, B., J. Dubuisson, et al. (2003). "Infectious Hepatitis C Virus Pseudo particles Containing Functional E1 E2 Envelope Protein Complexes." J Exp Med 197 (5): 633 642. Bateman, A., E. Birney, et al. (2002). "The Pfam Protein Families Database." Nucleic Acids Res 30 (1): 276 280. Betts, M. J. and R. Russell (2003). Amino Acid Properties and Consequences of Substitutions. Bioinformatics for Geneticists I. C. G. Michael R. Barnes : 289 316. Bindewald, E., T. D. Schneider, et al. (2006). "CorreLogo: an online server for 3D sequence l ogos of RNA and DNA alignments." Nucleic Acids Res 34 (suppl_2): W405 411.
194 Branden, C. and J. Tooze (1999). Introduction to Protein Structure New York, Garland Publishing. Brass, V., E. Bieck, et al. (2002). "An Amino terminal Amphipathic alpha Helix Medi ates Membrane Association of the Hepatitis C Virus Nonstructural Protein 5A." J. Biol. Chem. 277 (10): 8130 8139. Bruning, J. B., M. J. Chalmers, et al. (2007). "Partial agonists activate PPARgamma using a helix 12 independent mechanism." Structure 15 (10): 1258 1271. Buck, M. J. and W. R. Atchley (2005). "Networks of Coevolving Sites in Structural and Functional Domains of Serpin Proteins." Molecular Biology and Evolution 22 (7): 1627 1634. Burger, L. and E. van Nimwegen (2010). "Disentangling Direct from Ind irect Co Evolution of Residues in Protein Alignments." PLoS Comput Biol 6 (1): e1000633. Busenlehner, L. S. and R. N. Armstrong (2005). "Insights into enzyme structure and dynamics elucidated by amide H/D exchange mass spectrometry." Archives of Biochemistr y and Biophysics 433 (1): 34 46. Carlos Amor, J., D. H. Harrison, et al. (1994). "Structure of the human ADP ribosylation factor 1 complexed with GDP." Nature 372 (6507): 704 708. Chakrabarti, S. and A. R. Panchenko (2010). "Structural and functional roles o f coevolved sites in proteins." PLoS ONE 5 (1): e8591. Chalmers, M. J., S. A. Busby, et al. (2006). "Probing protein ligand interactions by automated hydrogen/deuterium exchange mass spectrometry." Anal Chem 78 (4): 1005 1014. Chalmers, M. J., S. A. Busby, e t al. (2007). "A two stage differential hydrogen deuterium exchange method for the rapid characterization of protein/ligand interactions." J Biomol Tech 18 (4): 194 204. Chamberlain, A. K. and S. Marqusee (1997). "Touring the landscapes: partially folded pr oteins examined by hydrogen exchange." Structure 5 (7): 859 863. Chandonia, J. M. and S. E. Brenner (2006). "The Impact of Structural Genomics: Expectations and Outcomes." Science 311 (5759): 347 351. Chandra, V., P. Huang, et al. (2008). "Structure of the i ntact PPAR [ggr] RXR [agr] nuclear receptor complex on DNA." Nature : 350 356. Chen, C. P., S. R. Chern, et al. (1999). "Androgen receptor gene mutations in 46,XY females with germ cell tumours." Hum Reprod 14 (3): 664 670. Chevaliez, S. and J. Pawlotsky (20 06). HCV Genome and Life Cycle. Hepatitis C viruses: genomes and molecular biology S. L. Tan. Wymondham, U.K., Horizon Bioscience.
195 Chothia, C. (1984). "Principles that determine the structure of proteins." Annual Review of Biochemistry 53 : 537 572. Clarke N. D. (1995). "Covariation of residues in the homeodomain sequence family." Protein Sci 4 (11): 2269 2278. Clauset, A., C. R. Shalizi, et al. (2009). "Power Law Distributions in Empirical Data." SIAM Review 51 (4): 661 703. Codoner, F. M. and M. A. Fares ( 2008). "Why should we care about molecular coevolution?" Evol Bioinform Online 4 : 29 38. Collingwood, T. N., R. Wagner, et al. (1998). "A role for helix 3 of the TRbeta ligand binding domain in coactivator recruitment identified by characterization of a th ird cluster of mutations in resistance to thyroid hormone." EMBO J 17 (16): 4760 4770. Crooks, G. E. and S. E. Brenner (2004). "Protein secondary structure: entropy, correlations and prediction." Bioinformatics 20 (10): 1603 1611. Crooks, G. E. B., S. E. (20 04). "Protein secondary structure: entropy, correlations and prediction." BIOINFORMATICS 20 (10): 1603 1611. Dai, S. Y., M. J. Chalmers, et al. (2008). "Prediction of the tissue specificity of selective estrogen receptor modulators by using a single biochem ical method." Proc Natl Acad Sci U S A 105 (20): 7171 7176. Daub, C., R. Steuer, et al. (2004). "Estimating mutual information using B spline functions an improved similarity measure for analysing gene expression data." BMC Bioinformatics 5 (1): 118. Dengu eInfo (2007). "DengueInfo Portal." Genome Institute of Singapore and Novartis Institute for Tropical Diseases. Desvergne, B. and W. Wahli (1999). "Peroxisome proliferator activated receptors: nuclear control of metabolism." Endocr Rev 20 (5): 649 688. Dim mic, M. W., M. J. Hubisz, et al. (2005). "Detecting coevolving amino acid sites using Bayesian mutational mapping." Bioinformatics 21 (suppl_1): i126 135. EMBL. (2007). "EMBL Nucleotide Sequence Database: Release Notes." from http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html Engen, J. R. and D. L. Smith (2001). "Investigating protein structure and dynamics by hydrogen exchange MS." Anal Chem 73 (9): 256A 265A. Englander, S. W. and N. R. Kallenbach (1983). "Hydrogen exchange and structural dynamics of proteins and nucleic acids." Q Rev Biophys 16 (4): 521 655.
196 Englander, S. W., L. Mayne, et al. (1997). "Hydrogen exchange: the modern legacy of Li nderstrom Lang." Protein Sci 6 (5): 1101 1109. Fares, M. A. and D. McNally (2006). "CAPS: coevolution analysis using protein sequences." Bioinformatics 22 (22): 2821 2822. Fares, M. A. and S. A. A. Travers (2006). "A Novel Method for Detecting Intramolecular Coevolution: Adding a Further Dimension to Selective Constraints Analyses." Genetics 173 (1): 9 23. Garcia, R. A., D. Pantazatos, et al. (2004). "Hydrogen/deuterium exchange mass spectrometry for investigating protein ligand interactions." Assay Drug Dev T echnol 2 (1): 81 91. Girvan, M. and M. E. J. Newman (2002). "Community structure in social and biological networks." Proceedings of the National Academy of Sciences 99 (12): 7821 7826. Hamilton, N., K. Burrage, et al. (2004). "Protein contact prediction using patterns of correlation." Proteins: Structure, Function, and Bioinformatics 56 (4): 679 684. Hamuro, Y., L. Wong, et al. (2002). "Phosphorylation driven motions in the COOH terminal Src kinase, CSK, revealed through enhanced hydrogen deuterium exchange and mass spectrometry (DXMS)." J Mol Biol 323 (5): 871 881. Hamuro, Y., K. M. Zawadzki, et al. (2003). "Dynamics of cAPK type IIbeta activation revealed by enhanced amide H/2H exchange m ass spectrometry (DXMS)." J Mol Biol 327 (5): 1065 1076. Han, J. D., N. Bertin, et al. (2004). "Evidence for dynamically organized modularity in the yeast protein protein interaction network." Nature 430 (6995): 88 93. Harris, A., G. Cardone, et al. (2006). "Influenza virus pleiomorphy characterized by cryoelectron tomography." Proceedings of the National Academy of Sciences 103 (50): 19123 19127. Horn, F., G. Vriend, et al. (2001). "Collecting and harvesting biological data: the GPCRDB and NucleaRDB informati on systems." Nucleic Acids Res 29 (1): 346 349. Horner, D. S., W. Pirovano, et al. (2008). "Correlated substitution analysis and the prediction of amino acid structural contacts." Briefings in Bioinformatics 9 (1): 46 56. Howe, K., A. Bateman, et al. (2002). "QuickTree: building huge Neighbour Joining trees of protein sequences." Bioinformatics 18 (11): 1546 1547. Iademarco, M. F. and K. G. Castro (2003). "Epidemiology of Tuberculosis." Seminars in Respiratory Infections 18 : 225 240. Ijpenberg, A., E. Jeannin, et al. (1997). "Polarity and Specific Sequence Requirements of Peroxisome Proliferator activated Receptor (PPAR)/Retinoid X Receptor Heterodimer Binding to DNA." Journal of Biological Chemistry 272 (32): 20108 20117.
197 Inbal Halperin, H. W. R. N. (2006). "Co rrelated mutations: Advances and limitations. A study on fusion proteins and on the Cohesin Dockerin families." Proteins: Structure, Function, and Bioinformatics 63 (4): 832 845. Jeong, H., S. P. Mason, et al. (2001). "Lethality and centrality in protein ne tworks." Nature 411 (6833): 41 42. John Marc Chandonia, S. E. B. (2005). "Implications of structural genomics target selection strategies: Pfam5000, whole genome, and random approaches." Proteins: Structure, Function, and Bioinformatics 58 (1): 166 179. Juge Aubry, C., A. s. Pernin, et al. (1997). "DNA Binding Properties of Peroxisome Proliferator activated Receptor Subtypes on Various Natural Peroxisome Proliferator Response Elements." Journal of Biological Chemistry 272 (40): 25252 25259. Kishan, K. V., M. E Newcomer, et al. (2001). "Effect of pH and salt bridges on structural assembly: molecular structures of the monomer and intertwined dimer of the Eps8 SH3 domain." Protein Sci 10 : 1046 1055. Kliewer, S. A., K. Umesono, et al. (1992). "Convergence of 9 cis retinoic acid and peroxisome proliferator signalling pathways through heterodimer formation of their receptors." Nature 358 (6389): 771 774. Korber, B. T. M., R. M. Farber, et al. (1993). "Covariation of Mutations in the V3 Loop of Human Immunodeficiency V irus Type 1 Envelope Protein: An Information Theoretic Analysis." PNAS 90 (15): 7176 7180. Korner, H., H. J. Sofia, et al. (2003). "Phylogeny of the bacterial superfamily of Crp Fnr transcription regulators: exploiting the metabolic spectrum by controlling alternative gene programs." FEMS Microbiology Reviews 27 (5): 559 592. Kraichely, D. M., J. J. Collins, 3rd, et al. (1999). "The autonomous transactivation domain in helix H3 of the vitamin D receptor is required for transactivation and coactivator interact ion." J Biol Chem 274 (20): 14352 14358. Kuiken, C., K. Yusim, et al. (2005). "The Los Alamos HCV Sequence Database." Bioinformatics 21 (3): 379 384. Laudet, V. and H. Gronemeyer (2002). The nuclear receptor : factsbook San Diego, Academic Press. Le, S. Y., K. Zhang, et al. (2002). "RNA molecules with structure dependent functions are uniquely folded." Nucl. Acids Res. 30 (16): 3574 3582. Leitner, T., B. Foley, et al. (2005). "HIV Sequence Compendium 2005." Theoretical Biology and Biophysics Group
198 Li, S., C. M. Armstrong, et al. (2004). "A map of the interactome network of the metazoan C. elegans." Science 303 (5657): 540 543. Lisal, J., T. T. Lam, et al. (2005). "Functional visualization of viral molecular motor by hydrogen deuterium exchange reveals transien t states." Nat Struct Mol Biol 12 (5): 460 466. Livingstone, C. and G. Barton (1993). "Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation." CABIOS 9 : 745 756. Lo Conte, L., B. Ailey, et al. (2000). "SCOP: a Structu ral Classification of Proteins database." Nucleic Acids Res 28 (1): 257 259. Martin, L. C., G. B. Gloor, et al. (2005). "Using information theory to search for co evolving residues in proteins." Bioinformatics 21 (22): 4116 4124. Mayer, B. J. (2001). "SH3 do mains: complexity in moderation." J Cell Sci 7 : 1253 1263. Memon, M. I. and M. A. Memon (2002). "Hepatitis C: an epidemiological review." J Viral Hepat 9 (2): 84 100. Milo, R., S. Shen Orr, et al. (2002). "Network motifs: simple building blocks of complex n etworks." Science 298 (5594): 824 827. Montemayor, C., O. A. Montemayor, et al. (2010). "Genome Wide Analysis of Binding Sites and Direct Target Genes of the Orphan Nuclear Receptor NR2F1/COUP TFI." PLoS ONE 5 (1): e8910. Moreland, J., A. Gramada, et al. (20 05). "The Molecular Biology Toolkit (MBT): a modular platform for developing molecular visualization applications." BMC Bioinformatics 6 (1): 21. Nazabal, A., M. L. Maddelein, et al. (2005). "Probing the structure of the infectious amyloid form of the prion forming domain of HET s using high resolution hydrogen/deuterium exchange monitored by mass spectrometry." J Biol Chem 280 (14): 13220 13228. Nielsen, S. U., M. F. Bassendine, et al. (2004). "Characterization of the genome and structural proteins of hepati tis C virus resolved from infected human liver." J Gen Virol 85 (6): 1497 1507. Orengo, C., A. Michie, et al. (1997). "CATH a hierarchic classification of protein domain structures." Structure 5 (8): 1093 1108. Overington, J. P., B. Al Lazikani, et al. (2006 ). "How many drug targets are there?" Nat Rev Drug Discov 5 (12): 993 996. Ozier, O., N. Amin, et al. (2003). "Global architecture of genetic interactions on the protein network." Nat Biotechnol 21 (5): 490 491.
199 Palmer, C. N. A., M. H. Hsu, et al. (1995). "N ovel Sequence Determinants in Peroxisome Proliferator Signaling." Journal of Biological Chemistry 270 (27): 16114 16121. Pasqualato, S., L. Renault, et al. (2002). "Arf, Arl, Arp and Sar proteins: a family of GTP binding proteins with a structural device fo r `front back' communication." EMBO Reports 3 (11): 1035 1041. Pazos, F., M. Helmer Citterich, et al. (1997). "Correlated Mutations Contain Information About Protein protein Interaction." Journal of Molecular Biology 271 : 511 523. Penin, F., J. Dubuisson, e t al. (2004). "Structural biology of hepatitis C virus." Hepatology 39 (1): 5 19. Phizicky, E., P. I. H. Bastiaens, et al. (2003). "Protein analysis on a proteomic scale." Nature 422 (6928): 208 215. Pierce, J. R. (1980). An Introduction to Information Theor y: Symbols, Signals and Noise New York, Dover Publications. Pollastri, G., A. Martin, et al. (2007). "Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information." BMC Bioinform atics 8 (1): 201. Pollock, D. D. and W. R. Taylor (1997). "Effectiveness of correlation analysis in identifying protein residues undergoing correlated evolution." Protein Eng. 10 (6): 647 657. Pritchard, L., P. Bladon, et al. (2001). "Evaluation of a novel m ethod for the identification of coevolving protein residues." Protein Engineering Design and Selection 14 (8): 549 555. RCSB. (2007). "An Information Portal to Biological Macromolecular Structures." from ht tp://www.pdb.org/pdb/home/home.do Robinson Rechavi, M., H. E. Garcia, et al. (2003). "The nuclear receptor superfamily." Journal of Cell Science 116 (4): 585 586. Schneider, T., G. Stormo, et al. (1986). "Information content of binding sites on nucleotide sequences." Journal of Molecular Biology 188 (8): 415 431. Schneider, T. D. and R. M. Stephens (1990). "Sequence logos: a new way to display consensus sequences." Nucl. Acids Res. 18 (20): 6097 6100. Schneider, T. D. and R. M. Stephens (1990). "Sequence log os: a new way to display consensus sequences." Nucleic Acids Res 18 (20): 6097 6100. Semple, C. and M. Steel (2003). Phylogenetics Oxford University Press, USA. Shackelford, G. and K. Karplus (2007). "Contact prediction using mutual information and neural nets." Proteins 69 Suppl 8 : 159 164.
200 Shannon, C. E. (1948). "A Mathematical Theory of Communication." Bell System Technical Journal 27 (July and October): 379 423, 623 656. Sivaraman, T., C. B. Arrington, et al. (2001). "Kinetics of unfolding and folding fr om amide hydrogen exchange in native ubiquitin." Nat Struct Biol 8 (4): 331 333. Strebel, K. (2003). "Virus host interactions: role of HIV proteins Vif, Tat, and Rev." AIDS 4 (17): S24 S34. Strogatz, S. H. (2001). "Exploring complex networks." Nature 410 (682 5): 268 276. Swanson, R., M. Vannucci, et al. (2009). "Information theory provides a comprehensive framework for the evaluation of protein structure predictions." Proteins: Structure, Function, and Bioinformatics 74 (3): 701 711. Swiss Prot. (2007). "UniPro tKB/Swiss Prot protein knowledgebase release 54.1 statistics." from http://www.expasy.org/sprot/relnotes/relstat.html Tan, S. L. (2006). Biochemical Activities of the HCV NS5B RNA Dependent RNA Polymerase. Hepatitis C viruses: genomes and molecular biology Norwich, United Kingdom, Horizon Scientific Press : 293 310. Tan, S. L. (2006). HCV NS4B: From Obscurity to Central Stage. Hepatitis C viruses: genomes and molecular biology Norwich, Unit ed Kingdom, Horizon Scientific Press : 245 266. Thompson, J. D., D. G. Higgins, et al. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix cho ice." Nucl. Acids Res. 22 (22): 4673 4680. Tilley, W. D., G. Buchanan, et al. (1996). "Mutations in the androgen receptor gene are associated with progression of human prostate cancer to androgen independence." Clin Cancer Res 2 (2): 277 285. Tillier, E. R. M. and T. W. H. Lui (2003). "Using multiple interdependency to separate functional from phylogenetic correlations in protein alignments." Bioinformatics 19 (6): 750 755. l biology." Trends Biochem. Sci. 28 (3): 137 144. Ueda, H., G. C. Sun, et al. (1992). "A novel DNA binding motif abuts the zinc finger domain of insect nuclear hormone receptor FTZ F1 and mouse embryonal long terminal repeat binding protein." Mol Cell Biol 12 (12): 5667 5672. Umesono, K., K. K. Murakami, et al. (1991). "Direct repeats as selective response elements for the thyroid hormone, retinoic acid, and vitamin D3 receptors." Cell 65 (7): 1255 1266. UNAIDS (2006) "2006 Report on the global AIDS epidemic."
201 Vaisanen, S., T. W. Dunlop, et al. (2005). "Spatio temporal Activation of Chromatin on the Human CYP24 Gene Promoter in the Presence of 1[alpha],25 Dihydroxyvitamin D3." Journal of Molecular Biology 350 (1): 65 77. Venter, J. C., M. D. Adams, et al. (2001) "The Sequence of the Human Genome." Science 291 (5507): 1304 1351. Wagner, A. (2001). "The Yeast Protein Interaction Network Evolves Rapidly and Contains Few Redundant Duplicate Genes." Molecular Biology and Evolution 18 (7): 1283 1292. Wain Hobson, S. (19 89). "HIV genome variability in vivo." AIDS 3 (1): 139. Wang, B., H. San Wong, et al. (2006). "Inferring Protein Protein Interacting Sites Using Residue Conservation and Evolutionary Information." Protein and Peptide Letters 13 : 999 1005. Warnmark, A., E. T reuter, et al. (2003). "Activation Functions 1 and 2 of Nuclear Receptors: Molecular Strategies for Transcriptional Activation." Molecular Endocrinology 17 (10): 1901 1909. Wasley, A. and M. J. Alter (2000). "Epidemiology of Hepatitis C: Geographic Differen ces and Temporal Trends." Seminars in Liver Diseases 20 : 1 16. Weidemann, W., B. Linck, et al. (1996). "Clinical and biochemical investigations and molecular analysis of subjects with mutations in the androgen receptor gene." Clin Endocrinol (Oxf) 45 (6): 7 33 739. Weidemann, W., B. Peters, et al. (1998). "Response to androgen treatment in a patient with partial androgen insensitivity and a mutation in the deoxyribonucleic acid binding domain of the androgen receptor." J Clin Endocrinol Metab 83 (4): 1173 1176 Winters, M. S., D. S. Spellman, et al. (2005). "Solvent accessibility of native and hydrolyzed human complement protein 3 analyzed by hydrogen/deuterium exchange and mass spectrometry." J Immunol 174 (6): 3469 3474. Wolf, D. M. and V. C. Jordan (1994). "T he estrogen receptor from a tamoxifen stimulated MCF 7 tumor variant contains a point mutation in the ligand binding domain." Breast Cancer Res Treat 31 (1): 129 138. Wong, T. S., D. Roccatano, et al. (2007). "Steering directed protein evolution: strategies to manage combinatorial complexity of mutant libraries." Environmental Microbiology 9 (11): 2645 2659. Wooster, R., J. Mangion, et al. (1992). "A germline mutation in the androgen receptor gene in two brothers with breast cancer and Reifenstein syndrome." Nat Genet 2 (2): 132 134. Wu, J., J. M. Jones, et al. (2004). "Crystal Structures of RI Subunit of Cyclic Adenosine 5' Monophosphate (cAMP) Dependent Protein Kinase Complexed with Adenosine 3',5'
202 Cyclic Monophosphothioate and Adenosine 3',5' Cyclic Monophos phothioate, the Phosphothioate Analogues of cAMP." Biochemistry 43 (21): 6620 6629. Wu, T. D., C. A. Schiffer, et al. (2003). "Mutation Patterns and Structural Correlates in Human Immunodeficiency Virus Type 1 Protease following Different Protease Inhibitor Treatments." J Virol 77 (8): 4836 4847. Yan, X., D. Broderick, et al. (2004). "Dynamics and ligand induced solvent accessibility changes in human retinoid X receptor homodimer determined by hydrogen deuterium exchange and mass spectrometry." Biochemistry 4 3 (4): 909 917. Yan, X., J. Watson, et al. (2004). "Mass spectrometric approaches using electrospray ionization charge states and hydrogen deuterium exchange for determining protein structures and their conformational changes." Mol Cell Proteomics 3 (1): 10 23. Yi, M., Y. Ma, et al. (2007). "Compensatory Mutations in E1, p7, NS2, and NS3 Enhance Yields of Cell Culture Infectious Intergenotypic Chimeric Hepatitis C Virus." J Virol 81 (2): 629 638. Young, K. H. (1998). "Yeast two hybrid: so many interactions, (i n) so little time." Biol Reprod 58 (2): 302 311. Zhang, Z. and D. L. Smith (1993). "Determination of amide hydrogen exchange by mass spectrometry: a new tool for protein structure elucidation." Protein Sci 2 (4): 522 531. Zhou, T., L. Xu, et al. (2007). "Str uctural definition of a conserved neutralization epitope on HIV 1 gp120." Nature 445 (7129): 732 737. Zuckerkandl, E. and L. Pauling (1962). "Molecular disease, evolution, and genetic heterogeneity." Horizons in Biochemistry (Academic Press): 189 225.
203 B IOGRAPHICAL SKETCH Homer Floyd Willis IV was born on January 6, 1968 in Miami, Florida and was given the moved to Jacksonville, Florida where he graduated from S andalwood high school in 1985. Scooter enrolled at the University of Florida in the fall of 1985 to pursue a degree in electrical engineering. Outside the classroom the experience of college was memorable but the slow pace of academic lectures and learnin g by taking tests was not to his liking. A plan was put in place to graduate in three and half years and enter the workforce as an electrical engineer. In the spring of 1988, Scooter was elected Student Body President as the result of a dispute between the Greek run Student Government a nd the funding of the Engineer Fair. The demands of politics and the creation of the University of Florida All n O ne student ID (GatorOne) card derailed any hopes of graduating anytime soon. After finishing his duties as Stu dent Body President, Scooter started a Global Digital Solutions, Inc. to develop, suppo rt and market the GatorOne card The core class that kept Scooter from graduating was tech writing where the grade was based on attendance something very difficult to do when starting a business that required excessive travel. In 1998, Scooter and his soon to be wife, Karen, took a year off from work to travel the world. The only work requirement during the trip was to take tech writing by correspondence. The final paper was mailed in from some place in India and Scooter was awarded a bachelor of science in electrical engineering in the fall of 1998. During the dot com period of 2000, Scooter was chief technical officer of a well financed online virtual reality company. A fter the company was closed and the dot com bubble had burst, Scooter decided to return to school for an executive MBA for engineers and scientists. He moved to Gainesville and began working for UF as the technical architect for a PeopleSoft ERP
204 implementa tion. He was awarded the MBA in the summer of 2003. With a renewed focus of learning he started taking classes in computer science exploring the options of earning a Ph.D. in computer engineering. He began taking computer science classes in the fall of 200 2 while enrolled in the MBA program. In the fall of 2003, he began taking a fulltime load of classes in computer science. He passed his written qualification exam on the first attempt in the spring of 2005 and was admitted to candidacy in the spring of 200 6. In January of 2009 he began working at Scripps Florida in the Omics Informatics department and is involved in numerous research projects related to the study of Nuclear Receptors.