Citation
Statistical Methods for Analyzing Genomics Data

Material Information

Title:
Statistical Methods for Analyzing Genomics Data
Creator:
Sikdar, Sinjini
Place of Publication:
[Gainesville, Fla.]
Florida
Publisher:
University of Florida
Publication Date:
Language:
english
Physical Description:
1 online resource (114 p.)

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Biostatistics
Committee Chair:
DATTA,SUSMITA
Committee Co-Chair:
DATTA,SOMNATH
Committee Members:
ZOU,FEI
MCINTYRE,LAUREN M

Subjects

Subjects / Keywords:
genes -- genomics -- meta-analysis -- pathways
Biostatistics -- Dissertations, Academic -- UF
Genre:
bibliography ( marcgt )
theses ( marcgt )
government publication (state, provincial, terriorial, dependent) ( marcgt )
born-digital ( sobekcm )
Electronic Thesis or Dissertation
Biostatistics thesis, Ph.D.

Notes

Abstract:
My doctoral dissertation comprises of three different projects that address three different aspects of statistical analysis of genomics data. The first project describes a novel statistical approach for identification of 'master regulator' transcription factor in a genome. A 'master regulator' transcription factor, being at the top of the hierarchy of the transcriptomic regulation, often appears to control most of the regulatory activities of the other transcription factors and the associated genes which are known to play important roles in a disease process. Therefore, it is important to identify and target the master regulator transcription factor for proper understanding of the associated disease process. Through simulated scenario and real dataset analyses, we show that our method performs well in validating the existence of a master regulator, and identifies biologically meaningful master regulators. The second project involves an integrated analysis using multiple cancer datasets for investigating the significance of the biological pathways which are interrupted by cancer-causing genetic mutations. This dataset consists of expression profiles for genes/proteins of patients receiving treatment, for three types of cancer - Head and Neck Squamous Cell Carcinoma, Lung Adenocarcinoma and Kidney Renal Clear Cell Carcinoma. We consider pathway analysis to identify all the biological pathways which are active among the patients and investigate the roles of the significant pathways using a differential network analysis of the protein expression datasets for the three cancers separately. We then integrate the pathway based results of all the three cancers which provide a more comprehensive picture of the three cancers. In the third project, we develop a novel meta-analysis method of combining p-values from different independent experiments involving large-scale multiple testing frameworks. Adhering to the regular statistical assumptions regarding the null distributions of test statistics can lead to incorrect significance testing results and biased inference in large-scale multiple testing frameworks when results from different independent genomic experiments are combined. In order to overcome this, our proposed method takes into account empirical adjustments of the individual test statistics and p-values. Consequently, our method outperforms the standard meta-analysis approach of significance testing as shown in simulation studies and real genomic data analyses. ( en )
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Thesis:
Thesis (Ph.D.)--University of Florida, 2017.
Local:
Adviser: DATTA,SUSMITA.
Local:
Co-adviser: DATTA,SOMNATH.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2018-02-28
Statement of Responsibility:
by Sinjini Sikdar.

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Embargo Date:
2/28/2018
Classification:
LD1780 2017 ( lcc )

Downloads

This item has the following downloads:


Full Text

PAGE 1

STATISTICAL METHODS FOR ANALYZING GENOMICS DATA By SINJINI SIKDAR A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2017

PAGE 2

2017 Sinjini Sikdar

PAGE 3

Dedicated to my husband, Sandipan, who has been a constant source of support and encouragement during the challenges of my PhD life

PAGE 4

4 ACKNOWLEDGMENTS I am deeply grateful to my advisor Professor Susmita Datta for her unyielding support and invaluable guidance throughout my graduate work. It would not have been possible for me to reach this po int without her support. I also want to express my sincere thanks to Professor Somnath Datta as I learnt a lot through my interactions with him and benefited from his expert suggestions. I am also than kful to my other dissertation committee members Professor Fei Zou, and Professor Lauren McIntyre for all their kind support and constructive comments regarding my dissertation. I want to thank Professor Ryan Gill from University of Louisville for his helpf ul contributions to my research works. I also want to thank all the faculty, staff, and students of the Department of Biostatistics at University of Florida as well as all the faculty and students of the Department of Bioinformatics and Biostatistics at Un iversity of Louisville who have helped me reach this point. I would like to thank my sister Shreejata Sikdar for her invaluable friendship and constant support throughout my PhD life. Finally, I want to thank my parents for their immense encouragement t hat have helped me in moving forward in my academic life.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ ............... 4 LIST OF TABLES ................................ ................................ ................................ ........................... 7 LIST OF FIGURE S ................................ ................................ ................................ ......................... 8 ABSTRACT ................................ ................................ ................................ ................................ ..... 9 CHAPTER 1 INTRODUCTION AND LITERATURE REVIEWS ................................ ............................ 11 1.1 Identification of the Master Regulator Transcription Fac tor ................................ ........... 11 1.2 Meta Analysis of Differential Protein Expression Networks of Cancer Signaling Pathways in Three Different Cancers ................................ ................................ ................ 14 1.3 Meta analysis Approach for Large scale Simultaneous Hypothesis Testing in Genomic Experiments ................................ ................................ ................................ ......... 15 2 A NOVEL STATISTICAL APROACH FOR IDENTIFICATION OF THE MASTER REGULATOR TRANSCRIPTION FACTOR ................................ ................................ ....... 19 2.1 Methods ................................ ................................ ................................ ............................ 19 2.1.1 Biological Considerations ................................ ................................ ...................... 19 2.1.2 Identification of the Master Regulator through a Hypothesis Testing Framework ................................ ................................ ................................ ................... 21 2.2 Simulations ................................ ................................ ................................ ....................... 25 2.2.1 Data Generation ................................ ................................ ................................ ...... 26 2.2.2 Size of the Test ................................ ................................ ................................ ....... 28 2.2.3 Power of the Test ................................ ................................ ................................ .... 29 2.3 Data Analysis ................................ ................................ ................................ .................... 34 2.3.1 Prostate Cancer Data ................................ ................................ .............................. 34 2.3.2 Colorectal Cancer Data ................................ ................................ ........................... 36 2.4 Discussion and Conclusion ................................ ................................ ............................... 37 3 EXPLORING THE IMPORTANCE OF CANCER PATHWAYS BY META ANALYSIS OF DIFFERENTIAL PROTEIN EXPRESSIO N NETWORKS IN THREE DIFFERENT CANCERS ................................ ................................ ................................ ....... 40 3.1 Methods ................................ ................................ ................................ ............................ 40 3.1.1 Pathway Analysis ................................ ................................ ................................ ... 41 3.1.2 Differential Network Analysis ................................ ................................ ............... 41 3.1.3 Rank Aggregation ................................ ................................ ................................ ... 43 3.2 Results ................................ ................................ ................................ ............................... 45 3.2.1 Pathway Analysis Results ................................ ................................ ....................... 45

PAGE 6

6 3.2.2 Differential Network Analysis Results ................................ ................................ ... 45 3.2.3 Rank Aggregation Results ................................ ................................ ...................... 48 3.3 Discussion and Conclusion ................................ ................................ ............................... 50 4 EAMA: EMPIRICALLY ADJUSTED META ANALYSIS FOR LARGE SCALE SIMULTANEOUS HYPOTHESIS TESTING IN GENOMIC EXPERIMENTS ................. 53 4.1 Methods ................................ ................................ ................................ ............................ 53 4.1.1 Empiricall y Adjusted Meta Analysis (EAMA) ................................ ...................... 53 4.1.2 Estimation of the empirical null distribution ................................ .......................... 55 4.2 Simulation Studies ................................ ................................ ................................ ............ 55 4.2.1 Generation of Continuous Data (Microarray Based Gene Expression) ................. 57 4.2.1.1 Correlated gene expression levels ................................ ................................ 61 4.2.1.2 Reduction of the number of experim ents ................................ ................ 62 4.2.1.3 Reduction in the difference in magnitude of the expression levels of the genes ................................ ................................ ................................ ..................... 64 4.2.1.4 Increase in the number of genes ................................ ............................... 65 4.2.1.5 Presence of a hidden variable which does not act as a confounder .............. 67 4.2.2 Generation of count data (NGS base d gene expression) ................................ ........ 68 4.3 Data Analysis ................................ ................................ ................................ .................... 69 4.4 Discussion ................................ ................................ ................................ ......................... 76 APPENDIX. R SCRIPTS ................................ ................................ ................................ .............. 78 A.1 R Sc ript for Identification of Master Regulator Transcription Factor ............................. 78 A.2 R Script for Meta Analysis of Differential Protein Express ion Networks ...................... 86 A.2.1 Pathway based Analysis ................................ ................................ ........................ 86 A.2.2 Individual Protein based Analysis ................................ ................................ ......... 92 A.3 R Script for EAMA ................................ ................................ ................................ ......... 99 LIST OF REFERENCES ................................ ................................ ................................ ............. 103 BIOGRAPHICAL SKETCH ................................ ................................ ................................ ....... 114

PAGE 7

7 LIST OF TABLES Table page 2 1 The list of top ten transcription factors which are highly correlated with the two ................................ ................................ ......... 35 3 1 values obtained from differential network analysis for each cancer type. ................................ ................................ ............................ 45 3 2 values for each cancer type along with the overall ordering. ................................ ................................ ................................ ............................. 50 4 1 Performances of EAMA and the nave method with 1000 uncorrelated genes, absolute differences in differential expressions as 8 and reduced number of experiments. ................................ ................................ ................................ ....................... 65 4 2 Performance assessment of EAMA and the nave method where a hidden variable does not act as a confounder. ................................ ................................ ............................. 68 4 3 The performances of EAMA and that of the nave method using the simulated count datasets. ................................ ................................ ................................ .............................. 69 4 4 The number of patients in each of the two lung cancer types within each dataset. ........... 70 4 5 The grouping of the genes, identified by EAMA and the nave method, according to the signaling pathways. ................................ ................................ ................................ ...... 73

PAGE 8

8 LIST OF FIGURES Figure page 2 1 A toy example showing possible regulatory network across a set of genes and transcription factors. ................................ ................................ ................................ .......... 20 2 2 The power curve with 500 subjects in each group for several choices of ...................... 31 2 3 Plot of the power curves for different choices of the sample sizes with several choices of using simulated datasets. ................................ ................................ .............. 32 2 4 Plot showing the power performance of our test in presence of two independent master regulators with varying using simulated datasets. ................................ ............. 33 3 1 Network structure for RAS signaling pathway in Head and Neck Squamous Cell Carcinoma (HNSC) ................................ ................................ ................................ ............ 46 3 2 Network structure for PI3K signaling pathway in Lung Adenocarcinoma (LUAD) ......... 48 3 3 well as combined. ................................ ................................ ................................ ............... 49 4 1 Performance assessment with 10 experiments, 1000 uncorrelated genes and absolute differences in differential expressions as 15. ................................ ................................ ..... 60 4 2 Performances of EAMA and the nave method with 10 experiments, 1000 correlated genes and absolute differences in differential expressions as 15. ................................ ...... 63 4 3 Performances of EAMA and the nave method with 1000 uncorrelated genes, absolute differences in differential expressions as 15, and (A) number of experiments was 8 (B) number of experiments was 4. ................................ ................................ ........... 63 4 4 Performance assessment of EAMA and the nave method with 10 experiments, 1000 uncorrelated genes and absolute differences in differential expressions as 8. ................... 65 4 5 Performance assessment of EAMA and the nave method with 10 experiments, 5000 uncorrelated genes and absolute differences in differential expressions as 15. ................. 66 4 6 The histograms of the original and modified z scores for the lung cancer datasets. ......... 71 4 7 The violin plots of the gene with ID 472 for the two cancer types in each of the five datasets. ................................ ................................ ................................ .............................. 74 4 8 The violin plots for the gene with ID 8200 for the two cancer types in each of the five datasets ................................ ................................ ................................ ........................ 75

PAGE 9

9 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy STATISTICAL METHODS FOR ANALYZING GENOMICS DATA By Sinjini Sikdar August 2017 Chair: Susmita Datta Major: Biostatistics My doctoral dissertation comprises of three different projects that address three different aspects of statistical analysis of genomic s data. The first t op of the hierarchy of the transcriptomic regulation, may control the regulatory activities of the other transcription factors and the associated genes. For example, in cases of systemic disease, cellular function is disrupted. We hypothesize that these ma y be the result of a regulatory Therefore, it is important to identify and target the master regulator transcription factor for proper understanding of the associat ed disease process. Through simulated scenario and real dataset analyses, we show that our method performs well in validating the existence of a master regulator, and identifies biologically meaningful master regulators. The second project involves an integrated analysis using multiple cancer datasets for investigating the significance of the biological pathways which are interrupted by cancer associated genetic mutations. This dataset consists of expression profiles for genes/proteins of patients recei ving treatment, for three types of cancer Head and Neck Squamous Cell

PAGE 10

10 Carcinoma, Lung Adenocarcinoma and Kidney Renal Clear Cell Carcinoma. We consider pathway analysis to identify all the biological pathways which are active among the patients and inves tigate the roles of the significant pathways using a differential network analysis of the protein expression datasets for the three cancers separately. We then integrate the pathway based results of all the three cancers which provide a more comprehensive picture of the three cancers. In the third project, we develop a novel meta analysis method of combining p values from different independent experiments involving large scale multiple testing frameworks. Adhering to the regular statistical assumptions rega rding the null distributions of test statistics can lead to incorrect significance testing results and biased inference in large scale multiple testing frameworks when results from different independent genomic experiments are combined. In order to overcom e this, our proposed method takes into account empirical adjustments of the individua l test statistics and p values. Consequently, our method outperforms the standard meta analysis approach of significance testing as shown in simulation studies and real ge nomic data analyses.

PAGE 11

11 CHAPTER 1 INTRODUCTION AND LITERATURE REVIEWS This dissertation is based on three different projects that are linked through the analysis of genomic datasets. The first project involves a novel two step statistical approach for activities o f the other transcription factors and the associated genes. In the second project, we explore the importance of cancer pathways through the integrative analyses of differential protein expression networks in three different cancers. A third project is als o proposed that aims to develop a meta analysis approach for large scale multiple hypothesis testing framework in genomic datasets. 1.1 Identification of the Master Regulator Transcription F actor Through several scientific findings, it has been suggested that cancer is mainly caused by the mutations in certain genes [2 4 ] The genes, which can transform a cell into tumor cell, are So, for effective understanding of cancer, identification of these mu tated genes (oncogenes) is essent ial. Genomic studies of different cancer datasets lead s to identification of several potential oncogenes which are directly or indirectly responsible for development and progression of cancer [ 5 7 ]. With the advent of modern technology, high throughput dat a analysis typically yields a list of differentially expressed genes or proteins that may have potential to play important roles in a given phenomenon or phenotype. However, It is a very challenging task to target and individually study all of these genes as the y are large in number [8 10 ]. One way to identify patterns is to group the proteins and genes belonging to the same pathway [8, 10 ]. These genes and their corresponding pathways form networks that are Reprinted with permission from Sikdar S, Datta S. A novel statistical approach for identification of the master regulator transcription factor. BMC Bioinformatics. 2017; 18: 79.

PAGE 12

12 hypothesized to control various cellular function s, and there has been sufficient interest in analyzing such pathway based networks [ 11 14 ] However re cent findings suggest that oncogenes and tumor suppressor genes may which play key roles in carcinogenesi s [15 18 ]. Cancer signaling pathways seem to ide ntify transcription factors [15, 19 ], which lead to tumor development, progression and cancer metastasis through the controlling of th e gene expression patterns [15, 16]. As suggested by [15, 17, 20 22 ], some of the major groups of transcription factors, which have been identified for cancer, are the steroid receptors (e.g. estrogen receptors in breast cancer, androgen receptors in prostate cancer), resident nuclear proteins activated by kinase cascades, and t he STAT protein family members. Apart from these, the ETS protein family members have also been identified as potential cancer transcription factors for their eme rging roles in human cancer [23 25 ]. It has been shown that direct suppression of transcripti on factors expressions which serve as the convergence points of oncogenic signaling and are functionally altered in many cancers, can lead to significant antitumor responses with minimal side effects, and targeting these transcription factors in tumor related immune cells can help in recovering from tumor immunoresistance [26 ]. As a result of these features of the transcription facto rs, in addition to the facts that they are much smaller in number than the oncogenes and have well regulated expression and activities, transcription factors are gaining popularity as potential therapeutic targets in anti cancer drug development [16, 17, 2 7, 28 ]. In the recent past, studies have identified a transcription factor or a group of transcription factors as the driving force behind the development of a biological or disease process [ 29 35 ] In order to facilitate such detection there have been at tempts to develop statistical methods for accurate identification of transcription factors that regulate large number of genes [ 36 42 ]. To

PAGE 13

13 this end methods have been attempted for identification of transcription factors and transcription factor binding sit es in cell cycle of yeast and similar organisms using multiple data sources [36 41 ] and to human cell culture [42 ] In addition, there have been efforts in developing statistical tools for identifying a cluster of transcription factors that cooperatively regulates a large number of genes and the as sociated disease process [43, 44 ] Methods have also been developed for identifying differentially regulated gene sets by integrating regulatory networks of transcription factors and gene expression data [45 ] Al so, transcription factor activities have been estimated through their effect on target genes [46 ] The importance of transcription factor regulation is also evident from methods that have been developed for identifying coordinately activated functional mo dules from gene expression data [47, 48 ] These methods assume that the transcription factor regulated target genes are differentially expressed from non target gene s i n the same functional module [47, 48 ] In fact there have been several studies for ident ifying transcription factors under the assumption that co expression indicates co regulation [49 51 ] The main idea behind such transcription factor regulation is that genes regulated by such transcription factors should have, on an average, significantly different expression levels durin g o ne or more cell c ycle phases [52 ] Besides, there have been studies for identifying groups of important transcription factors through integration of different genomic and epigenomic features [53 ] and integration of tra nscriptional and protein interaction networks [54 ] Most of these recent methods, including that of [55] and [56 ] have been directed towards identification of a group of candidate driver transcription factors. Despite the fact that in most cases there are a group of transcription factors that regulate the oncogenes and hence the disease process, it has been seen tha t there is a hierarchical structure in the regulatory activities of these transcription

PAGE 14

14 regulatory activities of the other transcription f act ors and the associated genes [57 59 ] According to the definition provided in [58 ] top of a regulatory hierarchy and must not be under the regulatory influence of any other gene or transcription factor. We use this definitio transcription factor. This master regulator transcription factor can be targeted for proper understanding of the associated disease process and can be used as a biomarker. In current literature there is a la ck of appropriate statistical methods which use a single data source for accurate identification of such a master regulator among a set of identified transcription factors. Motivated by this, we develop a novel two step statistical approach to test for the existence of a master regulator transcription factor and for subsequent identification of the master regulator, if it exists, from gene expression data alone. 1.2 Meta Analysis of Differential Protein Expression Networks of Cancer Signaling Pathways in Th ree Different Cancers known to be directly or indirectly responsible for caus ing selective growth advantage. These ling pathw ays. According to [60 ] this selective growth advantage can occur only through these twelve signaling pathways. We refer these cellular processes ] We believe understanding the roles of these twelve pathways can result into novel therapeutic intervention strateg cancers namely, Head and Neck Squamous Cell Carcinoma (HNSC), Lung Adenocarcin oma

PAGE 15

15 (LUAD) and Kidney Renal Clear Cell Carcinoma (KIRC). We find the study of protein protein products. These datasets are available to us from International Can cer Genomic Consortium (ICGC) as part of the CAMDA 2015 challenge data. We pursue an integrative analysis of protein expressions of all these three cancer datasets to investigate whether each of ee cancers. For example, we determine whether the proteins associated with these pathways interact differently between the The differentially expressed pathways between the two diseas e groups allow us to gain more insights about the functional working mechanism of the cells than just the individual differential ly expressed genes/proteins [61 ] The process begins with grouping of the proteins according to their biological pathways. Then we examine whether the network structures of all the proteins in We also examine whether the connectivity of each single protein in the netwo Then we would then rank the pathways by their global order of importance with re spect to all the three cancers. This ranking may shed light to the regulatory roles of individual proteins in the context of all others in the pathway. 1.3 Meta analysis Approach for Large scale Simultaneous Hypothesis Testing in Genomic Experiments In genomic experiments and association studies, meta analysis is a popular tool for pooling results from multiple experiments and studies in order to reach an overall decision. In recent times, rapid progress in technology has led to major development of h igh throughput

PAGE 16

16 genomic assays. This means that hundreds and thousands of genes are now being analyzed at the same time. As a result, the level of simultaneous inference has undergone a huge surge over the last decade. Development of novel meta analysis app roaches is crucial in order for such settings since the sample size of individual experiments are generally small compared to the number of genes leading to low power of statistical detection from them after adjusting for multiplicities. However, there has not been much change in the meta analysis methods to accommodate this large scale aspect of the underlying inference and the possibility of underlying hidden factors that act as confounders. For example, in testing for the significance of genes in disease studies, more or less the same meta analysis methods are being applied to experiments involving hundreds and thousands of genes as were initially developed for experiments involving a few number of candidate genes. One common practice is to use Fis ethod [62 ] for combining the p values from different testing problems involving the same overall hypothesis. The individual p values are calculated for one gene at a time and the negative log transformed p values are combined to form a chi squared test sta tistic under the assumption that they are individually uniformly distributed under the null hypothesis. However, as pointed out by Efron [63 ], in large values may not be u are needed to ensure that all the p values from individual experiments are uniforml y distributed values can be properly implemented. In single hypothesis test framework the main aim is to reject the null hypothesis in favor wer, say 80%. But in large scale multiple testing framework involving, say, 10,000 hypotheses related to 10,000 genes, rejection

PAGE 17

17 of 80% of 10,000, i.e. 8,000, null hypotheses is no longer a desired outcome. Rather, the aim of such large scale testing frame usually less than 10%, which can be pursued for further investigation. The advantage of having large number, e.g., 100 or more, of hypotheses over a single hypothesis is that it enables the e stimation of empirical null distribution avoiding the dependency on theoretical asymptotic null distribu t ion, as pointed out by Efron [63 ]. The use of this empirical null is more appropriate for addressing the goal of large scale hypotheses testing problem s. This is particularly relevant in large observational studies which are often characterized by the presence of unobserved variable effects (e.g., batch effects) or unmeasured/missed confounding factors. Unlike the theoretical null, the empirical null dis tribution, automatically, takes into account the effects of the additional variation (and also small to moderate biases). This has an even more serious consequence in meta analyses of large scale genomic experiments where a number of potentially low powere d study results are combined in order to achieve significance. However, a p value combination method [62 ] may lead to incorrect findings if at least one of such studies contains hidden sources of variation leading to a violation of the theoretical null distributional assumption for that component study. In particular, a possible consequence of combining unadjusted or incorrectly adjusted p v a ] in a large scale multiple testing situations is the preval ence of large number of false discoveries when some underlying hidden variable plays the role of a confounder. Occurrences of false discoveries are common in large scale DNA microarray experiments where the aim is to detect genes that are differentially ex pressed between two or more biological conditions. A good example in this context can be the well know n study of [64 ] which aimed to identify the differentially expressed genes between two types of genetic mutation

PAGE 18

18 analyzed, out of which 51 genes turned out to be significant initially at the p value cut off of 0.001. Later on it was shown that the chosen cut off is expected to produce substantial false positives, and the authors had to lower the cut off resulting in the lower number of significant genes. Details of this st udy results can be found in [65 ]. This highlights the possibility of inaccurate scientific conclusions due to the occurrences of false discoveries in lar ge scale experiments. A possible remedy to this problem is to make adjustments using the false discovery rate (FDR), developed by [66 ], which have been widely used in methods for analyzing d ata from g enomic experiments [ 67 70 ]. However, using the aforement ioned breast cancer microarray data, Efron [63 ] showed that even adjusting for FDR may not be enough to restrict false discoveries if the underlying assumption of the standard normal distribution of the test statistic is under suspicion. Instead, he advoca a local false discovery rate calculation. Motivated by this, we develop a meta analysis method called Empirically Adjusted Meta analysis (EAMA) that does not combine the raw p values t hey are first transformed, where the amount of transformation depends on the discrepancy between the empirical and the theoretical null (e.g., uniform distribution in case of p values), before they are Fisher combined. Of course, a multiple hypothesis meth od such as Benjamini Hochberg (1995) [66 ] is applied at the end to make the significance calls. We show that this procedure is very effective in reducing the FDR and increasing specificity at the expense of a slight reduction in sensitivity in a variety of situations which are affected by the presence of some hidden confounders.

PAGE 19

19 CHAPTER 2 A NOVEL STATISTICAL APROACH FOR IDENTIFICATION OF THE MASTER REGULATOR TRANSCRIPTION FACTOR 2.1 Methods In this chapter, we present our proposed statistical approach fo r identification of master regulator transcription factor. We first discuss the biological considerations that motivated the development of our method. We then provide the steps to formulate our test statistic and use it for the identification of the regul atory circuit of the transcription factors and genes. Finally, we identify the master transcription regulator at the top of the regulation hierarchy. 2.1.1 Biological Considerations Important biological processes can have multiple layers of regulation and control. A transcription factor is known to control not only genes but also other transcription factors. As discussed before in section 1.1 usually there is a hierarchical structure in the regulation of the transcription factors so that the master reg ulator controls most of the regulatory activities of the other transcription factors and the associated genes. In this article, we aim to identify the master regulator transcription factor which is at the top of the hierarchy for better understanding of th e associated disease process. A toy example is shown in Figure 2 1 which shows the possible regulatory network across a set of genes and transcription factors in a genome. In Figure 2 1, denote the transcription factors and denote the set of genes. Suppose directly regulates five of the genes, which are and also all the other three transcription factors, The transcription factor regulates the genes Similarly, the transcription factor regulates the genes and finally, the transcription factor regulates the Reprinted with permission from Sikdar S, Datta S. A novel statistical approach for identification of the master regulator transcription factor. BMC Bioinformatics. 2017; 18: 79.

PAGE 20

20 genes In this example, there exists a hierarchical structure with three layers. We have at the top of the hierarchical structure as it directly or indirectly regulates the other transcription factors and the genes. So, is considered to be the first layer of the hierarchy. Now, directly regulates the other transcription fact ors, and the genes So, and are considered to be at the second layer of the hierarchy. regulates the genes indirectly through the transcription factors Thus, the genes form the third layer of the hierarchy. In this example, directly or indirectly regulates all the layers of the hierarchy and is not under the regulatory influence of any other gene or trans cription factor. Therefore, according to the definition, can be considered as the master re gulator transcription factor. Figure 2 1 A toy example showing possible regulatory network across a set of genes and transcription factors

PAGE 21

21 Here, in this project we attempt to develop a test that can check if there exists any transcription factor that acts as a master regulator in a genome, and identify such a master regulator if present. The details of our proposed method are given in the next section. 2.1.2 Ide ntification of the Master Regulator through a Hypothesis Testing Framework Let denote the total number of transcription factors present in a genome. Let the transcription factors be denoted by Let the genes, which are not transc ription factors, be denoted by where denotes the total number of such genes. So, in total, we have expression data on genes. Let us assume that there are two groups of subjects, for example, the case group (the disease group) and the control group (non disease group). Let there be subjects in the case group and subjects in the control group. So, in other words, we have two groups of subjects with expression levels for features in each group. It is well known that the genes including the transcription factors are expressed differently in the two groups. Additionally, they are connected with one another differently in the networks of the two groups. There are methods to reverse engineer the networks of genes with association measures such as correlations, partial correlations and partial lea st squares regression scores [71, 72 ] These inter genomic connectivity are different in two groups and can be detected by statistica l methods such as Diff ere ntial Network Analysis (DNA) [73 ] Since, it is believed that the master regulator maximally controls the other transcription factors as well as the associated genes; it is important to find the regulatory network among the transcr iption factors and also the degree of regulation of all the transcription factors on the genes. We first measure the degree of regulation of the transcription factors on the genes. The degree of regulation of a transcription factor on the genes is measured by the change in connectivity of the genes it regulates in the two networks. In other words, we find how the connectivity of a

PAGE 22

22 transcription factor with the genes differs between the subjects in case and control groups. For this, we estimate the change in connectivity of a transcription factor with the genes in the two groups of samples using connectivity scores of the given transcription factor with all the genes in the case group with that in the control group. The difference in connectivity is measured u sing the following statistic [73 ] : ( 2 1 ) where denotes the set of all the genes, i.e., the cardinality of is Here, are the connectivity scores between the transcription factor and the gene in the case and control groups, respectively. There are several choices for connectivity scores such rrelation scores, partial least square based ass ociation scores. In this project the give us an idea about the magnitude of the differential regulation of the transcription factors on the genes between the case and control groups. Next, we find the regulatory structure among the transcription factors. For this, we measure the association between each pair of transcription factors using the coefficient s cores between them. For each pair let denote the absolute value of the and ; where Note that this calculation is done by po oling the data from both the groups. At this stage, for a transcription factor we have two measures: a measure of the differential regulation of on the genes (given by ); and a measure of the association of wit h all the transcription factors ( ).

PAGE 23

23 We argue that the degree of change in connectivity of the genes in the two networks is controlled by the transcription factors which are correlated amongst themselves in a hierarchical manner. That is, the hierarchical regulation pattern (as measured by the rank order) among the transcription factors is the same with the differential connectivity of genes in the two groups that they control. In other words, the rank order of the amount of differe ntial connectivity of a transcription factor with other genes it controls is in line (e.g., concordant ) with its ordered connectivity with the master regulator. Therefore, we consider two ranked lists. One that ranks the transcription factors by the amount of differential connectivity of the genes it controls and another that puts the master regulator in the first position and ranks the remaining transcription factors by their correlation with the master regulator. We evaluate the concordance of these two sets of ranks using a statistical measure which is described in next paragraph. Since we do not know a priori the identity of the master regulator/s, we maximize this measure of concordance over the set of all transcription factors in candidacy for playing the role of the master regulator. In case the maximal concordance is statistically significant, we conclude that there is a master regulator. In addition, we declare the transcription factor for which this concordance measure is maximal amongst all trans cription factors to be the master regulator. We construct a concordance statistic for each transcription factor that is in candidacy for the master regulator in the following way: Step 1. n coefficient test statistic given by (2 2 ) below based on the pairs of data Note that denotes the average difference in connectivity of transcription factor between the two groups, and is the absolute correlation between the transcription factor s and This test statistic below conveys whether the differential connectivity of the genes with the

PAGE 24

24 transcription factor in the two experimental groups is concordant with the correlations of the transcription factor with all other transcription factors. In other words, measures whether the differential connectivity is concordant with the hierarchical regulation of the trans cription transcription factor is given as: (2 2) where, in the above paired list, in the above paired list, = Total number of such paired observations for This statistic can be used to test the null hypothesis that the two sets of ranks produced by differential con nectivity and the correlations with are non concordant versus the alternative hypothesis that they are concordant. Step 2. We repeat step 1 for all such transcription factors, so that we have a concordance test statistic for each of the transcription factors which is a potential master regulator. We believe that the master regulator has the maximum measure of concordance, among all the transcription factors. Since we do not know the identity of the master regulator, we m aximize the measure of concordance, given by over the set of all transcription factors. So, we define as the maximum of the statistics given in ( 2 2) over all the transcription factors that are in candidacy for the role of the master re gulator, i.e. (2 3) Thus, statistically significant large values of would indicate the existence of a master regulator.

PAGE 25

25 Significance of can be assessed by a bootstrap (resampling) based procedure as the sampling distribution of is not tractable. This will calculate the p value or the observed level of significance of the value of test statistic calculated in ( 2 3). We draw bootstrap samples from the original sample each of size and consider the first samples as the case group and the remaining samples as the control group. We compute the test statistic value for each bootstrap sample. Let denotes the value of our test statistic for the bootstrap sample, where In order to estimate the p value, we calculate the proportion of times the test statistic values based on the bootstrap samples exceed the test statistic value obtained from the original sample, i.e., (2 4) If the p value obtained from ( 2 4) is low then the test is significant and we conclude that there exists a master regulator in the system. In case we conclude that there exists a master regulator the transcription factor is claimed to be the master regulator if it has the maximum value of the statistic given in ( 2 2), i.e. (2 5) We evaluate the performance of our master regulator identification procedure using a simulation experiment in the next section. 2.2 Simulation s In order to evaluate the performance of our proposed method, we generate synthetic datasets of gene expressions of the case and control groups with the different regulation schemes of the transcription factors. The simulation scheme consists of the followi ng steps:

PAGE 26

26 2.2.1 Data Generation We consider transcription factors and genes as described before in the Methods section. Also, let there be subjects in the case group and subjects in the control group. The gene expression data for the two groups of subjects are generated as given below. Note that, the choices of all the design parameters considered below are given in later sections depending on whether we are simulating und er the null or under the alternative. Step 1. We assume that (log transformed) expression values for follows a normal distribution with mean and variance 1 i.e. in the case group, and in the control group. We also generate ind ependent random variables from that are also independent of Step 2. We want to generate all the transcription factors in such a way that there exists a hierarchical regulatory pattern among them. In other words, we want t o generate the remaining transcription factors in such a way that where denotes the correlation between the transcription factors and One way of achieving this is to simulate the remaining transcription factors as follows: (2 6) where, are decreasing in In this case, the correlation structures among all the transcription factors are given by:

PAGE 27

27 and (2 7) Step 3. The next step is to generate the genes We assume that each of the transcription factors regulates genes. Here, The genes, which are directly regulated by alone, are generated as given below: (2 8) where, and are independent and identically distributed (i.i.d) as and and are real numbers. Here, the correlation between the transcription factor and the genes is given by ( 2 9) The genes, regulated by the remaining transcription factors are generated as follows: (2 10) where, and are i.i.d and and are real numbers, In this case, the correlation between a transcription factor and the genes regulated by that transcription factor is given by

PAGE 28

28 (2 11) Also, the correlations between a transcription factor and the genes which are not regulated by that transcription factors are zero i.e. Furthermore, We calculate the size and power of our test in the following sections. 2.2.2 Size of the Test Recall that, the null hypothesis of interest is that the rank order of the transcription factors based on their differential connectivity with the genes is not statistically concordant with their rank order based on their correlations with the master regulator. So, the null situation can be created by assuming that there exists a hierarchical regulatory pattern among the transcription factors but there is no differential regulation of the genes in the two experimental groups due to the transcription factors. Hence, there is no such master regulator. In order to follow the null hypothesis in the simulation setup, we assume to be decreasing in and choose and The decreasing nature of ensures that there exists a hierarchical regula tory pattern among the transcription factors. and ensure that the associations of the transcription factors with the genes remain the same in the two groups i.e. there is no differential connectivities of the transcript ion factors with the genes between the two groups. We generate samples for the case group and samples for the control group using the above described

PAGE 29

29 scheme. We calculate the value of our test statistic, denoted by using equation ( 2 3) and find its p value as described in the Methods section. In order to find the size of the test, we use Monte Carlo method. We repeat the whole process 1000 times and therefore, get 1000 p values using equation ( 2 4). Let the p value for the Monte Carlo ite ration be denoted as The size for the test is given by: (2 12 ) In particular, we consider the following choices of the parameters for calculating the size of the test: For the above choices of the parameters, the empirical size of the test came out to be 0.032 which is close to the nominal size of 0.05. 2.2.3 Power of the Test To calculate the power of our test, we generate a data under the alternative hypothesis Here the alternative hypothesis is that the rank order of the transcription factors based on their differential connectivity with the genes is concordant with their rank order based on their correlations with the master regulator. So, we generate the data in such a way that acts as the master regulator, that is, the connectivity of with other transcription factors are most concordant with the differential connectivity of the genes with the transcription factors. We set and Here ensures that the connectivity (associations) of with the genes, regulated by it, are greater in case group than that in the control group. Similarly, ensures that the connectivity of with the genes, regulated

PAGE 30

30 by it, are greater in case group than that in the control group, Also, we assume to be decreasing in so that there is a hierarchical regulatory structure among the transcription factor s, being at the top of the hierarchy. We follow the same steps in calculating the p value as we did for size calculation in the previous section. We consider the same choices for as we c onsider for size calculation. In particular, we choose ; We choose where These choices of ; ensure that increase in the value of also increase the difference between In other words, the differential regulations of the transcription factors on the genes between the two groups increase as increases. For the choice of we consider the follow ing relation: which implies This choice of ensures that increase in the value of also increase the difference between We draw the power curve for different choices of as shown in Figu re 2 2 From Figure 2 2 we see that the power steadily increases as the differential connectivity (regulated by ) of the genes with the transcription factors between the two groups increase. The power curve starts from 3.2% at (no difference in th e connectivity of the genes with the transcription factors in the two groups) and reaches its maximum of 100% at (maximum difference in the connectivity of the genes with the transcription factors in the two groups). The power reaches over 80% with a moderate choice of Therefore, we can say that our proposed method is a valid test (e.g., size ) that performs reasonably well (power reaching 100%) in identifying a significant concordance in the differential connectivity of the genes with

PAGE 31

31 the transcription factors and the connectivity of a transcription factor with ma ster regulator, if one exists. Figure 2 2 The power curve with 500 subjects in each group for several choices of We also consider several other choices of the sample sizes in each of the two groups (case and control), and calculate the size and draw the power curves for each of the following choices of the sample sizes: representing reduced sample sizes and unequal sam ple sizes in each treatment group.

PAGE 32

32 Figure 2 3. Plot of the power curves for different choices of the sample sizes with several choices of using simulated datasets. Overall, from our analyses with different choices of sample size, we find that the power of our test is increasing with increase in the sample size as well as an increase in the differential connectivity of the genes with the transcription factors in the two groups. Details of the variation of the power with samp le size c an be found in Figure 2 3 which shows the power curves for each of the above choices of the sample sizes with different choices of

PAGE 33

33 Figure 2 4. Plot showing the power performance of our test in presence of two independent master r egulators with varying using simulated datasets. In order to check the performance of our test in case there are more than one master regulator transcription factors, we have also studied a simulated scenario where there are two independent master reg ulator transcription factors regulating two independent sets of genes through transcriptional regulatory networks. Figure 2 4 shows the power performance of our test in the presence of two independent master regulators in the system. In this case, too, our test has substantial power performance, similar to the simulated settings of a single master regulator

PAGE 34

34 transcription factor. Note that, here we have considered one of the many possible simulation settings. However, our method can also be generalized for s everal other complicated scenarios. 2.3 Data Analysis 2.3.1 Prostate Cancer Data We apply our test statistic, proposed in Methods section, to a human Illumina expression array dataset GSE18684 of androgen regulated gene expression in the LNC aP prostate can cer cell line [74 ] It is believed that androgens and the androgen receptor (AR) play significant roles in prostate cancer cell proliferation and invasion. S o, this study was conducted by [74 ] with an aim to identify the androgen receptor (AR) regulated g enes. The LNCaP cells were treated with androgen (R1881) or with vehicle (ethanol) control There are 10 control and 35 androgen treated samples with expression levels for 17182 probes in the dataset. We identify the set of probes which are differentially expressed in the two groups (androgen treated and vehicle control) using the 75 ] After adjusting for false discovery rate (FDR) at 5% significance level, 6054 probes are differentially expressed in the two groups, out of which 542 are transcription factors. Now, we test whether there exists any master regulator in the above mentioned dataset. For this, we compute the value of our test statistic for this dataset using equation ( 2 3), which turns out to be 0.49 with a bootstrap based p value of 0.006. Since the p value is highly significant we conclude that there exists a master regulator transcription factor in the system which is controlling all the other transcription factors and the genes. In order to find the mast er regulator, we use equation ( 2 5) as given in the Methods section. For this study, the two equation ( 2 3). So, we conclude that these two transcription factors ma ximally control all the other transcription factors and consequently control the connectivity of the genes differently in

PAGE 35

35 the two groups. Additionally, 0.8. This high value of the correlation suggests that these two transcription factors are approximately at the same level of transcriptional regulatory hierarchy. Therefore, it can be concluded that both of them are the master regulators. Among these two prostate cancer. It is believed that deregulation of WNT/ catenin pathway contributes to prostate cancer progression [76 81] and according to [82 ] inhibition of the transcription fact or catenin expression and proliferation in human glioma stem cells. is relevant to prostate cancer [83 ] Further, the be associated with the processes of cancer agg ressiveness and angiogenesis [84 ] The results from our analysis show consistency method to identify the master r egulator transcription factor. Besides, the transcription factor congenital abnormalities o f t he kidneys and urinary tract [85 ] signi ficant r ole s in many cancers like NSCLC [86], breast cancer [87 ] etc. Table 2 1. The list of top ten transcription factors which are highly co rrelated with the two master regulators Master Regulators Top 10 transcription factors correlated with the master regulator PEG3 WWC1, FOXD4L1, NCOA7, TSHZ3, CTBP1, TCFL5, LHX2, ARID5B, CDCA7L, MAK, ARNT2 MSRB2, TULP4, TSHZ3, TCFL5, SNAPC5, TFDP1, WWC1, CITED4, NCOA7, GRAMD4 From Table 2 1, it can be seen that the

PAGE 36

36 influenced by AR signaling and is increased i n prostate cancer [88 ] The transcription factor ffe ct AR mediated transcription [89 ] known to be d ownreg ulated in prostate cancer [90 ] FOXD4L1 is als o implicated in many cancers [91 ] 2.3.2 Colorectal Cancer Data We apply our method to another human microarray dataset GSE41 07. This study was conducted by [92 ] with an aim to identify differentially expressed genes in early onset colorectal cancer (CRC). RNA samples are extracted from colonic mucosa of pa tients as well as healthy controls and analyzed using GeneChip U133 Plus 2.0 Array. There are 22 subjects involved in the study which included 12 patients and 10 controls. All the patients and the controls in the data are young Chinese who are aged 50 year s or less. There are expression levels for 54,675 genes for all the patients in the dataset. We first filter the data in order to find the set of differentially li pa ckage in Bioconductor [75 ] After adjusting for FDR at 10% significance level, the number of differentially expressed genes turns out to be 5192, among which 266 are transcription factors. Next, we apply our method to the filtered dataset. We first test w hether there exists a master regulator in the data. The value of our test statistic given in equation ( 2 3), is 0.38 for this dataset with a p value of 0.04 for the bootstrap based test. Since, the p value is small enough to make the test significant, we conclude that there exists a master regulator in the data. We identify the master regulator using equation ( 2 5), given in the Methods section. The master regulator in factor

PAGE 37

37 kappa ators of innate immune responses, inf lammation, and cell survival [93, 94 ] frequently associated with tumor growth in leukemias and lymphomas as well as prostate, pa ncr eatic and colorectal cancers [95 97 ] It has been wid plays a leading role in regulation of target genes that promote cell proliferation, anti apoptosis, regulate immune and inflammatory response, and results in p at hogenesis of various cancers [94, 98 102 ] Further, it has resistance to chemotherapy and radiotherapy [ 102 ] while molecular targeted therapy against co nstitutive activation [94 ] According to [ 101 ] ongoing inflammatory process in the gut mucosa resulting in the progression of colitis associated not only colitis associated cancer, but als o sporadic colorectal cancer [103 ] GC1A (PGC value of 0.72 in the patients group whereas it is cells. This leads to the increase in glucose oxidation which is observe d d u ring pro inflammatory state [104 ] 2.4 Discussion and Conclusion In this project we present a novel approach to identify a master regulator transcription factor in a system using only the gene expression profiles of the patients. We consider a simulation setting which validates our approach with a reasonable power in detecting the

PAGE 38

38 existence of a master regulator. We have also checked the power of our test in the presence of two ind ependent master regulator transcription factors in the simulation setup. We apply our approach to two human microarray datasets and detect the existence of master regulators in those. In order to check the robustness of our method in experiments not typica lly falling under Gliob las t oma (GBM) TCGA RNA seq data [105 ] Here we compare the two types of GBM tumors: Mesenchymal and Classical. Our method concludes the existen ce of a master regulator transcription factor (PPRC1) between the two types of GBM tumors (Mesenchymal and Classical) with a p value of 0.08 (marginally significant). One advantage of our proposed method is that it needs only a single data source for acc urate identification of a master regulator among a set of transcription factors. In the existing literature, most of the approaches integrate multiple platforms to identify a group of transcription factors or transcription factor binding sites or different ially regulated gene sets. Our method is particularly useful in case there is limitation in data sources and data from multiple platforms are not available. Also, our method is aimed to identify a single master regulator, as opposed to identifying a group of transcription factors associated with the disease process as in the case of other existing methods. The method can identify multiple master regulator transcription factors if they are individually at the top of hierarchy of the transcription regulation. This is advantageous in anti cancer drug development processes which initially target the most potential transcription factor associated with the disease and can be used as a potential biomarker. However, there is a scope of further improvement of our pr oposed method by incorporating important platforms like ChIP Seq data From simulation settings, we see that the performance of our method gets better with the increase in the number of patients in each group. So, our method is expected to be more

PAGE 39

39 efficien t when there is sufficiently large number (around 100) of patients in each group while it may not be very efficient in case the sample size is very small. Although both t he data analyzed in this project have lower number of subjects in each group, our test was still successful in identifying master regulator transcription factors from the data. One important assumption of our method is that the ranking of the transcription factors on the basis of their differential connectivity of the genes between two expe rimental conditions is concordant with the hierarchical order of their own regulation. The fulfilment of the above mentioned condition is a key indicator to the existence of a master regulator transcription factor and its subsequent detection through our m ethod. However, it may be possible that in certain situations, although there exists a master regulator transcription factor, there is no such clear cut concordance he other genes. In such a case, our method may not perform well. Overall, we believe that our method will give new insight for efficient identification of potential disease biomarker and therapeutic target in drug development processes based on master regu lator transcription factor

PAGE 40

40 CHAPTER 3 EXPLORING THE IMPORTANCE OF CANCER PATHWAYS BY META ANALYSIS OF DIFFERENTIAL PROTEIN EXPRESSION NETWORKS IN THREE DIFFERENT CANCERS 3.1 Methods In this chapter w e analyze the pre processed challenge datasets for CAMDA 2015 provided by the International Cancer Genomic Consortium (ICGC ) to explore the importance of cancer pathways For our study, we consider the protein expression and the clinical profiles of the patients for the three cancers : Head and Neck Squ amous Cell Carcinoma (HNSC), Lung Adenocarcinoma (LUAD) and Kidney Renal Clear Cell Carcinoma (KIRC). A set of 132 proteins is present in the protein expression profiles of each of the three cancers ; the patient sample sizes of HNSC, LUAD and KIRC are 212 237 and 454 patients, respectively. The number of patients in the clinical profile of HNSC, LUAD and KIRC are respectively 422, 473 and 515. The clinical profile of each of the cancer type represents the disease status ( progression or complete remiss ion ) of each patient However, 44, 111 and 28 patients have missing disease status in the clinical data of HNSC, LUAD and KIRC respectively. After removing the patients who have missing disease status, we are left with 185 patients (71 in val ues of 132 proteins. In summary, we have two groups of patients for each cancer type and a set of recorded protein expression values of 132 proteins for each of them Reprint ed with permission from Sikdar S, Datta S, Datta S. Exploring the importance of cancer pathways by meta analysis of differential protein expression networks in three different cancers. Biol Direct 2016; 11: 65.

PAGE 41

41 3.1.1 Pathway A nalysis From several studies it has been found that approximately there are 140 genes that are responsible for selective growth advantage. These genes mentioned before in section 1.1.2, mutations occur in a typical tumor due to two to eight such are only twelve signaling pathways which allo w selective growth advantage [60 : MAPK, STAT, PI3 K, RAS, Cell Cycle/Apoptosis, NOTCH, HH, APC, Chromatin modification, Transcriptional regulation and DNA damage control. Among these, MAPK, STAT, PI3 signaling pathway We separately analyze the protein profiles [ 107, 108 ] and group the proteins according to their biological pathways. As selective growth advantage can occur only through t only the proteins included in the 3.1.2 Differential Network A nalysis For each cancer, i e complete remission group to the perform d ifferential network analysis [73 ] using the R package dna [ 109 ] This differential network analysis for each pathway is conducted based on connectivity scores between the proteins in these target pathways T he difference in connectivity between the two groups ( progression versus complete remission ) is computed m athematically, using the following statistic:

PAGE 42

42 (3 1) where denotes the number of proteins in Here are the connectivity scores between the proteins and in the progression and complete remission group s, respectively. For our a nalysis, the connectivity score of a protein pair in a network is taken to be the Pe the expression values of the two proteins in the corresponding sample data. A permutation test is then carried out using the test statistic as follows: let denote the number of proteins in the sample (132 in this case). Let and denote t patients in the two groups are given in the matrix form of orders and respectively. Now, a matrix is constructed by merging the rows of the two profiles, i.e. is of order Then, the rows of are randomly permuted and the first patients are considered as one group and the remaining patients as another group. For the permutation, the connectivity scores between the proteins and are then computed using the expression profiles of the permuted data. Thus, the permuted test statistic is given by (3 2) We repeat this process 1000 times and obtain the observed level of sig nificance (p value) using (3 3)

PAGE 43

43 Next, we construct graphical networks values are significant, so that we get an idea about the network structures in each of the two groups The graphical networks are constructed by connecting each pair of proteins for which the connectivity score e xceeds a threshold In addition to testing the overall pathway significance for each cancer we also test whether the connectivity of each single protein has changed between the two groups ( complete remission ) using the following statis tic: (3 4) where denotes the set of all proteins and is the number of proteins in Once again, a permutation test is carried out for each protein, in the way described before. The p value corresponding to each prot ein is obtained using (3 3) with and replaced by and respectively, where is given by (3 5) 3.1.3 Rank Aggregation The p values, obtained usi ng the te st statistic given in ( 3 1 ) are used to obtain ranked value gets rank 1, the next one gets rank 2 and so on. Since, these ranked lists vary according to the cancer type; we need to aggregate them in a meaningful way to get an overall ranked list which would then rank the pathways by their global order of importance In other words, t his overall ranked lis t may provide us with the most important all the three cance rs. In order t o get th is overall ranked list, we consider t he R package RankAggreg [110, 111 ] based on the Cross en tropy Monte Carlo algorithm [112 ] In

PAGE 44

44 the framework of optim ization problem, RankAggreg [110, 111 ] minimizes an objective function, so that a final ordered list is obtained which is close to all the individual ordered lists, simultaneously. The objective function is defined as follows (3 6) Here, is the ordered list, is the proposed ordered list, is a distance measure and denotes the weight associated with the ordered list The aim here is to find for which the objective function given in ( 3 6 ) is minimum. In other words, (3 7) For our a nalysis, we consider equal weights for all the three ranked lists Here, is the As a distance measure, we consider weighted [113 ] A brief description of the algorithm of the weighted be the p values (in ascending order) associated with the o rdered list, Let and ordered list and the proposed ordered list (3 8) For our second analysis at the individual protein level the p values, obtained usi ng the t est statistic given in ( 3 4 ) are used to rank the set of 132 individual proteins. An overall ranked list of these proteins is obtained using the R package RankAggreg [110, 111 ] in a similar way.

PAGE 45

45 3.2 Results 3.2.1 Pathway Analysis Results For each cancer type, we find representation of five out of twelve in ou r sample of 132 protein s us ]; they are the PI3K signaling pathway, Cell Cycle, Apoptosis, RAS signaling pathway and MAPK signaling pathway. It is in pathways are all related to 3.2.2 Diffe rential Network Analysis Results Next, we determine whether there is any significant difference in the network structures [73 109 ] of the network of proteins in between the two groups of patients ( progression vs complete remission ) usi ng the t est statistic given in ( 3 1 ) coefficients as connectivity scores and absolute distance measure for each of the three cancer types The p values of the analys is are reported in Table 3 1. Based on these results, we have the following findings: the RAS signaling pathway is highly significant (p value = 0.026) and MAPK signaling pathway is marginally significant (p value = 0.082) in HNSC ; f or LUAD, PI3K signalin g pathway is highly significant (p value = 0.013). Table 3 1. values obtained from differential network analysis for each cancer type. Target Pathway Cancer Type HNSC LUAD KIRC RAS signaling pathway 0.026 0.507 0.156 MAPK signaling pathway 0.082 0.759 0.517 PI3K signaling pathway 0.241 0.013 0.774 Apoptosis 0.407 0.417 0.487 Cell Cycle 0.410 0.238 0.997 A graphical representation of the network structure of the proteins in the two groups of patients for RAS signaling pathway in HNSC is shown in Figure 3 1. In this figure, two proteins

PAGE 46

46 are connected if the connectivity score between them is significantly l arge. Different colors and shades in the figure represent positive or negative correlations and the thickness of the lines represent s the strength of the associations. A visual inspection reveals some obvious differences in the network connectivity between the two groups of patients. Figure 3 1. Network structure for RAS signaling pathway in Head and Neck Squamous Cell Carcinoma (HNSC) It can be seen from Figure 3 1 that the protein MET has very high connectivity with the be seen to The protein GAB2 appears to be

PAGE 47

47 percent connectivity with MAPK1 in the themselves and also with EGFR, which is further connected with KIT. But no such interesting MET, and BAD show noticeably different connectivities in the two networks. The corresponding genes are known oncogenes. GAB2 is known to be overexpressed in multiple human tumors especially in melanoma. I t is alte red by MAPK and PI3 K signaling pathways [114 ] MAPK1 (Mitogen activated protein kinase 1) is broadl y implicated in many cancers [115 ] MET is associated with MET signaling process. In many cancers involving solid tumors inhibiting this signaling has major therapeutic effect [116 ]. Similarly, it is found that BAD is a pro apoptotic protein which has been identified as an integrator of several anti apoptotic signaling pathways in prostate cancer cells [117 ] A graphical representation of the network structur e of the proteins in PI3K signaling pathway for the two groups of patients in LUAD is shown in Figure 3 2. From the figure, it can be seen that the protein MET has interesting connections with the proteins CASP9, FOXO3, group whereas no such interesting connections can The protein MAPK1 seems to be connected with MET, YWHAE, differences in the two networks PRKCA is a serine/threonine protein kinase that is highly

PAGE 48

48 expressed in a number of cancer cells where it can act as a tumor promoter and is implicated in malignant phenotypes of several tumors such as gliomas and breast cancers [118 ]. Figure 3 2. Network structure for PI3K signaling pathway in Lung Adenocarcinoma (LUAD ) Our analysis of individual proteins using the t est statistic given in (3 4 ) produces Figure 3 3 (A) (C). The pie charts represent the proportions of top fifty differentially connected proteins l three cancers. For all the three cancers, PI3K and RAS signaling pathways show significant contributions in terms of differential network connectivity. 3.2.3 Rank Aggregation Results Next, we rank the relative importance the p values, obtained using (3 3) from the differential network analysis, so that we can get an idea about the 100% Connectivity 100% Connectivity MET YWHAE PRKCA FRAP1 TSC2 BCL2L11 NFKB1 CASP9 FOXO3 Positive Negative Highly Positive MAPK1

PAGE 49

49 varies from one cancer to another; we obtain a rank aggregated list of the RankAggreg [110, 111 ]. Table 3 2 shows the ordering of the five cancers along with the rank aggregated list. Thus overall, the RAS signaling pathway appears to be most important followed by the PI3K signaling pathway based on our integrative analysis of the available data on three cancers. Figure 3 3. Relative contributions of the e cancers separately as well as combined. We also obtain a rank aggregated list of the proteins, based on the p values of the tests based on differential connectivity of each protein in the two different networks of the two patient groups. The proportions 3 (D). Once again, PI3K and RAS take the top two most important spots in terms of differential network connectivity. PI3K Cell Cycle RAS MAPK Apoptosis PI3K Cell Cycle RAS MAPK Apoptosis PI3K Cell Cycle RAS MAP K Apoptosis PI3K Cell Cycle RAS MAP K Apoptosis (A) (B) (C) (D)

PAGE 50

50 Table 3 2. by p values for each cancer type along with the overall ordering. Cancer Type Pathway Ordering by p values HNSC R, M, P, A, C LUAD P, C, A, R, M KIRC R, A, M, P, C Overall R, P, M, A, C R: RAS Signaling pathway M: MAPK Signaling pathway P: PI3K Signaling pathway A: Apoptosis C: Cell Cycle 3.3 Discussion and Conclusion It is known that for most cancers with solid tumors the genes in the above mentioned nd c hange their prot ein products [60 ] In all human tumors, PI3K is known to be as one of the most frequently targeted pathways M utation in PI3K pathway components contributes up to 30 percent of all human cancer and is known to be activated by RAS [ 119 ] It is interesting to know that PI3K is a regulatory subunit, which binds to cell surface receptors and to the RAS p rotein Genes and proteins in PI 3K and RAS have been investigated as therapeutic targets for many cancers [120, 121 ] Here this purely quanti tative analysis of the existing protein expression data of three different cancers also reveals the significant alteration of the proteins in PI 3K and RAS pathways Our findings are consistent with this and suggest that continued future efforts be made in this direction. Since genes act in consort during a biological process, a network analysis is essential for a system wide understanding. Thus, a study of differential network connectivity could yield interesting findings that are not possible from a diffe rential expression analysis of individual proteins. In addition, pathway level information should be incorporated in a differential network analysis whenever possible. It can be seen from Figure 3 3 (D) that on the basis of analysis of differential connect ivity of individual proteins, PI3K emerged as the most important pathway

PAGE 51

51 overall. However, this could be partly due to the fact that PI3K pathway has the largest number stic based on differential connectivity of pathways is automatically normalized by the size of the pathway and on the basis of this comparison, RAS turned out to be the most significant overall as shown in Table 3 2. In a recent study [122 ] multiple molecular profile data of LUAD for the CAMDA 2014 lung adenocarcinoma challenge data provided by ICGC is analyzed and it is noted that EGFR signaling pathway plays a significant role among the patients. Besides, it is known that EGFR acti va tion initiates RAS signaling [123 125 ] and EGF R induces rapid increase in number of epithelial cells by activating a network of signaling elements, includi ng members of the RAS and PI3K [125, 126 ] Thus, it is common for l ung cancer patients to have active EGF R mutation Moreover, the RTK/RAS/RAF pathway, identified as the main route in causing adenocarcinoma, e .g. in KRAS, BRAF, and EGFR [127 ] A lthough the EGFR kinase inhibitor Gefitinib is an effective treatment for lung cancers with EGFR activating mutations, amplification of MET causes Gefitinib resistance by driving ERBB3 (HER3) dependent activation of PI3K [ 128 ] a pathway thought to be specific t o EGFR/ERBB family receptors This fact, along with the fact that MET has been found in the module of the progression group but not in that of the complete remission group in PI3K signaling pathway of LUAD (see Figure 3 2) in our current study suggest that the patients under stu dy might have been treated with Gefitinib, but the presence of MET in some of these patients (those in the progression group) may have led to the resistance to this drug. However, this is regarding the treatment regime for any of these patients. This highlights the potential role of MET in lung cancer progression

PAGE 52

52 An interesting observation from Figure 3 3 is that much lesser proportion of proteins in cell cycle pathway is differentially connected between the two groups of patients in KIRC compared to LUAD and HNSC. So underlying molecular mechanisms related to cell cycle pathway may be a little different in KIRC compared to LUAD and HNSC. In a recent paper [129 ] patient level information such as mutation profiles is incorporated to identify protein protein interaction (PPI) interfaces enriched in somatic mutations. It will be interesting to explore how t o incorporate patient heterogeneity information into our approach.

PAGE 53

53 CHAPTER 4 EAMA: EMPIRICALLY ADJUSTED META ANALYSIS FOR LARGE SCALE SIMULTANEOUS HYPOTHESIS TESTING IN GENOMIC EXPERIMENTS 4.1 Methods 4.1.1 Empirically Adjusted Meta Analysis (EAMA) Here, we provide a description of empirically adjusted meta analysis (EAMA) in details. Let us suppose we have number of genes in our study. We are interested in finding out which of these genes contribute significantly to our outcome of interest. Suppose be the collection of null hypotheses where denotes the hypothesis that the gene has no significant contribution to the outcome of interest, Also, suppose we have data from multiple independent experiments. Let be the number of independent genomic experiments. By combining the results of these independent experiments we aim to identify the genes which contribute significantly to the outcome of interest. Let us define as the collection of p values from the genomic experiment, where is the p value corresponding to gene (i.e. corresponding to the null hypothesis ) in the experiment, Using the inverse z transformation, we get the collection of z sc ores as (4 1) The z scores given in (4 1 ) may not follow a distribution under the null hypotheses. Here we m odify the z scores, given in (4 1 estimating a n empirical null distribution [63 ] so that the resulting z scores follow distribution under the null hypotheses. A brief detail on estimating the empirical n ull can be found in the next section Following the steps of [63 ], suppose the empirical null distribution is

PAGE 54

54 obtained as u sing ]. Then the modified z values are calculated as (4 2) These empiricall y adjusted z values, given in (4 2 ), can be assumed to follow a distribution under the appropriate null hypotheses. Finally, we convert the modified z values into the corresponding p values as At this stage, we have a set of p values from the independent experiments for each gene But for proper inference on the overall effect of a gene, we need to have a single p value for that gene. So, for a typical gene we combine the p values from all of the experiments to obtain a single p value [62 ] as given below: If are the p values for the gene obtained from the independent experiments, we combine these p values to get a single test statistic Under the null hypothesis gene does not contribute significantly to the outcome, follows a distribution with degrees of freedom assuming that the p values follow uniform distribution. Using these we obtain a resulting set of p va lues where is the p value corresponding to the test statistic In this way we are able to get a single p value corresponding to each of the genes. To account for the large number of hypotheses being tested in the study, we apply the Benjamini Hochberg [67 ] method of multiplicity correction to get a set of corrected p values It may be noted that other methods for multiplicity correction can also be used. Finally, using the p value we decide whether the gene has any overall significant contribution to the outcome of interest.

PAGE 55

55 4.1.2 Estimation of the empirical null distribution A brief discussion on estimating the nu ll distribution, following [63 ], is as follows. Suppose that there are value s which can be classified into two classes, is generated according to the null hypothesis. Also, assume that the prior probabilities of the and respectively and that has density either (null density) or (non null) depending on its class. Then the mixture density of is given by Following Bayes theorem the a poster iori probability of belonging to the Uninteresting class given is obtained as The local false discovery rate is then defined as The main idea is to estimate the density from the central peak of the observed histogram of the values. Under the assumption that is density of a normal distribution with mean (not necessarily 0) and standard deviation (not necessarily 1), for close to zero, we can write Then and can be estimated as and Around curve is estimated through a quadratic approximation which leads to the final estimation of and This is done under the assumption that the central peak of the value histogram, presumably close to zero, is mainly contributed by the null cases, and this method of estimating the null parameters is termed as e ] for m ore details on this approach). 4.2 Simulation S tudies In order to evaluate the performance of our proposed method (EAMA), described in section 4.1 .1 for accurate identification of significant genes, we simulate datasets mimicking

PAGE 56

56 multiple g enomic experiments. We simulate continuous expression datasets as obtained from microarray experiments, as well as, count datasets which are found in next generation sequencing experime nts. Besides, we also consider the possible presence of some unknown hidden va riables or confounders that often impact the results of the genomic experiments. It is our interest to check the performance of EAMA in circumstances affected by the presence of hidden variables or confounders We call a gene differentially expressed if t ini ] adjusted p value is less than 0.05. The performance of EAMA alo ng with its nave counterpart i s assessed using four performance assessment measures: sensitivity, specificity, false discovery rate (FDR) and false n on discovery rate (FNR) as defined below: 1. Sensitivity (or true positive rate or reca ll): Proportion of genes that a re correctly identified as differentially expressed among all the differentially expressed genes. 2. Specificity (or true negative ra te): Propo rtion of genes that a re correctly identified as non differentially expressed among all the non differentially expressed genes. 3. False discovery rate (FDR, or 1 minus precisio n): Proportion of genes that a re incorrectly identified as differentially expresse d among the set of identified differentially expressed genes. 4. False non discovery rate (F NR): Proportion of genes that a re incorrectly identified as non differentially expressed among the set of identified non differentially expressed genes. The values o f all the above four measures a re calculated for our method (EAMA) based on 500 independent Monte Carlo simulations. For comparison, we also obtain the results from a nave meta analysis which does not apply the empirical adjustment (null transformation) to the raw p values. The nave method combines the raw p values using Fisher ] and adjust the resulting p Hochberg [66 ] Details of the data generation methods are described below

PAGE 57

57 4.2 .1 G eneration o f Continuous Data (Microarray Based Gene Expression) We generate a number of simulation studies involving mult iple genomic experiments. We have datasets obtained from independently generated experi ments, where each experiment has data on genes and subjects distributed equally over two groups (for example, case and co ntrol). The first 10 subjects a re considered to be in one group and the remaining 10 subjects in the other. The (log) expression levels of genes for a typical experiment a re generated as follows. Let be the (log) expression value corresponding to the gene belonging to the subject and group. The (log) expression values of genes a re generated using a linear model as given below: (4 3) where denotes the general mean effect in the model, is the effect of the gene, is the effect of the group, is the interaction effect of the gene and the group, is the effect of a latent confounder on the gene of the subject in the group, is the error component corresponding to the gene of the subject in the group. Here, For each of the independent experiments, the ge ne (log) expression profiles ar e generated as mentioned above. For our simulation studies on microarray data, the following two simulation schemes a re considered. Scheme 1. In this scheme, we simulate the microarray datasets for each experiment using (4 3 ) assuming that there i s no effect of any hidden confounder in the model. To achieve this, we set in (4 3 ).

PAGE 58

58 Fo r simplicity, we further assume that all the main effect terms were zero. That is, we set and in (4 3 ). The denote mean zero random errors. We generate these random errors in (4 3 ), independently from distribution under the assumption that all the genes in the datasets a re independent. In our simulations, we set 70 genes as differentially expressed between the two groups of subjec ts. In particular, we consider the diff erences in magnitudes of differential (log) expressions of these 70 genes between the two groups to be 15. For this, the interaction effects bet ween the genes and the groups a re generated as given below : For For For This data generation i s repeated for each of the independent experiments. Scheme 2. In this scheme, we wan t to evaluate the performance of EAMA in the presence of hidden confound er in the model. The effects of the latent variable in (4 3 ) a re generated in this scheme in such a way that it varied not only over the two groups of subjects and different groups of genes but also over different ex periments. So, here we generate as where When , which implies that there i s no effect of the latent confounder on the gene of the subject in the group. On the other hand, when , i.e., the ef fect of the latent confounder i s given by for the gene of the subject in the group. Here, i s generated depending on the gene, subject group, and experiment ID as follows:

PAGE 59

59 and (4 4) Generation of using the above design represents a situation where the gene expression values depend on the group status of the subjects and also on some unobserved features of the experiment (for example, age and gender of the subjects, geographical location of the exp eriment, etc.) which are often present in observational studies. Apart from all the variables in (4 3 ) a re generated in the same way as in Scheme 1 for each of the independent experiments. After generating the microarray datasets, the set of d ifferentially expressed genes a re identified for each of the experiments u limma ] and the corresponding raw p values of all the genes under study a re stored. Note that while identifying the set of differentially expr essed genes not consider the effect of any unmeasured/hidden confounding factors that may be present in the simulation model. This is because, although affecting the outcome, these factors remain unaccounted in practice, the very reason that they a our method EAMA to obtain the set of significant genes. W e consider the results from the setting involving 10 independent experiments and 1000 unco rrelated genes where 70 genes a re differentially expresse d. The difference in magnitudes of the (log) expressions of these 70 genes between the two groups i s considered to be 15.

PAGE 60

60 Fig ure 4 1. Performance assessment with 10 experiments, 1000 uncorrelated genes and absolute differences in differential expressio ns as 15. Fig ure 4 1 shows the plots of sensitivity, specificity, FDR, and FNR of EAMA and that of the nave method for each of the two simulation schemes. From Fig ure 4 1, we find that the performances of EAMA and the nave method a re very similar in Scheme 1 (no hidden confounder in the model) in terms of all the four perform ance measures. However, there i s a wide difference between the performances of the two competing methods in Scheme 2 (presence of hidden confounder in the model) as evident from Fig ure 4 1. The se nsitivity of the nave method i s better than that of EAMA, while the specificity of EAMA i s much better than that of the nave method The most drastic difference i s observed in the FDR measure in Scheme 2.

PAGE 61

61 Here EAMA outpe rforms the nave method by a large margin in terms of FDR, as the FDR of the nave method i s unacceptably high compared to that of EAMA. From th is simulation study, it appears that the performances of EAMA and the nave method are similar when there i s no latent factor present in the study. But in the presence of some hidden variables which act as confounders and have significant effects on the expression levels of the ge nes, our method (EAMA) performs reasonably well in terms of having low FDR whereas the nave method has unacceptably high FDR values. This result justifies our expectation that the EAMA, based on the empirical null distribution adjustment, lead to an accurate inference by accounting for the excess variations cause d by the latent factors whi ch a re missed by the theoretical null based nave method. We also consi der some variations of the above simulation schemes by introducing correlation among some of the genes under study, changing the number of independent experiments, varying the number of genes involved, changing the magnitude of differential expression among the set of differentially exp ressed genes. We also consider a simulation setting by introducing an effect of a hidden variable which does not act as a confounder. Each of the above me ntioned schemes are described below. 4.2 .1.1 Correlated gene expression levels In reality, genes having same biological functions are expected to have correlated expression values in the datasets. To study the robustness of EAMA, we generate a s imulation s etting where there a re 4 clust ers of correlated genes. We use the same model (4 3 ) in generating the (log ) expression values, but assure correlation among the genes within a cluster through the generation of the random error term in the following way:

PAGE 62

62 Let and denote the 4 clusters and denotes the union of the 4 clusters. We generate as (4 5) where a re drawn independently from in such a way that the values of a re same for all the genes belonging to the same cluster. ar e generated independently from All the other variables in (4 3 ) a re generated in two di fferent ways, similar to the studies with independent genes, namely Scheme 1 and Scheme 2. After generating the microarray datasets with correlated ge ne expression values, we apply our method EAMA to obtain the set of significant genes as before. The resu lts with correlated genes a re found to be similar to what we obtained from the study of uncorrelated genes where EAMA performed much better than the nave method in terms of FDR in the presence of latent factors in the study (see Fig ure 4 2) 4.2 .1.2 Reduction of the number of experiments We consider simulation settings with reduced number of experimen ts. In particular, we consider two other choices of the number of experiments which a re 8 and 4. The (log) expression values of the genes a re generated using (4 3) in the same way as before under the two schemes, namely, Scheme 1 and Scheme 2.

PAGE 63

63 Figure 4 2. Performance s of EAMA and the nave method with 10 experiments, 1000 correlated genes and absolute differences in differential expressions as 15. Fig ure 4 3. Performance s of EAMA and the nave method with 1000 uncorrelated genes absolute differences in differential expressions a s 15, and (A) number of experiments was 8 (B) number of experiments was 4. (A) (B)

PAGE 64

64 Overall, we find that the relative performances of EAMA and the nave method a re similar for different choices of the number of experiments, although the pe rformance of the nave method ge t s progres sively worse and that of EAMA ge t s better as the number of experiments increases (see Fig ures 4 3 (A) (B)). 4.2 .1.3 Reduction in the difference in magnitude of the expression levels of the genes Further, we consider the situation where the magnitude of the difference in the (log) expression levels of the 70 differentially ex pressed genes between the two groups is reduced. This i s reflected through the reduction of magnitudes of the interaction effect s between genes and groups in (4 3 ). Here the interaction effects a re generated as given below: For we set For we set For we set So, the difference in magnitude of the (log) expression levels of the differentially expressed genes between the two groups i s 8 instead of 15 as considered in the previous setting. In this case, too, we observe similar performances of EAMA and the nave method in both the schemes (see Fig ure 4 4 ). However, the difference between the FDR values of EAMA and the nave method i s hu ge where the EAMA outperforms the nave method by a very large ma rgin. Interestingly, it appears that the pe rformance of the nave method ge t s worse as the difference in the magnitudes of differential expression tend to decrease. Additionally, we also red uce the number of experiments to 8 and 4, where the difference in magnitude of the (log) expression levels of the differentially expressed genes between the two groups i s 8. The results are given in Table 4 1. The results of the study with reduced numb er of experiments a re similar to what we obtained with

PAGE 65

65 Figure 4 4. Performance assessment of EAMA and the nave method with 10 experiments, 1000 uncorrelated genes and absolute differences in differential expressions as 8. Table 4 1. Performances of EAMA and the nave method with 1000 uncorrelated genes, absolute differences in differential expressions as 8 and reduced number of experiments Scheme No. of experiments Method Sensitivity Specificity FDR FNR 1 8 EAMA 1 0.993 0.078 0 Nave 1 0.996 0.047 0 4 EAMA 1 0.994 0.067 8.63e 06 Nave 1 0.996 0.045 1.29e 05 2 8 EAMA 0.286 1 0.005 0.051 Nave 0.315 0.809 0.889 0.06 4 EAMA 0.286 0.999 0.046 0.051 Nave 0.311 0.907 0.793 0.054 4.2 .1.4 Increase in the number of genes We conside r a data generation through (4 3 ) where the number of genes (G) involved i s increased to 5000 such that the proportion of d ifferentially expressed genes is reduced. The

PAGE 66

66 simulation i s designed in such a way that, out of these 5000 genes, 200 genes a re differentially expressed. So the proportion of overall d ifferentially expressed genes i s 4%, which is lower than our previously discussed setting. Here the interaction effects ar e generated as given below : For we set For we set For we set Fig ure 4 5 Performance assessment of EAMA and the nave method with 10 experimen ts, 5 000 uncorrelated genes and absolute differences in differentia l expressions as 15.

PAGE 67

67 As before, we consider two different ways of generating the latent factor leading to the two simulation schemes, namely, Scheme 1 and Scheme 2. The performances of EAMA and the nave methods a re s imilar to what we obtained from the study of 1000 genes (see Fig ure 4 5 ) where EAMA largely outperforms the nave method in terms of FDR. Additionally, the pe rformances of the two methods a re compared after reducing the number of experiments and also the difference in magnitude of the interaction effects between gen es and groups. Overall, we find that there i s not much difference in the performances of the two met hods when the number of genes i s increased from 1000 to 5000 in both the schemes. 4.2 .1.5 Pre sence of a hidden variable which does not act as a confounder In addition to Scheme 1 and Scheme 2, we also check the situation where a hidden variable, although present and affects the outcome, does not act as a confounder. We refer to this scheme as Scheme 3. Here we consider a simulation setting where the distribution of the latent variable in (4 3 ) is the same in the case and the control groups of subjects i.e. the effect of the hidden variable on the outcome does not vary significantly betwe en the two groups of subjects. Here, where and i s generated for the experiment as : The differences in magnitudes of differential (log) expressions of the 70 differentially expressed genes between the two groups are 2. We found that in Scheme 3, where the latent variable no longe r acts as a c onfounder, EAMA has higher sensitivity than the nave method while the nave method appears a bit conservative with low sensitivity and FDR (see Table 4 2).

PAGE 68

68 Table 4 2. Performance assessment of EAMA and the nave method where a hidden variabl e does not ac t as a confounder. Method Sensitivity Specificity FDR FNR EAMA 0.517 0.995 0.0963 0.0351 Naive 0.337 0.999 0.023 0.0476 4.2.2 Generation of count data (NGS based gene expression) We also gen erate realistic NGS like datasets for our simulation experiments using a popular NGS simulator called SimSeq [ 132 ]. SimSeq generates read counts in two treatment groups for a known set of differentially expressed genes based on a real RNA sequencing datas et. He re, we generate a count data using SimSeq and kidney renal clear cell carcinom a data as the source datase t [133 ]. The kidney renal clear cell carcinoma dataset consists of 20,531 genes and 144 paired samples with tumor and non tumor replicate. We f ilter th e kidney renal clear cell carcinoma dataset by incl uding only those genes which have more than one hundred non zero read counts so that the simulated dataset did not include all zero read counts. From the red uced source dataset we generate read counts with genes and subjects distributed equally over two groups. Out of these 5000 genes, 1000 genes a re considered to be differentially expressed. Similar to the previous simulated settings involving expr ession datasets, we also assume that there e xists an effect of a hidden confounder in the count dataset. In order to achieve this, we independently generate another set of read counts using SimSeq with 5000 genes and 60 subjec ts as before where 1000 genes a re differentially expressed. For the gene, group and subject we generate a random observation from When we add the two read counts and divide the resulting sum by two in order to maintain the original magni tude of the read counts, for the gene in the group for the subject. We then round the result to nearest integer. If we retain

PAGE 69

69 the original read count for the gene, group and subject. In this way, determine whether there exists an effect of the hidden variable on the gene in the group for the subject. We repeat this whole process for experiments. After generating the count datasets, the set of d ifferentially expressed genes a re identified for each of the experiments using edgeR in Bioconductor [ 134, 135 ] and the corresponding raw p values of all the genes under study are stored. Then we apply our method EAMA to obtain the set of significant genes. The performances of ou r method (EAMA) and that of t he nave meta analysis method a re assessed usi ng the simulated count datasets. The results are shown in Table 4 3. From this table we find that the EAMA and nave method have similar performances in terms of sensitivity, specif icity, and FNR, while the major difference lies in the FDR. The FDR of EAMA i s much lower than that of the nave method, and hence EAMA has less f alse discoveries. The results a re similar to what we obtained in the studies involving continuous datasets. Ta ble 4 3. The performances of EAMA and that of the nave method using the simulated count datasets Method Sensitivity Specificity FDR FNR EAMA 0.870 0.968 0.129 0.032 Nave 0.904 0.942 0.204 0.025 4.3 Data Analysis We consider five publicly available lung cancer gene expression datasets: Bhattacharjee [136], GSE11969 [137], GSE29016 [138], GSE30219 [139] and GSE43580 [140 ]. These datasets a re previously analyzed by [141 ] on a classification fr amework. Each of the datasets is normalized and filtered by [141 ]. All the five datasets a re merged so that each of them has the same set of genes. We use the processed and merged datasets, which are available at https://zenodo.org/record/16006

PAGE 70

70 Each data set has normalized expression levels for 7200 genes. Although, information regarding lung cancer type, smoking status, ag e and gender for the patients i s available for our analysis we only use the information about t he cancer type of the patients. Here, w e attempt to ide ntify the set of genes, which a re differentially expressed between the two lung cancer types: Adenocarcinoma (AD) and Squamous cell carcinoma (SQ). The number of patients in each of the two types of lung cancer within each dataset is shown in Table 4 4. Table 4 4 The number of patients in each of the two lung cancer types within each dataset. Dataset Lung cancer type Adenocarcinoma (AD) Squamous cell carcinoma(SQ) Bhattacharjee 60 21 GSE11969 90 35 GSE29016 38 12 GSE30219 85 61 GSE43580 77 73 We fit linear model with the gene expression values of the patients as the response and cancer type of the patients as the predictor, for each experiment separately. U Bioconductor [ 75 ] we obtain the p values corresponding to the main effect term for the lung cancer type. So, we have five sets of p values for each of the 7200 genes. We identify the set of differentially expressed genes based on the five experiments using both EAMA and the nave method. Fig ure 4 6 shows the h istograms of the original z scores and the modified z scores (after empirically adjusting the z scores) obtained using (4.1) and (4.2 ) respectively, given in the Methods section. The curves superimposed on the histograms in Fig ure 4 6 are the density curv es of distribution. From Fig ure 4 6 it can be seen that the den sity of the original z scores i s much deviated from the distribution, whereas the den sity of the modified z scores i s almost identical to the distribution. So, Fig ure 4 6 highlights the fact that the test based on the assumption

PAGE 71

71 that the original z scores are standard normally distributed would lead to incorrect inference and possibly large amount of false discoveries. Hence, it is not surprising that the nave method, tha t used the original z scores, identified 5127 differentially expressed genes (more than 70% of the correction method [66 ]. On the other hand, our proposed method EAMA id entified 1541 significantly differentially expressed genes ( approximately 21% of the total number of genes) [66 ], hence reducing the possibility of gross false discoveries. Fig ure 4 6 The histograms of the original and modified z scores for the lung cancer datasets The genes identified by the methods described above could be further analyzed using pathway analyses of cancer signaling pathways. Several studies had identified 140 ponsible for tumor formation, a re classified into

PAGE 72

72 twelve signaling p athways [60 ]. These twelve signaling pathwa ys are: MAPK, STAT, PI3K, RAS, Cell Cycle/Apoptosis, NOTCH, HH, APC, Chromatin modification, Transcriptional regulation and DNA damage control. In order to check the association of the significant genes identified through our analysis with these signaling pathways, we functionally group all the significant genes, identified by EAMA as well as th e nave method, using DAVID [67 142 ]. Eight out of the twelve possible pathways have been identified through the significant genes obtained from EAMA as well as the nave method. Table 4 5 shows the grouping of the genes according to the above mentioned twelve signaling pathways. Out of the eight signalin g pathways, given in Table 4 re significant among the genes identified by EAMA at a p re significant among the genes identified by the n ave method at 10% significance level. Although the number of genes identified by the nave method i s very large, number of significant signaling pathways, obtained through the identified genes, i s same for the nave method and EAMA. From Table 4 5, we can see that 44 of t he genes, identified by EAMA, a entified by the nave method, a the nave method, also include t pathway. We further study some of the genes that have been identified by the nave method but not by EAMA. For example, the gene with ID 472 i s identified by the nave me thod but not by EAMA. We study in details the expression pattern of this gene in each of the five datasets. Fig ure 4 7 shows the violin plots of the gene with ID 472 for the two cancer types in each of the five datasets.

PAGE 73

73 Table 4 5. The grouping of the genes, identified by EAMA a nd the nave met hod, according to the signaling pathways. Signaling pathway EAMA Naive 12 40 MAPK 21 88 STAT 5 34 PI3K 26 99 RAS 22 79 Cell Cycle 44 80 NOTCH 10 17 HH 5 14 From Fig ure 4 7 we f ind that the gene with ID 472 i s not differentially expressed between the two lung cancer types in four of the five datasets. This sugg ests that the gene with ID 472, in spite of being id entified by the nave method, i s unlikely to be an important factor for the discrimination between t he two cancer types. Also, based on the individual analyses, the p val ue corresponding to this gene i Also, from Table 4 5, we can see that there a under the nave metho d. A mong these 40 genes, 12 genes a re also identified by EAMA as significant whereas the remaining 28 genes a re not identified as significant by our proposed method EAMA. For e xample, the gene with ID 8200 i s identified as significant by the nave method b ut not by EAMA. Fig 4 8 shows the violin plots for the gene with ID 8200 for the two cancer types in each of the five datasets. From Fig ure 4 8 we can s ee that the gene with ID 8200 i s not differentially expressed between the two lung cancer types for fou r of the five datasets. As before, w e can suggest that this gene do not have a considerable impact in distinguishing the two cancer types, and it i s reasonable t hat our proposed method EAMA has

PAGE 74

74 Fig ure 4 7 The violin plots of the gene with ID 472 for the two cancer types in each of the five datasets

PAGE 75

75 Fig ure 4 8 The violin plots for the gene with ID 8200 for the two cancer types in each of the five datasets

PAGE 76

76 4.4 Discussion High throughput tech nologies have enabled simultaneous analysis of thousands of genes in a single experiment. Combining hypotheses testing results from the multiple genomic experiments is a popular meta analysis approach of identifying significant genes related to some biolog ical process. However, there is a distinct difference between the aims of meta analyses involving single hypothesis in each component experiment with that of large scale multiple hypotheses in each experiment. While the former targets in favoring the alter hypothesis with high power, the latter is designed for the identification of a small proportion of ible candidates. In this project we discuss the possibility of making err oneous conclusions from combining the p values calculated under standard theoretical assumptions from multiple genomic experiments when each experiment involves simultaneous testing of enormous number of hypotheses in presence of some hidden confounder var iable. The presence of some confounder variables can induce over dispersion and/or bias that remain unaccounted by the theoretical null assu mptions. In particular, we show that even adjusting the p values by taking into account the false discovery rates ma y not be enough to substantially diminish the false discoveries from the meta analysis of large scale multiple testing when hidden confounder variables are present. We propose an alternative approach of modifying the p values by constructing empirical null distributions and combining these empirically adjusted p values through proper meta analysis approach. Through simulation studies and real datasets involv ing genomic experiments we show that our proposed method has much better performance than the standar d meta analysis approach of combining raw p values from multiple experiments especially in presence of hidden confounder variables. This article mainly focuses on developing meta analysis approach for combining p values from multiple genomic experiments w hen the outcome of interest is affected by some hidden

PAGE 77

77 variable that acts as a confounder with different effects between different groups under study. In biological studies, mostly in observational studies, such hidden variables are very common which play th e roles of confounders. We show that in such scenarios our proposed method performs much better than the standard nave method of meta analysis. However, it may happen in certain situations that a hidden variable affects the outcome but does not have a c onfounding effect, i.e. the effect of the hidden variable on the outcome does not vary significantly between the differ ent groups under study. We consider multiple simulation settings that addresses the abovementioned scenarios and the results show that th e standard meta analysis approach of combining raw p values is a bit conservative having lower power and FDR than our proposed method EAMA. We use empirical nul l distribution as outlined by [63 ] where th e empirical null distribution a approach (see section 4.1.2 as well as [131 ] for details ) of ]. There is an alternative option of estimating the null e ( Details can be found in [131 ]) In various simulation settings, the empi rical null distribution appears to perform better than the maximum likelihood approach. However, the overall results obtained through t he maximum likelihood approach are not significantly different from that obtained from the shown here in details. One may note that although we have used FDR adjusted p values using Benjamin i Hochberg procedure [66 ], there are other options for adjustment of false discovery rates i n cluding the use of q values [143 ].

PAGE 78

78 APPENDIX A R SCRIPTS A. 1 R Script for Identification of Master Regulator Transcription Factor library(dna) library(MASS) library(Kendall) pvalue_boot_tau < NULL for(z in 1:500){ ### case group mu_u < rep(0,10) Sigma=diag(10) u < mvrnorm(100,mu_u,Sigma) colnames(u) < paste("u",1:10,sep="") ## TFs 1 to 10 x1 < rnorm(100,mean=50,sd=1) rho < c(0,0.95,0.8,0.7,0.6,0.5,0.4,0.3,0. 2,0.1) s < x1 for(i in 2:10){ xi < (rho[i]*x1+u[,i])/sqrt(1+(rho[i])^2) s < cbind(s,xi) }

PAGE 79

79 colnames(s) < paste("TF",1:10,sep="") ## TF1 to TF10 round(cor(s),4) ## Genes 1 to 30 delta < 1 ## Delt a r1 < c(0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1,0.05) r2 < (1 delta)*r1 gamma1 < 0.5+(delta^2)*r1[1] genes1to30 < NULL for(i in 1:30){ e1 < rnorm(100) g1 < x1*gamma1+e1 genes1to30 < cbind(genes1to30,g1) } colnames(genes1to30) < paste("G",1:30,sep="") ## Genes1 to 30 m < c(10,10,10,10,10,10,5,5,5) r1 < c(0.45,0.4,0.35,0.3,0.25,0.2,0.15,0.1,0.05) oth_genes < NULL for(i in 2:10){ for(j in 1:m[i 1]){ e < rnorm(100)

PAGE 80

80 gj < u[,i]*r1[i 1]+e oth_genes < cbind(oth_genes,gj) } } colnames(oth_genes) < paste("G",31:105,sep="") ## Remaining genes pop_exp < cbind(s,genes1to30,oth_genes) ## case group : TFs, Genes ### control group mu_u < rep(0,10) Sigma=diag(10) u < mvrnorm(70,mu_u,Sigma) colnames(u) < paste("u",1:10,sep="") ## TFs 1 to 10 x1 < rnorm(70,mean=5,sd=1) rho < c(0,0.95,0.8,0.7,0.6,0.5,0.4,0.3,0.2,0.1) s < x1 for(i in 2:10){ xi < (rho[i]*x1+u[,i])/sqrt(1+(rho[i])^2) s < cbind(s,xi) }

PAGE 81

81 colnames(s) < paste("TF",1:10,sep="") ## TF1 to TF10 round(cor(s),4) ## Genes 1 to 30 gamma2 < 0.5 genes1to30 < NULL for(i in 1:30){ e1 < rnorm(70) g1 < x1*gamma2+e1 genes1to30 < cbind(genes1to30,g1) } colnames(genes1to30) < paste("G",1:30,sep="") ## Genes 1 to 30 m < c(10,10,10,10,10,10 ,5,5,5) r2 < (1 delta)*r1 oth_genes < NULL for(i in 2:10){ for(j in 1:m[i 1]){ e < rnorm(70) gj < u[,i]*r2[i 1]+e oth_genes < cbind(oth_genes,gj) }

PAGE 82

82 } colnames(oth_genes) < paste("G",31:105,sep="") ## Remaining genes pop_ctl < cbind(s,genes1to30,oth_genes) ## control group: TFs and genes ############################################################################# TFs < paste("TF",1:10,sep="") genes < paste("G",1:105,sep="") ## Measuring differential connecti vity scores for all TFs s < NULL for(l in 1:length(TFs)){ sub < c(TFs[l],genes) pop_exp_new < pop_exp[,colnames(pop_exp)%in%sub] pop_ctl_new < pop_ctl[,colnames(pop_ctl)%in%sub] s1=cornet(pop_exp_new,rescale.scores=FALSE) s2=cornet(pop_ctl_new,rescale .scores=FALSE) s[l] < sum(abs(s1[,TFs[l]] s2[,TFs[l]]))/(length(genes)) } DEgenes_pop < data.frame(TFs,s) DEgenes_pop TfTestStat_pop=DEgenes_pop[with(DEgenes_pop, order( DEgenes_pop$s)),] TfTestStat_pop

PAGE 83

83 ## Pearson's correlation between each pair of TFs data_req_pop=rbind(pop_exp,pop_ctl) TFexpress_pop=data_req_pop[,which(colnames(data_req_pop)%in%TFs)] cormatTF_pop=cor(TFexpress_pop,method="pearson") ## Kendall's statistic for each TF stat_tau=NULL for(t in 1:length(TFs)){ TranInterested=as.character(Tf TestStat_pop[t,1]) orderCorrTF=sort(abs(cormatTF_pop[Tra nInterested,]),decreasing=TRUE) d1=data.frame(gene=TfTestStat_pop[,1],TfTestStat_pop) d2=data.frame(orderCorrTF) d22=data.frame(d2,gene=rownames(d2)) data=merge(d22, d1, by="gene") test_tau=Kendall(da ta[,2], data[,4]) stat_tau[t]=test_tau$tau } allTFsStat_pop=data.frame(TfTestStat_pop[,1],stat_tau) Stat_pop_tau=max(allTFsStat_pop[,2]) ## Final test statistic K ################### ######## Sampling ######################### #

PAGE 84

84 full_pop < rbind(pop_exp,pop_ctl) p_tau < NULL for(k in 1:500){ ## Bootstrap samples x < seq(1:170) sam < sample(x,170,replace=TRUE) sam_exp=full_pop[sam[1:100],] sam_ctl=full_pop[sam[101:170],] ## Differential connectivity scores for all TFs for a bootstrap sample s < NULL for(l in 1:length(TFs)){ sub < c(TFs[l],genes) sam_exp_new < sam_exp[,colnames(sam_exp)%in%sub] sam_ctl_new < sam _ctl[,colnames(sam_ctl)%in%sub] s1=cornet(sam_exp_new,rescale.scores=FALSE) s2=cornet(sam_ctl_new ,rescale.scores=FALSE) s[l] < sum(abs(s1[,TFs[l]] s2[,TFs[l]]))/(length(genes)) } DEgenes_sam < data.frame(TFs,s) DEgenes_sam TfTestStat=DEgenes_sam[with(DEgenes_sam, order( DEgenes_sam$s)),]

PAGE 85

85 TfTestStat ## Pearson's correlation among the TFs for a boot strap sample data_req=rbind(sam_exp,sam_ctl) TFexpress=data_req[,which(colnames(data_req)%in%TFs)] cormatTF=cor(TFexpress,method="pearson") ## Kendall's test statistic value for each TF for a bootstrap sample stat_tau=NULL for(t in 1:length(TFs)){ TranInt erested=as.character(TfTestStat[t,1]) orderCorrTF=sort(abs(cormatTF[Tra nInterested,]),decreasing=TRUE) d1=data.frame(gene=TfTestStat[,1],TfTestStat) d2=data.frame(orderCorrTF) d22=data.frame(d2,gene=rownames(d2)) data=merge(d22, d1, by="gene") test_tau=Kendall(data[,2], data[,4]) stat_tau[t]=test_tau$tau } allTFsStat=data.frame(TfTestStat[,1],stat_tau) Stat_obs_tau=max(allTFsStat[,2]) ## Bootstrap based test statistic value p_tau[k] < Stat_obs_tau }

PAGE 86

86 pvalue_boot_tau[z]=mean(p_tau > Stat_pop_tau) ## p value calculation based on 500 bootstrap samples } datafin < data.frame(seq(1:500),pvalue_boot_tau) sum(datafin[,2]<0.05) A. 2 R Script for Meta Analysis of Differential Protein Expression Networks A.2. 1 Pathway based Analysis LUAD_prot < read.csv("C: \ \ Users \ \ sinjini8 \ \ Desktop \ \ CAMDA2015 \ \ Protein_LUAD \ \ GO_KEGG_LUAD.csv" ,header=T) ## Pathway results LUAD_prot1 < LUAD_prot[,c(3,4)] Prot_mod_LUAD < read.csv("C: \ \ Users \ \ sinjini8 \ \ Desktop \ \ CAMDA2015 \ \ Protein_LUAD \ \ Prot_LUAD.csv" ,header =T) ## Expression Data Clinical < read.csv("C: \ \ Users \ \ sinjini8 \ \ Desktop \ \ CAMDA2015 \ \ Clinical_LUAD \ \ Clinical_mod_LUAD.cs v",header=T) ## Clinical data Clinical1 < cbind(as.character(Clinical[,1]),ifelse(Clinical[,6]=="progression","p","cr")) common_ids < intersect(C linical1[,1],Prot_mod_LUAD[,1]) Prot_Clin_LUAD < Prot_mod_LUAD[which(Prot_mod_LUAD[,1]%in%common_ids),]

PAGE 87

87 ## Prot_LUAD dataset after mer ging with Clinical with 172 ids Clin_Prot_LUAD < Clinical1[which(Clinical1[, 1]%in%common_ids),] ## Clinical_LUAD dataset after merging with Prot_LUAD with 172 ids Prot_Clin_LUAD_stat < cbind(Prot_Clin_LUAD[,1],Clin_Prot_LUAD[,2],Prot_Clin_LUAD[,2:ncol(Prot_Clin_LUAD)]) dim(Prot_Clin_LUAD_stat) #### pathway 1 PI3K Akt signaling pathway ################## GR_prot < levels(droplevels(LUAD_prot1[1,2])) GR_prot_split < unlist(strsplit(GR_prot,",")) ## gene list for P13K prot_path1 < cbind(Prot_Clin_LUAD_stat[,1:2],Prot_Clin_LUAD_stat[,GR_prot_split[1:leng th(GR_prot_spli t)]]) ## expression data with PI3K proteins only dim(prot_path1) prot_path1.cr < subset(prot_path1,prot_path1[,2]=="cr") dim(prot_path1.cr) prot_path1.p < subset(prot_path1,prot_path1[,2]=="p") dim(prot_ path1.p)

PAGE 88

88 library(dna) ourgenelist < colnames(prot_path1)[ c(1,2)] t=test.class.genes(prot_path1.cr[, c(1,2)],prot_path1.p[ c(1,2)],genelist=ourgenelist, scores="cor",distance="abs",rescale.sco res=TRUE,num.permutations=1000) get.results(t) par(mfrow=c(1 ,2)) s=cornet(prot_path1.cr[, c(1,2)],rescale.scores=TRUE) network.modules(s,m=3,epsilon=.7,plot=TRUE,interactive=FALSE,main="complete remission") s=cornet(prot_path1.p[, c(1,2)],rescale.scores=TRUE) network.modules(s,m=3,epsilon=.7,plot=TRUE,interactive=FALSE,main="progression") #### pathway 2 Cell cycle ################## GR_prot < levels(droplevels(LUAD_prot1[18,2])) GR_prot_split < unlist(strsplit(GR_prot,",")) ## gene list for Ce ll cycle prot_path2 < cbind(Prot_Clin_LUAD_stat[,1:2],Prot_Clin_LUAD_stat[,GR_prot_split[1:length(GR_prot_spli t)]]) ## expression data with Cell cycle proteins only dim(prot_path2)

PAGE 89

89 prot_path2.cr < subset(prot_path2,p rot_path2[,2]=="cr") dim(prot_path2.cr) prot_path2.p < subset(prot_path2,prot_path2[,2]=="p") dim(prot_path2.p) ourgenelist < colnames(prot_path2)[ c(1,2)] t=test.class.genes(prot_path2.cr[, c(1,2)],prot_path2.p[ c(1,2)],genelist=ourgenelist, scores="c or",distance="abs",rescale.sco res=TRUE,num.permutations=1000) get.results(t) #### pathway 3 Ras signaling pathway ################## GR_prot < levels(droplevels(LUAD_prot1[32,2])) GR_prot_split < unlist(strsplit(GR_prot,",")) ## gene l ist for Ras signaling pathway prot_path3 < cbind(Prot_Clin_LUAD_stat[,1:2],Prot_Clin_LUAD_stat[,GR_prot_split[1:length(GR_prot_spli t)]]) ## expression data with Ras signaling pathway proteins only dim(prot_path3) prot_path3.cr < subset(prot_path3,prot_path3[,2]=="cr") dim(prot_path3.cr) prot_path3.p < subset(prot_path3,prot_path3[,2]=="p")

PAGE 90

90 dim(prot_path3.p) ourgenelist < colnames(prot_path3)[ c(1,2)] t=test.class.genes(prot_path3.cr[, c(1,2)],prot_path3.p[, c(1 ,2)],genelist=o urgenelist, scores="cor",distance="abs",rescale.sco res=TRUE,num.permutations=1000) get.results(t) #### pathway 4 MAPK signaling pathway ################## GR_prot < levels(droplevels(LUAD_prot1[56,2])) GR_prot_split < unl ist(strsplit(GR_prot,",")) ## gene list for MAPK signaling pathway prot_path4 < cbind(Prot_Clin_LUAD_stat[,1:2],Prot_Clin_LUAD_stat[,GR_prot_split[1:length(GR_prot_spli t)]]) ## expression data with MAPK signaling p athway proteins only dim(prot_path4) prot_path4.cr < subset(prot_path4,prot_path4[,2]=="cr") dim(prot_path4.cr) prot_path4.p < subset(prot_path4,prot_path4[,2]=="p") dim(prot_path4.p) ourgenelist < colnames(prot_path4)[ c(1,2)]

PAGE 91

91 t=test.class.genes(prot_ path4.cr[, c(1,2)],prot_path4.p[ c(1,2)],genelist=ourgenelist, scores="cor",distance="abs",rescale.sco res=TRUE,num.permutations=1000) get.results(t) #### pathway 5 Apoptosis ################## GR_prot < levels(droplevels(LUAD_prot1[71,2])) GR_prot_split < unlist(strsplit(GR_prot,",")) ## gene list for Apoptosis prot_path5 < cbind(Prot_Clin_LUAD_stat[,1:2],Prot_Clin_LUAD_stat[,GR_prot_split[1:length(GR_prot_spli t)]]) ## expression data with Apoptosis proteins only dim(prot_path5) prot_path5.cr < subset(prot_path5,prot_path5[,2]=="cr") dim(prot_path5.cr) prot_path5.p < subset(prot_path5,prot_path5[,2]=="p") dim(prot_path5.p) ourgenelist < colnames(prot_path5)[ c( 1,2)] t=test.class.genes(prot_path5.cr[, c(1,2)],prot_path5.p[ c(1,2)],genelist=ourgenelist, scores="cor",distance="abs",rescale.sco res=TRUE,num.permutations=1000) get.results(t)

PAGE 92

92 ######################################################################## library(RankAggreg) x < matrix(c("Ras signaling pathway", "MAPK signaling pathway","PI3K Akt signaling pat hway","Apoptosis","Cell cycle", "PI3K Akt s ignaling pathway","Cell cycle", "Apoptosis","Ras signaling pat hway","MAPK signaling pathway", "Ras signal ing pathway","Apoptosis", "MAPK signaling pathway ","PI3K Akt signaling pathway", Cell cycle"), byrow=T,ncol=5) w < matrix(c(0.026,0.082,0.241,0.407,0.410 ,0.013,0.238,0.417,0.507,0.759, 0.156,0.487,0.517,0.774,0.997),byrow=T,ncol=5) RankAggre g(x, 5, weig hts=w, method="CE", distance="Spearman",seed=NULL, maxIter = 1000) A.2. 2 In di vidual Protein based Analysis ####LUNG Cancer ### Merging protein data with clinical Prot_mod_LUAD < read.csv("C: \ \ Users \ \ sinjini8 \ \ Desktop \ \ CAMDA2015 \ \ Protein_LUAD \ \ Prot_LUAD.csv",header =T)

PAGE 93

93 Clinical < read.csv("C: \ \ Users \ \ sinjini8 \ \ Desktop \ \ CAMDA2015 \ \ Clinical_LUAD \ \ Clinical_mod_LUAD.cs v",header=T) Clinical1 < cbind(as.character(Clinical[,1]),ifelse(C linical[,6]=="progression","p","cr")) common_ids < intersect(Clinical1[,1],Prot_mod_LUAD[,1]) Prot_Clin_LUAD < Prot_mod_LUAD[which(Prot_mod_LUAD[,1]%in%common_ids),] ## Prot_LUAD dataset after merging with Clinical with 172 ids Clin_ Prot_LUAD < Clinical1[which(Clinical1[,1]%in%common_ids),] ## Clinical_LUAD dataset after merging with Prot_LUAD with 172 ids Prot_Clin_LUAD_stat < cbind(Prot_Clin_LUAD[,1],Clin_Prot_LUAD[,2],Prot_Clin_LUAD[,2:ncol(Prot_Clin_LUAD)]) dim(Pr ot_Clin_LUAD_stat) prot_path.cr < subset(Prot_Clin_LUAD_stat,Prot_Clin_LUAD_stat[,2]=="cr") dim(prot_path.cr) prot_path.p < subset(Prot_Clin_LUAD_stat,Prot_Clin_LUAD_stat[,2]=="p") dim(prot_path.p) ################# ## testing for individual proteins ## ##########################

PAGE 94

94 library(dna) tig.results1=test.individual.genes(prot_path.cr[, c(1,2)],prot_path.p[, c(1,2)],scores="cor", distance="abs",rescale.scores=TRUE,num.permutations=1000) summary(tig.results1) proteins1 < get.results(tig.results1)[1:50,] dim(proteins1) prots1 < rownames(proteins1) ################### pathway data ####################################### LUAD_prot < read.csv("C: \ \ Users \ \ sinjini8 \ \ Desktop \ \ CAMDA2015 \ \ Protein_LUAD \ \ GO_KEGG_LUAD.csv" ,hea der=T) LUAD_prot1 < LUAD_prot[,c(3,4)] GR_prot1_P13K < levels(droplevels(LUAD_prot1[1,2])) GR_prot1_split_P13K < unlist(strsplit(GR_prot1_P13K,",")) ## gene list for P13K Tot_P13K1 < length(GR_prot1_split_P13K) P13K1 < GR_prot1_split_P13K[GR_prot1_ split_P13K%in%prots1] ## P13K proteins significant at 10% level GR_prot1_CC < levels(droplevels(LUAD_prot1[18,2]))

PAGE 95

95 GR_prot1_split_CC < unlist(strsplit(GR_prot1_CC,",")) ## gene list for Cell cycle Tot_CC1 < length(GR_prot1_split_CC) CC1 < GR_prot1_ split_CC[GR_prot1_split_CC%in%prots1] ## cell cycle proteins significant at 10% level GR_prot1_RAS < levels(droplevels(LUAD_prot1[32,2])) GR_prot1_split_RAS < unlist(strsplit(GR_prot1_RAS,",")) ## gene list for Ras signaling pathway Tot_RAS1 < length( GR_prot1_split_RAS) RAS1 < GR_prot1_split_RAS[GR_prot1_split_RAS%in%prots1] ## RAS proteins significant at 10% level GR_prot1_MAPK < levels(droplevels(LUAD_prot1[56,2])) GR_prot1_split_MAPK < unlist(strsplit(GR_prot1_MAPK,",")) ## gene list for MAPK signaling pathway Tot_MAPK1 < length(GR_prot1_split_MAPK) MAPK1 < GR_prot1_split_MAPK[GR_prot1_split_MAPK%in%prots1] ## MAPK proteins significant at 10% level

PAGE 96

96 GR_prot1_Apop < levels(droplevels(LUAD_prot1[71,2])) GR_prot1_split_Apop < unlist(strsplit (GR_prot1_Apop,",")) ## gene list for Apoptosis Tot_Apop1 < length(GR_prot1_split_Apop) Apoptosis1 < GR_prot1_split_Apop[GR_prot1_split_Apop%in%prots1] ## Apoptosis proteins significant at 10% level par(mfrow = c(2, 2), # 2x2 layout oma = c(2, 2, 2, 0), # two rows of text at the outer left and bottom margin mar = c(1, 4, 1, 3), # space for one row of text at ticks and to separate plots mgp = c(2, 1, 0), # axis label at 2 rows distance, tick labels at 1 row xpd = NA) require(grDevices) pie_pro t1 < c(length(P 13K1),length(CC1),length(RAS1), length(MAPK1),length(Apoptosis1)) names(pie_prot1) < c("PI3K","Cell Cycle","RAS","MAPK","Apoptosis") pie(pie_prot1,col = c("red", "blue", "green3","yellow","cyan"), radius = 0.7, main = "Lung Adenocarcinoma ") ####Rank Aggregation ####################################### lung < get.results(tig.results1)

PAGE 97

97 kirc < get.results(tig.results2) hnsc < get.results(tig.results3) x < matrix(c(rownames(lung),rownames(kirc),rownames(hnsc)),nrow=3,byrow=T) w < matrix(c(lung$p.value,kirc$p.value,hnsc$p.value),nrow=3,byrow=T) library(RankAggreg) R < RankAggreg(x, 50, weights=w, method="CE",distance="Spearman",seed=NULL, maxIter = 1000) proteinsR < c("EGFR","NFKB1","ATM","SRC","ACACA.ACACB","LCK","NF2", "BCL2L 11", "PRKCA", "FRAP1", "CCNB1", "PCNA", "YBX1", "CDKN2A", "YWHAE", "FN1", "EIF4EBP1", "SMAD1", "BECN1", "DVL3", "ERBB2", "MRE11A", "ACACA", "RAF1", "CDH2", "AXL", "KRT5", "TSC2", "EEF2", "RAD50", "PTGS2", "XBP1", "WWTR1", "KRAS", "MACC1", "GAB2", "CLDN7", "NOTCH1", "MAP2K1", "NFE2L2", "MET", "FOXO3", "CDH3", "STAT3", "ESR1", "CTNNA1", "AKT1.AKT2.AKT3", "GSK3A.GSK3B", "PRKCD", "PGR") p13k_proteins < c(GR_prot1_split_P13K,GR_prot2_split_P13K,GR_prot3_split_P13K) p13k_proteins_un < unique(p13k_proteins) Tot _P13KR < length(p13k_proteins_un)

PAGE 98

98 ras_proteins < c(GR_prot1_split_RAS,GR_prot2_split_RAS,GR_prot3_split_RAS) ras_proteins_un < unique(ras_proteins) Tot_RASR < length(ras_proteins_un) cc_proteins < c(GR_prot1_split_CC,GR_prot2_split_CC,GR_prot3_split _CC) cc_proteins_un < unique(cc_proteins) Tot_CCR < length(cc_proteins_un) apop_proteins < c(GR_prot1_split_Apop,GR_prot2_split_Apop,GR_prot3_split_Apop) apop_proteins_un < unique(apop_proteins) Tot_ApopR < length(apop_proteins_un) mapk_proteins < c(GR_prot1_split_MAPK,GR_prot2_split_MAPK,GR_prot3_split_MAPK) mapk_proteins_un < unique(mapk_proteins) Tot_MAPKR < length(mapk_proteins_un) P13K_R < proteinsR[proteinsR%in%p13k_proteins_un] RAS_R < proteinsR[proteinsR%in%ras_proteins_un] CC_R < prot einsR[proteinsR%in%cc_proteins_un] Apop_R < proteinsR[proteinsR%in%apop_proteins_un] MAPK_R < proteinsR[proteinsR%in%mapk_proteins_un] require(grDevices)

PAGE 99

99 pie_protR < c(length(P13K_R),length(CC_R),length(RAS_R), length(MAPK_R),length(Apop_R)) names(pie_ protR) < c("PI3K","Cell Cycle","RAS","MAPK","Apoptosis") pie(pie_protR,col = c("red", "blue", "green3","yellow","cyan"), radius = 0.7, mai n = "Overall") A. 3 R Script for EAMA library(limma) library(locfdr) M < 10 NG < 1000 NSG < 70 N < 20 GV_con < matrix(rep(c(rep( 7.5,20),rep(7.5,50),rep(0,NG NSG)),10), nrow=NG,ncol=10,byrow=FALSE) GV_case < matrix(rep(c(rep(7.5,20),rep( 7.5,50),rep(0,NG NSG)),10), nrow=NG,ncol=10,byrow=FALSE) GV < cbind(GV_con,GV_case) false_prop < NULL DE_prop < NULL false_other < NULL

PAGE 100

100 DE_other < NULL for(i in 1:500){ Ztrans < NULL rpval < NULL for(k in 1:M){ Wijk < NULL eijk < NULL for(j in 1:20){ if(j <= 10){ s < rbinom(NG,1,0.4) z1 < rnorm(20, 3+(1*k),0.01) z2 < rnorm(50,2+(1*k),0.01) z3 < rnorm(NG NSG,10 +(1*k),0.01) z < c(z1,z2,z3) W < z*s Wijk < cbind(Wijk,W) e < rnorm(NG,0,0.05) eijk < cbind(eijk,e) } else{ s < rbinom(NG,1,0.4) z1 < rnorm(20,3+(1*k),0.01) z2 < rnorm(50,15+(1*k),0.01) z3 < rnorm(NG NSG,20+(1*k),0.01) z < c(z1,z2,z3) W < z*s Wijk < cbind(Wijk,W) e < rnorm(NG,0,0.05) eijk < cbind(eijk,e) }

PAGE 101

101 } yijk < GV+Wijk+eijk rownames( yijk) < paste("G",1:NG,sep="") design < cbind(Grp1=1,Grp2vs1=c(rep(0,10),rep(1,10))) options(digits=3) fit < lmFit(yijk,design) fit < eBayes(fit) t < t opTable( fit,coef=2,number=NG,p.value=1) pval < data.frame(genes=row names(t),t[,4]) colnames(pval) < c("genes",paste("col",k,sep="")) if(k==1){rpval < pval} else{ rpval < merge(rpval,pval,by="genes") } } rownames(rpval) < rpval[,1] raw_pval < t(rpval[, 1]) t1=apply(raw_pval,2,FUN=function(x) pchisq(2*sum( log(x)),2*M,lower.tail=FALSE)) adj _pval1=p.adjust(t1,method="BH") Ztrans < qnorm(raw_pval) Zvec < as.vector(Ztrans) Efron < locfdr(Zvec,bre=120,nulltype=2) mu.hat < Efron$fp0[5,1] sig.hat < Efron $fp0[5,2] Mod_Zt rans < (Ztrans mu.hat)/sig.hat pval_til < pnorm(Mod_Ztrans) t2=apply(pval_til,2,FUN=function(x) pchisq(2*sum( log(x)),2*M,lower.tail=FALSE)) adj_pval2=p.adjust(t2,method="BH") prop_meth < adj_pval2[adj_pval2<0.05] without_trans < adj_pval1[adj_pval1<0.05]

PAGE 102

102 DE_prop[i] < length(prop_meth[names(prop_meth)%in%paste("G",1:NSG,sep="")]) false_prop[i] < length(prop_meth[names(prop_meth)%in%paste("G",(NSG+1):NG,sep="")]) DE_other[i] < length(without_trans[names(without_trans)%in%pa ste("G",1:NSG,sep="")]) false_other[i] < length(without_trans[names(without_trans)%in%paste("G",(NSG+1):NG, sep="")]) } TP < DE_prop FN < NSG DE_prop Sens < mean(TP/(TP+FN)) NNSG < NG NSG TN < NNSG false_prop FP < false_prop Spec < mean(TN/(TN+ FP)) FDR < mean(FP/(TP+FP)) FNR < mean(FN/(TN+FN)) ## TP2 < DE_other FN2 < NSG DE_other Sens2 < mean(TP2/(TP2+FN2)) TN2 < NNSG false_other FP2 < false_other Spec2 < mean(TN2/(TN2+FP2)) FDR2 < mean(FP2/(TP2+FP2)) FNR2 < mean(FN2/(TN2+FN2))

PAGE 103

103 LIST OF REF ERENCES [1] Sikdar S, Datta S. A novel statistical approach for identification of the master regulator transcription factor BMC Bioinformatics. 2017; 18: 79. [2] Hahn WC Counter CM Lundberg AS Beijersb ergen RL Brooks MW W einberg RA Creation of human tumour cells with defined genetic elements. Nature. 1999 ; 400 : 464 8. [3] Hahn WC Weinberg RA Rule s for making human tumor cells. N Engl J Med. 2002 ; 347 : 1593 603. [4] Cancer.Net. http://www.cancer.net/navigating cancer care/cancer basics/genetics/genetics cancer [5] Tonon G Wong KK Maulik G Brennan C Feng B Zhang Y Khatry DB Protopopov A You MJ Aguirre AJ et al. High resolution genomic profiles of human lung cancer. Proc Natl Acad Sci USA 2005; 102(27): 9625 30. [6] Park H Cho SY Kim H Na D Han JY Chae J Park C Park OK Min S Kang J et al. Genomic alterations in BCL2L1 and DLC1 contribute to drug sensitivity in gastric cancer Proc Natl Acad Sci USA 2015 ; 112 ( 40 ) : 12492 7. [7] Sweet Cordero A Mukherjee S Subr amanian A You H Roix JJ Ladd Acosta C Mesirov J Golub TR Jacks T An on cogenic KRAS2 expression signature identified by cross sp ecies gene expression analysis. Nat Genet. 2005; 37(1): 48 55. [8 ] Khatri P, Sirota M, Butte AJ. Ten Year s of Pathway Analysis: Current approaches and outstanding c hallenges. PLoS Comput Bi ol. 2012; 8(2): e1002375. [9] Myers JS von Lersner AK Ro bbins CJ Sang QX Differentially expressed genes and signature pathways of human prostate cancer. PloS One 2015; 10 (12): e0145322. [10] Subramanian A Tamayo P Mootha VK Mukherjee S Ebert BL Gillette MA Paulovich A Pomero y SL Golub TR Lander ES et al. Gene set enrichment analysis: a knowledge based approach for interpreting ge nome wide expression profiles. Proc Natl Acad Sci USA 2005; 102(43): 15545 50. [11] Baranzini SE Galwey NW Wang J Khankhanian P Lindberg R Pelleti er D Wu W Uitdehaag BM Kappos L GeneMSA Consortium et al. P athway and network based analysis of genome wide association studies in multiple sclerosis. Hum Mol Genet 2009; 18(11): 2078 90. [12] Chuang HY Lee E Liu YT Lee D Ideker T Network based classificati on of breast cancer metastasis. Mol Syst Biol 2007; 3 : 140. [13] Li Y Agarwal P A pathway based view of human dise ases and disease relationships. PloS O ne 2009; 4(2) : e4346.

PAGE 104

104 [14] Lee I Blom UM Wang PI Shim JE Marcotte EM Prioritizing candidate disease genes by network based boosting o f genome wide association data. Genome Res 2011; 21(7): 1109 2 1. [15 ] Libermann TA, Zerbini LF. Targeting transcription factors for cancer gene therapy. Curr Gene Ther. 2006; 6(1): 17 33. [16 ] Zerbini LF. Oncogenic transcription factors: target g enes. In: eLS. John Wiley & Sons Ltd. 2007. http://www.els.net [doi: 1 0.1002/9780470015902.a0006049]. [17 ] Darnell JE Jr. Transcription factors as targets for cancer therapy. Nat Rev Cancer. 2002; 2(10): 740 9. [18] Wasylyk B Wasylyk C Flores P Begue A Leprince D Stehelin D The c ets proto oncogenes encode transcription factors that cooperate with c Fos and c Jun fo r transcriptional activation. Nature 1990; 346(6280) : 191 3 [19] Downward J Targeting RAS signalling pathways in cancer therapy. Nat Rev Cancer 2003 ; 3 (1): 11 22. [20] Huang B, Warner M, Gustafsson J Estrogen receptors in breast carcinogenesis and endocrine therapy. Mol Cell Endocrinol. 2015; 418 : 240 4. [21] Sharifi N Steroid receptors aplenty in prostate cancer. N Engl J Med. 2014 ; 370 (10): 970 1. [22] Yu H, Lee H, Herrmann A, Buettner R, Jove R Revisiting STAT3 signalling in cancer: new and unexpected biological functions. Nat Rev Cancer 2014 ; 14 (11): 736 46. [23 ] Seth RB, Sun L, Ea CK, Chen ZJ. Identification and characterization of MAVS, a mitochondrial antiviral signaling protein that activates NF kappaB and IRF 3. Cell. 2005; 122(5): 669 82. [24] Federman N, Meyers PA, Daw NC, Toretsky J, Breitmeyer JB, Singh AS, Miller LL, Oltersdorf T, Jezior D, Jessen KA, et al. A phase I, first in human, dose escala tion study of intravenous TK216 in patients with relaps ed or refractory e wing sarcoma. J Clin Oncol 2017; 35: TPS1 1626 [25] Mahajan N. Signatures of prostate deriv ed ets factor (PDEF) in cancer. Tumo ur Biol. 2016; 37(11) : 14335 40. [26 ] Yeh JE, Toniolo PA, Frank DA. Targeting transcription factors: promising new strategies for cancer therapy. Curr Opin Oncol. 2013; 25(6): 652 8. [27 ] Redmond AM Carroll JS. Defining and targeting transcription factors in cancer. Genome Biol. 2009; 10(7): 311. [28 ] Bhagwat AS, Vakoc CR. Targeting Transcription Factors in Cancer. Trends in cancer. 2015; 1(1): 53 65.

PAGE 105

105 [29 ] Tovar H, Garca Herrera R, Espinal Enrquez J, Hernndez Lemus E. Transcriptional master regulator analysis in breast cancer genetic networks. Comput Biol Chem. 2015; 59 : 67 77. [30 ] Bae T, Rho K, Choi JW, Horimoto K, Kim W, Kim S. Identification of upstream regulators for prognostic expression signature genes in colorectal cancer. BMC Syst Biol. 2013 ; 7 : 86. [31 ] Sawle AD, Kebschull M, Demmer RT, Papa panou PN. Identification of master regulator genes in human p eriodontitis. J Dent Res. 2016; 95(9): 1010 7. [32 ] Gubelmann C, Schwalie PC, Raghav SK, Rder E, Delessa T, Kiehlmann E, Waszak SM, Corsinotti A, Udin G, Holcombe W, et al. Identification of the transcription factor ZEB1 as a central component of the adipogenic gene regulatory network. Elife 2014 ; 3 : e03346. [33] Medvedovic J, Ebert A, Tagoh H, Busslinger M Pax5: a master regulator of B cell development and leukemogenesis. Adv. Immunol. 2011; 111: 179 206. [34] Carro MS, Lim WK, Alvarez MJ, Bollo RJ, Zhao X, Snyder EY, Sulman EP, Anne SL, Doetsch F, Colman H, et al. The transcriptional network for mesenchymal transformation of brain tumours. Nature. 2010; 463 (7279) : 318 25. [35] Mullen AC, Orlando DA, Newman JJ, Lovn J, Kumar RM, Bilodeau S, Reddy J, Guenther MG, DeKoter RP, Young RA Master transcription factors determine cell type specific responses to TGF signaling Cell. 2011; 147(3): 565 76 [ 36 ] Sinha S, Tompa M. A statistical meth od for finding transcription factor binding sites. Proc Int Conf Intell Syst Mol Biol. 2000 ; 8: 344 54. [ 37 ] Gardner TS, di Bernardo D, Lorenz D, Collins JJ. Inferring genetic networks and identifying compound mode of action via expression profiling. Science. 2003; 301(5629): 102 5. [38 ] Tsai HK, Lu HH, Li WH. Statistical methods for identifying yeast cell cycle transcription factors. Proc Natl Acad Sci USA. 2005 ; 102 (38): 13532 7 [ 39 ] Tsai HK, Huang GT, Chou MY, Lu HH, Li WH. Method for identifying t ranscription factor binding sites in yeast. Bioinformatics. 2006 ; 22 (14): 1675 81. [40 ] Cheng C, Li LM. Systematic identification of cell cycle regulated transcription factors from microarray time series data. BMC Genomics. 2008 ; 9 : 116. [41 ] Wu WS, Li WH. Systematic identification of yeast cell cycle transcription factors using multiple data sources. BMC Bioinformatics. 2008 ; 9 : 522. [42 ] Ho Sui SJ, Mortimer JR, Arenillas DJ, Brumm J, Walsh CJ, Kennedy BP, Wasserman WW. oPOSSUM: identification of over repr esented transcription factor binding sites in co expressed genes. Nucleic Acids Res. 2005 ; 33 (10): 3154 64.

PAGE 106

106 [43 ] Banerjee N, Zhang MQ. Identifying cooperativity among transcription factors controlling the cell cycle in yeast. Nucleic Acids Res. 2003; 31 (23 ): 7024 31. [44 ] Hu H. An efficient algorithm to identify coord inately activated transcription factors. Genomics. 2010; 95(3): 143 50. [45 ] Ma S, Jiang T, Jiang R. Differential regulation enrichment analysis via the integration of transcriptional regulator y network and gene expression data. Bioin formatics. 2015; 31(4): 563 71. [46 ] Schacht T, Oswald M, Eils R, Eichmller SB, Knig R. Estimating the activity of transcription factors by the effect on their target genes. Bioinformatics 2014; 30 (17): i401 7. [47 ] Petti AA, Church GM. A network of transcriptionally coordinated functional modules in Saccharomyces cerevisiae. Genome Res. 2005; 15(9): 1298 306. [48] Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein M. Genomic analysis of regulatory netw ork dynamics reveals large topological changes. Nature. 2004; 431(7006): 308 12. [49 ] Roth FP, Hughes JD, Estep PW, Church GM Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole genome mRNA quantitation. Nat Biotechnol. 1 998; 16(10): 939 45. [50 ] Allocco DJ, Kohane IS, Butte AJ. Quantifying the relationship between co expression, co regulation and gene function. BMC Bioinformatics. 2004; 5(1): 18. [51 ] Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM. Systematic determination of genetic network architecture. Nat Genet. 1999; 22(3): 281 5. [52 ] Das D, Banerjee N, Zhang MQ Interacting models of cooperative gene regulation. Proc Natl Acad Sci US A. 2004; 101(46): 16234 9. [53 ] Gevaert O, Plevritis S. Ident ifying master regulators of cancer and their downstream targets by integrating genomic and epigenomic features. Pac Symp Biocomput. 2013; 123 34. [54 ] Padi M, Quackenbush J. Integrating transcriptional and protein interaction networks to prioritize condition specific master regulators. BMC Syst Biol. 2015 ; 9 : 80. [55 ] Piao G, Saito S, Sun Y, Liu ZP, Wang Y, Han X, Wu J, Zhou H, Chen L, Horimoto K. A computational procedure for identifying master regulator candidates: a case study on diabetes progress ion in Goto Kakizaki rats. BMC Syst Biol 2012 ; 6 (1): S2. [56 ] Saito S, Zhou X, Bae T, Kim S, Horimoto K. Identification of master regulator candidates in conjunction with network screening and inference. Int J Data Min Bioinform 2013 ; 8 (3): 366 80.

PAGE 107

107 [57 ] Yang J, Mani SA, Donaher JL, Ramaswamy S, Itzykson RA, Come C, Savagner P, Gitelman I, Richardson A, Weinberg RA. Twist, a master regulator of morphogenesis, plays an essential role in tumor metastasis Cell. 2004; 117(7): 927 39. [58 ] Chan SS, Kyba M. What is a master regulator? J Stem Cell Res Ther. 2013 ; 3 : 114. [59 ] De D, Jeong MH, Leem YE, Svergun DI, Wemmer DE, Kang JS, Kim KK, Kim SH. Inhibition of master transcription factors in pluripotent cells induces early stage differentiation. Proc Natl Aca d Sci US A. 2014; 111 (5): 1778 83. [60 ] Vogelstein B, Papadopoulos N, Velculescu VE, Zhou S, Diaz LA, Kinzler KW. Cancer genome landscapes. Science. 2013; 339(6127): 1546 58. [61 ] Emmert Streib F, Glazko GV. Pathway a nalysis of expression data: d eciphering functional building blocks of complex d iseases PLoS Comput Biol. 2011; 7(5): e1002053. [62 ] Fisher R A. Statistical methods for research w orkers. London: Oliver and Boyd, 1932. [63 ] Efron B. Large scale simultaneous hypothesis testing: the choi ce of a null hyp othesis. JASA 2004; 99: 96 104. [64 ] Hedenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R Meltzer P, Gusterson B, Esteller M, Raffeld M, et al. Gene expression profiles in hereditary breast cancer. N Engl J Med. 2001; 344: 539 48. [65 ] Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci. US A. 2003 ; 100 : 9440 5. [66 ] Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Series B Stat Methodol. 1995; 57: 289 300. [67 ] Huang Da W, Sherman BT, Lempicki RA. Bioinformatics enrichment tools: paths toward the comprehensive functional analysis of large gene lists. Nucleic Acids Res. 2009; 37 : 1 13. [68 ] Purcell S, Neale B, Todd Brown K, Thomas L, Ferreira MA, Bender D, Maller J Sklar P de Bakker PI Daly MJ et al. PLINK: a tool set for whole genome association and population based linkage analyses. Am J Hum Genet. 2007 ; 81 : 559 75. [69 ] Anders Simon, Huber W. Differential expression analysis for sequence count data. Genome biol. 2010; 11: R106. [70 ] Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in behavior genetics research. Behav Brain Res. 2001 ; 125 : 279 84. [ 71 ] Wold H. Estimation of principal components and related models by iterative least squares. In Krishnaiaah, P.R. Multivariate Analysis New York: Academic Press. 1966. 391 420.

PAGE 108

108 [ 72 ] Datta S Exploring relationships in gene expressions: a partial least squares approach. Gene Expr. 2001; 9(6):249 55. [ 73 ] Gill R, Datta S, Datta S A statistical framework for differential network analysis from microarray data. BMC Bioinformatics 2010; 11 : 95. [ 74 ] Massie CE, Lynch A, Ramos Montoya A, Boren J, Stark R Fazli L Warren A Scott H Madhu B Sharma N et al The androgen receptor fuels prostate cancer by regulating central metabolism and biosynthesis. EMBO J. 2011; 30(13): 2719 33. [ 75 ] Ritchie ME, Phipson B Wu D, Hu Y, Law CW, Shi W, Smyth GK. limma powers differential expression analyses fo r RNA sequencing and microarray studies. Nucleic Acids Res. 2015 ; 43 (7), e47. [ 76 ] Cronauer MV, Schulz WA, Ackermann R, Burchardt M. Effects of WNT/beta catenin pathway activation on signaling through T cell factor and androgen receptor in prostate cancer cell lines. Int J Oncol. 2005; 26(4): 1033 40. [ 77 ] Li H, Kim JH, Koh SS, Stallcup MR. Synergistic effects of coactivators GRIP1 and beta catenin on gene activation: cross talk between androgen receptor and Wnt signaling pathways. J Biol Chem. 2004; 279(6) : 4212 20. [ 78 ] Song LN, Gelmann EP. Interaction of beta catenin and TIF2/GRIP1 in transcriptional activation by the androgen receptor. J Biol Chem. 2005; 280 (45): 37853 67. [ 79 ] Song LN, Herrell R, Byers S, Shah S, Wilson EM, G elmann EP Beta catenin bind s to the activation function 2 region of the androgen receptor and modulates the effects of the N terminal domain and TIF2 on ligand dependent transcription. Mol Cell Biol. 2003; 23(5): 1674 87. [ 80 ] Terry S, Yang X, Chen MW, Vacherot F, Buttyan R. Multifa ceted interaction between the androgen and Wnt signaling pathways and the implication for prostate cancer. J Cell Biochem. 2006; 99(2): 402 10. [ 81 ] Yang X, Chen MW, Terry S, Vacherot F, Bemis DL, Capodice J Kitajewski J de la Taille A Benson MC Guo Y, et al. Complex regulation of human androgen receptor expression by Wnt signaling in prostate cancer cells. Oncogene. 2006; 25(24): 3436 44. [ 82 ] Jiang X, Yu Y, Yang HW, Agar NY, Frado L, Johnson MD The imprinted gene PEG3 inhibits Wnt signaling and regul ates glioma growth. J Bio l Chem. 2010; 285(11): 8472 80. [ 83 ] Ribarska T, Bastian KM, Koch A, Schulz WA. Specific changes in the expression of imprinted genes in prostate cancer implications for cancer progression and epigenetic regulation. Asian J Androl. 2012 ; 14 (3): 436 50. [ 84 ] Su ZZ, Goldstein NI, Jiang H, Wang MN, Duigou GJ, Young CS, Fisher PB. PEG 3, a nontransforming cancer progression gene, is a positive regulator of cancer aggressiveness and ang iogenesis. Proc Natl Acad Sci US A. 1999 ; 96 (26): 151 15 20.

PAGE 109

109 [ 85 ] Webb EA, AlMutair A Kelberman D Bacchelli C Chanudet E Lescai F Andoniadou CL Banyan A Alsawaid A Alrifai MT et al. ARNT2 mutation causes hypopituitarism, post natal microcephaly, visual and renal anomalies. Brain. 2013; 136(10): 3096 105. [ 86 ] Yang B, Yang E, Liao H, Wang Z, Den Z, Ren H. ARNT2 is downregulated and serves as a potential tumor suppressor gene in non small cell lung cancer. Tumour Biol. 2015; 36(3): 2111 9. [ 87 ] Martinez V, Kennedy S, Doolan P, Gammell P, Joyce H, Kenny E, Prakash Mehta J, Ryan E, O'Connor R, Crown J, et al. Drug metabolism related genes as potential biomarkers: analysis of expression in normal and tumour breast tissue. Breast Cancer Res Treat. 2008; 110(3): 521 30. [ 88 ] Stauffer S, Chen X, Zhang L, Chen Y, Dong J. KIBRA promotes prostate cancer cell proliferation and motility. FEBS J. 2016; 283(10): 1800 11. [ 89 ] Heemers HV, Regan KM, Schmidt LJ, Anderson SK, Ballman KV, Tindall DJ. Androgen modulation of coregulator expression in prostate cancer ce lls. Mol Endocrinol. 2009 ; 23 (4): 572 83. [ 90 ] Yamamoto M, Cid E, Bru S, Yamamoto F. Rare and frequent promoter methylation, respectively, of TSHZ2 and 3 genes that are both downregulated in expression in breast and prostate cancers. PLoS One. 2011; 6(3): e17149. [ 91 ] Jackson BC, Carpenter C, Nebert DW, Vasiliou V. Update of human and mouse forkhead box (FOX) gene families. Hum Genomics. 2010; 4(5): 345 52. [92 ] Hong Y, Ho KS, Eu KW, Cheah PY. A susceptibility gene set for early onset colorectal cancer that integrates diverse signaling pathways: implication for tumorigenesis. Clin Cancer Res 2007; 13(4): 1107 14. [93 ] Karin M, Lin A. NF kappaB at the crossroads of life and death. Nat Immunol. 2002; 3(3): 221 7. [94 ] Sakamoto K, Maeda S, Hikiba Y, Nakagawa H, Hayakawa Y, Shibata W, Yanai A, Ogura K, Omata M. Constitutive NF kappaB activation in colorectal carcinoma plays a key role in angiogenesis, promoting tumor growth. Clin Cancer Res. 2009; 15(7): 2248 58. [95 ] Nakshatri H, Bhat Nakshatri P, Martin DA, G oulet RJ Jr, Sledge GW Jr. Constitutive activation of NF kB during progression of breast cancer to hormone independent growth. Mol Cell Biol. 1997; 17(7): 3629 39. [96 ] Rayet B, Gelinas C. Aberrant rel/nfkb genes and activity in human cancer. O ncogene. 19 99; 18(49): 6938 47. [97 ] Tai DI, Tsai SL, Chang YH, Huang SN, Chen TC, Chang KS, Liaw YF. Constitutive activation of nuclear factor kB in hepatocellular carcinoma. Cancer. 2000; 89(11): 2274 81.

PAGE 110

110 [98 ] Luo JL, Kamata H, Karin M. IKK/NF kappa B signaling: balancing life and death -a new approach to cancer therapy. J Clin Invest. 2005; 115(10): 2625 32. [99 ] Karin M, Ben Neriah Y. Phosphorylation meets ubiquitination: the control of NF [ kappa] B activity. Annu Rev Immunol. 2000; 18: 621 63. [100 ] Hayden MS, Ghosh S. Signaling to NF kappa B. Genes Dev. 2004; 18(18): 2195 224. [101 ] Wang S, Liu Z, Wang L, Zhang X NF kappaB signaling pathway, inflammation and colorectal cancer. Cell Mol Immunol. 2009; 6(5): 327 34. [ 102 ] Wang CY, Cusack JC Jr, Liu R, Baldwin AS Jr. Control of inducible chemoresistance: enhanced antitumor therapy through increased apoptosis by inhibition of NF ka ppaB. Nat Med. 1999; 5(4): 412 7. [103 ] Sakamoto K, Maeda S. Targeting NF kappaB for colorectal cancer. Expert Opin Ther Targets. 2010; 1 4(6): 593 601. [104 ] Alvarez Guardia D, Palomer X, Coll T, Davidson MM, Chan TO, Feldman AM, Laguna JC, Vzquez Carrera M. The p65 subunit of NF kappaB binds to PGC 1alpha, linking inflammation and metabolic disturbances in cardiac cells. Cardiovasc Res. 2 010; 87(3): 449 58. [105 ] Brennan CW, Verhaak RGW, McKenna A, Campos B, Noushmehr H, Salama SR, Zheng S, Chakravarty D, San born JZ, Berman SH, et al. The s omatic genomic landscape of g lioblastoma. Cell 2013; 155 (2): 462 77. [106 ] Sikdar S, Datta S, Datta S. Exploring the importance of cancer pathways by meta analysis of differential protein expression networks in three different cancers. Biol Direct 2016; 11: 65. [107 ] Reimand J, Kull M, Peterson H, Hansen J, Vilo J. g: Profiler a web based toolset for functional profiling of gene lists from large scale experiments. Nucleic Acids Res. 2007; 35 : W193 W200. [108 ] Reimand J, Arak T, Vilo J. g: Profiler a web server for functional interpretation of gene lists (2011 update). Nucleic Acids Res. 2011; 39 : W307 W315. [109 ] Gill R, Datta S, Datta S. dna: An R package for differential network analysis. B ioinformation 2014; 10: 233 34. [110 ] Pihur V, Datta S, Datta S. Weighted rank aggregation of cluster validation measures: a monte carlo cross entropy approach. Bioinformatics 2007; 23: 1607 15. [111 ] Pihur V, Datta S, Datta S. RankAggreg, an R package for weighted rank aggregation. BMC B ioinformatics 2009; 10: 62. [112 ] Rubinstein R. The cross entropy method for combinatorial and continuous optimization. Methodol Comput Appl. 1999; 1: 127 90.

PAGE 111

111 [113 ] Fagin R, Kumar R, Sivakumar D. Com paring top k lists. SIAM J Discrete Math. 2003; 17: 134 60. [114 ] Yang Y, Wu J, Demir A, Castillo Martin M, Melamed RD, Zhang G, Fukunaga Kanabis M, Perez Lorenzo R, Zheng B, Silvers DN et al. GAB2 induces tumor angiogenesis in NRAS driven melanoma. Oncogene. 2013; 32: 3627 37. [115 ] The Human Protein Atlas. http://www.proteinatlas.org/ENSG00000100030 MAPK1/cancer [116 ] Gherardi E, Birchmeier W, Birchmeier C, Vande Woude G. Targeting MET in cancer: rationale and progress. Nat Rev Cancer 2012; 12 : 89 103 [117 ] Smith AJ, Karpova Y, Jr Willingham M, Kulik G. Expression of the Bcl 2 protein BAD promotes prostate cancer growth. PLoS One 2009 ; 4 : e6224. [118 ] UniProt. http://www.uniprot.org/uniprot/P17252. [119 ] Luo J, Manning BD, Cantley LC. Targeting the PI3K Akt pathway in human cancer: rationale and promise. Cancer Cell 2003; 4 : 257 62. [120 ] Knight ZA, Shokat KM. Chemically targeting the PI3K family. Biochem Soc Trans. 2007; 35: 245 9. [121 ] Gysin S, Salt M, Young A, McCormick F. Therapeutic strategies for targeting Ras proteins. Genes Cancer 2011; 2: 359 72. [122 ] Sikdar S, Choo Wosoba H, Abdia Y, Dutta S, Gill R, Datta S, Datta S. An integrative exploratory analysis of omics data from the ICGC cancer genomes lung adenocarcinoma study. Syst Biomed. 2014; 2: 54 62. [123 ] Chakravarti A, Loeffler JS, Dyson NJ. Insulin like growth factor receptor I mediates resistance to anti epidermal growth factor receptor therapy in primary human glioblastoma cells through continued activation of phosphoinosi tide 3 kinase signaling. Cancer Res 2002; 62: 200 7. [124 ] Rojas M, Yao S, Lin YZ. Controlling epidermal growth factor (EGF) stimulated Ras activation in intact cells by a cell permeable peptide mimicking phosphorylated EGF receptor. J Biol Chem. 1996; 271: 27456 61. [125 ] She QB Solit DB Ye Q KE Lobo J Rosen N The BAD protein integrates survival signaling by EGFR/MAPK and PI3K/ Akt kinase pathways in PTEN deficient tumor cells. Cancer Cell 2005; 8: 287 97. [126 ] Yarden Y, Sliwkowski MX. Untangling the ErbB signalling network. Nat Rev Mol Cell Biol. 2001; 2: 127 37. [127 ] CrownBio. http://www.crownbio.com/finding right path lung cancer therapy/

PAGE 112

112 [128 ] E ngelman JA, Zejnullahu K, Mitsudomi T, Song Y, Hyland C, Park JO, Lindeman N, Gale CM, Zhao X, Christensen J et al. MET amplification leads to gefitinib resistance in lung cancer by activating ERBB3 signaling. Science 2007; 316: 1039 43. [129 ] Porta Pardo E, Garcia Alonso L, Hrabe T, Dopazo J, Godzik A. A pan cance r catalogue of cancer driver protein interaction interfaces. PLoS Computat Biol 2015; 11: e1004518. [130 ] Efron B, Turnbull BB, Narasimhan B. locfdr Vignette Complete Help Documentation Including Usage Ti ps and Simulation Example 2015. [131 ] Efron B. Size, power and false discovery rat es. Ann. Stat. 2007; 35: 1351 77. [132 ] Benidt S, Nettleton D. SimSeq: a nonparametric approach to simulation of RNA sequence datasets. Bioinformatics. 2015; 31 : 2131 140. [133 ] Network TCGAR. Comprehensive molecular cha racterization of clear cell renal cell carcinoma. Nature. 2013; 499 : 43 9. [134 ] Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010 ; 26 : 139 40. [135 ] McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA Seq experiments with respect to biological variation. Nucleic Acids Res. 2012 ; 40 : 4288 97. [136 ] Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, B eheshti J, Bueno R, Gillette M, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. US A. 2001; 98: 13790 5. [137 ] Takeuchi T, Tomida S, Yatabe Y, Kosaka T, Osada H, Yanagisawa K, Mitsudomi T, Takahashi T Expression profile defined classification of lung adenocarcinoma shows close relationship with underlying major genetic changes and clinicopathologic behaviors. J Clin Oncol. 2006; 24: 1679 88. [138 ] Staaf J, Jnss on G, Jnsson M, Karlsson A, Isaksson S, Salomonsson A, Pettersson HM Soller M Ewers SB Johansson L et al. Relation between smoking history and gene expression profiles in lung adenocarcinom as. BMC Med Genom. 2012; 5: 22. [139 ] Rouss eaux S, Debernardi A, Jacquiau B, Vitte AL, Vesin A, Nagy Mignotte H, Moro Sibilot D Brichon PY Lantuejoul S Hainaut P et al. Ectopic activation of germline and placental genes identifies aggressive metastasis prone lung cancers. Sci Transl Med. 2013; 5: 186ra66. [140 ] Tarca AL, Lauria M, Unger M, Bilal E, Boue S Kumar Dey K, Hoeng J Koeppl H Martin F Meyer P et al. Strengths and limitations of microarray based phenotype prediction: lessons learned from the IMPROVER Diagnostic Signature Chall enge. Bi oinformatics. 2013; 29: 2892 9.

PAGE 113

113 [141 ] Hughey JJ, Butte AJ. Robust meta analysis of gene expression using the elastic net. Nucleic Acids Res. 2015 ; gkv229. [1 42 ] Huang Da W, Sherman BT, Lempicki RA. Systematic and integrative analysis of large gene lists using DAVID Bioinformatics Resources. Nat Protoc. 2009; 4: 44 57. [143 ] Storey JD. A direct approach to false discovery rates. J R Stat Soc Series B Stat Methodol. 2002; 64 : 479 98.

PAGE 114

114 BIOGRAPHICAL SKETCH Sinjini Sikdar was born in India. She received her undergraduate degree in 2010 from Presidency College, Universit y of Calcutta, with a major in statistics and minor in m athemat ics and e conomics. She r s tatistics from Universi ty of Calcutta in She joined the graduate degree program in the Department of Bioinformatics and Biostatistics at University of Louisville in 2013 to pursue a PhD. She, along with he r supervisor, moved to the PhD program of the Department of Biostatistics at University of Florida in 2015 and continued her PhD degree there. Apart from her dissertation, she also worked on several projects during her PhD program which led to a number of publications in peer reviewed journals. Some of her works have been published in journals like Briefings in Bioinformatics BMC Bioinformatics, Biology Direct among others She also presented her research works in several conferences. She was awarded a 2 n d prize for excellent research presented at the 13 th Annual International Conference on Critical Assessment of Massive Data Analysis (CAMDA), held at Boston in 2014. She received the University of Florida PhD travel award for presenting her res earch at the ENAR 2017 conference at Washington D C. She also received one of the best paper awards at Biostatistics Workshop on Statistical Inference on Biomedical Big Data, held at University of Florida in 2017. During her PhD program, she worked as a gr aduate research assistant and had opportunities to work on several collaborative projects.