Feature Ranking and Selection for Svm Classification and Applications

MISSING IMAGE

Material Information

Title:
Feature Ranking and Selection for Svm Classification and Applications
Physical Description:
1 online resource (89 p.)
Language:
english
Creator:
Hsu, Wen-Chin
Publisher:
University of Florida
Place of Publication:
Gainesville, Fla.
Publication Date:

Thesis/Dissertation Information

Degree:
Doctorate ( Ph.D.)
Degree Grantor:
University of Florida
Degree Disciplines:
Electrical and Computer Engineering
Committee Chair:
Chen, Su-Shing
Committee Members:
Chang, Lung-Ji
Harris, John Gregory
Ritter, Gerhard
Wayne, Marta L

Subjects

Subjects / Keywords:
alzheimers -- cancers -- disease -- expression -- features -- gene -- imformation -- mutual -- network -- selection -- svm -- target
Electrical and Computer Engineering -- Dissertations, Academic -- UF
Genre:
Electrical and Computer Engineering thesis, Ph.D.
bibliography   ( marcgt )
theses   ( marcgt )
government publication (state, provincial, terriorial, dependent)   ( marcgt )
born-digital   ( sobekcm )
Electronic Thesis or Dissertation

Notes

Abstract:
Diagnosis of complex diseases such as cancers or Alzheimer’s disease (AD) remains a challenging research problem. An optimal approach is to discover important biomarkers for both diagnosis and therapy. These biomarkers form a certain dependency network, called a target network, which serves as a framework for diagnosis and therapies. However, selecting important genes for microarray datasets has been a major problem due to the COD (Curse of Dimensionality), referring to the difficulty in finding a relationships among a large number of input parameters (features) from a small number of samples (.patient subjects). A general methodology, AMFES (Adaptive Multiple FEature Selection), for ranking and selecting important biomarkers based on SVM (Support Vector Machine) classification is developed to improve diagnosis of complex diseases. In the research, three methods are comprehensively compared: AMFES, RFE (Recursive Features Elimination) and the CORR (Correlation Coefficient) on five datasets (leukemia, colon cancer, lymphoma, prostate cancer and simulated data). As an result, AMFES performs better in terms of computational time and the number of selected features, while also maintaining higher or comparable test accuracy and statistical significance. Based on the biomarkers, a multi-target and multi-component design that provides synergistic results is proposed to improve the cancer therapy. First, the biomarkers are selected and target networks are constructed of three datasets: prostate cancer (three stages), breast cancer (four subtypes), and another prostate cancer (normal vs. cancerous). Then, a framework is proposed as a computational foundation for the therapy. Recently, Maes et al. have investigated blood-based biomarkers to help analyze AD 2-4. Based on our success with cancers, we believed that AMFES could be usefully applied to AD. In this work, we extend the translational bioinformatics study conducted by Maes et al. for their AD datasets (GSE4226, GSE4227 and GSE4229). Interestingly, some of our selected genes are not listed by Maes’ report and this difference may indicate the novelty of our genes. In addition, based on the gender analysis, we observe that the gender could play a role in AD degradation. Finally, we describe a complete process for the diagnosis and prognosis of AD.
General Note:
In the series University of Florida Digital Collections.
General Note:
Includes vita.
Bibliography:
Includes bibliographical references.
Source of Description:
Description based on online resource; title from PDF title page.
Source of Description:
This bibliographic record is available under the Creative Commons CC0 public domain dedication. The University of Florida Libraries, as creator of this bibliographic record, has waived all rights to it worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.
Statement of Responsibility:
by Wen-Chin Hsu.
Thesis:
Thesis (Ph.D.)--University of Florida, 2013.
Local:
Adviser: Chen, Su-Shing.
Electronic Access:
RESTRICTED TO UF STUDENTS, STAFF, FACULTY, AND ON-CAMPUS USE UNTIL 2014-05-31

Record Information

Source Institution:
UFRGP
Rights Management:
Applicable rights reserved.
Classification:
lcc - LD1780 2013
System ID:
UFE0045215:00001


This item is only available as the following downloads:


Full Text

PAGE 1

1 FEATURE RANKING AND SELECTION FOR SVM CLASSIFICA TION AND APPLICATIONS By WEN CHIN HSU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2013

PAGE 2

2 2013 Wen Chin Hsu

PAGE 3

3 To my family, my teachers and my friends

PAGE 4

4 ACKNOWLEDGMENTS I would like to show my deep est appreciations to my advisor, Professor Su Shing Chen, for his invaluable gui dance, long te rm support and encouragements. I feel so blessed to work with a professional scholar. Professor Chen always has very brilliant and novel ideas which inspire me to research from different perspectives. His outstanding experiences guide me thro ugh difficult research phases and help me to overcome obstacles. His professional work ing philosophy also shows me the importance of team work. Without his generous help my dissertation will be difficult to accomplish I n addition, he and his family t reat me as a member of the family which provides a strong support to me. I also greatly thank encouragements and valuable guidance from my committee members, Dr. John Harris, Dr. Gerhard Ritter, Dr. Lung Ji Chang and Dr. Marta Wayne. In addition, I would like to show my sincere appreciations to Dr. Fu Chang, Dr. Chan Cheng Liu (as team members of our research project) and all the discussions from the faculty and friends from the Department of Electrical and Computer Engineering. Especially, I would like to tha nk Dr. Chun Chung Choi (international student group facilitator) and Dr. Carlos A Hernande z (Clinical Assistant professor) who gave me warm encouragements and helps Finally, I own a great debt of appreciations to my parents Mr. Huan Tu Hsu and Ms. Su Chi n Hsu and my sist ers, especially, my third sister Wen Chen Hsu If I have an y achievement, she is the individual who always supports me. Their unconditional love and strong belief in me are the reasons that carry me moving forward. Their wholehearte d car ing and continuous love are the wings which fly me to a big ger world. They are the rocks of my life.

PAGE 5

5 TABLE OF CONTENTS page ACKNOWLEDGMENTS ................................ ................................ ................................ .. 4 LIST OF TABLES ................................ ................................ ................................ ............ 7 LIST OF FIGURES ................................ ................................ ................................ .......... 9 ABSTRACT ................................ ................................ ................................ ................... 12 CHAPTER 1 INTRODUCTION ................................ ................................ ................................ .... 14 Traditional Microarray St atistics ................................ ................................ .............. 14 Statement of Problem ................................ ................................ ............................. 14 Background of Study ................................ ................................ .............................. 15 CORR (Correlation Coefficient) ................................ ................................ ........ 15 SVM (Support Vector Machine) ................................ ................................ ........ 15 RFE (Recursive Feature Elimination) ................................ ............................... 18 Statement of Work ................................ ................................ ................................ .. 19 Selecting and Ranking Biomarkers Using SVM Classification .......................... 19 Target Net work ................................ ................................ ................................ 20 Translational Bioinformatics ................................ ................................ ............. 21 2 METHODOLOGY ................................ ................................ ................................ ... 23 AMFES ( Adaptive Multiple FEature Selection) ................................ ....................... 23 Ranking ................................ ................................ ................................ ............ 23 Selection ................................ ................................ ................................ ........... 25 Integrated Ranking and Selection ................................ ................................ .... 27 Mutual Information ................................ ................................ ................................ .. 28 Target Network ................................ ................................ ................................ ....... 30 Synergistic Therapy ................................ ................................ ................................ 31 3 COMPARISONS OF AMFES AND OTHER METHODOLOGIES ........................... 36 Microarray Data Description ................................ ................................ ................... 36 Leukemia ................................ ................................ ................................ .......... 36 Colon Cancer ................................ ................................ ................................ ... 36 Lymphoma ................................ ................................ ................................ ........ 36 Prostate Cancer ................................ ................................ ............................... 37 Simulated Dataset ................................ ................................ ............................ 37 Results ................................ ................................ ................................ .................... 38 Discussion ................................ ................................ ................................ .............. 39

PAGE 6

6 4 CLINICAL BIOINFORMATICS: DIAGNOSIS AND THERAPY ................................ 51 Microarray Data Description ................................ ................................ ................... 51 Prostate Cancer Dataset with RNA Biomarkers ................................ ............... 51 Breast Cancer Dataset with Non Coding MicroRNA Biomarkers ..................... 52 Prostate Cancer Dataset of Cancerous and Normal Samples with RNA Biomarkers ................................ ................................ ................................ .... 53 Results ................................ ................................ ................................ .................... 53 Calculatin g Mutual Information ................................ ................................ ......... 54 Synergistic Therapy ................................ ................................ .......................... 55 Discussion ................................ ................................ ................................ .............. 56 5 T DISEASE) ................................ ................................ ................................ ............... 64 Microarray Datasets Descriptions ................................ ................................ ........... 64 GSE4226 ................................ ................................ ................................ .......... 64 GSE4227 ................................ ................................ ................................ .......... 64 GSE4229 ................................ ................................ ................................ .......... 65 Results ................................ ................................ ................................ .................... 65 Results of Biomarkers ................................ ................................ ...................... 65 ROC/AUC Comparison ................................ ................................ ..................... 65 Mutual Information Analysis ................................ ................................ ............. 66 Clustergram Example ................................ ................................ ....................... 66 Functional Attributes ................................ ................................ ......................... 67 Overlapping Genes Discovered ................................ ................................ ........ 67 Different Gene Profiling ................................ ................................ .................... 67 Gender Analysis ................................ ................................ ............................... 67 Discussion ................................ ................................ ................................ .............. 68 6 CONCLUSIONS ................................ ................................ ................................ ..... 80 APPENDIX A TARGET NETWORKS OF MUTUAL INFORMATION ................................ ............ 82 B ISEASE ANALYSIS ................................ ................................ ..... 83 LIST OF REFERENCES ................................ ................................ ............................... 84 BIOGRAPHICAL SKETCH ................................ ................................ ............................ 88

PAGE 7

7 LIST OF TABLES Table page 2 1 Pseudo codes of the rank subroutine ................................ ................................ 33 2 2 Pseudo codes of the selection subroutine ................................ .......................... 33 2 3 Pseudo codes of the integrated subroutine ................................ ........................ 34 3 1 Summary of tasks ................................ ................................ ............................... 40 3 2 T test and p value f or the colon cancer dataset ................................ .................. 41 3 3 T test and p values for the leukemia cancer dataset ................................ .......... 41 3 4 T test and p values for the lympho ma cancer dataset ................................ ........ 42 3 5 The t test and p value for the prostate cancer dataset ................................ ....... 50 3 6 Informative features discovery rate (%) ................................ .............................. 50 4 1 Descriptions of 3 datasets: GSE18655 (prostate cancer), GSE19536 (breast cancer) and GSE21036 (prostate cancer) ................................ .......................... 62 4 2 Res ults of selected subsets of genes ................................ ................................ 62 4 3 Results of analysis of MI matrices ................................ ................................ ...... 63 5 1 Descriptions of 3 datasets: GSE4226, GSE4227, and GSE4229 ....................... 70 5 2 Results of selected subsets of genes ................................ ................................ 70 5 3 Results of analysis of MI matrices ................................ ................................ ...... 70 5 4 The partial biological processes of genes selected for GSE 4226 ...................... 71 5 5 17 common down regulated genes ................................ ................................ .... 73 5 6 Nine common up regulated genes ................................ ................................ ...... 73 5 7 Mutual information analysis for non overlapped genes of AMFES and ................................ ................................ ................................ ............... 74 5 8 Comparisons of female genes and male gene selected by AMFES and ................................ ................................ ................................ ................. 74 5 9 .. 74

PAGE 8

8 5 10 Overlapped genes between the ones selected by AMFES and ones of ................................ ................................ ................... 75 A 1 GSE18655 96 Biomarkers(attached: .pdf file 16kB) ................................ ........... 82 A 2 GSE19536 72 Biomarkers (attached: pdf file 10kB) ................................ .......... 82 A 3 GSE19536 68 Biomarkers (attached: pdf file 12kB) ................................ .......... 82 A 4 GSE21036 22 Biomarkers(attached: pdf file 12kB) ................................ .......... 82 A 5 GSE18655 mutual information of grade 1(attached: pdf file 145kB) ................... 82 A 6 GSE18655 mutual information of grade 2(attached: pdf file 145kB) ................. 82 A 7 GSE18655 mutual information of grade 3(attached: pdf file 145kB) ................. 82 A 8 GSE 19536 Basal like(attached: pdf file 73kB) ................................ ................. 82 A 9 GSE 19536 Normal like(attached: pdf file 75kB) ................................ .............. 82 A 10 GSE21036 Cancer (attached: pdf file 14kB) ................................ ..................... 82 A 11 GSE21036 Normal (attached: pdf file 14kB) ................................ ..................... 82 B 1 GSE4226_74_Biomarkers (attached: pdf file 11kB) ................................ ........... 83 B 2 GSE4227_52_Biomarkers (attached: pdf file 10kB) ................................ ........... 83 B 3 GSE4229_395_Biomarkers (attached: pdf file 14kB) ................................ ......... 83 B 4 746_Down regulated_Genesymbols (attached: pdf file 18kB) ............................ 83 B 5 82_Up regulated_Biomarkers (attached: pdf file 10kB) ................................ ...... 83 B 6 GSE4226 AD MI (attached: pdf file 87kB) ................................ .......................... 83 B 7 GSE4226 Normal MI (attached: pdf file 84kB) ................................ .................... 83 B 8 GSE4227 AD MI (attached: pdf file 46kB) ................................ .......................... 83 B 9 GSE4227 Normal MI (attached: pdf file 46kB) ................................ .................... 83 B 10 GSE4229 AD MI (attached: pdf file 2.3mB) ................................ ........................ 83 B 11 GSE4229 Normal MI (attached: pdf file 2.3mB) ................................ .................. 83

PAGE 9

9 LIST OF FIGURES Figure page 1 1 Feature space of liver cancer patients (in blue) and healthy patients (in green) ................................ ................................ ................................ ................. 22 2 1 AMFES expands four features into eight features, four of which are original (O1, O2, O3 and O4) and the other four are artificial (R1, R2, R3 and R4), obtained by permutating, randomizing, O1 O4. ................................ .................. 34 2 2 The distribution of original pair wise MI values and permuted pair wise MI values. ................................ ................................ ................................ ................ 35 3 1 Average test accuracy of the 3 methods (100 %) ................................ ............... 42 3 2 Number of selected features. ................................ ................................ .............. 43 3 3 Total computational time which include training and testing for several pairs (sec). ................................ ................................ ................................ .................. 43 3 4 Computational time for training the dataset and testing is around 1 for one pair (sec). ................................ ................................ ................................ ........... 44 3 5 The AUC values of the 3 methods on the cancer datasets (sec). ....................... 45 3 6 Comparison of ROC curves for the colon cancer dataset: black, AMFES; light grey, RFE; dark grey, CORR. The AUC value is for AMFES. ............................. 46 3 7 Comparison of ROC curves for the leukemia cancer dataset: black, AMFES; light grey, RFE (fully overlapped with AMFES); dark grey, CORR. The AUC value is for AMFES. ................................ ................................ ............................ 47 3 8 Com parison of ROC curves for the lymphoma cancer dataset: black, AMFES; light grey, RFE (fully overlapped with AMFES); dark grey, CORR. The AUC value is for AMFES. ................................ ................................ ............ 48 3 9 Comparison of ROC curves for the prostate cancer dataset: black, AMFES; light grey, RFE (fully overlapped with AMFES); dark grey, CORR. The AUC value is for AMFES. ................................ ................................ ............................ 49 4 1 Comparison of 96 MI of grade1, grade2 and gra de3 prostate cancer samples .. 58 4 2 Comparison of 72 MI of luminal A and luminal B samples ................................ .. 58 4 3 Comparison of 68 MI of basal li ke and normal like samples .............................. 5 9 4 4 Comparison of 22 MI of prostate cancerous and normal like samples ............... 59

PAGE 10

10 4 5 Diagram of detailed p rocess of building the genetic model ................................ 60 4 6 Relationships between biomarkers, pharmacons and operons where R1, R2, R3, R4 and R5 denote 5 biomarkers. Among all the biomarkers, R2, R3 and R5 ar e regulators ................................ ................................ ................................ 61 5 1 ......... 75 5 2 ROC curve comparison of AMFES values shown in the figure is for AMFES. ................................ ........................... 76 5 3 Histograms of pairwise MI values of normal and AD samples of GSE4226 ....... 76 5 4 Histograms of pairwise MI values of normal and AD samples of GSE4227 ....... 77 5 5 Histograms of pairwise MI values of normal and AD samples of GSE4229 ....... 77 5 6 The clustergram of first 15 genes selected by AMFES for GSE4226 ................. 78 5 7 The target network of first 15 gene selected by AMFES for GSE42 66 ............... 78 5 8 A complete process to improve diagnosis of AD by AMFES .............................. 79

PAGE 11

11 LIST OF ABBREVIATION S AA Agent Score AD AMFES Adaptive Multiple Feat ure Selection AN Athymic Nude Rat CISE Computer and Information Science and Engineering COD Curse of Dimensionality DNA Deoxyribonucleic A cid GEO Gene Expression Omnibus MI Mutual Information miRNA microRNA NCBI National Center for Biotechnology Informat ion NINDS National Institute of N eurological Disorder and Stroke OMIM Online Mendelian Inheritance in Man PCA Principle Component Analysis RNA Ribonucleic Acid SD Sprague Da wley Rat SOM Self Organization Map SS Synergy Score SVM Support Vector Machine T BI Traumatic Brain Injury T CMID Traditional Chines e Medicine Information Database TS Topology Score

PAGE 12

12 Abstract of Dissertation Presented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy FEATURE RANKING AND SELECTION FOR SVM CLASSIFICA TION AND APPLICATIONS By Wen Chin Hsu May 2013 Chair: Su Shing Chen Major: Electrical and Computer Engineering Diagnosis of complex disease s such as cancer s d isease (AD) remains a challenging research problem An optimal approach is to discover important biom arkers for both diag nosis and therapy These biomarkers form a certain dependency network called a target network, which serves as a framework for diagno sis and therapies. However, selecting important genes for microarray data sets has been a major problem due to the COD ( Curse of Dimensionality) referring to the difficulty in finding a relationships among a large number of input parameters (features) from a small number of samples ( patient subjects). A general methodology, AMFES (Adaptive Multiple FEature Selection), for ranking and selecting important biomarkers based on SVM (Support Vector Machine) classification is developed to improve diagnosis of co mplex diseases. In the research, three methods are comprehensive ly compared : AMFES, RFE (Recursive Features Elimination) and the CORR (Correlation Coefficient) on five datasets (leukemia, colon cancer lymphoma, prostate cancer and simulated data ). As an r esult AMFES performs better in terms of computational time and the number of selected features while also maintaining hig her or comparable test accuracy and statistical significance

PAGE 13

13 Based on the biomarkers, a multi target and mu lti component design tha t provide s syne rgistic results is proposed to improve the cancer therapy First, the biomarkers are selected and target networks are constructed of t hree datasets: prostate cancer (three stages), breast cancer (four subtypes), and another prostate can cer ( normal vs. cancerous) Then, a framework is proposed as a computational foundation for the therapy. Recently, Maes et al. have investigated blood based biomarkers to help analyze AD [ 2 4 ] Based on our success with cancers we believed that AMFES c ould be useful ly applied to AD In this work, w e extend the translational bioinformatics study conducted by Maes et al. for their AD datasets (GSE4226, GSE4227 and GSE4229) Interestingly, some of our selected genes are not list ed by Maes report and this difference may indicate the novelty of our genes. In addition, b ased on the gender analysis, we observe that the gender could play a role in AD degradation Finally, we des cribe a complete process for the diagnosis and prognosis of AD

PAGE 14

14 CHAPTER 1 INTRODUCTION Traditional Microarray Statistic s C lustering such as hierarchical clustering, SOM (Self Organizing Map), k means and PCA (Principle Component Analysis) is trad itionall y a standard gene selection method Cluster and Cluster 3.0, an extended version of Cluster, have focused on gene clusters correlated with diseases and m any tools are now distributed as open source software suites including GeneCluster 1.0 [ 5 ] which has been updated to a new system called GenePattern 2.0 [ 6 ] In addition to var i ous clusteri ng methods, GenePattern also features a systems biology tool to analyze genomic data. For users of Java based tool s TIGR MeV (Multiple Experiments Viewer) can be used [ 7 ] Another tool especi ally used to determine gene pathway s is GenMapp (Gene Map Annotator and Pathway Profiler) [ 8 ] which can visualize and analyze metabolic and signaling pathways of datas ets and further build the connection be tween datasets and diseases. The current version is GenMapp 2.0 [ 9 ] Statement of Problem When analyzing human gene expression a common and challenging research problem is the COD (C urse of Di mensionality) which arises due to a small number of sample space ( number of patient subjects) but relatively large number of features (genes). Theoretically, w hen the number of samples is much smaller than the features, the statistical significance of the analysis is red uced. As a rule of thumb, the number of samples should be exponentiall y proportional to the features. Thus, gene expression analysis has focused on reducing a gigantic set of featu res to a small but statistically valid set

PAGE 15

15 Background of St udy CORR (Correlation Coefficient) A common methodology for selecting important features involves rank ing features based on their indi vidual abilities to classify disease by computing their correlation coefficients to the samples Assuming if a g iven dataset has n features and for each feature, only a few samples of two classes are available, we can use the CORR approach used of [ 10 ] The coefficient, W i of one feature i is defined as Where i i denote the mean and standard deviation of gene expressions for gene i, calculated by all patient samples in class 1 or class 2 with i n as the number of features in a dataset Howev er, ranking features by the ir correlation coeffi cients requires that the features be orthogonal which may not always be valid. In addition, selecting a group of high ranked features does not guarantee a good classification while a group of low er ranked fe atures may form a much stronger classifier [ 11 ] Later in Chapter 3 the CORR approach will be used as a comparison with AMFES [ 10 11 ] SVM (Suppor t Vector Machine) SVM has been a powerful classification tool for machine learning and pattern recognition [ 12 14 ] and f or a decade, it has also been used to rank features [ 11 ] The simplest type s of SVM are binary SVMs which are used to classify two classes. Given a set of training samples with unknown classification labels, a n SVM s the knowledge trained to predict the sample of an unknown label. For e xample, a set of cancerous

PAGE 16

16 where cancerous patients are labeled as class 1 and health y patients as class 1 are used to train a n SVM. Once the SVM learns classification patterns, it can predict whether a new patient has cancer or n ot. Formally, given a set of n training samples, for the i th sample, x i is a d dimensional vector with label y i where y i and d is the number of features. P resent ing each sample as a data point in the space forme d by all features, o ur goal is to find a hyperplane (a boundary) which can separate all data points in the feature space into 2 class es. For example, assume that the samples for both liver cancer and healthy patients x 1 x 2 9 are given and the weight and age are consider ed as important factors which trigger proliferation of liver cancer cells. In the concept of SVM, these two factors are called features and samples are represented as data points in a 2 dimensional feature space formed by weight and age as shown in Figure 1 1 As Figure 1 1 shows, there are many hyp erplanes such as H1 and H2 that can separate the green color data points from blue color points. Obviously, H1 is better than H2 because with H1 it is more difficult to mislabel the data points into the wrong cl ass. Thus, the original goal is to find a hyperplane that can generate the maximum distance (margin) between the nearest da ta from two classes. The hyperplane is represented as a subset of x which satisfies the equation (1 1) Where W is a normal vector to the hyperplane and the dot operator represents an inner product Clearly, the hyperplane resides in the middle of the mar gin determined by the boundaries formed by the nearest data points from two classes. These data samples,

PAGE 17

17 which lie on the boundaries and form the maximum margin are called support vectors. A sample x i in the space can be classified into class 1 or class 2 by the definition below. (1 1a) (1 1b) All data points of 2 classes with their label s can be represented by (1 2) Thus, the goal is to find W and b which maximize the margin and all points of 2 class es should satisfy the constraints in ( 1 2). Mathematically, maximizing the distance can be represented as so the goal is to minimize subject to the con straints. When one encounters a finding maximum or minimum optimization problem subjective to a set of constraints, the method of Lagrange multipliers is commonly used. Consider the problem of finding the minimum of a function f(x) subject to g(x) and g(x)>=0. One can construct a Lagrange function as (1 3) (1 4) Where x is the primary variable for the primary space an dual space. By introducing the Lagrange multiplier to our optimization problem, assume that f(x) = min and g(x) = .Then, the objective function in the primary space is represent ed as, (1 5)

PAGE 18

18 Since the equation is a quadratic programming problem the answer to the optimiza tion problem can then be transformed into its dual form as below: L subject to (1 6) Where and finally the W can be calculated as (1 7) and only those that lie on the boundaries would hav e non value s [ 12 ] The explanation above is especially applicable for linear separable samples. When the data points are not separable in the original fe ature space, they can be separable after they are transformed to a higher dimensional feature space. The mapping functions, also called kernel functions, k (x i ,x j ), then perform the transformation In this research we use the common Gaussian radius kernel function details of which can be found in [ 13 ] RFE (Recursive Feature Elimination) RF E is the first method ology to apply SVM to rank biomarkers [ 11 ] Theoretically, RFE trains an SVM on all featu res and eliminates the feature s eemed as least useful by SVM. The process proceeds recursivel y with the remaining features and ranks features in the reverse order as they are eliminat ed. RFE has the advantage of evaluating features collectively using SVM, an eff ective learning machine However, the disadvantage of RFE as a ny backward elimination method is its computation speed. In micr oarray data analysis, thousands of features, or e ven more are common, requiring an extremely large amount of computational time for RFE

PAGE 19

19 Statement of Work Selecting and Ranking Biomarkers Using SVM Classification We have developed a n alternative feature ranking and selection method ology, AMFES (Adaptive Multiple FEatues Selection), to improve the RFE. Based on several cancer datasets, AMFES and CORR ( Correlation Coefficient ) Unlike RFE, AMFES evaluates features based on a number of feature subsets g enerated in an adaptive fashion. As we observed in our exper iments, the COD effect involves with the apparent correlation between features. Due to the relat ively large number of features f o r training samples, an ir relevant feature may accidentally become c orrelated with some critical features. With the introduction of multipl e feature subsets, the irrelevant feature and its correlated features can co locate only some of the subsets. Therefore, by examining a sufficient number of s ubsets, irrelevant features can be more easily distinguished from the critical features. To further im prove the outcome, the ranking procedure of AMFES is implemented at a number of stages. At the fi rst stage, all features are evaluated and ranked I n doing so, most, if not all, cri tical features can be moved to the top ranks, thereby reducing the number of irrelevant features in these ran ks. At each subsequent stage, AMFES examine s the features whose ranks at the previous stage were above the median rank. This allows AMFES to deal w ith f ewer irrelevant features at the current stage compared to previous stages. Then, to improve the feature ranking, AMFES re rank s thes e features in the same way as in the first stage. Randomly generated feature subsets have been used to form random fore sts [ 15 ] which consist of mult iple decision trees, each of which is built on a feature subset. While one individual decision tree may perform rather weakly, a combination of th re e

PAGE 20

20 forms a strong er classifier. Breiman (2001) further proposed a feature selection method for a random fores t. Based on a similar idea, Tuv et al. (2006) developed a more sophisticated method [ 16 ] While Breiman and Tuv et al. proposed a feature selection method for an ensemble of classifiers, Lai et al. (2006) proposed a random subset method (RSM) for a single SVM classifier [ 17 ] AMFES is a met hod for a single SVM classifier as well as RSM but RSM generates feature subsets and ranks features in one step, while AMFES performs an iterative re ranking process. Moreover, AMFES employs a different ranking score from R SM. We show that AMFES achieves better or comparable test accuracy rates compared to RFE, and selects a smaller number of feat ures. Moreover, the computation time is much less for AMFES Target Network Lately, researchers have found that the mere superpo sition of a single drug can generate side effects and cross talks with another drug and these interactions may cancel out the favorable effects of the treatments. Thus, Zimmermann et al. and Keit h et al. focused on measuring drug treatments as a whole rat her than considering them individually [ 18 19 ] Dancey et al. later proposed a synergistic concept to evalu ate drug treatments [ 20 ] However, evaluations are still based on cases and do not have a systematic approach. A network methodology was first proposed to evaluate the efficiency of drug treatments [ 21 ] By building the ta rget networks of disease s researchers could further select suitable drug agents to improve the efficiency of therapy. A target network is an interaction network of biomarkers based on the graph theory where the nodes represent biomarkers and edges repres ent the interactions between pairwise biomarkers. Intuitively, complex diseases possess unique target

PAGE 21

21 networks as signatures of disease s each of which will have a synergistic therapy strategy. In [ 22 ] Li et al. used a n SS (synergy score) to apply the topology factor of the network based on the disease and the drug agent combination. Our ap proach is first to build a more precious target network from the selected biomarkers (by AMFES). Then, we hope to identify the intrinsic properties by computing mutual information of the interactions among these biomarkers. The proposed appr oach method by considering the mutual information of t he tar get network. As example systems we focus on developing target ne tworks of cancers The resulting target networks are shown in chapter 4. Translational Bioinformatics Diagnosis of s a very challenging research problem. Recently Maes et a l investigated blood based biomarkers to analyze Al [ 4 ] Based on our success with AMFES to select important biomarkers for cancers [ 23 ] we now translational bioinformatics study on AD Our results selected a much smaller set of biomarkers and obtained better ROC/AUC (Receiver Operating Characteristic/Area Under Curve) values after the cross validation verification. Then, the target network s of the selected biomarkers are constructed As shown in all the chapters, our method discovered a ne w group of genes which are not reported by Maes et al. Then, the mutual i nformation value s between our group and the Maes group are compared [ 4 ] The result shows a low dependency between these two group s, demonstrating the novelty of our results. In addition, t he MI values of AD subjects are lower than those of normal patients Based on the gender analysis, we observe that gender may play a role in AD degradation W e also have provided a

PAGE 22

22 summery of our works for diagnosis and prognosis based on selected biomarkers and the target net works constructed for AD as well as cancers Figure 1 1. Feature space of liver cancer patients (in blue ) and healthy patients (in gre en )

PAGE 23

23 CHAPTER 2 METHODOLOGY AMFES ( Adaptive Multiple FEature Selection ) AMFES comp ri ses both a r anking and a selection process This chapter describes in detail and the integrated ranking and selection is descri bed at the end of the chapter When a dataset is given, AMFES ran domly divides it into a learning subset S of samples and a testing subset T of samples at a heuristic learn ing : test ing ratio of 5:1. The subset S is used for ranking and selecting of genes and for constructing a classifier from th e sel ected genes, while T is used for compu ting test accuracy. When a learn ing subset S is given, r training validation pairs are extracted from S according to the heuristic rule and n is the number of samples in S Each pair randomly di vides S into a training component of samples and a validation component of samples at a training : validation ratio of 4:1. The heuristic ratio and rule are chosen based on the experimental experiences at the balance of time consumption and performance [ 24 ] (unpublished data). Ranking The gene ranking process contains a few ranki ng stages. In the first stage all genes are arranged by their ranking scores in a descending order. Then, in the next stage, only t he top half of the ranked genes are ranked again while the bottom half holds the current order in the subsequent stage. The same iteration repeats recursi vely until only three genes remained

PAGE 24

24 Assuming that at a given ranking stage, there are k genes inde xed from 1 to k t o rank these k genes, AMFES follow s 5 steps below. (I) Generation of m independent subsets S 1 m Each subset S i i m has j genes which are selected randomly and independently from the k genes, where j = (int) ( k /2). S i then ind uces a transformation that converts training sa m ples x 1 x n to z i 1 z in (II) Thus, variables C 1 C i are designated as SVM classifier s for training samples z i 1 z in (II I ) For each of the k genes, the ranking score of the gene g, is c omputed according to equation (1). (IV ) U sing the average weight of the gene g the summation of weights of g in m subsets is divided by the number of subsets for which g is randomly selected. This increases the robustness to present the true classifying a bility of gene g ( V) The k genes are ranking in the descending order by their ranking scores given by ( (2 1) Where I is an indicator function such that I proposition = 1 if the proposition is true; otherwise, I proposition = 0. In o ther word s if gene g is randomly selected for the subset S i it is denoted as and I proposition = 1. We denote the objective function of C i as where v 1 v 2 v s are support vectors of C i The weight i (g) is the n defined as the change in the objective function due to g, i.e., (2 2)

PAGE 25

25 [ 11 25 ] Note that if v is a vector, v ( g ) is the vector obtained by dropping gene g from v With regard to the second term the value of m is not fixed in advance. Instead, the value of m is determined by add ing one f eature subset at a time until a stop criter i on is met Then, the m is defined as a k dimensional vector comprising the ranking scores derived from the m feature subsets generated thus far. Because these subsets were randomly and independently selected, th e law of lar ge numbers ensures that, during the iteration process, m will converge to a constant vector, which is the vector of some average results [ 26 ] For this reason, no new feature subset is generated when m and m 1 a p proach each other, i.e, when (2 3) nking process are shown in Table 2 1 Selecti on When all features are ranked, a naive way to find the critical subset of features F k is first to train an SVM with the top k ranked features, where k = 1, d This is performed to compute their respective validation accuracy rates and then to pick the one with the highest rate. However, this proc e dure can be time consuming when d is a very large number. In addition, it is not robust because a tiny variation in the validation accuracy rate can tremendously alter the optimal F k To solve this problem, so me have proposed selection procedures with the help of artificial features [ 16 24 27 28 ] Our a p proach follows closely from that of Tuv et al but relies on a validation procedure to

PAGE 26

26 determine the subset of selected features instead of the statist i cal test adopted by Tuv et al [ 16 ] When a dataset is given, AMFES generate s as many artificial features as the origin al features. For example, assuming a dataset X with 3 samples ( x 1 x 2 and x 3 ) and 2 features ( f 1 and f 2 ) is given as shown in Fig ure.2 1 AMFES permute s the elements of both features to gener ate two corresponding artificial features, a 1 and a 2 which are appended nex t to the original features. T hen it label s the artificial features as the same class as their corresponding original features. Artificial features ar e created to help distinguish the relevant features from the irrelevant ones When AMFES rank s all features including both artificial features and original features, the irrelev ant features should rank close to the artificial features than the relevant ones. Assume a set of unranked genes is given. AMFES first generates artificial features based on the or iginal features as described above The ranking procedure then needs to be applied to both the original and artificial features. After ranking the set, each original gene has to be as signed a gene index, a numerical real value between 0 and 1, which is the proportion of artificial ones that are ranked above it. Later, AMFES generate s 200 subset candidates from which the optimal subset is chosen. The number 200 is determined based on th e experimental conditions at the balance of time consumption and performance. Each one of the 200 subsets, B( p i ), has a p i value, a numerical value between 0 and 1, where p i = i 0.005 and i subset contains the original genes whose gen e indices are smaller than or equal to respective p i value. We can obtain the validation accuracy, v(pi), for every B(pi) by training a SVM.

PAGE 27

27 To select the optimal subset for one training validation pair, AMFEFS stop s at the first p k at which v( p k v basel ine and v ( p k v ( p l ) for k l k +10, where v baseline is the validation accuracy rate of the SVM trained on the baseline, i.e., the case in which all features are involved in training. The final result, B( p k ), is then the optimal subset for the given set of genes Integrated Ranking and Selection The ranking and selection processe s from previous sections correspond to one training validation pair. To increase the reliabi lity of validation, r pairs are generated to find t he optimal subset. T he validation a ccuracy of the q th pair for all p q i subsets is computed where q denotes the pair index and i denotes the subset index. Then, av ( p i ), the average of v ( p q i ) over r training validation pairs is also computed. A subset search is then performed as explained in selection section on av ( p i ) to calculate the optimal p i value denoted as p *. However, p is a derived value which does not belong to one unique subset. Thus, we have to adapt all samples of S as training samples to iterate the process in order to find a unique subset which has the p value We then generate artificial genes and again rank them together with the original genes Finally, we select the original genes whose gene indices are smaller than or equal to the value p derived previously to be th e subset of genes we select for S The integrated v ersion of process is shown in Table 2 3 T he AMFES ALGORITHM represents the integrated version of the whole process while RANK SUBROUTINE represents the ranking process and SELECTION SUBROUTINE represents the selection process. All the computations were performed on a Quad Core Intel i3 quad core CPU of 2.4GHz and 4GB RAM

PAGE 28

28 M utual Information To treat a complex disease an optimal approach is to discover important biomarkers for which we can specify a certai n treatment. These biomarkers form a certain dependency network as a framework for diagnosis and therapies [ 29 ] We call such a network a target network of these biomarkers [ 23 ] Mutual information has been used to measure the dependency between two random variables based on the ir probabilities. Random v ariable X and Y, I(X; Y), can be expressed as these equivalent equations [ 30 ] (3 1) (3 2) (3 3) Where H(X), H(Y) denote marginal entro pies, H(X|Y) and H(Y|X) denote conditional entropies and H(X,Y) denotes the joint entropy of X and Y. To compute entropy, the probability distribution functions of the random variables must be calculated first. Because gene expressions are usu ally continuo us numbers, we use the kernel estimation to calculate the probability distribution [ 31 ] Assuming that the two random variables X and Y are continuous numbers. The mutual information is defined as [ 30 ] : (3 4) W here f (x,y) denotes the joint probability distribution, and f (x) and f (y) denote the marginal probability distribution s of X and Y. By using the Gaussian kernel estimation, the f (x, y), f (x) and f (y) can be further represented as equations b elow [ 32 ]

PAGE 29

29 (3 5) (3 6) where M represents the number of samples for both X and Y u is index of samples u= and h is a parameter controlling the width of the kernels. Thus, the mutual information I(x,y) can then be represented as: (3 7) where both w and u are indices of samples Computation of pairwise genes of a microarray dataset usually involves a nested loops calculation which requires extensive computational time Assuming that a dataset has N g e nes and each gene has M samples, t o calculate the pairwise mutual information values, the computation usually first finds the kernel distance between any two samples for a given gene. Then, the same process is repeated for every pair of genes in the datas et. In order to be computation efficient, two improvements are applied [ 31 ] F irst the marginal probability of each gene is calculated in advance and use d it repeatedly during the process [ 31 33 ] S econd th e summation of each sample pair for a given gene is moved to the most outer for loop rather than inside a nested for loop for every pairwise gene. As a result, the kernel distance between two samples is only calculated twice instead N times thereby saving saves a lot of computation time.

PAGE 30

30 LNO (Loops Nest Optimization) which changes the order of nested loops is a common time saving technique in the computer science [ 34 ] T arget Network In our approach, a constructed target network can be represented in an undirected graph. Nodes represent genes in the system and edges represe nt the dependency between gene pair [ 29 ] For each gene pair, a MI (Mutual Information) value is applied to measure their dependency and to represent the weight of the linkage Assuming that a gra ph has N genes, there should be pairwise MI values for all genetic pairs. A N N adjacency matrix which can be visualized as a heatmap is used to hold MI values of all the linkages in the graph. In addition, hierarchical clustering is often used to verify the dependency between genes. For efficiency w e adapted the Matlab clustergram function which uses E uclidean distance as the default method to calculate pairwise distance. In order to remove irrelevant linkages in a graph, it is necessary to choose a suitable MI threshold which determines the topology of the network The value of 0 or 1 is assigned to the matrix element based on the chosen MI threshold. References [ 35 ] and [ 36 ] describe a method to determine a suitable threshold using permutations of MI The procedure involves permuting MI valu es of gene pairs and then choosing the largest one to be the threshold. Using this procedure for 30 repetitions of the permutation on the MI matrix we choose 0.06 as the threshold. The distribut ions of original and permuted MI values for GSE4226 AD dataset ar e shown in Figure 2 2

PAGE 31

31 Synergistic Therapy Scientists believe that the effect of a drug with multiple components should be viewed as a whole rather than as a superposition of individual com ponents [ 18 19 ] Thus, a synergic concept is formed and considered as an efficient manner to design a drug [ 20 ] Fitzgerald et al. used mathematical models to measure the effect generated by the multiple components [ 26 ] However, their method does not consider practical issues such as cross talks between pathways. C sermely et al. started to apply a network approach to analyze the interactions among multiple components [ 21 ] In spired en proposed another system biological methodology, NIMS (Network target based Identification of Multicomponent Synergy) to measure the effect of drug agent pairs depending on their gene expression data [ 22 ] NIMS focuses on ranking the drug agent pairs of Chinese Medicine components by SS (Synergy Score) A drug component is denoted as a drug agent and a set of genes associated with it are denoted as agent genes of the drug agent [ 22 ] For e xample, f or a given disease, assume there may be N drug agents Initially, NIMS randomly chooses two drug agents A 1 and A 2 from N and builds a background target network by their agent genes in a graph. From the graph, NIMS calculates a TS (Topology Scor e) of the graph by applying PCA (Principle Component Analysis) to form a n important score, IP value w hich is integrated by Betweenness, C loseness and a variant of Eigenvalues PageRank [ 37 ] The TS is used to evaluate the topological significance of the target network for the drug agent pair, A 1 and A 2 and is defined as

PAGE 32

32 (3 8) where IP 1 and IP 2 denote IP values for drug agent s A 1 and A 2 respective ly m in( d i,j ) denotes shortest path from gene i of A 1 to all genes of A 2 and min( d j,i ) denotes the one from gene j of A 1 to all genes of A 2 In [ 22 ] NIMS defines another term, AS (Agent Score), to evaluate the similarity o f a disease phenotype for a drug agent. For a given drug agent, if one of its agent genes has a phenotype record in the OMIM (Online Mendelian Inheritance in Man) database, t he drug agent has that value as one of its phenotype s The similarity score of a d rug agent pair is defined as the cosine [ 38 ] and t he AS is defined as: (3 9) where P i,j denotes similarity score of i th phenotype of A 1 and j th phenotype of A 2 and M denotes the total number of phenotypes. The SS (Synergy Score) of the pair is then defined as the product of TS and AS. NIMS calculates SS for all poss ible drug age nt pairs for a disease and can then find potential drug agent pairs after ranking them by SS.

PAGE 33

33 Table 2 1. Pseudo codes of the rank subroutine RANK SUBROUTINE INPUT : a subset of k genes to be ranked Generate k artificial genes and put them ne xt to the original genes. Pick an initial tentative value of m DO FOR each subset S i of m subsets Randomly select j elements from k genes to form the subset S i Train an SVM to get weight i ( g ) for each gene in the subset. ENDFOR FOR each gene of k genes Co mpute the average score of the gene from m subsets ENDFOR List k genes in descending order by their ranking scores. ENDDO WHILE m does not satisfies equation (3) OUPUT : the ranked k genes Table 2 2. Pseudo codes of the selection subroutine SELECTION SUB ROUTINE INPUT : a few subsets with their validation accuracies, av ( p i ) Compute the validation accuracy of all genes, vbaseline. FOR each subset given IF v( p k ) v baseline and v ( p k ) v ( p l ) for k l k +10 r esulting subset is B( p k ) ENDIF ENDFOR OUPUT : B( p k )

PAGE 34

34 Table 2 3. Pseudo codes of the integrated subroutine AMFES ALGORITHM Integrated Version INPUT : a dataset Divide a dataset into train samples and test samples. Divide the train samples into r training validation component pairs FOR each pair of r t rain validation components Generate 200 candidate subsets p q i FOR each subset of 200 subsets CALL RANK subroutine to rank each subset. Assign each original gene a gene index Train each subset on an SVM and compute corresponding validation accuracy, v ( p q i ), for the subset END FOR END FOR FOR each subset of 200 subsets Compute average validation rate, av ( p i ), of the subset from r pairs. END FOR CALL SELECTION subroutine to search for the optimal subset by its average validation rate and denotes it as p* C ALL RANK subroutine to rank original genes again and select original genes which belong to the subset B( p *). OUPUT : an optimal subset of genes B( p *) Figure 2 1. AMFES expand s four features into eight features, four of which are original (O1, O2, O3 an d O4) and the other four are artificial (R1, R2, R3 and R4) obtained by permutating, randomizing O1 O4

PAGE 35

35 Figure 2 2. The distribution of original pair wise MI values and permuted pair wise MI values.

PAGE 36

3 6 CHAPTER 3 COMPARISONS O F AMFES AND OTHER METHODOLOGI ES Microarray Data Description To compare AMFES to the RFE and CORR methods, datasets of four cancer types and a simulated datasets are used. Leukemia Golub et al classified types of cancer by using DNA microarray gene expression and Guyon et al used th e same dataset for comparison. The dataset includes two types of leukemia (A LL and AML) and was split into a train set and a test set of samples. The train ing set contains 38 samples (27 ALL type leukemia and 11 AML ) T he t est set contains 34 samples (20 A LL and 11 AML). Each sample has 7129 features whose gene expression values are normalized. Colon Cancer Guyon et al used the same colon cancer dataset presented in [ 39 ] with 62 total tissue samples ( 22 are normal tissue and 40 tissues from cancer patients) Each sample has 2000 gene expression values. In [ 39 ] Alon et al observed that normal samples and cancer samples tended to group into separate clusters using hierarchical clustering. In addition, they also discovered some genes that contribute to classification as normal or cancer ous To complement the previous work, Guyon et al performed a classificat ion using RFE and designed a method to determine the optimal su b set of genes. Lymphoma The lymphoma dataset contains gene expressions of DLBCL (Diffuse Large B Cell Lympho ma) patients. There are 96 samples ( 62 malignant and 34 normal ), with

PAGE 37

37 4,026 features [ 40 ] In [ 40 ] the research demonstrated the genetic variations play a role i n t he survival rate against DLBCL. The differential gene expressions among these patients show as sociations to the various tumor proliferation rates. Prostate Can cer The CNAs (Copy Number Alterations) of some genes may be indicators of the growth of prostate cancers [ 41 ] Some changes are discovered in mutations of fusion gene, mRNA expressions and pathways in a majority of primary prostate samples. The analysis was applied to four platforms and consists of 3 subseries, GSE21034, GSE21035 and GSE21036 [ 41 ] We use only the GSE 21036 for analysis. This prostate cancer dataset contain s 373 features of miRNA express ions and 142 samples (114 tumorous and 28 normal ) [ 23 ] The platform is Agilent 019118 Human miRNA Microarray 2.0 G4470B (miRNA ID version). Simulated Dataset ability to discovery informative features we generate a simulated dataset ,Sim data, based on an approach similar to that used to generate Data G [ 42 ] and 50 samples each for class1 and class2. The informative featur es follow Gaussian distribution N (0.25, 1) for class 1 and N ( 0.25, 1) for class 2 while non informative features follow Gaussian distribution N (0, 1) for both classes. A mong all the features, randomly choose 5% of the genes as outliers which follow N (0 .25, 100) for class 1 samples and N ( 0.25, 100) for class2.

PAGE 38

38 Results We compare AMFES, RFE and CORR in terms of the average test accuracy, number of features, total computational time, training time, ROC curve, AUC values, t test, p values, and informative discovery rate. The tasks are summarize in Table 3 1. The average test accuracy is the average value of test accuracies measured from multiple randomly chosen training testing pairs as described in the method section. The number of selected features is defined according to the condition of the same test accuracy, most likely 100%, among all methods. We measure total computational time as the time of both training and testing processes. The training computation time is the time used during the training pr ocess only. The two sample t test with unequal mean and variance is performed on the top six genes to obtain t scores and p values with a 5% statistical significance threshold. In addition, we also analyze the classification ability of the top six genes by visualizing their ROC and AUC values which are comput ed by 2 fold cross validations using the LIBSVM tool [ 13 ] The discovery rate of informative features is performed only on the Sim data. It is computed as the number of informative features selected divided by the total number of selected features. We create 10 simulated datasets as descr ibed in the simulated dataset section and calculate the individual and average discovery rate of all simulated datasets. AMFES, RFE and CORR are compared for colon, leukemia, lymphoma, prostate cancer and one simulated dataset The average test accu racie s are shown in Figure 3 1 and t he numbers of selected f eatures are shown in Figure 3 2 The total computation time including both training and testing processes for all datasets is displayed in Figure. 3 3 and the training time is given in Figure 3 4 T he t test values and p values of the

PAGE 39

39 top six genes for each cancer type are displayed in Table 3 2, through Table 3 5 respectively. Figure 3 5 shows the corresponding AUC values for all methods. The ROC curves for colon cancer leukemia lymphoma and prosta te cancer are shown in Figures. 3 6 through 3 9 respectively Finally, Table 3 6 presents the discovery rate for the informa tive features using AMFES, RFE and CORR. Discussion AMFES has higher or comparable average test accura cy compared to RFE and CORR a nd for t the same test accuracy, AMFES selects the smallest number of features. For example, in the colon dataset, AMFES selects as few as 12% of the number s elected by RFE and 6% of the number selected by CORR while maintaining a higher or comparable test accuracy. It is especially noteworthy that AMFES demonstrates better total computational performance than RFE and CORR, showing much shorter total and training computation time. The total computational time including training and testing are compared based on a few training validations pairs. The training computational time is compared based only on one training validation pair while testing computational time is as less as 1 sec. We analyze the individual classification ability of the top six genes by app lying t test as a complement to the test accur acy of the selected features. For the colon and lymphoma cancer datasets, the features selected by AMFES, RFE and CORR all have p value s less than 5%. On the leukemia dataset, both AMFES and CORR have p values less than 5% while the fifth top feature selected by RFE has a p value over 5%. For the prostate cancer, the top fourth feature selected by AMFES has a larger than 5% p value. Although one feature selected by AMFES does not have a p value less than 5%, th e overall classify ing ability still demonstrates either better or comparable

PAGE 40

40 performance as shown by its ROC/AUC analysis. In addition, the classification ability of a combination o still able outperform one of the stronger o nes [ 11 16 17 43 ] For the discovery rate of informative features, AMFES also shows a slightly higher result than other two methods. Both the average test accuracy and informative features discovery rate support the efficiency and efficacy of AMFES. Table 3 1 Summ ary of tasks List of tasks Average test accuracy Number of selected features Total computational time Training time ROC curve AUC values Two sample t test score P values Informative features discovery rate

PAGE 41

41 Table 3 2 T test and p value for the colon cancer dataset AMFES RFE CORR p value t test p value t test p value t test Top1 1e 4* 0.0000 7.3868 0.0000 8.0856 1e 5* 0.0000 7.9930 Top2 1e 4* 0.0000 8.0856 0.0001 4.2560 1e 5* 0.0001 7.2484 Top3 1e 4* 0.0016 5.9360 0.0004 3.7890 1e 5* 0.00 00 8.0856 Top4 1e 4* 0.7420 4.2560 0.0002 3.9189 1e 5* 0.0001 7.3868 Top5 1e 4* 0.0025 5.8116 0.0029 3.1026 1e 5* 0.0253 5.8116 Top6 1e 4* 0.0164 5.3169 0.0011 3.4219 0.2246 5.2323 Results: 6 pass 6 pass 6 pass Table 3 3 T t est and p values fo r the l eukemia cancer dataset AMFES RFE CORR p value t test p value t test p value t test Top1 0.0010 3.4348 0.0097 2.6585 1.0e 10*0.0000 10.9232 Top2 0.0003 3.8536 0.0006 3.6053 1.0e 10*0.0024 9.0196 Top3 0.0032 3.0550 0.0113 2.6014 1.0e 10* 0.0029 8.9802 Top4 0.0013 3.3516 0.1458 1.4710 1.0e 10*0.0545 8.2853 Top5 0.0097 2.6599 0.0074 2.7598 1.0e 10*0.1390 8.0645 Top6 0.0003 3.8310 0.0001 4.2705 1.0e 10*0.0457 8.3267 Results: 6 pass 5 pass ,1 fail 6 pass

PAGE 42

42 Table 3 4 T test a nd p value s for the lymphoma cancer dataset AMFES RFE CORR p value t test p value t test p value t test Top1 1.0e 004 0.000 10.1515 0.0000 7.0261 1.0e 008*0.0000 10.1515 Top2 1.0e 004 0.2668 4.4177 0.0000 6.0387 1.0e 008*0.0000 9.5715 Top3 1.0e 004 *0.0000 8.2928 0.0000 7.1575 1.0e 008*0.0001 8.2928 Top4 1.0e 004 *0.0000 7.0261 0.0005 3.6218 1.0e 008*0.0001 8.2274 Top5 1.0e 004 0.000 2 6.0982 0.0069 2.7623 1.0e 008*0.0000 8.5950 Top6 1.0e 004 0.0 307 4.9641 0.0000 6.0931 1.0e 008*0.1008 6.7 886 Results: All pass All pass All pass Figure 3 1. A verage test accuracy of the 3 methods (100 %)

PAGE 43

43 Figure 3 2 N umber of selected features. Figure 3 3 T otal computational time which include training and testing for several pairs (sec).

PAGE 44

44 F igure 3 4 C omputational time for training the dataset and testing is around 1 for one pair (sec).

PAGE 45

45 Figure 3 5 The AUC values of the 3 methods on the cancer datasets (sec).

PAGE 46

46 Figure 3 6 C omparison of ROC curves fo r the colon cancer dataset: black AM FES; light grey, RFE; dark grey CORR. The AUC value is for AMFES

PAGE 47

47 F igure 3 7 C omparison of ROC curves for the leukemia cancer dataset: black AMFES; light grey RFE (ful ly overlapped with AMFES); dark grey CORR. The AUC value is for AMFES

PAGE 48

48 Figur e 3 8 C omparison of ROC curves for the lymphoma cancer dataset: black AMFES; light grey RFE (ful ly overlapped with AMFES); dark grey CORR. The AUC value is for AMFES

PAGE 49

49 Figure 3 9 C omparison of ROC curves for the prostate cancer dataset: black AMF ES; light grey RFE (fully ove rlapped with AMFES); dark grey CORR. The AUC value is for AMFES

PAGE 50

50 Table 3 5 The t test and p value for the prostate c ancer dataset AMFES RFE CORR p value t test p value t test p value t test Top1 6.79765e 009 7.16216e+00 0 1e* 4* 0.0000 7.2662 1e* 6* 0.0690 5.6977 Top2 2.60543e 008 6.89841e+000 1e* 4* 0.0000 8.0750 1e* 6* 0.0000 8.5169 Top3 1.45811e 011 8.99926e+000 1e* 4* 0.1002 4.5839 1e* 6* 0.0000 7.9116 Top4 T test fail T test fail 1e* 4* 0.0028 5.4003 1e* 6* 0.0000 7.2 180 Top5 9.38531e 009 7 .16995e+000 1e* 4* 0.0000 6.4124 1e* 6* 0.2778 5.4003 Top6 2.37164e 008 .97485e+000 1e* 4* 0.0000 7.2007 1e* 6* 0.0000 8.0750 Results: 5 pass 1 fail All pass All pass T able 3 6 I nformative features discovery rate (%) No 1 No 2 No 3 No 4 No 5 No 6 No 7 No 8 No 9 No 10 Ave rage AMFES 88 86 97 93 84 88 90 89 86 97 90 RFE 91 85 91 80 86 90 84 86 92 93 87 CORR 91 84 96 80 87 92 87 91 88 100 89

PAGE 51

51 CHAPTER 4 CLINICAL BIOINFORMATICS: DIAGNOSIS AND THERAPY An important goal in medical scien ce especially for complicated disease s such as cancer is the design of a th erapy tailed to each patient (s o called personalized medicine) To do this, the physician must have knowledge of specific biomarkers for each form of the disea se, and of how these biomarkers interact This information is represented usually as a target network. This chapter described the steps used to generate the target network for three stages of colon cancer of various types of breasts cancer. Microarray Data Description Prostate Cancer Dataset with RNA Biomarkers In order to provide a more accurate prog nosis, pathologists have used cancer stage s to measure cell tissue and tu mor aggressions as indicator s for doctors to choose a suitable treatment. The most w idely used cancer stagi ng is the TNM (Tumor, Node, and Metastasis) system [20]. Depending on the levels of differentiation between normal and tumor cells, a different hi stologic grade is given. Tumor classification with grade 1 indicate s almost normal tissues, with grade 2 indic ating somewhat normal tissues and with grad e 3 indicating tissues far from normal conditions. Although most of cancers can be adapted to the TNM staging system, some specific cancers require additional grading systems for pathologists to better interpret tumors. The Gleason Grading System which is especially useful for prostate cancers gives a GS (Gleason Score) based on cellular contents and tissues of cancer bio psies from patients, the higher GS less favorable the prognoses is The prostate cancer dat aset, GSE18655, includes 139 patients with 502 RNA molecular markers [21]. Li et

PAGE 52

52 al [21] showed that prostate tumors with gene fusions, TMPRSS2: ERG T1/E,4 have higher risk of recurrence than tumors without the gene fusions. The samples were prostate fresh frozen tu mor tissues of patients after radical prostatectomy surgery. All nitrogen was used to freeze middle sections of prostates at extreme ly low temperature. Among the se patients, 38 have GS 5 6 corresponding to histologic grade 1, 90 have GS 7 corresponding to hi stologic grade 2 and 11 have GS 8 9 corresponding to histologic grade 3. The platform used for the datasets is GPL5858, DASL (cDNA mediated, annealing, s election, extension and ligation) Human Cancer Panel by Gene manufactured by Illumina. The FDR (f alse discovery rate) of all RNA expressions in the microarray is less than 5%. Breast Ca ncer Dataset with Non Coding MicroRNA Biomarkers The miRNAs (microRNAs) have strong correlation with some cellular processes, such as proliferation. They have been used as a breast cancer dataset [22], containing 799 miRNAs and 101 patient samples. Differential expressions of miRNAs indicate different level s of proliferation corresponding to 5 intrinsic breast cancer subtypes: luminal A, luminal B, basal like, normal like, an d ERBB2. The original dataset contain s 101 samples (41 luminal A, 15 basal like, 10 normal like, 12 luminal B, 17 ERBB2, as well as 1 sample with T35 muta tion status, another sample has T35 wide type mutation and 3 unclassified samples GSE19536 was represented in two platfor ms GPL8227, an Agil ent 09118 Human miRNA microarray 2.0 G4470B (miRNA ID version) and the GPL6480, an Agilent 014850 whole Human Geno me Microarray 4x44k G41 12F (Probe Name). For this research, only the expressions of platform GPL8227 are used

PAGE 53

53 Prostate Cancer Dataset of Cancerous and Normal Samples with RNA Biomarkers The CNAs (Copy Number Alterations) of some genes may associate with g rowth of prostate cancers [23]. In addition, some changes are discovere d in mutations of fusion gene, RNA expressions and pathways in a majority of primary prostate samples. The analysis was applied to four platforms and consists of 3 subseries, GSE21 034, GSE21035 and GSE21036 [23], but o nly the GSE 21036 is used for analysis as an example The microarray dataset has 142 sample s ( 114 primary prostate ca ncer samples and 28 normal cell samples ) The platform is the Agilent 019118 Human miRNA Microarray 2.0 G4 470B (miRNA ID version). Results We employ AMFES on two prostate cancer s datasets (GSE18655 and GSE21036 ), and a breast cancer (GSE19536) dataset As shown in Table 4 1 for GSE18655, AMFES selects 96 biomarkers via a two step process The first step invol ve s differentiation of grade1 samples, resulting in 93 biomarkers. IN the second step, AMFES classifies between grade2 and grade3 samples with 3 bioma rkers being selected Thus, these 96 biomarkers are assumed to be able to classify among grade1, grade2 a nd grade3 samples [6]. For GSE19536, AMFES also performs classification in tw o steps. In the first step, AMFES classify betwee n luminal and non luminal samples with selection of 47 biomarkers [6]. In the second step, AMFES further classifies luminal sampl es as luminal A or luminal B and selects 27 biomarkers. For the non luminal samples, AMFES classifies them as basal like or normal like samples with selection of 25 biomarkers [6]. After removing duplicate biomarkers, AMFES has 72 (47+27 2(duplicated)) for classifying luminal samples and 68 (47+25 4(duplicated)) for classifying non luminal

PAGE 54

54 s amples [6]. For GSE21036, AMFES simply selects 22 biomarkers for classifying cancerou s and normal samples. Table 4 2 sho ws the number of selected genes, and t he complete lists of these biomarkers can be found in Additional file 1 GSE18655_96_Biomarkers.xlsx, Additional file 2 GSE19536_72_Biomarkers.xlsx,Additional file 3 GSE19536_68_Biomarkers.xlsx, and Additional file 4 GSE21036_22_Biomakers.xlsx. Calculating Mutual Info rmation T he MI calculation is then applied as described in the Mutual Information section on 96 biomarkers for GSE18655 T he pairwise MI values of grade 1, grade 2 and grade 3 samples are represented in three 96*96 matric es which can be found in Additio nal file 5 GSE18655 Grade1 MI.xlsx, Additional file 6 GSE18655 Grade2 MI.xlsx and Additional file 7 GSE18655 Grade3 MI.xlsx. We also represent the four MI matrices of 72 and 68 biomarkers for GSE19536 in Additional file 8 GSE19536 Luminal A MI.xlsx, Additi onal file 9 GSE19536 Luminal B MI.xlsx, Additional file 10 GSE19536 Basal Like MI.xlsx, and Additional file 11 GSE19536 Normal Like MI.xlsx. The two MI matrices for GSE21036 are in Additional file 12 GSE21036 Cancer MI.xlsx, Additional file 13 GSE21036 Nor mal MI.xlsx. The results of the analysis of the MI matrices for different classifications are shown in Table 4 3. For a given matrix, the first column in Table 4 3 denotes the mean value; the second column denotes the standard deviation; the third column s hows the number of positive values in the matrix; the fourth column shows the number of negative values; the sixth column shows the minimum value and the seventh column displays the maximum. In the fifth column, MI matrices for two different classification s such as luminal A vs. luminal B are compared The fifth column denotes the number of sign

PAGE 55

55 differences of the samples compared with one sign difference corresponding to different signs for two entri es at the same position in the two matrices We employ t he same process for comparing basal like versus normal like for GSE19536 and the cancerous versus normal for GSE21036. To visual ize the differences, the histograms of MI values of grade1s, grade2s and grade3s are displayed in Figure 4 1. Figure 4 2 sho ws t he histograms for luminal A versus luminal B Figure 4 3 shows basal like versus normal like, and Figure 4 4 sh ows the cancerous versus normal Since there are three prostate types they cannot be fairly compared ( N/A in column 5 for GSE18655). In additio n, because there are many MI entries for all histograms, only the densest section of each histogram is shown in the figures. Synergistic Therapy Based on the interpretation of the network [4,5], we propose a framework that can help to elucidate the underly ing interactions between multi target biomarkers and multi component drug agents. The framework consists of three parts: (1) selecti on of biomarkers for a complex disease such as cancer, (2) constructing of target networks of biomarkers, (3) and determinat ion of interaction s between biomarkers and drug agents to provide a personalized and synergistic therapy plan. From the GEO datasets for cancers, a genetic model of each cancer called signature of that particular cancer, is developed The signatures (targ et networks) of various cancers may be quite different corresponding to different biomarkers in Additional file 1 GSE18655_96_Biomarkers.xlsx, Additional file 2 GSE19536_72_Biomarkers.xlsx, Additional file 3 GSE19536_68_Biomarkers.xlsx, and Additional file 4 GSE21036_22_Biomakers.xlsx.. For these different signatures, various synergistic mechanisms may be discovered as exemplified in [24].

PAGE 56

56 A synergis tic therapy plan for a patient A can then be designed based on his/her bodily dat a such as saliva, blood sam ples. T he process starts with obtain ing the corresponding microarray dataset of patient A and then apply ing it to the genetic model as shown in Figure 4 5. A complete synergistic therapy should be able to select a small subset of biomarkers and correlate them with drug agents in a multi target multi component network approach as shown in Figure 4 6. In Figure 4 6, a disease associates with several biomarkers such as RNAs, miRNAs or proteins denoted by R 1 R 2 R 3 R 4 and R 5 which are the regulators for op erons O 1 O 2 and O 3 A n operon is a basic unit of DNA and formed by a group of genes controlled by a gene regulator. These operons initiate molecular mechanisms as promoters. The gene regulators can enable organs to regulate other genes either b y inductio n or repression. Each target biomarker may have a list of pharmacons used as enzyme inhibitors Traditionally, pharmacons refer to biological active substances and they are not limited to drug agents only. For example, the herbal extractions whose ingredi ents have a promising anti Disease) effect can be used as pharmacons [24]. Meanwhile, pharmacons denoted by D 1 D 2 and D 3, have effects an some target biomarkers. For example, D 1 may affect target biomarker R 3 D 2 may affect target biomark er R 5 and D 3 may affect biomarker R 1 Compared with drug agent pair methodology [5], the proposed framework in Figure 4 6 represents a more accurate interpretation of biomarkers with multi component drug agents. Discussion After computation, t he MI values can be either positive or negative The positive value s represent the attractions among the biomarkers while the negative s represent

PAGE 57

57 the repulsion amo ng the biomarkers, similar to the concept of Yin Yang in TCM (Traditional Chinese Medicine). From these r esults, we observe that there is minimal difference of mutual information values between cancer stages. However, the difference of the mean MI value for the prostate cancer versus normal cells is move obvious. The mean MI value of the last prostate cancer cell is approximately twice that of normal cells. This may b e intriguing for further investigations.

PAGE 58

58 Figure 4 1. Comparison of 96 MI of grade1, grade2 and grade3 prostate cancer samples Figure 4 2 Comparison of 72 MI of luminal A and luminal B sam ples

PAGE 59

59 Figure 4 3 Comparison of 68 MI of basal like and normal like samples Figure 4 4 Comparison of 22 MI of prostate cancerous and normal like samples

PAGE 60

60 Figure 4 5 Diagram of detailed process of building the genetic model

PAGE 61

61 Figure 4 6 Relatio nships between biomarkers, pharmacons and operons where R1, R2, R3, R4 and R5 denote 5 biomarkers. Among all the biomarkers, R2, R3 and R5 are regulators

PAGE 62

62 Table 4 1 Descriptions of 3 datasets: GSE18655 (prostate cancer), GSE19536 (breast cancer) and GSE21 036 (prostate cancer) Prostate Cancer (GSE18655) Breast Cancer (GSE19536) Prostate Cancer (GSE21036) Number of Biomarkers 502 489 373 Type of Biomarkers RNAs miRNAs RNAs Number of Samples 139 101 142 Variation of Samples Grade1(38), Grade2(90), Grade3 (11) Luminal A ( 41), Luminal B (15), Basal like (10), Normal like(12) Cancerous (114), Normal (28) Table 4 2 Results of selected subsets of genes Prostate Cancer (GSE18655) Breast Cancer (GSE19536) Breast Cancer (GSE19536) Prostate Cancer (GSE21036) Number of Biomarkers Selected 96 72 68 22 Variation of Samples Grade1, Grade2, Grade3 Luminal A, Luminal B Basal like Normal like Cancerous Normal

PAGE 63

63 Table 4 3 Results of analysis of MI matrices Mean value of MI Standard deviation of MI Num of positive values Num of negative values Num o f ifferent sign Min value Max value GSE18655_grade1 0.00024 0.0015 6298 2918 N/A 0.0858 GSE18655_grade2 0.00020 0.0017 6468 2748 0.0949 GSE18655_grade3 0.0004 0.0021 6650 2566 0.0582 GSE1953 6_A(72) 0.00036 0.0022 3912 1272 2052 0.1293 GSE19536_B(72) 0.00053 0.0040 3388 1796 0.2279 GSE19536_BasalLike(68) 0.0017 0.0056 3491 998 1217 0.1648 GSE19536_NormalLike(68) 0.0056 0.008 4200 420 0.1279 GSE21036_cancer 0.0165 0.0212 10 474 56 0.1446 GSE21036_norm 0.0086 0.0146 46 438 0.1565

PAGE 64

64 CHAPTER 5 This chapter uses the bioinformatics methods developed in chapter 4 to discover biomarkers Microarray Datasets Descriptions The gene expressi ons used for this paper are based on PBMC (Peripheral Blood Mononuclear Cells), blood based biomarkers [ 2 4 ] PBMCs are blood cells with round nuclei which are separa ted from plasma, polymorphonuclear cells and erythrocytes using ficoll, a hydrophilic polysaccharide. Fields such as immunology, transplant immunology, and vaccine development ofte n use PBMCs. Subject AD and normal elderly patients all took the MMSE (Mini Mental State Examination). Those with chronic metabolic conditions such as diabetes, rheumatoid arthritis and other chronic illnesses or familial AD problems are not included for the analysis [ 2 4 ] GSE4226 AMFES i s used to analyze the gene expressions from the BMC (Blood Mononuclear Cell) of AD patients [ 4 ] The dataset contains 9600 features from 14 normal elder ly control samples ( 7 females and 7 male s ) and 14 AD patient samples ( 7 females and 7 males) The average age of the patients i s 79 5 years with 11 4 years of formal education al background. The platform of the dataset i s GPL1211and gene expressions a re extracted by using the technology of NIA (National Institution on Aging) Human MGC (Mammalian Geno me Collection) cDNA microarray technology GSE4227 The dataset i s extrac ted from BMC under the same GPL1211 platform as GSE4226 and used to identify the genes with expressions associated with GSTM3

PAGE 65

65 (Glutathione S Transferase Mu 3) [ 3 ] The dataset contains 9600 features and 34 samples ( 18 normal elderly control samples and 16 sporadic AD samples ) GSE4229 This dataset contains new subjects and some subjects from GSE42 26 and GSE4227. The blood samples were extracted by phlebotomy in to an EDTA vacutai ner. The datase t also contain s 9600 features an d 40 samples ( 18 AD patients and 22 normal elderly control samples ) The platform i s the same as that for GSE4226 and GES4227. Results Results of Biomarkers Table 5 1 contains the description of three datasets, GSE 4226, 4227 and 4229. AMFES selects 74 genes for GSE4226, 52 for GSE4227 and 395 for GSE4229 and the selected results are shown in Table 5 2. The complete lists of the 74, 52 and 395 selected genes can be found in Table s B 1 B 2 and B 3 in Appendix B. The statistical results of MI values are shown in Table 5 3. ROC/AUC Comparison F or dataset GSE4226, gene expression differences of AD and normal samples were analyzed b y SAM (Significance Analysis Mic roarray) software and 30 permutations to were performed to generate the corresponding T test by Mae s [ 4 ] As in result, 8 49 genes were found to act as down regulating and 93 genes acted as up regulating in AD signal paths [ 4 ] Overlap of genes selected by AMFES with those in Maes et al in Figure 5 2. Among the 849 down regulat ed genes 7 46 genes overlap the 9600 genes in GPL1211 For the 93 up regulated genes, 82 genes overlap The complete list of 746 down regulated genes is provided in Table B 4 and the list of of 82 up regulated

PAGE 66

66 genes is shown in Table B 5. T o compare the classification ability of the selected genes, the AUC is calculated and the resulted ROC curves of the gene expressions of 74 genes selected by AMFES and 828 genes (746 down regulated + 82 up re gula ted) are drawn by using LIBSVM Matlab ROC tool [ 13 ] as shown in Figure 5 2 The ROC/AUC values are verified based on cross validation [ 13 ] Mutual Information Analysis T he pair wise MI values of selected genes of AD or normal samples were calculated separately. T he histograms of MI values for GSE4226 are shown in Figure. 5 3 where the black bars represent MI values of normal samples and the grey bars are for AD samples. The hist ograms for GSE4227 and GSE4229 are displayed in Figure s 5 4, 5 5 respectively. The pair wise MI files of AD and normal samples are shown in Table B 6: GSE4226 AD, Table B 7: GSE4226 Normal, Table B 8: GSE4227 AD, Table B 9: GSE4227 Normal, Table B 10: GSE4 229 AD and Table B 11: GSE4229 Normal. The analysis results are shown in Table 5 3. Clustergram Example T he clustergram function on the genes selected from the dataset of GSE4226 is described as an example. Only the top ranked 15 genes are used for analy sis as shown in Fig ure 5 6 If a few genes share high pairwise MI valu es with a specific gene, they tend to cluster together as indicated by rectangles (Figure 5 6) and have fewer (number of connections between a pair of gene ) than other genes (Fig ure 5 7) For example, PEX5 share s similar MI values with DNPEP, CCBP2, CCMT1, BCAP29, LRRC1 and NDUFA6 which are clustered together in Figure 5 6. From the graph ical view of the target network, these genes have direct connections to the PEX5 as shown

PAGE 67

67 in Fig ure 5 7. On the other hand, a gene such as PLEKHA1 which is two hops away from PEX5 has an obvious color difference in the MI clustergram in Figure 5 7. Functional Attributes T he biological functional attributes were searched for the selected 74 ge nes in GSE4226 and those of 19 were discovered from the SOURCE database (http://source.sta nford.edu) [ 44 ] The results are shown in the Table 5 4. Overlapping Gen es Discovered As shown in Figure 5 2, AMFES discovered 17 overlapped down regulated genes out of 746 genes and 9 overlapped up reg ulated genes out of 82 genes shown in Table 5 5 and Table 5 6 re spectively Different Gene Profiling As shown in Figure 5 2, 729 genes are discovered from specifically [ 4 ] and 57 genes are discovered only by AMFES. We analyzed the correlations of these 57 genes with the 729 genes by calc ulating the MI values on the combinational matrix of these two groups. As shown in Table 5 3, t he minimum, mean and maximum MI values of 57 to 57 pair wise genes and 729 to 729 pair wise showed difference s from 57 to 729 pair wise MI values The MI value b etween the 729 Maes et al specific genes and the 57 AMFES specific genes are obvious ly lower than 57 57 genes or 729 729 genes Gender Analysis After d ividing all samples based on gender and applying AMFES the genes selected by AMFES are compa red to those selected by Maes et al [ 4 ] in Table s 5 8, 5 9 and 5 10. The genes selected by AMFES are compared with ones in Maes et al: 13 9 down regulated genes (comm on for both genders) 19 up regulated ones (common for

PAGE 68

68 both gender s ), 130 down regulated (female sp ecific), 132 up regulated (female sp ecific), 124 down regulated (male spe cific) and 151 up regulated (male specific) in [ 4 ] The 420 female genes are the total of ones common for both genders (139 down regula ted + 19 up regulated) and female specific ones (130 down regulated + 132 up regulated) and 433 genes are the total of ones for both genders (139 down regu lated + 19 up regulated) and male specific ones (124 down regulated + 151 up regulated). Table 5 10. shows the overlapped female and male results respectively. Discussion In this c hapter GSE 4226 is described in more detail because the number s of fem ale and male subjects are equal In a similar way, the same procedure can be applied for GSE 4227 and GSE4229. For GSE4226, Maes et al found 849 down regulated and 93 up regulated biomarkers while our results select a much smaller subset of biomarkers, 74 with higher test a ccuracy (100% vs. 90%) [ 4 ] Then, our results obtain higher ROC/AUC values (0.96 v s. 0.51). For the distributions of MI values, all three datasets have both positive and negative values. We observe that normal subjects have higher MI values than AD subjects which could be an indicator for prognosis and diagnosis. In addition, all normal subjects have higher standard deviations than AD subjects which may reveal some interesting p atterns among AD subjects. Hierarchical clustering of the 74 genes yields the top ranked 15 genes shown in the heatmap (Figure 5 6) T he genes clustered together also have a short distance s between them in the corresponding t arget network (Figure 5 7) We also observe a low dependency of our 57 genes w ith 729 genes when compared to ,a indication of the novelty of our genes. In addition, we extra ct the biological process, the

PAGE 69

69 function al attributes of 19 out of these 57 genes from the SOURCE database. These 57 genes could be potential candidates for further clinical investigation. Finally, we analyze the selected features based on gender, and still selected with a much smal ler number of features for female subjects (GSE4226:19, GSE4227:12, GSE4229: 36) and male subjects (GSE4226:9, GSE4227:13, GSE4227: 13). Based on our results, the complete process for improving the diagnosis of AD is developed as shown in the Figure 5 8. First, all gene expressions of AD and healthy subjects will be labeled by 1 or 1 to be trained on the AMFES. From the trained patterns of these subjects, when a new subject is presented, AMFES can predict the pathological s tatus of the subject and select a small set of important biomarkers to construct the target network for the new subject. Based on the computations of mutual information and selected biomarkers, we can obtain more detailed information such as regulatory pathways and biological processes for genetic profiling to further improve diagnosis of AD.

PAGE 70

70 Table 5 1 Descriptions of 3 datasets: GSE4226, GSE4227, and GSE4229 GSE4226 GSE4227 GSE4229 Number of Biomarkers 9600 9600 9600 Type of Biomarkers RNAs RNAs RNAs Nu mber of Samples 28 (14 AD vs 14 Normal) 34(14 AD vs. 18 normal) 40(18 AD vs. 22 normal) Table 5 2. Results of selected subsets of genes GES4226 GSE4227 GSE4229 Number of Biomarkers Selected 74 52 39 5 Table 5 3. Results of analysis of MI matrices Mean value of MI Standard deviation of MI # of positive values # of negative values Min value Max value GSE4226_normal 0.0408 0.0572 4912 272 0.0043 0.6211 GSE4226_AD 0.0355 0.0463 5088 388 0.0045 0.5810 GSE4227_normal 0.0309 0.0436 2546 158 0.0 056 0.5621 GSE4227_AD 0.0289 0.0399 2490 214 0.0075 0.5048 GSE4229_normal 0.0246 0.0295 146301 9724 0.0069 0.5513 GSE4229_AD 0.0221 0.0278 142665 13360 0.0077 0.5189

PAGE 71

71 Table 5 4. The partial biological processes of genes selected for GSE 4226 Sym bol Biological Process CCBP2 sig na l transduction PEX5 protein transport NDUFA6 electron transport component ARG2 arginine catabolism BCAP29 apoptosis BTF3 transcription DNPEP peptide metabolism | proteolysis and peptidolysis HSP90AB1 positive regul ation of nitric oxide biosynthesis |protein folding| response to unfolded protein| RXRG |regulation of transcription, DNA dependent| transcription LCMT1 protein modification ECE2 |cell cell sig na ling| embryonic development| heart development| peptide ho rmone processing| proteolysis and peptidolysis | regulation of G protein coupled receptor protein sig na ling pathway| vasoconstriction| ZNF3 cell differentiation| immune cell activation |regulation of transcription, DNA dependent | regulation of transcript ion, DNA dependent | transcription| SLC10A3 "Sodium ion transport / Sodium ion transport / Transport / Organic anion transport a ctivity" CD99 cell adhesion CPSF3 |mRNA cleavage | mRNA polyadenylylation| PEA15 anti apoptosis | negative regulation of glucose import| regulation of apoptosis| transport| FEM1B induction of apoptosis CCRN4L In multiple clusters EIF2S2 |protein biosynthesis| translation l initiation| LEPREL2 |protein metabolism| CSTF1 |RNA processing | mRNA cleavage| mRNA polyadenylyla tion| PLOD3 |protein metabolism | protein modification| SUMF2 IFT81 cell differentiation| spermatogenesis| POLR2I RNA elongation| regulation of transcription, DNA dependent| transcription| transcription from RNA polymerase II promoter| GBA2 |bile aci d metabolism| bile acid metabolism| NCK1 T cell activation| intracellular sig na ling cascade| positive regulation of T cell proliferation||positive regulation of actin filament polymerization| sig na l complex formation|

PAGE 72

72 Table 5 4. Continued. Symbol Bio logical Process PPP2R1A RNA splicing| ceramide metabolism| a ctivation of MAPK| induction of apoptosis| mitotic chromosome condensation| negative regulation of cell growth| negative regulation of tyrosine phosphorylation of Stat3 protein| protein amino aci d dephosphorylation| protein complex assembly| regulation of DNA replication| regulation of Wnt receptor sig na ling pathway| regulation of cell adhesion| regulation of cell cycle|regulation of cell differentiation| regulation of growth| regulation of transc ription| regulation of translation| response to organic substance| second messenger mediated sig na ling| TBCK protein amino acid phosphorylation ATF3 regulation of transcription, DNA dependent| transcription| PFKM glucose metabolism| glycogen metabolism| regulation of glycolysis| XRCC6 DNA ligation| DNA recombition| DNA repair| positive regulation of transcription, DNA dependent| C21orf119 SSSCA1 cell cycle| cytokinesis| mitosis| VCAM1 cell cell adhesion| MAT2B S adenosyl methionine biosynthesis| S adenosyl methionine biosynthesis| extracellular polysaccharide biosynthesis| SLC25A6 mitochondrial transport| transport| C20orf3 biosynthesis| BNIP3 anti apoptosis |apoptosis| positive regulation of apoptosis| IGHA1 immune response CELA3A cholesterol metabolism| digestion| proteolysis and peptidolysis | proteolysis and peptidolysis| PI GPI anchor biosynthesis| nucleotide metabolism| PSMB4 |ubiquitin dependent protein catabolism|

PAGE 73

73 Table 5 5. 17 common down regulated genes Gene Symbols 1 CRYBA2 2 KIF1B 3 CPSF3 4 MPDU1 5 BAP29 6 PLEKHA1 7 RXRG 8 MAT2B 9 SSSCA1 10 PPP2R1A 11 RAB9A 12 C20orf3 13 BTF3 14 UQCRC1 15 HSPCB 16 LOXL4 17 GK001 Table 5 6. Nine common up regulated genes Gene Symbols 1 NDUFA6 2 ATF3 3 PRMT6 4 CSTF1 5 PS1D 6 RPS25 7 GRCB 8 VCAM1 9 IGLJ3

PAGE 74

74 Table 5 7. M utual information analysis for non Minimum Mean Maximum 57_57 pair wise MI 0.0012 0.0425 0.5071 57_729 pair wise MI 0.0072 0.0302 0.3005 729_729 pair wis e MI 0.0064 0.0368 0.5093 Table 5 8. C omparisons of female genes and male gene selected Datasets Numbe r of features selected for female (AMFES vs. Maes (417 genes)) Number of features selected f or male (AMFES vs. Maes (430 genes)) GSE4226 19 9 GSE4227 12 13 GSE4229 36 13 Table 5 9. C ommon female genes and male genes found between Datasets Common female genes (AMFES vs. Maes ) Common male genes (AMFES vs. Maes) GSE4226 ELA3B, TAH2 FLJ12571, IDH2, RNASE1, Z DHHC3 GSE4227 FLJ20234 ATF3, IDH2, KIAA0737, LOC144305, RNASE1, RP4 622L5 GSE4229 CA3, DKFZP56410422, RIC 8 IDH2, KIAA0737, RNASE1

PAGE 75

75 Table 5 10. O Up regulated gen es Down regulated samples number Gene symbols number Gene symbols GSE4226 2 W: RHBDL2 0 n/a M: DLOD3 GSE4227 3 W:n/a 0 W: n/a M:FEZ1, OSR2, SLC2A5 M: n/a GSE4229 17 W:CREB3, GOS2, GNB3, MAGEA12, RAB13 1 W: n/a M: ASAH1, DUSP14, EIF2AK4, FE Z1, FH, INA, NCKAP1, PLOD3, PSMA3, SIAH2, SLC1A5, ZNF256 M: CYBSP2 Fig ure 5 1. Overlapping genes selected by AMFES with those

PAGE 76

76 Figure 5 2 The AUC values shown in the figure is for AMFES. Figure 5 3 H istograms of pairwise MI values of normal and AD samples of GSE4226

PAGE 77

77 Figure 5 4 H istograms of pairwise MI values of normal and AD samples of GSE4227 Figure 5 5 H istograms of pairwise M I values of normal and AD samples of GSE4229

PAGE 78

78 Figure 5 6 The clustergram of first 15 genes selected by AMFES for GSE4226 Figure 5 7 The target network of first 15 gene selected by AMFES for GSE4266

PAGE 79

79 Figure 5 8 A complete process to improve diagn osis of AD by AMFES

PAGE 80

80 CHAPTER 6 CONCLUSIONS Based on the results presented in Chapter 3, we have shown AMFES to be an effective SVM based classification method compared to CORR and RFE to select a much smaller set of important biomarkers with a shorter c omputation time and higher or comparable test accuracy statistical significance and discovery rate of inf ormative features. It provides a general methodology not only for microarray data but also for other applications such as image processing, pattern r ecognition sampling and data mining. In Chapter 4, we presented a comprehensive approach to d iagnosis and therapy of cancers, and we proposed a complete procedure is proposed for clinical appli cation to cancer patients. W hile the genetic model provides a standard framework to design synergistic therapy, the actual plan for an individual patient is personalized and flexible. With careful monitoring, physicians may adaptively change or modify the therapy plan. Finally, w e discovered important biomarkers an d provided the target netw orks for AD in Chapter 5 Our research extended the translational bioinformatics study of Maes et al to improve diagnosis by using AMFES Our results we re verified by cross validation and obtain ed better ROC/AUC values than those Maes et al A comparison of t he distributions of MI values for normal and AD subjects shower that the AD subjects had lower MI values than normal ones. Thus, we developed a methodology based on the MI value for diagnosis and prognosis. If the maximum MI value of a new patient is higher than the maximum value of our normal subjects, we can assume t hat the patient does not have AD On the other hand, if the minimum MI value of the new patient is lower the minimum value of our results, the patient could have AD. The accuracy of our

PAGE 81

81 method can be improved by enlarging the sample space of subjects. When we clustered the selected genes and showed them in a heatmap, the clustered genes showed short distance s among them in the correspond ing target network. We also demons t rated the novelty of our selected features. Finally, when we performed gene selections based on the gender, our results still selected a much smaller set of fea tu res compared to Maes B ased on the target networks, we can further dev elop syne rgistic strategy to improve therapy for AD in the future.

PAGE 82

82 APPENDIX A TARGET NETWORKS OF MUTU AL INFORMATION Table A 1. GSE18655 96 Biomarkers (attached: .pdf file 16kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE18655_96_Biomarkers.pdf Table A 2. GSE19536 72 Biomarkers (attached: pdf file 10kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE19536_72_Biomarkers.pdf Table A 3. GSE19536 68 Biomarkers (attached: pdf file 12kB) http://ufdcimages .uflib.ufl.edu/AA/00/01/38/93/00001/GSE19536_68_Biomarkers.pdf Table A 4. GSE21036 22 Biomarkers (attached: pdf file 12kB) http://ufdcimages.uflib.ufl.edu/AA /00/01/38/93/00001/GSE21036_22_Biomarkers.pdf Table A 5. GSE18655 mutual information of grade 1(attached: pdf file 145kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38 /93/00001/18655_Grade1_MI.pdf Table A 6. GSE18655 mutual information of grade 2(attached: pdf file 145kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/1865 5_Grade2_MI.pdf Table A 7. GSE18655 mutual information of grade 3(attached: pdf file 145kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/18655_Grade3_MI.pd f Table A 8. GSE 19536 Basal like(attached: pdf file 73kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/19536_Basal_Like_MI.pdf Table A 9. GSE 19536 Normal like(attached: pdf file 75kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/19536_Normal_Like_MI.pdf Table A 10. GSE21036 Cancer (attached: pd f file 14kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/21036_Cancer_MI.pdf Table A 11. GSE21036 Normal (attached: pdf file 14kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/21036_Normal_MI.pdf

PAGE 83

83 APPENDIX B ALZHE I M Table B 1. GSE4226_74_Biomarkers (attached: pdf file 11 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4226_74_Biomarkers.pdf Table B 2. GSE4227_ 52_Biomarkers (attached: pdf file 10 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4227_52_Biomarkers.pdf Table B 3. GSE4229_ 395_Biomarkers (attached: pdf file 14 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4229_395_Biomarkers.pdf Table B 4. 746_Down regulated_Genesymbols (attached: pdf file 18 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/746_Down_regulated_Genesymbo ls.pdf Table B 5. 82_Up regulated_Biomarkers (attached: pdf file 10 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/82_Up_regulated_Genesymbols.p df Table B 6. GSE4226 AD MI (attached: pdf file 87 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4226_AD_MI.pdf Table B 7. GSE4226 Normal MI (attached: pdf file 84 kB) http://ufdcimages.uflib.ufl.ed u/AA/00/01/38/93/00001/GSE4226_Normal_MI.pdf Table B 8. GSE4227 AD MI (attached: pdf file 46 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4227_AD_MI.pdf Table B 9. GSE4227 Normal MI (attached: pdf file 46 kB) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4227_Normal_MI.pdf Table B 10. GSE4229 AD MI (atta ched: pdf file 2.3m B) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4229_AD_MI.pdf Table B 11. G SE4229 Normal MI (attached: pdf file 2.3m B) http://ufdcimages.uflib.ufl.edu/AA/00/01/38/93/00001/GSE4229_Normal_MIs.pdf

PAGE 84

84 LIST OF REFERENCES [1] X. Zhang, X. Lu, Q. Shi et al. tion and sample classification for mass BMC bioinformatics, vol. 7, pp. 197, 2006. [2] O. C. Maes, H. M. Schipper, H. M. Chertkow et al. Alzheimer's disease blood J Geront ol A Biol Sci Med Sci, vol. 64, no. 6, pp. 636 45, Jun, 2009. [3] O. C. Maes, H. M. Schipper, G. Chong et al. Neurobiol Aging, vol. 31, no. 1, pp. 34 45, Jan, 2010. [4] O. C. Maes, S. Xu, B. Yu et al. Neurobiol Aging, vol. 28, no. 12, pp. 1795 809, Dec, 2007. [5] organizing m aps: Proceedings of the National Academy of Sciences, vol. 96, no. 6, pp. 2907 2912, 1999. [6] M. Reich, T. Liefeld, J. Gould et al. Nature genetics, vol. 38, no. 5, pp. 500 501, 2006. [7] A. I. Saeed, N. K. Bhagabati, J. C. Braisted et al. "[9] TM4 Microarray Software Suite," Methods in Enzymology DNA Microarrays, Part B: Databases and Statistics, pp. 134 193: Academic Press. [8] K. D. Dahlquist, N. Salomonis, K. V ranizan et al. Nature genetics, vo l. 31, no. 1, pp. 19 20, 2002. [9] N. Salomonis, K. Hanspers, A. C. Zambon et al. resources for pathway BMC bioinformatics, vol. 8, no. Journal Article, pp. 217, 2007. [10] T. R. Golub, D. K. Slonim, P. Tamayo et al. Science (New York, N.Y.) vol. 286, no. 5439, pp. 531 537, 1999. [11] I. Guyon, J. Weston, S. Barnhill et al. Mach. Learn., vol. 46, no. 1 3, pp. 389 422, 2002. [12] C. Bishop, Pattern recognition and mach ine learning : Springer, 2006.

PAGE 85

85 [13] C. C. Chang, and C. ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 1 27, 2011. [14] Machine Learning, vo l. 20, no. 3, pp. 273 297, 1995. [15] Ieee Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 8, pp. 832 844, Aug, 1998. [16] E. Tuv, A. Borisov, and K. Torkkola "Feature Selection Using Ensemble Based Ranking Against Artificial Contrasts." pp. 2181 2186. [17] Pattern Recognition Letters, vol. 2 7, no. 10, pp. 1 067 1076, 2006. [18] Nat Rev Drug Discov, vol. 4, no. 1, pp. 71 78, 2005. [19] target therapeutics: whe n Drug discovery today, vol. 12, no. 1 2, pp. 34 42, 2007. [20] Nature reviews.Drug discovery vol. 5, no. 8, pp. 649 659, 2006. [21] target drugs: Trends in pharmacological sciences, vol. 26, no. 4, pp. 178 182, 2005. [22] S. Li, B. Zhan BMC systems biology, vol. 5 Suppl 1, no. Journal Article, pp. S10, 2011. [23] W. C. Hsu, C. C. Liu, F. Chang et al. ssification: Mutual information, J Clin Bioinforma, vol. 2, no. 1, pp. 16, October 2nd, 2012. [24] F. C. a. C. C. Liu, Ranking and selecting features using an adaptive multiple feature subset method, number, Techn ical Report TR IIS 12 005, Academia Sinica 2012. [25] J. Mach. Learn. Res., vol. 3, pp. 1357 1370, 2003.

PAGE 86

86 [26] J. B. Fitzgerald, B. Schoeberl, U. B. Nielsen et al. combin Nature chemical biology, vol. 2, no. 9, pp. 458 466, 2006. [27] J. Bi, K. Bennett, M. Embrechts et al. J. Mach. Learn. Res., vol. 3, pp. 1229 1243, 2003. [28] H. Stoppiglia, G. Dreyfus, R. Dubois et al. J.Mach.Learn.Res., vol. 3, no. Journal Article, pp. 1399 1414, 2008. [29] rstanding the cell's Nat Rev Genet, vol. 5, no. 2, pp. 101 13, Feb, 2004. [30] SIGMOBILE Mob. Comput. Commun. Rev., vol. 5, no. 1, pp. 3 55, 2001. [31] P. Qiu, A. J. Gentles Computer Methods and Programs in Biomedicine, vol. 94, no. 2, pp. 177 180, May, 2009. [32] J. Beirlant, E. J. Dudewicz, L. G. ouml et al. International Journal of Mathematical and Statistical Sciences, vol. 6, no. Journal Article, 1997. [33] A. Margolin, I. Nemenman, K. Basso et al. Reconstruction of Gene Regulat BMC bioinformatics, vol. 7, no. Suppl 1, pp. S7, 2006. [34] E. W. Michael, and S. L. Monica, "A data locality optimizing algorithm," 1991. [35] A. J. Butte, and I. S. Kohane, "Mutual information relevance netw orks: functional genomic clustering using pairwise entropy measurements," 2000. [36] K. Basso, A. A. Margolin, G. Stolovitzky et al. Nat Genet, vol. 37, no. 4, pp. 382 390, 2005. [37] L. Page, S. Brin, R. Motwani et al. The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford InfoLab, 1999. [38] M. A. van Driel, J. Bruggeman, G. Vriend et al. mining analysis of the European journal of human genetics : EJHG, vol. 14, no. 5, pp. 535 542, 2006. [39] U. Alon, N. Barkai, D. A. Notterman et al. revealed by clustering analysis of tumor and normal colon tissues probed by Proceedings of the National Academy of Sciences of the United States of America, vol. 96, no. 12, pp. 6745 6750, 1999.

PAGE 87

87 [40] A. A. Alizadeh, M. B. Eisen, R. E. Davis et al. Nature, vol 403, no. 6769, pp. 503 511, 2000. [41] B. S. Taylor, N. Schultz, H. Hieronymus et al. Cancer cell, vol. 18, no. 1, pp. 11 22, 2010. [42] X. Zhang, X. Lu, Q. Shi et al. ection and sample classification for mass BMC bioinformatics, vol. 7, no. 1, pp. 197, 2006. [43] Machine Learning, vol. 45, pp. 5 32, 2001. [44] M. Diehn, G. Sherlock, G. Binkley et al. URCE: a unified genomic resource Nucleic Acids Res, vol. 31, no. 1, pp. 219 23, Jan 1, 2003.

PAGE 88

88 BIOGRAPHICAL SKETCH In many ways, Dr us scholarly accomplishments a m as d egre e in electrical engineering from the University of Southern California and a second M.S. in computer science from California State University, Northridge prepared her to be an independent researcher, the dream since she was a child Silicon Valley. Via coursework and independent projects, she obtained advanced knowledge and a solid background in both hardware design and software programming. For her M.S. in computer s cience, she design ed modular components and co mmunication protocols for vehicle system s Car owners often suffer from the high maintenance costs and difficult diagnosis and expensive repairs due to complicated wiring systems. Her thesis focused on developing an efficient protocol and algorithm to red uce the wiring complexity to mark accurately, dia g nose a faulty part communication protocol design with the embedded modular components will also reduce manufacturing costs The thesis gave her an opportunity to apply knowledge of hardware (microcontroller s) and software (programming/protocols) in to a significant real world platform. After she was accepted by Universi ty of Florida in 2006 she continued developing better algorithms and models to improve the efficiency of computing systems. During her rese arch, she learned that many bioimformaticians have been developing computational methodologies as a framework as preliminary work to in diagnosis for patients. The diagn isease has been very challenging because of the comple x nature of these diseases and the many parameters that need to be considered However, the number of patient samples is limited. Machine learning

PAGE 89

89 has been a gre at methodology to solve the problem. The idea of impr oving the diagnose s for human being s from the computatio nal perspective really intrigued her. Thus, with cooperating with other team members, she proposed AMFES, an efficient SVM based classification algorithm to discover important biomarkers of cancers and AD AMFES has verified theoretically eff icient and important biomarkers are discovered This research represents a starting point to improve diagnosis of complex diseases. I n the future, she will continue to contribute her abilities and knowledge of e lectric al and computer e ngineering to help m edical professionals to improve the health of human beings.