UFDC Home  Search all Groups  UF Institutional Repository  UF Institutional Repository   Help 
Material Information
Notes
Record Information

This item is only available as the following downloads:
Cancer classification: Mutual information, target network and strategies of therapy ( PDF )
Cancer classification: Mutual information, target network and strategies of therapy ( XML ) 20439113216S1 ( XLSX ) 20439113216S10 ( XLSX ) 20439113216S11 ( XLSX ) 20439113216S12 ( XLSX ) 20439113216S13 ( XLSX ) 20439113216S2 ( XLSX ) 20439113216S3 ( XLSX ) 20439113216S4 ( XLSX ) 20439113216S5 ( XLSX ) 20439113216S6 ( XLSX ) 20439113216S7 ( XLSX ) 20439113216S8 ( XLSX ) 20439113216S9 ( XLSX ) ( XML ) 
Full Text 
!DOCTYPE art SYSTEM 'http:www.biomedcentral.comxmlarticle.dtd'
ui 20439113216 ji 20439113 fm dochead Research bibl title p Cancer classification: Mutual information, target network and strategies of therapy aug au id A1 snm Hsufnm WenChininsr iid I1 I2 email wenchin@ufl.edu A2 LiuChanChengI4 cheng@iis.sinica.edu.tw A3 ChangFufchang@iis.sinica.edu.tw A4 ca yes ChenSuShingI3 suchen@cise.ufl.edu insg ins System Biology Lab, University of Florida, Florida, USA Department of Electrical and Computer Engineering, University of Florida, Florida, USA Department of Computer and Information Science and Engineering, University of Florida, Florida, USA Institute of Information Science, Academia Sinica, Taipei, Taiwan source Journal of Clinical Bioinformatics issn 20439113 pubdate 2012 volume 2 issue 1 fpage 16 url http://www.jclinbioinformatics.com/content/2/1/16 xrefbib pubidlist pubid idtype doi 10.1186/20439113216pmpid 23031749 history rec date day 10month 7year 2012acc 2092012pub 2102012 cpyrt 2012collab Hsu et al.; licensee BioMed Central Ltd.note This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. kwdg kwd Feature selection Biomarkers Microarray Therapy design Target network abs sec st Abstract Background Cancer therapy is a challenging research area because side effects often occur in chemo and radiation therapy. We intend to study a multitargets and multicomponents design that will provide synergistic results to improve efficiency of cancer therapy. Methods We have developed a general methodology, AMFES (Adaptive Multiple FEature Selection), for ranking and selecting important cancer biomarkers based on SVM (Support Vector Machine) classification. In particular, we exemplify this method by three datasets: a prostate cancer (three stages), a breast cancer (four subtypes), and another prostate cancer (normal vs. cancerous). Moreover, we have computed the target networks of these biomarkers as the signatures of the cancers with additional information (mutual information between biomarkers of the network). Then, we proposed a robust framework for synergistic therapy design approach which includes varies existing mechanisms. Results These methodologies were applied to three GEO datasets: GSE18655 (three prostate stages), GSE19536 (4 subtypes breast cancers) and GSE21036 (prostate cancer cells and normal cells) shown in. We selected 96 biomarkers for first prostate cancer dataset (three prostate stages), 72 for breast cancer (luminal A vs. luminal B), 68 for breast cancer (basallike vs. normallike), and 22 for another prostate cancer (cancerous vs. normal. In addition, we obtained statistically significant results of mutual information, which demonstrate that the dependencies among these biomarkers can be positive or negative. Conclusions We proposed an efficient feature ranking and selection scheme, AMFES, to select an important subset from a large number of features for any cancer dataset. Thus, we obtained the signatures of these cancers by building their target networks. Finally, we proposed a robust framework of synergistic therapy for cancer patients. Our framework is not only supported by real GEO datasets but also aim to a multitargets/multicomponents drug design tool, which improves the traditional single target/single component analysis methods. This framework builds a computational foundation which can provide a clear classification of cancers and lead to an efficient cancer therapy. bdy Background Cancer therapy is a difficult research area due to its level of complexity. Lately, the mere superposition of single drugs is found to generate sideeffects and crosstalk with another drug which may cancel out the final success of treatments. Thus, current research focuses on measuring the drug treatments as a whole rather than considering them individually abbrgrp abbr bid B1 1 B2 2 . Later, a synergistic concept is proposed to evaluate the drug treatments B3 3 . However, evaluations are still based on cases and do not have a systematic approach. In B4 4 , a network methodology is first used to evaluate efficiency of drug treatments. Thus, Li et al. use a parameter, namely a SS (Synergy Score) to introduce the topology factor of the network based on the disease and the drug agent combination B5 5 .Our approach is first to build a more precise target network from the selected biomarkers (by AMFES) B6 6 . Then, we identify the intrinsic properties by computing mutual information of the interactions among these biomarkers. Our approach is to improve Li’s results by considering the mutual information in the target network. And we provide a general framework of synergistic therapy, which may include several different approaches. Methods AMFES The COD (Curse of Dimensionality) has been a major challenge of microarray data analysis due to the large number of genes (features) and relatively small number of samples (patterns). To tackle this problem, many gene selection methodologies were developed to select only significant subsets of genes in a microarray dataset. AMFES selects an optimal subset of genes by training a SVM with subsets of genes generated adaptively 6 .When AMFES runs a dataset, all samples are randomly divided into a training subset it S of samples and a testing subset T of samples at a heuristic ratio of 5:1. S is used for ranking and selecting of genes and for constructing a classifier out of the selected genes. T is used for computing test accuracy. When a training subset S is given, we extract r trainingvalidation pairs from S according to the heuristic rule r = max (5, (int) (500/n+0.5)) and n is the number of samples in S. Each pair randomly divides S into a training component of samples and a validation component of samples at a ratio of 4:1. The heuristic ratio and rule are chosen based on the experimental experiences at the balance of time consumption and performance. Basically, AMFES has two fundamental processes, ranking and selection. We first explain each process in details and then the integrated version at the end. Ranking The gene ranking process contains a few ranking stages. At first stage, all genes are ranked by their ranking scores in a descending order. Then, in the next stage, only the top half ranked genes are ranked again while the bottom half holds the current order in the subsequent stage. The same iteration repeats recursively until only three genes are remained to be ranked again to complete one ranking process. Assume at a given ranking stage, there are k genes indexed from 1 to k. To rank these k genes, we follow 4 steps below. (I) We first generate m independent subsets S sub 1 … S m. Each subset S i , i = 1, 2… m, has j genes which are selected randomly and independently from the k genes, where j = (int) (k/2). (II) Let C i be the SVM classifier that is trained on each subset of genes , i = 1, 2… m. For each gene of k genes, we compute the ranking score inlineformula m:math name 20439113216i1 xmlns:m http:www.w3.org1998MathMathML m:msub m:mi θ m (ul g) of the gene g, as equation (1). (III) We use the average weight of the gene g, the summation of weights of g in m subsets divided by the number of subsets for which g is randomly selected. This increases the robustness to represent the true classifying ability of the gene g. (IV) Rank k genes in the descending order by their ranking scores. displayformula M1 20439113216i2 m:mrow θ m m:mfenced open ( close ) g m:mo = m:mfrac m:mstyle displaystyle true m:munderover ∑ i = m:mn 1 m I { } g ∈ S i w e i g h t mathvariant italic i g ∑ i = 1 m I g ∈ S i where I is an indicator function such that Iproposition = 1 if the proposition is true; otherwise, Iproposition = 0. In other word, if gene g is randomly selected for the subset S i , it is denoted as 20439113216i3 g ∈ S i and Iproposition = 1.We denote the objective function of C i as 20439113216i4 o b j i v 1 , v 2 , … , v s where b v 1, v 2… v s are support vectors of C i . The weight i (g) is then defined as the change in the objective function due to g, i.e., M2 20439113216i5 w e i g h t i g =  o b j i v 1 , v 2 , … v s − o b j i m:msubsup v 1 g , v 2 g , … , v 3 g 6 B7 7 B8 8 . Note that if v is a vector, v sup (g) is the vector obtained by dropping gene g from v. Let θm be a vector comprising the ranking scores derived from the m gene subsets generated thus far and θm1 is the vector at the previous stage. The m value is determined when θm satisfies the equation (3) by adding a gene to an empty subset once a time. M3 20439113216i6 m:msup ‖ bold θ m − 1 m:mspace width 0.5em − θ m 2 θ m − 1 2 < 0.01 where θ is understood as the Euclidean norm of vector θ. The pseudo codes of ranking process are shown in below. Pseudo codes for ranking process of AMFES indent 1 RANKSUBROUTINE INPUT: a subset of k genes to be ranked 2 Generate k artificial genes and put them next to the original genes Pick an initial tentative value of m DO WHILE m does not satisfies equation (3) 3 FOR each subset Si of m subsets Randomly select j elements from k genes to form the subset Si. Train an SVM to get weight i (g) for each gene in the subset ENDFOR FOR each gene of k genes Compute the average score of the gene from m subsets ENDFOR List k genes in descending order by their ranking scores ENDDO OUPUT: a ranked k genes Selection Ranking artificial features together with original features has been demonstrated as a useful tool to distinguish relevant features from irrelevant ones as in B9 9 B10 10 B11 11 . In our selection process, we also use this technique to find the optimal subset of genes.Assume a set of genes is given. We generate artificial genes and rank them together with original ones. After finishing ranking the set, we assign a geneindex to each original gene by the proportion of artificial ones that are ranked above it where the geneindex is the real numerical value between 0 and 1. Then, we generate a few subset candidates from which the optimal subset is chosen. Let p 1 , p 2 , be the sequence of subsetindices of the candidates with p 1 < p 2 < ….where p i = i×0.005 and i= 1,2,…200. Let B(p i ) denote the corresponding subset of subsetindex p i , and it contains original genes whose indices are smaller than or equal to p i . Then, we train a SVM on every B(p i ), and compute its validation accuracy v(p i ).We stop at the first p k at which v(p k ) ≥ v baseline and v(p k ) ≥ v(p l ) for k ≤ l ≤ k+10, where v baseline is the validation accuracy rate of the SVM trained on the baseline, i.e., the case in which all features are involved in training. The final result, B(p k ), is then the optimal subset for the given set of genes. The pseudo codes for selection process of AMFES are listed below. Pseudo codes for selection process of AMFES SELECTIONSUBROUTINE INPUT: a few subsets with their validation accuracies, av(p i ) Compute the validation accuracy of all genes, vbaseline. FOR each subset given IF v(p k ) ≥ v baseline and v(p k ) ≥ v(p l ) for k ≤ l ≤ k+10 THEN Resulted subset is B(p k ) ENDIF ENDFOR OUPUT: B(p k ) Integrated version The ranking and selection processes from previous sections are for one training validation pair. To increase the reliability of validation, we generate r pairs to find the optimal subset. We calculate the validation accuracy of the q th pair for all p qi subsets where q denotes pairindex and i denotes the subsetindex. Then, we compute av(p i ), the average of v(p qi ) over r trainingvalidation pairs and perform the subset search as explained in selection section on av(p i ) to find the optimal p i , denoted as p*.However, p* does not correspond to a unique subset, since each pair has its own B(p*) and they can be all different. Thus, we adopt all samples of S as training samples in order to find a unique subset. We generate artificial genes and rank them together with original genes. Finally, we select the original genes whose indices are smaller than or equal to the p* as the genes we select for S. The integrated version of process is shown below. In the pseudo codes below, the AMFESALGORITHM represents the integrated version of the whole process while RANKSUBROUTINE represents the ranking process and SELECTIONSUBROUTINE represents the selection process. Pseudo codes for integrated version of AMFES AMFES ALGORITHMIntegrated Version INPUT: a dataset Divide a dataset into train samples and test samples. Divide the train samples into r trainingvalidation components pairs FOR each pair of r trainvalidation components pairs Generate 200 candidate subsets p q i FOR each subset of 200 subsets CALL RANK subroutine to rank each subset. Assign each original gene a geneindex Train each subset on an SVM and compute corresponding validation accuracy, v(p qi ), for the subset END FOR END FOR FOR each subset of 200 subsets Compute average validation rate, av(p i ), of the subsetfrom r pairs. END FOR CALL SELECTION subroutine to search for the optimal subset by its average validation rate and denotes it as p* CALL RANK subroutine to rank original genes again and select original genes which belong to the subset B(p*). OUPUT: an optimal subset of genes B(p*) Mutual information Mutual information has been used to measure the dependency between two random variables based on the probability of them. If two random variables X and Y, the mutual information of X and Y, I(X; Y), can be expressed as these equivalent equations B12 12 : M4 20439113216i7 I X ; Y = H X − H X stretchy  Y M5 20439113216i8 = H Y − H Y  X M6 20439113216i9 = H X + H Y − H X , Y where H(X), H(Y) denote marginal entropies, H(XY) and H(YX) denote conditional entropies and H(X,Y) denotes joint entropy of the X and Y. To compute entropy, the probability distribution functions of the random variables are required to be calculated first. Because gene expressions are usually continuous numbers, we used the kernel estimation to calculate the probability distribution B13 13 .Assume the two random variables X and Y are continuous numbers. The mutual information is defined as 12 : M7 20439113216i10 I X , Y = ∫ ∫ f x , y log f x , y f x f y d x d y where f(x,y) denotes the joint probability distribution, and f(x) and f(y) denote marginal probability distribution of X and Y. By using the Gaussian kernel estimation, the f(x, y),f(x) and f(y) can be further represented as equations below B14 14 : M8 20439113216i11 f x , y = 1 M m:munder ∑ 2 π h 2 e − 1 2 h 2 x − x u 2 + y − y u 2 M9 20439113216i12 f x = 1 M Σ 1 m:msqrt 2 π h 2 e − 1 2 h 2 x − y u 2 m:mtext , where M represents the number of samples for both X and Y, u is index of samples 20439113216i13 u = 1 , 2 , … M , and h is a parameter controlling the width of the kernels. Thus, the mutual information 20439113216i14 I X , Y can then be represented as: M10 20439113216i15 I X , Y = 1 M ∑ i log M ∑ i e − 1 2 h 2 x w − x u 2 + y wi − y u 2 ∑ j e − 1 2 h 2 x w − x u 2 ∑ j e − 1 2 h 2 y wi − y u 2 where both w, u are indices of samples 20439113216i16 w , u = 1 , 2 , … M .Computation of pairwise genes of a microarray dataset usually involves nested loops calculation which takes a dramatic amount of time. Assume a dataset has N genes and each gene has M samples. To calculate the pairwise mutual information values, the computation usually first finds the kernel distance between any two samples for a given gene. Then, the same process goes through every pair of genes in the dataset. In order to be computation efficient, two improvements are applied 13 . The first one is to calculate the marginal probability of each gene in advance and use it repeatedly during the process 13 B15 15 .The second improvement is to move the summation of each sample pair for a given gene to the most outer forloop rather than inside a nested forloop for every pairwise gene. As a result, the kernel distance between two samples is only calculated twice instead N times which saves a lot of computation time. LNO (Loops Nest Optimization) which changes the order of nested loops is a common timesaving technique in computer science field B16 16 . Target network The effect of drugs with multiple components should be viewed as a whole rather than a superposition of individual components 1 2 . Thus, a synergic concept is formed and considered as an efficient manner to design a drug 3 . In B17 17 , mathematical models are used to measure the effect generated by the multiple components. However, it does not consider practical situation such as crosstalk between pathways. A network approach starts to be used to analyze the interactions among multiple components 4 . Initiated by work in 4 , another system biological methodology, NIMS (Networktargetbased Identification of Multicomponent Synergy) is proposed to measure the effect of drug agent pairs depending on their gene expression data 5 . NIMS focuses on ranking the drug agent pairs of Chinese Medicine components by their SS.In 5 , it assumes that a drug component is denoted as a drug agent and with which a set of genes associated are denoted as agent genes of the drug agent. For a given disease, assume there are N drug agents where N =1, 2…n. Initially, NIMS randomly chooses two drug agents from N, A1, and A2, and builds a background target network by their agent genes in a graph. From the graph, NIMS calculates TS (Topology Score) of the graph by applying the PCA (Principle Component Analysis) to form a IP value which is integrated by betweenness, closeness and a variant of Eigenvalues PageRank B18 18 . The TS is used to evaluate the topology significance of the target network for the drug agent pair, A1 and A2, and is defined as M11 20439113216i17 T S 1 , 2 = 1 2 × [ ] ∑ i I P 1 i × exp − min d i , j ∑ i I P 1 i + ∑ j I P 2 j × exp − min d j , i ∑ j I P 2 j , where IP 1 and IP 2 denote IP values for drug agent A1 agent and A2. Min(d i,j) denotes minimum shortest path from gene i of A1 to all genes of A2 and min(d j,i) denotes the one from gene j of A1 to all genes of A2.NIMS define another term, AS (Agent Score), to evaluate the similarity of a disease phenotype for a drug agent. For a given drug agent, if one of its agent genes has a phenotype record in the OMIM (Online Mendelian Inheritance in Man) database, the drug agent has that phenotype as one of its phenotype. The similarity score of a drug agent pair is defined as the cosine value of the pair’s feature vector angle B19 19 . The AS is defined as: M12 20439113216i18 A S 1 , 2 = ∑ i , j P i , j M , where P i,j denotes similarity score of ith phenotype of A1 and jth phenotype of A2 and M denotes the total number of phenotypes.The SS of the pair is then defined as the product of TS and AS. NIMS calculates SS for all possible drug agent pairs for a disease and then can find potential drug agent pairs after ranking them by SS. Results MIROARRAY data description We made a brief description of these three datasets in Table tblr tid T1 1. It listed the number of biomarkers, types of biomarkers, number of samples and variation of samples used. table Table 1 caption Descriptions of 3 datasets: GSE18655 (prostate cancer), GSE19536 (breast cancer) and GSE21036 (prostate cancer) tgroup align left cols 4 colspec colname c1 colnum colwidth 1* c2 c3 c4 thead valign top row rowsep entry Prostate Cancer (GSE18655) Breast Cancer (GSE19536) Prostate Cancer (GSE21036) tbody Number of Biomarkers 502 489 373 Type of Biomarkers RNAs miRNAs miRNAs Number of Samples 139 78 142 Variation of Samples Grade1(38), Grade2(90), Grade3(11) Luminal A ( 41), Luminal B (12), Basallike (15), Normallike(10) Cancerous (114), Normal(28) The prostate cancer dataset with RNA biomarkers In order to give a better prognosis, pathologists have used a cancer stage to measure cell tissues and tumors’ aggressions as an indicator for doctors to choose a suitable treatment. The most widely used cancer staging system is TNM (Tumor, Node, and Metastasis) system B20 20 . Depending on levels of differentiation between normal and tumor cells, a different histologic grade is given. Tumors with grade 1 indicate almost normal tissues, with grade 2 indicating somewhat normal tissues and with grade 3 indicating tissues far away from normal conditions. Although most of cancers can be adapted to TNM grading system, some specific cancers require additional grading systems for pathologists to better interpret tumors.The Gleason Grading System is especially used for prostate cancers and a GS (Gleason Score) is given based on cellular contents and tissues of cancer biopsies from patients. The higher the GS are, the worse the prognoses are. The prostate cancer dataset, GSE18655, includes 139 patients with 502 molecular markers, RNAs B21 21 . In 21 , it showed that prostate tumors with gene fusions, TMPRSS2: ERG T1/E,4 have higher risk of recurrences than tumors without the gene fusions. 139 samples were prostate freshfrozen tumor tissues of patients after a radical prostatectomy surgery. All samples were taken from the patients’ prostates at the time of prostatectomy and liquid nitrogen was used to freeze middle sections of prostates at extreme low temperature. Among these patients, 38 patient samples have GS 5–6 corresponding to histologic grade 1, 90 samples have GS 7 corresponding to histologic grade 2 and 11 samples have GS 8–9 corresponding to histologic grade 3. The platform used for the datasets is GPL5858, DASL (cDNAmediated, annealing, selection, extension and ligation) Human Cancer Panel by Gene manufactured by Illumina. The FDR (false discovery rate) of all RNAs expressions in the microarray is less than 5%. Breast cancer dataset with Noncoding miRNA biomarkers The miRNAs have strong correlation with some cellular processes, such as proliferation, which has been used as a breast cancer dataset B22 22 . It has 799 miRNAs and 101 patients’ samples. Differential expressions of miRNAs indicated different level of proliferations corresponding to 6 intrinsic breast cancer subtypes: luminal A, luminal B, basallike, normallike, and ERBB2. The original dataset has 101 samples and among them, 41 samples are luminal A, 15 samples are basallike, 10 samples are normallike, 12 samples are luminal B, 17 samples are ERBB2, 2 samples have T35 mutation status, another sample has T35 wide type mutation and 3 samples are not classified. GSE19536 was represented in two platforms GPL8227, an Agilient09118 Human miRNA microarray 2.0 G4470B (miRNA ID version) and the GPL6480, an Agilent014850 whole Human Genome Microarray 4x44k G4112F (Probe Name). For this paper, we only used the expressions from GPL8227. Prostate cancer dataset of cancerous and normal samples with miRNA biomarkers The CNAs (Copy Number Alterations) of some genes may associate with growth of prostate cancers B23 23 . In addition, some changes are discovered in mutations of fusion gene, mRNA expressions and pathways in a majority of primary prostate samples. The analysis was applied to four platforms and consists of 3 subseries, GSE21034, GSE21035 and GSE21036 23 . For this paper, we only use the GSE 21036 for analysis. The microarray dataset has 142 samples which include 114 primary prostate cancer samples and 28 normal cells samples. The platform is Agilent019118 Human miRNA Microarray 2.0 G4470B (miRNA ID version). Results of AMFES We employ the AMFES on the prostate cancer (GSE18655), breast cancer (GSE19536) and another prostate cancer (GSE21036) datasets. Consequently, for GSE18655, AMFES selects 96 biomarkers. The classification is performed in two steps. The first step performs classification between grade1 and above samples and it selects 93 biomarkers. At the second step, AMFES classifies between grade2 and grade3 samples and it selects 3 biomarkers. Thus, we can assume that these 96 biomarkers can classify among grade1, grade2 and grade3 samples 6 . For GSE19536, AMFES also performs classification in two steps. At the first step, AMFES classify between luminal and nonluminal types samples and it selects 47 biomarkers 6 . At the second step, AMFES further classifies luminal samples into luminal A and luminal B and selects 27 biomarkers. For the nonluminal samples, AMFES also classifies them into basallike and normallike samples and selects 25 biomarkers 6 . After removing duplicate biomarkers, AMFES has 72 (47+272(duplicated)) for classifying luminal samples and 68 (47+254(duplicated)) for classifying nonluminal ones 6 . For GSE21036, AMFES simply selects 22 biomarkers for classifying cancerous and normal samples. Table T2 2. shows the number of selected genes. The complete lists of these biomarkers can be found in Additional file supplr sid S1 1 GSE18655_96_Biomarkers.xlsx, Additional file S2 2 GSE19536_72_Biomarkers.xlsx, Additional file S3 3 GSE19536_68_Biomarkers.xlsx, and Additional file S4 4 GSE21036_22_Biomakers.xlsx. suppl Additional file 1 text GSE18655_96_Biomarkers. An MS Office Excel file which contains a list of gene symbols of 96 biomarkers of GSE18655 samples. file 20439113216S1.xlsx Click here for file Additional file 2 GSE19536_72_Biomarkers. An MS Office Excel file which contains a list of gene symbols of 72 biomarkers of GSE19536 luminal A and luminal B samples. 20439113216S2.xlsx Click here for file Additional file 3 GSE19536_68_Biomarkers. An MS Office Excel file which contains a list of gene symbols of 68 biomarkers of GSE19536 basallike and normallike samples. 20439113216S3.xlsx Click here for file Additional file 4 GSE21036_22_Biomarkers. An MS Office Excel file which contains a list of gene symbols of 22 biomarkers of GSE21036 samples. 20439113216S4.xlsx Click here for file Table 2 Results of selected subsets of genes 5 c5 Prostate Cancer (GSE18655) Breast Cancer (GSE19536) Breast Cancer (GSE19536) Prostate Cancer (GSE21036) Number of Biomarkers Selected 96 72 68 22 Variation of Samples Grade1, Grade2, Grade3 Luminal A, Luminal B Basallike Normallike Cancerous Normal We then apply the MI calculation described in the Mutual Information section on 96 biomarkers for GSE18655 and represent the pairwise MI values of grade 1, grade 2 and grade 3 samples in three 96*96 matrixes which can be found in Additional file S5 5 GSE18655 Grade1 MI.xlsx, Additional file S6 6 GSE18655 Grade2 MI.xlsx and Additional file S7 7 GSE18655 Grade3 MI.xlsx. We also represent the four MI matrixes of 72 and 68 biomarkers for GSE19536 in Additional file S8 8 GSE19536 LuminalA MI.xlsx, Additional file S9 9 GSE19536 LuminalB MI.xlsx, Additional file S10 10 GSE19536 BasalLike MI.xlsx, and Additional file S11 11 GSE19536 NormalLike MI.xlsx. The two MI matrixes for GSE21036 are in Additional file S12 12 GSE21036 Cancer MI.xlsx, Additional file S13 13 GSE21036 Normal MI.xlsx. Additional file 5 18655 Grade1 MI. An MS Office Excel file which contains a matrix of the pairwise MI values of 96 biomarkers of grade1 samples. 20439113216S5.xlsx Click here for file Additional file 6 18655 Grade2 MI. An MS Office Excel file which contains a matrix of the pairwise MI values of 96 biomarkers of grade2 samples. 20439113216S6.xlsx Click here for file Additional file 7 18655 Grade3 MI. An MS Office Excel file which contains a matrix of the pairwise MI values of 96 biomarkers of grade3 samples. 20439113216S7.xlsx Click here for file Additional file 8 19536 LuminalA MI. An MS Office Excel file which contains the pairwise MI values of 72 biomarkers of luminal A samples. 20439113216S8.xlsx Click here for file Additional file 9 19536 LuminalB MI. An MS Office Excel file which contains the pairwise MI values of 72 biomarkers of luminal B samples. 20439113216S9.xlsx Click here for file Additional file 10 19536 BasalLike MI. An MS Office Excel file which contains the pairwise MI values of 68 biomarkers of Basallike samples. 20439113216S10.xlsx Click here for file Additional file 11 19536 NormalLike MI. An MS Office Excel file which contains the pairwise MI values of 68 biomarkers of Normallike samples. 20439113216S11.xlsx Click here for file Additional file 12 21036 Cancer MI. An MS Office Excel file which contains the pairwise MI values of 22 biomarkers of cancerous samples. 20439113216S12.xlsx Click here for file Additional file 13 21036 Normal MI. An MS Office Excel file which contains the pairwise MI values of 22 biomarkers of normal samples. 20439113216S13.xlsx Click here for file We analyze these MI matrixes and list differences between them under different conditions in Table T3 3. For a given matrix, the first column in Table 3 denotes the mean value; the second column denotes the standard deviation; the third column shows the number of positive values in the matrix; the fourth column shows the number of negative values; the sixth column shows the minimum value and the seventh column displays the maximum. In the fifth column, we compare MI matrixes under two different conditions such as luminal A vs. luminal B. If the signs of two entries at the same position in these two matrixes are different, we count it as one sign difference. The fifth column denotes the number of sign differences of the samples compared. We employ the same process for comparing basallike versus normallike for GSE19536 and the cancerous versus normal for GSE21036. To visualize the differences, we display the histograms of MI values of grade1s, grade2s and grade3s in Figure figr fid F1 1. Figure F2 2 shows the histograms for luminal As versus luminal Bs. Figure F3 3 shows basallikes versus normallikes and Figure F4 4 shows the cancerous versus normals. Table 3 Results of analysis of MI matrices 8 c6 6 c7 7 c8 Mean value of MI Standard deviation of MI Num of positive values Num of negative values Num of values of different sign Min value Max value GSE18655_grade1 0.00024 0.0015 6298 2918 N/A −0.0011 0.0858 GSE18655_grade2 0.00020 0.0017 6468 2748 −0.0018 0.0949 GSE18655_grade3 0.0004 0.0021 6650 2566 −0.0029 0.0582 GSE19536_A(72) 0.00036 0.0022 3912 1272 2052 −0.0010 0.1293 GSE19536_B(72) 0.00053 0.0040 3388 1796 −0.0022 0.2279 GSE19536_BasalLike(68) 0.0017 0.0056 3491 998 1217 −0.0033 0.1648 GSE19536_NormalLike(68) 0.0056 0.008 4200 420 −0.002 0.1279 GSE21036_cancer 0.0165 0.0212 10 474 56 −0.002 0.1446 GSE21036_norm 0.0086 0.0146 46 438 −0.0015 0.1565 fig Figure 1Comparison of 96 MI of grade1, grade2 and grade3 samples Comparison of 96 MI of grade1, grade2 and grade3 samples. graphic 204391132161 Figure 2Comparison of 72 MI of luminal A and luminal B samples Comparison of 72 MI of luminal A and luminal B samples. 204391132162 Figure 3Comparison of 68 MI of basallike and normallike samples Comparison of 68 MI of basallike and normallike samples. 204391132163 Figure 4Comparison of 22 MI of prostate cancerous and normal samples Comparison of 22 MI of prostate cancerous and normal samples. 204391132164 For the fifth column of comparison of GSE18655, since there are three types prostate, they cannot be fairly compared, so we skipped the process for it. In addition, because there are many MI entries for all histograms, we only show the densest section of each histogram in figures. Results of calculating mutual information The statistic results of calculating mutual information are shown in Table 3 at the end of this paper. Synergistic therapy Based on the interpretation of the network 4 5 , we proposed a framework that can help to elucidate the underlying interactions between multitarget biomarkers and multicomponent drug agents. The framework consists of three parts: selecting biomarkers of a complex disease such as cancer, building target networks of biomarkers, and forming interaction between biomarkers and drug agents to provide a personalized and synergistic therapy plan.From the GEO datasets of cancers, we have discovered the genetic model of each cancer, called signature of that particular cancer. Among different cancers, their signatures (target networks) may be quite different which corresponds to different biomarkers in Additional file 1 GSE18655_96_Biomarkers.xlsx, Additional file 2 GSE19536_72_Biomarkers.xlsx, Additional file 3 GSE19536_68_Biomarkers.xlsx, and Additional file 4 GSE21036_22_Biomakers.xlsx. For these different signatures, we would discover various synergistic mechanisms which have exemplified in B24 24 .Assume we would like to provide a synergistic therapy plan of a patient A. By collecting his/her bodily data such as saliva, blood samples, we first obtain the corresponding microarray dataset of patient A and apply it to the genetic model as shown in Figure F5 5. Figure 5Diagram of detailed process of building the genetic model Diagram of detailed process of building the genetic model. 204391132165 A complete synergistic therapy should be able to select small subset of biomarkers and correlate them with drug agents in a multitarget multicomponents network approach as shown in Figure F6 6. In Figure 6, a disease associates with several biomarkers such as RNAs, miRNAs or proteins denoted by R1, R2, R3, R4 and R5 which are the regulators for operons O1, O2, and O3. An operon is a basic unit of DNAs and formed by a group of genes controlled by a gene regulator. These operons initiate molecular mechanisms as promoters. The gene regulators can enable organs to regulate other genes either by induction or repression. For each target biomarker, it may have a list of pharmacons used as enzyme inhibitors. Traditionally, pharmacons are referred to biological active substances which are not limited to drug agents only. For example, the herbal extractions whose ingredients have a promising antiAD (Alzheimer’s Disease) effect can be used as pharmacons 24 . Meanwhile, pharmacons denoted by D1, D2, and D3, have effects for some target biomarkers. For example, D1 affects target biomarker R3, D2 affects target biomarker R5 and D3 affects biomarker R1. Compared with drug agent pair methodology 5 , the proposed framework in Figure 6 represents a more accurate interpretation of biomarkers with multicomponent drug agents. Figure 6Relationships between biomarkers, pharmacons and operons where R1, R2, R3, R4 and R5 denote 5 biomarkers Relationships between biomarkers, pharmacons and operons where R1, R2, R3, R4 and R5 denote 5 biomarkers. Among all the biomarkers, R2, R3 and R5 are regulators. 204391132166 Discussion Among the MI values obtained, we see positive values and negative values. The positive value can represent the attractions among the biomarkers while the negative may represent the repulsion among the biomarkers, which matches the concept of YinYang in TCM (Traditional Chinese Medicine). From these results, we observed that there is minimal difference of mutual information values between cancer stages. However, the difference of mean MI value of the prostate cancer versus normal cells is move obvious. The mean MI value of the last prostate cancer cell is approximately twice that of normal cells. This may be intriguing for medical people for further investigations. Conclusions We have presented a comprehensive approach to diagnosis and therapy of complex diseases, such as cancer. A complete procedure is proposed for clinical application to cancer patients. While the genetic model provides a standard framework to design synergistic therapy, the actual plan for individual patient is personalized and flexible. With careful monitoring, physicians may adaptively change or modify the therapy plan. Much further analysis of this framework in clinical settings should be experimented. Competing interests The authors declare that they have no competing interests. Author’s contributions WH, CL: Implementation of project. FC, SC: Design the project. All authors read and approved the final manuscript. bm ack Acknowledgements We are grateful to the reviewers for their valuable comments and suggestions. We are also grateful to Dr. John Harris for his encouragements for this research. We are also thankful for Dr. LungJi Chang for his discussion and encouragements. refgrp Multitarget therapeutics: when the whole is greater than the sum of the partsZimmermannGRLeharJKeithCTDrug discovery today2007121–234lpage 42link fulltext 17198971Multicomponent therapeutics for networked systemsKeithCTBorisyAAStockwellBRNat Rev Drug Discov200541717810.1038/nrd160915688074Strategies for optimizing combinations of molecularly targeted anticancer agentsDanceyJEChenHXNature reviewsDrug discovery20065864965910.1038/nrd2089The efficiency of multitarget drugs: the network approach might help drug designCsermelyPAgostonVPongorSTrends Pharmacol Sci200526417818210.1016/j.tips.2005.02.00715808341Network target for screening synergistic drug combinations with application to traditional Chinese medicineLiSZhangBZhangNBMC Syst Biol20115Suppl 1S10Journal Article10.1186/175205095S1S10pmcid 328756522784616HsuWCLiuCCChangFChenSSFeature Selection for Microarray Data Analysis: GEO & AMFESpublisher Florida: Technical Report, Gainesville2012247868818558008Gene Selection for Cancer Classification using Support Vector MachinesGuyonIWestonJBarnhillSVapnikVMach Learn2002461–3389422Variable selection using svm based criteriaRakotomamonjyAJ Mach Learn Res2003313571370Dimensionality reduction via sparse support vector machinesBiJBennettKEmbrechtsMBrenemanCSongMJ Mach Learn Res2003312291243Ranking a random feature for variable and feature selectionStoppigliaHDreyfusGDuboisROussarYJMachLearnRes20083Journal Article13991414Feature Selection Using Ensemble Based Ranking Against Artificial ContrastsTuvEBorisovATorkkolaKNeural Networks, 2006 IJCNN '06 International Joint Conference on: 0–0 0200621812186A mathematical theory of communicationShannonCESIGMOBILE Mob Comput Commun Rev20015135510.1145/584091.584093Fast calculation of pairwise mutual information for gene regulatory network reconstructionQiuPGentlesAJPlevritisSKComp Methods and Programs in Biomed200994217718010.1016/j.cmpb.2008.11.003Nonparametric entropy estimation: An overviewBeirlantJDudewiczEJoumlLG,rMeulenECVDInt J Math Stat Sci1997611739ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular ContextMargolinANemenmanIBassoKWigginsCStolovitzkyGFaveraRCalifanoABMC Bioinforma20067Suppl 1S710.1186/147121057S1S7MichaelEWMonicaSLA data locality optimizing algorithm1991Systems biology and combination therapy in the quest for clinical efficacyFitzgeraldJBSchoeberlBNielsenUBSorgerPKNat Chem Biol20062945846610.1038/nchembio81716921358The PageRank Citation Ranking: Bringing Order to the WebPageLBrinSMotwaniRWinogradTStanford InfoLab1999A textmining analysis of the human phenomevan DrielMABruggemanJVriendGBrunnerHGLeunissenJAEur J Human Genet : EJHG200614553554210.1038/sj.ejhg.5201585SobinLHWittekindCTNM: classification of malignant tumoursNew York: WileyLiss2002Prostate cancer genes associated with TMPRSS2ERG gene fusion and prognostic of biochemical recurrence in multiple cohortsBarwickBGAbramovitzMKodaniMMorenoCSNamRTangWBouzykMSethALeylandJonesBBr J Cancer2010102357057610.1038/sj.bjc.6605519282294820068566miRNAmRNA Integrated Analysis Reveals Roles for miRNAs in Primary Breast TumorsEnerlyESteinfeldIKleiviKLeivonenSKAureMRRussnesHGRonnebergJAJohnsenHNavonRRodlandEetal PLoS One201162e1691510.1371/journal.pone.0016915304307021364938Integrative genomic profiling of human prostate cancerTaylorBSSchultzNHieronymusHGopalanAXiaoYCarverBSAroraVKKaushikPCeramiERevaBCancer cell2010181112210.1016/j.ccr.2010.05.026319878720579941Towards a bioinformatics analysis of antiAlzheimer's herbal medicines from a target network perspectiveSunYZhuRYeHTangKZhaoJChenYLiuQCaoZBriefings in bioinformatics2012 xml version 1.0 encoding utf8 standalone no mets ID sortmets_mets OBJID swordmets LABEL DSpace SWORD Item PROFILE METS SIP Profile xmlns http:www.loc.govMETS xmlns:xlink http:www.w3.org1999xlink xmlns:xsi http:www.w3.org2001XMLSchemainstance xsi:schemaLocation http:www.loc.govstandardsmetsmets.xsd metsHdr CREATEDATE 20121218T14:58:01 agent ROLE CUSTODIAN TYPE ORGANIZATION name BioMed Central dmdSec swordmetsdmd1 GROUPID swordmetsdmd1_group1 mdWrap SWAP Metadata MDTYPE OTHER OTHERMDTYPE EPDCX MIMETYPE textxml xmlData epdcx:descriptionSet xmlns:epdcx http:purl.orgeprintepdcx20061116 xmlns:MIOJAVI http:purl.orgeprintepdcxxsd20061116epdcx.xsd epdcx:description epdcx:resourceId swordmetsepdcx1 epdcx:statement epdcx:propertyURI http:purl.orgdcelements1.1type epdcx:valueURI http:purl.orgeprintentityTypeScholarlyWork http:purl.orgdcelements1.1title epdcx:valueString Cancer classification: Mutual information, target network and strategies of therapy http:purl.orgdctermsabstract Abstract Background Cancer therapy is a challenging research area because side effects often occur in chemo and radiation therapy. We intend to study a multitargets and multicomponents design that will provide synergistic results to improve efficiency of cancer therapy. Methods We have developed a general methodology, AMFES (Adaptive Multiple FEature Selection), for ranking and selecting important cancer biomarkers based on SVM (Support Vector Machine) classification. In particular, we exemplify this method by three datasets: a prostate cancer (three stages), a breast cancer (four subtypes), and another prostate cancer (normal vs. cancerous). Moreover, we have computed the target networks of these biomarkers as the signatures of the cancers with additional information (mutual information between biomarkers of the network). Then, we proposed a robust framework for synergistic therapy design approach which includes varies existing mechanisms. Results These methodologies were applied to three GEO datasets: GSE18655 (three prostate stages), GSE19536 (4 subtypes breast cancers) and GSE21036 (prostate cancer cells and normal cells) shown in. We selected 96 biomarkers for first prostate cancer dataset (three prostate stages), 72 for breast cancer (luminal A vs. luminal B), 68 for breast cancer (basallike vs. normallike), and 22 for another prostate cancer (cancerous vs. normal. In addition, we obtained statistically significant results of mutual information, which demonstrate that the dependencies among these biomarkers can be positive or negative. Conclusions We proposed an efficient feature ranking and selection scheme, AMFES, to select an important subset from a large number of features for any cancer dataset. Thus, we obtained the signatures of these cancers by building their target networks. Finally, we proposed a robust framework of synergistic therapy for cancer patients. Our framework is not only supported by real GEO datasets but also aim to a multitargets/multicomponents drug design tool, which improves the traditional single target/single component analysis methods. This framework builds a computational foundation which can provide a clear classification of cancers and lead to an efficient cancer therapy. http:purl.orgdcelements1.1creator Hsu, WenChin Liu, ChanCheng Chang, Fu Chen, SuShing http:purl.orgeprinttermsisExpressedAs epdcx:valueRef swordmetsexpr1 http:purl.orgeprintentityTypeExpression http:purl.orgdcelements1.1language epdcx:vesURI http:purl.orgdctermsRFC3066 en http:purl.orgeprinttermsType http:purl.orgeprinttypeJournalArticle http:purl.orgdctermsavailable epdcx:sesURI http:purl.orgdctermsW3CDTF 20121002 http:purl.orgdcelements1.1publisher BioMed Central Ltd http:purl.orgeprinttermsstatus http:purl.orgeprinttermsStatus http:purl.orgeprintstatusPeerReviewed http:purl.orgeprinttermscopyrightHolder WenChin Hsu et al.; licensee BioMed Central Ltd. http:purl.orgdctermslicense http://creativecommons.org/licenses/by/2.0 http:purl.orgdctermsaccessRights http:purl.orgeprinttermsAccessRights http:purl.orgeprintaccessRightsOpenAccess http:purl.orgeprinttermsbibliographicCitation Journal of Clinical Bioinformatics. 2012 Oct 02;2(1):16 http:purl.orgdcelements1.1identifier http:purl.orgdctermsURI http://dx.doi.org/10.1186/20439113216 fileSec fileGrp swordmetsfgrp1 USE CONTENT file swordmetsfgid0 swordmetsfile1 FLocat LOCTYPE URL xlink:href 20439113216.xml swordmetsfgid1 swordmetsfile2 applicationpdf 20439113216.pdf swordmetsfgid3 swordmetsfile3 applicationvnd.openxmlformatsofficedocument.spreadsheetml.sheet 20439113216S11.XLSX swordmetsfgid4 swordmetsfile4 20439113216S12.XLSX swordmetsfgid5 swordmetsfile5 20439113216S1.XLSX swordmetsfgid6 swordmetsfile6 20439113216S3.XLSX swordmetsfgid7 swordmetsfile7 20439113216S6.XLSX swordmetsfgid8 swordmetsfile8 20439113216S7.XLSX swordmetsfgid9 swordmetsfile9 20439113216S13.XLSX swordmetsfgid10 swordmetsfile10 20439113216S4.XLSX swordmetsfgid11 swordmetsfile11 20439113216S10.XLSX swordmetsfgid12 swordmetsfile12 20439113216S2.XLSX swordmetsfgid13 swordmetsfile13 20439113216S9.XLSX swordmetsfgid14 swordmetsfile14 20439113216S5.XLSX swordmetsfgid15 swordmetsfile15 20439113216S8.XLSX structMap swordmetsstruct1 structure LOGICAL div swordmetsdiv1 DMDID Object swordmetsdiv2 File fptr FILEID swordmetsdiv3 swordmetsdiv4 swordmetsdiv5 swordmetsdiv6 swordmetsdiv7 swordmetsdiv8 swordmetsdiv9 swordmetsdiv10 swordmetsdiv11 swordmetsdiv12 swordmetsdiv13 swordmetsdiv14 swordmetsdiv15 swordmetsdiv16 PAGE 1 RESEARCHOpenAccessCancerclassification:Mutualinformation, targetnetworkandstrategiesoftherapyWenChinHsu1,2,ChanChengLiu4,FuChang4andSuShingChen1,3*AbstractBackground: Cancertherapyisachallengingresearchareabecausesideeffectsoftenoccurinchemoand radiationtherapy.Weintendtostudyamultitargetsandmulticomponentsdesignthatwillprovidesynergistic resultstoimproveefficiencyofcancertherapy. Methods: Wehavedevelopedageneralmethodology,AMFES(AdaptiveMultipleFEatureSelection),forranking andselectingimportantcancerbiomarkersbasedonSVM(SupportVectorMachine)classification.Inparticular,we exemplifythismethodbythreedatasets:aprostatecancer(threestages),abreastcancer(foursubtypes),and anotherprostatecancer(normalvs.cancerous).Moreover,wehavecomputedthetargetnetworksofthese biomarkersasthesignaturesofthecancerswithadditionalinformation(mutualinformationbetweenbiomarkersof thenetwork).Then,weproposedarobustframeworkforsynergistictherapydesignapproachwhichincludesvaries existingmechanisms. Results: ThesemethodologieswereappliedtothreeGEOdatasets:GSE18655(threeprostatestages),GSE19536 (4subtypesbreastcancers)andGSE21036(prostatecancercellsandnormalcells)shownin.Weselected96 biomarkersforfirstprostatecancerdataset(threeprostatestages),72forbreastcancer(luminalAvs.luminalB), 68forbreastcancer(basallikevs.normallike),and22foranotherprostatecancer(cancerousvs.normal.In addition,weobtainedstatisticallysignificantresultsofmutualinformation,whichdemonstratethatthe dependenciesamongthesebiomarkerscanbepositiveornegative. Conclusions: Weproposedanefficientfeaturerankingandselectionscheme,AMFES,toselectanimportantsubset fromalargenumberoffeaturesforanycancerdataset.Thus,weobtainedthesignaturesofthesecancersby buildingtheirtargetnetworks.Finally,weproposedarobustframeworkofsynergistictherapyforcancerpatients. OurframeworkisnotonlysupportedbyrealGEOdatasetsbutalsoaimtoamultitargets/multicomponentsdrug designtool,whichimprovesthetraditionalsingletarget/singlecomponentanalysismethods.Thisframework buildsacomputationalfoundationwhichcanprovideaclearclassificationofcancersandleadtoanefficient cancertherapy. Keywords: Featureselection,Biomarkers,Microarray,Therapydesign,TargetnetworkBackgroundCancertherapyisadifficultresearchareaduetoitslevel ofcomplexity.Lately,themeresuperpositionofsingle drugsisfoundtogeneratesideeffectsandcrosstalkwith anotherdrugwhichmaycanceloutthefinalsuccessof treatments.Thus,currentresearchfocusesonmeasuring thedrugtreatmentsasawholeratherthanconsidering themindividually[1,2].Later,asynergisticconceptis proposedtoevaluatethedrugtreatments[3].However, evaluationsarestillbasedoncasesanddonothavea systematicapproach.In[4],anetworkmethodologyis firstusedtoevaluateefficiencyofdrugtreatments.Thus, Lietal.useaparameter,namelyaSS(SynergyScore)to introducethetopologyfactorofthenetworkbasedon thediseaseandthedrugagentcombination[5]. Ourapproachisfirsttobuildamoreprecisetarget networkfromtheselectedbiomarkers(byAMFES)[6]. Then,weidentifytheintrinsicpropertiesbycomputing *Correspondence: suchen@cise.ufl.edu1SystemBiologyLab,UniversityofFlorida,Florida,USA3DepartmentofComputerandInformationScienceandEngineering, UniversityofFlorida,Florida,USA Fulllistofauthorinformationisavailableattheendofthearticle JOURNAL OF CLINICAL BIOINFORMATICS 2012Hsuetal.;licenseeBioMedCentralLtd.ThisisanOpenAccessarticledistributedunderthetermsoftheCreative CommonsAttributionLicense(http://creativecommons.org/licenses/by/2.0),whichpermitsunrestricteduse,distribution,and reproductioninanymedium,providedtheoriginalworkisproperlycited.Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 2 mutualinformationoftheinteractionsamongthesebiomarkers.OurapproachistoimproveLi sresultsbyconsideringthemutualinformationinthetargetnetwork. Andweprovideageneralframeworkofsynergistictherapy,whichmayincludeseveraldifferentapproaches.MethodsAMFESTheCOD(CurseofDimensionality)hasbeenamajor challengeofmicroarraydataanalysisduetothelarge numberofgenes(features)andrelativelysmallnumber ofsamples(patterns).Totacklethisproblem,manygene selectionmethodologiesweredevelopedtoselectonly significantsubsetsofgenesinamicroarraydataset. AMFESselectsanoptimalsubsetofgenesbytraininga SVMwithsubsetsofgenesgeneratedadaptively[6]. WhenAMFESrunsadataset,allsamplesarerandomlydividedintoatrainingsubset S ofsamplesanda testingsubset T ofsamplesataheuristicratioof5:1. S isusedforrankingandselectingofgenesandforconstructingaclassifieroutoftheselectedgenes. T isused forcomputingtestaccuracy.Whenatrainingsubset S is given,weextract r trainingvalidationpairsfrom S accordingtotheheuristicrule r =max(5,(int)(500/ n +0.5))and n isthenumberofsamplesin S .Eachpair randomlydivides S intoatrainingcomponentofsamplesandavalidationcomponentofsamplesataratioof 4:1.Theheuristicratioandrulearechosenbasedonthe experimentalexperiencesatthebalanceoftimeconsumptionandperformance.Basically,AMFEShastwo fundamentalprocesses,rankingandselection.Wefirst explaineachprocessindetailsandthentheintegrated versionattheend.RankingThegenerankingprocesscontainsafewrankingstages.At firststage,allgenesarerankedbytheirrankingscoresina descendingorder.Then,inthenextstage,onlythetophalf rankedgenesarerankedagainwhilethebottomhalfholds thecurrentorderinthesubsequentstage.Thesameiterationrepeatsrecursivelyuntilonlythreegenesare remainedtoberankedagaintocompleteoneranking process.Assumeatagivenrankingstage,thereare k genes indexedfrom 1 to k .Torankthese k genes,wefollow4 stepsbelow.(I)Wefirstgenerate m independentsubsets S1... Sm .Eachsubset Si, i =1,2 ... m ,has j geneswhichare selectedrandomlyandindependentlyfromthe k genes, where j =(int)( k /2).(II)LetCibetheSVMclassifierthatis trainedoneachsubsetofgenes,i =1,2 ... m .Foreachgene of k genes,wecomputetherankingscore m( g)ofthegene g ,asequation(1).(III)Weusetheaverageweightofthe gene g ,thesummationofweightsof g in m subsetsdivided bythenumberofsubsetsforwhich g israndomlyselected. Thisincreasestherobustnesstorepresentthetrue classifyingabilityofthegene g .(IV)Rank k genesinthe descendingorderbytheirrankingscores. mg Xm i 1Ig 2 Sifgweightig Xm i 1Ig 2 Sifg 1 whereIisanindicatorfunctionsuchthatIproposition=1if thepropositionistrue;otherwise,Iproposition=0.Inother word,ifgene g israndomlyselectedforthesubset Si,itis denotedas g 2 SiandIproposition=1. WedenotetheobjectivefunctionofCias objiv1; v2; ... ; vs where v1, v2... vsaresupportvectors ofCi.The weighti(g)isthendefinedasthechangeinthe objectivefunctionduetog,i.e., weightig objiv1; v2; ... vs objivg 1; vg 2; ... ; vg 3 2 [6][7,8].Notethatif v isavector, v( g )isthevector obtainedbydroppinggene g from v .Let mbeavector comprisingtherankingscoresderivedfromthe m gene subsetsgeneratedthusfarand m1isthevectoratthe previousstage.The m valueisdeterminedwhen msatisfiestheequation(3)byaddingagenetoanemptysubsetonceatime. jj m 1 mjj2jj m 1jj2< 0 : 01 3 where isunderstoodastheEuclideannormofvector .Thepseudocodesofrankingprocessareshownin below.PseudocodesforrankingprocessofAMFESRANKSUBROUTINE INPUT:asubsetofkgenestoberanked Generatekartificialgenesandputthemnexttothe originalgenes Pickaninitialtentativevalueofm DOWHILEmdoesnotsatisfiesequation(3) FOReachsubsetSiofmsubsets Randomlyselectjelementsfromkgenestoformthe subsetSi. TrainanSVMtogetweighti(g)foreachgeneinthe subset ENDFORHsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page2of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 3 FOReachgeneofkgenes Computetheaveragescoreofthegenefrommsubsets ENDFOR Listkgenesindescendingorderbytheirrankingscores ENDDO OUPUT:arankedkgenesSelectionRankingartificialfeaturestogetherwithoriginalfeatures hasbeendemonstratedasausefultooltodistinguish relevantfeaturesfromirrelevantonesasin[911].In ourselectionprocess,wealsousethistechniquetofind theoptimalsubsetofgenes. Assumeasetofgenesisgiven.Wegenerateartificial genesandrankthemtogetherwithoriginalones.After finishingrankingtheset,weassignageneindextoeach originalgenebytheproportionofartificialonesthatare rankedaboveitwherethegeneindexistherealnumerical valuebetween0and1.Then,wegenerateafewsubset candidatesfromwhichtheoptimalsubsetischosen.Let p1, p2,bethesequenceofsubsetindicesofthecandidates with p1< p2< ... .where pi= i 0.005and i =1,2, ... 200. LetB( pi)denotethecorrespondingsubsetofsubsetindex pi, anditcontainsoriginalgeneswhoseindicesaresmaller thanorequalto pi. Then wetrainaSVMoneveryB( pi), andcomputeitsvalidationaccuracy v ( pi). Westopatthefirst pkatwhichv( pk) vbaselineand v ( pk) v ( pl)for k l k +10,where vbaselineisthevalidationaccuracyrateoftheSVMtrainedonthebaseline, i.e.,thecaseinwhichallfeaturesareinvolvedintraining.Thefinalresult,B( pk),isthentheoptimalsubsetfor thegivensetofgenes.Thepseudocodesforselection processofAMFESarelistedbelow.PseudocodesforselectionprocessofAMFESSELECTIONSUBROUTINE INPUT:afewsubsetswiththeirvalidationaccuracies, av(pi) Computethevalidationaccuracyofallgenes,vbaseline. FOReachsubsetgiven IFv(pk) vbaselineandv(pk) v(pl)fork l k+10 THEN ResultedsubsetisB(pk) ENDIF ENDFOR OUPUT:B(pk)IntegratedversionTherankingandselectionprocessesfromprevioussectionsareforonetrainingvalidationpair.Toincrease thereliabilityofvalidation,wegenerate r pairstofind theoptimalsubset.Wecalculatethevalidationaccuracy ofthe qthpairforall pqisubsetswhere q denotespairindexand i denotesthesubsetindex.Then,wecompute av ( pi),theaverageof v ( pqi)over r trainingvalidation pairsandperformthesubsetsearchasexplainedinselectionsectionon av ( pi)tofindtheoptimal pi,denoted as p *.However, p *doesnotcorrespondtoauniquesubset,sinceeachpairhasitsownB( p *)andtheycanbeall different.Thus,weadoptallsamplesof S astraining samplesinordertofindauniquesubset.Wegenerate artificialgenesandrankthemtogetherwithoriginal genes.Finally,weselecttheoriginalgeneswhoseindices aresmallerthanorequaltothe p *asthegenesweselect for S .Theintegratedversionofprocessisshownbelow. Inthepseudocodesbelow,theAMFESALGORITHM representstheintegratedversionofthewholeprocess whileRANKSUBROUTINErepresentstheranking processandSELECTIONSUBROUTINErepresentsthe selectionprocess.PseudocodesforintegratedversionofAMFESAMFESALGORITHMIntegratedVersion INPUT:adataset Divideadatasetintotrainsamplesandtestsamples. Dividethetrainsamplesintortrainingvalidation componentspairs FOReachpairofrtrainvalidationcomponentspairs Generate200candidatesubsetspqiFOReachsubsetof200subsets CALLRANKsubroutinetorankeachsubset. Assigneachoriginalgeneageneindex TraineachsubsetonanSVMandcompute correspondingvalidationaccuracy,v(pqi), forthesubset ENDFOR ENDFOR FOReachsubsetof200subsetsHsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page3of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 4 Computeaveragevalidationrate,av(pi),ofthesubset fromrpairs. ENDFOR CALLSELECTIONsubroutinetosearchforthe optimalsubsetbyitsaveragevalidationrateand denotesitasp* CALLRANKsubroutinetorankoriginalgenesagain andselectoriginalgeneswhichbelongtothesubsetB (p*). OUPUT:anoptimalsubsetofgenesB(p*)MutualinformationMutualinformationhasbeenusedtomeasurethedependencybetweentworandomvariablesbasedonthe probabilityofthem.IftworandomvariablesXandY, themutualinformationofXandY,I(X;Y),canbe expressedastheseequivalentequations[12]: IX ; Y HX HXY j 4 HY HYX j 5 HX HY HX ; Y 6 whereH(X),H(Y)denotemarginalentropies,H(XY)and H(YX)denoteconditionalentropiesandH(X,Y)denotes jointentropyoftheXandY.Tocomputeentropy,the probabilitydistributionfunctionsoftherandomvariables arerequiredtobecalculatedfirst.Becausegeneexpressionsareusuallycontinuousnumbers,weusedthekernel estimationtocalculatetheprobabilitydistribution[13]. AssumethetworandomvariablesXandYarecontinuousnumbers.Themutualinformationisdefinedas [12]: IX ; Y ZZ fx ; y log fx ; y fx fy dxdy 7 where f (x,y)denotesthejointprobabilitydistribution,and f (x)and f (y)denotemarginalprobabilitydistributionofX andY.ByusingtheGaussiankernelestimation,the f (x,y), f (x)and f (y)canbefurtherrepresentedasequationsbelow [14]: fx ; y 1 M X2 h2e 1 2 h 2x xu2 y y2 u 8 fx 1 M 1 2 h2p e 1 2 h 2x yu2; 9 where M representsthenumberofsamplesforbothX andY, u isindexofsamples u 1 ; 2 ; ... M ; and h isaparametercontrollingthewidthofthekernels.Thus,themutualinformation IX ; Y canthenberepresentedas: IX ; Y 1 M Xilog M Xie 1 2 h 2xw xu2 ywi yu2 Xje 1 2 h 2xw xu2Xje 1 2 h 2ywi yu2 10 whereboth w,u areindicesofsamples w ; u 1 ; 2 ; ... M Computationofpairwisegenesofamicroarraydataset usuallyinvolvesnestedloopscalculationwhichtakesa dramaticamountoftime.Assumeadatasethas N genes andeachgenehas M samples.Tocalculatethepairwise mutualinformationvalues,thecomputationusuallyfirst findsthekerneldistancebetweenanytwosamplesfora givengene.Then,thesameprocessgoesthroughevery pairofgenesinthedataset.Inordertobecomputation efficient,twoimprovementsareapplied[13].Thefirst oneistocalculatethemarginalprobabilityofeachgene inadvanceanduseitrepeatedlyduringtheprocess[13] [15].Thesecondimprovementistomovethesummation ofeachsamplepairforagivengenetothemostouter forloopratherthaninsideanestedforloopforevery pairwisegene.Asaresult,thekerneldistancebetween twosamplesisonlycalculatedtwiceinstead N times whichsavesalotofcomputationtime.LNO(Loops NestOptimization)whichchangestheorderofnested loopsisacommontimesavingtechniqueincomputer sciencefield[16].TargetnetworkTheeffectofdrugswithmultiplecomponentsshouldbe viewedasawholeratherthanasuperpositionofindividualcomponents[1][2].Thus,asynergicconceptis formedandconsideredasanefficientmannertodesign adrug[3].In[17],mathematicalmodelsareusedto measuretheeffectgeneratedbythemultiplecomponents.However,itdoesnotconsiderpracticalsituation suchascrosstalkbetweenpathways.Anetworkapproachstartstobeusedtoanalyzetheinteractions amongmultiplecomponents[4].Initiatedbyworkin [4],anothersystembiologicalmethodology,NIMS(NetworktargetbasedIdentificationofMulticomponent Synergy)isproposedtomeasuretheeffectofdrugagent pairsdependingontheirgeneexpressiondata[5].NIMS focusesonrankingthedrugagentpairsofChinese MedicinecomponentsbytheirSS. In[5],itassumesthatadrugcomponentisdenotedas adrugagentandwithwhichasetofgenesassociatedare denotedasagentgenesofthedrugagent.Foragivendisease,assumethereare N drugagentswhere N =1,2 ... n Initially,NIMSrandomlychoosestwodrugagentsfrom N ,A1,andA2,andbuildsabackgroundtargetnetworkHsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page4of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 5 bytheiragentgenesinagraph.Fromthegraph,NIMS calculatesTS(TopologyScore)ofthegraphbyapplying thePCA(PrincipleComponentAnalysis)toformaIP valuewhichisintegratedbybetweenness,closenessand avariantofEigenvaluesPageRank[18].TheTSisusedto evaluatethetopologysignificanceofthetargetnetwork forthedrugagentpair,A1andA2,andisdefinedas TS1 ; 2 1 2 XiIP1i exp min di ; j XiIP1i 2 6 6 4 XjIP2j exp min dj ; i XjIP2j 3 7 7 5 ; 11 where IP1and IP2denoteIPvaluesfordrugagentA1agentandA2.Min( di,j)denotesminimumshortestpath fromgene i ofA1toallgenesofA2andmin( dj,i)denotes theonefromgene j ofA1toallgenesofA2. NIMSdefineanotherterm,AS(AgentScore),toevaluatethesimilarityofadiseasephenotypeforadrugagent. Foragivendrugagent,ifoneofitsagentgeneshasa phenotyperecordintheOMIM(OnlineMendelianInheritanceinMan)database,thedrugagenthasthat phenotypeasoneofitsphenotype.Thesimilarityscore ofadrugagentpairisdefinedasthecosinevalueofthe pair sfeaturevectorangle[19].TheASisdefinedas: AS1 ; 2 Xi ; jPi ; jM ; 12 where Pi,jdenotessimilarityscoreof i thphenotypeofA1and j thphenotypeofA2and M denotesthetotalnumber ofphenotypes. TheSSofthepairisthendefinedastheproductofTS andAS.NIMScalculatesSSforallpossibledrugagent pairsforadiseaseandthencanfindpotentialdrugagent pairsafterrankingthembySS.ResultsMIROARRAYdatadescriptionWemadeabriefdescriptionofthesethreedatasetsin Table1.Itlistedthenumberofbiomarkers,typesofbiomarkers,numberofsamplesandvariationofsamples used.TheprostatecancerdatasetwithRNAbiomarkersInordertogiveabetterprognosis,pathologistshave usedacancerstagetomeasurecelltissuesandtumors aggressionsasanindicatorfordoctorstochooseasuitabletreatment.ThemostwidelyusedcancerstagingsystemisTNM(Tumor,Node,andMetastasis)system[20]. Dependingonlevelsofdifferentiationbetweennormal andtumorcells,adifferenthistologicgradeisgiven. Tumorswithgrade1indicatealmostnormaltissues, withgrade2indicatingsomewhatnormaltissuesand withgrade3indicatingtissuesfarawayfromnormal conditions.Althoughmostofcancerscanbeadaptedto TNMgradingsystem,somespecificcancersrequireadditionalgradingsystemsforpathologiststobetterinterprettumors. TheGleasonGradingSystemisespeciallyusedforprostatecancersandaGS(GleasonScore)isgivenbasedon cellularcontentsandtissuesofcancerbiopsiesfrom patients.ThehighertheGSare,theworsetheprognoses are.Theprostatecancerdataset,GSE18655,includes139 patientswith502molecularmarkers,RNAs[21].In[21],it showedthatprostatetumorswithgenefusions,TMPRSS2: ERGT1/E,4havehigherriskofrecurrencesthantumors withoutthegenefusions.139sampleswereprostatefreshfrozentumortissuesofpatientsafteraradicalprostatectomysurgery.Allsamplesweretakenfromthepatients prostatesatthetimeofprostatectomyandliquidnitrogen wasusedtofreezemiddlesectionsofprostatesatextreme lowtemperature.Amongthesepatients,38patientsamples haveGS5 6correspondingtohistologicgrade1,90sampleshaveGS7correspondingtohistologicgrade2and11 sampleshaveGS8 9correspondingtohistologicgrade3. TheplatformusedforthedatasetsisGPL5858,DASL (cDNAmediated,annealing,selection,extensionand ligation)HumanCancerPanelbyGenemanufacturedby Illumina.TheFDR(falsediscoveryrate)ofallRNAs expressionsinthemicroarrayislessthan5%. Table1Descriptionsof3datasets:GSE18655(prostatecancer),GSE19536(breastcancer)andGSE21036(prostate cancer)ProstateCancer(GSE18655)BreastCancer(GSE19536)ProstateCancer(GSE21036) NumberofBiomarkers502489373 TypeofBiomarkersRNAsmiRNAsmiRNAs NumberofSamples13978142 VariationofSamplesGrade1(38),Grade2(90), Grade3(11) LuminalA(41),LuminalB(12), Basallike(15),Normallike(10) Cancerous(114),Normal(28) Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page5of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 6 BreastcancerdatasetwithNoncodingmiRNAbiomarkersThemiRNAshavestrongcorrelationwithsomecellular processes,suchasproliferation,whichhasbeenusedas abreastcancerdataset[22].Ithas799miRNAsand101 patients samples.DifferentialexpressionsofmiRNAs indicateddifferentlevelofproliferationscorresponding to6intrinsicbreastcancersubtypes:luminalA,luminal B,basallike,normallike,andERBB2.Theoriginaldatasethas101samplesandamongthem,41samplesareluminalA,15samplesarebasallike,10samplesare normallike,12samplesareluminalB,17samplesare ERBB2,2sampleshaveT35mutationstatus,another samplehasT35widetypemutationand3samplesare notclassified.GSE19536wasrepresentedintwoplatformsGPL8227,anAgilient09118HumanmiRNA microarray2.0G4470B(miRNAIDversion)andthe GPL6480,anAgilent014850wholeHumanGenome Microarray4x44kG4112F(ProbeName).Forthispaper, weonlyusedtheexpressionsfromGPL8227.Prostatecancerdatasetofcancerousandnormalsamples withmiRNAbiomarkersTheCNAs(CopyNumberAlterations)ofsomegenes mayassociatewithgrowthofprostatecancers[23].In addition,somechangesarediscoveredinmutationsof fusiongene,mRNAexpressionsandpathwaysinamajorityofprimaryprostatesamples.Theanalysiswasappliedtofourplatformsandconsistsof3subseries, GSE21034,GSE21035andGSE21036[23].Forthis paper,weonlyusetheGSE21036foranalysis.The microarraydatasethas142sampleswhichinclude114 primaryprostatecancersamplesand28normalcells samples.TheplatformisAgilent019118HumanmiRNA Microarray2.0G4470B(miRNAIDversion).ResultsofAMFESWeemploytheAMFESontheprostatecancer (GSE18655),breastcancer(GSE19536)andanotherprostatecancer(GSE21036)datasets.Consequently,for GSE18655,AMFESselects96biomarkers.Theclassificationisperformedintwosteps.Thefirststepperformsclassificationbetweengrade1andabovesamplesanditselects 93biomarkers.Atthesecondstep,AMFESclassifiesbetweengrade2andgrade3samplesanditselects3biomarkers.Thus,wecanassumethatthese96biomarkerscan classifyamonggrade1,grade2andgrade3samples[6].For GSE19536,AMFESalsoperformsclassificationintwo steps.Atthefirststep,AMFESclassifybetweenluminal andnonluminaltypessamplesanditselects47biomarkers[6].Atthesecondstep,AMFESfurtherclassifiesluminalsamplesintoluminalAandluminalBandselects27 biomarkers.Forthenonluminalsamples,AMFESalso classifiesthemintobasallikeandnormallikesamplesand selects25biomarkers[6].Afterremovingduplicate biomarkers,AMFEShas72(47+272(duplicated))for classifyingluminalsamplesand68(47+254(duplicated)) forclassifyingnonluminalones[6].ForGSE21036, AMFESsimplyselects22biomarkersforclassifyingcancerousandnormalsamples.Table2.showsthenumber ofselectedgenes.Thecompletelistsofthesebiomarkers canbefoundinAdditionalfile1GSE18655_96_Biomarkers.xlsx,Additionalfile2GSE19536_72_Biomarkers.xlsx, Additionalfile3GSE19536_68_Biomarkers.xlsx,and Additionalfile4GSE21036_22_Biomakers.xlsx. WethenapplytheMIcalculationdescribedintheMutualInformationsectionon96biomarkersforGSE18655 andrepresentthepairwiseMIvaluesofgrade1,grade2 andgrade3samplesinthree96*96matrixeswhichcan befoundinAdditionalfile5GSE18655Grade1MI.xlsx, Additionalfile6GSE18655Grade2MI.xlsxandAdditionalfile7GSE18655Grade3MI.xlsx.WealsorepresentthefourMImatrixesof72and68biomarkersfor GSE19536inAdditionalfile8GSE19536LuminalAMI. xlsx,Additionalfile9GSE19536LuminalBMI.xlsx, Additionalfile10GSE19536BasalLikeMI.xlsx,and Additionalfile11GSE19536NormalLikeMI.xlsx.The twoMImatrixesforGSE21036areinAdditionalfile12 GSE21036CancerMI.xlsx,Additionalfile13GSE21036 NormalMI.xlsx. WeanalyzetheseMImatrixesandlistdifferencesbetweenthemunderdifferentconditionsinTable3.Fora givenmatrix,thefirstcolumninTable3denotesthe meanvalue;thesecondcolumndenotesthestandarddeviation;thethirdcolumnshowsthenumberofpositive valuesinthematrix;thefourthcolumnshowsthenumberofnegativevalues;thesixthcolumnshowstheminimumvalueandtheseventhcolumndisplaysthe maximum.Inthefifthcolumn,wecompareMImatrixes undertwodifferentconditionssuchasluminalAvs.luminalB.Ifthesignsoftwoentriesatthesameposition inthesetwomatrixesaredifferent,wecountitasone signdifference.Thefifthcolumndenotesthenumberof signdifferencesofthesamplescompared.Weemploy thesameprocessforcomparingbasallikeversus Table2ResultsofselectedsubsetsofgenesProstateCancer (GSE18655) BreastCancer (GSE19536) BreastCancer (GSE19536) ProstateCancer (GSE21036) NumberofBiomarkers Selected 96726822 VariationofSamplesGrade1,Grade2,Grade3LuminalA,LuminalBBasallikeNormallikeCancerousNormal Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page6of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 7 normallikeforGSE19536andthecancerousversusnormalforGSE21036.Tovisualizethedifferences,wedisplaythehistogramsofMIvaluesofgrade1s,grade2sand grade3sinFigure1.Figure2showsthehistogramsfor luminalAsversusluminalBs.Figure3showsbasallikes versusnormallikesandFigure4showsthecancerous versusnormals. ForthefifthcolumnofcomparisonofGSE18655,since therearethreetypesprostate,theycannotbefairlycompared,soweskippedtheprocessforit.Inaddition,becausetherearemanyMIentriesforallhistograms,we onlyshowthedensestsectionofeachhistograminfigures.ResultsofcalculatingmutualinformationThestatisticresultsofcalculatingmutualinformation areshowninTable3attheendofthispaper.SynergistictherapyBasedontheinterpretationofthenetwork[4,5],weproposedaframeworkthatcanhelptoelucidatetheunderlyinginteractionsbetweenmultitargetbiomarkersand multicomponentdrugagents.Theframeworkconsistsof threeparts:selectingbiomarkersofacomplexdiseasesuch ascancer,buildingtargetnetworksofbiomarkers,and Table3ResultsofanalysisofMImatricesMeanvalue ofMI Standard deviationofMI Numof positivevalues Numof negativevalues Numofvalues ofdifferentsign Min value Max value GSE18655_grade10.000240.001562982918N/A 0.00110.0858 GSE18655_grade20.000200.001764682748 0.00180.0949 GSE18655_grade30.00040.002166502566 0.00290.0582 GSE19536_A(72)0.000360.0022391212722052 0.00100.1293 GSE19536_B(72)0.000530.004033881796 0.00220.2279 GSE19536_BasalLike(68)0.00170.005634919981217 0.00330.1648 GSE19536_NormalLike(68)0.00560.0084200420 0.0020.1279 GSE21036_cancer0.01650.02121047456 0.0020.1446 GSE21036_norm0.00860.014646438 0.00150.1565 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 FrequencyBin Grade3 Grade2 Grade1 Figure1 Comparisonof96MIofgrade1,grade2andgrade3samples. Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page7of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 8 forminginteractionbetweenbiomarkersanddrugagents toprovideapersonalizedandsynergistictherapyplan. FromtheGEOdatasetsofcancers,wehavediscovered thegeneticmodelofeachcancer,calledsignatureofthat particularcancer.Amongdifferentcancers,theirsignatures (targetnetworks)maybequitedifferentwhichcorresponds todifferentbiomarkersinAdditionalfile1GSE18655_96_ Biomarkers.xlsx,Additionalfile2GSE19536_72_Biomarkers.xlsx,Additionalfile3GSE19536_68_Biomarkers.xlsx, andAdditionalfile4GSE21036_22_Biomakers.xlsx..For thesedifferentsignatures,wewoulddiscovervarioussynergisticmechanismswhichhaveexemplifiedin[24]. 0 100 200 300 400 500 600 700 FrequencyBin LuminalB LuminalA Figure2 Comparisonof72MIofluminalAandluminalBsamples. 0 100 200 300 400 500 600 700 800 900 10000.0003 0 0.0003 0.0006 0.0009 0.0012 0.0015 0.0018 0.0021 0.00240.0027 0.0030.0033 0.0036 0.0039 0.0042 0.0045 0.0048 0.0051 0.0054 0.0057 0.006 0.0063 0.0066 0.0069 0.0072 0.0075 0.0078 0.0081 0.0084 0.0087 0.009 0.0093 0.0096 FrequencyBin Normallike Basal Figure3 Comparisonof68MIofbasallikeandnormallikesamples. Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page8of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 9 Assumewewouldliketoprovideasynergistictherapy planofapatientA.Bycollectinghis/herbodilydatasuch assaliva,bloodsamples,wefirstobtainthecorrespondingmicroarraydatasetofpatientAandapplyittothe geneticmodelasshowninFigure5. Acompletesynergistictherapyshouldbeabletoselect smallsubsetofbiomarkersandcorrelatethemwithdrug agentsinamultitargetmulticomponentsnetworkapproachasshowninFigure6.InFigure6,adiseaseassociateswithseveralbiomarkerssuchasRNAs,miRNAsor proteinsdenotedbyR1,R2,R3,R4andR5whicharethe regulatorsforoperonsO1,O2,andO3.Anoperonisa basicunitofDNAsandformedbyagroupofgenescontrolledbyageneregulator.Theseoperonsinitiate molecularmechanismsaspromoters.Thegeneregulators canenableorganstoregulateothergeneseitherbyinductionorrepression.Foreachtargetbiomarker,itmayhave alistofpharmaconsusedasenzymeinhibitors.Traditionally,pharmaconsarereferredtobiologicalactivesubstanceswhicharenotlimitedtodrugagentsonly.For example,theherbalextractionswhoseingredientshavea promisingantiAD(Alzheimer sDisease)effectcanbe usedaspharmacons[24].Meanwhile,pharmacons denotedbyD1,D2,andD3,haveeffectsforsometarget biomarkers.Forexample,D1affectstargetbiomarkerR3, D2affectstargetbiomarkerR5andD3affectsbiomarker R1.Comparedwithdrugagentpairmethodology[5], theproposedframeworkinFigure6representsa 0 272 135 42 6 5 9 4 3 1 2 1111 00000 1 2 142 166 70 26 202211 6 22 3 1 3 22 0 1 0 3 00 50 100 150 200 250 300 FrequencyBin Normal samples Cancer samples Figure4 Comparisonof22MIofprostatecancerousandnormalsamples. Figure5 Diagramofdetailedprocessofbuildingthegeneticmodel. Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page9of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 10 moreaccurateinterpretationofbiomarkerswithmulticomponentdrugagents.DiscussionAmongtheMIvaluesobtained,weseepositivevalues andnegativevalues.Thepositivevaluecanrepresentthe attractionsamongthebiomarkerswhilethenegative mayrepresenttherepulsionamongthebiomarkers, whichmatchestheconceptofYinYanginTCM(TraditionalChineseMedicine).Fromtheseresults,we observedthatthereisminimaldifferenceofmutualinformationvaluesbetweencancerstages.However,the differenceofmeanMIvalueoftheprostatecancerversusnormalcellsismoveobvious.ThemeanMIvalueof thelastprostatecancercellisapproximatelytwicethat ofnormalcells.Thismaybeintriguingformedical peopleforfurtherinvestigations.ConclusionsWehavepresentedacomprehensiveapproachtodiagnosisandtherapyofcomplexdiseases,suchascancer.A completeprocedureisproposedforclinicalapplication tocancerpatients.Whilethegeneticmodelprovidesa standardframeworktodesignsynergistictherapy,the actualplanforindividualpatientispersonalizedand flexible.Withcarefulmonitoring,physiciansmayadaptivelychangeormodifythetherapyplan.Muchfurther analysisofthisframeworkinclinicalsettingsshouldbe experimented.AdditionalfilesAdditionalfile1: GSE18655_96_Biomarkers. AnMSOfficeExcelfile whichcontainsalistofgenesymbolsof96biomarkersofGSE18655 samples. Additionalfile2: GSE19536_72_Biomarkers. AnMSOfficeExcelfile whichcontainsalistofgenesymbolsof72biomarkersofGSE19536 luminalAandluminalBsamples. Additionalfile3: GSE19536_68_Biomarkers. AnMSOfficeExcelfile whichcontainsalistofgenesymbolsof68biomarkersofGSE19536 basallikeandnormallikesamples. Additionalfile4: GSE21036_22_Biomarkers. AnMSOfficeExcelfile whichcontainsalistofgenesymbolsof22biomarkersofGSE21036 samples. Additionalfile5: 18655Grade1MI. AnMSOfficeExcelfilewhich containsamatrixofthepairwiseMIvaluesof96biomarkersofgrade1 samples. Additionalfile6: 18655Grade2MI. AnMSOfficeExcelfilewhich containsamatrixofthepairwiseMIvaluesof96biomarkersofgrade2 samples. Additionalfile7: 18655Grade3MI. AnMSOfficeExcelfilewhich containsamatrixofthepairwiseMIvaluesof96biomarkersofgrade3 samples. Additionalfile8: 19536LuminalAMI. AnMSOfficeExcelfilewhich containsthepairwiseMIvaluesof72biomarkersofluminalAsamples. Additionalfile9: 19536LuminalBMI. AnMSOfficeExcelfilewhich containsthepairwiseMIvaluesof72biomarkersofluminalBsamples. Additionalfile10: 19536BasalLikeMI. AnMSOfficeExcelfilewhich containsthepairwiseMIvaluesof68biomarkersofBasallikesamples. R1 R2R3R4R5 MI Target Network for a cancer O1O2O3 Regulate D1 D2D3D4 Good interaction Chosen by algorithm Figure6 Relationshipsbetweenbiomarkers,pharmaconsandoperonswhereR1,R2,R3,R4andR5denote5biomarkers.Amongallthe biomarkers,R2,R3andR5areregulators. Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page10of11 http://www.jclinbioinformatics.com/content/2/1/16 PAGE 11 Additionalfile11: 19536NormalLikeMI. AnMSOfficeExcelfile whichcontainsthepairwiseMIvaluesof68biomarkersofNormallike samples. Additionalfile12: 21036CancerMI. AnMSOfficeExcelfilewhich containsthepairwiseMIvaluesof22biomarkersofcanceroussamples. Additionalfile13: 21036NormalMI. AnMSOfficeExcelfilewhich containsthepairwiseMIvaluesof22biomarkersofnormalsamples. Competinginterests Theauthorsdeclarethattheyhavenocompetinginterests. Author scontributions WH,CL:Implementationofproject.FC,SC:Designtheproject.Allauthors readandapprovedthefinalmanuscript. Acknowledgements Wearegratefultothereviewersfortheirvaluablecommentsand suggestions.WearealsogratefultoDr.JohnHarrisforhisencouragements forthisresearch.WearealsothankfulforDr.LungJiChangforhisdiscussion andencouragements. Authordetails1SystemBiologyLab,UniversityofFlorida,Florida,USA.2Departmentof ElectricalandComputerEngineering,UniversityofFlorida,Florida,USA.3DepartmentofComputerandInformationScienceandEngineering, UniversityofFlorida,Florida,USA.4InstituteofInformationScience,Academia Sinica,Taipei,Taiwan. Received:10July2012Accepted:20September2012 Published:2October2012 References1.ZimmermannGR,LeharJ,KeithCT: Multitargettherapeutics:whenthe wholeisgreaterthanthesumoftheparts. Drugdiscoverytoday 2007, 12 (1 2):34 42. 2.KeithCT,BorisyAA,StockwellBR: Multicomponenttherapeuticsfor networkedsystems. NatRevDrugDiscov 2005, 4 (1):71 78. 3.DanceyJE,ChenHX: Strategiesforoptimizingcombinationsof molecularlytargetedanticanceragents. NaturereviewsDrugdiscovery 2006, 5 (8):649 659. 4.CsermelyP,AgostonV,PongorS: Theefficiencyofmultitargetdrugs:the networkapproachmighthelpdrugdesign. TrendsPharmacolSci 2005, 26 (4):178 182. 5.LiS,ZhangB,ZhangN: Networktargetforscreeningsynergisticdrug combinationswithapplicationtotraditionalChinesemedicine. BMCSyst Biol 2011, 5 (Suppl1):S10.JournalArticle. 6.HsuWC,LiuCC,ChangF,ChenSS: FeatureSelectionforMicroarrayData Analysis:GEO&AMFES .Gainesville,Florida:TechnicalReport;2012. 7.GuyonI,WestonJ,BarnhillS,VapnikV: GeneSelectionforCancer ClassificationusingSupportVectorMachines. MachLearn 2002, 46 (1 3):389 422. 8.RakotomamonjyA: Variableselectionusingsvmbasedcriteria. JMach LearnRes 2003, 3 :1357 1370. 9.BiJ,BennettK,EmbrechtsM,BrenemanC,SongM: Dimensionality reductionviasparsesupportvectormachines. JMachLearnRes 2003, 3 :1229 1243. 10.StoppigliaH,DreyfusG,DuboisR,OussarY: Rankingarandomfeaturefor variableandfeatureselection. JMachLearnRes 2008, 3 (Journal Article):1399 1414. 11.TuvE,BorisovA,TorkkolaK: FeatureSelectionUsingEnsembleBased RankingAgainstArtificialContrasts .In NeuralNetworks,2006IJCNN'06 InternationalJointConferenceon:0 00 .;2006:2181 2186. 12.ShannonCE:Amathematicaltheoryofcommunication. SIGMOBILEMob ComputCommunRev 2001, 5 (1):3 55. 13.QiuP,GentlesAJ,PlevritisSK: Fastcalculationofpairwisemutual informationforgeneregulatorynetworkreconstruction. CompMethods andProgramsinBiomed 2009, 94 (2):177 180. 14.BeirlantJ,DudewiczEJ,oumlLG,r,MeulenECVD: Nonparametricentropy estimation:Anoverview. IntJMathStatSci 1997, 6 (1):17 39. 15.MargolinA,NemenmanI,BassoK,WigginsC,StolovitzkyG,FaveraR, CalifanoA: ARACNE:AnAlgorithmfortheReconstructionofGene RegulatoryNetworksinaMammalianCellularContext. BMCBioinforma 2006, 7 (Suppl1):S7. 16.MichaelEW,MonicaSL: Adatalocalityoptimizingalgorithm .;1991. 17.FitzgeraldJB,SchoeberlB,NielsenUB,SorgerPK: Systemsbiologyand combinationtherapyinthequestforclinicalefficacy. NatChemBiol 2006, 2 (9):458 466. 18.PageL,BrinS,MotwaniR,WinogradT: ThePageRankCitationRanking: BringingOrdertotheWeb .In StanfordInfoLab .1999. 19.vanDrielMA,BruggemanJ,VriendG,BrunnerHG,LeunissenJA: Atextmininganalysisofthehumanphenome. EurJHumanGenet:EJHG 2006, 14 (5):535 542. 20.SobinLH,WittekindC: TNM:classificationofmalignanttumours .NewYork: WileyLiss;2002. 21.BarwickBG,AbramovitzM,KodaniM,MorenoCS,NamR,TangW,Bouzyk M,SethA,LeylandJonesB: Prostatecancergenesassociatedwith TMPRSS2ERGgenefusionandprognosticofbiochemicalrecurrencein multiplecohorts. BrJCancer 2010, 102 (3):570 576. 22.EnerlyE,SteinfeldI,KleiviK,LeivonenSK,AureMR,RussnesHG,Ronneberg JA,JohnsenH,NavonR,RodlandE, etal : miRNAmRNAIntegrated AnalysisRevealsRolesformiRNAsinPrimaryBreastTumors. PLoSOne 2011, 6 (2):e16915. 23.TaylorBS,SchultzN,HieronymusH,GopalanA,XiaoY,CarverBS,AroraVK, KaushikP,CeramiE,RevaB, etal : Integrativegenomicprofilingofhuman prostatecancer. Cancercell 2010, 18 (1):11 22. 24.SunY,ZhuR,YeH,TangK,ZhaoJ,ChenY,LiuQ,CaoZ: TowardsabioinformaticsanalysisofantiAlzheimer'sherbalmedicinesfroma targetnetworkperspective .In Briefingsinbioinformatics .2012.doi:10.1186/20439113216 Citethisarticleas: Hsu etal. : Cancerclassification:Mutualinformation, targetnetworkandstrategiesoftherapy. JournalofClinicalBioinformatics 2012 2 :16. Submit your next manuscript to BioMed Central and take full advantage of: Convenient online submission Thorough peer review No space constraints or color gure charges Immediate publication on acceptance Inclusion in PubMed, CAS, Scopus and Google Scholar Research which is freely available for redistribution Submit your manuscript at www.biomedcentral.com/submit Hsu etal.JournalofClinicalBioinformatics 2012, 2 :16 Page11of11 http://www.jclinbioinformatics.com/content/2/1/16 