Pattern Recognition Techniques in Microarray Data Analysis

Pattern Reognition Tehniques in Miroarray Data Analysis Miao Li, Biao Wang, Zohreh Momeni, and Faramarz Valafar Department of Computer Siene San Diego State University San Diego, California, USA faramarz@sienes.sdsu.edu Abstrat: The explosion in the prodution rate of expression level data has highlighted the need for automated tehniques that help sientists analyze, understand, and luster (sequene) the enormous amount of data that is being produed. Example of suh problems are analyzing gene expression level data produed by miroarray tehnology on the genomi sale, sequening genes on the genomi sale, sequening proteins and amino aids, et. In this paper we review some of the pattern reognition tehniques that have been used in this area. Keywords: Pattern reognition, sequening, miroarray, sequening, lustering, expression level data Introdution Miroarray tehnology relies on the hybridization properties of nulei aids to monitor DNA or RNA abundane on a genomi sale. Miroarrays have revolutionized the study of genome by allowing researhers to study the expression of thousands of genes simultaneously for the first time. The ability to simultaneously study thousands of genes under a host of differing onditions presents an immense hallenge in the fields of omputational siene and data mining. In this survey, we have reviewed methods that help (have been used) in answering both of the above questions. In order to properly omprehend and interpret expression data produed by miroarray tehnology, new omputational and data mining tehniques need to be developed. The analysis and understanding of miroarray data inludes a searh for genes that have similar or orrelated patterns of expression. Among the various frameworks in whih pattern reognition has been traditionally formulated, statistial approah, neural network tehniques, and methods imported from statistial learning theory are among those that have been applied in miroarray data analysis []. In the following you will find a surveyed of a number of tehniques that have been used in miroarray data analysis. Simple approahes: In order to understand the onnetion between miroarray data and biologial funtion, one ould ask the following fundamental question: how do expression levels of speifi genes, or gene-sequenes, differ in a ontrol group versus a treatment group? In other words, in a ontrolled study, if a speifi biologial funtion (ondition) is present, how do expression levels hange when the funtion (ondition) is turned off (absent) (or visa versa). This line of questioning ould be useful in, for instane, determining the effet of a speifi genomi treatment of a disease. In suh a ase, the question would be: has the treatment hanged the expression level(s) of speifi (target) gene(s)/gene-sequene(s) to notieably different levels. If so, then have the hanges in expression levels resulted in elimination (or alleviating the symptoms) of patient s ondition? If the answers to the above questions are yes, then a orrelation between the speifi genes/gene-sequenes that show hanged levels, and the biologial funtion, an be drawn. One of the simple methods attempting to answer the above question is alled the Fold approah. In this approah, if the average expression level of a gene has hanged by a predetermined number of folds, that gene expression is delared to have been hanged (from on to off, or visa versa) due to the treatment. In many studies, a 2-fold tehnique is used (rather than 3-fold or 4-fold), in whih the average expression level has to hange to at least two folds of its initial level in order for it to be lassified as hanged. The drawbak to this method is that it is unlikely that this method reveals the desired orrelation between expression data and funtion, as a predetermined fator of 2 (or 3 or 4) has different signifiane depending on expression levels of various genes. A further drawbak is that this method only ompares the

expression level of the gene under question to determine whether it has been turned on or off. We believe that a better and more biologially relevant method of analysis would be to onsider expression patterns of related (or neighboring) genes to determine the on or off state of the gene urrently under observation. The folding tehnique as it is, does not allow this type of analysis. Similar to the Fold approah, T-test is another simple method applied in gene expression analysis. It manipulates logarithm of expression levels, and requires omparison omputation against the means and varianes of both treatment and ontrol groups. The fundamental problem with T-test is that it requires repeated treatment and ontrol experiments, whih is both tedious and ostly. As a result, small number of experimental repetition ould affet the reliability of the mean-based approah [2,3]. Karhunen Loe`ve expansion : Karhunen Loe`ve expansion [4], as it is known in pattern reognition, is also known as prinipal omponent analysis (PCA), or singular value deomposition (SVD) [5] in statistis. Use of SVD in miroarray data analysis has been suggested by various researhers [6-]. PCA is a linear mathematial tehnique that finds base vetors that expand the problem spae (gene expression spae). These vetors are alled prinipal omponents (PCs). In expression data analysis, the vetors are also alled mean expression vetors, oreigengenes. A PC an be thought of as a major pattern in the data set (e.g. gene expression data). The more PCs are used to expand (model) the problem spae, the more aurate the representation will be. However, one should also be aware that the lower the signifiane of a PC, the more noise it represents. So a balane needs to be struk between the need for maximal expansion of the problem spae and the need for elimination of noise. In most ases, PCA redues the dimensionality of the problem spae without muh loss of generality or information. It is easy to think of eah PC as the mean expression vetor representing a luster of expression data (expression pattern). In most studies involving SVD, the tehnique was used to find underlying patterns or `modes' in expression data with the intention of linking these modes to the ation of transriptional regulators. The advantage of SVD is its simpliity and ease of understanding of the algorithm. Among the disadvantages, is the method s inherent linear nature. Beause SVD is a linear tehnique, it works well in problem spaes that are linearly separable. It is yet to be shown that finding underlying modes in expression data is a linearly separable problem. PCA is a powerful tehnique for the analysis of geneexpression data when used in ombination with another lassifiation tehnique suh as k-means lustering or self organizing maps (SOMs) that requires the user to speify the number of lusters. We will disuss both of these methods in the oming setions. Bayesian Belief Networks (BBN): A more sophistiated approah, Bayesian probabilisti framework has been suggested to analyze expression data [2,2,3]. An alternative approah to the SVD methodology desribed above is to use prior knowledge of the regulatory network's arhiteture to design ompeting models then use Bayesian belief networks to pik the model that best fits the expression data [4]. Gifford and o-workers have used this approah to distinguish between two ompeting models for galatose regulation [5]. Friedman and o-workers have used Bayesian networks to analyze genome-wide expression data in order to identify signifiant interations between genes in a variety of metaboli and regulatory pathways [6,2]. Baldi et. Al. [2] used BBN s to model log-expression values by independent normal distributions, parameterized by orresponding means and varianes with hierarhial prior distributions. They derive point estimates for both parameter and hyperparameter, and regularize expressions for the variane of eah gene by ombining the empirial variane with loal bakground variane assoiated with neighboring genes. This approah ompares favorably with simple t-test and fold method. It an aommodate noise, variability, and low repliation often typial of miroarray data. [2] k-means Clustering (Partitioning): k-means is a divisive lustering approah. It partitions data (genes or experiments) into groups that have similar expression patterns. k is the number of lusters (also sometimes alled bukets, bins, or lasses) into whih the user believes the data should fall. The number k is an input value given to the algorithm by the user. The k-mean lustering algorithm is a three step proess. In the first step, the algorithm randomly assigns all training data to one of the k lusters. In the seond step the mean inter- and intra-lass distanes are alulated. The mean inter-lass distane (δ )ofeahlusteris

alulated by first alulating a mean vetor (µ ) for eah luster and then averaging the distanes between the vetors (data) of a luster and its mean vetor. In expression level data, the mean vetor is alled the average expression vetor. n µ = vi and δ = n i= n n i= v i µ The mean intra-lass distane between two lusters is the distane between their respetive mean vetors.,2, for instane is the mean intra-lass distane between lusters and 2:,2 = 2 µ µ The third step is an iterative step, and its goal is to minimize the mean inter-lass distanes (δ ), maximize intra-lass distanes (,2 ), or both, by moving data from one luster to another. In eah iteration one piee of data is moved to a new luster where it is losest to a µ (the mean vetor of the new luster). After eah move, new expression vetors for the two effeted lasses are realulated. This proess ontinues, until any further move would inrease the mean inter-lass means (expression variability for eah lass) or redue intra-lass distanes. There are additional (sometimes optional) steps that an be found in variations of the basi k-means lustering algorithm desribed above. Quakenbush [7] disusses a few optional steps and variations of this algorithm. K-means lustering is easy to implement. However, the major disadvantage of this method is that the number k is often not known in advane. Another potential problem with this method is that beause eah gene is uniquely assigned to some luster, it is diffiult for the method to aommodate a large number of stray data points, intermediates or outliers. Further onerns about the algorithm have to do with the algorithms biologial interpretation (in the ase of expression data) of the final lustered data. In this regard, Tamayo et al. explain that k-means lustering is a ompletely unstrutured approah, whih proeeds in an entirely loal fashion and produes an unorganized olletion of lusters that is not ondutive to interpretation. [8] The most reent variant of the k-means lustering algorithm (at the time of this survey) designed speifially for the assessment of gene spots (on the array images) is the work of Bozinov et al [9]. The tehnique is based on lustering pixels of a target area into foreground and bakground. The authors report: results from the analysis of real gene spots indiate that our approah performs superior to other existing analytial methods. [9] Hierarhial Clustering: There are various hierarhial lustering algorithms that an be applied to miroarray data analysis. These inlude single-linkage lustering, omplete-linkage lustering, averagelinkage lustering, weighted pair-group averaging, and within pair-group averaging [7,20-22]. These algorithms only differ in the manner in whih distanes are alulated between the growing lusters and the remaining members of the data set. Hierarhial lustering algorithms usually generate a gene similarity sore for all gene ombinations, plae the sores in a matrix, join those genes that have the highest sore, and then ontinue to join progressively less similar pairs. In the lustering proess, after similarity sore alulations, the most losely related pairs are identified in an above-diagonal soring matrix. In this proess, a node in the hierarhy is reated for the highest-soring pair, the gene expressed profilers of the two genes are averaged, and the joined elements are weighted by the number of elements they ontain. The matrix is then updated replaing the two joined elements by the node. For n genes, the proess is repeated n- times until a single element (that ontains all genes) remains. In the following formulas, we assume a Eulidian measure for distane, and arithmeti averaging for alulating the means. In various forms of the algorithm, various measures of distane and averaging tehniques have been used. A popular and more representative measure of distane has been the Mahalanobis distane.

The first report by Wen et al. [20] uses lustering and data-mining tehniques to analyze large-sale gene expression data. This report is signifiant in that it shows how integrating results that were obtained by using various distane metris an reveal different but meaningful patterns in the data. Eisen et al. [2] also make an elegant demonstration of the power of hierarhial lustering in the analysis of miroarray data. Similar to the k-means algorithm, the advantage of hierarhial lustering lies in its simpliity. A further advantage of hierarhial tehnique versus the k-means method is that the results from hierarhial lustering methods an easily be visualized. Although hierarhial luster analysis is a powerful tehnique and possesses lear advantages for expression data analysis, it also presents researhers with two major drawbaks. The first problem arises from the greedy nature of the algorithm. Hierarhial lustering is essentially a greedy algorithm, and like other suh algorithms, it suffers from sensitivity to early mistakes in the greedy proess. Beause, by definition greedy algorithms annot go bak (baktrak) to redo the step that was taken by mistake, small errors in luster assignment in early stages of the algorithm an be drastially amplified [23]. Therefore, the dependene on the results produed by ertain arbitrarily imposed lustering strutures (that do not orrespond to reality) an give rise to misleading results. For instane, in time-ourse gene expression studies, hierarhial lustering has reeived mixed reviews. These algorithms often fail to disriminate between different patterns of variation in time. For instane, a gene express pattern for whih a high value is found at an intermediate time point will be lustered with another gene for whih a high value is found at a later point in time. These variations have to be separated in a subsequent step. The seond drawbak of hierarhial lustering is best desribed by Quakenbush [7]: one potential problem with many hierarhial lustering methods is that, as lusters grow in size, the expression vetor that represents the luster might no longer represent any of the genes in the luster. Consequently, as lustering progresses, the atual expression patterns of the genes themselves beome less relevant. As a result, an ative area of researh in hierarhial luster analysis is in deteting when to stop the hierarhial merging of elements. In this diretion new hybrid tehniques are emerging that ombine hierarhial methodology with k-means tehniques. Mixture models and EM (expetation maximization): Mixture models are probabilisti models built by using positive onvex ombination of distributions taken from a given family of distributions [24,25]. EM algorithm is an iterative algorithm that proeeds in two alternating steps, the E (expetation) step and the M (maximization) step. Applying EM algorithm to the orresponding mixture model an serve as a omplementary analysis to standard hierarhial lustering. An attrative feature of the mixture modeling approah is that the strength of evidene measure for the number of true lusters in the data is omputed. This assessment of reliability addresses the primary defiieny of hierarhial lustering and is often an important question to biologists onsidering data from miroarray studies. A disadvantage of the tehnique is that variane parameters are diffiult to estimate for the mixture model in the ase of lusters with small number of samples. One way to address it is to use a fully Bayesian estimation proedure. It might not work with data having temporal struture, say gene expression of the same population of ells measured at a different number of time points [24]. Gene Shaving: Gene shaving is a reently developed and popular statistial method for disovering patterns in gene expression data. The original algorithm uses the PCA algorithm and was proposed by Hastie, Tibshirani et al. [26]. Later variations are under development, some of whih inlude the replaement of the PCA step with a nonlinear variety. Gene shaving is designed to identify lusters of genes with oherent expression patterns and large variation aross the samples. Using this method, the authors have suessfully analyzed gene expression measurements made on samples from patients with diffuse large B-ell lymphoma and identified a small luster of genes whose expression is highly preditive of survival. [26] The shaving algorithm an be summarized in the following steps: ) Build the expression matrix Ξ for all genes, and enter eah row of Ξ to zero mean. 2) Compute the leading prinipal omponent of rows of Ξ. 3) Shave off the proportion (typially 0%) of the genes having smallest absolute inner produt with leading prinipal omponent. 4) Repeat the seond and third steps until only one remains. 5) Estimate the optimal luster size k (for maximum gap). 6) Orthogonalize eah row of Ξ with respet to the average gene in the size-optimized gene luster. 7) Repeat the first five steps with the

orthogonalized data to find the seond optimal luster. This proess is ontinued until a maximum of M lusters are reahed (M is hosen a priori). There are two varieties to the original shaving algorithm: supervised (partially supervised), or unsupervised. In supervised and partially supervised shaving, available information about genes and samples (outome measure, known properties of genes or sample, or any other a priori information) an be used to label their data as a means to supervise the lustering proess and ensuring meaningful groupings. Gene shaving also allows genes to belong to more than one luster. These two properties (the ability to supervise and multiple groupings for the same gene) make gene shaving different from most hierarhial lustering and other widely used methods for analyzing gene expression data. The most prominent advantage of the shaving methods proposed by Hastie, Tibshirani et al. were best expressed by the authors themselves: (shaving methods) searh for lusters of genes showing both high variation aross the samples, and oherene (orrelation) aross the genes. Both of these aspets are important and annot be aptured by simple lustering of genes or setting threshold of individual genes based on the variation over samples. [26] The drawbak of this method is the omputational intensity of the algorithm. The shaving proess requires repeated omputation of the largest prinipal omponent of a large set of variables. Some variant approahes should onsider replaement algorithms for SVD that are less omputationally intensive. Support Vetor Mahine (SVM): SVM is a supervised mahine learning tehnique in the sense that vetors are lassified with respet to known referene vetors. SVM solve the problem by mapping the gene-expression vetors from expression spae into a higher-dimensional feature spae, in whih distane is measured using a mathematial funtion known as a kernel funtion, and the data an then be separated into two lasses. [7] Gene expression vetors an be thought of as points in an n-dimensional spae. For miroarray analysis, sets of genes are identified that represent a target pattern of gene expression. The SVM is then trained to disriminate between the data points for that pattern (positive points in the feature spae) and other data points that do not show that pattern (negative points in the feature spae). With an appropriately hosen feature spae of suffiient dimensionality, any onsistent training set an be made separable. [27] SVM is a linear tehnique in that it uses hyperplanes as separating surfaes between positive and negative points in the feature spae. Speifially, SVM hooses the hyperplane that provides maximum margin between the plane surfae and the positive and negative points. This feature provides a mehanism to avoid over fitting. One the separating hyperplanes have been seleted, the deision funtion for lassifying points with respet to the hyperplane only involves dot produt between points in the feature spae, whih arries a low omputational burden. [27] Beause the linear boundary in the feature spae maps to a nonlinear boundary in the gene expression spae, SVM an be onsidered as a nonlinear separation tehnique. The important advantage of SVM is that it offers a possibility to train generalizable, nonlinear lassifiers in high-dimensional spae using a small training set. For large training sets, SVM typially selets a small support set that is neessary for designing the lassifier, thereby, minimizing the omputational requirements during testing. Furey, et al. [28] praises SVM as follows: It (SVM) has demonstrated the ability to not only orretly separate entities into appropriate lasses, but also identify instanes whose established lassifiation is not supported by the data. It performs well in multiple biologial analyses, having remarkable robust performane with respet to sparse and noisy data. One of the drawbaks of SVM is its sensitivity to the hoie of a kernel funtion, parameters, and penalties. For instane, if the kernel funtion is not appropriately hosen for the training data, SVM may not be able to find a separating hyperplane in feature spae. [27] Choosing the best kernel funtion, parameters and penalties for the SVM algorithm an be diffiult in many ases. Beause of the sensitivity of the algorithm to these hoies, different hoies often yield ompletely different lassifiations. As a result, it is neessary to suessively inrease kernel omplexity until an appropriate (biologially sound) lassifiation is ahieved. [7] The ad ho nature of the penalty term (error penalty), the omputational omplexity of the training proedure (a quadrati minimization problem), and risk of over fitting when using larger hidden layers are further drawbaks of this method.

Other Tehniques in Miroarray Data Analysis: Sasik et al. [23] have presented the perolation lustering approah to lustering of gene expression patterns base on the mutual onnetivity of the patterns. Unlike SOM or k-means whih fore gene expression data into a fixed number of predetermined lustering strutures, this approah is to reveal the natural tendeny of the data to luster, in analogy to the physial phenomenon of perolation. GA/KNN is another algorithm desribed by Li, et al [29]. This approah ombines a geneti algorithm (GA) and the k-nearest Neighbor (KNN) method to identify genes that an jointly disriminate between different lasses of samples. The GA/KNN is a supervised stohasti pattern reognition method. It is apable of seleting a subset of preditive genes from a set of large noisy data for sample lassifiation. [29] A large body of researh has been onduted on the appliation of various types of artifiial neural networks applied to the problems at hand. These tehniques have been reviewed in [30] and will not be repeated in this paper. Conlusion A number of pattern reognition tehniques have been applied to analyze miroarray expression data [2,7,]. The simplest ategory of these tehniques is that of individual gene-based analysis suh as fold approah, t-test rule, and Bayesian framework. More sophistiated tehniques inlude (but are not limited) luster analysis methods and SVD tehnique. The hypothesis (hope) behind using lustering methods is that genes in a luster share some ommon funtion or regulatory elements. Although, this may be true in biologially sound division of data, it only holds partially true (or not at all) in the lustering produed by a purely mathematial tehnique. The suess of these algorithms should be evaluated based on how biologially sound (or relevant) the lustering is. On this road, we believe that algorithms that allow the inlusion of a priori biologial knowledge show higher potential. This review represents only a small part of the researh being onduted in the area, and only is meant as a omplementary/ontinuation of the survey that others have onduted in this area [7,]. It should in no way be taken as a omplete survey of all algorithms. For the reason of limited spae, some signifiant developments in the area had to be left out. Furthermore, new tehniques and algorithm are being proposed for miroarray data analysis on a daily basis making survey artiles suh as this highly time-dependent. Referenes. Szabo A. etalvariable seletion and pattern reognition with gene expression data generated by the miroarray tehnology.math Biosi (2002) 76(), 7-98 2. Baldi, P. et al. A Bayesian framework for the analysis of miroarray expression data: regularized t-test and statistial inferene of gene hanges (200) Bioinformatis, 7(6), 509-59 3. Pan, W. A omparative review of statistial methods for disovering differentially expressed genes in repliated miroarray experiments (2002) Bioinformatis 8(4), 546-54 4. Mallat, S. G. (999) A Wavelet Tour of Signal Proessing (Aademi, San Diego), seond edition 5. Anderson, T. W. (984) Introdution to Multivariate Statistial Analysis (Wiley, NY) seonf edition 6. Alter, O. et al. Singular Value Deomposition for Genome-wide Expression Data Proessing and Modeling (2000) Pro. Natl. Aad. Si. USA 97, 00-006 7. O. Alter, P.O. Brown and D. Botstein, Singular value deomposition for genome-wide expression data proessing and modeling. Pro Natl Aad Si USA 97 (2000), pp. 00 006 8. N.S. Holter, A. Maritan, M. Cieplak, N.V. Fedoroff and J.R. Banavar, Dynami modeling of gene expression data. Pro Natl Aad Si USA 98 (200), pp. 693 698 9. N.S. Holter, M. Mitra, A. Maritan, M. Cieplak, J.R. Banavar and N.V. Fedoroff, Fundamental patterns underlying gene expression profiles: simpliity from omplexity. Pro Natl Aad Si USA 97 (2000), pp. 8409 844

0. Rayhaudhuri S, Stuart JM, Altman RB. Prinipal omponents analysis to summarize miroarray experiments: appliation to sporulation time series. Pa Symp Bioomput 2000, 455-466.. Wu, C., Berry, M., Shivakumar, S. and Marty, J. 995. Neural networks for full-sale protein sequene lassifiation: sequene enoding with singular value deomposition. Mahine Learning. 2: 77-93. 2. N. Friedman, M. Linial, I. Nahman and D. Pe'er, Using Bayesian networks to analyze expression data. J Comput Biol 7 (2000), pp. 60 620 3. A. Drawid and M. Gerstein, A Bayesian system integrating expression data with sequene patterns for loalizing proteins: omprehensive appliation to the yeast genome. J Mol Biol 30 (2000), pp. 059 075 4. D.K. Gifford, Blazing pathways through geneti mountains. Siene 293 (200), pp. 2049 205 5. Hartemink AJ, Gifford DK, Jaakkola TS, Young RA. Using graphial models and genomi expression data to statistially validate models of geneti regulatory networks. Pa Symp Bioomput 200, 422-433 6. D. Pe'er, A. Regev, G. Elidan and N. Friedman, Inferring subnetworks from perturbed expression profiles. Bioinformatis 7 Suppl (200), pp. S25 S224 7. Quakenbush, J. Computational Analysis of MiroArray Data (200) Nature Genetis 2, 48-427 8. Tamayo, P. et al. Interpreting patterns of gene expression with self-organizing maps: methods and appliation to hematopoieti differtiation (999) Pro. Natl. Aad. Si. USA 96, 2907-292 9. Bozinov D Unsupervised tehnique for robust target separation and analysis of DNA miroarray spots through adaptive pixel lustering (2002) Bioinformatis 8(5), 747-56. 20. Wen, X.et al. Large-sale Temporal Gene Expression Mapping of Central Nervous System Development (998) Pro. Natl. Aad. Si USA 95, 334-339 2. Eisen, P. T. et al Cluster Analysis and Display of Genome-Wide Expression Patterns (998) Pro. Natl. Aad. Si. USA 95, 4863-4868 22. Alon, U.et al. Broad Patterns of gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonuleotide Arrays (999) Pro. Natl. Aad. Si USA 96, 6745-6750 23. Sasik, R. et al. Perolation Clustering: A Novel Approah to the Clustering of Gene Expression Patterns in Dityostelium Development (200) PSB Proeedings 6, 335-347 24. Baldi, P. On the onvergene of a lustering algorithm for protein-oding regions in mirobial genomes (2000) Bioinformatis 6, 367-37 25. Ghosh, D. Mixture modeling of gene expression data from miroarray experiments Bioinformatios (200) 8(2), 275-286 26. Hastie, T. et al. Gene Shaving as a Method for Identifying Distint Sets of Genes with Similar Expression Pattern (2000) Genome Biology, -2 27. Brown, M. P. S. et al Knowledge-based analysis of miroarray gene expression data by using support vetor mahines (2000) Pro. Natl. Aad. Si. USA 97, 262-267 28. Furey, T.S et al., Support vetor mahine lassifiation and validation of aner tissue samples using miroarray expression data (2000) Bioinformatis, 6, 906-94 29. Li, et al Gene seletion for sample lassifiation based on gene expression data: study of sensitivity to hoie of parameters of the GA/KNN method (200) Bioinformatis 7(2), 3-42 30. F. Valafar. Neural Network Appliations in Biologial Sequening. Proeedings of the 2003 International Conferene on Mathematis and Engineering Tehniques in Mediine and Biologial Sienes. (METMBS 03) June 24-27, 2002. In print.