Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data



Similar documents
L10: Linear discriminants analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Forecasting the Direction and Strength of Stock Market Movement

1. Measuring association using correlation and regression

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

Single and multiple stage classifiers implementing logistic discrimination

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

How To Calculate The Accountng Perod Of Nequalty

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Statistical Methods to Develop Rating Models

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

STATISTICAL DATA ANALYSIS IN EXCEL

Logistic Regression. Steve Kroon

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Calculation of Sampling Weights

Lecture 5,6 Linear Methods for Classification. Summary

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

CHAPTER 14 MORE ABOUT REGRESSION

What is Candidate Sampling

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

The OC Curve of Attribute Acceptance Plans

SIMPLE LINEAR CORRELATION

Can Auto Liability Insurance Purchases Signal Risk Attitude?


A DATA MINING APPLICATION IN A STUDENT DATABASE

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

Lecture 2: Single Layer Perceptrons Kevin Swingler

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Data Visualization by Pairwise Distortion Minimization

BERNSTEIN POLYNOMIALS

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

Calculating the high frequency transmission line parameters of power cables

where the coordinates are related to those in the old frame as follows.

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Support Vector Machines

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

An interactive system for structure-based ASCII art creation

The Greedy Method. Introduction. 0/1 Knapsack Problem

This circuit than can be reduced to a planar circuit

A Comparative Study of Data Clustering Techniques

Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

An Interest-Oriented Network Evolution Mechanism for Online Communities

1 Example 1: Axis-aligned rectangles

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

J. Parallel Distrib. Comput.

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

Fast Fuzzy Clustering of Web Page Collections

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

An Alternative Way to Measure Private Equity Performance

Gender Classification for Real-Time Audience Analysis System

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

Realistic Image Synthesis

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Economic Interpretation of Regression. Theory and Applications

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

Implementation of Deutsch's Algorithm Using Mathcad

Traffic-light a stress test for life insurance provisions

Chapter 6. Classification and Prediction

Portfolio Loss Distribution

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank.

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Customer Segmentation Using Clustering and Data Mining Techniques

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Support vector domain description

DEFINING %COMPLETE IN MICROSOFT PROJECT

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC

Binomial Link Functions. Lori Murray, Phil Munz

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Project Networks With Mixed-Time Constraints

Performance Analysis and Coding Strategy of ECOC SVMs

Cluster Analysis. Cluster Analysis

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

Recurrence. 1 Definitions and main statements

An Algorithm for Data-Driven Bandwidth Selection

DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Transcription:

Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 : 109-118 Estmatng the Number of Clusters n Genetcs of Acute Lymphoblastc Leukema Data Mahmoud K. Okasha, Khaled I.A. Almghar Department of Appled Statstcs Al-Azhar Unversty - Gaza Receved 16/11/2011 Accepted 31/12/2011 Abstract: Cluster analyss s a statstcal technque that has been wdely used for the analyss of genetc data to cluster gene expressons and other data n many felds. However, the problem encountered n the lterature s the choce of the number of clusters. Specfcally, the problem of estmatng the number of clusters n a gven populaton partcularly for gene expressons s of a great nterest and needs to be addressed. Many algorthms are used n practce for that purpose n dfferent felds. In ths paper we examned dfferent clusterng algorthms, for estmatng the number of clusters, that are based on probabltes, covarance matrx, and egenvalues on real data sets usng R package algorthms. Specfcally, we examned the model based algorthm (Mclust) and herarchcal clusterng algorthm (hclust) and compared these algorthms wth Partton Around Medod (PAM) algorthm. The results we found are that the frst algorthm can be used only for large data sets and the second one can be safely used for small data sets. The Mclust s a model based clusterng approach bult on Bayesan Informaton Crteron (BIC) whch maxmzes (EM) algorthm. The results of these two algorthms are compared wth a thrd approach based on Partton Around Medod (PAM) algorthm but selects the number of clusters manually accordng to the average slhouette wdth and selectng the number of clusters as that number whch maxmzes the average slhouette wdth. The later algorthm although allows to estmate the number of clusters manually, t has the best performance. However, the frst two algorthms can be automated to produce the best estmate for the number of clusters n a gven data set. These algorthms can be appled not only for genetc data but also for many other felds such as market research. Keywords: clusterng, model based algorthm, herarchcal clusterng, Partton Around Medod, Bayesan Informaton Crteron, average slhouette, herarchcal tree, gene expresson. http://www.alazhar.edu.ps/journal123/natural_scences.asp?typeno=1

Mahmoud K. Okasha, Khaled I. A. Almghar 1. Introducton Cluster analyss s a collecton of statstcal methods whch are used to detect groups of observatons that have smlar behavor or characterstcs n a set of data. Cluster analyss s generally classfed nto two dfferent technques; namely herarchcal and nonherarchcal procedures. The goal s to construct a herarchy or a decson-tree lke structure (dendogram) to llustrate the relatonshp among enttes. In the non-herarchcal method a poston n the measurement s taken as a central place and the dstance s measured from such central pont (Partton Around Medod). In the herarchcal clusterng, the concept of orderng s nvolved n ths approach. The orderng s a drven by the number of observatons that can be combned at a tme based on the assumptons that the dstance between two observatons s not statstcally dfferent from zero. The clusters could be arrved at ether from weedng out observatons (dvsve method) or jonng together smlar observatons (agglomeratve method). However, estmatng the number of clusters n any data remans the man problem (Chen et al., 2002). 2. Ams of the study Acute lymphoblastc leukema dsease has many dfferent types and causes. For every type, there are many dfferent stages. The man goal of the analyss of acute lymphoblastc leukema data s to splt the sample nto categores and subcategores and to classfy the data nto homogeneous clusters. To acheve ths goal, cluster analyss s usually used to: 1. Classfy homogenous cases nto the same clusters and heterogeneous ones n dfferent clusters. 2. Reduce the sample cases to a few dfferent clusters wth smlar propertes. 3. Determne the numbers of clusters: Allocatng homogenous objects nto the same cluster means that all patents wth the same type of dsease and at the same stage wll be classfed nto the same group. The beneft of ths s that, same clusters of patents should be gven smlar protocols of medcne. Moreover, classfyng the genes whch causes the dsease makes t easy to solate ths gene n new generatons to avod the acute lymphoblastc leukema dsease. Afterwards, ths dsease can be avoded by usng (110) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. DNA technologes that can prevent ths dsease for people who have the genes whch cause the acute lymphoblastc leukema. The dependent varable here to be clustered s usually a classfcaton varable ndcatng the type and stage of the dsease. Thus, ths paper ams to s to estmate the number of clusters n acute lymphoblastc leukema genetcs data. 3. Statstcal Models n Dfferental Gene Expresson: Several model based technques have been used n the analyss of acute lymphoblastc leukema data and to analyze mcroarray data. The approach s based on multvarate exploratory data analyss, amng to acheve a number of technques that allows for quck vewng of dstnct gene expresson patterns wthn a data set. Prncpal Component Analyss (PCA) has been used n the analyss of multvarate data by expressng the maxmum varance as a mnmum number of prncpal components, redundant components are elmnated, thus reducng the dmensons of the nput vectors (De Bn and Rsso, 2011). Sngular Value Decomposton (SVD) treats mcroarray data as a matrx, A, whch s composed of n rows (genes) by p columns (experments). SVD s represented by the mathematcal equaton, wth U beng the gene coeffcent vectors, S s the mode T ampltude and V the expresson level vctors, where: T A n p = U n ns n pv p p One of the most famlar statstcal technques to bologsts s herarchcal clusterng that presents data as gene lst wthn a dendogram to perform a bottom-up analyss. Ths can be obtaned by assgnng a smlarty score to all gene pars by calculatng the Pearson's correlaton coeffcent, and buldng a tree of genes. K- means clusterng however, s a top down technque that groups a collecton of nodes nto a fxed number of cluster (k) that are subjected to an teratve process. Each class must have a center pont that s the average poston of all the dstances n that class and each sample must fall nto the class to whch ts center s closest. The Nearest-Neghbor(NN) methods are based on a dstance functon for pars of tumor messenger Rbo Nuclec Acd (mrna) samples, such as the Eucldean dstance or one mnus the correlaton of ther gene expresson profles. By mplementng the NN for each tumor sample n the test set we can: Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (111)

Mahmoud K. Okasha, Khaled I. A. Almghar (a) fnd the k closest tumor samples n the learnng set, and (b) predct the class by the majorty vote; that s chooses the class that s most common among those k neghbors. The number of neghbor's k s chosen by cross-valdaton; that s, by runnng the NN classfer on the learnng set only. Class predcton s based on supervsed data analyss methods that mpose known groups datasets. Frst, a tranng set s dentfed, ths s, a group of genes whch has a known pattern of expresson s used to "tran" a dataset, by comparng the data to the tranng set and thus classfyng t. Ths partcular method s very useful n the sub classfcaton of smlar samples, cancer dagnoss, or to predct cell or patent response to drug therapy. In some cases, ths type of analyss has also been used to predct patent outcome, allowng for a clncally relevant use of mcroarray data. The Fsher Lnear Dscrmnant Analyss assumes that a random vector X has a multvarate normal dstrbuton for each defned group, and the covarance wthn each group s dentcal for all the groups. Ths makes the optmal decson functon for the comparson of data a lnear transformaton of x. Varatons on ths theme nclude quadratc dscrmnant analyss, flexble dscrmnant analyss and penalzed dscrmnant analyss. Other methods of analyss nclude Support Vector Machnes and based on constructng planes n a multdmensonal space that separate the dfferent classes of genes, and set decson boundares usng an teratve tranng algorthm. Data s mapped nto the hgher dmensonal space from ts orgnal nput space, and a nonlnear decson boundary s assgned. Ths plane s known as the maxmum margn hyper plane, and can be located by the use of a kernel functon (a nonparametrc weghtng functon). Moreover, Artfcal Neural networks, or perceptons s, another machne-learnng technque. Multlayer perceptrons can be used to classfy samples based on ther gene expresson. Gene expresson data for a sample are nput nto the model, and response s generated n the next layer, ultmately trggerng a response n the output layer. Ths output preceptor s expected to represent the class to whch the sample belongs. The method of Decson Trees s another tool that can be bult by usng crtera to dvde samples nto nodes. Samples are dvded recursvely untl they ether fall nto parttons, or untl a termnaton condton s (112) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. met. Ultmately, the ntermedate nodes represent splttng ponts or parttonng crtera, and the leaf nodes represent those decsons. 4. The data The data set that wll be used n the analyss n the present paper s the ALL data whch has been obtaned from (Charett, et.al. 2004) and can be also obtaned from Boconductor (2004). It conssts of sample of mcroarrays from 128 dfferent ndvduals wth Acute Lymphoblastc Leukema (ALL). A number of addtonal covarates are avalable. The data have been normalzed (usng qqnorm) and t s the jontly normalzed data that are avalable for us. The data are gven n the form of an exprset object. The dfferent covarates nclude the date of dagnoss; the sex of the patent, coded as M and F; the age of the patent n years; the type and stage of the dsease; and a vector CR wth the followng values: 1: CR, remsson acheved; 2: DEATH IN CR, patent ded whle n remsson; 3: DEATH IN INDUCTION, patent ded whle n nducton therapy; 4: REF, patent was refractory to therapy; the date on whch remsson was acheved. Other covarates nclude an assgned molecular bology of the cancer (manly for those wth B-cell ALL), BCR\/ABL, ALL\/AF4, E2APBX etc.; the patents response to multdrug resstance, ether NEG, or POS. a vector ndcatng whether the patent had contnuous complete remsson or not.; a vector ndcatng whether the patent had relapse or not and many other follow up and bologcal data. The data conssts of 83 Males,42 Females and 3 are NA's. The clusterng varable s type and stage of the dsease; B ndcates B-cell whle a T ndcates T-cell. Both types B and T have 5 stages each. In each of these stages there are: 4 observatons of B, 9 observatons of B1, 35 observatons of B2, 22 observatons of B3, and 9 observatons of B4. Moreover T-cell ncludes 5 observatons of T, 1 observaton of T1, 5 observatons of T2, 9 observatons of T3 and 2 observatons of T4. The data set (ALL) was separated nto two subsets of patents because t conssts of 94 patents who have B-cells and 32 patents who have T-cells (Charett; 2004). The goal here s to splt the sample nto categores and subcategores and to classfy the data nto homogeneous clusters. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (113)

Mahmoud K. Okasha, Khaled I. A. Almghar 5. Estmatng the number of clusters When there s more than one cluster of patents, a plot of the absolute egenvalues of the data matrx s characterzed by the ntersecton of a two parts curve the frst wth hgh negatve slope and the second s a flat curve. The curve ntersects wth the x-axs at the value of x=k n the case of a smlarty matrx. Plot-based nference can be formalzed by splttng the values of the covarate (rank) at dfferent ponts and fndng the reflecton pont correspondng to the best ft for response varable based on mnmum devance (Dudot et al; 2002). One expects the slope to change dramatcally at the reflecton pont. The number of large egenvalues s the ndex at whch the slope changes mnus 1. Snce the projecton operaton forces egenvalues to be exactly zero, artfcal egenvalue can always be deleted before nterpretng the plots or applyng the slope change method (Dudot et al; 2002). The null hypothess that k=1 can also be tested by comparng the devance of the smple lnear regresson wth the mnmum devance of the broken lne regresson. The null hypothess s rejected f the dfference between the two devances was greater than the expected ch-squared value wth one degree of freedom at the specfed sgnfcance level. Ths procedure s an ad-hoc because the non-null devance was mnmzed over all possble change ponts. Experence has shown that whle postve square roots of the egenvalues are superor for vsual nspecton, the slope change method works best usng the absolute value of the row egenvalues. The methods that have been appled to the underlyng data sets depend upon the Boconductor whch s an open software development for computatonal bology and bonformatcs R to automatcally estmate the number of clusters for large B_cells sample wth 79 cases and small T_cells wth 32 cases. The automatc estmaton of the number of clusters saves tme and efforts partcularly for non-experenced users. Two lbrares were appled whch are the Mclust on the B_cells sample and hclust on the T_cells sample. 5.1. Estmatng the Number of Clusters Usng Mclust Algorthm (114) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. Mclust algorthm has been developed by Fraley and Raftery (2007-a) and assumes a normal or Gaussan mxture model: n G τ kφ k ( x µ k, Σk ), = 1 k= 1 where x represents the data, G s the number of components, π k s the probablty that an observaton belongs to the k th component G ( τ 0; τ = 1), and k k= 1 k p 1 1 2 (, ) (2 ) 2 k x k Σ k = Σk exp x k Σk x k T 1 ( ) ( ) φ µ τ µ µ 2. The excepton s for model-based herarchcal clusterng, for whch the model used s the classfcaton lkelhood wth a parameterzed normal dstrbuton assumed for each class: n φ ( x µ l, ) l Σ l, = 1 where the l are labels ndcatng a unque classfcaton of each observaton: l = k f x belongs to the k th component. The components or clusters n both these models are ellpsodal, centered at the means µ k. The covarances k determne ther other geometrc features. Each covarance matrx s parameterzed by egenvalue decomposton n the form Σ = λ D A D T k k k k k where D k s the orthogonal matrx of egenvectors, A k s a dagonal matrx whose elements are proportonal to the egenvalues of k, and λ k s a scalar (Banfeld and Raftery 1993). The orentaton of the prncpal components of k s determned by D k, whle A k determnes the shape of the densty contours; λ k specfes the volume of the correspondng ellpsod, whch s proportonal to λ k d A k, where d s the data dmenson. Characterstcs (orentaton, volume and shape) of dstrbutons are usually estmated from the data, and can be allowed to vary between clusters, or constraned to be the same for all clusters. Ths parameterzaton ncludes but s not restrcted to well-known varance models that are assocated wth varous crteron for herarchcal clusterng, such as equal-volume sphercal varance ( k = Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (115)

Mahmoud K. Okasha, Khaled I. A. Almghar λi) for the sum of squares crteron, constant varance, and unconstraned varance (Fraley and Raftery, 2006). Several measures have been proposed for choosng the clusterng model (parameterzaton and number of clusters). We use the Bayesan Informaton Crteron (BIC) approxmaton to the Bayes factor, whch adds a penalty to the loglkelhood based on the number of parameters, and has performed well n a number of applcatons (Fraley and Raftery, 2007-b). The Bayesan Informaton Crtera (BIC) has the followng features: 1. It s ndependent of the pror. 2. It can measure the effcency of the parameterzed model n terms of predctng the data. 3. It penalzes the complexty of the model where complexty refers to the number of parameters n model. 4. It can be used to choose the number of clusters whch makes the model reach to maxmze BIC. 5. The model wth lower value of BIC s the one to be preferred. The BIC has the form - 2ln p(x k) BIC - 2ln L + k ln(n) Where: x : the observed data. N : the number of observatons. K : the number of free parameters to be estmated. P(x k) : the lkelhood of the observed data gven the number of parameters. L : the maxmzed value of the lkelhood functon for estmated model. A large BIC score ndcates strong evdence for the correspondng model. BIC can be used to choose the number of clusters and the covarance parameterzatons (Mclust). Usng Mclust algorthm we can select the ftted model, each combnaton of a dfferent specfcaton of the covarance matrces and a dfferent number of clusters corresponds to a separate probablty model. Then the optmal model accordng to BIC for EM ntalzed were chosen by herarchcal clusterng for parameterzed Gaussan mxture models (116) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. 5.2. Estmatng the Number of Clusters usng hclust Algorthm The hclust algorthm allows clusterng genes by ther expresson profle smlarty. The purpose of the analyss s to select groups of genes that have common patterns of expresson n dfferent experments, e.g. hgh expresson n cancer tssues and low expresson n normal tssues. These patterns of co-expresson are usually treated as co-regulaton. The smlarty of the expressons patterns may not be lmted by smple rules and can be descrbed by smlarty (or dstance) Measures. There are several measures of expresson profle smlarty between two genes: 1. Eucldean dstance. Ths s the geometrc dstance n the multdmensonal space. 2. Squared Eucldean dstance. The squared Eucldean dstance can be mplemented n order to place progressvely greater weght on objects that are further apart. 3. Manhattan dstance. Ths dstance s the average absolute dfference for the set of experments. 4. Chebychev dstance. Ths dstance s computed as d j = max k x k - x k. The measure s useful when one wants to defne two objects as "dfferent" f they are dfferent on any one of the experments. In SelTag all dstance measures (1-3) are normalzed to the number of felds nvolved n calculaton. Ths s useful when take nto account expresson data wth mssng values. 5. 1-r j ; Ths measure keep close profles wth postve correlaton coeffcents and s useful when one wants to detect coregulated genes. 6. 1- r j ; Ths measure keep close profles wth hgher absolute value of correlaton coeffcents. 7. 1+r j ; Ths measure keep close profles wth negatve value of correlaton coeffcents (ant-correlated). The hclust algorthm descrbes the dendogram produced by the clusterng process. The functon performs a herarchcal cluster analyss usng a set of dssmlartes for the n objects beng clustered. Each object s assgned to a cluster. A number of dfferent clusterng methods provde Ward's mnmum varance method. There are numerous ways n whch clusters can be Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (117)

Mahmoud K. Okasha, Khaled I. A. Almghar formed, Herarchcal clusterng s one of the most straghtforward methods, t can be ether agglomeratve or dvsve. Agglomeratve herarchcal clusterng begns wth every case beng a cluster nto tself, at successve steps, smlar clusters are merged. Dvsve clusterng starts wth all cases n one cluster and end up wth each case n an ndvdual clusters. In agglomeratve clusterng, once a cluster s formed, t cannot be splt; t can only be combned wth other clusters. Agglomeratve herarchcal clusterng does not let cases to be separated from clusters that they have joned. Once n a cluster, always n that cluster, we can choose the number of clusters when we reach to maxmze heght at herarchcal cluster dagram both agglomeratve and Dvsve are used to estmate the number of clusters n small data. When we choose the number of clusters usng the hclust algorthm descrbed above we compare ts results wth the results of Parttonng Around Medod " PAM " algorthm. 5.3. Estmatng the Number of Clusters Usng the Parttonng Around Medod (PAM) Algorthm Ths algorthm desgned by Kaufman and Rousseuw (1990) as a parttonng method whch operates on the dssmlarty matrx, e.g. Eucldean dstance matrx. PAM s more robust than k-means n the presence of nose and outlers because a medod s less nfluenced by outlers or other extreme values than a mean. It works well for small data sets but does not scale well for large data sets. For a prespecfed number of clusters K, the PAM procedure s based on the search for K representatve objects, or medods, among the observatons to be clustered. After fndng a set of K medods, K clusters are constructed by assgnng each observaton to the nearest medod. The goal s to fnd K medods, M * * * =( m1,..., mk ) where M * s the sum of the dssmlartes of the observatons to ther closest medod; that s, M * = arg mn * mn k d( x, mk ) M, tends to be more robust K_means. Ths algorthm has the followng features: a) It accepts the dssmlarty matrx. (118) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. b) It s more robust because t mnmzes a sum of dssmlartes nstead of a sum of squared Eucldean dstance. c) It provdes a novel graphcal dsplay. d) It allows selectng fttng the number of clusters by selectng the clusters whch maxmze the average slhouette wdth. PAM algorthm provdes a graphcal dsplay (Slhouette plots). Among the graphs the PAM provdes a graphcal dsplay (Slhouette plots) whch can be used to: 1. Select the number of clusters and 2. Asses how well ndvdual observatons are clustered. The slhouette wdth of the observaton s defned as : ( b a ) sl =, where a denotes the average dssmlarty max( a, b ) between and all other observatons n the cluster to whch belongs, and b denotes the mnmum average dssmlarty of to objects n other clusters. Intutvely, objects wth large slhouette wdth clustered; then those wth small clusters. sl are well sl whch tend to le between The dvsve coeffcent represents the strength of the clusterng structure founded by the PAM algorthm. Let dd() be the dameter of the cluster to whch data belongs before beng splt to a sngle varable, dvded by the dameter of the whole data set. The dvsve coeffcent (DC) for a cluster s gven by: n dd( ) 1 DC = 2 1 Where n s the number of objects, dd() s the n dameter of cluster. See McQuarre and Tsa (1998). Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (119)

Mahmoud K. Okasha, Khaled I. A. Almghar 6. The Analyss of Acute Lymphoblastc Leukema (ALL) Genetcs Data The data we used for analyss and llustraton n ths paper s the ALL data set whch has been descrbed above and conssts of 128 mcroarrays from dfferent ndvduals wth acute lymphoblastc leukema dsease. The groupng varable s BT: The type and stage of the dsease; B ndcates B-cell whle a T ndcates T-cell. 6.1. Estmatng the Number of Clusters Usng Mclust Algorthm When estmatng the number of clusters n the B_cells data by Mclust algorthm, the result was that the data s dvded nto two components wth dfferent varances but the varance wthn each component "cluster" s equal. Therefore, we concluded that there are two homogenous clusters wth all symmetrc observatons wthn the same cluster (See Szekely and Rzzo, 2005). Fgure 1 s produced by Mclust algorthm and llustrates the above result for the B-cell. In Fgure 1 below, two models can be seen easly. The upper one marked by sold trangles and the other one s marked by empty trangles. Each trangle represents the number of clusters so that each model has 9 dfferent numbers of clusters. Moreover, as descrbed n the characterstcs of Bayesan Informaton Crtera that the model wth the lowest absolute value of BIC s preferred whch s here upper one whch s marked wth sold trangle, also from the characterstcs of BIC both two models reach ther maxmum BIC when the number of clusters s two. Therefore, observaton of Fgure 1 supports the results of the Mclust algorthm output. (120) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. Fgure 1: The BIC for dfferent number of clusters n ALL data set 6.2. Estmatng the Number of Clusters usng hclust Algorthm The second data set (T_cells) conssts of 32 observatons so that the sutable algorthm for estmatng the number of clusters s hclust as descrbed n Secton 4. The hclust algorthm uses the agglomeratve herarchal clusterng whch begns wth every case beng a cluster. Smlar clusters are merged and we can choose the number of clusters that maxmzes the heght at herarchal cluster dendogram. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (121)

Mahmoud K. Okasha, Khaled I. A. Almghar Cluster Dendrogram 2 34 1 5 67 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 32 30 31 Heght 0 5 10 15 20 25 30 Fgure 2: Cluster dendogram for the second data set (T_cells) hclust (*, "complete") Lookng at fgure 2 above from bottom, we can easly see that each object cluster wth tself and thus we have 32 clusters. If we moved steps from down to top we can observe that the object 1 s n one cluster, objects 2 and 3 n another cluster. These two clusters are agglomerate n another cluster wth heght =3. Objects 4 and 5 n one cluster and 6 and 7 n another cluster wth heght = 3. The four clusters are agglomerate n a cluster wth heght = 7. Objects 8 and 9 are agglomerate n a cluster. Objects 10 and 11 also agglomerate n a cluster and both clusters are agglomerate wth a cluster wth heght=3. Objects 12 and 13 agglomerate wth a cluster and 14 and15 agglomerate wth a cluster wth heght = 7. Both two clusters wth heght = 7 are agglomerate wth another cluster wth heght = 15. Objects 16 and17 agglomerate wth a cluster ts heght= 3. Also (122) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. objects 18 and 19 are agglomerate wth a cluster ts heght =3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7. Objects 20 and 21 agglomerate wth a cluster ts heght= 3. Also objects 22 and 23 are agglomerate wth a cluster ts heght=3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7. Objects 24 and 25 agglomerate wth a cluster ts heght= 3. Also objects 26 and 27 are agglomerate wth a cluster ts heght=3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7. Objects 28 and 29 agglomerate wth a cluster ts heght= 3. Also objects 30 and 31 are agglomerate wth a cluster ts heght=3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7 and the object 32 s clustered wth t self. The fve clusters are agglomerate wth a cluster wth heght= 4. The clusters from object 24 to 32 are agglomerate wth heght=8. The objects from 16 to 23 and 24 to 31 are agglomerate wth a cluster wth heght equals 16, here we have two clusters wth maxmum heght one s 15 and other s 16. We conclude that our data composed from 2 clusters. 6.3. Estmatng the Number of Clusters Usng the Parttonng Around Medod (PAM) Algorthm The goal of cluster analyss for our data set s to reach to the maxmum dssmlarty between observatons of dfferent clusters and the medod and wder dameter cluster and maxmum average slhouette wdth. Usng the PAM algorthm we acheved the followng results When number of clusters s 2 then the average slhouette wdth = 0.6 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 31. When number of clusters s 3 then the average slhouette wdth = 0.55 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 30. When number of clusters s 4 then the average slhouette wdth = 0.51 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 29. When number of clusters s 5 then the average slhouette wdth = 0.48 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 28. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (123)

Mahmoud K. Okasha, Khaled I. A. Almghar From the above results we can conclude that the best number of clusters s 2. In ths case each of average slhouette wdth, dssmlarty between the observatons, the cluster medod, and the cluster dameter reach ts maxmum value. We notce that after cluster 2 we gets the same results n maxmum dssmlarty because the sample sze s 32 and t s dffcult to cluster such sample sze n more than 3clusters. 7. Conclusons: In ths paper we analyzed two data sets. The frst one s the B_cells whch can be consdered as a large sample and the Mclust algorthm was used to estmate the number of clusters. Usng ths algorthm we llustrate the results that the number of clusters s two. We also descrbed the ft of Mclust algorthm that uses the Bayesan Informaton Crtera (BIC). Moreover we llustrate the result that the number of clusters s two where the maxmum BIC has been acheved. To confrm these results, we compared the results of both Mclust and hclust algorthms wth PAM algorthm where we selected dfferent numbers of clusters from 1 to 5 because we have 5 stages of dsease n our data set and we compute the Average Slhouette Wdth for each choce of number of clusters. We conclude that the number of clusters also equals two snce t corresponds to the maxmum value of Average Slhouette Wdth. Lookng at the number of clusters at B_cells we can conclude that there s only 2 clusters, whch means that we reduce the 5 stages to 2 symmetrc clusters and that means that each cluster should have the same medcaton or treatments. The second data sets s T_cells whch s a small sample. Therefore we used the hclust algorthm to estmate the number of clusters and we concluded that the number of clusters s two. To confrm these results we compared the results of hclust algorthm wth the PAM algorthm where we selected dfferent numbers of clusters from 1 to 5 as n the prevous data set and we computed the Average Slhouette Wdth for each choce of number of clusters. We then conclude that the number of clusters equals two snce t corresponds to the maxmum value of Average Slhouette Wdth. Lookng at the number of clusters at T_cells we can conclude that there s only 2 clusters, whch means that we reduce the 5 stages to 2 symmetrc clusters and that means that each cluster should have the same medcaton or treatments. (124) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. From the above dscussons of the results and the comparson between Mclust, hclust and PAM algorthms we conclude that the Mclust algorthm s sutable for large data sets and the hclust algorthm s sutable for small samples. Therefore we recommend usng Mclust algorthm for large samples and hclust algorthm for small samples. 8. Recommendatons: 1. From the above results we recommend to concentrate future research on methods of detectng genes that causes dfferent types of cancer to avod ths dsease by solatng these genes n the new generatons. 2. The data used n ths study was obtaned from "boconductor.org" webste. We recommend that a genetc data bank to be establshed n Palestne. Ths would help n solaton genes whch causes heredty dseases n Palestne. 3. For future studes we propose jont researches between the Facultes of Medcne, Medcal Scences and the Department of Statstcs at Al Azhar Unversty n genetcs felds. 4. We recommend conductng further research on usng Neural Networks technques for estmatng the number of clusters n genetcs data. 5. We also recommend conductng further research on testng whether there s a sgnfcant evdence of dfferent types of cancer between genetcs causes and other causes n Palestne. 6. It s also recommended that the clusterng algorthms dscussed n ths paper to be appled n other felds such as Economcs and Human Scences. References: 1. Banfeld J. D. and Raftery A. E. (1993); Model-based Gaussan and non-gaussan clusterng. Bometrcs, 49:803 821. 2. Boconductor (2004): Open software development for computatonal bology and bonformatcs; Gentleman R., Carey V. J., Bates D. M., Bolstad B., Dettlng M., Dudot S., Ells B., Gauter L., Ge Y., and others, Genome Bology, Vol. 5, R80. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (125)

Mahmoud K. Okasha, Khaled I. A. Almghar 3. Chen, G., Banerjee, N., Jaradat, S.A., Tanaka, T.S., Ko, M.S.H. and Zhang, M.Q. (2002), Evaluaton and Comparson of Clusterng Algorthms n Analyzng ES Cell Gene Expresson Data, Statstca Snca, 12: 241-262 4. Charett S, L X, Gentleman R, Vtale A, Vgnett M, Mandell F, Rtz J, Foa R (2004); Gene expresson profle of adult T-cell acute lymphocytc leukema dentfes dstnct subsets of patents wth dfferent response to therapy and survval Blood, Vol. 103, No 7. 5. De Bn R. and Rsso D. (2011), A novel approach to the clusterng of mcroarray data va nonparametrc densty estmaton, BMC Bonformatcs, 12:49. 6. Dudot, S; Frdlyand, J and Speed, T. P. (2002), Comparson of Dscrmnaton Methods for the Classfcaton of Tumors Usng Gene Expresson Data ; Journal of Amercan Statstcal Assocaton; Vol. 97, No 457,70-85. 7. Fraley C and Raftery AE (2006); Model-based mcroarray mage analyss ; R News, 6:60 63. 8. Fraley C and Raftery AE (2007-a); Model-based methods of classfcaton: usng the mclust software n chemometrcs ; Journal of Statstcal Software, 18(6). 9. Fraley C and Raftery AE (2007-b); Bayesan regularzaton for normal mxture estmaton and model-based clusterng ; Journal of Classfcaton, 24:155 181. 10. Kaufman L and Rousseeuw PJ (1990), Fndng Groups n Data: An Introducton to Cluster Analyss, Wley-Interscence, New York (Seres n Appled Probablty and Statstcs). 11. McQuarre ADR and Tsa CL (1998), Regresson and Tme Seres Model Selecton, World Scentfc. 12. Szekely, G. J. and Rzzo, M. L. (2005) Herarchcal Clusterng va Jont Between-Wthn Dstances: Extendng Ward's Mnmum Varance Method, Journal of Classfcaton 22(2) 151-183. (126) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13