Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data

Size: px
Start display at page:

Download "Estimating the Number of Clusters in Genetics of Acute Lymphoblastic Leukemia Data"

Transcription

1 Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 : Estmatng the Number of Clusters n Genetcs of Acute Lymphoblastc Leukema Data Mahmoud K. Okasha, Khaled I.A. Almghar Department of Appled Statstcs Al-Azhar Unversty - Gaza Receved 16/11/2011 Accepted 31/12/2011 Abstract: Cluster analyss s a statstcal technque that has been wdely used for the analyss of genetc data to cluster gene expressons and other data n many felds. However, the problem encountered n the lterature s the choce of the number of clusters. Specfcally, the problem of estmatng the number of clusters n a gven populaton partcularly for gene expressons s of a great nterest and needs to be addressed. Many algorthms are used n practce for that purpose n dfferent felds. In ths paper we examned dfferent clusterng algorthms, for estmatng the number of clusters, that are based on probabltes, covarance matrx, and egenvalues on real data sets usng R package algorthms. Specfcally, we examned the model based algorthm (Mclust) and herarchcal clusterng algorthm (hclust) and compared these algorthms wth Partton Around Medod (PAM) algorthm. The results we found are that the frst algorthm can be used only for large data sets and the second one can be safely used for small data sets. The Mclust s a model based clusterng approach bult on Bayesan Informaton Crteron (BIC) whch maxmzes (EM) algorthm. The results of these two algorthms are compared wth a thrd approach based on Partton Around Medod (PAM) algorthm but selects the number of clusters manually accordng to the average slhouette wdth and selectng the number of clusters as that number whch maxmzes the average slhouette wdth. The later algorthm although allows to estmate the number of clusters manually, t has the best performance. However, the frst two algorthms can be automated to produce the best estmate for the number of clusters n a gven data set. These algorthms can be appled not only for genetc data but also for many other felds such as market research. Keywords: clusterng, model based algorthm, herarchcal clusterng, Partton Around Medod, Bayesan Informaton Crteron, average slhouette, herarchcal tree, gene expresson.

2 Mahmoud K. Okasha, Khaled I. A. Almghar 1. Introducton Cluster analyss s a collecton of statstcal methods whch are used to detect groups of observatons that have smlar behavor or characterstcs n a set of data. Cluster analyss s generally classfed nto two dfferent technques; namely herarchcal and nonherarchcal procedures. The goal s to construct a herarchy or a decson-tree lke structure (dendogram) to llustrate the relatonshp among enttes. In the non-herarchcal method a poston n the measurement s taken as a central place and the dstance s measured from such central pont (Partton Around Medod). In the herarchcal clusterng, the concept of orderng s nvolved n ths approach. The orderng s a drven by the number of observatons that can be combned at a tme based on the assumptons that the dstance between two observatons s not statstcally dfferent from zero. The clusters could be arrved at ether from weedng out observatons (dvsve method) or jonng together smlar observatons (agglomeratve method). However, estmatng the number of clusters n any data remans the man problem (Chen et al., 2002). 2. Ams of the study Acute lymphoblastc leukema dsease has many dfferent types and causes. For every type, there are many dfferent stages. The man goal of the analyss of acute lymphoblastc leukema data s to splt the sample nto categores and subcategores and to classfy the data nto homogeneous clusters. To acheve ths goal, cluster analyss s usually used to: 1. Classfy homogenous cases nto the same clusters and heterogeneous ones n dfferent clusters. 2. Reduce the sample cases to a few dfferent clusters wth smlar propertes. 3. Determne the numbers of clusters: Allocatng homogenous objects nto the same cluster means that all patents wth the same type of dsease and at the same stage wll be classfed nto the same group. The beneft of ths s that, same clusters of patents should be gven smlar protocols of medcne. Moreover, classfyng the genes whch causes the dsease makes t easy to solate ths gene n new generatons to avod the acute lymphoblastc leukema dsease. Afterwards, ths dsease can be avoded by usng (110) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

3 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. DNA technologes that can prevent ths dsease for people who have the genes whch cause the acute lymphoblastc leukema. The dependent varable here to be clustered s usually a classfcaton varable ndcatng the type and stage of the dsease. Thus, ths paper ams to s to estmate the number of clusters n acute lymphoblastc leukema genetcs data. 3. Statstcal Models n Dfferental Gene Expresson: Several model based technques have been used n the analyss of acute lymphoblastc leukema data and to analyze mcroarray data. The approach s based on multvarate exploratory data analyss, amng to acheve a number of technques that allows for quck vewng of dstnct gene expresson patterns wthn a data set. Prncpal Component Analyss (PCA) has been used n the analyss of multvarate data by expressng the maxmum varance as a mnmum number of prncpal components, redundant components are elmnated, thus reducng the dmensons of the nput vectors (De Bn and Rsso, 2011). Sngular Value Decomposton (SVD) treats mcroarray data as a matrx, A, whch s composed of n rows (genes) by p columns (experments). SVD s represented by the mathematcal equaton, wth U beng the gene coeffcent vectors, S s the mode T ampltude and V the expresson level vctors, where: T A n p = U n ns n pv p p One of the most famlar statstcal technques to bologsts s herarchcal clusterng that presents data as gene lst wthn a dendogram to perform a bottom-up analyss. Ths can be obtaned by assgnng a smlarty score to all gene pars by calculatng the Pearson's correlaton coeffcent, and buldng a tree of genes. K- means clusterng however, s a top down technque that groups a collecton of nodes nto a fxed number of cluster (k) that are subjected to an teratve process. Each class must have a center pont that s the average poston of all the dstances n that class and each sample must fall nto the class to whch ts center s closest. The Nearest-Neghbor(NN) methods are based on a dstance functon for pars of tumor messenger Rbo Nuclec Acd (mrna) samples, such as the Eucldean dstance or one mnus the correlaton of ther gene expresson profles. By mplementng the NN for each tumor sample n the test set we can: Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (111)

4 Mahmoud K. Okasha, Khaled I. A. Almghar (a) fnd the k closest tumor samples n the learnng set, and (b) predct the class by the majorty vote; that s chooses the class that s most common among those k neghbors. The number of neghbor's k s chosen by cross-valdaton; that s, by runnng the NN classfer on the learnng set only. Class predcton s based on supervsed data analyss methods that mpose known groups datasets. Frst, a tranng set s dentfed, ths s, a group of genes whch has a known pattern of expresson s used to "tran" a dataset, by comparng the data to the tranng set and thus classfyng t. Ths partcular method s very useful n the sub classfcaton of smlar samples, cancer dagnoss, or to predct cell or patent response to drug therapy. In some cases, ths type of analyss has also been used to predct patent outcome, allowng for a clncally relevant use of mcroarray data. The Fsher Lnear Dscrmnant Analyss assumes that a random vector X has a multvarate normal dstrbuton for each defned group, and the covarance wthn each group s dentcal for all the groups. Ths makes the optmal decson functon for the comparson of data a lnear transformaton of x. Varatons on ths theme nclude quadratc dscrmnant analyss, flexble dscrmnant analyss and penalzed dscrmnant analyss. Other methods of analyss nclude Support Vector Machnes and based on constructng planes n a multdmensonal space that separate the dfferent classes of genes, and set decson boundares usng an teratve tranng algorthm. Data s mapped nto the hgher dmensonal space from ts orgnal nput space, and a nonlnear decson boundary s assgned. Ths plane s known as the maxmum margn hyper plane, and can be located by the use of a kernel functon (a nonparametrc weghtng functon). Moreover, Artfcal Neural networks, or perceptons s, another machne-learnng technque. Multlayer perceptrons can be used to classfy samples based on ther gene expresson. Gene expresson data for a sample are nput nto the model, and response s generated n the next layer, ultmately trggerng a response n the output layer. Ths output preceptor s expected to represent the class to whch the sample belongs. The method of Decson Trees s another tool that can be bult by usng crtera to dvde samples nto nodes. Samples are dvded recursvely untl they ether fall nto parttons, or untl a termnaton condton s (112) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

5 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. met. Ultmately, the ntermedate nodes represent splttng ponts or parttonng crtera, and the leaf nodes represent those decsons. 4. The data The data set that wll be used n the analyss n the present paper s the ALL data whch has been obtaned from (Charett, et.al. 2004) and can be also obtaned from Boconductor (2004). It conssts of sample of mcroarrays from 128 dfferent ndvduals wth Acute Lymphoblastc Leukema (ALL). A number of addtonal covarates are avalable. The data have been normalzed (usng qqnorm) and t s the jontly normalzed data that are avalable for us. The data are gven n the form of an exprset object. The dfferent covarates nclude the date of dagnoss; the sex of the patent, coded as M and F; the age of the patent n years; the type and stage of the dsease; and a vector CR wth the followng values: 1: CR, remsson acheved; 2: DEATH IN CR, patent ded whle n remsson; 3: DEATH IN INDUCTION, patent ded whle n nducton therapy; 4: REF, patent was refractory to therapy; the date on whch remsson was acheved. Other covarates nclude an assgned molecular bology of the cancer (manly for those wth B-cell ALL), BCR\/ABL, ALL\/AF4, E2APBX etc.; the patents response to multdrug resstance, ether NEG, or POS. a vector ndcatng whether the patent had contnuous complete remsson or not.; a vector ndcatng whether the patent had relapse or not and many other follow up and bologcal data. The data conssts of 83 Males,42 Females and 3 are NA's. The clusterng varable s type and stage of the dsease; B ndcates B-cell whle a T ndcates T-cell. Both types B and T have 5 stages each. In each of these stages there are: 4 observatons of B, 9 observatons of B1, 35 observatons of B2, 22 observatons of B3, and 9 observatons of B4. Moreover T-cell ncludes 5 observatons of T, 1 observaton of T1, 5 observatons of T2, 9 observatons of T3 and 2 observatons of T4. The data set (ALL) was separated nto two subsets of patents because t conssts of 94 patents who have B-cells and 32 patents who have T-cells (Charett; 2004). The goal here s to splt the sample nto categores and subcategores and to classfy the data nto homogeneous clusters. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (113)

6 Mahmoud K. Okasha, Khaled I. A. Almghar 5. Estmatng the number of clusters When there s more than one cluster of patents, a plot of the absolute egenvalues of the data matrx s characterzed by the ntersecton of a two parts curve the frst wth hgh negatve slope and the second s a flat curve. The curve ntersects wth the x-axs at the value of x=k n the case of a smlarty matrx. Plot-based nference can be formalzed by splttng the values of the covarate (rank) at dfferent ponts and fndng the reflecton pont correspondng to the best ft for response varable based on mnmum devance (Dudot et al; 2002). One expects the slope to change dramatcally at the reflecton pont. The number of large egenvalues s the ndex at whch the slope changes mnus 1. Snce the projecton operaton forces egenvalues to be exactly zero, artfcal egenvalue can always be deleted before nterpretng the plots or applyng the slope change method (Dudot et al; 2002). The null hypothess that k=1 can also be tested by comparng the devance of the smple lnear regresson wth the mnmum devance of the broken lne regresson. The null hypothess s rejected f the dfference between the two devances was greater than the expected ch-squared value wth one degree of freedom at the specfed sgnfcance level. Ths procedure s an ad-hoc because the non-null devance was mnmzed over all possble change ponts. Experence has shown that whle postve square roots of the egenvalues are superor for vsual nspecton, the slope change method works best usng the absolute value of the row egenvalues. The methods that have been appled to the underlyng data sets depend upon the Boconductor whch s an open software development for computatonal bology and bonformatcs R to automatcally estmate the number of clusters for large B_cells sample wth 79 cases and small T_cells wth 32 cases. The automatc estmaton of the number of clusters saves tme and efforts partcularly for non-experenced users. Two lbrares were appled whch are the Mclust on the B_cells sample and hclust on the T_cells sample Estmatng the Number of Clusters Usng Mclust Algorthm (114) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

7 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. Mclust algorthm has been developed by Fraley and Raftery (2007-a) and assumes a normal or Gaussan mxture model: n G τ kφ k ( x µ k, Σk ), = 1 k= 1 where x represents the data, G s the number of components, π k s the probablty that an observaton belongs to the k th component G ( τ 0; τ = 1), and k k= 1 k p (, ) (2 ) 2 k x k Σ k = Σk exp x k Σk x k T 1 ( ) ( ) φ µ τ µ µ 2. The excepton s for model-based herarchcal clusterng, for whch the model used s the classfcaton lkelhood wth a parameterzed normal dstrbuton assumed for each class: n φ ( x µ l, ) l Σ l, = 1 where the l are labels ndcatng a unque classfcaton of each observaton: l = k f x belongs to the k th component. The components or clusters n both these models are ellpsodal, centered at the means µ k. The covarances k determne ther other geometrc features. Each covarance matrx s parameterzed by egenvalue decomposton n the form Σ = λ D A D T k k k k k where D k s the orthogonal matrx of egenvectors, A k s a dagonal matrx whose elements are proportonal to the egenvalues of k, and λ k s a scalar (Banfeld and Raftery 1993). The orentaton of the prncpal components of k s determned by D k, whle A k determnes the shape of the densty contours; λ k specfes the volume of the correspondng ellpsod, whch s proportonal to λ k d A k, where d s the data dmenson. Characterstcs (orentaton, volume and shape) of dstrbutons are usually estmated from the data, and can be allowed to vary between clusters, or constraned to be the same for all clusters. Ths parameterzaton ncludes but s not restrcted to well-known varance models that are assocated wth varous crteron for herarchcal clusterng, such as equal-volume sphercal varance ( k = Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (115)

8 Mahmoud K. Okasha, Khaled I. A. Almghar λi) for the sum of squares crteron, constant varance, and unconstraned varance (Fraley and Raftery, 2006). Several measures have been proposed for choosng the clusterng model (parameterzaton and number of clusters). We use the Bayesan Informaton Crteron (BIC) approxmaton to the Bayes factor, whch adds a penalty to the loglkelhood based on the number of parameters, and has performed well n a number of applcatons (Fraley and Raftery, 2007-b). The Bayesan Informaton Crtera (BIC) has the followng features: 1. It s ndependent of the pror. 2. It can measure the effcency of the parameterzed model n terms of predctng the data. 3. It penalzes the complexty of the model where complexty refers to the number of parameters n model. 4. It can be used to choose the number of clusters whch makes the model reach to maxmze BIC. 5. The model wth lower value of BIC s the one to be preferred. The BIC has the form - 2ln p(x k) BIC - 2ln L + k ln(n) Where: x : the observed data. N : the number of observatons. K : the number of free parameters to be estmated. P(x k) : the lkelhood of the observed data gven the number of parameters. L : the maxmzed value of the lkelhood functon for estmated model. A large BIC score ndcates strong evdence for the correspondng model. BIC can be used to choose the number of clusters and the covarance parameterzatons (Mclust). Usng Mclust algorthm we can select the ftted model, each combnaton of a dfferent specfcaton of the covarance matrces and a dfferent number of clusters corresponds to a separate probablty model. Then the optmal model accordng to BIC for EM ntalzed were chosen by herarchcal clusterng for parameterzed Gaussan mxture models (116) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

9 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc, Estmatng the Number of Clusters usng hclust Algorthm The hclust algorthm allows clusterng genes by ther expresson profle smlarty. The purpose of the analyss s to select groups of genes that have common patterns of expresson n dfferent experments, e.g. hgh expresson n cancer tssues and low expresson n normal tssues. These patterns of co-expresson are usually treated as co-regulaton. The smlarty of the expressons patterns may not be lmted by smple rules and can be descrbed by smlarty (or dstance) Measures. There are several measures of expresson profle smlarty between two genes: 1. Eucldean dstance. Ths s the geometrc dstance n the multdmensonal space. 2. Squared Eucldean dstance. The squared Eucldean dstance can be mplemented n order to place progressvely greater weght on objects that are further apart. 3. Manhattan dstance. Ths dstance s the average absolute dfference for the set of experments. 4. Chebychev dstance. Ths dstance s computed as d j = max k x k - x k. The measure s useful when one wants to defne two objects as "dfferent" f they are dfferent on any one of the experments. In SelTag all dstance measures (1-3) are normalzed to the number of felds nvolved n calculaton. Ths s useful when take nto account expresson data wth mssng values r j ; Ths measure keep close profles wth postve correlaton coeffcents and s useful when one wants to detect coregulated genes r j ; Ths measure keep close profles wth hgher absolute value of correlaton coeffcents r j ; Ths measure keep close profles wth negatve value of correlaton coeffcents (ant-correlated). The hclust algorthm descrbes the dendogram produced by the clusterng process. The functon performs a herarchcal cluster analyss usng a set of dssmlartes for the n objects beng clustered. Each object s assgned to a cluster. A number of dfferent clusterng methods provde Ward's mnmum varance method. There are numerous ways n whch clusters can be Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (117)

10 Mahmoud K. Okasha, Khaled I. A. Almghar formed, Herarchcal clusterng s one of the most straghtforward methods, t can be ether agglomeratve or dvsve. Agglomeratve herarchcal clusterng begns wth every case beng a cluster nto tself, at successve steps, smlar clusters are merged. Dvsve clusterng starts wth all cases n one cluster and end up wth each case n an ndvdual clusters. In agglomeratve clusterng, once a cluster s formed, t cannot be splt; t can only be combned wth other clusters. Agglomeratve herarchcal clusterng does not let cases to be separated from clusters that they have joned. Once n a cluster, always n that cluster, we can choose the number of clusters when we reach to maxmze heght at herarchcal cluster dagram both agglomeratve and Dvsve are used to estmate the number of clusters n small data. When we choose the number of clusters usng the hclust algorthm descrbed above we compare ts results wth the results of Parttonng Around Medod " PAM " algorthm Estmatng the Number of Clusters Usng the Parttonng Around Medod (PAM) Algorthm Ths algorthm desgned by Kaufman and Rousseuw (1990) as a parttonng method whch operates on the dssmlarty matrx, e.g. Eucldean dstance matrx. PAM s more robust than k-means n the presence of nose and outlers because a medod s less nfluenced by outlers or other extreme values than a mean. It works well for small data sets but does not scale well for large data sets. For a prespecfed number of clusters K, the PAM procedure s based on the search for K representatve objects, or medods, among the observatons to be clustered. After fndng a set of K medods, K clusters are constructed by assgnng each observaton to the nearest medod. The goal s to fnd K medods, M * * * =( m1,..., mk ) where M * s the sum of the dssmlartes of the observatons to ther closest medod; that s, M * = arg mn * mn k d( x, mk ) M, tends to be more robust K_means. Ths algorthm has the followng features: a) It accepts the dssmlarty matrx. (118) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

11 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. b) It s more robust because t mnmzes a sum of dssmlartes nstead of a sum of squared Eucldean dstance. c) It provdes a novel graphcal dsplay. d) It allows selectng fttng the number of clusters by selectng the clusters whch maxmze the average slhouette wdth. PAM algorthm provdes a graphcal dsplay (Slhouette plots). Among the graphs the PAM provdes a graphcal dsplay (Slhouette plots) whch can be used to: 1. Select the number of clusters and 2. Asses how well ndvdual observatons are clustered. The slhouette wdth of the observaton s defned as : ( b a ) sl =, where a denotes the average dssmlarty max( a, b ) between and all other observatons n the cluster to whch belongs, and b denotes the mnmum average dssmlarty of to objects n other clusters. Intutvely, objects wth large slhouette wdth clustered; then those wth small clusters. sl are well sl whch tend to le between The dvsve coeffcent represents the strength of the clusterng structure founded by the PAM algorthm. Let dd() be the dameter of the cluster to whch data belongs before beng splt to a sngle varable, dvded by the dameter of the whole data set. The dvsve coeffcent (DC) for a cluster s gven by: n dd( ) 1 DC = 2 1 Where n s the number of objects, dd() s the n dameter of cluster. See McQuarre and Tsa (1998). Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (119)

12 Mahmoud K. Okasha, Khaled I. A. Almghar 6. The Analyss of Acute Lymphoblastc Leukema (ALL) Genetcs Data The data we used for analyss and llustraton n ths paper s the ALL data set whch has been descrbed above and conssts of 128 mcroarrays from dfferent ndvduals wth acute lymphoblastc leukema dsease. The groupng varable s BT: The type and stage of the dsease; B ndcates B-cell whle a T ndcates T-cell Estmatng the Number of Clusters Usng Mclust Algorthm When estmatng the number of clusters n the B_cells data by Mclust algorthm, the result was that the data s dvded nto two components wth dfferent varances but the varance wthn each component "cluster" s equal. Therefore, we concluded that there are two homogenous clusters wth all symmetrc observatons wthn the same cluster (See Szekely and Rzzo, 2005). Fgure 1 s produced by Mclust algorthm and llustrates the above result for the B-cell. In Fgure 1 below, two models can be seen easly. The upper one marked by sold trangles and the other one s marked by empty trangles. Each trangle represents the number of clusters so that each model has 9 dfferent numbers of clusters. Moreover, as descrbed n the characterstcs of Bayesan Informaton Crtera that the model wth the lowest absolute value of BIC s preferred whch s here upper one whch s marked wth sold trangle, also from the characterstcs of BIC both two models reach ther maxmum BIC when the number of clusters s two. Therefore, observaton of Fgure 1 supports the results of the Mclust algorthm output. (120) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

13 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. Fgure 1: The BIC for dfferent number of clusters n ALL data set 6.2. Estmatng the Number of Clusters usng hclust Algorthm The second data set (T_cells) conssts of 32 observatons so that the sutable algorthm for estmatng the number of clusters s hclust as descrbed n Secton 4. The hclust algorthm uses the agglomeratve herarchal clusterng whch begns wth every case beng a cluster. Smlar clusters are merged and we can choose the number of clusters that maxmzes the heght at herarchal cluster dendogram. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (121)

14 Mahmoud K. Okasha, Khaled I. A. Almghar Cluster Dendrogram Heght Fgure 2: Cluster dendogram for the second data set (T_cells) hclust (*, "complete") Lookng at fgure 2 above from bottom, we can easly see that each object cluster wth tself and thus we have 32 clusters. If we moved steps from down to top we can observe that the object 1 s n one cluster, objects 2 and 3 n another cluster. These two clusters are agglomerate n another cluster wth heght =3. Objects 4 and 5 n one cluster and 6 and 7 n another cluster wth heght = 3. The four clusters are agglomerate n a cluster wth heght = 7. Objects 8 and 9 are agglomerate n a cluster. Objects 10 and 11 also agglomerate n a cluster and both clusters are agglomerate wth a cluster wth heght=3. Objects 12 and 13 agglomerate wth a cluster and 14 and15 agglomerate wth a cluster wth heght = 7. Both two clusters wth heght = 7 are agglomerate wth another cluster wth heght = 15. Objects 16 and17 agglomerate wth a cluster ts heght= 3. Also (122) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

15 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. objects 18 and 19 are agglomerate wth a cluster ts heght =3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7. Objects 20 and 21 agglomerate wth a cluster ts heght= 3. Also objects 22 and 23 are agglomerate wth a cluster ts heght=3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7. Objects 24 and 25 agglomerate wth a cluster ts heght= 3. Also objects 26 and 27 are agglomerate wth a cluster ts heght=3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7. Objects 28 and 29 agglomerate wth a cluster ts heght= 3. Also objects 30 and 31 are agglomerate wth a cluster ts heght=3 and both the two clusters wth heght =3 are clustered n one cluster wth heght=7 and the object 32 s clustered wth t self. The fve clusters are agglomerate wth a cluster wth heght= 4. The clusters from object 24 to 32 are agglomerate wth heght=8. The objects from 16 to 23 and 24 to 31 are agglomerate wth a cluster wth heght equals 16, here we have two clusters wth maxmum heght one s 15 and other s 16. We conclude that our data composed from 2 clusters Estmatng the Number of Clusters Usng the Parttonng Around Medod (PAM) Algorthm The goal of cluster analyss for our data set s to reach to the maxmum dssmlarty between observatons of dfferent clusters and the medod and wder dameter cluster and maxmum average slhouette wdth. Usng the PAM algorthm we acheved the followng results When number of clusters s 2 then the average slhouette wdth = 0.6 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 31. When number of clusters s 3 then the average slhouette wdth = 0.55 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 30. When number of clusters s 4 then the average slhouette wdth = 0.51 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 29. When number of clusters s 5 then the average slhouette wdth = 0.48 and the total maxmum dssmlarty s 16 and the total of cluster dameter s 28. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (123)

16 Mahmoud K. Okasha, Khaled I. A. Almghar From the above results we can conclude that the best number of clusters s 2. In ths case each of average slhouette wdth, dssmlarty between the observatons, the cluster medod, and the cluster dameter reach ts maxmum value. We notce that after cluster 2 we gets the same results n maxmum dssmlarty because the sample sze s 32 and t s dffcult to cluster such sample sze n more than 3clusters. 7. Conclusons: In ths paper we analyzed two data sets. The frst one s the B_cells whch can be consdered as a large sample and the Mclust algorthm was used to estmate the number of clusters. Usng ths algorthm we llustrate the results that the number of clusters s two. We also descrbed the ft of Mclust algorthm that uses the Bayesan Informaton Crtera (BIC). Moreover we llustrate the result that the number of clusters s two where the maxmum BIC has been acheved. To confrm these results, we compared the results of both Mclust and hclust algorthms wth PAM algorthm where we selected dfferent numbers of clusters from 1 to 5 because we have 5 stages of dsease n our data set and we compute the Average Slhouette Wdth for each choce of number of clusters. We conclude that the number of clusters also equals two snce t corresponds to the maxmum value of Average Slhouette Wdth. Lookng at the number of clusters at B_cells we can conclude that there s only 2 clusters, whch means that we reduce the 5 stages to 2 symmetrc clusters and that means that each cluster should have the same medcaton or treatments. The second data sets s T_cells whch s a small sample. Therefore we used the hclust algorthm to estmate the number of clusters and we concluded that the number of clusters s two. To confrm these results we compared the results of hclust algorthm wth the PAM algorthm where we selected dfferent numbers of clusters from 1 to 5 as n the prevous data set and we computed the Average Slhouette Wdth for each choce of number of clusters. We then conclude that the number of clusters equals two snce t corresponds to the maxmum value of Average Slhouette Wdth. Lookng at the number of clusters at T_cells we can conclude that there s only 2 clusters, whch means that we reduce the 5 stages to 2 symmetrc clusters and that means that each cluster should have the same medcaton or treatments. (124) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

17 S-Polarzed Surface waves n Ferrte bounded by Nonlnear Nonmagnetc,. From the above dscussons of the results and the comparson between Mclust, hclust and PAM algorthms we conclude that the Mclust algorthm s sutable for large data sets and the hclust algorthm s sutable for small samples. Therefore we recommend usng Mclust algorthm for large samples and hclust algorthm for small samples. 8. Recommendatons: 1. From the above results we recommend to concentrate future research on methods of detectng genes that causes dfferent types of cancer to avod ths dsease by solatng these genes n the new generatons. 2. The data used n ths study was obtaned from "boconductor.org" webste. We recommend that a genetc data bank to be establshed n Palestne. Ths would help n solaton genes whch causes heredty dseases n Palestne. 3. For future studes we propose jont researches between the Facultes of Medcne, Medcal Scences and the Department of Statstcs at Al Azhar Unversty n genetcs felds. 4. We recommend conductng further research on usng Neural Networks technques for estmatng the number of clusters n genetcs data. 5. We also recommend conductng further research on testng whether there s a sgnfcant evdence of dfferent types of cancer between genetcs causes and other causes n Palestne. 6. It s also recommended that the clusterng algorthms dscussed n ths paper to be appled n other felds such as Economcs and Human Scences. References: 1. Banfeld J. D. and Raftery A. E. (1993); Model-based Gaussan and non-gaussan clusterng. Bometrcs, 49: Boconductor (2004): Open software development for computatonal bology and bonformatcs; Gentleman R., Carey V. J., Bates D. M., Bolstad B., Dettlng M., Dudot S., Ells B., Gauter L., Ge Y., and others, Genome Bology, Vol. 5, R80. Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13 (125)

18 Mahmoud K. Okasha, Khaled I. A. Almghar 3. Chen, G., Banerjee, N., Jaradat, S.A., Tanaka, T.S., Ko, M.S.H. and Zhang, M.Q. (2002), Evaluaton and Comparson of Clusterng Algorthms n Analyzng ES Cell Gene Expresson Data, Statstca Snca, 12: Charett S, L X, Gentleman R, Vtale A, Vgnett M, Mandell F, Rtz J, Foa R (2004); Gene expresson profle of adult T-cell acute lymphocytc leukema dentfes dstnct subsets of patents wth dfferent response to therapy and survval Blood, Vol. 103, No De Bn R. and Rsso D. (2011), A novel approach to the clusterng of mcroarray data va nonparametrc densty estmaton, BMC Bonformatcs, 12: Dudot, S; Frdlyand, J and Speed, T. P. (2002), Comparson of Dscrmnaton Methods for the Classfcaton of Tumors Usng Gene Expresson Data ; Journal of Amercan Statstcal Assocaton; Vol. 97, No 457, Fraley C and Raftery AE (2006); Model-based mcroarray mage analyss ; R News, 6: Fraley C and Raftery AE (2007-a); Model-based methods of classfcaton: usng the mclust software n chemometrcs ; Journal of Statstcal Software, 18(6). 9. Fraley C and Raftery AE (2007-b); Bayesan regularzaton for normal mxture estmaton and model-based clusterng ; Journal of Classfcaton, 24: Kaufman L and Rousseeuw PJ (1990), Fndng Groups n Data: An Introducton to Cluster Analyss, Wley-Interscence, New York (Seres n Appled Probablty and Statstcs). 11. McQuarre ADR and Tsa CL (1998), Regresson and Tme Seres Model Selecton, World Scentfc. 12. Szekely, G. J. and Rzzo, M. L. (2005) Herarchcal Clusterng va Jont Between-Wthn Dstances: Extendng Ward's Mnmum Varance Method, Journal of Classfcaton 22(2) (126) Journal of Al Azhar Unversty-Gaza (Natural Scences), 2011, 13

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

More information

1. Measuring association using correlation and regression

1. Measuring association using correlation and regression How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

Interpreting Patterns and Analysis of Acute Leukemia Gene Expression Data by Multivariate Statistical Analysis

Interpreting Patterns and Analysis of Acute Leukemia Gene Expression Data by Multivariate Statistical Analysis Interpretng Patterns and Analyss of Acute Leukema Gene Expresson Data by Multvarate Statstcal Analyss ChangKyoo Yoo * and Peter A. Vanrolleghem BIOMATH, Department of Appled Mathematcs, Bometrcs and Process

More information

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary

More information

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching) Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

More information

How To Calculate The Accountng Perod Of Nequalty

How To Calculate The Accountng Perod Of Nequalty Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

More information

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6 PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

Statistical Methods to Develop Rating Models

Statistical Methods to Develop Rating Models Statstcal Methods to Develop Ratng Models [Evelyn Hayden and Danel Porath, Österrechsche Natonalbank and Unversty of Appled Scences at Manz] Source: The Basel II Rsk Parameters Estmaton, Valdaton, and

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

STATISTICAL DATA ANALYSIS IN EXCEL

STATISTICAL DATA ANALYSIS IN EXCEL Mcroarray Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 6 Some Advanced Topcs Dr. Petr Nazarov 14-01-013 petr.nazarov@crp-sante.lu Statstcal data analyss n Ecel. 6. Some advanced topcs Correcton for

More information

Logistic Regression. Steve Kroon

Logistic Regression. Steve Kroon Logstc Regresson Steve Kroon Course notes sectons: 24.3-24.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro

More information

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting Causal, Explanatory Forecastng Assumes cause-and-effect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

Lecture 5,6 Linear Methods for Classification. Summary

Lecture 5,6 Linear Methods for Classification. Summary Lecture 5,6 Lnear Methods for Classfcaton Rce ELEC 697 Farnaz Koushanfar Fall 2006 Summary Bayes Classfers Lnear Classfers Lnear regresson of an ndcator matrx Lnear dscrmnant analyss (LDA) Logstc regresson

More information

Conversion between the vector and raster data structures using Fuzzy Geographical Entities

Conversion between the vector and raster data structures using Fuzzy Geographical Entities Converson between the vector and raster data structures usng Fuzzy Geographcal Enttes Cdála Fonte Department of Mathematcs Faculty of Scences and Technology Unversty of Combra, Apartado 38, 3 454 Combra,

More information

CHAPTER 14 MORE ABOUT REGRESSION

CHAPTER 14 MORE ABOUT REGRESSION CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

SIMPLE LINEAR CORRELATION

SIMPLE LINEAR CORRELATION SIMPLE LINEAR CORRELATION Smple lnear correlaton s a measure of the degree to whch two varables vary together, or a measure of the ntensty of the assocaton between two varables. Correlaton often s abused.

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

v a 1 b 1 i, a 2 b 2 i,..., a n b n i. SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

More information

Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Single Layer Perceptrons Kevin Swingler Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

More information

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES In ths chapter, we wll learn how to descrbe the relatonshp between two quanttatve varables. Remember (from Chapter 2) that the terms quanttatve varable

More information

Data Visualization by Pairwise Distortion Minimization

Data Visualization by Pairwise Distortion Minimization Communcatons n Statstcs, Theory and Methods 34 (6), 005 Data Vsualzaton by Parwse Dstorton Mnmzaton By Marc Sobel, and Longn Jan Lateck* Department of Statstcs and Department of Computer and Informaton

More information

BERNSTEIN POLYNOMIALS

BERNSTEIN POLYNOMIALS On-Lne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful

More information

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network 700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

Calculating the high frequency transmission line parameters of power cables

Calculating the high frequency transmission line parameters of power cables < ' Calculatng the hgh frequency transmsson lne parameters of power cables Authors: Dr. John Dcknson, Laboratory Servces Manager, N 0 RW E B Communcatons Mr. Peter J. Ncholson, Project Assgnment Manager,

More information

where the coordinates are related to those in the old frame as follows.

where the coordinates are related to those in the old frame as follows. Chapter 2 - Cartesan Vectors and Tensors: Ther Algebra Defnton of a vector Examples of vectors Scalar multplcaton Addton of vectors coplanar vectors Unt vectors A bass of non-coplanar vectors Scalar product

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.

More information

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

More information

Modelling high-dimensional data by mixtures of factor analyzers

Modelling high-dimensional data by mixtures of factor analyzers Computatonal Statstcs & Data Analyss 41 (2003) 379 388 www.elsever.com/locate/csda Modellng hgh-dmensonal data by mxtures of factor analyzers G.J. McLachlan, D. Peel, R.W. Bean Department of Mathematcs,

More information

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,

More information

An interactive system for structure-based ASCII art creation

An interactive system for structure-based ASCII art creation An nteractve system for structure-based ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract Non-Photorealstc Renderng (NPR), whose am

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

+ + + - - This circuit than can be reduced to a planar circuit

+ + + - - This circuit than can be reduced to a planar circuit MeshCurrent Method The meshcurrent s analog of the nodeoltage method. We sole for a new set of arables, mesh currents, that automatcally satsfy KCLs. As such, meshcurrent method reduces crcut soluton to

More information

A Comparative Study of Data Clustering Techniques

A Comparative Study of Data Clustering Techniques A COMPARATIVE STUDY OF DATA CLUSTERING TECHNIQUES A Comparatve Study of Data Clusterng Technques Khaled Hammouda Prof. Fakhreddne Karray Unversty of Waterloo, Ontaro, Canada Abstract Data clusterng s a

More information

Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data

Mixtures of Factor Analyzers with Common Factor Loadings for the Clustering and Visualisation of High-Dimensional Data Mxtures of Factor Analyzers wth Common Factor Loadngs for the Clusterng and Vsualsaton of Hgh-Dmensonal Data Jangsun Baek 1 and Geoffrey J. McLachlan 2 1 Department of Statstcs, Chonnam Natonal Unversty,

More information

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES The goal: to measure (determne) an unknown quantty x (the value of a RV X) Realsaton: n results: y 1, y 2,..., y j,..., y n, (the measured values of Y 1, Y 2,..., Y j,..., Y n ) every result s encumbered

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data

Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data Computatonal Statstcs & Data Analyss 51 (26) 1643 1655 www.elsever.com/locate/csda Multclass sparse logstc regresson for classfcaton of multple cancer types usng gene expresson data Yongda Km a,, Sunghoon

More information

Evaluating the generalizability of an RCT using electronic health records data

Evaluating the generalizability of an RCT using electronic health records data Evaluatng the generalzablty of an RCT usng electronc health records data 3 nterestng questons Is our RCT representatve? How can we generalze RCT results? Can we use EHR* data as a control group? *) Electronc

More information

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n

More information

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM

GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM GRAVITY DATA VALIDATION AND OUTLIER DETECTION USING L 1 -NORM BARRIOT Jean-Perre, SARRAILH Mchel BGI/CNES 18.av.E.Beln 31401 TOULOUSE Cedex 4 (France) Emal: jean-perre.barrot@cnes.fr 1/Introducton The

More information

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION

NEURO-FUZZY INFERENCE SYSTEM FOR E-COMMERCE WEBSITE EVALUATION NEURO-FUZZY INFERENE SYSTEM FOR E-OMMERE WEBSITE EVALUATION Huan Lu, School of Software, Harbn Unversty of Scence and Technology, Harbn, hna Faculty of Appled Mathematcs and omputer Scence, Belarusan State

More information

Fast Fuzzy Clustering of Web Page Collections

Fast Fuzzy Clustering of Web Page Collections Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng Otto-von-Guercke-Unversty of Magdeburg Unverstätsplatz, D-396 Magdeburg,

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System Mnng Feature Importance: Applyng Evolutonary Algorthms wthn a Web-based Educatonal System Behrouz MINAEI-BIDGOLI 1, and Gerd KORTEMEYER 2, and Wllam F. PUNCH 1 1 Genetc Algorthms Research and Applcatons

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Gender Classification for Real-Time Audience Analysis System

Gender Classification for Real-Time Audience Analysis System Gender Classfcaton for Real-Tme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,

More information

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification IDC IDC A Herarchcal Anomaly Network Intruson Detecton System usng Neural Network Classfcaton ZHENG ZHANG, JUN LI, C. N. MANIKOPOULOS, JAY JORGENSON and JOSE UCLES ECE Department, New Jersey Inst. of Tech.,

More information

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background: SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and

More information

Realistic Image Synthesis

Realistic Image Synthesis Realstc Image Synthess - Combned Samplng and Path Tracng - Phlpp Slusallek Karol Myszkowsk Vncent Pegoraro Overvew: Today Combned Samplng (Multple Importance Samplng) Renderng and Measurng Equaton Random

More information

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES Zuzanna BRO EK-MUCHA, Grzegorz ZADORA, 2 Insttute of Forensc Research, Cracow, Poland 2 Faculty of Chemstry, Jagellonan

More information

Economic Interpretation of Regression. Theory and Applications

Economic Interpretation of Regression. Theory and Applications Economc Interpretaton of Regresson Theor and Applcatons Classcal and Baesan Econometrc Methods Applcaton of mathematcal statstcs to economc data for emprcal support Economc theor postulates a qualtatve

More information

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University Characterzaton of Assembly Varaton Analyss Methods A Thess Presented to the Department of Mechancal Engneerng Brgham Young Unversty In Partal Fulfllment of the Requrements for the Degree Master of Scence

More information

Fuzzy Regression and the Term Structure of Interest Rates Revisited

Fuzzy Regression and the Term Structure of Interest Rates Revisited Fuzzy Regresson and the Term Structure of Interest Rates Revsted Arnold F. Shapro Penn State Unversty Smeal College of Busness, Unversty Park, PA 68, USA Phone: -84-865-396, Fax: -84-865-684, E-mal: afs@psu.edu

More information

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh

More information

Implementation of Deutsch's Algorithm Using Mathcad

Implementation of Deutsch's Algorithm Using Mathcad Implementaton of Deutsch's Algorthm Usng Mathcad Frank Roux The followng s a Mathcad mplementaton of Davd Deutsch's quantum computer prototype as presented on pages - n "Machnes, Logc and Quantum Physcs"

More information

Traffic-light a stress test for life insurance provisions

Traffic-light a stress test for life insurance provisions MEMORANDUM Date 006-09-7 Authors Bengt von Bahr, Göran Ronge Traffc-lght a stress test for lfe nsurance provsons Fnansnspetonen P.O. Box 6750 SE-113 85 Stocholm [Sveavägen 167] Tel +46 8 787 80 00 Fax

More information

Chapter 6. Classification and Prediction

Chapter 6. Classification and Prediction Chapter 6. Classfcaton and Predcton What s classfcaton? What s Lazy learners (or learnng from predcton? your neghbors) Issues regardng classfcaton and Frequent-pattern-based predcton classfcaton Classfcaton

More information

Portfolio Loss Distribution

Portfolio Loss Distribution Portfolo Loss Dstrbuton Rsky assets n loan ortfolo hghly llqud assets hold-to-maturty n the bank s balance sheet Outstandngs The orton of the bank asset that has already been extended to borrowers. Commtment

More information

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank.

Marginal Benefit Incidence Analysis Using a Single Cross-section of Data. Mohamed Ihsan Ajwad and Quentin Wodon 1. World Bank. Margnal Beneft Incdence Analyss Usng a Sngle Cross-secton of Data Mohamed Ihsan Ajwad and uentn Wodon World Bank August 200 Abstract In a recent paper, Lanjouw and Ravallon proposed an attractve and smple

More information

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Learnng from Large Dstrbuted Data: A Scalng Down Samplng Scheme for Effcent Data Processng Che Ngufor and Janusz Wojtusak part

More information

Customer Segmentation Using Clustering and Data Mining Techniques

Customer Segmentation Using Clustering and Data Mining Techniques Internatonal Journal of Computer Theory and Engneerng, Vol. 5, No. 6, December 2013 Customer Segmentaton Usng Clusterng and Data Mnng Technques Kshana R. Kashwan, Member, IACSIT, and C. M. Velu fronter

More information

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka

Bag-of-Words models. Lecture 9. Slides from: S. Lazebnik, A. Torralba, L. Fei-Fei, D. Lowe, C. Szurka Bag-of-Words models Lecture 9 Sldes from: S. Lazebnk, A. Torralba, L. Fe-Fe, D. Lowe, C. Szurka Bag-of-features models Overvew: Bag-of-features models Orgns and motvaton Image representaton Dscrmnatve

More information

Support vector domain description

Support vector domain description Pattern Recognton Letters 20 (1999) 1191±1199 www.elsever.nl/locate/patrec Support vector doman descrpton Davd M.J. Tax *,1, Robert P.W. Dun Pattern Recognton Group, Faculty of Appled Scence, Delft Unversty

More information

DEFINING %COMPLETE IN MICROSOFT PROJECT

DEFINING %COMPLETE IN MICROSOFT PROJECT CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,

More information

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC

Approximating Cross-validatory Predictive Evaluation in Bayesian Latent Variables Models with Integrated IS and WAIC Approxmatng Cross-valdatory Predctve Evaluaton n Bayesan Latent Varables Models wth Integrated IS and WAIC Longha L Department of Mathematcs and Statstcs Unversty of Saskatchewan Saskatoon, SK, CANADA

More information

Binomial Link Functions. Lori Murray, Phil Munz

Binomial Link Functions. Lori Murray, Phil Munz Bnomal Lnk Functons Lor Murray, Phl Munz Bnomal Lnk Functons Logt Lnk functon: ( p) p ln 1 p Probt Lnk functon: ( p) 1 ( p) Complentary Log Log functon: ( p) ln( ln(1 p)) Motvatng Example A researcher

More information

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting

Properties of Indoor Received Signal Strength for WLAN Location Fingerprinting Propertes of Indoor Receved Sgnal Strength for WLAN Locaton Fngerprntng Kamol Kaemarungs and Prashant Krshnamurthy Telecommuncatons Program, School of Informaton Scences, Unversty of Pttsburgh E-mal: kakst2,prashk@ptt.edu

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

Performance Analysis and Coding Strategy of ECOC SVMs

Performance Analysis and Coding Strategy of ECOC SVMs Internatonal Journal of Grd and Dstrbuted Computng Vol.7, No. (04), pp.67-76 http://dx.do.org/0.457/jgdc.04.7..07 Performance Analyss and Codng Strategy of ECOC SVMs Zhgang Yan, and Yuanxuan Yang, School

More information

Cluster Analysis. Cluster Analysis

Cluster Analysis. Cluster Analysis Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base

More information

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns

A study on the ability of Support Vector Regression and Neural Networks to Forecast Basic Time Series Patterns A study on the ablty of Support Vector Regresson and Neural Networks to Forecast Basc Tme Seres Patterns Sven F. Crone, Jose Guajardo 2, and Rchard Weber 2 Lancaster Unversty, Department of Management

More information

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract Household Sample Surveys n Developng and Transton Countres Chapter More advanced approaches to the analyss of survey data Gad Nathan Hebrew Unversty Jerusalem, Israel Abstract In the present chapter, we

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

Evaluating credit risk models: A critique and a new proposal

Evaluating credit risk models: A critique and a new proposal Evaluatng credt rsk models: A crtque and a new proposal Hergen Frerchs* Gunter Löffler Unversty of Frankfurt (Man) February 14, 2001 Abstract Evaluatng the qualty of credt portfolo rsk models s an mportant

More information

An Algorithm for Data-Driven Bandwidth Selection

An Algorithm for Data-Driven Bandwidth Selection IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 An Algorthm for Data-Drven Bandwdth Selecton Dorn Comancu, Member, IEEE Abstract The analyss of a feature space

More information

DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION

DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION DATA MINING CLASSIFICATION ALGORITHMS FOR KIDNEY DISEASE PREDICTION Dr. S. Vjayaran 1, Mr.S.Dhayanand 2, Assstant Professor 1, M.Phl Research Scholar 2, Department of Computer Scence, School of Computer

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 2249-0868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng

More information