A Simple Approach to Clustering in Excel

A Smple Approach to Clusterng n Excel Aravnd H Center for Computatonal Engneerng and Networng Amrta Vshwa Vdyapeetham, Combatore, Inda C Rajgopal Center for Computatonal Engneerng and Networng Amrta Vshwa Vdyapeetham, Combatore, Inda K P Soman Center for Computatonal Engneerng and Networng Amrta Vshwa Vdyapeetham, Combatore, Inda ABSTRACT Data clusterng refers to the method of groupng data nto dfferent groups dependng on ther characterstcs. Ths groupng brngs an order n the data and hence further processng on ths data s made easer. Ths paper explans the clusterng process usng the smplest of clusterng algorthms - the K-Means. The novelty of the paper comes from the fact that t shows a way to perform clusterng n Mcrosoft Excel 2007 wthout usng macros, through the nnovatve use of what-f analyss. The paper also shows that, mage processng operatons can be done n excel and all operatons except dsplayng an mage do not requre a macro. The paper gves a soluton to the problem of readng an mage n excel by ntroducng a user defned add-n. The paper also has explaned and mplemented mage segmentaton as an applcaton of clusterng. Ths paper ams at showng that Mcrosoft Excel s a great tool as far as techncal learnng s concerned for the fact that, t can mplement almost all algorthms and processes, and s very successful n provdng the frst hand exposure to an novce student. General Terms Machne learnng, Clusterng, Image processng. Keywords Data clusterng, K-Means, Image Segmentaton, Excel add-n to read mage, Mcrosoft Excel 1. INTRODUCTION It s of no doubt that, nformaton s the drvng force of the world. But what s nformaton: To be exact, t s a collecton of meanngful data. Each day n the world of computng s manpulatng on bllons of data to extract some nformaton from them. The data acts le an deal gas lterally, and they tend to fll the best and largest storage n no tme. Hence managng data s a complex job. Groupng them nto dfferent groups as soon as they are obtaned wll brng an order n the data.ths wll help n reducng the computaton complexty requred n further processng and managng t. The word Data clusterng refers to the process of parttonng a set of data nto a set of meanngful sub classes called clusters. Data clusterng has mmense number of applcatons n every feld of lfe. One mportant applcaton of clusterng s n the feld of data mnng. Data mnng s the process of dscoverng meanngful new correlaton, patterns and trends by sftng through large amounts of data, usng pattern recognton technologes as well as statstcal and mathematcal technques. Clusterng s often one of the frst steps n data mnng analyss. It dentfes groups of related records that can be used as a startng pont for explorng further relatonshps. For example, n case of detecton of dseases le tumors, the scanned pctures or the x-rays s subjected to herarchcal clusterng. Here clusters are formed usng a varety of mages avalable for a specfc part of the body along wth vald records. Such clusters are created for all body parts. Now the tumor affected part n the body s located by comparng the test mage wth the mages n these clusters. Once the body part s located, mage s sent to that specfc matchng cluster and matched wth all the mages n that partcular cluster. Now the mage, wth whch the query mage has the most smlartes, s retreved and the record assocated to that mage s taen. Usng ths technque really fne tumor can be detected [1]. By usng clusterng an enormous amount of tme n fndng the exact match from the database s reduced. In the management sde, a normal requrement of predctng the sales of a product at dfferent ctes s acheved by clusterng demographcally smlar ctes. Another applcaton of clusterng s load balancng n applcaton servers. Load balancng s an enterprse-level feature n whch the applcaton server automatcally alternates requests among the server nstances n a cluster. Clusterng enables applcaton servers to route requests to a runnng server nstance when the orgnal server nstance goes down [1]. Clusterng s also used to mprove the performance (.e. perplexty) of language models as well as to compress language models [2]. Here we have explaned on how to mplement one of the smplest of the clusterng algorthms, the K-Means. Ths s done wth the help of two add-n pacages avalable wth Mcrosoft Excel - the Solver and What-If analyss. As an applcaton of clusterng, the detals regardng, how segmentaton of a pcture usng clusterng can be mplemented n Mcrosoft Excel s explaned wth ths paper. 1.1 Mcrosoft Excel Mcrosoft Excel s a basc learnng tool of great potental. Excel s forte s performng numercal calculatons, organze data, compare as well present data graphcally [8]. Usng Excel t s possble to mplement almost all algorthms and thus helps the learner to get an exposure about the functonalty of the algorthm. To an extend t s true to say that Excel s an ntellgent software except that, t s only as ntellgent as the user. Excel pacage provdes wth t a large amount of features, most of them as add-n pacages. In our paper we are manly concerned about two of the add-n pacages the Solver and the What-If Analyss. We have also utlzed the condtonal formattng to provde a loo and feel to the dsplay of output. Solver s bascally an optmzaton problem solver whch taes an objectve functon and a set of constrants and provdes bac 19

an optmum soluton.e. best values for the varables by whch the objectve functon s made of so that the objectve functon s mnmzed or maxmzed(whch s user requrement). The "best" or optmal soluton may mean maxmzng profts, mnmzng costs, or achevng the best possble qualty. What-If Analyss feature helps n executng the same set of operatons for dfferent nputs and then records or prnts the outputs for each of those nput sets. Ths maes lfe easer by reducng the worload of dong the same set of operatons for dfferent nput sets manually. The mplcatons of What-If analyss were understood from [5], [6]. Condtonal Formattng s a method by whch specfc operaton le hghlght nterestng cells or ranges of cells, emphasze unusual values, vsualze data by usng data bars, color scales, and con set, dong some mathematcal manpulaton etc are easly mplemented, dependng on the rule set defned for that specfc set of cells. Ths rule sets wll be actve all the tme.e. When cell values are changed the rules are reappled. 2. CLUSTERING ALGORITHMS The performances gven by clusterng algorthms are heavly dependent on the spread of the data and for ths reason there are more than one clusterng algorthms whch are developed over tme. Some of them nclude the K-Means, K-Medods, the EM algorthm, dfferent types of lnage methods, the mean-shft algorthm, algorthms that mnmze some graph-cut crtera etc. Thus t s true to say that, a unversal clusterng algorthm remans an elusve goal. 2.1 K-means Clusterng K-means clusterng s one of the basc clusterng algorthms n the machne learnng doman. The nference of ths algorthm s based on the value of whch s the number of clusters that can be found n an n-dmensonal dataset. Usually the value of s assumed or nown a-pror. In -means algorthm, snce t s consdered that there s number of clusters; we consder that there are number of cluster means (cluster centers), where the cluster mean s the average of all the data-ponts fallng under each cluster. The end result of the -means clusterng algorthm s that each data pont n the data-set s grouped nto clusters around the cluster means. If the data ponts are tghtly surroundng the cluster means, then t s consdered as a hghly cohesve and good cluster, else t s not. Hence the cluster compactness forms the metrc of qualty of the -means algorthm. Thus as per [7] we can defne a measure of cluster compactness as the total dstance of each data pont of a cluster from the cluster mean whch s gven by, xîc 2 m 2 x-x = z x-x =1 where the cluster mean s defned as and m =1 x = 1 m xîc m = z s the total number of ponts allocated to x cluster. The parameter z s an ndcator varable ndcatng the sutablty of the th data pont x to be a part of the th cluster. The sutablty s determned by consderng ponts at a mnmum dstance to the cluster mean to be a part of the cluster. The value of the ndcator varable can be consdered to be 1 when the th data pont falls n the th cluster and for the other stuatons as 0.Agan from [7], the total goodness of the clusterng wll then be based on the sum of the cluster compactness measures for each of the clusters. Usng the ndcator varables z then we can defne the overall cluster goodness as: m K 2 ε = z x -x K =1 =1 Now the objectve of the algorthm s to fnd an optmum x for the above equaton so that the value of ε (the measure of overall cluster qualty) reaches ts mnmum. 2.1.1 The Smplfed Algorthm The smplfed algorthm for performng the -means clusterng can be gven n the form of an teratve mnmzaton of the overall measure of cluster qualty K. It can be elaborated n the followng steps: 1. Gven the value of, choose arbtrary cluster means x and fnd all ndcator parameters z for each of the cluster means. Ths s essentally done by puttng z as 1 f the th data pont has mnmum dstance to the th cluster mean compared to the dstance from all the other cluster means and 0 f t s not at mnmum dstance. 2. Calculate the overall metrc m K K K 1 1 z x x 3. Mnmze the overall metrc by assumng a new set of cluster means. 4. For the new set of cluster means calculate the new ndcator parameter values as descrbed n Step 1. 5. For the new set of cluster means and ndcator parameter values, recalculate the new overall metrc. 6. Repeat Steps 3 to 5 untl K converges. 2.2 Implementaton Detals for K-Means Clusterng Clusterng, as the name says s the process of groupng of data. In our case we tae a group of data pont n two dmensons. Let s assume data two dmensonal data ponts as (3,13), (3, 10), (8,7), (5,2), (5,4), (2,11), (5,4),(2,14), (6,2),(4,2), (10,8), (8,9), (10,9), (10,12). 2 20

Fgure 4 s a plot of ntal cluster centers and the data ponts. Ths s obtaned by selectng the range of the data ponts (C7:D20) and three cluster ponts separately (X=F8, Y=G8), (X=H8, Y=I8) and (X=J8, Y=K8) and then plottng t usng XY plot avalable wth excel. Fgure 1. Data ponts The data ponts are assumed here sutablty such that the three dstnct clusters can be dentfed easly (even vsually). It must be ept n mnd that the real tme data ponts wll not always provde such a vsual advantage. Clusterng can be done on any N dmensonal data and for any data groups wth any sought of compactness. Fgure 2. XY Plot of Data ponts The K Means algorthm starts wth the ntal assumpton of K. Here we have assumed K as 3. Snce K = 3, we have 3 cluster centers and they are ntalzed wth random ponts. Lets the ntal ponts assumed be (4, 4), (5, 12), (10, 6). Ths s entered n the separate cells n Mcrosoft excel sheet as shown n the fgure 3 below. Fgure 4. Cluster Center s (Intally assumed) In K-Means algorthm the objectve s to mnmze the term n j 1 1 ( j) j 2 X C, where X s the data pont belongng to the cluster and C j s the cluster center (centrod). For ths reason, as the next step we need to fnd out the Eucldan dstance of the each data pont wth all the centrods (cluster centers). Here, for easness we utlze the What-If Analyss feature avalable wth excel. As a prerequste for What-If Analyss we need to mae avalable a structure as shown below, where the X and Y values for the chosen ndex wll be taen from the table shown n fgure 1. The ndex value wll be provded n the cell G11 and the correspondng X and Y values are obtaned n the cells I11 and J11. The formula used n I11 s =INDEX(C7:C20,$G$11,1) and formula n J11 s =INDEX(D7:D20,$G$11,1). Fgure 5. A requrement for what f analyss We have 14 data ponts n our example and need to fnd the dstance (Eucldan dstance) of all the ponts wth each of the assumed cluster centers. The below fgure shows table structure desgned to fnd out the dfferent Eucldan dstances usng the What-If Analyss [5], [6]. Fgure 3. Cluster Center s (Intally assumed) 21

dstances to the nearest cluster centers. The mnmzaton s done wth the help of Solver feature avalable wth excel. A Solver bascally solves an optmzaton problem (mnmzaton or maxmzaton problem) subjected to a set of constrants. The fgure 7 depcts on, how to fnd optmal cluster centers usng solver. As fgure 7 show, the cell correspondng to the label Set Target cells s assgned wth the excel cell address whose content value has to be mnmzed. Here t s $T$21 (objectve functon) and the cells correspondng to the label By Changng Cells should be provded wth the address of cells whose value has to be changed/adjusted n order to acheve the mnmzaton. Here t s $F$8:$K$8. Fgure 6. Assgnment of clusters For What-If Analyss, values 1 to 14 are wrtten n leftmost column M (M7 to M20) as shown n the above screenshot of the table. In the frst row of the second column(.e. Cell N7), type the formula to fnd the Eucldan dstance of centrod 1 from data pont 1,.e.=SQRT(($I$11-$F$8)^2+($J$11-$G$8)^2), when INDEX OF CURRENT POINT s set to 1, $F$8 $I$11 wll refer to the, X coordnate of the data pont 1 and $J$11wll refer to the, Y coordnate of the data pont 1. The $F$8 holds the X value of cluster center 1 and $G$8 holds the y value of cluster center 1.Smlarly n the thrd column(.e. cell O7) type the formula for fndng dstance of second centrod wth the data pont 1..e. =SQRT(($I$11-$H$8)^2+($J$11-$I$8)^2). In the next column (.e. cell P7) do fnd the dstance wth the thrd centrod..e. =SQRT(($I$11-$J$8)^2+($J$11-$K$8)^2). In the last but one s column fnd out the class to whch the data pont 1 belongs. The formula used to obtan ths s - =IF(MIN(N7:P7)=N7,"Cluster1",IF(MIN(N7:P7)=O7,"Cluster2", "Cluster3")), here MIN(N7:P7) taes the mnmum dstance from the dstances calculated for the selected data pont wth the all the three cluster centers, 1,2 and 3. The formula also mplements the functonalty fndng out from whch column the mnmum value comes. I.e. f the mnmum comes from the second column we assgn the pont to Cluster 1, f t comes from the thrd column we assgn ths pont as a Cluster 2 pont and else f t comes from the fourth column, we assgn Cluster3. Ths s done n order to assgn the pont to a partcular cluster. The What-If Analyss wll do the above mentoned operatons for all the pont by changng the ndex value of the table shown n fgure 5 from 1 to 14 and correspondng results are assgned to the cells from S7:S20. Now we need to explctly fnd out and eep the mnmum dstance of each pont from among the three cluster centers. Ths calculaton s what s beng done n the last column. In the column labeled Mnmum dstance we calculate the sum of mnmum dstances obtaned for all the data pont. Now we calculate the sum of the above calculated mnmum dstance values for all the ponts as shown n fgure 6. The K-Means algorthm specfes to mnmze the above mentoned sum of Fgure 7. Assgnng Solver nputs On clcng the Solve button, the optmal values for all three centrods are obtaned. The solver acheves ths by ncrementng or decrementng the values of the assumed cluster centers by small factor (ths factor can be specfed as a Solver parameter). Any change n the cluster center wll reflect n the allocaton of data ponts to the clusters. [Ths s a feature of What-If Analyss add-n]. After solvng we can see that the cluster centers wll come close to the center of correspondng clusters. The reallocaton of data ponts can be seen n fgure 9. Fgure 8 shows the mnmzed dstances value and optmal cluster centers. (a) (b) Fgure 8. (a) Mnmzed dstance, (b) Optmal cluster centers 22

The new cluster centers obtaned are (5.218, 3.667), (3.615, 10.461), (9.478, 8.750). The XY plot of the new cluster centers s shown n fgure 10. From fgure 10 t can be seen that the optmal cluster centers have moved more close to the center of each cluster. Due to the easness to mplement here we have chosen -means algorthm for color component clusterng. Frst of all we need to read the mage to be clustered nto excel sheet. Ths wll gve the correspondng pxel values(r, G and B) for the mage. Readng of mage s done well wthn excel by usng a user defned add-n named loadimagearray. The add-n s made avalable at [4]. The add-n was developed usng C# [3].The add-n once nstalled can be obtan from the functons tab of excel as shown below. Fgure 11. Image descrpton of selectng loadimagearray Fgure 9. Optmal clusters assgnment We can then plot the fnal ponts whch are the optmal cluster centers as show n the fgure 10. In fgure 10 we can see that, three clusters are recognzed. Ths s because the value of n our example s 3. It must be ept n mnd that can be any hgher value, and as ncreases the number of clusters whch are recognzed by the algorthm wll also ncrease. Once you select the template, we can see loadimagearray poppng up n the functon lst. On selectng ths functon one can see the below gven popup wndow. It wll as for four arguments Locaton of mage (should be gven wthn double quotes), length of mage, wdth of mage and fnally the color component you are nterested n..e. Gve 1 for acqurng red, 2 for acqurng green and 3 for acqurng blue pxels of the mage. Fgure 10. XY Plot of Data ponts and Optmal Cluster centers 2.3 An Applcaton of Clusterng Image segmentaton, an mportant problem n computer vson, s often formulated as a clusterng problem. A color pcture s made of three components R (Red), G (green) and B (components). As mentoned prevously clusterng groups the dfferent smlar data ponts dependng on the Eucldan dstance of ts property values n property space. Here we have selected a color mage and has clustered dfferent colors n a pcture. The number of clusters created wll be dependent on the value of. Here we have chosen = 4, snce there are four colors n the consdered mage. Fgure 12. loadimagearray - detals nput screen On clcng o, the frst pxel wll be shown n the selected cell of the excel sheet. Begnnng from that cell, select a range 100 x 69 cells (for ths example). Now select the formula n the startng cell where the pxel value was prnted cell and press Ctrl + Alt + Enter. Now we wll get all the pxel value for red component. Do ths same process to obtan the green and blue component by gvng 2 and 3 n place of 1. As a second step we can plot ths pcture (Apples). Ths can be done by sutably colorng the correspondng cells of an excel sheet by the color we obtan from the pxel value combnaton(r, G and B whch was obtaned n three dfferent sheets). We have acheved ths colorng of excel 23

cells by usng a small macro. Fnally reduces the row heght and column wdth of the excel cells to small values so that the mage can be seen n the requred sze. Now just as mentoned n -means mplementaton descrpton, we mae an arrangement, where n the R, G and B values are acqured from the table n fgure 15 by provdng the ndex values. Ths arrangement s shown n fgure 16. Fgure 16. A requrement for what f analyss Fgure 13. Apples (Input mage for clusterng) We have the mage pxel values for R, G and B components, read and stored n excel fle n separate sheets. Now we can move to segmentaton of pxel values. Ths segmentaton s done usng - means clusterng. Thus the frst process, we need to do s to randomly select the cluster centers (centrods). The values selected should be wthn the range 0-255 snce pxel values can only be n ths range. Here we have selected four cluster centers snce we are nterested n clusterng 4 color ncludng one whte bacground and hence =4. Lets ths be (66, 76, 45), (240, 230, 44), (44, 113, 110) and (5, 5, 200). Ths s entered n the separate cells as shown n the fgure 14. Now as mentoned n normal K means mplementaton we obtan the dstance of the each data pont (pxel value) from all the assumed centrod pont. For each data pont, fnd the mnmum among the dstances to all the four clusters and prnt t n the last column of the table n fgure 17. The sum of these, mnmum dstances s found out and ths s the objectve functon (value) whch s to be mnmzed. Fgure 17 llustrates the whole process of obtanng the Total dstance to be mnmzed. Fgure 14. Cluster Center s (Intally assumed) The data ponts whch has to be clustered here s the pxel values of the mage (Apple). Now for convenence the pxel values obtaned for the mage s prnted on to a sngle column excel fle as shown n fgure 16 (red, green, blue pxels values should be n dfferent columns). The avalable data s wrtten n the format to help what f analyss. Fgure 17. Total dstance to be mnmzed Now we utlze Solver to fnd out an optmal value for R, G and B for each of the centrods. The optmum values of the centrod ponts found by mnmzng the sum of dstances to the nearest centrod ponts s shown n the below fgure. Fgure 18. Optmzed centrod values Fgure 15. Test data Due to use of What-If Analyss, once the cluster centers (centrods) are changed the allocaton of data pont to the clusters also changes automatcally. Now we plot the segmented mage of "Apples by assgnng dfferent user nterested colors to the ponts 24

belongng to dfferent clusters. Ths s also done wth the help of a macro. The segmented mage s obtaned as shown n fgure 19. Fgure 19. Clustered / Segmented Image 3. CONCLUSION Data Clusterng or groupng together of smlar data comes useful n all real world data processng applcatons. Ths paper deals wth the mplementaton of -Means clusterng n Mcrosoft Excel 2007 wth the help of What-If Analyss and Solver add-n pacages avalable by default. As an applcaton of clusterng, ths paper explans, how mage segmentaton can be done n Mcrosoft Excel. The paper also explans on newly developed add-n whch can read any mage and obtan the pxel values for red, green and blue component of the mage. Thus t s shown that the whole set of mage processng operatons such as readng, processng and prntng of mage can be done n excel. Above all ths paper stressed on the potental of Mcrosoft Excel as a scentfc learnng tool. 4. REFERENCES [1] Al, R., Ghal, U., Saeed, A. Data Clusterng and Its Applcatons. Avalable at the web address : http://members.trpod.com/asm_saeed/paper.htm [2] Gao, J., Goodmen, Joshua T., Mao,J. The Use of Clusterng Technques For Language Modelng. Internatonal Jouranal for Computatonal Lngustcs and Chnese Language Processng. Vol. 6, No. 1, pp 27-60. [3] Gunnerson, E. and Wenholt, N. (2005), A Programmer s Introducton to C # 2.0, Thrd Edton, Publshed by Apress. ISBN (pb) : 1-59059-501-7 [4] Implementaton of clusterng n Excel and Excel Add-n to load Images. Avalable at the URL : http://cen.amrtafoss.org/downloads/boos/datamnng/wor sheet/. [5] MacDonald, M. (2006), Excel 2007 : The Mssng Manual.Publshed by Pogue Press, O Relly. ISBN:978-0- 596-52759-4. [6] Ragsdale, C. T. and Zobel, C. W. (2010), A Smple Approach to Implementng and Tranng Neural Networs n Excel. Decson Scences Journal of Innovatve Educaton, 8: 143 149. do: 10.1111/j.1540-4609.2009.00249.x [7] Soman, K.P., Loganathan, R., Ajay, V. Machne learnng Wth SVM and Other Kernel Methods. Publshed by PHI Learnng Prvate Lmted. ISBN:978-81-203-3435-9 [8] Tang, H. (2008), A Smple Approach of Data Mnng n Excel.IEEE Fourth Internatonal Conference Wreless Communcatons, Networng and Moble Computng, do : 10.1109/WCom.2008.2679 25