A Statstal Persetve on Data Mnng Ranjan Matra Abstrat Tehnologal advanes have led to new and automated data olleton methods. Datasets one at a remum are often lentful nowadays and sometmes ndeed massve. A new breed of hallenges are thus resented rmary among them s the need for methodology to analyze suh masses of data wth a vew to understandng omlex henomena and relatonshs. Suh aablty s rovded by data mnng whh ombnes ore statstal tehnques wth those from mahne ntellgene. Ths artle revews the urrent state of the dslne from a statstan s ersetve, llustrates ssues wth real-lfe examles, dsusses the onnetons wth statsts, the dfferenes, the falngs and the hallenges ahead. 1 Introduton The nformaton age has been mathed by an exloson of data. Ths surfet has been a result of modern, mroved and, n many ases, automated methods for both data olleton and storage. For nstane, many stores tag ther tems wth a rodut-sef bar ode, whh s sanned n when the orresondng tem s bought. Ths automatally reates a ggant reostory of nformaton on roduts and rodut ombnatons sold. Smlar databases are also reated by automated book-keeng, dgtal ommunaton tools or by remote sensng satelltes, and aded by the avalablty of affordable and effetve storage mehansms magnet taes, data warehouses and so on. Ths has reated a stuaton of lentful data and the otental for new and deeer understandng of omlex henomena. The very sze of these databases however means that any sgnal or attern may be overshadowed by nose. New methodology for the areful analyss of suh datasets s therefore alled for. Consder for nstane the database reated by the sannng of rodut bar odes at sales hekouts. Orgnally adoted for reasons of onvenene, ths now forms the bass for ggant databases as large stores mantan reords of roduts bought by ustomers n any Ranjan Matra s Assstant Professor of Statsts n the Deartment of Mathemats and Statsts at the Unversty of Maryland, Baltmore County, Baltmore, MD 21250, USA. The author thanks Surajt Chaudhur for dsussons on the ratal asets of data mnng from the ont of vew of a researher n databases and for hel wth Fgure 4, Rouben Rostaman for rovdng me wth the enrolment data of Table 1 and Devass Bassu for hel wth the examle n Seton 6 of ths aer. 1
transaton. Some busnesses have gone further: by rovdng ustomers wth an nentve to use a magnet-stred frequent shoer ard, they have reated a database not just of rodut ombnatons but also tme-sequened nformaton on suh transatons. The goal behnd olletng suh data s the ablty to answer questons suh as If otato hs and kethu are urhased together, what s the tem that s most lkely to be also bought?, or If shamoo s urhased, what s the most ommon tem also bought n that same transaton?. Answers to suh questons result n what are alled assoaton rules. Suh rules an be used, for nstane, n dedng on store layout or on romotons of ertan brands of roduts by offerng dsounts on selet ombnatons. Alatons of assoaton rules transend sales transatons data ndeed, I llustrate the onets n Seton 2 through a small-sale lass-shedulng roblem n my home deartment but wth mllons of daly transatons on mllons of roduts, that alaton best reresents the omlextes and hallenges n dervng meanngful and useful assoaton rules and s art of folklore. An oft-stated goal of data mnng s the dsovery of atterns and relatonshs among dfferent varables n the database. Ths s no dfferent from some of the goals of statstal nferene: onsder for nstane, smle lnear regresson. Smlarly, the ar-wse relatonsh between the roduts sold above an be nely reresented by means of an undreted weghted grah, wth roduts as the nodes and weghted edges for the resene of the artular rodut ar n as many transatons as roortonal to the weghts. Whle undreted grahs rovde a grahal dslay, dreted ayl grahs are erhas more nterestng they rovde understandng of the henomena drvng the relatonshs between the varables. The nature of these relatonshs an be analyzed usng lassal and modern statstal tools suh as regresson, neural networks and so on. Seton 3 llustrates ths onet. Closely related to ths noton of knowledge dsovery s that of ausal deendene models whh an be studed usng Bayesan belef networks. Hekerman (1996) ntrodues an examle of a ar that does not start and roeeds to develo the goal of fndng the most lkely ause for the malfunton as an llustraton for ths onet. The buldng bloks are elementary statstal tools suh as Bayes theorem and ondtonal robablty statements, but as we shall see n Seton 3, the use of these onets to make a ass at exlanng ausalty s unque. One agan, the roblem beomes more aute wth large numbers of varables as n many omlex systems or roesses. Another aset of knowledge dsovery s suervsed learnng. Statstal tools suh as dsrmnant analyss or lassfaton trees often need to be refned for these roblems. Some addtonal methods to be nvestgated here are k-nearest neghbor methods, bootstra aggregaton or baggng, and boostng whh orgnally evolved n the mahne learnng lterature, but whose statstal roertes have been analyzed n reent years by statstans. Boostng s artularly useful n the ontext of data streams when we have rad data flowng nto the system and real-tme lassfaton rules are needed. Suh aablty s eseally desrable n the ontext of fnanal data, to guard aganst redt ard and allng ard fraud, when transatons are streamng n from several soures and an automated slt-seond determnaton of fraudulent or genune use has to be made, based on ast exerene. Modern lassfaton tools suh as these are surveyed n Seton 4. 2
Another mortant aset of knowledge dsovery s unsuervsed learnng or lusterng, whh s the ategorzaton of the observatons n a dataset nto an a ror unknown number of grous, based on some haraterst of the observatons. Ths s a very dffult roblem, and s only omounded when the database s massve. Herarhal lusterng, robabltybased methods, as well as otmzaton arttonng algorthms are all dffult to aly here. Matra (2001) develos, under restrtve Gaussan equal-dserson assumtons, a multass sheme whh lusters an ntal samle, flters out observatons that an be reasonably lassfed by these lusters, and terates the above roedure on the remander. Ths method s salable, whh means that t an be used on datasets of any sze. Ths aroah, along wth several unsurmounted hallenges, are revewed n detal n Seton 5. Fnally, we address the ssue of text retreval. Wth ts ready aessblty, the World Wde Web s a treasure-trove of nformaton. An user wshng to obtan douments on a artular to an do so by tyng the word usng one of the ubl-doman searh engnes. However when an user tyes the word ar he lkely wants not just all douments ontanng the word ar but also relevant douments nludng ones on automoble. In Seton 6, we dsuss a tehnque, smlar to dmenson reduton whh makes ths ossble. Here also, the roblem s omounded by databases that are ggant n sze. Note therefore, that a large number of the goals of data mnng overla wth those n statsts. Indeed, n some ases, they are exatly the same and sound statstal solutons exst, but are often omutatonally mratal to mlement. In the sequel, I revew some of the faets of the onneton between data mnng and statsts as ndated above. I llustrate eah aset through a real-lfe examle, dsuss urrent methodology and outlne suggested solutons. Exet for art of the lusterng examle where I used some of my own re-wrtten software routnes n C and Fortran I was able to use the ublly avalable and free statstal software akage R avalable at htt://www.r-rojet.org/ to erform neessary data analyss. Whle not exhaustve, I hoe that ths artle rovdes a broad revew of the emergng area from a statstal vew-ont and surs nterest n the many unsolved or unsatsfatorly-solved roblems n these areas. 2 Assoaton Rules Assoaton rules (Patetsky-Sharo, 1991) are statements of the form, 93% of ustomers who urhase ant also buy ant brushes. The mortane of suh a rule from a ommeral vewont s that a store sellng ant wll also stok ant brushes, for ustomer onvenene, translatng nto more buyer vsts and and store sales. Further, he an offer romotons on brands wth hgher roft margns to ustomers buyng ant and subtly dtate ustomer referene. Suh nformaton an addtonally be used to dede on storage layout and sheduled dsath from the warehouse. Transatons data are used to derve rules suh as the one mentoned above. Formally, let D = {T 1, T 2,..., T n } be the transatons database, where eah transaton T j s a member of the set of tems n I = { 1, 2,..., m }. (The transatons data may, for nstane, be reresented as an n m matrx of 0 s and 1 s alled D wth the (k, l) th entry as an ndator 3
varable for the resene of tem k n transaton T l.) Further, wrte X T l to denote that a set of tems X s ontaned n T l. An assoaton rule s a statement of the form X Y where X, Y I and X and Y are dsjont sets. The rule X Y has onfdene γ f 100γ% of the transatons n the database D ontanng X also ontan the set of tems Y. It s sad to have suort δ f 100δ% of the transatons n D ontan all elements of the set X Y. The mrovement ι of a rule X Y s the rato of the onfdene and the roorton of transatons only nvolvng Y. Note that whle onfdene rovdes us wth a measure of how onfdent we are n the gven rule, suort assesses the roorton of transatons on whh the rule s based. In general, rules wth greater suort are more desrable from a busness ersetve, though rules wth lower suort often reresent small nhes who, beause of ther sze are not neessarly wooed by other busnesses and may be very loyal ustomers. Further, mrovement omares how muh better a rule X Y s at redtng Y than n usng no ondtonal nformaton n dedng on Y. When mrovement s greater than unty, t s better to redt Y usng X than to smly use the margnal robablty dstrbuton of Y. On the other hand, f the mrovement of X Y s less than unty, then t s better to redt Y usng the margnal robablty dstrbuton than n usng t dstrbuton ondtonal on X. At equalty, there s no dfferene n usng ether X for redtng Y or n smly dedng on Y by hane. Although, assoaton rules have been rmarly aled to sales data, they are llustrated here n the ontext of shedulng mathemats and statsts graduate lasses. 2.1 An Alaton Lke several deartments, the Deartment of Mathemats and Statsts at the Unversty of Maryland, Baltmore County offers several graduate ourses n mathemats and statsts. Most students n these lasses ursue Master of Sene (M. S.) and dotoral (Ph. D.) degrees offered by the deartment. Shedulng these lasses has always been an onerous task. Pratal ssues do not ermt all lasses to be sheduled at dfferent tmes, and ndeed, four artular tme-slots are most desrable for students. Addtonally, t s desrable for ommutng and art-tme students to have ther lasses sheduled lose to one another. Htherto, lass tmes have been deded on the bass of emral eretons of the shedulng authortes. Dedng on obvous anddates for onurrently sheduled lasses s sometmes rather straghtforward. For nstane, students should have a roer gras of the elementary robablty ourse (Stat 651) to be ready for the hgher-level ourse (Stat 611) so that these an run smultaneously. Suh rules are not always obvous however. Further, the other queston whh lasses to shedule lose to eah other s not always easy to determne. Here I use assoaton rules to suggest lasses that should not be offered onurrently, as well as lasses that may be sheduled lose to eah other to maxmze student onvenene. The database on the enrollment of graduate students n the fall semester of 2001 s small wth 11 tems (the lasses offered) and 38 transatons (students enrollment n the dfferent lasses) see Table 1. Assoaton rules were bult usng the above fve rules wth the hghest suort (for both one- and two-lass ondtonng rules) are reorted n Table 2. Sne eah 4
Table 1: Enrollment data of deartmental graduate students n mathemats and statsts at the Unversty of Maryland, Baltmore County, Fall semester, 2001. Course Math 600 Math 617 Math 630 Math 700 Math 710 Stat 601 Stat 611 Stat 615 Stat 621 Stat 651 Stat 710 Class lst by student s last name Gavrea, Korostyshevskaya, Lu, Musedere, Shevhenko, Soane, Vdovna Challou, Gavrea, Korostyshevskaya, Shevhenko, Vdovna, Waldt Du, Feldman, Foster, Gavrea, Korostyshevskaya, Shevhenko, Sngh, Smth, Soane, Taff, Vdovna,Waldt Challou, Hanhart, He, Sabaka, Soane, Tymofyeyev, Wang, Webster, Zhu Korolev, Korostyshevskaya, Maura, Osmoukhna Cameron, L, Paul, Pone, Sddan, Zhang Cameron, L, Lu, Sddan, Zhang Benamat, Kelly, L, Lu, Pone, Sddan, Wang, Webb, Zhang Hang, Osmoukhna, Tymofyeyev Waldt, Wang Hang, Lu, Paul, Pone, Safronov student enrolls n at most three lasses, note that all ondtonng rules nvolve only u to two lasses. To suggest a lass shedule usng the above, t s farly lear that Math 600, Math 617 and Math 630 should be held at dfferent tme-slots but lose to eah other. Note that whle the rule If Math 600 and Math 617, then Math 630 has onfdene 100%, both (Math 600 & Math 630) Math 617 and (Math 617 & Math 630) Math 600 have onfdene of 80%. Based on ths, t s referable to suggest an orderng of lass tmes suh that the Math 630 tme-slot s n between those of Math 600 and Math 617. Smlar rules are used to suggest the shedule n Table 3 for future semesters that have a smlar mx of lasses. The suggested shedule requres at most four dfferent tme-slots, whle the urrent shedule derved emrally and gnorng enrollment data has no less than eght tme-slots. Table 2: Some assoaton rules obtaned from Table 1 and ther onfdene (γ),suort (δ) and mrovement (ι). Only sngle- and double-lass ondtonng rules wth the fve hghest mrovement measures are reorted here. Rule γ δ ι Rule γ δ ι Stat 601 Stat 611 0.67 0.11 5.07 (Math 700 & Stat 615) Stat 651 1.00 0.03 19.00 Stat 611 Stat 601 0.67 0.11 5.07 (Math 630 & Stat 651) Math 617 1.00 0.03 6.33 Math 600 Math 617 0.57 0.10 3.62 (Math 600 & Math 630) Math 617 0.80 0.10 5.07 Math 617 Math 600 0.67 0.10 3.62 (Stat 611 & Stat 615) Stat 601 0.75 0.08 4.75 Stat 611 Stat 615 0.80 0.11 3.38 (Math 617 & Math 630) Math 600 0.80 0.10 4.34 5
Table 3: Suggested shedule for graduate lasses n mathemats and statsts at the Unversty of Maryland, Baltmore County based on the assoaton rules obtaned n Table 2. Slot Classes Slot Classes Slot Classes 1 Stat 621 5 Math 600 1 Math 600, Stat 611 2 Stat 651, Stat 611 6 Math 617 2 Math 630, Math 700, Stat 601, Math 710 3 Math 710 7 Math 700, Stat 601 3 Math 617, Stat 615, Stat 621 4 Stat 615 8 Math 630, Stat 710 4 Stat 651, Stat 710 The above s an nterestng and somewhat straghtforward llustraton of assoaton rules. It s made easer by the lmtaton that all graduate students regster for at most three lasses. For nstane, the onsequenes of searhng for suh rules for all undergraduate and graduate lasses taught at the unversty an well be magned. There are hundreds of suh lasses and over 12,000 students, eah of who take any number between one and eght lasses a semester. Obtanng assoaton rules from suh large databases s qute dauntng. Ths roblem of ourse, ales n omarson to the searh for suh rules n sales databases, where there are several mllons of tems and hundreds of mllons of transatons. 2.2 Alaton to Large Databases There has been substantal work on fndng assoaton rules n large databases. Agarwal et al. (1993) fous on dsoverng assoaton rules of suffent suort and onfdene. The bas dea s to frst sefy large tem-sets, desrbed by those sets of tems that our have a re-sefed transaton suort (. e. roorton of transatons n whh they our together) s greater than the desred mnmum suort δ +. By restrtng attenton to these tem-sets, one an develo rules X Y of onfdene gven by the rato of the suort of X Y to the suort of X. Ths rule holds f and only f ths rato s greater than the mnmum desred onfdene γ +. It may be noted that sne X Y s a large tem-set, the gven rule X Y trvally has re-sefed mnmum suort δ +. The algorthms for dsoverng all large tem-sets n a transatons database are teratve n srt. The dea s to make multle asses over the transatons database startng wth a seed set of large tem-sets and use ths to generate anddate tem-sets for onsderaton. The suort of eah of these anddate tem-sets s also alulated at eah ass. The anddate tem-sets satsfyng our defnton of large tem-sets at eah ass form the seed for the next ass, and the roess ontnues tll onvergene of the tem-sets, whh s defned as the stage when no new anddate tem-set s found. The dea s onetually very smle and hnges on the fat that an tem-set of j elements an be large only f every subset of t s large. Consequently, t s enough for a mult-ass algorthm to buld the rules nrementally n terms of the ardnalty of the large tem-sets. For nstane, f at any stage, there are k large tem-sets of j elements eah, then the only ossble large tem-sets wth more than j elements ertan to transatons whh have members from these k tem-sets. So, the searh-sae for large tem-sets an be onsderably whttled down, even n the ontext of huge databases. 6
Ths forms the bass of algorthms n the database lterature. 2.3 Addtonal Issues The solutons for large tem-sets gnore transatons wth small suort. In some ases, these reresent nhes whh may be worth aterng to, eseally from a busness ont of vew. Redung γ + and δ + would allow for these rules but result n a large number of surous assoaton rules defeatng the very urose of data mnng. Brn et al. (1997a, 1997b) therefore develos the noton of both orrelaton and mlaton rules. Instead of onfdene measures, ther rules have mlaton strengths rangng from 0 to an mlaton strength of 1 means that the rule s as useful as under the framework of statstal ndeendene. An mlaton strength greater than 1 ndates a greater than exeted resene, under statstal ndeendene, of the temset. Another noton (Aggarwal and Yu, 1998) alled olletve strength of an temset I nvolves defnng a volaton rate v(i) as the fraton of volatons (transatons n whh only some members of an temset are resent) of the temset over all transatons. The olletve strength s then defned as (C(I)) = [{1 v(i)}/{1 IEv(I)}].[{IEv(I)}{IEv(I)] where IE denotes exeted value under the assumton of statstal ndeendene. Ths measure regards both absene and resene of tem ombnatons n an temset symmetrally so that t an be used f the absene of ertan tem ombnatons rather than ther resene s of nterest. It addtonally nororates the orrelaton measure beause for erfetly ostvely orrelated tems, C(I) = whle for erfetly negatvely orrelated tems, C(I) = 0. Fnally, the olletve strength uses the relatve ourrenes of an temset n the database: temsets wth nsgnfant suort an then be runed off at a later stage. DuMouhel and Pregbon (2001) ont out however, that the above advantages arue at onsderable loss of nterretablty. They roose a measure of nterestngness for all temsets and ft an emral Bayes model to the temset ounts. All lower 95% redble lmts for the nterestngness measure of an temset are ranked. Ths aroah has the added advantage of dsoverng omlex mehansms among mult-tem assoatons that are more frequent than ther orresondng arwse assoatons suggest. One of the ssues n a mult-ass algorthm as of Agarwal et al. (1993) s that t s not suted for data streams suh as web alatons. Understandng the struture and nature of suh transatons s mortant and to ths extent, some work has been done reently (Babok et al., 2002). Moreover, none of the above formulatons use any tme seres nformaton, whh s very often resent n the database. For examle, a ustomer may urhase tea and sugar searately n suessve transatons, resumably falng to reall that he needed to urhase the seond rodut also. Suh assoatons would not show u above, and ndeed t would erhas be desrable for the store to rovde a frendly remnder to suh ustomers. Addtonally, no onsderaton s gven to multle taxonomes (or herarhes) n ostulatng these rules. For examle, a refrgerator s a kthen alane s a heavy eletral alane. Ths s a taxonomy. Gven suh a taxonomy, t may be ossble to nfer a rule that eole who buy kthen alanes tend also to buy dshware. Ths rule may hold even f rules suh as eole who buy refrgerators also tend to buy dshware or eole who buy kthen 7
Toner Comuter Workstaton Prnt Server Prnter Outut (Prntout) Paer Fgure 1: A dreted grah reresentaton for the outut from a laser rnter. Dreted edges are drawn n the dreton of ause to effet. alanes also tend to buy dshware do not hold. It would be useful to onstrut rules usng suh herarhes. These are some of the asets of assoaton rules requrng attenton of both the statstal and the database ommunty. 3 Grahal Models and Belef Networks The ar-wse assoatons from the enrollment data n Table 1 an be reresented usng an undreted grah wth nodes reresentng the lasses. Weghts on the edges between a ar of nodes would ndate the frequenes wth whh the two lasses are taken by the same student. Grahs are frequently used to rovde toral reresentatons of jont arwse relatonshs between varables. From a statstal ont of vew they addtonally hel us to learn and sefy the jont multvarate dstrbuton of the varables. For suose we are nterested n sefyng the jont dstrbuton of varables X 1, X 2,..., X. The smlest aroah s to sefy that the jont dstrbuton of the varables s the rodut of the margnals of eah,. e. the varables are all ndeendent. In many omlex henomena and real-lfe stuatons, suh an assumton s not tenable. On the other hand, multvarate dstrbutons are dffult to sefy and subsequent nferene s often ntratable, eseally when s large. Reresentng the jont dstrbuton through an arorate sequene of margnal and ondtonal dstrbutons allevates the roblem somewhat. I rovde an llustraton through a smlfed day-to-day examle, also grahally reresented n Fgure 1. When a rnt job s submtted from a deskto omuter/workstaton, ontrol of the job s transferred to the rnt server whh queues t arorately and sends t to the 8
rntng deve. The rnter needs toner and aer to rnt ts outut. The outut may be reresented by a varable whh may be bnary or a ontnuous varable measurng the degree of satsfaton wth the result. For nstane, 0% may ndate no outut, 100% a erfet result, whle 25% may mean faded outut as a onsequene of an over-used toner. The varables here are the bnary varable X 1 reresentng the orretness of the rnt ommand, X 2 for the number of ages transmtted from the rnt server to the rnter, X 3 denotng the qualty of the toner, X 4 for the number of ages n the rnter aer tray and X 5 ndatng the status of the rnter (off-lne, on-lne or jammed) whle nteratng wth the toner, the rnt server and the aer tray. The outut varable s gven by X 6. The dstrbuton of X 1, X 2,..., X 6 an be reresented by means of the dreted grah n Fgure 1. The arrows ndate ondtonal deendene of the desendants to ther arents. For nstane, X 2 s deendent on X 1, whle X 5 s deendent on ts anteedents X 1, X 2,..., X 4 only through ts arents X 2, X 3, X 4. Fnally X 6 s deendent on ts anteedents X 1, X 2,..., X 5 only through X 5. The grah of Fgure 1 ndates that the jont dstrbuton of X 1, X 2..., X 6 s: Pr(X 1, X 2..., X 6 ) = Pr(X 1 )Pr(X 2 X 1 )Pr(X 3 )Pr(X 4 )Pr(X 5 X 2, X 3, X 4 )Pr(X 6 X 5 ). (1) Note that ondtonal and margnal robablty dstrbutons may have addtonal arameters: for nstane, both aer and toner may be modeled n terms of the number of ages sent through the rnt server and the date last relenshed. Now suose that gven X 6 = 0, we want to dagnose the ossble ause. If all the robablty dstrbutons were omletely sefed, by usng Bayes Theorem and ondtonal robablty statements one an alulate the robabltes of the other omonents malfuntonng, gven X 6 = 0 and use ths to dentfy the most lkely auses, together wth a statement on the relablty of our assessments. The grahal struture desrbed above s a Bayesan belef network. In realty of ourse, we would not be handed down a omlete sefaton of the ondtonal and margnal robablty dstrbutons, even wth a omletely known network. I address the alternatve stuaton a lttle later but for now address the ase for a senaro wth knowledge of network struture suh as outlned, but unknown robabltes. Data on the network (suh as several observatons on the workng of the rnter n the examle above) are used to learn the dfferent robabltes). For eah varable X, let reresent the set of ts anteedents, and θ ω (x ) be the ondtonal densty of X at the ont x gven that = ω. For dsrete statesaes, one an ut Drhlet dstrbutons on eah robablty whle for ontnuous statesaes a Drhlet Proess ror may be used. Assumng arameter ndeendene. e. ndeendene of the arameters underlyng the unknown robabltes, the robablty dstrbutons are udated wth newer data. I refer to Hekerman (1996) for a detaled desrton of the related mehans but note that unertanty n our knowledge of the robablty dstrbutons governng the network s addressed through these rors. The use of grahal and ondtonal strutures to sefy jont dstrbutons has a long hstory n statsts. The smlest ase s that of a Markov Chan. If X 1, X 2,..., X n are suessve realzatons of a frst-order Markov Chan the deendene struture may be reresented usng the dreted grah X 1 X 2... X n. Note that the above struture 9
mles that the ondtonal dstrbuton of any nteror X gven both the ast and the future nvolves only ts mmedate anteedent X 1 and ts mmedate desendant X +1. In other words the ondtonal dstrbuton of X gven X 1 and X +1 s ndeendent of the other varables. Thus an equvalent struture may be sefed by an undreted ayl grah wth the only edges nvolvng X beng those onnetng t to X 1 and X +1 for all = 2, 3,..., n 1. X 1 and X n have only one onnetng edge, to X 2 and X n 1 resetvely. The undreted grah reresentaton s mortant beause a Markov Chan runnng bakwards and therefore havng the same dreted grah X 1 X 2... X n has the same ondtonal deendene struture as the one above. Sefyng jont dstrbutons through ondtonal dstrbutons s farly ommon and wellestablshed n statsts. For nstane, t an be shown that under ondtons of ostvty for the statesae (whh may be relaxed somewhat), omlete knowledge of the loal haratersts or the full ondtonal dstrbutons Pr(X X j : j, j = 1, 2,..., n) s enough to sefy the jont dstrbuton of X 1, X 2,..., X n (Brook, 1964; Besag, 1974). Indeed a system sefed through ts ondtonal robablty struture s alled a Markov Random Feld (MRF) and s ommonly used n statstal hyss, Bayesan satal statsts, to name a few areas. Of ourse, a ondtonal ndeendene model as desrbed above usng a grah s handy here beause the loal haraterst at a artular X s deendent only on those X j s whh share a ommon edge wth X. Ths roerty s alled the loally Markov roerty. Those X j s sharng an edge wth X n the grahal reresentaton of deendene are alled neghbors of the latter. Defnng the envronment of a set of random varables to be the unon of neghbors of eah element n the set, a globally Markov roerty s defned to be the ase when the ondtonal dstrbuton of the subset of random varables gven everythng else deends only through the envronment. Under ondtons of ostvty, the loally Markov roerty s equvalent to the globally Markov roerty. Further, f we defne a lquo to be a set onsstng ether of a sngle random varable or a set of random varables all of whh are neghbors to eah other and a lque to be a maxmal lquo (suh that there s no suerset of the lquo whh s also a lquo), t an be shown (Hammersley and Clfford, 1971) that the underlyng robablty densty/mass funton an be deomosed nto a rodut of funtons, eah defned on a searate lque. In most ases, t s mratal to know ausal relatonshs between the dfferent omonents of the network. Indeed, learnng the underlyng ausal struture s desrable and a muh-sought goal of the knowledge dsovery roess. Ths an be done by extendng the methodology from the ase where the network s omletely known but the governng robablty dstrbutons are not. We sefy a ror robablty on eah ossble network and then for eah network, roeed as above. The osteror robablty of eah network s margnalzed over the arameter robablty dstrbutons and the maxmum a osteror (MAP) estmate omuted. Note that the omutatonal burden s often severe so that stohast methods suh as Markov Chan Monte Carlo (MCMC) are used n omutaton. Markov grahs are artularly amenable to suh methods. Also, I refer bak to the dsusson n the revous aragrah and menton that n the termnology of Verma and Pearl (1990), two network strutures are equvalent f and only f they have the same struture, gnorng ar 10
dretons and the same v-strutures. (A v-struture s an ordered trle (x, y, z) wth ars between x to y and z to y, but no ar between x and z.) Equvalent network strutures have the same robablty and therefore, our learnng methods atually learn about equvalene lasses n the network. Even wth these smlfatons, the number of networks to evaluate grows exonentally wth the number of nodes n. Of nterest therefore s to see whether a few network-strutures an be used n the learnng roedure. Some researhers have shown that even a sngle good network struture often rovdes an exellent aroxmaton to several networks (Cooer and Hersovts, 1992; Alfers and Cooer, 1994, Hekerman et al., 1995). The fous then s on how to dentfy a few of these good networks. One aroah s to sore networks n terms of the goodness of ft and rank them avalable methods nlude Madgan and Raftery s (1994) Oam s Razor, Rsannen s (1987) Mnmum Desrton Length (MDL), the more tradtonal Akake s (1974) An Informaton Crteron (AIC) or Shwarz s (1978) Bayes Informaton Crteron (BIC). Fnally, onsder the ssue of learnng new varables n the network whose exstene and relevane s not known a ror they are therefore hdden. Suh varables (both ther number and statesae) an be nororated n the ror model durng the network-learnng roess. I refer to Cheeseman and Stutz (1996) for a detaled desrton of a network wth hdden varables. Fnally, belef networks have generated a lot of nterest n the knowledge dsovery ommunty a few referenes n ths regard are Charnak (1991), Segelhalter et al. (1993), Hekerman et al. (1995), Laurtzen (1982), Verma and Pearl (1990), Frydenberg (1990), Buntne (1994, 1996). I onlude ths dsusson of grahal models and belef networks by notng that most of the networklearnng methods dsussed here are omuter- and memory-ntensve, eseally n the ase of very omlex roesses wth lots of varables. Develong methodology n ths ontext s therefore desrable and useful. 4 Classfaton and Suervsed Learnng Suervsed learnng or lassfaton has long been an mortant area n statsts, and an arse wth regard to a number of alatons. For nstane, a bank may lke to dentfy long-term deostors by ther attrbutes and rovde them wth nentves for ondutng busness at that bank. Other alatons also exst: here we dsuss a ase study n the emergng area of bonformats. 4.1 Proten Loalzaton Identfyng roten loalzaton stes s an mortant early ste for fndng remedes (Naka and Kanehsa, 1991). The E. ol batera, has eght suh stes: or the ytolasm, m or the nner membrane wthout sgnal sequene, or erlasm, mu or nner membrane wth an unleavable sgnal sequene, om or the outer membrane, oml or the outer membrane loroten, ml or the nner membrane loroten, ms or the nner membrane wth a leavable sgnal sequene. Eah roten sequene has a number of numer attrbutes: these are mg or measurements obtaned usng MGeoh s method 11
for sgnal sequene reognton, gvh or measurements va von Hejne s method for sgnal sequene reognton, ad or the sore of a dsrmnant analyss of the amno ad ontent of outer membrane and erlasm rotens, alm1 or the sore of the ALOM membrane sannng regon redton rogram, and alm2 or the sore usng ALOM rogram after exludng utatve leavable sgnal regons from the sequene. In addton, there are two bnary attrbutes l or the von Hejne s Sgnal Petdase II onsensus sequene sore and the hg ndatng the resene of a harge on the N-termnus of redted lorotens. (See Horton and Naka, 1996 and the referenes theren for a detaled desrton of these attrbutes.) Data on 336 suh roten sequenes for E. ol are avalable for analyss at the Unversty of Calforna Irvne s Mahne Learnng Reostory (Horton and Naka, 1996). The l and hg are bnary attrbutes. These attrbutes are gnored for lassfaton beause some of the methods dsussed here do not admt bnary attrbutes. As a onsequene, roten sequenes from the stes ml and ms an no longer be dstngushed from eah other. Sequenes from these stes are also elmnated. Two ases from the ste oml are dsarded beause the number of ases s too few for quadrat dsrmnant analyss, leavng for analyss a subset of 324 sequenes from fve roten loalzaton stes. Further, a set of 162 random sequenes was ket asde as a test set the remanng 162 formed the tranng data. 4.2 Deson-Theoret Aroahes The dataset s n the form {(X, G ):=1,2...,n}, where 1 G K reresents the grou to whh the th observaton X belongs. The densty of an arbtrary observaton X evaluated at a ont x may be wrtten as f(x) = K k=1 π kf k (x), where π k s the ror robablty that an observaton belongs to the k th grou, and f k ( ) s the densty of an observaton from the k th grou. Our goal s to sefy the grou G(x), gven an observaton x. Alyng Bayes theorem rovdes osteror membersh robabltes of x for eah grou: Pr(G(x) = k X = x) = π k f k (x) K k=1 π kf k (x). The grou wth the hghest osteror robablty s dentfed as the redted lass for x. Note that havng the ndvdual lass denstes s equvalent to obtanng the osteror lass robabltes, and hene the redted lassfaton (assumng that the π k s are known). When the denstes f k s are multvarate Gaussan, eah wth dfferent mean vetors µ k s but wth ommon dserson matrx Σ, the logarthm of the rato between the osteror lass robabltes of lasses k and j s a lnear funton n x. Ths s beause log Pr(G(x) = k X = x) Pr(G(x) = j X = x) = log π k π j 1 2 (µ k + µ j ) Σ 1 (µ k + µ j ) + x Σ 1 (µ k µ j ). Ths s the bass for Fsher s (1936) lnear dsrmnant analyss (LDA). Lnear dsrmnant funtons δ k (x) = x Σ 1 µ k 1 2 µ k Σ 1 µ k +log π k are defned for eah grou. Then the deson 12
rule s equvalent to G(x) = argmax k δ k (x). In rate, the π s, µ s and Σ are unknown and need to be estmated. The lass roortons of the tranng data are used to estmate the π k s, the ndvdual grou means are used to estmate the µ k s, whle the ommon estmate of Σ s estmated usng the ooled wthn-grous dserson matrx. Oeratonal mlementaton of LDA s very straghtforward: Usng the setral deomoston ˆΣ = V DV and X = D 1 2 V X as the rojeton of the shered data-ont X nto the transformed sae, a new observaton X s lassfed to losest lass entrod n the transformed sae, after salng loseness arorately to aount for the effet of the lass ror robabltes π k s. For arbtrary Gaussan denstes, the dsrmnant funton δ k (x) = 1 2 log Σ k 1 2 (x µ k ) Σ 1 (x µ k ) + log π k nludes a quadrat term n x, so s alled quadrat dsrmnant analyss (QDA). The estmaton of the arameters s smlar as before, wth the kth grou dsersons from the tranng data used to estmate Σ k. For the roten sequene data, the tranng dataset was used to devse lassfaton rules usng both LDA and QDA. Wth LDA, the test set had a mslassfaton rate of 30.5% whle wth QDA, t was only 9.3%. Clearly, LDA was not very suessful here. In general, both methods are reuted to have good erformane deste beng farly smle. Ths s erhas beause most deson boundares are no more omlated than lnear or quadrat. Fredman (1989) suggested a omromse between LDA and QDA by roosng regularzed dsrmnant analyss (RDA). The bas dea s to have a regularzed estmate for the ovarane matrx ˆΣ k (γ) = γ ˆΣ k + (1 γ) ˆΣ, where ˆΣ s the ooled dserson matrx from LDA and ˆΣ k s the k th grou varaneovarane matrx obtaned as er QDA. Other regularzatons exst, and we address these shortly. But frst, a few addtonal words on the oularty of LDA. As dsussed n Haste et al (2001), one attraton of LDA s an addtonal restrton that allows substantal reduton n dmensonalty through nformatve low-dmensonal rojetons. The K-dmensonal entrods obtaned usng LDA san at most a (K 1)-dmensonal subsae, so there s substantal reduton n dmensonalty f s muh more than K. So, X may as well be rojeted on to a (K 1)-dmensonal subsae, and the lasses searated there wthout any loss of nformaton. A further reduton n dmensonalty s ossble by rojetng X onto a smaller subsae n some otmal way. Fsher s otmalty rteron was to fnd rojetons whh sread the entrods as muh as ossble n terms of varane. The roblem s smlar to fndng rnal omonent subsaes of the entrods themselves: formally, the roblem s to defne the j th dsrmnant oordnate by the orresondng rnal omonent of the matrx of entrods (after salng wth ˆΣ). The soluton s equvalent to that for the generalzed egenvalue roblem and for Fsher s roblem of fndng the rojeton of the data that maxmzes the between-lass varane (hene searates the dfferent lasses as wdely as ossble) relatve to the wthn-lass ovarane and s orthogonal to the revous rojeton oordnates already found. As wth rnal omonents, a few dsrmnant oordnates are often enough to rovde well-searated lasses. We dslay LDA usng only the frst two dsrmnant varables the rojeted dataont usng the orresondng dsrmnant oordnates for the E. ol roten sequene dataset n Fgure 2. 13
4 6 8 10 12 0 2 4 6 frst dsrmnant oordnate seond dsrmnant oordnate o U o U o U o U U o U o U U o U U U o U U U U o Fgure 2: Deson boundares for the lassfer based on the frst two lnear dsrmnants. The larger font,, o, u and reresent the rojeted means n the frst two dsrmnant oordnates for the,, om, m and mu ategores. Smaller font haraters reresent the rojeted observatons for eah grou. One drawbak of LDA arses from the fat that lnear deson boundares are not always adequate to searate the lasses. QDA allevates the roblem somewhat by usng quadrat boundares. More general deson boundares are rovded by flexble dsrmnant analyss (FDA) (Haste et al., 1994). The bas dea s to reast the lassfaton roblem as a regresson roblem and to allow for nonarametr aroahes. To do ths, defne a sore funton ϕ : G IR 1 whh assgns sores to the dfferent lasses and regress these sores on a transformed verson of the redtor varables, erhas allowng for regularzaton and other onstrants. In other words, the roblem redues to fndng sorngs ϕ s and transformatons ζ( ) s that mnmze the dfferent lasses and regress these sores on a transformed verson of the redtor varables, erhas allowng for regularzaton and other onstrants. In other words, the roblem redues to fndng sorngs ϕ s and transformatons ζ( ) s that mnmze 14
the onstraned mean squared error MSE({ϕ m, ζ m } M m=1) = 1 n [ M n ] {ϕ m (G ) ζ m (x )} + λj(ζ m ). m=1 =1 Here J s a regularzaton onstrant orresondng to the desred nonarametr regresson methodology suh as smoothng and addtve slnes, kernel estmaton and so on. Comutatons are artularly smlfed when the nonarametr regresson roedure an be reresented as a lnear oerator ϖ h wth smoothng arameter h. Then the roblem an be reformulated n terms of the multvarate adatve regresson of Y on X. Here Y s the (n K) ndator resonse matrx wth th row orresondng to the th tranng data-ar (X, G ) and havng exatly one non-zero entry of 1 for the olumn orresondng to G. Denote ϖ h as the lnear oerator that fts the fnal hosen model wth ftted values Ŷ and let ˆζ(x) be the vetor of ftted funtons. The otmal sores are obtaned by omutng the egen-deomoston of Y Ŷ and then rojetng the ˆζ(x) on to ths normalzed egen-sae. These otmal sores are then used to udate the ft. The above aroah redues to LDA when ϖ h s the lnear regresson rojeton oerator. FDA an also be vewed dretly as a form of enalzed dsrmnant analyss (PDA) where the regresson roedure s wrtten as a generalzed rdge regresson roedure (Haste et al., 1995). For the roten sequene dataset, PDA rovded us wth a mslassfaton rate of 13.59% on the test set. FDA used wth multvarate adatve regresson slnes (MARS) (Fredman, 1991) gave us a test set mslassfaton rate of 16.05%. Wth BRUTO, the rate was 16.67% whle wth a quadrat olynomal regresson sorng method, the test set mslassfaton error was 14.81%. Another generalzaton s mxture dsrmnant analyss (MDA). Ths s an mmedate extenson of LDA and QDA and s bult on the assumton that the densty for eah lass s a mxture of Gaussan dstrbutons wth unknown mxng roortons, means and dsersons (Haste and Tbshran, 1996). The number of mxtures n eah lass s assumed to be known. The arameters are estmated for eah lass by the Exetaton-Maxmzaton (E-M) algorthm. MDA s more flexble than LDA or QDA n that t allows for more rototyes than the mean and varane to desrbe a lass: here the rototyes are the mxng roortons, means and dsersons that make u the mxture dserson n eah lass. Ths allows for more general deson boundares beyond the lnear or the quadrat. One may ombne FDA or PDA wth MDA models to allow for even more generalzatons. More flexble otons for deson-theoret aroahes to the lassfaton roblem nvolve the use of nonarametr densty estmates for eah lass, or the seal ase of nave Bayesan aroahes whh assume that the nuts are ondtonally ndeendent wthn eah lass. The full array of tools n densty estmaton an then be ut to work here. 4.3 Dstrbuton-free Predtve Aroahes The methods dsussed n the revous seton are essentally model-based. Model-free aroahes suh as tree-based lassfaton also exst and are oular for ther ntutve aeal. 15
In ths seton, we dsuss n detal a few suh redtve aroahes these are the nearestneghbor methods, suort vetor mahnes, neural networks and lassfaton trees. 4.3.1 Nearest-neghbor Aroahes Perhas the smlest and most ntutve of all redtve aroahes s k-nearest-neghbor lassfaton. Deendng on k, the strategy for redtng the lass of an observaton s to dentfy the k losest neghbors from among the tranng dataset and then to assgn the lass whh has the most suort among ts neghbors. Tes may be broken at random or usng some other aroah. Deste the smle ntuton behnd the k-nearest neghbor method, there s some smlarty between ths and regresson. To see ths, note that the regresson funton f(x) = IE(Y X = x) mnmzes the exeted (squared) redton error. Relaxng the ondtonng at a ont to nlude a regon lose to the ont, and wth a 0-1 loss funton leads to the nearest-neghbor aroah. The hoe of k or the number of neghbors to be nluded n the lassfaton at a new ont s mortant. A ommon hoe s to take k = 1 but ths an gve rse to very rregular and jagged regons wth hgh varanes n redton. Larger hoes of k lead to smoother regons and less varable lassfatons, but do not ature loal detals and an have larger bases. Sne ths s a redtve roblem, a hoe of k may be made usng ross-valdaton. Cross-valdaton was used on the roten sequene tranng dataset to obtan k = 9 as the most otmal hoe n terms of mnmzng redted mslassfaton error. The dstane funton used here was Euldean. The 9-nearest neghbor lassfaton redted the roten sequenes n the test set wth a mslassfaton rate of 15.1%. Wth k = 1, the mslassfaton rate was 17.3%. On the fae of t, k-nearest neghbors have only one arameter the number of neghbors to be nluded n dedng on the majorty-vote redted lassfaton. However, the effetve number of arameters to be ft s not really k but more lke n/k. Ths s beause the aroah effetvely dvdes the regon sanned by the tranng set nto aroxmately n/k arts and eah art s assgned to the majorty vote lassfer. Another ont to be noted s that the urse of dmensonalty (Bellman, 1961) also mats erformane. As dmensonalty nreases, the data-onts n the tranng set beome loser to the boundary of the samle sae than to any other observaton. Consequently, redton s muh more dffult at the edges of the tranng samle, sne some extraolaton n redton may be needed. Furthermore, for a -dmensonal nut roblem, the samlng densty s roortonal to n 1. Hene the samle sze requred to mantan the same samlng densty as n lower dmensons grows exonentally wth the number of dmensons. The k-nearest-neghbors aroah s therefore not mmune to the henomenon of degraded erformane n hgher dmensons. 4.3.2 Suort Vetor Classfers and Mahnes Consder a two-lass roblem wth two nut dmensons n the feature sae as reresented n Fgure 3. Wthout loss of generalty, let the lass ndators G be -1 (reresented by flled s) and +1 (denoted by flled s). Fgure 3a shows the ase where the two lasses 16
are lnearly searable (the lne ω t x = γ s one ossble lne). Note that for onts lassfed orretly, G (ω t x γ) 0. Rosenblatt (1958) devsed the eretron learnng algorthm wth a vew to fndng a searatng hyerlane (lne n two dmensons) mnmzng the dstane of mslassfed onts to the deson boundary. However Fgure 3a shows that there an be an nfnte number of suh solutons. Vank (1996) rovded an elegant aroah to ths roblem by ntrodung an addtonal requrement: the revsed objetve s to fnd the otmal searatng hyerlane searatng the two lasses whle also maxmzng the shortest dstane of the onts n eah lass to the hyerlane. Assumng wthout loss of generalty that ω = 1, the otmzaton roblem s to maxmze δ over ω = 1, γ subjet to the onstrant that G (ω t x γ) δ for all = 1, 2,... n. The roblem an be wrtten n the equvalent but more onvenent form as: mnmze ω over all ω and γ subjet to the onstrant G (ω t x γ ) 1 for all = 1, 2,... n. Here, we have droed the unt-norm requrement on ω and set δ = ω 1. Standard solutons to ths roblem exst and nvolve quadrat rogrammng. The lassfer for a new ont x s gven by sgn( ˆω t x ˆγ) where ˆω, ˆγ are the otmal values. Fgure 3b shows the ase where the two lasses overla n the nut sae so that no hyerlane an omletely searate the two grous. A reasonable omromse s to allow for some varables to be on the wrong sde of the margn. Let us defne addtonal nonnegatve slak varables as ϑ = {ϑ, = 1, 2,..., n}. The orgnal onstrant n the roblem of fndng the otmal searatng hyerlane s then modfed to be G (ω t x γ) δ(1 ϑ ) for all = 1, 2,..., n wth ϑ bounded above by a onstant. The slak varables ϑ n the onstrant denote the roortonal amount by whh any redted observaton s on the wrong sde of the lassfaton hyerlane. Slang the uer bound on ther sum lmts the total roortonal amount by whh redtons are made on the wrong sde of the margn. Note that mslassfatons our only when ϑ > 1 so that ϑ M automatally lmts the number of tranng set mslassfatons to at most M. One agan, the otmzaton roblem an be rehrased as that of mnmzng ω subjet to the onstrants G (ω t x γ) (1 ϑ ) wth δ = ω 1, ϑ M, ϑ 0 for all = 1, 2,..., n so that smlar to the searable ase, quadrat rogrammng tehnques an agan be used. It an be shown (Haste et al., 2001) that the soluton ˆω s n the form ˆω = n =1 ˆα G x, wth non-zero values of ˆα only when G (ω t x γ) = (1 ϑ ). The soluton s alled the suort vetor lassfer. Note that onts wth non-zero values for the slak varables ϑ lay a major role these are alled the suort vetors. Ths s a major dstnton wth LDA where the soluton deends on all data onts, nludng those far away from the deson boundary. It does ths through alulatons on the lass ovarane-matres and the lass entrods. The suort vetor lassfer on the other hand uses all data-onts to dentfy the suort vetors, but fouses on the observatons near the boundary for the lassfaton. Of ourse, f the underlyng lasses are really Gaussan, LDA wll erform better than SVM beause of the latter s heavy relane on the (noser) observatons near the lass boundares. Suort vetor mahnes (Crstann and Shawe-Taylor, 2001) generalze the above senaro n a srt smlar to the extenson of FDA over LDA. The suort vetor lassfer fnds lnear boundares n the nut feature sae. The aroah s made more flexble by 17
0.2 0.4 0.6 0.8 1.0 ω t x = γ δ ω t x = γ + δ ω t x = γ 0.2 0.4 0.6 0.8 1.0 θ 1 θ 2 θ 3 θ 4 θ 8 θ 6 θ 7 ω t x = γ ω t x = γ ζ θ 5 θ 11 θ 10 θ θ 13 9 ω t x = γ + ζ θ 12 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 (a) (b) Fgure 3: The suort vetor lassfer for a two-lass roblem n a two-dmensonal nut sae. (a) In the searable ase, the deson boundary s the sold lne whle broken lnes bound the maxmal searatng margn of wdth 2δ. (b) In the overlang ase, onts on the wrong sde of ther margn are labeled θ j = δϑ j, whh the dstane from ther margn. All unlabeled onts have θ j = 0. The maxmal margn tself s obtaned wthn the onstrant that ϑ j does not exeed a ertan allowable budget. enlargng the feature sae to nlude bass exansons suh as olynomals or slnes. One the bass funtons are seleted to be, say, ζ(x) = {ζ (x), = 1, 2,..., m}, the roblem s reast n the same frame-work to obtan the lassfer G(x) = sgn( ˆω t ζ(x) ˆγ). Smlar to the ase of lnear boundares, the soluton s now ˆω = n =1 ˆα G ζ(x ). Ths reresentaton makes omutatons ratal. Ths s beause the ftted funton Ĝ(x) now has the form n =1 ˆα G < ζ(x), ζ(x ) > whh means that knowledge of the kernel funton Ψ(x, y) =< ζ(x), ζ(x ) > rather than the exat transformaton ζ(x) s enough to solve the roblem. Poular hoes for Ψ(x, y) nlude the th degree olynomal Ψ(x, y) = (1+x t y), the radal bass funton Ψ h (x, y) = ex ( x y 2 /h) or the sgmodal funton denoted by Ψ(x, y) = tanh(β 1 x t y + β 0 ). Suort vetor mahnes wth dfferent kernels were used on the roten sequene dataset. Suort vetor lassfaton usng 35 suort vetors, yelded a test set mslassfaton rate of 62.34%. A slghtly hgher mslassfaton rate of 63.5% was obtaned wth the sgmodal 18
kernel wth β 1 = 0.2 and β 0 = 0. Ths last alaton used 33 suort vetors. Usng a olynomal kernel of degree 6 mroved mslassfaton a bt redton errors on the test set were on the order of 56.8%. The number of suort vetors was then 45. Fnally usng the radal bass funton wth h = 5 gave a test set mslassfaton rate of 17.9%. The number of suort vetors was dentfed to be 51. Clearly, the erformane s the best wth the radal kernel, even though t s worse than that usng ether QDA, FDA, PDA or k-nearest neghbors. I onlude wth a bref dsusson on reent work n ths fast-develong area. As wth several smlar roblems, suort vetor mahnes are also hallenged by massve databases. One aroah, develoed by Platt (1998, 1999) s to break the quadrat rogrammng nto a seres of smallest ossble roblems. Ths s alled hunkng. The smallest ossble hunk sze s 2. Sne the omonent roblems are small, analyt solutons may be obtaned, seedng u omutatons. Another area of nterest s to formulate the roblem n a robablst framework and to obtan not a redton but the osteror robablty of a lass gven an nut. Ths s done, for examle, by formulatng the lassfaton model through a logst lnk funton Prob(G = 1 x) = (1 + ex { f(x)}) 1 (Wahba, 1999). Other aroahes also exst see, for examle, Haste and Tbshran (1998), Platt (1999), Vank (1996) and the referenes theren. 4.3.3 Artfal Neural Networks Consder a lassfaton roblem wth two nut varables X 1 and X 2. A sngle-hddenlayer artfal neural network s obtaned by defnng a layer Z wth omonents Z l = σ(α 0l + α t lx), l = 1, 2,... L. The outut G s then assumed to be a erturbed verson of a funton of a lnear ombnaton of ths layer. Here, σ( ) s the atvaton funton and G = F j (X) ɛ wth F (X) = g j (T j ) where T j = β 0j + β t jx and reresents the degradaton oerator on the sgnal F (X) wth ɛ and j = 0, 1 for the two-lass roblem (more generally 0, 1,..., K 1 for the K-lass roblem). Ths s llustrated n Fgure 4, wth m = 3. The X s form the nut layer, the Z s are the hdden layer and the Y s are the outut layer. The omonents α 0 s and β 0 s are the bas unts for eah layer. In general there an be several hdden layers. The most ommon hoe for the atvaton funton s the sgmod σ(u) = (1 + ex ( u)) 1 though a Gaussan radal bass funton σ(u) = ex { u 2 } has also been used n what s alled a radal bass funton network. The funton g j ( ) s alled the outut funton and s usually the softmax funton g j (v) = ex (T j )/ j ex (T j). Ths s of ourse the transformaton also used n logst regresson whh lke other regresson models suh as multle lnear regresson or rojeton ursut (Fredman and Stuetzle, 1984) are seal ases wthn ths framework: hoosng the dentty funton for σ( ), m = 1, β 0 = 0 and the bnomal dstrbuton for nstane, takes us to the logst model. Note that there may be any number of hdden layers n the model. Fttng the neural network means estmatng the unknown arameters α s and β s to obtan the lassfer G(x) = argmax j F j (X) otmzng the error funton - usually hosen to be least-squares or entroy error. Beause of the strong ossblty of over-fttng, gven the 19
Z 1 X 1 X 2 Z 2 F(Z) Y Z 3 Outut Layer Inut Layer Hdden Layer Fgure 4: A two-varables-nut, one-hdden layer, feed-forward neural network. model s enormous flexblty, a enalty omonent s usually nororated: early stong s also emloyed. The most ommon method used to ft a neural network model s bakroagaton roosed by Hofeld (1982, 1984) essentally a gradent desent aroah whh works by settng the arameters to some ntal values, buldng F from these values n the network and then omutng the error n the ftted F and the observatons. Ths error s then roagated bak to the revous hdden layer where t s aortoned to eah omonent. The roess s n turn reeated down to the revous layer and on down to the nut layer. The nut arameters are adjusted to mnmze these aortoned errors and fed forward to form a new udated verson of F. Ths roedure s terated tll onvergene. Bak-roagaton an be qute slow oeratonally and s sed u usng onjugate gradents and varable metr methods. Artfal neural networks have been wldly oular n the engneerng and artfal ntellgene ommuntes, artly beause of an effetve namng strategy that has lnked the methodology to an mtaton of the bran. The enormous flexblty rovded by any number of hdden layers s often mentoned as ts great vrtue. However, from a statstal ersetve, they are really nothng more than non-lnear statstal models wth the nherent 20
dangers of over-fttng resented by the ossbly unrestraned number of arameters and layers. Over-fttng of the weght arameters s sought to be revented by nororatng enalty funtons and by usng ross-valdaton to dede on the weght of the enalty. The number of hdden unts and layers s best left to knowledge and understandng of the system beng modeled and/or to exermentaton. Addtonally, the ntal weghts suled n the fttng algorthms have a good deal of mat n the rate of onvergene and the fnal soluton small weghts mean that the algorthm hardly moves, whle large weghts often lead to oor solutons. Also, sne salng the nuts mats the qualty of the soluton, t s erhas best to start wth ntalzers all standardzed to zero mean and unt standard devaton. Fnally, as wth many nonlnear roblems, multle loal mnma exst the soluton s ether to try out a number of runs wth dfferent startng onts and hoose the otmal from among the solutons, or to average over redtons from dfferent solutons (Rley, 1996). Alyng the sngle-hdden-layer neural network on the roten sequene tranng samle wth an entroy error funton and several random ntal weghts yelded a network wth two nodes n the hdden layer and a test set mslassfaton rate of 22.22%. 4.3.4 Classfaton Trees We onlude ths desrton of nonarametr lassfaton wth a bref dsusson on lassfaton trees (Breman et al., 1984). These are suervsed learnng methods whh result n a tree-lke struture, effetvely arttonng the nut feature sae nto sets of retangles and then takng a majorty vote of the tranng set observatons n eah set to ome u wth a lassfaton rule. A ommon aroah s to use only bnary slts at eah node, though mult-slt aroahes are also used. The tree s bult by fndng the slt n the redtor varable whh best slts the data nto heterogeneous grous. Thus t s ndutvely bult through future slts that are wholly deendent on ast hoes (suh algorthms are alled greedy algorthms). The tree-deth s usually found by ross-valdaton. We used the tranng samle for the roten sequene dataset to ome u wth a seven-node tree as the best leave-out-one ross-valdated tree. Ths tree yelded a 19.75% test set mslassfaton rate. Classfaton trees are attratve beause of ther nterretablty and ther ablty to handle mssng observatons as a new value or as a surrogate varable for slttng at a node. Furthermore, they are easy to use n hgher dmensons sne the slttng s often done one feature at a tme. Moreover both ategoral and quanttatve nut varables an be nororated n the model. However, there are some dsadvantages. Some algorthms requre dsrete (bnned) nut varables. More mortantly, a tree model nvarably has hgh-order nteratons, gven that all slts deend on revous ones. Also lassfaton s very often done aross several databases and there needs to be a way for unfyng suh trees. I dsuss methods for aggregatng dfferent trees n the followng seton, but onlude ths seton wth an outlne of a dfferent aroah to ths roblem. The strategy s to take Fourer Transforms of the dfferent lassfaton trees from the dfferent databases (Karguta and Park, 2001). Indeed, the oeffents of these reresentatons deay exonentally, so that storng only a few for omutatons s adequate. Ths s eseally useful for alatons suh as moble 21
omutng where memory and bandwdth are lmted and need to be onserved. The stored oeffents an then be ombned to buld a tree-based struture based on all the databases. 4.4 Aggregaton Aroahes The lassfaton aroahes erform dfferently on dfferent datasets. No method unformly erforms better than any other on all datasets, and ths s ndeed exeted beause eah tehnque addresses sef stuatons. Moreover, some methods suh as lassfaton trees have hgh varane and are not very robust. Perturbng the data slghtly an result n a very dfferent tree struture. In ths seton, we dsuss aroahes whh ombne dfferent models as well as aggregate over bootstra samles of the data. 4.4.1 Bootstra Aggregaton The onet of bootstra aggregaton or baggng was frst ntrodued by Breman (1996). Suose that as before, we have tranng data of the form Z = {(X, G ) : = 1, 2..., n}, and let us assume that ths yelds the lassfaton rule Ĝ(x). Baggng omutes a omoste rule over a olleton of bootstra samles wth varane reduton as the goal. For eah bootstra samle Z b, b = 1, 2,..., B a lassfer Ĝb (x) s obtaned. The test set s lassfed from eah bootstraed rule Ĝb (x) and eah observaton n the test set s lassfed to that grou whh has the hghest reresentaton among the bootstra redtons. The sheme outlned here s a majorty votng rule. (An alternatve sheme s to average the osteror robablty for eah lass over the bootstra redtons.) Baggng s a methodology also used n regresson ndeed t an be shown qute easly that wthn ths framework and under squared-error loss, the exeted error n redton s redued. The same s not neessary n lassfaton baggng a good lassfer an make t better, but baggng a bad one an atually make t worse! To see ths, onsder the smle two-lass senaro where the observaton G 1 for all X. Any lassfer Ĝ(x) that redts an observaton nto lass 0 wth robablty α and lass 1 otherwse wll have a mslassfaton rate α whle the error rate for a bagged rule wll be 1.0, for α > 0.5. Usng bootstra relatons of sze 500, we llustrate baggng on the roten sequene dataset wth some of the methods desrbed earler these are k-nearest neghbors, the sngle-layer four-hdden-nodes neural network and the lassfaton tree. For the k-nearest neghbor ase, k was hosen for eah bootstra samle to be the one wth the lowest mslassfaton rate on the omlete tranng samle. For tree lassfaton, the best tree sze was obtaned by leave-out-one ross-valdaton of the bootstra samle. In all ases, the fnal rule was deded from the bootstraed lassfaton rules Ĝb (x) by majorty votng. For the k-nearest neghbors lassfaton aroah, baggng resulted n an error rate of 17.28%. Baggng the neural network rodued a test set mslassfaton rate of 20.99% whle the bagged tree lassfer rodued an error rate of 16.67%. Interestngly therefore, baggng degraded the erformane of the k-nearest neghbor lassfer but mroved erformane of the tree and neural network lassfers. 22
4.4.2 Bumng Bumng s really a method of stohast searh that tres to fnd a better model. The method uses bootstra samlng to searh the sae for a good model. Ths method an be used effetvely to get out of the roblems assoated wth loal mnma. Oeratonally, the method s mlemented by takng boost-ra samles Z b, b = 1, 2,..., B and dervng the orresondng lassfaton rule Ĝb (x). Eah rule s then used to ft the tranng set. The lassfaton rule wth the least error rate on the tranng set s hosen to be the otmal rule found by bumng. We use bumng on both the lassfaton tree and the neural network rules to lassfy the roten sequene loalzaton dataset. The test set mslassfaton rate of the bumed leave-out-one ross-valdated tree lassfer was 20.37% whle that of the two-node bumed neural network lassfer was 25.31%. 4.4.3 Boostng I fnally address a very owerful method orgnally ntrodued n the mahne learnng lterature, alled boostng. Ths method as ntrodued by Freund and Share (1997) s ntended to mrove erformane of weak lassfers by ombnng several of them to form a owerful ommttee n that sense, ths s smlar to ommttee methods suh as baggng (Fredman et al., 2000). A weak lassfer s one that erforms only margnally better than random guessng. Boostng ales the weak lassfer suessvely to modfed ortons of the dataset. The modfaton ntrodued n eah teraton s to nrease reresentaton from mslassfed observatons whle dereasng artaton from arts of the dataset where the lassfaton rule does well. Thus, as the methodology rogresses, the algorthm onentrates on those arts of the feature sae that are dffult to lassfy. The fnal ste s to take a weghted vote of all the lassfaton rules that ome out from every teraton. Oeratonally, the algorthm alled AdaBoost.M1 ntrodued by Freund and Share (1997) has the followng stes: 1. Assgn ntal weghts ω j to eah observaton n the tranng set, = 1, 2,..., n. 2. For j = 1 to J do the followng: (a) Ft a lassfer Ĝj (x) to the tranng dataset where the th observaton s assgned the weght ω, = 1, 2,..., n. (b) Let α j = log 1 ε j ε j, where ε j = n =1 ω1[ĝj (x ) G ] n. =1 ω () Re-weght ω to be ω ex {α j 1[Ĝj (x ) G ]} for = 1, 2,..., n. 3. Comute the fnal boosted lassfer Ĝ(x) = argmax k J j=1 α j1[ĝj (x) = k]. Note that for the two-lass [ roblem wth lasses desgnated as {-1, 1}, the fnal lassfer redues to Ĝ(x) = sgn J ] j=1 αjĝj (x). Also the above an be vewed as a way of fttng 23
an addtve exanson f we onsder the ndvdual lassfers Ĝj (x) to be a set of elementary bass funtons. Ths equvalene of the boostng roedure to forward stage-wse modelng of an addtve model s key to understandng ts erformane n a statstal framework (Fredman et al., 2000). We llustrate boostng on the tree lassfer, the PDA lassfer and the neural network lassfer for the roten sequene loalzaton data. The mrovement n the test set mslassfaton rate was margnal (to 17.9%) for the boosted leave-out-one ross-valdated tree lassfer. Boostng the 2-node neural network lassfer had the best mrovement, wth a derease of mslassfaton rate to 20.37% on the test set whle boostng the PDA lassfer had no effet the test set error remaned at 13.58%. We end by notng that unlke the other methods dsussed here, boostng an be adated to nfnte data streams, suh as redt ard transatons data. In ths two-lass roblem, the goal s real-tme dentfaton of fraudulent transatons. Fraud s usually as newer and newer suh shemes are nvented and ut to rate. At the same tme, any lassfer should guard aganst ast knds of aberrant behavor. One way of alyng boostng s to kee ortons of the ast data and to nlude new transatons data weghted by ases where the urrent rule erformed oorly to derve a lassfer that an then be aled and fne-tuned on new streamng data. Ths methodology an also be aled to derve lassfers from massve databases by frst obtanng a rule from a small samle, usng that to get a weghted sub-samle from the remander where the rule does not erform well, reomutng the rule and so on untl the fnal weghted rule does not show further mrovement. 5 Clusterng and Unsuervsed Learnng The ablty to grou data nto a revously unknown number of ategores s mortant n a varety of ontexts. Alatons nlude groung sees of bees n taxonomy (Mhener and Sokal, 1957), lusterng ororatons on the bass of sze, assets, and so on n busness (Chen et al, 1974), loatng Iron Age settlements from broahes found n arhaelogy (Hodson et al, 1966), or demaratng olar regons usng haratersts of sea e and fern n glaology (Rotman et al, 1981). In reent tmes, the mortane of lusterng has been made more aute by the huge databases reated by automated data olleton methods. For examle, dentfyng nhe grous from ggant ustomer databases for targeted advertsng s useful (Hnneburg and Kem, 1999). In ndustral manufaturng Hnneburg and Kem (1999) homogeneous grous may be dentfed on the bass of feature vetors suh as Fourer or wavelet oeffents from a a large omuter-aded-desgn (CAD) database of arts. Standard arts may then be manufatured for eah reresentatve grou n leu of sealzed arts redung osts and the number of knds of arts to be rodued. Many more alatons exst: here I resent the software metrs examle of Matra (2001). 5.1 Clusterng Software Metrs The software ndustry s a very dynam setor wth new releases and ugrades beomng avalable n very short tme-frames. The hallenge for any software ndustry s to offer 24
both mantenane and ugrade serves. However beause ths s also a setor wth very hgh ersonnel turnover rates, rodut mantenane or ugrade are rarely arred out by the same set of develoers. The urrent ersonnel must therefore have a roer gras of the omlextes of the rodut. Tradtonally, they have reled on system soure ode doumentaton and manuals. Wth the nreasngly short tme-lnes assoated wth software roduton, omrehensve manuals have all but dsaeared. Automated methodology to reveal a omrehensve vew of the nner organzaton of soure ode s therefore desrable. Sne modularty haraterzes good software akages (Myers, 1978; Yourdon, 1975), one may luster roedures and funtons usng measurements on dfferent asets of ode, suh as number of blank lnes, non-ommented soure lnes, alls to tself, alls to external routnes and so on. Many software akages ontan a huge number of omonents the examle n Matra (2001) that we revst here has 32 measurements on 214,982 roedures. Clusterng algorthms for suh datasets are therefore desrable and mortant for the software ndustry. 5.2 An Overvew of Clusterng Methods There s a large body of methodologal work n lusterng (Banfeld and Raftery, 1993; Bekett, 1977; Brosser, 1990; Can and Ozkaran, 1984; Chen et al., 1974; Cormak, 1971; Celeux and Govaert, 1995; Eddy, 1996; Evertt, 1988; Fowlkes, 1988; Gnanadeskan, 1977; Good, 1979; Hartgan, 1975, 1985; Mojena and Wshart, 1980; Murtagh, 1985; Ramey, 1985; Rley, 1991; Symons, 1981; Van Ryzn, 1977). Broadly, lusterng algorthms an be dvded nto two knds: herarhal lusterng tehnques and otmzaton arttonng algorthms (Marda et al, 1979). The former algorthms artton the dataset n a herarhy resultng n a tree struture, wth the roerty that all observatons n a grou at some branh node reman together hgher u the tree. Most of these tehnques are based on mergng or slttng grous of data-onts on the bass of a re-defned between-grous smlarty measure. Slttng or to-down methods start wth one node reresentng the sngle grou to whh the entre dataset belongs. At eah stage, the grou slts nto two ortons suh that the dstane between the two grous s maxmzed. There are several rtera for alulatng nter-grou dstanes. The sngle lnkage rteron sefes the dstane between any two grous to be the mnmum of all dstanes between observatons n ether grou, and results n lean, sknny lusters. The omlete lnkage rteron sefy the dstane between any two grous to be the maxmum of all arwse dstanes between onts n the two grous and results n omat lusters. Average lnkage defnes the dstane between the grous as the average of all the arwse dstanes between every ar of onts n the two grous and results n grous that are round but also that may have some extreme observatons. Several other rtera also exst. Agglomeratve or bottoms-u aroah start wth eah observaton as ts own grou. Grous whh are losest to eah other are suessvely merged, usng one of the same sets of rtera outlned above, and form a herarhy rght u to the very to where every observaton n the dataset s a member of the same grou. Otmzaton-arttonng algorthms dvde the dataset nto a number of homogeneous lusters based on an otmalty rteron, suh as the mnmzaton of some aset of the 25
wthn-sums-of-squares-and-roduts matrx (Fredman and Rubn, 1967; Sott and Symons, 1971). Some ommon aroahes are K-means lusterng, K-medods, self-organzng mas and robablty lusterng. We dsuss eah method brefly n the followng aragrahs. The K-means algorthm (Hartgan and Wong, 1979) s wdely used n a number of alatons. The roedure fnds a loally otmal artton of the nut sae lose to an ntal set of K luster enters usng Euldean dstane. The algorthm teratvely udates the arttons and luster enters tll onvergene. The arttons are udated by assgnng eah ont n the dataset to the enter losest to t, and the luster enters are alulated at eah ste by alulatng the enter of the data-onts n eah artton. The K-means algorthm erforms best when the underlyng lusters are reasonably sheral, well-searated and homogeneous. The method has some otmalty roertes Pollard (1981) rovdes an elegant roof for the strong onssteny of the luster enter estmates. A generalzaton of the above s the K-medods algorthm, whh also fnds a loally otmal artton n the vnty of the ntal luster values alled medods. The algorthm s flexble enough to allow for any dstane measure. Smlar to K-means, the algorthm teratvely udates the arttons and the medods alternately, tll onvergene. The arttons are udated at eah ste by assgnng eah observaton to the lass wth the losest medod (as er the dstane measure). The medod for eah lass s udated by settng t to be the observaton n the grou havng the smallest total dstane to all ts other fellow luster members. The K-medods algorthm s learly more omutatonally demandng than K-means beause of the medods-udate ste. Smlfed strateges have been suggested by Kaufman and Rousseeuw (1990) and earler by Massart et al. (1983). The latter suggests a branhand-bound ombnatoral method for the roblem that an be ratally mlemented only when the sze of the dataset s very small. A onstraned verson of the K-means algorthm s rovded by Self-Organzng Mas (SOM) (Kohonen, 2001). The dea behnd ths algorthm, orgnally rovded by Kohonen (1990) s to ma the orgnal nut-sae to a lower-dmensonal manfold and then to fnd K rototyes reresentng the K grous n the lower-dmensonal sae. Eah of the K rototyes {µ ; = 1, 2,..., K} are arameterzed wth reset to a oordnate system l j Q. For nstane, the µ j may be ntalzed to le n the -dmensonal rnal omonent lane of the data. The SOM algorthm tres to bend the lane so that these rototyes aroxmate the data onts as well as ossble. A fnal ste mas the observatons onto the -dmensonal grd. The udates on the µ s an be done by roessng eah observaton x one at a tme. The roess nvolves frst fndng the losest µ to the x and then udatng the µ s usng µ j = µ j + αγ( l l j )(x µ j ). Here, γ( ) s a dereasng neghborhood funton whh effetvely mles that rototyes µ j s have more weght f ther orresondng oordnates l j s n the lower-dmensonal ma are lose to l. The arameter α s the learnng rate and s dereased gradually from 1 to zero as the algorthm rogresses. The funton γ( ) also usually beomes more radly dereasng as the algorthm roeeds. Note that usng a neghborhood funton suh that every rototye has only one neghbor n the Q-oordnate system yelds an onlne verson of the K-means algorthm. When aled n ths framework, exerments show that the algorthm does stablze at one of the loal mn- 26
ma of the K-means algorthm. Bath versons of SOM also exst: eah µ j s udated to j γ( l l j )(x µ j )/ j γ( l l j ). Fnally, we dsuss a model-based lusterng aroah, whh s mlemented through the exetaton-maxmzaton (E-M) algorthm. The dataset X 1, X 2,..., X N s assumed to be a random samle from the mxture dstrbuton f(x) = K π k φ k (x; µ k, Σ k ), k=1 where π k s sum to unty, and φ k (x; µ k, Σ k ) reresents the multvarate normal densty wth mean µ k and dserson matrx Σ k. The goal s to estmate the mxng roortons π k s, as well as µ k and Σ. Wth a known K, the roblem an be set u as a lkelhood-maxmzaton roblem and solved teratvely usng an E-M aroah. The roosal (Day, 1969; Dk and Bowden, 1973; Wolfe, 1970) udates the lass roortons π k s n the E-ste and the luster means µ k s and dsersons Σ k s n the M-ste. The E-M algorthm has been shown to exhbt strong onssteny (Render and Walker, 1985) the result holds for the wder lass of a known number of mxtures of exonental famles. In the E-ste, π k s are udated gven the urrent values of π k s, µ k s, Σ k s by lassfyng the observatons nto eah of the K grous and then estmatng the urrent value of π k by the roorton of observatons lassfed nto the k th grou. The M-ste then uses the udated lassfaton to obtan udated estmates for the lass means and dsersons. The roedure starts wth ntal estmates for all arameters and ontnues untl onvergene. For unknown K however, the E-M algorthm s not neessarly onsstent (see the reort of the Panel on Dsrmnant and Clusterng, 1989). I now menton some ssues n the hoe of the number of lusters K. Determnng the number of lusters K has been qute a hallenge: there are many suggestons some of them were made by Marrott (1971), Mojena and Wshart (1980) and more reently, by Tbshran et al. (2001). Evertt (1974) erformed a smulaton analyss of then-avalable rtera for hoosng K and reorted that Marrott s (1971) rteron of mnmzng (k 2 W k ) wth reset to k and subjet to non-sngular W k to be most useful. Ths rteron assumes a ommon dserson matrx for Σ k and denotes the ooled wthn-lusters sums-of-squares-and-roduts matrx as W k. Tbshran et al. (2001) reently roosed another heurst aroah for estmaton K. They roose the Ga statst whh omares the urve for log Ω k obtaned from the dataset to that obtaned usng unform data over the nut sae. Here Ω k s any wthn-luster dssmlarty measure obtaned by arttonng the dataset nto k lusters. For nstane, one may use W k as the measure when the underlyng grous are assumed to be homogeneous and Gaussan. The Ga statst estmates the otmal number of lusters K to be that value for whh the dfferene or the ga s the largest. Unlke several other ometng methods, ther llustratons show that ths measure works well even when there s only one underlyng luster n the oulaton. 27
5.3 Alaton to Massve Datasets Most lusterng algorthms are not feasble to mlement when the dataset s massve. Among the methods desrbed above, SOM s the only aroah largely unaffeted but even then we need to know the lower-dmensonal manfold on whh to rojet the data. In many ases, suh as n data vsualzaton, the dmensonalty s set to be 2 or 3. The oordnate system an then be deded, erhas by takng a samle from the dataset and omutng the rnal omonents lane. Onlne lusterng an then roeed. However, ths deends on the objetve of lusterng and s not always the best aroah for a gven stuaton. We dsuss here aroahes to lusterng suh massve datasets. The earlest statstal attenton to the need for lusterng large datasets was n the ontext of a herarhal algorthm develoed by MQutty and Koh (1975) for a dataset of a thousand observatons. A subsequent attemt was made by formulatng the roblem n terms of data reduton (Bruynooghe, 1978; Zuan, 1982). These aers however address the roblem usng defntons of a large dataset that are now largely obsolete. More reent attenton has been n the ontext of lusterng massve databases n the omuter senes lterature (a few referenes are Agarwal et al., 1998; Bradley et al., 1997; Bradley et al., 1998; Charkar et al., 1997; Eddy et al., 1996; Ester et al., 1996; Ester et al., 1998; Gant et al., 1999; Gol et al., 2000; Huang, 1997; Guha, et al., 1998; Han, et al., 1998; Shekholeslam et al., 1998; Zhang, et al., 1996). A number of algorthms try to obtan aroxmate arttons of the dataset and are not really ratal for large datasets. I fous dsusson here on the few statstal aroahes develoed. Bradley et al (1997, 1998) suggest an aroah under the assumton that the number of lusters K s known n advane and that the grou denstes are multvarate normal. They frst take a samle of observatons and set u the lkelhood-maxmzaton roblem n terms of an E-M algorthm. After lusterng the frst samle, the algorthm udates the estmated π k s va somewhat rude refnements derved from the remanng dataset n samles. The method s salable, but neffent. Further, the algorthm just dentfes lusters based on a frst-stage samle, and at subsequent stages only udates the estmated lass roortons. Ths means that the benefts from suh a huge number of observatons are not fully utlzed, and sne the ntal samle s neessarly small (for ratalty of the above roedure), mnorty lusters are unlkely to be nluded and therefore dentfed. Gordon (1986) rovded a suggested road-ma for lusterng massve datasets. The bas thrust of hs suggeston was to slt the lusterng roblem nto lusterng and lassfaton omonents and to devse a sheme whereby one would be able to fgure out whether the grous had been adequately dentfed at the lusterng stage. Although hs aer has very few lear dretons on mlementaton, Matra (2001) develoed and extensvely tested a strategy smlar n srt. Under the assumton that all lusters have ommon dserson Σ, the methodology frst takes a subsamle S 1 of sze n from the dataset D (deded by the avalable resoures). Ths frst samle s lustered and the luster means and dserson estmated, usng K-means wth K estmated va Marrott s (1971) rteron. Wrtng K 0 as the estmate for K, π 0 k s, µ0 k s and Σ0 s for the estmates of luster roortons, means and the ommon dserson, the next ste s to erform a hyothess test on eah of the remanng 28
members of the dataset where the hyothess to be tested s whether the observaton an reasonably be lassfed nto the lusters so far dentfed. Formally therefore, the null and the alternatve hyotheses to be tested for a gven observaton X are H 0 : X K 0 k=1 πk 0φ(x; µ0 k, L 0 Σ0 ) vs. H 1 : X πl φ(x; µ l, Σ 0 ), (2) where L 0 = K 0 + 1, πl s and µ l s are the lass robabltes and lass enters under H 1, {µ 0 k ; k = 1, 2,..., K 0} {µ l ; l = 1, 2,..., L 0}. In ths settng, a lkelhood-rato-tye test statst s gven by K Λ 0 = mn(1, Σ 0 1 2 πkφ(x; 0 µ 0 k, Σ 0 )). (3) The null dstrbuton of ths test statst an be obtaned by smulaton. The above hyothess s tested at sgnfane levels from a set of anddate values of α. For eah α, f the roorton of observatons rejeted s less than α, then t s lear that random hane an exlan the extreme observatons n the dataset so all lusters are delared to have been dentfed and the luster dentfaton stage s termnated. Otherwse, the observatons for the largest sgnfane level α for whh the roorton of rejets s greater than α form the set D 1 under onsderaton at the next stage. The πk 0 s are saled to sum to the roorton aeted, and a samle S 1 of sze n drawn from the seond stage dataset D 1. The samle s agan lustered and the entre roedure terated untl every observaton n the dataset has ether been nluded n a samle to be lustered at some stage or has been n the aetane regon for one of the hyothess tests. Let K = K, {(µ k, π k ); k = 1, 2,..., K} = {(µ k, π k ); k = 1, 2,..., K } be the number of lusters, the luster means and the luster roortons at the end of ths stage. Ths ends the lusterdentfaton stage of the algorthm. A lassfaton of all observatons nto these lusters rovdes fnal estmates of the lass roortons and the emty lusters are elmnated. A fnal ste lassfes the observatons nto the lusters based on these fnal estmates of lass roortons. The algorthm s salable and when tested on a range of examles, of datasets wth N rangng from 500,000 to over 77 mllon, termnated n no more than fve stages. Matra (2001) llustrated and tested ths methodology extensvely through a set of smulaton exerments. Before dsussng an exerment, I menton a measure to evaluate erformane. One measure wth known denstes s the Kullbak-Lebler-tye dvergene measure κ(f, ˆf) = log f(x) log ˆf(x) f(x)dx, wth f(x) and ˆf(x) as the true and estmated mxture denstes resetvely. Another oton s the dssmlarty measure between x the orgnal true artton and the artton formed by the lusterng algorthm, adated from Mrkn and Chernyl (1970), and Arabe and Boorman (1973). Ths artton-based measure s onstruted for a total number of N objets as follows: Out of the N(N 1)/2 ossble ars of objets, let the dsagreement be the total number of ars whh are ether lassfed together n the true artton but n dfferent grous by the lusterng algorthm or ve-versa. Then M = 2 /N(N 1) s a normalzed measure of dsagreement between two arttons. M s omlementary to the smlarty measure (equal to 1-M) 29 k=1 l=1
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10 (a) (b) Fgure 5: True luster means and roortons (flled rles) and the estmates (unflled rles) for test ase 1, obtaned usng (a) the mult-stage lusterng algorthm, and (b) lusterng a samle from the dataset. The rles are roortonal n area to the square root of the lass roortons. roosed by Rand (1971). The exerment (Matra, 2001) I dsuss here s on a samle of sze 500,000 from a mxture of 94 bvarate normal oulatons wth dfferent means and lass roortons but equal dsersons (σ 1 = σ 2 = 0.1; ρ = 0.5). The mxng roortons ranged from the very small (1.8 10 4 ) to the moderate (0.17). The lusters are grahally shown n Fgure 5 (flled rles). The ntal startng onts for the K-means algorthm at eah stage were hosen to be the K hghest loal modes of the gven stage samle. The mult-stage algorthm termnated n only four stages of samlng and lusterng. Nnetyfve grous were dentfed and the fnal estmated standard devatons were 0.225 and 0.238 wth orrelaton estmated at 0.305. Fgure 5a (unflled rles) rovdes a grahal dslay of the erformane of the algorthm: note that a large number of the lusters have been orretly dentfed. (The area of the rles n the fgure s roortonal to the square root of the lass robabltes n order to hghlght the smaller lusters). Fgure 5b (unflled rles) dslays the 21 lusters dentfed by just lusterng a small samle of sze 2,500 from the orgnal dataset. Interestngly, the strategy fals to dentfy even a few medum-szed lusters. The Kullbak-Lebler-tye dvergene measure was omuted to be 0.48 for the mult-stage lusterng sheme and 1.25 for the alternatve. Parttonng of the observatons 30
as er the former dsagreed wth the truth for only = 188, 037, 120 ars of ases, so that M = 0.0015. Dsagreement was more emhat for the ase of the alternatve algorthm wth = 1, 468, 538, 880 and M = 0.0089. Thus, relatve to the latter algorthm, the mult-stage lusterng strategy mroved lassfaton rates by about 83.1%. Ths above aroah rovdes the frst real attemt to address the roblem of lusterng massve datasets n a omrehensve statstal framework. It s not neessary to use K-means whle lusterng the samle at eah stage: other aroahes an also be used. The assumton on ommon dserson for all lusters s mortant n the dervaton of the lkelhood rato test of (2) for testng the reresentatveness of the dentfed lusters based on eah ndvdual observaton. Some very lmted generalzaton (suh as a known relatonsh of the luster dsersons wth the luster means) s ossble and exhbted but those are useful only n some seal ases. We dsuss ossble extensons and detal areas requrng further attenton n a short whle. 5.4 Alaton to the Clusterng of Software Reords Ths llustraton s also rovded n Matra (2001). One agan, K-means was used n lusterng the sub-samles. The dataset was frst re-roessed by removng those roedures wthout any metrs. These roedures had erhas not yet been ntegrated nto the akage. Ths redued the total number of reords to 214,982. Non-nformatve metrs (or metrs wth ommon readngs for all roedures) were then gnored from our onsderaton, thus brngng down the dmensonalty from 70 to 32. All metrs were then standardzed to the same sale and the the mult-stage lusterng algorthm aled. At eah stage, the samle sze was set to be n = 2, 500. Intal guesses for the K-means algorthm were hosen for eah k from the means of the k most homogeneous grous obtaned through herarhal lusterng wth Euldean dstane, and the otmal K was determned by Marrott s rteron. The mult-stage lusterng algorthm termnated after two stages and dentfed 94 lusters. Wth ths groung, the 94 lasses for the roedures were used for software mantenane and ugrades. Moreover, some of these roedures were themselves found to be defetve, and the groung was also used to dentfy other omonents that needed to be udated whle fxng those bugs. 5.5 Unresolved Issues There are several asets of lusterng that need further attenton. Some of these ssues are general and ertan to all szes of datasets: others are more sef and arse beause of the sze of the dataset or beause of the tye of data. Indeed, the breadth of the unresolved ssues underlnes the dffult nature of the roblem. Hartgan s (1975) mlementaton of the K-means algorthm s sub-otmal and known to be unstable. One of the ssues n the K-means algorthm s the hoe of ntal values for the luster enters. The algorthm then fnds a loal mnmum of the hosen enalty funton. Bradley and Fayyad (1998) develo an algorthm n an attemt to rovde a robust way of 31
obtanng the luster enters. Assumng K known, the strategy onssts of obtanng several subsamles of the dataset and lusterng eah sub-samle va K-means and an arbtrary set of ntalzaton values. The luster enters, µ s obtaned from eah sub-samle are then lustered agan va K-means, wth ntal values of the luster enters gven by luster enters from eah sub-samle. Eah exerse wth ntal onts µ s gves a set of luster means µ s and the fnal ntal onts for the K-means algorthm are hosen from amongst ths set as the one whh s losest n a least-squares sense to all the µ s. The goal of these suggestons s to obtan ntal values for the K-means algorthm. When the sub-samle has the same sze as the samle, the above method s dental to runnng the K-means algorthm wth randomly hosen start-onts, and then relatng ths exerment several tmes. Note that the sub-samles are not neessarly ndeendent and t s not lear how ths affets the lusterng strategy of the ntally hosen luster means µ. Addtonally, knowledge of K s rtal. Further, ths strategy only addresses the ssue of ntal onts for the K-means algorthm. The mortant ssue of nose and varablty n the observatons s not addressed, and there s need for robust lusterng methodology aganst nose/random erturbatons. The mult-stage lusterng sheme develoed by Matra (2001) s severely restrted by the assumton of underlyng homogeneous Gaussan lusters. Extendng the methodology s mortant, and we note that the major stumblng blok s the estmaton of the dserson matrx under the global ondton for the lkelhood rato test for testng eah observaton. Clusterng datasets on the bass of ndretly observed data s another mortant roblem that has not reeved muh attenton, even though the Panel on Dsrmnant Analyss and Clusterng (1989) mentoned ths as an mortant area requrng statstal attenton. The ontext s that X 1, X 2,..., X N are not dretly observed, but a rojeton s observed. Therefore, we have a samle Y 1, Y 2,..., Y N, where Y = AX. For ths roosal, A s a lnear transformaton wth full olumn rank. The objetve then s to obtan estmates for K, µ K, Σ K, wthout gong through a otentally omutatonally exensve nverse roblem, as would haen for very large N. Suh funtonalty would be very useful n the ontext of segmentng mages that are ndretly observed, for examle, n Postron Emsson Tomograhy (PET) see Matra and O Sullvan (1998) or Matra (2001) for detaled dsussons on the alaton. Another area of nterest wth regard to lusterng s when all oordnate-values are not all avalable together, but beome avalable nrementally. That s, we frst have observatons only on the frst oordnate X 11, X 21,..., X N1 followed by observatons on the seond oordnate X 12, X 22,..., X N2 and so on u to tme. Wrtng X = {X 1, X 2,..., X }; = 1, 2,...,, our goal s to luster the tme seres X 1, X 2,..., X N wthout atually havng to wat for all the observatons to be reorded. In other words, we roose to luster the X s based on artally avalable nformaton and then to sequentally udate our lusters wth new avalable data on the other oordnates. Ths would be desrable n the ontext of the PET set-u also. Ths s beause when the duraton of the study s long, t s not feasble to segment voxels based on the omlete set of observatons. In suh a ase, t would be desrable to segment observatons based on the frst few tme-onts and then to udate the segmentaton sequentally wth the observatons from subsequent tme-bns. 32
Another area of nterest s the ase when the observatons arrve sequentally. Ths s mortant n the ontext, for examle, of streamng eletron transatons. Busnesses need to to dentfy and ategorze transatons beause marketng amagns an dentfy and talor rograms desgned for the dfferent grous. The ontext then s that we have an nfnte stream of observatons X 1, X 2,... from where we want to estmate the number of lusters K, the luster enters µ s and the dsersons Σ s, and also ategorze the observatons nto these lusters. Thus, we see that several ssues n lusterng datasets need attenton n the statstal lterature. Many of the aroahes that exst are emral and somewhat ad ho, ontng to the dffult nature of the roblem, whle at the same tme underlnng the absene of as well as the need for oordnated statstal methodology. 6 Informaton Retreval and Latent Semant Indexng Searh engnes avalable on the Worldwde Web (WWW) very nely llustrate the need for nformaton retreval. For nstane, t s very ommon to use a web-based searh engne to fnd artles ertanng to a artular to. Tyng n the word lato for nstane should brng u all artles n the database ontanng that word. Ths s what one gets by a lexograh searh. However, when searhng for douments on the word lato, the user s most lkely nterested not only n artles ontanng the word lato but also n artles on ortable omuters whh hoose to use some other word (suh as notebook ) to refer to ortable omutatonal equment. Ths aset would be mssng n a smle lexal-math-based searh strategy. How would one rovde a searh strategy nororatng suh synonyms? One aroah would be to ut together an exhaustve lst of synonyms and refer to all artles ontanng at least one of every synonym for the desred word. However, the database has a large number of words (and wth artular referene to the WWW, ths lst s dynam) and besdes, the same word an be used n more than one ontext. To eludate the last ont, note that an user who tyes n the word Jaguar would get an artle on ars, the Anglo-Frenh fghter arraft as well as the member of the felne famly. Ths aset of a word beng used n many dfferent onets s alled olysemy and may be allevated somewhat by usng multle-searh words suh as Jaguar ars to romote ontext. One agan, note that a lexal math of the word ar would not yeld any nformaton on ars smlar to any of the Jaguars n terms of erformane and/or re. An aroah, alled relevane feedbak (Roho, 1971) seeks to rovde the user wth a lst of douments and ask for feedbak on the relevane of the lst to the query. A statstal aroah alled Latent Semant Indexng (LSI) was develoed (Deerwester et al., 1990) to allow for retreval on the bass of onet, rather than an exat lexal math. The aroah uses statstally derved onetual ndes to searh the database. The bas dea s to wrte a term-doument matrx, ontanng the frequeny of ourrenes of eah term n eah doument. Let us assume that there are D douments n the database, and the douments all ontan a total of T unque keywords (or the words of nterest n the database). 33
For nstane, keywords ould be all words n the database, exludng artles, reostons, and onjugatons. Denote the term-doument frequeny matrx by A = ((a td )) where a td s the number of tmes that the t th keyword ours n the d th doument. LSI uses a k-rank aroxmaton of A usng ts sngular value deomoston (SVD). To see ths, suose that rank(a) = r and that the SVD of A s gven by A = UD λ V wth U, V beng T r and D r matres resetvely. Here U U = I r, V V = I r and D λ s a dagonal matrx of the sngular values λ = {λ 1 λ 2... λ r } of A. Alternatvely, wrte U = [u 1 : u 2 :... : u r ] and V = [v 1 : v 2 :... : v r ], then A = r =1 λ u v. Then the k-rank aroxmaton to A s gven by A k = k =1 λ u v. The goal of the redued-rank reresentaton A k s to ature the salent assoaton struture between the terms and the douments whle at the same tme elmnatng a substantal art of the nose and varablty assoated wth word usage. Note that SVD deomoses A nto a set of k unorrelated ndexng varables, wth eah term and doument reresented by a vetor n a k-dmensonal sae usng the elements of the left or rght sngular vetors. A query searh for douments ertanng to a artular set of terms an be reresented as a vetor of 0 s and 1 s, wth 1 ndatng resene of a keyword n the gven query and 0 otherwse. A query an be onsdered to be a doument denoted by the T 1 vetor q. To fnd out douments lose to ths query, we need to transform both query and douments nto the same sae and k the transformed douments losest to the query n the redued sae. Note that the redued-rank reresentaton of A means that any doument d n the database an be reresented as U kλ k d k so that the doument mas to the redued-doument sae as d k = Λ 1 k U kd. Ths s the row of V k orresondng to the doument d. Smlarly, any query q transforms to the vetor q = Λ 1 k U kq, where Λ k = dag(λ 1, λ 2,..., λ k ). To alulate douments lose to the query, we alulate the osne between the vetors q and d k for eah doument d and use that to rovde the douments most relevant to a searh. The searh returns all those douments wth osne measure above a user-defned threshold value α. We now llustrate ths feature usng a smle examle. 6.1 An Illustraton Consder a database ontanng the ttles of the 16 statsts books gven n Table 6.1. In our database, we have used as keywords all those words that aear more than one n eah ttle. Thus, we have the keywords: Generalzed (T1), Lnear (T2), Model (T3), Theory (T4), Inferene (T5), Alaton (T6), Regresson (T7), Introduton (T8), Analyss (T9), Multvarate (T10) and Nonarametr (T11). Note that we have assoated words suh as Aled and Alatons or Multvarate and Multvarable to the same keywords. Also we have not nluded statsts as a keyword sne all the ttles here ertan exlusvely to that subjet. Table 6.1 yelds the term-doument matrx rovded n Table 6.1. The matrx s deomosed usng SVD. For the redued-rank reresentaton, we use k = 2 so that the alaton of LSI on ths examle an be reresented grahally. Fgure 6 s a two-dmensonal lot of the terms and the douments n the redued sae. The two olumns of U 2 Λ 2 are lotted aganst eah other on the x- and y-axes resetvely. Smlarly 34
Table 4: Lst of statsts books used n our LSI examle. The underlned terms are the keywords used for searhng the douments (book ttles). D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 Generalzed Lnear Models Theory of Statstal Inferene Aled Lnear Regresson Lnear Statstal Inferene and Its Alatons Aled Multvarate Statstal Analyss Theory and Alatons of the Lnear Model Introduton to Lnear Regresson Analyss Multvarate Analyss An Introduton to Multvarate Statstal Analyss Theory of Multvarate Statsts Aled Lnear Regresson Analyss and Multvarable Methods The Analyss of Lnear Models Aled Regresson Analyss Aled Lnear Regresson Models Nonarametr Regresson and Generalzed Lnear Models Multle and Generalzed Nonarametr Regresson Table 5: The 11 16 term-doument matrx orresondng to the database n Table 6.1. Terms D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11 D12 D13 D14 D15 D16 Generalzed 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 Lnear 1 0 1 1 0 1 1 0 0 0 1 1 0 1 1 0 Model 1 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 Theory 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 Inferene 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Alaton 0 0 1 1 1 1 0 0 0 0 1 0 1 1 0 0 Regresson 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 Introduton 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 Analyss 0 0 0 0 1 0 1 1 1 0 1 1 1 0 0 0 Multvarate 0 0 0 0 1 0 0 1 1 1 1 0 0 0 0 0 Nonarametr 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 35
for the transformed douments, the two olumns of V 2 Λ 2 are lotted on the x- and y-axes resetvely. Note the natural lusterng of terms and douments. Thus, the douments D5, D8, D9 and D10 all luster together and the keyword multvarate may be used to desrbe these ttles. The douments D1, D15 and D16 all ertan to generalzed and models whle the douments D3, D4, D6, D12 and D14 all luster together and are lose to the keywords lnear and regresson. The terms theory and nferene are lose to eah other, as also nonarametr and generalzed or lnear and regresson. I now exhbt query searh usng LSI. Suose we want to hoose ttles ertanng to lnear models. The query vetor d n the two-dmensonal transformed-sae s gven by (0.211,.122) llustrated by the x n Fgure 6. To fnd douments n the database most relevant to our searh, we alulate the osne of the angle formed by d and d k for eah doument vetor d. Only the transformed doument vetors for D1, D3, D4, D6, D12, D14, D15 and D16 have osnes of over 0.9 wth q. Note that these nlude ttles on regresson when the query dd not nlude ths as a keyword. Note also that lexal mathng would not have nluded D16, for nstane. It s often ommon to sale the term-doument matrx n order to downweght ommon terms. The SVD s then erformed and LSI mlemented on ths weghted term-doument matrx. Although we dd not use suh a strategy n our llustraton here, ths strategy an be readly mlemented wthn the framework mentoned above. 6.2 Udatng the Database In a dynam database suh as on the WWW, new douments and terms are beng added over tme. As new douments and terms ome n, the database and ndeed the reduedk-rank reresentaton needs to be udated. Of ourse, one may reeat the above exerse for every udate of the database. However, ths s a very exensve roess requrng onstant reomutaton. Moreover for large databases, the SVD deomoston s n tself very omutatonally exensve, even for extremely sarse matres. One oton s to use what s alled foldng-n of terms and douments. Ths aroah assumes that the SVD done on the orgnal database s adequate for the exanded database. In other words, the new term and doument vetors are just folded n nto the exstng database. To llustrate, a new doument vetor d s maed, usng the arguments gven above, to d k = Λ 1 k U kd, whle a new term vetor t s smlarly rojeted to t k = Λ 1 k V kt n the orgnally redued termsae. The new vetors thus formed are however not orthogonal to the others or to eah other. Foldng-n also has no effet on exstng term and doument vetors and therefore, the aroah does not learn about ontext from new terms and douments n the exanded database. All these fators mean that whle foldng-n may work for a few addtonal terms and douments, t would be neffent wth substantally new database. Another aroah to ths roblem s alled SVD-udatng (Berry et al., 1994; Berry et al., 1999). Ths aroah assumes that the term-doument matrx for the exstng database A s reresented through A k. Matrx oeratons are then used to omute the SVD of A k augmented wth the new olumns from the addtonal douments D and the new terms T. To eludate, suose for the sake of smlty that the new term-doument matrx only 36
Analyss Multvarate 1.0 0.5 0.0 0.5 1.0 1.5 D2 Introduton D10 Theory Inferene x D8 D9 Nonarametr D16 Generalzed D5 D1 D4 D13 D12 D7 D3 D6 Model D14 D11 Alaton Regresson Lnear D15 0.5 1.0 1.5 2.0 2.5 Fgure 6: The two-dmensonal lot of terms and douments for the examle n Table 6.1. The ont denotes the loaton of the query lnear models n the transformed sae. The shaded regon reresents the regon ontanng the douments whh n the redued doument-sae have osnes of over 0.9 wth the transformed query vetor. nludes a new set of douments (and no new terms) and s gven by [A D]. Here D s a T P matrx. Relang A by ts redued-rank verson A k, the SVD of A = [A k D] s omuted n terms of A = U Λ V. It an be shown (Berry et al., 1995; Berry et al., 1999) that U = U k U, V = V V and Λ = Λ where A = [Λ k U kd], A = U Λ V n terms of ts SVD and V s the blok-dagonal matrx formed by the matres V k and I P. Ths redues omutatons for the SVD but note that the rank of the new matrx s lmted by at most k. A smlar aroah may be used for addng new terms. One agan, suose that we are only nororatng new terms and no new douments n the exanded database (as a onsequene, say, of douments that have been udated). Then the new term-doument matrx s gven by A = [ A k T verson A k and omute the SVD of A = [ A k D ] Here D s a Q D. Then we relae A by ts redued-rank ] n terms of A = U Λ V. Agan, t s easy 37
[ to show that V = V k V, V = V V and Λ = Λ where A = Λ k T V k ], A = U Λ V n terms of ts SVD and V s the blok-dagonal matrx formed by the matres U k and I Q. For the ase when both new terms and douments are added to the database, the above an be mlemented n a two-ste aroah. Note that whle the omutatons here are onsderably more than for foldng-n, they are far less than for omutng the SVD of the entre database. A rmary ontrbuton to the exense n SVD-udatng omutatons as desrbed above s the fat that the U s and V s are dense matres. 6.3 Other Issues We onlude ths seton wth a dsusson on some of the many nterestng statstal ssues unresolved. A rmary unresolved ssue ertans to the hoe of k. In our alaton above, we hose k = 2 for ease of reresentaton of our results. There s really no known method for hoosng k. One of the ssues n the hoe of k s the absene of a roer statstal reresentaton of the LSI framework, even though Dng (1999) took the frst stes to rovdng a robablty model reresentaton of LSI. A arorate statstal model may be used together wth a enalty funton to dede on k. Another ssue onerns nose and varablty. There are two asets here: one ertans to terms that are orruted as a result of tyograhal errors. For nstane, nonarametr may be ms-selt as nonarametr n one of the douments. Ths s nose n the database and needs to be modeled searate from varablty whh arses when one has a hoe of synonyms (for examle, war and onflt ) and hooses to use one of them. It may be useful to nororate wthn LSI, a robablty model on the bass of usage of these synonyms. Another area n whh there has been onsderable work, and where LSI has been used, onerns the ssue of retreval aross mult-lngual databases (Landauer and Lttman, 1990; Lttman et al., 1998). SVD-udatng has been modfed to aommodate the above ontext. Fnally, omutaton ontnues to be very serous onstrant. An aroah may be to udate the SVD of a database nrementally, usng a ombnaton of foldng-n and SVD-udatng, erhas usng a statstal model and a mult-stage aroah. Thus, we see that there are substantal ssues n nformaton retreval requrng statstal attenton. 7 Dsusson Ths aer dsusses some of the major areas assoated wth the fast emergng dslne of data mnng. As we an see, a number of these asets have reeved statstal attenton n a varety of ways. In some ases, methodology has been bult, erhas somewhat unonventonally, ato exstng statstal mahnery. However I note that there are avenues for substantal further researh n almost all the asets dsussed n ths aer. A ommon theme n many of these ases ertans to mnng and knowledge dsovery n massve databases: however, as we have mentoned n ths aer, these are not the only ssues. Knowledge dsovery, learnng of the dfferent roesses, the understandng of ausal and 38
assoatve relatonshs, whether n small or large databases, s an mortant goal. Sound statstal roedures exst, but qute often, under very restrtve roedures whh are not always arorate for modern databases. There s a need to extend suh methodology for more general stuatons. In assng, I dsuss an area not tradtonally assoated wth data mnng, but whh has been very muh an ssue n multvarate statsts. I beleve the ssue of dmensonalty reduton has a greater role n analyzng large, rregular or nosy databases. In the statstal lterature, we have avalable several aroahes to ths roblem: some ommonly used tehnques are Prnal Comonents Analyss (PCA), Canonal Correlaton Analyss (CCA) and Mult-dmensonal Salng (MDS) (Marda et al, 1979), Prnal Curves (Haste and Stuetzle, 1989), Projeton Pursut (Huber, 1985). In reent years, there has been develoment of a tehnque alled Indeendent Comonents Analyss (ICA) whh searates sgnals nto ndeendent omonents (Hyvarnen and Oja, 2000). The aroah s smlar to PCA for Gaussan data but fnds ndeendent omonents n non-gaussan data, thereby removng a major shortomng n the alaton of PCA to non-gaussan data. At some level, data mnng ultmately needs automated tools to gan nsght nto relatonshs from databases. A large number of tools are ommerally avalable and often sold as a anaea for all roblems. Indeed, these tools are often touted as a substtute for exert human nteraton and analyses. In substane, many of these tools are front-ends for very ommon and exstng statstal tools akaged under effetve namng strateges: t s unlkely and n my exerene erroneous to beleve that human exertse an ever be substtuted by a set of automated learnng tools, eseally n the analyss of omlex roesses and omlated datasets. I beleve that data mnng tools wll be nvaluable n aessng reords, eseally n omlex databases. However from the ont of vew of understandng and ganng nsghts nto databases, knowledge dsovery would erhas be better served by onsderng these tools (as we do wth statstal software) as some more useful ad n the task ahead. Referenes [1] Agarwal R., Imelnsk T., and Swam A. (1993). Mnng assoaton rules between sets of tems n large databases. In Pro. of ACM SIGMOD Conferene on Management Data, 207-16, Washngton, DC. [2] Agarwal, R., Gehrke, J., Gunoulos, D. and Raghavan, P. (1998). Automat subsae lusterng of hgh-dmensonal data for data mnng alatons. In Pro. of ACM SIG- MOD Internatonal Conferene on Management of Data. [3] Aggarwal C. C. and Yu, P. S. 1998. Mnng Large Itemsets for Assoaton Rules. Bulletn of the IEEE Comuter Soety Tehnal Commttee on Data Engneerng. 2. 1, 23-31. [4] Akake, H. (1974). A new look at the statstal model dentfaton. IEEE Trans. Automat Control, 19:716-723. 39
[5] Alfers, C. and Cooer, G. (1994). An evaluaton of an algorthm for ndutve learnng of Bayesan belef networks usng smulated datasets. In Pro. Tenth Conf. on Unertanty n Artfal Intellgene, 8-14. San Franso: Morgan Kaufmann. [6] Arabe, P. and Boorman, S. A. (1973). Mult-dmensonal Salng of Measures of Dstane Between Parttons. J. Math. Psyh. 10:148-203. [7] Babok B., Babu S., Datar M., Motwan, R. and Wdom, J. 2002. Models and Issues n Data Stream Systems. Invted aer n Pro. of the 2002 ACM Sym. on Prnles of Database Systems (PODS 2002), to aear. [8] Banfeld, J. D. and Raftery, A. E. (1993). Model-based Gaussan and non-gaussan lusterng. Bometrs 49:803-21. [9] Bekett, J. (1977). Clusterng Rank Preferene Data. ASA Pro. So. Stat. 983-86. [10] Bellman, R. E. (1961). Adatve Control Proesses. Prneton Unversty Press. [11] Berry, M. W., Dumas, S. T. and O Bren, G. W. (1994). Usng lnear algebra for Intellgent Informaton Retreval. SIAM Revew, 37:4:573-595. [12] Berry, M. W., Drma, Z., Jessu, E. R. (1999). Matres, vetor saes, and nformaton retreval. SIAM Revew, 41:2:335-362. [13] Besag, J. (1974). Satal nteraton and the statstal analyss of latte systems (wth dsusson). J. Roy. Stat. So. B 36:192-236. [14] Bozdogan, H. and Slove, S. L. (1984). Mult-samle Cluster Analyss Usng Akake s Informaton Crteron. Ann. Inst. Stat. Math. 36:163-180. [15] Bradley, P. S. and Fayyad,U. M. (1998) Refnng Intal Ponts for K-Means Clusterng. Proeedngs of the Ffteenth Internatonal Conferene on Mahne Learnng ICML98, ages 91-99. Morgan Kaufmann, San Franso. [16] Bradley, P., Fayyad, U. and Rena, C. (1997) Salng EM(Exetaton-Maxmzaton) Clusterng to Large Databases. Tehnal Reort, MSR-TR-98-35,Mrosoft Researh. [17] Bradley, P., Fayyad, U. and Rena, C. (1998) Salng lusterng algorthms to large databases. In The Fourth Internatonal Conferene on Knowledge Dsovery and Data Mnng, New York Cty. [18] Breman, L. (1996). Baggng redtors. Mahne Learnng, 24:51-64. [19] Breman, L., Fredman, J., Olshen, R. and Stone, C. (1984). Classfaton and Regresson Trees. Wadsworth. [20] Brn S., Motwan R. and Slversten C. (1997a). Beyond market baskets: generalzng assoaton rules to orrelatons.pro. of the ACM SIGMOD, 265-276. 40
[21] Brn S., Motwan R., Ullman J. D. and Tsur S. (1997b). Dynam temset ountng and mlaton rules for market basket data. Pro. of the ACM SIGMOD, 255-264. [22] Brook, D. (1964). On the dstnton between the ondtonal robablty and the jont robablty aroahes n the sefaton of nearest-neghbor systems. Bometrka, 51:481-3. [23] Brosser, G. (1990) Pee-wse herarhal lusterng. J. Classfaton 7:197-216. [24] Bruynooghe, M. (1978). Large Data Set Clusterng Methods Usng the Conet of Sae Contraton. In COMPSTAT, Thrd Sym. Com. Stat. (Leden) 239-45. Physa-Verlag (Hedelberg) [25] Buntne, W. (1994). Oeratons for learnng wth grahal models. J. Artfal Intellgene Researh, 2:159-225. [26] Buntne, W. (1996). Grahal models for dsoverng knowledge. In Advanes n Knowledge Dsovery and Data Mnng (Fayyad, Patetsky-Sharo, Smyth and Uthurusamy, Eds.), 59-82. [27] Can, F. and Ozkaran, E. A. (1984). Two arttonng-tye lusterng algorthms. J. Amer. So. Inform. S. 35:268-76. [28] Charnak, E. (1991). Bayesan networks wthout tears. AI Magazne, 12:50-63. [29] Charkar, M., Chekur, C., Feder, T. and Motvan, R. (1997) Inremental Clusterng and Dynam Informaton Retreval. In Proeedngs of the 29th Annual ACM Symosum on Theory of Comutng, 144-155. [30] Cheeseman, P. and Stutz, J. (1996) Bayesan lassfaton (AutoClass): Theory and results. In Advanes n Knowledge Dsovery and Data Mnng (Fayyad, Patetsky- Sharo, Smyth and Uthurusamy, Eds.), 153-182. [31] Chen, H., Gnanadeskan, R. and Kettenrng, J. R. (1974). Statstal methods for groung ororatons. Sankhya Ser. B 36:1-28. [32] Cormak R. M. (1971). A revew of lassfaton. J. Roy. Stat. So. Ser. A 134:321-367. [33] Celeux, G. and Govaert, G. (1995) Gaussan arsmonous lusterng models. Patt. Reog. 28:781-93. [34] Cooer G. and Herskovtxs, E. (1992). A Bayesan method for the nduton of robablst networks from data. Mahne Learnng, 9:309-47. [35] Crstann, N. and Shawe-Taylor, J. (2001). An Introduton to Suort Vetor Mahnes (and other kernel-based learnng methods), Cambrdge Unversty Press. 41
[36] Day, N. E. (1969) Estmatng the omonents of a mxture of two normal dstrbutons. Bometrka 56:463-474. [37] Deerwester, S., Dumas, S., Furnas, G., Landauer, T. and Harshman, R. Indexng by latent semant analyss. J. Amer. So. Informaton Sene, 41:391-407. [38] Dk, N. P. and Bowden, D. C. (1973) Maxmum lkelhood estmaton of mxtures of two normal dstrbutons. Bometrs 29:781-790. [39] Dng, C. H. Q. (1999) A smlarty-based robablty model for Latent Semant Indexng. In Pro. of 22nd ACM SIGIR 99 Conf., 59-65. [40] Dsrmnant Analyss and Clusterng: Reort of the Panel on Dsrmnant Analyss, Classfaton and Clusterng. In Statstal Sene 4:34-69. [41] DuMouhel W. and Pregbon D. (2001). Emral bayes sreenng for mult-tem assoatons n massve datasets. In Proeedngs of Seventh ACM SIGKDD Internatonal Conferene on Knowledge Dsovery and Data Mnng. San Franso, CA. ACM Press 67-76. [42] Eddy, W. F., Mokus, A. and Oue, S. (1996). Aroxmate sngle lnkage luster analyss of large datasets n hgh-dmensonal saes. Com. Stat. and Data Analyss 23:29-43. [43] Ester, M., Kregel, H. P., Sander, J. and Xu, X. (1996) A Densty-Based Algorthm for Dsoverng Clusters n Large Satal Databases wth Nose. In Pro. 2nd Int. Conf. on Knowledge Dsovery and Data Mnng (KDD-96), Portland, OR,. 226-231. [44] Ester, M., Kregel, H. P., Sander, J. and Xu, X. (1996) Clusterng for mnng n large satal databases. Seal Issue on Data Mnng, KI-Journal, SenTe Publshng, No. 1. [45] Evertt, B. (1974). Cluster Analyss. Heneman Eduatonal, London. [46] Evertt, B. S. (1988) A fnte mxture model for the lusterng of mxed-mode data. Stat. Prob. Let. 6:305-309. [47] Fsher R. A. (1936) The use of multle measurements n taxonom roblems. Eugens, 7:179-188. [48] Fowlkes, E. B., Gnanadeskan, R. and Kettenrng, J. R. (1988). Varable seleton n lusterng. J. Classfaton 5:205-28. [49] Freund, Y. and Share, R. (1997). A deson-theoret generalzaton of onlne learnng and an alaton to boostng. J. Comuter and System Senes, 55:119-139. [50] Fredman, J. (1989) Regularzed dsrmnant analyss. J. Amer. Stat. Asso, 84:165-175. 42
[51] Fredman, J. (1991) Multvarate adatve regresson slnes (wth dsusson). Ann. Stat., 19(1):1-141. [52] Fredman, J., Haste, T. and Tbshran, R. (2001). Addtve logst regresson: a statstal vew of boostng (wth dsusson). Ann. Stat., 28:337-307. [53] Fredman, J. and Stuetzle, W. (1981). Projeton ursut regresson. J. Amer. Stat. Asso., 76:817-823. [54] Fredman, H. P. and Rubn, J. (1967). On some nvarant rtera for groung data. J. Amer. Stat. Asso. 62:1159-78. [55] Frydenberg, M. (1990) The han grah Markov roerty. Sand. J. Stat., 17:333-353. [56] Gant, V., Ramakrshnan, R., Gehrke, J., Powell, A. and Frenh,J. (1999) Clusterng large datasets n arbtrary metr saes. In Pro. of Internatonal Conferene on Data Engneerng. [57] Gnanadeskan, R. (1977). Methods for Statstal Data Analyss of Multvarate Observatons. John Wley & Sons (New York; Chhester). [58] Gol, S., Harsha, A. C. (1999) Parallel subsae lusterng for very large datasets. Tehnal reort CPDC-TR-9906-010, Northwestern Unversty. [59] Good, I. J. (1979). The lusterng of random varables. J. Stat. Comutaton and Smulaton, 9:241-3. [60] Gordon, A. D. (1986). Lnks between lusterng and assgnment roedures. Pro. Com. Stat. 149-56. Physa-Verlag (Hedelberg). [61] Guha, S., Rastog, R. and Shm, K. (1998) CURE: An effent algorthm for lusterng large databases. In Proeedngs of ACM-SIGMOD 1998 Internatonal Conferene on Management of Data, Seattle. [62] Hammersley, J. and Clfford P. (1971) Markov felds on fnte grahs and lattes. Unublshed manusrt. [63] Han, E. H., Karyas, G., Kumar, V. and Mobasher, B. (1998) Clusterng n a hghdmensonal sae usng hyergrah models. Tehnal Reort 97-019, Deartment of Comuter Sene, Unversty of Mnnesota. [64] Hartgan, J. (1975) Clusterng Algorthms. Wley, New York. [65] Hartgan, J. (1985). Statstal Theory n Clusterng. J. Class. 2:63-76. [66] Hartgan, J. A. and Wong, M. A. (1979). [Algorthm AS 136] A k-means lusterng algorthm. Aled Statsts, 28:100-108. 43
[67] Haste, T., Buja, A., and Tbshran, R. (1995). Penalzed dsrmnant analyss. Ann. Stat., 23:73-102. [68] Haste, T. and Stuetzle, W. (1989). Prnal urves. J. Amer. Statst. Asso., 84(406):502-516. [69] Haste, T. and Tbshran, R. (1996). Dsrmnant analyss by Gaussan mxtures. J. Roy. Stat. So. Ser. B, 58:155-176. [70] Haste, T., Tbshran, R. and Buja, A. (1994). Flexble dsrmnant analyss by otmal sorng. J. Amer. Stat. Asso., 89:1255-1270. [71] Haste, T., Tbshran, R. and Fredman, J. (2001) The Elements of Statstal Learnng: Data Mnng, Inferene and Predton. Srnger, New York. [72] Hekerman, D. Geger D. and Chkerng D. (1995) Learnng Bayesan networks: The ombnaton of knowledge and statstal data. Mahne Learnng. [73] Hekerman, D. (1996) Bayesan networks for knowledge dsovery. In Advanes n Knowledge Dsovery and Data Mnng (Fayyad, Patetsky-Sharo, Smyth and Uthurusamy, Eds.), 273-205. [74] Hnneburg, A. and Kem, D. (1999) Cluster dsovery methods for large databases: From the ast to the future. Tutoral Sesson, In Pro. of ACM SIGMOD Internatonal Conferene on Management of Data. [75] Hodson, F. R., Sneath, P. H. A. and Doran, J. E. (1966) Some Exerments n the Numeral Analyss of Arheologal Data. Bometrka 53:311-324. [76] Hofeld, J. (1982). Neural networks and hysal systems wth emergent olletve omutatonal roertes. Pro. Natonal Aademy of Senes of the USA, 79:2554-2588. [77] Hofeld, J. (1984). Neurons wth graded resonse have olletve omutatonal roertes lke those of two-state neurons. Pro. Natonal Aademy of Senes of the USA, 81:3088-3092. [78] Horton, P. and Naka, K. (1996) A robablst lassfaton system for redtng the ellular loalzaton stes of rotens. Intellgent Systems n Moleular Bology, 109-115. [79] Huang, Z. (1997) A fast lusterng algorthm to luster very large ategoral datasets n data mnng. In Pro. SIGMOD Worksho on Researh Issues on Data Mnng and Knowledge Dsovery. [80] Huber, P. J. (1985). Projeton ursut. Ann. Stat., 13:435-475. 44
[81] Hyvarnen, A. and Oja, E. (2000). Indeendent omonent analyss: algorthms and alatons. Neural Networks, 13(4-5):411-430. [82] Karguta, H. and Park, B. (2001). Mnng tme-rtal data stream usng the Fourer setrum of deson trees. Pro. IEEE Intl. Conf. on Data Mnng, 281-288, 2001. [83] Kaufman, L. and Rousseeuw, P. (1990). Fndng Grous n Data: An Introduton to Cluster Analyss, Wley, New York. [84] Kohonen, T. (1990). The self-organzng ma. Pro. IEEE, 78:1464-1479. [85] Kohonen, T. (2001). Self-organzng mas (Thrd edton). Srnger, Berln. [86] Landauer, T. K., and Lttman, M. L. (1990). Fully automat ross-language doument retreval usng latent semant ndexng. In Pro. Sxth Annual Conferene of the UW Centre for the New Oxford Englsh Dtonary and Text Researh, Waterloo Ontaro. [87] Laurtzen, S. (1982) Letures on Contngeny Tables. Unversty of Aalborg Press, Aalborg, Denmark. [88] Lttman, M., Dumas, S. T. and Landauer, T. (1998). Automat ross-language nformaton retreval usng latent semant ndexng. In Cross Language Informaton Retreval, Grefenstette, G., (Ed.). Kluwer. [89] Madgan, D. and Raftery, A. (1994) Model seleton and aountng for model unertanty n grahal models usng Oam s wndow. J. Amer. Stat. Asso., 89:1535-1546. [90] Matra, R. (2001) Clusterng massve datasets wth alatons to software metrs and tomograhy. Tehnometrs, 43:3:336-46. [91] Matra, R. and O Sullvan, F. (1998). Varablty assessment n ostron emsson tomograhy and related generalzed deonvoluton roblems. J. Amer. Stat. Asso. 93:44:1340-55. [92] Marda, K. V., Kent, J. T. and Bbby, J. M. (1979). Multvarate Analyss. Aad. Pr. (New York; London) [93] Marrott, F. H. (1971). Pratal roblems n a method of luster analyss. Bometrs 27:501-14. [94] Mhener, C. D. and Sokal, R. R. (1957) A quanttatve aroah to a roblem n lassfaton. Evoluton 11:130-162. [95] Mrkn, B. G. and Chernyl, L. B. (1970) Measurement of the dstane between dstnt arttons of a fnte set of objets. Automaton and Remote Control 31:786-92. [96] MQutty, L. L. and Koh, V. L. (1975). A method of herarhal lusterng of a matrx of a thousand by thousand. Ed. Psyh. Meas. 35:239-54. 45
[97] Mojena, R. and Wshart, D. (1980). Stong Rules for Ward s Clusterng Method. In COMPSTAT 1980, Pro. Com. Stat. 454-59. Physa-Verlag (Hedelberg). [98] Murtagh, F. (1985). Mult-dmensonal lusterng algorthms. Srnger-Verlag (Berln; New York). [99] Myers, G. J. (1978) Comoste Strutured Desgn. Van Nostrand (New York). [100] Naka, K. and Kanehsa, M. (1991) Exert sytem for redtng roten loalzaton stes n gram-negatve batera. PROTEINS: Struture, Funton, and Genets, 11:95-110. [101] Patetsky-Sharo G. (1991). Dsovery, analyss, and resentaton of strong rules. In Indutve Log Programmng, S. Muggleton (Eds.), 281-298. Aadem Press, London. [102] Platt, J. (1998). How to mlement SVMs. In IEEE Intellgent Systems Magazne, Trends and Controverses, Mart Hearst, (ed.), 13(4). [103] Platt, J. (1999). Fast tranng of suort vetor mahnes usng sequental mnmal otmzaton. In Advanes n Kernel Methods - Suort Vetor Learnng, B. Shof, C. Burges, and A. Smola, (Eds.), MIT Press. [104] Pollard, D. (1981). Strong onssteny of k-means lusterng. Ann. Stat. 9:135-40. [105] Ramey, D. B. (1985). Nonarametr lusterng tehnques. In Enyl. Stat. S. 6:318-9. Wley (New York). [106] Rand, W. M. (1971). Objetve rtera for the evaluaton of lusterng methods. J. Amer. Stat. Asso. 66:846-50. [107] Render, R. A. and Walker, H. F. (1984) Mxture Denstes, Maxmum Lkelhood and the EM algorthm. SIAM Rev. 26:195-239. [108] Rley, B. D. (1991). Classfaton and lusterng n satal and mage data. In Analyss and Modelng of Data Knowledge 85-91. Srnger-Verlag (Berln; New York). [109] Rley, B. D. (1996). Pattern reognton and neural networks. Cambrdge Unversty Press. [110] Rssanen, J. (1987) Stohast omlexty (wth dsusson). J. Roy. Stat. So. Ser. B, 49:223-265. [111] Roho, J. J. (1971). Relevane feedbak n nformaton retreval. In The SMART Retreval System, G. Salton (Ed.). Prente-Hall, In, NJ, 313-323. [112] Rotman, S. R., Fsher, A. D. and Staeln, D. H. (1981). Analyss of multle-angle mrowave observatons of snow and e usng luster analyss tehnques. J. Glaology 27:89-97. 46
[113] Rosenblatt, F. (1958). The eretron: a robablst model for nformaton storage and organzaton n the bran. Psyhologal Revew, 65:386-408. [114] Shwarz, G. (1978). Estmatng the dmenson of a model. Annals of Statsts, 6:461-464. [115] Sott, A. J. and Symons, M. J. (1971). Clusterng methods based on lkelhood rato rtera. Bometrs 27:387-97. [116] Shekholeslam, G., Chatterjee, S. and Zhang, A. (1998)- WaveCluster: A multresoluton lusterng aroah for very large satal databases. In 24th Internatonal Conferene on Very Large Data Bases, August 24-27, New York Cty. [117] Segelhalter, D., Dawd, A., Laurtzen, S. and Cowell, R. (1993) Bayesan analyss n exert systems. Statstal Sene, 8:219-282. [118] Symons, M. J. (1981). Clusterng rtera and multvarate normal mxtures. Bometrs 37:35-43. [119] Tbshran, R., Walther, G. and Haste, T. (2001). Estmatng the number of lusters n a dataset va the ga statst. J. Roy. Stat. So. Ser. B, 63:411-423. [120] Van Ryzn, J. (1977). Classfaton and lusterng. Aad. Pr. (New York; London). [121] Vank, V. (1996). The Nature of Statstal Learnng Theory. Srnger-Verlag, New York. [122] Verma, T. and Pearl, J. (1990) Equvalene and synthess of ausal models. In Pro. Sxth Conf. on Unertanty n Artfal Intellgene, 220-7. San Franso: Morgan Kaugmann. [123] Wahba, G. (1999). Suort vetor mahnes, rerodung kernel Hlbert saes and the randomzed GACV. In Advanes n Kernel Methods Suort Vetor Learnng (B. Sholkof, C. J. C. Burges, and A. J. Smola, Eds.), 69 88. MIT Press. [124] Wolfe, J. H. (1970) Pattern Clusterng by multvarate mxture analyss. Multvarate Behavoral Researh 5:329-350. [125] Yourdon, E. (1975). Tehnques of Program Struture and Desgn. Prente-Hall (Englewood Clffs, NJ). [126] Zhang, T., Ramakrshnan, R. and Lvny, M. (1996) BIRCH: An effent data lusterng method for very large databases. In Pro. of ACM SIGMOD Internatonal Conferene on Management of Data. [127] Zuan, J. (1982). Clusterng of Large Data Sets. John Wley & Sons New York; Chhester. 47