Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Learnng from Large Dstrbuted Data: A Scalng Down Samplng Scheme for Effcent Data Processng Che Ngufor and Janusz Wojtusak part can be attrbuted to the rapd evoluton of hardware and programmng archtectures []. These new technologes are hghly optmzed for dstrbuted computng n the sense that they are parallel effcent, relable, fault tolerant and scalable. The Hadoop-MapReduce framework for example has been successfully appled to a broad range of real world machne learnng applcatons [3], [4]. Despte the appealng propertes of scalng up machne learnng algorthms, there are some obvous problems wth ths approach. Frst, scalng up an algorthm so that t can handle "large" data today does not necessarly mean t wll handle "large" data tomorrow. Second, adaptng current algorthms to cope wth large data can be very challengng and the scaled-up algorthm may end up beng too complex and computatonally very expensve to deploy. Fnally, not all machne learnng algorthms can be modfed for parallel mplementaton. Coupled wth the fact that there s no sngle algorthm that s unformly the best n all applcatons, t s sometmes necessary to deploy many algorthms so that they can collaborate to mprove accuracy. The second approach to the scalng problem attempts to scale down large data sets to reasonable szes that allow practcal use of exstng algorthms. The tradtonal approach s to take a random sample of the large data for learnng. Ths näve approach can run the rsk of learnng from non-nformatve nstances. For example, n mbalance classfcaton problems where one class s underrepresented, t s possble for random samplng to select only a few members of the mnorty class and a large number of the majorty class, presentng yet another mbalance learnng problem. Though not exhaustvely explored, research n methods for scalng up algorthms have been on the rse n the past few years. Not much work has been done n fndng effcent ways to scale down very large dstrbuted data sets for learnng. Ths work presents three key contrbutons to the research n learnng from large dstrbuted data sets. In a frst step, a new sngle-pass formula for computng the covarance matrx of large data sets on the MapReduce framework s derved. The formula can be seen as an effcent generalzaton of the parwse and ncremental update formula presented n [5]. The sngle-pass covarance matrx estmaton s then used n a second step to derve two new samplng schemes for scalng down large dstrbuted data set for effcent processng locally n memory. Precsely, uncertantes n the lnear dscrmnant score and the logstc regresson model are used to nfer the nformatveness of data ponts wth respect to the decson boundary. The nformatve nstances are selected through uncertantes of nterval estmates of these statstcs. Because real data s not always normal, lnear dscrmnant analyss (LDA) may perform poorly when the normalty assumpton s volated. Thus the Hausman specfcaton test Abstract Extractng nformaton from a tranng data set for predctve nference s a fundamental task n data mnng and machne learnng. Wth the exponental growth n the amount of data beng generated n the past few years, there s an urgent need to develop or adapt exstng learnng algorthms to effcently learn from large data sets. Ths paper descrbes three scalng technques enablng machne learnng algorthms to learn from large dstrbuted data sets. Frst, a general sngle-pass formula for computng the covarance matrx of large data sets usng the MapReduce framework s derved. Second, two new effcent and accurate samplng schemes for scalng down large data sets for local processng are presented. The frst samplng scheme uses the sngle-pass covarance formula to select the most nformatve data ponts based on uncertantes n the lnear dscrmnant score. The second technque on the other hand selects nformatve ponts based on uncertantes n the logstc regresson model. A seres of numercal experments demonstrates numercally stable results from the applcaton of the formula and a fast, effcent, accurate and cost effectve samplng scheme. Index Terms Lnear dscrmnant analyss, logstc regresson, classfcaton, samplng, mapreduce, sngle-pass. I. INTRODUCTION The basc machne learnng task s that of extractng relevant nformaton from a tranng set for predctve nference. Gven today's ever and steadly growng data set szes, the machne learnng process must be able to effectvely and effcently handle large amounts of data. However, most exstng machne learnng algorthms were desgned at a tme where data set szes were far smaller than current szes. Ths has led to a sgnfcant amount of research n scalng methods, that s, desgnng algorthms to effcently learn from large data sets. Two general approaches can be dentfed n ths endeavor: scalng up and scalng down. The frst approach attempts to scale up machne learnng algorthms, that s, develop new algorthms or modfy exstng algorthms so that they can better handle large data sets. There has been a rapd rse n research methods for scalng up machne learnng algorthms. Ths research has been aded n part by the fact that some machne learnng algorthms can be readly deployed n parallel. For example [] showed that ten commonly used machne learnng algorthms (logstc regresson, näve Bayes, k-means clusterng, support vector machne etc) can be easly wrtten as MapReduce programs on mult-core machnes. The other Manuscrpt receved September 9, 03; revsed December 0, 03. Che Ngufor s wth Computatonal Scences and Informatcs at George Mason Unversty, Farfax, VA (e-mal:cngufor@masonlve.gmu.edu). Janusz Wojtusak s wth the George Mason Unversty Department of Health Admnstraton and Polcy (e-mal:jwojt@ml.gmu.edu). DOI: 0.7763/IJMLC.04.V4.45 6

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 [6] can be appled to test for normalty and decde whch samplng scheme to use. The Hausman test requres consstent estmators of the asymptotc covarance matrx of parameters of the models to be tested. The fsher nformaton matrx s easly computed from the output of the logstc regresson (LR) and provdes a consstent estmator of the asymptotc covarance matrx of the parameters. On the other hand, consstent estmators of the asymptotc covarance matrx of parameters of LDA are not readly avalable. To solve ths problem, ths paper also derves a consstent asymptotc covarance matrx of the parameters of LDA that s smple and easy to compute for large data sets usng the sngle-pass covarnace formula. Emprcal results show that usng these technques, large classfcaton data sets can be effcently scaled down to manageable szes permttng local processng n memory, whle sacrfcng lttle f any accuracy. There are also many potental benefts of the samplng approach: The selected samples can be used to gan more knowledge about the characterstcs of the decson boundary, for vsualzaton and to tran other algorthms that cannot be easly traned on MapReduce. A specfc example to llustrate the usefulness of ths approach s presented where the support vector machne (SVM) s traned usng scaled down data and large scale predctons s carred out on MapReduce. For smplcty, ths paper consders only bnary classfcaton problems and organzed as follows: Related work s presented n Secton II. Secton III presents a general sngle-pass formula for computng the covarance matrx of large data sets. Secton IV brefly revews LDA and LR and presents confdence nterval estmates of the models. Dervaton of the asymptotc covarance matrx of parameters of LDA s also presented n ths Secton. Implementaton of the sngle-pass covarance matrx computaton and the proposed samplng schemes on MapReduce s presented n Secton V. Numercal experments are reported n Secton VI whle Secton VII concludes the paper. Because of space lmtatons, the proofs of the propostons presented n the paper wll be omtted. The nterested reader s referred to http://www.ml.gmu.edu/publcatons for proofs and other detaled nformaton. II. RELATED WORK Sngle-pass parwse updatng algorthms especally for the varance have been n exstence for some tme and proven to be numercally stable [5]. In these technques, the data set s always splt nto two subsets and the varance/covarance of each subset computed and combned. Ths leads to a tree lke structure where each node s the result of the combnaton of two statstcs each of whch resulted from the combnaton of two other statstcs and so on. The draw back of these methods les n the parwse ncremental updatng step. At each round of the computaton, the data s splt nto two and the formula s appled recursvely. Some specal data structures and bookkeepng may be requred to handle the tree-lke structure of the algorthm. More over, t s not readly amenable to the MapReduce framework as some communcatons between processes may be requred. The sngle-pass formula presented n ths paper avods the tree lke combnaton structure and computes the covarance matrx n one step. Much work has been done on scalng up exstng machne learnng algorthms so that they can handle large data sets and reduce executon tme. Whle ths approach s attractve, they are many stuatons where t s not possble to scale up a machne learnng algorthm or the scale up may produce a complex or computatonally more expensve algorthm. In such stuatons, scalng down by samplng s an attractve alternatve. Varous approaches have been followed to reduce tranng set szes such as dmenson reducton, random samplng, actve learnng [7] etc. The smplest of these samplng technques s random samplng n whch there s no control of the nature and type of nstances that are added to the tranng set. It s possble to reduce tranng sze wth random samplng and end up wth a tranng set wth no precse decson boundary. It s therefore mportant to gude random samplng so that the reduced tranng set always has classes separated by a precse decson boundary. Actve learnng on the other hand s a controlled samplng technque that has been shown n several applcatons to be reasonably successful n dealng wth the problem of acqurng tranng data. Its man purpose s to mnmze the number of data ponts requested for labelng there by reducng the cost of learnng. In actve learnng usng SVM for example [7], tranng sze reducton s archved by tranng a classfer usng only the support vectors or data ponts close to the SVM hyperplane. A major drawback of usng SVM for sample sze reducton s that tranng SVM s at least quadratc n the tranng set sze. Thus, the computatonal cost for large data sets can be very sgnfcant. In addton, SVM s also known to be very dffcult to parallelze especally on MapReduce where there s lttle or no communcaton between processes. The samplng schemes for sample szed reducton presented n ths work are desgned to avod or mnmze these problems. III. SINGLE-PASS PARALLEL STATISTICS Accurate computaton of statstcs such as the mean, varance/covarance matrx and the correlaton coeffcent/matrx are crtcal for the deployment of many machne learnng applcatons. For example, the performance of dscrmnant analyss, prncpal component analyss, outler detecton, etc. depends on the accurate estmaton of these statstcs. However, the computaton of these statstcs especally the varance or covarance matrx can be very expensve for large data sets and potentally unstable when ther magntude s very small. The standard approach consst of calculatng the sum of squares devaton from the mean. Ths nvolves passng through the data twce, frst to compute the mean and second the devatons from the mean. Ths nave two-pass algorthm s known to be numercally stable, but may become too costly. Sngle-Pass Covarance Matrx Formula Gven a large dstrbuted data set that can be k parttoned nto k fnte blocks wth, j. Each s typcally a set of n j multvarate random samples { X,, X } where each X s a p n 7

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 random vector: X ( X,, X p )T. The scatter matrx for each block s gven by S where X / n X X and s usually estmated usng unbased sample versons such as the mnmum varance unbased estmator [8] gven as ( X X )( X X )T, ˆ( x) ˆ ( x) ˆ ( x) log(n / n ) X s the sample mean of each block. (3) Unbased estmate of the covarance matrx of each block s ˆ / (n )S. The man goal s to compute the gven by Σ covarance matrx of the complete data. Proposton. The scatter matrx of the dstrbuted data parttoned nto k dsjont data-blocks { } k s gven by S k S n n n ( X where n X )( X X )T n, X / n X k denotes the summaton over k X () and ( ) where combnatons of dstnct wth X the sample mean and S p the pooled covarance pars (, ) from (,, k ). matrx. If the probablty dstrbuton of ˆ( x) s known, then one can easly fnd the lkelhood that the true value of the score s wthn some specfed range for each test pont x. For example f ˆ( x) s assumed to be normally dstrbuted, IV. SCALING DOWN SAMPLING SCHEMES In most classfcaton problems, the classfer usually has great uncertanty n decdng the class membershps of nstances on or very close to the decson boundary. These are nterestng data ponts warrantng further nvestgaton. If the classfer can be taught how to accurately classfy these ponts, then classfyng the non-boundary ponts wll be a trval process. It s well known n the actve learnng communty that tranng a classfer only on the most uncertan examples can drastcally reduce data labelng cost, computatonal tme and tranng set sze wthout sacrfcng predctve accuracy of the classfer [7]. Borrowng ths dea, ths secton presents two complementary samplng schemes based on the lnear dscrmnant score and the logstc regresson model to scaled down large dstrbuted data set for local learnng. then a 95% confdence nterval centered at 0 wll correspond to data ponts close to the decson boundary. The LDA samplng scheme presented n ths work s based on an approxmate dstrbuton derved n [8] under the assumpton of equal prors.e. The case of unequal prors can be adjusted accordngly. Lettng ( x) ( x μ )T Σ ( x μ ) be the squared Mahalanobs dstance between x and the populaton center μ and ( x) ( x) ( x), t s shown n [8] that ˆ ( x) s asymptotcally normally dstrbuted wth mean ( x) and varance var ( ˆ( x)) gven (n a smplfed form) as: a var ( ˆ( x)) ( ( x) cm / (n)) b b ( x)(cn / n (μ )) 4 (μ ) / 4 a bc p( N M ) / n (c ) M / n d A. A Lnear Dscrmnant Score Samplng Scheme LDA am at dscrmnatng between two multvarate normal populatons wth a common covarance matrx H p (μ, Σ) and H p (μ, Σ) say, on the bass of ndependent random samples of szes n and n. Fsher's lnear dscrmnant rule assgns a new test example x nto populaton H f the dscrmnant score ( x) satsfes ( x) 0 λt x 0 where s the probablty that (4) where N n n, M n n, n nn, a N p, b a, c N and d (a )(b ). () Wth the approxmate dstrbuton of ˆ( x), uncertantes n classfyng each data pont x can be estmated by computng confdence ntervals about the mean value ˆ( x). 0 log( / ) (μ μ )T Σ (μ μ ), λ Σ (μ μ ) and ˆ (n n p 3)( x X )T S p ( x X ) p / n x belongs In partcular, the ( )00% confdence nterval about the to populaton H. The decson boundary s defned by ponts satsfyng ( x) 0. If the true value of ( x) s known, then these decson boundary ( x) 0 s gven by ponts can be easly determned. However ( x) s not known Z 8 ˆ( x) var ( ˆ( x)) Z (5)

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Fg.. Performance of samplng schemes. Data ponts wthn ths nterval represent ponts for whch the classfer has hgh uncertanty about class membershps and are the most nformatve for learnng. A large confdence nterval wll select more ponts whle a tght nterval wll return fewer ponts. Equaton 5 therefore presents an effcent prncpled query strategy for uncertanty base actve learnng [7] that can be used for sample sze reducton. In the standard pooled based actve learnng parlance, a small labeled tranng set l and a large "unlabeled" pool u Precsely n ths example, the tme rato of SVM to LDA was about 40 averaged over ten fold cross-valdaton. To the best of the knowledge of the authors of ths paper, ths s the frst "actve" learnng technque for sample sze reducton based on uncertantes n the lnear dscrmnant score. B. A Logstc Regresson Samplng Scheme LR s another popular dscrmnatve classfcaton method that makes no assumpton about the dstrbuton of the ndependent varables. It has found wde used n machne learnng, data mnng, and statstcs. are assumed to be avalable. The task of the actve learner s to use the nformaton n best query pont x * u l n a smart way to select the Let and ask the user or an oracle for ts the label and then add to l. Ths process contnues untl the desred tranng set sze or accuracy of the learner has been archved. For the samplng schemes proposed n ths work however, both l and u are labeled tranng sets, and the dea s to select the most nformatve data ponts and ther labels from u. The proposed learnng algorthm for sample sze reducton s presented n Algorthm. The stoppng crteron of the algorthm can be set equal to the requred sample sze of the reduced tranng set. Note that at each round of the algorthm, the selected ponts can be used to tran a dfferent classfer such as LR or SVM. Fg. (a) shows a comparson of the classfcaton performance of LR traned at each step of the samplng scheme usng: data ponts selected by LDA samplng scheme, the support vectors from a SVM tranng, and random samplng (Random). The forest covertype data from the UCI Machne learnng repostory [9] was used for tranng. The classfcaton problem represented by ths data s to dscrmnate between 7 forest cover types usng 54 cartographc varables. The data was converted to bnary by combnng the two majorty forest cover types (Spruce-Fr wth n = 840, and Lodgepole Pne wth n = 8330) to one class and the rest (n = 8587) to the second class. The data was splt nto 75% tranng and 5% testng. The samplng schemes were stopped once 0.7% of the tranng set has been quered for learnng. The results showed that for the LDA samplng scheme, to archve a reducton n error of about % (approxmately where the algorthm stablzes) only about 0.3% carefully selected tranng data ponts were needed whereas random samplng method uses all 0.7% of the tranng data and stll acheved only 0.4 % reducton n error. The performance of LDA and SVM samplng schemes are very smlar, however LDA took by far a smaller tme to converge compared to SVM. {( x, y )}n be a set of tranng examples where random varables Y y (0,) are bnary and X x p are p -dmensonal feature vectors. The fundamental assumpton of the LR model s that the log-odds or "logt" transformaton of the posteror probablty (β; x) Pr( y x; β) s lnear.e log T 0 β x (6) where 0 and β (,, p ) are the unknown parameters of the model. Typcally, the method of maxmum lkelhood s used to estmate the unknown parameters. By settng x : (, x) and β : ( 0, β) the regularzed log-lkelhood functon s gven by l (β) βt XT Y n log exp( x β) β T where X s the desgn matrx, Y the response vector and reflects the strength of regularzaton. Iteratve methods such as gradent based methods or the Newton-Raphson method are commonly used to compute the maxmum lkelhood estmates (MLE) β of β. For example, the one step tranng of L -regularzed stochastc gradent descent (SGD) s gven by βnew βold ( y βold ) x βold (7) where 0 s the learnng rate. Each teraton of SGD 9

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 consst of choosng an example ( x, y ) at random from the tranng set and updatng the parameter β. An mportant feature of the LR parameters s that the parameter estmates are consstent. It can be shown that the MLE of LR are asymptotcally normally dstrbuted.e O, I(β) n βˆ β where I(β) XT WX s the Fsher nformaton matrx wth W dag{ ( )},,, n (see for example [0] Fg. (b) shows the error curve for the logstc regresson traned at each step of the samplng scheme usng: data ponts selected by the LR samplng scheme, the support vectors from a SVM tranng, and random samplng. The Waveform data set from the UCI machne learnng repostory was used for ths example. There are a total of 5000 records n ths data set wth 40 attrbutes, 75% was used for tranng and 5% for testng. All the samplng schemes were stopped once 6% of the tranng set has been quered for learnng. Clearly, the LR samplng scheme outperforms both the SVM and Random schemes. The authors of ths paper are unaware of any prevous use of uncertantes n the LR model as descrbed n Algorthm for sample sze reducton or for actve learnng. A closely related but dfferent approach s the varance reducton actve learnng for LR presented n []. The dea of ths approach s to select data ponts that mnmzes the mean square error of the LR model. To do ths, the mean square error s decomposed nto ts bas and varance components. However, n the actve learnng step, only the varance s mnmzed and the bas term neglected. Frequently however, the bas term consttutes a large porton of the model's error, so ths varance only actve learnng approach may not select all nformatve data ponts. Secton 6.5). Based on the dstrbuton of β, the asymptotc dstrbuton of the MLE of the logstc functon ˆ can be derved by applcaton of the delta method. Specfcally, for any real valued functon g wth the property that g (β) / β one has O, g (β) n g (βˆ ) g (β) T I(β) g (β) where s the gradent operator. By takng g (β) (β; x), t can be seen that ˆ (βˆ ; x) s asymptotcally normally dstrbuted wth (β; x) mean and varance Var (βˆ ; x) (β; x)t I(β) (β; x). The decson boundary of LR model s defned by 0 βt x 0.e where (β; x) 0.5. Ths shows that ponts on the boundary have equal chances of beng assgned to ether populaton. Therefore, uncertantes n (βˆ ; x) for the boundary ponts can be statstcally captured by a ( )00% confdence nterval about 0.5. Confdence ntervals for parameter estmates of LR can be calculated from crtcal values of the student t-dstrbuton. By followng a smlar calculaton presented n [] Secton 8.6.3 for computng the confdence nterval of a lnear functon at β of parameters of a lnear regresson model, one C. The Hausman Specfcaton Test Under the normal assumpton, LDA and LR estmators are known to be consstent but LDA s asymptotcally more effcent [3]. Thus the Hausman specfcaton test can be appled to test for these dstrbutonal assumptons by comparng the two estmators. Ths secton brefly presents the dervaton of the asymptotc covarance matrx of the LDA parameters requred for the Hausman specfcaton test. Ths s useful n decdng whch of the samplng schemes presented n ths paper s best to use. The LDA and LR models are very smlar n form but sgnfcantly dfferent n model assumptons. LR makes no assumptons about the dstrbuton of the ndependent varables whle LDA explctly assumes a normal dstrbuton. Specfcally, LR s more applcable to a wder class of dstrbutons of the nput than the normal LDA. However, as llustrated n [3], when the normalty assumpton holds, LDA s more effcent than LR. Under non-normal condtons, LDA s generally nconsstent whereas LR mantans ts consstency. Snce LDA may perform poorly on non-normal data, an mportant crteron for choosng between LR and LDA s to check whether the assumpton of normalty s satsfed. The Hausman's specfcaton test s an asymptotc ch-square test based on the quadratc form obtaned from the obtans for (βˆ ; x) the statstcs: t (βˆ ; x) (β; x) Var (βˆ ; x) ~ t (n p ) whch has a student t-dstrbuton wth n p degrees of freedom. Uncertantes about the true decson boundary (β; x) 0.5 can now be nferred through confdence ntervals. In partcular, the ( )00% confdence nterval about the decson boundary s gven by t, n p (βˆ ; x) 0.5 Var (βˆ ; x) t, n p. (8) A smlar algorthm for sample sze reducton usng the logstc regresson model s presented n Algorthm. 0

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 dfference between a consstent estmator under the alternatve hypothess and an effcent estmator under the null hypothess. Under the null hypothess of normalty, both LDA and LR estmators should be numercally close, mplyng that for large samples szes, the dfference between them converges to zero. However under the alternatve hypothess of non-normalty, the two estmators should dffer. Naturally then, f the null hypothess s true, one should use the more effcent estmator, whch s the LDA estmator and LR estmator otherwse. Let Σ ˆ and Σ ˆ be the estmated asymptotc covarance LDA LR matrces of ˆλ and ˆβ ; the estmators of LDA and LR respectvely. Lettng Q λˆ β ˆ, the Hausman ch-squared statstc [6] s defned by Q T ˆ ˆ Σ Σ ~ (9) LDA LR p where denotes the generalzed nverse. Durng tranng, Σ ˆ s readly avalable through the LR Fsher nformaton matrx I( β ˆ). Therefore, the man dffculty n computng the Hausman statstc s how to compute Σ ˆ. Several methods have been proposed n the LDA lterature to compute Σ ˆ [6], [3]. These methods are LDA however too complex to mplement on MapReduce. In ths work, a much smpler approach followng proposton s derved and the resultng covarance matrx can be easly computed by the sngle-pass formula. Proposton : Gven the tranng set {( x, )} n y where y j, ndcates the multvarate normal p( μj, Σ ) that x comes from. The lmtng dstrbuton of the MLE of LDA paraneters s gven by where n * ( λ λ) ~ ( O, Γ ) p ˆ * λ ( n ) λ, λ Σ ( μ μ ) and n nn T T Γ Σ μμ μ Σ μ Σ wth μ μ μ. Computng Γ for large data sets only requres a straght forward applcaton of the sngle-pass formula. Note that the constant term n () has been omtted for convenence. 0 V. A DISTRIBUTED FRAMEWORK FOR MACHINE LEARNING Ths Secton brefly descrbes the Hadoop-MapReduce framework and ts applcaton to machne learnng. The secton ends wth the mplementaton of the sngle-pass covarance formula, LDA and LR samplng schemes on MapReduce. A. The Hadoop-MapReduce Framework The MapReduce (MR) framework s based on a typcal dvde and conquer parallel computng strategy. Any applcaton that can be desgned as a dvde and conquer applcaton can generally be set-up as a MR program. The applcaton core of MR conssts of two functons: a Map and a Reduce functon. The nput to the Map functon s a lst of key-value pars. Each key-value par s processed separately by each Map functon and outputs a key-value par. The output from each Map s then shuffled so that values correspondng to the same key are grouped together. The Reduce functon aggregates the lst of values correspondng to the same key based on the user specfed aggregatng functon. In Hadoop (http://hadoop.apache.org/) mplementaton of MR, all that s requred s for the user to provde the Map and Reduce functons. Data parttonng, dstrbuton, replcaton, communcaton, synchronzaton and fault tolerance s handled by the Hadoop platform. Whle hghly scalable, the Hadoop-MapReduce framework however, suffers from one serous lmtaton for machne learnng tasks: t does not support teratve procedures. However, a number of technques have recently been proposed to tran teratve machne learnng algorthms lke LR effcently on MR [4], [5]. In [4] the Taylor frst order approxmaton of the logstc functon s used to approxmate the logstc score equatons. Ths leads to a "least-squares normal equatons" for LR. The authors demonstrated that the least-squares approxmaton technque s easy to mplement on MR and showed superor scalablty and accuracy compared to gradent based methods. In [5], a parallelzed verson of SGD s proposed. The full SGD s solved by each MR Map functon and the Reducer smply averages the results to obtan the global soluton. The least-squares method for LR proposed n [4] s sutable for the purpose of ths paper, however there s no guarantee that the estmates wll reman consstent for carryng out the Hausman specfcaton test. The parallelzed SGD approach s therefore mplemented for the LR samplng scheme. To speed up convergence of SGD, the least-squares solutons are used as ntal estmates. B. MapReduce Implementaton The appealng property of the sngle-pass covarance matrx computaton s that ts MR mplementaton s very straght forward. Each Map functon computes the covarance matrx of data assgned to t and the Reducer smply combnes them by applcaton of the sngle-pass formula. Selectng nformatve data ponts by the LDA samplng scheme proceeds n a smlar way. Each Map functon selects ts most nformatve data ponts usng Algorthm. The selected ponts, covarance matrx and mean vector are then passed to the Reducer who apples the sngle-pass formula to compute the global covarance matrx of the reduced data from all mappers. Optonally, another run of Algorthm can be carred out by the reducer but wth the sample mean and covarance matrx set to the global values. Ths step s useful to flter out any un-nformatve data ponts that were selected to ntalze the algorthm. The whole process s performed as a sngle-pass samplng scheme over the dstrbuted data. The LR samplng scheme also proceeds n a smlar fashon. Here each mapper solves the SGD for the parameter estmates and selects nformatve ponts by applcaton of Algorthm. The Reducer aggregates all nformatve ponts

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 and averages the LR parameters from all mappers and optonally performs another samplng usng the global parameter estmates. The algorthm equally proceeds as a sngle-pass MR job. To decde whch samplng scheme to adopt when there s concern about the normalty assumpton of LDA, the Hausman specfcaton test can be used. Both samplng schemes can be run by the same MR program, and each mapper performs the Hausamn test before queryng data ponts. values of LRE ndcates the algorthm s numercally more stable. The nave two-pass pooled covarance matrx of the full data sets are also computed for comparson. All experments were performed usng Hadoop verson.0.4 on a small cluster of three machnes: One 8 core 6 GB RAM and two 6 core 8 GB RAM each. Table I presents the LRE values obtaned from the sngle-pass and the nave two pass algorthms. The results ndcate that the sngle-pass s slghtly more stable than the nave two-pass. For sample szes greater than 3000 03 t became too costly to compute the nave two-pass covarance matrx on a sngle machne. VI. EXPERIMENTS TABLE I: ACCURACY OF SINGLE-PASS AND TWO-PASS ALGORITHMS Sample Sze Covarance Matrx Range (LRE) 03 Two-Pass Sngle-Pass 0 6.85 6.85 I 00 3.08 3. 000 8.84 9.04 3000 8.70 8.7 0 6.90 6.90 00.98.6 I 000 8.3 8.08 3000 7.97 7.96 0 7.03 7.03 I3 00.38.08 000 7.06 7.08 3000 8.77 8.94 Ths secton presents numercal results to demonstrate the correctness of the sngle-pass covarance matrx computaton and the effectveness of the LDA and LR samplng schemes. Frst, a seres of synthetc bnary classfcaton data sets are used to assess the accuracy of the covarance matrx calculatons. Then the two samplng schemes are used to scale down two real data sets for local processng. For comparson, SVM s also traned on the full and sampled data and tested on a large test data on a Hadoop cluster. A. Correctness of the Sngle-Pass Algorthm The accuracy of the sngle-pass formula was assessed by estmatng the common covarance matrx for a seres of two multvarate normal populatons: p (μ, Σ),,. The TABLE II: ACCURACY OF LOCAL MODELS VS DISTRIBUTED MODEL D Dataset Models Accuracy Tranng Sze Tranng Tme (mn) 0.75 8,59,655 43.7 LDAd 0.78 8,59,655 5.9 LRd 8,59,655 SVMd Flght parameters of the two populatons are generated as follows: The mean of the frst populaton μ s unformly generated from three ntervals: I [0.99,.0], I [999.99,000] and I3 [999999.99,000000] whle the second populaton mean s taken as μ μ.5 or μ μ.5. The covarance matrx Σ s also randomly generated such that the dagonals are sampled from the ntervals I,,,3. The ntervals I are specally chosen so that the generated data ponts are large wth very small devatons from each other. In ths way, they wll almost cancel each other out n the computaton of the varance. Ths allows the study of numercal stablty on large data sets wth small varances. For each nterval, four experments were performed wth the sample sze varyng as n 0 03, 00 03, 000 03 and 3000 03. The proporton of observatons fallng n the frst populaton was chosen as 0.3 whle the dmenson of the multvarate normal dstrbuton was set to p 50. The common populaton covarance matrx s estmated by the pooled sample covarance matrx S p. Snce the true LDAl LRl SVMl 0.73 0.77 0.77 587,079 40,54 40,54 45.3 6.3 37.4 LDAd LRd Lnkage SVMd 0.98 0.99 0.98 4,3,849 4,3,849 4,3,849 0.8 5.6 786.47 LDAl LRl SVMl 0.98 0.99 0.99 0,078 0,38 0,38 5.8 7.6 3.7 B. Effectve Scale-Down Samplng Scheme Ths secton demonstrates the effectveness of usng uncertantes n LDA and LR as tools for down-samplng large data sets. Emprcal results on two real large data sets are presented. The basc dea s to apply Algorthms and and the Hausman's test on dstrbuted data to select only the most nformatve examples for local learnng. The algorthms also outputs the parameters of LDA and LR from whch local learnng and dstrbuted learnng can be compared. parameters μ and Σ are known, t s easy to access numercal accuracy of the computatons. The Accuracy of the algorthms s measured usng the Log Relatve Error metrc ntroduced n [6] and defned as ˆ LRE log0 C. Data Sets The frst real data set s the Arlne data set (http://stat-computng.org/dataexpo/009/) consstng of more than 0 mllon flght arrval and departure nformaton for all commercal flghts wthn the USA, from October 987 to Aprl 008. The classfcaton problem where ˆ s the computed value from an algorthm and s the true value. LRE s a measure of the number of correct sgnfcant dgts that match between the two values. Hgher

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 formulated here s to predct flght delays. The data contaned a contnuous varable ndcatng flght delays n mnutes where negatve values meant flght was early and postve values represented delayed flghts. A bnary response varable was created where values greater than 5 mn were coded as and zero otherwse. Flghts detals for four years: 004-007 (n = 8,59,655) was used for tranng and detals for 008 (n = 5,80,46) reserved for testng. The second data set s the Record Lnkage Comparson Patterns data set from the UCI machne learnng repostory and conssts of 5,749,3 record pars. The classfcaton task s to decde from comparson patterns whether the underlyng (medcal) records belongs to the same ndvdual. The data set was splt nto 75% tranng and 5% testng. Three classfers: LDA, LR and SVM were traned on scaled down (local data) and ther performance assessed on a large dstrbuted test data. The three classfers were equally traned on the full dstrbuted data usng MR and tested on the same test set. Due to the large sample sze of the flght data, t was computatonally very expensve to tran SVM on the full data, so only the local results are avalable. An attempt was also made to perform an SVM samplng scheme on MapReduce,.e select only the support vectors for learnng. However, the approach was agan computatonally too expensve and was dropped. The Gaussan kernel was used for SVM wth 5-fold cross-valdaton procedure for parameter selecton. To dfferentate a local model from dstrbuted model the subscrpts l and d wll be used respectvely. The Arlne and record lnkage data sets contan bnary and categorcal varables wth at least 3 levels clearly ndcatng non-normal condtons. Ths was verfed by Hausman's test meanng that LR may be more robust to learn the data than LDA. However, LDA results wll stll be reported. Wth a 95% confdence nterval, t was observed that the number of samples selected by both samplng schemes was usually less than 3% of the tranng sze. Table II shows the performance of LDA, LR and SVM models traned locally usng sampled data selected by the LDA and LR samplng schemes compared to ther performance on the full tranng set. The local LDA model s traned on data sampled by the LDA samplng scheme and lkewse for the local LR model. However, because the LDA normal assumpton was volated for both data sets, the SVM local model was traned on data selected by the LR samplng scheme. The tranng tmes for the local models n the last column s the total tme to perform the scaled down operaton on MR and to carry out local tranng of the classfers on a sngle machne. The results from Table II llustrates the effectveness of the samplng schemes n terms of both accuracy and tme scalablty. Whle t was not possble to tran SVM on the large flght data set, usng the LR samplng scheme, t was possble to tran SVM locally gvng almost the same predctve accuracy as LR traned on the full data set. Equally for the record lnkage data set, t took almost 3 hours to tran the SVM classfer on MapReduce whle an even better accuracy was obtaned wth less than 0.5% the tranng sze n only about half an hour. Tranng the classfers on less than 3% of the orgnal tranng sze resulted n almost the same accuracy as leanng from the complete data. Ths result llustrates the effectveness of the samplng schemes. Though the LDA faled the Hausman's specfcaton test, ts overall performance was however very good. VII. CONCLUSION Ths paper presented three major contrbutons to research n machne learnng wth large dstrbuted data sets. Frst, a general sngle pass formula for computng the covarance matrx of large dstrbuted data sets was derved. Numercal results obtaned from applcaton of the new formula showed slghtly more stable and accurate results than the tradtonal two pass algorthm. In addton, the presented formula does not requre any parwse ncremental updatng schemes as exstng technques. Second, two new smple, fast, parallel effcent, scalable and accurate samplng technques based on uncertantes of the lnear dscrmnant score and the logstc regresson model were presented. These schemes are readly mplemented on the MapReduce framework and makes use of the sngle-pass covarance matrx formula. Wth these samplng schemes, large dstrbuted data sets can be scaled down to manageable szes for effcent local processng on a sngle machne. Numercal evaluaton results demonstrated that the approach s accurate and cost effectve, producng results that are accurate as learnng from the full dstrbuted data set. REFERENCES [] C. Chu, S. K. Km, Y.-A. Ln, Y. Yu, G. Bradsk, and A. Y. N. A. K. Olukotun, "Map-reduce for machne learnng on multcore," Advances n Neural Informaton Processng Systems, vol. 9, pp. 8, 007. [] R. Bekkerman and M. Blenko, and J. Langford, Scalng up machne learnng: Parallel and dstrbuted approaches, Cambrdge Unversty Press, 0. [3] F. Zhang, J. Cao, X. Song, and H. C. A. C. Wu, "Amref: An adaptve mapreduce framework for real tme applcatons," n Proc. 00 9th Internatonal Conference on Grd and Cooperatve Computng (GCC), 00, pp. 57-6. [4] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo, "Planet: massvely parallel learnng of tree ensembles wth mapreduce," Proceedngs of the VLDB Endowment, vol., no., pp. 46-437, 009. [5] J. Bennett, R. Grout, P. Pébay, D. Roe, and D. Thompson, "Numercally stable, sngle-pass, parallel statstcs algorthms," n Proc. IEEE Internatonal Conference on Cluster Computng and Workshops, 009, pp. -8. [6] A. W. Lo, "Logt versus dscrmnant analyss: A specfcaton test and applcaton to corporate bankruptces," Journal of Econometrcs, vol. 3, no., pp. 5-78, 986. [7] B. Settles, "Actve learnng lterature survey," Unversty of Wsconsn, Madson, 00. [8] F. Crtchley and I. Ford, "Interval estmaton n dscrmnaton: the multvarate normal equal covarance case," Bometrka, vol. 7, no., pp. 09-6, 985. [9] A. Frank and A. Asuncon, "UCI machne learnng repostory," 00. [0] P. J. Bckel and B. L, Mathematcal statstcs, n Test, 977. [] A. C. Rencher and G. B. Schaalje, Lnear models n statstcs, Wley-Interscence, 008. [] A. I. Schen and L. H. Ungar, "Actve learnng for logstc regresson: an evaluaton," Machne Learnng, vol. 68, no. 3, pp. 35-65, 007. [3] B. Efron, "The effcency of logstc regresson compared to normal dscrmnant analyss," Journal of the Amercan Statstcal Assocaton, vol. 70, no. 35, pp. 89-898, 975. [4] C. Ngufor and J. Wojtusak, "Learnng from large-scale dstrbuted health data: An approxmate logstc regresson approach," n Proc. ICML 3: Role of Machne Learnng n Transformng Healthcare, 03. [5] M. Znkevch, M. Wemer, A. Smola, and L. L, "Parallelzed stochastc gradent descent," Advances n Neural Informaton Processng Systems, vol. 3, no. 3, pp. -9, 00. [6] B. D. McCullough, "Assessng the relablty of statstcal software: Part I," The Amercan Statstcan, vol. 5, no. 4, pp. 358-366, 998. 3

Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Che Ngufor receved hs B.sc n mathematcs and computer scences from the Unversty of Dschang, Cameroon n 004, M.sc n mathematcs from Tennessee Technologcal Unversty, Cookevle, TN n 008. He s currently workng on hs Ph.D n computatonal scences and nformatcs at George Mason Unversty, Farfax, VA. Mr. Ngufor's man research areas nclude computatonal mathematcs and statstcs, medcal nformatcs, dstrbuted and parallel computng, bg data and large-scale machne learnng and knowledge dscovery. Janusz Wojtusak obtaned hs master's degree n computer scence from Jagellonan Unversty n 00, and Ph.D. n computatonal scences and nformatcs (concentraton n Computatonal Intellgence and Knowledge Mnng) from George Mason Unversty n 007. Currently Dr. Wojtusak s an assstant professor n the George Mason Unversty Department of Health Admnstraton and Polcy, and the coordnator of GMU Health Informatcs program. He also serves as the drector of the GMU Machne Learnng and Inference Laboratory, and the drector of GMU Center for Dscovery Scence and Health Informatcs. 4