Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing

Size: px
Start display at page:

Download "Learning from Large Distributed Data: A Scaling Down Sampling Scheme for Efficient Data Processing"

Transcription

1 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Learnng from Large Dstrbuted Data: A Scalng Down Samplng Scheme for Effcent Data Processng Che Ngufor and Janusz Wojtusak part can be attrbuted to the rapd evoluton of hardware and programmng archtectures []. These new technologes are hghly optmzed for dstrbuted computng n the sense that they are parallel effcent, relable, fault tolerant and scalable. The Hadoop-MapReduce framework for example has been successfully appled to a broad range of real world machne learnng applcatons [3], [4]. Despte the appealng propertes of scalng up machne learnng algorthms, there are some obvous problems wth ths approach. Frst, scalng up an algorthm so that t can handle "large" data today does not necessarly mean t wll handle "large" data tomorrow. Second, adaptng current algorthms to cope wth large data can be very challengng and the scaled-up algorthm may end up beng too complex and computatonally very expensve to deploy. Fnally, not all machne learnng algorthms can be modfed for parallel mplementaton. Coupled wth the fact that there s no sngle algorthm that s unformly the best n all applcatons, t s sometmes necessary to deploy many algorthms so that they can collaborate to mprove accuracy. The second approach to the scalng problem attempts to scale down large data sets to reasonable szes that allow practcal use of exstng algorthms. The tradtonal approach s to take a random sample of the large data for learnng. Ths näve approach can run the rsk of learnng from non-nformatve nstances. For example, n mbalance classfcaton problems where one class s underrepresented, t s possble for random samplng to select only a few members of the mnorty class and a large number of the majorty class, presentng yet another mbalance learnng problem. Though not exhaustvely explored, research n methods for scalng up algorthms have been on the rse n the past few years. Not much work has been done n fndng effcent ways to scale down very large dstrbuted data sets for learnng. Ths work presents three key contrbutons to the research n learnng from large dstrbuted data sets. In a frst step, a new sngle-pass formula for computng the covarance matrx of large data sets on the MapReduce framework s derved. The formula can be seen as an effcent generalzaton of the parwse and ncremental update formula presented n [5]. The sngle-pass covarance matrx estmaton s then used n a second step to derve two new samplng schemes for scalng down large dstrbuted data set for effcent processng locally n memory. Precsely, uncertantes n the lnear dscrmnant score and the logstc regresson model are used to nfer the nformatveness of data ponts wth respect to the decson boundary. The nformatve nstances are selected through uncertantes of nterval estmates of these statstcs. Because real data s not always normal, lnear dscrmnant analyss (LDA) may perform poorly when the normalty assumpton s volated. Thus the Hausman specfcaton test Abstract Extractng nformaton from a tranng data set for predctve nference s a fundamental task n data mnng and machne learnng. Wth the exponental growth n the amount of data beng generated n the past few years, there s an urgent need to develop or adapt exstng learnng algorthms to effcently learn from large data sets. Ths paper descrbes three scalng technques enablng machne learnng algorthms to learn from large dstrbuted data sets. Frst, a general sngle-pass formula for computng the covarance matrx of large data sets usng the MapReduce framework s derved. Second, two new effcent and accurate samplng schemes for scalng down large data sets for local processng are presented. The frst samplng scheme uses the sngle-pass covarance formula to select the most nformatve data ponts based on uncertantes n the lnear dscrmnant score. The second technque on the other hand selects nformatve ponts based on uncertantes n the logstc regresson model. A seres of numercal experments demonstrates numercally stable results from the applcaton of the formula and a fast, effcent, accurate and cost effectve samplng scheme. Index Terms Lnear dscrmnant analyss, logstc regresson, classfcaton, samplng, mapreduce, sngle-pass. I. INTRODUCTION The basc machne learnng task s that of extractng relevant nformaton from a tranng set for predctve nference. Gven today's ever and steadly growng data set szes, the machne learnng process must be able to effectvely and effcently handle large amounts of data. However, most exstng machne learnng algorthms were desgned at a tme where data set szes were far smaller than current szes. Ths has led to a sgnfcant amount of research n scalng methods, that s, desgnng algorthms to effcently learn from large data sets. Two general approaches can be dentfed n ths endeavor: scalng up and scalng down. The frst approach attempts to scale up machne learnng algorthms, that s, develop new algorthms or modfy exstng algorthms so that they can better handle large data sets. There has been a rapd rse n research methods for scalng up machne learnng algorthms. Ths research has been aded n part by the fact that some machne learnng algorthms can be readly deployed n parallel. For example [] showed that ten commonly used machne learnng algorthms (logstc regresson, näve Bayes, k-means clusterng, support vector machne etc) can be easly wrtten as MapReduce programs on mult-core machnes. The other Manuscrpt receved September 9, 03; revsed December 0, 03. Che Ngufor s wth Computatonal Scences and Informatcs at George Mason Unversty, Farfax, VA (e-mal:cngufor@masonlve.gmu.edu). Janusz Wojtusak s wth the George Mason Unversty Department of Health Admnstraton and Polcy (e-mal:jwojt@ml.gmu.edu). DOI: /IJMLC.04.V4.45 6

2 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 [6] can be appled to test for normalty and decde whch samplng scheme to use. The Hausman test requres consstent estmators of the asymptotc covarance matrx of parameters of the models to be tested. The fsher nformaton matrx s easly computed from the output of the logstc regresson (LR) and provdes a consstent estmator of the asymptotc covarance matrx of the parameters. On the other hand, consstent estmators of the asymptotc covarance matrx of parameters of LDA are not readly avalable. To solve ths problem, ths paper also derves a consstent asymptotc covarance matrx of the parameters of LDA that s smple and easy to compute for large data sets usng the sngle-pass covarnace formula. Emprcal results show that usng these technques, large classfcaton data sets can be effcently scaled down to manageable szes permttng local processng n memory, whle sacrfcng lttle f any accuracy. There are also many potental benefts of the samplng approach: The selected samples can be used to gan more knowledge about the characterstcs of the decson boundary, for vsualzaton and to tran other algorthms that cannot be easly traned on MapReduce. A specfc example to llustrate the usefulness of ths approach s presented where the support vector machne (SVM) s traned usng scaled down data and large scale predctons s carred out on MapReduce. For smplcty, ths paper consders only bnary classfcaton problems and organzed as follows: Related work s presented n Secton II. Secton III presents a general sngle-pass formula for computng the covarance matrx of large data sets. Secton IV brefly revews LDA and LR and presents confdence nterval estmates of the models. Dervaton of the asymptotc covarance matrx of parameters of LDA s also presented n ths Secton. Implementaton of the sngle-pass covarance matrx computaton and the proposed samplng schemes on MapReduce s presented n Secton V. Numercal experments are reported n Secton VI whle Secton VII concludes the paper. Because of space lmtatons, the proofs of the propostons presented n the paper wll be omtted. The nterested reader s referred to for proofs and other detaled nformaton. II. RELATED WORK Sngle-pass parwse updatng algorthms especally for the varance have been n exstence for some tme and proven to be numercally stable [5]. In these technques, the data set s always splt nto two subsets and the varance/covarance of each subset computed and combned. Ths leads to a tree lke structure where each node s the result of the combnaton of two statstcs each of whch resulted from the combnaton of two other statstcs and so on. The draw back of these methods les n the parwse ncremental updatng step. At each round of the computaton, the data s splt nto two and the formula s appled recursvely. Some specal data structures and bookkeepng may be requred to handle the tree-lke structure of the algorthm. More over, t s not readly amenable to the MapReduce framework as some communcatons between processes may be requred. The sngle-pass formula presented n ths paper avods the tree lke combnaton structure and computes the covarance matrx n one step. Much work has been done on scalng up exstng machne learnng algorthms so that they can handle large data sets and reduce executon tme. Whle ths approach s attractve, they are many stuatons where t s not possble to scale up a machne learnng algorthm or the scale up may produce a complex or computatonally more expensve algorthm. In such stuatons, scalng down by samplng s an attractve alternatve. Varous approaches have been followed to reduce tranng set szes such as dmenson reducton, random samplng, actve learnng [7] etc. The smplest of these samplng technques s random samplng n whch there s no control of the nature and type of nstances that are added to the tranng set. It s possble to reduce tranng sze wth random samplng and end up wth a tranng set wth no precse decson boundary. It s therefore mportant to gude random samplng so that the reduced tranng set always has classes separated by a precse decson boundary. Actve learnng on the other hand s a controlled samplng technque that has been shown n several applcatons to be reasonably successful n dealng wth the problem of acqurng tranng data. Its man purpose s to mnmze the number of data ponts requested for labelng there by reducng the cost of learnng. In actve learnng usng SVM for example [7], tranng sze reducton s archved by tranng a classfer usng only the support vectors or data ponts close to the SVM hyperplane. A major drawback of usng SVM for sample sze reducton s that tranng SVM s at least quadratc n the tranng set sze. Thus, the computatonal cost for large data sets can be very sgnfcant. In addton, SVM s also known to be very dffcult to parallelze especally on MapReduce where there s lttle or no communcaton between processes. The samplng schemes for sample szed reducton presented n ths work are desgned to avod or mnmze these problems. III. SINGLE-PASS PARALLEL STATISTICS Accurate computaton of statstcs such as the mean, varance/covarance matrx and the correlaton coeffcent/matrx are crtcal for the deployment of many machne learnng applcatons. For example, the performance of dscrmnant analyss, prncpal component analyss, outler detecton, etc. depends on the accurate estmaton of these statstcs. However, the computaton of these statstcs especally the varance or covarance matrx can be very expensve for large data sets and potentally unstable when ther magntude s very small. The standard approach consst of calculatng the sum of squares devaton from the mean. Ths nvolves passng through the data twce, frst to compute the mean and second the devatons from the mean. Ths nave two-pass algorthm s known to be numercally stable, but may become too costly. Sngle-Pass Covarance Matrx Formula Gven a large dstrbuted data set that can be k parttoned nto k fnte blocks wth, j. Each s typcally a set of n j multvarate random samples { X,, X } where each X s a p n 7

3 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 random vector: X ( X,, X p )T. The scatter matrx for each block s gven by S where X / n X X and s usually estmated usng unbased sample versons such as the mnmum varance unbased estmator [8] gven as ( X X )( X X )T, ˆ( x) ˆ ( x) ˆ ( x) log(n / n ) X s the sample mean of each block. (3) Unbased estmate of the covarance matrx of each block s ˆ / (n )S. The man goal s to compute the gven by Σ covarance matrx of the complete data. Proposton. The scatter matrx of the dstrbuted data parttoned nto k dsjont data-blocks { } k s gven by S k S n n n ( X where n X )( X X )T n, X / n X k denotes the summaton over k X () and ( ) where combnatons of dstnct wth X the sample mean and S p the pooled covarance pars (, ) from (,, k ). matrx. If the probablty dstrbuton of ˆ( x) s known, then one can easly fnd the lkelhood that the true value of the score s wthn some specfed range for each test pont x. For example f ˆ( x) s assumed to be normally dstrbuted, IV. SCALING DOWN SAMPLING SCHEMES In most classfcaton problems, the classfer usually has great uncertanty n decdng the class membershps of nstances on or very close to the decson boundary. These are nterestng data ponts warrantng further nvestgaton. If the classfer can be taught how to accurately classfy these ponts, then classfyng the non-boundary ponts wll be a trval process. It s well known n the actve learnng communty that tranng a classfer only on the most uncertan examples can drastcally reduce data labelng cost, computatonal tme and tranng set sze wthout sacrfcng predctve accuracy of the classfer [7]. Borrowng ths dea, ths secton presents two complementary samplng schemes based on the lnear dscrmnant score and the logstc regresson model to scaled down large dstrbuted data set for local learnng. then a 95% confdence nterval centered at 0 wll correspond to data ponts close to the decson boundary. The LDA samplng scheme presented n ths work s based on an approxmate dstrbuton derved n [8] under the assumpton of equal prors.e. The case of unequal prors can be adjusted accordngly. Lettng ( x) ( x μ )T Σ ( x μ ) be the squared Mahalanobs dstance between x and the populaton center μ and ( x) ( x) ( x), t s shown n [8] that ˆ ( x) s asymptotcally normally dstrbuted wth mean ( x) and varance var ( ˆ( x)) gven (n a smplfed form) as: a var ( ˆ( x)) ( ( x) cm / (n)) b b ( x)(cn / n (μ )) 4 (μ ) / 4 a bc p( N M ) / n (c ) M / n d A. A Lnear Dscrmnant Score Samplng Scheme LDA am at dscrmnatng between two multvarate normal populatons wth a common covarance matrx H p (μ, Σ) and H p (μ, Σ) say, on the bass of ndependent random samples of szes n and n. Fsher's lnear dscrmnant rule assgns a new test example x nto populaton H f the dscrmnant score ( x) satsfes ( x) 0 λt x 0 where s the probablty that (4) where N n n, M n n, n nn, a N p, b a, c N and d (a )(b ). () Wth the approxmate dstrbuton of ˆ( x), uncertantes n classfyng each data pont x can be estmated by computng confdence ntervals about the mean value ˆ( x). 0 log( / ) (μ μ )T Σ (μ μ ), λ Σ (μ μ ) and ˆ (n n p 3)( x X )T S p ( x X ) p / n x belongs In partcular, the ( )00% confdence nterval about the to populaton H. The decson boundary s defned by ponts satsfyng ( x) 0. If the true value of ( x) s known, then these decson boundary ( x) 0 s gven by ponts can be easly determned. However ( x) s not known Z 8 ˆ( x) var ( ˆ( x)) Z (5)

4 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Fg.. Performance of samplng schemes. Data ponts wthn ths nterval represent ponts for whch the classfer has hgh uncertanty about class membershps and are the most nformatve for learnng. A large confdence nterval wll select more ponts whle a tght nterval wll return fewer ponts. Equaton 5 therefore presents an effcent prncpled query strategy for uncertanty base actve learnng [7] that can be used for sample sze reducton. In the standard pooled based actve learnng parlance, a small labeled tranng set l and a large "unlabeled" pool u Precsely n ths example, the tme rato of SVM to LDA was about 40 averaged over ten fold cross-valdaton. To the best of the knowledge of the authors of ths paper, ths s the frst "actve" learnng technque for sample sze reducton based on uncertantes n the lnear dscrmnant score. B. A Logstc Regresson Samplng Scheme LR s another popular dscrmnatve classfcaton method that makes no assumpton about the dstrbuton of the ndependent varables. It has found wde used n machne learnng, data mnng, and statstcs. are assumed to be avalable. The task of the actve learner s to use the nformaton n best query pont x * u l n a smart way to select the Let and ask the user or an oracle for ts the label and then add to l. Ths process contnues untl the desred tranng set sze or accuracy of the learner has been archved. For the samplng schemes proposed n ths work however, both l and u are labeled tranng sets, and the dea s to select the most nformatve data ponts and ther labels from u. The proposed learnng algorthm for sample sze reducton s presented n Algorthm. The stoppng crteron of the algorthm can be set equal to the requred sample sze of the reduced tranng set. Note that at each round of the algorthm, the selected ponts can be used to tran a dfferent classfer such as LR or SVM. Fg. (a) shows a comparson of the classfcaton performance of LR traned at each step of the samplng scheme usng: data ponts selected by LDA samplng scheme, the support vectors from a SVM tranng, and random samplng (Random). The forest covertype data from the UCI Machne learnng repostory [9] was used for tranng. The classfcaton problem represented by ths data s to dscrmnate between 7 forest cover types usng 54 cartographc varables. The data was converted to bnary by combnng the two majorty forest cover types (Spruce-Fr wth n = 840, and Lodgepole Pne wth n = 8330) to one class and the rest (n = 8587) to the second class. The data was splt nto 75% tranng and 5% testng. The samplng schemes were stopped once 0.7% of the tranng set has been quered for learnng. The results showed that for the LDA samplng scheme, to archve a reducton n error of about % (approxmately where the algorthm stablzes) only about 0.3% carefully selected tranng data ponts were needed whereas random samplng method uses all 0.7% of the tranng data and stll acheved only 0.4 % reducton n error. The performance of LDA and SVM samplng schemes are very smlar, however LDA took by far a smaller tme to converge compared to SVM. {( x, y )}n be a set of tranng examples where random varables Y y (0,) are bnary and X x p are p -dmensonal feature vectors. The fundamental assumpton of the LR model s that the log-odds or "logt" transformaton of the posteror probablty (β; x) Pr( y x; β) s lnear.e log T 0 β x (6) where 0 and β (,, p ) are the unknown parameters of the model. Typcally, the method of maxmum lkelhood s used to estmate the unknown parameters. By settng x : (, x) and β : ( 0, β) the regularzed log-lkelhood functon s gven by l (β) βt XT Y n log exp( x β) β T where X s the desgn matrx, Y the response vector and reflects the strength of regularzaton. Iteratve methods such as gradent based methods or the Newton-Raphson method are commonly used to compute the maxmum lkelhood estmates (MLE) β of β. For example, the one step tranng of L -regularzed stochastc gradent descent (SGD) s gven by βnew βold ( y βold ) x βold (7) where 0 s the learnng rate. Each teraton of SGD 9

5 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 consst of choosng an example ( x, y ) at random from the tranng set and updatng the parameter β. An mportant feature of the LR parameters s that the parameter estmates are consstent. It can be shown that the MLE of LR are asymptotcally normally dstrbuted.e O, I(β) n βˆ β where I(β) XT WX s the Fsher nformaton matrx wth W dag{ ( )},,, n (see for example [0] Fg. (b) shows the error curve for the logstc regresson traned at each step of the samplng scheme usng: data ponts selected by the LR samplng scheme, the support vectors from a SVM tranng, and random samplng. The Waveform data set from the UCI machne learnng repostory was used for ths example. There are a total of 5000 records n ths data set wth 40 attrbutes, 75% was used for tranng and 5% for testng. All the samplng schemes were stopped once 6% of the tranng set has been quered for learnng. Clearly, the LR samplng scheme outperforms both the SVM and Random schemes. The authors of ths paper are unaware of any prevous use of uncertantes n the LR model as descrbed n Algorthm for sample sze reducton or for actve learnng. A closely related but dfferent approach s the varance reducton actve learnng for LR presented n []. The dea of ths approach s to select data ponts that mnmzes the mean square error of the LR model. To do ths, the mean square error s decomposed nto ts bas and varance components. However, n the actve learnng step, only the varance s mnmzed and the bas term neglected. Frequently however, the bas term consttutes a large porton of the model's error, so ths varance only actve learnng approach may not select all nformatve data ponts. Secton 6.5). Based on the dstrbuton of β, the asymptotc dstrbuton of the MLE of the logstc functon ˆ can be derved by applcaton of the delta method. Specfcally, for any real valued functon g wth the property that g (β) / β one has O, g (β) n g (βˆ ) g (β) T I(β) g (β) where s the gradent operator. By takng g (β) (β; x), t can be seen that ˆ (βˆ ; x) s asymptotcally normally dstrbuted wth (β; x) mean and varance Var (βˆ ; x) (β; x)t I(β) (β; x). The decson boundary of LR model s defned by 0 βt x 0.e where (β; x) 0.5. Ths shows that ponts on the boundary have equal chances of beng assgned to ether populaton. Therefore, uncertantes n (βˆ ; x) for the boundary ponts can be statstcally captured by a ( )00% confdence nterval about 0.5. Confdence ntervals for parameter estmates of LR can be calculated from crtcal values of the student t-dstrbuton. By followng a smlar calculaton presented n [] Secton for computng the confdence nterval of a lnear functon at β of parameters of a lnear regresson model, one C. The Hausman Specfcaton Test Under the normal assumpton, LDA and LR estmators are known to be consstent but LDA s asymptotcally more effcent [3]. Thus the Hausman specfcaton test can be appled to test for these dstrbutonal assumptons by comparng the two estmators. Ths secton brefly presents the dervaton of the asymptotc covarance matrx of the LDA parameters requred for the Hausman specfcaton test. Ths s useful n decdng whch of the samplng schemes presented n ths paper s best to use. The LDA and LR models are very smlar n form but sgnfcantly dfferent n model assumptons. LR makes no assumptons about the dstrbuton of the ndependent varables whle LDA explctly assumes a normal dstrbuton. Specfcally, LR s more applcable to a wder class of dstrbutons of the nput than the normal LDA. However, as llustrated n [3], when the normalty assumpton holds, LDA s more effcent than LR. Under non-normal condtons, LDA s generally nconsstent whereas LR mantans ts consstency. Snce LDA may perform poorly on non-normal data, an mportant crteron for choosng between LR and LDA s to check whether the assumpton of normalty s satsfed. The Hausman's specfcaton test s an asymptotc ch-square test based on the quadratc form obtaned from the obtans for (βˆ ; x) the statstcs: t (βˆ ; x) (β; x) Var (βˆ ; x) ~ t (n p ) whch has a student t-dstrbuton wth n p degrees of freedom. Uncertantes about the true decson boundary (β; x) 0.5 can now be nferred through confdence ntervals. In partcular, the ( )00% confdence nterval about the decson boundary s gven by t, n p (βˆ ; x) 0.5 Var (βˆ ; x) t, n p. (8) A smlar algorthm for sample sze reducton usng the logstc regresson model s presented n Algorthm. 0

6 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 dfference between a consstent estmator under the alternatve hypothess and an effcent estmator under the null hypothess. Under the null hypothess of normalty, both LDA and LR estmators should be numercally close, mplyng that for large samples szes, the dfference between them converges to zero. However under the alternatve hypothess of non-normalty, the two estmators should dffer. Naturally then, f the null hypothess s true, one should use the more effcent estmator, whch s the LDA estmator and LR estmator otherwse. Let Σ ˆ and Σ ˆ be the estmated asymptotc covarance LDA LR matrces of ˆλ and ˆβ ; the estmators of LDA and LR respectvely. Lettng Q λˆ β ˆ, the Hausman ch-squared statstc [6] s defned by Q T ˆ ˆ Σ Σ ~ (9) LDA LR p where denotes the generalzed nverse. Durng tranng, Σ ˆ s readly avalable through the LR Fsher nformaton matrx I( β ˆ). Therefore, the man dffculty n computng the Hausman statstc s how to compute Σ ˆ. Several methods have been proposed n the LDA lterature to compute Σ ˆ [6], [3]. These methods are LDA however too complex to mplement on MapReduce. In ths work, a much smpler approach followng proposton s derved and the resultng covarance matrx can be easly computed by the sngle-pass formula. Proposton : Gven the tranng set {( x, )} n y where y j, ndcates the multvarate normal p( μj, Σ ) that x comes from. The lmtng dstrbuton of the MLE of LDA paraneters s gven by where n * ( λ λ) ~ ( O, Γ ) p ˆ * λ ( n ) λ, λ Σ ( μ μ ) and n nn T T Γ Σ μμ μ Σ μ Σ wth μ μ μ. Computng Γ for large data sets only requres a straght forward applcaton of the sngle-pass formula. Note that the constant term n () has been omtted for convenence. 0 V. A DISTRIBUTED FRAMEWORK FOR MACHINE LEARNING Ths Secton brefly descrbes the Hadoop-MapReduce framework and ts applcaton to machne learnng. The secton ends wth the mplementaton of the sngle-pass covarance formula, LDA and LR samplng schemes on MapReduce. A. The Hadoop-MapReduce Framework The MapReduce (MR) framework s based on a typcal dvde and conquer parallel computng strategy. Any applcaton that can be desgned as a dvde and conquer applcaton can generally be set-up as a MR program. The applcaton core of MR conssts of two functons: a Map and a Reduce functon. The nput to the Map functon s a lst of key-value pars. Each key-value par s processed separately by each Map functon and outputs a key-value par. The output from each Map s then shuffled so that values correspondng to the same key are grouped together. The Reduce functon aggregates the lst of values correspondng to the same key based on the user specfed aggregatng functon. In Hadoop ( mplementaton of MR, all that s requred s for the user to provde the Map and Reduce functons. Data parttonng, dstrbuton, replcaton, communcaton, synchronzaton and fault tolerance s handled by the Hadoop platform. Whle hghly scalable, the Hadoop-MapReduce framework however, suffers from one serous lmtaton for machne learnng tasks: t does not support teratve procedures. However, a number of technques have recently been proposed to tran teratve machne learnng algorthms lke LR effcently on MR [4], [5]. In [4] the Taylor frst order approxmaton of the logstc functon s used to approxmate the logstc score equatons. Ths leads to a "least-squares normal equatons" for LR. The authors demonstrated that the least-squares approxmaton technque s easy to mplement on MR and showed superor scalablty and accuracy compared to gradent based methods. In [5], a parallelzed verson of SGD s proposed. The full SGD s solved by each MR Map functon and the Reducer smply averages the results to obtan the global soluton. The least-squares method for LR proposed n [4] s sutable for the purpose of ths paper, however there s no guarantee that the estmates wll reman consstent for carryng out the Hausman specfcaton test. The parallelzed SGD approach s therefore mplemented for the LR samplng scheme. To speed up convergence of SGD, the least-squares solutons are used as ntal estmates. B. MapReduce Implementaton The appealng property of the sngle-pass covarance matrx computaton s that ts MR mplementaton s very straght forward. Each Map functon computes the covarance matrx of data assgned to t and the Reducer smply combnes them by applcaton of the sngle-pass formula. Selectng nformatve data ponts by the LDA samplng scheme proceeds n a smlar way. Each Map functon selects ts most nformatve data ponts usng Algorthm. The selected ponts, covarance matrx and mean vector are then passed to the Reducer who apples the sngle-pass formula to compute the global covarance matrx of the reduced data from all mappers. Optonally, another run of Algorthm can be carred out by the reducer but wth the sample mean and covarance matrx set to the global values. Ths step s useful to flter out any un-nformatve data ponts that were selected to ntalze the algorthm. The whole process s performed as a sngle-pass samplng scheme over the dstrbuted data. The LR samplng scheme also proceeds n a smlar fashon. Here each mapper solves the SGD for the parameter estmates and selects nformatve ponts by applcaton of Algorthm. The Reducer aggregates all nformatve ponts

7 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 and averages the LR parameters from all mappers and optonally performs another samplng usng the global parameter estmates. The algorthm equally proceeds as a sngle-pass MR job. To decde whch samplng scheme to adopt when there s concern about the normalty assumpton of LDA, the Hausman specfcaton test can be used. Both samplng schemes can be run by the same MR program, and each mapper performs the Hausamn test before queryng data ponts. values of LRE ndcates the algorthm s numercally more stable. The nave two-pass pooled covarance matrx of the full data sets are also computed for comparson. All experments were performed usng Hadoop verson.0.4 on a small cluster of three machnes: One 8 core 6 GB RAM and two 6 core 8 GB RAM each. Table I presents the LRE values obtaned from the sngle-pass and the nave two pass algorthms. The results ndcate that the sngle-pass s slghtly more stable than the nave two-pass. For sample szes greater than t became too costly to compute the nave two-pass covarance matrx on a sngle machne. VI. EXPERIMENTS TABLE I: ACCURACY OF SINGLE-PASS AND TWO-PASS ALGORITHMS Sample Sze Covarance Matrx Range (LRE) 03 Two-Pass Sngle-Pass I I I Ths secton presents numercal results to demonstrate the correctness of the sngle-pass covarance matrx computaton and the effectveness of the LDA and LR samplng schemes. Frst, a seres of synthetc bnary classfcaton data sets are used to assess the accuracy of the covarance matrx calculatons. Then the two samplng schemes are used to scale down two real data sets for local processng. For comparson, SVM s also traned on the full and sampled data and tested on a large test data on a Hadoop cluster. A. Correctness of the Sngle-Pass Algorthm The accuracy of the sngle-pass formula was assessed by estmatng the common covarance matrx for a seres of two multvarate normal populatons: p (μ, Σ),,. The TABLE II: ACCURACY OF LOCAL MODELS VS DISTRIBUTED MODEL D Dataset Models Accuracy Tranng Sze Tranng Tme (mn) ,59, LDAd ,59, LRd 8,59,655 SVMd Flght parameters of the two populatons are generated as follows: The mean of the frst populaton μ s unformly generated from three ntervals: I [0.99,.0], I [999.99,000] and I3 [ ,000000] whle the second populaton mean s taken as μ μ.5 or μ μ.5. The covarance matrx Σ s also randomly generated such that the dagonals are sampled from the ntervals I,,,3. The ntervals I are specally chosen so that the generated data ponts are large wth very small devatons from each other. In ths way, they wll almost cancel each other out n the computaton of the varance. Ths allows the study of numercal stablty on large data sets wth small varances. For each nterval, four experments were performed wth the sample sze varyng as n 0 03, 00 03, and The proporton of observatons fallng n the frst populaton was chosen as 0.3 whle the dmenson of the multvarate normal dstrbuton was set to p 50. The common populaton covarance matrx s estmated by the pooled sample covarance matrx S p. Snce the true LDAl LRl SVMl ,079 40,54 40, LDAd LRd Lnkage SVMd ,3,849 4,3,849 4,3, LDAl LRl SVMl ,078 0,38 0, B. Effectve Scale-Down Samplng Scheme Ths secton demonstrates the effectveness of usng uncertantes n LDA and LR as tools for down-samplng large data sets. Emprcal results on two real large data sets are presented. The basc dea s to apply Algorthms and and the Hausman's test on dstrbuted data to select only the most nformatve examples for local learnng. The algorthms also outputs the parameters of LDA and LR from whch local learnng and dstrbuted learnng can be compared. parameters μ and Σ are known, t s easy to access numercal accuracy of the computatons. The Accuracy of the algorthms s measured usng the Log Relatve Error metrc ntroduced n [6] and defned as ˆ LRE log0 C. Data Sets The frst real data set s the Arlne data set ( consstng of more than 0 mllon flght arrval and departure nformaton for all commercal flghts wthn the USA, from October 987 to Aprl 008. The classfcaton problem where ˆ s the computed value from an algorthm and s the true value. LRE s a measure of the number of correct sgnfcant dgts that match between the two values. Hgher

8 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 formulated here s to predct flght delays. The data contaned a contnuous varable ndcatng flght delays n mnutes where negatve values meant flght was early and postve values represented delayed flghts. A bnary response varable was created where values greater than 5 mn were coded as and zero otherwse. Flghts detals for four years: (n = 8,59,655) was used for tranng and detals for 008 (n = 5,80,46) reserved for testng. The second data set s the Record Lnkage Comparson Patterns data set from the UCI machne learnng repostory and conssts of 5,749,3 record pars. The classfcaton task s to decde from comparson patterns whether the underlyng (medcal) records belongs to the same ndvdual. The data set was splt nto 75% tranng and 5% testng. Three classfers: LDA, LR and SVM were traned on scaled down (local data) and ther performance assessed on a large dstrbuted test data. The three classfers were equally traned on the full dstrbuted data usng MR and tested on the same test set. Due to the large sample sze of the flght data, t was computatonally very expensve to tran SVM on the full data, so only the local results are avalable. An attempt was also made to perform an SVM samplng scheme on MapReduce,.e select only the support vectors for learnng. However, the approach was agan computatonally too expensve and was dropped. The Gaussan kernel was used for SVM wth 5-fold cross-valdaton procedure for parameter selecton. To dfferentate a local model from dstrbuted model the subscrpts l and d wll be used respectvely. The Arlne and record lnkage data sets contan bnary and categorcal varables wth at least 3 levels clearly ndcatng non-normal condtons. Ths was verfed by Hausman's test meanng that LR may be more robust to learn the data than LDA. However, LDA results wll stll be reported. Wth a 95% confdence nterval, t was observed that the number of samples selected by both samplng schemes was usually less than 3% of the tranng sze. Table II shows the performance of LDA, LR and SVM models traned locally usng sampled data selected by the LDA and LR samplng schemes compared to ther performance on the full tranng set. The local LDA model s traned on data sampled by the LDA samplng scheme and lkewse for the local LR model. However, because the LDA normal assumpton was volated for both data sets, the SVM local model was traned on data selected by the LR samplng scheme. The tranng tmes for the local models n the last column s the total tme to perform the scaled down operaton on MR and to carry out local tranng of the classfers on a sngle machne. The results from Table II llustrates the effectveness of the samplng schemes n terms of both accuracy and tme scalablty. Whle t was not possble to tran SVM on the large flght data set, usng the LR samplng scheme, t was possble to tran SVM locally gvng almost the same predctve accuracy as LR traned on the full data set. Equally for the record lnkage data set, t took almost 3 hours to tran the SVM classfer on MapReduce whle an even better accuracy was obtaned wth less than 0.5% the tranng sze n only about half an hour. Tranng the classfers on less than 3% of the orgnal tranng sze resulted n almost the same accuracy as leanng from the complete data. Ths result llustrates the effectveness of the samplng schemes. Though the LDA faled the Hausman's specfcaton test, ts overall performance was however very good. VII. CONCLUSION Ths paper presented three major contrbutons to research n machne learnng wth large dstrbuted data sets. Frst, a general sngle pass formula for computng the covarance matrx of large dstrbuted data sets was derved. Numercal results obtaned from applcaton of the new formula showed slghtly more stable and accurate results than the tradtonal two pass algorthm. In addton, the presented formula does not requre any parwse ncremental updatng schemes as exstng technques. Second, two new smple, fast, parallel effcent, scalable and accurate samplng technques based on uncertantes of the lnear dscrmnant score and the logstc regresson model were presented. These schemes are readly mplemented on the MapReduce framework and makes use of the sngle-pass covarance matrx formula. Wth these samplng schemes, large dstrbuted data sets can be scaled down to manageable szes for effcent local processng on a sngle machne. Numercal evaluaton results demonstrated that the approach s accurate and cost effectve, producng results that are accurate as learnng from the full dstrbuted data set. REFERENCES [] C. Chu, S. K. Km, Y.-A. Ln, Y. Yu, G. Bradsk, and A. Y. N. A. K. Olukotun, "Map-reduce for machne learnng on multcore," Advances n Neural Informaton Processng Systems, vol. 9, pp. 8, 007. [] R. Bekkerman and M. Blenko, and J. Langford, Scalng up machne learnng: Parallel and dstrbuted approaches, Cambrdge Unversty Press, 0. [3] F. Zhang, J. Cao, X. Song, and H. C. A. C. Wu, "Amref: An adaptve mapreduce framework for real tme applcatons," n Proc. 00 9th Internatonal Conference on Grd and Cooperatve Computng (GCC), 00, pp [4] B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo, "Planet: massvely parallel learnng of tree ensembles wth mapreduce," Proceedngs of the VLDB Endowment, vol., no., pp , 009. [5] J. Bennett, R. Grout, P. Pébay, D. Roe, and D. Thompson, "Numercally stable, sngle-pass, parallel statstcs algorthms," n Proc. IEEE Internatonal Conference on Cluster Computng and Workshops, 009, pp. -8. [6] A. W. Lo, "Logt versus dscrmnant analyss: A specfcaton test and applcaton to corporate bankruptces," Journal of Econometrcs, vol. 3, no., pp. 5-78, 986. [7] B. Settles, "Actve learnng lterature survey," Unversty of Wsconsn, Madson, 00. [8] F. Crtchley and I. Ford, "Interval estmaton n dscrmnaton: the multvarate normal equal covarance case," Bometrka, vol. 7, no., pp. 09-6, 985. [9] A. Frank and A. Asuncon, "UCI machne learnng repostory," 00. [0] P. J. Bckel and B. L, Mathematcal statstcs, n Test, 977. [] A. C. Rencher and G. B. Schaalje, Lnear models n statstcs, Wley-Interscence, 008. [] A. I. Schen and L. H. Ungar, "Actve learnng for logstc regresson: an evaluaton," Machne Learnng, vol. 68, no. 3, pp , 007. [3] B. Efron, "The effcency of logstc regresson compared to normal dscrmnant analyss," Journal of the Amercan Statstcal Assocaton, vol. 70, no. 35, pp , 975. [4] C. Ngufor and J. Wojtusak, "Learnng from large-scale dstrbuted health data: An approxmate logstc regresson approach," n Proc. ICML 3: Role of Machne Learnng n Transformng Healthcare, 03. [5] M. Znkevch, M. Wemer, A. Smola, and L. L, "Parallelzed stochastc gradent descent," Advances n Neural Informaton Processng Systems, vol. 3, no. 3, pp. -9, 00. [6] B. D. McCullough, "Assessng the relablty of statstcal software: Part I," The Amercan Statstcan, vol. 5, no. 4, pp ,

9 Internatonal Journal of Machne Learnng and Computng, Vol. 4, No. 3, June 04 Che Ngufor receved hs B.sc n mathematcs and computer scences from the Unversty of Dschang, Cameroon n 004, M.sc n mathematcs from Tennessee Technologcal Unversty, Cookevle, TN n 008. He s currently workng on hs Ph.D n computatonal scences and nformatcs at George Mason Unversty, Farfax, VA. Mr. Ngufor's man research areas nclude computatonal mathematcs and statstcs, medcal nformatcs, dstrbuted and parallel computng, bg data and large-scale machne learnng and knowledge dscovery. Janusz Wojtusak obtaned hs master's degree n computer scence from Jagellonan Unversty n 00, and Ph.D. n computatonal scences and nformatcs (concentraton n Computatonal Intellgence and Knowledge Mnng) from George Mason Unversty n 007. Currently Dr. Wojtusak s an assstant professor n the George Mason Unversty Department of Health Admnstraton and Polcy, and the coordnator of GMU Health Informatcs program. He also serves as the drector of the GMU Machne Learnng and Inference Laboratory, and the drector of GMU Center for Dscovery Scence and Health Informatcs. 4

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

More information

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006

Latent Class Regression. Statistics for Psychosocial Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson Statstcs for Psychosocal Research II: Structural Models December 4 and 6, 2006 Latent Class Regresson (LCR) What s t and when do we use t? Recall the standard latent class model

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

Logistic Regression. Steve Kroon

Logistic Regression. Steve Kroon Logstc Regresson Steve Kroon Course notes sectons: 24.3-24.4 Dsclamer: these notes do not explctly ndcate whether values are vectors or scalars, but expects the reader to dscern ths from the context. Scenaro

More information

CHAPTER 14 MORE ABOUT REGRESSION

CHAPTER 14 MORE ABOUT REGRESSION CHAPTER 14 MORE ABOUT REGRESSION We learned n Chapter 5 that often a straght lne descrbes the pattern of a relatonshp between two quanttatve varables. For nstance, n Example 5.1 we explored the relatonshp

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation Exhaustve Regresson An Exploraton of Regresson-Based Data Mnng Technques Usng Super Computaton Antony Daves, Ph.D. Assocate Professor of Economcs Duquesne Unversty Pttsburgh, PA 58 Research Fellow The

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble 1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In

More information

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6 PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has

More information

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES

THE METHOD OF LEAST SQUARES THE METHOD OF LEAST SQUARES The goal: to measure (determne) an unknown quantty x (the value of a RV X) Realsaton: n results: y 1, y 2,..., y j,..., y n, (the measured values of Y 1, Y 2,..., Y j,..., Y n ) every result s encumbered

More information

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting Causal, Explanatory Forecastng Assumes cause-and-effect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of

More information

Statistical algorithms in Review Manager 5

Statistical algorithms in Review Manager 5 Statstcal algorthms n Reve Manager 5 Jonathan J Deeks and Julan PT Hggns on behalf of the Statstcal Methods Group of The Cochrane Collaboraton August 00 Data structure Consder a meta-analyss of k studes

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network 700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

Calculating the high frequency transmission line parameters of power cables

Calculating the high frequency transmission line parameters of power cables < ' Calculatng the hgh frequency transmsson lne parameters of power cables Authors: Dr. John Dcknson, Laboratory Servces Manager, N 0 RW E B Communcatons Mr. Peter J. Ncholson, Project Assgnment Manager,

More information

Statistical Methods to Develop Rating Models

Statistical Methods to Develop Rating Models Statstcal Methods to Develop Ratng Models [Evelyn Hayden and Danel Porath, Österrechsche Natonalbank and Unversty of Appled Scences at Manz] Source: The Basel II Rsk Parameters Estmaton, Valdaton, and

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

More information

Fault tolerance in cloud technologies presented as a service

Fault tolerance in cloud technologies presented as a service Internatonal Scentfc Conference Computer Scence 2015 Pavel Dzhunev, PhD student Fault tolerance n cloud technologes presented as a servce INTRODUCTION Improvements n technques for vrtualzaton and performance

More information

STATISTICAL DATA ANALYSIS IN EXCEL

STATISTICAL DATA ANALYSIS IN EXCEL Mcroarray Center STATISTICAL DATA ANALYSIS IN EXCEL Lecture 6 Some Advanced Topcs Dr. Petr Nazarov 14-01-013 petr.nazarov@crp-sante.lu Statstcal data analyss n Ecel. 6. Some advanced topcs Correcton for

More information

Realistic Image Synthesis

Realistic Image Synthesis Realstc Image Synthess - Combned Samplng and Path Tracng - Phlpp Slusallek Karol Myszkowsk Vncent Pegoraro Overvew: Today Combned Samplng (Multple Importance Samplng) Renderng and Measurng Equaton Random

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo

More information

The Application of Fractional Brownian Motion in Option Pricing

The Application of Fractional Brownian Motion in Option Pricing Vol. 0, No. (05), pp. 73-8 http://dx.do.org/0.457/jmue.05.0..6 The Applcaton of Fractonal Brownan Moton n Opton Prcng Qng-xn Zhou School of Basc Scence,arbn Unversty of Commerce,arbn zhouqngxn98@6.com

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models

ActiveClean: Interactive Data Cleaning While Learning Convex Loss Models ActveClean: Interactve Data Cleanng Whle Learnng Convex Loss Models Sanjay Krshnan, Jannan Wang, Eugene Wu, Mchael J. Frankln, Ken Goldberg UC Berkeley, Columba Unversty {sanjaykrshnan, jnwang, frankln,

More information

Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Single Layer Perceptrons Kevin Swingler Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

More information

Lecture 5,6 Linear Methods for Classification. Summary

Lecture 5,6 Linear Methods for Classification. Summary Lecture 5,6 Lnear Methods for Classfcaton Rce ELEC 697 Farnaz Koushanfar Fall 2006 Summary Bayes Classfers Lnear Classfers Lnear regresson of an ndcator matrx Lnear dscrmnant analyss (LDA) Logstc regresson

More information

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features

On-Line Fault Detection in Wind Turbine Transmission System using Adaptive Filter and Robust Statistical Features On-Lne Fault Detecton n Wnd Turbne Transmsson System usng Adaptve Flter and Robust Statstcal Features Ruoyu L Remote Dagnostcs Center SKF USA Inc. 3443 N. Sam Houston Pkwy., Houston TX 77086 Emal: ruoyu.l@skf.com

More information

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS 21 22 September 2007, BULGARIA 119 Proceedngs of the Internatonal Conference on Informaton Technologes (InfoTech-2007) 21 st 22 nd September 2007, Bulgara vol. 2 INVESTIGATION OF VEHICULAR USERS FAIRNESS

More information

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching) Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

8 Algorithm for Binary Searching in Trees

8 Algorithm for Binary Searching in Trees 8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How To Understand The Results Of The German Meris Cloud And Water Vapour Product Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System Mnng Feature Importance: Applyng Evolutonary Algorthms wthn a Web-based Educatonal System Behrouz MINAEI-BIDGOLI 1, and Gerd KORTEMEYER 2, and Wllam F. PUNCH 1 1 Genetc Algorthms Research and Applcatons

More information

Sketching Sampled Data Streams

Sketching Sampled Data Streams Sketchng Sampled Data Streams Florn Rusu, Aln Dobra CISE Department Unversty of Florda Ganesvlle, FL, USA frusu@cse.ufl.edu adobra@cse.ufl.edu Abstract Samplng s used as a unversal method to reduce the

More information

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services An Evaluaton of the Extended Logstc, Smple Logstc, and Gompertz Models for Forecastng Short Lfecycle Products and Servces Charles V. Trappey a,1, Hsn-yng Wu b a Professor (Management Scence), Natonal Chao

More information

Mining Multiple Large Data Sources

Mining Multiple Large Data Sources The Internatonal Arab Journal of Informaton Technology, Vol. 7, No. 3, July 2 24 Mnng Multple Large Data Sources Anmesh Adhkar, Pralhad Ramachandrarao 2, Bhanu Prasad 3, and Jhml Adhkar 4 Department of

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression Novel Methodology of Workng Captal Management for Large Publc Constructons by Usng Fuzzy S-curve Regresson Cheng-Wu Chen, Morrs H. L. Wang and Tng-Ya Hseh Department of Cvl Engneerng, Natonal Central Unversty,

More information

How To Calculate The Accountng Perod Of Nequalty

How To Calculate The Accountng Perod Of Nequalty Inequalty and The Accountng Perod Quentn Wodon and Shlomo Ytzha World Ban and Hebrew Unversty September Abstract Income nequalty typcally declnes wth the length of tme taen nto account for measurement.

More information

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña Proceedngs of the 2008 Wnter Smulaton Conference S. J. Mason, R. R. Hll, L. Mönch, O. Rose, T. Jefferson, J. W. Fowler eds. A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION

More information

Evaluating credit risk models: A critique and a new proposal

Evaluating credit risk models: A critique and a new proposal Evaluatng credt rsk models: A crtque and a new proposal Hergen Frerchs* Gunter Löffler Unversty of Frankfurt (Man) February 14, 2001 Abstract Evaluatng the qualty of credt portfolo rsk models s an mportant

More information

SIMPLE LINEAR CORRELATION

SIMPLE LINEAR CORRELATION SIMPLE LINEAR CORRELATION Smple lnear correlaton s a measure of the degree to whch two varables vary together, or a measure of the ntensty of the assocaton between two varables. Correlaton often s abused.

More information

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University Characterzaton of Assembly Varaton Analyss Methods A Thess Presented to the Department of Mechancal Engneerng Brgham Young Unversty In Partal Fulfllment of the Requrements for the Degree Master of Scence

More information

New Approaches to Support Vector Ordinal Regression

New Approaches to Support Vector Ordinal Regression New Approaches to Support Vector Ordnal Regresson We Chu chuwe@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt, Unversty College London, London, WCN 3AR, UK S. Sathya Keerth selvarak@yahoo-nc.com

More information

A Fast Incremental Spectral Clustering for Large Data Sets

A Fast Incremental Spectral Clustering for Large Data Sets 2011 12th Internatonal Conference on Parallel and Dstrbuted Computng, Applcatons and Technologes A Fast Incremental Spectral Clusterng for Large Data Sets Tengteng Kong 1,YeTan 1, Hong Shen 1,2 1 School

More information

Brigid Mullany, Ph.D University of North Carolina, Charlotte

Brigid Mullany, Ph.D University of North Carolina, Charlotte Evaluaton And Comparson Of The Dfferent Standards Used To Defne The Postonal Accuracy And Repeatablty Of Numercally Controlled Machnng Center Axes Brgd Mullany, Ph.D Unversty of North Carolna, Charlotte

More information

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract

Chapter XX More advanced approaches to the analysis of survey data. Gad Nathan Hebrew University Jerusalem, Israel. Abstract Household Sample Surveys n Developng and Transton Countres Chapter More advanced approaches to the analyss of survey data Gad Nathan Hebrew Unversty Jerusalem, Israel Abstract In the present chapter, we

More information

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

v a 1 b 1 i, a 2 b 2 i,..., a n b n i. SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

More information

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION

THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Internatonal Journal of Electronc Busness Management, Vol. 3, No. 4, pp. 30-30 (2005) 30 THE APPLICATION OF DATA MINING TECHNIQUES AND MULTIPLE CLASSIFIERS TO MARKETING DECISION Yu-Mn Chang *, Yu-Cheh

More information

Binomial Link Functions. Lori Murray, Phil Munz

Binomial Link Functions. Lori Murray, Phil Munz Bnomal Lnk Functons Lor Murray, Phl Munz Bnomal Lnk Functons Logt Lnk functon: ( p) p ln 1 p Probt Lnk functon: ( p) 1 ( p) Complentary Log Log functon: ( p) ln( ln(1 p)) Motvatng Example A researcher

More information

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background:

SPEE Recommended Evaluation Practice #6 Definition of Decline Curve Parameters Background: SPEE Recommended Evaluaton Practce #6 efnton of eclne Curve Parameters Background: The producton hstores of ol and gas wells can be analyzed to estmate reserves and future ol and gas producton rates and

More information

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

Research Note APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES * Iranan Journal of Scence & Technology, Transacton B, Engneerng, ol. 30, No. B6, 789-794 rnted n The Islamc Republc of Iran, 006 Shraz Unversty "Research Note" ALICATION OF CHARGE SIMULATION METHOD TO ELECTRIC

More information

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications CMSC828G Prncples of Data Mnng Lecture #9 Today s Readng: HMS, chapter 9 Today s Lecture: Descrptve Modelng Clusterng Algorthms Descrptve Models model presents the man features of the data, a global summary

More information

Georey E. Hinton. University oftoronto. Email: zoubin@cs.toronto.edu. Technical Report CRG-TR-96-1. May 21, 1996 (revised Feb 27, 1997) Abstract

Georey E. Hinton. University oftoronto. Email: zoubin@cs.toronto.edu. Technical Report CRG-TR-96-1. May 21, 1996 (revised Feb 27, 1997) Abstract The EM Algorthm for Mxtures of Factor Analyzers Zoubn Ghahraman Georey E. Hnton Department of Computer Scence Unversty oftoronto 6 Kng's College Road Toronto, Canada M5S A4 Emal: zoubn@cs.toronto.edu Techncal

More information

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School Robust Desgn of Publc Storage Warehouses Yemng (Yale) Gong EMLYON Busness School Rene de Koster Rotterdam school of management, Erasmus Unversty Abstract We apply robust optmzaton and revenue management

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT Toshhko Oda (1), Kochro Iwaoka (2) (1), (2) Infrastructure Systems Busness Unt, Panasonc System Networks Co., Ltd. Saedo-cho

More information

Data Visualization by Pairwise Distortion Minimization

Data Visualization by Pairwise Distortion Minimization Communcatons n Statstcs, Theory and Methods 34 (6), 005 Data Vsualzaton by Parwse Dstorton Mnmzaton By Marc Sobel, and Longn Jan Lateck* Department of Statstcs and Department of Computer and Informaton

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

Gender Classification for Real-Time Audience Analysis System

Gender Classification for Real-Time Audience Analysis System Gender Classfcaton for Real-Tme Audence Analyss System Vladmr Khryashchev, Lev Shmaglt, Andrey Shemyakov, Anton Lebedev Yaroslavl State Unversty Yaroslavl, Russa vhr@yandex.ru, shmaglt_lev@yahoo.com, andrey.shemakov@gmal.com,

More information

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing A Replcaton-Based and Fault Tolerant Allocaton Algorthm for Cloud Computng Tork Altameem Dept of Computer Scence, RCC, Kng Saud Unversty, PO Box: 28095 11437 Ryadh-Saud Araba Abstract The very large nfrastructure

More information

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta

More information

Improved SVM in Cloud Computing Information Mining

Improved SVM in Cloud Computing Information Mining Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu

More information

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

Abstract. Clustering ensembles have emerged as a powerful method for improving both the Clusterng Ensembles: {topchyal, Models jan, of punch}@cse.msu.edu Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty

More information

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign PAS: A Packet Accountng System to Lmt the Effects of DoS & DDoS Debsh Fesehaye & Klara Naherstedt Unversty of Illnos-Urbana Champagn DoS and DDoS DDoS attacks are ncreasng threats to our dgtal world. Exstng

More information

Simple Interest Loans (Section 5.1) :

Simple Interest Loans (Section 5.1) : Chapter 5 Fnance The frst part of ths revew wll explan the dfferent nterest and nvestment equatons you learned n secton 5.1 through 5.4 of your textbook and go through several examples. The second part

More information

Traffic-light a stress test for life insurance provisions

Traffic-light a stress test for life insurance provisions MEMORANDUM Date 006-09-7 Authors Bengt von Bahr, Göran Ronge Traffc-lght a stress test for lfe nsurance provsons Fnansnspetonen P.O. Box 6750 SE-113 85 Stocholm [Sveavägen 167] Tel +46 8 787 80 00 Fax

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

Sample Design in TIMSS and PIRLS

Sample Design in TIMSS and PIRLS Sample Desgn n TIMSS and PIRLS Introducton Marc Joncas Perre Foy TIMSS and PIRLS are desgned to provde vald and relable measurement of trends n student achevement n countres around the world, whle keepng

More information

1. Measuring association using correlation and regression

1. Measuring association using correlation and regression How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

More information

Diagnostic Tests of Cross Section Independence for Nonlinear Panel Data Models

Diagnostic Tests of Cross Section Independence for Nonlinear Panel Data Models DISCUSSION PAPER SERIES IZA DP No. 2756 Dagnostc ests of Cross Secton Independence for Nonlnear Panel Data Models Cheng Hsao M. Hashem Pesaran Andreas Pck Aprl 2007 Forschungsnsttut zur Zukunft der Arbet

More information

Analysis of Premium Liabilities for Australian Lines of Business

Analysis of Premium Liabilities for Australian Lines of Business Summary of Analyss of Premum Labltes for Australan Lnes of Busness Emly Tao Honours Research Paper, The Unversty of Melbourne Emly Tao Acknowledgements I am grateful to the Australan Prudental Regulaton

More information

Searching for Interacting Features for Spam Filtering

Searching for Interacting Features for Spam Filtering Searchng for Interactng Features for Spam Flterng Chuanlang Chen 1, Yun-Chao Gong 2, Rongfang Be 1,, and X. Z. Gao 3 1 Department of Computer Scence, Bejng Normal Unversty, Bejng 100875, Chna 2 Software

More information

Detecting Credit Card Fraud using Periodic Features

Detecting Credit Card Fraud using Periodic Features Detectng Credt Card Fraud usng Perodc Features Alejandro Correa Bahnsen, Djamla Aouada, Aleksandar Stojanovc and Björn Ottersten Interdscplnary Centre for Securty, Relablty and Trust Unversty of Luxembourg,

More information

Improved Mining of Software Complexity Data on Evolutionary Filtered Training Sets

Improved Mining of Software Complexity Data on Evolutionary Filtered Training Sets Improved Mnng of Software Complexty Data on Evolutonary Fltered Tranng Sets VILI PODGORELEC Insttute of Informatcs, FERI Unversty of Marbor Smetanova ulca 17, SI-2000 Marbor SLOVENIA vl.podgorelec@un-mb.s

More information

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35,000 100,000 2 2,200,000 60,000 350,000 Problem Set 5 Solutons 1 MIT s consderng buldng a new car park near Kendall Square. o unversty funds are avalable (overhead rates are under pressure and the new faclty would have to pay for tself from

More information

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification

A Hierarchical Anomaly Network Intrusion Detection System using Neural Network Classification IDC IDC A Herarchcal Anomaly Network Intruson Detecton System usng Neural Network Classfcaton ZHENG ZHANG, JUN LI, C. N. MANIKOPOULOS, JAY JORGENSON and JOSE UCLES ECE Department, New Jersey Inst. of Tech.,

More information

An Algorithm for Data-Driven Bandwidth Selection

An Algorithm for Data-Driven Bandwidth Selection IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 25, NO. 2, FEBRUARY 2003 An Algorthm for Data-Drven Bandwdth Selecton Dorn Comancu, Member, IEEE Abstract The analyss of a feature space

More information

Dynamic Resource Allocation for MapReduce with Partitioning Skew

Dynamic Resource Allocation for MapReduce with Partitioning Skew Ths artcle has been accepted for publcaton n a future ssue of ths journal, but has not been fully edted. Content may change pror to fnal publcaton. Ctaton nformaton: DOI 1.119/TC.216.253286, IEEE Transactons

More information

A Dynamic Load Balancing for Massive Multiplayer Online Game Server

A Dynamic Load Balancing for Massive Multiplayer Online Game Server A Dynamc Load Balancng for Massve Multplayer Onlne Game Server Jungyoul Lm, Jaeyong Chung, Jnryong Km and Kwanghyun Shm Dgtal Content Research Dvson Electroncs and Telecommuncatons Research Insttute Daejeon,

More information

Implementation of Deutsch's Algorithm Using Mathcad

Implementation of Deutsch's Algorithm Using Mathcad Implementaton of Deutsch's Algorthm Usng Mathcad Frank Roux The followng s a Mathcad mplementaton of Davd Deutsch's quantum computer prototype as presented on pages - n "Machnes, Logc and Quantum Physcs"

More information

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,

More information