Distributed Column Subset Selection on MapReduce


 Thomasine Jenkins
 2 years ago
 Views:
Transcription
1 Dstrbuted Column Subset Selecton on MapReduce Ahmed K. arahat Ahmed Elgohary Al Ghods Mohamed S. Kamel Unversty of Waterloo Waterloo, Ontaro, Canada N2L 3G1 Emal: {afarahat, aelgohary, aghodsb, Abstract Gven a very large data set dstrbuted over a cluster of several nodes, ths paper addresses the problem of selectng a few data nstances that best represent the entre data set. he soluton to ths problem s of a crucal mportance n the bg data era as t enables data analysts to understand the nsghts of the data and explore ts hdden structure. he selected nstances can also be used for data preprocessng tasks such as learnng a lowdmensonal embeddng of the data ponts or computng a lowrank approxmaton of the correspondng matrx. he paper frst formulates the problem as the selecton of a few representatve columns from a matrx whose columns are massvely dstrbuted, and t then proposes a MapReduce algorthm for selectng those representatves. he algorthm frst learns a concse representaton of all columns usng random projecton, and t then solves a generalzed column subset selecton problem at each machne n whch a subset of columns are selected from the submatrx on that machne such that the reconstructon error of the concse representaton s mnmzed. he paper then demonstrates the effectveness and effcency of the proposed algorthm through an emprcal evaluaton on benchmark data sets. KeywordsColumn Subset Selecton; Greedy Algorthms; Dstrbuted Computng; Bg Data; MapReduce; I. INRODUCION Recent years have wtnessed the rse of the bg data era n computng and storage systems. Wth the great advances n nformaton and communcaton technology, hundreds of petabytes of data are generated, transferred, processed and stored every day. he avalablty of ths overwhelmng amount of structured and unstructured data creates an acute need to develop fast and accurate algorthms to dscover useful nformaton that s hdden n the bg data. One of the crucal problems n the bg data era s the ablty to represent the data and ts underlyng nformaton n a succnct format. Although dfferent algorthms for clusterng and dmenson reducton can be used to summarze bg data, these algorthms tend to learn representatves whose meanngs are dffcult to nterpret. or nstance, the tradtonal clusterng algorthms such as kmeans [1] tend to produce centrods whch encode nformaton about thousands of data nstances. he meanngs of these centrods are hard to nterpret. Even clusterng methods that use data nstances as prototypes, such as kmedod [2], learn only one representatve for each cluster, whch s usually not enough to capture the nsghts of the data nstances n that cluster. In addton, usng medods as representatves mplctly assumes that the data ponts are dstrbuted as clusters and that the number of those clusters are known ahead of tme. hs assumpton s not true for many data sets. On the other hand, tradtonal dmenson reducton algorthms such as Latent Semantc Analyss LSA) [3] tend to learn a few latent concepts n the feature space. Each of these concepts s represented by a dense vector whch combnes thousands of features wth postve and negatve weghts. hs makes t dffcult for the data analyst to understand the meanng of these concepts. Even f the goal of representatve selecton s to learn a lowdmensonal embeddng of data nstances, learnng dmensons whose meanngs are easy to nterpret allows the understandng of the results of the data mnng algorthms, such as understandng the meanngs of data clusters n the lowdmensonal space. he acute need to summarze bg data to a format that appeals to data analysts motvates the development of dfferent algorthms to drectly select a few representatve data nstances and/or features. hs problem can be generally formulated as the selecton of a subset of columns from a data matrx, whch s formally known as the Column Subset Selecton CSS) problem [4], [5], [6]. Although many algorthms have been proposed for tacklng the CSS problem, most of these algorthms focus on randomly selectng a subset of columns wth the goal of usng these columns to obtan a lowrank approxmaton of the data matrx. In ths case, these algorthms tend to select a relatvely large number of columns. When the goal s to select a very few columns to be drectly presented to a data analyst or ndrectly used to nterpret the results of other algorthms, the randomzed CSS methods are not gong to produce a meanngful subset of columns. On the other hand, determnstc algorthms for CSS, although more accurate, do not scale to work on bg matrces wth massvely dstrbuted columns. hs paper addresses the aforementoned problem by presentng a fast and accurate algorthm for selectng a very few columns from a bg data matrx wth massvely dstrbuted columns. he algorthm starts by learnng a concse representaton of the data matrx usng random projecton. Each machne then ndependently solves a generalzed column subset selecton problem n whch a subset of columns s selected from the current submatrx such that the reconstructon error of the concse representaton s mnmzed. A further selecton step s then appled to
2 the columns selected at dfferent machnes to select the requred number of columns. he proposed algorthm s desgned to be executed effcently over massve amounts of data stored on a cluster of several commodty nodes. In such settngs of nfrastructure, ensurng the scalablty and the fault tolerance of data processng jobs s not a trval task. In order to allevate these problems, MapReduce [7] was ntroduced to smplfy largescale data analytcs over a dstrbuted envronment of commodty machnes. Currently, MapReduce and ts open source mplementaton Hadoop [8]) s consdered the most successful and wdelyused framework for managng bg data processng jobs. he approach proposed n ths paper consders the dfferent aspects of developng MapReduceeffcent algorthms. he contrbutons of the paper can be summarzed as follows: he paper proposes an algorthm for dstrbuted Column Subset Selecton CSS) whch frst learns a concse representaton of the data matrx and then selects columns from dstrbuted submatrces that approxmate ths concse representaton. o facltate CSS from dfferent submatrces, a fast and accurate algorthm for generalzed CSS s proposed. hs algorthm greedly selects a subset of columns from a source matrx whch approxmates the columns of a target matrx. A MapReduceeffcent algorthm s proposed for learnng a concse representaton usng random projecton. he paper also presents a MapReduce algorthm for dstrbuted CSS whch only requres two passes over the data wth a very low communcaton overhead. Largescale experments have been conducted on benchmark data sets n whch dfferent methods for CSS are compared. he rest of the paper s organzed as follows. Secton II descrbes the notatons used throughout the paper. Secton III gves a bref background on the CSS problem. Secton IV descrbes a centralzed greedy algorthm for CSS, whch s the core of the dstrbuted algorthm presented n ths paper. Secton V gves a necessary background on the framework of MapReduce. he proposed MapReduce algorthm for dstrbuted CSS s descrbed n detals n Secton VI. Secton VII revews the stateoftheart CSS methods and ther applcablty to dstrbuted data. In Secton VIII, an emprcal evaluaton of the proposed method s descrbed. nally, Secton IX concludes the paper. II. NOAIONS he followng notatons are used throughout the paper unless otherwse ndcated. Scalars are denoted by small letters e.g., m, n), sets are denoted n scrpt letters e.g., S, R), vectors are denoted by small bold talc letters e.g., f, g), and matrces are denoted by captal letters e.g., A, B). he subscrpt ) ndcates that the varable corresponds to the th block of data n the dstrbuted envronment. In addton, the followng notatons are used: or a set S: S the cardnalty of the set. or a vector x R m : x th element of x. x the Eucldean norm l 2 norm) of x. or a matrx A R m n : A j, j)th entry of A. A : th row of A. A :j jth column of A. A :S the submatrx of A whch conssts of the set S of columns. A the transpose of A. A the robenus norm of A: A = Σ,j A 2 j. Ã a low rank approxmaton of A. Ã S a rankl approxmaton of A based on the set S of columns, where S = l. III. COLUMN SUBSE SELECION CSS) he Column Subset Selecton CSS) problem can be generally defned as the selecton of the most representatve columns of a data matrx [4], [5], [6]. he CSS problem generalzes the problem of selectng representatve data nstances as well as the unsupervsed feature selecton problem. Both are crucal tasks, that can be drectly used for data analyss or as preprocessng steps for developng fast and accurate algorthms n data mnng and machne learnng. Although dfferent crtera for column subset selecton can be defned, a common crteron that has been used n much recent work measures the dscrepancy between the orgnal matrx and the approxmate matrx reconstructed from the subset of selected columns [9], [10], [11], [12], [13], [4], [5], [6], [14]. Most of the recent work ether develops CSS algorthms that drectly optmze ths crteron or uses ths crteron to assess the qualty of the proposed CSS algorthms. In the present work, the CSS problem s formally defned as Problem 1: Column Subset Selecton) Gven an m n matrx A and an nteger l, fnd a subset of columns L such that L = l and L = arg mn A P S) A 2, S where P S) s an m m projecton matrx whch projects the columns of A onto the span of the canddate columns A :S. he crteron S) = A P S) A 2 represents the sum of squared errors between the orgnal data matrx A and ts rankl columnbased approxmaton where l = S ), Ã S = P S) A. 1)
3 In other words, the crteron S) calculates the robenus norm of the resdual matrx E = A ÃS. Other types of matrx norms can also be used to quantfy the reconstructon error. Some of the recent work on the CSS problem [4], [5], [6] derves theoretcal bounds for both the robenus and spectral norms of the resdual matrx. he present work, however, focuses on developng algorthms that mnmze the robenus norm of the resdual matrx. he projecton matrx P S) can be calculated as P S) = A :S A :S A :S ) 1 A :S, 2) where A :S s the submatrx of A whch conssts of the columns correspondng to S. It should be noted that f S s known, the term A :S A 1 :S) A :S A s the closedform soluton of leastsquares problem = arg mn A A :S 2. he set of selected columns.e., data nstances or features) can be drectly presented to a data analyst to learn about the nsghts of the data, or they can be used to preprocess the data for further analyss. or nstance, the selected columns can be used to obtan a lowdmensonal representaton of all columns nto the subspace of selected ones. hs representaton can be obtaned by calculatng an orthogonal bass for the selected columns Q and then embeddng all columns of A nto the subspace of Q as W = Q A. he selected columns can also be used to calculate a columnbased lowrank approxmaton of A [12]. Moreover, the leadng sngular values and vectors of the lowdmensonal embeddng W can be used to approxmate those of the data matrx. IV. GREEDY CSS he column subset selecton crteron presented n Secton III measures the reconstructon error of a data matrx based on the subset of selected columns. he mnmzaton of ths crteron s a combnatoral optmzaton problem whose optmal soluton can be obtaned n O n l mnl ) [5]. hs secton brefly descrbes a determnstc greedy algorthm for optmzng ths crteron, whch extends the greedy method for unsupervsed feature selecton recently proposed by arahat et al. [15], [16]. A bref descrpton of ths method s ncluded n ths secton for completeness. he reader s referred to [16] for the proofs of the dfferent formulas presented n ths secton. he greedy CSS [16] s based the followng recursve formula for the CSS crteron. heorem 1: Gven a set of columns S. or any P S, S) = P) ẼR 2, where E = A P P) A, and ẼR s the lowrank approxmaton of E based on the subset R = S \ P of columns. Proof: See [16, heorem 2]. he term ẼR 2 represents the decrease n reconstructon error acheved by addng the subset R of columns to P. hs recursve formula allows the development of an effcent greedy algorthm that approxmates the optmal soluton of the column subset selecton problem. At teraton t, the goal s to fnd column p such that p = arg mn S {}), 3) where S s the set of columns selected durng the frst t 1 teratons. Let G be an n n matrx whch represents the nnerproducts over the columns of the resdual matrx E,.e., G = E E. he greedy selecton problem can be smplfed to See [16, Secton 6]) Problem 2: Greedy Column Subset Selecton) At teraton t, fnd column p such that p = arg max G : 2 G where G = E E, E = A ÃS and S s the set of columns selected durng the frst t 1 teratons. or teraton t, defne δ = G :p and ω = G :p / G pp = δ/ δ p. he vector δ t) can be calculated n terms of A and prevous ω s as t 1 δ t) = A A :p ω r) p ω r). 4) r=1 he numerator and denomnator of the selecton crteron at each teraton can be calculated n an effcent manner wthout explctly calculatng E or G usng the followng theorem. heorem 2: Let f = G : 2 and g = G be the numerator and denomnator of the crteron functon for column respectvely, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Aω Σ t 2 r=1 ω r) ω ω r))) ) t 1), + ω 2 ω ω) ) t 1). g t) = g ω ω) where represents the Hadamard product operator. Proof: See [16, heorem 4]. Algorthm 1 shows the complete greedy CSS algorthm. he dstrbuted CSS algorthm presented n ths paper ntroduces a generalzed varant of the greedy CSS algorthm n whch a subset of columns s selected from a source matrx such that the reconstructon error of a target matrx s mnmzed. he dstrbuted CSS method uses the greedy generalzed CSS algorthm as the core method for selectng columns at dfferent machnes as well as n the fnal selecton step.
4 Algorthm 1 Greedy Column Subset Selecton Input: Data matrx A, Number of columns l Output: Selected subset of columns S 1: Intalze S = { } 2: Intalze f 0) = A A : 2, g 0) = A : A : for = 1...n 3: Repeat t = 1 l: 4: p = arg max f t) 5: δ t) = A A :p t 1 /g t), S = S {p} r=1 ωr) p ω r) 6: ω t) = δ t) / δ t) p 7: Update f s, g s heorem 2) V. MAPREDUCE PARADIGM MapReduce [7] was presented as a programmng model to smplfy largescale data analytcs over a dstrbuted envronment of commodty machnes. he ratonale behnd MapReduce s to mpose a set of constrants on data access at each ndvdual machne and communcaton between dfferent machnes to ensure both the scalablty and faulttolerance of the analytcal tasks. Currently, MapReduce s consdered the defacto soluton for many data analytcs tasks over large dstrbuted clusters [17], [18]. A MapReduce job s executed n two phases of userdefned data transformaton functons, namely, map and reduce phases. he nput data s splt nto physcal blocks dstrbuted among the nodes. Each block s vewed as a lst of keyvalue pars. In the frst phase, the keyvalue pars of each nput block b are processed by a sngle map functon runnng ndependently on the node where the block b s stored. he keyvalue pars are provded onebyone to the map functon. he output of the map functon s another set of ntermedate keyvalue pars. he values assocated wth the same key across all nodes are grouped together and provded as an nput to the reduce functon n the second phase. Dfferent groups of values are processed n parallel on dfferent machnes. he output of each reduce functon s a thrd set of keyvalue pars and collectvely consdered the output of the job. It s mportant to note that the set of the ntermedate keyvalue pars s moved across the network between the nodes whch ncurs sgnfcant addtonal executon tme when much data are to be moved. or complex analytcal tasks, multple jobs are typcally chaned together [17] and/or many rounds of the same job are executed on the nput data set [18]. In addton to the programmng model constrants, Karloff et al. [19] defned a set of computatonal constrants that ensure the scalablty and the effcency of MapReducebased analytcal tasks. hese computatonal constrants lmt the used memory sze at each machne, the output sze of both the map and reduce functons and the number of rounds used to complete a certan tasks. he MapReduce algorthms presented n ths paper adhere to both the programmng model constrants and the computatonal constrants. he proposed algorthm ams also at mnmzng the overall runnng tme of the dstrbuted column subset selecton task to facltate nteractve data analytcs. VI. DISRIBUED CSS ON MAPREDUCE hs secton descrbes a MapReduce algorthm for the dstrbuted column subset selecton problem. Gven a bg data matrx A whose columns are dstrbuted across dfferent machnes, the goal s to select a subset of columns S from A such that the CSS crteron S) s mnmzed. One naïve approach to perform dstrbuted column subset selecton s to select dfferent subsets of columns from the submatrces stored on dfferent machnes. he selected subsets are then sent to a centralzed machne where an addtonal selecton step s optonally performed to flter out rrelevant or redundant columns. Let A ) be the submatrx stored at machne, the naïve approach optmzes the followng functon. c A ) P L )) 2 A) =1, 5) where L ) s the set of columns selected from A ) and c s the number of physcal blocks of data. he resultng set of columns s the unon of the sets selected from dfferent submatrces: L = c =1 L ). he set L can further be reduced by nvokng another selecton process n whch a smaller subset of columns s selected from A :L. he naïve approach, however smple, s prone to mssng relevant columns. hs s because the selecton at each machne s based on approxmatng a local submatrx, and accordngly there s no way to determne whether the selected columns are globally relevant or not. or nstance, suppose the extreme case where all the truly representatve columns happen to be loaded on a sngle machne. In ths case, the algorthm wll select a lessthanrequred number of columns from that machne and many rrelevant columns from other machnes. In order to allevate ths problem, the dfferent machnes have to select columns that best approxmate a common representaton of the data matrx. o acheve that, the proposed algorthm frst learns a concse representaton of the span of the bg data matrx. hs concse representaton s relatvely small and t can be sent over to all machnes. After that each machne can select columns from ts submatrx that approxmate ths concse representaton. he proposed algorthm uses random projecton to learn ths concse representaton, and proposes a generalzed Column Subset Selecton CSS) method to select columns from dfferent machnes. he detals of the proposed methods are explaned n the rest of ths secton.
5 A. Random Projecton he frst step of the proposed algorthm s to learn a concse representaton B for a dstrbuted data matrx A. In the proposed approach, a random projecton method s employed. Random projecton [20][21][22] s a wellknown technque for dealng wth the curseofthedmensonalty problem. Let Ω be a random projecton matrx of sze n r, and gven a data matrx X of sze m n, the random projecton can be calculated as Y = XΩ. It has been shown that applyng random projecton Ω to X preserves the parwse dstances between vectors n the row space of X wth a hgh probablty [20]: 1 ɛ) X : X j: X : Ω X j: Ω 1 + ɛ) X : X j:, where ɛ s an arbtrarly small factor. Snce the CSS crteron S) measures the reconstructon error between the bg data matrx A and ts lowrank approxmaton P S) A, t essentally measures the sum of the dstances between the orgnal rows and ther approxmatons. hs means that when applyng random projecton to both A and P S) A, the reconstructon error of the orgnal data matrx A wll be approxmately equal to that of AΩ when both are approxmated usng the subset of selected columns: 6) A P S) A 2 AΩ P S) AΩ 2. 7) So, nstead of optmzng A P S) A 2, the dstrbuted CSS can approxmately optmze AΩ P S) AΩ 2. Let B = AΩ, the dstrbuted column subset selecton problem can be formally defned as Problem 3: Dstrbuted Column Subset Selecton) Gven an m n ) submatrx A ) whch s stored at node and an nteger l ), fnd a subset of columns L ) such that L ) = l ) and L ) = arg mn B P S) B 2, S where B = AΩ, Ω s an n r random projecton matrx, S s the set of the ndces of the canddate columns and L ) s the set of the ndces of the selected columns from A ). A key observaton here s that random projecton matrces whose entres are sampled..d from some unvarate dstrbuton Ψ can be exploted to compute random projecton on MapReduce n a very effcent manner. Examples of such matrces are Gaussan random matrces [20], unform random sgn ±1) matrces [21], and sparse random sgn matrces [22]. In order to mplement random projecton on MapReduce, the data matrx A s dstrbuted n a columnwse fashon and vewed as pars of, A : where A : s the th column of A. Recall that B = AΩ can be rewrtten as n B = A : Ω : 8) =1 Algorthm 2 ast Random Projecton on MapReduce Input: Data matrx A, Unvarate dstrbuton Ψ, Number of dmensons r Output: Concse representaton B = AΩ, Ω j Ψ, j 1: map: 2: B = [0]m r 3: foreach, A : 4: 5: Generate v = [v 1, v 2,...v r ], v j Ψ B = B + A: v 6: for j = 1 to m 7: emt j, B j: 8: reduce: 9: foreach j, [ [ B 1) ] j:, [ B 2) ] j:,..., [ B c) ] j: ] 10: B j: = c =1 [ B ) ] j: 11: emt j, B j: and snce the map functon s provded one column of A at a tme, one does not need to worry about precomputng the full matrx Ω. In fact, for each nput column A :, a new vector Ω : needs to be sampled from Ψ. So, each nput column generates a matrx of sze m r whch means that Onmr) data should be moved across the network to sum the generated n matrces at m ndependent reducers each summng a row B j: to obtan B. o mnmze that network cost, an nmemory summaton can be carred out over the generated m r matrces at each mapper. hs can be done ncrementally after processng each column of A. hat optmzaton reduces the network cost to Ocmr), where c s the number of physcal blocks of the matrx 1. Algorthm 2 outlnes the proposed random projecton algorthm. he term emt s used to refer to outputtng new key, value pars from a mapper or a reducer. B. Generalzed CSS hs secton presents the generalzed column subset selecton algorthm whch wll be used to perform the selecton of columns at dfferent machnes. Whle Problem 1 s concerned wth the selecton of a subset of columns from a data matrx whch best represent other columns of the same matrx, Problem 3 selects a subset of columns from a source matrx whch best represent the columns of a dfferent target matrx. he objectve functon of Problem 3 represents the reconstructon error of the target matrx B based on the selected columns) from the source matrx. and the term P S) = A :S A 1 :S A :S A :S s the projecton matrx whch projects the columns of B onto the subspace of the columns selected from A. In order to optmze ths new crteron, a greedy algorthm can be ntroduced. Let S) = B P S) B 2 be the 1 he nmemory summaton can also be replaced by a MapReduce combner [7].
6 dstrbuted CSS crteron, the followng theorem derves a recursve formula for S). heorem 3: Gven a set of columns S. or any P S, S) = P) 2 R, where = B P P) B, and R s the lowrank approxmaton of based on the subset R = S \ P of columns of E = A P P) A. Proof: Usng the recursve formula for the lowrank approxmaton of A: Ã S = ÃP + ẼR, and multplyng both sdes wth Ω gves Ã S Ω = ÃPΩ + ẼRΩ. Lowrank approxmatons can be wrtten n terms of projecton matrces as Usng B = AΩ, P S) AΩ = P P) AΩ + R R) EΩ. P S) B = P P) B + R R) EΩ. Let = EΩ. he matrx s the resdual after approxmatng B usng the set P of columns ) = EΩ = A P P) A Ω = AΩ P P) AΩ = B P P) B. hs means that P S) B = P P) B + R R) Substtutng n S) = B P S) B 2 gves S) = B P P) B R R) Usng = B P P) B gves S) = R R) Usng the relaton between robenus norm and trace, ) ) ) S) = trace R R) R R) ) = trace 2 R R) + R R) R R) ) = trace R R) = 2 R R) Usng P) = 2 and R = R R) proves the theorem. Usng the recursve formula for S {}) allows the development of a greedy algorthm whch at teraton t optmzes p = arg mn 2 S {}) = arg max 2 2 {} 2 9) Algorthm 3 Greedy Generalzed Column Subset Selecton Input: Source matrx A, arget matrx B, Number of columns l Output: Selected subset of columns S 1: Intalze f 0) = B A : 2, g 0) = A : A : for = 1...n 2: Repeat t = 1 l: 3: p = arg max f t) 4: δ t) = A A :p t 1 /g t), S = S {p} r=1 ωr) p ω r) r=1 ωr) p υ r) δ t) 5: γ t) = B A :p t 1 6: ω t) = δ t) / p, υ t) = γ t) / 7: Update f s, g s heorem 4) δ t) p Let G = E E and H = E, the objectve functon of ths optmzaton problem can be smplfed as follows. 2 ) {} = E : E 1 : E : E : 2 = trace ) E : E 1 ) : E : E : 10) E : 2 = E: E = H : 2. : G hs allows the defnton of the followng generalzed CSS problem. Problem 4: Greedy Generalzed CSS) At teraton t, fnd column p such that p = arg max H : 2 G where H = E, G = E E, = B P S) B, E = A P S) A and S s the set of columns selected durng the frst t 1 teratons. or teraton t, defne γ = H :p and υ = H :p / G pp = γ/ δ p. he vector γ t) can be calculated n terms of A, B and prevous ω s and υ s as γ t) = B A :p t 1 r=1 ωr) p υ r). Smlarly, the numerator and denomnator of the selecton crteron at each teraton can be calculated n an effcent manner usng the followng theorem. heorem 4: Let f = H : 2 and g = G be the numerator and denomnator of the greedy crteron functon for column respectvely, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Bυ Σ t 2 r=1 υ r) υ ω r))) ) t 1), + υ 2 ω ω) ) t 1) g t) = g ω ω), where represents the Hadamard product operator. As outlned n Secton VIA, the algorthm s dstrbuton strategy s based on sharng the concse representaton of the data B among all mappers. hen, ndependent l b) columns
7 Algorthm 4 Dstrbuted CSS on MapReduce Input: Matrx A of sze m n, Concse representaton B, Number of columns l Output: Selected columns C 1: map: 2: A b) = [ ] 3: foreach, A : 4: 5: A b) = [A b) A : ] S = GeneralzedCSSAb), B, l b) ) 6: foreach j n S 7: emt 0, [A b) ] :j 8: reduce: 9: or all values {[A 1) ] : S1), [A 2) ] : S2),..., [A c) ] : Sc) } ] 10: A 0) = [[A 1) ] : S1), [A 2) ] : S2),..., [A c) ] : Sc) 11: S = GeneralzedCSS A 0), B, l) 12: foreach j n S 13: emt 0, [A 0) ] :j from each mapper are selected usng the generalzed CSS algorthm. A second phase of selecton s run over the c b=1 l b) where c s the number of nput blocks) columns to fnd the best l columns to represent B. Dfferent ways can be used to set l b) for each nput block b. In the context of ths paper, the set of l b) s assgned unform values for all blocks.e. l b) = l/c b 1, 2,..c). Other methods are to be consdered n future extensons. Algorthm 4 sketches the MapReduce mplementaton of the dstrbuted CSS algorthm. It should be emphaszed that the proposed MapReduce algorthm requres only two passes over the data set and ts moves a very few amount of the data across the network. VII. RELAED WORK Dfferent approaches have been proposed for selectng a subset of representatve columns from a data matrx. hs secton focuses on brefly descrbng these approaches and ther applcablty to massvely dstrbuted data matrces. he Column Subset Selecton CSS) methods can be generally categorzed nto randomzed, determnstc and hybrd. he randomzed methods sample a subset of columns from the orgnal matrx usng carefully chosen samplng probabltes. reze et al. [9] was the frst to suggest the dea of randomly samplng l columns from a matrx and usng these columns to calculate a rankk approxmaton of the matrx where l k). hat work of reze et al. was followed by dfferent papers [10], [11] that enhanced the algorthm by proposng dfferent samplng probabltes. Drneas et al. [12] proposed a subspace samplng method whch samples columns usng probabltes proportonal to the norms of the rows of the top k rght sngular vectors of A. Deshpande et al. [13] proposed an adaptve samplng method whch updates the samplng probabltes based on the columns selected so far. Column subset selecton wth unform samplng can be easly mplemented on MapReduce. or nonunform samplng, the effcency of mplementng the selecton on MapReduce s determned by how easy are the calculatons of the samplng probabltes. he calculatons of probabltes that depend on calculatng the leadng sngular values and vectors are tmeconsumng on MapReduce. On the other hand, adaptve samplng methods are computatonally very complex as they depend on calculatng the resdual of the whole data matrx after each teraton. he second category of methods employs a determnstc algorthm for selectng columns such that some crteron functon s mnmzed. hs crteron functon usually quantfes the reconstructon error of the data matrx based on the subset of selected columns. he determnstc methods are slower, but more accurate, than the randomzed ones. In the area of numercal lnear algebra, the column pvotng method exploted by the QR decomposton [23] permutes the columns of the matrx based on ther norms to enhance the numercal stablty of the QR decomposton algorthm. he frst l columns of the permuted matrx can be drectly selected as representatve columns. Besdes methods based on QR decomposton, dfferent recent methods have been proposed for drectly selectng a subset of columns from the data matrx. Boutsds et al. [4] proposed a determnstc column subset selecton method whch frst groups columns nto clusters and then selects a subset of columns from each cluster. Çvrl and MagdonIsmal [14] presented a determnstc algorthm whch greedly selects columns from the data matrx that best represent the rght leadng sngular values of the matrx. Recently, Boutsds et al. [6] presented a column subset selecton algorthm whch frst calculates the topk rght sngular values of the data matrx where k s the target rank) and then uses determnstc sparsfcaton methods to select l k columns from the data matrx. Besdes, other determnstc algorthms have been proposed for selectng columns based on the volume defned by them and the orgn [24], [25]. he determnstc algorthms are more complex to mplement on MapReduce. or nstance, t s tmeconsumng to calculate the leadng sngular values and vectors of a massvely dstrbuted matrx or to cluster ther columns usng kmeans. It s also computatonally complex to calculate QR decomposton wth pvotng. Moreover, the recently proposed algorthms for volume samplng are more complex than other CSS algorthms as well as the one presented n ths paper, and they are nfeasble for large data sets. A thrd category of CSS technques s the hybrd methods whch combne the benefts of both the randomzed and determnstc methods. In these methods, a large subset of columns s randomly sampled from the columns of the data matrx and then a determnstc step s employed to reduce
8 able I HE PROPERIES O HE DAA SES USED O EVALUAE HE DISRIBUED CSS MEHOD. Data set ype # Instances # eatures RCV1200K Documents 193,844 47,236 nyimages1m Images 1 mllon 1,024 the number of selected columns to the desred rank. or nstance, Boutsds et al. [5] proposed a twostage hybrd CSS algorthm whch frst samples O l log l) columns based on probabltes calculated usng the lleadng rght sngular vectors, and then employs a determnstc algorthm to select exactly l columns from the columns sampled n the frst stage. However, the algorthm depends on calculatng the leadng l rght sngular vectors whch s tmeconsumng for large data sets. he hybrd algorthms for CSS can be easly mplemented on MapReduce f the randomzed selecton step s MapReduceeffcent and the determnstc selecton step can be mplemented on a sngle machne. hs s usually true f the number of columns selected by the randomzed step s relatvely small. In comparson to other CSS methods, the algorthm proposed n ths paper s desgned to be MapReduceeffcent. In the dstrbuted selecton step, representatve columns are selected based on a common representaton. he common representaton proposed n ths work s based on random projecton. hs s more effcent than the work of Çvrl and MagdonIsmal [14] whch selects columns based on the leadng sngular vectors. In comparson to other determnstc methods, the proposed algorthm s specfcally desgned to be parallelzed whch makes t applcable to bg data matrces whose columns are massvely dstrbuted. On the other hand, the twostep of dstrbuted then centralzed selecton s smlar to that of the hybrd CSS methods. he proposed algorthm however employs a determnstc algorthm at the dstrbuted selecton phase whch s more accurate than the randomzed selecton employed by hybrd methods n the frst phase. VIII. EXPERIMENS Experments have been conducted on two bg data sets to evaluate the effcency and effectveness of the proposed dstrbuted CSS algorthm on MapReduce. he propertes of the data sets are descrbed n able I. he RCV1200K s a subset of the RCV1 data set [26] whch has been prepared and used by Chen et al. [27] to evaluate parallel spectral clusterng algorthms. he nyimages1m data set contans 1 mllon mages that were sampled from the 80 mllon tny mages data set [28] and converted to grayscale. Smlar to prevous work on CSS, the dfferent methods are evaluated accordng to ther ablty to mnmze the reconstructon error of the data matrx based on the subset of selected columns. In order to quantfy the reconstructon error across dfferent data sets, a relatve accuracy measure s defned as Relatve Accuracy = A ÃU A ÃS A ÃU A Ãl 100%, where ÃU s the rankl approxmaton of the data matrx based on a random subset U of columns, Ã S s the rankl approxmaton of the data matrx based on the subset S of columns and Ãl s the best rankl approxmaton of the data matrx calculated usng the Sngular Value Decomposton SVD). hs measure compares dfferent methods relatve to the unform samplng as a baselne wth hgher values ndcatng better performance. he experments were conducted on Amazon EC2 2 clusters, whch consst of 10 nstances for the RCV1200K data set and 20 nstances for the nyimages1m data set. Each nstance has a 7.5 GB of memory and a twocores processor. All nstances are runnng Deban and Hadoop verson he data sets were converted nto a bnary format n the form of a sequence of keyvalue pars. Each par conssted of a column ndex as the key and a vector of the column entres. hat s the standard format used n Mahout 3 for storng dstrbuted matrces. he dstrbuted CSS method has been compared wth dfferent stateoftheart methods. It should be noted that most of these methods were not desgned wth the goal of applyng them to massvelydstrbuted data, and hence ther mplementaton on MapReduce s not straghtforward. However, the desgned experments used the best practces for mplementng the dfferent steps of these methods on MapReduce to the best of the authors knowledge. In specfc, the followng dstrbuted CSS algorthms were compared. UnNoRep: s unform samplng of columns wthout replacement. hs s usually the worst performng method n terms on approxmaton error and t wll be used as a baselne to evaluate methods across dfferent data sets. HybrdUn, HybrdCol and HybrdSVD: are dfferent dstrbuted varants of the hybrd CSS algorthm whch can be mplemented effcently on MapReduce. In the randomzed phase, the three methods use probabltes calculated based on unform samplng, column norms and the norms of the leadng sngular vectors rows, respectvely. he number of selected columns n the randomzed phase s set to l log l). In the determnstc phase, the centralzed greedy CSS s employed to select exactly l columns from the randomly sampled columns. DstApproxSVD: s an extenson of the centralzed algorthm for sparse approxmaton of Sngular Value Decomposton SVD) [14]. he dstrbuted CSS algorthm presented n ths paper Algorthm 4) s used 2 Amazon Elastc Compute Cloud EC2): 3 Mahout s an Apache project for mplementng Machne Learnng algorthms on Hadoop. See
9 able II HE RUN IMES AND RELAIVE ACCURACIES O DIEREN CSS MEHODS. HE BES PERORMING MEHOD OR EACH l IS HIGHLIGHED IN BOLD, AND HE SECOND BES MEHOD IS UNDERLINED. NEGAIVE MEASURES INDICAE MEHODS HA PERORM WORSE HAN UNIORM SAMPLING. Methods Run tme mnutes) Relatve accuracy %) l = 10 l = 100 l = 500 l = 10 l = 100 l = 500 RCV1200K Unform  Baselne Hybrd Unform) Hybrd Column Norms) Hybrd SVDbased) Dstrbuted Approx. SVD Dstrbuted Greedy CSS rnd) Dstrbuted Greedy CSS ssgn) ny Images  1M Unform  Baselne Hybrd Unform) Hybrd Column Norms) Hybrd SVDbased) Dstrbuted Approx. SVD Dstrbuted Greedy CSS ssgn) to select columns that best approxmate the leadng sngular vectors by settng B = U k Σ k ). he use of the dstrbuted CSS algorthm extends the orgnal algorthm proposed by Çvrl and MagdonIsmal [14] to work on dstrbuted matrces. In order to allow effcent mplementaton on MapReduce, the number of leadng sngular vectors s set of 100. DstGreedyCSS: s the dstrbuted column subset selecton method descrbed n Algorthm 4. or all experments, the dmenson of the random projecton matrx s set to 100. hs makes the sze of the concse representaton the same as the DstApproxSVD method. wo types of random matrces are used for random projecton: 1) a dense Gaussan random matrx rnd), and 2) a sparse random sgn matrx ssgn). or the methods that requre the calculatons of Sngular Value Decomposton SVD), the Stochastc SVD SSVD) algorthm [29] s used to approxmate the leadng sngular values and vectors of the data matrx. he use of SSVD sgnfcantly reduces the run tme of the orgnal SVDbased algorthms whle achevng comparable accuracy. In the conducted experments, the SSVD mplementaton of Mahout was used. able II shows the run tmes and relatve accuraces for dfferent CSS methods. It can be observed from the table that for the RCV1200K data set, the DstGreedyCSS methods wth random Gaussan and sparse random sng matrces) outperforms all other methods n terms of relatve accuraces. In addton, the run tmes of both of them are relatvely small compared to the DstApproxSVD method whch acheves accuraces that are close to the DstGreedyCSS method. Both the DstApproxSVD and DstGreedyCSS methods acheve very good approxmaton accuraces compared to randomzed and hybrd methods. It should also be noted that usng a sparse random sgn matrx for random projecton takes much less tme than a dense Gaussan matrx, whle achevng comparable approxmaton accuraces. Based on ths observaton, the sparse random matrx has been used wth the nyimages1m data set. or the nyimages1m data set, although the DstApproxSVD acheves slghtly hgher approxmaton accuraces than DstGreedyCSS wth sparse random sgn matrx), the DstGreedyCSS selects columns n almost onethrd of the tme. he reason why the DstApproxSVD outperforms DstGreedyCSS for ths data set s that ts rank s relatvely small less than 1024). hs means that usng the leadng 100 sngular values to represent the concse representaton of the data matrx captures most of the nformaton n the matrx and accordngly s more accurate than random projecton. he DstGreedyCSS however stll selects a very good subset of columns n a relatvely small tme. IX. CONCLUSION hs paper proposes an accurate and effcent MapReduce algorthm for selectng a subset of columns from a massvely dstrbuted matrx. he algorthm starts by learnng a concse representaton of the data matrx usng random projecton. It then selects columns from each submatrx that best approxmate ths concse approxmaton. A centralzed selecton step s then performed on the columns selected from dfferent submatrces. In order to facltate the mplementaton of the proposed method, a novel algorthm for greedy generalzed CSS s proposed to perform the selecton from dfferent submatrces. In addton, the dfferent steps of the algorthms are carefully desgned to be MapReduceeffcent. Experments on bg data sets demonstrate the effectveness and effcency of the proposed algorthm n comparson to other CSS methods when mplemented on dstrbuted data. REERENCES [1] A. K. Jan and R. C. Dubes, Algorthms for Clusterng Data. Upper Saddle Rver, NJ, USA: PrentceHall, Inc., 1988.
10 [2] L. Kaufman and P. Rousseeuw, Clusterng by means of medods, echnsche Hogeschool, Delft Netherlands). Department of Mathematcs and Informatcs, ech. Rep., [3] S. Deerwester, S. Dumas, G. urnas,. Landauer, and R. Harshman, Indexng by latent semantc analyss, Journal of the Amercan Socety for Informaton Scence and echnology, vol. 41, no. 6, pp , [4] C. Boutsds, J. Sun, and N. Anerouss, Clustered subset selecton and ts applcatons on t servce metrcs, n Proceedngs of the Seventeenth ACM Conference on Informaton and Knowledge Management CIKM 08), 2008, pp [5] C. Boutsds, M. W. Mahoney, and P. Drneas, An mproved approxmaton algorthm for the column subset selecton problem, n Proceedngs of the wenteth Annual ACMSIAM Symposum on Dscrete Algorthms SODA 09), 2009, pp [6] C. Boutsds, P. Drneas, and M. MagdonIsmal, Near optmal columnbased matrx reconstructon, n Proceedngs of the 52nd Annual IEEE Symposum on oundatons of Computer Scence OCS 11), 2011, pp [7] J. Dean and S. Ghemawat, MapReduce: Smplfed data processng on large clusters, Communcatons of the ACM, vol. 51, no. 1, pp , [8]. Whte, Hadoop: he Defntve Gude, 1st ed. O Relly Meda, Inc., [9] A. reze, R. Kannan, and S. Vempala, ast MonteCarlo algorthms for fndng lowrank approxmatons, n Proceedngs of the 39th Annual IEEE Symposum on oundatons of Computer Scence OCS 98), 1998, pp [10] P. Drneas, A. reze, R. Kannan, S. Vempala, and V. Vnay, Clusterng large graphs va the sngular value decomposton, Machne Learnng, vol. 56, no. 13, pp. 9 33, [11] P. Drneas, R. Kannan, and M. Mahoney, ast Monte Carlo algorthms for matrces II: Computng a lowrank approxmaton to a matrx, SIAM Journal on Computng, vol. 36, no. 1, pp , [12] P. Drneas, M. Mahoney, and S. Muthukrshnan, Subspace samplng and relatveerror matrx approxmaton: Columnbased methods, n Approxmaton, Randomzaton, and Combnatoral Optmzaton. Algorthms and echnques. Sprnger Berln / Hedelberg, 2006, pp [13] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, Matrx approxmaton and projectve clusterng va volume samplng, heory of Computng, vol. 2, no. 1, pp , [14] A. Çvrl and M. MagdonIsmal, Column subset selecton va sparse approxmaton of SVD, heoretcal Computer Scence, vol. 421, no. 0, pp. 1 14, [15] A. K. arahat, A. Ghods, and M. S. Kamel, An effcent greedy method for unsupervsed feature selecton, n Proceedngs of the Eleventh IEEE Internatonal Conference on Data Mnng ICDM 11), 2011, pp [16], Effcent greedy feature selecton for unsupervsed learnng, Knowledge and Informaton Systems, vol. 35, no. 2, pp , [17]. Elsayed, J. Ln, and D. W. Oard, Parwse document smlarty n large collectons wth MapReduce, n Proceedngs of the 46th Annual Meetng of the Assocaton for Computatonal Lngustcs on Human Language echnologes: Short Papers HL 08), 2008, pp [18] A. Ene, S. Im, and B. Moseley, ast clusterng usng MapReduce, n Proceedngs of the Seventeenth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng KDD 11), 2011, pp [19] H. Karloff, S. Sur, and S. Vasslvtsk, A model of computaton for MapReduce, n Proceedngs of the 21st Annual ACMSIAM Symposum on Dscrete Algorthms SODA 10), 2010, pp [20] S. Dasgupta and A. Gupta, An elementary proof of a theorem of Johnson and Lndenstrauss, Random Structures and Algorthms, vol. 22, no. 1, pp , [21] D. Achloptas, Databasefrendly random projectons: JohnsonLndenstrauss wth bnary cons, Journal of computer and System Scences, vol. 66, no. 4, pp , [22] P. L,. J. Haste, and K. W. Church, Very sparse random projectons, n Proceedngs of the welfth ACM SIGKDD nternatonal conference on Knowledge Dscovery and Data Mnng KDD 06), 2006, pp [23] G. Golub and C. Van Loan, Matrx Computatons, 3rd ed. Johns Hopkns Unv Pr, [24] A. Deshpande and L. Rademacher, Effcent volume samplng for row/column subset selecton, n Proceedngs of the 51st Annual IEEE Symposum on oundatons of Computer Scence OCS 10), 2010, pp [25] V. Guruswam and A. K. Snop, Optmal columnbased lowrank matrx reconstructon, n Proceedngs of the 21st Annual ACMSIAM Symposum on Dscrete Algorthms SODA 12), 2012, pp [26] D. D. Lews, Y. Yang,. G. Rose, and. L, Rcv1: A new benchmark collecton for text categorzaton research, he Journal of Machne Learnng Research, vol. 5, pp , [27] W.Y. Chen, Y. Song, H. Ba, C.J. Ln, and E. Chang, Parallel spectral clusterng n dstrbuted systems, Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol. 33, no. 3, pp , [28] A. orralba, R. ergus, and W. reeman, 80 mllon tny mages: A large data set for nonparametrc object and scene recognton, Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol. 30, no. 11, pp , [29] N. Halko, P.G. Martnsson, Y. Shkolnsky, and M. ygert, An algorthm for the prncpal component analyss of large data sets, SIAM Journal on Scentfc Computng, vol. 33, no. 5, pp , 2011.
Luby s Alg. for Maximal Independent Sets using Pairwise Independence
Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent
More informationThe Development of Web Log Mining Based on ImproveKMeans Clustering Analysis
The Development of Web Log Mnng Based on ImproveKMeans Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.
More informationWhat is Candidate Sampling
What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble
More information8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by
6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng
More informationv a 1 b 1 i, a 2 b 2 i,..., a n b n i.
SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are
More informationAn InterestOriented Network Evolution Mechanism for Online Communities
An InterestOrented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne
More information8 Algorithm for Binary Searching in Trees
8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the
More informationLogistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification
Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson
More information1 Approximation Algorithms
CME 305: Dscrete Mathematcs and Algorthms 1 Approxmaton Algorthms In lght of the apparent ntractablty of the problems we beleve not to le n P, t makes sense to pursue deas other than complete solutons
More informationCalculation of Sampling Weights
Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a twostage stratfed cluster desgn. 1 The frst stage conssted of a sample
More informationFace Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)
Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton
More informationForecasting the Direction and Strength of Stock Market Movement
Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract  Stock market s one of the most complcated systems
More informationL10: Linear discriminants analysis
L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss
More informationModule 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur
Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..
More informationSupport Vector Machines
Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.
More informationNonlinear data mapping by neural networks
Nonlnear data mappng by neural networks R.P.W. Dun Delft Unversty of Technology, Netherlands Abstract A revew s gven of the use of neural networks for nonlnear mappng of hgh dmensonal data on lower dmensonal
More informationImplementation of Deutsch's Algorithm Using Mathcad
Implementaton of Deutsch's Algorthm Usng Mathcad Frank Roux The followng s a Mathcad mplementaton of Davd Deutsch's quantum computer prototype as presented on pages  n "Machnes, Logc and Quantum Physcs"
More informationInstitute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic
Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange
More informationMultivariate EWMA Control Chart
Multvarate EWMA Control Chart Summary The Multvarate EWMA Control Chart procedure creates control charts for two or more numerc varables. Examnng the varables n a multvarate sense s extremely mportant
More informationA Computer Technique for Solving LP Problems with Bounded Variables
Dhaka Unv. J. Sc. 60(2): 163168, 2012 (July) A Computer Technque for Solvng LP Problems wth Bounded Varables S. M. Atqur Rahman Chowdhury * and Sanwar Uddn Ahmad Department of Mathematcs; Unversty of
More informationbenefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).
REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or
More informationRecurrence. 1 Definitions and main statements
Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.
More informationData Broadcast on a MultiSystem Heterogeneous Overlayed Wireless Network *
JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819840 (2008) Data Broadcast on a MultSystem Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,
More informationThe Greedy Method. Introduction. 0/1 Knapsack Problem
The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton
More informationExtending Probabilistic Dynamic Epistemic Logic
Extendng Probablstc Dynamc Epstemc Logc Joshua Sack May 29, 2008 Probablty Space Defnton A probablty space s a tuple (S, A, µ), where 1 S s a set called the sample space. 2 A P(S) s a σalgebra: a set
More informationFeature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College
Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure
More informationA Programming Model for the Cloud Platform
Internatonal Journal of Advanced Scence and Technology A Programmng Model for the Cloud Platform Xaodong Lu School of Computer Engneerng and Scence Shangha Unversty, Shangha 200072, Chna luxaodongxht@qq.com
More informationLoop Parallelization
  Loop Parallelzaton C52 Complaton steps: nested loops operatng on arrays, sequentell executon of teraton space DECLARE B[..,..+] FOR I :=.. FOR J :=.. I B[I,J] := B[I,J]+B[I,J] ED FOR ED FOR analyze
More informationImproved SVM in Cloud Computing Information Mining
Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.3340 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu
More informationgreatest common divisor
4. GCD 1 The greatest common dvsor of two ntegers a and b (not both zero) s the largest nteger whch s a common factor of both a and b. We denote ths number by gcd(a, b), or smply (a, b) when there s no
More informationA Fast Incremental Spectral Clustering for Large Data Sets
2011 12th Internatonal Conference on Parallel and Dstrbuted Computng, Applcatons and Technologes A Fast Incremental Spectral Clusterng for Large Data Sets Tengteng Kong 1,YeTan 1, Hong Shen 1,2 1 School
More information320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:
More informationPowerofTwo Policies for Single Warehouse MultiRetailer Inventory Systems with Order Frequency Discounts
Powerofwo Polces for Sngle Warehouse MultRetaler Inventory Systems wth Order Frequency Dscounts José A. Ventura Pennsylvana State Unversty (USA) Yale. Herer echnon Israel Insttute of echnology (Israel)
More informationA ReplicationBased and Fault Tolerant Allocation Algorithm for Cloud Computing
A ReplcatonBased and Fault Tolerant Allocaton Algorthm for Cloud Computng Tork Altameem Dept of Computer Scence, RCC, Kng Saud Unversty, PO Box: 28095 11437 RyadhSaud Araba Abstract The very large nfrastructure
More informationCS 2750 Machine Learning. Lecture 17a. Clustering. CS 2750 Machine Learning. Clustering
Lecture 7a Clusterng Mlos Hauskrecht mlos@cs.ptt.edu 539 Sennott Square Clusterng Groups together smlar nstances n the data sample Basc clusterng problem: dstrbute data nto k dfferent groups such that
More informationMultiplePeriod Attribution: Residuals and Compounding
MultplePerod Attrbuton: Resduals and Compoundng Our revewer gave these authors full marks for dealng wth an ssue that performance measurers and vendors often regard as propretary nformaton. In 1994, Dens
More informationProject Networks With MixedTime Constraints
Project Networs Wth MxedTme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa
More informationLecture 18: Clustering & classification
O CPS260/BGT204. Algorthms n Computatonal Bology October 30, 2003 Lecturer: Pana K. Agarwal Lecture 8: Clusterng & classfcaton Scrbe: Daun Hou Open Problem In HomeWor 2, problem 5 has an open problem whch
More informationRealistic Image Synthesis
Realstc Image Synthess  Combned Samplng and Path Tracng  Phlpp Slusallek Karol Myszkowsk Vncent Pegoraro Overvew: Today Combned Samplng (Multple Importance Samplng) Renderng and Measurng Equaton Random
More informationChapter 7. RandomVariate Generation 7.1. Prof. Dr. Mesut Güneş Ch. 7 RandomVariate Generation
Chapter 7 RandomVarate Generaton 7. Contents Inversetransform Technque AcceptanceRejecton Technque Specal Propertes 7. Purpose & Overvew Develop understandng of generatng samples from a specfed dstrbuton
More informationBERNSTEIN POLYNOMIALS
OnLne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful
More informationCommunication Networks II Contents
8 / 1  Communcaton Networs II (Görg)  www.comnets.unbremen.de Communcaton Networs II Contents 1 Fundamentals of probablty theory 2 Traffc n communcaton networs 3 Stochastc & Marovan Processes (SP
More informationCS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements
Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there
More informationA DATA MINING APPLICATION IN A STUDENT DATABASE
JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (5357) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng BüyükbakkalköyIstanbul
More informationOverview of monitoring and evaluation
540 Toolkt to Combat Traffckng n Persons Tool 10.1 Overvew of montorng and evaluaton Overvew Ths tool brefly descrbes both montorng and evaluaton, and the dstncton between the two. What s montorng? Montorng
More information) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance
Calbraton Method Instances of the Cell class (one nstance for each FMS cell) contan ADC raw data and methods assocated wth each partcular FMS cell. The calbraton method ncludes event selecton (Class Cell
More informationSurvey on Virtual Machine Placement Techniques in Cloud Computing Environment
Survey on Vrtual Machne Placement Technques n Cloud Computng Envronment Rajeev Kumar Gupta and R. K. Paterya Department of Computer Scence & Engneerng, MANIT, Bhopal, Inda ABSTRACT In tradtonal data center
More informationLearning from Multiple Outlooks
Learnng from Multple Outlooks Maayan Harel Department of Electrcal Engneerng, Technon, Hafa, Israel She Mannor Department of Electrcal Engneerng, Technon, Hafa, Israel maayanga@tx.technon.ac.l she@ee.technon.ac.l
More informationCluster Analysis. Cluster Analysis
Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos DenstyBase Methos GrBase Methos MoelBase
More informationJ. Parallel Distrib. Comput.
J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n
More informationAnts Can Schedule Software Projects
Ants Can Schedule Software Proects Broderck Crawford 1,2, Rcardo Soto 1,3, Frankln Johnson 4, and Erc Monfroy 5 1 Pontfca Unversdad Católca de Valparaíso, Chle FrstName.Name@ucv.cl 2 Unversdad Fns Terrae,
More informationLETTER IMAGE RECOGNITION
LETTER IMAGE RECOGNITION 1. Introducton. 1. Introducton. Objectve: desgn classfers for letter mage recognton. consder accuracy and tme n takng the decson. 20,000 samples: Startng set: mages based on 20
More informationA Constant Factor Approximation for the Single Sink Edge Installation Problem
A Constant Factor Approxmaton for the Sngle Snk Edge Installaton Problem Sudpto Guha Adam Meyerson Kamesh Munagala Abstract We present the frst constant approxmaton to the sngle snk buyatbulk network
More informationA Simple Approach to Clustering in Excel
A Smple Approach to Clusterng n Excel Aravnd H Center for Computatonal Engneerng and Networng Amrta Vshwa Vdyapeetham, Combatore, Inda C Rajgopal Center for Computatonal Engneerng and Networng Amrta Vshwa
More informationA hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm
Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):18841889 Research Artcle ISSN : 09757384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel
More information9.1 The Cumulative Sum Control Chart
Learnng Objectves 9.1 The Cumulatve Sum Control Chart 9.1.1 Basc Prncples: Cusum Control Chart for Montorng the Process Mean If s the target for the process mean, then the cumulatve sum control chart s
More informationCluster Analysis of Data Points using Partitioning and Probabilistic Modelbased Algorithms
Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 22490868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng
More informationThe OC Curve of Attribute Acceptance Plans
The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4
More informationClustering Gene Expression Data. (Slides thanks to Dr. Mark Craven)
Clusterng Gene Epresson Data Sldes thanks to Dr. Mark Craven Gene Epresson Proles we ll assume we have a D matr o gene epresson measurements rows represent genes columns represent derent eperments tme
More informationDEFINING %COMPLETE IN MICROSOFT PROJECT
CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMISP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,
More informationPAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of IllinoisUrbana Champaign
PAS: A Packet Accountng System to Lmt the Effects of DoS & DDoS Debsh Fesehaye & Klara Naherstedt Unversty of IllnosUrbana Champagn DoS and DDoS DDoS attacks are ncreasng threats to our dgtal world. Exstng
More informationJoint Scheduling of Processing and Shuffle Phases in MapReduce Systems
Jont Schedulng of Processng and Shuffle Phases n MapReduce Systems Fangfe Chen, Mural Kodalam, T. V. Lakshman Department of Computer Scence and Engneerng, The Penn State Unversty Bell Laboratores, AlcatelLucent
More information1 Example 1: Axisaligned rectangles
COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton
More informationGraph Calculus: Scalable Shortest Path Analytics for Large Social Graphs through Core Net
Graph Calculus: Scalable Shortest Path Analytcs for Large Socal Graphs through Core Net Lxn Fu Department of Computer Scence Unversty of North Carolna at Greensboro Greensboro, NC, U.S.A. lfu@uncg.edu
More informationOn Mean Squared Error of Hierarchical Estimator
S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta
More informationCan Auto Liability Insurance Purchases Signal Risk Attitude?
Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? ChuShu L Department of Internatonal Busness, Asa Unversty, Tawan ShengChang
More informationCausal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting
Causal, Explanatory Forecastng Assumes causeandeffect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of
More informationFuzzy TOPSIS Method in the Selection of Investment Boards by Incorporating Operational Risks
, July 68, 2011, London, U.K. Fuzzy TOPSIS Method n the Selecton of Investment Boards by Incorporatng Operatonal Rsks Elssa Nada Mad, and Abu Osman Md Tap Abstract Mult Crtera Decson Makng (MCDM) nvolves
More informationTime Series Analysis in Studies of AGN Variability. Bradley M. Peterson The Ohio State University
Tme Seres Analyss n Studes of AGN Varablty Bradley M. Peterson The Oho State Unversty 1 Lnear Correlaton Degree to whch two parameters are lnearly correlated can be expressed n terms of the lnear correlaton
More information6. EIGENVALUES AND EIGENVECTORS 3 = 3 2
EIGENVALUES AND EIGENVECTORS The Characterstc Polynomal If A s a square matrx and v s a nonzero vector such that Av v we say that v s an egenvector of A and s the correspondng egenvalue Av v Example :
More informationSingle and multiple stage classifiers implementing logistic discrimination
Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul  PUCRS Av. Ipranga,
More informationA Performance Analysis of View Maintenance Techniques for Data Warehouses
A Performance Analyss of Vew Mantenance Technques for Data Warehouses Xng Wang Dell Computer Corporaton Round Roc, Texas Le Gruenwald The nversty of Olahoma School of Computer Scence orman, OK 739 Guangtao
More informationAbstract. Clustering ensembles have emerged as a powerful method for improving both the
Clusterng Ensembles: {topchyal, Models jan, of punch}@cse.msu.edu Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty
More informationGraph Theory and Cayley s Formula
Graph Theory and Cayley s Formula Chad Casarotto August 10, 2006 Contents 1 Introducton 1 2 Bascs and Defntons 1 Cayley s Formula 4 4 Prüfer Encodng A Forest of Trees 7 1 Introducton In ths paper, I wll
More informationOn the Optimal Control of a Cascade of HydroElectric Power Stations
On the Optmal Control of a Cascade of HydroElectrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;
More informationSelecting Best Employee of the Year Using Analytical Hierarchy Process
J. Basc. Appl. Sc. Res., 5(11)7276, 2015 2015, TextRoad Publcaton ISSN 20904304 Journal of Basc and Appled Scentfc Research www.textroad.com Selectng Best Employee of the Year Usng Analytcal Herarchy
More informationBayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending
Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success
More informationMANY machine learning and pattern recognition applications
1 Trace Rato Problem Revsted Yangqng Ja, Fepng Ne, and Changshu Zhang Abstract Dmensonalty reducton s an mportant ssue n many machne learnng and pattern recognton applcatons, and the trace rato problem
More informationFast Fuzzy Clustering of Web Page Collections
Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng OttovonGuerckeUnversty of Magdeburg Unverstätsplatz, D396 Magdeburg,
More information+ + +   This circuit than can be reduced to a planar circuit
MeshCurrent Method The meshcurrent s analog of the nodeoltage method. We sole for a new set of arables, mesh currents, that automatcally satsfy KCLs. As such, meshcurrent method reduces crcut soluton to
More informationStudy on CET4 Marks in China s Graded English Teaching
Study on CET4 Marks n Chna s Graded Englsh Teachng CHE We College of Foregn Studes, Shandong Insttute of Busness and Technology, P.R.Chna, 264005 Abstract: Ths paper deploys Logt model, and decomposes
More informationHow Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence
1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh
More informationAn Alternative Way to Measure Private Equity Performance
An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate
More informationEnterprise Master Patient Index
Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an
More informationCHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol
CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL
More informationUncertain Data Mining: A New Research Direction
Uncertan Data Mnng: A New Research Drecton Mchael Chau 1, Reynold Cheng, and Ben Kao 3 1: School of Busness, The Unversty of Hong Kong, Pokfulam, Hong Kong : Department of Computng, Hong Kong Polytechnc
More informationQuestions that we may have about the variables
Antono Olmos, 01 Multple Regresson Problem: we want to determne the effect of Desre for control, Famly support, Number of frends, and Score on the BDI test on Perceved Support of Latno women. Dependent
More information1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)
6.3 /  Communcaton Networks II (Görg) SS20  www.comnets.unbremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes
More informationMethodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications
Methodology to Determne Relatonshps between Performance Factors n Hadoop Cloud Computng Applcatons Lus Eduardo Bautsta Vllalpando 1,2, Alan Aprl 1 and Alan Abran 1 1 Department of Software Engneerng and
More informationMETHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS
METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS Lus Eduardo Bautsta Vllalpando 1,2, Alan Aprl 1 and Alan Abran 1 1 Department of Software Engneerng
More informationAn Adaptive and Distributed Clustering Scheme for Wireless Sensor Networks
2007 Internatonal Conference on Convergence Informaton Technology An Adaptve and Dstrbuted Clusterng Scheme for Wreless Sensor Networs Xnguo Wang, Xnmng Zhang, Guolang Chen, Shuang Tan Department of Computer
More informationThe eigenvalue derivatives of linear damped systems
Control and Cybernetcs vol. 32 (2003) No. 4 The egenvalue dervatves of lnear damped systems by YeongJeu Sun Department of Electrcal Engneerng IShou Unversty Kaohsung, Tawan 840, R.O.C emal: yjsun@su.edu.tw
More informationGreedy Column Subset Selection for Largescale Data Sets
Knowledge and Information Systems manuscript No. will be inserted by the editor) Greedy Column Subset Selection for Largescale Data Sets Ahmed K. Farahat Ahmed Elgohary Ali Ghodsi Mohamed S. Kamel Received:
More informationQuality Adjustment of Secondhand Motor Vehicle Application of Hedonic Approach in Hong Kong s Consumer Price Index
Qualty Adustment of Secondhand Motor Vehcle Applcaton of Hedonc Approach n Hong Kong s Consumer Prce Index Prepared for the 14 th Meetng of the Ottawa Group on Prce Indces 20 22 May 2015, Tokyo, Japan
More informationA Prefix Code Matching Parallel LoadBalancing Method for SolutionAdaptive Unstructured Finite Element Graphs on Distributed Memory Multicomputers
Ž. The Journal of Supercomputng, 15, 25 49 2000 2000 Kluwer Academc Publshers. Manufactured n The Netherlands. A Prefx Code Matchng Parallel LoadBalancng Method for SolutonAdaptve Unstructured Fnte Element
More informationFault tolerance in cloud technologies presented as a service
Internatonal Scentfc Conference Computer Scence 2015 Pavel Dzhunev, PhD student Fault tolerance n cloud technologes presented as a servce INTRODUCTION Improvements n technques for vrtualzaton and performance
More informationAn interactive system for structurebased ASCII art creation
An nteractve system for structurebased ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract NonPhotorealstc Renderng (NPR), whose am
More informationMAPP. MERIS level 3 cloud and water vapour products. Issue: 1. Revision: 0. Date: 9.12.1998. Function Name Organisation Signature Date
Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPPATBDClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller
More informationA heuristic task deployment approach for load balancing
Xu Gaochao, Dong Yunmeng, Fu Xaodog, Dng Yan, Lu Peng, Zhao Ja Abstract A heurstc task deployment approach for load balancng Gaochao Xu, Yunmeng Dong, Xaodong Fu, Yan Dng, Peng Lu, Ja Zhao * College of
More informationTHE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek
HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo
More information