Distributed Column Subset Selection on MapReduce

Transcription

1 Dstrbuted Column Subset Selecton on MapReduce Ahmed K. arahat Ahmed Elgohary Al Ghods Mohamed S. Kamel Unversty of Waterloo Waterloo, Ontaro, Canada N2L 3G1 Emal: {afarahat, aelgohary, aghodsb, Abstract Gven a very large data set dstrbuted over a cluster of several nodes, ths paper addresses the problem of selectng a few data nstances that best represent the entre data set. he soluton to ths problem s of a crucal mportance n the bg data era as t enables data analysts to understand the nsghts of the data and explore ts hdden structure. he selected nstances can also be used for data preprocessng tasks such as learnng a low-dmensonal embeddng of the data ponts or computng a low-rank approxmaton of the correspondng matrx. he paper frst formulates the problem as the selecton of a few representatve columns from a matrx whose columns are massvely dstrbuted, and t then proposes a MapReduce algorthm for selectng those representatves. he algorthm frst learns a concse representaton of all columns usng random projecton, and t then solves a generalzed column subset selecton problem at each machne n whch a subset of columns are selected from the sub-matrx on that machne such that the reconstructon error of the concse representaton s mnmzed. he paper then demonstrates the effectveness and effcency of the proposed algorthm through an emprcal evaluaton on benchmark data sets. Keywords-Column Subset Selecton; Greedy Algorthms; Dstrbuted Computng; Bg Data; MapReduce; I. INRODUCION Recent years have wtnessed the rse of the bg data era n computng and storage systems. Wth the great advances n nformaton and communcaton technology, hundreds of petabytes of data are generated, transferred, processed and stored every day. he avalablty of ths overwhelmng amount of structured and unstructured data creates an acute need to develop fast and accurate algorthms to dscover useful nformaton that s hdden n the bg data. One of the crucal problems n the bg data era s the ablty to represent the data and ts underlyng nformaton n a succnct format. Although dfferent algorthms for clusterng and dmenson reducton can be used to summarze bg data, these algorthms tend to learn representatves whose meanngs are dffcult to nterpret. or nstance, the tradtonal clusterng algorthms such as k-means [1] tend to produce centrods whch encode nformaton about thousands of data nstances. he meanngs of these centrods are hard to nterpret. Even clusterng methods that use data nstances as prototypes, such as k-medod [2], learn only one representatve for each cluster, whch s usually not enough to capture the nsghts of the data nstances n that cluster. In addton, usng medods as representatves mplctly assumes that the data ponts are dstrbuted as clusters and that the number of those clusters are known ahead of tme. hs assumpton s not true for many data sets. On the other hand, tradtonal dmenson reducton algorthms such as Latent Semantc Analyss LSA) [3] tend to learn a few latent concepts n the feature space. Each of these concepts s represented by a dense vector whch combnes thousands of features wth postve and negatve weghts. hs makes t dffcult for the data analyst to understand the meanng of these concepts. Even f the goal of representatve selecton s to learn a low-dmensonal embeddng of data nstances, learnng dmensons whose meanngs are easy to nterpret allows the understandng of the results of the data mnng algorthms, such as understandng the meanngs of data clusters n the low-dmensonal space. he acute need to summarze bg data to a format that appeals to data analysts motvates the development of dfferent algorthms to drectly select a few representatve data nstances and/or features. hs problem can be generally formulated as the selecton of a subset of columns from a data matrx, whch s formally known as the Column Subset Selecton CSS) problem [4], [5], [6]. Although many algorthms have been proposed for tacklng the CSS problem, most of these algorthms focus on randomly selectng a subset of columns wth the goal of usng these columns to obtan a low-rank approxmaton of the data matrx. In ths case, these algorthms tend to select a relatvely large number of columns. When the goal s to select a very few columns to be drectly presented to a data analyst or ndrectly used to nterpret the results of other algorthms, the randomzed CSS methods are not gong to produce a meanngful subset of columns. On the other hand, determnstc algorthms for CSS, although more accurate, do not scale to work on bg matrces wth massvely dstrbuted columns. hs paper addresses the aforementoned problem by presentng a fast and accurate algorthm for selectng a very few columns from a bg data matrx wth massvely dstrbuted columns. he algorthm starts by learnng a concse representaton of the data matrx usng random projecton. Each machne then ndependently solves a generalzed column subset selecton problem n whch a subset of columns s selected from the current sub-matrx such that the reconstructon error of the concse representaton s mnmzed. A further selecton step s then appled to

2 the columns selected at dfferent machnes to select the requred number of columns. he proposed algorthm s desgned to be executed effcently over massve amounts of data stored on a cluster of several commodty nodes. In such settngs of nfrastructure, ensurng the scalablty and the fault tolerance of data processng jobs s not a trval task. In order to allevate these problems, MapReduce [7] was ntroduced to smplfy large-scale data analytcs over a dstrbuted envronment of commodty machnes. Currently, MapReduce and ts open source mplementaton Hadoop [8]) s consdered the most successful and wdelyused framework for managng bg data processng jobs. he approach proposed n ths paper consders the dfferent aspects of developng MapReduce-effcent algorthms. he contrbutons of the paper can be summarzed as follows: he paper proposes an algorthm for dstrbuted Column Subset Selecton CSS) whch frst learns a concse representaton of the data matrx and then selects columns from dstrbuted sub-matrces that approxmate ths concse representaton. o facltate CSS from dfferent sub-matrces, a fast and accurate algorthm for generalzed CSS s proposed. hs algorthm greedly selects a subset of columns from a source matrx whch approxmates the columns of a target matrx. A MapReduce-effcent algorthm s proposed for learnng a concse representaton usng random projecton. he paper also presents a MapReduce algorthm for dstrbuted CSS whch only requres two passes over the data wth a very low communcaton overhead. Large-scale experments have been conducted on benchmark data sets n whch dfferent methods for CSS are compared. he rest of the paper s organzed as follows. Secton II descrbes the notatons used throughout the paper. Secton III gves a bref background on the CSS problem. Secton IV descrbes a centralzed greedy algorthm for CSS, whch s the core of the dstrbuted algorthm presented n ths paper. Secton V gves a necessary background on the framework of MapReduce. he proposed MapReduce algorthm for dstrbuted CSS s descrbed n detals n Secton VI. Secton VII revews the state-of-the-art CSS methods and ther applcablty to dstrbuted data. In Secton VIII, an emprcal evaluaton of the proposed method s descrbed. nally, Secton IX concludes the paper. II. NOAIONS he followng notatons are used throughout the paper unless otherwse ndcated. Scalars are denoted by small letters e.g., m, n), sets are denoted n scrpt letters e.g., S, R), vectors are denoted by small bold talc letters e.g., f, g), and matrces are denoted by captal letters e.g., A, B). he subscrpt ) ndcates that the varable corresponds to the -th block of data n the dstrbuted envronment. In addton, the followng notatons are used: or a set S: S the cardnalty of the set. or a vector x R m : x -th element of x. x the Eucldean norm l 2 -norm) of x. or a matrx A R m n : A j, j)-th entry of A. A : -th row of A. A :j j-th column of A. A :S the sub-matrx of A whch conssts of the set S of columns. A the transpose of A. A the robenus norm of A: A = Σ,j A 2 j. Ã a low rank approxmaton of A. Ã S a rank-l approxmaton of A based on the set S of columns, where S = l. III. COLUMN SUBSE SELECION CSS) he Column Subset Selecton CSS) problem can be generally defned as the selecton of the most representatve columns of a data matrx [4], [5], [6]. he CSS problem generalzes the problem of selectng representatve data nstances as well as the unsupervsed feature selecton problem. Both are crucal tasks, that can be drectly used for data analyss or as pre-processng steps for developng fast and accurate algorthms n data mnng and machne learnng. Although dfferent crtera for column subset selecton can be defned, a common crteron that has been used n much recent work measures the dscrepancy between the orgnal matrx and the approxmate matrx reconstructed from the subset of selected columns [9], [10], [11], [12], [13], [4], [5], [6], [14]. Most of the recent work ether develops CSS algorthms that drectly optmze ths crteron or uses ths crteron to assess the qualty of the proposed CSS algorthms. In the present work, the CSS problem s formally defned as Problem 1: Column Subset Selecton) Gven an m n matrx A and an nteger l, fnd a subset of columns L such that L = l and L = arg mn A P S) A 2, S where P S) s an m m projecton matrx whch projects the columns of A onto the span of the canddate columns A :S. he crteron S) = A P S) A 2 represents the sum of squared errors between the orgnal data matrx A and ts rank-l column-based approxmaton where l = S ), Ã S = P S) A. 1)

3 In other words, the crteron S) calculates the robenus norm of the resdual matrx E = A ÃS. Other types of matrx norms can also be used to quantfy the reconstructon error. Some of the recent work on the CSS problem [4], [5], [6] derves theoretcal bounds for both the robenus and spectral norms of the resdual matrx. he present work, however, focuses on developng algorthms that mnmze the robenus norm of the resdual matrx. he projecton matrx P S) can be calculated as P S) = A :S A :S A :S ) 1 A :S, 2) where A :S s the sub-matrx of A whch conssts of the columns correspondng to S. It should be noted that f S s known, the term A :S A 1 :S) A :S A s the closed-form soluton of least-squares problem = arg mn A A :S 2. he set of selected columns.e., data nstances or features) can be drectly presented to a data analyst to learn about the nsghts of the data, or they can be used to preprocess the data for further analyss. or nstance, the selected columns can be used to obtan a low-dmensonal representaton of all columns nto the subspace of selected ones. hs representaton can be obtaned by calculatng an orthogonal bass for the selected columns Q and then embeddng all columns of A nto the subspace of Q as W = Q A. he selected columns can also be used to calculate a column-based low-rank approxmaton of A [12]. Moreover, the leadng sngular values and vectors of the lowdmensonal embeddng W can be used to approxmate those of the data matrx. IV. GREEDY CSS he column subset selecton crteron presented n Secton III measures the reconstructon error of a data matrx based on the subset of selected columns. he mnmzaton of ths crteron s a combnatoral optmzaton problem whose optmal soluton can be obtaned n O n l mnl ) [5]. hs secton brefly descrbes a determnstc greedy algorthm for optmzng ths crteron, whch extends the greedy method for unsupervsed feature selecton recently proposed by arahat et al. [15], [16]. A bref descrpton of ths method s ncluded n ths secton for completeness. he reader s referred to [16] for the proofs of the dfferent formulas presented n ths secton. he greedy CSS [16] s based the followng recursve formula for the CSS crteron. heorem 1: Gven a set of columns S. or any P S, S) = P) ẼR 2, where E = A P P) A, and ẼR s the low-rank approxmaton of E based on the subset R = S \ P of columns. Proof: See [16, heorem 2]. he term ẼR 2 represents the decrease n reconstructon error acheved by addng the subset R of columns to P. hs recursve formula allows the development of an effcent greedy algorthm that approxmates the optmal soluton of the column subset selecton problem. At teraton t, the goal s to fnd column p such that p = arg mn S {}), 3) where S s the set of columns selected durng the frst t 1 teratons. Let G be an n n matrx whch represents the nnerproducts over the columns of the resdual matrx E,.e., G = E E. he greedy selecton problem can be smplfed to See [16, Secton 6]) Problem 2: Greedy Column Subset Selecton) At teraton t, fnd column p such that p = arg max G : 2 G where G = E E, E = A ÃS and S s the set of columns selected durng the frst t 1 teratons. or teraton t, defne δ = G :p and ω = G :p / G pp = δ/ δ p. he vector δ t) can be calculated n terms of A and prevous ω s as t 1 δ t) = A A :p ω r) p ω r). 4) r=1 he numerator and denomnator of the selecton crteron at each teraton can be calculated n an effcent manner wthout explctly calculatng E or G usng the followng theorem. heorem 2: Let f = G : 2 and g = G be the numerator and denomnator of the crteron functon for column respectvely, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Aω Σ t 2 r=1 ω r) ω ω r))) ) t 1), + ω 2 ω ω) ) t 1). g t) = g ω ω) where represents the Hadamard product operator. Proof: See [16, heorem 4]. Algorthm 1 shows the complete greedy CSS algorthm. he dstrbuted CSS algorthm presented n ths paper ntroduces a generalzed varant of the greedy CSS algorthm n whch a subset of columns s selected from a source matrx such that the reconstructon error of a target matrx s mnmzed. he dstrbuted CSS method uses the greedy generalzed CSS algorthm as the core method for selectng columns at dfferent machnes as well as n the fnal selecton step.

4 Algorthm 1 Greedy Column Subset Selecton Input: Data matrx A, Number of columns l Output: Selected subset of columns S 1: Intalze S = { } 2: Intalze f 0) = A A : 2, g 0) = A : A : for = 1...n 3: Repeat t = 1 l: 4: p = arg max f t) 5: δ t) = A A :p t 1 /g t), S = S {p} r=1 ωr) p ω r) 6: ω t) = δ t) / δ t) p 7: Update f s, g s heorem 2) V. MAPREDUCE PARADIGM MapReduce [7] was presented as a programmng model to smplfy large-scale data analytcs over a dstrbuted envronment of commodty machnes. he ratonale behnd MapReduce s to mpose a set of constrants on data access at each ndvdual machne and communcaton between dfferent machnes to ensure both the scalablty and faulttolerance of the analytcal tasks. Currently, MapReduce s consdered the de-facto soluton for many data analytcs tasks over large dstrbuted clusters [17], [18]. A MapReduce job s executed n two phases of userdefned data transformaton functons, namely, map and reduce phases. he nput data s splt nto physcal blocks dstrbuted among the nodes. Each block s vewed as a lst of key-value pars. In the frst phase, the key-value pars of each nput block b are processed by a sngle map functon runnng ndependently on the node where the block b s stored. he key-value pars are provded one-by-one to the map functon. he output of the map functon s another set of ntermedate key-value pars. he values assocated wth the same key across all nodes are grouped together and provded as an nput to the reduce functon n the second phase. Dfferent groups of values are processed n parallel on dfferent machnes. he output of each reduce functon s a thrd set of key-value pars and collectvely consdered the output of the job. It s mportant to note that the set of the ntermedate key-value pars s moved across the network between the nodes whch ncurs sgnfcant addtonal executon tme when much data are to be moved. or complex analytcal tasks, multple jobs are typcally chaned together [17] and/or many rounds of the same job are executed on the nput data set [18]. In addton to the programmng model constrants, Karloff et al. [19] defned a set of computatonal constrants that ensure the scalablty and the effcency of MapReducebased analytcal tasks. hese computatonal constrants lmt the used memory sze at each machne, the output sze of both the map and reduce functons and the number of rounds used to complete a certan tasks. he MapReduce algorthms presented n ths paper adhere to both the programmng model constrants and the computatonal constrants. he proposed algorthm ams also at mnmzng the overall runnng tme of the dstrbuted column subset selecton task to facltate nteractve data analytcs. VI. DISRIBUED CSS ON MAPREDUCE hs secton descrbes a MapReduce algorthm for the dstrbuted column subset selecton problem. Gven a bg data matrx A whose columns are dstrbuted across dfferent machnes, the goal s to select a subset of columns S from A such that the CSS crteron S) s mnmzed. One naïve approach to perform dstrbuted column subset selecton s to select dfferent subsets of columns from the sub-matrces stored on dfferent machnes. he selected subsets are then sent to a centralzed machne where an addtonal selecton step s optonally performed to flter out rrelevant or redundant columns. Let A ) be the submatrx stored at machne, the naïve approach optmzes the followng functon. c A ) P L )) 2 A) =1, 5) where L ) s the set of columns selected from A ) and c s the number of physcal blocks of data. he resultng set of columns s the unon of the sets selected from dfferent submatrces: L = c =1 L ). he set L can further be reduced by nvokng another selecton process n whch a smaller subset of columns s selected from A :L. he naïve approach, however smple, s prone to mssng relevant columns. hs s because the selecton at each machne s based on approxmatng a local sub-matrx, and accordngly there s no way to determne whether the selected columns are globally relevant or not. or nstance, suppose the extreme case where all the truly representatve columns happen to be loaded on a sngle machne. In ths case, the algorthm wll select a less-than-requred number of columns from that machne and many rrelevant columns from other machnes. In order to allevate ths problem, the dfferent machnes have to select columns that best approxmate a common representaton of the data matrx. o acheve that, the proposed algorthm frst learns a concse representaton of the span of the bg data matrx. hs concse representaton s relatvely small and t can be sent over to all machnes. After that each machne can select columns from ts submatrx that approxmate ths concse representaton. he proposed algorthm uses random projecton to learn ths concse representaton, and proposes a generalzed Column Subset Selecton CSS) method to select columns from dfferent machnes. he detals of the proposed methods are explaned n the rest of ths secton.

5 A. Random Projecton he frst step of the proposed algorthm s to learn a concse representaton B for a dstrbuted data matrx A. In the proposed approach, a random projecton method s employed. Random projecton [20][21][22] s a well-known technque for dealng wth the curse-of-the-dmensonalty problem. Let Ω be a random projecton matrx of sze n r, and gven a data matrx X of sze m n, the random projecton can be calculated as Y = XΩ. It has been shown that applyng random projecton Ω to X preserves the parwse dstances between vectors n the row space of X wth a hgh probablty [20]: 1 ɛ) X : X j: X : Ω X j: Ω 1 + ɛ) X : X j:, where ɛ s an arbtrarly small factor. Snce the CSS crteron S) measures the reconstructon error between the bg data matrx A and ts low-rank approxmaton P S) A, t essentally measures the sum of the dstances between the orgnal rows and ther approxmatons. hs means that when applyng random projecton to both A and P S) A, the reconstructon error of the orgnal data matrx A wll be approxmately equal to that of AΩ when both are approxmated usng the subset of selected columns: 6) A P S) A 2 AΩ P S) AΩ 2. 7) So, nstead of optmzng A P S) A 2, the dstrbuted CSS can approxmately optmze AΩ P S) AΩ 2. Let B = AΩ, the dstrbuted column subset selecton problem can be formally defned as Problem 3: Dstrbuted Column Subset Selecton) Gven an m n ) sub-matrx A ) whch s stored at node and an nteger l ), fnd a subset of columns L ) such that L ) = l ) and L ) = arg mn B P S) B 2, S where B = AΩ, Ω s an n r random projecton matrx, S s the set of the ndces of the canddate columns and L ) s the set of the ndces of the selected columns from A ). A key observaton here s that random projecton matrces whose entres are sampled..d from some unvarate dstrbuton Ψ can be exploted to compute random projecton on MapReduce n a very effcent manner. Examples of such matrces are Gaussan random matrces [20], unform random sgn ±1) matrces [21], and sparse random sgn matrces [22]. In order to mplement random projecton on MapReduce, the data matrx A s dstrbuted n a column-wse fashon and vewed as pars of, A : where A : s the -th column of A. Recall that B = AΩ can be rewrtten as n B = A : Ω : 8) =1 Algorthm 2 ast Random Projecton on MapReduce Input: Data matrx A, Unvarate dstrbuton Ψ, Number of dmensons r Output: Concse representaton B = AΩ, Ω j Ψ, j 1: map: 2: B = [0]m r 3: foreach, A : 4: 5: Generate v = [v 1, v 2,...v r ], v j Ψ B = B + A: v 6: for j = 1 to m 7: emt j, B j: 8: reduce: 9: foreach j, [ [ B 1) ] j:, [ B 2) ] j:,..., [ B c) ] j: ] 10: B j: = c =1 [ B ) ] j: 11: emt j, B j: and snce the map functon s provded one column of A at a tme, one does not need to worry about pre-computng the full matrx Ω. In fact, for each nput column A :, a new vector Ω : needs to be sampled from Ψ. So, each nput column generates a matrx of sze m r whch means that Onmr) data should be moved across the network to sum the generated n matrces at m ndependent reducers each summng a row B j: to obtan B. o mnmze that network cost, an n-memory summaton can be carred out over the generated m r matrces at each mapper. hs can be done ncrementally after processng each column of A. hat optmzaton reduces the network cost to Ocmr), where c s the number of physcal blocks of the matrx 1. Algorthm 2 outlnes the proposed random projecton algorthm. he term emt s used to refer to outputtng new key, value pars from a mapper or a reducer. B. Generalzed CSS hs secton presents the generalzed column subset selecton algorthm whch wll be used to perform the selecton of columns at dfferent machnes. Whle Problem 1 s concerned wth the selecton of a subset of columns from a data matrx whch best represent other columns of the same matrx, Problem 3 selects a subset of columns from a source matrx whch best represent the columns of a dfferent target matrx. he objectve functon of Problem 3 represents the reconstructon error of the target matrx B based on the selected columns) from the source matrx. and the term P S) = A :S A 1 :S A :S A :S s the projecton matrx whch projects the columns of B onto the subspace of the columns selected from A. In order to optmze ths new crteron, a greedy algorthm can be ntroduced. Let S) = B P S) B 2 be the 1 he n-memory summaton can also be replaced by a MapReduce combner [7].

6 dstrbuted CSS crteron, the followng theorem derves a recursve formula for S). heorem 3: Gven a set of columns S. or any P S, S) = P) 2 R, where = B P P) B, and R s the low-rank approxmaton of based on the subset R = S \ P of columns of E = A P P) A. Proof: Usng the recursve formula for the low-rank approxmaton of A: Ã S = ÃP + ẼR, and multplyng both sdes wth Ω gves Ã S Ω = ÃPΩ + ẼRΩ. Low-rank approxmatons can be wrtten n terms of projecton matrces as Usng B = AΩ, P S) AΩ = P P) AΩ + R R) EΩ. P S) B = P P) B + R R) EΩ. Let = EΩ. he matrx s the resdual after approxmatng B usng the set P of columns ) = EΩ = A P P) A Ω = AΩ P P) AΩ = B P P) B. hs means that P S) B = P P) B + R R) Substtutng n S) = B P S) B 2 gves S) = B P P) B R R) Usng = B P P) B gves S) = R R) Usng the relaton between robenus norm and trace, ) ) ) S) = trace R R) R R) ) = trace 2 R R) + R R) R R) ) = trace R R) = 2 R R) Usng P) = 2 and R = R R) proves the theorem. Usng the recursve formula for S {}) allows the development of a greedy algorthm whch at teraton t optmzes p = arg mn 2 S {}) = arg max 2 2 {} 2 9) Algorthm 3 Greedy Generalzed Column Subset Selecton Input: Source matrx A, arget matrx B, Number of columns l Output: Selected subset of columns S 1: Intalze f 0) = B A : 2, g 0) = A : A : for = 1...n 2: Repeat t = 1 l: 3: p = arg max f t) 4: δ t) = A A :p t 1 /g t), S = S {p} r=1 ωr) p ω r) r=1 ωr) p υ r) δ t) 5: γ t) = B A :p t 1 6: ω t) = δ t) / p, υ t) = γ t) / 7: Update f s, g s heorem 4) δ t) p Let G = E E and H = E, the objectve functon of ths optmzaton problem can be smplfed as follows. 2 ) {} = E : E 1 : E : E : 2 = trace ) E : E 1 ) : E : E : 10) E : 2 = E: E = H : 2. : G hs allows the defnton of the followng generalzed CSS problem. Problem 4: Greedy Generalzed CSS) At teraton t, fnd column p such that p = arg max H : 2 G where H = E, G = E E, = B P S) B, E = A P S) A and S s the set of columns selected durng the frst t 1 teratons. or teraton t, defne γ = H :p and υ = H :p / G pp = γ/ δ p. he vector γ t) can be calculated n terms of A, B and prevous ω s and υ s as γ t) = B A :p t 1 r=1 ωr) p υ r). Smlarly, the numerator and denomnator of the selecton crteron at each teraton can be calculated n an effcent manner usng the followng theorem. heorem 4: Let f = H : 2 and g = G be the numerator and denomnator of the greedy crteron functon for column respectvely, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Bυ Σ t 2 r=1 υ r) υ ω r))) ) t 1), + υ 2 ω ω) ) t 1) g t) = g ω ω), where represents the Hadamard product operator. As outlned n Secton VI-A, the algorthm s dstrbuton strategy s based on sharng the concse representaton of the data B among all mappers. hen, ndependent l b) columns

7 Algorthm 4 Dstrbuted CSS on MapReduce Input: Matrx A of sze m n, Concse representaton B, Number of columns l Output: Selected columns C 1: map: 2: A b) = [ ] 3: foreach, A : 4: 5: A b) = [A b) A : ] S = GeneralzedCSSAb), B, l b) ) 6: foreach j n S 7: emt 0, [A b) ] :j 8: reduce: 9: or all values {[A 1) ] : S1), [A 2) ] : S2),..., [A c) ] : Sc) } ] 10: A 0) = [[A 1) ] : S1), [A 2) ] : S2),..., [A c) ] : Sc) 11: S = GeneralzedCSS A 0), B, l) 12: foreach j n S 13: emt 0, [A 0) ] :j from each mapper are selected usng the generalzed CSS algorthm. A second phase of selecton s run over the c b=1 l b) where c s the number of nput blocks) columns to fnd the best l columns to represent B. Dfferent ways can be used to set l b) for each nput block b. In the context of ths paper, the set of l b) s assgned unform values for all blocks.e. l b) = l/c b 1, 2,..c). Other methods are to be consdered n future extensons. Algorthm 4 sketches the MapReduce mplementaton of the dstrbuted CSS algorthm. It should be emphaszed that the proposed MapReduce algorthm requres only two passes over the data set and ts moves a very few amount of the data across the network. VII. RELAED WORK Dfferent approaches have been proposed for selectng a subset of representatve columns from a data matrx. hs secton focuses on brefly descrbng these approaches and ther applcablty to massvely dstrbuted data matrces. he Column Subset Selecton CSS) methods can be generally categorzed nto randomzed, determnstc and hybrd. he randomzed methods sample a subset of columns from the orgnal matrx usng carefully chosen samplng probabltes. reze et al. [9] was the frst to suggest the dea of randomly samplng l columns from a matrx and usng these columns to calculate a rank-k approxmaton of the matrx where l k). hat work of reze et al. was followed by dfferent papers [10], [11] that enhanced the algorthm by proposng dfferent samplng probabltes. Drneas et al. [12] proposed a subspace samplng method whch samples columns usng probabltes proportonal to the norms of the rows of the top k rght sngular vectors of A. Deshpande et al. [13] proposed an adaptve samplng method whch updates the samplng probabltes based on the columns selected so far. Column subset selecton wth unform samplng can be easly mplemented on MapReduce. or non-unform samplng, the effcency of mplementng the selecton on MapReduce s determned by how easy are the calculatons of the samplng probabltes. he calculatons of probabltes that depend on calculatng the leadng sngular values and vectors are tme-consumng on MapReduce. On the other hand, adaptve samplng methods are computatonally very complex as they depend on calculatng the resdual of the whole data matrx after each teraton. he second category of methods employs a determnstc algorthm for selectng columns such that some crteron functon s mnmzed. hs crteron functon usually quantfes the reconstructon error of the data matrx based on the subset of selected columns. he determnstc methods are slower, but more accurate, than the randomzed ones. In the area of numercal lnear algebra, the column pvotng method exploted by the QR decomposton [23] permutes the columns of the matrx based on ther norms to enhance the numercal stablty of the QR decomposton algorthm. he frst l columns of the permuted matrx can be drectly selected as representatve columns. Besdes methods based on QR decomposton, dfferent recent methods have been proposed for drectly selectng a subset of columns from the data matrx. Boutsds et al. [4] proposed a determnstc column subset selecton method whch frst groups columns nto clusters and then selects a subset of columns from each cluster. Çvrl and Magdon-Ismal [14] presented a determnstc algorthm whch greedly selects columns from the data matrx that best represent the rght leadng sngular values of the matrx. Recently, Boutsds et al. [6] presented a column subset selecton algorthm whch frst calculates the top-k rght sngular values of the data matrx where k s the target rank) and then uses determnstc sparsfcaton methods to select l k columns from the data matrx. Besdes, other determnstc algorthms have been proposed for selectng columns based on the volume defned by them and the orgn [24], [25]. he determnstc algorthms are more complex to mplement on MapReduce. or nstance, t s tme-consumng to calculate the leadng sngular values and vectors of a massvely dstrbuted matrx or to cluster ther columns usng k-means. It s also computatonally complex to calculate QR decomposton wth pvotng. Moreover, the recently proposed algorthms for volume samplng are more complex than other CSS algorthms as well as the one presented n ths paper, and they are nfeasble for large data sets. A thrd category of CSS technques s the hybrd methods whch combne the benefts of both the randomzed and determnstc methods. In these methods, a large subset of columns s randomly sampled from the columns of the data matrx and then a determnstc step s employed to reduce

8 able I HE PROPERIES O HE DAA SES USED O EVALUAE HE DISRIBUED CSS MEHOD. Data set ype # Instances # eatures RCV1-200K Documents 193,844 47,236 nyimages-1m Images 1 mllon 1,024 the number of selected columns to the desred rank. or nstance, Boutsds et al. [5] proposed a two-stage hybrd CSS algorthm whch frst samples O l log l) columns based on probabltes calculated usng the l-leadng rght sngular vectors, and then employs a determnstc algorthm to select exactly l columns from the columns sampled n the frst stage. However, the algorthm depends on calculatng the leadng l rght sngular vectors whch s tme-consumng for large data sets. he hybrd algorthms for CSS can be easly mplemented on MapReduce f the randomzed selecton step s MapReduce-effcent and the determnstc selecton step can be mplemented on a sngle machne. hs s usually true f the number of columns selected by the randomzed step s relatvely small. In comparson to other CSS methods, the algorthm proposed n ths paper s desgned to be MapReduce-effcent. In the dstrbuted selecton step, representatve columns are selected based on a common representaton. he common representaton proposed n ths work s based on random projecton. hs s more effcent than the work of Çvrl and Magdon-Ismal [14] whch selects columns based on the leadng sngular vectors. In comparson to other determnstc methods, the proposed algorthm s specfcally desgned to be parallelzed whch makes t applcable to bg data matrces whose columns are massvely dstrbuted. On the other hand, the two-step of dstrbuted then centralzed selecton s smlar to that of the hybrd CSS methods. he proposed algorthm however employs a determnstc algorthm at the dstrbuted selecton phase whch s more accurate than the randomzed selecton employed by hybrd methods n the frst phase. VIII. EXPERIMENS Experments have been conducted on two bg data sets to evaluate the effcency and effectveness of the proposed dstrbuted CSS algorthm on MapReduce. he propertes of the data sets are descrbed n able I. he RCV1-200K s a subset of the RCV1 data set [26] whch has been prepared and used by Chen et al. [27] to evaluate parallel spectral clusterng algorthms. he nyimages-1m data set contans 1 mllon mages that were sampled from the 80 mllon tny mages data set [28] and converted to grayscale. Smlar to prevous work on CSS, the dfferent methods are evaluated accordng to ther ablty to mnmze the reconstructon error of the data matrx based on the subset of selected columns. In order to quantfy the reconstructon error across dfferent data sets, a relatve accuracy measure s defned as Relatve Accuracy = A ÃU A ÃS A ÃU A Ãl 100%, where ÃU s the rank-l approxmaton of the data matrx based on a random subset U of columns, Ã S s the rank-l approxmaton of the data matrx based on the subset S of columns and Ãl s the best rank-l approxmaton of the data matrx calculated usng the Sngular Value Decomposton SVD). hs measure compares dfferent methods relatve to the unform samplng as a baselne wth hgher values ndcatng better performance. he experments were conducted on Amazon EC2 2 clusters, whch consst of 10 nstances for the RCV1-200K data set and 20 nstances for the nyimages-1m data set. Each nstance has a 7.5 GB of memory and a two-cores processor. All nstances are runnng Deban and Hadoop verson he data sets were converted nto a bnary format n the form of a sequence of key-value pars. Each par conssted of a column ndex as the key and a vector of the column entres. hat s the standard format used n Mahout 3 for storng dstrbuted matrces. he dstrbuted CSS method has been compared wth dfferent state-of-the-art methods. It should be noted that most of these methods were not desgned wth the goal of applyng them to massvely-dstrbuted data, and hence ther mplementaton on MapReduce s not straghtforward. However, the desgned experments used the best practces for mplementng the dfferent steps of these methods on MapReduce to the best of the authors knowledge. In specfc, the followng dstrbuted CSS algorthms were compared. UnNoRep: s unform samplng of columns wthout replacement. hs s usually the worst performng method n terms on approxmaton error and t wll be used as a baselne to evaluate methods across dfferent data sets. HybrdUn, HybrdCol and HybrdSVD: are dfferent dstrbuted varants of the hybrd CSS algorthm whch can be mplemented effcently on MapReduce. In the randomzed phase, the three methods use probabltes calculated based on unform samplng, column norms and the norms of the leadng sngular vectors rows, respectvely. he number of selected columns n the randomzed phase s set to l log l). In the determnstc phase, the centralzed greedy CSS s employed to select exactly l columns from the randomly sampled columns. DstApproxSVD: s an extenson of the centralzed algorthm for sparse approxmaton of Sngular Value Decomposton SVD) [14]. he dstrbuted CSS algorthm presented n ths paper Algorthm 4) s used 2 Amazon Elastc Compute Cloud EC2): 3 Mahout s an Apache project for mplementng Machne Learnng algorthms on Hadoop. See

9 able II HE RUN IMES AND RELAIVE ACCURACIES O DIEREN CSS MEHODS. HE BES PERORMING MEHOD OR EACH l IS HIGHLIGHED IN BOLD, AND HE SECOND BES MEHOD IS UNDERLINED. NEGAIVE MEASURES INDICAE MEHODS HA PERORM WORSE HAN UNIORM SAMPLING. Methods Run tme mnutes) Relatve accuracy %) l = 10 l = 100 l = 500 l = 10 l = 100 l = 500 RCV1-200K Unform - Baselne Hybrd Unform) Hybrd Column Norms) Hybrd SVD-based) Dstrbuted Approx. SVD Dstrbuted Greedy CSS rnd) Dstrbuted Greedy CSS ssgn) ny Images - 1M Unform - Baselne Hybrd Unform) Hybrd Column Norms) Hybrd SVD-based) Dstrbuted Approx. SVD Dstrbuted Greedy CSS ssgn) to select columns that best approxmate the leadng sngular vectors by settng B = U k Σ k ). he use of the dstrbuted CSS algorthm extends the orgnal algorthm proposed by Çvrl and Magdon-Ismal [14] to work on dstrbuted matrces. In order to allow effcent mplementaton on MapReduce, the number of leadng sngular vectors s set of 100. DstGreedyCSS: s the dstrbuted column subset selecton method descrbed n Algorthm 4. or all experments, the dmenson of the random projecton matrx s set to 100. hs makes the sze of the concse representaton the same as the DstApproxSVD method. wo types of random matrces are used for random projecton: 1) a dense Gaussan random matrx rnd), and 2) a sparse random sgn matrx ssgn). or the methods that requre the calculatons of Sngular Value Decomposton SVD), the Stochastc SVD SSVD) algorthm [29] s used to approxmate the leadng sngular values and vectors of the data matrx. he use of SSVD sgnfcantly reduces the run tme of the orgnal SVDbased algorthms whle achevng comparable accuracy. In the conducted experments, the SSVD mplementaton of Mahout was used. able II shows the run tmes and relatve accuraces for dfferent CSS methods. It can be observed from the table that for the RCV1-200K data set, the DstGreedyCSS methods wth random Gaussan and sparse random sng matrces) outperforms all other methods n terms of relatve accuraces. In addton, the run tmes of both of them are relatvely small compared to the DstApproxSVD method whch acheves accuraces that are close to the DstGreedyCSS method. Both the DstApproxSVD and DstGreedyCSS methods acheve very good approxmaton accuraces compared to randomzed and hybrd methods. It should also be noted that usng a sparse random sgn matrx for random projecton takes much less tme than a dense Gaussan matrx, whle achevng comparable approxmaton accuraces. Based on ths observaton, the sparse random matrx has been used wth the nyimages-1m data set. or the nyimages-1m data set, although the DstApproxSVD acheves slghtly hgher approxmaton accuraces than DstGreedyCSS wth sparse random sgn matrx), the DstGreedyCSS selects columns n almost one-thrd of the tme. he reason why the DstApproxSVD outperforms DstGreedyCSS for ths data set s that ts rank s relatvely small less than 1024). hs means that usng the leadng 100 sngular values to represent the concse representaton of the data matrx captures most of the nformaton n the matrx and accordngly s more accurate than random projecton. he DstGreedyCSS however stll selects a very good subset of columns n a relatvely small tme. IX. CONCLUSION hs paper proposes an accurate and effcent MapReduce algorthm for selectng a subset of columns from a massvely dstrbuted matrx. he algorthm starts by learnng a concse representaton of the data matrx usng random projecton. It then selects columns from each sub-matrx that best approxmate ths concse approxmaton. A centralzed selecton step s then performed on the columns selected from dfferent sub-matrces. In order to facltate the mplementaton of the proposed method, a novel algorthm for greedy generalzed CSS s proposed to perform the selecton from dfferent submatrces. In addton, the dfferent steps of the algorthms are carefully desgned to be MapReduce-effcent. Experments on bg data sets demonstrate the effectveness and effcency of the proposed algorthm n comparson to other CSS methods when mplemented on dstrbuted data. REERENCES [1] A. K. Jan and R. C. Dubes, Algorthms for Clusterng Data. Upper Saddle Rver, NJ, USA: Prentce-Hall, Inc., 1988.

10 [2] L. Kaufman and P. Rousseeuw, Clusterng by means of medods, echnsche Hogeschool, Delft Netherlands). Department of Mathematcs and Informatcs, ech. Rep., [3] S. Deerwester, S. Dumas, G. urnas,. Landauer, and R. Harshman, Indexng by latent semantc analyss, Journal of the Amercan Socety for Informaton Scence and echnology, vol. 41, no. 6, pp , [4] C. Boutsds, J. Sun, and N. Anerouss, Clustered subset selecton and ts applcatons on t servce metrcs, n Proceedngs of the Seventeenth ACM Conference on Informaton and Knowledge Management CIKM 08), 2008, pp [5] C. Boutsds, M. W. Mahoney, and P. Drneas, An mproved approxmaton algorthm for the column subset selecton problem, n Proceedngs of the wenteth Annual ACM-SIAM Symposum on Dscrete Algorthms SODA 09), 2009, pp [6] C. Boutsds, P. Drneas, and M. Magdon-Ismal, Near optmal column-based matrx reconstructon, n Proceedngs of the 52nd Annual IEEE Symposum on oundatons of Computer Scence OCS 11), 2011, pp [7] J. Dean and S. Ghemawat, MapReduce: Smplfed data processng on large clusters, Communcatons of the ACM, vol. 51, no. 1, pp , [8]. Whte, Hadoop: he Defntve Gude, 1st ed. O Relly Meda, Inc., [9] A. reze, R. Kannan, and S. Vempala, ast Monte-Carlo algorthms for fndng low-rank approxmatons, n Proceedngs of the 39th Annual IEEE Symposum on oundatons of Computer Scence OCS 98), 1998, pp [10] P. Drneas, A. reze, R. Kannan, S. Vempala, and V. Vnay, Clusterng large graphs va the sngular value decomposton, Machne Learnng, vol. 56, no. 1-3, pp. 9 33, [11] P. Drneas, R. Kannan, and M. Mahoney, ast Monte Carlo algorthms for matrces II: Computng a low-rank approxmaton to a matrx, SIAM Journal on Computng, vol. 36, no. 1, pp , [12] P. Drneas, M. Mahoney, and S. Muthukrshnan, Subspace samplng and relatve-error matrx approxmaton: Column-based methods, n Approxmaton, Randomzaton, and Combnatoral Optmzaton. Algorthms and echnques. Sprnger Berln / Hedelberg, 2006, pp [13] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, Matrx approxmaton and projectve clusterng va volume samplng, heory of Computng, vol. 2, no. 1, pp , [14] A. Çvrl and M. Magdon-Ismal, Column subset selecton va sparse approxmaton of SVD, heoretcal Computer Scence, vol. 421, no. 0, pp. 1 14, [15] A. K. arahat, A. Ghods, and M. S. Kamel, An effcent greedy method for unsupervsed feature selecton, n Proceedngs of the Eleventh IEEE Internatonal Conference on Data Mnng ICDM 11), 2011, pp [16], Effcent greedy feature selecton for unsupervsed learnng, Knowledge and Informaton Systems, vol. 35, no. 2, pp , [17]. Elsayed, J. Ln, and D. W. Oard, Parwse document smlarty n large collectons wth MapReduce, n Proceedngs of the 46th Annual Meetng of the Assocaton for Computatonal Lngustcs on Human Language echnologes: Short Papers HL 08), 2008, pp [18] A. Ene, S. Im, and B. Moseley, ast clusterng usng MapReduce, n Proceedngs of the Seventeenth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng KDD 11), 2011, pp [19] H. Karloff, S. Sur, and S. Vasslvtsk, A model of computaton for MapReduce, n Proceedngs of the 21st Annual ACM-SIAM Symposum on Dscrete Algorthms SODA 10), 2010, pp [20] S. Dasgupta and A. Gupta, An elementary proof of a theorem of Johnson and Lndenstrauss, Random Structures and Algorthms, vol. 22, no. 1, pp , [21] D. Achloptas, Database-frendly random projectons: Johnson-Lndenstrauss wth bnary cons, Journal of computer and System Scences, vol. 66, no. 4, pp , [22] P. L,. J. Haste, and K. W. Church, Very sparse random projectons, n Proceedngs of the welfth ACM SIGKDD nternatonal conference on Knowledge Dscovery and Data Mnng KDD 06), 2006, pp [23] G. Golub and C. Van Loan, Matrx Computatons, 3rd ed. Johns Hopkns Unv Pr, [24] A. Deshpande and L. Rademacher, Effcent volume samplng for row/column subset selecton, n Proceedngs of the 51st Annual IEEE Symposum on oundatons of Computer Scence OCS 10), 2010, pp [25] V. Guruswam and A. K. Snop, Optmal column-based lowrank matrx reconstructon, n Proceedngs of the 21st Annual ACM-SIAM Symposum on Dscrete Algorthms SODA 12), 2012, pp [26] D. D. Lews, Y. Yang,. G. Rose, and. L, Rcv1: A new benchmark collecton for text categorzaton research, he Journal of Machne Learnng Research, vol. 5, pp , [27] W.-Y. Chen, Y. Song, H. Ba, C.-J. Ln, and E. Chang, Parallel spectral clusterng n dstrbuted systems, Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol. 33, no. 3, pp , [28] A. orralba, R. ergus, and W. reeman, 80 mllon tny mages: A large data set for nonparametrc object and scene recognton, Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol. 30, no. 11, pp , [29] N. Halko, P.-G. Martnsson, Y. Shkolnsky, and M. ygert, An algorthm for the prncpal component analyss of large data sets, SIAM Journal on Scentfc Computng, vol. 33, no. 5, pp , 2011.