Distributed Column Subset Selection on MapReduce

Size: px
Start display at page:

Download "Distributed Column Subset Selection on MapReduce"

Transcription

1 Dstrbuted Column Subset Selecton on MapReduce Ahmed K. arahat Ahmed Elgohary Al Ghods Mohamed S. Kamel Unversty of Waterloo Waterloo, Ontaro, Canada N2L 3G1 Emal: {afarahat, aelgohary, aghodsb, Abstract Gven a very large data set dstrbuted over a cluster of several nodes, ths paper addresses the problem of selectng a few data nstances that best represent the entre data set. he soluton to ths problem s of a crucal mportance n the bg data era as t enables data analysts to understand the nsghts of the data and explore ts hdden structure. he selected nstances can also be used for data preprocessng tasks such as learnng a low-dmensonal embeddng of the data ponts or computng a low-rank approxmaton of the correspondng matrx. he paper frst formulates the problem as the selecton of a few representatve columns from a matrx whose columns are massvely dstrbuted, and t then proposes a MapReduce algorthm for selectng those representatves. he algorthm frst learns a concse representaton of all columns usng random projecton, and t then solves a generalzed column subset selecton problem at each machne n whch a subset of columns are selected from the sub-matrx on that machne such that the reconstructon error of the concse representaton s mnmzed. he paper then demonstrates the effectveness and effcency of the proposed algorthm through an emprcal evaluaton on benchmark data sets. Keywords-Column Subset Selecton; Greedy Algorthms; Dstrbuted Computng; Bg Data; MapReduce; I. INRODUCION Recent years have wtnessed the rse of the bg data era n computng and storage systems. Wth the great advances n nformaton and communcaton technology, hundreds of petabytes of data are generated, transferred, processed and stored every day. he avalablty of ths overwhelmng amount of structured and unstructured data creates an acute need to develop fast and accurate algorthms to dscover useful nformaton that s hdden n the bg data. One of the crucal problems n the bg data era s the ablty to represent the data and ts underlyng nformaton n a succnct format. Although dfferent algorthms for clusterng and dmenson reducton can be used to summarze bg data, these algorthms tend to learn representatves whose meanngs are dffcult to nterpret. or nstance, the tradtonal clusterng algorthms such as k-means [1] tend to produce centrods whch encode nformaton about thousands of data nstances. he meanngs of these centrods are hard to nterpret. Even clusterng methods that use data nstances as prototypes, such as k-medod [2], learn only one representatve for each cluster, whch s usually not enough to capture the nsghts of the data nstances n that cluster. In addton, usng medods as representatves mplctly assumes that the data ponts are dstrbuted as clusters and that the number of those clusters are known ahead of tme. hs assumpton s not true for many data sets. On the other hand, tradtonal dmenson reducton algorthms such as Latent Semantc Analyss LSA) [3] tend to learn a few latent concepts n the feature space. Each of these concepts s represented by a dense vector whch combnes thousands of features wth postve and negatve weghts. hs makes t dffcult for the data analyst to understand the meanng of these concepts. Even f the goal of representatve selecton s to learn a low-dmensonal embeddng of data nstances, learnng dmensons whose meanngs are easy to nterpret allows the understandng of the results of the data mnng algorthms, such as understandng the meanngs of data clusters n the low-dmensonal space. he acute need to summarze bg data to a format that appeals to data analysts motvates the development of dfferent algorthms to drectly select a few representatve data nstances and/or features. hs problem can be generally formulated as the selecton of a subset of columns from a data matrx, whch s formally known as the Column Subset Selecton CSS) problem [4], [5], [6]. Although many algorthms have been proposed for tacklng the CSS problem, most of these algorthms focus on randomly selectng a subset of columns wth the goal of usng these columns to obtan a low-rank approxmaton of the data matrx. In ths case, these algorthms tend to select a relatvely large number of columns. When the goal s to select a very few columns to be drectly presented to a data analyst or ndrectly used to nterpret the results of other algorthms, the randomzed CSS methods are not gong to produce a meanngful subset of columns. On the other hand, determnstc algorthms for CSS, although more accurate, do not scale to work on bg matrces wth massvely dstrbuted columns. hs paper addresses the aforementoned problem by presentng a fast and accurate algorthm for selectng a very few columns from a bg data matrx wth massvely dstrbuted columns. he algorthm starts by learnng a concse representaton of the data matrx usng random projecton. Each machne then ndependently solves a generalzed column subset selecton problem n whch a subset of columns s selected from the current sub-matrx such that the reconstructon error of the concse representaton s mnmzed. A further selecton step s then appled to

2 the columns selected at dfferent machnes to select the requred number of columns. he proposed algorthm s desgned to be executed effcently over massve amounts of data stored on a cluster of several commodty nodes. In such settngs of nfrastructure, ensurng the scalablty and the fault tolerance of data processng jobs s not a trval task. In order to allevate these problems, MapReduce [7] was ntroduced to smplfy large-scale data analytcs over a dstrbuted envronment of commodty machnes. Currently, MapReduce and ts open source mplementaton Hadoop [8]) s consdered the most successful and wdelyused framework for managng bg data processng jobs. he approach proposed n ths paper consders the dfferent aspects of developng MapReduce-effcent algorthms. he contrbutons of the paper can be summarzed as follows: he paper proposes an algorthm for dstrbuted Column Subset Selecton CSS) whch frst learns a concse representaton of the data matrx and then selects columns from dstrbuted sub-matrces that approxmate ths concse representaton. o facltate CSS from dfferent sub-matrces, a fast and accurate algorthm for generalzed CSS s proposed. hs algorthm greedly selects a subset of columns from a source matrx whch approxmates the columns of a target matrx. A MapReduce-effcent algorthm s proposed for learnng a concse representaton usng random projecton. he paper also presents a MapReduce algorthm for dstrbuted CSS whch only requres two passes over the data wth a very low communcaton overhead. Large-scale experments have been conducted on benchmark data sets n whch dfferent methods for CSS are compared. he rest of the paper s organzed as follows. Secton II descrbes the notatons used throughout the paper. Secton III gves a bref background on the CSS problem. Secton IV descrbes a centralzed greedy algorthm for CSS, whch s the core of the dstrbuted algorthm presented n ths paper. Secton V gves a necessary background on the framework of MapReduce. he proposed MapReduce algorthm for dstrbuted CSS s descrbed n detals n Secton VI. Secton VII revews the state-of-the-art CSS methods and ther applcablty to dstrbuted data. In Secton VIII, an emprcal evaluaton of the proposed method s descrbed. nally, Secton IX concludes the paper. II. NOAIONS he followng notatons are used throughout the paper unless otherwse ndcated. Scalars are denoted by small letters e.g., m, n), sets are denoted n scrpt letters e.g., S, R), vectors are denoted by small bold talc letters e.g., f, g), and matrces are denoted by captal letters e.g., A, B). he subscrpt ) ndcates that the varable corresponds to the -th block of data n the dstrbuted envronment. In addton, the followng notatons are used: or a set S: S the cardnalty of the set. or a vector x R m : x -th element of x. x the Eucldean norm l 2 -norm) of x. or a matrx A R m n : A j, j)-th entry of A. A : -th row of A. A :j j-th column of A. A :S the sub-matrx of A whch conssts of the set S of columns. A the transpose of A. A the robenus norm of A: A = Σ,j A 2 j. Ã a low rank approxmaton of A. Ã S a rank-l approxmaton of A based on the set S of columns, where S = l. III. COLUMN SUBSE SELECION CSS) he Column Subset Selecton CSS) problem can be generally defned as the selecton of the most representatve columns of a data matrx [4], [5], [6]. he CSS problem generalzes the problem of selectng representatve data nstances as well as the unsupervsed feature selecton problem. Both are crucal tasks, that can be drectly used for data analyss or as pre-processng steps for developng fast and accurate algorthms n data mnng and machne learnng. Although dfferent crtera for column subset selecton can be defned, a common crteron that has been used n much recent work measures the dscrepancy between the orgnal matrx and the approxmate matrx reconstructed from the subset of selected columns [9], [10], [11], [12], [13], [4], [5], [6], [14]. Most of the recent work ether develops CSS algorthms that drectly optmze ths crteron or uses ths crteron to assess the qualty of the proposed CSS algorthms. In the present work, the CSS problem s formally defned as Problem 1: Column Subset Selecton) Gven an m n matrx A and an nteger l, fnd a subset of columns L such that L = l and L = arg mn A P S) A 2, S where P S) s an m m projecton matrx whch projects the columns of A onto the span of the canddate columns A :S. he crteron S) = A P S) A 2 represents the sum of squared errors between the orgnal data matrx A and ts rank-l column-based approxmaton where l = S ), Ã S = P S) A. 1)

3 In other words, the crteron S) calculates the robenus norm of the resdual matrx E = A ÃS. Other types of matrx norms can also be used to quantfy the reconstructon error. Some of the recent work on the CSS problem [4], [5], [6] derves theoretcal bounds for both the robenus and spectral norms of the resdual matrx. he present work, however, focuses on developng algorthms that mnmze the robenus norm of the resdual matrx. he projecton matrx P S) can be calculated as P S) = A :S A :S A :S ) 1 A :S, 2) where A :S s the sub-matrx of A whch conssts of the columns correspondng to S. It should be noted that f S s known, the term A :S A 1 :S) A :S A s the closed-form soluton of least-squares problem = arg mn A A :S 2. he set of selected columns.e., data nstances or features) can be drectly presented to a data analyst to learn about the nsghts of the data, or they can be used to preprocess the data for further analyss. or nstance, the selected columns can be used to obtan a low-dmensonal representaton of all columns nto the subspace of selected ones. hs representaton can be obtaned by calculatng an orthogonal bass for the selected columns Q and then embeddng all columns of A nto the subspace of Q as W = Q A. he selected columns can also be used to calculate a column-based low-rank approxmaton of A [12]. Moreover, the leadng sngular values and vectors of the lowdmensonal embeddng W can be used to approxmate those of the data matrx. IV. GREEDY CSS he column subset selecton crteron presented n Secton III measures the reconstructon error of a data matrx based on the subset of selected columns. he mnmzaton of ths crteron s a combnatoral optmzaton problem whose optmal soluton can be obtaned n O n l mnl ) [5]. hs secton brefly descrbes a determnstc greedy algorthm for optmzng ths crteron, whch extends the greedy method for unsupervsed feature selecton recently proposed by arahat et al. [15], [16]. A bref descrpton of ths method s ncluded n ths secton for completeness. he reader s referred to [16] for the proofs of the dfferent formulas presented n ths secton. he greedy CSS [16] s based the followng recursve formula for the CSS crteron. heorem 1: Gven a set of columns S. or any P S, S) = P) ẼR 2, where E = A P P) A, and ẼR s the low-rank approxmaton of E based on the subset R = S \ P of columns. Proof: See [16, heorem 2]. he term ẼR 2 represents the decrease n reconstructon error acheved by addng the subset R of columns to P. hs recursve formula allows the development of an effcent greedy algorthm that approxmates the optmal soluton of the column subset selecton problem. At teraton t, the goal s to fnd column p such that p = arg mn S {}), 3) where S s the set of columns selected durng the frst t 1 teratons. Let G be an n n matrx whch represents the nnerproducts over the columns of the resdual matrx E,.e., G = E E. he greedy selecton problem can be smplfed to See [16, Secton 6]) Problem 2: Greedy Column Subset Selecton) At teraton t, fnd column p such that p = arg max G : 2 G where G = E E, E = A ÃS and S s the set of columns selected durng the frst t 1 teratons. or teraton t, defne δ = G :p and ω = G :p / G pp = δ/ δ p. he vector δ t) can be calculated n terms of A and prevous ω s as t 1 δ t) = A A :p ω r) p ω r). 4) r=1 he numerator and denomnator of the selecton crteron at each teraton can be calculated n an effcent manner wthout explctly calculatng E or G usng the followng theorem. heorem 2: Let f = G : 2 and g = G be the numerator and denomnator of the crteron functon for column respectvely, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Aω Σ t 2 r=1 ω r) ω ω r))) ) t 1), + ω 2 ω ω) ) t 1). g t) = g ω ω) where represents the Hadamard product operator. Proof: See [16, heorem 4]. Algorthm 1 shows the complete greedy CSS algorthm. he dstrbuted CSS algorthm presented n ths paper ntroduces a generalzed varant of the greedy CSS algorthm n whch a subset of columns s selected from a source matrx such that the reconstructon error of a target matrx s mnmzed. he dstrbuted CSS method uses the greedy generalzed CSS algorthm as the core method for selectng columns at dfferent machnes as well as n the fnal selecton step.

4 Algorthm 1 Greedy Column Subset Selecton Input: Data matrx A, Number of columns l Output: Selected subset of columns S 1: Intalze S = { } 2: Intalze f 0) = A A : 2, g 0) = A : A : for = 1...n 3: Repeat t = 1 l: 4: p = arg max f t) 5: δ t) = A A :p t 1 /g t), S = S {p} r=1 ωr) p ω r) 6: ω t) = δ t) / δ t) p 7: Update f s, g s heorem 2) V. MAPREDUCE PARADIGM MapReduce [7] was presented as a programmng model to smplfy large-scale data analytcs over a dstrbuted envronment of commodty machnes. he ratonale behnd MapReduce s to mpose a set of constrants on data access at each ndvdual machne and communcaton between dfferent machnes to ensure both the scalablty and faulttolerance of the analytcal tasks. Currently, MapReduce s consdered the de-facto soluton for many data analytcs tasks over large dstrbuted clusters [17], [18]. A MapReduce job s executed n two phases of userdefned data transformaton functons, namely, map and reduce phases. he nput data s splt nto physcal blocks dstrbuted among the nodes. Each block s vewed as a lst of key-value pars. In the frst phase, the key-value pars of each nput block b are processed by a sngle map functon runnng ndependently on the node where the block b s stored. he key-value pars are provded one-by-one to the map functon. he output of the map functon s another set of ntermedate key-value pars. he values assocated wth the same key across all nodes are grouped together and provded as an nput to the reduce functon n the second phase. Dfferent groups of values are processed n parallel on dfferent machnes. he output of each reduce functon s a thrd set of key-value pars and collectvely consdered the output of the job. It s mportant to note that the set of the ntermedate key-value pars s moved across the network between the nodes whch ncurs sgnfcant addtonal executon tme when much data are to be moved. or complex analytcal tasks, multple jobs are typcally chaned together [17] and/or many rounds of the same job are executed on the nput data set [18]. In addton to the programmng model constrants, Karloff et al. [19] defned a set of computatonal constrants that ensure the scalablty and the effcency of MapReducebased analytcal tasks. hese computatonal constrants lmt the used memory sze at each machne, the output sze of both the map and reduce functons and the number of rounds used to complete a certan tasks. he MapReduce algorthms presented n ths paper adhere to both the programmng model constrants and the computatonal constrants. he proposed algorthm ams also at mnmzng the overall runnng tme of the dstrbuted column subset selecton task to facltate nteractve data analytcs. VI. DISRIBUED CSS ON MAPREDUCE hs secton descrbes a MapReduce algorthm for the dstrbuted column subset selecton problem. Gven a bg data matrx A whose columns are dstrbuted across dfferent machnes, the goal s to select a subset of columns S from A such that the CSS crteron S) s mnmzed. One naïve approach to perform dstrbuted column subset selecton s to select dfferent subsets of columns from the sub-matrces stored on dfferent machnes. he selected subsets are then sent to a centralzed machne where an addtonal selecton step s optonally performed to flter out rrelevant or redundant columns. Let A ) be the submatrx stored at machne, the naïve approach optmzes the followng functon. c A ) P L )) 2 A) =1, 5) where L ) s the set of columns selected from A ) and c s the number of physcal blocks of data. he resultng set of columns s the unon of the sets selected from dfferent submatrces: L = c =1 L ). he set L can further be reduced by nvokng another selecton process n whch a smaller subset of columns s selected from A :L. he naïve approach, however smple, s prone to mssng relevant columns. hs s because the selecton at each machne s based on approxmatng a local sub-matrx, and accordngly there s no way to determne whether the selected columns are globally relevant or not. or nstance, suppose the extreme case where all the truly representatve columns happen to be loaded on a sngle machne. In ths case, the algorthm wll select a less-than-requred number of columns from that machne and many rrelevant columns from other machnes. In order to allevate ths problem, the dfferent machnes have to select columns that best approxmate a common representaton of the data matrx. o acheve that, the proposed algorthm frst learns a concse representaton of the span of the bg data matrx. hs concse representaton s relatvely small and t can be sent over to all machnes. After that each machne can select columns from ts submatrx that approxmate ths concse representaton. he proposed algorthm uses random projecton to learn ths concse representaton, and proposes a generalzed Column Subset Selecton CSS) method to select columns from dfferent machnes. he detals of the proposed methods are explaned n the rest of ths secton.

5 A. Random Projecton he frst step of the proposed algorthm s to learn a concse representaton B for a dstrbuted data matrx A. In the proposed approach, a random projecton method s employed. Random projecton [20][21][22] s a well-known technque for dealng wth the curse-of-the-dmensonalty problem. Let Ω be a random projecton matrx of sze n r, and gven a data matrx X of sze m n, the random projecton can be calculated as Y = XΩ. It has been shown that applyng random projecton Ω to X preserves the parwse dstances between vectors n the row space of X wth a hgh probablty [20]: 1 ɛ) X : X j: X : Ω X j: Ω 1 + ɛ) X : X j:, where ɛ s an arbtrarly small factor. Snce the CSS crteron S) measures the reconstructon error between the bg data matrx A and ts low-rank approxmaton P S) A, t essentally measures the sum of the dstances between the orgnal rows and ther approxmatons. hs means that when applyng random projecton to both A and P S) A, the reconstructon error of the orgnal data matrx A wll be approxmately equal to that of AΩ when both are approxmated usng the subset of selected columns: 6) A P S) A 2 AΩ P S) AΩ 2. 7) So, nstead of optmzng A P S) A 2, the dstrbuted CSS can approxmately optmze AΩ P S) AΩ 2. Let B = AΩ, the dstrbuted column subset selecton problem can be formally defned as Problem 3: Dstrbuted Column Subset Selecton) Gven an m n ) sub-matrx A ) whch s stored at node and an nteger l ), fnd a subset of columns L ) such that L ) = l ) and L ) = arg mn B P S) B 2, S where B = AΩ, Ω s an n r random projecton matrx, S s the set of the ndces of the canddate columns and L ) s the set of the ndces of the selected columns from A ). A key observaton here s that random projecton matrces whose entres are sampled..d from some unvarate dstrbuton Ψ can be exploted to compute random projecton on MapReduce n a very effcent manner. Examples of such matrces are Gaussan random matrces [20], unform random sgn ±1) matrces [21], and sparse random sgn matrces [22]. In order to mplement random projecton on MapReduce, the data matrx A s dstrbuted n a column-wse fashon and vewed as pars of, A : where A : s the -th column of A. Recall that B = AΩ can be rewrtten as n B = A : Ω : 8) =1 Algorthm 2 ast Random Projecton on MapReduce Input: Data matrx A, Unvarate dstrbuton Ψ, Number of dmensons r Output: Concse representaton B = AΩ, Ω j Ψ, j 1: map: 2: B = [0]m r 3: foreach, A : 4: 5: Generate v = [v 1, v 2,...v r ], v j Ψ B = B + A: v 6: for j = 1 to m 7: emt j, B j: 8: reduce: 9: foreach j, [ [ B 1) ] j:, [ B 2) ] j:,..., [ B c) ] j: ] 10: B j: = c =1 [ B ) ] j: 11: emt j, B j: and snce the map functon s provded one column of A at a tme, one does not need to worry about pre-computng the full matrx Ω. In fact, for each nput column A :, a new vector Ω : needs to be sampled from Ψ. So, each nput column generates a matrx of sze m r whch means that Onmr) data should be moved across the network to sum the generated n matrces at m ndependent reducers each summng a row B j: to obtan B. o mnmze that network cost, an n-memory summaton can be carred out over the generated m r matrces at each mapper. hs can be done ncrementally after processng each column of A. hat optmzaton reduces the network cost to Ocmr), where c s the number of physcal blocks of the matrx 1. Algorthm 2 outlnes the proposed random projecton algorthm. he term emt s used to refer to outputtng new key, value pars from a mapper or a reducer. B. Generalzed CSS hs secton presents the generalzed column subset selecton algorthm whch wll be used to perform the selecton of columns at dfferent machnes. Whle Problem 1 s concerned wth the selecton of a subset of columns from a data matrx whch best represent other columns of the same matrx, Problem 3 selects a subset of columns from a source matrx whch best represent the columns of a dfferent target matrx. he objectve functon of Problem 3 represents the reconstructon error of the target matrx B based on the selected columns) from the source matrx. and the term P S) = A :S A 1 :S A :S A :S s the projecton matrx whch projects the columns of B onto the subspace of the columns selected from A. In order to optmze ths new crteron, a greedy algorthm can be ntroduced. Let S) = B P S) B 2 be the 1 he n-memory summaton can also be replaced by a MapReduce combner [7].

6 dstrbuted CSS crteron, the followng theorem derves a recursve formula for S). heorem 3: Gven a set of columns S. or any P S, S) = P) 2 R, where = B P P) B, and R s the low-rank approxmaton of based on the subset R = S \ P of columns of E = A P P) A. Proof: Usng the recursve formula for the low-rank approxmaton of A: à S = ÃP + ẼR, and multplyng both sdes wth Ω gves à S Ω = ÃPΩ + ẼRΩ. Low-rank approxmatons can be wrtten n terms of projecton matrces as Usng B = AΩ, P S) AΩ = P P) AΩ + R R) EΩ. P S) B = P P) B + R R) EΩ. Let = EΩ. he matrx s the resdual after approxmatng B usng the set P of columns ) = EΩ = A P P) A Ω = AΩ P P) AΩ = B P P) B. hs means that P S) B = P P) B + R R) Substtutng n S) = B P S) B 2 gves S) = B P P) B R R) Usng = B P P) B gves S) = R R) Usng the relaton between robenus norm and trace, ) ) ) S) = trace R R) R R) ) = trace 2 R R) + R R) R R) ) = trace R R) = 2 R R) Usng P) = 2 and R = R R) proves the theorem. Usng the recursve formula for S {}) allows the development of a greedy algorthm whch at teraton t optmzes p = arg mn 2 S {}) = arg max 2 2 {} 2 9) Algorthm 3 Greedy Generalzed Column Subset Selecton Input: Source matrx A, arget matrx B, Number of columns l Output: Selected subset of columns S 1: Intalze f 0) = B A : 2, g 0) = A : A : for = 1...n 2: Repeat t = 1 l: 3: p = arg max f t) 4: δ t) = A A :p t 1 /g t), S = S {p} r=1 ωr) p ω r) r=1 ωr) p υ r) δ t) 5: γ t) = B A :p t 1 6: ω t) = δ t) / p, υ t) = γ t) / 7: Update f s, g s heorem 4) δ t) p Let G = E E and H = E, the objectve functon of ths optmzaton problem can be smplfed as follows. 2 ) {} = E : E 1 : E : E : 2 = trace ) E : E 1 ) : E : E : 10) E : 2 = E: E = H : 2. : G hs allows the defnton of the followng generalzed CSS problem. Problem 4: Greedy Generalzed CSS) At teraton t, fnd column p such that p = arg max H : 2 G where H = E, G = E E, = B P S) B, E = A P S) A and S s the set of columns selected durng the frst t 1 teratons. or teraton t, defne γ = H :p and υ = H :p / G pp = γ/ δ p. he vector γ t) can be calculated n terms of A, B and prevous ω s and υ s as γ t) = B A :p t 1 r=1 ωr) p υ r). Smlarly, the numerator and denomnator of the selecton crteron at each teraton can be calculated n an effcent manner usng the followng theorem. heorem 4: Let f = H : 2 and g = G be the numerator and denomnator of the greedy crteron functon for column respectvely, f = [f ] =1..n, and g = [g ] =1..n. hen, ) f t) = f 2 ω A Bυ Σ t 2 r=1 υ r) υ ω r))) ) t 1), + υ 2 ω ω) ) t 1) g t) = g ω ω), where represents the Hadamard product operator. As outlned n Secton VI-A, the algorthm s dstrbuton strategy s based on sharng the concse representaton of the data B among all mappers. hen, ndependent l b) columns

7 Algorthm 4 Dstrbuted CSS on MapReduce Input: Matrx A of sze m n, Concse representaton B, Number of columns l Output: Selected columns C 1: map: 2: A b) = [ ] 3: foreach, A : 4: 5: A b) = [A b) A : ] S = GeneralzedCSSAb), B, l b) ) 6: foreach j n S 7: emt 0, [A b) ] :j 8: reduce: 9: or all values {[A 1) ] : S1), [A 2) ] : S2),..., [A c) ] : Sc) } ] 10: A 0) = [[A 1) ] : S1), [A 2) ] : S2),..., [A c) ] : Sc) 11: S = GeneralzedCSS A 0), B, l) 12: foreach j n S 13: emt 0, [A 0) ] :j from each mapper are selected usng the generalzed CSS algorthm. A second phase of selecton s run over the c b=1 l b) where c s the number of nput blocks) columns to fnd the best l columns to represent B. Dfferent ways can be used to set l b) for each nput block b. In the context of ths paper, the set of l b) s assgned unform values for all blocks.e. l b) = l/c b 1, 2,..c). Other methods are to be consdered n future extensons. Algorthm 4 sketches the MapReduce mplementaton of the dstrbuted CSS algorthm. It should be emphaszed that the proposed MapReduce algorthm requres only two passes over the data set and ts moves a very few amount of the data across the network. VII. RELAED WORK Dfferent approaches have been proposed for selectng a subset of representatve columns from a data matrx. hs secton focuses on brefly descrbng these approaches and ther applcablty to massvely dstrbuted data matrces. he Column Subset Selecton CSS) methods can be generally categorzed nto randomzed, determnstc and hybrd. he randomzed methods sample a subset of columns from the orgnal matrx usng carefully chosen samplng probabltes. reze et al. [9] was the frst to suggest the dea of randomly samplng l columns from a matrx and usng these columns to calculate a rank-k approxmaton of the matrx where l k). hat work of reze et al. was followed by dfferent papers [10], [11] that enhanced the algorthm by proposng dfferent samplng probabltes. Drneas et al. [12] proposed a subspace samplng method whch samples columns usng probabltes proportonal to the norms of the rows of the top k rght sngular vectors of A. Deshpande et al. [13] proposed an adaptve samplng method whch updates the samplng probabltes based on the columns selected so far. Column subset selecton wth unform samplng can be easly mplemented on MapReduce. or non-unform samplng, the effcency of mplementng the selecton on MapReduce s determned by how easy are the calculatons of the samplng probabltes. he calculatons of probabltes that depend on calculatng the leadng sngular values and vectors are tme-consumng on MapReduce. On the other hand, adaptve samplng methods are computatonally very complex as they depend on calculatng the resdual of the whole data matrx after each teraton. he second category of methods employs a determnstc algorthm for selectng columns such that some crteron functon s mnmzed. hs crteron functon usually quantfes the reconstructon error of the data matrx based on the subset of selected columns. he determnstc methods are slower, but more accurate, than the randomzed ones. In the area of numercal lnear algebra, the column pvotng method exploted by the QR decomposton [23] permutes the columns of the matrx based on ther norms to enhance the numercal stablty of the QR decomposton algorthm. he frst l columns of the permuted matrx can be drectly selected as representatve columns. Besdes methods based on QR decomposton, dfferent recent methods have been proposed for drectly selectng a subset of columns from the data matrx. Boutsds et al. [4] proposed a determnstc column subset selecton method whch frst groups columns nto clusters and then selects a subset of columns from each cluster. Çvrl and Magdon-Ismal [14] presented a determnstc algorthm whch greedly selects columns from the data matrx that best represent the rght leadng sngular values of the matrx. Recently, Boutsds et al. [6] presented a column subset selecton algorthm whch frst calculates the top-k rght sngular values of the data matrx where k s the target rank) and then uses determnstc sparsfcaton methods to select l k columns from the data matrx. Besdes, other determnstc algorthms have been proposed for selectng columns based on the volume defned by them and the orgn [24], [25]. he determnstc algorthms are more complex to mplement on MapReduce. or nstance, t s tme-consumng to calculate the leadng sngular values and vectors of a massvely dstrbuted matrx or to cluster ther columns usng k-means. It s also computatonally complex to calculate QR decomposton wth pvotng. Moreover, the recently proposed algorthms for volume samplng are more complex than other CSS algorthms as well as the one presented n ths paper, and they are nfeasble for large data sets. A thrd category of CSS technques s the hybrd methods whch combne the benefts of both the randomzed and determnstc methods. In these methods, a large subset of columns s randomly sampled from the columns of the data matrx and then a determnstc step s employed to reduce

8 able I HE PROPERIES O HE DAA SES USED O EVALUAE HE DISRIBUED CSS MEHOD. Data set ype # Instances # eatures RCV1-200K Documents 193,844 47,236 nyimages-1m Images 1 mllon 1,024 the number of selected columns to the desred rank. or nstance, Boutsds et al. [5] proposed a two-stage hybrd CSS algorthm whch frst samples O l log l) columns based on probabltes calculated usng the l-leadng rght sngular vectors, and then employs a determnstc algorthm to select exactly l columns from the columns sampled n the frst stage. However, the algorthm depends on calculatng the leadng l rght sngular vectors whch s tme-consumng for large data sets. he hybrd algorthms for CSS can be easly mplemented on MapReduce f the randomzed selecton step s MapReduce-effcent and the determnstc selecton step can be mplemented on a sngle machne. hs s usually true f the number of columns selected by the randomzed step s relatvely small. In comparson to other CSS methods, the algorthm proposed n ths paper s desgned to be MapReduce-effcent. In the dstrbuted selecton step, representatve columns are selected based on a common representaton. he common representaton proposed n ths work s based on random projecton. hs s more effcent than the work of Çvrl and Magdon-Ismal [14] whch selects columns based on the leadng sngular vectors. In comparson to other determnstc methods, the proposed algorthm s specfcally desgned to be parallelzed whch makes t applcable to bg data matrces whose columns are massvely dstrbuted. On the other hand, the two-step of dstrbuted then centralzed selecton s smlar to that of the hybrd CSS methods. he proposed algorthm however employs a determnstc algorthm at the dstrbuted selecton phase whch s more accurate than the randomzed selecton employed by hybrd methods n the frst phase. VIII. EXPERIMENS Experments have been conducted on two bg data sets to evaluate the effcency and effectveness of the proposed dstrbuted CSS algorthm on MapReduce. he propertes of the data sets are descrbed n able I. he RCV1-200K s a subset of the RCV1 data set [26] whch has been prepared and used by Chen et al. [27] to evaluate parallel spectral clusterng algorthms. he nyimages-1m data set contans 1 mllon mages that were sampled from the 80 mllon tny mages data set [28] and converted to grayscale. Smlar to prevous work on CSS, the dfferent methods are evaluated accordng to ther ablty to mnmze the reconstructon error of the data matrx based on the subset of selected columns. In order to quantfy the reconstructon error across dfferent data sets, a relatve accuracy measure s defned as Relatve Accuracy = A ÃU A ÃS A ÃU A Ãl 100%, where ÃU s the rank-l approxmaton of the data matrx based on a random subset U of columns, Ã S s the rank-l approxmaton of the data matrx based on the subset S of columns and Ãl s the best rank-l approxmaton of the data matrx calculated usng the Sngular Value Decomposton SVD). hs measure compares dfferent methods relatve to the unform samplng as a baselne wth hgher values ndcatng better performance. he experments were conducted on Amazon EC2 2 clusters, whch consst of 10 nstances for the RCV1-200K data set and 20 nstances for the nyimages-1m data set. Each nstance has a 7.5 GB of memory and a two-cores processor. All nstances are runnng Deban and Hadoop verson he data sets were converted nto a bnary format n the form of a sequence of key-value pars. Each par conssted of a column ndex as the key and a vector of the column entres. hat s the standard format used n Mahout 3 for storng dstrbuted matrces. he dstrbuted CSS method has been compared wth dfferent state-of-the-art methods. It should be noted that most of these methods were not desgned wth the goal of applyng them to massvely-dstrbuted data, and hence ther mplementaton on MapReduce s not straghtforward. However, the desgned experments used the best practces for mplementng the dfferent steps of these methods on MapReduce to the best of the authors knowledge. In specfc, the followng dstrbuted CSS algorthms were compared. UnNoRep: s unform samplng of columns wthout replacement. hs s usually the worst performng method n terms on approxmaton error and t wll be used as a baselne to evaluate methods across dfferent data sets. HybrdUn, HybrdCol and HybrdSVD: are dfferent dstrbuted varants of the hybrd CSS algorthm whch can be mplemented effcently on MapReduce. In the randomzed phase, the three methods use probabltes calculated based on unform samplng, column norms and the norms of the leadng sngular vectors rows, respectvely. he number of selected columns n the randomzed phase s set to l log l). In the determnstc phase, the centralzed greedy CSS s employed to select exactly l columns from the randomly sampled columns. DstApproxSVD: s an extenson of the centralzed algorthm for sparse approxmaton of Sngular Value Decomposton SVD) [14]. he dstrbuted CSS algorthm presented n ths paper Algorthm 4) s used 2 Amazon Elastc Compute Cloud EC2): 3 Mahout s an Apache project for mplementng Machne Learnng algorthms on Hadoop. See

9 able II HE RUN IMES AND RELAIVE ACCURACIES O DIEREN CSS MEHODS. HE BES PERORMING MEHOD OR EACH l IS HIGHLIGHED IN BOLD, AND HE SECOND BES MEHOD IS UNDERLINED. NEGAIVE MEASURES INDICAE MEHODS HA PERORM WORSE HAN UNIORM SAMPLING. Methods Run tme mnutes) Relatve accuracy %) l = 10 l = 100 l = 500 l = 10 l = 100 l = 500 RCV1-200K Unform - Baselne Hybrd Unform) Hybrd Column Norms) Hybrd SVD-based) Dstrbuted Approx. SVD Dstrbuted Greedy CSS rnd) Dstrbuted Greedy CSS ssgn) ny Images - 1M Unform - Baselne Hybrd Unform) Hybrd Column Norms) Hybrd SVD-based) Dstrbuted Approx. SVD Dstrbuted Greedy CSS ssgn) to select columns that best approxmate the leadng sngular vectors by settng B = U k Σ k ). he use of the dstrbuted CSS algorthm extends the orgnal algorthm proposed by Çvrl and Magdon-Ismal [14] to work on dstrbuted matrces. In order to allow effcent mplementaton on MapReduce, the number of leadng sngular vectors s set of 100. DstGreedyCSS: s the dstrbuted column subset selecton method descrbed n Algorthm 4. or all experments, the dmenson of the random projecton matrx s set to 100. hs makes the sze of the concse representaton the same as the DstApproxSVD method. wo types of random matrces are used for random projecton: 1) a dense Gaussan random matrx rnd), and 2) a sparse random sgn matrx ssgn). or the methods that requre the calculatons of Sngular Value Decomposton SVD), the Stochastc SVD SSVD) algorthm [29] s used to approxmate the leadng sngular values and vectors of the data matrx. he use of SSVD sgnfcantly reduces the run tme of the orgnal SVDbased algorthms whle achevng comparable accuracy. In the conducted experments, the SSVD mplementaton of Mahout was used. able II shows the run tmes and relatve accuraces for dfferent CSS methods. It can be observed from the table that for the RCV1-200K data set, the DstGreedyCSS methods wth random Gaussan and sparse random sng matrces) outperforms all other methods n terms of relatve accuraces. In addton, the run tmes of both of them are relatvely small compared to the DstApproxSVD method whch acheves accuraces that are close to the DstGreedyCSS method. Both the DstApproxSVD and DstGreedyCSS methods acheve very good approxmaton accuraces compared to randomzed and hybrd methods. It should also be noted that usng a sparse random sgn matrx for random projecton takes much less tme than a dense Gaussan matrx, whle achevng comparable approxmaton accuraces. Based on ths observaton, the sparse random matrx has been used wth the nyimages-1m data set. or the nyimages-1m data set, although the DstApproxSVD acheves slghtly hgher approxmaton accuraces than DstGreedyCSS wth sparse random sgn matrx), the DstGreedyCSS selects columns n almost one-thrd of the tme. he reason why the DstApproxSVD outperforms DstGreedyCSS for ths data set s that ts rank s relatvely small less than 1024). hs means that usng the leadng 100 sngular values to represent the concse representaton of the data matrx captures most of the nformaton n the matrx and accordngly s more accurate than random projecton. he DstGreedyCSS however stll selects a very good subset of columns n a relatvely small tme. IX. CONCLUSION hs paper proposes an accurate and effcent MapReduce algorthm for selectng a subset of columns from a massvely dstrbuted matrx. he algorthm starts by learnng a concse representaton of the data matrx usng random projecton. It then selects columns from each sub-matrx that best approxmate ths concse approxmaton. A centralzed selecton step s then performed on the columns selected from dfferent sub-matrces. In order to facltate the mplementaton of the proposed method, a novel algorthm for greedy generalzed CSS s proposed to perform the selecton from dfferent submatrces. In addton, the dfferent steps of the algorthms are carefully desgned to be MapReduce-effcent. Experments on bg data sets demonstrate the effectveness and effcency of the proposed algorthm n comparson to other CSS methods when mplemented on dstrbuted data. REERENCES [1] A. K. Jan and R. C. Dubes, Algorthms for Clusterng Data. Upper Saddle Rver, NJ, USA: Prentce-Hall, Inc., 1988.

10 [2] L. Kaufman and P. Rousseeuw, Clusterng by means of medods, echnsche Hogeschool, Delft Netherlands). Department of Mathematcs and Informatcs, ech. Rep., [3] S. Deerwester, S. Dumas, G. urnas,. Landauer, and R. Harshman, Indexng by latent semantc analyss, Journal of the Amercan Socety for Informaton Scence and echnology, vol. 41, no. 6, pp , [4] C. Boutsds, J. Sun, and N. Anerouss, Clustered subset selecton and ts applcatons on t servce metrcs, n Proceedngs of the Seventeenth ACM Conference on Informaton and Knowledge Management CIKM 08), 2008, pp [5] C. Boutsds, M. W. Mahoney, and P. Drneas, An mproved approxmaton algorthm for the column subset selecton problem, n Proceedngs of the wenteth Annual ACM-SIAM Symposum on Dscrete Algorthms SODA 09), 2009, pp [6] C. Boutsds, P. Drneas, and M. Magdon-Ismal, Near optmal column-based matrx reconstructon, n Proceedngs of the 52nd Annual IEEE Symposum on oundatons of Computer Scence OCS 11), 2011, pp [7] J. Dean and S. Ghemawat, MapReduce: Smplfed data processng on large clusters, Communcatons of the ACM, vol. 51, no. 1, pp , [8]. Whte, Hadoop: he Defntve Gude, 1st ed. O Relly Meda, Inc., [9] A. reze, R. Kannan, and S. Vempala, ast Monte-Carlo algorthms for fndng low-rank approxmatons, n Proceedngs of the 39th Annual IEEE Symposum on oundatons of Computer Scence OCS 98), 1998, pp [10] P. Drneas, A. reze, R. Kannan, S. Vempala, and V. Vnay, Clusterng large graphs va the sngular value decomposton, Machne Learnng, vol. 56, no. 1-3, pp. 9 33, [11] P. Drneas, R. Kannan, and M. Mahoney, ast Monte Carlo algorthms for matrces II: Computng a low-rank approxmaton to a matrx, SIAM Journal on Computng, vol. 36, no. 1, pp , [12] P. Drneas, M. Mahoney, and S. Muthukrshnan, Subspace samplng and relatve-error matrx approxmaton: Column-based methods, n Approxmaton, Randomzaton, and Combnatoral Optmzaton. Algorthms and echnques. Sprnger Berln / Hedelberg, 2006, pp [13] A. Deshpande, L. Rademacher, S. Vempala, and G. Wang, Matrx approxmaton and projectve clusterng va volume samplng, heory of Computng, vol. 2, no. 1, pp , [14] A. Çvrl and M. Magdon-Ismal, Column subset selecton va sparse approxmaton of SVD, heoretcal Computer Scence, vol. 421, no. 0, pp. 1 14, [15] A. K. arahat, A. Ghods, and M. S. Kamel, An effcent greedy method for unsupervsed feature selecton, n Proceedngs of the Eleventh IEEE Internatonal Conference on Data Mnng ICDM 11), 2011, pp [16], Effcent greedy feature selecton for unsupervsed learnng, Knowledge and Informaton Systems, vol. 35, no. 2, pp , [17]. Elsayed, J. Ln, and D. W. Oard, Parwse document smlarty n large collectons wth MapReduce, n Proceedngs of the 46th Annual Meetng of the Assocaton for Computatonal Lngustcs on Human Language echnologes: Short Papers HL 08), 2008, pp [18] A. Ene, S. Im, and B. Moseley, ast clusterng usng MapReduce, n Proceedngs of the Seventeenth ACM SIGKDD Internatonal Conference on Knowledge Dscovery and Data Mnng KDD 11), 2011, pp [19] H. Karloff, S. Sur, and S. Vasslvtsk, A model of computaton for MapReduce, n Proceedngs of the 21st Annual ACM-SIAM Symposum on Dscrete Algorthms SODA 10), 2010, pp [20] S. Dasgupta and A. Gupta, An elementary proof of a theorem of Johnson and Lndenstrauss, Random Structures and Algorthms, vol. 22, no. 1, pp , [21] D. Achloptas, Database-frendly random projectons: Johnson-Lndenstrauss wth bnary cons, Journal of computer and System Scences, vol. 66, no. 4, pp , [22] P. L,. J. Haste, and K. W. Church, Very sparse random projectons, n Proceedngs of the welfth ACM SIGKDD nternatonal conference on Knowledge Dscovery and Data Mnng KDD 06), 2006, pp [23] G. Golub and C. Van Loan, Matrx Computatons, 3rd ed. Johns Hopkns Unv Pr, [24] A. Deshpande and L. Rademacher, Effcent volume samplng for row/column subset selecton, n Proceedngs of the 51st Annual IEEE Symposum on oundatons of Computer Scence OCS 10), 2010, pp [25] V. Guruswam and A. K. Snop, Optmal column-based lowrank matrx reconstructon, n Proceedngs of the 21st Annual ACM-SIAM Symposum on Dscrete Algorthms SODA 12), 2012, pp [26] D. D. Lews, Y. Yang,. G. Rose, and. L, Rcv1: A new benchmark collecton for text categorzaton research, he Journal of Machne Learnng Research, vol. 5, pp , [27] W.-Y. Chen, Y. Song, H. Ba, C.-J. Ln, and E. Chang, Parallel spectral clusterng n dstrbuted systems, Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol. 33, no. 3, pp , [28] A. orralba, R. ergus, and W. reeman, 80 mllon tny mages: A large data set for nonparametrc object and scene recognton, Pattern Analyss and Machne Intellgence, IEEE ransactons on, vol. 30, no. 11, pp , [29] N. Halko, P.-G. Martnsson, Y. Shkolnsky, and M. ygert, An algorthm for the prncpal component analyss of large data sets, SIAM Journal on Scentfc Computng, vol. 33, no. 5, pp , 2011.

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

v a 1 b 1 i, a 2 b 2 i,..., a n b n i.

v a 1 b 1 i, a 2 b 2 i,..., a n b n i. SECTION 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS 455 8.4 COMPLEX VECTOR SPACES AND INNER PRODUCTS All the vector spaces we have studed thus far n the text are real vector spaces snce the scalars are

More information

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Interest-Oriented Network Evolution Mechanism for Online Communities An Interest-Orented Network Evoluton Mechansm for Onlne Communtes Cahong Sun and Xaopng Yang School of Informaton, Renmn Unversty of Chna, Bejng 100872, P.R. Chna {chsun,yang}@ruc.edu.cn Abstract. Onlne

More information

8 Algorithm for Binary Searching in Trees

8 Algorithm for Binary Searching in Trees 8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the

More information

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification

Logistic Regression. Lecture 4: More classifiers and classes. Logistic regression. Adaboost. Optimization. Multiple class classification Lecture 4: More classfers and classes C4B Machne Learnng Hlary 20 A. Zsserman Logstc regresson Loss functons revsted Adaboost Loss functons revsted Optmzaton Multple class classfcaton Logstc Regresson

More information

Calculation of Sampling Weights

Calculation of Sampling Weights Perre Foy Statstcs Canada 4 Calculaton of Samplng Weghts 4.1 OVERVIEW The basc sample desgn used n TIMSS Populatons 1 and 2 was a two-stage stratfed cluster desgn. 1 The frst stage conssted of a sample

More information

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching)

Face Verification Problem. Face Recognition Problem. Application: Access Control. Biometric Authentication. Face Verification (1:1 matching) Face Recognton Problem Face Verfcaton Problem Face Verfcaton (1:1 matchng) Querymage face query Face Recognton (1:N matchng) database Applcaton: Access Control www.vsage.com www.vsoncs.com Bometrc Authentcaton

More information

Forecasting the Direction and Strength of Stock Market Movement

Forecasting the Direction and Strength of Stock Market Movement Forecastng the Drecton and Strength of Stock Market Movement Jngwe Chen Mng Chen Nan Ye cjngwe@stanford.edu mchen5@stanford.edu nanye@stanford.edu Abstract - Stock market s one of the most complcated systems

More information

L10: Linear discriminants analysis

L10: Linear discriminants analysis L0: Lnear dscrmnants analyss Lnear dscrmnant analyss, two classes Lnear dscrmnant analyss, C classes LDA vs. PCA Lmtatons of LDA Varants of LDA Other dmensonalty reducton methods CSCE 666 Pattern Analyss

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ). REVIEW OF RISK MANAGEMENT CONCEPTS LOSS DISTRIBUTIONS AND INSURANCE Loss and nsurance: When someone s subject to the rsk of ncurrng a fnancal loss, the loss s generally modeled usng a random varable or

More information

Support Vector Machines

Support Vector Machines Support Vector Machnes Max Wellng Department of Computer Scence Unversty of Toronto 10 Kng s College Road Toronto, M5S 3G5 Canada wellng@cs.toronto.edu Abstract Ths s a note to explan support vector machnes.

More information

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic Lagrange Multplers as Quanttatve Indcators n Economcs Ivan Mezník Insttute of Informatcs, Faculty of Busness and Management, Brno Unversty of TechnologCzech Republc Abstract The quanttatve role of Lagrange

More information

Implementation of Deutsch's Algorithm Using Mathcad

Implementation of Deutsch's Algorithm Using Mathcad Implementaton of Deutsch's Algorthm Usng Mathcad Frank Roux The followng s a Mathcad mplementaton of Davd Deutsch's quantum computer prototype as presented on pages - n "Machnes, Logc and Quantum Physcs"

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network *

Data Broadcast on a Multi-System Heterogeneous Overlayed Wireless Network * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 24, 819-840 (2008) Data Broadcast on a Mult-System Heterogeneous Overlayed Wreless Network * Department of Computer Scence Natonal Chao Tung Unversty Hsnchu,

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts Power-of-wo Polces for Sngle- Warehouse Mult-Retaler Inventory Systems wth Order Frequency Dscounts José A. Ventura Pennsylvana State Unversty (USA) Yale. Herer echnon Israel Insttute of echnology (Israel)

More information

Loop Parallelization

Loop Parallelization - - Loop Parallelzaton C-52 Complaton steps: nested loops operatng on arrays, sequentell executon of teraton space DECLARE B[..,..+] FOR I :=.. FOR J :=.. I B[I,J] := B[I-,J]+B[I-,J-] ED FOR ED FOR analyze

More information

Extending Probabilistic Dynamic Epistemic Logic

Extending Probabilistic Dynamic Epistemic Logic Extendng Probablstc Dynamc Epstemc Logc Joshua Sack May 29, 2008 Probablty Space Defnton A probablty space s a tuple (S, A, µ), where 1 S s a set called the sample space. 2 A P(S) s a σ-algebra: a set

More information

A Fast Incremental Spectral Clustering for Large Data Sets

A Fast Incremental Spectral Clustering for Large Data Sets 2011 12th Internatonal Conference on Parallel and Dstrbuted Computng, Applcatons and Technologes A Fast Incremental Spectral Clusterng for Large Data Sets Tengteng Kong 1,YeTan 1, Hong Shen 1,2 1 School

More information

320 The Internatonal Arab Journal of Informaton Technology, Vol. 5, No. 3, July 2008 Comparsons Between Data Clusterng Algorthms Osama Abu Abbas Computer Scence Department, Yarmouk Unversty, Jordan Abstract:

More information

Project Networks With Mixed-Time Constraints

Project Networks With Mixed-Time Constraints Project Networs Wth Mxed-Tme Constrants L Caccetta and B Wattananon Western Australan Centre of Excellence n Industral Optmsaton (WACEIO) Curtn Unversty of Technology GPO Box U1987 Perth Western Australa

More information

A Programming Model for the Cloud Platform

A Programming Model for the Cloud Platform Internatonal Journal of Advanced Scence and Technology A Programmng Model for the Cloud Platform Xaodong Lu School of Computer Engneerng and Scence Shangha Unversty, Shangha 200072, Chna luxaodongxht@qq.com

More information

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing

A Replication-Based and Fault Tolerant Allocation Algorithm for Cloud Computing A Replcaton-Based and Fault Tolerant Allocaton Algorthm for Cloud Computng Tork Altameem Dept of Computer Scence, RCC, Kng Saud Unversty, PO Box: 28095 11437 Ryadh-Saud Araba Abstract The very large nfrastructure

More information

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College Feature selecton for ntruson detecton Slobodan Petrovć NISlab, Gjøvk Unversty College Contents The feature selecton problem Intruson detecton Traffc features relevant for IDS The CFS measure The mrmr measure

More information

Improved SVM in Cloud Computing Information Mining

Improved SVM in Cloud Computing Information Mining Internatonal Journal of Grd Dstrbuton Computng Vol.8, No.1 (015), pp.33-40 http://dx.do.org/10.1457/jgdc.015.8.1.04 Improved n Cloud Computng Informaton Mnng Lvshuhong (ZhengDe polytechnc college JangSu

More information

Overview of monitoring and evaluation

Overview of monitoring and evaluation 540 Toolkt to Combat Traffckng n Persons Tool 10.1 Overvew of montorng and evaluaton Overvew Ths tool brefly descrbes both montorng and evaluaton, and the dstncton between the two. What s montorng? Montorng

More information

Realistic Image Synthesis

Realistic Image Synthesis Realstc Image Synthess - Combned Samplng and Path Tracng - Phlpp Slusallek Karol Myszkowsk Vncent Pegoraro Overvew: Today Combned Samplng (Multple Importance Samplng) Renderng and Measurng Equaton Random

More information

Multiple-Period Attribution: Residuals and Compounding

Multiple-Period Attribution: Residuals and Compounding Multple-Perod Attrbuton: Resduals and Compoundng Our revewer gave these authors full marks for dealng wth an ssue that performance measurers and vendors often regard as propretary nformaton. In 1994, Dens

More information

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DATA MINING APPLICATION IN A STUDENT DATABASE JOURNAL OF AERONAUTICS AND SPACE TECHNOLOGIES JULY 005 VOLUME NUMBER (53-57) A DATA MINING APPLICATION IN A STUDENT DATABASE Şenol Zafer ERDOĞAN Maltepe Ünversty Faculty of Engneerng Büyükbakkalköy-Istanbul

More information

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements Lecture 3 Densty estmaton Mlos Hauskrecht mlos@cs.ptt.edu 5329 Sennott Square Next lecture: Matlab tutoral Announcements Rules for attendng the class: Regstered for credt Regstered for audt (only f there

More information

BERNSTEIN POLYNOMIALS

BERNSTEIN POLYNOMIALS On-Lne Geometrc Modelng Notes BERNSTEIN POLYNOMIALS Kenneth I. Joy Vsualzaton and Graphcs Research Group Department of Computer Scence Unversty of Calforna, Davs Overvew Polynomals are ncredbly useful

More information

Cluster Analysis. Cluster Analysis

Cluster Analysis. Cluster Analysis Cluster Analyss Cluster Analyss What s Cluster Analyss? Types of Data n Cluster Analyss A Categorzaton of Maor Clusterng Methos Parttonng Methos Herarchcal Methos Densty-Base Methos Gr-Base Methos Moel-Base

More information

Learning from Multiple Outlooks

Learning from Multiple Outlooks Learnng from Multple Outlooks Maayan Harel Department of Electrcal Engneerng, Technon, Hafa, Israel She Mannor Department of Electrcal Engneerng, Technon, Hafa, Israel maayanga@tx.technon.ac.l she@ee.technon.ac.l

More information

Survey on Virtual Machine Placement Techniques in Cloud Computing Environment

Survey on Virtual Machine Placement Techniques in Cloud Computing Environment Survey on Vrtual Machne Placement Technques n Cloud Computng Envronment Rajeev Kumar Gupta and R. K. Paterya Department of Computer Scence & Engneerng, MANIT, Bhopal, Inda ABSTRACT In tradtonal data center

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Dstrb. Comput. 71 (2011) 62 76 Contents lsts avalable at ScenceDrect J. Parallel Dstrb. Comput. journal homepage: www.elsever.com/locate/jpdc Optmzng server placement n dstrbuted systems n

More information

Sngle Snk Buy at Bulk Problem and the Access Network

Sngle Snk Buy at Bulk Problem and the Access Network A Constant Factor Approxmaton for the Sngle Snk Edge Installaton Problem Sudpto Guha Adam Meyerson Kamesh Munagala Abstract We present the frst constant approxmaton to the sngle snk buy-at-bulk network

More information

A Simple Approach to Clustering in Excel

A Simple Approach to Clustering in Excel A Smple Approach to Clusterng n Excel Aravnd H Center for Computatonal Engneerng and Networng Amrta Vshwa Vdyapeetham, Combatore, Inda C Rajgopal Center for Computatonal Engneerng and Networng Amrta Vshwa

More information

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance Calbraton Method Instances of the Cell class (one nstance for each FMS cell) contan ADC raw data and methods assocated wth each partcular FMS cell. The calbraton method ncludes event selecton (Class Cell

More information

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms

Cluster Analysis of Data Points using Partitioning and Probabilistic Model-based Algorithms Internatonal Journal of Appled Informaton Systems (IJAIS) ISSN : 2249-0868 Foundaton of Computer Scence FCS, New York, USA Volume 7 No.7, August 2014 www.jas.org Cluster Analyss of Data Ponts usng Parttonng

More information

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm

A hybrid global optimization algorithm based on parallel chaos optimization and outlook algorithm Avalable onlne www.ocpr.com Journal of Chemcal and Pharmaceutcal Research, 2014, 6(7):1884-1889 Research Artcle ISSN : 0975-7384 CODEN(USA) : JCPRC5 A hybrd global optmzaton algorthm based on parallel

More information

Ants Can Schedule Software Projects

Ants Can Schedule Software Projects Ants Can Schedule Software Proects Broderck Crawford 1,2, Rcardo Soto 1,3, Frankln Johnson 4, and Erc Monfroy 5 1 Pontfca Unversdad Católca de Valparaíso, Chle FrstName.Name@ucv.cl 2 Unversdad Fns Terrae,

More information

DEFINING %COMPLETE IN MICROSOFT PROJECT

DEFINING %COMPLETE IN MICROSOFT PROJECT CelersSystems DEFINING %COMPLETE IN MICROSOFT PROJECT PREPARED BY James E Aksel, PMP, PMI-SP, MVP For Addtonal Informaton about Earned Value Management Systems and reportng, please contact: CelersSystems,

More information

Abstract. Clustering ensembles have emerged as a powerful method for improving both the

Abstract. Clustering ensembles have emerged as a powerful method for improving both the Clusterng Ensembles: {topchyal, Models jan, of punch}@cse.msu.edu Consensus and Weak Parttons * Alexander Topchy, Anl K. Jan, and Wllam Punch Department of Computer Scence and Engneerng, Mchgan State Unversty

More information

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems

Joint Scheduling of Processing and Shuffle Phases in MapReduce Systems Jont Schedulng of Processng and Shuffle Phases n MapReduce Systems Fangfe Chen, Mural Kodalam, T. V. Lakshman Department of Computer Scence and Engneerng, The Penn State Unversty Bell Laboratores, Alcatel-Lucent

More information

Graph Calculus: Scalable Shortest Path Analytics for Large Social Graphs through Core Net

Graph Calculus: Scalable Shortest Path Analytics for Large Social Graphs through Core Net Graph Calculus: Scalable Shortest Path Analytcs for Large Socal Graphs through Core Net Lxn Fu Department of Computer Scence Unversty of North Carolna at Greensboro Greensboro, NC, U.S.A. lfu@uncg.edu

More information

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S

How To Know The Components Of Mean Squared Error Of Herarchcal Estmator S S C H E D A E I N F O R M A T I C A E VOLUME 0 0 On Mean Squared Error of Herarchcal Estmator Stans law Brodowsk Faculty of Physcs, Astronomy, and Appled Computer Scence, Jagellonan Unversty, Reymonta

More information

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign PAS: A Packet Accountng System to Lmt the Effects of DoS & DDoS Debsh Fesehaye & Klara Naherstedt Unversty of Illnos-Urbana Champagn DoS and DDoS DDoS attacks are ncreasng threats to our dgtal world. Exstng

More information

The OC Curve of Attribute Acceptance Plans

The OC Curve of Attribute Acceptance Plans The OC Curve of Attrbute Acceptance Plans The Operatng Characterstc (OC) curve descrbes the probablty of acceptng a lot as a functon of the lot s qualty. Fgure 1 shows a typcal OC Curve. 10 8 6 4 1 3 4

More information

Can Auto Liability Insurance Purchases Signal Risk Attitude?

Can Auto Liability Insurance Purchases Signal Risk Attitude? Internatonal Journal of Busness and Economcs, 2011, Vol. 10, No. 2, 159-164 Can Auto Lablty Insurance Purchases Sgnal Rsk Atttude? Chu-Shu L Department of Internatonal Busness, Asa Unversty, Tawan Sheng-Chang

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

A Performance Analysis of View Maintenance Techniques for Data Warehouses

A Performance Analysis of View Maintenance Techniques for Data Warehouses A Performance Analyss of Vew Mantenance Technques for Data Warehouses Xng Wang Dell Computer Corporaton Round Roc, Texas Le Gruenwald The nversty of Olahoma School of Computer Scence orman, OK 739 Guangtao

More information

Enterprise Master Patient Index

Enterprise Master Patient Index Enterprse Master Patent Index Healthcare data are captured n many dfferent settngs such as hosptals, clncs, labs, and physcan offces. Accordng to a report by the CDC, patents n the Unted States made an

More information

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting Causal, Explanatory Forecastng Assumes cause-and-effect relatonshp between system nputs and ts output Forecastng wth Regresson Analyss Rchard S. Barr Inputs System Cause + Effect Relatonshp The job of

More information

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

On the Optimal Control of a Cascade of Hydro-Electric Power Stations On the Optmal Control of a Cascade of Hydro-Electrc Power Statons M.C.M. Guedes a, A.F. Rbero a, G.V. Smrnov b and S. Vlela c a Department of Mathematcs, School of Scences, Unversty of Porto, Portugal;

More information

Selecting Best Employee of the Year Using Analytical Hierarchy Process

Selecting Best Employee of the Year Using Analytical Hierarchy Process J. Basc. Appl. Sc. Res., 5(11)72-76, 2015 2015, TextRoad Publcaton ISSN 2090-4304 Journal of Basc and Appled Scentfc Research www.textroad.com Selectng Best Employee of the Year Usng Analytcal Herarchy

More information

Fast Fuzzy Clustering of Web Page Collections

Fast Fuzzy Clustering of Web Page Collections Fast Fuzzy Clusterng of Web Page Collectons Chrstan Borgelt and Andreas Nürnberger Dept. of Knowledge Processng and Language Engneerng Otto-von-Guercke-Unversty of Magdeburg Unverstätsplatz, D-396 Magdeburg,

More information

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence 1 st Internatonal Symposum on Imprecse Probabltes and Ther Applcatons, Ghent, Belgum, 29 June 2 July 1999 How Sets of Coherent Probabltes May Serve as Models for Degrees of Incoherence Mar J. Schervsh

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Fuzzy TOPSIS Method in the Selection of Investment Boards by Incorporating Operational Risks

Fuzzy TOPSIS Method in the Selection of Investment Boards by Incorporating Operational Risks , July 6-8, 2011, London, U.K. Fuzzy TOPSIS Method n the Selecton of Investment Boards by Incorporatng Operatonal Rsks Elssa Nada Mad, and Abu Osman Md Tap Abstract Mult Crtera Decson Makng (MCDM) nvolves

More information

Fault tolerance in cloud technologies presented as a service

Fault tolerance in cloud technologies presented as a service Internatonal Scentfc Conference Computer Scence 2015 Pavel Dzhunev, PhD student Fault tolerance n cloud technologes presented as a servce INTRODUCTION Improvements n technques for vrtualzaton and performance

More information

+ + + - - This circuit than can be reduced to a planar circuit

+ + + - - This circuit than can be reduced to a planar circuit MeshCurrent Method The meshcurrent s analog of the nodeoltage method. We sole for a new set of arables, mesh currents, that automatcally satsfy KCLs. As such, meshcurrent method reduces crcut soluton to

More information

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending

Bayesian Network Based Causal Relationship Identification and Funding Success Prediction in P2P Lending Proceedngs of 2012 4th Internatonal Conference on Machne Learnng and Computng IPCSIT vol. 25 (2012) (2012) IACSIT Press, Sngapore Bayesan Network Based Causal Relatonshp Identfcaton and Fundng Success

More information

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek HE DISRIBUION OF LOAN PORFOLIO VALUE * Oldrch Alfons Vascek he amount of captal necessary to support a portfolo of debt securtes depends on the probablty dstrbuton of the portfolo loss. Consder a portfolo

More information

Single and multiple stage classifiers implementing logistic discrimination

Single and multiple stage classifiers implementing logistic discrimination Sngle and multple stage classfers mplementng logstc dscrmnaton Hélo Radke Bttencourt 1 Dens Alter de Olvera Moraes 2 Vctor Haertel 2 1 Pontfíca Unversdade Católca do Ro Grande do Sul - PUCRS Av. Ipranga,

More information

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications Methodology to Determne Relatonshps between Performance Factors n Hadoop Cloud Computng Applcatons Lus Eduardo Bautsta Vllalpando 1,2, Alan Aprl 1 and Alan Abran 1 1 Department of Software Engneerng and

More information

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble

ECE544NA Final Project: Robust Machine Learning Hardware via Classifier Ensemble 1 ECE544NA Fnal Project: Robust Machne Learnng Hardware va Classfer Ensemble Sa Zhang, szhang12@llnos.edu Dept. of Electr. & Comput. Eng., Unv. of Illnos at Urbana-Champagn, Urbana, IL, USA Abstract In

More information

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP) 6.3 / -- Communcaton Networks II (Görg) SS20 -- www.comnets.un-bremen.de Communcaton Networks II Contents. Fundamentals of probablty theory 2. Emergence of communcaton traffc 3. Stochastc & Markovan Processes

More information

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK Sample Stablty Protocol Background The Cholesterol Reference Method Laboratory Network (CRMLN) developed certfcaton protocols for total cholesterol, HDL

More information

A heuristic task deployment approach for load balancing

A heuristic task deployment approach for load balancing Xu Gaochao, Dong Yunmeng, Fu Xaodog, Dng Yan, Lu Peng, Zhao Ja Abstract A heurstc task deployment approach for load balancng Gaochao Xu, Yunmeng Dong, Xaodong Fu, Yan Dng, Peng Lu, Ja Zhao * College of

More information

An interactive system for structure-based ASCII art creation

An interactive system for structure-based ASCII art creation An nteractve system for structure-based ASCII art creaton Katsunor Myake Henry Johan Tomoyuk Nshta The Unversty of Tokyo Nanyang Technologcal Unversty Abstract Non-Photorealstc Renderng (NPR), whose am

More information

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

How To Understand The Results Of The German Meris Cloud And Water Vapour Product Ttel: Project: Doc. No.: MERIS level 3 cloud and water vapour products MAPP MAPP-ATBD-ClWVL3 Issue: 1 Revson: 0 Date: 9.12.1998 Functon Name Organsaton Sgnature Date Author: Bennartz FUB Preusker FUB Schüller

More information

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING Matthew J. Lberatore, Department of Management and Operatons, Vllanova Unversty, Vllanova, PA 19085, 610-519-4390,

More information

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm

Document Clustering Analysis Based on Hybrid PSO+K-means Algorithm Document Clusterng Analyss Based on Hybrd PSO+K-means Algorthm Xaohu Cu, Thomas E. Potok Appled Software Engneerng Research Group, Computatonal Scences and Engneerng Dvson, Oak Rdge Natonal Laboratory,

More information

A Prefix Code Matching Parallel Load-Balancing Method for Solution-Adaptive Unstructured Finite Element Graphs on Distributed Memory Multicomputers

A Prefix Code Matching Parallel Load-Balancing Method for Solution-Adaptive Unstructured Finite Element Graphs on Distributed Memory Multicomputers Ž. The Journal of Supercomputng, 15, 25 49 2000 2000 Kluwer Academc Publshers. Manufactured n The Netherlands. A Prefx Code Matchng Parallel Load-Balancng Method for Soluton-Adaptve Unstructured Fnte Element

More information

An Adaptive and Distributed Clustering Scheme for Wireless Sensor Networks

An Adaptive and Distributed Clustering Scheme for Wireless Sensor Networks 2007 Internatonal Conference on Convergence Informaton Technology An Adaptve and Dstrbuted Clusterng Scheme for Wreless Sensor Networs Xnguo Wang, Xnmng Zhang, Guolang Chen, Shuang Tan Department of Computer

More information

ERP Software Selection Using The Rough Set And TPOSIS Methods

ERP Software Selection Using The Rough Set And TPOSIS Methods ERP Software Selecton Usng The Rough Set And TPOSIS Methods Under Fuzzy Envronment Informaton Management Department, Hunan Unversty of Fnance and Economcs, No. 139, Fengln 2nd Road, Changsha, 410205, Chna

More information

Greedy Column Subset Selection for Large-scale Data Sets

Greedy Column Subset Selection for Large-scale Data Sets Knowledge and Information Systems manuscript No. will be inserted by the editor) Greedy Column Subset Selection for Large-scale Data Sets Ahmed K. Farahat Ahmed Elgohary Ali Ghodsi Mohamed S. Kamel Received:

More information

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION Vson Mouse Saurabh Sarkar a* a Unversty of Cncnnat, Cncnnat, USA ABSTRACT The report dscusses a vson based approach towards trackng of eyes and fngers. The report descrbes the process of locatng the possble

More information

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School

Robust Design of Public Storage Warehouses. Yeming (Yale) Gong EMLYON Business School Robust Desgn of Publc Storage Warehouses Yemng (Yale) Gong EMLYON Busness School Rene de Koster Rotterdam school of management, Erasmus Unversty Abstract We apply robust optmzaton and revenue management

More information

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS Lus Eduardo Bautsta Vllalpando 1,2, Alan Aprl 1 and Alan Abran 1 1 Department of Software Engneerng

More information

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12 14 The Ch-squared dstrbuton PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 1 If a normal varable X, havng mean µ and varance σ, s standardsed, the new varable Z has a mean 0 and varance 1. When ths standardsed

More information

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña Proceedngs of the 2008 Wnter Smulaton Conference S. J. Mason, R. R. Hll, L. Mönch, O. Rose, T. Jefferson, J. W. Fowler eds. A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION

More information

Traffic-light a stress test for life insurance provisions

Traffic-light a stress test for life insurance provisions MEMORANDUM Date 006-09-7 Authors Bengt von Bahr, Göran Ronge Traffc-lght a stress test for lfe nsurance provsons Fnansnspetonen P.O. Box 6750 SE-113 85 Stocholm [Sveavägen 167] Tel +46 8 787 80 00 Fax

More information

The Stochastic Guaranteed Service Model with Recourse for Multi-Echelon Warehouse Management

The Stochastic Guaranteed Service Model with Recourse for Multi-Echelon Warehouse Management The Stochastc Guaranteed Servce Model wth Recourse for Mult-Echelon Warehouse Management Jörg Rambau, Konrad Schade 1 Lehrstuhl für Wrtschaftsmathematk Unverstät Bayreuth Bayreuth, Germany Abstract The

More information

Tools for Privacy Preserving Distributed Data Mining

Tools for Privacy Preserving Distributed Data Mining Tools for Prvacy Preservng Dstrbuted Data Mnng hrs lfton, Murat Kantarcoglu, Jadeep Vadya Purdue Unversty Department of omputer Scences 250 N Unversty St West Lafayette, IN 47907-2066 USA (clfton, kanmurat,

More information

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining

Risk Model of Long-Term Production Scheduling in Open Pit Gold Mining Rsk Model of Long-Term Producton Schedulng n Open Pt Gold Mnng R Halatchev 1 and P Lever 2 ABSTRACT Open pt gold mnng s an mportant sector of the Australan mnng ndustry. It uses large amounts of nvestments,

More information

A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning

A Scalable Data Science Workflow Approach for Big Data Bayesian Network Learning A Scalable Data Scence Workflow Approach for Bg Data Bayesan Network Learnng Janwu Wang 1, Yan Tang 2, Ma Nguyen 1, Ilkay Altntas 1 1 San Dego Supercomputer Center Unversty of Calforna, San Dego La Jolla,

More information

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1

Open Access A Load Balancing Strategy with Bandwidth Constraint in Cloud Computing. Jing Deng 1,*, Ping Guo 2, Qi Li 3, Haizhu Chen 1 Send Orders for Reprnts to reprnts@benthamscence.ae The Open Cybernetcs & Systemcs Journal, 2014, 8, 115-121 115 Open Access A Load Balancng Strategy wth Bandwdth Constrant n Cloud Computng Jng Deng 1,*,

More information

Bayesian Cluster Ensembles

Bayesian Cluster Ensembles Bayesan Cluster Ensembles Hongjun Wang 1, Hanhua Shan 2 and Arndam Banerjee 2 1 Informaton Research Insttute, Southwest Jaotong Unversty, Chengdu, Schuan, 610031, Chna 2 Department of Computer Scence &

More information

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6 PAR TESTS If a WEIGHT varable s specfed, t s used to replcate a case as many tmes as ndcated by the weght value rounded to the nearest nteger. If the workspace requrements are exceeded and samplng has

More information

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering

Out-of-Sample Extensions for LLE, Isomap, MDS, Eigenmaps, and Spectral Clustering Out-of-Sample Extensons for LLE, Isomap, MDS, Egenmaps, and Spectral Clusterng Yoshua Bengo, Jean-Franços Paement, Pascal Vncent Olver Delalleau, Ncolas Le Roux and Mare Oumet Département d Informatque

More information

Virtual Network Embedding with Coordinated Node and Link Mapping

Virtual Network Embedding with Coordinated Node and Link Mapping Vrtual Network Embeddng wth Coordnated Node and Lnk Mappng N. M. Mosharaf Kabr Chowdhury Cherton School of Computer Scence Unversty of Waterloo Waterloo, Canada Emal: nmmkchow@uwaterloo.ca Muntasr Rahan

More information

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System

Mining Feature Importance: Applying Evolutionary Algorithms within a Web-based Educational System Mnng Feature Importance: Applyng Evolutonary Algorthms wthn a Web-based Educatonal System Behrouz MINAEI-BIDGOLI 1, and Gerd KORTEMEYER 2, and Wllam F. PUNCH 1 1 Genetc Algorthms Research and Applcatons

More information

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network

Forecasting the Demand of Emergency Supplies: Based on the CBR Theory and BP Neural Network 700 Proceedngs of the 8th Internatonal Conference on Innovaton & Management Forecastng the Demand of Emergency Supples: Based on the CBR Theory and BP Neural Network Fu Deqang, Lu Yun, L Changbng School

More information

Search Efficient Representation of Healthcare Data based on the HL7 RIM

Search Efficient Representation of Healthcare Data based on the HL7 RIM 181 JOURNAL OF COMPUTERS, VOL. 5, NO. 12, DECEMBER 21 Search Effcent Representaton of Healthcare Data based on the HL7 RIM Razan Paul Department of Computer Scence and Engneerng, Bangladesh Unversty of

More information