Applied Multivariate Analysis  Big data analytics


 Jared Lang
 1 years ago
 Views:
Transcription
1 Applied Multivariate Analysis  Big data analytics Nathalie VillaVialaneix M1 in Economics and Economics and Statistics Toulouse School of Economics Nathalie VillaVialaneix M1 E&S  Big data, part 1 1/48
2 Course outline 1 Short introduction to bootstrap 2 Application of bootstrap to classification: bagging 3 Application of bagging to CART: random forests 4 Introduction to parallel computing 5 Application of map reduce to statistical learning Nathalie VillaVialaneix M1 E&S  Big data, part 1 2/48
3 Section 1 Short introduction to bootstrap Nathalie VillaVialaneix M1 E&S  Big data, part 1 3/48
4 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a nonparametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). Nathalie VillaVialaneix M1 E&S  Big data, part 1 4/48
5 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a nonparametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). [Efron, 1979] proposes to simulate the unknown law using resampling from the observations. Nathalie VillaVialaneix M1 E&S  Big data, part 1 4/48
6 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. Nathalie VillaVialaneix M1 E&S  Big data, part 1 5/48
7 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. In the previous examples, R n is a plugin estimate: R n = θ(p n ) where P n is the empirical distribution 1 n n i=1 δ x i. Nathalie VillaVialaneix M1 E&S  Big data, part 1 5/48
8 Real/bootstrap worlds sampling law P (related to X) P n = 1 n n i=1 δ x i (related to X n ) sample X n = {x 1,..., x n } X n = {x 1,..., x n} (sample of size n with replacement in X n ) estimation R n = R(x 1,..., x n ) R n = R(x 1,..., x n) Nathalie VillaVialaneix M1 E&S  Big data, part 1 6/48
9 Example: bootstrap estimation of the IC for mean Sample obtained from χ 2 distribution (n = 50, number of df: 3) histogram of (x i ) i=1,...,n and R n = 1 n n i=1 x i Nathalie VillaVialaneix M1 E&S  Big data, part 1 7/48
10 Example: bootstrap estimation of the IC for mean Distribution of B = 1000 estimates obtained from bootstrap samples: estimation of a confidence interval from 2.5% and 97.5% quantiles histogram of (R,b n ) b=1,...,b with R,b n = 1 n n i=1 x i and Q µ = quantile({r,b n } b, µ), µ {0.025, 0.975} Nathalie VillaVialaneix M1 E&S  Big data, part 1 7/48
11 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n Nathalie VillaVialaneix M1 E&S  Big data, part 1 8/48
12 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b Nathalie VillaVialaneix M1 E&S  Big data, part 1 8/48
13 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b confidence interval estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; confidence interval at risk α of R n is estimated by: [Q α/2 ; Q 1 α/2 ] where Q µ = quantile({rn,b } b, µ) Nathalie VillaVialaneix M1 E&S  Big data, part 1 8/48
14 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n? empirical estimate for the mean? unbiased estimate for the variance? Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
15 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
16 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 bootstrap estimate for the variance of x? bootstrap estimate for the mean of the empirical variance? Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
17 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 = , Rn,2 = and Var (x) = bootstrap estimate for the mean of the empirical variance bootstrap estimate for the variance of x R,1 n R,1 n = , R,2 n = and Ê (σ n 1 ) = Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
18 user and the package boot library( boot) # a sample from a Chi  Square distribution is generated orig. sample < rchisq(50, df =3) # the estimate of the mean is mean( orig. sample) # function that calculates estimate from a bootstrap sample. mean < function(x, d) { return( mean(x[d])) } # bootstraping now... boot. mean < boot( orig. sample, sample. mean, R =1000) boot. mean # ORDINARY NONPARAMETRIC BOOTSTRAP # Call: # boot( data = orig. sample, statistic = sample. mean, # R = 1000) # Bootstrap Statistics : # original bias std. error # t1* Nathalie VillaVialaneix M1 E&S  Big data, part 1 10/48
19 Section 2 Application of bootstrap to classification: bagging Nathalie VillaVialaneix M1 E&S  Big data, part 1 11/48
20 What is bagging? Bagging: Bootstrap Aggregating metaalgorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Nathalie VillaVialaneix M1 E&S  Big data, part 1 12/48
21 What is bagging? Bagging: Bootstrap Aggregating metaalgorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Notations: a random pair of variables (X, Y): X X and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : X {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie VillaVialaneix M1 E&S  Big data, part 1 12/48
22 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. Nathalie VillaVialaneix M1 E&S  Big data, part 1 13/48
23 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. Nathalie VillaVialaneix M1 E&S  Big data, part 1 13/48
24 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. (Φ b ) b=1,...,b are aggregated using a majority vote scheme (an averaging in the regression case): x X, Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie VillaVialaneix M1 E&S  Big data, part 1 13/48
25 Why using bagging? Bagging improves stability and limits the risk of overtraining. Experiment: Using breastcancer dataset from mlbench 1 : 699 observations on 10 variables, 9 being ordered or nominal and describing a tumor and 1 target class indicating if this tumor was malignant or benign. 1 Data are coming from the UCI machine learning repository Nathalie VillaVialaneix M1 E&S  Big data, part 1 14/48
26 To begin: bagging by hand... for x = Cl.thickness Cell.size Cell.shape Marg.adhesion j g g d Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli e j e g Mitoses b and the following classification trees obtained from 3 bootstrap samples, what is the prediction for x by bagging? Nathalie VillaVialaneix M1 E&S  Big data, part 1 15/48
27 Nathalie VillaVialaneix M1 E&S  Big data, part 1 16/48
28 individual predictions are: malignant, malignant, malignant so the final prediction is malignant Nathalie VillaVialaneix M1 E&S  Big data, part 1 16/48
29 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. Nathalie VillaVialaneix M1 E&S  Big data, part 1 17/48
30 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. this results in test errors 100 bagging errors Nathalie VillaVialaneix M1 E&S  Big data, part 1 17/48
31 Results of the simulations Nathalie VillaVialaneix M1 E&S  Big data, part 1 18/48
32 What is overfitting? Function x y to be estimated Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
33 What is overfitting? Observations we might have Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
34 What is overfitting? Observations we do have Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
35 What is overfitting? First estimation from the observations: underfitting Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
36 What is overfitting? Second estimation from the observations: accurate estimation Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
37 What is overfitting? Third estimation from the observations: overfitting Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
38 What is overfitting? Summary A compromise must be made between accuracy and generalization ability. Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
39 Why do bagging predictors work? References: [Breiman, 1994, Breiman, 1996]. For some instable predictors (such as classification trees for instance), a small change in the training set can yield to a big change in the trained tree due to overfitting (hence misclassification error obtained on the training dataset is very optimistic) Bagging reduces this instability by using an averaging procedure. Nathalie VillaVialaneix M1 E&S  Big data, part 1 20/48
40 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; Nathalie VillaVialaneix M1 E&S  Big data, part 1 21/48
41 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; Nathalie VillaVialaneix M1 E&S  Big data, part 1 21/48
42 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; outofbag error (see next slide). Nathalie VillaVialaneix M1 E&S  Big data, part 1 21/48
43 Outofbag observations, prediction and error OOB (OutOf Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be outofbag for T b if x i T b ) Nathalie VillaVialaneix M1 E&S  Big data, part 1 22/48
44 Outofbag observations, prediction and error OOB (OutOf Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be outofbag for T b if x i T b ) OOB error is the misclassification rate of these estimates: 1 n n I {Φ OOB (x i ) y i} i=1 It is often a bit optimistic compared to the test error. Nathalie VillaVialaneix M1 E&S  Big data, part 1 22/48
45 Section 3 Application of bagging to CART: random forests Nathalie VillaVialaneix M1 E&S  Big data, part 1 23/48
46 General framework Notations: a random pair of variables (X, Y): X R p and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : R p {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie VillaVialaneix M1 E&S  Big data, part 1 24/48
47 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie VillaVialaneix M1 E&S  Big data, part 1 25/48
48 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Nathalie VillaVialaneix M1 E&S  Big data, part 1 25/48
49 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Drawbacks black box model; is not supported by strong mathematical results (consistency...) until now. Nathalie VillaVialaneix M1 E&S  Big data, part 1 25/48
50 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Nathalie VillaVialaneix M1 E&S  Big data, part 1 26/48
51 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Basic bagging with classification trees 1: for b = 1,..., B do 2: Construct a bootstrap sample T b from {(x i, y i )} i=1,...,n 3: Train a classification tree from T b, Φ b 4: end for 5: Aggregate the classifiers with majority vote Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie VillaVialaneix M1 E&S  Big data, part 1 26/48
52 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
53 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
54 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
55 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Random forest are not very sensitive to hyperparameters setting: default values for q and bootstrap sample size (2n/3) should work in most cases. Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
56 Additional tools OOB (OutOf Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Nathalie VillaVialaneix M1 E&S  Big data, part 1 28/48
57 Additional tools OOB (OutOf Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized: I(X j ) = P (Φ n (X) = Y) P ( Φ n (X (j) = Y ) in which X (j) = (X 1,..., X (j),..., X p ), X (j) being the variable X j with permuted values. Nathalie VillaVialaneix M1 E&S  Big data, part 1 28/48
58 Additional tools OOB (OutOf Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized. Importance is estimated with OOB observations (see next slide for details) Nathalie VillaVialaneix M1 E&S  Big data, part 1 28/48
59 Importance estimation in random forests OOB estimation for variable X j 1: for b = 1 B (loop on trees) do 2: permute values for (x j i ) i: x i T b return x(j,b) = (x 1,..., x (j,b) i i permuted values x (j,b) i 3: predict Φ b ( x (j,b)) 4: end for 5: return OOB estimation of the importance 1 B 1 B I b=1 T b {Φb (x i )=y i} 1 x i T T b b I { } Φ b (x (j,b) )=y i x i T b i,..., x p i ), Nathalie VillaVialaneix M1 E&S  Big data, part 1 29/48
60 Section 4 Introduction to parallel computing Nathalie VillaVialaneix M1 E&S  Big data, part 1 30/48
61 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multicore processors), on several processors (multiprocessor computers) or on clusters (composed of several computers). Nathalie VillaVialaneix M1 E&S  Big data, part 1 31/48
62 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multicore processors), on several processors (multiprocessor computers) or on clusters (composed of several computers). Constraint: communication between cores slows down the computation a strategy consists in breaking the calculation into independent parts so that each processing unit executes its part independently from the others. Nathalie VillaVialaneix M1 E&S  Big data, part 1 31/48
63 Parallel computing with nonbig data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Nathalie VillaVialaneix M1 E&S  Big data, part 1 32/48
64 Parallel computing with nonbig data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Example: bagging can easily be computed with a parallel strategy (it is said to be embarrassingly parallel): first step: each processing unit creates one (or several) bootstrap sample(s) and learn a (or several) classifier(s) from it; final step: a processing unit collect all results and combine them into a single classifier with a majority vote law. Nathalie VillaVialaneix M1 E&S  Big data, part 1 32/48
65 Illustration with random forest random forest Nathalie VillaVialaneix M1 E&S  Big data, part 1 33/48
66 Parallel computing with R R CRAN task view: HighPerformanceComputing.html What does this course cover? very basic explicit parallel computing; the package foreach (you can also have a look at the packages snow and snowfall for parallel versions of apply, lapply, sapply,... and management of random seed settings in parallel environments). What does this course not cover? implicit parallel computing; large memory management (see for instance biglm and bigmemory to be able to use outofmemory problems with R which occur because R is limited to physical randomaccess memory); GPU computing; resource management;... TL;TD: many things are still to learn at the end of this course... Nathalie VillaVialaneix M1 E&S  Big data, part 1 34/48
67 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. Nathalie VillaVialaneix M1 E&S  Big data, part 1 35/48
68 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. What needs to be adapted to use parallel programming? In this framework a single processing unit cannot analyze all the dataset by itself the data must be split between the different processing units as well. Nathalie VillaVialaneix M1 E&S  Big data, part 1 35/48
69 Overview of Map Reduce Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
70 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Each bit is processed through ONE map step and gives pairs {(key, value)}. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
71 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map jobs must be independent! Result: indexed data. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
72 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Each key is processed through ONE reduce step to produce the output. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
73 Map Reduce in practice (stupid) Case study: A huge number of sales identified by the shop and the amount. shop1,25000 shop2,12 shop2,1500 shop4,47 shop1, Question: Extract the total amount per shop. Standard way (sequential) the data are read sequentially; a vector containing the values of the current sum for every shop is updated at each line. Map Reduce way (parallel)... Nathalie VillaVialaneix M1 E&S  Big data, part 1 37/48
74 Map Reduce for an aggregation framework Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie VillaVialaneix M1 E&S  Big data, part 1 38/48
75 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map step: reads the line and outputs a pair key=shop and value=amount. Nathalie VillaVialaneix M1 E&S  Big data, part 1 38/48
76 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Reduce step: for every key (i.e., shop), compute the sum of values. Nathalie VillaVialaneix M1 E&S  Big data, part 1 38/48
77 In practice, Hadoop framework Apache Hadoop: opensource software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a mapreduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Nathalie VillaVialaneix M1 E&S  Big data, part 1 39/48
78 In practice, Hadoop framework Apache Hadoop: opensource software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a mapreduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Data inside HDFS are not indexed (unlike SQL data for instance) but stored as simple text files (e.g., comma separated) queries cannot be performed simply. Nathalie VillaVialaneix M1 E&S  Big data, part 1 39/48
79 Hadoop & R How will we be using it? We will be using a R interface for Hadoop, composed of several packages (see https://github.com/revolutionanalytics/rhadoop/wiki): studied: rmr: Map Reduce framework (can be used as if Hadoop is installed, even if it is not...); not studied: rhdfs (to manage Hadoop data filesystem), rhbase (to manage Hadoop HBase database), plyrmr (advanced data processing functions with a plyr syntax). Installing rmr without Hadoop: Nathalie VillaVialaneix M1 E&S  Big data, part 1 40/48
The Set Covering Machine
Journal of Machine Learning Research 3 (2002) 723746 Submitted 12/01; Published 12/02 The Set Covering Machine Mario Marchand School of Information Technology and Engineering University of Ottawa Ottawa,
More informationModel Compression. Cristian Bucilă Computer Science Cornell University. Rich Caruana. Alexandru NiculescuMizil. cristi@cs.cornell.
Model Compression Cristian Bucilă Computer Science Cornell University cristi@cs.cornell.edu Rich Caruana Computer Science Cornell University caruana@cs.cornell.edu Alexandru NiculescuMizil Computer Science
More informationIntroduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationChoosing Multiple Parameters for Support Vector Machines
Machine Learning, 46, 131 159, 2002 c 2002 Kluwer Academic Publishers. Manufactured in The Netherlands. Choosing Multiple Parameters for Support Vector Machines OLIVIER CHAPELLE LIP6, Paris, France olivier.chapelle@lip6.fr
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Dataintensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More information1 Covariate Shift by Kernel Mean Matching
1 Covariate Shift by Kernel Mean Matching Arthur Gretton 1 Alex Smola 1 Jiayuan Huang Marcel Schmittfull Karsten Borgwardt Bernhard Schölkopf Given sets of observations of training and test data, we consider
More informationA Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package
A Guide to Sample Size Calculations for Random Effect Models via Simulation and the MLPowSim Software Package William J Browne, Mousa Golalizadeh Lahi* & Richard MA Parker School of Clinical Veterinary
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationUnbiased Groupwise Alignment by Iterative Central Tendency Estimation
Math. Model. Nat. Phenom. Vol. 3, No. 6, 2008, pp. 232 Unbiased Groupwise Alignment by Iterative Central Tendency Estimation M.S. De Craene a1, B. Macq b, F. Marques c, P. Salembier c, and S.K. Warfield
More informationTop 10 algorithms in data mining
Knowl Inf Syst (2008) 14:1 37 DOI 10.1007/s1011500701142 SURVEY PAPER Top 10 algorithms in data mining Xindong Wu Vipin Kumar J. Ross Quinlan Joydeep Ghosh Qiang Yang Hiroshi Motoda Geoffrey J. McLachlan
More informationReview of basic statistics and the simplest forecasting model: the sample mean
Review of basic statistics and the simplest forecasting model: the sample mean Robert Nau Fuqua School of Business, Duke University August 2014 Most of what you need to remember about basic statistics
More informationGetting Started with SAS Enterprise Miner 7.1
Getting Started with SAS Enterprise Miner 7.1 SAS Documentation The correct bibliographic citation for this manual is as follows: SAS Institute Inc 2011. Getting Started with SAS Enterprise Miner 7.1.
More informationMINING DATA STREAMS WITH CONCEPT DRIFT
Poznan University of Technology Faculty of Computing Science and Management Institute of Computing Science Master s thesis MINING DATA STREAMS WITH CONCEPT DRIFT Dariusz Brzeziński Supervisor Jerzy Stefanowski,
More information1 NOT ALL ANSWERS ARE EQUALLY
1 NOT ALL ANSWERS ARE EQUALLY GOOD: ESTIMATING THE QUALITY OF DATABASE ANSWERS Amihai Motro, Igor Rakov Department of Information and Software Systems Engineering George Mason University Fairfax, VA 220304444
More informationLarge Linear Classification When Data Cannot Fit In Memory
Large Linear Classification When Data Cannot Fit In Memory ABSTRACT HsiangFu Yu Dept. of Computer Science National Taiwan University Taipei 106, Taiwan b93107@csie.ntu.edu.tw KaiWei Chang Dept. of Computer
More informationClean Answers over Dirty Databases: A Probabilistic Approach
Clean Answers over Dirty Databases: A Probabilistic Approach Periklis Andritsos University of Trento periklis@dit.unitn.it Ariel Fuxman University of Toronto afuxman@cs.toronto.edu Renée J. Miller University
More informationGetting the Most Out of Ensemble Selection
Getting the Most Out of Ensemble Selection Rich Caruana, Art Munson, Alexandru NiculescuMizil Department of Computer Science Cornell University Technical Report 20062045 {caruana, mmunson, alexn} @cs.cornell.edu
More informationDetecting LargeScale System Problems by Mining Console Logs
Detecting LargeScale System Problems by Mining Console Logs Wei Xu Ling Huang Armando Fox David Patterson Michael I. Jordan EECS Department University of California at Berkeley, USA {xuw,fox,pattrsn,jordan}@cs.berkeley.edu
More informationMODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS
MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS JEANNETTE JANSSEN, MATT HURSHMAN, AND NAUZER KALYANIWALLA Abstract. Several network models have been proposed to explain the link structure observed
More informationStrong Rules for Discarding Predictors in Lassotype Problems
Strong Rules for Discarding Predictors in Lassotype Problems Robert Tibshirani, Jacob Bien, Jerome Friedman, Trevor Hastie, Noah Simon, Jonathan Taylor, and Ryan J. Tibshirani Departments of Statistics
More informationMining Data Streams. Chapter 4. 4.1 The Stream Data Model
Chapter 4 Mining Data Streams Most of the algorithms described in this book assume that we are mining a database. That is, all our data is available when and if we want it. In this chapter, we shall make
More informationNCSS Statistical Software
Chapter 06 Introduction This procedure provides several reports for the comparison of two distributions, including confidence intervals for the difference in means, twosample ttests, the ztest, the
More informationLoad Shedding for Aggregation Queries over Data Streams
Load Shedding for Aggregation Queries over Data Streams Brian Babcock Mayur Datar Rajeev Motwani Department of Computer Science Stanford University, Stanford, CA 94305 {babcock, datar, rajeev}@cs.stanford.edu
More informationProduct quantization for nearest neighbor search
Product quantization for nearest neighbor search Hervé Jégou, Matthijs Douze, Cordelia Schmid Abstract This paper introduces a product quantization based approach for approximate nearest neighbor search.
More informationHandling Missing Values when Applying Classification Models
Journal of Machine Learning Research 8 (2007) 16251657 Submitted 7/05; Revised 5/06; Published 7/07 Handling Missing Values when Applying Classification Models Maytal SaarTsechansky The University of
More informationAutomatically Detecting Vulnerable Websites Before They Turn Malicious
Automatically Detecting Vulnerable Websites Before They Turn Malicious Kyle Soska and Nicolas Christin, Carnegie Mellon University https://www.usenix.org/conference/usenixsecurity14/technicalsessions/presentation/soska
More informationES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP
ES 2 : A Cloud Data Storage System for Supporting Both OLTP and OLAP Yu Cao, Chun Chen,FeiGuo, Dawei Jiang,YutingLin, Beng Chin Ooi, Hoang Tam Vo,SaiWu, Quanqing Xu School of Computing, National University
More informationGaussian Processes in Machine Learning
Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany carl@tuebingen.mpg.de WWW home page: http://www.tuebingen.mpg.de/ carl
More informationBigBench: Towards an Industry Standard Benchmark for Big Data Analytics
BigBench: Towards an Industry Standard Benchmark for Big Data Analytics Ahmad Ghazal 1,5, Tilmann Rabl 2,6, Minqing Hu 1,5, Francois Raab 4,8, Meikel Poess 3,7, Alain Crolotte 1,5, HansArno Jacobsen 2,9
More information