Applied Multivariate Analysis - Big data analytics

Transcription

1 Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr M1 in Economics and Economics and Statistics Toulouse School of Economics Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 1/48

2 Course outline 1 Short introduction to bootstrap 2 Application of bootstrap to classification: bagging 3 Application of bagging to CART: random forests 4 Introduction to parallel computing 5 Application of map reduce to statistical learning Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 2/48

3 Section 1 Short introduction to bootstrap Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 3/48

4 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a non-parametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 4/48

5 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a non-parametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). [Efron, 1979] proposes to simulate the unknown law using re-sampling from the observations. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 4/48

6 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 5/48

7 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. In the previous examples, R n is a plug-in estimate: R n = θ(p n ) where P n is the empirical distribution 1 n n i=1 δ x i. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 5/48

8 Real/bootstrap worlds sampling law P (related to X) P n = 1 n n i=1 δ x i (related to X n ) sample X n = {x 1,..., x n } X n = {x 1,..., x n} (sample of size n with replacement in X n ) estimation R n = R(x 1,..., x n ) R n = R(x 1,..., x n) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 6/48

9 Example: bootstrap estimation of the IC for mean Sample obtained from χ 2 -distribution (n = 50, number of df: 3) histogram of (x i ) i=1,...,n and R n = 1 n n i=1 x i Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 7/48

10 Example: bootstrap estimation of the IC for mean Distribution of B = 1000 estimates obtained from bootstrap samples: estimation of a confidence interval from 2.5% and 97.5% quantiles histogram of (R,b n ) b=1,...,b with R,b n = 1 n n i=1 x i and Q µ = quantile({r,b n } b, µ), µ {0.025, 0.975} Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 7/48

11 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

12 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

13 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b confidence interval estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; confidence interval at risk α of R n is estimated by: [Q α/2 ; Q 1 α/2 ] where Q µ = quantile({rn,b } b, µ) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

14 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n? empirical estimate for the mean? unbiased estimate for the variance? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

15 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

16 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 bootstrap estimate for the variance of x? bootstrap estimate for the mean of the empirical variance? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

17 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 = , Rn,2 = and Var (x) = bootstrap estimate for the mean of the empirical variance bootstrap estimate for the variance of x R,1 n R,1 n = , R,2 n = and Ê (σ n 1 ) = Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

18 user and the package boot library( boot) # a sample from a Chi - Square distribution is generated orig. sample <- rchisq(50, df =3) # the estimate of the mean is mean( orig. sample) # function that calculates estimate from a bootstrap sample. mean <- function(x, d) { return( mean(x[d])) } # bootstraping now... boot. mean <- boot( orig. sample, sample. mean, R =1000) boot. mean # ORDINARY NONPARAMETRIC BOOTSTRAP # Call: # boot( data = orig. sample, statistic = sample. mean, # R = 1000) # Bootstrap Statistics : # original bias std. error # t1* Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 10/48

19 Section 2 Application of bootstrap to classification: bagging Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 11/48

20 What is bagging? Bagging: Bootstrap Aggregating meta-algorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 12/48

21 What is bagging? Bagging: Bootstrap Aggregating meta-algorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Notations: a random pair of variables (X, Y): X X and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : X {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 12/48

22 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

23 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

24 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. (Φ b ) b=1,...,b are aggregated using a majority vote scheme (an averaging in the regression case): x X, Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

25 Why using bagging? Bagging improves stability and limits the risk of overtraining. Experiment: Using breastcancer dataset from mlbench 1 : 699 observations on 10 variables, 9 being ordered or nominal and describing a tumor and 1 target class indicating if this tumor was malignant or benign. 1 Data are coming from the UCI machine learning repository Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 14/48

26 To begin: bagging by hand... for x = Cl.thickness Cell.size Cell.shape Marg.adhesion j g g d Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli e j e g Mitoses b and the following classification trees obtained from 3 bootstrap samples, what is the prediction for x by bagging? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 15/48

27 Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 16/48

28 individual predictions are: malignant, malignant, malignant so the final prediction is malignant Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 16/48

29 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 17/48

30 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. this results in test errors 100 bagging errors Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 17/48

31 Results of the simulations Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 18/48

32 What is overfitting? Function x y to be estimated Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

33 What is overfitting? Observations we might have Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

34 What is overfitting? Observations we do have Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

35 What is overfitting? First estimation from the observations: underfitting Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

36 What is overfitting? Second estimation from the observations: accurate estimation Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

37 What is overfitting? Third estimation from the observations: overfitting Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

38 What is overfitting? Summary A compromise must be made between accuracy and generalization ability. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

39 Why do bagging predictors work? References: [Breiman, 1994, Breiman, 1996]. For some instable predictors (such as classification trees for instance), a small change in the training set can yield to a big change in the trained tree due to overfitting (hence misclassification error obtained on the training dataset is very optimistic) Bagging reduces this instability by using an averaging procedure. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 20/48

40 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

41 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

42 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; out-of-bag error (see next slide). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

43 Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be out-of-bag for T b if x i T b ) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 22/48

44 Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be out-of-bag for T b if x i T b ) OOB error is the misclassification rate of these estimates: 1 n n I {Φ OOB (x i ) y i} i=1 It is often a bit optimistic compared to the test error. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 22/48

45 Section 3 Application of bagging to CART: random forests Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 23/48

46 General framework Notations: a random pair of variables (X, Y): X R p and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : R p {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 24/48

47 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

48 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

49 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Drawbacks black box model; is not supported by strong mathematical results (consistency...) until now. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

50 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 26/48

51 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Basic bagging with classification trees 1: for b = 1,..., B do 2: Construct a bootstrap sample T b from {(x i, y i )} i=1,...,n 3: Train a classification tree from T b, Φ b 4: end for 5: Aggregate the classifiers with majority vote Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 26/48

52 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

53 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

54 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

55 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Random forest are not very sensitive to hyper-parameters setting: default values for q and bootstrap sample size (2n/3) should work in most cases. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

56 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

57 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized: I(X j ) = P (Φ n (X) = Y) P ( Φ n (X (j) = Y ) in which X (j) = (X 1,..., X (j),..., X p ), X (j) being the variable X j with permuted values. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

58 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized. Importance is estimated with OOB observations (see next slide for details) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

59 Importance estimation in random forests OOB estimation for variable X j 1: for b = 1 B (loop on trees) do 2: permute values for (x j i ) i: x i T b return x(j,b) = (x 1,..., x (j,b) i i permuted values x (j,b) i 3: predict Φ b ( x (j,b)) 4: end for 5: return OOB estimation of the importance 1 B 1 B I b=1 T b {Φb (x i )=y i} 1 x i T T b b I { } Φ b (x (j,b) )=y i x i T b i,..., x p i ), Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 29/48

60 Section 4 Introduction to parallel computing Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 30/48

61 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multi-core processors), on several processors (multi-processor computers) or on clusters (composed of several computers). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 31/48

62 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multi-core processors), on several processors (multi-processor computers) or on clusters (composed of several computers). Constraint: communication between cores slows down the computation a strategy consists in breaking the calculation into independent parts so that each processing unit executes its part independently from the others. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 31/48

63 Parallel computing with non-big data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 32/48

64 Parallel computing with non-big data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Example: bagging can easily be computed with a parallel strategy (it is said to be embarrassingly parallel): first step: each processing unit creates one (or several) bootstrap sample(s) and learn a (or several) classifier(s) from it; final step: a processing unit collect all results and combine them into a single classifier with a majority vote law. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 32/48

65 Illustration with random forest random forest Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 33/48

66 Parallel computing with R R CRAN task view: HighPerformanceComputing.html What does this course cover? very basic explicit parallel computing; the package foreach (you can also have a look at the packages snow and snowfall for parallel versions of apply, lapply, sapply,... and management of random seed settings in parallel environments). What does this course not cover? implicit parallel computing; large memory management (see for instance biglm and bigmemory to be able to use out-of-memory problems with R which occur because R is limited to physical random-access memory); GPU computing; resource management;... TL;TD: many things are still to learn at the end of this course... Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 34/48

67 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 35/48

68 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. What needs to be adapted to use parallel programming? In this framework a single processing unit cannot analyze all the dataset by itself the data must be split between the different processing units as well. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 35/48

69 Overview of Map Reduce Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

70 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Each bit is processed through ONE map step and gives pairs {(key, value)}. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

71 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map jobs must be independent! Result: indexed data. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

72 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Each key is processed through ONE reduce step to produce the output. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

73 Map Reduce in practice (stupid) Case study: A huge number of sales identified by the shop and the amount. shop1,25000 shop2,12 shop2,1500 shop4,47 shop1, Question: Extract the total amount per shop. Standard way (sequential) the data are read sequentially; a vector containing the values of the current sum for every shop is updated at each line. Map Reduce way (parallel)... Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 37/48

74 Map Reduce for an aggregation framework Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

75 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map step: reads the line and outputs a pair key=shop and value=amount. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

76 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Reduce step: for every key (i.e., shop), compute the sum of values. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

77 In practice, Hadoop framework Apache Hadoop: open-source software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a map-reduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 39/48

78 In practice, Hadoop framework Apache Hadoop: open-source software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a map-reduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Data inside HDFS are not indexed (unlike SQL data for instance) but stored as simple text files (e.g., comma separated) queries cannot be performed simply. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 39/48

79 Hadoop & R How will we be using it? We will be using a R interface for Hadoop, composed of several packages (see studied: rmr: Map Reduce framework (can be used as if Hadoop is installed, even if it is not...); not studied: rhdfs (to manage Hadoop data filesystem), rhbase (to manage Hadoop HBase database), plyrmr (advanced data processing functions with a plyr syntax). Installing rmr without Hadoop: Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 40/48