# Applied Multivariate Analysis - Big data analytics

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix M1 in Economics and Economics and Statistics Toulouse School of Economics Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 1/48

2 Course outline 1 Short introduction to bootstrap 2 Application of bootstrap to classification: bagging 3 Application of bagging to CART: random forests 4 Introduction to parallel computing 5 Application of map reduce to statistical learning Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 2/48

3 Section 1 Short introduction to bootstrap Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 3/48

4 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a non-parametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 4/48

5 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a non-parametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). [Efron, 1979] proposes to simulate the unknown law using re-sampling from the observations. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 4/48

6 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 5/48

7 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. In the previous examples, R n is a plug-in estimate: R n = θ(p n ) where P n is the empirical distribution 1 n n i=1 δ x i. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 5/48

8 Real/bootstrap worlds sampling law P (related to X) P n = 1 n n i=1 δ x i (related to X n ) sample X n = {x 1,..., x n } X n = {x 1,..., x n} (sample of size n with replacement in X n ) estimation R n = R(x 1,..., x n ) R n = R(x 1,..., x n) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 6/48

9 Example: bootstrap estimation of the IC for mean Sample obtained from χ 2 -distribution (n = 50, number of df: 3) histogram of (x i ) i=1,...,n and R n = 1 n n i=1 x i Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 7/48

10 Example: bootstrap estimation of the IC for mean Distribution of B = 1000 estimates obtained from bootstrap samples: estimation of a confidence interval from 2.5% and 97.5% quantiles histogram of (R,b n ) b=1,...,b with R,b n = 1 n n i=1 x i and Q µ = quantile({r,b n } b, µ), µ {0.025, 0.975} Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 7/48

11 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

12 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

13 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b confidence interval estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; confidence interval at risk α of R n is estimated by: [Q α/2 ; Q 1 α/2 ] where Q µ = quantile({rn,b } b, µ) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

14 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n? empirical estimate for the mean? unbiased estimate for the variance? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

15 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

16 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 bootstrap estimate for the variance of x? bootstrap estimate for the mean of the empirical variance? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

17 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 = , Rn,2 = and Var (x) = bootstrap estimate for the mean of the empirical variance bootstrap estimate for the variance of x R,1 n R,1 n = , R,2 n = and Ê (σ n 1 ) = Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

18 user and the package boot library( boot) # a sample from a Chi - Square distribution is generated orig. sample <- rchisq(50, df =3) # the estimate of the mean is mean( orig. sample) # function that calculates estimate from a bootstrap sample. mean <- function(x, d) { return( mean(x[d])) } # bootstraping now... boot. mean <- boot( orig. sample, sample. mean, R =1000) boot. mean # ORDINARY NONPARAMETRIC BOOTSTRAP # Call: # boot( data = orig. sample, statistic = sample. mean, # R = 1000) # Bootstrap Statistics : # original bias std. error # t1* Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 10/48

19 Section 2 Application of bootstrap to classification: bagging Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 11/48

20 What is bagging? Bagging: Bootstrap Aggregating meta-algorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 12/48

21 What is bagging? Bagging: Bootstrap Aggregating meta-algorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Notations: a random pair of variables (X, Y): X X and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : X {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 12/48

22 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

23 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

24 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. (Φ b ) b=1,...,b are aggregated using a majority vote scheme (an averaging in the regression case): x X, Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

25 Why using bagging? Bagging improves stability and limits the risk of overtraining. Experiment: Using breastcancer dataset from mlbench 1 : 699 observations on 10 variables, 9 being ordered or nominal and describing a tumor and 1 target class indicating if this tumor was malignant or benign. 1 Data are coming from the UCI machine learning repository Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 14/48

26 To begin: bagging by hand... for x = Cl.thickness Cell.size Cell.shape Marg.adhesion j g g d Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli e j e g Mitoses b and the following classification trees obtained from 3 bootstrap samples, what is the prediction for x by bagging? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 15/48

27 Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 16/48

28 individual predictions are: malignant, malignant, malignant so the final prediction is malignant Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 16/48

29 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 17/48

30 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. this results in test errors 100 bagging errors Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 17/48

31 Results of the simulations Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 18/48

32 What is overfitting? Function x y to be estimated Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

33 What is overfitting? Observations we might have Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

34 What is overfitting? Observations we do have Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

35 What is overfitting? First estimation from the observations: underfitting Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

36 What is overfitting? Second estimation from the observations: accurate estimation Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

37 What is overfitting? Third estimation from the observations: overfitting Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

38 What is overfitting? Summary A compromise must be made between accuracy and generalization ability. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

39 Why do bagging predictors work? References: [Breiman, 1994, Breiman, 1996]. For some instable predictors (such as classification trees for instance), a small change in the training set can yield to a big change in the trained tree due to overfitting (hence misclassification error obtained on the training dataset is very optimistic) Bagging reduces this instability by using an averaging procedure. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 20/48

40 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

41 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

42 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; out-of-bag error (see next slide). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

43 Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be out-of-bag for T b if x i T b ) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 22/48

44 Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be out-of-bag for T b if x i T b ) OOB error is the misclassification rate of these estimates: 1 n n I {Φ OOB (x i ) y i} i=1 It is often a bit optimistic compared to the test error. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 22/48

45 Section 3 Application of bagging to CART: random forests Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 23/48

46 General framework Notations: a random pair of variables (X, Y): X R p and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : R p {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 24/48

47 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

48 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

49 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Drawbacks black box model; is not supported by strong mathematical results (consistency...) until now. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

50 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 26/48

51 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Basic bagging with classification trees 1: for b = 1,..., B do 2: Construct a bootstrap sample T b from {(x i, y i )} i=1,...,n 3: Train a classification tree from T b, Φ b 4: end for 5: Aggregate the classifiers with majority vote Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 26/48

52 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

53 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

54 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

55 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Random forest are not very sensitive to hyper-parameters setting: default values for q and bootstrap sample size (2n/3) should work in most cases. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

56 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

57 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized: I(X j ) = P (Φ n (X) = Y) P ( Φ n (X (j) = Y ) in which X (j) = (X 1,..., X (j),..., X p ), X (j) being the variable X j with permuted values. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

58 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized. Importance is estimated with OOB observations (see next slide for details) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

59 Importance estimation in random forests OOB estimation for variable X j 1: for b = 1 B (loop on trees) do 2: permute values for (x j i ) i: x i T b return x(j,b) = (x 1,..., x (j,b) i i permuted values x (j,b) i 3: predict Φ b ( x (j,b)) 4: end for 5: return OOB estimation of the importance 1 B 1 B I b=1 T b {Φb (x i )=y i} 1 x i T T b b I { } Φ b (x (j,b) )=y i x i T b i,..., x p i ), Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 29/48

60 Section 4 Introduction to parallel computing Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 30/48

61 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multi-core processors), on several processors (multi-processor computers) or on clusters (composed of several computers). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 31/48

62 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multi-core processors), on several processors (multi-processor computers) or on clusters (composed of several computers). Constraint: communication between cores slows down the computation a strategy consists in breaking the calculation into independent parts so that each processing unit executes its part independently from the others. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 31/48

63 Parallel computing with non-big data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 32/48

64 Parallel computing with non-big data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Example: bagging can easily be computed with a parallel strategy (it is said to be embarrassingly parallel): first step: each processing unit creates one (or several) bootstrap sample(s) and learn a (or several) classifier(s) from it; final step: a processing unit collect all results and combine them into a single classifier with a majority vote law. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 32/48

65 Illustration with random forest random forest Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 33/48

66 Parallel computing with R R CRAN task view: HighPerformanceComputing.html What does this course cover? very basic explicit parallel computing; the package foreach (you can also have a look at the packages snow and snowfall for parallel versions of apply, lapply, sapply,... and management of random seed settings in parallel environments). What does this course not cover? implicit parallel computing; large memory management (see for instance biglm and bigmemory to be able to use out-of-memory problems with R which occur because R is limited to physical random-access memory); GPU computing; resource management;... TL;TD: many things are still to learn at the end of this course... Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 34/48

67 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 35/48

68 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. What needs to be adapted to use parallel programming? In this framework a single processing unit cannot analyze all the dataset by itself the data must be split between the different processing units as well. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 35/48

69 Overview of Map Reduce Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

70 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Each bit is processed through ONE map step and gives pairs {(key, value)}. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

71 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map jobs must be independent! Result: indexed data. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

72 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Each key is processed through ONE reduce step to produce the output. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

73 Map Reduce in practice (stupid) Case study: A huge number of sales identified by the shop and the amount. shop1,25000 shop2,12 shop2,1500 shop4,47 shop1, Question: Extract the total amount per shop. Standard way (sequential) the data are read sequentially; a vector containing the values of the current sum for every shop is updated at each line. Map Reduce way (parallel)... Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 37/48

74 Map Reduce for an aggregation framework Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

75 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map step: reads the line and outputs a pair key=shop and value=amount. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

76 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Reduce step: for every key (i.e., shop), compute the sum of values. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

77 In practice, Hadoop framework Apache Hadoop: open-source software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a map-reduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 39/48

78 In practice, Hadoop framework Apache Hadoop: open-source software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a map-reduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Data inside HDFS are not indexed (unlike SQL data for instance) but stored as simple text files (e.g., comma separated) queries cannot be performed simply. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 39/48

79 Hadoop & R How will we be using it? We will be using a R interface for Hadoop, composed of several packages (see studied: rmr: Map Reduce framework (can be used as if Hadoop is installed, even if it is not...); not studied: rhdfs (to manage Hadoop data filesystem), rhbase (to manage Hadoop HBase database), plyrmr (advanced data processing functions with a plyr syntax). Installing rmr without Hadoop: Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 40/48

### Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

### COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

### Model Combination. 24 Novembre 2009

Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how

### Predicting Flight Delays

Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

### Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

### Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

### Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Leveraging Ensemble Models in SAS Enterprise Miner

ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

### Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

### Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

### The Big Data Bootstrap

Ariel Kleiner akleiner@cs.berkeley.edu Ameet Talwalkar ameet@cs.berkeley.edu Purnamrita Sarkar psarkar@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 9472, USA Michael

### Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### BIG DATA What it is and how to use?

BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

### Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

### Gerry Hobbs, Department of Statistics, West Virginia University

Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

### Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

### D-optimal plans in observational studies

D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

### M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 2 - Bagging

Nathalie Villa-Vialaneix Année 2015/2016 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 2 - Bagging This worksheet illustrates the use of bagging

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency

### The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner

Paper 3361-2015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### Classification and Regression by randomforest

Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

### Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

### Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

### Advanced Data Science on Spark

Advanced Data Science on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Data Science Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in

### Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

### Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

### II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques

### Trees and Random Forests

Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG037392-01 Cache Valley, Utah Utah State University Leo

### Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

### TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

### A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

### Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

### Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

### Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

### Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Big data in R EPIC 2015

Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### A Scalable Bootstrap for Massive Data

A Scalable Bootstrap for Massive Data arxiv:2.56v2 [stat.me] 28 Jun 22 Ariel Kleiner Department of Electrical Engineering and Computer Science University of California, Bereley aleiner@eecs.bereley.edu

### A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### I/O Considerations in Big Data Analytics

Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

### Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

### ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation

### Big Data and Parallel Work with R

Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More

### Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

### 2015 The MathWorks, Inc. 1

25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex

### Regression Estimation - Least Squares and Maximum Likelihood. Dr. Frank Wood

Regression Estimation - Least Squares and Maximum Likelihood Dr. Frank Wood Least Squares Max(min)imization Function to minimize w.r.t. b 0, b 1 Q = n (Y i (b 0 + b 1 X i )) 2 i=1 Minimize this by maximizing

### Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional

2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

### Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

### Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

### Application of Machine Learning to Link Prediction

Application of Machine Learning to Link Prediction Kyle Julian (kjulian3), Wayne Lu (waynelu) December 6, 6 Introduction Real-world networks evolve over time as new nodes and links are added. Link prediction

### An Introduction to Ensemble Learning in Credit Risk Modelling

An Introduction to Ensemble Learning in Credit Risk Modelling October 15, 2014 Han Sheng Sun, BMO Zi Jin, Wells Fargo Disclaimer The opinions expressed in this presentation and on the following slides

### Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

### COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection.

COMP 551 Applied Machine Learning Lecture 6: Performance evaluation. Model assessment and selection. Instructor: (jpineau@cs.mcgill.ca) Class web page: www.cs.mcgill.ca/~jpineau/comp551 Unless otherwise