Applied Multivariate Analysis - Big data analytics

Size: px
Start display at page:

Download "Applied Multivariate Analysis - Big data analytics"

Transcription

1 Applied Multivariate Analysis - Big data analytics Nathalie Villa-Vialaneix nathalie.villa@toulouse.inra.fr M1 in Economics and Economics and Statistics Toulouse School of Economics Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 1/48

2 Course outline 1 Short introduction to bootstrap 2 Application of bootstrap to classification: bagging 3 Application of bagging to CART: random forests 4 Introduction to parallel computing 5 Application of map reduce to statistical learning Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 2/48

3 Section 1 Short introduction to bootstrap Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 3/48

4 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a non-parametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 4/48

5 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a non-parametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). [Efron, 1979] proposes to simulate the unknown law using re-sampling from the observations. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 4/48

6 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 5/48

7 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. In the previous examples, R n is a plug-in estimate: R n = θ(p n ) where P n is the empirical distribution 1 n n i=1 δ x i. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 5/48

8 Real/bootstrap worlds sampling law P (related to X) P n = 1 n n i=1 δ x i (related to X n ) sample X n = {x 1,..., x n } X n = {x 1,..., x n} (sample of size n with replacement in X n ) estimation R n = R(x 1,..., x n ) R n = R(x 1,..., x n) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 6/48

9 Example: bootstrap estimation of the IC for mean Sample obtained from χ 2 -distribution (n = 50, number of df: 3) histogram of (x i ) i=1,...,n and R n = 1 n n i=1 x i Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 7/48

10 Example: bootstrap estimation of the IC for mean Distribution of B = 1000 estimates obtained from bootstrap samples: estimation of a confidence interval from 2.5% and 97.5% quantiles histogram of (R,b n ) b=1,...,b with R,b n = 1 n n i=1 x i and Q µ = quantile({r,b n } b, µ), µ {0.025, 0.975} Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 7/48

11 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

12 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

13 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b confidence interval estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; confidence interval at risk α of R n is estimated by: [Q α/2 ; Q 1 α/2 ] where Q µ = quantile({rn,b } b, µ) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 8/48

14 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n? empirical estimate for the mean? unbiased estimate for the variance? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

15 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

16 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 bootstrap estimate for the variance of x? bootstrap estimate for the mean of the empirical variance? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

17 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 = , Rn,2 = and Var (x) = bootstrap estimate for the mean of the empirical variance bootstrap estimate for the variance of x R,1 n R,1 n = , R,2 n = and Ê (σ n 1 ) = Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 9/48

18 user and the package boot library( boot) # a sample from a Chi - Square distribution is generated orig. sample <- rchisq(50, df =3) # the estimate of the mean is mean( orig. sample) # function that calculates estimate from a bootstrap sample. mean <- function(x, d) { return( mean(x[d])) } # bootstraping now... boot. mean <- boot( orig. sample, sample. mean, R =1000) boot. mean # ORDINARY NONPARAMETRIC BOOTSTRAP # Call: # boot( data = orig. sample, statistic = sample. mean, # R = 1000) # Bootstrap Statistics : # original bias std. error # t1* Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 10/48

19 Section 2 Application of bootstrap to classification: bagging Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 11/48

20 What is bagging? Bagging: Bootstrap Aggregating meta-algorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 12/48

21 What is bagging? Bagging: Bootstrap Aggregating meta-algorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Notations: a random pair of variables (X, Y): X X and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : X {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 12/48

22 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

23 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

24 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. (Φ b ) b=1,...,b are aggregated using a majority vote scheme (an averaging in the regression case): x X, Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 13/48

25 Why using bagging? Bagging improves stability and limits the risk of overtraining. Experiment: Using breastcancer dataset from mlbench 1 : 699 observations on 10 variables, 9 being ordered or nominal and describing a tumor and 1 target class indicating if this tumor was malignant or benign. 1 Data are coming from the UCI machine learning repository Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 14/48

26 To begin: bagging by hand... for x = Cl.thickness Cell.size Cell.shape Marg.adhesion j g g d Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli e j e g Mitoses b and the following classification trees obtained from 3 bootstrap samples, what is the prediction for x by bagging? Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 15/48

27 Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 16/48

28 individual predictions are: malignant, malignant, malignant so the final prediction is malignant Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 16/48

29 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 17/48

30 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. this results in test errors 100 bagging errors Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 17/48

31 Results of the simulations Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 18/48

32 What is overfitting? Function x y to be estimated Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

33 What is overfitting? Observations we might have Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

34 What is overfitting? Observations we do have Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

35 What is overfitting? First estimation from the observations: underfitting Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

36 What is overfitting? Second estimation from the observations: accurate estimation Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

37 What is overfitting? Third estimation from the observations: overfitting Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

38 What is overfitting? Summary A compromise must be made between accuracy and generalization ability. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 19/48

39 Why do bagging predictors work? References: [Breiman, 1994, Breiman, 1996]. For some instable predictors (such as classification trees for instance), a small change in the training set can yield to a big change in the trained tree due to overfitting (hence misclassification error obtained on the training dataset is very optimistic) Bagging reduces this instability by using an averaging procedure. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 20/48

40 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

41 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

42 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; out-of-bag error (see next slide). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 21/48

43 Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be out-of-bag for T b if x i T b ) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 22/48

44 Out-of-bag observations, prediction and error OOB (Out-Of Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be out-of-bag for T b if x i T b ) OOB error is the misclassification rate of these estimates: 1 n n I {Φ OOB (x i ) y i} i=1 It is often a bit optimistic compared to the test error. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 22/48

45 Section 3 Application of bagging to CART: random forests Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 23/48

46 General framework Notations: a random pair of variables (X, Y): X R p and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : R p {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 24/48

47 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

48 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

49 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Drawbacks black box model; is not supported by strong mathematical results (consistency...) until now. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 25/48

50 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 26/48

51 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Basic bagging with classification trees 1: for b = 1,..., B do 2: Construct a bootstrap sample T b from {(x i, y i )} i=1,...,n 3: Train a classification tree from T b, Φ b 4: end for 5: Aggregate the classifiers with majority vote Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 26/48

52 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

53 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

54 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

55 Random forests CART bagging with under-efficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest inter-class variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Random forest are not very sensitive to hyper-parameters setting: default values for q and bootstrap sample size (2n/3) should work in most cases. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 27/48

56 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

57 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized: I(X j ) = P (Φ n (X) = Y) P ( Φ n (X (j) = Y ) in which X (j) = (X 1,..., X (j),..., X p ), X (j) being the variable X j with permuted values. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

58 Additional tools OOB (Out-Of Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized. Importance is estimated with OOB observations (see next slide for details) Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 28/48

59 Importance estimation in random forests OOB estimation for variable X j 1: for b = 1 B (loop on trees) do 2: permute values for (x j i ) i: x i T b return x(j,b) = (x 1,..., x (j,b) i i permuted values x (j,b) i 3: predict Φ b ( x (j,b)) 4: end for 5: return OOB estimation of the importance 1 B 1 B I b=1 T b {Φb (x i )=y i} 1 x i T T b b I { } Φ b (x (j,b) )=y i x i T b i,..., x p i ), Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 29/48

60 Section 4 Introduction to parallel computing Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 30/48

61 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multi-core processors), on several processors (multi-processor computers) or on clusters (composed of several computers). Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 31/48

62 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multi-core processors), on several processors (multi-processor computers) or on clusters (composed of several computers). Constraint: communication between cores slows down the computation a strategy consists in breaking the calculation into independent parts so that each processing unit executes its part independently from the others. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 31/48

63 Parallel computing with non-big data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 32/48

64 Parallel computing with non-big data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Example: bagging can easily be computed with a parallel strategy (it is said to be embarrassingly parallel): first step: each processing unit creates one (or several) bootstrap sample(s) and learn a (or several) classifier(s) from it; final step: a processing unit collect all results and combine them into a single classifier with a majority vote law. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 32/48

65 Illustration with random forest random forest Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 33/48

66 Parallel computing with R R CRAN task view: HighPerformanceComputing.html What does this course cover? very basic explicit parallel computing; the package foreach (you can also have a look at the packages snow and snowfall for parallel versions of apply, lapply, sapply,... and management of random seed settings in parallel environments). What does this course not cover? implicit parallel computing; large memory management (see for instance biglm and bigmemory to be able to use out-of-memory problems with R which occur because R is limited to physical random-access memory); GPU computing; resource management;... TL;TD: many things are still to learn at the end of this course... Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 34/48

67 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 35/48

68 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. What needs to be adapted to use parallel programming? In this framework a single processing unit cannot analyze all the dataset by itself the data must be split between the different processing units as well. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 35/48

69 Overview of Map Reduce Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

70 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Each bit is processed through ONE map step and gives pairs {(key, value)}. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

71 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map jobs must be independent! Result: indexed data. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

72 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Each key is processed through ONE reduce step to produce the output. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 36/48

73 Map Reduce in practice (stupid) Case study: A huge number of sales identified by the shop and the amount. shop1,25000 shop2,12 shop2,1500 shop4,47 shop1, Question: Extract the total amount per shop. Standard way (sequential) the data are read sequentially; a vector containing the values of the current sum for every shop is updated at each line. Map Reduce way (parallel)... Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 37/48

74 Map Reduce for an aggregation framework Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

75 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map step: reads the line and outputs a pair key=shop and value=amount. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

76 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Reduce step: for every key (i.e., shop), compute the sum of values. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 38/48

77 In practice, Hadoop framework Apache Hadoop: open-source software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a map-reduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 39/48

78 In practice, Hadoop framework Apache Hadoop: open-source software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a map-reduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Data inside HDFS are not indexed (unlike SQL data for instance) but stored as simple text files (e.g., comma separated) queries cannot be performed simply. Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 39/48

79 Hadoop & R How will we be using it? We will be using a R interface for Hadoop, composed of several packages (see studied: rmr: Map Reduce framework (can be used as if Hadoop is installed, even if it is not...); not studied: rhdfs (to manage Hadoop data filesystem), rhbase (to manage Hadoop HBase database), plyrmr (advanced data processing functions with a plyr syntax). Installing rmr without Hadoop: Nathalie Villa-Vialaneix M1 E&S - Big data, part 1 40/48

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: Pierre-Luc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Predicting Flight Delays

Predicting Flight Delays Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

More information

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

ISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Chapter 12 Bagging and Random Forests

Chapter 12 Bagging and Random Forests Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts

More information

The Big Data Bootstrap

The Big Data Bootstrap Ariel Kleiner akleiner@cs.berkeley.edu Ameet Talwalkar ameet@cs.berkeley.edu Purnamrita Sarkar psarkar@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 9472, USA Michael

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016 Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency

More information

Predictive Modeling Techniques in Insurance

Predictive Modeling Techniques in Insurance Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics

More information

HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Is a Data Scientist the New Quant? Stuart Kozola MathWorks Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an

II. Methods - 2 - X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne Ruiz-Gazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques

More information

Microsoft Azure Machine learning Algorithms

Microsoft Azure Machine learning Algorithms Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation

More information

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Big Data and Data Science: Behind the Buzz Words

Big Data and Data Science: Behind the Buzz Words Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Journée Thématique Big Data 13/03/2015

Journée Thématique Big Data 13/03/2015 Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets

More information

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

More information

Big data in R EPIC 2015

Big data in R EPIC 2015 Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theory-driven analysis This future

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Fraud Detection for Online Retail using Random Forests

Fraud Detection for Online Retail using Random Forests Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

The Operational Value of Social Media Information. Social Media and Customer Interaction

The Operational Value of Social Media Information. Social Media and Customer Interaction The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia

More information

Multiple Linear Regression in Data Mining

Multiple Linear Regression in Data Mining Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Bayesian networks - Time-series models - Apache Spark & Scala

Bayesian networks - Time-series models - Apache Spark & Scala Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly

More information

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark. Fast, Interactive, Language- Integrated Cluster Computing Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

More information

Big Data and Parallel Work with R

Big Data and Parallel Work with R Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More

More information

ANALYTICS CENTER LEARNING PROGRAM

ANALYTICS CENTER LEARNING PROGRAM Overview of Curriculum ANALYTICS CENTER LEARNING PROGRAM The following courses are offered by Analytics Center as part of its learning program: Course Duration Prerequisites 1- Math and Theory 101 - Fundamentals

More information

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015

SEIZE THE DATA. 2015 SEIZE THE DATA. 2015 1 Copyright 2015 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 10-13 Predicting and reducing deforestation

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

A Scalable Bootstrap for Massive Data

A Scalable Bootstrap for Massive Data A Scalable Bootstrap for Massive Data arxiv:2.56v2 [stat.me] 28 Jun 22 Ariel Kleiner Department of Electrical Engineering and Computer Science University of California, Bereley aleiner@eecs.bereley.edu

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

CS570 Data Mining Classification: Ensemble Methods

CS570 Data Mining Classification: Ensemble Methods CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of Han-Kamber-Pei, Tan et al., and Li Xiong Günay (Emory) Classification:

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

More information

Benchmarking Open-Source Tree Learners in R/RWeka

Benchmarking Open-Source Tree Learners in R/RWeka Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems

More information

Decompose Error Rate into components, some of which can be measured on unlabeled data

Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance

More information

On the effect of data set size on bias and variance in classification learning

On the effect of data set size on bias and variance in classification learning On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent

More information

Penalized regression: Introduction

Penalized regression: Introduction Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences

More information

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering IEICE Transactions on Information and Systems, vol.e96-d, no.3, pp.742-745, 2013. 1 Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering Ildefons

More information