Applied Multivariate Analysis  Big data analytics


 Jared Lang
 1 years ago
 Views:
Transcription
1 Applied Multivariate Analysis  Big data analytics Nathalie VillaVialaneix M1 in Economics and Economics and Statistics Toulouse School of Economics Nathalie VillaVialaneix M1 E&S  Big data, part 1 1/48
2 Course outline 1 Short introduction to bootstrap 2 Application of bootstrap to classification: bagging 3 Application of bagging to CART: random forests 4 Introduction to parallel computing 5 Application of map reduce to statistical learning Nathalie VillaVialaneix M1 E&S  Big data, part 1 2/48
3 Section 1 Short introduction to bootstrap Nathalie VillaVialaneix M1 E&S  Big data, part 1 3/48
4 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a nonparametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). Nathalie VillaVialaneix M1 E&S  Big data, part 1 4/48
5 Basics about bootstrap General method for: parameter estimation (especially bias) confidence interval estimation in a nonparametric context (i.e., when the law of the observation is completely unknown). Can handle small sample size (n small). [Efron, 1979] proposes to simulate the unknown law using resampling from the observations. Nathalie VillaVialaneix M1 E&S  Big data, part 1 4/48
6 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. Nathalie VillaVialaneix M1 E&S  Big data, part 1 5/48
7 Notations and problem Framework: X random variable with (unknown) law P. Problem: estimation of a parameter θ(p) with an estimate R n = R(x 1,..., x n ) where x 1,..., x n are i.i.d. observations of P? Standard examples: estimation of the mean: θ(p) = xdp(x) with R n = x = 1 n n i=1 x i; estimation of the variance: θ(p) = x 2 dp(x) ( xdp(x) ) 2 with R n = 1 n 1 n i=1 (x i x) 2. In the previous examples, R n is a plugin estimate: R n = θ(p n ) where P n is the empirical distribution 1 n n i=1 δ x i. Nathalie VillaVialaneix M1 E&S  Big data, part 1 5/48
8 Real/bootstrap worlds sampling law P (related to X) P n = 1 n n i=1 δ x i (related to X n ) sample X n = {x 1,..., x n } X n = {x 1,..., x n} (sample of size n with replacement in X n ) estimation R n = R(x 1,..., x n ) R n = R(x 1,..., x n) Nathalie VillaVialaneix M1 E&S  Big data, part 1 6/48
9 Example: bootstrap estimation of the IC for mean Sample obtained from χ 2 distribution (n = 50, number of df: 3) histogram of (x i ) i=1,...,n and R n = 1 n n i=1 x i Nathalie VillaVialaneix M1 E&S  Big data, part 1 7/48
10 Example: bootstrap estimation of the IC for mean Distribution of B = 1000 estimates obtained from bootstrap samples: estimation of a confidence interval from 2.5% and 97.5% quantiles histogram of (R,b n ) b=1,...,b with R,b n = 1 n n i=1 x i and Q µ = quantile({r,b n } b, µ), µ {0.025, 0.975} Nathalie VillaVialaneix M1 E&S  Big data, part 1 7/48
11 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n Nathalie VillaVialaneix M1 E&S  Big data, part 1 8/48
12 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b Nathalie VillaVialaneix M1 E&S  Big data, part 1 8/48
13 Other practical examples θ is any parameter to be estimated: the mean of the distribution (as in the previous example), the median, the variance, slope or intercept in linear models... bias estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; bias of R n is estimated by: 1 B b Rn,b R n variance estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; variance of R n is estimated by: 1 B b(rn,b Rn) 2 where Rn = 1 B b Rn,b confidence interval estimate estimate θ with the empirical estimate R n ; obtain B bootstrap estimates of θ, (Rn,b ) b=1,...,b ; confidence interval at risk α of R n is estimated by: [Q α/2 ; Q 1 α/2 ] where Q µ = quantile({rn,b } b, µ) Nathalie VillaVialaneix M1 E&S  Big data, part 1 8/48
14 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n? empirical estimate for the mean? unbiased estimate for the variance? Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
15 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
16 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 bootstrap estimate for the variance of x? bootstrap estimate for the mean of the empirical variance? Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
17 Do it yourself: unrealistic bootstrap by hand! {x i } i : ; ; ; ; n = 5 empirical estimate for the mean R n = x = 1 n n i=1 x i = unbiased estimate for the variance R n = ˆσ n 1 = 1 n 1 n i=1 (x i x) 2 = Bootstrap samples (B = 2): b = 1: x 3, x 3, x 2, x 1, x 3 b = 2: x 3, x 3, x 2, x 1, x 3 = , Rn,2 = and Var (x) = bootstrap estimate for the mean of the empirical variance bootstrap estimate for the variance of x R,1 n R,1 n = , R,2 n = and Ê (σ n 1 ) = Nathalie VillaVialaneix M1 E&S  Big data, part 1 9/48
18 user and the package boot library( boot) # a sample from a Chi  Square distribution is generated orig. sample < rchisq(50, df =3) # the estimate of the mean is mean( orig. sample) # function that calculates estimate from a bootstrap sample. mean < function(x, d) { return( mean(x[d])) } # bootstraping now... boot. mean < boot( orig. sample, sample. mean, R =1000) boot. mean # ORDINARY NONPARAMETRIC BOOTSTRAP # Call: # boot( data = orig. sample, statistic = sample. mean, # R = 1000) # Bootstrap Statistics : # original bias std. error # t1* Nathalie VillaVialaneix M1 E&S  Big data, part 1 10/48
19 Section 2 Application of bootstrap to classification: bagging Nathalie VillaVialaneix M1 E&S  Big data, part 1 11/48
20 What is bagging? Bagging: Bootstrap Aggregating metaalgorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Nathalie VillaVialaneix M1 E&S  Big data, part 1 12/48
21 What is bagging? Bagging: Bootstrap Aggregating metaalgorithm based on bootstrap which aggregate an ensemble of predictors in statistical classification and regression (special case of model averaging approaches) Notations: a random pair of variables (X, Y): X X and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : X {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie VillaVialaneix M1 E&S  Big data, part 1 12/48
22 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. Nathalie VillaVialaneix M1 E&S  Big data, part 1 13/48
23 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. Nathalie VillaVialaneix M1 E&S  Big data, part 1 13/48
24 Basics Suppose that we are given an algorithm: T = { (x i, y i ) } i Φ T where Φ T is a classification function: Φ T : x X Φ T (x) {1,..., K}. B classifiers can be defined from B bootstrap samples using this algorithm: b = 1,..., B, T b is a bootstrap sample of (x i, y i ) i=1,...,n and Φ b = Φ T b. (Φ b ) b=1,...,b are aggregated using a majority vote scheme (an averaging in the regression case): x X, Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie VillaVialaneix M1 E&S  Big data, part 1 13/48
25 Why using bagging? Bagging improves stability and limits the risk of overtraining. Experiment: Using breastcancer dataset from mlbench 1 : 699 observations on 10 variables, 9 being ordered or nominal and describing a tumor and 1 target class indicating if this tumor was malignant or benign. 1 Data are coming from the UCI machine learning repository Nathalie VillaVialaneix M1 E&S  Big data, part 1 14/48
26 To begin: bagging by hand... for x = Cl.thickness Cell.size Cell.shape Marg.adhesion j g g d Epith.c.size Bare.nuclei Bl.cromatin Normal.nucleoli e j e g Mitoses b and the following classification trees obtained from 3 bootstrap samples, what is the prediction for x by bagging? Nathalie VillaVialaneix M1 E&S  Big data, part 1 15/48
27 Nathalie VillaVialaneix M1 E&S  Big data, part 1 16/48
28 individual predictions are: malignant, malignant, malignant so the final prediction is malignant Nathalie VillaVialaneix M1 E&S  Big data, part 1 16/48
29 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. Nathalie VillaVialaneix M1 E&S  Big data, part 1 17/48
30 Description of computational aspects 100 runs. For each run: split the data into a training set (399 observations) and a test set (300 remaining observations); train a classification tree on the training set. Using the test set, calculate a test misclassification error by comparing the prediction given by the trained tree with the true class; generate 50 bootstrap samples from the training set and use them to compute 500 classification trees. Use them to compute a bagging prediction for the test sets and calculate a bagging misclassification error by comparing the bagging prediction with the true class. this results in test errors 100 bagging errors Nathalie VillaVialaneix M1 E&S  Big data, part 1 17/48
31 Results of the simulations Nathalie VillaVialaneix M1 E&S  Big data, part 1 18/48
32 What is overfitting? Function x y to be estimated Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
33 What is overfitting? Observations we might have Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
34 What is overfitting? Observations we do have Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
35 What is overfitting? First estimation from the observations: underfitting Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
36 What is overfitting? Second estimation from the observations: accurate estimation Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
37 What is overfitting? Third estimation from the observations: overfitting Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
38 What is overfitting? Summary A compromise must be made between accuracy and generalization ability. Nathalie VillaVialaneix M1 E&S  Big data, part 1 19/48
39 Why do bagging predictors work? References: [Breiman, 1994, Breiman, 1996]. For some instable predictors (such as classification trees for instance), a small change in the training set can yield to a big change in the trained tree due to overfitting (hence misclassification error obtained on the training dataset is very optimistic) Bagging reduces this instability by using an averaging procedure. Nathalie VillaVialaneix M1 E&S  Big data, part 1 20/48
40 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; Nathalie VillaVialaneix M1 E&S  Big data, part 1 21/48
41 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; Nathalie VillaVialaneix M1 E&S  Big data, part 1 21/48
42 Estimating the generalization ability Several strategies can be used to estimate the generalization ability of an algorithm: split the data into a training/test set ( 67/33%): the model is estimated with the training dataset and a test error is computed with the remaining observations; cross validation: split the data into L folds. For each fold, train a model without the data included in the current fold and compute the error with the data included in the fold: the averaging of these L errors is the cross validation error; outofbag error (see next slide). Nathalie VillaVialaneix M1 E&S  Big data, part 1 21/48
43 Outofbag observations, prediction and error OOB (OutOf Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be outofbag for T b if x i T b ) Nathalie VillaVialaneix M1 E&S  Big data, part 1 22/48
44 Outofbag observations, prediction and error OOB (OutOf Bags) error: error based on the observations not included in the bag : for every i = 1,..., n, calculate the OOB prediction: Φ OOB (x i ) = argmax k=1,...,k { b : Φ b (x i ) = k and x i T b} (x i is said to be outofbag for T b if x i T b ) OOB error is the misclassification rate of these estimates: 1 n n I {Φ OOB (x i ) y i} i=1 It is often a bit optimistic compared to the test error. Nathalie VillaVialaneix M1 E&S  Big data, part 1 22/48
45 Section 3 Application of bagging to CART: random forests Nathalie VillaVialaneix M1 E&S  Big data, part 1 23/48
46 General framework Notations: a random pair of variables (X, Y): X R p and Y R (regression) or Y {1,..., K} (classification) a training set (x i, y i ) i=1,...,n of i.i.d. observations of (X, Y) Purpose: train a function, Φ n : R p {1,..., K}, from (x i, y i ) i capable of predicting Y from X Nathalie VillaVialaneix M1 E&S  Big data, part 1 24/48
47 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Nathalie VillaVialaneix M1 E&S  Big data, part 1 25/48
48 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Nathalie VillaVialaneix M1 E&S  Big data, part 1 25/48
49 Overview: Advantages/Drawbacks Random Forest: introduced by [Breiman, 2001] Advantages classification OR regression (i.e., Y can be a numeric variable or a factor); non parametric method (no prior assumption needed) and accurate; can deal with a large number of input variables, either numeric variables or factors; can deal with small/large samples. Drawbacks black box model; is not supported by strong mathematical results (consistency...) until now. Nathalie VillaVialaneix M1 E&S  Big data, part 1 25/48
50 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Nathalie VillaVialaneix M1 E&S  Big data, part 1 26/48
51 Basis for random forest: bagging of classification trees Suppose: Y {1,..., K} (classification problem) and (Φ b ) b=1,...,b are B classifiers, Φ b : R p {1,..., K}, obtained from B bootstrap samples of {(x i, y i )} i=1,...,n. Basic bagging with classification trees 1: for b = 1,..., B do 2: Construct a bootstrap sample T b from {(x i, y i )} i=1,...,n 3: Train a classification tree from T b, Φ b 4: end for 5: Aggregate the classifiers with majority vote Φ n (x) := argmax k=1,...,k {b : Φ b (x) = k} where S denotes the cardinal of a finite set S. Nathalie VillaVialaneix M1 E&S  Big data, part 1 26/48
52 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
53 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
54 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
55 Random forests CART bagging with underefficient trees to avoid overfitting 1 for every tree, each time a split is made, it is preceded by a random choice of q variables among the p available X = (X 1, X 2,..., X p ). The current node is then built based on these variables only: it is defined as the split among the q variables that produces the two subsets with the largest interclass variance. An advisable choice for q is p for classification (and p/3 for regression); 2 the tree is restricted to the first l nodes with l small. Hyperparameters those of the CART algorithm (maximal depth, minimum size of a node, minimum homogeneity of a node...); those that are specific to the random forest: q, number of bootstrap samples (B also called number of trees). Random forest are not very sensitive to hyperparameters setting: default values for q and bootstrap sample size (2n/3) should work in most cases. Nathalie VillaVialaneix M1 E&S  Big data, part 1 27/48
56 Additional tools OOB (OutOf Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Nathalie VillaVialaneix M1 E&S  Big data, part 1 28/48
57 Additional tools OOB (OutOf Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized: I(X j ) = P (Φ n (X) = Y) P ( Φ n (X (j) = Y ) in which X (j) = (X 1,..., X (j),..., X p ), X (j) being the variable X j with permuted values. Nathalie VillaVialaneix M1 E&S  Big data, part 1 28/48
58 Additional tools OOB (OutOf Bags) error: error based on the OOB predictions. Stabilization of OOB error is a good indication that there is enough trees in the forest. Importance of a variable to help interpretation: for a given variable X j (j {1,..., p}), the importance of X j is the mean decrease in accuracy obtained when the values of X j are randomized. Importance is estimated with OOB observations (see next slide for details) Nathalie VillaVialaneix M1 E&S  Big data, part 1 28/48
59 Importance estimation in random forests OOB estimation for variable X j 1: for b = 1 B (loop on trees) do 2: permute values for (x j i ) i: x i T b return x(j,b) = (x 1,..., x (j,b) i i permuted values x (j,b) i 3: predict Φ b ( x (j,b)) 4: end for 5: return OOB estimation of the importance 1 B 1 B I b=1 T b {Φb (x i )=y i} 1 x i T T b b I { } Φ b (x (j,b) )=y i x i T b i,..., x p i ), Nathalie VillaVialaneix M1 E&S  Big data, part 1 29/48
60 Section 4 Introduction to parallel computing Nathalie VillaVialaneix M1 E&S  Big data, part 1 30/48
61 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multicore processors), on several processors (multiprocessor computers) or on clusters (composed of several computers). Nathalie VillaVialaneix M1 E&S  Big data, part 1 31/48
62 Very basic background on parallel computing Purpose: Distributed or parallel computing seeks at distributing a calculation on several cores (multicore processors), on several processors (multiprocessor computers) or on clusters (composed of several computers). Constraint: communication between cores slows down the computation a strategy consists in breaking the calculation into independent parts so that each processing unit executes its part independently from the others. Nathalie VillaVialaneix M1 E&S  Big data, part 1 31/48
63 Parallel computing with nonbig data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Nathalie VillaVialaneix M1 E&S  Big data, part 1 32/48
64 Parallel computing with nonbig data Framework: the data (number of observations n) is small enough to allow the processors to access them all and the calculation can be easily broken into independent parts. Example: bagging can easily be computed with a parallel strategy (it is said to be embarrassingly parallel): first step: each processing unit creates one (or several) bootstrap sample(s) and learn a (or several) classifier(s) from it; final step: a processing unit collect all results and combine them into a single classifier with a majority vote law. Nathalie VillaVialaneix M1 E&S  Big data, part 1 32/48
65 Illustration with random forest random forest Nathalie VillaVialaneix M1 E&S  Big data, part 1 33/48
66 Parallel computing with R R CRAN task view: HighPerformanceComputing.html What does this course cover? very basic explicit parallel computing; the package foreach (you can also have a look at the packages snow and snowfall for parallel versions of apply, lapply, sapply,... and management of random seed settings in parallel environments). What does this course not cover? implicit parallel computing; large memory management (see for instance biglm and bigmemory to be able to use outofmemory problems with R which occur because R is limited to physical randomaccess memory); GPU computing; resource management;... TL;TD: many things are still to learn at the end of this course... Nathalie VillaVialaneix M1 E&S  Big data, part 1 34/48
67 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. Nathalie VillaVialaneix M1 E&S  Big data, part 1 35/48
68 Parallel computing with Big Data What is Big Data? Massive datasets: Volume: n is huge, and the storage capacity of the data can exceed several GB or PB (ex: Facebook reports to have 21 Peta Bytes of data with 12 TB new data daily); Velocity: the data are coming in/out very fast and the need to analyze them is fast as well; Variety: the data are composed of many types of data. What needs to be adapted to use parallel programming? In this framework a single processing unit cannot analyze all the dataset by itself the data must be split between the different processing units as well. Nathalie VillaVialaneix M1 E&S  Big data, part 1 35/48
69 Overview of Map Reduce Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
70 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Each bit is processed through ONE map step and gives pairs {(key, value)}. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
71 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map jobs must be independent! Result: indexed data. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
72 Overview of Map Reduce Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Each key is processed through ONE reduce step to produce the output. Nathalie VillaVialaneix M1 E&S  Big data, part 1 36/48
73 Map Reduce in practice (stupid) Case study: A huge number of sales identified by the shop and the amount. shop1,25000 shop2,12 shop2,1500 shop4,47 shop1, Question: Extract the total amount per shop. Standard way (sequential) the data are read sequentially; a vector containing the values of the current sum for every shop is updated at each line. Map Reduce way (parallel)... Nathalie VillaVialaneix M1 E&S  Big data, part 1 37/48
74 Map Reduce for an aggregation framework Data 1 Data 2 Data Data 3 The data are broken into several bits. Nathalie VillaVialaneix M1 E&S  Big data, part 1 38/48
75 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Data Data 3 Map {(key, value)} Map step: reads the line and outputs a pair key=shop and value=amount. Nathalie VillaVialaneix M1 E&S  Big data, part 1 38/48
76 Map Reduce for an aggregation framework Data 1 Map {(key, value)} Data 2 Map {(key, value)} Reduce key = key 1 Data OUTPUT Reduce key = key k Data 3 Map {(key, value)} Reduce step: for every key (i.e., shop), compute the sum of values. Nathalie VillaVialaneix M1 E&S  Big data, part 1 38/48
77 In practice, Hadoop framework Apache Hadoop: opensource software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a mapreduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Nathalie VillaVialaneix M1 E&S  Big data, part 1 39/48
78 In practice, Hadoop framework Apache Hadoop: opensource software framework for Big Data programmed in Java. It contains: a distributed file system (data seem to be accessed as if they were on a single computer, though distributed on several storage units); a mapreduce framework that takes advantage of data locality. It is divided into: Name Nodes (typically two) that manage the file system index and Data Nodes that contain a small portion of the data and processing capabilities. Data inside HDFS are not indexed (unlike SQL data for instance) but stored as simple text files (e.g., comma separated) queries cannot be performed simply. Nathalie VillaVialaneix M1 E&S  Big data, part 1 39/48
79 Hadoop & R How will we be using it? We will be using a R interface for Hadoop, composed of several packages (see https://github.com/revolutionanalytics/rhadoop/wiki): studied: rmr: Map Reduce framework (can be used as if Hadoop is installed, even if it is not...); not studied: rhdfs (to manage Hadoop data filesystem), rhbase (to manage Hadoop HBase database), plyrmr (advanced data processing functions with a plyr syntax). Installing rmr without Hadoop: Nathalie VillaVialaneix M1 E&S  Big data, part 1 40/48
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for largescale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for largescale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: PierreLuc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationModel Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 20092010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationDistributed forests for MapReducebased machine learning
Distributed forests for MapReducebased machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationPredicting Flight Delays
Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing
More informationM1 in Economics and Economics and Statistics Applied multivariate Analysis  Big data analytics Worksheet 3  Random Forest
Nathalie VillaVialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis  Big data analytics Worksheet 3  Random Forest This worksheet s aim is to learn how
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationISSN: 23201363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS1332014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationClass #6: Nonlinear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Nonlinear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Nonlinear classification Linear Support Vector Machines
More informationKnowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19  Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19  Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.standrews.ac.uk twk@standrews.ac.uk Tom Kelsey ID505919B &
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? Web data (web logs, click histories) ecommerce applications (purchase histories) Retail purchase histories
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECASCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECASCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationChapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida  1  Outline A brief introduction to the bootstrap Bagging: basic concepts
More informationThe Big Data Bootstrap
Ariel Kleiner akleiner@cs.berkeley.edu Ameet Talwalkar ameet@cs.berkeley.edu Purnamrita Sarkar psarkar@cs.berkeley.edu Computer Science Division, University of California, Berkeley, CA 9472, USA Michael
More informationL3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 20150305
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 20150305 Roman Kern (KTI, TU Graz) Ensemble Methods 20150305 1 / 38 Outline 1 Introduction 2 Classification
More informationComparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
More informationParallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis WeiChen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
More informationGerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
More informationDoptimal plans in observational studies
Doptimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationEvent driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016
Event driven trading new studies on innovative way of trading in Forex market Michał Osmoła INIME live 23 February 2016 Forex market From Wikipedia: The foreign exchange market (Forex, FX, or currency
More informationM1 in Economics and Economics and Statistics Applied multivariate Analysis  Big data analytics Worksheet 2  Bagging
Nathalie VillaVialaneix Année 2015/2016 M1 in Economics and Economics and Statistics Applied multivariate Analysis  Big data analytics Worksheet 2  Bagging This worksheet illustrates the use of bagging
More informationApplied Data Mining Analysis: A StepbyStep Introduction Using RealWorld Data Sets
Applied Data Mining Analysis: A StepbyStep Introduction Using RealWorld Data Sets http://info.salfordsystems.com/jsm2015ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationThe More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner
Paper 33612015 The More Trees, the Better! Scaling Up Performance Using Random Forest in SAS Enterprise Miner Narmada Deve Panneerselvam, Spears School of Business, Oklahoma State University, Stillwater,
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationIs a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
More informationPredictive Modeling Techniques in Insurance
Predictive Modeling Techniques in Insurance Tuesday May 5, 2015 JF. Breton Application Engineer 2014 The MathWorks, Inc. 1 Opening Presenter: JF. Breton: 13 years of experience in predictive analytics
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50 person
More informationBig Data and Data Science: Behind the Buzz Words
Big Data and Data Science: Behind the Buzz Words Peggy Brinkmann, FCAS, MAAA Actuary Milliman, Inc. April 1, 2014 Contents Big data: from hype to value Deconstructing data science Managing big data Analyzing
More informationII. Methods  2  X (i.e. if the system is convective or not). Y = 1 X ). Usually, given these estimates, an
STORMS PREDICTION: LOGISTIC REGRESSION VS RANDOM FOREST FOR UNBALANCED DATA Anne RuizGazen Institut de Mathématiques de Toulouse and Gremaq, Université Toulouse I, France Nathalie Villa Institut de Mathématiques
More informationClassification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
More informationTrees and Random Forests
Trees and Random Forests Adele Cutler Professor, Mathematics and Statistics Utah State University This research is partially supported by NIH 1R15AG03739201 Cache Valley, Utah Utah State University Leo
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationTRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationJournée Thématique Big Data 13/03/2015
Journée Thématique Big Data 13/03/2015 1 Agenda About Flaminem What Do We Want To Predict? What Is The Machine Learning Theory Behind It? How Does It Work In Practice? What Is Happening When Data Gets
More informationA General Framework for Mining ConceptDrifting Data Streams with Skewed Distributions
A General Framework for Mining ConceptDrifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at UrbanaChampaign IBM T. J. Watson Research Center
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationChapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida  1 
Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida  1  Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationMicrosoft Azure Machine learning Algorithms
Microsoft Azure Machine learning Algorithms Tomaž KAŠTRUN @tomaz_tsql Tomaz.kastrun@gmail.com http://tomaztsql.wordpress.com Our Sponsors Speaker info https://tomaztsql.wordpress.com Agenda Focus on explanation
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationParallel Programming MapReduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming MapReduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationA Study Of Bagging And Boosting Approaches To Develop MetaClassifier
A Study Of Bagging And Boosting Approaches To Develop MetaClassifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet524121,
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationlargescale machine learning revisited Léon Bottou Microsoft Research (NYC)
largescale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationBig data in R EPIC 2015
Big data in R EPIC 2015 Big Data: the new 'The Future' In which Forbes magazine finds common ground with Nancy Krieger (for the first time ever?), by arguing the need for theorydriven analysis This future
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationI/O Considerations in Big Data Analytics
Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationFraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
More informationBig Data and Parallel Work with R
Big Data and Parallel Work with R What We'll Cover Data Limits in R Optional Data packages Optional Function packages Going parallel Deciding what to do Data Limits in R Big Data? What is big data? More
More informationSpark. Fast, Interactive, Language Integrated Cluster Computing
Spark Fast, Interactive, Language Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 4089241000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationSEIZE THE DATA. 2015 SEIZE THE DATA. 2015
1 Copyright 2015 HewlettPackard Development Company, L.P. The information contained herein is subject to change without notice. BIG DATA CONFERENCE 2015 Boston August 1013 Predicting and reducing deforestation
More informationA Scalable Bootstrap for Massive Data
A Scalable Bootstrap for Massive Data arxiv:2.56v2 [stat.me] 28 Jun 22 Ariel Kleiner Department of Electrical Engineering and Computer Science University of California, Bereley aleiner@eecs.bereley.edu
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Largescale Computation Traditional solutions for computing large quantities of data
More informationData Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationA Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICETUNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
More informationCS570 Data Mining Classification: Ensemble Methods
CS570 Data Mining Classification: Ensemble Methods Cengiz Günay Dept. Math & CS, Emory University Fall 2013 Some slides courtesy of HanKamberPei, Tan et al., and Li Xiong Günay (Emory) Classification:
More information2015 The MathWorks, Inc. 1
25 The MathWorks, Inc. 빅 데이터 및 다양한 데이터 처리 위한 MATLAB의 인터페이스 환경 및 새로운 기능 엄준상 대리 Application Engineer MathWorks 25 The MathWorks, Inc. 2 Challenges of Data Any collection of data sets so large and complex
More informationA crash course in probability and Naïve Bayes classification
Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s
More informationAdvanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
More informationMachine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
More informationMultiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationBenchmarking OpenSource Tree Learners in R/RWeka
Benchmarking OpenSource Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationBayesian networks  Timeseries models  Apache Spark & Scala
Bayesian networks  Timeseries models  Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup  November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
BiasVariance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data BiasVariance Decomposition for Regression BiasVariance Decomposition for Classification BiasVariance
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationThe Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio MorenoGarcia
More informationOPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP
OPINION MINING IN PRODUCT REVIEW SYSTEM USING BIG DATA TECHNOLOGY HADOOP 1 KALYANKUMAR B WADDAR, 2 K SRINIVASA 1 P G Student, S.I.T Tumkur, 2 Assistant Professor S.I.T Tumkur Abstract Product Review System
More informationOn the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
More informationPenalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20thcentury statistics dealt with maximum likelihood
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More information