Data Science Using Open Souce Tools Decision Trees and Random Forest Using R
|
|
|
- Gregory Maxwell
- 10 years ago
- Views:
Transcription
1 Data Science Using Open Souce Tools Decision Trees and Random Forest Using R Jennifer Evans Clickfox [email protected] January 14, 2014 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
2 Text Questions to Twitter Account JenniferE CF Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
3 Example Code in R All the R Code is Hosted includes additional code examples rcode Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
4 Overview 1 Data Science a Brief Overview 2 Data Science at Clickfox 3 Data Preparation 4 Algorithms Decision Trees Knowing your Algorithm Example Code in R 5 Evaluation Evaluating the Model Evaluating the Business Questions 6 Kaggle and Random Forest 7 Visualization 8 Recommended Reading Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
5 Data Science a Brief Overview Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
6 What is Data Science? The meticulous process of iterative testing, proving, revising, retesting, resolving, redoing, programming (because you got smart here and thought automate), debugging, recoding, debugging, tracing, more debugging, documenting (maybe should have started here...) analyzing results, some tweaking, some researching, some hacking, and start over. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
7 Data Science at Clickfox Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
8 Data Science at Clickfox Software Development Activly engaged in development of product capabilities in ClickFox Experience Analytics Platform (CEA). Client Specific Analytics Engagements in client specific projects. Force Multipliers Focus on enabling everyone to be more effective at using data to make decisions. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
9 Will it Rain Tomorrow? Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
10 Data Preparation Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
11 Receive the Data Raw Data Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
12 Data Munging Begin Creating Analytic Data Set Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
13 Data Munging Data Munging and Meta Data Creation Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
14 Data Preparation Checking that Data Quality has been Preserved Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
15 Bad Data Types of bad data missing, unknown, does not exist inaccurate, invalid, inconsistent - false records, or wrong information corrupt, wrong character encoding poor interpretation, often because lack of context. polluted - too much data and overlook what is important Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
16 Bad Data A lot can go wrong in the data collection process, the data storage process, and the data analysis process. Nephew and the movie survey Protection troops and flooded with information, overlooked that the group gathering nearby was women and children aka. Civilians. Manufacturing with acceptable variance, but every so often the measurement machine was bumped, causing miss measurements Chemists were meticulous about data collection, but inconsistent with data storage. Used flat files and spreadsheets. They did not have a central data center. The data base grew over time. e.g. Threshold limits listed as zero and less than some threshold number. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
17 Bad Data Parrot helping you write code... Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
18 Not to mention all the things that we can do to really screw things up. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
19 The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data John Tukey Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
20 Final Analytic Data Set Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
21 Will it Rain Tomorrow? Will it Rain Tomorrow? Example (Variables) 1 > names(ds) 2 [1] "Date" "Location" "MinTemp" "MaxTemp" 3 [5] "Rainfall" "Evaporation" "Sunshine" "WindGustDir" 4 [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "WindSpeed9am" 5 [13] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "Pressure9am" 6 [17] "Pressure3pm" "Cloud9am" "Cloud3pm" "Temp9am" 7 [21] "Temp3pm" "RainToday" "RISK_MM" "RainTomorrow" Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
22 Will it Rain Tomorrow? Example (First Four Rows of Data) Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir Canberra NW Canberra ENE Canberra NW Canberra NW WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am 1 30 SW NW E W N NNE WNW W Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK_MM RainTomorrow 1 No 3.6 Yes 2 Yes 3.6 Yes 3 Yes 39.8 Yes 4 Yes 2.8 Yes Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
23 Data Checking Make sure that the values make sense in the context of the field. Dates are in the date field. A measurement field has numerical values Counts of occurrences should be zero or greater. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
24 head(ds) Date Location MinTemp MaxTemp R a i n f a l l Evaporation Sunshine WindGustDir Canberra NW Canberra ENE Canberra NW Canberra NW Canberra SSE Canberra SE WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am 1 30 SW NW E W N NNE WNW W SSE ESE SE E Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday RISK MM RainTomorrow 1 No 3. 6 Yes 2 Yes 3. 6 Yes 3 Yes Yes 4 Yes 2. 8 Yes 5 Yes 0. 0 No 6 No 0. 2 No Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
25 Data Checking There are numeric and categoric variables. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
26 Data Checking Check the max/min do they make sense? What are the ranges? Do the numerical values need to be normalized? Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
27 Cloud9am Cloud3pm Temp9am Temp3pm RainToday Min Jennifer. : 0 Evans. 000 (Clickfox) Min. : Min. Twitter: : JenniferE Min CF. : 5.10 No January :300 14, / 164 summary(ds) Date L o c a t i o n MinTemp MaxTemp Min. : Canberra :366 Min. : Min. : s t Qu.: A d e l a i d e : 0 1 s t Qu. : s t Qu. : Median : Albany : 0 Median : Median : Mean : A l b u r y : 0 Mean : Mean : rd Qu.: A l i c e S p r i n g s : 0 3 rd Qu. : rd Qu. : Max. : BadgerysCreek : 0 Max. : Max. : ( Other ) : 0 R a i n f a l l E v a p o r a t i o n S u n s h i n e WindGustDir Min. : Min. : Min. : NW : 73 1 s t Qu. : s t Qu. : s t Qu. : NNW : 44 Median : Median : Median : E : 37 Mean : Mean : Mean : WNW : 35 3 rd Qu. : rd Qu. : rd Qu. : ENE : 30 Max. : Max. : Max. : ( Other ) : NA s : 3 NA s : 3 WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Min. : SE : 47 WNW : 61 Min. : Min. : s t Qu. : SSE : 40 NW : 61 1 s t Qu. : s t Qu. : Median : NNW : 36 NNW : 47 Median : Median : Mean : N : 31 N : 30 Mean : Mean : rd Qu. : NW : 30 ESE : 27 3 rd Qu. : rd Qu. : Max. : ( Other ) : ( Other ) : Max. : Max. : NA s : 2 NA s : 31 NA s : 1 NA s : 7 Humidity9am Humidity3pm Pressure9am Pressure3pm Min. : Min. : Min. : Min. : s t Qu. : s t Qu. : s t Qu. : s t Qu. : Median : Median : Median : Median : Mean : Mean : Mean : Mean : rd Qu. : rd Qu. : rd Qu. : rd Qu. : Max. : Max. : Max. : Max. :
28 Data Checking Plot variables against one another. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
29 R Code Example (Scatterplot) pairs(~mintemp+maxtemp+rainfall+evaporation, data = ds, main="simple Scatterplot Matrix") Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
30 MinTemp MaxTemp Rainfall Evaporation Simple Scatterplot Matrix Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
31 Humidity3pm Pressure9am Sunshine RainTomorrow Simple Scatterplot Matrix Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
32 WindDir9am WindGustSpeed Temp9am Cloud3pm Simple Scatterplot Matrix Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
33 Data Checking Create a histogram of numerical values in a data field, or kernel density estimate. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
34 R Code Example (Histogram) histogram(ds$mintemp, breaks=20, col="blue") Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
35 MinTemp 6 Percent of Total ds$mintemp Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
36 R Code Example (Kernel Density Plot) plot(density(ds$mintemp)) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
37 MinTemp density.default(x = ds$mintemp) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
38 Data Checking Kernel Density Plot for all Numerical Variables Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
39 density.default(x = ds$mintemp) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
40 density.default(x = ds$maxtemp) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
41 density.default(x = ds$rainfall) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
42 density.default(x = ds$evaporation) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
43 density.default(x = ds.complete$sunshine) Density N = 328 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
44 density.default(x = ds.complete$windspeed9am) Density N = 328 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
45 density.default(x = ds$windspeed3pm) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
46 density.default(x = ds$humidity9am) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
47 density.default(x = ds$mintemp) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
48 density.default(x = ds$humidity3pm) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
49 density.default(x = ds$pressure9am) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
50 density.default(x = ds$pressure3pm) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
51 density.default(x = ds$cloud9am) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
52 density.default(x = ds$cloud3pm) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
53 density.default(x = ds$temp9am) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
54 density.default(x = ds$temp3pm) Density N = 366 Bandwidth = Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
55 Data Checking There are missing values in Sunshine and Wind- Speed9am. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
56 Data Checking Missing and Incomplete A common pitfall is to assume that you are working with data that is correct and complete. Usually a round of simple checks will reveal any problems; such as counting records, aggregating totals, plotting and comparing to known quantities. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
57 Data Checking Spillover of time-bound data Check for duplicates - do not expect that data is perfectly partitioned. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
58 Algorithms Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
59 All models are wrong, some are useful. George Box Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
60 Movie Selection Explanation Difference between Decision Trees and Random Forest Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
61 Movie Selection Explanation Willow is a decision tree. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
62 Movie Selection Explanation Willow does not generalize well, so you want to ask a few more friends. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
63 Random Friend Rainbow Dash Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
64 Random Friend Cartman Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
65 Random Friend Stay Puff Marshmallow Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
66 Random Friend Professor Cat Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
67 Movie Selection Explanation Your friends are an ensebmble of decision trees. But you dont want them all having the same information and giving the same answer. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
68 Good and Bad Predictiors Willow thinks you like vampire movies more than you do Stay Puff thinks you like candy Rainbowdash thinks you can fly Cartman thinks you just hate everything Professor Cat wants a cheeseburger Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
69 Movie Selection Explanation Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
70 Movie Selection Explanation There is still one problem with your data. You don t want all your friends asking the same questions and basing their decisions on whether a movies is scary or not. So when each friend asks a question, only a random subset of the possible questions is allowed. About the square root of all variables. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
71 Conclusion Random forest is just an ensemble of decision trees. Really bad, over-fit beasts. A whole lot of trees that really have no idea about what is going on, but we let them vote anyways. Their votes all cancel each other out. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
72 Random Forest Voting Theorem (Bad Predictors Cancel Out) Willow + Cartman + StayPuff + ProfCat + Rainbowdash = AccutatePrediction Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
73 Boosting and Bagging Technique Bagging decision trees, an early ensemble method, builds multiple decision trees by repeatedly resampling training data with replacement, and voting the trees for a consensus prediction. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
74 Decision Trees Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
75 There are a lot of tree algorithm choices in R. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
76 Trees in R rpart (CART) tree (CART) ctree (conditional inference tree) CHAID (chi-squared automatic interaction detection) evtree (evolutionary algorithm) mvpart (multivariate CART) knntree (nearest-neighbor-based trees) RWeka (J4.8, M50, LMT) LogicReg (Logic Regression) BayesTree TWIX (with extra splits) party (conditional inference trees, model-based trees) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
77 There are a lot of forest algorithm choices in R. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
78 Forests in R randomforest(cart-based random forests) randomsurvivalforest(for censored responses) party(conditional random forests) gbm(tree-based gradient boosting) mboost(model-based and tree-based gradient boosting) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
79 There are a lot of other ensemble methods and useful packages in R. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
80 Other Useful R Packages library(rattle) #Fancy tree plot, nice graphical interface library(rpart.plot) #Enhanced tree plots library(rcolorbrewer) #Color selection for fancy tree plot library(party) #Alternative decision tree algorithm library(partykit) #Convert rpart object to BinaryTree library(doparallel) library(caret) library(rocr) library(metrics) library(ga) #genetic algorithm, this is the most popular EA Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
81 R Code Example (Useful Commands) 1 #summary functions 2 dim(ds) 3 head(ds) 4 tail(ds) 5 summary(ds) 6 str(ds) 7 8 #list functions in package party 9 ls(package:party) 0 1 #save plots as pdf 2 pdf("plot.pdf") 3 fancyrpartplot(model) 4 dev.off() 5 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
82 Knowing your Algorithm Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
83 Classification and Regression Tree Choose the best split from among the candidate set. Rank order each splitting rule on the basis of some quality-of-split criterion purity function. The most frequently used ones are: Entropy reduction (nominal / binary targets) Gini-index (nominal / binary targets) Chi-square tests (nominal / binary targets) F-test (interval targets) Variance reduction (interval targets) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
84 CART Locally-Optimal Trees Commonly use a greedy heuristic, where split rules are selected in a forward stepwise search. The split rule at each internal node is selected to maximize the homogeneity of only its child nodes. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
85 Example Code in R Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
86 Example Code in R Example (R Packages Used for Example Code) 1 library(rpart) #Popular decision tree algorithm 2 library(rattle) #Fancy tree plot, nice graphical interface 3 library(rpart.plot) #Enhanced tree plots 4 library(rcolorbrewer) #Color selection for fancy tree plot 5 library(party) #Alternative decision tree algorithm 6 library(partykit) #Convert rpart object to BinaryTree 7 library(rweka) #Weka decision tree J48 8 library(evtree) #Evolutionary Algorithm, builds the tree from the bottom up 9 library(randomforest) 0 library(doparallel) 1 library(chaid) #Chi-squared automatic interaction detection tree 2 library(tree) 3 library(caret) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
87 R Code Example (Data Prep) 1 data(weather) 2 dsname <- "weather" 3 target <- "RainTomorrow" 4 risk <- "RISK_MM" 5 ds <- get(dsname) 6 vars <- colnames(ds) 7 (ignore <- vars[c(1, 2, if (exists("risk")) which(risk==vars))]) 8 #names(ds)[1]== Date 9 #names(ds)[2]== Location Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
88 R Code Example (Data Prep) 1 vars <- setdiff(vars, ignore) 2 (inputs <- setdiff(vars, target)) 3 (nobs <- nrow(ds)) 4 dim(ds[vars]) 5 6 (form <- formula(paste(target, "~."))) 7 set.seed(1426) 8 length(train <- sample(nobs, 0.7*nobs)) 9 length(test <- setdiff(seq_len(nobs), train)) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
89 Note It is okay to split the data set like this if the outcome of interest is not rare. If the outcome of interest occurs in some small fraction of cases, use a different technique so that 30% or so of cases with the outcome are in the training set. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
90 Example Code in R Example (rpart Tree) model <- rpart(formula=form, data=ds[train, vars]) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
91 Note The default parameter for predict is na.action = na.pass. If there are Na s in the data set, rpart will use surrogate splits. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
92 Example Code in R Example (rpart Tree Object) 1 print(model) 2 summary(model) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
93 print(model) n= 256 node ), s p l i t, n, l o s s, y v a l, ( yprob ) d e n o t e s t e r m i n a l node 1) r o o t No ( ) 2) Humidity3pm< No ( ) 4) Pressure3pm >= No ( ) 5) Pressure3pm< No ( ) 10) Sunshine >= No ( ) 11) Sunshine < Yes ( ) 3) Humidity3pm>= Yes ( ) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
94 Node number 2 : 238 observations, complexity param= p rjennifer e d i c t e devans c l a s(clickfox) s=no e x p e c t e d l o s s = Twitter: JenniferE P( node CF ) = January 14, / 164 summary(model) C a l l : r p a r t ( f o r m u l a = form, data = ds [ t r a i n, v a r s ] ) n= 256 CP n s p l i t r e l e r r o r x e r r o r x s t d V a r i a b l e i m p o r t a n c e Humidity3pm S u n s h i n e Pressure3pm Temp9am Pressure9am Temp3pm Cloud3pm MaxTemp MinTemp Node number 1 : 256 observations, complexity param= p r e d i c t e d c l a s s=no e x p e c t e d l o s s = P( node ) =1 c l a s s c o u n t s : p r o b a b i l i t i e s : l e f t son=2 (238 obs ) r i g h t son=3 (18 obs ) Primary s p l i t s : Humidity3pm < 71 to the l e f t, improve = , (0 missing ) Pressure3pm < to the right, improve = , (0 missing ) Cloud3pm < 6. 5 to the l e f t, improve = , (0 missing ) S u n s h i n e < to the r i g h t, improve= , (2 m i s s i n g ) Pressure9am < to the r i g h t, improve= , (0 m i s s i n g ) S u r r o g a t e s p l i t s : Sunshine < to the right, agree =0.949, adj =0.278, (0 s p l i t ) Pressure3pm < to the right, agree =0.938, adj =0.111, (0 s p l i t ) Temp3pm < 7. 6 to the right, agree =0.938, adj =0.111, (0 s p l i t ) Pressure9am < to the right, agree =0.934, adj =0.056, (0 s p l i t )
95 Example Code in R Example (rpart Tree Object) printcp(model) #printcp for rpart objects plotcp(model) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
96 plotcp(model) size of tree X val Relative Error Inf cp Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
97 Example Code in R Example (rpart Tree Object) plot(model) text(model) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
98 plot(model) text(model) Humidity3pm< 71 No Pressure3pm>=1010 Sunshine>=9.95 Yes No Yes Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
99 Example Code in R Example (rpart Tree Object) fancyrpartplot(model) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
100 fancyrpartplot(model) yes 1 No % Humidity3pm < 71 no 2 No % Pressure3pm >= No % Sunshine >= No % Yes % No % Yes % Rattle 2014 Jan 02 11:59:47 jevans Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
101 Example Code in R Example (rpart Tree Object) prp(model) prp(model, type=2, extra=104, nn=true, fallen.leaves=true, faclen=0, varlen=0, shadow.col="grey", branch.lty=3) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
102 prp(model) yes Humidity < 71 no Pressure >= 1010 Yes No Sunshine >= 9.9 No Yes Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
103 prp(model, type=2, extra=104, nn=true, fallen.leaves=true, faclen=0, varlen=0, shadow.col= grey, branch.lty=3) 1 No % yes Humidity3pm < 71 no 2 No % Pressure3pm >= No % Sunshine >= No % Yes % No % Yes % Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
104 Example Code in R Example (rpart Tree Predictions) pred <- predict(model, newdata=ds[test, vars], type="class") pred.prob <- predict(model, newdata=ds[test, vars], type="prob") Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
105 Example Code in R Example (Na values and pruning) 1 table(is.na(ds)) 2 ds.complete <- ds[complete.cases(ds),] 3 (nobs <- nrow(ds.complete)) 4 set.seed(1426) 5 length(train.complete <- sample(nobs, 0.7*nobs)) 6 length(test.complete <- setdiff(seq_len(nobs), train.complete)) 7 8 #Prune tree 9 model$cptable[which.min(model$cptable[,"xerror"]),"cp"] 0 model <- rpart(formula=form, data=ds[train.complete, vars], cp=0) 1 printcp(model) 2 prune <- prune(model, cp=.01) 3 printcp(prune) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
106 Example Code in R Example (Random Forest) 1 #Random Forest from library(randomforest) 2 table(is.na(ds)) 3 table(is.na(ds.complete)) 4 5 #subset(ds, select=-c(humidity3pm, Humidity9am, Cloud9am, Cloud3pm)) 6 setnum <- colnames(ds.complete)[16:19] 7 ds.complete[,setnum] <- lapply(ds.complete[,setnum], 8 function(x) as.numeric(x)) 9 0 ds.complete$humidity3pm <- as.numeric(ds.complete$humidity3pm) 1 ds.complete$humidity9am <- as.numeric(ds.complete$humidity9am) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
107 Note Variables in the randomforest algorithm must be either factor or numeric, factors can not have more than 32 levels. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
108 Example Code in R Example (Random Forest) 1 begtime <- Sys.time() 2 set.seed(1426) 3 model <- randomforest(formula=form,data=ds.complete[train.complete,vars]) 4 runtime <- Sys.time()-begTime 5 runtime 6 #Time difference of secs Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
109 Note Na values must be imputed, removed or otherwise fixed. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
110 Random Forest Bagging Given a standard training set D of size n, bagging generates m new training sets D i, each of size n, by sampling from D uniformly and with replacement. By sampling with replacement, some observations may be repeated in each D i. If n =n, then for large n the set D i is expected to have the fraction (1-1/e) (63.2) of the unique examples of D, the rest being duplicates. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
111 Random Forest Sampling with replacement (default) VS Sampling without replacement (sample size equals 1-1/e =.632) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
112 Example Code in R Example (Random Forest, sampling without replacement) 1 begtime <- Sys.time() 2 set.seed(1426) 3 model <- randomforest(formula=form, data=ds.complete[train, vars], 4 ntree=500, replace = FALSE, sampsize =.632*.7*nrow(ds), 5 na.action=na.omit) 6 runtime <- Sys.time()-begTime 7 runtime 8 #Time difference of secs Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
113 print(model) C a l l : randomforest ( f o r m u l a = form, data = ds. complete [ t r a i n, v a r s ], n t r e e = 500, r e p l a c e = FALSE, s a m p s i z e = nrow ( ds ), na. a c t i o n = na. omit ) Type o f random f o r e s t : c l a s s i f i c a t i o n Number o f t r e e s : 500 No. o f v a r i a b l e s t r i e d at each s p l i t : 4 OOB e s t i m a t e o f e r r o r r a t e : 11.35% C o n f u s i o n m a t r i x : No Yes c l a s s. e r r o r No Yes Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
114 summary(model) Length C l a s s Mode c a l l 7 none c a l l t y p e 1 none c h a r a c t e r p r e d i c t e d 229 f a c t o r numeric e r r. r a t e 1500 none numeric c o n f u s i o n 6 none numeric v o t e s 458 m a t r i x numeric oob. t i m e s 229 none numeric c l a s s e s 2 none c h a r a c t e r i m p o r t a n c e 20 none numeric importancesd 0 none NULL l o c a l I m p o r t a n c e 0 none NULL p r o x i m i t y 0 none NULL n t r e e 1 none numeric mtry 1 none numeric f o r e s t 14 none l i s t y 229 f a c t o r numeric t e s t 0 none NULL i n b a g 0 none NULL terms 3 terms c a l l Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
115 str(model) L i s t o f 19 $ c a l l : language randomforest ( formula = form, data = ds. complete [ train, vars ], n r e p l a c e = FALSE, s a m p s i z e = nrow ( ds ), na. a c t i o n = na. omit ) $ t y p e : c h r c l a s s i f i c a t i o n $ p r e d i c t e d : F a c t o r w/ 2 l e v e l s No, Yes : a t t r (, names )= c h r [ 1 : ] $ e r r. r a t e : num [ 1 : 5 0 0, 1 : 3 ] a t t r (, dimnames )= L i s t o f $ : NULL.... $ : c h r [ 1 : 3 ] OOB No Yes $ c o n f u s i o n : num [ 1 : 2, 1 : 3 ] a t t r (, dimnames )= L i s t o f $ : c h r [ 1 : 2 ] No Yes.... $ : c h r [ 1 : 3 ] No Yes c l a s s. e r r o r $ votes : matrix [ 1 : 229, 1 : 2 ] a t t r (, dimnames )= L i s t o f $ : c h r [ 1 : ] $ : c h r [ 1 : 2 ] No Yes.. a t t r (, c l a s s )= c h r [ 1 : 2 ] m a t r i x v o t e s $ oob. times : num [ 1 : ] $ c l a s s e s : c h r [ 1 : 2 ] No Yes $ importance : num [ 1 : 2 0, 1 ] a t t r (, dimnames )= L i s t o f $ : c h r [ 1 : 2 0 ] MinTemp MaxTemp R a i n f a l l E v a p o r a t i o n $ : c h r MeanDecreaseGini $ importancesd : NULL $ localimportance : NULL $ p r o x i m i t y : NULL $ n t r e e : num 500 $ mtry : num 4 $ f o r e s t : L i s t o f 14.. $ n d b i g t r e e : i n t [ 1 : ] $ n o d e s t a t u s : i n t [ 1 : 6 7, 1 : ] Jennifer $ bestvar Evans (Clickfox) : i n t [ 1 : 67, 1 : 500] 12 Twitter: 15 11JenniferE CF January 14, / 164
116 importance(model) MeanDecreaseGini MinTemp MaxTemp R a i n f a l l E v a p o r a t i o n S u n s h i n e WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
117 Example Code in R Example (Random Forest, predictions) 1 pred <- predict(model, newdata=ds.complete[test.complete, vars]) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
118 Note Random Forest in parallel. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
119 Example Code in R Example (Random Forest in parallel) 1 #Random Forest in parallel 2 library(doparallel) 3 ntree = 500; numcore = 4 4 rep <- 125 # tree / numcore 5 registerdoparallel(cores=numcore) 6 begtime <- Sys.time() 7 set.seed(1426) 8 rf <- foreach(ntree=rep(rep, numcore),.combine=combine, 9.packages= randomforest ) %dopar% 0 randomforest(formula=form, data=ds.complete[train.complete, vars], 1 ntree=ntree, 2 mtry=6, 3 importance=true, 4 na.action=na.roughfix, #can also use na.action = na.omit 5 replace=false) 6 runtime <- Sys.time()-begTime 7 runtime 8 #Time difference of secs Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
120 Note mtry in model is 4, mtry in rf is 6, length(vars) is 24 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
121 importance(model) MeanDecreaseGini MinTemp MaxTemp R a i n f a l l E v a p o r a t i o n S u n s h i n e WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
122 importance(rf) No Yes MeanDecreaseAccuracy MeanDecreaseGini MinTemp MaxTemp R a i n f a l l E v a p o r a t i o n S u n s h i n e WindGustDir WindGustSpeed WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
123 Example Code in R Example (Random Forest) pred <- predict(rf, newdata=ds.complete[test.complete, vars]) confusionmatrix(pred, ds.complete[test.complete, target]) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
124 confusionmatrix(pred, ds.complete[test.complete, target]) C o n f u s i o n M a t r i x and S t a t i s t i c s R e f e r e n c e P r e d i c t i o n No Yes No Yes 4 11 Accuracy : % CI : ( , ) No I n f o r m a t i o n Rate : P Value [ Acc > NIR ] : Kappa : Mcnemar s Test P Value : S e n s i t i v i t y : S p e c i f i c i t y : Pos Pred Value : Neg Pred Value : P r e v a l e n c e : D e t e c t i o n Rate : D e t e c t i o n P r e v a l e n c e : Positive Class : No Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
125 Example Code in R Example (Random Forest) #Factor Levels id <- which(!(ds$var.name %in% levels(ds$var.name))) ds$var.name[id] <- NA Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
126 DANGER!! How to draw a Random Forest? Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
127 Random Forest Visualization Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
128 Evaluating the Model Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
129 Evaluating the Model Methods and Metrics to Evaluate Model Performance 1 Resubstitution Estimate (internal estimate, biased) 2 Confusion matrix 3 ROC 4 Test Sample Estimation (independent estimate) 5 V-fold and N-fold Cross-Validation (resampling techniques) 6 RMSLE library(metrics) 7 lift Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
130 Example Code in R Example (ctree in package party) #Conditional Inference Tree model <- ctree(formula=form, data=ds[train, vars]) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
131 ctree: plot(model) 1 Sunshine <= 6.4 > Pressure3pm Pressure3pm <= > <= > Yes No Node 3 (n = 29) 0 Yes No Node 4 (n = 37) 0 Yes No Node 6 (n = 25) 0 Yes No Node 7 (n = 165)0 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
132 print(model) Model formula : RainTomorrow MinTemp + MaxTemp + R a i n f a l l + Evaporation + Sunshine + WindGustDir + WindGustSpeed + WindDir9am + WindDir3pm + WindSpeed9am + WindSpeed3pm + Humidity9am + Humidity3pm + Pressure9am + Pressure3pm + Cloud9am + Cloud3pm + Temp9am + Temp3pm + RainToday F i t t e d p a r t y : [ 1 ] r o o t [ 2 ] Sunshine <= 6.4 [ 3 ] Pressure3pm <= : Yes ( n = 29, e r r = 24.1%) [ 4 ] Pressure3pm > : No ( n = 36, e r r = 8.3%) [ 5 ] Sunshine > 6.4 [ 6 ] Cloud3pm <= 6 [ 7 ] Pressure3pm <= : No ( n = 18, e r r = 22.2%) [ 8 ] Pressure3pm > : No ( n = 147, e r r = 1.4%) [ 9 ] Cloud3pm > 6 : No ( n = 26, e r r = 26.9%) Number of i n n e r nodes : 4 Number of t e r m i n a l nodes : 5 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
133 Difference between ctree and rpart Both rpart and ctree recursively perform univariate splits of the dependent variable based on values on a set of covariates. rpart employs information measures (such as the Gini coefficient) for selecting the current covariate. ctree uses a significance test procedure in order to select variables instead of selecting the variable that maximizes an information measure. This may avoid some selection bias. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
134 Example Code in R Example (ctree in package party) 1 #For class predictions: 2 library(caret) 3 pred <- predict(model, newdata=ds[test, vars]) 4 confusionmatrix(pred, ds[test, target]) 5 mc <- table(pred, ds[test, target]) 6 err < (mc[1,1] + mc[2,2]) / sum(mc) #resubstitution error rate Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
135 ctree C o n f u s i o n M a t r i x and S t a t i s t i c s R e f e r e n c e P r e d i c t i o n No Yes No Yes 8 12 Accuracy : % CI : ( , ) No I n f o r m a t i o n Rate : P Value [ Acc > NIR ] : Kappa : Mcnemar s Test P Value : S e n s i t i v i t y : S p e c i f i c i t y : Pos Pred Value : Neg Pred Value : P r e v a l e n c e : D e t e c t i o n Rate : D e t e c t i o n P r e v a l e n c e : Positive Class : No Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
136 Example Code in R Example (ctree in package party) #For class probabilities: pred.prob <- predict(model, newdata=ds[test, vars], type="prob") Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
137 ctree summary ( pred ) No Yes summary ( pred. prob ) No Yes Min. : Min. : s t Qu. : s t Qu. : Median : Median : Mean : Mean : rd Qu. : rd Qu. : Max. : Max. : e r r [ 1 ] 0. 2 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
138 Example Code in R Example (ctree in package party) 1 #For a roc curve: 2 library(rocr) 3 pred <- do.call(rbind, as.list(pred)) 4 summary(pred) 5 roc <- prediction(pred[,1], ds[test, target]) 6 plot(performance(roc, measure="tpr", x.measure="fpr"), colorize=true) 7 8 #For a lift curve: 9 plot(performance(roc, measure="lift", x.measure="rpp"), colorize=true) 0 1 #Sensitivity/Specificity Curve and Precision/Recall Curve: 2 #Sensitivity(i.e True Positives/Actual Positives) 3 #Specifcity(i.e True Negatives/Actual Negatives) 4 plot(performance(roc, measure="sens", x.measure="spec"), colorize=true) 5 plot(performance(roc, measure="prec", x.measure="rec"), colorize=true) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
139 roc True positive rate Lift value False positive rate Rate of positive predictions Sensitivity Precision Specificity Recall Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
140 Example Code in R Example (crossvalidation) 1 #Example of using 10-fold cross-validation to evaluation your model 2 3 model <- train(ds[, vars], ds[,target], method= rpart, tunelength=10) 4 5 #cross validation 6 #example 7 n <- nrow(ds) #nobs 8 K <- 10 #for 10 validation cross sections 9 taille <- n%/%k 0 set.seed(5) 1 alea <- runif(n) 2 rang <- rank(alea) 3 bloc <- (rang-1)%/%taille +1 4 bloc <- as.factor(bloc) 5 print(summary(bloc)) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
141 Example Code in R Example (cross validation continued) 1 all.err <- numeric(0) 2 for(k in 1:K){ 3 model <- rpart(formula=form, data = ds[train,vars], method="class") 4 pred <- predict(model, newdata=ds[test,vars], type="class") 5 mc <- table(ds[test,target],pred) 6 err < (mc[1,1] +mc[2,2]) / sum(mc) 7 all.err <- rbind(all.err,err) 8 } 9 print(all.err) 0 (err.cv <- mean(all.err)) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
142 p r i n t ( a l l. e r r ) [, 1 ] e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 e r r 0. 2 ( e r r. cv < mean ( a l l. e r r ) ) [ 1 ] 0. 2 Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
143 caret Package Check out the caret package if you re building predictive models in R. It implements a number of out-of-sample evaluation schemes, including bootstrap sampling, cross-validation, and multiple train/test splits. caret is really nice because it provides a unified interface to all the models, so you don t have to remember, e.g., that treeresponse is the function to get class probabilities from a ctree model. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
144 Example Code in R Example (Random Forest - cforest) #Random Forest from library(party) model <- cforest(formula=form, data=ds.complete[train.complete, vars]) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
145 cforest C o n f u s i o n M a t r i x and S t a t i s t i c s R e f e r e n c e P r e d i c t i o n No Yes No Yes 3 6 Accuracy : % CI : ( , ) No I n f o r m a t i o n Rate : P Value [ Acc > NIR ] : Kappa : Mcnemar s Test P Value : S e n s i t i v i t y : S p e c i f i c i t y : Pos Pred Value : Neg Pred Value : P r e v a l e n c e : D e t e c t i o n Rate : D e t e c t i o n P r e v a l e n c e : Positive Class : No Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
146 Best Model: randomforest with mty=4 C o n f u s i o n M a t r i x and S t a t i s t i c s R e f e r e n c e P r e d i c t i o n No Yes No 75 1 Yes 2 21 Accuracy : % CI : ( , ) No I n f o r m a t i o n Rate : P Value [ Acc > NIR ] : e 08 Kappa : Mcnemar s Test P Value : 1 S e n s i t i v i t y : S p e c i f i c i t y : Pos Pred Value : Neg Pred Value : P r e v a l e n c e : D e t e c t i o n Rate : D e t e c t i o n P r e v a l e n c e : Positive Class : No Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
147 Example Code in R Example (Data for Today) > Today MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir WindGustSpeed NNW 30 WindDir9am WindDir3pm WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm N NW Pressure9am Pressure3pm Cloud9am Cloud3pm Temp9am Temp3pm RainToday Yes Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
148 Example Code in R Example (Random Forest - cforest) > (predict(model, newdata=today)) [1] Yes Levels: No Yes > (predict(model, newdata=today, type="prob")) $ 50 RainTomorrow.No RainTomorrow.Yes [1,] Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
149 Example Code in R Example (Random Forest - randomforest) > predict(model, newdata=today) 50 Yes Levels: No Yes > predict(model, newdata=today, type="prob") No Yes attr(,"class") [1] "matrix" "votes" Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
150 Will it Rain Tomorrow? Yes, it will rain tomorrow. There is a ninety percent chance of rain, and we are ninety-five percent confident that we have a five percent chance of being wrong. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
151 Evaluating the Business Questions Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
152 Evaluating the Business Questions Is this of value? Is it understandable? How to communicate this to the business? Are you answering the question asked...? Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
153 An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem. John Tukey Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
154 Kaggle and Random Forest Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
155 Tip Get the advantage with creativity, understanding the data, data munging and meta data creation. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
156 The best way to have a good idea is to have a lot of ideas. Linus Pauling Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
157 Tip A lot of the data munging is done for you, you are given a nice flat file to work with. Knowing and uderstanding this process will enable you to find data leaks and holes in the data set. What did their data scientists miss? Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
158 Tip Use some type of version control, write notes to yourself, read the forum comments. Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
159 Visualization Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
160 Pie Chart Visualization (Sometimes you really just need a Pie Chart) Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
161 Recommended Reading Christopher M. Bishop (2006) Pattern Recognition and Machine Learning, Information Science and Statistics Leo Breiman (1999) Random Forest, breiman/random-forests.pdf George Casella and Roger L. Berger Statistical Inference Rachel Schutt and Cathy O Neil (2013) Doing Data Science, Straight Talk from the Frontline Q. Ethan McCallum (2013) Bad Data Handbook, Mapping the World of Data Problems Graham Williams (2013) Decision Trees in R, Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
162 References Hothorn, Hornik, and Zeileis (2006) party: : A Laboratory for Recursive Partytioning, Torsten Hothorn and Achim Zeileis (2009) A Toolbox for Recursive Partytioning, Torsten Hothorn (2013) Machine Learning and Statistical Learning Other Sources StackExchange StackOverFlow PackageDocumentation Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
163 Acknowledgment Ken McGuire Robert Bagley Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
164 Questions Twitter Account: CF Website for R Code: rcode [email protected] Jennifer Evans (Clickfox) Twitter: JenniferE CF January 14, / 164
Data Science with R Ensemble of Decision Trees
Data Science with R Ensemble of Decision Trees [email protected] 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce
How To Understand Data Mining In R And Rattle
http: // togaware. com Copyright 2014, [email protected] 1/40 Data Analytics and Business Intelligence (8696/8697) Introducing Data Science with R and Rattle [email protected] Chief
Data Science with R. Introducing Data Mining with Rattle and R. [email protected]
http: // togaware. com Copyright 2013, [email protected] 1/35 Data Science with R Introducing Data Mining with Rattle and R [email protected] Senior Director and Chief Data Miner,
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
partykit: A Toolkit for Recursive Partytioning
partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/ Overview Status quo: R software for tree models New package: partykit Unified infrastructure
An Overview and Evaluation of Decision Tree Methodology
An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX [email protected] Carole Jesse Cargill, Inc. Wayzata, MN [email protected]
Data Mining with R. Decision Trees and Random Forests. Hugh Murrell
Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest
Nathalie Villa-Vialaneix Année 2014/2015 M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest This worksheet s aim is to learn how
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Data Analytics and Business Intelligence (8696/8697)
http: // togaware. com Copyright 2014, [email protected] 1/34 Data Analytics and Business Intelligence (8696/8697) Introducing and Interacting with R [email protected] Chief Data
Evaluation & Validation: Credibility: Evaluating what has been learned
Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model
1 R: A Language for Data Mining. 2 Data Mining, Rattle, and R. 3 Loading, Cleaning, Exploring Data in Rattle. 4 Descriptive Data Mining
A Data Mining Workshop Excavating Knowledge from Data Introducing Data Mining using R [email protected] Data Scientist Australian Taxation Office Adjunct Professor, Australian National University
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee
UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee 1. Introduction There are two main approaches for companies to promote their products / services: through mass
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
Fast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
CART 6.0 Feature Matrix
CART 6.0 Feature Matri Enhanced Descriptive Statistics Full summary statistics Brief summary statistics Stratified summary statistics Charts and histograms Improved User Interface New setup activity window
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
A fast, powerful data mining workbench designed for small to midsize organizations
FACT SHEET SAS Desktop Data Mining for Midsize Business A fast, powerful data mining workbench designed for small to midsize organizations What does SAS Desktop Data Mining for Midsize Business do? Business
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP
Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP ABSTRACT In data mining modelling, data preparation
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
A Property & Casualty Insurance Predictive Modeling Process in SAS
Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected]
Université de Montpellier 2 Hugo Alatrista-Salas : [email protected] WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
Package trimtrees. February 20, 2015
Package trimtrees February 20, 2015 Type Package Title Trimmed opinion pools of trees in a random forest Version 1.2 Date 2014-08-1 Depends R (>= 2.5.0),stats,randomForest,mlbench Author Yael Grushka-Cockayne,
Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
Applied Predictive Modeling in R R. user! 2014
Applied Predictive Modeling in R R user! 204 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT [email protected] Outline Conventions in R Data Splitting and Estimating Performance Data Pre-Processing Over
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Package missforest. February 20, 2015
Type Package Package missforest February 20, 2015 Title Nonparametric Missing Value Imputation using Random Forest Version 1.4 Date 2013-12-31 Author Daniel J. Stekhoven Maintainer
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)
THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS by XIAO CHENG (Under the Direction of Jeongyoun Ahn) ABSTRACT Big Data has been the new trend in businesses.
Package bigrf. February 19, 2015
Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap
Statistical Learning: Chapter 5 Resampling methods (Cross-validation and bootstrap) (Note: prior to these notes, we'll discuss a modification of an earlier train/test experiment from Ch 2) We discuss 2
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
A Methodology for Predictive Failure Detection in Semiconductor Fabrication
A Methodology for Predictive Failure Detection in Semiconductor Fabrication Peter Scheibelhofer (TU Graz) Dietmar Gleispach, Günter Hayderer (austriamicrosystems AG) 09-09-2011 Peter Scheibelhofer (TU
Hands-On Data Science with R Dealing with Big Data. [email protected]. 27th November 2014 DRAFT
Hands-On Data Science with R Dealing with Big Data [email protected] 27th November 2014 Visit http://handsondatascience.com/ for more Chapters. In this module we explore how to load larger datasets
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Open-Source Machine Learning: R Meets Weka
Open-Source Machine Learning: R Meets Weka Kurt Hornik, Christian Buchta, Michael Schauerhuber, David Meyer, Achim Zeileis http://statmath.wu-wien.ac.at/ zeileis/ Weka? Weka is not only a flightless endemic
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Decision-Tree Learning
Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values
user! 2013 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT [email protected]
Predictive Modeling with R and the caret Package user! 23 Max Kuhn, Ph.D Pfizer Global R&D Groton, CT [email protected] Outline Conventions in R Data Splitting and Estimating Performance Data Pre-Processing
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
Data Mining Practical Machine Learning Tools and Techniques
Credibility: Evaluating what s been learned Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Issues: training, testing,
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com
SPSS-SA Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone: +27 21 702 4666 www.spss-sa.com SPSS-SA Training Brochure 2009 TABLE OF CONTENTS 1 SPSS TRAINING COURSES FOCUSING
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Chapter 12 Discovering New Knowledge Data Mining
Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Why is Internal Audit so Hard?
Why is Internal Audit so Hard? 2 2014 Why is Internal Audit so Hard? 3 2014 Why is Internal Audit so Hard? Waste Abuse Fraud 4 2014 Waves of Change 1 st Wave Personal Computers Electronic Spreadsheets
Didacticiel Études de cas
1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
6 Classification and Regression Trees, 7 Bagging, and Boosting
hs24 v.2004/01/03 Prn:23/02/2005; 14:41 F:hs24011.tex; VTEX/ES p. 1 1 Handbook of Statistics, Vol. 24 ISSN: 0169-7161 2005 Elsevier B.V. All rights reserved. DOI 10.1016/S0169-7161(04)24011-1 1 6 Classification
Benchmarking Open-Source Tree Learners in R/RWeka
Benchmarking Open-Source Tree Learners in R/RWeka Michael Schauerhuber 1, Achim Zeileis 1, David Meyer 2, Kurt Hornik 1 Department of Statistics and Mathematics 1 Institute for Management Information Systems
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these
Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Exploratory Data Analysis with R. @matthewrenze #codemash
Exploratory Data Analysis with R @matthewrenze #codemash Motivation The ability to take data to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it that
Why Ensembles Win Data Mining Competitions
Why Ensembles Win Data Mining Competitions A Predictive Analytics Center of Excellence (PACE) Tech Talk November 14, 2012 Dean Abbott Abbott Analytics, Inc. Blog: http://abbottanalytics.blogspot.com URL:
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm [email protected].
Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign
Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Data Analysis of Trends in iphone 5 Sales on ebay
Data Analysis of Trends in iphone 5 Sales on ebay By Wenyu Zhang Mentor: Professor David Aldous Contents Pg 1. Introduction 3 2. Data and Analysis 4 2.1 Description of Data 4 2.2 Retrieval of Data 5 2.3
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
