1 R: A Language for Data Mining. 2 Data Mining, Rattle, and R. 3 Loading, Cleaning, Exploring Data in Rattle. 4 Descriptive Data Mining

Size: px
Start display at page:

Download "1 R: A Language for Data Mining. 2 Data Mining, Rattle, and R. 3 Loading, Cleaning, Exploring Data in Rattle. 4 Descriptive Data Mining"

Transcription

1 A Data Mining Workshop Excavating Knowledge from Data Introducing Data Mining using R Graham.Williams@togaware.com Data Scientist Australian Taxation Office Adjunct Professor, Australian National University Adjunct Professor, University of Canberra Fellow, Institute of Analytics Professionals of Australia Graham.Williams@togaware.com Workshop Overview R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R Visit: for Workshop Notes http: // togaware. com Copyright 204, Graham.Williams@togaware.com /7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/7 R: A Language for Data Mining What is R? Installing R Workshop Overview Installing R and Rattle R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R First task is to install R As free/libre open source software (FLOSS or FOSS), R and Rattle are available to all, with no limitations on our freedom to use and share the software, except to share and share alike. Visit CRAN at Visit Rattle at Linux: Install packages (Ubuntu is recommended) $ wajig install r-recommended r-cran-rattle Windows: Download and install from CRAN MacOSX: Download and install from CRAN http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/28 What is R? Why a Workshop on R? Why do Data Science with R? What is R? Why a Workshop on R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/28 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/28

2 What is R? Why a Workshop on R? Why do Data Science with R? What is R? Why a Workshop on R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/28 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/28 What is R? Popularity of R? How Popular is R? Discussion List Traffic Monthly traffic on software s main discussion list. What is R? Popularity of R? How Popular is R? Discussion Topics Number of discussions on popular QandA forums 203. http: // togaware. com Copyright 204, Graham.Williams@togaware.com Source: 6/28 Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/28 What is R? Popularity of R? How Popular is R? R versus SAS Number of R/SAS related posts to Stack Overflow by week. What is R? Popularity of R? How Popular is R? Professional Forums Registered for the main discussion group for each software. Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/28 Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/28

3 What is R? Popularity of R? How Popular is R? Used in Analytics Competitions Software used in data analysis competitions in 20. What is R? Popularity of R? How Popular is R? User Survey Rexer Analytics Survey 200 results for data mining/analytic tools. Source: http: // togaware. com Copyright 204, 0/28 Source: http: // togaware. com Copyright 204, /28 What is R? What is R? Popularity of R? Data Mining, Rattle, and R Workshop Overview R: A Language for Data Mining R The Video A 90 Second Promo from Revolution Analytics 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/28 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/7 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Big Data and Big Business Data Mining Data Mining A data driven analysis to uncover otherwise unknown but useful patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and (one day perhaps) wisdom, in a timely manner. Application of Machine Learning Statistics Software Engineering and Effective Communications and Intuition... to Datasets that vary by Volume, Velocity, Variety, Value, Veracity... to discover new knowledge... to improve business outcomes... to deliver better tailored services http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/40

4 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Big Data and Big Business Data Mining in Research Data Mining in Government Health Research Adverse reactions using linked Pharmaceutical, General Practitioner, Hospital, Pathology datasets. Astronomy Microlensing events in the Large Magellanic Cloud of several million observed stars (out of 0 billion). Psychology Investigation of age-of-onset for Alzheimer s disease from 75 variables for 800 people. Social Sciences Survey evaluation. Social network analysis - identifying key influencers. Australian Taxation Office Lodgment ($0M) Tax Havens ($50M) Tax Fraud ($250M) Immigration and Border Control Check passengers before boarding Health and Human Services Doctor shoppers Over servicing http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/40 An Introduction to Data Mining Big Data and Big Business An Introduction to Data Mining Algorithms The Business of Data Mining Basic Tools: Data Mining Algorithms SAS has annual revenues of $3B (203) IBM bought SPSS for $.2B (2009) Analytics is >$00B business and >$320B by 2020 Amazon, ebay/paypal, Google, Facebook, LinkedIn,... Shortage of 80,000 data scientists in US in 208 (McKinsey)... Cluster Analysis (kmeans, wskm) Association Analysis (arules) Linear Discriminant Analysis (lda) Logistic Regression (glm) Decision Trees (rpart, wsrpart) Random Forests (randomforest, wsrf) Boosted Stumps (ada) Neural Networks (nnet) Support Vector Machines (kernlab)... That s a lot of tools to learn in R! Many with different interfaces and options. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/40 The Rattle Package for Data Mining A GUI for Data Mining The Rattle Package for Data Mining A GUI for Data Mining Why a GUI? Users of Rattle Statistics can be complex and traps await So many tools in R to deliver insights Effective analyses should be scripted Scripting also required for repeatability R is a language for programming with data How to remember how to do all of this in R? How to skill up 50 data analysts with Data Mining? Today, Rattle is used world wide in many industries Health analytics Customer segmentation and marketing Fraud detection Government It is used by Universities to teach Data Mining Within research projects for basic analyses Consultants and Analytics Teams across business It is and will remain freely available. CRAN and http: // togaware. com Copyright 204, Graham.Williams@togaware.com /40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/40

5 The Rattle Package for Data Mining Setting Things Up The Rattle Package for Data Mining Tour Installation A Tour Thru Rattle: Startup Rattle is built using R Need to download and install R from cran.r-project.org Recommend also install RStudio from Then start up RStudio and install Rattle: install.packages("rattle") Then we can start up Rattle: rattle() Required packages are loaded as needed. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/40 The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Loading Data A Tour Thru Rattle: Explore Distribution http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/40 The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Explore Correlations A Tour Thru Rattle: Hierarchical Cluster http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/40

6 The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Decision Tree A Tour Thru Rattle: Decision Tree Plot http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/40 The Rattle Package for Data Mining Tour The Rattle Package for Data Mining Tour A Tour Thru Rattle: Random Forest A Tour Thru Rattle: Risk Chart Risk Chart Random Forest weather.csv [test] RainTomorrow Risk Scores Lift Performance (%) % RainTomorrow (92%) Rain in MM (97%) 0 Precision Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/40 Data Scientists are Programmers of Data From GUI to CLI Rattle s Log Tab But... Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language Data Scientists Desire... Scripting Transparency Repeatability Sharing http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/40

7 R Tool Suite The Power of Free/Libre and Open Source Software From GUI to CLI Rattle s Log Tab Tools Ubuntu GNU/Linux operating system Feature rich toolkit, up-to-date, easy to install, FLOSS RStudio Easy to use integrated development environment, FLOSS Powerful alternative is Emacs (Speaks Statistics), FLOSS R Statistical Software Language Extensive, powerful, thousands of contributors, FLOSS KnitR and L A TEX Produce beautiful documents, easily reproducible, FLOSS http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/34 R Tool Suite Ubuntu RStudio Interface Using Ubuntu RStudio The Default Three Panels Desktop Operating System (GNU/Linux) Replacing Windows and OSX The GNU Tool Suite based on Unix significant heritage Multiple specialised single task tools, working well together Compared to single application trying to do it all Powerful data processing from the command line: grep, awk, head, tail, wc, sed, perl, python, most, diff, make, paste, join, patch,... For interacting with R start up RStudio from the Dash http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/34 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/34 RStudio Interface Introduction to R Simple Plots RStudio With R Script File Editor Panel Scatterplot R Code Our first little bit of R code: Load a couple of packages into the R library library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function Then produce a quick plot using qplot() ds <- weather qplot(mintemp, MaxTemp, data=ds) Your turn: give it a go. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/34 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 0/34

8 Introduction to R Simple Plots Scatterplot Plot MinTemp MaxTemp http: // togaware. com Copyright 204, Graham.Williams@togaware.com /34 Introduction to R Simple Plots Scatterplot RStudio http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/34 Introduction to R Installing Packages Missing Packages Tools Install Packages... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/34 Introduction to R Installing Packages RStudio Installing ggplot2 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/34 Introduction to R RStudio Shortcuts RStudio Keyboard Shortcuts These will become very useful! Editor: Ctrl-Enter will send the line of code to the R console Ctrl-2 will move the cursor to the Console Console: UpArrow will cycle through previous commands Ctrl-UpArrow will search previous commands Tab will complete function names and list the arguments Ctrl- will move the cursor to the Editor Your turn: try them out. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/34 Introduction to R Basic R Commands Basic R library(rattle) # Load the weather dataset. head(weather) # First 6 observations of the dataset. ## Date Location MinTemp MaxTemp Rainfall Evapora... ## Canberra ## Canberra ## Canberra str(weather) # Struncture of the variables in the dataset. ## 'data.frame': 366 obs. of 24 variables: ## $ Date : Date, format: " " " ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## $ MinTemp : num http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/34

9 Introduction to R Basic R Commands Basic R summary(weather) # Univariate summary of the variables. ## Date Location MinTemp... ## Min. : Canberra :366 Min. : ## st Qu.: Adelaide : 0 st Qu.: ## Median : Albany : 0 Median : ## Mean : Albury : 0 Mean : ## 3rd Qu.: AliceSprings : 0 3rd Qu.: ## Max. : BadgerysCreek: 0 Max. : ## (Other) : 0... ## Rainfall Evaporation Sunshine WindGust... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW :... ## st Qu.: 0.00 st Qu.: 2.20 st Qu.: 5.95 NNW :... ## Median : 0.00 Median : 4.20 Median : 8.60 E :... ## Mean :.43 Mean : 4.52 Mean : 7.9 WNW :... ## 3rd Qu.: rd Qu.: rd Qu.:0.50 ENE :... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/34 Introduction to R Visualising Data Visual Summaries Add A Little Colour qplot(humidity3pm, Pressure3pm, colour=raintomorrow, data=ds) Humidity3pm Pressure3pm RainTomorrow No Yes http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/34 Introduction to R Visualising Data Visual Summaries Careful with Categorics qplot(windgustdir, Pressure3pm, data=ds) N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir Pressure3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/34 Introduction to R Visualising Data Visual Summaries Add A Little Jitter qplot(windgustdir, Pressure3pm, data=ds, geom="jitter") N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir Pressure3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/34 Introduction to R Visualising Data Visual Summaries And Some Colour qplot(windgustdir, Pressure3pm, data=ds, colour=windgustdir, geom="jitter") N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir Pressure3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/34 Introduction to R Help Getting Help Precede Command with? http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/34

10 Loading, Cleaning, Exploring Data in Rattle Workshop Overview Loading, Cleaning, Exploring Data in Rattle Loading Data R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/7 Loading, Cleaning, Exploring Data in Rattle Exploring Data Test Data Loading, Cleaning, Exploring Data in Rattle http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/7 Loading, Cleaning, Exploring Data in Rattle Transform Data Descriptive Data Mining Workshop Overview R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 0/7

11 Cluster Analysis Requirements Algorithms Cluster Methods What is Cluster Analysis? Major Clustering Approaches Cluster: a collection of observations Similar to one another within the same cluster Dissimilar to the observations in other clusters Cluster analysis Grouping a set of data observations into classes Clustering is unsupervised classification: no predefined classes descriptive data mining. Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Partitioning algorithms (kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some criterion. A fixed number of clusters, k, is generated. Start with an initial (perhaps random) cluster. Hierarchical algorithms: (hclust, agnes, diana) Create a hierarchical decomposition of the set of observations using some criterion Density-based algorithms: based on connectivity and density functions Grid-based algorithms: based on a multiple-level granularity structure Model-based algorithms: (mclust for mixture of Gaussians) A model is hypothesized for each of the clusters and the idea is to find the best fit of that model http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/33 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/33 Descriptive Data Mining Introduction Rules KMeans Clustering Association Rule Mining An unsupervised learning algorithm descriptive data mining. Identify items (patterns) that occur frequently together in a given set of data. Patterns = associations, correlations, causal structures (Rules). Data = sets of items in... transactional database relational database complex information repositories Rule: Body Head [support, confidence] http: // togaware. com Copyright 204, Graham.Williams@togaware.com /7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/29 Introduction Rules Descriptive Data Mining Examples Association Rules Friday Nappies Beer [0.5%, 60%] Age [20, 30] Income [20K, 30K] MP3Player [2%, 60%] Maths CS HDinCS [%, 75%] Gladiator Patriot Sixth Sense [0.%, 90%] Statins Peritonitis Chronic Renal Failure [0.%, 32%] http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/29 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/7

12 Predictive Data Mining: Decision Trees Decision Trees Basics Workshop Overview Predictive Modelling: Classification R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses Goal of classification is to build models (sentences) in a knowledge representation (language) from examples of past decisions. The model is to be used on unseen cases to make decisions. Often referred to as supervised learning. Common approaches: decision trees; neural networks; logistic regression; support vector machines. 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/46 Decision Trees Basics Decision Trees Basics Language: Decision Trees Knowledge representation: A flow-chart-like tree structure Internal nodes denotes a test on a variable Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Age Y < 42 > 42 Y Gender Male Female http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/46 N Tree Construction: Divide and Conquer Decision tree induction is an example of a recursive partitioning algorithm: divide and conquer. At start, all the training examples are at the root Partition examples recursively based on selected variables Females Males _ <42 + >42 + http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/46 Decision Trees Algorithm Decision Trees Algorithm Algorithm for Decision Tree Induction Basic Motivation: Entropy A greedy algorithm: takes the best immediate (local) decision while building the overall model Tree constructed top-down, recursive, divide-and-conquer Begin with all training examples at the root Data is partitioned recursively based on selected variables Select variables on basis of a measure Stop partitioning when? All samples for a given node belong to the same class There are no remaining variables for further partitioning majority voting is employed for classifying the leaf There are no samples left We are trying to predict output Y (e.g., Yes/No) from input X. A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y A data set with low entropy: Y s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! http: // togaware. com Copyright 204, Graham.Williams@togaware.com 0/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com /46

13 Decision Trees Algorithm Decision Trees Algorithm Basic Motivation: Entropy Basic Motivation: Entropy We are trying to predict output Y (e.g., Yes/No) from input X. A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y A data set with low entropy: Y s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! We are trying to predict output Y (e.g., Yes/No) from input X. A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y A data set with low entropy: Y s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! http: // togaware. com Copyright 204, Graham.Williams@togaware.com /46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com /46 Decision Trees Algorithm Decision Trees Algorithm Basic Motivation: Entropy Variable Selection Measure: Entropy We are trying to predict output Y (e.g., Yes/No) from input X. A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y A data set with low entropy: Y s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! Information gain (ID3/C4.5) Select the variable with the highest information gain Assume there are two classes: P and N Let the data S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as I E (p, n) = p p + n log 2 p p + n n p + n log 2 n p + n http: // togaware. com Copyright 204, Graham.Williams@togaware.com /46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/46 Decision Trees Algorithm Decision Trees Algorithm Variable Selection Measure: Gini Variable Selection Measure Gini index of impurity traditional statistical measure CART Measure how often a randomly chosen observation is incorrectly classified if it were randomly classified in proportion to the actual classes. Calculated as the sum of the probability of each observation being chosen times the probability of incorrect classification, equivalently: Measure Variable Importance Measure Formula Info Gini I G (p, n) = (p 2 + ( p) 2 ) 0.25 As with Entropy, the Gini measure is maximal when the classes are equally distributed and minimal when all observations are in one class or the other Proportion of Positives http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/46

14 Decision Trees Algorithm In Rattle Information Gain Startup Rattle Now use variable A to partition S into v cells: {S, S 2,..., S v } If S i contains p i examples of P and n i examples of N, the information now needed to classify objects in all subtrees S i is: library(rattle) rattle() E(A) = v i= p i + n i p + n I (p i, n i ) So, the information gained by branching on A is: Gain(A) = I (p, n) E(A) So choose the variable A which results in the greatest gain in information. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/46 In Rattle In Rattle Load Example Weather Dataset Summary of the Weather Dataset A summary of the weather dataset is displayed. Click on the Execute button and an example dataset is offered. Click on Yes to load the weather dataset. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/46 In Rattle In Rattle Model Tab Decision Tree Click on the Model tab to display the modelling options. Build Tree to Predict RainTomorrow Decision Tree is the default model type simply click Execute. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 23/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/46

15 In Rattle In Rattle Decision Tree Predicting RainTomorrow Click the Draw button to display a tree (Settings Advanced Graphics). Evaluate Decision Tree Click Evaluate tab options to evaluate model performance. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/46 In Rattle In Rattle Evaluate Decision Tree Error Matrix Click Execute to display simple error matrix. Identify the True/False Positives/Negatives. Decision Tree Risk Chart Click the Risk type and then Execute. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 27/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 28/46 In Rattle In Rattle Decision Tree ROC Curve Click the ROC type and then Execute. Score a Dataset Click the Score type to score a new dataset using model. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 30/46

16 In Rattle In Rattle Log of R Commands Click the Log tab for a history of all your interactions. Save the log contents as a script to repeat what we did. Log of R Commands rpart() Here we see the call to rpart() to build the model. Click on the Export button to save the script to file. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 32/46 In Rattle In R Help Model Tree Rattle provides some basic help click Yes for R help. Weather Dataset - Inputs ds <- weather head(ds, 4) ## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine ## Canberra ## Canberra ## Canberra ## Canberra summary(ds[c(3:5,23)]) ## MinTemp MaxTemp Rainfall RISK_MM ## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00 ## st Qu.: 2.30 st Qu.:5.0 st Qu.: 0.00 st Qu.: 0.00 ## Median : 7.45 Median :9.6 Median : 0.00 Median : 0.00 ## Mean : 7.27 Mean :20.6 Mean :.43 Mean :.43 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 33/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 35/46 In R In R Weather Dataset - Target Simple Train/Test Paradigm target <- "RainTomorrow" summary(ds[target]) ## RainTomorrow ## No :300 ## Yes: 66 (form <- formula(paste(target, "~."))) ## RainTomorrow ~. (vars <- names(ds)[-c(, 2, 23)]) ## [] "MinTemp" "MaxTemp" "Rainfall" "Evaporation" ## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am" ## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" ## [3] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am" ## [7] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday" ## [2] "RainTomorrow" set.seed(42) train <- c(sample(:nrow(ds), 0.70*nrow(ds))) head(train) ## [] length(train) ## [] 256 # Training dataset test <- setdiff(:nrow(ds), train) # Testing dataset length(test) ## [] 0 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 36/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 37/46

17 In R In R Display the Model Performance on Test Dataset model <- rpart(form, ds[train, vars]) model ## n= 256 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## ) root No ( ) ## 2) Humidity3pm< No ( ) ## 4) WindGustSpeed< No ( ) ## 8) Cloud3pm< No ( ) * ## 9) Cloud3pm>= No ( ) ## 8) Temp3pm< No ( ) * ## 9) Temp3pm>= Yes ( ) * The predict() function is used to score new data. head(predict(model, ds[test,], type="class")) ## ## No No No No No No ## Levels: No Yes table(predict(model, ds[test,], type="class"), ds[test, target]) ## ## No Yes ## No 77 4 ## Yes 8 Notice the legend to help interpret the tree. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 38/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 39/46 In R In R Example DTree Plot using Rattle 4 No % Cloud3pm < No % yes WindGustSpeed < 64 9 No % Temp3pm < 26 No % Humidity3pm < 60 no 3 Yes % Pressure3pm >= 05 An R Scripting Hint Notice the use of variables ds, target, vars. Change these variables, and the remaining script is unchanged. Simplifies script writing and reuse of scripts. ds <- iris target <- "Species" vars <- names(ds) Then repeat the rest of the script, without change. 8 No % Yes % No % Yes % Rattle 204 Feb 27 23:2:0 gjw 6 7 No % Yes % http: // togaware. com Copyright 204, Graham.Williams@togaware.com 40/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/46 In R In R An R Scripting Hint Unchanged Code An R Scripting Hint Unchanged Code This code remains the same to build the decision tree. form <- formula(paste(target, "~.")) train <- c(sample(:nrow(ds), 0.70*nrow(ds))) test <- setdiff(:nrow(ds), train) model <- rpart(form, ds[train, vars]) model ## n= 05 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## ) root setosa ( ) ## 2) Petal.Length< setosa ( ) * ## 3) Petal.Length>= virginica ( ) ## 6) Petal.Length< versicolor ( ) * ## 7) Petal.Length>= virginica ( ) * Similarly for the predictions. head(predict(model, ds[test,], type="class")) ## ## setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica table(predict(model, ds[test,], type="class"), ds[test, target]) ## ## setosa versicolor virginica ## setosa ## versicolor ## virginica 0 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 42/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 43/46

18 Summary Overview Predictive Data Mining: Ensembles Summary Workshop Overview R: A Language for Data Mining Decision Tree Induction. Most widely deployed machine learning algorithm. Simple idea, powerful learner. Available in R through the rpart package. Related packages include party, Cubist, C50, RWeka (J48). 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 45/46 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/7 Multiple Models Boosting Algorithm Building Multiple Models Boosting Algorithms Basic idea: boost observations that are hard to model. General idea developed in Multiple Inductive Learning algorithm (Williams 987). Ideas were developed (ACJ 987, PhD 990) in the context of: observe that variable selection methods don t discriminate; so build multiple decision trees; then combine into a single model. Basic idea is that multiple models, like multiple experts, may produce better results when working together, rather than in isolation Two approaches covered: Boosting and Random Forests. Meta learners. Algorithm: iteratively build weak models using a poor learner: Build an initial model; Identify mis-classified cases in the training dataset; Boost (over-represent) training observations modelled incorrectly; Build a new model on the boosted training dataset; Repeat. The result is an ensemble of weighted models. Best off the shelf model builder. (Leo Brieman) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/36 Boosting Example Boosting Example Example: Error Rate Notice error rate decreases quickly then flattens. plot(m) Example: Variable Importance Helps understand the knowledge captured. varplot(m) Training Error Variable Importance Plot Error Train Temp3pm Pressure9am MinTemp Humidity3pm Temp9am MaxTemp Evaporation Cloud9am Humidity9am WindSpeed3pm WindSpeed9am Cloud3pm WindGustSpeed Sunshine Pressure3pm WindGustDir WindDir3pm Rainfall WindDir9am RainToday Iteration to Score http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/36

19 Boosting Example Boosting Example Example: Sample Trees Example: Performance There are 50 trees in all. Here s the first 3. fancyrpartplot(m$model$trees[[]]) fancyrpartplot(m$model$trees[[2]]) fancyrpartplot(m$model$trees[[3]]) predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$raintomorrow risks <- weather[-train,]$risk_mm riskchart(predicted, actual, risks) Risk Scores % 00 Lift 4 yes Cloud3pm < 7.5 no % % % Pressure3pm >= % yes Pressure3pm >= 02 no % Sunshine >= yes Pressure3pm >= 02 no % MaxTemp >= 27 Performance (%) Humidity3pm < % % % % % % 20 23% % % % Rattle 204 Feb 27 23:2:9 gjw % Rattle 204 Feb 27 23:2:9 gjw Rattle 204 Feb 27 23:2:20 gjw 0 Recall (88%) Risk (94%) Precision Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/36 Boosting Example Boosting Example Example Applications Summary ATO Application: What life events affect compliance? First application of the technology 995 Decision Stumps: Age > NN; Change in Marital Status Boosted Neural Networks OCR using neural networks as base learners Drucker, Schapire, Simard, 993 Boosting is implemented in R in the ada library 2 AdaBoost uses e m ; LogitBoost uses log( + e m ); Doom II uses tanh(m) 3 AdaBoost tends to be sensitive to noise (addressed by BrownBoost) 4 AdaBoost tends not to overfit, and as new models are added, generalisation error tends to improve. 5 Can be proved to converge to a perfect model if the learners are always better than chance. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/36 Random Forests Random Forests Random Forests Random Forests Original idea from Leo Brieman and Adele Cutler. The name is Licensed to Salford Systems! Hence, R package is randomforest. Typically presented in context of decision trees. Random Multinomial Logit uses multiple multinomial logit models. Build many decision trees (e.g., 500). For each tree: Select a random subset of the training set (N); Choose different subsets of variables for each node of the decision tree (m << M); Build the tree without pruning (i.e., overfit) Classify a new entity using every decision tree: Each tree votes for the entity. The decision with the largest number of votes wins! The proportion of votes is the resulting score. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/36

20 Random Forests Example: RF on Weather Data set.seed(42) (m <- randomforest(raintomorrow ~., weather[train, -c(:2, 23)], na.action=na.roughfix, importance=true)) ## ## Call: ## randomforest(formula=raintomorrow ~., data=weath... ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## OOB estimate of error rate: 3.67% ## Confusion matrix: ## No Yes class.error ## No ## Yes Random Forests Example: Error Rate Error rate decreases quickly then flattens over the 500 trees. plot(m) Error m trees http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/36 Random Forests Example: Variable Importance Helps understand the knowledge captured. varimpplot(m, main="variable Importance") Variable Importance Sunshine Pressure3pm Cloud3pm Sunshine Pressure3pm Cloud3pm Temp3pm Pressure9am WindGustSpeed WindGustSpeed MaxTemp Humidity3pm Pressure9am MinTemp Temp9am Temp3pm Humidity3pm Temp9am MinTemp MaxTemp Cloud9am Humidity9am WindSpeed3pm WindSpeed3pm WindSpeed9am WindSpeed9am Humidity9am Cloud9am WindGustDir Evaporation Evaporation WindDir9am WindDir9am WindGustDir WindDir3pm WindDir3pm Rainfall Rainfall RainToday RainToday MeanDecreaseAccuracy MeanDecreaseGini Random Forests Example: Sample Trees There are 500 trees in all. Here s some rules from the first tree. ## Random Forest Model ## ## ## Tree Rule Node 30 Decision No ## ## : Evaporation <= 9 ## 2: Humidity3pm <= 7 ## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE") ## 5: Sunshine <= 0.25 ## 6: Temp3pm <= 7.55 ## ## Tree Rule 2 Node 3 Decision Yes ## ## : Evaporation <= 9 ## 2: Humidity3pm <= 7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 23/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/36 Random Forests Example: Performance Random Forests Features of Random Forests: By Brieman predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$raintomorrow risks <- weather[-train,]$risk_mm riskchart(predicted, actual, risks) Most accurate of current algorithms. 00 Risk Scores Lift Runs efficiently on large data sets Can handle thousands of input variables. 3 Performance (%) Gives estimates of variable importance. 22% 20 Recall (92%) Risk (97%) 0 Precision Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/36

21 Summary Summary Moving into R and Scripting our Analyses Summary Workshop Overview Ensemble: Multiple models working together Often better than a single model Variance and bias of the model are reduced The best available models today - accurate and robust In daily use in very many areas of application R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 36/36 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/7 Data Scientists are Programmers of Data From GUI to CLI Rattle s Log Tab But... Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language Data Scientists Desire... Scripting Transparency Repeatability Sharing http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/40 From GUI to CLI Rattle s Log Tab Step : Load the Dataset dsname <- "weather" ds <- get(dsname) dim(ds) ## [] names(ds) ## [] "Date" "Location" "MinTemp" "... ## [5] "Rainfall" "Evaporation" "Sunshine" "... ## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "... ## [3] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 27/40

22 Step 2: Observe the Data Observations Step 2: Observe the Data Structure head(ds) ## Date Location MinTemp MaxTemp Rainfall Evapora... ## Canberra ## Canberra ## Canberra tail(ds) ## Date Location MinTemp MaxTemp Rainfall Evapo... ## Canberra ## Canberra ## Canberra str(ds) ## 'data.frame': 366 obs. of 24 variables: ## $ Date : Date, format: " " " ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## $ MinTemp : num ## $ MaxTemp : num ## $ Rainfall : num ## $ Evaporation : num ## $ Sunshine : num ## $ WindGustDir : Ord.factor w/ 6 levels "N"<"NNE"<"N... ## $ WindGustSpeed: num ## $ WindDir9am : Ord.factor w/ 6 levels "N"<"NNE"<"N... ## $ WindDir3pm : Ord.factor w/ 6 levels "N"<"NNE"<"N... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 28/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/40 Step 2: Observe the Data Summary Step 2: Observe the Data Variables summary(ds) ## Date Location MinTemp... ## Min. : Canberra :366 Min. : ## st Qu.: Adelaide : 0 st Qu.: ## Median : Albany : 0 Median : ## Mean : Albury : 0 Mean : ## 3rd Qu.: AliceSprings : 0 3rd Qu.: ## Max. : BadgerysCreek: 0 Max. : ## (Other) : 0... ## Rainfall Evaporation Sunshine Wind... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW... ## st Qu.: 0.00 st Qu.: 2.20 st Qu.: 5.95 NNW... ## Median : 0.00 Median : 4.20 Median : 8.60 E... id <- c("date", "Location") target <- "RainTomorrow" risk <- "RISK_MM" (ignore <- union(id, risk)) ## [] "Date" "Location" "RISK_MM" (vars <- setdiff(names(ds), ignore)) ## [] "MinTemp" "MaxTemp" "Rainfall" "... ## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "... ## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "... ## [3] "Humidity3pm" "Pressure9am" "Pressure3pm" "... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 30/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/40 Step 3: Clean the Data Remove Missing Step 3: Clean the Data Remove Missing dim(ds) ## [] sum(is.na(ds[vars])) ## [] 47 ds <- ds[-attr(na.omit(ds[vars]), "na.action"),] dim(ds) ## [] sum(is.na(ds[vars])) ## [] 0 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 32/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 33/40

23 Step 3: Clean the Data Target as Categoric Step 3: Clean the Data Target as Categoric summary(ds[target]) summary(ds[target]) ## RainTomorrow ## Min. :0.000 ## st Qu.:0.000 ## Median :0.000 ## Mean :0.83 ## 3rd Qu.:0.000 ## Max. :.000 ds[target] <- as.factor(ds[[target]]) levels(ds[target]) <- c("no", "Yes") ## RainTomorrow ## 0:268 ## : 60 count RainTomorrow http: // togaware. com Copyright 204, Graham.Williams@togaware.com 34/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 35/40 Step 4: Prepare for Modelling Step 5: Build the Model Random Forest (form <- formula(paste(target, "~."))) ## RainTomorrow ~. (nobs <- nrow(ds)) ## [] 328 train <- sample(nobs, 0.70*nobs) length(train) ## [] 229 test <- setdiff(:nobs, train) length(test) library(randomforest) model <- randomforest(form, ds[train, vars], na.action=na.omit) model ## ## Call: ## randomforest(formula=form, data=ds[train, vars],... ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 4 ## [] 99 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 36/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 37/40 Literate Data Mining in R Step 6: Evaluate the Model Risk Chart Workshop Overview pr <- predict(model, ds[test,], type="prob")[,2] riskchart(pr, ds[test, target], ds[test, risk], title="random Forest - Risk Chart", risk=risk, recall=target, thresholds=c(0.35, 0.5)) 00 Risk Scores Random Forest Risk Chart 0. Lift 5 R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Performance (%) RainTomorrow (98%) RISK_MM (97%) 9% Predictive Data Mining: Decision Trees 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 0 Precision 8 Literate Data Mining in R Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 38/40 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/7

24 Motivation Why is Reproducibility Important? Motivation Literate Data Mining Overview Your Research Leader or Executive drops by and asks: Remember that research you did last year? I ve heard there is an update on the data that you used. Can you add the new data in and repeat the same analysis? Jo Bloggs did a great analysis of the company returns data just before she left. Can you get someone else to analyse the new data set using the same methods, and so produce an updated report that we can present to the Exec next week? The fraud case you provided an analysis of last year has finally reached the courts. We need to ensure we have a clear trail of the data sources, the analyses performed, and the results obtained, to stand up in court. Could you document these please. One document to intermix the analysis, code, and results Authors productive with narrative and code in one document Sweave (Leisch 2002) and now KnitR (Yihui 20) Embed R code into L A TEX documents for typesetting KnitR also supports publishing to the web http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/37 Motivation Why Reproducible Data Mining? Motivation Prime Objective: Trustworthy Software Automatically regenerate documents when code, data, or assumptions change. Eliminate errors that occur when transcribing results into documents. Record the context for the analysis and decisions made about the type of analysis to perform in the one place. Document the processes to provide integrity for the conclusions of the analysis. Share approach with others for peer review and for learning from each other engender a continuous learning environment. Those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis, and by extension the software that produced it. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive. John Chambers (2008) Software for Data Analysis: Programming with R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/37 Motivation Beautiful Output by KnitR Motivation Beautiful Output by Default KnitR combined with L A TEX will Intermix analysis and results of analysis Automatically generate graphics and tables Support reproducible and transparent analysis Produce the best looking reports. The reader wants to read the document and easily do so! Code highlighting is done automatically Default theme is carefully designed Many other themes are available R Code is properly reformatted Analyses (Graphs and Tables) automatically included. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/37

25 Using RStudio Basic LATEX Markup Using RStudio Introducing L A TEX Simplified interaction with R, L A TEX, and KnitR Executes R code one line at a time Formats L A TEX documents and provides and spell checking A single click compile to PDF and synchronised views Demonstrate: Startup and explore RStudio. A text markup language rather than a WYSIWYG. Based on TEX from 977 very stable and powerful. L A TEX is easier to use macro package built on TEX. Ensures consistent style (layout, fonts, tables, maths, etc.) Automatic indexes, footnotes and references. Documents are well structured and are clear text. Has a learning curve. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/37 Basic L A TEX Usage Basic LATEX Markup Structures Basic LATEX Markup \documentclass{article} \documentclass{article} \begin{document} \end{document} Demonstrate Create a new Sweave document in RStudio \begin{document} \section{introduction}... \subsection{concepts}... \end{document} http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/37 Formats Basic LATEX Markup Basic LATEX Markup RStudio Support for L A TEX \documentclass{article} \begin{document} \begin{itemize} \item ABC \item DEF \end{itemize} This if \textbf{bold} text or \textbf{italic} text,... \end{document} RStudio provides excellent support for working with L A TEX documents Helps to avoid having to know too much abuot L A TEX Best illustrated through a demonstration Format menu Section commands Font commands List commands Verbatim/Block commands Spell Checker Compile PDF Demonstrate: Start a new document, add contents, format to PDF. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/37

26 Incorporating R Code Incorporating R Code Incorporating R Code Making You Look Good We insert R code in a Chunk starting with << >>= We terminate the Chunk Save L A TEX with extension Rnw This Chunk <<simple_example>>= x <- sum(:0) Produces x <- sum(:0) x ## [] 55 <<format_example>>= for(i in for(i in :5) { j <- cos(sin(i)*i^2)+3 print(j-5) } ## [] ## [] ## [] ## [] -.03 Demonstrate: Do this in RStudio http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/37 Incorporating R Code R Within the Text A Simple Table Formatting Tables and Plots Include information about data within the narrative. We can do that with \Sexpr{...}. Our dataset has \Sexpr{nrow(ds)} observations of \Sexpr{ncol(ds)} variables. Becomes Our dataset has 8269 observations of 24 variables. Better Still: \Sexpr{format(nrow(ds), big.mark=",")} Our dataset has 82,69 observations of 24 variables. library(xtable) obs <- sample(:nrow(weatheraus), 8) vars <- 2:6 xtable(weatheraus[obs, vars]) Location MinTemp MaxTemp Rainfall Evaporation Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/37 Formatting Tables and Plots Table: Exclude Row Names Formatting Tables and Plots Table: Limit Number of Digits print(xtable(weatheraus[obs, vars]), include.rownames=false) Location MinTemp MaxTemp Rainfall Evaporation Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat print(xtable(weatheraus[obs, vars], digits=), include.rownames=false) Location MinTemp MaxTemp Rainfall Evaporation Cairns Canberra Cobar SalmonGums Canberra PerthAirport Darwin Ballarat http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/37

27 Formatting Tables and Plots Table: Tiny Font Formatting Tables and Plots Table: Column Alignment vars <- 2:8 print(xtable(weatheraus[obs, vars], digits=0), size="tiny", include.rownames=false) vars <- 2:8 print(xtable(weatheraus[obs, vars], digits=0, align="rlrrrrrr"), size="tiny") Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir Cairns SSE Canberra WSW Cobar ENE SalmonGums 34 0 SE Canberra S PerthAirport W Darwin WNW Ballarat SE Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir Cairns SSE 2658 Canberra WSW 3947 Cobar ENE 7304 SalmonGums 34 0 SE Canberra S PerthAirport W Darwin WNW Ballarat SE http: // togaware. com Copyright 204, Graham.Williams@togaware.com 27/37 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 28/37 Table: Caption Formatting Tables and Plots print(xtable(weatheraus[obs, vars], digits=, caption="this is the table caption."), size="tiny") Plots Formatting Tables and Plots library(ggplot2) cities <- c("canberra", "Darwin", "Melbourne", "Sydney") ds <- subset(weatheraus, Location %in% cities &! is.na(temp3pm)) g <- ggplot(ds, aes(temp3pm, colour=location, fill=location)) g <- g + geom_density(alpha = 0.55) print(g) Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir Cairns SSE 2658 Canberra WSW 3947 Cobar ENE 7304 SalmonGums SE Canberra S PerthAirport W Darwin WNW Ballarat SE Table : This is the table caption. density Location Canberra Darwin Melbourne Sydney 0.00 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/ Temp3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/37 Knitting Our First KnitR Document Knitting Our First KnitR Document Create a KnitR Document: New R Sweave Setup KnitR We wish to use KnitR rather than the older Sweave processor In RStudio we can configure the options to use knitr: Select Tools Options Choose the Sweave group Choose knitr for Weave Rnw files using: The remaining defaults should be okay Click Apply and then OK http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/34 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/34

28 Knitting Our First KnitR Document Knitting Our First KnitR Document Simple KnitR Document Simple KnitR Document Insert the following into your new KnitR document: Insert the following into your new KnitR document: \title{sample KnitR Document} \author{graham Williams} \maketitle \section*{my First Section} This is some text that is automatically typeset by the LaTeX processor to produce well formatted quality output as PDF. \title{sample KnitR Document} \author{graham Williams} \maketitle \section*{my First Section} This is some text that is automatically typeset by the LaTeX processor to produce well formatted quality output as PDF. Your turn Click Compile PDF to view the result. Your turn Click Compile PDF to view the result. http: // togaware. com Copyright 204, 26/34 http: // togaware. com Copyright 204, 26/34 Knitting Our First KnitR Document Knitting Our First KnitR Document Simple KnitR Document Simple KnitR Document Resulting PDF Result of Compile PDF http: // togaware. com Copyright 204, 27/34 http: // togaware. com Copyright 204, 28/34 Knitting Including R Commands in KnitR Knitting Including R Commands in KnitR KnitR: Add R Commands KnitR: Add R Commands R code can be used to generate results into the document: <<echo=false, message=false>>= library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function ds <- weather qplot(mintemp, MaxTemp, Your turn Click Compile PDF to view the result. R code can be used to generate results into the document: <<echo=false, message=false>>= library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function ds <- weather qplot(mintemp, MaxTemp, Your turn Click Compile PDF to view the result. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/34 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/34

29 Knitting Including R Commands in KnitR Knitting Including R Commands in KnitR KnitR Document With R Code Simple KnitR Document PDF with Plot Result of Compile PDF http: // togaware. com Copyright 204, Graham.Williams@togaware.com 30/34 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/34 Knitting Basics Cheat Sheet Knitting Basics Cheat Sheet LaTeX Basics KnitR Basics \subsection*{...} \subsubsection*{...} \textbf{...} \textit{...} \begin{itemize} \item... \item... \end{itemize} % Introduce a Sub Section % Introduce a Sub Sub Section % Bold font % Italic font % A bullet list Plus an extensive collection of other markup and capabilities. echo=false # Do not display the R code eval=true # Evaluate the R code results="hide" # Hide the results of the R commands fig.width=0 # Extend figure width from 7 to 0 inches fig.height=8 # Extend figure height from 7 to 8 inches out.width="0.8\\textwidth" # Fit figure 80% page width out.height="0.5\\textheight" # Fit figure 50% page height Plus an extensive collection of other options. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 32/34 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 33/34 Resources Workshop Overview R: A Language for Data Mining 2 Data Mining, Rattle, and R 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 Predictive Data Mining: Decision Trees Resources and References OnePageR: Tutorial Notes Rattle: Guides: Practise: Book: Data Mining using Rattle/R Chapter: Rattle and Other Tales Paper: A Data Mining GUI for R R Journal, Volume (2) 6 Predictive Data Mining: Ensembles 7 Moving into R and Scripting our Analyses 8 Literate Data Mining in R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 39/40

How To Understand Data Mining In R And Rattle

How To Understand Data Mining In R And Rattle http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/40 Data Analytics and Business Intelligence (8696/8697) Introducing Data Science with R and Rattle Graham.Williams@togaware.com Chief

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/34 Data Analytics and Business Intelligence (8696/8697) Introducing and Interacting with R Graham.Williams@togaware.com Chief Data

More information

Data Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com

Data Science with R. Introducing Data Mining with Rattle and R. Graham.Williams@togaware.com http: // togaware. com Copyright 2013, Graham.Williams@togaware.com 1/35 Data Science with R Introducing Data Mining with Rattle and R Graham.Williams@togaware.com Senior Director and Chief Data Miner,

More information

Data Analytics and Business Intelligence (8696/8697)

Data Analytics and Business Intelligence (8696/8697) http: // togaware. com Copyright 2014, Graham.Williams@togaware.com 1/36 Data Analytics and Business Intelligence (8696/8697) Ensemble Decision Trees Graham.Williams@togaware.com Data Scientist Australian

More information

Data Science with R Ensemble of Decision Trees

Data Science with R Ensemble of Decision Trees Data Science with R Ensemble of Decision Trees Graham.Williams@togaware.com 3rd August 2014 Visit http://handsondatascience.com/ for more Chapters. The concept of building multiple decision trees to produce

More information

Didacticiel Études de cas

Didacticiel Études de cas 1 Theme Data Mining with R The rattle package. R (http://www.r project.org/) is one of the most exciting free data mining software projects of these last years. Its popularity is completely justified (see

More information

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell Data Mining with R Decision Trees and Random Forests Hugh Murrell reference books These slides are based on a book by Graham Williams: Data Mining with Rattle and R, The Art of Excavating Data for Knowledge

More information

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT

Hands-On Data Science with R Dealing with Big Data. Graham.Williams@togaware.com. 27th November 2014 DRAFT Hands-On Data Science with R Dealing with Big Data Graham.Williams@togaware.com 27th November 2014 Visit http://handsondatascience.com/ for more Chapters. In this module we explore how to load larger datasets

More information

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Getting Started with R and RStudio 1

Getting Started with R and RStudio 1 Getting Started with R and RStudio 1 1 What is R? R is a system for statistical computation and graphics. It is the statistical system that is used in Mathematics 241, Engineering Statistics, for the following

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Data Science Using Open Souce Tools Decision Trees and Random Forest Using R

Data Science Using Open Souce Tools Decision Trees and Random Forest Using R Data Science Using Open Souce Tools Decision Trees and Random Forest Using R Jennifer Evans Clickfox jennifer.evans@clickfox.com January 14, 2014 Jennifer Evans (Clickfox) Twitter: JenniferE CF January

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Hands-On Data Science with R Exploring Data with GGPlot2. Graham.Williams@togaware.com. 22nd May 2015 DRAFT

Hands-On Data Science with R Exploring Data with GGPlot2. Graham.Williams@togaware.com. 22nd May 2015 DRAFT Hands-On Data Science with R Exploring Data with GGPlot2 Graham.Williams@togaware.com 22nd May 215 Visit http://handsondatascience.com/ for more Chapters. The ggplot2 (Wickham and Chang, 215) package implements

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

Classification and Regression by randomforest

Classification and Regression by randomforest Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015

R Tools Evaluation. A review by Analytics @ Global BI / Local & Regional Capabilities. Telefónica CCDO May 2015 R Tools Evaluation A review by Analytics @ Global BI / Local & Regional Capabilities Telefónica CCDO May 2015 R Features What is? Most widely used data analysis software Used by 2M+ data scientists, statisticians

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Grow Revenues and Reduce Risk with Powerful Analytics Software

Grow Revenues and Reduce Risk with Powerful Analytics Software Grow Revenues and Reduce Risk with Powerful Analytics Software Overview Gaining knowledge through data selection, data exploration, model creation and predictive action is the key to increasing revenues,

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Delivery of an Analytics Capability

Delivery of an Analytics Capability Delivery of an Analytics Capability Ensembles and Model Delivery for Tax Compliance Graham Williams Senior Director and Chief Data Miner, Analytics Office of the Chief Knowledge Officer Australian Taxation

More information

partykit: A Toolkit for Recursive Partytioning

partykit: A Toolkit for Recursive Partytioning partykit: A Toolkit for Recursive Partytioning Achim Zeileis, Torsten Hothorn http://eeecon.uibk.ac.at/~zeileis/ Overview Status quo: R software for tree models New package: partykit Unified infrastructure

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs 1.1 Introduction Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs For brevity, the Lavastorm Analytics Library (LAL) Predictive and Statistical Analytics Node Pack will be

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Fast Analytics on Big Data with H20

Fast Analytics on Big Data with H20 Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,

More information

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr WEKA Gallirallus Zeland) australis : Endemic bird (New Characteristics Waikato university Weka is a collection

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 lakshmi.mahanra@gmail.com

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De

More information

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model Creating a Scoring Application Based on a Decision Tree Model This Quick Start guides you through creating a credit-scoring application in eight easy steps. Quick Start Century Corp., an electronics retailer,

More information

Chapter 4 Displaying and Describing Categorical Data

Chapter 4 Displaying and Describing Categorical Data Chapter 4 Displaying and Describing Categorical Data Chapter Goals Learning Objectives This chapter presents three basic techniques for summarizing categorical data. After completing this chapter you should

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

A Data Mining Tutorial

A Data Mining Tutorial A Data Mining Tutorial Presented at the Second IASTED International Conference on Parallel and Distributed Computing and Networks (PDCN 98) 14 December 1998 Graham Williams, Markus Hegland and Stephen

More information

COC131 Data Mining - Clustering

COC131 Data Mining - Clustering COC131 Data Mining - Clustering Martin D. Sykora m.d.sykora@lboro.ac.uk Tutorial 05, Friday 20th March 2009 1. Fire up Weka (Waikako Environment for Knowledge Analysis) software, launch the explorer window

More information

Data exploration with Microsoft Excel: analysing more than one variable

Data exploration with Microsoft Excel: analysing more than one variable Data exploration with Microsoft Excel: analysing more than one variable Contents 1 Introduction... 1 2 Comparing different groups or different variables... 2 3 Exploring the association between categorical

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya

More information

Better credit models benefit us all

Better credit models benefit us all Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis

More information

Oracle Data Miner (Extension of SQL Developer 4.0)

Oracle Data Miner (Extension of SQL Developer 4.0) An Oracle White Paper September 2013 Oracle Data Miner (Extension of SQL Developer 4.0) Integrate Oracle R Enterprise Mining Algorithms into a workflow using the SQL Query node Denny Wong Oracle Data Mining

More information

Data mining techniques: decision trees

Data mining techniques: decision trees Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39

More information

2 Decision tree + Cross-validation with R (package rpart)

2 Decision tree + Cross-validation with R (package rpart) 1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing

More information

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

!!!#$$%&'()*+$(,%!#$%$&'()*%(+,'-*&./#-$&'(-&(0*.$#-$1(2&.3$'45 !"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:

More information

The R pmmltransformations Package

The R pmmltransformations Package The R pmmltransformations Package Tridivesh Jena Alex Guazzelli Wen-Ching Lin Michael Zeller Zementis, Inc.* Zementis, Inc. Zementis, Inc. Zementis, Inc. Tridivesh.Jena@ Alex.Guazzelli@ Wenching.Lin@ Michael.Zeller@

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it KNIME TUTORIAL Anna Monreale KDD-Lab, University of Pisa Email: annam@di.unipi.it Outline Introduction on KNIME KNIME components Exercise: Market Basket Analysis Exercise: Customer Segmentation Exercise:

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

Data Mining with Weka

Data Mining with Weka Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to

More information

Tutorial Segmentation and Classification

Tutorial Segmentation and Classification MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 1.0.8 Tutorial Segmentation and Classification Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel

More information

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table

More information

Open-Source Machine Learning: R Meets Weka

Open-Source Machine Learning: R Meets Weka Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but

More information

A Property & Casualty Insurance Predictive Modeling Process in SAS

A Property & Casualty Insurance Predictive Modeling Process in SAS Paper AA-02-2015 A Property & Casualty Insurance Predictive Modeling Process in SAS 1.0 ABSTRACT Mei Najim, Sedgwick Claim Management Services, Chicago, Illinois Predictive analytics has been developing

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

An Overview and Evaluation of Decision Tree Methodology

An Overview and Evaluation of Decision Tree Methodology An Overview and Evaluation of Decision Tree Methodology ASA Quality and Productivity Conference Terri Moore Motorola Austin, TX terri.moore@motorola.com Carole Jesse Cargill, Inc. Wayzata, MN carole_jesse@cargill.com

More information

2015 Workshops for Professors

2015 Workshops for Professors SAS Education Grow with us Offered by the SAS Global Academic Program Supporting teaching, learning and research in higher education 2015 Workshops for Professors 1 Workshops for Professors As the market

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Data Mining Lab 5: Introduction to Neural Networks

Data Mining Lab 5: Introduction to Neural Networks Data Mining Lab 5: Introduction to Neural Networks 1 Introduction In this lab we are going to have a look at some very basic neural networks on a new data set which relates various covariates about cheese

More information

MicroStrategy Analytics Express User Guide

MicroStrategy Analytics Express User Guide MicroStrategy Analytics Express User Guide Analyzing Data with MicroStrategy Analytics Express Version: 4.0 Document Number: 09770040 CONTENTS 1. Getting Started with MicroStrategy Analytics Express Introduction...

More information

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7 UNDER THE GUIDANCE Dr. N.P. DHAVALE, DGM, INFINET Department SUBMITTED TO INSTITUTE FOR DEVELOPMENT AND RESEARCH IN BANKING TECHNOLOGY

More information

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

More information

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

More information

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING Wrocław University of Technology Internet Engineering Henryk Maciejewski APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING PRACTICAL GUIDE Wrocław (2011) 1 Copyright by Wrocław University of Technology

More information

Introduction Course in SPSS - Evening 1

Introduction Course in SPSS - Evening 1 ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/

More information

SPSS: Getting Started. For Windows

SPSS: Getting Started. For Windows For Windows Updated: August 2012 Table of Contents Section 1: Overview... 3 1.1 Introduction to SPSS Tutorials... 3 1.2 Introduction to SPSS... 3 1.3 Overview of SPSS for Windows... 3 Section 2: Entering

More information

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

In this presentation, you will be introduced to data mining and the relationship with meaningful use. In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine

More information

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES Translating data into business value requires the right data mining and modeling techniques which uncover important patterns within

More information

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*)

Business Intelligence. Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Business Intelligence Professor Chen NAME: Due Date: Tutorial for Rapid Miner (Advanced Decision Tree and CRISP-DM Model with an example of Market Segmentation*) Tutorial Summary Objective: Richard would

More information

COURSE RECOMMENDER SYSTEM IN E-LEARNING

COURSE RECOMMENDER SYSTEM IN E-LEARNING International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand

More information

Task Force on Technology / EXCEL

Task Force on Technology / EXCEL Task Force on Technology EXCEL Basic terminology Spreadsheet A spreadsheet is an electronic document that stores various types of data. There are vertical columns and horizontal rows. A cell is where the

More information

Lecture 10: Regression Trees

Lecture 10: Regression Trees Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Decision-Tree Learning

Decision-Tree Learning Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Easily Identify the Right Customers

Easily Identify the Right Customers PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your

More information

STC: Descriptive Statistics in Excel 2013. Running Descriptive and Correlational Analysis in Excel 2013

STC: Descriptive Statistics in Excel 2013. Running Descriptive and Correlational Analysis in Excel 2013 Running Descriptive and Correlational Analysis in Excel 2013 Tips for coding a survey Use short phrases for your data table headers to keep your worksheet neat, you can always edit the labels in tables

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT

More information

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 - Chapter 11 Boosting Xiaogang Su Department of Statistics University of Central Florida - 1 - Perturb and Combine (P&C) Methods have been devised to take advantage of the instability of trees to create

More information

Package bigrf. February 19, 2015

Package bigrf. February 19, 2015 Version 0.1-11 Date 2014-05-16 Package bigrf February 19, 2015 Title Big Random Forests: Classification and Regression Forests for Large Data Sets Maintainer Aloysius Lim OS_type

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Advanced analytics at your hands

Advanced analytics at your hands 2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously

More information

1 Topic. 2 Scilab. 2.1 What is Scilab?

1 Topic. 2 Scilab. 2.1 What is Scilab? 1 Topic Data Mining with Scilab. I know the name "Scilab" for a long time (http://www.scilab.org/en). For me, it is a tool for numerical analysis. It seemed not interesting in the context of the statistical

More information

This chapter reviews the general issues involving data analysis and introduces

This chapter reviews the general issues involving data analysis and introduces Research Skills for Psychology Majors: Everything You Need to Know to Get Started Data Preparation With SPSS This chapter reviews the general issues involving data analysis and introduces SPSS, the Statistical

More information

When to use Excel. When NOT to use Excel 9/24/2014

When to use Excel. When NOT to use Excel 9/24/2014 Analyzing Quantitative Assessment Data with Excel October 2, 2014 Jeremy Penn, Ph.D. Director When to use Excel You want to quickly summarize or analyze your assessment data You want to create basic visual

More information