1 R: A Language for Data Mining. 2 Data Mining, Rattle, and R. 3 Loading, Cleaning, Exploring Data in Rattle. 4 Descriptive Data Mining

Transcription

1 A Data Mining Workshop Excavating Knowledge from Data Introducing Data Mining using R Graham.Williams@togaware.com Data Scientist Australian Taxation Office Adjunct Professor, Australian National University Adjunct Professor, University of Canberra Fellow, Institute of Analytics Professionals of Australia Graham.Williams@togaware.com Workshop Overview R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 Visit: for Workshop Notes http: // togaware. com Copyright 204, Graham.Williams@togaware.com /74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/74 R: A Language for Data Mining Workshop Overview R: A Language for Data Mining Installing R Installing R and Rattle R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 First task is to install R As free/libre open source software (FLOSS or FOSS), R and Rattle are available to all, with no limitations on our freedom to use and share the software, except to share and share alike. Visit CRAN at Visit Rattle at Linux: Install packages (Ubuntu is recommended) $ wajig install r-recommended r-cran-rattle Windows: Download and install from CRAN MacOSX: Download and install from CRAN http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/74 R: A Language for Data Mining Data Science with R? Why do Data Science with R? R: A Language for Data Mining Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74

2 R: A Language for Data Mining Data Science with R? Why do Data Science with R? R: A Language for Data Mining Data Science with R? Why do Data Science with R? Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! Most widely used Data Mining and Machine Learning Package Machine Learning Statistics Software Engineering and Programming with Data But not the nicest of languages for a Computer Scientist! Free (Libre) Open Source Statistical Software... all modern statistical approaches... many/most machine learning algorithms... opportunity to readily add new algorithms That is important for us in the research community Get our algorithms out there and being used impact!!! http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74 R: A Language for Data Mining Popularity of R? How Popular is R? Discussion List Traffic Monthly traffic on software s main discussion list. R: A Language for Data Mining Popularity of R? How Popular is R? Discussion Topics Number of discussions on popular QandA forums 203. http: // togaware. com Copyright 204, Graham.Williams@togaware.com Source: 6/74 Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/74 R: A Language for Data Mining Popularity of R? How Popular is R? R versus SAS Number of R/SAS related posts to Stack Overflow by week. R: A Language for Data Mining Popularity of R? How Popular is R? Professional Forums Registered for the main discussion group for each software. Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/74 Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/74

3 R: A Language for Data Mining Popularity of R? How Popular is R? Used in Analytics Competitions Software used in data analysis competitions in 20. R: A Language for Data Mining Popularity of R? How Popular is R? User Survey Rexer Analytics Survey 200 results for data mining/analytic tools. Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com 0/74 Source: http: // togaware. com Copyright 204, Graham.Williams@togaware.com /74 What is R? R: A Language for Data Mining Popularity of R? Workshop Overview R: A Language for Data Mining R The Video A 90 Second Promo from Revolution Analytics Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/74 Definition Definition Data Mining Data Mining A data driven analysis to uncover otherwise unknown but useful patterns in large datasets, to discover new knowledge and to develop predictive models, turning data and information into knowledge and (one day perhaps) wisdom, in a timely manner. Application of Machine Learning Statistics Software Engineering and Programming with Data Effective Communications and Intuition... to Datasets that vary by Volume, Velocity, Variety, Value, Veracity... to discover new knowledge... to improve business outcomes... to deliver better tailored services http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74

4 Definition Definition Data Mining in Research Data Mining in Government Health Research Adverse reactions using linked Pharmaceutical, General Practitioner, Hospital, Pathology datasets. Astronomy Microlensing events in the Large Magellanic Cloud of several million observed stars (out of 0 billion). Psychology Investigation of age-of-onset for Alzheimer s disease from 75 variables for 800 people. Social Sciences Survey evaluation. Social network analysis - identifying key influencers. Australian Taxation Office Lodgment ($0M) Tax Havens ($50M) Tax Fraud ($250M) Immigration and Border Control Check passengers before boarding Health and Human Services Doctor shoppers Over servicing http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/74 Definition Algorithms The Business of Data Mining Basic Tools: Data Mining Algorithms SAS has annual revenues of $3B (203) IBM bought SPSS for $.2B (2009) Analytics is >$00B business and >$320B by 2020 Amazon, ebay/paypal, Google, Facebook, LinkedIn,... Shortage of 80,000 data scientists in US in 208 (McKinsey)... Cluster Analysis (kmeans, wskm) Association Analysis (arules) Linear Discriminant Analysis (lda) Logistic Regression (glm) Decision Trees (rpart, wsrpart) Random Forests (randomforest, wsrf) Boosted Stumps (ada) Neural Networks (nnet) Support Vector Machines (kernlab)... That s a lot of tools to learn in R! Many with different interfaces and options. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/74 A GUI for Data Mining A GUI for Data Mining Why a GUI? Users of Rattle Statistics can be complex and traps await So many tools in R to deliver insights Effective analyses should be scripted Scripting also required for repeatability R is a language for programming with data How to remember how to do all of this in R? How to skill up 50 data analysts with Data Mining? Today, Rattle is used world wide in many industries Health analytics Customer segmentation and marketing Fraud detection Government It is used by Universities to teach Data Mining Within research projects for basic analyses Consultants and Analytics Teams across business It is and will remain freely available. CRAN and http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/74

5 Setting Things Up Tour Installation A Tour Thru Rattle: Startup Rattle is built using R Need to download and install R from cran.r-project.org Recommend also install RStudio from Then start up RStudio and install Rattle: install.packages("rattle") Then we can start up Rattle: rattle() Required packages are loaded as needed. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 23/74 Tour Tour A Tour Thru Rattle: Loading Data A Tour Thru Rattle: Explore Distribution http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/74 Tour Tour A Tour Thru Rattle: Explore Correlations A Tour Thru Rattle: Hierarchical Cluster http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 27/74

6 Tour Tour A Tour Thru Rattle: Decision Tree A Tour Thru Rattle: Decision Tree Plot http: // togaware. com Copyright 204, Graham.Williams@togaware.com 28/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/74 Tour Tour A Tour Thru Rattle: Random Forest A Tour Thru Rattle: Risk Chart Risk Chart Random Forest weather.csv [test] RainTomorrow Risk Scores Lift Performance (%) % RainTomorrow (92%) Rain in MM (97%) 0 Precision Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 30/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/74 Programming with Data Programming with Data Data Scientists are Programmers of Data From GUI to CLI Rattle s Log Tab But... Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language Data Scientists Desire... Scripting Transparency Repeatability Sharing http: // togaware. com Copyright 204, Graham.Williams@togaware.com 32/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 33/74

7 Programming with Data Programming with Data From GUI to CLI Rattle s Log Tab Interface Notes Work through the tabs from left to right After setting up a tab we need to Execute it Projects save the current Rattle state Projects can be restored at a later time http: // togaware. com Copyright 204, Graham.Williams@togaware.com 34/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 35/74 The Power of Free/Libre and Open Source Software Ubuntu Tools Using Ubuntu Ubuntu GNU/Linux operating system Feature rich toolkit, up-to-date, easy to install, FLOSS RStudio Easy to use integrated development environment, FLOSS Powerful alternative is Emacs (Speaks Statistics), FLOSS R Statistical Software Language Extensive, powerful, thousands of contributors, FLOSS KnitR and L A TEX Produce beautiful documents, easily reproducible, FLOSS Desktop Operating System (GNU/Linux) Replacing Windows and OSX The GNU Tool Suite based on Unix significant heritage Multiple specialised single task tools, working well together Compared to single application trying to do it all Powerful data processing from the command line: grep, awk, head, tail, wc, sed, perl, python, most, diff, make, paste, join, patch,... For interacting with R start up RStudio from the Dash http: // togaware. com Copyright 204, Graham.Williams@togaware.com 36/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 37/74 RStudio Interface RStudio Interface RStudio The Default Three Panels RStudio With R Script File Editor Panel http: // togaware. com Copyright 204, Graham.Williams@togaware.com 38/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 39/74

8 Simple Plots Scatterplot R Code Our first little bit of R code: Load a couple of packages into the R library library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function Then produce a quick plot using qplot() ds <- weather qplot(mintemp, MaxTemp, data=ds) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 40/74 Simple Plots Scatterplot Plot MinTemp MaxTemp http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/74 Simple Plots Scatterplot RStudio http: // togaware. com Copyright 204, Graham.Williams@togaware.com 42/74 Installing Packages Missing Packages Tools Install Packages... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 43/74 Installing Packages RStudio Installing ggplot2 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 44/74 RStudio Shortcuts RStudio Keyboard Shortcuts These will become very useful! Editor: Ctrl-Enter will send the line of code to the R console Ctrl-2 will move the cursor to the Console Console: UpArrow will cycle through previous commands Ctrl-UpArrow will search previous commands Tab will complete function names and list the arguments Ctrl- will move the cursor to the Editor http: // togaware. com Copyright 204, Graham.Williams@togaware.com 45/74

9 Basic R Commands Basic R library(rattle) # Load the weather dataset. head(weather) # First 6 observations of the dataset. ## Date Location MinTemp MaxTemp Rainfall Evapora... ## Canberra ## Canberra ## Canberra str(weather) # Struncture of the variables in the dataset. ## 'data.frame': 366 obs. of 24 variables: ## $ Date : Date, format: " " " ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## $ MinTemp : num http: // togaware. com Copyright 204, Graham.Williams@togaware.com 46/74 Basic R Commands Basic R summary(weather) # Univariate summary of the variables. ## Date Location MinTemp... ## Min. : Canberra :366 Min. : ## st Qu.: Adelaide : 0 st Qu.: ## Median : Albany : 0 Median : ## Mean : Albury : 0 Mean : ## 3rd Qu.: AliceSprings : 0 3rd Qu.: ## Max. : BadgerysCreek: 0 Max. : ## (Other) : 0... ## Rainfall Evaporation Sunshine WindGust... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW :... ## st Qu.: 0.00 st Qu.: 2.20 st Qu.: 5.95 NNW :... ## Median : 0.00 Median : 4.20 Median : 8.60 E :... ## Mean :.43 Mean : 4.52 Mean : 7.9 WNW :... ## 3rd Qu.: rd Qu.: rd Qu.:0.50 ENE :... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 47/74 Visualising Data Visual Summaries Add A Little Colour qplot(humidity3pm, Pressure3pm, colour=raintomorrow, data=ds) Humidity3pm Pressure3pm RainTomorrow No Yes http: // togaware. com Copyright 204, Graham.Williams@togaware.com 48/74 Visualising Data Visual Summaries Careful with Categorics qplot(windgustdir, Pressure3pm, data=ds) N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir Pressure3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 49/74 Visualising Data Visual Summaries Add A Little Jitter qplot(windgustdir, Pressure3pm, data=ds, geom="jitter") N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir Pressure3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 50/74 Visualising Data Visual Summaries And Some Colour qplot(windgustdir, Pressure3pm, data=ds, colour=windgustdir, geom="jitter") N NNE NE ENE E ESE SE SSE S SSW SW WSW W WNW NW NNW NA WindGustDir Pressure3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74

10 Help Loading, Cleaning, Exploring Data in Rattle Getting Help Precede Command with? Workshop Overview R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 52/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 53/74 Loading, Cleaning, Exploring Data in Rattle Loading Data Loading, Cleaning, Exploring Data in Rattle Exploring Data http: // togaware. com Copyright 204, Graham.Williams@togaware.com 54/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 55/74 Test Data Loading, Cleaning, Exploring Data in Rattle Loading, Cleaning, Exploring Data in Rattle Transform Data http: // togaware. com Copyright 204, Graham.Williams@togaware.com 56/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 57/74

11 Descriptive Data Mining Descriptive Data Mining Cluster Analysis Workshop Overview What is Cluster Analysis? R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses Cluster: a collection of observations Similar to one another within the same cluster Dissimilar to the observations in other clusters Cluster analysis Grouping a set of data observations into classes Clustering is unsupervised classification: no predefined classes descriptive data mining. Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 58/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 59/74 Descriptive Data Mining Cluster Analysis Descriptive Data Mining Cluster Analysis What Is Good Clustering? Major Clustering Approaches High Quality: high intra-class similarity low inter-class similarity The Quality depends on: similarity measure algorithm for searching Depends on the opinion of the user, and the algorithm s ability to discover hidden patterns that are of interest to the user. Partitioning algorithms (kmeans, pam, clara, fanny): Construct various partitions and then evaluate them by some criterion. A fixed number of clusters, k, is generated. Start with an initial (perhaps random) cluster. Hierarchical algorithms: (hclust, agnes, diana) Create a hierarchical decomposition of the set of observations using some criterion Density-based algorithms: based on connectivity and density functions Grid-based algorithms: based on a multiple-level granularity structure Model-based algorithms: (mclust for mixture of Gaussians) A model is hypothesized for each of the clusters and the idea is to find the best fit of that model http: // togaware. com Copyright 204, Graham.Williams@togaware.com 60/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/74 Descriptive Data Mining Cluster Analysis Descriptive Data Mining Association Rules KMeans Clustering Association Rule Mining An unsupervised learning algorithm descriptive data mining. Identify items (patterns) that occur frequently together in a given set of data. Patterns = associations, correlations, causal structures (Rules). Data = sets of items in... transactional database relational database complex information repositories Rule: Body Head [support, confidence] http: // togaware. com Copyright 204, Graham.Williams@togaware.com 62/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 63/74

12 Descriptive Data Mining Association Rules Descriptive Data Mining Association Rules Typical Applications Examples Link analysis Market basket analysis Cross marketing Customers who purchase... Friday Nappies Beer [0.5%, 60%] Age [20, 30] Income [20K, 30K] MP3Player [2%, 60%] Maths CS HDinCS [%, 75%] Gladiator Patriot Sixth Sense [0.%, 90%] Statins Peritonitis Chronic Renal Failure [0.%, 32%] http: // togaware. com Copyright 204, 64/74 http: // togaware. com Copyright 204, 65/74 Descriptive Data Mining Health Insurance Commission Descriptive Data Mining Health Insurance Commission Health Insurance Commision Association Rules Associations on episode database for pathology services 6.8 million records X 20 attributes (3.5GB) 5 months preprocessing then 2 weeks data mining Goal: find associations between tests cmin = 50% and smin = %, 0.5%, 0.25% (% of 6.8 million = 68,000) Unexpected/unnecessary combination of services Refuse cover saves $550,000 per year http: // togaware. com Copyright 204, Graham.Williams@togaware.com 66/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 67/74 Basics Workshop Overview Predictive Modelling: Classification R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses Goal of classification is to build models (sentences) in a knowledge representation (language) from examples of past decisions. The model is to be used on unseen cases to make decisions. Often referred to as supervised learning. Common approaches: decision trees; neural networks; logistic regression; support vector machines. 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 68/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 69/74

13 Basics Basics Language: Decision Trees Knowledge representation: A flow-chart-like tree structure Internal nodes denotes a test on a variable Branch represents an outcome of the test Leaf nodes represent class labels or class distribution Age Y < 42 > 42 Y Gender Male Female http: // togaware. com Copyright 204, Graham.Williams@togaware.com 70/74 N Tree Construction: Divide and Conquer Decision tree induction is an example of a recursive partitioning algorithm: divide and conquer. At start, all the training examples are at the root Partition examples recursively based on selected variables Females Males _ <42 + >42 + http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/74 Algorithm Algorithm Algorithm for Decision Tree Induction Basic Motivation: Entropy A greedy algorithm: takes the best immediate (local) decision while building the overall model Tree constructed top-down, recursive, divide-and-conquer Begin with all training examples at the root Data is partitioned recursively based on selected variables Select variables on basis of a measure Stop partitioning when? All samples for a given node belong to the same class There are no remaining variables for further partitioning majority voting is employed for classifying the leaf There are no samples left We are trying to predict output Y (e.g., Yes/No) from input X. A random data set may have high entropy: Y is from a uniform distribution a frequency distribution would be flat! a sample will include uniformly random values of Y A data set with low entropy: Y s distribution will be very skewed a frequency distribution will have a single peak a sample will predominately contain just Yes or just No Work towards reducing the amount of entropy in the data! http: // togaware. com Copyright 204, Graham.Williams@togaware.com 72/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 73/74 Algorithm Algorithm Variable Selection Measure: Entropy Variable Selection Measure: Gini Information gain (ID3/C4.5) Select the variable with the highest information gain Assume there are two classes: P and N Let the data S contain p elements of class P and n elements of class N The amount of information, needed to decide if an arbitrary example in S belongs to P or N is defined as I E (p, n) = p p + n log 2 p p + n n p + n log 2 n p + n Gini index of impurity traditional statistical measure CART Measure how often a randomly chosen observation is incorrectly classified if it were randomly classified in proportion to the actual classes. Calculated as the sum of the probability of each observation being chosen times the probability of incorrect classification, equivalently: I G (p, n) = (p 2 + ( p) 2 ) As with Entropy, the Gini measure is maximal when the classes are equally distributed and minimal when all observations are in one class or the other. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 74/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 75/74

14 Algorithm Algorithm Variable Selection Measure Information Gain.00 Variable Importance Measure Now use variable A to partition S into v cells: {S, S 2,..., S v } If S i contains p i examples of P and n i examples of N, the information now needed to classify objects in all subtrees S i is: Measure Formula Info Gini E(A) = v i= p i + n i p + n I (p i, n i ) So, the information gained by branching on A is: 0.25 Gain(A) = I (p, n) E(A) Proportion of Positives So choose the variable A which results in the greatest gain in information. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 76/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 77/74 Building Decision Trees Building Decision Trees Startup Rattle Load Example Weather Dataset library(rattle) rattle() Click on the Execute button and an example dataset is offered. Click on Yes to load the weather dataset. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 78/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 79/74 Building Decision Trees Building Decision Trees Summary of the Weather Dataset A summary of the weather dataset is displayed. Model Tab Decision Tree Click on the Model tab to display the modelling options. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 80/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/74

15 Building Decision Trees Building Decision Trees Build Tree to Predict RainTomorrow Decision Tree is the default model type simply click Execute. Decision Tree Predicting RainTomorrow Click the Draw button to display a tree (Settings Advanced Graphics). http: // togaware. com Copyright 204, Graham.Williams@togaware.com 82/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 83/74 Building Decision Trees Building Decision Trees Evaluate Decision Tree Click Evaluate tab options to evaluate model performance. Evaluate Decision Tree Error Matrix Click Execute to display simple error matrix. Identify the True/False Positives/Negatives. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 84/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 85/74 Building Decision Trees Building Decision Trees Decision Tree Risk Chart Click the Risk type and then Execute. Decision Tree ROC Curve Click the ROC type and then Execute. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 86/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 87/74

16 Building Decision Trees Building Decision Trees Score a Dataset Log of R Commands Click the Score type to score a new dataset using model. Click the Log tab for a history of all your interactions. Save the log contents as a script to repeat what we did. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 88/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 89/74 Building Decision Trees Building Decision Trees Log of R Commands rpart() Here we see the call to rpart() to build the model. Click on the Export button to save the script to file. Help Model Tree Rattle provides some basic help click Yes for R help. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 90/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/74 Doing It In R Doing It In R Weather Dataset - Inputs Weather Dataset - Target ds <- weather head(ds, 4) ## Date Location MinTemp MaxTemp Rainfall Evaporation Sunshine ## Canberra ## Canberra ## Canberra ## Canberra summary(ds[c(3:5,23)]) ## MinTemp MaxTemp Rainfall RISK_MM ## Min. :-5.30 Min. : 7.6 Min. : 0.00 Min. : 0.00 ## st Qu.: 2.30 st Qu.:5.0 st Qu.: 0.00 st Qu.: 0.00 ## Median : 7.45 Median :9.6 Median : 0.00 Median : 0.00 ## Mean : 7.27 Mean :20.6 Mean :.43 Mean :.43 target <- "RainTomorrow" summary(ds[target]) ## RainTomorrow ## No :300 ## Yes: 66 (form <- formula(paste(target, "~."))) ## RainTomorrow ~. (vars <- names(ds)[-c(, 2, 23)]) ## [] "MinTemp" "MaxTemp" "Rainfall" "Evaporation" ## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "WindDir9am" ## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "Humidity9am" ## [3] "Humidity3pm" "Pressure9am" "Pressure3pm" "Cloud9am" ## [7] "Cloud3pm" "Temp9am" "Temp3pm" "RainToday" ## [2] "RainTomorrow" http: // togaware. com Copyright 204, Graham.Williams@togaware.com 92/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 93/74

17 Doing It In R Doing It In R Simple Train/Test Paradigm Display the Model set.seed(42) train <- c(sample(:nrow(ds), 0.70*nrow(ds))) head(train) ## [] length(train) ## [] 256 # Training dataset test <- setdiff(:nrow(ds), train) # Testing dataset length(test) ## [] 0 model <- rpart(form, ds[train, vars]) model ## n= 256 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## ) root No ( ) ## 2) Humidity3pm< No ( ) ## 4) WindGustSpeed< No ( ) ## 8) Cloud3pm< No ( ) * ## 9) Cloud3pm>= No ( ) ## 8) Temp3pm< No ( ) * ## 9) Temp3pm>= Yes ( ) * Notice the legend to help interpret the tree. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 94/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 95/74 Doing It In R Doing It In R Performance on Test Dataset Example DTree Plot using Rattle The predict() function is used to score new data. yes No % Humidity3pm < 60 no head(predict(model, ds[test,], type="class")) ## ## No No No No No No ## Levels: No Yes table(predict(model, ds[test,], type="class"), ds[test, target]) ## ## No Yes ## No 77 4 ## Yes 8 4 No % Cloud3pm < No % WindGustSpeed < 64 9 No % Temp3pm < 26 3 Yes % Pressure3pm >= 05 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 96/74 8 No % Yes % No % Yes % Rattle 204 Mar 0 07:25:2 gjw 6 7 No % Yes % http: // togaware. com Copyright 204, Graham.Williams@togaware.com 97/74 Doing It In R Doing It In R An R Scripting Hint An R Scripting Hint Unchanged Code This code remains the same to build the decision tree. Notice the use of variables ds, target, vars. Change these variables, and the remaining script is unchanged. Simplifies script writing and reuse of scripts. ds <- iris target <- "Species" vars <- names(ds) Then repeat the rest of the script, without change. form <- formula(paste(target, "~.")) train <- c(sample(:nrow(ds), 0.70*nrow(ds))) test <- setdiff(:nrow(ds), train) model <- rpart(form, ds[train, vars]) model ## n= 05 ## ## node), split, n, loss, yval, (yprob) ## * denotes terminal node ## ## ) root setosa ( ) ## 2) Petal.Length< setosa ( ) * ## 3) Petal.Length>= virginica ( ) ## 6) Petal.Length< versicolor ( ) * ## 7) Petal.Length>= virginica ( ) * http: // togaware. com Copyright 204, Graham.Williams@togaware.com 98/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 99/74

18 Doing It In R Summary An R Scripting Hint Unchanged Code Summary Similarly for the predictions. head(predict(model, ds[test,], type="class")) ## ## setosa setosa setosa setosa setosa setosa ## Levels: setosa versicolor virginica table(predict(model, ds[test,], type="class"), ds[test, target]) ## ## setosa versicolor virginica ## setosa ## versicolor ## virginica 0 Most widely deployed machine learning algorithm. Simple idea, powerful learner. Available in R through the rpart package. Related packages include party, Cubist, C50, RWeka (J48). http: // togaware. com Copyright 204, Graham.Williams@togaware.com 00/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 0/74 Workshop Overview Building Multiple Models R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 General idea developed in Multiple Inductive Learning algorithm (Williams 987). Ideas were developed (ACJ 987, PhD 990) in the context of: observe that variable selection methods don t discriminate; so build multiple decision trees; then combine into a single model. Basic idea is that multiple models, like multiple experts, may produce better results when working together, rather than in isolation Two approaches covered: Boosting and Random Forests. Meta learners. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 02/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 03/74 Boosting Boosting Boosting Algorithms Example: Ada on Weather Data Basic idea: boost observations that are hard to model. Algorithm: iteratively build weak models using a poor learner: Build an initial model; Identify mis-classified cases in the training dataset; Boost (over-represent) training observations modelled incorrectly; Build a new model on the boosted training dataset; Repeat. The result is an ensemble of weighted models. Best off the shelf model builder. (Leo Brieman) head(weather[c(:5, 23, 24)], 3) ## Date Location MinTemp MaxTemp Rainfall RISK_MM... ## Canberra ## Canberra ## Canberra set.seed(42) train <- sample(:nrow(weather), 0.7 * nrow(weather)) (m <- ada(raintomorrow ~., weather[train, -c(:2, 23)])) ## Call: ## ada(raintomorrow ~., data=weather[train, -c(:2, 23)]) ## ## Loss: exponential Method: discrete Iteration: 50 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 04/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 05/74

19 Boosting Boosting Example: Error Rate Notice error rate decreases quickly then flattens. plot(m) Example: Variable Importance Helps understand the knowledge captured. varplot(m) Training Error Variable Importance Plot Error Train Temp3pm Pressure9am MinTemp Humidity3pm Temp9am MaxTemp Evaporation Cloud9am Humidity9am WindSpeed3pm WindSpeed9am Cloud3pm WindGustSpeed Sunshine Pressure3pm WindGustDir WindDir3pm Rainfall WindDir9am RainToday Iteration to Score http: // togaware. com Copyright 204, Graham.Williams@togaware.com 06/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 07/74 Boosting Boosting Example: Sample Trees Example: Performance There are 50 trees in all. Here s the first 3. fancyrpartplot(m$model$trees[[]]) fancyrpartplot(m$model$trees[[2]]) fancyrpartplot(m$model$trees[[3]]) predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$raintomorrow risks <- weather[-train,]$risk_mm riskchart(predicted, actual, risks) Risk Scores % 00 Lift 4 yes Cloud3pm < 7.5 no % % % Pressure3pm >= % yes Pressure3pm >= 02 no % Sunshine >= yes Pressure3pm >= 02 no % MaxTemp >= 27 Performance (%) Humidity3pm < % % % % % % 20 23% % % % Rattle 204 Mar 0 07:25:22 gjw % Rattle 204 Mar 0 07:25:23 gjw Rattle 204 Mar 0 07:25:23 gjw 0 Recall (88%) Risk (94%) Precision Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 08/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 09/74 Boosting Boosting Example Applications Summary ATO Application: What life events affect compliance? First application of the technology 995 Decision Stumps: Age > NN; Change in Marital Status Boosted Neural Networks OCR using neural networks as base learners Drucker, Schapire, Simard, 993 Boosting is implemented in R in the ada library 2 AdaBoost uses e m ; LogitBoost uses log( + e m ); Doom II uses tanh(m) 3 AdaBoost tends to be sensitive to noise (addressed by BrownBoost) 4 AdaBoost tends not to overfit, and as new models are added, generalisation error tends to improve. 5 Can be proved to converge to a perfect model if the learners are always better than chance. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 0/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com /74

20 Random Forests Random Forests Random Forests Random Forests Original idea from Leo Brieman and Adele Cutler. The name is Licensed to Salford Systems! Hence, R package is randomforest. Typically presented in context of decision trees. Random Multinomial Logit uses multiple multinomial logit models. Build many decision trees (e.g., 500). For each tree: Select a random subset of the training set (N); Choose different subsets of variables for each node of the decision tree (m << M); Build the tree without pruning (i.e., overfit) Classify a new entity using every decision tree: Each tree votes for the entity. The decision with the largest number of votes wins! The proportion of votes is the resulting score. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/74 Random Forests Random Forests Example: RF on Weather Data set.seed(42) (m <- randomforest(raintomorrow ~., weather[train, -c(:2, 23)], na.action=na.roughfix, importance=true)) ## ## Call: ## randomforest(formula=raintomorrow ~., data=weath... ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 4 ## ## OOB estimate of error rate: 3.67% ## Confusion matrix: ## No Yes class.error ## No ## Yes Example: Error Rate Error rate decreases quickly then flattens over the 500 trees. plot(m) Error m trees http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74 Random Forests Random Forests Example: Variable Importance Helps understand the knowledge captured. varimpplot(m, main="variable Importance") Sunshine Cloud3pm Pressure3pm Temp3pm WindGustSpeed MaxTemp Pressure9am Temp9am Humidity3pm MinTemp Cloud9am WindSpeed3pm WindSpeed9am Humidity9am WindGustDir Evaporation WindDir9am WindDir3pm Rainfall RainToday MeanDecreaseAccuracy Variable Importance Pressure3pm Sunshine Cloud3pm Pressure9am WindGustSpeed Humidity3pm MinTemp Temp3pm Temp9am MaxTemp Humidity9am WindSpeed3pm WindSpeed9am Cloud9am Evaporation WindDir9am WindGustDir WindDir3pm Rainfall RainToday MeanDecreaseGini Example: Sample Trees There are 500 trees in all. Here s some rules from the first tree. ## Random Forest Model ## ## ## Tree Rule Node 30 Decision No ## ## : Evaporation <= 9 ## 2: Humidity3pm <= 7 ## 3: Cloud3pm <= 2.5 ## 4: WindDir9am IN ("NNE") ## 5: Sunshine <= 0.25 ## 6: Temp3pm <= 7.55 ## ## Tree Rule 2 Node 3 Decision Yes ## ## : Evaporation <= 9 ## 2: Humidity3pm <= 7 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/74

21 Random Forests Random Forests Example: Performance Features of Random Forests: By Brieman predicted <- predict(m, weather[-train,], type="prob")[,2] actual <- weather[-train,]$raintomorrow risks <- weather[-train,]$risk_mm riskchart(predicted, actual, risks) Most accurate of current algorithms. 00 Risk Scores Lift Runs efficiently on large data sets Can handle thousands of input variables. 3 Performance (%) Gives estimates of variable importance. 22% 20 Recall (92%) Risk (97%) 0 Precision Caseload (%) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 8/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 9/74 Random Forests Random Forests Moving into R and Scripting our Analyses Workshop Overview R: A Language for Data Mining Ensemble: Multiple models working together Often better than a single model Variance and bias of the model are reduced The best available models today - accurate and robust In daily use in very many areas of application 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 20/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 2/74 Moving into R and Scripting our Analyses Data Scientists are Programmers of Data Moving into R and Scripting our Analyses From GUI to CLI Rattle s Log Tab But... Data scientists are programmers of data A GUI can only do so much R is a powerful statistical language Data Scientists Desire... Scripting Transparency Repeatability Sharing http: // togaware. com Copyright 204, Graham.Williams@togaware.com 22/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 23/74

22 Moving into R and Scripting our Analyses From GUI to CLI Rattle s Log Tab Moving into R and Scripting our Analyses Step : Load the Dataset dsname <- "weather" ds <- get(dsname) dim(ds) ## [] names(ds) ## [] "Date" "Location" "MinTemp" "... ## [5] "Rainfall" "Evaporation" "Sunshine" "... ## [9] "WindGustSpeed" "WindDir9am" "WindDir3pm" "... ## [3] "WindSpeed3pm" "Humidity9am" "Humidity3pm" "... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 24/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 25/74 Moving into R and Scripting our Analyses Step 2: Observe the Data Observations Moving into R and Scripting our Analyses Step 2: Observe the Data Structure head(ds) ## Date Location MinTemp MaxTemp Rainfall Evapora... ## Canberra ## Canberra ## Canberra tail(ds) ## Date Location MinTemp MaxTemp Rainfall Evapo... ## Canberra ## Canberra ## Canberra str(ds) ## 'data.frame': 366 obs. of 24 variables: ## $ Date : Date, format: " " " ## $ Location : Factor w/ 46 levels "Adelaide","Alba... ## $ MinTemp : num ## $ MaxTemp : num ## $ Rainfall : num ## $ Evaporation : num ## $ Sunshine : num ## $ WindGustDir : Ord.factor w/ 6 levels "N"<"NNE"<"N... ## $ WindGustSpeed: num ## $ WindDir9am : Ord.factor w/ 6 levels "N"<"NNE"<"N... ## $ WindDir3pm : Ord.factor w/ 6 levels "N"<"NNE"<"N... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 26/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 27/74 Moving into R and Scripting our Analyses Step 2: Observe the Data Summary Moving into R and Scripting our Analyses Step 2: Observe the Data Variables summary(ds) ## Date Location MinTemp... ## Min. : Canberra :366 Min. : ## st Qu.: Adelaide : 0 st Qu.: ## Median : Albany : 0 Median : ## Mean : Albury : 0 Mean : ## 3rd Qu.: AliceSprings : 0 3rd Qu.: ## Max. : BadgerysCreek: 0 Max. : ## (Other) : 0... ## Rainfall Evaporation Sunshine Wind... ## Min. : 0.00 Min. : 0.20 Min. : 0.00 NW... ## st Qu.: 0.00 st Qu.: 2.20 st Qu.: 5.95 NNW... ## Median : 0.00 Median : 4.20 Median : 8.60 E... id <- c("date", "Location") target <- "RainTomorrow" risk <- "RISK_MM" (ignore <- union(id, risk)) ## [] "Date" "Location" "RISK_MM" (vars <- setdiff(names(ds), ignore)) ## [] "MinTemp" "MaxTemp" "Rainfall" "... ## [5] "Sunshine" "WindGustDir" "WindGustSpeed" "... ## [9] "WindDir3pm" "WindSpeed9am" "WindSpeed3pm" "... ## [3] "Humidity3pm" "Pressure9am" "Pressure3pm" "... http: // togaware. com Copyright 204, Graham.Williams@togaware.com 28/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 29/74

23 Moving into R and Scripting our Analyses Step 3: Clean the Data Remove Missing Moving into R and Scripting our Analyses Step 3: Clean the Data Remove Missing dim(ds) ## [] sum(is.na(ds[vars])) ## [] 47 ds <- ds[-attr(na.omit(ds[vars]), "na.action"),] dim(ds) ## [] sum(is.na(ds[vars])) ## [] 0 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 30/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 3/74 Moving into R and Scripting our Analyses Step 3: Clean the Data Target as Categoric Moving into R and Scripting our Analyses Step 3: Clean the Data Target as Categoric summary(ds[target]) summary(ds[target]) ## RainTomorrow ## Min. :0.000 ## st Qu.:0.000 ## Median :0.000 ## Mean :0.83 ## 3rd Qu.:0.000 ## Max. :.000 ds[target] <- as.factor(ds[[target]]) levels(ds[target]) <- c("no", "Yes") ## RainTomorrow ## 0:268 ## : 60 count RainTomorrow http: // togaware. com Copyright 204, Graham.Williams@togaware.com 32/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 33/74 Moving into R and Scripting our Analyses Step 4: Prepare for Modelling Moving into R and Scripting our Analyses Step 5: Build the Model Random Forest (form <- formula(paste(target, "~."))) ## RainTomorrow ~. (nobs <- nrow(ds)) ## [] 328 train <- sample(nobs, 0.70*nobs) length(train) ## [] 229 test <- setdiff(:nobs, train) length(test) library(randomforest) model <- randomforest(form, ds[train, vars], na.action=na.omit) model ## ## Call: ## randomforest(formula=form, data=ds[train, vars],... ## Type of random forest: classification ## Number of trees: 500 ## No. of variables tried at each split: 4 ## [] 99 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 34/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 35/74

24 Moving into R and Scripting our Analyses Step 6: Evaluate the Model Risk Chart Workshop Overview pr <- predict(model, ds[test,], type="prob")[,2] riskchart(pr, ds[test, target], ds[test, risk], title="random Forest - Risk Chart", risk=risk, recall=target, thresholds=c(0.35, 0.5)) Performance (%) Risk Scores Random Forest Risk Chart 0. RainTomorrow (99%) RISK_MM (00%) Precision Caseload (%) 5% Lift R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining Moving into R and Scripting our Analyses 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 36/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 37/74 Motivation Motivation Why is Reproducibility Important? Literate Data Mining Overview Your Research Leader or Executive drops by and asks: Remember that research you did last year? I ve heard there is an update on the data that you used. Can you add the new data in and repeat the same analysis? Jo Bloggs did a great analysis of the company returns data just before she left. Can you get someone else to analyse the new data set using the same methods, and so produce an updated report that we can present to the Exec next week? The fraud case you provided an analysis of last year has finally reached the courts. We need to ensure we have a clear trail of the data sources, the analyses performed, and the results obtained, to stand up in court. Could you document these please. One document to intermix the analysis, code, and results Authors productive with narrative and code in one document Sweave (Leisch 2002) and now KnitR (Yihui 20) Embed R code into L A TEX documents for typesetting KnitR also supports publishing to the web http: // togaware. com Copyright 204, Graham.Williams@togaware.com 38/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 39/74 Motivation Motivation Why Reproducible Data Mining? Prime Objective: Trustworthy Software Automatically regenerate documents when code, data, or assumptions change. Eliminate errors that occur when transcribing results into documents. Record the context for the analysis and decisions made about the type of analysis to perform in the one place. Document the processes to provide integrity for the conclusions of the analysis. Share approach with others for peer review and for learning from each other engender a continuous learning environment. Those who receive the results of modern data analysis have limited opportunity to verify the results by direct observation. Users of the analysis have no option but to trust the analysis, and by extension the software that produced it. This places an obligation on all creators of software to program in such a way that the computations can be understood and trusted. This obligation I label the Prime Directive. John Chambers (2008) Software for Data Analysis: Programming with R http: // togaware. com Copyright 204, Graham.Williams@togaware.com 40/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 4/74

25 Motivation Motivation Beautiful Output by KnitR Beautiful Output by Default KnitR combined with L A TEX will Intermix analysis and results of analysis Automatically generate graphics and tables Support reproducible and transparent analysis Produce the best looking reports. The reader wants to read the document and easily do so! Code highlighting is done automatically Default theme is carefully designed Many other themes are available R Code is properly reformatted Analyses (Graphs and Tables) automatically included. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 42/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 43/74 Using RStudio Using RStudio Supporting Technology Using RStudio A suite of Free and Open Source Software FLOSS RStudio Creating, managing, compiling documents L A TEX Markup language for typesetting a document R Statistical analysis language KnitR Integrator of typesetting and analysis Simplified interaction with R, L A TEX, and KnitR Executes R code one line at a time Formats L A TEX documents and provides and spell checking A single click compile to PDF and synchronised views Demonstrate: Startup and explore RStudio. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 44/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 45/74 Using RStudio Using RStudio Introducing L A TEX Basic L A TEX Usage A text markup language rather than a WYSIWYG. Based on TEX from 977 very stable and powerful. L A TEX is easier to use macro package built on TEX. Ensures consistent style (layout, fonts, tables, maths, etc.) Automatic indexes, footnotes and references. Documents are well structured and are clear text. Has a learning curve. \documentclass{article} \begin{document} \end{document} Demonstrate Create a new Sweave document in RStudio http: // togaware. com Copyright 204, Graham.Williams@togaware.com 46/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 47/74

26 Using RStudio Using RStudio Structures Formats \documentclass{article} \begin{document} \section{introduction}... \subsection{concepts}... \end{document} \documentclass{article} \begin{document} \begin{itemize} \item ABC \item DEF \end{itemize} This if \textbf{bold} text or \textbf{italic} text,... \end{document} http: // togaware. com Copyright 204, 48/74 http: // togaware. com Copyright 204, 49/74 Using RStudio Incorporating R Code RStudio Support for L A TEX Incorporating R Code RStudio provides excellent support for working with L A TEX documents Helps to avoid having to know too much abuot L A TEX Best illustrated through a demonstration Format menu Section commands Font commands List commands Verbatim/Block commands Spell Checker Compile PDF Demonstrate: Start a new document, add contents, format to PDF. We insert R code in a Chunk starting with << >>= We terminate the Chunk Save L A TEX with extension Rnw This Chunk <<simple_example>>= x <- sum(:0) Produces x <- sum(:0) x ## [] 55 Demonstrate: Do this in RStudio http: // togaware. com Copyright 204, Graham.Williams@togaware.com 50/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 5/74 Incorporating R Code Incorporating R Code Making You Look Good R Within the Text <<format_example>>= for(i in for(i in :5) { j <- cos(sin(i)*i^2)+3 print(j-5) } ## [] ## [] ## [] ## [] -.03 Include information about data within the narrative. We can do that with \Sexpr{...}. Our dataset has \Sexpr{nrow(ds)} observations of \Sexpr{ncol(ds)} variables. Becomes Our dataset has 328 observations of 24 variables. Better Still: \Sexpr{format(nrow(ds), big.mark=",")} Our dataset has 328 observations of 24 variables. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 52/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 53/74

27 Incorporating R Code Incorporating R Code A Simple Table Table: Exclude Row Names library(xtable) obs <- sample(:nrow(weatheraus), 8) vars <- 2:6 xtable(weatheraus[obs, vars]) Location MinTemp MaxTemp Rainfall Evaporation AliceSprings Ballarat MountGambier WaggaWagga Cairns Albury AliceSprings PearceRAAF print(xtable(weatheraus[obs, vars]), include.rownames=false) Location MinTemp MaxTemp Rainfall Evaporation AliceSprings Ballarat MountGambier WaggaWagga Cairns Albury AliceSprings PearceRAAF http: // togaware. com Copyright 204, Graham.Williams@togaware.com 54/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 55/74 Incorporating R Code Incorporating R Code Table: Limit Number of Digits Table: Tiny Font print(xtable(weatheraus[obs, vars], digits=), include.rownames=false) Location MinTemp MaxTemp Rainfall Evaporation AliceSprings Ballarat MountGambier WaggaWagga Cairns Albury AliceSprings PearceRAAF vars <- 2:8 print(xtable(weatheraus[obs, vars], digits=0), size="tiny", include.rownames=false) Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir AliceSprings SE Ballarat 2 0 W MountGambier SW WaggaWagga SSW Cairns NE Albury 32 0 NNE AliceSprings N PearceRAAF S http: // togaware. com Copyright 204, Graham.Williams@togaware.com 56/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 57/74 Incorporating R Code Incorporating R Code Table: Column Alignment Table: Caption vars <- 2:8 print(xtable(weatheraus[obs, vars], digits=0, align="rlrrrrrr"), size="tiny") Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir AliceSprings SE Ballarat 2 0 W MountGambier SW WaggaWagga SSW Cairns NE 44 Albury 32 0 NNE AliceSprings N PearceRAAF S print(xtable(weatheraus[obs, vars], digits=, caption="this is the table caption."), size="tiny") Location MinTemp MaxTemp Rainfall Evaporation Sunshine WindGustDir AliceSprings SE Ballarat W MountGambier SW WaggaWagga SSW Cairns NE 44 Albury NNE AliceSprings N PearceRAAF S Table : This is the table caption. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 58/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 59/74

28 Incorporating R Code Our First KnitR Document Plots Create a KnitR Document: New R Sweave library(ggplot2) cities <- c("canberra", "Darwin", "Melbourne", "Sydney") ds <- subset(weatheraus, Location %in% cities &! is.na(temp3pm)) g <- ggplot(ds, aes(temp3pm, colour=location, fill=location)) g <- g + geom_density(alpha = 0.55) print(g) density 0.0 Location Canberra Darwin Melbourne Sydney Temp3pm http: // togaware. com Copyright 204, Graham.Williams@togaware.com 60/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 6/74 Our First KnitR Document Our First KnitR Document Setup KnitR Simple KnitR Document Insert the following into your new KnitR document: We wish to use KnitR rather than the older Sweave processor In RStudio we can configure the options to use knitr: Select Tools Options Choose the Sweave group Choose knitr for Weave Rnw files using: The remaining defaults should be okay Click Apply and then OK \title{sample KnitR Document} \author{graham Williams} \maketitle \section*{my First Section} This is some text that is automatically typeset by the LaTeX processor to produce well formatted quality output as PDF. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 62/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 63/74 Our First KnitR Document Our First KnitR Document Simple KnitR Document Simple KnitR Document Resulting PDF Result of Compile PDF http: // togaware. com Copyright 204, Graham.Williams@togaware.com 64/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 65/74

29 Including R Commands in KnitR Including R Commands in KnitR KnitR: Add R Commands KnitR Document With R Code R code can be used to generate results into the document: <<echo=false, message=false>>= library(rattle) # Provides the weather dataset library(ggplot2) # Provides the qplot() function ds <- weather qplot(mintemp, MaxTemp, http: // togaware. com Copyright 204, Graham.Williams@togaware.com 66/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 67/74 Including R Commands in KnitR Basics Cheat Sheet Simple KnitR Document PDF with Plot LaTeX Basics Result of Compile PDF \subsection*{...} \subsubsection*{...} \textbf{...} \textit{...} \begin{itemize} \item... \item... \end{itemize} % Introduce a Sub Section % Introduce a Sub Sub Section % Bold font % Italic font % A bullet list Plus an extensive collection of other markup and capabilities. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 68/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 69/74 Basics Cheat Sheet Basics Cheat Sheet KnitR Basics Workshop Overview echo=false eval=true results="hide" fig.width=0 fig.height=8 # Do not display the R code # Evaluate the R code # Hide the results of the R commands # Extend figure width from 7 to 0 inches # Extend figure height from 7 to 8 inches R: A Language for Data Mining 2 3 Loading, Cleaning, Exploring Data in Rattle 4 Descriptive Data Mining 5 out.width="0.8\\textwidth" # Fit figure 80% page width out.height="0.5\\textheight" # Fit figure 50% page height Plus an extensive collection of other options. 6 7 Moving into R and Scripting our Analyses 8 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 70/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 7/74

30 Resources Resources Resources: One Page R A new series of guides to using R. Data Preparation template: Model Building template: Plots using GGPlot2: Test Mining in R: Resources and References OnePageR: Tutorial Notes Rattle: Guides: Practise: Book: Data Mining using Rattle/R Chapter: Rattle and Other Tales Paper: A Data Mining GUI for R R Journal, Volume (2) http: // togaware. com Copyright 204, Graham.Williams@togaware.com 72/74 http: // togaware. com Copyright 204, Graham.Williams@togaware.com 73/74 That Is All Folks Time For Questions Thank You This document, sourced from CUNYWorkshop.Rnw revision 35, was processed by KnitR version.5 of and took.4 seconds to process. It was generated by gjw on theano running Ubuntu 3.0 with Intel(R) Core(TM) i7-357u having 4 cores and 3.9GB of RAM. It completed the processing :25:28. http: // togaware. com Copyright 204, Graham.Williams@togaware.com 74/74