Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni G. Parmigiani. For further volumes:

Transcription

1

2 Use R! Series Editors: Robert Genteman Kurt Hornik Giovanni G. Parmigiani For further voumes:

3

4 Graham Wiiams Data Mining with Ratte and R The Art of Excavating Data for Knowedge Discovery

5 Graham Wiiams Togaware Pty Ltd PO Box 655 Jamison Centre ACT, 2614 Austraia Graha Series Editors: Robert Genteman Program in Computationa Bioogy Division of Pubic Heath Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Avenue, N. M2-B876 Seatte, Washington USA Kurt Hornik Department of Statistik and Mathematik Wirtschaftsuniversität Wien Augasse 2-6 A-1090 Wien Austria Giovanni G. Parmigiani The Sidney Kimme Comprehensive Cancer Center at Johns Hopkins University 550 North Broadway Batimore, MD USA ISBN e-isbn DOI / Springer New York Dordrecht Heideberg London Library of Congress Contro Number: Springer Science+Business Media, LLC 2011 A rights reserved. This work may not be transated or copied in whoe or in part without the written permission of the pubisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or schoary anaysis. Use in connection with any form of information storage and retrieva, eectronic adaptation, computer software, or by simiar or dissimiar methodoogy now known or hereafter deveoped is forbidden. The use in this pubication of trade names, trademarks, service marks, and simiar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (

6 To Catharina

7

8 Preface Knowedge eads to wisdom and better understanding. Data mining buids knowedge from information, adding vaue to the ever-increasing stores of eectronic data that abound today. Emerging from the database community in the ate 1980s data mining grew quicky to encompass researchers and technoogies from machine earning, high-performance computing, visuaisation, and statistics, recognising the growing opportunity to add vaue to data. Today, this mutidiscipinary and transdiscipinary effort continues to deiver new techniques and toos for the anaysis of very arge coections of data. Working on databases that are now measured in the terabytes and petabytes, data mining deivers discoveries that can improve the way an organisation does business. Data mining enabes companies to remain competitive in this modern, data-rich, information-poor, knowedge-hungry, and wisdom-scarce word. Data mining deivers knowedge to drive the getting of wisdom. A wide range of techniques and agorithms are used in data mining. In performing data mining, many decisions need to be made regarding the choice of methodoogy, data, toos, and agorithms. Throughout this book, we wi be introduced to the basic concepts and agorithms of data mining. We use the free and open source software Ratte (Wiiams, 2009), buit on top of the R statistica software package (R Deveopment Core Team, 2011). As free software the source code of Ratte and R is avaiabe to everyone, without imitation. Everyone is permitted, and indeed encouraged, to read the source code to earn, understand verify, and extend it. R is supported by a wordwide network of some of the word s eading statisticians and impements a of the key agorithms for data mining. This book wi guide the reader through the various options that Ratte provides and serves to guide the new data miner through the use of Ratte. Many excursions into using R itsef are presented, with the aim vii

9 viii Preface of encouraging readers to use R directy as a scripting anguage. Through scripting comes the necessary integrity and repeatabiity required for professiona data mining. Features A key feature of this book, which differentiates it from many other very good textbooks on data mining, is the focus on the hands-on end-to-end process for data mining. We cover data understanding, data preparation, mode buiding, mode evauation, data refinement, and practica depoyment. Most data mining textbooks have their primary focus on just the mode buiding that is, the agorithms for data mining. This book, on the other hand, shares the focus with data and with mode evauation and depoyment. In addition to presenting descriptions of approaches and techniques for data mining using modern toos, we provide a very practica resource with actua exampes using Ratte. Ratte is easy to use and is buit on top of R. As mentioned above, we aso provide excursions into the command ine, giving numerous exampes of direct interaction with R. The reader wi earn to rapidy deiver a data mining project using software obtained for free from the Internet. Ratte and R deiver a very sophisticated data mining environment. This book encourages the concept of programming with data, and this theme reies on some famiiarity with the programming of computers. However, students without that background wi sti benefit from the materia by staying with the Ratte appication. A readers are encouraged, though, to consider becoming famiiar with some eve of writing commands to process and anayse data. The book is accessibe to many readers and not necessariy just those with strong backgrounds in computer science or statistics. At times, we do introduce more sophisticated statistica, mathematica, and computer science notation, but generay aim to keep it simpe. Sometimes this means oversimpifying concepts, but ony where it does not ose the intent of the concept and ony where it retains its fundamenta accuracy. At other times, the presentation wi eave the more statisticay sophisticated wanting. As important as the materia is, it is not aways easiy covered within the confines of a short book. Other resources cover such materia in more detai. The reader is directed to the extensive

10 Preface ix mathematica treatment by Hastie et a. (2009). For a more introductory treatment using R for statistics, see Dagaard (2008). For a broader perspective on using R, incuding a brief introduction to the toos in R for data mining, Ader (2010) is recommended. For an introduction to data mining with a case study orientation, see Torgo (2010). Organisation Chapter 1 sets the context for our data mining. It presents an overview of data mining, the process of data mining, and issues associated with data mining. It aso canvasses open source software for data mining. Chapter 2 then introduces Ratte as a graphica user interface (GUI) deveoped to simpify data mining projects. This covers the basics of interacting with R and Ratte, providing a quick-start guide to data mining. Chapters 3 to 7 dea with data we discuss the data, exporatory, and transformationa steps of the data mining process. We introduce data and how to seect variabes and the partitioning of our data in Chapter 3. Chapter 4 covers the oading of data into Ratte and R. Chapters 5 and 6 then review various approaches to exporing the data in order for us to gain our initia insights about the data. We aso earn about the distribution of the data and how to assess the appropriateness of any anaysis. Often, our exporation of the data wi ead us to identify various issues with the data. We thus begin ceaning the data, deaing with missing data, transforming the data, and reducing the data, as we describe in Chapter 7. Chapters 8 to 14 then cover the buiding of modes. This is the next step in data mining, where we begin to represent the knowedge discovered. The concepts of modeing are introduced in Chapter 8, introducing descriptive and predictive data mining. Specific descriptive data mining approaches are then covered in Chapters 9 (custers) and 10 (association rues). Predictive data mining approaches are covered in Chapters 11 (decision trees), 12 (random forests), 13 (boosting), and 14 (support vector machines). Not a predictive data mining approaches are incuded, eaving some of the we-covered topics (incuding inear regression and neura networks) to other books. Having buit a mode, we need to consider how to evauate its performance. This is the topic for Chapter 15. We then consider the task of depoying our modes in Chapter 16.

11 x Preface Appendix A can be consuted for instaing R and Ratte. Both R and Ratte are open source software and both are freey avaiabe on mutipe patforms. Appendix B describes in detai how the datasets used throughout the book were obtained from their sources and how they were transformed into the datasets made avaiabe through ratte. Production and Typographica Conventions This book has been typeset by the author using L A TEX and R s Sweave(). A R code segments incuded in the book are run at the time of typesetting the book, and the resuts dispayed are directy and automaticay obtained from R itsef. The Ratte screen shots are aso automaticay generated as the book is typeset. Because a R code and screen shots are automaticay generated, the output we see in the book shoud be reproducibe by the reader. A code is run on a 64 bit depoyment of R on a Ubuntu GNU/Linux system. Running the same code on other systems (particuary on 32 bit systems) may resut in sight variations in the resuts of the numeric cacuations performed by R. Other minor differences wi occur with regard to the widths of ines and rounding of numbers. The foowing options are set when typesetting the book. We can see that width= is set to 58 to imit the ine width for pubication. The two options scipen= and digits= affect how numbers are presented: > options(width=58, scipen=5, digits=4, continue=" ") Sampe code used to iustrate the interactive sessions using R wi incude the R prompt, which by defaut is >. However, we generay do not incude the usua continuation prompt, which by defaut consists of +. The continuation prompt is used by R when a singe command extends over mutipe ines to indicate that R is sti waiting for input from the user. For our purposes, incuding the continuation prompt makes it more difficut to cut-and-paste from the exampes in the eectronic version of the book. The options() exampe above incudes this change to the continuation prompt. R code exampes wi appear as code bocks ike the foowing exampe (though the continuation prompt, which is shown in the foowing exampe, wi not be incuded in the code bocks in the book).

12 Preface xi > ibrary(ratte) Ratte: A free graphica interface for data mining with R. Version Copyright (c) Togaware Pty Ltd. Type 'ratte()' to shake, ratte, and ro your data. > ratte() Ratte timestamp: :57:52 > cat("wecome to Ratte", + "and the word of Data Mining.\n") Wecome to Ratte and the word of Data Mining. In providing exampe output from commands, at times we wi truncate the isting and indicate missing components with [...]. Whie most exampes wi iustrate the output exacty as it appears in R, there wi be times where the format wi be modified sighty to fit pubication imitations. This might invove sienty removing or adding bank ines. In describing the functionaity of Ratte, we wi use a sans serif font to identify a Ratte widget (a graphica user interface component that we interact with, such as a button or menu). The kinds of widgets that are used in Ratte incude the check box for turning options on and off, the radio button for seecting an option from a ist of aternatives, fie seectors for identifying fies to oad data from or to save data to, combo boxes for making seections, buttons to cick for further pots or information, spin buttons for setting numeric options, and the text view, where the output from R commands wi be dispayed. R provides very many packages that together deiver an extensive tookit for data mining. ratte is itsef an R package we use a bod font to refer to R packages. When we discuss the functions or commands that we can type at the R prompt, we wi incude parentheses with the function name so that it is ceary a reference to an R function. The command ratte(), for exampe, wi start the user interface for Ratte. Many functions and commands can aso take arguments, which we indicate by traiing the argument with an equas sign. The ratte() command, for exampe, can accept the command argument csvfie=.

13 xii Preface Impementing Ratte Ratte has been deveoped using the Gnome (1997) tookit with the Gade (1998) graphica user interface (GUI) buider. Gnome is independent of any programming anguage, and the GUI side of Ratte started out using the Python (1989) programming anguage. I soon moved to R directy, once RGtk2 (Lawrence and Tempe Lang, 2010) became avaiabe, providing access to Gnome from R. Moving to R aowed us to avoid the idiosyncrasies of interfacing mutipe anguages. The Gade graphica interface buider is used to generate an XML fie that describes the interface independent of the programming anguage. That fie can be oaded into any supported programming anguage to dispay the GUI. The actua functionaity underying the appication is then written in any supported anguage, which incudes Java, C, C++, Ada, Python, Ruby, and R! Through the use of Gade, we have the freedom to quicky change anguages if the need arises. R itsef is written in the procedura programming anguage C. Where computation requirements are significant, R code is often transated into C code, which wi generay execute faster. The detais are not important for us here, but this aows R to be surprisingy fast when it needs to be, without the users of R actuay needing to be aware of how the function they are using is impemented. Currency New versions of R are reeased twice a year, in Apri and October. R is free, so a sensibe approach is to upgrade whenever we can. This wi ensure that we keep up with bug fixes and new deveopments, and we won t annoy the deveopers with questions about probems that have aready been fixed. The exampes incuded in this book are from version of R and version of Ratte. Ratte is an ever-evoving package and, over time, whist the concepts remain, the detais wi change. For exampe, the advent of ggpot2 (Wickham, 2009) provides an opportunity to significanty deveop its graphics capabiities. Simiary, caret (Kuhn et a., 2011) offers a newer approach to interfacing various data mining agorithms, and we may see Ratte take advantage of this. New data mining agorithms continue to emerge and may be incorporated over time.

14 Preface xiii Simiary, the screen shots incuded in this book are current ony for the version of Ratte avaiabe at the time the book was typeset. Expect some minor changes in various windows and text views, and the occasiona major change with the addition of new functionaity. Appendix A incudes inks to guides for instaing Ratte. We aso ist there the versions of the primary packages used by Ratte, at east as of the date of typesetting this book. Acknowedgements This book has grown from a desire to share experiences in using and depoying data mining toos and techniques. A considerabe proportion of the materia draws on over 20 years of teaching data mining to undergraduate and graduate students and running industry-based courses. The aim is to provide recipe-type materia that can be easiy understood and depoyed, as we as reference materia covering the concepts and terminoogy a data miner is ikey to come across. Many thanks are due to students from the Austraian Nationa University, the University of Canberra, and esewhere who over the years have been the reason for me to coect my thoughts and experiences with data mining and to bring them together into this book. I have benefited from their insights into how they earn best. They have aso contributed in a number of ways with suggestions and exampe appications. I am aso in debt to my coeagues over the years, particuary Peter Mine, Joshua Huang, Warwick Graco, John Maindonad, and Stuart Hamiton, for their support and contributions to the deveopment of data mining in Austraia. Coeagues in various organisations depoying or deveoping skis in data mining have aso provided significant feedback, as we as the motivation, for this book. Anthony Noan deserves specia mention for his enthusiasm and ongoing contribution of ideas that have heped fine-tune the materia in the book. Many others have aso provided insights and comments. Iustrative exampes of using R have aso come from the R maiing ists, and I have used many of these to guide the kinds of exampes that are incuded in the book. The many contributors to those ists need to be thanked. Thanks aso go to the reviewers, who have added greaty to the readabiity and usabiity of the book. These incude Robert Muenchen, Pe-

15 xiv Preface ter Christen, Peter Hemsted, Bruce McCuough, and Baázs Bárány. Thanks aso to John Garden for his encouragement and insights in choosing a tite for the voume. My very specia thanks to my wife, Catharina, and chidren, Sean and Anita, who have endured my indugence in bringing this book together. Canberra Graham J. Wiiams

16 Contents Preface vii I Exporations 1 1 Introduction Data Mining Beginnings The Data Mining Team Agie Data Mining The Data Mining Process A Typica Journey Insights for Data Mining Documenting Data Mining Toos for Data Mining: R Toos for Data Mining: Ratte Why R and Ratte? Privacy Resources Getting Started Starting R Quitting Ratte and R First Contact Loading a Dataset Buiding a Mode Understanding Our Data Evauating the Mode: Confusion Matrix Interacting with Ratte Interacting with R xv

17 xvi Contents 2.10 Summary Command Summary Working with Data Data Nomencature Sourcing Data for Mining Data Quaity Data Matching Data Warehousing Interacting with Data Using R Documenting the Data Summary Command Summary Loading Data CSV Data ARFF Data ODBC Sourced Data R Dataset Other Data Sources R Data Library Data Options Command Summary Exporing Data Summarising Data Basic Summaries Detaied Numeric Summaries Distribution Skewness Kurtosis Missing Vaues Visuaising Distributions Box Pot Histogram Cumuative Distribution Pot Benford s Law Bar Pot Dot Pot

18 Contents xvii Mosaic Pot Pairs and Scatter Pots Pots with Groups Correation Anaysis Correation Pot Missing Vaue Correations Hierarchica Correation Command Summary Interactive Graphics Latticist GGobi Command Summary Transforming Data Data Issues Transforming Data Rescaing Data Imputation Recoding Ceanup Command Summary II Buiding Modes Descriptive and Predictive Anaytics Mode Nomencature A Framework for Modeing Descriptive Anaytics Predictive Anaytics Mode Buiders Custer Anaysis Knowedge Representation Search Heuristic Measures Tutoria Exampe Discussion Command Summary

19 xviii Contents 10 Association Anaysis Knowedge Representation Search Heuristic Measures Tutoria Exampe Command Summary Decision Trees Knowedge Representation Agorithm Measures Tutoria Exampe Tuning Parameters Discussion Summary Command Summary Random Forests Overview Knowedge Representation Agorithm Tutoria Exampe Tuning Parameters Discussion Summary Command Summary Boosting Knowedge Representation Agorithm Tutoria Exampe Tuning Parameters Discussion Summary Command Summary Support Vector Machines Knowedge Representation Agorithm

20 Contents xix 14.3 Tutoria Exampe Tuning Parameters Command Summary III Deivering Performance Mode Performance Evauation The Evauate Tab: Evauation Datasets Measure of Performance Confusion Matrix Risk Charts ROC Charts Other Charts Scoring Depoyment Depoying an R Mode Converting to PMML Command Summary IV Appendices 329 A Instaing Ratte 331 B Sampe Datasets 335 B.1 Weather B.1.1 Obtaining Data B.1.2 Data Preprocessing B.1.3 Data Ceaning B.1.4 Missing Vaues B.1.5 Data Transforms B.1.6 Using the Data B.2 Audit B.2.1 The Adut Survey Dataset B.2.2 From Survey to Audit B.2.3 Generating Targets B.2.4 Finaising the Data B.2.5 Using the Data

21 xx Contents B.3 Command Summary References 357 Index 365

22 Part I Exporations

23

24 Chapter 1 Introduction For the keen data miner, Chapter 2 provides a quick-start guide to data mining with Ratte, working through a sampe process of oading a dataset and buiding a mode. Data mining is the art and science of inteigent data anaysis. The aim is to discover meaningfu insights and knowedge from data. Discoveries are often expressed as modes, and we often describe data mining as the process of buiding modes. A mode captures, in some formuation, the essence of the discovered knowedge. A mode can be used to assist in our understanding of the word. Modes can aso be used to make predictions. For the data miner, the discovery of new knowedge and the buiding of modes that nicey predict the future can be quite rewarding. Indeed, data mining shoud be exciting and fun as we watch new insights and knowedge emerge from our data. With growing enthusiasm, we meander through our data anayses, foowing our intuitions and making new discoveries a the time discoveries that wi continue to hep change our word for the better. Data mining has been appied in most areas of endeavour. There are data mining teams working in business, government, financia services, bioogy, medicine, risk and inteigence, science, and engineering. Anywhere we coect data, data mining is being appied and feeding new knowedge into human endeavour. We are iving in a time where data is coected and stored in unprecedented voumes. Large and sma government agencies, commercia enterprises, and noncommercia organisations coect data about their businesses, customers, human resources, products, manufacturing pro- G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _1, Springer Science+ Business Media, LLC

25 4 1 Introduction cesses, suppiers, business partners, oca and internationa markets, and competitors. Data is the fue that we inject into the data mining engine. Turning data into information and then turning that information into knowedge remains a key factor for success. Data contains vauabe information that can support managers in their business decisions to effectivey and efficienty run a business. Amongst data there can be hidden cues of the frauduent activity of criminas. Data provides the basis for understanding the scientific processes that we observe in our word. Turning data into information is the basis for identifying new opportunities that ead to the discovery of new knowedge, which is the inchpin of our society! Data mining is about buiding modes from data. We buid modes to gain insights into the word and how the word works so we can predict how things behave. A data miner, in buiding modes, depoys many different data anaysis and mode buiding techniques. Our choices depend on the business probems to be soved. Athough data mining is not the ony approach, it is becoming very widey used because it is we suited to the data environments we find in today s enterprises. This is characterised by the voume of data avaiabe, commony in the gigabytes and terabytes and fast approaching the petabytes. It is aso characterised by the compexity of that data, both in terms of the reationships that are awaiting discovery in the data and the data types avaiabe today, incuding text, image, audio, and video. The business environments are aso rapidy changing, and anayses need to be performed reguary and modes updated reguary to keep up with today s dynamic word. Modeing is what peope often think of when they think of data mining. Modeing is the process of turning data into some structured form or mode that refects the suppied data in some usefu way. Overa, the aim is to expore our data, often to address a specific probem, by modeing the word. From the modes, we gain new insights and deveop a better understanding of the word. Data mining, in reaity, is so much more than simpy modeing. It is aso about understanding the business context within which we depoy it. It is about understanding and coecting data from across an enterprise and from externa sources. It is then about buiding modes and evauating them. And, most importanty, it is about depoying those modes to deiver benefits. There is a bewidering array of toos and techniques at the disposa of the data miner for gaining insights into data and for buiding modes.

26 1.2 The Data Mining Team 5 This book introduces some of these as a starting point on a onger journey to becoming a practising data miner. 1.1 Data Mining Beginnings Data mining, as a named endeavour, emerged at the end of the 1980s from the database community, which was wondering where the next big steps forward were going to come from. Reationa database theory had been deveoped and successfuy depoyed, and thus began the era of coecting arge amounts of data. How do we add vaue to our massive stores of data? The first few data mining workshops in the eary 1990s attracted the database community researchers. Before ong, other computer science, and particuary artificia inteigence, researchers began to get interested. It is usefu to note that a key eement of inteigence is the abiity to earn, and machine earning research had been deveoping technoogy for this for many years. Machine earning is about coecting observationa data through interacting with the word and buiding modes of the word from such data. That is pretty much what data mining was aso setting about to do. So, naturay, the machine earning and data mining communities started to come together. However, statistics is one of the fundamenta toos for data anaysis, and has been so for over a hundred years. Statistics brings to the tabe essentia ideas about uncertainty and how to make aowances for it in the modes that we buid. Statistics provides a framework for understanding the strength or veracity of modes that we might buid from data. Discoveries need to be statisticay sound and statisticay significant, and any uncertainty associated with the modeing needs to be understood. Statistics pays a key roe in today s data mining. Today, data mining is a discipine that draws on sophisticated skis in computer science, machine earning, and statistics. However, a data miner wi work in a team together with data and domain experts. 1.2 The Data Mining Team Many data mining projects work with i-defined and ambiguous goas. Whist the first reaction to such an observation is that we shoud become better at defining the probem, the reaity is that often the probem to

27 6 1 Introduction be soved is identified and refined as the data mining project progresses. That s natura. An initiation meeting of a data mining project wi often invove data miners, domain experts, and data experts. The data miners bring the statistica and agorithmic understanding, programming skis, and key investigative abiity that underies any anaysis. The domain experts know about the actua probem being tacked, and are often the business experts who have been working in the area for many years. The data experts know about the data, how it has been coected, where it has been stored, how to access and combine the data required for the anaysis, and any idiosyncrasies and data traps that await the data miner. Generay, neither the domain expert nor the data expert understand the needs of the data miner. In particuar, as a data miner we wi often find ourseves encouraging the data experts to provide (or to provide access to) a of the data, and not just the data the data expert thinks might be usefu. As data miners we wi often think of ourseves as greedy consumers of a the data we can get our hands on. It is critica that a three experts come together to deiver a data mining project. Their different understandings of the probem to be tacked a need to med to deiver a common pathway for the data mining project. In particuar, the data miner needs to understand the probem domain perspective and understand what data is avaiabe that reates to the probem and how to get that data, and identify what data processing is required prior to modeing. 1.3 Agie Data Mining Buiding modes is ony one of the tasks that the data miner performs. There are many other important tasks that we wi find ourseves invoved in. These incude ensuring our data mining activities are tacking the right probem; understanding the data that is avaiabe, turning noisy data into data from which we can buid robust modes; evauating and demonstrating the performance of our modes; and ensuring the effective depoyment of our modes. Whist we can easiy describe these steps, it is important to be aware that data mining is an agie activity. The concept of agiity comes from the agie software engineering principes, which incude the evoution or incrementa deveopment of the probem requirements, the requirement

28 1.4 The Data Mining Process 7 for reguar cient input or feedback, the testing of our modes as they are being deveoped, and frequent rebuiding of the modes to improve their performance. An aied aspect is the concept of pair programming, where two data miners work together on the same data in a friendy, competitive, and coaborative approach to buiding modes. The agie approach aso emphasises the importance of face-to-face communication, above and beyond a of the effort that is otherwise often expended, and often wasted, on written documentation. This is not to remove the need to write documents but to identify what is reay required to be documented. We now identify the common steps in a data mining project and note that the foowing chapters of this book then wak us through these steps one step at a time! 1.4 The Data Mining Process The Cross Industry Process for Data Mining (CRISP-DM, 1996) provides a common and we-deveoped framework for deivering data mining projects. CRISP-DM identifies six steps within a typica data mining project: 1. Probem Understanding 2. Data Understanding 3. Data Preparation 4. Modeing 5. Evauation 6. Depoyment The chapters in this book essentiay foow this step-by-step process of a data mining project, and Ratte is very much based around these same steps. Using a tab-based interface, each tab represents one of the steps, and we proceed through the tabs as we work our way through a data mining project. One noticeabe exception to this is the first step, probem understanding. That is something that needs study, discussion, thought, and brain power. Practica toos to hep in this process are not common.

29 8 1 Introduction 1.5 A Typica Journey Many organisations are ooking to set up a data mining capabiity, often caed the anaytics team. Within the organisation, data mining projects can be initiated by the business or by this anaytics team. Often, for best business engagement, a business-initiated project works best, though business is not aways equipped to understand where data mining can be appied. It is often a mutua journey. Data miners, by themseves, rarey have the deeper knowedge of business that a professiona from the business itsef has. Yet the business owner wi often have very itte knowedge of what data mining is about, and indeed, given the hype, may we have the wrong idea. It is not unti they start getting to see some actua data mining modes for their business that they start to understand the project, the possibiities, and a gimpse of the potentia outcomes. We wi reate an actua experience over six months with six significant meetings of the business team and the anaytics team. The picture we paint here is a itte simpified and ideaised but is not too far from reaity. Meeting One The data miners sit in the corner to isten and earn. The business team understands itte about what the data miners might be abe to deiver. They discuss their current business issues and steps being taken to improve processes. The data miners have itte to offer just yet but are on the ookout for the avaiabiity of data from which they can earn. Meeting Two The data miners wi now often present some observations of the data from their initia anayses. Whist the anayses might be we presented graphicay, and are perhaps interesting, they are yet to deiver any new insights into the business. At east the data miners are starting to get the idea of the business, as far as the business team is concerned. Meeting Three The data miners start to demonstrate some initia modeing outcomes. The resuts begin to ook interesting to the business team. They are becoming engaged, asking questions, and understanding that the data mining team has uncovered some interesting insights. Meeting Four The data miners are the main agenda item. Their anayses are starting to ring true. They have made some quite interesting discoveries from the data that the business team (the domain and data experts) suppied. The discoveries are nonobvious, and sometimes intriguing. Sometimes they are aso rather obvious.

30 1.6 Insights for Data Mining 9 Meeting Five The modes are presented for evauation. The data mining team has presented its evauation of how we the modes perform and expained the context for the depoyment of the modes. The business team is now keen to evauate the mode on rea cases and monitor its performance over a period of time. Meeting Six The modes have been depoyed into business and are being run daiy to match customers and products for marketing, to identify insurance caims or credit card transactions that may be frauduent, or taxpayers whose tax returns may require refinement. Procedures are in pace to monitor the performance of the mode over time and to sound aarm bes once the mode begins to deviate from expectations. The key to much of the data mining work described here, in addition to the significance of communication, is the reiance and focus on data. This eads us to identify some key principes for data mining. 1.6 Insights for Data Mining The starting point with a data mining is the data. We need to have good data that reates to a process that we wish to understand and improve. Without data we are simpy guessing. Considerabe time and effort spent getting our data into shape is a key factor in the success of a data mining project. In many circumstances, once we have the right data for mining, the rest is straightforward. As many others note, this effort in data coection and data preparation can in fact be the most substantia component of a data mining project. My ist of insights for data mining, in no particuar order, incudes: 1. Focus on the data and understand the business. 2. Use training/vaidate/test datasets to buid/tune/evauate modes. 3. Buid mutipe modes: most give very simiar performance. 4. Question the perfect mode as too good to be true. 5. Don t overook how the mode is to be depoyed. 6. Stress repeatabiity and efficiency, using scripts for everything. 7. Let the data tak to you but not misead you. 8. Communicate discoveries effectivey and visuay.

31 10 1 Introduction 1.7 Documenting Data Mining An important task whist data mining is the recording of the process. We need to be vigiant to record a that is done. This is often best done through the code we write to perform the anaysis rather than having to document the process separatey. Having a separate process to document the data mining wi often mean that it is rarey competed. An impication of this is that we often capture the process as transparent, executabe code rather than as a ist of instructions for using a GUI. There are many important advantages to ensuring we document a project through our coding of the data anayses. There wi be times when we need to hand a project to another data miner. Or we may cease work on a project for a period of time and return to it at a ater stage. Or we have performed a series of anayses and much the same process wi need to be repeated again in a year s time. For whatever reason, when we return to a project, we find the documentation, through the coding, essentia in being efficient and effective data miners. Various things shoud be documented, and most can be documented through a combination of code and comments. We need to document our access to the source data, how the data was transformed and ceaned, what new variabes were constructed, and what summaries were generated to understand the data. Then we aso need to record how we buit modes and what modes were chosen and considered. Finay, we record the evauation and how we coect the data to support the benefit that we propose to obtain from the mode. Through documentation, and ideay by deveoping documented code that tes the story of the data mining project and the actua process as we, we wi be communicating to others how we can mine data. Our processes can be easiy reviewed, improved, and automated. We can transparenty stand behind the resuts of the data mining by having openy avaiabe the process and the data that have ed to the resuts. 1.8 Toos for Data Mining: R R is used throughout this book to iustrate data mining procedures. It is the programming anguage used to impement the Ratte graphica user interface for data mining. If you are moving to R from SAS or SPSS,

32 1.9 Toos for Data Mining: Ratte 11 then you wi find Muenchen (2008) a great resource. 1 R is a sophisticated statistica software package, easiy instaed, instructiona, state-of-the-art, and it is free and open source. It provides a of the common, most of the ess common, and a of the new approaches to data mining. The basic modus operandi in using R is to write scripts using the R anguage. After a whie you wi want to do more than issue singe simpe commands and rather write programs and systems for common tasks that suit your own data mining. Thus, saving our commands to an R script fie (often with the.r fiename extension) is important. We can then rerun our scripts to transform our source data, at wi and automaticay, into information and knowedge. As we progress through the book, we wi become famiiar with the common R functions and commands that we might combine into a script. Whist for data mining purposes we wi focus on the use of the Ratte GUI, more advanced users might prefer the powerfu Emacs editor, augmented with the ESS package, to deveop R code directy. Both run under GNU/Linux, Mac/OSX, and Microsoft Windows. We aso note that direct interaction with R has a steeper earning curve than using GUI based systems, but once over the hurde, performing operations over the same or simiar datasets becomes very easy using its programming anguage interface. A paradigm that is encouraged throughout this book is that of earning by exampe or programming by exampe (Cypher, 1993). The intention is that anyone wi be abe to easiy repicate the exampes from the book and then fine-tune them to suit their own needs. This is one of the underying principes of Ratte, where a of the R commands that are used under the graphica user interface are aso exposed to the user. This makes it a usefu teaching too in earning R for the specific task of data mining, and aso a good memory aid! 1.9 Toos for Data Mining: Ratte Ratte is buit on the statistica anguage R, but an understanding of R is not required in order to use it. Ratte is simpe to use, quick to depoy, and aows us to rapidy work through the data processing, modeing, and evauation phases of a data mining project. On the other hand, 1 An eary version is avaiabe from

33 12 1 Introduction R provides a very powerfu anguage for performing data mining we beyond the imitations that are embodied in any graphica user interface and the consequenty canned approaches to data mining. When we need to fine-tune and further deveop our data mining projects, we can migrate from Ratte to R. Ratte can save the current state of a data mining task as a Ratte project. A Ratte project can then be oaded at a ater time or shared with other users. Projects can be oaded, modified, and saved, aowing check pointing and parae exporations. Projects aso retain a of the R code for transparency and repeatabiity. This is an important aspect of any scientific and depoyed endeavour to be abe to repeat our experiments. Whist a user of Ratte need not necessariy earn R, Ratte exposes a of the underying R code to aow it to be directy depoyed within the R Consoe as we as saved in R scripts for future reference. The R code can be oaded into R (outside of Ratte) to repeat any data mining task. Ratte by itsef may be sufficient for a of a user s needs, particuary in the context of introducing data mining. However, it aso provides a stepping stone to more sophisticated processing and modeing in R itsef. It is worth emphasising that the user is not imited to how Ratte does things. For sophisticated and unconstrained data mining, the experienced user wi progress to interacting directy with R. The typica workfow for a data mining project was introduced above. In the context of Ratte, it can be summarised as: 1. Load a Dataset. 2. Seect variabes and entities for exporing and mining. 3. Expore the data to understand how it is distributed or spread. 4. Transform the data to suit our data mining purposes. 5. Buid our Modes. 6. Evauate the modes on other datasets. 7. Export the modes for depoyment. It is important to note that at any stage the next step coud we be a step to a previous stage. Aso, we can save the contents of Ratte s Log tab as a repeatabe record of the data mining process. We iustrate a typica workfow that is embodied in the Ratte interface in Figure 1.1.

34 1.10 Why R and Ratte? 13 Understand Business Understanding the business may itsef entai a mini data mining project of a few days. Identify Data Seect Variabes and Their Roes Expore Distributions Cean and Transform Buid and Tune Modes Evauate Modes Depoy Mode Monitor Performance Start by getting as much data as we can and then cu. Exporing data is important for us to understand their shape, size, and content. We may oop around here many times as we cean, transform, and then buid and tune our modes. Evauate performance, structure, compexity, and depoyabiity. Is the mode run manuay on demand or on an automatic shecdue? Is it depoyed stand aone or integrated into current systems? Figure 1.1: The typica workfow of a data mining project as supported by Ratte Why R and Ratte? R and Ratte are free software in terms of aowing anyone the freedom to do as they wish with them. This is aso referred to as open source software to distinguish it from cosed source software, which does not provide the source code. Cosed source software usuay has quite restrictive icenses associated with it, aimed at imiting our freedom using it. This is separate from the issue of whether the software can be obtained for free (which is

35 14 1 Introduction often, but not necessariy, the case for open source software) or must be purchased. R and Ratte can be obtained for free. On 7 January 2009, the New York Times carried a front page technoogy artice on R where a vendor representative was quoted: I think it addresses a niche market for high-end data anaysts that want free, readiy avaiabe code....we have customers who buid engines for aircraft. I am happy they are not using freeware when I get on a jet. This is a common misunderstanding about the concept of free and open source software. R, being free and open source software, is in fact a peer-reviewed software product that a number of the words top statisticians have deveoped and others have reviewed. In fact, anyone is permitted to review the R source code. Over the years, many bugs and issues have been identified and rectified by a arge community of deveopers and users. On the other hand, a cosed source software product cannot be so readiy and independenty verified or viewed by others at wi. Bugs and enhancement requests need to be reported back to the vendor. Customers then need to rey on a very seect group of vendor-chosen peope to assure the software, rectify any bugs in it, and enhance it with new agorithms. Bug fixes and enhancements can take months or years, and generay customers need to purchase the new versions of the software. Both scenarios (open source and cosed source) see a ot of effort put into the quaity of their software. With open source, though, we a share it, whereas we can share and earn very itte about the agorithms we use from cosed source software. It is worthwhie to highight another reason for using R in the context of free and commercia software. In obtaining any software, due diigence is required in assessing what is avaiabe. However, what is finay deivered may be quite different from what was promised or even possibe with the software, whether it is open source or cosed source, free or commercia. With free open source software, we are free to use it without restriction. If we find that it does not serve our purposes, we can move on with minima cost. With cosed source commercia purchases, once the commitment is made to buy the software and it turns out not to meet our requirements, we are generay stuck with it, having made the financia commitment, and have to make do.

36 1.10 Why R and Ratte? 15 Moving back to R specificay, many have identified the pros and cons of using this statistica software package. We ist some of the advantages with using R: ˆ R is the most comprehensive statistica anaysis package avaiabe. It incorporates a of the standard statistica tests, modes, and anayses, as we as providing a comprehensive anguage for managing and manipuating data. New technoogy and ideas often appear first in R. ˆ R is a programming anguage and environment deveoped for statistica anaysis by practising statisticians and researchers. It refects we on a very competent community of computationa statisticians. ˆ R is now maintained by a core team of some 19 deveopers, incuding some very senior statisticians. ˆ The graphica capabiities of R are outstanding, providing a fuy programmabe graphics anguage that surpasses most other statistica and graphica packages. ˆ The vaidity of the R software is ensured through openy vaidated and comprehensive governance as documented for the US Food and Drug Administration (R Foundation for Statistica Computing, 2008). Because R is open source, unike cosed source software, it has been reviewed by many internationay renowned statisticians and computationa scientists. ˆ R is free and open source software, aowing anyone to use and, importanty, to modify it. R is icensed under the GNU Genera Pubic License, with copyright hed by The R Foundation for Statistica Computing. ˆ R has no icense restrictions (other than ensuring our freedom to use it at our own discretion), and so we can run it anywhere and at any time, and even se it under the conditions of the icense. ˆ Anyone is wecome to provide bug fixes, code enhancements, and new packages, and the weath of quaity packages avaiabe for R is a testament to this approach to software deveopment and sharing.

37 16 1 Introduction ˆ R has over 4800 packages avaiabe from mutipe repositories speciaising in topics ike econometrics, data mining, spatia anaysis, and bio-informatics. ˆ R is cross-patform. R runs on many operating systems and different hardware. It is popuary used on GNU/Linux, Macintosh, and Microsoft Windows, running on both 32 and 64 bit processors. ˆ R pays we with many other toos, importing data, for exampe, from CSV fies, SAS, and SPSS, or directy from Microsoft Exce, Microsoft Access, Orace, MySQL, and SQLite. It can aso produce graphics output in PDF, JPG, PNG, and SVG formats, and tabe output for LATEX and HTML. ˆ R has active user groups where questions can be asked and are often quicky responded to, often by the very peope who deveoped the environment this support is second to none. Have you ever tried getting support from the core deveopers of a commercia vendor? ˆ New books for R (the Springer Use R! series) are emerging, and there is now a very good ibrary of books for using R. Whist the advantages might fow from the pen with a great dea of enthusiasm, it is usefu to note some of the disadvantages or weaknesses of R, even if they are perhaps transitory! ˆ R has a steep earning curve it does take a whie to get used to the power of R but no steeper than for other statistica anguages. ˆ R is not so easy to use for the novice. There are severa simpe-touse graphica user interfaces (GUIs) for R that encompass pointand-cick interactions, but they generay do not have the poish of the commercia offerings. ˆ Documentation is sometimes patchy and terse, and impenetrabe to the non-statistician. However, some very high-standard books are increasingy pugging the documentation gaps. ˆ The quaity of some packages is ess than perfect, athough if a package is usefu to many peope, it wi quicky evove into a very robust product through coaborative efforts.

38 1.11 Privacy 17 ˆ There is, in genera, no one to compain to if something doesn t work. R is a software appication that many peope freey devote their own time to deveoping. Probems are usuay deat with quicky on the open maiing ists, and bugs disappear with ightning speed. Users who do require it can purchase support from a number of vendors internationay. ˆ Many R commands give itte thought to memory management, and so R can very quicky consume a avaiabe memory. This can be a restriction when doing data mining. There are various soutions, incuding using 64 bit operating systems that can access much more memory than 32 bit ones Privacy Before cosing out our introduction to data mining and toos for doing it, we need to touch upon the topic of privacy. Laws in many countries can directy affect data mining, and it is very worthwhie to be aware of them and their penaties, which can often be severe. There are basic principes reating to the protection of privacy that we shoud adhere to. Some are captured by the privacy principes deveoped by the internationa Organisation for Economic Co-operation and Deveopment the OECD (Organisation for Economic Co-operation and Deveopment (OECD), 1980). They incude: ˆ Coection imitation Data shoud be obtained awfuy and fairy, whie some very sensitive data shoud not be hed at a. ˆ Data quaity Data shoud be reevant to the stated purposes, accurate, compete, and up-to-date; proper precautions shoud be taken to ensure this accuracy. ˆ Purpose specification The purposes for which data wi be used shoud be identified, and the data shoud be destroyed if it no onger serves the given purpose. ˆ Use imitation Use of data for purposes other than specified is forbidden.

39 18 1 Introduction As data miners, we have a socia responsibiity to protect our society and individuas for the good and benefit of a of society. Pease take that responsibiity seriousy. Think often and carefuy about what you are doing Resources This book does not attempt to be a comprehensive introduction to using R. Some basic famiiarity with R wi be gained through our traves in data mining using the Ratte interface and some excursions into R. In this respect, most of what we need to know about R is contained within the book. But there is much more to earn about R and its associated packages. We do ist and comment on here a number of books that provide an entrée to R. A good starting point for handing data in R is Data Manipuation with R (Spector, 2008). The book covers the basic data structures, reading and writing data, subscripting, manipuating, aggregating, and reshaping data. Introductory Statistics with R (Dagaard, 2008), as mentioned earier, is a good introduction to statistics using R. Modern Appied Statistics with S (Venabes and Ripey, 2002) is quite an extensive introduction to statistics using R. Moving more towards areas reated to data mining, Data Anaysis and Graphics Using R (Maindonad and Braun, 2007) provides exceent practica coverage of many aspects of exporing and modeing data using R. The Eements of Statistica Learning (Hastie et a., 2009) is a more mathematica treatise, covering a of the machine earning techniques discussed in this book in quite some mathematica depth. If you are coming to R from a SAS or SPSS background, then R for SAS and SPSS Users (Muenchen, 2008) is an exceent choice. Even if you are not a SAS or SPSS user, the book provides a straightforward introduction to using R. Quite a few speciaist books using R are now avaiabe, incuding Lattice: Mutivariate Data Visuaization with R (Sarkar, 2008), which covers the extensive capabiities of one of the graphics/potting packages avaiabe for R. A newer graphics framework is detaied in ggpot2: Eegant Graphics for Data Anaysis (Wickham, 2009). Bivand et a. (2008) cover appied spatia data anaysis, Keiber and Zeieis (2008) cover appied econometrics, and Cowpertwait and Metcafe (2009) cover time series, to

40 1.12 Resources 19 name just a few books in the R ibrary. Moving on from R itsef and into data mining, there are very many genera introductions avaiabe. One that is commony used for teaching in computer science is Han and Kamber (2006). It provides a comprehensive generic introduction to most of the agorithms used by a data miner. It is presented at a eve suitabe for information technoogy and database graduates.

41

42 Chapter 2 Getting Started New ideas are often most effectivey understood and appreciated by actuay doing something with them. So it is with data mining. Fundamentay, data mining is about practica appication appication of the agorithms deveoped by researchers in artificia inteigence, machine earning, computer science, and statistics. This chapter is about getting started with data mining. Our aim throughout this book is to provide hands-on practise in data mining, and to do so we need some computer software. There is a choice of software packages avaiabe for data mining. These incude commercia cosed source software (which is aso often quite expensive) as we as free open source software. Open source software (whether freey avaiabe or commerciay avaiabe) is aways the best option, as it offers us the freedom to do whatever we ike with it, as discussed in Chapter 1. This incudes extending it, verifying it, tuning it to suit our needs, and even seing it. Such software is often of higher quaity than commercia cosed source software because of its open nature. For our purposes, we need some good toos that are freey avaiabe to everyone and can be freey modified and extended by anyone. Therefore we use the open source and free data mining too Ratte, which is buit on the open source and free statistica software environment R. See Appendix A for instructions on obtaining the software. Now is a good time to insta R. Much of what foows for the rest of the book, and specificay this chapter, reies on interacting with R and Ratte. We can, quite quicky, begin our first data mining project, with Ratte s support. The aim is to buid a mode that captures the essence of the knowedge discovered from our data. Be carefu though there is a G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _2, Springer Science+Business Media, LLC

43 22 2 Getting Started ot of effort required in getting our data into shape. Once we have quaity data, Ratte can buid a mode with just four mouse cicks, but the effort is in preparing the data and understanding and then fine-tuning the modes. In this chapter, we use Ratte to buid our first data mining mode a simpe decision tree mode, which is one of the most common modes in data mining. We cover starting up (and quitting from) R, an overview of how we interact with Ratte, and then how to oad a dataset and buid a mode. Once the enthusiasm for buiding a mode is satisfied, we then review the arger tasks of understanding the data and evauating the mode. Each eement of Ratte s user interface is then reviewed before we finish by introducing some basic concepts reated to interacting directy with and writing instructions for R. 2.1 Starting R R is a command ine too that is initiated either by typing the etter R (capita R R is case-sensitive) into a command ine window (e.g., a termina in GNU/Linux) or by opening R from the desktop icon (e.g., in Microsoft Windows and Mac/OSX). This assumes that we have aready instaed R, as detaied in Appendix A. One way or another, we shoud see a window (Figure 2.1) dispaying the R prompt (> ), indicating that R is waiting for our commands. We wi generay refer to this as the R Consoe. The Microsoft Windows R Consoe provides additiona menus specificay for working with R. These incude options for working with script fies, managing packages, and obtaining hep. We start Ratte by oading ratte into the R ibrary using ibrary(). We suppy the name of the package to oad as the argument to the command. The ratte() command is then entered with an empty argument ist, as shown beow. We wi then see the Ratte GUI dispayed, as in Figure 2.2. > ibrary(ratte) > ratte() The Ratte user interface is a simpe tab-based interface, with the idea being to work from the eftmost tab to the rightmost tab, mimicking the typica data mining process.

44 2.1 Starting R 23 Figure 2.1: The R Consoe for GNU/Linux and Microsoft Windows. The prompt indicates that R is awaiting user commands.

45 24 2 Getting Started Figure 2.2: The initia Ratte window dispays a wecome message and a itte introduction to Ratte and R. Tip: The key to using Ratte, as hinted at in the status bar on starting up Ratte, is to suppy the appropriate information for a particuar tab and to then cick the Execute button to perform the action. Aways make sure you have cicked the Execute button before proceeding to the next step. 2.2 Quitting Ratte and R A rather important piece of information, before we get into the detais, is how to quit from the appications. To exit from Ratte, we simpy cick the Quit button. In genera, this won t terminate the R Consoe. For R, the startup message (Figure 2.1) tes us to type q() to quit. We type this command into the R Consoe, incuding the parentheses so that the command is invoked rather than simpy isting its definition. Pressing Enter wi then ask R to quit: > q() Save workspace image? [y/n/c]:

46 2.3 First Contact 25 We are prompted to save our workspace image. The workspace refers to a of the datasets and any other objects we have created in the current R session. We can save a of the objects currenty avaiabe in a workspace between different invocations of R. We do so by choosing the y option. We might be in the midde of some compex anaysis and wish to resume it at a ater time, so this option is usefu. Many users generay answer n each time here, having aready captured their anayses into script fies. Script fies aow us to automaticay regenerate the resuts as required, and perhaps avoid saving and managing very arge workspace fies. If we do not actuay want to quit, we can answer c to cance the operation and return to the R Consoe. 2.3 First Contact In Chapter 1, we identified that a significant amount of effort within a data mining project is spent in processing our data into a form suitabe for data mining. The amount of such effort shoud not be underestimated, but we do skip this step for now. Once we have processed our data, we are ready to buid a mode and with Ratte we can buid the mode with just a few mouse cicks. Using a sampe dataset that someone ese has aready prepared for us, in Ratte we simpy: 1. Cick on the Execute button. Ratte wi notice that no dataset has been identified, so it wi take action, as in the next step, to ensure we have some data. This is covered in detai in Section 2.4 and Chapter Cick on Yes within the resuting popup. The weather dataset is provided with Ratte as a sma and simpe dataset to expore the concepts of data mining. The dataset is described in detai in Chapter Cick on the Mode tab. This wi change the contents of Ratte s main window to dispay options and information reated to the buiding of modes. This is where we te Ratte what kind of mode we want to buid and how it shoud be buit. The Mode tab is described in more detai in Section 2.5, and mode buiding is discussed in considerabe detai in Chapters 8 to 14.

47 26 2 Getting Started 4. Cick on the Execute button. Once we have specified what we want done, we ask Ratte to do it by cicking the Execute button. For simpe mode buiders for sma datasets, Ratte wi ony take a second or two before we see the resuts dispayed in the text view window. The resuting decision tree mode, dispayed textuay in Ratte s text view, is based on a sampe dataset of historic daiy weather observations (the curious can skip a few pages ahead to see the actua decision tree in Figure 2.5 on page 30). The data comes from a weather monitoring station ocated in Canberra, Austraia, via the Austraian Bureau of Meteoroogy. Each observation is a summary of the weather conditions on a particuar day. It has been processed to incude a target variabe that indicates whether it rained the day foowing the particuar observation. Using this historic data, we have buit a mode to predict whether it wi rain tomorrow. Weather data is commony avaiabe, and you might be abe to buid a simiar mode based on data from your own region. With ony one or two more cicks, further modes can be buit. A few more cicks and we have an evauation chart dispaying the performance of the mode. Then, with just a cick or two more, we wi have the mode appied to a new dataset to generate scores for new observations. Now to the detais. We wi continue to use Ratte and aso the simpe command ine faciity. The command ine is not stricty necessary in using Ratte, but as we deveop our data mining capabiity, it wi become usefu. We wi oad data into Ratte and expain the mode that we have buit. We wi buid a second mode and compare their performances. We wi then appy the mode to a new dataset to provide scores for a coection of new observations (i.e., predictions of the ikeihood of it raining tomorrow). 2.4 Loading a Dataset With Ratte we can oad a sampe dataset in preparation for modeing, as we have just done. Now we want to iustrate oading any data (perhaps our own data) into Ratte. If we have foowed the four steps in Section 2.3, then we wi now need to reset Ratte. Simpy cick the New button within the toobar. We are asked to confirm that we woud ike to cear the current project.

48 2.4 Loading a Dataset 27 Aternativey, we might have exited Ratte and R, as described in Section 2.1, and need to restart everything, as aso described in Section 2.1. Either way, we need to have a fresh Ratte ready so that we can foow the exampes beow. On starting Ratte, we can, without any other action, cick the Execute button in the toobar. Ratte wi notice that no CSV fie (the defaut data format) has been specified (notice the (None) in the Fiename: chooser) and wi ask whether we wish to use one of the sampe datasets suppied with the package. Cick on Yes to do so, to see the data isted, as shown in Figure 2.3. Figure 2.3: The sampe weather.csv fie has been oaded into memory as a dataset ready for mining. The dataset consists of 366 observations and 24 variabes, as noted in the status bar. The first variabe has a roe other than the defaut Input roe. Ratte uses heuristics to initiaise the roes.

49 28 2 Getting Started The fie weather.csv wi be oaded by defaut into Ratte as its dataset. Within R, a dataset is actuay known as a data frame, and we wi see this terminoogy frequenty. The dataset summary (Figure 2.3) provides a ist of the variabes, their data types, defaut roes, and other usefu information. The types wi generay be Numeric (if the data consists of numbers, ike temperature, rainfa, and wind speed) or Categoric (if the data consists of characters from the aphabet, ike the wind direction, which might be N or S, etc.), though we can aso see an Ident (identifier). An Ident is often one of the variabes (coumns) in the data that uniquey identifies each observation (row) of the data. The Comments coumn incudes genera information ike the number of unique (or distinct) vaues the variabe has and how many observations have a missing vaue for a variabe. 2.5 Buiding a Mode Using Ratte, we cick the Mode tab and are presented with the Mode options (Figure 2.4). To buid a decision tree mode, one of the most common data mining modes, cick the Execute button (decision trees are the defaut). A textua representation of the mode is shown in Figure 2.4. The target variabe (which stores the outcome we want to mode or predict) is RainTomorrow, as we see in the Data tab window of Figure 2.3. Ratte automaticay chose this variabe as the target because it is the ast variabe in the data fie and is a binary (i.e., two-vaued) categoric. Using the weather dataset, our modeing task is to earn about the prospect of it raining tomorrow given what we know about today. The textua presentation of the mode in Figure 2.4 takes a itte effort to understand and is further expained in Chapter 11. For now, we might cick on the Draw button provided by Ratte to obtain the pot that we see in Figure 2.5. The pot provides a better idea of why it is caed a decision tree. This is just a different way of representing the same mode. Cicking the Rues button wi dispay a ist of rues that are derived directy from the decision tree (we need to scro the pane contained in the Mode tab to see them). This is yet another way to represent the same mode. The rues are isted here, and we expain them in detai next.

50 2.5 Buiding a Mode 29 Figure 2.4: The weather dataset has been oaded, and a decision tree mode has been buit. Rue number: 7 [RainTomorrow=Yes cover=27 (11%) prob=0.74] Pressure3pm< 1012 Sunshine< 8.85 Rue number: 5 [RainTomorrow=Yes cover=9 (4%) prob=0.67] Pressure3pm>=1012 Coud3pm>=7.5 Rue number: 6 [RainTomorrow=No cover=25 (10%) prob=0.20] Pressure3pm< 1012 Sunshine>=8.85 Rue number: 4 [RainTomorrow=No cover=195 (76%) prob=0.05] Pressure3pm>=1012 Coud3pm< 7.5 A we-recognised advantage of the decision tree representation for a mode is that the paths through the decision tree can be interpreted as a coection of rues, as above. The rues are perhaps the more readabe representation of the mode. They are isted in the order of the prob-

51 30 2 Getting Started Figure 2.5: The decision tree buit out of the box with Ratte. We traverse the tree by foowing the branches corresponding to the tests at each node. The > =< notation on the root (top) node indicates that we trave eft if Pressure3pm is greater than and down the right branch if it is ess than or equa to The <= > is simiar, but reversed. The eaf nodes incude a node number for reference, a decision of No or Yes to indicate whether it wi RainTomorrow, the number of training observations, and the strength or confidence of the decision. abiity (prob) that we see isted with each rue. The interpretation of the probabiity wi be expained in more detai in Chapter 11, but we provide an intuitive reading here. Rue number 7 (which aso corresponds to the 7) in Figure 2.4 and eaf node number 7 in Figure 2.5) is the strongest rue predicting rain (having the highest probabiity for a Yes). We can read it as saying that if the atmospheric pressure (reduced to mean sea eve) at 3 pm was ess than 1012 hectopascas and the amount of sunshine today was ess than 8.85 hours, then it seems there is a 74% chance of rain tomorrow (yva = yes and prob = 0.74). That is to say that on most days when we have previousy seen these conditions (as represented in the data) it has rained the foowing day.

52 2.6 Understanding Our Data 31 Progressing down to the other end of the ist of rues, we find the conditions under which it appears much ess ikey that there wi be rain the foowing day. Rue number 4 has two conditions: the atmospheric pressure at 3 pm greater than or equa to 1012 hectopascas and coud cover at 3 pm ess than 7.5. When these conditions hod, the historic data tes us that it is unikey to be raining tomorrow. In this particuar case, it suggests ony a 5% probabiity (prob=0.05) of rain tomorrow. We now have our first mode. We have data-mined our historic observations of weather to hep provide some insight about the ikeihood of it raining tomorrow. 2.6 Understanding Our Data We have reviewed the modeing part of data mining above with very itte attention to the data. A reaistic data mining project, though, wi precede modeing with quite an extensive exporation of data, in addition to understanding the business, understanding what data is avaiabe, and transforming such data into a form suitabe for modeing. There is a ot more invoved than just buiding a mode. We ook now at exporing our data to better understand it and to identify what we might want to do with it. Ratte s Expore tab provides access to some common pots as we as extensive data exporation possibiities through atticist (Andrews, 2010) and rggobi (Lang et a., 2011). We wi cover exporatory data anaysis in detai in Chapters 5 and 6. We present here an initia favour of exporatory data anaysis. One of the first things we might want to know is how the vaues of the target variabe (RainTomorrow) are distributed. A histogram might be usefu for this. The simpest way to create one is to go to the Data tab, cick on the Input roe for RainTomorrow, and cick the Execute button. Then go to the Expore tab, choose the Distributions option, and then seect Bar Pot for RainTomorrow. The pot of Figure 2.6 wi be shown. We can see from Figure 2.6 that the target variabe is highy skewed. More than 80% of the days have no rain. This is typica of data mining, where even greater skewness is not uncommon. We need to be aware of the skewness, for exampe, in evauating any modes we buid a mode that simpy predicts that it never rains is going to be over 80% accurate, but pretty useess.

53 32 2 Getting Started Figure 2.6: The target variabe, RainTomorrow, is skewed, with Yes being quite underrepresented. We can dispay other simpe pots from the Expore tab by seecting the Distributions option. Under both the Box Pot and Histogram coumns, seect MaxTemp and Sunshine (as in Figure 2.7). Then cick on Execute to dispay the pots in Figure 2.8. The pots begin to te a story about the data. We sketch the story here, eaving the detais to Chapter 5. The top two pots are known as box-and-whisker pots. The top eft pot tes us that the maximum temperature is generay higher the day before it rains (the pot above the x-axis abe Yes) than before the days when it does not rain (above the No). The top right pot suggests an even more dramatic skew for the amount of sunshine the day prior to the prediction. Generay we see that if there is ess sunshine the day before, then the chance of rain (Yes) seems to be increased. Both box pots aso give another cue about the distribution of the vaues of the target variabe. The width of the boxes in a box pot provides a visua indication of this distribution. Each bottom pot overays three separate pots that give further insight into the distribution of the observations. The three pots within each figure are a histogram (bars), a density pot (ines), and a rug pot (short spikes on the x-axis), each of which we now briefy describe. The histogram has partitioned the numeric data into segments of equa width, showing the frequency for each segment. We see again that

54 2.6 Understanding Our Data 33 Figure 2.7: The weather dataset has been oaded and a decision tree mode has been buit. sunshine (the bottom right) is quite skewed compared with the maximum temperature. The density pots tend to convey a more accurate picture of the distribution of the data. Because the density pot is a simpe ine, we can aso dispay the density pots for each of the target casses (Yes and No). Aong the x-axis is the rug pot. The short vertica ines represent actua observations. This can give us an idea of where any extreme vaues are, and the dense parts show where more of the observations ie. These pots are usefu in understanding the distribution of the numeric data. Ratte simiary provides a number of simpe standard pots for categoric variabes. A seection are shown in Figure 2.9. A three pots show a different view of the one variabe, WindDir9am, as we now describe. The top pot of Figure 2.9 shows a very simpe bar chart, with bars corresponding to each of the eves (or vaues) of the categoric variabe of interest (WindDir9am). The bar chart has been sorted from the overa most frequent to the overa east frequent categoric vaue. We note that each vaue of the variabe (e.g., the vaue SE, representing a wind direc-

55 34 2 Getting Started Figure 2.8: A sampe of distribution pots for two variabes. tion of southeast) has three bars. The first bar is the overa frequency (i.e., the number of days) for which the wind direction at 9 am was from the southeast. The second and third bars show the breakdown for the vaues across the respective vaues of the categoric target variabe (i.e., for No and Yes). We can see that the distribution within each wind direction differs between the three groups, some more than others. Reca that the three groups correspond to a observations (A), observations where it did not rain on the foowing day (No), and observations where it did (Yes). The ower two pots show essentiay the same information, in different forms. The bottom eft pot is a dot pot. It is simiar to the bar chart, on its side, and with dots representing the top of the bars. The breakdown into the eves of the target variabe is compacty shown as dots within the same row. The bottom right pot is a mosaic pot, with a bars having the same height. The reative frequencies between the vaues of WindDir9am are now indicated by the widths of the bars. Thus, SE is the widest bar, and WSW is the thinnest. The proportion between No and Yes within each bar

56 2.7 Evauating the Mode: Confusion Matrix 35 Figure 2.9: variabe. A sampe of the three distribution pots for the one categoric is ceary shown. A mosaic pot aows us to easiy identify eves that have very different proportions associated with the eves of the target variabe. We can see that a north wind direction has a higher proportion of observations where it rains the foowing day. That is, if there is a northery wind today, then the chance of rain tomorrow seems to be increased. These exampes demonstrate that data visuaisation (or exporatory data anaysis) is a powerfu too for understanding our data a picture is worth a thousand words. We actuay earn quite a ot about our data even before we start to specificay mode it. Many data miners begin to deiver significant benefits to their cients simpy by providing such insights. We deve further into exporing data in Chapter Evauating the Mode: Confusion Matrix We often begin a data mining project by exporing the data to gain our initia insights. In a ikeihood, we then aso transform and cean up

57 36 2 Getting Started our data in various ways. We have iustrated above how to then buid our first mode. It is now time to evauate the performance or quaity of the mode. Evauation is a critica step in any data mining process, and one that is often eft underdone. For the sake of getting started, we wi ook at a simpe evauation too. The confusion matrix (aso referred to as the error matrix) is a common mechanism for evauating mode performance. In buiding our mode we used a 70% subset of a of the avaiabe data. Figure 2.3 (page 27) shows the defaut samping strategy of 70/15/15. We ca the 70% sampe the training dataset. The remainder is spit equay into a vaidation dataset (15%) and a testing dataset (15%). The vaidation dataset is used to test different parameter settings or different choices of variabes whist we are data mining. It is important to note that this dataset shoud not be used to provide any error estimations of the fina resuts from data mining since it has been used as part of the process of buiding the mode. The testing dataset is ony to be used to predict the unbiased error of the fina resuts. It is important not to use this testing dataset in any way in buiding or even fine-tuning the modes that we buid. Otherwise, it no onger provides an unbiased estimate of the mode performance. The testing dataset and, whist we are buiding modes, the vaidation dataset, are used to test the performance of the modes we buid. This often invoves cacuating the mode error rate. A confusion matrix simpy compares the decisions made by the mode with the actua decisions. This wi provide us with an understanding of the eve of accuracy of the mode in terms of how we the mode wi perform on new, previousy unseen, data. Figure 2.10 shows the Evauate tab with the Error Matrix (confusion matrix) using the Testing dataset for the Tree mode that we have previousy seen in Figures 2.4 and 2.5. Two tabes are presented. The first ists the actua counts of observations and the second the percentages. We can observe that for 62% of the predictions the mode correcty predicts that it won t rain (caed the true negatives). That is, 35 days out of the 56 days are correcty predicted as not raining. Simiary, we see the mode correcty predicts rain (caed the true positives) on 18% of the days. In terms of how correct the mode is, we observe that it correcty predicts rain for 10 days out of the 15 days on which it actuay does rain. This is a 67% accuracy in predicting rain. We ca this the true

58 2.7 Evauating the Mode: Confusion Matrix 37 Figure 2.10: A confusion matrix appying the mode to the testing dataset is dispayed. positive rate, but it is aso known as the reca and the sensitivity of the mode. Simiary, the true negative rate (aso caed the specificity of the mode) is 85%. We aso see six days when we are expecting rain and none occurs (caed the fase positives). If we were using this mode to hep us decide whether to take an umbrea or raincoat with us on our traves tomorrow, then it is probaby not a serious oss in this circumstance we had to carry an umbrea without needing to use it. Perhaps more serious though is that there are five days when our mode tes us there wi be no rain yet it rains (caed the fase negatives). We might get inconvenienty wet without our umbrea. The concepts of true and fase positives and negatives wi be further covered in Chapter 15. The performance measure here tes us that we are going to get wet more often than we woud ike. This is an important issue the fact that the different types of errors have different consequences for us. We aso see more about this in Chapter 15. It is usefu to compare the performance as measured using the vaidation and testing datasets with the performance as measured using

59 38 2 Getting Started the training dataset. To do so, we can seect the Vaidation and then the Training options (and for competeness the Fu option) from the Data ine of the Evauate tab and then Execute each. The resuting performance wi be reported. We reproduce a four here for comparison, incuding the count and the percentages. Evauation Using the Training Dataset: Count Predict Percentage Predict No Yes No Yes Actua No Actua No 80 4 Yes Yes 6 10 Evauation Using the Vaidation Dataset: Count Predict Percentage Predict No Yes No Yes Actua No 39 5 Actua No 72 9 Yes 5 5 Yes 9 9 Evauation Using the Testing Dataset: Count Predict Percentage Predict No Yes No Yes Actua No 35 6 Actua No Yes 5 10 Yes 9 18 Evauation Using the Fu Dataset: Count Predict No Yes Actua No Yes Percentage Predict No Yes Actua No 76 6 Yes 7 11 We can see that there are fewer errors in the training dataset than in either the vaidation or testing datasets. That is not surprising since the tree was buit using the training dataset, and so it shoud be more accurate on what it has aready seen. This provides a hint as to why we do not vaidate our mode on the training dataset the evauation wi provide optimistic estimates of the performance of the mode. By appying the mode to the vaidation and testing datasets (which the

60 2.8 Interacting with Ratte 39 mode has not previousy seen), we expect to obtain a more reaistic estimate of the performance of the mode on new data. Notice that the overa accuracy from the training dataset is 90% (i.e., adding the diagona percentages, 80% pus 10%), which is exceent. For the vaidation and testing datasets, it is around 80%. This is more ikey how accurate the mode wi be onger-term as we appy it to new observations. 2.8 Interacting with Ratte We have now stepped through some of the process of data mining. We have oaded some data, expored it, ceaned and transformed it, buit a mode, and evauated the mode. The mode is now ready to be depoyed. Of course, there is a ot more to what we have just done than what we have covered here. The remainder of the book provides much of these detais. Before proceeding to the detais, though, we might review how we interact with Ratte and R. We have seen the Ratte interface throughout this chapter and we now introduce it more systematicay. The interface is based on a set of tabs through which we progress as we work our way through a data mining project. For any tab, once we have set up the required information, we wi cick the Execute button to perform the actions. Take a moment to expore the interface a itte. Notice the Hep menu and that the hep ayout mimics the tab ayout. The Ratte interface is designed as a simpe interface to a powerfu suite of underying toos for data mining. The genera process is to step through each tab, eft to right, performing the corresponding actions. For any tab, we configure the options and then cick the Execute button (or F2) to perform the appropriate tasks. It is important to note that the tasks are not performed unti the Execute button (or F2 or the Execute menu item under Toos) is cicked. The Status Bar at the base of the window wi indicate when the action is competed. Messages from R (e.g., error messages) may appear in the R Consoe from which Ratte was started. Since Ratte is a simpe graphica interface sitting on top of R itsef, it is important to remember that some errors encountered by R on oading the data (and in fact during any operation performed by Ratte) may be dispayed in the R Consoe.

61 40 2 Getting Started The R code that Ratte passes on to R to execute underneath the interface is recorded in the Log tab. This aows us to review the R commands that perform the corresponding data mining tasks. The R code snippets can be copied as text from the Log tab and pasted into the R Consoe from which Ratte is running, to be directy executed. This aows us to depoy Ratte for basic tasks yet sti gives us the fu power of R to be depoyed as needed, perhaps through using more command options than are exposed through the Ratte interface. This aso aows us the opportunity to export the whoe session as an R script fie. The og serves as a record of the actions taken and aows those actions to be repeated directy and automaticay through R itsef at a ater time. Simpy seect (to dispay) the Log tab and cick on the Export button. This wi export the og to a fie that wi have an R extension. We can choose to incude or excude the extensive comments provided in the og and to rename the interna Ratte variabes (from crs$ to a string of our own choosing). We now traverse the main eements of the Ratte user interface, specificay the toobar and menus. We begin with a basic concept a project. Projects A project is a packaging of a dataset, variabe seections, exporations, and modes buit from the data. Ratte aows projects to be saved for ater resumption of the work or for sharing the data mining project with other users. A project is typicay saved to a fie with a ratte extension. In fact, the fie is a standard binary RData fie used by R to store objects in a more compact binary form. Any R system can oad such a fie and hence have access to these objects, even without running Ratte. Loading a ratte fie into Ratte (using the Open button) wi oad that project into Ratte, restoring the data, modes, and other dispayed information reated to the project, incuding the og and summary information. We can then resume our data mining from that point. From a fie system point of view, we can rename the fies (as we as the fiename extension, though that is not recommended) without impacting the project fie itsef that is, the fiename has no forma bearing on the contents, so use it to be descriptive. It is best to avoid spaces and unusua characters in the fienames.

62 2.8 Interacting with Ratte 41 Projects are opened and saved using the appropriate buttons on the toobar or from the Project menu. Toobar The most important button on the Toobar (Figure 2.11) is the Execute button. A action is initiated with an Execute, often with a cick of the Execute button. A keyboard shortcut for Execute is the F2 function key. A menu item for Execute is aso avaiabe. It is worth repeating that the user interface paradigm used within Ratte is to set up the parameters on a tab and then Execute the tab. Figure 2.11: The Ratte menu and toobar. The next few buttons on the Toobar reate to the concept of a project within Ratte. Projects were discussed above. Cicking on the New button wi restore Ratte to its pristine startup state with no dataset oaded. This can be usefu when a source dataset has been externay modified (externa to Ratte and R). We might, for exampe, have manipuated our data in a spreadsheet or database program and re-exported the data to a CSV fie. To reoad this fie into Ratte, if we have previousy oaded it into the current Ratte session, we need to cear Ratte as with a cick of the New button. We can then specify the fiename and reoad it. The Report button wi generate a formatted report based on the current tab. A number of report tempates are provided with Ratte and wi generate a document in the open standard ODT format, for the open source and open standards supporting LibreOffice. Whist support for user-generated reports is imited, the og provides the necessary commands used to generate the ODT fie. We can thus create our own ODT tempates and appy them within the context of the current Ratte session. The Export button is avaiabe to export various objects and entities from Ratte. Detais are avaiabe together with the specific sections in the foowing chapters. The nature of the export depends on which tab is active and within the tab, which option is active. For exampe, if

63 42 2 Getting Started the Mode tab is on dispay then Export wi save the current mode as PMML (the Predictive Modeing Markup Language see Chapter 16). The Export button is not avaiabe for a tabs and options. Menus The menus (Figure 2.11) provide aternative access to many of the functions of the interface. A key point in introducing menus is that they can be navigated from the keyboard and contain keyboard shortcuts so that we can navigate more easiy through Ratte using the keyboard. The Project menu provides access to the Open and Save options for oading and saving projects from or to fies. The Toos menu provides access to some of the other toobar functions as we as access to specific tabs. The Settings menu aows us to contro a number of optiona characteristics of Ratte. This incudes tootips and the use of the more modern Cairo graphics device. Extensive hep is avaiabe through the Hep menu. The structure of the menu foows that of the tabs of the main interface. On seecting a hep topic, a brief text popup wi dispay some basic information. Many of the popups then have the option of dispaying further information, which wi be dispayed within a Web browser. This additiona documentation comes directy from the documentation provided by R or the reevant R package. Interacting with Pots It is usefu to know how we interact with pots in Ratte. Often we wi generate pots and want to incude them in our own reports. Pots are generated from various paces within the Ratte interface. Ratte optionay uses the Cairo device, which is a vector graphics engine for dispaying high-quaity graphic pots. If the Cairo device is not avaiabe within your instaation, then Ratte resorts to the defaut window device for the operating system (x11() for GNU/Linux and window() for Microsoft Windows). The Settings menu aso aows contro of the choice of graphics device (aowing us to use the defaut by disabing support for Cairo). The Cairo device has a number of advantages, one being that it can be encapsuated within other windows, as is done with Ratte. This aows Ratte to provide some operating-systemindependent functionaity and a common interface. If we choose not to

64 2.9 Interacting with R 43 use the Cairo device, we wi have the defaut devices, which sti work just fine, but with ess obvious functionaity. Figure 2.8 (page 34) shows a typica Ratte pot window. At the bottom of the window, we see a series of buttons that aow us to Save the pot to a fie, to Print it, and Cose it. The Save button aows us to save the graphics to a fie in one of the supported formats. The supported formats incude pdf (for highresoution pictures), png (for vector images and text), jpg (for coourfu images), svg (for genera scaabe vector graphics), and, in Microsoft Windows, wmf (for Windows Metafie, Microsoft Windows-specific vector graphics). A popup wi request the fiename to save to. The defaut is to save in PDF format, saving to a fie with the fiename extension of.pdf. You can choose to save in the other formats simpy by specifying the appropriate fiename extension. The Print button wi send the pot to a printer. This requires the underying R appication to have been set up propery to access the required printer. This shoud be the case by defaut. Once we are finished with the pot, we can cick the Cose button to shut down that particuar pot window. Keyboard Navigation Keyboard navigation of the menus is usuay initiated with the F10 function key. The keyboard arrow keys can then be used to navigate. Pressing the keyboard s Enter key wi then seect the highighted menu item. Judicious use of the keyboard (in particuar, the arrow keys, the Tab and Shift-Tab keys, and the Enter key, together with F2 and F10) aows us to competey contro Ratte from the keyboard if desired or required. 2.9 Interacting with R R is a command ine too. We saw in Section 2.1 how to interact with R to start up Ratte. Essentiay, R dispays a prompt to indicate that it is waiting for us to issue a command. Two such commands are ibrary() and ratte(). In this section, we introduce some basic concepts and commands for interacting with R directy.

65 44 2 Getting Started Basic Functionaity Generay we instruct R to evauate functions a technica term used to describe mathematica objects that return a resut. A functions in R return a resut, and that resut can be passed to other functions to do other things. This simpe idea is actuay a very powerfu concept, aowing functions to do we what they are designed to do (ike buiding a mode) and pass on their output to other functions to do something with it (ike formatting it for easy reading). We saw in Section 2.1 two function cas, which we repeat beow. The first was a ca to the function ibrary(), where we asked R to oad ratte. We then started up Ratte with a ca to the ratte() function: > ibrary(ratte) > ratte() Irrespective of the purpose of the function, for each function ca we usuay suppy arguments that refine the behaviour of the function. We did that above in the ca to ibrary(), where the argument was ratte. Another simpe exampe is to ca dim() (dimensions) with the argument weather. > dim(weather) [1] Here, weather is an object name. We can think of it simpy as a reference to some object (something that contains data). The object in this case is the weather dataset as used in this chapter. It is organised as rows and coumns. The dim() function reports the number of rows and coumns. If we type a name (e.g., either weather or dim) at the R prompt, R wi respond by showing us the object. Typing weather (foowed by pressing the Enter key) wi resut in the actua data. We wi see a 366 rows of data scroed on the screen. If we type dim and press Enter, we wi see the definition of the function (which in this case is a primitive function coded into the core of R): > dim function (x).primitive("dim")

66 2.9 Interacting with R 45 A common mistake made by new users is to type a function name by itsef (without arguments) and end up a itte confused about the resuting output. To actuay invoke the function, we need to suppy the argument ist, which may be an empty ist. Thus, at a minimum, we add () to the function ca on the command ine: > dim() Error in dim: 0 arguments passed to 'dim' which requires 1 As we see, executing this function wi generate an error message. We note that dim() actuay needs one argument, and no arguments were passed to it. Some functions can be invoked with no arguments, as is the case for ratte(). The exampes above iustrate how we wi show our interaction with R. The > is R s prompt, and when we see that we know that R is waiting for commands. We type the string of characters dim(weather) as the command in this case a ca to the dim function. We then press the Enter key to send the command to R. R responds with the resut from the function. In the case above, it returned the resut [1] Technicay, dim() returns a vector (a sequence of eements or vaues) of ength 2. The [1] simpy tes us that the first number we see from the vector (the 366) is the first eement of the vector. The second eement is 24. The two numbers isted by R in the exampe above (i.e., the vector returned by dim()) are the number of rows and coumns, respectivey, in the weather dataset that is, its dimensions. For very ong vectors, the ist of the eements of the vector wi be wrapped to fit across the screen, and each ine wi start with a number within square brackets to indicate what eement of the vector we are up to. We can iustrate this with seq(), which generates a sequence of numbers: > seq(1, 50) [1] [19] [37] We saw above that we can view the actua data stored in an object by typing the name of the object (weather) at the command prompt.

67 46 2 Getting Started Generay this wi print too many ines (athough ony 366 in the case of the weather dataset). A usefu pair of functions for inspecting our data are head() and tai(). These wi ist just the top and bottom six observations (or rows of data), by defaut, from the data frame, based on the order in which they appear there. Here we request, through the arguments to the function, to ist the top two observations (and we aso use indexing, described shorty, to ist ony the first nine variabes): > head(weather[1:9], 2) Date Location MinTemp MaxTemp Rainfa Canberra Canberra Evaporation Sunshine WindGustDir WindGustSpeed NW ENE 39 Simiary, we can request the bottom three rows of the dataset. > tai(weather[1:9], 3) Date Location MinTemp MaxTemp Rainfa Canberra Canberra Canberra Evaporation Sunshine WindGustDir WindGustSpeed ESE NW NW 78 The weather dataset is more compex than the simpe vectors we have seen above. In fact, it is a specia kind of ist caed a data frame, which is one of the most common data structures in R for storing our datasets. A data frame is essentiay a ist of coumns. The weather dataset has 24 coumns. For a data frame, each coumn is a vector, each of the same ength. If we ony want to review certain rows or coumns of the data frame, we can index the dataset name. Indexing simpy uses square brackets to ist the row numbers and coumn numbers that are of interest to us:

68 2.9 Interacting with R 47 > weather[4:8, 2:4] Location MinTemp MaxTemp 4 Canberra Canberra Canberra Canberra Canberra Notice the notation for a sequence of numbers. The string 4:8 is actuay equivaent to a ca to seq() with two arguments, 4 and 8. The function returns a vector containing the integers from 4 to 8. It s the same as isting them a and combining them using c(): > 4:8 [1] > seq(4, 8) [1] > c(4, 5, 6, 7, 8) [1] Getting Hep It is important to know how we can earn more about using R. From the command ine, we obtain hep on commands by caing hep(): > hep(dim) A shorthand is to precede the argument with a? as in:?dim. This is automaticay converted into a ca to hep(). The hep.search() function wi search the documentation to ist functions that may be of reevance to the topic we suppy as an argument: > hep.search("dimensions") The shorthand here is to precede the string with two question marks as in??dimensions. A third command for searching for hep on a topic is RSiteSearch(). This wi submit a query to the R project s search engine on the Internet:

69 48 2 Getting Started > RSiteSearch("dimensions") Quitting R Reca that to exit from R, as we saw in Section 2.1, we issue q(): > q() Our first session with R is now compete. The command ine, as we have introduced here, is where we access the fu power of R. But not everyone wants to earn and remember commands, so Ratte wi get us started quite quicky into data mining, with ony our minima knowedge of the command ine. R and Ratte Interactions Ratte generates R commands that are passed on through to R at various times during our interactions with Ratte. In particuar, whenever the Execute button is cicked, Ratte constructs the appropriate R commands and then sends them off to R and awaits R s response. We can aso interact with R itsef directy, and even intereave our interactions with Ratte and R. In Section 2.5, for exampe, we saw a decision tree mode represented textuay within Ratte s text view. The same can aso be viewed in the R Consoe using print(). We can repicate that here once we have buit the decision tree mode as described in Section 2.5. The R Consoe window is where we can enter R commands directy. We first need to make the window active, usuay by cicking the mouse within that window. For the exampe beow, we assume we have run Ratte on the weather dataset to buid a decision tree as described in Section 2.5. We can then type the print() command at the prompt. We see this in the code box beow. The command itsef consists of the name of an R function we wish to ca on (print() in this case), foowed by a ist of arguments we pass to the function. The arguments provide information about what we want the function to do. The reference we see here, crs$rpart, identifies where the mode itsef has been saved internay by Ratte. The parameter digits= specifies the precision of the printed numbers. In this case we are choosing a singe digit.

70 2.9 Interacting with R 49 After typing the fu command (incuding the function name and arguments) we then press the Enter key. This has the effect of passing the command to R. R wi respond with the text exacty as shown beow. The text starts with an indication of the number of observations (256). This is foowed by the same textua presentation of the mode we saw in Section 2.5. > print(crs$rpart, digits=1) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) 2) Pressure3pm>=1e No ( ) 4) Coud3pm< No ( ) * 5) Coud3pm>=8 9 3 Yes ( ) * 3) Pressure3pm< 1e No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) * Commands versus Functions We have referred above to the R command ine, where we enter commands to be executed. We aso taked about functions that we type on the command ine that make up the command to be executed. In this book, we wi adopt a particuar terminoogy around functions and commands, which we describe here. In its true mathematica sense, a function is some operation that consumes some data and returns some resut. Functions ike dim(), seq(), and head(), as we have seen, do this. Functions might aso have what we often ca side effects that is, they might do more than simpy returning some resut. In fact, the purpose of some functions is actuay to perform some other action without necessariy returning a resut. Such functions we wi tend to ca commands. The function ratte(), for exampe, does not return any resut to the command ine as such. Instead, its purpose is to start up the GUI and aow us to start data mining. Whist ratte() is sti a function, we wi usuay refer to it as a command rather than a function. The two terms can be used interchangeaby.

71 50 2 Getting Started Programming Styes for R R is a programming anguage supporting different programming styes. We can use R to write programs that anayse data we program the data anayses. Note that if we are ony using Ratte, then we wi not need to program directy. Nonetheess, for the programs we might write, we can take advantage of the numerous programming styes offered by R to deveop code that anayses data in a consistent, simpe, reusabe, transparent, and error-free way. Mistakeny, we are often trained to think that writing sentences in a programming anguage is primariy for the benefit of having a computer perform some activity for us. Instead, we shoud think of the task as reay writing sentences that convey to other humans a story a story about anaysing our data. Coincidentay, we aso want a computer to perform some activity. Keeping this simpe message in mind, whenever writing in R, heps to ensure we write in such a way that others can easiy understand what we are doing and that we can aso understand what we have done when we come back to it after six months or more. Environments as Containers in R For a particuar project, we wi usuay anayse a coection of data, possiby transforming it and storing different bits of information about it. It is convenient to package a of our data and what we earn about it into some container, which we might save as a binary R object and reoad more efficienty at a ater time. We wi use R s concept of an environment for this. As a programming stye, we can create a storage space and give it a name (i.e., it wi ook ike a programming anguage variabe) to act as a container. The container is an R environment and is initiaised using new.env() (new environment). Here, we create a new environment and give it the name en: > en <- new.env() The object en now acts as a singe container into which we can pace a the reevant information associated with the dataset and that can aso be shared amongst severa modes. We wi store and access the reevant information from this container.

72 2.9 Interacting with R 51 Data is paced into the container using the $ notation and the assignment operator, as we see in the foowing exampe: > en$obs <- 4:8 > en$obs [1] > en$vars <- 2:4 > en$vars [1] The variabes obs and vars are now contained within the environment referenced as en. We can operate on variabes within an environment without using the $ notation (which can become quite cumbersome) by wrapping the commands within evaq(): > evaq( { nobs <- ength(obs) nvars <- ength(vars) }, en) > en$nobs [1] 5 > en$nvars [1] 3 The use of evaq() becomes most convenient when we have more than a coupe of statements to write. At any time, we can ist the contents of the container using s(): > s(en) [1] "nobs" "nvars" "obs" "vars" Another usefu function, provided by gdata (Warnes, 2011), is (), which provides a itte more information:

73 52 2 Getting Started > ibrary(gdata) > (en) Cass KB nobs integer 0 nvars integer 0 obs integer 0 vars integer 0 We can aso convert the environment to a ist using as.ist(): > as.ist(en) $nvars [1] 3 $nobs [1] 5 $vars [1] $obs [1] By keeping a the data reated to a project together, we can save and oad the project through this one object. We aso avoid pouting the goba environment with ots of objects and osing track of what they a reated to, possiby confusing ourseves and others. We can now aso quite easiy use the same variabe names, but within different containers. Then, when we write scripts to buid modes, for exampe, often we wi be abe to use exacty the same scripts, changing ony the name of the container. This encourages the reuse of our code and promotes efficiencies. This approach is aso sympathetic to the concept of object-oriented programming. The container is a basic object in the object-oriented programming context. We wi use this approach of encapsuating a of our data and information within a container when we start buiding modes. The foowing provides the basic tempate:

74 2.9 Interacting with R 53 > ibrary(rpart) > weatherds <- new.env() > evaq({ data <- weather nobs <- nrow(data) vars <- c(2:22, 24) form <- formua(raintomorrow ~.) target <- a.vars(form)[1] train <- sampe(nobs, 0.7*nobs) }, weatherds) > weatherrpart <- new.env(parent=weatherds) > evaq({ mode <- rpart(formua=form, data=data[train, vars]) predictions <- predict(mode, data[-train, vars]) }, weatherrpart) Here we have created two containers, one for the data and the other for the mode. The mode container (weatherrpart) has as its parent the data container (weatherds), which is achieved by specifying the parent= argument. This makes the variabes defined in the data container avaiabe within the mode container. To save a container to a fie for use at a ater time, or to document stages within the data mining project, use save(): > save(weatherds, fie="weatherds.rdata") It can ater be oaded using oad(): > oad("weatherds.rdata") It can at times become tiresome to be wrapping our code up within a container. Whist we retain the discipine of using containers we can aso quicky interact with the variabes in a container without having to specify the container each time. WE use attach and detach to add a container into the so caed search path used by R to find variabes. Thus we coud do something ike the foowing: > attach(weatherrpart) > print(mode) > detach(weatherrpart)

75 54 2 Getting Started However, creating new variabes to store within the environment wi not work in the same way. Thus: > attach(weatherrpart) > new.mode <- mode > detach(weatherrpart) does not pace the variabe new.mode into the weatherrpart environment. Instead it goes into the goba environment. A convenient feature, particuary with the ayout used within the evaq() exampes above and generay throughout the book, is that we coud ignore the string that starts a bock of code (which is the ine containing evaq({ ) and the string that ends a bock of code (which is the ine containing }, weatherds) ) and simpy copy-and-paste the other commands directy into the R consoe. The variabes (data, nobs, etc.) are then created in the goba environment, and nothing specia is needed to access them. This is usefu for quicky testing out ideas, for exampe, and is provided as a choice if you prefer not to use the container concept yoursef. Containers do, however, provide usefu benefits. Ratte uses containers internay to coect together the data it needs. The Ratte container is caed crs (the current ratte store). Once a dataset is oaded into Ratte, for exampe, it is stored as crs$dataset. We saw crs$rpart above as referring to the decision tree we buit above Summary In this chapter, we have become famiiar with the Ratte interface for data mining with R. We have aso buit our first data mining mode, abeit using an aready prepared dataset. We have aso introduced some of the basics of interacting with the R anguage. We are now ready to deve into the detais of data mining. Each of the foowing chapters wi cover a specific aspect of the data mining process and iustrate how this is accompished within Ratte and then further extended with direct coding in R. Before proceeding, it is advisabe to review Chapter 1 as an introduction to the overa data mining process if you have not aready done so.

76 2.11 Command Summary Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: <- function Assign a vaue into a named reference. c() function Concatenate vaues into a vector. dim() function Return the dimensions of a dataset. evaq() function Access the environment for storing data. head() function Return the first few rows of a dataset. hep() command Dispay hep for a specific function. hep.search() command Search for hep on a specific topic. atticist package Interactive visuaisation of data. ibrary() command Load a package into the R ibrary. () function Longer ist of an environment. oad() command Load R objects from a fie. s() function List the contents of an environment. new.env() function Create a new object to store data. nrow() function Number of rows in a dataset. print() command Dispay representation of an R object. q() command Quit from R. R she Start up the R statistica environment. ratte() command Start the Ratte GUI. rggobi package Interactive visuaisation of data. rpart() function Buid a decision tree predictive mode. rpart package Provides decision tree functions. RSiteSearch() command Search the R Web site for hep. sampe() function Random seection of its first argument. save() command Save R objects into a fie. seq() function Return a sequence of numbers. tabe() function Make a tabe from some variabes. tai() function Return the ast few rows of a dataset. weather dataset Sampe dataset from ratte. window() command Open a new pot in Microsoft Windows. x11() command Open a new pot in Unix/Linux.

77

78 Chapter 3 Working with Data Data is the starting point for a data mining without it there is nothing to mine. In today s word, there is certainy no shortage of data, but turning that data into information, knowedge, and, eventuay, wisdom is not a simpe matter. We often think of data as being numbers or categories. But data can aso be text, images, videos, and sounds. Data mining generay ony deas with numbers and categories. Often, the other forms of data can be mapped into numbers and categories if we wish to anayse them using the approaches we present here. Whist data abounds in our modern era, we sti need to scout around to obtain the data we need. Many of today s organisations maintain massive warehouses of data. This provides a fertie ground for sourcing data but aso an extensive headache for us in navigating through a massive andscape. An eary step in a data mining project is to gather a the required data together. This seemingy simpe task can be a significant burden on the budgeted resources for data mining, perhaps consuming up to 70 90% of the eapsed time of a project. It shoud not be underestimated. When bringing data together, a number of issues need to be considered. These incude the provenance (source and purpose) and quaity (accuracy and reiabiity) of the data. Data coected for different purposes may we store different information in confusingy simiar ways. Aso, some data requires appropriate permission for its use, and the privacy of anyone the data reates to needs to be considered. Time spent at this stage getting to know your data wi be time we spent. In this chapter, we introduce data, starting with the anguage we use to describe and tak about data. G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _3, Springer Science+Business Media, LLC

79 58 3 Working with Data 3.1 Data Nomencature Data miners have a pethora of terminoogy, often using many different terms to describe the same concept. A ot of this confusion of terminoogy is due to the history of data mining, with its roots in many different discipines, incuding databases, machine earning, and statistics. Throughout this book, we wi use a consistent and generay accepted nomencature, which we introduce here. We refer to a coection of data as a dataset. This might be caed in mathematica terms a matrix or in database terms a tabe. Figure 3.1 iustrates a dataset annotated with our chosen nomencature. We often view a dataset as consisting of rows, which we refer to as observations, and those observations are recorded in terms of variabes, which form the coumns of the dataset. Observations are aso known as entities, rows, records, and objects. Variabes are aso known as fieds, coumns, attributes, characteristics, and features. The dimension of a dataset refers to the number of observations (rows) and the number of variabes (coumns). Variabes can serve different roes: as input variabes or output variabes. Input variabes are measured or preset data items. They might aso be known as predictors, covariates, independent variabes, observed variabes, and descriptive variabes. An output variabe may be identified in the data. These are variabes that are often infuenced by the input variabes. They might aso be known as target, response, or dependent variabes. In data mining, we often buid modes to predict the output variabes in terms of the input variabes. Eary on in a data mining project, we may not know for sure which variabes, if any, are output variabes. For some data mining tasks (e.g., custering), we might not have any output variabes. Some variabes may ony serve to uniquey identify the observations. Common exampes incude socia security and other such government identity numbers. Even the date may be a unique identifier for particuar observations. We refer to such variabes as identifiers. Identifiers are not normay used in modeing, particuary those that are essentiay randomy generated. Variabes can store different types of data. The vaues might be the names or the quaities of objects, represented as character strings. Or the vaues may be quantitative and thereby represented numericay. At a high eve we often ony need to distinguish these two broad types of data, as we do here.

80 3.1 Data Nomencature 59 Categoric Categoric Numeric Numeric Variabes Date Temp Wind Dir. Evap Rain? 10 Dec 23 NNE 10.4 Y 25 Jan 25 E 6.8 Y 02 Apr 22 SSW 3.6 N Observations 08 May 10 May NW N N 04 Jun 13 SE 0.2 Y 04 Ju 10 SSW 1.8 N 01 Aug 9 NW 2.6 N 07 Aug 6 SE 3.0 Y Identifier Input Output Figure 3.1: A simpe dataset showing the nomencature used. Each coumn is a variabe and each row is an observation. A categoric variabe 1 is one that takes on a singe vaue, for a particuar observation, from a fixed set of possibe vaues. Exampes incude eye coour (with possibe vaues incuding bue, green, and brown), age group (with possibe vaues young, midde age, and od), and rain tomorrow (with ony two possibe vaues, Yes and No). Categoric variabes are aways discrete (i.e., they can ony take on specific vaues). Categoric variabes ike eye coour are aso known as nomina variabes, quaitative variabes, or factors. The possibe vaues have no order to them. That is, bue is no ess than or greater than green. On the other hand, categoric variabes ike age group are aso known as ordina variabes. The possibe vaues have a natura order to them, so that young is in some sense ess than midde age, which in turn is ess than od. 1 We use the terms categoric rather than categorica and numeric rather than numerica.

81 60 3 Working with Data A categoric variabe ike rain tomorrow, having ony two possibe vaues, is aso known as a binary variabe. A numeric variabe has vaues that are integers or rea numbers, such as a person s age or weight or their income or bank baance. Numeric variabes are aso known as quantitative variabes. Numeric variabes can be discrete (integers) or continuous (rea). A dataset (or, in particuar, different randomy chosen subsets of a dataset) can have different roes. For buiding predictive modes, for exampe, we often partition a dataset into three independent datasets: a training dataset, a vaidation dataset, and a testing dataset. The partitioning is done randomy to ensure each dataset is representative of the whoe coection of observations. Typica spits might be 40/30/30 or 70/15/15. A vaidation dataset is aso known as a design dataset (since it assists in the design of the mode). We buid our mode using the training dataset. The vaidation dataset is used to assess the mode s performance. This wi ead us to tune the mode, perhaps through setting different mode parameters. Once we are satisfied with the mode, we assess its expected performance into the future using the testing dataset. It is important to understand the significance of the testing dataset. This dataset must be a so-caed hodout or out-of-sampe dataset. It consists of randomy seected observations from the fu dataset that are not used in any way in the buiding of the mode. That is, it contains no observations in common with the training or vaidation datasets. This is important in reation to ensuring we obtain an unbiased estimate of the true performance of a mode on new, previousy unseen observations. We can summarise our generic nomencature, in one sentence, as: A dataset consists of observations recorded using variabes, which consist of a mixture of input variabes and output variabes, either of which may be categoric or numeric. Having introduced our generic nomencature, we aso need to reate the same concepts to how they are impemented in an actua system, ike R. We do so, briefy, here. R has the concept of a data frame to represent a dataset. A data frame is, technicay, a ist of variabes. Each variabe in the ist represents a coumn of data a variabe stores a coection of data items that are a

82 3.2 Sourcing Data for Mining 61 of the same type. For exampe, this might be a coection of integers recording the ages of cients. Technicay, R refers to what we ca a variabe within a dataset as a vector. Each variabe wi record the same number of data items, and thus we can picture the dataset as a rectanguar matrix, as we iustrated in Figure 3.1. A data frame is much ike a tabe in a database or a page in a spreadsheet. It consists of rows, which we have caed observations, and coumns, which we have caed variabes. 3.2 Sourcing Data for Mining To start a data mining project, we must first recognise and understand the probem to tacke. Whist that might be quite obvious, there are subteties we need to address, as discussed in Chapter 1. We aso need data again, somewhat obvious. As we suggested above, though, sourcing our data is usuay not a trivia matter. We discuss the genera data issue here before we deve into some technica aspects of data. In an idea word, the data we require for data mining wi be nicey stored in a data warehouse or a database, or perhaps a spreadsheet. However, we ive in a ess than idea word. Data is stored in many different forms and on many different systems, with many different meanings. Data is everywhere, for sure, but we need to find it, understand it, and bring it together. Over the years, organisations have impemented we-managed data warehouse systems. They serve as the organisation-wide repository of data. It is true, though that, despite this, data wi aways spring up outside of the data warehouse, and wi have none of the carefu contros that surround the data warehouse with regard to data provenance and data quaity. Eventuay the organisation s data custodians wi recapture the usefu new cottage industry repositories into the data warehouse and the cyce of new cottage industries wi begin once again. We wi aways face the chaenge of finding data from many sources within an organisation. An organisation s data is often not the ony data we access within a data mining project. Data can be sourced from outside the organisation. This coud incude data pubicy avaiabe, commerciay coected, or egisativey obtained. The data wi be in a variety of formats and of varying quaity. An eary task for us is to assess whether the data wi

83 62 3 Working with Data be usefu for the business probem and how we wi bring the new data together with our other data. We deve further into understanding the data in Chapter 5. We consider data quaity now. 3.3 Data Quaity No rea-word data is perfecty coected. Despite the amount of effort organisations put into ensuring the quaity of the data they coect, errors wi aways occur. We need to understand issues reating to, for exampe, consistency, accuracy, competeness, interpretabiity, accessibiity, and timeiness. It is important that we recognise and understand that our data wi be of varying quaity. We need to treat (i.e., transform) our data appropriatey and be aware of the imitations (uncertainties) of any anaysis we perform on it. Chapter 7 covers many aspects of data quaity and how we can work towards improving the quaity of our avaiabe data. Beow we summarise some of the issues. In the past, much data was entered by data entry staff working from forms or directy in conversation with cients. Different data entry staff often interpret different data fieds (variabes) differenty. Such inconsistencies might incude using different formats for dates or recording expenses in different currencies in the same fied, with no information to identify the currency. Often in the coection of data some data is more carefuy (or accuratey) coected than other data. For bank transactions, for exampe, the doar amounts must be very accurate. The precise speing of a person s name or address might not need to be quite so accurate. Where the data must be accurate, extra resources wi be made avaiabe to ensure data quaity. Where accuracy is ess critica, resources might be saved. In anaysing data, it is important to understand these aspects of accuracy. Reated to accuracy is the issue of competeness. Some ess important data might ony be optionay coected, and thus we end up with much missing data in our datasets. Aternativey, some data might be hard to coect, and so for some observations it wi be missing. When anaysing data, we need to understand the reasons for missing data and dea with the data appropriatey. We cover this in detai in Chapter 7. Another major issue faced by the data miner is the interpretation of the data. Having a thorough understanding of the meaning of the data is

84 3.4 Data Matching 63 critica. Knowing that height is measured in feet or in meters wi make a difference to the anaysis. We might find that some data was entered as feet and other data as meters (the consistency probem). We might have doar amounts over many years, and our anaysis might need to interpret the amounts in terms of their reative present-day vaue. Codes are aso often used, and we need to understand what each code means and how different codes reate to each other. As the data ages, the meaning of the different variabes wi often change or be atogether ost. We need to understand and dea with this. The accessibiity of the right data for anaysis wi often aso be an issue. A typica process in data coection invoves checking for obvious data errors in the data suppied and correcting those errors. In coecting tax return data from taxpayers, for exampe, basic checks wi be performed to ensure the data appears correct (e.g., checking for mistakes that enter data as 3450 to mean $3450, whereas it was meant to be $34.50). Sometimes the checks might invove discussing the data with its suppier and modifying it appropriatey. Often it is this ceaner data that is stored on the system rather than the origina data suppied. The origina data is often archived, but often it is such data that we actuay need for the anaysis we want to anayse the data as suppied originay. Accessing archived data is often probematic. Accessing the most recent data can sometimes be a chaenge. In an onine data processing environment, where the key measure of performance is the turnaround time of the transaction, providing other systems with access to the data in a timey manner can be a probem. In many environments, the data can ony be accessed after a sometimes compex extract/transform/oad (ETL) process. This can mean that the data may ony be avaiabe after a day or so, which may present chaenges for its timey anaysis. Often, business processes need to be changed so that more timey access is possibe. 3.4 Data Matching In coecting data from mutipe sources, we end up with a major probem in that we need to match observations from one dataset with those from another dataset. That is, we need to identify the same entities (e.g., peope or companies) from different data sources. These different sources coud be, for exampe, patient medica data from a doctor and

85 64 3 Working with Data from a hospita. The doctor s data might contain information about the patients genera visits, basic test resuts, diagnoses, and prescriptions. The doctor might have a unique number to identify his or her own patients, as we as their names, dates of birth, and addresses. A hospita wi aso record data about patients that are admitted, incuding their reason for admission, treatment pan, and medications. The hospita wi probaby have its own unique number to identify each patient, as we as the patient s name, date and pace of birth, and address. The process of data matching might be as simpe as joining two datasets together based on shared identifiers that are used in each of the two databases. If the doctor and the hospita share the same unique numbers to identify the patients, then the data matching process is simpified. However, the data matching task is usuay much more compex. Data matching often invoves, for exampe, matching of names, addresses, and dates and paces of birth, a of which wi have inaccuracies and aternatives for the same thing. The data entered at a doctor s consuting rooms wi in genera be entered by a different receptionist on a different day from the data entered on admission at a hospita where surgery might be performed. It is not uncommon to find, even within a singe database, one person s name recorded differenty, et aone when deaing with data from very different sources. One data source might identify John L. Smith, and another might identify the person as J.L. Smith, and a third might have an error or two but identify the person as Jon Lesie Smyth. The task of data matching is to bring different data sources together in a reiabe and supportabe manner so that we have the right data about the right person. An idea that can improve data matching quaity is that of a trusted data matching bureau. Many data matching bureaus within organisations amost start each new data matching effort from scratch. However, over time there is the opportunity to buid up a data matching database that records reevant information about a previous data matches. Under this scenario, each time a new data matching effort is undertaken, the identities within this database, and their associated information, are used to improve the new data matching. Importanty, the resuts of the new data matching feed back into the data matching database to improve the quaity of the matched entities and thus even improve previousy matched data.

86 3.5 Data Warehousing 65 Data matching is quite an extensive topic in itsef and worth a separate book. A number of commerciay avaiabe toos assist with the basic task. The open source Febr 2 system aso provides data matching capabiities. They a aim to identify the same entity in a of the data sources. 3.5 Data Warehousing The process of bringing data together into one unified and carefuy managed repository is referred to as data warehousing the anaogy being with a arge buiding used for the storage of goods. What we store in our warehouse is data. Data warehouses were topica in the 1990s and primariy vendor driven, servicing a rea opportunity to get on top of managing data. Inmon (1996) provides a detaied introduction. We can view a data warehouse as a arge database management system. It is designed to integrate data from many different sources and to support anaysis for different objectives. In any organisation, the data warehouse can be the foundation for business inteigence, providing a singe, integrated source of data for the whoe organisation. Typicay, a data warehouse wi contain data coected together from mutipe sources but focussed around the function of an organisation. The data sources wi often be operationa systems (such as transaction processing systems) that run the day-to-day functions of the organisation. In banking, for exampe, the transaction processing systems incude ATMs and EFTPOS machines, which are today most pervasive. Transaction processing systems coect data that gets upoaded to the data warehouse on a reguar basis (e.g., daiy, but perhaps even more frequenty). We-organised data warehouses, at east from the point of view of data mining, wi aso be nonvoatie. The data stored in our data warehouses wi capture data reguary, and oder data is not removed. Even when an update to data is to correct existing data items, such data must be maintained, creating a massive repository of historic data that can be used to carefuy track changes over time. Consider the case of tax returns hed by our various revenue authorities. Many corrections are made to individua tax returns over time. When a tax return is fied, a number of checks for accuracy may resut in 2

87 66 3 Working with Data simpe changes (e.g., correcting a misspeed address). Further changes might be made at a ater time as a taxpayer corrects data originay suppied. Changes might aso be the resut of audits eading to corrections made by the revenue authority, or a taxpayer may notify the authority of a change in address. Keeping the history of data changes is essentia for data mining. It may be quite significant, from a fraud point of view, that a number of cients in a short period of time change their detais in a common way. Simiary, it might be significant, from the point of view of understanding cient behaviour, that a cient has had ten different addresses in the past 12 months. It might be of interest that a taxpayer aways fies his or her tax return on time each year, and then makes the same two adjustments subsequenty, each year. A of this historic data is important in buiding a picture of the entities we are interested in. Whist the operationa systems may ony store data for one or two months before it is archived, having this data accessibe for many years within a data warehouse for data mining is important. In buiding a data warehouse, much effort goes into how the data warehouse is structured. It must be designed to faciitate the queries that operate on a arge proportion of data. A carefu design that exposes a of the data to those who require it wi aid in the data mining process. Data warehouses quicky become unwiedy as more data is coected. This often eads to the deveopment of specific data marts, which can be thought of as creating a tuned subset of the data warehouse for specific purposes. An organisation, for exampe, may have a finance data mart, a marketing data mart, and a saes data mart. Each data mart wi draw its information from various other data coected in the warehouse. Different data sources within the warehouse wi be shared by different data marts and present the data in different ways. A crucia aspect of a data warehouse (and any data storage, in fact) is the maintenance of information about the data so-caed metadata. Metadata heps make the data understandabe and thereby usefu. We might tak about two types of metadata: technica metadata and business metadata. Technica metadata captures data about the operationa systems from which the data was obtained, how it was extracted from the source systems, how it was transformed, how it was oaded into the warehouse, where it is stored within the warehouse, and its structure as stored in the warehouse.

88 3.5 Data Warehousing 67 The actua process of extracting, transforming, and then oading data is often referred to as ETL (extract, transform, oad). Many vendors provide ETL toos, and there is aso extensive capabiity for automating ETL using open source software, incuding R. The business metadata, on the other hand, provides the information that is usefu in understanding the data. It wi incude descriptions of the variabes contained in the data and measures of their data quaity. It can aso incude who owns the data, who has access to it, the cost of accessing it, when it was ast updated, how frequenty it is updated, and how it is used operationay. Before data mining became a widey adopted technoogy, the data warehouse supported anayses through business inteigence (BI) technoogy. The simpest anayses buid reports that aggregate the data within a warehouse in many different ways. Through this technoogy, an organisation is abe to ensure its executives are aware of its activities. On-ine, anaytic processing (OLAP) within the BI technoogy supports user-driven and mutidimensiona anayses of the data contained within the warehouse. Extending the concept of a human-driven and generay manua anaysis of data, as in business inteigence, data mining provides a data-driven approach to the anaysis of the data. Ideay, the data warehouse is the primary data source for data mining. Integrating data from mutipe sources, the data warehouse shoud contain an extensive resource that captures a of the activity of an organisation. Aso, ideay, the data wi be consistent, of high quaity, and documented with very good metadata. If a that is true, the data mining wi be quite straightforward. Rarey is this true. Nonetheess, mining data from the data warehouse can significanty reduce the time for preparing it and sharing the data across many data mining and reporting projects. Data warehouses wi often be accessed through the common structured query anguage (SQL). Our data wi usuay be spread across mutipe ocations within the warehouse, and SQL queries wi be used to bring them together. Some basic famiiarity with SQL wi be usefu as we extract our data. Otherwise we wi need to ensure we have ready access to the skis of a data anayst to extract the data for us.

89 68 3 Working with Data 3.6 Interacting with Data Using R Once we have scouted for data, matched common entities, and brought the data together, we need to structure the data into a form suitabe for data mining. More specificay, we need to structure the data to suit the data mining too we are intending to use. In our case, this invoves putting the data into a form that aows it to be easiy oaded into R, using Ratte, where we wi then expore, test, and transform it in various ways in preparation for mining. Once we have oaded a dataset into Ratte, through one of the mechanisms we introduce in Chapter 4 (or directy through R itsef), we may want to modify the data, cean it, and transform it into the structures we require. We may aready be famiiar with a variety of toos for deaing with data (ike SQL or a spreadsheet). These toos may be quite adequate for the manipuations we need to undertake. We can easiy prepare the data with them and then oad it into Ratte when ready. But R itsef is aso a very powerfu data manipuation anguage. Much of R s capabiities for data management are covered in other books, incuding those of Spector (2008), Muenchen (2008), and Chambers (2008). Ratte provides access to some data ceaning operations under the Transform tab, as covered in Chapter 7. We provide here eementary instruction in using R itsef for a imited set of manipuations that are typica in preparing data for data mining. We do not necessariy cover the detais nor provide the systematic coverage of R avaiabe through other means. One of the most basic operations is accessing the data within a dataset. We index a dataset using the notation of square brackets, and within the square brackets we identify the index of the observations and the variabes we are interested in, separating them with a comma. We briefy saw this previousy in Section 2.9. Using the same weather dataset as in Chapter 2 (avaiabe from ratte, which we can oad into R s ibrary()), we can access observations 100 to 105 and variabes 5 to 6 by indexing the dataset. If either index (observations or variabes) is eft empty, then the resut wi be a observations or a variabes, respectivey, rather than just a subset of them. Using dim() to report on the resuting size (dimensions) of the dataset, we can see the effect of the indexing:

90 3.6 Interacting with Data Using R 69 > ibrary(ratte) > weather[100:105, 5:6] Rainfa Evaporation > dim(weather) [1] > dim(weather[100:105, 5:6]) [1] 6 2 > dim(weather[100:105,]) [1] 6 24 > dim(weather[,5:6]) [1] > dim(weather[5:6]) [1] > dim(weather[,]) [1] Note that the notation 100:105 is actuay shorthand for a ca to seq(), which generates a ist of numbers. Another way to generate a ist of numbers is to use c() (for combine) and ist each of the numbers expicity. These expressions can repace the 100:105 in the exampe above to have the same effect. We can see this in the foowing code bock.

91 70 3 Working with Data > 100:105 [1] > seq(100, 105) [1] > c(100, 101, 102, 103, 104, 105) [1] Variabes can be referred to by their position number, as above, or by the variabe name. In the foowing exampe, we extract six observations of just two variabes. Note the use of the vars object to ist the variabes of interest and then from that index the dataset. > vars <- c("evaporation", "Sunshine") > weather[100:105, vars] Evaporation Sunshine We can ist the variabe names contained within a dataset using names(): > head(names(weather)) [1] "Date" "Location" "MinTemp" [4] "MaxTemp" "Rainfa" "Evaporation" In this exampe we ist ony the first six names, making use of head(). This exampe aso iustrates the functiona nature of R. Notice how we directy feed the output of one function (names()) into another function (head()). We coud aso use indexing to achieve the same resut: > names(weather)[1:6] [1] "Date" "Location" "MinTemp" [4] "MaxTemp" "Rainfa" "Evaporation"

92 3.7 Documenting the Data 71 When we index a dataset with singe brackets, as in weather[2] or weather[4:7], we retrieve a subset of the dataset specificay, we retrieve a subset of the variabes. The resut itsef is another dataset, even if it contains just a singe variabe. Compare this with weather[[2]], which returns the actua vaues of the variabe. The differences may appear subte, but as we gain experience with R, they become important. We do not dwe on this here, though. > head(weather[2]) Location 1 Canberra 2 Canberra 3 Canberra 4 Canberra 5 Canberra 6 Canberra > head(weather[[2]]) [1] Canberra Canberra Canberra Canberra Canberra Canberra 46 Leves: Adeaide Abany Abury... Woomera We can use the $ notation to access specific variabes within a dataset. The expression weather$mintemp refers to the MinTemp variabe of the weather dataset: > head(weather$mintemp) [1] Documenting the Data The weather dataset, for exampe, though very sma in the number of observations, is somewhat typica of data mining. We have obtained the dataset from a known source and have processed it to buid a dataset ready for our data mining. To do this, we ve had to research the meaning of the variabes and read about any idiosyncrasies associated with the coection of the data. Such information needs to be captured in a data mining report. The report shoud record where our data has come from, our understanding of its integrity, and the meaning of the variabes. This

93 72 3 Working with Data information wi come from a variety of sources and usuay from mutipe domain experts. We need to understand and document the provenance of the data: how it was coected, who coected it, and how they understood what they were coecting. The foowing summary wi be usefu. It is obtained from processing the output from the str(). That output, which is normay ony dispayed in the consoe, is first captured into a variabe using capture.output(): > sw <- capture.output(str(weather, vec.en=1)) > cat(sw[1]) 'data.frame': 366 obs. of 24 variabes: The output is then processed to add a variabe number and appropriatey fit the page. The processing first uses sprintf() to generate a ist of variabe numbers, each number stored as a string of width 2 ( %2d ): > swa <- sprintf("%2d", 1:ength(sw[-1])) Each number is then pasted to each ine of the output, coapsing the separate ines to form one ong string with a new ine ( \n ) separating each ine: > swa <- paste(swa, sw[-1], sep="", coapse="\n") The gsub() function is then used to truncate ines that are too ong by substituting a particuar pattern of dots and digits with just... > swa <- gsub("\\.\\.: [0-9]+ [0-9]+ \\.\\.\\.", "..", swa) The fina substitution removes some unnecessary characters, again to save on space. That is a itte compex at this stage but iustrates the power of R for string processing (as we as statistics). > swa <- gsub("( \\$ : )", "", swa) We use cat() to then dispay the resuts of this processing.

94 3.8 Summary 73 > cat(swa) 1 Date Date, format " "... 2 Location Factor w/ 46 eves "Adeaide","Abany",.. 3 MinTemp num MaxTemp num Rainfa num Evaporation num Sunshine num WindGustDir Ord.factor w/ 16 eves "N"<"NNE"<"NE"<.. 9 WindGustSpeed num WindDir9am Ord.factor w/ 16 eves "N"<"NNE"<"NE"<.. 11 WindDir3pm Ord.factor w/ 16 eves "N"<"NNE"<"NE"<.. 12 WindSpeed9am num WindSpeed3pm num Humidity9am int Humidity3pm int Pressure9am num Pressure3pm num Coud9am int Coud3pm int Temp9am num Temp3pm num RainToday Factor w/ 2 eves "No","Yes" RISK_MM num RainTomorrow Factor w/ 2 eves "No","Yes" Summary In this chapter, we have introduced the concepts of data and dataset. We have described how we obtain data and issues reated to the data we use for data mining. We have aso introduced some basic data manipuation using R. We wi revisit the weather, weatheraus, and audit datasets throughout the book. Appendix B describes in detai how these datasets are obtained and processed into a form for use in data mining. The amount of detai there and the R code provided may be usefu in earning more about manipuating data in R.

95 74 3 Working with Data 3.9 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: <- function Assign a vaue into a named reference. $ function Extract a variabe from a dataset. audit dataset Sampe dataset from ratte. c() function Combine items to form a coection. cat() function Dispay the arguments to the screen. dim() function Report the rows and coumns of a dataset. gsub() function Gobay substitute one string for another. head() function Show top observations of a dataset. ibrary() command Load a package into the R ibrary. names() function Show variabes contained in a dataset. paste() function Combine strings into one string. ratte package Provides the weather and audit datasets. seq() function Generate a sequence of numbers. sprintf function Format a string with substitution. str() function Show the structure of an object. weather dataset Sampe dataset from ratte. weatheraus dataset A arger dataset from ratte.

96 Chapter 4 Loading Data Data can come in many different formats from many different sources. By using R s extensive capabiities, Ratte provides direct access to such data. Indeed, we are fortunate with the R system in that it is an open system and therefore is strong on sharing and cooperating with other appications. R supports importing data in many formats. One of the most common formats for data exchange between appications is the comma-separated vaue (CSV) fie. Such fies typicay have a csv fiename extension. This is a simpe text fie format that is oriented around rows and coumns, using a comma to separate the coumns in the fie. Such fies can be used to transfer data through export and import between spreadsheets, databases, weather monitoring stations, and many other appications. A variation on the idea is to separate the coumns with other markers, such as a tab character, which is often associated with fies having a txt fiename extension. These simpe data fies (the CSV and TXT fies) contain no expicit metadata information that is, there is no data to describe the structure of the data contained in the fie. That information often needs to be guessed at by the software reading the data. Other types of data sources do provide information about the data so that our software does not need to make guesses about what it is reading. Attribute-Reation Fie Format fies (Section 4.2) have an arff fiename extension and add metadata to the CSV format. Extracting data directy from a database often deivers the metadata aong with the data itsef. The Open Database Connectivity (ODBC) standard provides an open access method for accessing data stored in a variety of databases and is supported by R. This aows direct connection G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _4, Springer Science+Business Media, LLC

97 76 4 Loading Data to a vast coection of data sources, incuding Microsoft Exce, Microsoft Access, SQL Server, Orace, MySQL, Postgres, and SQLite. Section 4.3 covers the package RODBC. The fu variety of R s capabiity for oading data is necessariy not avaiabe directy within Ratte. However, we can use the underying R commands to oad data and then access it within Ratte, as in Section 4.4. R packages themseves aso provide an extensive coection of sampe datasets. Whist many datasets wi be irreevant to our specific tasks, they can be used to experiment with data mining using R. A ist of datasets contained in the R Library is avaiabe through the Ratte interface by choosing Library as the Source on the Data tab. We cover this further in Section 4.6. Having oaded our data into Ratte through some mechanism, we need to decide on the roe payed by each of the variabes in the dataset. We aso need to decide how the observations in the dataset are going to be used in the mining. We record these decisions through the Ratte interface, with Ratte itsef providing usefu defauts. Once a dataset source has been identified and the Data tab executed, an overview of the data wi be dispayed in the text view. Figure 4.1 dispays the Ratte appication after oading the weather.csv fie, which is suppied as a sampe dataset with the Ratte package. We get here by starting up R and then oading ratte, starting up Ratte, and then cicking the Execute button for an offer to oad the weather dataset: > ibrary(ratte) > ratte() In this chapter, we review the different source data formats and discuss how to oad them for data mining. We then review the options that Ratte provides for identifying how the data is to be used for data mining. 4.1 CSV Data One of the simpest and most common ways of sharing data today is via the comma-separated vaues (CSV) format. CSV has become a standard fie format used to exchange data between many different appications. CSV fies, which usuay have a csv extension, can be exported and imported by spreadsheets and databases, incuding LibreOffice Cac, Gnumeric, Microsoft Exce, SAS/Enterprise Miner, Teradata, Netezza, and

98 4.1 CSV Data 77 Figure 4.1: Loading the weather.csv dataset. very many other appications. For these reasons, CSV is a good option for importing data into Ratte. The downside is that a CSV fie does not contain expicit metadata (i.e., data about the data incuding whether the data is numeric or categoric). Without this metadata, R sometimes determines the wrong data type for a particuar coumn. This is not usuay fata, and we can hep R aong when oading data using R. Locating and Loading Data Using the Spreadsheet option of Ratte s Data tab, we can oad data directy from a CSV fie. Cick the Fiename button (Figure 4.2) to dispay the fie chooser diaogue (Figure 4.3). We can browse to the CSV fie we wish to oad, highight it, and cick the Open button. We now need to actuay oad the data into Ratte from the fie. As aways, we do this with a cick on the Execute button (or a press of the F2 key). This wi oad the contents of the fie from the hard disk into the computer s memory for processing by Ratte as a dataset. Ratte suppies a number of sampe CSV fies and in particuar pro-

99 78 4 Loading Data Figure 4.2: The Spreadsheet option of the Data tab, highighting the Fiename button. Cick this button to open up the fie chooser. vides the weather.csv data fie. The data fie wi have been instaed when ratte was instaed. We can ask R to te us the actua ocation of the fie using system.fie(), which we can type into the R Consoe: > system.fie("csv", "weather.csv", package="ratte") [1] "/usr/oca/ib/r/site-ibrary/ratte/csv/weather.csv" The ocation reported wi depend on your particuar instaation and operating system. Here the ocation is reative to a standard instaation of a Ubuntu GNU/Linux system. Tip: We can aso oad this fie into a new instance of Ratte with just two mouse cicks (Execute and Yes). We can then cick the Fiename button (dispaying weather.csv) to open up a fie browser showing the fie path at the top of the window. We can review the contents of the fie using fie.show(). This wi pop up a window dispaying the contents of the fie: > fn <- system.fie("csv", "weather.csv", package="ratte") > fie.show(fn) The fie contents can be directy viewed outside of R and Ratte with any simpe text editor. If you aren t famiiar with CSV fies, it is instructiona to become so. We wi see that the top of the fie begins: Date,Location,MinTemp,MaxTemp,Rainfa,Evaporation ,Canberra,8,24.3,0,3.4,6.3,NW,30,SW,NW ,Canberra,14,26.9,3.6,4.4,9.7,ENE,39,E,W ,Canberra,13.7,23.4,3.6,5.8,3.3,NW,85,N,NNE...

100 4.1 CSV Data 79 Figure 4.3: The CSV fie chooser showing just those fies with a.csv extension in the foder. We can aso seect to dispay just the.txt fies (e.g., the extension often used for tab-deimited fies) or ese a fies by seecting from the dropdown menu at the bottom right. A CSV fie is just a norma text fie that commony begins with a header ine isting the names of the variabes, each separated by a comma. The remainder of the fie after the header row is expected to consist of rows of data that record the observations. For each observation, the fieds are separated by commas, deimiting the actua observation of each of the variabes. Loading data into Ratte from a CSV fie uses read.csv(). We can see this by reviewing the contents of the Log tab. From the Log tab we wi see something ike the foowing: > crs$dataset <- read.csv("fie:.../weather.csv", na.strings=c(".", "NA", "", "?"), strip.white=true) The fu path to the weather.csv fie is truncated here for brevity, so the command above won t succeed with a copy-and-paste. Instead, copy the corresponding ine from the Log tab into the R Consoe. The resut

101 80 4 Loading Data of executing this function is that the dataset itsef is oaded into memory and referenced using the name crs\$dataset. The second argument in the function ca above (na.strings=) ists the four strings that, if found as the vaue of a variabe, wi be transated into R s representation for missing vaues (NA). The ist of strings used here captures the most common approaches to representing missing vaues. SAS, for exampe, uses the dot (. ) to denote missing vaues, and R uses the specia string NA. Other appications simpy use the empty string, whist yet others (incuding machine earning appications ike C4.5) use the question mark (? ). We aso use the strip.white= argument, setting it to TRUE, which has the effect of stripping white space (i.e., spaces and/or tabs). This aows the source CSV fie to have the commas aigned for easier human viewing and sti support missing vaues appropriatey. The read.csv() function need not be quite so compex. If we have a CSV fie to oad into R (again substituting the... with the actua path to the fie), we can usuay simpy type the foowing command: > ds <- read.csv(".../weather.csv") We can aso oad data directy from the Internet. weather dataset is avaiabe from togaware.com: For exampe, the > ds <- read.csv(" As we saw in Chapter 2 Ratte wi offer to oad the suppied sampe data fie (weather.csv) if no other data fie is specified through the Fiename button. This is the simpest way to oad sampe data into Ratte, and is usefu for earning the Ratte interface. After identifying the fie to oad, we do need to remember to cick the Execute button to actuay oad the dataset into Ratte. The main text pane of the Data tab then changes to ist the variabes, together with their types and roes and some other usefu information, as can be seen in Figure 4.1. After oading the data from the fie into Ratte, thereby creating a dataset, we can begin to expore it. The top of the fie can be viewed in the R Consoe, as we saw in Chapter 2. Here we imit the dispay to just the first five variabes and request just six observations:

102 4.1 CSV Data 81 > head(crs$dataset[1:5], 6) Date Location MinTemp MaxTemp Rainfa Canberra Canberra Canberra Canberra Canberra Canberra As we described earier (Section 2.9, page 50), Ratte stores the dataset within an environment caed crs, so we can reference it directy in R as crs$dataset. Through the Ratte interface, once we have oaded the dataset, we can aso view it as a spreadsheet by cicking the View button, which uses dfedit() from RGtk2Extras (Taverner et a., 2010). Data Variations The Ratte interface provides options for tuning how we read the data from a CSV fie. As we can see in Figure 4.2, the options incude the Separator and Header. We can choose the fied deimiter through the Separator entry. A comma is the defaut. To oad a TXT fie, which uses a tab as the fied separator, we repace the comma with the specia code \\t (that is, two sashes foowed by a t) to represent a tab. We can aso eave the entry empty and any white space (i.e., any number of spaces and/or tabs) wi be used as the separator. From the read.csv() viewpoint, the effect of the separator entry is to incude the appropriate argument (using sep=) in the ca to the function. In this exampe, if we happen to have a fie named mydata.txt that contained tab-deimited data, then we woud incude the sep=: > ds <- read.csv("mydata.txt", sep="\t") Tip: Note that when specifying the tab as the separator directy within R we use a singe sash rather than the doube sashes through the Ratte interface. Another option of interest when oading a dataset is the Header check box. Generay, a CSV fie wi have as its first row a ist of coumn names.

103 82 4 Loading Data These names wi be used by R and Ratte as the names of the variabes. However, not a CSV fies incude headers. For such fies, uncheck the Header check box. On oading a CSV fie that does not contain headers, R wi generate variabe names for the coumns. The check box transates to the header= argument in the ca to read.csv(). Setting the vaue of header= to FALSE wi resut in the first ine being read as data rather than as a header ine. If we had such a fie, perhaps caed mydata.csv, then the ca to read.csv() woud be: > ds <- read.csv("mydata.csv", header=false) Tip: The data can contain different numbers of coumns in different rows, with missing coumns at the end of a row being fied with NAs. This is handed using the fi= argument of read.csv(), which is TRUE by defaut. Basic Data Summary Once a dataset has been oaded into Ratte, we can start to obtain an idea of the shape of the data from the simpe summary that is dispayed. In Figure 4.1, for exampe, the first variabe, Date, is recognised as a unique identifier for each observation. It has 366 unique vaues, which is the same as the number of observations. The variabe Location has ony a singe vaue across a observations in the dataset. Consequenty, it is identified as a constant and pays no roe in the modeing. It is ignored. The next five variabes in Figure 4.1 are a tagged as numeric, foowed by the categoric WindGustDir, and so on. The Comment coumn identifies the unique number of vaues and the number of missing observations for each variabe. Sunshine, for exampe, has 114 unique vaues and 3 missing vaues. How to dea with missing vaues is covered in Chapter ARFF Data The Attribute-Reation Fie Format (ARFF) is a text fie format that is essentiay a CSV fie with a number of rows at the top of the fie that contain metadata. The ARFF format was deveoped for use in the

104 4.2 ARFF Data 83 Weka (Witten and Frank, 2005) machine earning software, and there are many datasets avaiabe in this format. We can oad an ARFF dataset into Ratte through the ARFF option (Figure 4.4), specifying the fiename from which the data is oaded. Ratte provides sampe ARFF datasets. To access them, after starting up Ratte and oading the sampe weather dataset (Section 2.4), choose the ARFF option and then cick the Fiename chooser. Browse to the parent foder and then into the arff foder to choose a dataset to oad. Figure 4.4: Choosing the ARFF radio button to oad an ARFF fie. The key difference between CSV and ARFF is in the top part of the fie, which contains information about each of the variabes in the data this is the data description section. An exampe of the ARFF format for our weather dataset is shown beow. Note that ARFF refers to variabes as Date Location {Adeaide, MinTemp MaxTemp RainTomrrow {No, ,Canberra,8,24.3,0,...,Yes ,Canberra,14,26.9,3.6,...,Yes ,Canberra,?,23.4,3.6,...,Yes... The data description section is straightforward, beginning with the name of the dataset (or the name of the reation in ARFF terminoogy). Each of the variabes used to describe each observation is then identified together with its data type. Each variabe definition appears on a

105 84 4 Loading Data singe ine (we have truncated the ines in the exampe above). Numeric variabes are identified as numeric, rea, or integer. For categoric variabes, the possibe vaues are isted. Two other data types recognised by ARFF are string and date. A string data type indicates that the variabe can have a string (a sequence of characters) as its vaue. The date data type can aso optionay specify the format in which the date is recorded. The defaut for dates is the ISO-8601 standard format, which is "yyyy-mm-dd't'hh:mm:ss". Foowing the metadata specification, the actua observations are then isted, each on a singe ine, with fieds separated by commas, exacty as with a CSV fie. A significant advantage of the ARFF data fie over the CSV data fie is the metadata information. This is particuary usefu in Ratte, where for categoric data the possibe vaues are determined from the data when reading in a CSV fie. Any possibe vaues of a categoric variabe that are not present in the data wi, of course, not be known. When reading the data from an ARFF fie, the metadata wi ist a possibe vaues of a categoric variabe, even if one of the vaues might not be used in the actua data. We wi come across this as an issue, particuary when we buid and depoy random forest modes, as covered in Chapter 12. Comments can aso be incuded in an ARFF fie with a % at the beginning of the comment ine. Incuding comments in the data fie aows us to record extra information about the dataset, incuding how it was derived, where it came from, and how it might be cited. Missing vaues in an ARFF data fie are identified using the question mark?. These are identified by R s read.arff(), and we see them as the usua NAs in Ratte. Overa, the ARFF format, whist simpe, is quite an advance over a CSV fie. Nonetheess, CSV sti remains the more common data fie. 4.3 ODBC Sourced Data Much data is stored within databases and data warehouses. The Open Database Connectivity (ODBC) standard has been deveoped as a common approach for accessing data from databases (and hence data warehouses). The technoogy is based on the Structured Query Language (SQL) used to query reationa databases. We discuss here how to access data directy from such databases.

106 4.3 ODBC Sourced Data 85 Ratte can obtain a dataset from any database accessibe through ODBC by using Ratte s ODBC option (Figure 4.5). Underneath the GUI, RODBC (Ripey and Lapsey, 2010) provides the actua interface to the ODBC data source. Figure 4.5: Loading data through an ODBC database connection. The key to accessing data via ODBC is to identify the data source through a so-caed data source name (or DSN). Different operating systems provide different mechanisms for setting up DSNs. Under the GNU/Linux operating system, for exampe, using the unixodbc appication, the system DSNs are often defined in /etc/odbcinst.ini and /etc/odbc.ini. Under Microsoft Windows, the contro pane provides access to the ODBC Data Sources too. Using Ratte, we identify a configured DSN by typing its name into the DSN text entry (Figure 4.5). Once a DSN is specified, Ratte wi attempt to make a connection. Many ODBC drivers wi prompt for a username and password before estabishing the connection. Figure 4.6 iustrates a typica popup for entering such data, in this case for connecting to a Netezza data warehouse. To estabish a connection using R directy, we use odbcconnect() from RODBC. This function estabishes what we might think of as a channe connecting to the remote data source: > ibrary(rodbc) > channe <- odbcconnect("mydwh", uid="kayon", pwd="toga") After estabishing a connection to a data source, Ratte wi query the database for the names of the avaiabe tabes and provide access to that ist through the Tabe combo box of Figure 4.5. We need to seect the specific tabe to oad. A imited number of options avaiabe in R are exposed through Ratte for fine-tuning the ODBC connection. One option aows us to imit the

107 86 4 Loading Data Figure 4.6: Netezza ODBC connection number of rows retrieved from the chosen tabe. If the row imit is set to 0, then a of the rows from the tabe are retrieved. Unfortunatey, there is no SQL standard for imiting the number of rows returned from a query. For some database systems (e.g., Teradata and Netezza), the SQL keyword is LIMIT, and this is what is used by Ratte. A variety of R functions, provided by RODBC, are avaiabe to interact with the database. For exampe, the ist of avaiabe tabes is obtained using sqtabes(). We pass to it the channe that we created above to communicate with the database: > tabes <- sqtabes(channe) If there is a tabe in the connected database caed, for exampe, cients, we can obtain a ist of coumn names using sqcoumns(): > coumns <- sqcoumns(channe, "cients") Often, we are interested in oading ony a specific subset of a tabe from the database. We can directy formuate an SQL query to retrieve just the data we want. For exampe: > query <- "SELECT * FROM cients WHERE cost > 2500" > myds <- sqquery(channe, query) Using R directy provides a ot more scope for carefuy identifying the data we wish to oad. Any SQL query can be substituted for the simpe SELECT statement used above. For those with skis in writing SQL queries, this provides quite a powerfu mechanism for refining the data to be oaded, before it is oaded. Loading data by directy sending an SQL query to the channe as above wi store the data in R as a dataset, which we can reference as

108 4.4 R Dataset Other Data Sources 87 myds (as defined above). This dataset can be accessed in Ratte with the R Dataset option, which we now introduce. 4.4 R Dataset Other Data Sources Data can be oaded from any source, one way or another, into R. We have covered oading data from a data fie (as in oading a CSV or TXT fie) or directy from a database. However, R supports many more options for importing data from a variety of sources. Ratte can use any dataset (technicay, any data frame) that has been oaded into R as a dataset to be mined. When choosing the R Dataset option of the Data tab (Figure 4.7), the Data Name box wi ist each of the avaiabe data frames that can be brought into Ratte as a dataset. Using foreign (DebRoy and Bivand, 2011), for exampe, R can be used to read SPSS datasets (read.spss()), SAS XPORT format datasets (read.xport()), and DBF database fies (read.dbf()). One notabe exception, though, is the proprietary SAS dataset format, which cannot be oaded uness we have a icensed copy of SAS to read the data for us. Loading SPSS Datasets As an exampe, suppose we have an SPSS data fie saved or exported from SPSS. We can read that into R using read.spss(): > ibrary(foreign) > mydataset <- read.spss(fie="mydataset.sav") Then, as in Figure 4.7, we can find the data frame name, mydataset, isted as an avaiabe R Dataset: Figure 4.7: Loading an aready defined R data frame as a dataset for use in Ratte.

109 88 4 Loading Data The datasets that we wish to use with Ratte need to be constructed or oaded into the same R session that is running Ratte (i.e., the same R Consoe in which we oaded the Ratte package). Reading Data from the Cipboard Figure 4.8: Seected region of a spreadsheet copied to the cipboard. An interesting variation that may at times be quite convenient is the abiity to directy copy and paste a seection via the system cipboard. Through this mechanism, we can copy (as in copy-and-paste ) data from a spreadsheet into the cipboard. Then, within R we can paste the data into a dataset using read.tabe(). Suppose we have opened a spreadsheet with the data we see in Figure 4.8. If we seect the 16 rows, incuding the header, in the usua way, we can very simpy oad the data using R: > expenses <- read.tabe(fie("cipboard"), header=true)

110 4.4 R Dataset Other Data Sources 89 The expenses data frame is then avaiabe to Ratte. Converting Dates By defaut, the Date variabe in the exampe above is oaded as categoric. We can convert it into a date type, as beow, before we oad it into Ratte, as in Figure 4.9: > expenses$date <- as.date(expenses$date, format="%d-%b-%y") > head(expenses) Date Expense Tota Figure 4.9: Loading an R data frame that was obtained from a copy-and-paste, via the cipboard, from a spreadsheet. Reading Data from the Word Wide Web A ot of data today is avaiabe in HTML format on the Word Wide Web. XML (Lang, 2011) provides functions to read such data directy into R and so make that data avaiabe for anaysis in Ratte (and, of course,

111 90 4 Loading Data R). As an exampe, we can read data from Googe s ist of most visited web sites, converting it to a data frame and thus making it avaiabe to Ratte. We begin this by oading XML and setting up some ocations: > ibrary(xml) > googe <- " > path <- "adpanner/static/top1000/" > top1000urs <- paste(googe, path, sep="") Now we can read in the data using readhtmltabe(), extracting the reevant tabe and setting up the coumn names: > tabes <- readhtmltabe(top1000urs) > top1000 <- tabes[[2]] > conames(top1000) <- c('rank', 'Site', 'Category', 'Users', 'Reach', 'Views', 'Advertising') The top few rows of data from the tabe can be viewed using head(): > head(top1000) Rank Site Category 1 1 facebook.com Socia Networks 2 2 youtube.com Onine Video 3 3 yahoo.com Web Portas 4 4 ive.com Search Engines 5 5 wikipedia.org Dictionaries & Encycopedias 6 6 msn.com Web Portas Users Reach Views Advertising 1 880,000, % 910,000,000,000 Yes 2 800,000, % 100,000,000,000 Yes 3 660,000, % 77,000,000,000 Yes 4 550,000, % 36,000,000,000 Yes 5 490,000, % 7,000,000,000 No 6 450,000,000 24% 15,000,000,000 Yes 4.5 R Data Using the RData Fie option (Figure 4.10), data can be oaded directy from a native binary R data fie (usuay with the RData fiename exten-

112 4.6 Library 91 sion). Such fies may contain mutipe datasets (usuay in a compressed format) and wi have been saved from R sometime previousy (using save()). RData can be oaded by first identifying the fie containing the data. The data wi be oaded once the fie is identified, and we wi be given an option to choose just one of the avaiabe data frames to be oaded as Ratte s dataset. We specify this through the Data Name combo box and then cick Execute to make the dataset avaiabe within Ratte. Figure 4.10: Loading a dataset from a binary R data fie. Figure 4.10 iustrates the seection of an RData fie. The fie is caed cardiac.rdata. Having identified the fie, Ratte wi popuate the Data Name combo box with the names of each of the data frames found in the fie. We can choose the risk dataset, from within the data fie, to be oaded into Ratte. 4.6 Library Amost every R package provides a sampe dataset that is used to iustrate the functionaity of the package. Ratte, as we have seen, provides the weather, weatheraus, and audit datasets. We can expore the weath of datasets that are avaiabe to us through the packages that are contained in our instaed R ibrary.

113 92 4 Loading Data The Library option of the Data tab provides access to this vast coection of sampe datasets. Cicking the radio button wi generate the ist of avaiabe datasets, which can then be accessed from the Data Name combo box. The dataset name, the package that provides that dataset, and a short description of the dataset wi be incuded in the ist. Note that the ist can be quite ong, and its contents wi depend on the packages that are instaed. We can see a sampe of the ist here, iustrating the R code that Ratte uses to generate the ist: > da <- data(package=.packages(a.avaiabe=true)) > sort(paste(da$resuts[, "Item"], " : ", da$resuts[, "Package"], " : ", da$resuts[, "Tite"], sep=""))... [10] "Adut : arues : Adut Data Set"... [12] "Affairs : AER : Fair's Extramarita Affairs Data"... [14] "Aids2 : MASS : Austraian AIDS Surviva Data"... [19] "airmay : robustbase : Air Quaity Data"... [23] "ais : DAAG : Austraian athetes data set"... [66] "audit : ratte : Sampe dataset for data mining"... [74] "Baseba : vcd : Baseba Data"... [1082] "weather : ratte : Sampe dataset for..."... To access a dataset provided by a particuar package, the actua package wi first need to be oaded using ibrary() (Ratte wi do so automaticay). For many packages (specificay those that decare the datasets as being azy oaded that is, oaded when they are referenced), the dataset wi then be avaiabe from the R Consoe simpy by typing the dataset name. Otherwise, data() needs to be run before the dataset can be accessed. We need to provide data() with the name of the dataset to be made avaiabe. Ratte takes care of this for us to ensure the appropriate action is taken to have the dataset avaiabe.

114 4.7 Data Options Data Options A of Ratte s data oad options that we have described above share a common set of further options that reate to the dataset once it has been oaded. The additiona options reate to samping the data as we as deciding on the roe payed by each of the variabes. We review these options in the context of data mining. Partitioning Data As we first saw in Section 2.7, the Partition option aows us to partition our dataset into a training dataset, a vaidation dataset, and a testing dataset. The concept of partitioning a dataset was further defined in Section 3.1. The concepts are primariy oriented towards predictive data mining. Generay we wi buid a mode using the training dataset. To evauate (Chapter 15) the performance of the mode, we might then appy it to the evauation dataset. This dataset has not been used to buid the mode and so provides an estimate of how we the mode wi perform when presented with new observations. Depending on the performance, we may tune the mode-buiding parameters to seek an improvement in mode performance. Once we have a mode that appears to perform we, or as we as possibe with respect to the vaidation dataset, we might then evauate its performance on the third partition, the testing dataset. The mode has not previousy been exposed to the observations contained in the testing dataset. Thus, the performance of the mode on this dataset is probaby a very good indication of how we the mode wi perform on new observations as they become avaiabe. The concept of partitioning or samping, though, is more genera than simpy a mechanism for partitioning for predictive data mining purposes. Statisticians have deveoped an understanding of samping as a mechanism for anaysing a sma dataset to make concusions about the whoe popuation. Thus there is much iterature from the statistics community on ensuring a good understanding of the uncertainty surrounding any concusions we might make from anayses performed on any data. Such an understanding is important, though often underpayed in the data mining context. Ratte creates a random partition/sampe using sampe(). A random sampe wi generay have a good chance of refecting the distributions of

115 94 4 Loading Data the whoe popuation. Thus, exporing the data, as we do in Chapter 5, wi be made easier when very arge datasets are samped down into much smaer ones. Exporing 10,000 observations is often a more interactive and practica proposition than exporing 1,000,000 observations. Other advantages of samping incude aowing anayses or pots to be repeated over different sampes to gauge the stabiity and statistica accuracy of resuts. Mode buiding, as we wi see particuary when buiding random forests (Chapter 12), can take advantage of this. The use of samping in this way wi aso be necessary in data mining when the datasets avaiabe to mode are so arge that mode buiding may take a considerabe amount of time (hours or days). Samping down to sma proportions of a dataset wi aow us to experiment more interactivey with buiding a mode. Once we are sure of how the data needs to be ceaned and transformed from our initia interactions, we can start experimenting with modes. After we have the basic mode parameters in pace, we might be in a position to cean, transform, and mode over a much arger portion of the data. We can eave the mode buiding to compete over the possiby many hours that are often needed. The downside of samping, particuary in the data mining context, is that observations that correspond to rare events might disappear from a sampe. Cases of rare diseases, or of the few instances of fraud from amongst miions of eectronic funds transfers, may we be ost, even though they are the items that are of most interest to us in many data mining projects. This probem is often referred to as the cass imbaance probem. Ratte provides a defaut random partitioning of a dataset with 70% of the data going into a training dataset, 15% into a vaidation dataset, and 15% into a testing dataset (see Figure 4.11). We can override these choices, depending on our needs. A very sma samping may be required to perform some exporations of otherwise very arge datasets. Smaer sampes may aso be required to buid modes using some of the more computationay expensive agorithms (ike support vector machines). Random numbers are used to seect sampes. Any sequence of random numbers must start with a so-caed seed. If we use the same seed each time we wi get the same sequence of random numbers. Thus the process is repeatabe. By changing the seed we can seect different random sampes. This is often usefu when we wish to expore the sensitivity of our modes to different data sampes. Within Ratte a defaut seed is aways used. This ensures, for exampe,

116 4.7 Data Options 95 Figure 4.11: Samping the weather dataset. repeatabe modeing. The seed is passed to the R function set.seed() to set a seed for the next generated sequence of random numbers. Thus, by setting the seed to the same number each time we can be assured of obtaining the same sampe. Conversey, we may ike to set the seed to a different number in a series of mode buiding exercises, and to then compare the performance of each mode. Each mode wi have been buit from a different random sampe of the dataset. If we see significant variation between the different modes, we may be concerned about the robustness of the approach we are taking. We discuss this further in Chapter 15. Variabe Roes When buiding a mode each variabe wi pay a specific roe. Most variabes wi be inputs to the mode, and one variabe is often identified as the target which we are modeing. A variabe can aso be identified as a so-caed risk variabe. A risk variabe might not be used for modeing as such. Generay it wi record some magnitude associated with the risk or outcome. In the audit dataset, for exampe, it records the doar amount of an adjustment that resuts from an audit this is a measure of the size of the risk associated with that case. In the weather dataset the risk variabe is the amount of

117 96 4 Loading Data rain recorded for the foowing day the amount of rain can be thought of as the size of the risk. See Section 15.4 for an exampe of using risk variabes within Ratte, specificay for mode evauation. Finay, we might aso identify some variabes to be ignored in the modeing atogether. In oading data into Ratte we need to ensure our variabes have their correct roe for modeing. The defaut roe for most variabes is that of an Input variabe. Generay, these are the variabes that wi be used to predict the vaue of a Target variabe. A target variabe, if there is one associated with the dataset, is generay the variabe of interest, from a predictive modeing point of view. That is, it is a variabe that records the outcome from the historic data. In the case of the weather dataset this is RainTomrrow, whist for the audit dataset the target is Adjusted. Ratte uses simpe heuristics to guess at a variabe having a Target roe. The primary heuristic is that a variabe with a sma number of distinct vaues (e.g., ess than 5) is considered as a candidate target variabe. The ast variabe in the dataset is usuay considered as a candidate for being the target variabe, because in many pubic datasets the ast variabe often is the target variabe. If it has more than 5 distinct vaues Ratte wi proceed from the first variabe unti it finds one with ess than 5, if there are any. Ony one variabe can be tagged as a Target. In a simiar vain, integer variabes that have a unique vaue for each observation are often automaticay identified as an Ident (an identifier). Any number of variabes can be tagged as being an Ident. A Ident variabes are ignored when modeing, but are used after scoring a dataset, when it is being written to a score fie, so that the observations that are scored can be identified. Not a variabes in our dataset might be wanted for the particuar modeing task at hand. Such variabes can be ignored, using the Ignore radio button. When oading data into Ratte certain specia strings are used to identify variabe roes. For exampe, if the variabe name starts with ID then the variabe is automaticay marked as having a roe as an Ident. The user can override this. Simiary, a variabe with a name beginning with IGNORE wi have the defaut roe of Ignore. And so with RISK and TARGET.

118 4.8 Command Summary 97 At any one time a target is either treated as categoric or numeric. For a numeric variabe chosen as the target, if it has 10 or fewer unique vaues then Ratte wi automaticay treat it as a categoric variabe (by defaut). For modeing purposes, the consequence is that ony cassification type predictive modes wi be avaiabe. To buid regression type predictive modes we need to override the heuristic by seecting the Numeric radio button of the Data tab. Weights Cacuator and Roe The fina data configuration option of the Data tab is the Weight Cacuator and the associated Weight roe. A singe variabe can be identified as representing some weight associated with each observation. The Weight Cacuator aows us to provide a formua that coud invove mutipe variabes as we as some scaing to give a weight for each observation. For exampe, with the audit dataset, we might enter a formua that uses the adjustment amount, and this wi give more weight to those observations with a arger adjustment. 4.8 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: archetypes package Archetypa anaysis. audit dataset Sampe dataset from ratte. cients dataset A fictitious dataset. data() command Make a dataset avaiabe to R. dfedit() command Edit a data frame in a spreadsheet. fie.show() command Dispay a data frame. foreign package Access mutipe data formats. ibrary() command Load a package into the R ibrary. fie.show() command Dispay the contents of a fie. odbcconnect() function Connect to a database. paste() function Combine strings into one string. ratte package Provides sampe datasets.

119 98 4 Loading Data read.arff() function Read an ARFF data fie. read.csv() function Read a comma-separated data fie. read.dbf() function Read data from a DBF database fie. read.deim() function Read a tab-deimited data fie. read.spss() function Read data from an SPSS data fie. read.tabe() function Read data from a text fie. read.xport() function Read data from a SAS Export data fie. readhtmltabe() function Read data from the Word Wide Web. risk dataset A fictitious dataset. RODBC package Provides database connectivity. sampe() function Take a random sampe of a dataset. save() command Save R objects to a binary fie. set.seed() command Reset the random number sequence. ske dataset Dataset from archetypes package. sqcoumns() function List coumns of a database tabe. sqtabes() function List tabes avaiabe from a database. system.fie() function Locate R or package fie. weather dataset Sampe dataset from ratte. XML package Access and generate XML ike HTML.

120 Chapter 5 Exporing Data As a data miner, we need to ive and breathe our data. Even before we start buiding our data mining modes, we can gain significant insights through exporing the data. Insights gained can deiver new discoveries to our cients discoveries that can offer benefits eary on in a data mining project. Through such insights and discoveries, we wi increase our knowedge and understanding. Through exporing our data, we can discover what the data ooks ike, its boundaries (the minimum and maximum vaues), its numeric characteristics (the average vaue), and how it is distributed (how spread out the data is). The data begins to te us a story, and we need to buid and understand that story for ourseves. By capturing that story, we can communicate it back to our cients. This task of exporatory data anaysis (often abbreviated as EDA) is a core activity in any data mining project. Exporing the data generay invoves getting a basic understanding of a dataset through numerous variabe summaries and visua pots. Through data exporation, we begin to understand the ay of the and just as a god miner works to understand the terrain rather than bindy digging for god randomy. Through this exporation, we wi often identify probems with the data, incuding missing vaues, noise, erroneous data, and skewed distributions. This in turn wi drive our choice of the most appropriate and, importanty, appicabe toos for preparing and transforming our data and for mining. Some toos, for exampe, are imited in use when there is much missing data. Ratte provides toos ranging from textua summaries to visuay appeaing graphica pots for identifying correations between variabes. The G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _5, Springer Science+Business Media, LLC

121 100 5 Exporing Data Expore tab within Ratte provides a window into the toos for heping us understand our data, driving the many options avaiabe in R. 5.1 Summarising Data Figure 5.1 shows the options avaiabe under Ratte s Expore tab. We begin our exporation of data with the basic Summary option, which provides a textua overview of the data. Whist a picture may be worth a thousand words, textua summaries sti pay an important roe in our understanding of data. Figure 5.1: The Expore tab provides access to a variety of ways in which we start to understand our data. Often, we dea with very arge datasets, and some of the cacuations and visuaisations we perform wi be computationay quite expensive. Thus it may be usefu to summarise random subsets of the data instead. The Partition option of the Data tab is usefu here. This uses sampe() to generate a ist of row numbers that we can then use to index the dataset. The foowing exampe generates a 20% (i.e., 0.2 times the number of rows) random sampe of our weather dataset. We use nrow() to obtain the number of rows in the sampe (73.2) and dim() for information about the number of rows and coumns in the data frame: > ibrary(ratte) > dim(weather) [1] > set.seed(42) > smp <- sampe(nrow(weather), 0.2*nrow(weather)) > dim(weather[smp,]) [1] 73 24

122 5.1 Summarising Data 101 For our weather dataset with ony 366 observations, we ceary do not need to sampe Basic Summaries The simpest text-based statistica summary of a dataset is provided by summary(). This is aways a usefu starting point in reviewing our data. It provides a summary of each variabe. Here we see summaries for a mixture of numeric and categoric variabes: > summary(weather[7:9]) Sunshine WindGustDir WindGustSpeed Min. : 0.00 NW : 73 Min. :13.0 1st Qu.: 5.95 NNW : 44 1st Qu.:31.0 Median : 8.60 E : 37 Median :39.0 Mean : 7.91 WNW : 35 Mean :39.8 3rd Qu.:10.50 ENE : 30 3rd Qu.:46.0 Max. :13.60 (Other):144 Max. :98.0 NA's : 3.00 NA's : 3 NA's : 2.0 For the numeric variabes, summary() wi ist the minimum and maximum vaues together with average vaues (the mean and median) and the first and third quarties. The quarties represent a partitioning of the vaues of the numeric variabe into four equay sized sets. The first quartie incudes 25% of the observations of this variabe that have a vaue ess than this first quartie. The third quartie is the same, but at the 75% mark. The median is actuay aso the second quartie, representing the 50% cutoff (i.e., the midde vaue). Generay, if the mean and median are significanty different, then we woud think that there are some observations of this variabe that are quite a distance from the mean in one particuar direction (i.e., some exceptionay arge positive or negative vaues, generay caed outiers, which we cover in Chapter 7). From the variabes we see above, Sunshine has a reativey arger (athough sti sma) gap between its mean and median, whist the mean and median of WindGustSpeed are quite simiar. Sunshine has more sma observations than arge observations, using our terms rather oosey.

123 102 5 Exporing Data The categoric variabes wi have isted for them the top few most frequent eves with their frequency counts and then aggregate the remainder under the (Other) abe. Thus there are 73 observations with a NW wind gust, 44 with a NNW wind gust, and so on. We observe quite a predominance of these northwestery wind gusts. For both types of istings, the count of any missing vaues (NAs) wi be reported. A somewhat more detaied summary is obtained from describe(), provided by Hmisc (Harre, 2010). To iustrate this we first oad Hmisc into the ibrary: > ibrary(hmisc) For numeric variabes ike Sunshine (which is variabe number 7) describe() outputs two more decies (10% and 90%) as we as two other percenties (5% and 95%). The output continues with a ist of the owest few and highest few observations of the variabe. The extra information is quite usefu in buiding up our picture of the data. > describe(weather[7]) weather[7] 1 Variabes 366 Observations Sunshine n missing unique Mean owest : highest: For categoric variabes ike WindGustDir (which is variabe number 8) describe() outputs the frequency count and the percentage this represents for each eve. The information is spit over as many ines as is required, as we see in the foowing code box.

124 5.1 Summarising Data 103 > describe(weather[8]) weather[8] 1 Variabes 366 Observations WindGustDir n missing unique N NNE NE ENE E ESE SE SSE S SSW SW WSW W Frequency % WNW NW NNW Frequency % Detaied Numeric Summaries An even more detaied summary of the numeric data is provided by basicstats() from fbasics (Wuertz et a., 2010). Though intended for time series data, it provides usefu statistics in genera, as we see in the code box beow. Some of the same data that we have aready seen is presented together with a itte more. Here we see that the variabe Sunshine is observed 366 times, of which 3 are missing (NAs). The minimum, maximum, quarties, mean, and median are as before. The statistics then go on to incude the tota sum of the amount of sunshine, the standard error of the mean, the ower and upper confidence imits on the true vaue of the mean (at a 95% eve of confidence), the variance and standard deviation, and two measures of the shape of the distribution of the data: skewness and kurtosis (expained beow). The mean is stated as being We can be 95% confident that the actua mean (the true mean) of the popuation, of which the data we have here is assumed to be a random sampe, is somewhere between 7.55 and 8.27.

125 104 5 Exporing Data > ibrary(fbasics) > basicstats(weather$sunshine) X..weather.Sunshine nobs NAs Minimum Maximum Quartie Quartie Mean Median Sum SE Mean LCL Mean UCL Mean Variance Stdev Skewness Kurtosis The standard deviation is a measure of how spread out (or how dispersed or how variabe) the data is with respect to the mean. It is measured in the same units as the mean itsef. We can read it to say that most observations (about 68% of them) are no more than this distance from the mean. That is, most days have 7.91 hours of sunshine, pus or minus 3.48 hours. Our observation of the mean and standard deviation for the sunshine data needs to be understood in the context of other knowedge we gean about the variabe. Consider again Figure 2.8 on page 34. An observation we might make there is that the distribution appears to be what we might ca bimoda that is it has two distinct scenarios. One is that of a coudy day, and for such days the hours of sunshine wi be quite sma. The other is that of a sunny day, for which the hours of sunshine wi cover the whoe day. This observation might be more important to us in weather forecasting, than the interva around the mean. We might want to transform this variabe into a binary variabe to capture this observation. Transformations are covered in Chapter 7. The variance is the square of the standard deviation.

126 5.1 Summarising Data Distribution In statistics, we often tak about how observations (i.e., the vaues of a variabe) are distributed. By distributed we mean how many times each vaue of some variabe might appear in a coection of data. For the variabe Sunshine, for exampe, the distribution is concerned with how many days have 8 hours of sunshine, how many have 8.1 hours, and so on. The concept is not quite that back and white, though. In fact, the distribution is often visuaised as a smooth curve, as we might be famiiar with from pubished artices that tak about a norma (or some other common) distribution. We often hear about the be curve. This is a graph that pots a shape simiar to that of musica bes. For our discussion here, it is usefu to have a menta picture of such a be curve, where the horizonta axis represents the possibe vaues of the variabe (the observations) and the vertica axis represents how often those vaues might occur Skewness The skewness is a measure of how asymmetricay our data is distributed. The skewness indicates whether there is a ong tai on one or the other side of the mean vaue of the data. Here we use skewness() from Hmisc to compare the distributions of a number of variabes: > skewness(weather[,c(7,9,12,13)], na.rm=true) Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm A skewness of magnitude (i.e., ignoring whether it is positive or negative) greater than 1 represents quite an obvious extended spread of the data in one direction or the other. The direction of the spread is indicated by the sign of the skewness. A positive skewness indicates that the spread is more to the right side of the mean (i.e., above the mean) and is referred to as having a onger right tai. A negative skewness is the same but on the eft side. Many modes and statistica tests are based on the assumption of a so-caed be curve distribution of the data, which describes a symmetric spread of data vaues around the mean. The greater the skewness, the greater the distortion to this spread of vaues. For a arge skewness, the

127 106 5 Exporing Data assumptions of the modes and statistica tests wi not hod, and so we need to be a itte more carefu in their use. The impact tends to be greater for traditiona statistica approaches and ess so for more recent approaches ike decision trees Kurtosis A companion for skewness is kurtosis, which is a measure of the nature of the peaks in the distribution of the data. Once again, we might picture the distribution of the data as having a shape that is something ike that of a church be (i.e., a be curve). The kurtosis tes us how skinny or fat the be is. Hmisc provides kurtosis(): > kurtosis(weather[,c(7,9,12,13)], na.rm=true) Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm A arger vaue for the kurtosis indicates that the distribution has a sharper peak, primariy because there are ony a few vaues with more extreme vaues compared with the mean vaue. Thus, WindSpeed9am has a sharper peak and a smaer number of more extreme vaues than WindSpeed3pm. The ower kurtosis vaue indicates a fatter peak Missing Vaues Missing vaues present chaenges to data mining and modeing in genera. There can be many reasons for missing vaues, incuding the fact that the data is hard to coect and so not aways avaiabe (e.g., resuts of an expensive medica test), or that it is simpy not recorded because it is in fact 0 (e.g., spouse income for a spouse who stays home to manage the famiy). Knowing why the data is missing is important in deciding how to dea with the missing vaue. We can expore the nature of the missing data using md.pattern() from mice (van Buuren and Groothuis-Oudshoorn, 2011), as Ratte does when activating the Show Missing check button of the Summary option of the Expore tab. The resuts can hep us understand any structure in the missing data and even why the data is missing:

128 5.1 Summarising Data 107 > ibrary(mice) mice > md.pattern(weather[,7:10]) WindGustSpeed Sunshine WindGustDir WindDir9am The tabe presents, for each variabe, a pattern of missing vaues. Within the tabe, a 1 indicates a vaue is present, whereas a 0 indicates a vaue is missing. The eft coumn records the number of observations that match the corresponding pattern of missing vaues. There are 329 observations with no missing vaues over these four variabes (each having a vaue of 1 within that row). The fina coumn is the number of missing vaues within the pattern. In the case of the first row here, with no missing vaues, this is 0. The rows and coumns are sorted in ascending order according to the amount of missing data. Thus, generay, the first row records the number of observations that have no missing vaues. In our exampe, the second row corresponds to a pattern of missing vaues for the variabe Sunshine. There are NA hundred NA three observations that have just Sunshine missing (and there are three observations overa that have Sunshine missing based on the fina row). This particuar row s pattern has just a singe variabe missing, as indicated by the 1 in the fina coumn. The fina row records the number of missing vaues over the whoe dataset for each of the variabes. For exampe, WindGustSpeed has two missing vaues. The tota number of missing vaues over a observations and variabes is noted at the bottom right (39 in this exampe). In Section 7.4, we wi discuss how we might dea with missing vaues through an approach caed imputation.

129 108 5 Exporing Data 5.2 Visuaising Distributions In the previous section, we purposey avoided any graphica presentation of our data. In fact, I rather expect you might have been frustrated that there was no picture there to hep you visuaise what we were describing. The absence of a picture was primariy to make the point that it can get a itte tricky expaining ideas without the aid of pictures. In particuar, our expanation of skewness and kurtosis was quite aboured, and reverted to painting a menta picture rather than presenting an actua picture. After reviewing this current section, go back to reconsider the discussion of skewness and kurtosis. Pictures reay do pay a significant roe in understanding, and graphica presentations, for many, are more effective for communicating than tabes of numbers. Graphica toos aow us to visuay investigate the data s characteristics to hep us understand it. Such an exporation of the data can ceary identify errors in the data or oddities about its coection. This wi aso guide our choice of options to transform variabes in different ways and to seect those variabes of interest. Visuaising data has been an area of study within statistics for many years. A vast array of techniques have been deveoped for presenting data visuay, and the topic is covered in great detai in many books, incuding Ceveand (1993) and Tufte (1985). It is a good idea, then, eary on in a data mining project, to review the distributions of the vaues of each of the variabes in our dataset graphicay. R provides a weath of options for graphicay presenting data. Indeed, R is one of the most capabe data visuaisation anguages and aows us to program the visuaisations. There are aso many standard types of visuaisations, and some of these are avaiabe through Ratte s Distributions option on the Expore tab (Figure 5.2). Using Ratte s Distributions option, we can seect specific variabes of interest and dispay various distribution pots. Seecting many variabes wi of course ead to many pots being dispayed, and so it may be usefu to dispay mutipe pots per page (i.e., per window). Ratte wi do this for us automaticay, controed by our setting of the appropriate vaue for the number of pots per page within the interface. By defaut, four pots are dispayed per page or window. Figure 5.3 iustrates a sampe of the variety of pots avaiabe. Cockwise from the top eft pot, we have iustrated a box pot, a histogram, a mosaic pot, and a cumuative function pot. Because we have

130 5.2 Visuaising Distributions 109 Figure 5.2: The Expore tab s Distributions option provides convenient access to a variety of standard pots for the two primary variabe types numeric and categoric. identified a target variabe (RainTomorrow), the pots incude the distributions for each subset of observations associated with each vaue (No and Yes) of the target variabe. That is, the pots incude a visuaisation of the stratification of the data based on the different vaues of the target variabe. In brief, the box pot identifies the median and mean of the variabe (MinTemp) and the spread from the first quartie to the third, and indicates the outiers. The histogram spits the range of vaues of the variabe (Sunshine) into segments (hours in this case) and shows the number of observations in each segment. The mosaic pot shows the proportions of data spit according to the target (RainTomorrow) and the chosen variabe (WindGustDir, modified to have fewer eves in this case). The cumuative pot shows the percentage of observations beow any particuar vaue of the variabe (WindGustSpeed). Each of the pots avaiabe through Ratte is expained in more detai in the foowing sections.

131 110 5 Exporing Data Figure 5.3: A sampe of pots iustrates the different distributions and how they can be visuaised Box Pot A box pot (Tukey, 1977) (aso known as a box-and-whisker pot) provides a graphica overview of how the observations of a variabe are distributed. Ratte s box pot adds some additiona information to the basic box pot provided by R. A box pot is usefu for quicky ascertaining the distribution of numeric data, covering some of the same statistics presented textuay in Section In particuar, any skewness wi be ceary visibe. When a target variabe has been identified the box pot wi aso show the distribution of the observations of the chosen variabe by the eves of the target variabe. We see such a pot for the variabe Humidity3pm in Figure 5.4, noting that RainTomorrow is the target variabe. The width of each of the box pots aso indicates the distribution of the vaues of the target variabe. We see that there are quite a few more observations with No for RainTomorrow than with Yes. The box pot (which is shown with Ratte s Annotate option active in Figure 5.4) presents a variety of statistics. The thicker horizonta

132 5.2 Visuaising Distributions Humidity3pm A No Yes RainTomorrow Figure 5.4: The Ratte box pot extends the defaut R box pots to provide a itte more information by defaut and incudes annotations if requested. The pot here is for the fu dataset. ine within the box represents the median (aso known as the second quartie or the 50th percentie). The eftmost box pot in Figure 5.4 (showing the distribution over a of the observations for Humidity3pm) has the median abeed as 43. The top and bottom extents of the box (55 and 32, respectivey) identify the upper quartie (aso known as the third quartie or the 75th percentie) and the ower quartie (the first quartie or the 25th percentie). The extent of the box is known as the interquartie range (55 32 = 23). Dashed ines extend to the maximum and minimum data points, which are no more than 1.5 times the interquartie range from the median. We might expect most of the rest of the observations to be within this region. Outiers (points further than 1.5 times the interquartie range from the median) are then individuay potted (we can see a sma number of outiers for the eft two box pots, each being annotated with the actua vaue of the observation). The notches in the box, around the median, indicate an approximate 95% confidence eve for the differences between the medians (assuming

133 112 5 Exporing Data independent observations, which may not be the case). Thus they are usefu in comparing the distributions. In this instance, we can observe that the median of the vaues associated with the observations for which it rained tomorrow (i.e., the variabe RainTomorrow has the vaue Yes) is significanty different (at the 95% eve of confidence) from the median for those observations for which it did not rain tomorrow. It woud appear that a higher humidity recorded at 3 pm is an indication that it might rain tomorrow. The mean is aso dispayed as the asterisk in each of the boxes. A arge gap between the median and the mean is another indication of a skewed distribution. Ratte s Log tab records the sequence of commands used to draw the box pot and to annotate it. Basicay, boxpot() (the basic pot), points() (to pot the means), and text() (to abe the various points) are empoyed. We can, as aways, copy-and-paste these commands into the R Consoe to repicate the pot and to then manuay modify the pot commands to suit any specific need. The automaticay generated code is shown beow, modified sighty for carity. The first step is to generate the data we wish to pot. The foowing exampe creates a singe dataset with two coumns, one being the observations of Humidity3pm and the other, identified by a variabe caed grp, the group to which the observation beongs. There are three groups, two corresponding to the two vaues of the target variabe and the other covering a observations. The use of with() aows the variabes within the origina dataset to be referenced without having to name the dataset each time. We combine three data.frame() objects row-wise, using rbind(), to generate the fina dataset: > ds <- with(crs$dataset[crs$train,], rbind(data.frame(dat=humidity3pm, grp="a"), data.frame(dat=humidity3pm[raintomorrow=="no"], grp="no"), data.frame(dat=humidity3pm[raintomorrow=="yes"], grp="yes")))

134 5.2 Visuaising Distributions 113 Now we dispay the boxpot(), grouping our data by the variabe grp: > bp <- boxpot(formua=dat ~ grp, data=ds, co=rainbow_hc(3), xab="raintomorrow", yab="humidity3pm", notch=true) Notice that we assign to the variabe bp the vaue returned by boxpot(). The function returns the data for the cacuation needed to draw the box pot. By saving the resut, we can make further use of it, as we do beow, to annotate the pot. We wi aso annotate the pot with the means. To do so, summaryby() from doby comes in handy. The use of points() together with pch= resuts in the asterisks we see in Figure 5.4. > ibrary(doby) > points(x=1:3, y=summaryby(formua=dat ~ grp, data=ds, FUN=mean, na.rm=true)$dat.mean, pch=8) Next, we add further text() annotations to identify the median and interquartie range: > for (i in seq(nco(bp$stats))) { text(x=i, y=bp$stats[,i] *(max(ds$dat, na.rm=true) - min(ds$dat, na.rm=true)), abes=bp$stats[,i]) } The outiers are then annotated using text(), but decreasing the font size using cex=: > text(x=bp$group+0.1, y=bp$out, abes=bp$out, cex=0.6) To round out our pot, we add a tite() to incude a main= and a sub= tite. We format() the current date and time (Sys.time()) and incude the current user (obtained from Sys.info()) in the tites:

135 114 5 Exporing Data > tite(main="distribution of Humidity3pm (sampe)", sub=paste("ratte", format(sys.time(), "%Y-%b-%d %H:%M:%S"), Sys.info()["user"])) A variation of the box pot is the box-percentie pot. This pot provides more information about the distribution of the vaues. We can see such a pot in Figure 5.5, which is generated using bppot() of Hmisc. The foowing code wi generate the pot (at the time of writing this book, box-percentie pots are not yet avaiabe in Ratte): > ibrary(hmisc) > h3 <- weather$humidity3pm > hn <- h3[weather$raintomorrow=="no"] > hy <- h3[weather$raintomorrow=="yes"] > ds <- ist(h3, hn, hy) > bppot(ds, name=c("a", "No", "Yes"), yab="humidity3pm", xab="raintomorrow") The width within each box (they aren t quite boxes as such, but we get the idea) is determined to be proportiona to the number of observations that are beow (or above) that point. The median and the 25th and 75th percenties are aso shown Histogram A histogram provides a quick and usefu graphica view of the spread of the data. We can very quicky get a fee for the distribution of our data, incuding an idea of its skewness and kurtosis. Histograms are probaby one of the more common ways of visuay presenting data. A histogram pot in Ratte incudes three components, as we see in Figure 5.6. The first of these is obviousy the vertica bars. The continuous data in the exampe here (the wind speed at 9 am) has been partitioned into ranges, and the frequency of each range is dispayed as the bar. R automaticay chooses both the partitioning and how the x- axis is abeed, showing x-axis points at 0, 10, 20, and so on. We might observe that the most frequent range of vaues is in the 4 6 partition. The pot aso incudes a ine pot showing the so-caed density estimate. The density pot is a more accurate dispay of the actua (at east estimated true) distribution of the data (the vaues of WindSpeed9am).

136 5.2 Visuaising Distributions 115 Box Percentie Pot Humidity3pm A No Yes RainTomorrow Figure 5.5: A box-percentie pot provides some more information about the distribution. It aows us to see that rather than vaues in the range 4 6 occurring frequenty, in fact it is 6 itsef that occurs most frequenty. The third eement of the pot is the so-caed rug aong the bottom of the pot. The rug is a singe-dimensiona pot of the data aong the number ine. It is usefu in seeing exacty where data points actuay ie. For arge coections of data with a reativey even spread of vaues, the rug ends up being quite back. From Figure 5.6, we can make some observations about the data. First, it is cear that the measure of wind speed is actuay an integer. Presumaby, in the source data, it is rounded to the nearest integer. We can aso observe that some vaues are not represented at a in the dataset. In particuar, we can see that 0, 2, 4, 6, 7, and 9 are represented in the data but 1, 3, 5, and 8 are not. The distribution of the vaues for WindSpeed9am is aso ceary skewed, having a onger tai to the right than to the eft. Reca from Section that WindSpeed9am had a skewness of Simiary, the kurtosis measure was 1.48, indicating a bit of a narrower peak.

137 116 5 Exporing Data Distribution of WindSpeed9am Frequency WindSpeed9am Figure 5.6: The Ratte histogram extends the defaut R histogram pots with a density pot and a rug pot. We can compare WindSpeed9am with Sunshine, as in Figure 5.7. The corresponding skewness and kurtosis for Sunshine are 0.72 and 0.27, respectivey. That is, Sunshine has a smaer and negative skew, and a smaer kurtosis and hence a more spread-out peak Cumuative Distribution Pot Another popuar pot for communicating the distribution of the vaues of a variabe is the cumuative distribution pot. A cumuative distribution pot dispays the proportion of the data that has a vaue that is ess than or equa to the vaue shown on the x-axis. Figure 5.8 shows a cumuative distribution pot for two variabes, WindSpeed9am and Sunshine. Each chart incudes three cumuative pots: one ine is drawn for a the data and one ine for each of the vaues of the target variabe. We can see again that these two variabes have quite different distri-

138 5.2 Visuaising Distributions 117 Frequency Distribution of Sunshine Sunshine Figure 5.7: The Ratte histogram for Sunshine for comparison with Wind- Speed9am. butions. The pot for WindSpeed9am indicates that the wind speed at 9 am is usuay at the ower end of the scae (e.g., ess than 10), but there are a few days with quite extreme wind speeds at 9 am (i.e., outiers). For Sunshine there is a ot more data around the midde, which is typica of a more norma type of distribution. There is quite a spread of vaues between 6 and 10. The Sunshine pot is aso interesting. We can see quite an obvious difference between the two ines that represent A of the observations and just those with a No (i.e., observations for which there is no rain tomorrow) and the ine that represents the Yes observations. It woud appear that ower vaues of Sunshine today are associated with observations for which it rains tomorrow. The Ecdf() command of Hmisc provides a simpe interface for producing cumuative distribution pots. The code to generate the Sunshine pot is presented beow.

139 118 5 Exporing Data Distribution of WindSpeed9am by RainTomorrow Proportion x WindSpeed9am A No Yes Distribution of Sunshine by RainTomorrow Proportion x A No Yes Sunshine Figure 5.8: Cumuative distribution pots for WindSpeed9am and Sunshine. > ibrary(ratte) > ibrary(hmisc) > su <- weather$sunshine > sn <- su[weather$raintomorrow=="no"] > sy <- su[weather$raintomorrow=="yes"] > Ecdf(su, co="#e495a5", xab="sunshine", subtites=false) > Ecdf(sn, co="#86b875", ty=2, add=true, subtites=false) > Ecdf(sy, co="#7db0dd", ty=3, add=true, subtites=false) We can add a egend and a tite to the pot: > egend("bottomright", c("a","no","yes"), bty="n", co=c("#e495a5", "#86B875", "#7DB0DD"), ty=1:3, inset=c(0.05,0.05)) > tite(main=paste("distribution of Sunshine (sampe)", "by RainTomorrow", sep="\n"), sub=paste("ratte", format(sys.time(), "%Y-%b-%d %H:%M:%S")))

140 5.2 Visuaising Distributions Benford s Law The use of Benford s aw has proven to be effective in identifying oddities in data. It has been used for case seection in fraud detection, particuary in accounting data (Durtschi et a., 2004), where the vaue of a variabe for a group of reated observations might be identified as not conforming to Benford s aw even though other groups do. Benford's Law: Income by TARGET_Adjusted Probabiity Benford's A (2000) 0 (1537) 1 (463) Distribution of the First Digit Figure 5.9: A Benford s aw pot of the variabe Income from the audit dataset, particuary showing nonconformance for the popuation of known noncompiant cients. Benford s aw reates to the frequency of occurrence of the first digit in a coection of numbers. These numbers might be the doar income earned by individuas across a popuation of taxpayers or the height of buidings in a city. The aw generay appies when severa orders of magnitude (e.g., 10, 100, and 1000) are recorded in the observations. The aw states that the digit 1 appears as the first digit of the numbers some 30% of the time. That is, for income, numbers ike $13,245 and $162,385 (having an initia digit of 1 ) wi appear about 30% of the time in our popuation. On the other hand, the digit 9 (for exampe,

141 120 5 Exporing Data as in $94,251) appears as the first digit ess than 5% of the time. Other digits have frequencies between these two, as we can see from the back ine in Figure 5.9. This rather starting observation is certainy found, empiricay, to hod in many coections of numbers, such as bank account baances, tax refunds, stock prices, death rates, engths of rivers, and potentia fraud in eections. It is observed to hod for processes that are described by what are caed power aws, which are common in nature. By potting a coection of numbers against the expectation as based on Benford s aw, we are abe to quicky see any odd behaviour in the data. Benford s aw is not vaid for a coections of numbers. For exampe, peopes ages woud not be expected to foow Benford s aw, nor woud teephone numbers. So we do need to use caution in reying just on Benford s aw to identify cases of interest. We can iustrate Benford s aw using the audit dataset from ratte. Ratte provides a convenient mechanism for generating a pot to visuaise Benford s aw, and we iustrate this with the variabe Income in Figure 5.9. The darker ine corresponds to Benford s aw, and we note that the ines corresponding to A and 0 foow the expected first-digit distribution proposed by Benford s aw. However, the ine corresponding to 1 (i.e., cients who had to have their caims adjusted) ceary deviates from the proposed distribution. This might indicate that these numbers have been made up by someone or that there is some other process happening that affects this popuation of numbers Bar Pot The pots we have discussed so far work for numeric data. We now consider pots that work for categoric data. These incude the bar pot, dot pot, and mosaic pot. A bar pot, much ike a histogram, uses vertica bars to show counts of the number of observations of each of the possibe vaues of the categoric variabe. There are many ways to graph a bar pot. In Ratte, the defaut is to ist the possibe vaues aong the x-axis, eaving the y-axis for the frequency or count. When a categoric target variabe is active within Ratte, additiona bars wi be drawn for each vaue, corresponding to the different vaues of the target. Figure 5.10 shows a typica bar pot. The

142 5.2 Visuaising Distributions 121 sampe bar pot aso shows that by defaut Ratte wi sort the bars from the argest to the smaest. Distribution of WindGustDir by RainTomorrow Frequency A No Yes NW NNW E WNW ENE ESE S N W NE SE SSE NNE SSW SW WSW WindGustDir Figure 5.10: A bar pot for the categoric variabe WindGustDir (modified) from the weather dataset Dot Pot A dot pot iustrates much the same information as a bar pot but uses a different approach. The bars are repaced by dots that show the height, thus not fiing in as much of the graphic. A dotted ine repaces the extent of the bar, and by defaut in Ratte the pots are horizonta rather than vertica. Once again, the categoric vaues are ordered to produce the pot in Figure For the dot pot, we iustrate the distribution of observations over a of the vaues of the categoric variabe WindGustDir. With the horizonta pot it is more feasibe to ist a of the vaues than for the vertica bar pots. Of course, the bar pots coud be drawn horizontay for the same reason. Both the bar pot and dot pot are usefu in understanding how the observations are distributed across the different categoric vaues. One thing to ook for when a target variabe is identified is any specific variation in the distributions between the target vaues. In Figure 5.11, for exampe, we can see that the distributions of the Yes and No observa-

143 122 5 Exporing Data WindGustDir NW NNW E WNW ENE ESE S N W NE SE SSE NNE SSW SW WSW Distribution of WindGustDir by RainTomorrow A No Yes Frequency Figure 5.11: A dot pot for the categoric variabe WindGustDir (origina) from the weather dataset. tions are quite different from the overa distributions. Such observations may merey be interesting or might ead to usefu questions about the data. In our data here, we do need to reca that the majority of days have no rain. Thus the No distribution of vaues for WindGustDir foows the distribution of a observations quite cosey. We can see a few deviations, suggesting that these wind directions have an infuence on the target variabe Mosaic Pot A mosaic pot is an effective way to visuaise the distribution of the vaues of one variabe over the different vaues of another variabe, ooking for any structure or reationship between those variabes. In Ratte, this second variabe is usuay the target variabe (e.g., RainTomorrow in our weather dataset), as in Figure The mosaic pot again provides insights into how the data is distributed over the vaues of a second variabe. The area of each bar is proportiona to the number of observations having a particuar vaue for the variabe WindGustDir. Once again, the vaues of WindGustDir are ordered according to their frequency of occurrence. The vaue NW is observed most frequenty. The spit between the vaues of the target

144 5.2 Visuaising Distributions 123 Mosaic of WindGustDir by RainTomorrow NW NNW E WNW ENE ESE S N W NE SE SSE NNE SSW SWWSW Yes RainTomorrow No WindGustDir Figure 5.12: A mosaic pot for the categoric variabe WindGustDir (origina) from the weather dataset. variabe RainTomorrow is simiary proportiona to their frequency. Once again, we see that a wind gust of west has a high proportion of days for which RainTomorrow is true. Something that we missed in reviewing the bar and the dot pots is that a SW wind gust has the highest proportions of days where it rains tomorrow, foowed by SSW. It is arguabe, though, that it is harder to see the overa distribution of the wind gust directions in a mosaic pot compared with the bar and dot pots. Mosaic pots are thus generay used in combination with other pots, and they are particuary good for comparing two or more variabes at the same time Pairs and Scatter Pots The bar and dot pots are basicay singe-variabe (i.e., univariate) pots. In our pots, we have been incuding a second variabe, the target. Moving on from considering the distribution of a singe variabe at a time, we can compare variabes pairwise. Such a pot is caed a scatter pot. Generay we have mutipe variabes that we might wish to compare pairwise using mutipe scatter pots. Such a pot then becomes a scatter pot matrix. The pairs() command in R can be used to generate a matrix of scatter pots. In fact, the function can be fine-tuned to not ony dispay pairwise scatter pots but aso to incude histograms and a

145 124 5 Exporing Data pairwise measure of the correation between variabes (correations are discussed in Section 5.3). For this added functionaity we need two support functions that are a itte more compex and that we won t expain in detai: > pane.hist <- function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr=c(usr[1:2], 0, 1.5) ) h <- hist(x, pot=false) breaks <- h$breaks; nb <- ength(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb], 0, breaks[-1], y, co="grey90",...) } > pane.cor <- function(x, y, digits=2, prefix="", cex.cor,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r <- (cor(x, y, use="compete")) txt <- format(c(r, ), digits=digits)[1] txt <- paste(prefix, txt, sep="") if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt) text(0.5, 0.5, txt) } We can then generate the pot with: > vars <- c(5, 7, 8, 9, 15, 24) > pairs(weather[vars], diag.pane=pane.hist, upper.pane=pane.smooth, ower.pane=pane.cor) There are two additiona commands defined here, pane.hist() and pane.cor(), provided as the arguments diag.pane and ower.pane to pairs(). These two commands are not provided by R directy. Their definitions can be obtained from the hep page for pairs() and pasted into the R Consoe.

146 5.2 Visuaising Distributions 125 Figure 5.13: A pairs pot with a scatter pot matrix dispayed in the upper pane, histograms in the diagona, and a measure of correation in the ower pane. Tip: Ratte can generate a pairs pot incuding the scatter pots, histograms, and correations in the one pot (Figure 5.13). To do so, we go to the Distributions option of the Expore tab and ensure that no pot types are seected for any variabe (the defaut). Then cick the Execute button. Notice that we have ony incuded six variabes in the pairs pot. Any more than this and the pot becomes somewhat crowded. In generating a pairs pot, Ratte wi randomy subset the tota number of variabes avaiabe down to just six variabes. In fact, each time the Execute button is cicked, a different randomy seected coection of variabes wi be dispayed. This is a usefu exercise to expore for interesting pairwise reationships among our variabes. If we are keen to do so, we can generate pots with more than six variabes quite simpy by copying the command from Ratte s Log tab (which wi be simiar to pairs(), shown above) and pasting it into the R Consoe. Let s expore the pairs pot in a itte more detai. The diagona contains a histogram for the numeric variabes and a bar pot (aso a histogram) for the categoric variabes. The top right pots (i.e., those pots above the diagona) are pairwise scatter pots, which pot the observations of just two variabes at a time. The corresponding variabes are identified from the diagona. Rainfa Sunshine WindGustDir WindGustSpeed Humidity3pm RainTomorrow

147 126 5 Exporing Data The top eft scatter pot, which appears in the first row and the second coumn, has Rainfa on the y-axis and Sunshine on the x-axis. We can see quite a predominance of days (observations) with what ooks ike no rainfa at a, with fewer observations having some rainfa. There does not appear to be any particuar reationship between the amount of rain and the hours of sunshine, athough there are some days with higher rainfa when there is ess sunshine. Note, though, that an outier for rainfa (at about 40 mm of rain) appears on a day with about 9 hours of sunshine. We might decide to expore this apparent anomay to assure ourseves that there was no measurement or data error that ed to this observation. An interesting scatter pot to examine is that in row 2 and coumn 5. This pot has Sunshine on the y-axis and Humidity3pm on the x- axis. The soid red ines that are drawn on the pot are a resut of pane.smooth() being provided as the vaue for the upper.pane argument in the ca to pairs(). The ine provides a hint of any trend in the reationship between the two variabes. For this particuar scatter pot, we can see some structure in that higher eves of humidity at 3 pm are observed with ower hours of sunshine. One or two of the other scatter pots show other, but ess pronounced and hence probaby ess significant, reationships. The ower part of our scatter pot matrix contains numbers between 1 and 1. These are measures of the correation between two variabes. Pearson s correation coefficient is used. We can see that Rainfa and Humidity3pm (see the number in row 5, coumn 1) have a sma positive correation of That is not a great dea of correation. If we square the correation vaue to obtain , we can interpret this as indicating that some 8% of the variation is reated. There is perhaps some basis to expect that when we observe higher rainfa we might aso observe higher humidity at 3 pm. There is even stronger correation between the variabes Sunshine and Humidity3pm (row 5, coumn 2) measured at The negative sign indicates a negative correation of strength Squaring this number eads us to observe that some 58% of the variation is reated. Thus, observations of more sunshine do tend to occur with observations of ess humidity at 3 pm, as we have aready noted. We wi come back to correations shorty.

148 5.2 Visuaising Distributions Pots with Groups Extending the idea of comparing variabes, we can usefuy pot, for exampe, a box pot of one variabe but with the observations spit into groups that are defined by another variabe. We have aready seen this with the target variabe being the one by which we group our observations, as in Figure 5.4. Simpy through seecting another variabe as the target, we can expore many different reationships quite effectivey. Consider, for exampe, the distribution of the observations of the variabe Sunshine. We might choose Coud9am as the target variabe (in Ratte s Data tab) and then request a box pot from the Expore tab. The resut wi be as in Figure Note that Coud9am is actuay a numeric variabe, but we are effectivey using it here as a categoric variabe. This is okay since it has ony nine numeric vaues and those are used here to group together different observations (days) having a common vaue for the variabe. Distribution of Sunshine by Coud9am Sunshine A Coud9am Figure 5.14: Data dispayed through the Distributions tab is grouped using the target variabe vaues to define the groups. Seecting aternative targets wi group the data differenty. The eftmost box pot shows the distribution of the observations of Sunshine over the whoe dataset. The remaining nine box pots then coect together the observations of Sunshine for each of the nine possibe vaues of Coud9am. Reca that Coud9am is measured in something caed oktas. An okta of 0 indicates no coud coverage, 1 indicates oneeighth of the sky is covered, and so on up to 8, indicating that the sky is

149 128 5 Exporing Data competey covered in couds. The reationship that we see in Figure 5.14 then makes some sense. There is a cear downward trend in the box pots for the amount of sunshine as we progressivey have more coud coverage. Some groups are quite distinct: compare groups 6, 7, and 8. They have different medians, with their notches ceary not overapping. The pot aso iustrates some minor idiosyncrasies of the box pot. The box pots for groups 2 and 4 appear a itte different. Each has an odd arrangement at the end of the quartie on one side of the box. This occurs when the notch is cacuated to be arger than the portion of the box on one side of the median. 5.3 Correation Anaysis We have seen from many of the pots in the sections above, particuary those pots with more than a singe variabe, that we often end up identifying some kind of reationship or correation between the observations of two variabes. The reationship we saw between Sunshine and Humidity3pm in Figure 5.13 is one such exampe. A correation coefficient is a measure of the degree of reationship between two variabes it is usuay a number between 1 and 1. The magnitude represents the strength of the correation and the sign represents the direction of the correation. A high degree of correation (coser to 1 or 1) indicates that the two variabes are very highy correated, either positivey or negativey. A high positive correation indicates that observations with a high vaue for one variabe wi aso tend to have a high vaue for the second variabe. A high negative correation indicates that observations with a high vaue for one variabe wi aso tend to have a ower vaue of the second variabe. Correations of 1 (or 1) indicate that the two variabes are essentiay identica, except perhaps for scae (i.e., one variabe is just a mutipe of the other) Correation Pot From our previous exporation of the weather dataset, we noted a moderate (negative) correation between Sunshine and Humidity3pm. Generay, days with a higher eve of sunshine have a ower eve of humidity at 3 pm and vice versa.

150 5.3 Correation Anaysis 129 Variabes that are very strongy correated are probaby not independent. That is, they have some cose reationship. The reationship coud be causa in that an increase in one has some physica impact on the other. But such evidence for this needs to be ascertained separatey. Nonetheess, having correated variabes as input to some agorithms may misguide the data mining. Thus it is important to recognise this. R can be used to quite easiy generate a matrix of correations between variabes. The cor() command wi cacuate and ist the Pearson correation between variabes: > vars <- c(5, 6, 7, 9, 15) > cor(weather[vars], use="pairwise", method="pearson") Rainfa Evaporation Sunshine Rainfa Evaporation Sunshine WindGustSpeed Humidity3pm WindGustSpeed Humidity3pm Rainfa Evaporation Sunshine WindGustSpeed Humidity3pm We can compare these numbers with those in Figure They shoud agree. Note that each variabe is, of course, perfecty correated with itsef, and that the matrix here is symmetrica about the diagona (i.e., the measure of the correation between Rainfa and Sunshine is the same as that between Sunshine and Rainfa). We have to work a itte hard to find patterns in the matrix of correation vaues expressed in this way. Ratte provides access to a graphica pot of the correations between variabes in our dataset. The Correation option of the Expore tab provides a number of choices for correation pots (Figure 5.15). Simpy cicking the Execute button wi cause the defaut correation pot to be dispayed (Figure 5.16). The first thing we might notice about this correation pot is that ony the numeric variabes appear. Ratte ony computes correations between numeric variabes. The second thing to note about the pot is

151 130 5 Exporing Data Figure 5.15: The Expore tab s Correation option provides access to pots that visuaise correations between pairs of variabes. that it is symmetric about the diagona, as is the numeric correation matrix we saw above the correation between two variabes is the same, irrespective of the order in which we view them. The third thing to note is that the order of the variabes does not correspond to the order in the dataset but to the order of the strength of any correations, from the east to the greatest. This is done to achieve a more peasing graphic but can aso ead to further insight with groupings of simiar correations. This is controed through the Ordered check button. We can understand the degree of any correation between two variabes by both the shape and the coour of the graphic eements. Any variabe is, of course, perfecty correated with itsef, and this is refected as the straight ines on the diagona of the pot. A perfect circe, on the other hand, indicates that there is no (or very itte) correation between the variabes. This appears to be the case, for exampe, for the correation between Sunshine and Pressure9am. In fact, there is a correation, just an extremey weak one (0.006), as we see in Figure 5.15.

152 5.3 Correation Anaysis 131 Correation weather.csv using Pearson Pressure9am Pressure3pm Humidity9am WindSpeed3pm Humidity3pm Sunshine Coud3pm WindSpeed9am Rainfa Coud9am WindGustSpeed Evaporation Temp3pm MaxTemp Temp9am MinTemp Pressure9am Pressure3pm Humidity9am WindSpeed3pm Humidity3pm Sunshine Coud3pm WindSpeed9am Rainfa Coud9am WindGustSpeed Evaporation Temp3pm MaxTemp Temp9am MinTemp Figure 5.16: The correation pot graphicay dispays different degrees of correation pairwise between variabes. The circes turn into straight ines, by degrees, as the strength of correation between the two variabes increases. Thus we can see that there is some moderate correation between Humidity9am and Humidity3pm, represented as the squashed circe (i.e., an eipse shape). The more squashed (i.e., the more ike a straight ine), the higher the degree of correation, as in the correation between MinTemp and Temp9am. Notice that, intuitivey, a of the observations of correations make some sense. The direction of the eipse indicates whether the correation is positive or negative. The correations we noted above were in the positive direction. We can see, for exampe, our previousy observed negative correation between Sunshine and Humidity3pm. The coours used to shade the eipses give another, if redundant, cue to the strength of the correation. The intensity of the coour is maxima (back) for a perfect correation and minima (white) if there is no correation. Shades of red are used for negative correations and bue

153 132 5 Exporing Data for positive correations Missing Vaue Correations An interesting and usefu twist on the concept of correation anaysis is the concept of correation amongst missing vaues in our data. In many datasets, it is often constructive to understand the nature of missing data. We often find commonaity amongst observations with a missing vaue for one variabe having missing vaues for other variabes. A correation pot can effectivey highight such structure in our datasets. The correation between missing vaues can be expored by cicking the Expore Missing check box. To understand missing vaues fuy, we have aso turned off the partitioning of the dataset on the Data tab so that a of the data is considered for the pot. The resuting pot is shown in Figure Correation of Missing Vaues weather.csv using Pearson WindGustDir Sunshine WindGustSpeed WindDir3pm WindSpeed9am WindDir9am WindGustDir Sunshine WindGustSpeed WindDir3pm WindSpeed9am WindDir9am Figure 5.17: The missing vaues correation pot showing correations between missing vaues of variabes. We notice immediatey that ony six variabes are incuded in this correation pot. Ratte has identified that the other variabes have no missing vaues, and so there is no point incuding them in the pot. We aso notice that a categoric variabe, WindGustDir, is incuded in the pot even though it was not incuded in the usua correation pot. We

154 5.3 Correation Anaysis 133 can obtain a correation for categoric variabes since we ony measure the absence or presence of a vaue, which is easiy interpreted as numeric. The graphic shows us that WindGustSpeed and WindGustDir are quite highy correated with respect to missing vaues. That is, when the variabe WindGustSpeed has a missing vaue, WindGustDir aso tends to have a missing vaue, and vice versa. The actua correation is (which can be read from the Ratte text view window). There is aso a weak correation between WindDir9am and WindSpeed9am ( ). On the other hand, there is no (in fact, very itte at ) correation between Sunshine and WindGustSpeed, or any other variabe, with regard to missing vaues. It is important to note that the correations showing missing vaues may be based on very sma sampes, and this information is incuded in the text view of the Ratte window. For exampe, in this case we can see in Figure 5.18 that there are ony 21 missing observations for WindDir9am and ony two or three for the other variabes. This corresponds to approximatey 8% and 1% of the observations, respectivey, having missing vaues for these variabes. This is too itte to draw too many concusions from Hierarchica Correation Another usefu option provided by Ratte is the hierarchica correation pot (Figure 5.19). The pot provides an overview of the correation between variabes using a tree-ike structure known as a dendrogram. The pot ists the variabes in the right coumn. The variabes are then inked together in the dendrogram according to how we they are correated. The x-axis is a measure of the height within the dendrogram, ranging from 0 to 3. The heights (i.e., engths of the ines within the dendrogram) give an indication of the eve of correation between variabes, with shorter heights indicating stronger correations. Very quicky we can observe that Temp3pm and MaxTemp are quite cosey correated (in fact, they have a correation of 0.99). Simiary, Coud3pm and Coud9am are moderatey correated (0.51). The group of variabes Temp9am, MinTemp, Evaporation, Temp3pm, and MaxTemp, unsurprisingy, have some higher eve of correation amongst themseves than they do with other variabes. A number of R functions are used together to generate the pot we see in Figure We take the opportunity to review the R code to gain

155 134 5 Exporing Data Figure 5.18: The Ratte window dispays the underying data used for the missing observations correation pot. a itte more understanding of working directy with R. Ratte s Log tab wi again provide the steps, which incude generating the correations for the numeric variabes using cor(): > numerics <- c(3:7, 9, 12:21) > cc <- cor(weather[numerics], use="pairwise", method="pearson") We then generate a hierarchica custering of the correations. This can be done using hcust() (custer anaysis is detaied in Chapter 9): > hc <- hcust(dist(cc), method="average") A dendrogram, the graph structure that we see in the pot for Figure 5.19, can then be constructed using as.dendrogram(): > dn <- as.dendrogram(hc)

156 5.4 Command Summary 135 Variabe Correation Custers weather.csv using Pearson WindSpeed3pm WindGustSpeed WindSpeed9am Rainfa Coud3pm Coud9am Humidity3pm Humidity9am Pressure3pm Pressure9am Temp9am MinTemp Evaporation Temp3pm MaxTemp Sunshine Figure 5.19: The Hierarchica option dispays the correation between variabes using a dendrogram. The actua pot is drawn using pot(): > pot(dn, horiz = TRUE) 5.4 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: audit dataset Used to iustrate Benford s aw. basicstats() command Detaied statistics of data. bppot() command Box-percentie pot.

157 136 5 Exporing Data describe() command Detaied data summary. cor() function Correation between variabes. Ecdf() command Produce cumuative distribution pot. fbasics package More comprehensive basic statistics. hcust() function A hierarchica custering agorithm. Hmisc package Additiona basic statistics and pots. kurtosis() function A measure of distribution peakiness. md.pattern() command Tabe of patterns of missing vaues. mice package Missing data anaysis. pairs() command Matrix of pairwise scatter pots. pane.hist() command Draw histograms within a pairs pot. pane.cor() command Correations within a pairs pot. pane.smooth() command Add smooth ine to pairs pot. sampe() function Seect a random sampe of a dataset. skewness() function A measure of distribution skew. summary() command Basic dataset statistics. weather dataset Sampe dataset from ratte.

158 Chapter 6 Interactive Graphics There is more to exporing data than simpy generating textua and statistica summaries and graphica pots. As we have begun to see, R has some very significant capabiities for generating graphics that assist in reveaing the story our data is teing us and then heps us to effectivey communicate that story to others. However, R is specificay suited to generating static graphics that is, as Wickham (2009) says, there is no benefit dispaying on a computer screen as opposed to on a piece of paper when using R s graphics capabiities. R graphics were impemented with the idea of presenting the data visuay rather than interacting with it. We write scripts for the dispay. We then go back to our script to fine-tune or expore different options for the dispayed data. This is great for repeatabe generation of graphics but not so efficient for the foow your nose or ad hoc reporting approach to quick and efficient data exporation. Being abe to easiy interact with a pot can add significanty to the efficiency of our data exporation and ead to the discovery of interesting and important patterns and reationships. Data miners wi need sophisticated skis in dynamicay interacting with the visuaisations of data to provide themseves with significant insights. Whist software supports this to some extent, the true insights come from the ski of the data miner. We must take time to expore our data, identify reationships, discover patterns, and understand the picture painted by the data. Ratte provides access to two very powerfu R packages for interactive data anaysis, atticist (Andrews, 2010) and GGobi, the atter of which is accessed via rggobi (Lang et a., 2011). These can be initiated through the Interactive option of the Expore tab (Figure 6.1). We wi introduce G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _6, Springer Science+Business Media, LLC

159 138 6 Interactive Graphics each of the toos in this chapter. Note that each appication has much more functionaity than can be covered here, and indeed GGobi has its own book (Cook and Swayne, 2007), which provides good detais. Figure 6.1: The Expore tab s Interactive option can initiate a atticist or GGobi session for interactive data anaysis. 6.1 Latticist Latticist (Andrews, 2010) provides a graphica and interactive interface to the advanced potting capabiities of R s attice (Sarkar, 2008). It is written in R itsef and aows the underying R commands that generate the pots to be directy edited and their effect immediatey viewed. This then provides a more interactive experience with the generation of R pots. Seect the Latticist radio button of the Interactive option of the Expore tab and then cick the toobar s Execute button to dispay atticist s window, as shown in Figure 6.2. From the R Consoe, we can use atticist() to dispay the same interactive window for exporing the weather dataset: > ibrary(atticist) > atticist(weather) With the initia Latticist window, we immediatey obtain an overa view of some of the story from our data. Note that, by defaut, from Ratte, the pots show the data grouped by the target variabe RainTomorrow. We see that numeric data is iustrated with a density pot, whist categoric data is dispayed using dot pots. Many of the pots show differences in the distributions for the two groups (based on whether RainTomorrow is No or Yes). We might note, for exampe, that variabes MinTemp and MaxTemp (the first two pots of the top row) have sighty higher vaues for the observations where it

160 6.1 Latticist 139 Figure 6.2: The Expore tab s Interactive option can initiate a atticist session for interactive data anaysis.. rains tomorrow. The third pot suggests that the amount of Rainfa today seems to be amost identicay distributed for observations where it does not rain tomorrow and those where it does. The fifth pot then indicates that there seems to be ess Sunshine on days prior to days on which it rains. There is an extensive set of features avaiabe for interacting with the visuaisations. The actua command used to generate the current pot is shown at the top of the window. We can modify the command and immediatey see the resut, either by editing the command in pace or cicking the Edit ca... button. The atter resuts in the dispay of a sma text window in which the command can be edited. There are buttons in the main window s toobar to open the hep page for the current pot, to reoad the pot, and to navigate to previous pots. The defaut pot is a pot of the margina distribution of the variabes. The buttons near the bottom eft of the window aow us to seect between margina, spom (pairs), and parae coordinates pots. A spom is a scatter pot matrix simiar to that in Section A parae coordinates pot draws a ine for each observation from one variabe to the next, as in

161 140 6 Interactive Graphics Figure 6.3. Parae coordinates pots can be quite usefu in identifying groups of observations with simiar vaues across mutipe variabes. Figure 6.3: The parae coordinates pot from atticist. The parae coordinates pot in Figure 6.3 exposes some structure in the weather dataset. The top variabe in this case is the target variabe, RainTomorrow. The other variabes are Sunshine, Rainfa, MaxTemp, and MinTemp. Noting that each ine represents a singe observation (the weather detais for each day), we might observe that for days when there is ess sunshine it is more ikey to rain tomorrow, and simiary when there is more sunshine it is ess ikey to rain tomorrow. We can observe a strong band of observations with no rain tomorrow, higher amounts of sunshine today, and itte or no rainfa today. From there (to the remaining two variabes) we observe ess structure in the data. There is a ot more functionaity avaiabe in atticist. Exporing many of the different options through the interface is fruitfu. We can add arrows and text to pots and then export the pots for incusion in other documents. The data can be subset and grouped in a variety of ways using the variabes avaiabe. This can ead to many insights, foowing our nose, so to speak, in navigating our way through the data. A the time we are on the ookout for structure and must remember to capture it to support the story that we find the data teing us.

162 6.2 GGobi GGobi GGobi is aso a powerfu open source too for visuaising data, supporting two of the most usefu interactive visuaisation concepts, known as brushing and tours. GGobi is not R software as such 1 but is integrated with R through rggobi (Lang et a., 2011) and ggobi(). Key uses in a data mining context incude the exporation of the distribution of observations for mutipe variabes, visuaisations of missing vaues, exporation for the deveopment of cassification modes, and custer anaysis. Cook and Swayne (2007) provide extensive coverage of the use of GGobi, particuary reevant in a data mining context. To use GGobi from the Interactive option of the Expore tab, the GGobi appication wi need to be instaed. GGobi runs under GNU/Linux, Mac OS/X, and Microsoft Windows and is avaiabe for downoad from GGobi is very powerfu indeed, and here we ony cover some basic functionaity. With GGobi we are abe to expore high-dimensiona data through highy dynamic and interactive graphics that incude tours, scatter pots, bar pots, and parae coordinates pots. The pots are interactive and inked with brushing and identification. Panning and zooming are supported. Data can be rotated in 3D, and we can tour highdimensiona data through 1D, 2D, and 2x1D projections, with manua and automatic contro of projection pursuits. We are aso abe to interact with GGobi by issuing commands through the R Consoe, and thus we can script some standard visuaisations from R using GGobi. For exampe, patterns found in data using R or Ratte can be automaticay passed to GGobi for interactive exporation. Whist interacting with GGobi pots we can aso highight points and have them communicated back to R for further anaysis. Scatter pot We can start GGobi from Ratte by cicking the Execute button whist having seected GGobi under the Interactive option of the Expore tab, as in Figure 6.1. We can aso initiate GGobi with rggobi(), providing it with a data frame to oad. In this exampe, we remove the first two variabes (Date and Location) and pass on to rggobi() the remaining variabes: 1 A project is under way to impement the concepts of GGobi directy in R.

163 142 6 Interactive Graphics Figure 6.4: The Expore tab s Interactive option can initiate a GGobi session for interactive data anaysis. Seect GGobi and then cick Execute. > ibrary(rggobi) > gg <- rggobi(weather[-c(1,2)]) On starting, GGobi wi dispay the two windows shown in Figure 6.5. The first provides contros for the visuaisations and the other dispays the defaut visuaisation (a two-variabe scatter pot of the first two variabes of the data frame suppied, noting that we have removed Date and Location). Figure 6.5: The GGobi appication contro and scatter pot windows. The contro window provides menus to access a of the functionaity of GGobi. Beow the menu bar, we can currenty see the XY Pot (i.e., scatter pot) options. Two variabes are seected from the variabe ist on the right side of the contro window. The variabes seected for dispay

164 6.2 GGobi 143 in the scatter pot are for the x-axis (X) and the y-axis (Y). By defaut, the first (MinTemp) and second (MaxTemp) are the chosen variabes in our dataset. We can choose any of our variabes to be the X or the Y by cicking the appropriate button. This wi immediatey change what is dispayed in the pot. Mutipe Pots Any number of pots can be dispayed simutaneousy. From the Dispay menu, we can choose a New Scatterpot Dispay to have two (or more) pots dispayed at one time, each in its own window. Figure 6.6 shows two scatter pots, with the new one chosen to dispay Evaporation against Sunshine. Changes that we make in the controing window affect the current, pot which can be chosen by cicking the pot. We can aso do this from the R Consoe using dispay(): > dispay(gg[1], vars=ist(x="evaporation", Y="Sunshine")) Figure 6.6: Mutipe scatter pots from GGobi with and without axes. Brushing Brushing aows us to seect observations in any pot and see them highighted in a pots. This ets us visuaise across many more dimensions than possibe with a singe two-dimensiona pot.

165 144 6 Interactive Graphics From a data mining perspective, we are usuay most interested in the reationship between the input variabes and the target variabe (using the variabe RainTomorrow in our exampes). We can highight its two different vaues for the different observations using coour. From the Toos menu, choose Automatic Brushing to dispay the window shown in Figure 6.7. Figure 6.7: GGobi s automatic brushing. The frequencies aong the bottom wi be different depending on whether or not the data is partitioned within Ratte. From the ist of variabes that we see at the top of the resuting window, we can choose RainTomorrow (after scroing through the ist of variabes to find RainTomorrow at the bottom of the ist). Notice that the number ranges that are dispayed in the ower coour map change to refect the range of vaues associated with the chosen variabe. For RainTomorrow, which has ony the vaues 0 and 1, any observations having RainTomorrow vaues of 0 wi be cooured purpe, whist those with a vaue of 1 wi be cooured yeow. We cick on the Appy button for the automatic brushing to take effect. Any pots that GGobi is currenty dispaying (and any new pots we cause to be dispayed from now on) wi coour the observations appropriatey,

166 6.2 GGobi 145 as in Figure 6.8. This coouring of points across mutipe pots is referred to as brushing. Figure 6.8: Automatic brushing of mutipe scatterpots using GGobi. Figure 6.9: Coourfu brushing of mutipe scatterpots. Our pots can be made somewhat more coourfu by choosing a numeric variabe, ike Sunshine, as the choice for automatic brushing. We can see the effect in Figure 6.9. GGobi provides an extensive coection of coour schemes to choose from for these gradients, for exampe. Under the Toos menu, seect the Coor Schemes option. A nice choice coud be YOrRd9.

167 146 6 Interactive Graphics Other Pots The Dispay menu provides a number of other options for pots. The Scatterpot Matrix, for exampe, can be used to dispay a matrix of scatter pots across many variabes at one time. We ve seen this aready in both Ratte itsef and atticist. However, GGobi offers brushing and inked views across a of the currenty dispayed GGobi pots. By defaut, the Scatterpot Matrix wi dispay the first four variabes in our dataset, as shown in Figure We can add and remove variabes by seecting the appropriate buttons in the contro window, which we notice has changed to incude just the Scatterpot Matrix options rather than the previous Scatterpot options. Any manua or automatic brushing in effect wi aso be refected in the scatter pots, as we can see in Figure Figure 6.10: GGobi s scatter pot matrix. A parae coordinates pot is aso easiy generated from GGobi s Dispay menu. An exampe can be seen in Figure 6.11, showing five variabes, beginning with Sunshine. The automatic brushing based on Sunshine is sti in effect, and we can see that the cooured ines emanate from the eft end of the pot within coour groups. The yeow ines represent observations with a higher vaue of Sunshine, and we can see that these generay correspond to higher vaues of the other variabes here, except

168 6.2 GGobi 147 for the fina variabe (Rainfa). Figure 6.11: GGobi s parae coordinates pot. As with many of the approaches to data visuaisation, when there are many observations the pots can become rather crowded and ose some of their usefuness. For exampe, a scatter pot over very many points wi sometimes become a soid bock of points showing itte usefu information. Quaity Pots Using R We can save the pots generated by GGobi into an R script fie and then have R generate the pots for us. This aows the pots to be regenerated as pubication-quaity graphics using R s capabiities. DescribeDispay (Wickham et a., 2010) is required for this: > insta.packages("describedispay") > ibrary(describedispay) Then, within GGobi, we choose from the Toos menu to Save Dispay Description. This wi prompt us for a fiename into which GGobi wi write an R script to recreate the current graphic. We can oad this script into R with dd_oad() and then generate a pot in the usua way: > pd <- dd_oad("ggobi-saved-dispay-description.r") > pdf("ggobi-rpot-deductions-outiers") > pot(pd) > dev.off() > ggpot(pd)

169 148 6 Interactive Graphics R code can aso be incuded in LibreOffice documents to directy generate and incude the pots within the document using odfweave (Kuhn et a., 2010). For Microsoft Word, SWordInstaer offers simiar functionaity. Further GGobi Documentation We have ony reay just started to scratch the surface of using GGobi here. There is a ot more functionaity avaiabe, and whist the functionaity that is ikey to be usefu for the data miner has been touched on, there is a ot more to expore. So do expore the other features of GGobi, as some wi surey be usefu for new tasks. A very good overview of using GGobi for visua data mining is presented by Cook and Swayne (2007). Another overview is provided by Wickham et a. (2008). 6.3 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: dd_oad() command Load an rggobi pot script fie. dev.off() command Cose a graphics device. dispay() command Create a new GGobi dispay. ggpot() command Advanced potting functionaity. ggobi() command Interactive data exporation using GGobi. atticist() command Interactive data exporation within R. atticist package Interactive data exporation within R. odfweave package Embed R in LibreOffice documents. pot() command Visuaise suppied data. rggobi package Interactive data exporation using GGobi. weather dataset Sampe dataset from ratte.

170 Chapter 7 Transforming Data An interesting issue with the deivery of a data mining project is that in reaity we spend more of our time working on and with the data than we do buiding actua modes, as we suggested in Chapter 1. In buiding modes, we wi often be ooking to improve their performance. The answer is often to improve our data. This might entai sourcing some additiona data, ceaning up the data, deaing with missing vaues in the data, transforming the data, and anaysing the data to raise its efficiency through a better choice of variabes. In genera, we need to transform our data from the raw data originay suppied for a data mining project to the poished and focussed data from which we buid our best modes. This is often the make-or-break phase of a data mining project. This chapter introduces these data issues. We then review the various options for deaing with some of these issues, iustrating how to do so in Ratte and R. 7.1 Data Issues A review of the winning entries in the annua data mining competitions reinforces the notion that buiding modes from the right data is crucia to the success of a data mining project. The ACM KDD Cup, an annua Data Mining and Knowedge Discovery competition, is often won by a team that has paced a ot of effort in preprocessing the data suppied. The 2009 ACM KDD Cup competition is a prime exampe. The French teecommunications company Orange suppied data reated to G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _7, Springer Science+Business Media, LLC

171 150 7 Transforming Data customer reationship management. It consisted of 50,000 observations with much missing data. Each observation recorded vaues for 15,000 (anonymous) variabes. There were three target variabes to be modeed. One of the common characteristics for many entries was the preprocessing performed on the data. This incuded deaing with missing vaues, recoding data in various ways, and seecting variabes. Some of the resuting modes, for exampe, used ony one or two hundred of the origina 15,000 variabes. We review in this section some of the issues that reate to the quaity of the data that we might have avaiabe for data mining. We then consider how we dea with these issues in the foowing sections. An important point to understand is that often in data mining we are making use of, and indeed making do with, the data that is avaiabe. Such data might be reguary coected for other purposes. Some variabes might be critica to the operation of the business and so specia attention is paid to ensuring its accuracy. However, other data might ony be informationa and so ess attention is paid to its quaity. We need to understand many different aspects about how and why the data was coected in order to understand any data issues. It is crucia to spend time understanding such data issues. We shoud do this before we start buiding modes and then again when we are trying to understand why particuar modes have emerged. We need to expore the data issues that may have ed to specific patterns or anomaies in our modes. We may then need to rectify those issues and rebuid our modes. Data Ceaning When coecting data, it is not possibe to ensure it is perfecty coected, except in trivia cases. There wi aways be errors in the coection, despite how carefuy it might have been coected. It cannot be stressed enough that we aways need to be questioning the quaity of the data we have. Particuary in arge data warehouse environments where a ot of effort has aready been expended in addressing data quaity issues, there wi sti remain dirty data. It is important to aways question the data quaity and to be aert to the issue. There are many reasons for the data to be dirty. Simpe data entry errors occur frequenty. Decima points can be incorrecty paced, turning $ into $ There can be inherent error in any counting or measuring device. There can aso be externa factors that cause errors

172 7.1 Data Issues 151 to change over time, and so on. One of the most important ongoing tasks we have in data mining, then, is ceaning our data. We usuay start ceaning the data before we buid our modes. Exporing the data and buiding descriptive and predictive modes wi ead us to question the quaity of the data at different times, particuary when we identify odd patterns. A number of simpe steps are avaiabe in reviewing the quaity of our data. In exporing data, we wi often expore variabes through frequency counts and histograms. Any anomaous patterns there shoud be expored and expained. For categoric variabes, for exampe, we woud be on the ookout for categories with very ow frequency counts. These might be mistyped or differenty typed (upper/owercase) categories. A major task in data ceaning is often focussed around ceaning up names and addresses. This becomes particuary significant when bringing data together from mutipe sources. In combining financia and business data from numerous government agencies and pubic sources, for exampe, it is not uncommon to see an individua have his or her name recorded in mutipe ways. Up to 20 or 30 variations can be possibe. Street addresses present the same issues. A significant amount of effort is often expended in deaing with ceaning up such data in many organisations, and a number of toos have been deveoped to assist in the task. Missing Data Missing data is a common feature of any dataset. Sometimes there is no information avaiabe to popuate some vaue. Sometimes the data has simpy been ost, or the data is purposefuy missing because it does not appy to a particuar observation. For whatever reason the data is missing, we need to understand and possiby dea with it. Missing vaues can be difficut to dea with. Often we wi see missing vaues repaced with sentines to mark that they are missing. Such sentines can incude things ike 9999, or 1 Jan 1900, or even specia characters that can interfere with automated processing ike *,?, #, or $. We consider deaing with missing vaues through various transformations, as discussed in Section 7.4.

173 152 7 Transforming Data Outiers An outier is an observation that has vaues for the variabes that are quite different from most other observations. Typicay, an outier appears at the maximum or minimum end of a variabe and is so arge or sma that it skews or otherwise distorts the distribution. It is not uncommon to have a singe instance or a very sma number of these outier vaues when compared to the frequency of other vaues of the variabe. When summarising our data, performing tests on the data, and in buiding modes, outiers can have an adverse impact on the quaity of the resuts. Hawkins (1980) captures the concept of an outier as an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. Outiers can be thought of as exceptiona cases. Exampes might incude extreme weather conditions on a particuar day, a very weathy person who financiay is very different from the rest of the popuation, and so on. Often, an outier may be interesting but not reay a key observation for our anaysis. Sometimes outiers are the rare events that we are specificay interested in. We may be interested in rare, unusua, or just infrequent events in a data mining context when considering fraud in income tax, insurance, and on-ine banking, as we as for marketing. Identifying whether an observation is an outier is quite difficut, as it depends on the context and the mode to be buit. Perhaps under one context an observation is an outier but under another context it might be a typica observation. The decision of what an outier is wi aso vary by appication and by user. Genera outier detection agorithms incude those that are based on distance, density, projections, or distributions. The distance-based approaches are common in data mining, where an outier is identified based on an observation s distance from nearby observations. The number of nearby observations and the minimum distance are two parameters. Another common approach is to assume a known distribution for the data. We then consider by how much an observation deviates from the distribution. Many more recent mode buiders (incuding random forests and support vector machines) are very robust to outiers in that outiers tend

174 7.2 Transforming Data 153 not to adversey affect the agorithm. Linear regression type approaches tend to be affected by outiers. One approach to deaing with outiers is to remove them from the dataset atogether. However, identifying the outier remains an issue. Variabe Seection Variabe seection is another approach that can resut in improved modeing. By removing irreevant variabes from the modeing process, the resuting modes can be made more robust. Of course, it takes a good knowedge of the dataset and an understanding of the reevance of variabes to the probem at hand. Some variabes wi aso be found to be quite reated to other variabes, creating unnecessary noise when buiding modes. Various techniques can be used for variabe seection. Simpe techniques incude considering different subsets of variabes to expore for a subset that provides the best resuts. Other approaches use modeing measures (such as the information measure of decision tree induction discussed in Chapter 11) to identify the more important coection of variabes. A variety of other techniques are avaiabe. Approaches ike principa components anaysis and the variabe importance measures of random forests and boosting can guide the choice of variabes for buiding modes. 7.2 Transforming Data With the pethora of issues that we find in data, there is quite a coection of approaches for transforming data to improve our abiity to discover knowedge. Ceaning our dataset and creating new variabes from other variabes in the dataset occupies much of our time as data miners. A programming anguage ike R provides support for most of the myriad of approaches possibe. Ratte s Transform tab (Figure 7.1) provides many options for transforming datasets using many of the more common transformations. This incudes normaising our data, fiing in missing vaues, turning numeric variabes into categoric variabes and vice versa, deaing with outiers, and removing variabes or observations with missing vaues. For the more compex transformations, we can revert to using R.

175 154 7 Transforming Data Figure 7.1: The Transform tab options. We now introduce the various transformations supported by Ratte. In tuning our dataset, we wi often transform it in many different ways. This often represents quite a ot of work, and we need to capture the resuting data in some form. Once the dataset is transformed, we can save the new version to a CSV fie. We do this by cicking on the Export button whist viewing the Transform (or the Data) tab. This wi prompt us for a CSV fiename under which the current transformed dataset wi be saved. We can aso save the whoe current state of Ratte as a project, which can easiy be reoaded at a ater time. Another option, and one to be encouraged as good practise, is to save to a script fie the series of transformations as recorded in the Log tab. Saving these to a script fie means we can automate the generation of the transformed dataset from the origina dataset. The automaticay transformed dataset can then be used for buiding modes or for scoring. For scoring (i.e., appying a mode to a new coection of data), we can simpy change the name of the origina source data fie within the script. The data is then processed through the R script and we can then appy our mode to this new dataset within R. The remainder of this chapter introduces each of the casses of transformations that are typica of a data mining project and supported by Ratte. 7.3 Rescaing Data Different mode buiders wi have different assumptions on the data from which the modes are buit. When buiding a custer using any kind of distance measure, for exampe, we may need to ensure a variabes have approximatey the same scae. Otherwise, a variabe ike Income wi

176 7.3 Rescaing Data 155 overwhem a variabe ike Age when cacuating distances. A distance of 10 years may be more significant than a distance of $10,000, yet swamps 10 when they are added together, as woud be the case when cacuating distances without rescaing the data. In these situations, we wi want to normaise our data. The types of normaisations (avaiabe through the Normaise option of the Transform tab) we can perform incude recentering and rescaing our data to be around zero (Recenter uses a so-caed Z score, which subtracts the mean and divides by the standard deviation), rescaing our data to be in the range from 0 to 1 (Scae [0 1]), performing a robust rescaing around zero using the median (Median/MAD), appying og() to our data, or transforming mutipe variabes with one divisor (Matrix). The detais of these transformations wi be presented beow. Other rescaing transformations incude converting the numbers into a rank ordering (Rank) and performing a transform to rescae a variabe according to some group that the observation beongs to (By Group). Figure 7.2: Transforming Temp3pm in five different ways. Figure 7.2 shows the resut of transforming the variabe Temp3pm in

177 156 7 Transforming Data five different ways. The simpe summary that we can see for each variabe in Figure 7.2 provides a quick view of how the data has been transformed. For exampe, the recenter transform of the variabe Temp3pm has changed the range of vaues for the variabe from the origina 5.10 to to end up with 2.13 to Tip: Notice, as we see in Figure 7.2, that the origina data is not modified. Instead, a new variabe is created for each transform with a prefix added to the variabe s name that indicates the kind of transformation. The prefixes are RRC_ (for Recenter), R01_ (for Scae [0 1]), RMD_ (for Median/MAD), RLG_ (for Log), and RRK_ (for Rank). Figure 7.3 iustrates the effect of the four transformations on the variabe Temp3pm compared with the origina distribution of the data. The top eft pot shows the origina distribution. Note that the three normaisations (recenter, rescae 0 1, and recenter using the median/- MAD) a produce new variabes with very simiar ooking distributions. The og transform changes the distribution quite significanty. The rank transform simpy deivers a variabe with a fat distribution since the new variabe simpy consists of a sequence of integers and thus each vaue of the new variabe appears just once. Recenter This is a common normaisation that re-centres and rescaes our data. The usua approach is to subtract the mean vaue of a variabe from each observation s vaue of the variabe (to recentre the variabe) and then divide the vaues by their standard deviation (cacuating the square root of the sum of squares), which rescaes the variabe back to a range within a few integer vaues around zero. To demonstrate the transforms on our weather, we wi oad ratte and create a copy of the dataset, to be referred to as ds: > ibrary(ratte) > ds <- weather The foowing R code can then perform the transformation using scae(): > ds$rrc_temp3pm <- scae(ds$temp3pm)

178 7.3 Rescaing Data 157 Frequency Distribution of Temp3pm (sampe) Frequency Distribution of RRC_Temp3pm (sampe) Temp3pm RRC_Temp3pm Frequency Distribution of R01_Temp3pm (sampe) Frequency Distribution of RMD_Temp3pm (sampe) R01_Temp3pm RMD_Temp3pm Frequency Distribution of RLG_Temp3pm (sampe) Frequency Distribution of RRK_Temp3pm (sampe) RLG_Temp3pm RRK_Temp3pm Figure 7.3: Comparing distributions after transforming. From the eft to right, top to bottom: origina, recenter, rescae to 0 1, rank, og transform, and recenter using median/mad. Scae [0 1] Rescaing so that our data has a mean around zero might not be so intuitive for variabes that are never negative. Most numeric variabes from the weather dataset naturay ony take on positive vaues, incuding Rainfa and WindSpeed3pm. To rescae whist retaining ony positive vaues, we might choose the Scae [0 1] transform, which simpy recodes the data so that the vaues are a between 0 and 1. This is done by subtracting the minimum vaue from the variabe s vaue for each observation and then dividing by the difference between the minimum and the maximum vaues.

179 158 7 Transforming Data The foowing R code is used to perform the transformation. We use rescaer() from reshape (Wickham, 2007): > ibrary(reshape) > ds$r01_temp3pm <- rescaer(ds$temp3pm, "range") Median/MAD This option for recentring and rescaing our data is regarded as a robust (to outiers) version of the standard Recenter option. Instead of using the mean and standard deviation, we subtract the median and divide by the so-caed median absoute deviation (MAD). The foowing R code is used to perform the transformation. Again we use rescaer() from reshape: > ibrary(reshape) > ds$rmd_temp3pm <- rescaer(ds$temp3pm, "robust") Natura Log Often the vaues of a variabe can be quite skewed in one direction or another. A typica exampe is Income. The majority of a popuation may have incomes beow $150,000. But there are a reativey sma number of individuas with excessive incomes measured in the miions of doars. In many approaches to anaysis and mode buiding, these extreme vaues (outiers) can adversey affect any anaysis. Logarithm transforms map a very broad range of (positive) numeric vaues into a narrower range of (positive) numeric vaues. The natura og function effectivey reduces the spread of the vaues of the variabe. This is particuary usefu when we have outiers with extremey arge vaues compared with the rest of the popuation. Logarithms can use a so caed base with respect to which they do the transformation. We can use a base 10 transform to expain what the transform does. With a og 10 transform, a saary of $10,000 is recoded as 4, $100,000 as 5, $150,000 as , and $1,000,000 as 6 that is, a ogarithm of base 10 recodes each power of 10 (e.g., 10 5 or 100,000) to the power itsef (e.g., 5) and simiary for a ogarithm of base 2, which recodes 8 (which is 2 3 ) to 3.

180 7.3 Rescaing Data 159 By defaut, Ratte simpy uses the natura ogarithm for its transform. This recodes using a ogarithm to base e, where e is the specia number This is the defaut base that R uses for og(). The foowing R code is used to perform the transformation. We aso recode any resuting infinite vaues (e.g., og(0)) to be treated as missing vaues: > ds$rlg_temp3pm <- og(ds$temp3pm) > ds$rlg_temp3pm[ds$rlg_temp3pm == -Inf] <- NA Rank On some occasions, we are not interested in the actua vaue of the variabe but rather in the reative position of the vaue within the distribution of that variabe. For exampe, in comparing restaurants or universities, the actua score may be ess interesting than where each restaurant or university sits compared with the others. A rank is then used to capture the reative position, ignoring the actua scae of any differences. The Rank option wi convert each observation s numeric vaue for the identified variabe into a ranking in reation to a other observations in the dataset. A rank is simpy a ist of integers, starting from 1, that is mapped from the minimum vaue of the variabe, progressing by integer unti we reach the maximum vaue of the variabe. The argest vaue is thus the sampe size, which for the weather dataset is 366. A rank has an advantage over a recentring transform, as it removes any skewness from the data (which may or may not be appropriate for the task at hand). A probem with recoding our data using a rank is that it becomes difficut when using the resuting mode to score new observations. How do we rank a singe observation? For exampe, suppose we have a mode that tests whether the rank is ess than 50 for the variabe Temp3pm. What does this actuay mean when we appy this test to a new observation? We might instead need to revert the rank back to an actua vaue to be usefu in scoring. The foowing R code is used to perform the transformation. Once again we use rescaer() from reshape: > ibrary(reshape) > ds$rrk_temp3pm <- rescaer(ds$temp3pm, "rank")

181 160 7 Transforming Data By Group A By Group transform recodes the vaues of a variabe into a rank order between 0 and 100. A categoric variabe can aso be identified as part of the transformation. In this case, the observations are grouped by the vaues of the categoric variabe. These groups are then considered as peers. The ranking is then performed with respect to the peers rather than the whoe popuation. An exampe might be to rank wind speeds within groups defined by the wind direction. A high wind speed reative to one direction may not be a high wind speed reative to another direction. The code to do this gets a itte compex. > ibrary(reshape) > ds$rbg_speedbydir <- ds$windgustspeed > byeves <- eves(ds$windgustdir) > for (v in byeves) { grp <- sappy(ds$windgustdir == v, istrue) ds[grp, "RBG_SpeedByDir"] <- round(rescaer(ds[grp, "WindGustSpeed"], "range") * 99) } > ds[is.nan(ds$rbg_speedbydir), "RBG_SpeedByDir"] <- 50 > v <- c("windgustspeed", "WindGustDir", "RBG_SpeedByDir") We can then seectivey dispay some observations: > head(ds[ds$windgustdir %in% c("nw", "SE"), v], 10) Observation 1, for exampe, with a WindGustSpeed of 30, is at the 18th percentie within a those observations for which WindGustDir is NW. Overa, we might observe that the WindGustSpeed is generay ess when the WindGustDir is SE as compared with NW, ooking at the rankings within each group. Instead of generating a rank of between 0 and 100, a Z score (i.e., Recenter) coud be used to recode within each group. This woud require ony a minor change to the R code above. Summary We summarise this coection of transformations of the first few observations of the variabe Temp3pm:

182 7.4 Imputation 161 Obs. WindGustSpeed WindGustDir RBG SpeedByDir 1 30 NW NW NW SE SE NW NW SE NW NW 42 Obs. Temp3pm RRC R01 RMD RLG RRK Imputation Imputation is the process of fiing in the gaps (or missing vaues) in data. Data is missing for many different reasons, and it is important to understand why. This wi guide us in deaing with the missing vaues. For rainfa variabes, for exampe, a missing vaue may mean there was no rain recorded on that day, and hence it is reay a surrogate for 0 mm of rain. Aternativey, perhaps the measuring equipment was not functioning that day and hence recorded no rain. Imputation can be questionabe because, after a, we are inventing data. We won t discuss here the pros and cons in any detai, but note that, despite such concerns, reasonabe resuts can be obtained from simpe imputations. There are many types of imputations avaiabe, ony some of which are directy avaiabe in Ratte. Imputation might invove simpy repacing missing vaues with a particuar vaue. This then aows, for exampe, inear regression modes to be buit using a observations. Or we might

183 162 7 Transforming Data add an additiona variabe to record when vaues are missing. This then aows the mode buider to identify the importance of the missing vaues, for exampe. We do note, however, that not a mode buiders (e.g., decision trees) are troubed by missing vaues. Figure 7.4 shows Ratte s Impute option on the Transform tab seected with the choices for imputation, incuding Zero/Missing, Mean, Median, Mode, and Constant. Figure 7.4: The Transform tab with the Impute option seected. When Ratte performs an imputation, it wi store the resuts in a new variabe within the same dataset. The new variabe wi have the same name as the variabe that is imputed, but prefixed with either IZR_, IMN_, IMD_, IMO_, or ICN_. Such variabes wi automaticay be identified as having an Input roe, whist the origina variabe wi have a roe of Ignore. Zero/Missing The simpest imputations invove repacing a missing vaues for a variabe with a singe vaue. This makes the most sense when we know that the missing vaues actuay indicate that the vaue is 0 rather than unknown. For exampe, in a taxation context, if a taxpayer does not provide a vaue for a specific type of deduction, then we might assume that they intend it to be zero. Simiary, if the number of chidren in a famiy is not recorded, it coud be a reasonabe assumption to assume it is zero. For categoric data, the simpest approach to imputation is to repace missing vaues with a specia vaue, such as Missing. The foowing R code is used to perform the transformation: > ds$izr_sunshine <- ds$sunshine > ds$izr_sunshine[is.na(ds$izr_sunshine)] <- 0

184 7.4 Imputation 163 Mean/Median/Mode Often a simpe, if not aways satisfactory, choice for missing vaues that are known not to be zero is to use some centra vaue of the variabe. This is often the mean, median, or mode, and thus usuay has imited impact on the distribution. We might choose to use the mean, for exampe, if the variabe is otherwise generay normay distributed (and in particuar does not have any skewness). If the data does exhibit some skewness, though (e.g., there are a sma number of very arge vaues), then the median might be a better choice. For categoric variabes, there is, of course, no mean or median, and so in such cases we might (but with care) choose to use the mode (the most frequent vaue) as the defaut to fi in for the otherwise missing vaues. The mode can aso be used for numeric variabes. This coud be appropriate for variabes that are dominated by a singe vaue. Perhaps we notice that predominatey (e.g., for 80% of the observations) the temperature at 9 am is 26 degrees Cesius. That coud be a reasonabe choice for any missing vaues. Whist this is a simpe and computationay quick approach, it is a very bunt approach to imputation and can ead to poor performance from the resuting modes. However, it has aso been found empiricay to be usefu. The foowing R code is used to perform the transformation: > ds$imn_sunshine <- ds$sunshine > ds$imn_sunshine[is.na(ds$imn_sunshine)] <- mean(ds$sunshine, na.rm=true) Constant This choice aows us to provide our own defaut vaue to fi in the gaps. This might be an integer or rea number for numeric variabes, or ese a specia marker or the choice of something other than the majority category for categoric variabes. The foowing R code is used to perform the transformation: > ds$izr_sunshine <- ds$sunshine > ds$izr_sunshine[is.na(ds$izr_sunshine)] <- 0

185 164 7 Transforming Data 7.5 Recoding The Recode option on the Transform tab provides numerous remapping operations, incuding binning and transformations of the type of the data. Figure 7.5 ists the options. Figure 7.5: The Transform tab with the Recode option seected. Binning Binning is the operation of transforming a continuous numeric variabe into a specific set of categoric vaues based on the numeric vaues. Simpe exampes incude converting an age into an age group, and a temperature into Low, Medium, and High. Performing a binning transform may ose vauabe information, so do give some thought as to whether binning is appropriate. Binning can be usefu in simpifying modes. It is aso usefu when we visuaise data. A mosaic pot (Chapter 5), for exampe, is ony usefu for categoric data, and so we coud turn Sunshine into a categoric variabe by binning. Binning can aso be usefu to set a numeric vaue as the stratifying variabe in various pots in Chapter 5. For exampe, we coud bin Temp9am and then choose the new BE4_Temp9am (BE4 for binning into four equa-size bins) as the Target and generate a Box Pot from the Expore tab to see the reationship with the Evaporation. Ratte supports automated binning through the use of binning() (provided by Daniee Medri). The Ratte interface provides an option to choose between Quantie (or equa count) binning, KMeans binning, and Equa Width binning. For each option, the defaut number of bins is four. We can change this to suit our needs. The variabes generated are prefixed with either BQn_, BKn_, or BEn_, respectivey, with n repaced by the number of bins.

186 7.5 Recoding 165 Indicator Variabes Some mode buiders often do not directy hande categoric variabes. This is typica of distance-based mode buiders such as k-means custering, as we as the traditiona numeric regression types of modes. A simpe approach to transforming a categoric variabe into a numeric one is to construct a coection of so-caed indicator or dummy variabes. For each possibe vaue of the categoric variabe, we can create a new variabe that wi have the vaue 1 for any observation that has this categoric vaue and 0 otherwise. The resut is a coection of new numeric variabes, one for each of the possibe categoric vaues. An exampe might be the categoric variabe Coour, which might ony aow the possibe vaues of Red, Green, or Bue. This can be converted to three variabes, Coour_Red, Coour_Green, and Coour_Bue. Ony one of these wi have the vaue 1 at any time, whist the other(s) wi have the vaue 0. Ratte s Transform tab provides an option to transform a categoric variabe into a coection of indicator variabes. Each of the new variabes has a name that is prefixed by TIN_. The remainder of the name is made up of the origina name of the categoric variabe (e.g., Coour) and the particuar vaue (e.g., Red). This wi give, for exampe, TIN_Coour_Red as one of the new variabe names. Tabe 7.1 iustrates how the recoding works for a coection of observations. Tabe 7.1: Exampes of recoding a singe categoric variabe as a number of numeric indicator variabes. Obs. Coour Coour Red Coour Green Coour Bue 1 Green Bue Bue Red Green Red In terms of modeing, for a categoric variabe with k possibe vaues, we ony need to convert it to k 1 indicator variabes. The kth indicator variabe is redundant and in fact is directy determined by the vaues of the other k 1 indicators. If a of the other indicators are 0, then ceary

187 166 7 Transforming Data the kth wi be 1. Simiary if any of the other k 1 indicators is 1, then the kth must be 0. Consequenty, we shoud ony incude a but one of the new indicator variabes as having an Input roe. Ratte, by defaut, wi set the roe of the first new indicator variabe to be Ignore. There is not aways a need to transform a categoric variabe. Some mode buiders, ike the Linear mode buider in Ratte, wi do it automaticay. Join Categorics The Join Categorics option provides a convenient way to stratify the dataset based on mutipe categoric variabes. It is a simpe mechanism that creates a new variabe from the combination of a of the vaues of the two constituent variabes seected in the Ratte interface. The resuting variabes are prefixed with TJN_ and incude the names of both the constituent variabes. A simpe exampe might be to join RainToday and RainTomorrow to give a new variabe (TJN here and TJN_RainToday_RainTomorrow in Ratte): > ds$tjn <- interaction(paste(ds$raintoday, "_", ds$raintomorrow, sep="")) > ds$tjn[grep("^na_ _NA$", ds$tjn)] <- NA > ds$tjn <- as.factor(as.character(ds$tjn)) > head(ds[c("raintoday", "RainTomorrow", "TJN")]) RainToday RainTomorrow TJN 1 No Yes No_Yes 2 Yes Yes Yes_Yes 3 Yes Yes Yes_Yes 4 Yes Yes Yes_Yes 5 Yes No Yes_No 6 No No No_No We might aso want to join a numeric variabe and a categoric variabe, ike the common Age and Gender stratification. To do this, we first use the Binning option within Recode to categorise the Age variabe and then use Join Categorics.

188 7.7 Command Summary 167 Type Conversion The As Categoric and As Numeric options wi, respectivey, convert a numeric variabe to categoric (with the new categoric variabe name prefixed with TFC_) and vice versa (with the new numeric variabe name prefixed with TNM_). The R code for these transforms uses as.factor() and as.numeric(): > ds$tfc_coud3pm <- as.factor(ds$coud3pm) > ds$tnm_raintoday <- as.numeric(ds$raintoday) 7.6 Ceanup It is quite easy to get our dataset variabe count up to significant numbers. The Ceanup option aows us to te Ratte to actuay deete coumns from the dataset. Thus, we can perform numerous transformations and then save the dataset back into a CSV fie (using the Export option). Various Ceanup options are avaiabe. These aow us to remove any variabe that is ignored (Deete Ignored), remove any variabes we seect (Deete Seected), or remove any variabes that have missing vaues (Deete Missing). The Deete Obs with Missing option wi remove observations (rather than variabes i.e., remove rows rather than coumns) that have missing vaues. 7.7 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: as.factor() function Convert variabe to be categoric. as.numeric() function Convert variabe to be numeric. is.na() function Identify which vaues are missing. eves() function List the vaues of a categoric variabe. og() function Logarithm of a numeric variabe. mean() function Mean vaue of a numeric variabe. rescaer() function Remap numeric variabes.

189 168 7 Transforming Data reshape package Transform variabes in various ways. scae() function Remap numeric variabes.

190 Part II Buiding Modes

191

192 Chapter 8 Descriptive and Predictive Anaytics Modeing is what we most often think of when we think of data mining. Modeing is the process of taking some data (usuay) and buiding a simpified description of the processes that might have generated it. The description is often a computer program or mathematica formua. A mode captures the knowedge exhibited by the data and encodes it in some anguage. Often the aim is to address a specific probem through modeing the word in some form and then use the mode to deveop a better understanding of the word. We now turn our attention to buiding modes. As in any data mining project, buiding modes is usuay the aim, yet we spend a ot more time understanding the business probem and the data, and working the data into shape, before we can begin buiding the modes. Often we gain much vauabe knowedge from our preparation for modeing, and some data mining projects finish at that stage, even without the need to buid a mode that might be unusua, though, and we do need to expect to buid a mode or two. As we wi find, we buid modes eary on in a project, then work on our data some more to transform, shape, and cean it, buid more modes, then return to processing the data once again, and so on for many iterations. Each cyce takes us a step coser to achieving our desired outcomes. This chapter introduces the concept of modes and mode buiders that fa into the categories of data mining: descriptive and predictive. In this chapter, we provide an overview of these approaches. For descrip- G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _8, Springer Science+Business Media, LLC

193 172 8 Descriptive and Predictive Anaytics tive data mining, we present custer anaysis and association rues as two approaches to mode buiding. For predictive data mining, we consider both cassification and regression modes, introducing agorithms ike decision trees, random forests, boosting, support vector machines, inear regression, and neura networks. In each case, in their own chapters, the agorithms are presented together with a guide to using them within Ratte and R. 8.1 Mode Nomencature Much of the terminoogy used in data mining has grown out of terminoogy used in both machine earning and research statistics. We identify, for exampe, two very broad categories of mode buiding agorithms as descriptive anaytics and predictive anaytics. In a traditiona machine earning context, these equate to unsupervised earning and supervised earning. We cover both approaches in the foowing chapters and describe each in a itte more detai in the foowing sections. On top of the basic agorithm for buiding modes, we aso identify meta earners, which incude ensembe earners. These approaches suggest buiding many modes and combining them in some way. Some ideas for ensembes originate from the mutipe inductive earning (MIL) agorithm (Wiiams, 1988), where mutipe decision tree modes are buit and combined as a singe mode. 8.2 A Framework for Modeing Buiding modes is a common pursuit throughout ife. When we think about it, we buid ad hoc and informa modes every day when we sove probems in our head and ive our ives. Different professions, ike architects and engineers, for exampe, specificay buid modes to see how things fit together, to make sure they do fit together, to see how things wi work in the rea word, and even to se the idea behind the mode to others. Data mining is about buiding modes that give us insights into the word and how it works. But even more than that, our modes are often usefu to give us guidance in how to dea with and interact with the rea word. Buiding modes is thus fundamenta to understanding our word. We start doing it as a chid and continue unti death. When we buid a mode,

194 8.2 A Framework for Modeing 173 whether it be with toy bricks, papier mâché, or computer software, we get a new perspective of how things fit together or interact. Once we have some basic modes, we can start to get ideas about more compex ones, buiding on what has come before. With data mining, our modes are driven by the data and thus aim to be objective. Other modes might be more subjective and refect our views of what we are modeing. In understanding new, compex ideas, we often begin by trying to map the idea into concepts or constructs that we aready know. We bring these constructs together in different ways that refect how we understand a new, more compex idea. As we earn more about the new, compex idea, we change our mode to better refect that idea unti eventuay we have a mode that is a good enough match to the idea. The same is true when buiding modes using computers. Writing any computer program is essentiay about buiding a mode. An accountant s spreadsheet is a mode of something in the word. A socia media appication captures a mode or introduces a new mode of how peope communicate. Modes of the economy and of the environment provide insights into how these things work and aow us to expore possibe future scenarios. An important thing to remember, though, is that no mode can perfecty represent the rea word, except in the most simpistic and trivia of scenarios. To perfecty mode the rea word, even if it were possibe, we woud need to incorporate into the mode every possibe variabe imaginabe. The rea word has so many different factors feeding into it that a we can reay hope to do is to get a good approximation of it. A mode, as a good approximation of the word, wi express some understanding of that word. It needs to be expressed using some anguage, whether it be a spoken or written human anguage, a mathematica anguage, a computer anguage, or a modeing anguage. The anguage is used to represent our knowedge. We write or speak in sentences based on the anguage we have chosen. Some sentences expressed in our chosen anguage wi capture usefu knowedge. Other sentences might capture misinformation, and yet others may capture beiefs or propositions, and so on. Formay, each sentence wi express or capture some concept within the forma constraints of the particuar anguage chosen. We can think of constructing a sentence to express something about our data as buiding a mode. For any anguage, though, there is often an infinite (or at east a very arge) coection of possibe sentences (i.e., modes) that can be

195 174 8 Descriptive and Predictive Anaytics expressed. We need some way of measuring how good a sentence is. This might just be a measure of how we formed our written sentence is is it grammaticay correct and does it read we? But just as importanty, does the sentence express a vaid statement about the word? Does it provide usefu insight and knowedge about the word? Is it a good mode? For each of the mode buiders we introduce, we wi use this threepronged framework: ˆ identify the anguage used to express the discovered knowedge, ˆ deveop a mechanism to search for good sentences within the anguage, and ˆ define a measure that can be used to assess how good a sentence is. This is a quite common framework from the artificia inteigence tradition. There we seek to automaticay search for soutions to probems, within the bounds of a chosen knowedge representation anguage. This framework is simpy cast for the task of data mining the task of buiding modes. We refer to an agorithm for buiding a mode as a mode buider. Ratte supports a number of mode buiders, incuding custering, association rues, decision tree induction, random forests, boosted decision trees, support vector machines, ogistic regression, and neura networks. In essence, the mode buiders differ in how they represent the modes they buid (i.e., the discovered knowedge) and how they find (or search for) the best mode within this representation. In buiding a mode, we wi often ook to the structure of the mode itsef to provide insights. In particuar, we can earn much about the reationships between the input variabes and the target variabe (if any) from studying our modes. Sometimes these observations themseves deiver benefits from the data mining project, even without actuay using the modes directy. There is generay an infinite number of possibe sentences (i.e., modes) given any specific anguage. In human anguage, we are generay very we skied at choosing sentences from this infinite number of possibiities to best represent what we woud ike to communicate. And so it needs to be with mode buiding. The ski is to express, within the chosen anguage, the best sentences that capture what it is we are attempting to mode.

196 8.4 Predictive Ana.ytics Descriptive Anaytics Descriptive anaytics is the task of providing a representation of the knowedge discovered without necessariy modeing a specific outcome. The tasks of custer anaysis, association and correation anaysis and pattern discovery, can fa under this category. From a machine earning perspective, we might compare these agorithms to unsupervised earning. The aim of unsupervised earning is to identify patterns in the data that extend our knowedge and understanding of the word that the data refects. There is generay no specific target variabe that we are attempting to mode. Instead, these approaches shed ight on the patterns that emerge from the descriptive anaytics. 8.4 Predictive Anaytics Often our task in data mining is to buid a mode that can be used to predict the occurrence of an event. The mode buiders wi extract knowedge from historic data and represent it in such a form that we can appy the resuting mode to new situations. We refer to this as predictive anaytics. The tasks of cassification and regression are at the heart of what we often think of as data mining and specificay predictive anaytics. Indeed, we ca much of what we do in data mining predictive anaytics. From a machine earning perspective, this is aso referred to as supervised earning. The historic data from which we buid our modes wi aready have associated with it specific outcomes. For exampe, each observation of the weather dataset has associated with it a known outcome, recorded as the target variabe. The target variabe is RainTomorrow (whether it rained the foowing day), with the possibe vaues of No and Yes. Cassification modes are used to predict the cass of new observations. New observations are cassified into the different target variabe categories or casses (for the weather dataset, this woud be Yes and No). Often we wi be presented with just two casses, but it coud be more. A new observation might be today s weather observation. We want to cassify the observation into the cass Yes or the cass No. Membership in a particuar cass indicates whether there might be rain on the foowing day or not, as the case may be.

197 176 8 Descriptive and Predictive Anaytics Often, cassification modes are represented symboicay. That is, they are often expressed as, for exampe, a series of tests (or conditions) on different variabes. Each test exhibits a piece of the knowedge that, together with other tests, eads to the identified outcome. Regression modes, on the other hand, are generay modes that predict a numeric outcome. For the weather dataset, this might be the amount of rain expected on the foowing day rather than whether it wi or won t rain. Regression modes are often expressed as a mathematica formua that captures the reationship between a coection of input variabes and the numeric target variabe. This formua can then be appied to new observations to predict a numeric outcome. Interestingy, regression comes from the word regress, which means to move backwards. It was used by Gaton (1885) in the context of techniques for regressing (i.e., moving from) observations to the average. The eary research incuded investigations that separated peope into different casses based on their characteristics. The regression came from modeing the heights of reated peope (Crano and Brewer, 2002). 8.5 Mode Buiders Each of the foowing chapters describes a particuar cass of mode buiders using specific agorithms. For each mode buider, we identify the structure of the anguage used to describe a mode. The search agorithm is described as we as any measures used to assist in the search and to identify a good mode. Foowing the forma overview of each mode buider, we then describe how the agorithm is used in Ratte and R and provide iustrative exampes. The aim is to provide insight into how the agorithm works and some detais reated to it so that as a data miner we can make effective use of the mode buider. The agorithms we present wi generay be in the context of a twocass cassification task where appropriate. The aim of such tasks is to distinguish between two casses of observations. Such probems abound. The two casses might, for exampe, identify whether or not it is predicted to rain tomorrow (No and Yes). Or they might distinguish between highrisk and ow-risk insurance cients, productive and unproductive taxation audits, responsive and nonresponsive customers, successfu and unsuccessfu security breaches, and so on. Many of the popuar agorithms are

198 8.5 Mode Buiders 177 covered in the foowing chapters. Agorithms not covered incude neura networks, inear and ogistic regressions, and Bayesian approaches. In demonstrating the tasks using Ratte (together with a guide to the underying R code), we note that Ratte presents a basic coection of tuning parameters. Good defaut vaues for various options aow the user to more simpy buid a mode with itte tuning. However, this may not aways be the right approach, and whist it is certainy a good pace to start, experienced users wi want to make much more use of the fuer set of tuning parameters avaiabe directy through the R Consoe.

199

200 Chapter 9 Custer Anaysis The custering technique is one of the core toos that is used by the data miner. Custering gives us the opportunity to group observations in a generay unguided fashion according to how simiar they are. This is done on the basis of a measure of the distance between observations. For exampe, we might have a dataset that is made up of schoo chidren of various heights, a range of weights, and different ages. Depending on what is needed to sove the probem at hand, we might wish to group the students into smaer, more definabe groups and then compare different variabes common to a groupings. Each group may have different ranges, minimums and maximums, and so on that represent that group. Custering aows the data miner to break data into more meaningfu groups and then contrast the different custers against each other. Custers can aso be usefu in grouping observations to hep make the smaer datasets easier to manage. The aim of custering is often to identify groups of observations that are cose together but as a group are quite separate from other groups. Numerous agorithms have been deveoped for custering. In this chapter, we focus primariy on the k-means custering agorithm. The agorithm wi identify a coection of k custers using a heuristic search starting with a seection of k randomy chosen custers. G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _9, Springer Science+Business Media, LLC

201 180 9 Custer Anaysis 9.1 Knowedge Representation A mode buit using the k-means agorithm represents the custers as a coection of k means. The observations in the dataset are associated with their cosest mean and thus are partitioned into k custers. The mean of a particuar numeric variabe for a coection of observations is the average vaue of that variabe over those observations. The means for the coection of observations that form one of the k custers in any particuar custering are then the coection of mean vaues for each of the input variabes over the observations within the custering. Consider, for exampe, a simpe and sma random subset of the weather dataset. This can be generated as beow, where we choose ony a sma number of the avaiabe numeric variabes: > ibrary(ratte) > set.seed(42) > obs1 <- sampe(1:nrow(weather), 5) > vars <- c("mintemp", "MaxTemp", "Rainfa", "Evaporation") > custer1 <- weather[obs1, vars] We now obtain the means of each of the variabes. The vector of means then represents one of the custers within our set of k custers: > mean(custer1) MinTemp MaxTemp Rainfa Evaporation Another custer wi have a different mean: > obs2 <- setdiff(sampe(1:nrow(weather), 20), obs1) > custer2 <- weather[obs2, vars] > mean(custer2) MinTemp MaxTemp Rainfa Evaporation In comparing the two custers, we might suggest that the second custer generay has warmer days with ess rainfa. However, without having actuay buit the custering mode, we can t reay make too many such genera observations without knowing the actua distribution of the observations.

202 9.2 Search Heuristic 181 A particuar sentence in our knowedge representation anguage for k-means is then a coection of k sets of mean vaues for each of the variabes. Thus, if we were to simpy partition the weather dataset into ten sets (a common vaue for k), we woud get ten sets of means for each of the four variabes. Together, these 10 by 4 means represent a singe sentence (or mode) in the k-means anguage. 9.2 Search Heuristic For a given dataset, there are a very arge number of possibe k-means modes that coud be buit. We might think to enumerate every possibiity and then, using some measure that indicates how good the custering is, choose the one that gets the best score. In genera, this process of competey enumerating a possibiities woud not be computationay possibe. It may take hours, days, or weeks of computer time to generate and measure each possibe set of custers. Instead, the k-means agorithm uses a search heuristic. It begins with a random coection of k custers. Each custer is represented by a vector of the mean vaues for each of the variabes. The next step in the process is to then measure the distance between an observation and each of the k vectors of mean vaues. Each observation is then associated with its cosest custer. We then recacuate the mean vaues based on the observations that are now associated with each custer. This wi provide us with a new coection of k vectors of means. With this new set of k means, we once again cacuate the distance each observation is from each of the k means and reassociate the observation with the cosest of the k means. This wi often resut in some observations moving from one group or custer to another. Once again, we recacuate the mean vaues based on the observations that are now associated with each custer. Again, we have k new vectors of means. We repeat the process again. This iterative process is repeated unti no more observations move from one custer to another. The resuting custering is then the mode.

203 182 9 Custer Anaysis 9.3 Measures The basic measure used in buiding the mode is a measure of distance, or conversey the measure of simiarity between observations and the custer means. Any distance measure that measures the distance between two observations a and b must satisfy the foowing requirements: ˆ d(a, b) 0 distance is nonnegative ˆ d(a, a) = 0 distance to itsef is 0 ˆ d(a, b) = d(b, a) ˆ d(a, b) d(a, c) + d(c, b) distance is symmetric trianguar inequaity One common distance measure is known as the Minkowski distance. This is formuated as d(a, b) = q ( a 1 b 1 q + a 2 b 2 q a n b n q ), where a 1 is the vaue of variabe 1 for observation a, etc. The vaue of q determines an actua distance formua. We can best picture the distance cacuation using just two variabes, ike MinTemp and MaxTemp, from two observations. We pot the first two observations from the weather dataset in Figure 9.1 as generated using the foowing ca to pot(). We aso report the actua vaues being potted. > x <- round(weather$mintemp[1:2]) > y <- round(weather$maxtemp[1:2]) > pot(x, y, yim=c(23, 29), pch=4, wd=5, xab="mintemp", yab="maxtemp", bty="n") > round(x) [1] 8 14 > round(y) [1] When q = 1, d is known as the Manhattan distance: d(a, b) = a 1 b 1 + a 2 b a n b n.

204 9.3 Measures 183 MaxTemp MinTemp Figure 9.1: Two observations of two variabes from the weather dataset. What are the possibe ways of measuring the distance between these two points? The Manhattan distance measure gets its name from one of the five boroughs of New York City. Most of the streets of Manhattan are aid out on a reguar grid. Each bock is essentiay a rectange. Figure 9.2 simpifies the grid structure but iustrates the point. Suppose we want to cacuate the distance to wak from one bock corner, say West 31st Street and 8th Avenue, to another, say West 17th Street and 6th Avenue. We must trave aong the street, and the distance is given by how far we trave in each of just two directions, as is captured in the formua above. For our weather dataset, we can add a grid() to the pot and imit our wak to the ines on the grid, as in Figure 9.2. The distance traveed wi be d = = 9, and one such path is shown as the horizonta and then vertica ine in Figure 9.2. When q = 2, d is known as the more famiiar, and most commony

205 184 9 Custer Anaysis MaxTemp MinTemp Figure 9.2: Measuring the distance by traveing the streets of Manhattan (the reguar grid, with one path shown as the horizonta and then the vertica ine), rather than as a bird might fy (the direct ine between the two points). used, Eucidean distance: d(a, b) = ( a 1 b a 2 b a n b n 2 ). This is the straight-ine distance between the two points shown in Figure 9.2. It is how a bird woud fy direct from one point to another if it was fying high enough in Manhattan. The distance in this case is d = = In terms of how we measure the quaity of the actua custering mode, there are very many possibiities. Most reate to measuring the distance between a of the observations within a custer and summing that up. Then compare that with some measure of the distances between the means or even the observations of each of the different custers. We wi see and expain in a itte more detai some of these measures in the next section.

206 9.4 Tutoria Exampe Tutoria Exampe The weather dataset is used to iustrate the buiding of a custer mode. The Custer tab in the Ratte window provides access to various custering agorithms, incuding k-means. kmeans() is provided directy through R by the standard stats package. Buiding a Mode Using Ratte After oading a dataset into Ratte, we seect the Custer tab to be presented with various custering agorithms. We wi aso see a simpe coection of options avaiabe for use to fine-tune the mode buiding. The k-means agorithm is the defaut option, and by defaut ten custers wi be buit as the mode. A random seed is provided. Changing the seed wi resut in a randomy different coection of starting points for our means. The heuristic search then begins the iterative process as described in Section 9.2. Load the weather dataset from the Data tab, and then simpy cicking the Execute button whist on the Custer tab wi resut in the k-means custering output shown in Figure 9.3. Figure 9.3: Buiding a k-means custering mode.

207 186 9 Custer Anaysis The text view contains a itte information about the mode that has been buit. We wi work our way through its contents. It begins with the custer size, which is simpy a count of the number of observations within each custer: Custer sizes: [1] " " Mean (or average) vaues are the basic representationa anguage for modes when using k-means. The text view provides a summary of the mean vaue of each variabe over the whoe dataset of observations (with the output truncated here): Data means: MinTemp MaxTemp Rainfa Evaporation Sunshine WindGustSpeed WindSpeed9am WindSpeed3pm Humidity9am Humidity3pm Pressure9am Pressure3pm Custer Means A mode from a k-means custering point of view consists of ten (because ten custers is the defaut) vectors of the mean vaues for each of the variabes. The main content of the text view is a ist of these means. We ony show the first five variabes and ony eight of the ten custers: Custer centers: MinTemp MaxTemp Rainfa Evaporation Sunshine

208 9.4 Tutoria Exampe 187 Mode Quaity The means are foowed by a simpe measure of the quaity of the mode: Within custer sum of squares: [1] [10] The measure used is the sum of the squares of the differences between the observations within each of the ten custers. Time Taken Finay, we see how ong the k-means agorithm took to buid the ten custers. For such a sma dataset, very itte time is required. The time taken is the amount of CPU time spent on the task: Time taken: 0.00 secs Tuning Options The basic tuning option for buiding a k-means mode in Ratte is simpy the Number of custers that are to be generated. The defaut is 10, but any positive integer greater than 1 is aowed. Ratte aso provides an option to iterativey buid more custers and measure the quaity of each resuting mode as a guide to how many custers to buid. This is chosen by enabing the Iterate Custers option. When active, a mode with two custers, then a mode with three custers, and so on up to a mode with ten (or as many as specified) custers wi be buit. A pot is generated and dispayed to report the improvement in the quaity measure (the sum of the within custer sum of squares). As mentioned previousy, the Seed option aows different starting points to be used for the heuristic search. Each time a different seed is used, the resuting mode wi usuay be different. For some datasets, differences between the modes using different seeds wi often not be too arge, though for others they might be quite arge. In the atter case, we are finding different, possiby ess optima or perhaps equay optima modes each time. The Runs option wi repeat the mode buiding the specified number of times and choose the mode

209 188 9 Custer Anaysis that provides the best performance against the measure of mode quaity. For each different seed, we can check the ist of custer size to confirm that we obtain a coection of custers that are about the same sizes each time, though the order in the isting changes. Once a mode has been buit, the Stats, Data Pot, and Discriminant Pot buttons become avaiabe. Cicking the Stats button wi resut in quite a few additiona custer statistics being dispayed in the text view. These can a participate in determining the quaity of the mode and comparing one k-means mode against another. The Data Pot and the Discriminant Pot buttons resut in pots that dispay how the custers are distributed across the data. The discriminant coordinates pot is generated by projecting the origina data to dispay the key differences between custers, simiar to principa components anaysis. The pots are probaby ony usefu for smaer datasets (in the hundreds or thousands). The Ratte user interface aso provides access to the Cara, Hierarchica, and BiCuster custering agorithms. These are not covered here. Buiding a Mode Using R The primary function used within R for k-means custering is kmeans() which comes standard with R. We can buid a k-means custer mode using the encapsuation idea presented in Section 2.9: > weatherds <- new.env() From the weather dataset, we wi seect ony two numeric variabes on which to custer, and we aso ignore the output variabe RISK_MM: > ibrary(ratte) > evaq({ data <- weather nobs <- nrow(data) }, weatherds) We now create a mode container to store the resuts of the modeing and buid the actua mode. The container aso incudes the weatherds dataset information. > weatherkmeans <- new.env(parent=weatherds) > evaq({ mode <- kmeans(x=na.omit(data[, vars]), centers=10) }, weatherkmeans)

210 9.5 Discussion 189 We have used kmeans() and passed to it a dataset with any observations having missing vaues omitted. The function otherwise compains if the data contains missing vaues, as we might expect when using a distance measure. The centers= option is used either to specify the number of custers or to ist the starting points for the custering. 9.5 Discussion Number of Custers The primary tuning parameter for the k-means agorithm is the number of custers, k. Simpy because the defaut is to identify ten custers does not mean that 10 is a good choice at a. Choosing the number of custers is often quite a tricky exercise. Sometimes it is a matter of experimentation and other times we might have some other knowedge to hep us decide. We wi soon note that the arger the number of custers reative to the size of the sampe, the smaer our custers wi generay be. However, a common observation is that often we might end up with a sma number of custers containing most of the observations and a arge number of custers containing ony a few observations each. We aso note that different custer agorithms (and even simpy using different random seeds to initiate the custering) can resut in different (and sometimes very different) custers. How much they differ is a measure of the stabiity of the custering. Ratte provides an Iterate Custers option to assist with identifying a good number of custers. The approach is to iterate through different vaues of k. For each k, we observe the sum of the within custer sum of squares. A pot is generated to show both the sum of squares and its change in the sum of squares. A heuristic is to choose the number of custers where we see the argest drop in the sum of the within custer sum of squares. Shape of Custers One of the characteristics to distinguish between custering agorithms is the shape of the resuting custers. Essentiay, the k-means agorithm, as with any agorithm that uses the distance to a mean as the representation of the custers, produces convex custers. Other custering agorithms ex-

211 190 9 Custer Anaysis ist that can produce differenty shaped custers that might better refect the data. Other Custer Agorithms R supports a very arge variety of custering agorithms besides the k- means agorithm we have described here. They are grouped into the partitioning type of agorithms, of which k-means is one exampe, modebased agorithms (see mcust (Fraey and Raftery, 2006), for exampe), and hierarchica custering (see hcust() from stats and agnes() for aggomerative custering, and diana() for divisive custering from custer (Maecher et a., 2005)). Ratte supports the buiding of hierarchica custers using hcust(). Such an agorithm buids the custers iterativey and hierarchicay. For an aggomerative hierarchica approach, the two cosest observations form the first custer. Then the next two cosest observations, but now aso incuding the mean of the first custer as a combined observation, form the second custer, and so on unti we have formed a singe custer. The resuting coection of potentia custers can be drawn using a dendrogram, as shown in Figure 9.4. An advantage of this approach is that we get a visua cue as to the number of custers that naturay appear in the data. In Figure 9.4 we have drawn boxes to indicate perhaps three custers. A disadvantage is that this approach is reay ony usefu for a sma dataset. Recent research has expored the issue of very high dimensiona data, or data with very many variabes. For such data the k-means agorithm performs rather poory, as a observations essentiay become equidistant from each other. A successfu approach has been deveoped (Jing et a., 2007) using a weighted distance measure. The agorithm essentiay chooses ony subsets of the variabes on which to custer. This has been referred to as subspace custering. The siatust (Wiiams et a., 2011) package, provides an impementation of this modification to the k-means agorithm. Entropy weighted variabe seection through ewkm() is used to improve the custering performance with high dimensiona data.

212 9.6 Command Summary 191 Figure 9.4: A sampe dendrogram showing three custers. 9.6 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: agnes() function An aggomerative custering agorithm. custer package A variety of toos for custer anaysis. diana() function A divisive custering agorithm. ewkm() function Entropy weighted k-means. evaq() function Access environment for storing data. grid() command Add a grid to a pot. hcust() function A hierarchica custering agorithm. kmeans() function The k-means custering agorithm. mean function Cacuate the mean vaues.

213 192 9 Custer Anaysis pot() command Draw a dendrogram for an hcust object. round() function Round numbers to specific digits. set.seed() command Reset random sequence for samping. siatcust package Weighted and subspace k-means. stats package Base package providing k-means. weather dataset Sampe dataset from ratte.

214 Chapter 10 Association Anaysis Many years ago, a number of new Internet businesses were created to se books on-ine. Over time, they coected information about the books that each of their customers were buying. Using association anaysis, they were abe to identify groups of books that customers with simiar interests seem to have been buying. Using this information, they were abe to deveop recommendation systems that informed their customers that other customers who purchased some book of interest aso purchased other reated books. The customer woud often find such recommendations quite usefu. Association anaysis identifies reationships or correations between observations and/or between variabes in our datasets. These reationships are then expressed as a coection of so-caed association rues. The approach has been particuary successfu in mining very arge transactiona databases, ike shopping baskets and on-ine customer purchases. Association anaysis is one of the core techniques of data mining. For the on-ine bookseing exampe, historic data is used to identify, for exampe, that customers who purchased two particuar books aso tended to purchase another particuar book. The historic data might indicate that the first two books are purchased by ony 0.5% of a customers. But 70% of these then aso purchase the third book. This is an interesting group of customers. As a business, we wi take advantage and G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _10, Springer Science+Business Media, LLC

215 Association Anaysis of this observation by targeting advertising of the third book to those customers who have purchased both of the other books. The usua type of input data for association anaysis consists of transactiona data, such as the items in a shopping basket at a supermarket, books and videos purchased by a singe cient, or medica treatments and tests received by patients. We are interested in the items whose co-occurrence within a transaction is of interest. This short chapter introduces the basic concepts of association anaysis and how to perform it in both Ratte and R Knowedge Representation A representation of association rues is required to identify reationships between items within transactions. Suppose each transaction is thought of as a basket of items (which we might represent as {A, B, C, D, E, F }). The aim is to identify coections of items that appear together in mutipe baskets (e.g., perhaps the items {A, C, F } appear together in quite a few shopping baskets). From these so caed itemsets (i.e., sets of items) we identify rues ike A, F C that te us that when A and F appear in a transaction (e.g., a shopping basket) then typicay so does C. A coection of association rues then represents a mode as the outcome of association anaysis. The genera format of an association rue is A C. Both A (the eft hand side or antecedent) and C (the right side or consequent) are sets of items. Generay we think of items as being particuar books, for exampe, or particuar grocery items. Exampes might be: mik bread, beer & nuts potato crisps, Shrek1 Shrek2 & Shrek3. The concept of an item can be generaised to a specific variabe/vaue combination as the item. The concept of association anaysis can then be appied to many different datasets. Using our weather dataset, for exampe, this representation wi ead to association rues ike W inddir3pm = NNW RainT oday = No.

216 10.2 Search Heuristic Search Heuristic The basis of an association anaysis agorithm is the generation of frequent itemsets. A frequent itemset is a set of items that occur together frequenty enough to be considered as a candidate for generating association rues. The obvious approaches to identifying itemsets that appear frequenty enough in the data are quite expensive computationay, even with moderatey sized datasets. The apriori agorithm takes advantage of the simpe observation that a subsets of a frequent itemset must aso be frequent. That is, if {mik, bread, cheese} is a frequent itemset then so must each of the smaer itemsets, {mik, bread}, {mik, cheese}, {bread, cheese}, {mik}, {bread}, and {cheese}. This observation aows the agorithm to consider a significanty reduced search space by starting with frequent individua items. This first step eiminates very rare items. We then combine the remaining singe items into itemsets containing just two items and retain ony those that are frequent enough and simiary for itemsets containing three items and so on. The concept of frequent enough is a parameter of the agorithm used to contro the number of association rues discovered. This is caed the support and specifies how frequenty the items must appear in the whoe dataset before they can be considered as a candidate association rue. For exampe, the user may choose to consider ony sets of items that occur in at east 5% of a transactions. The second phase of the agorithm considers each of the frequent itemsets and for each generates a possibe combinations of association rues. Thus, for an itemset containing three items {mik, bread, cheese}, the foowing are among the possibe association rues that wi be considered: bread & mik cheese, mik bread & cheese, cheese & mik bread, and so on. The actua association rues that we retain are those that meet a criterion caed confidence. The confidence cacuates the proportion of transactions containing A that aso contain C. The confidence specifies a minima probabiity for the association rue. For exampe, the user may choose to generate ony rues that are true at east 90% of the time (that

217 Association Anaysis is, when A appears in the basket, C aso appears in the same basket at east 90% of the time). The apriori agorithm is a breadth-first or generate-and-test type of search agorithm. Ony after exporing a of the possibiities of associations containing k items does it then consider those containing k + 1 items. For each k, a candidates are tested to determine whether they have enough support. In summary, the agorithm uses a simpe two-phase generate-andmerge process. In phase 1, we generate frequent itemsets of size k (iterating from 1 unti we have no frequent k-itemsets) and then combine them to generate candidate frequent itemsets of size k + 1. In phase 2, we buid candidate association rues Measures The two primary measures used in association anaysis are the support and the confidence. The minimum support is expressed as a percentage of the tota number of transactions in the dataset. Informay, it is simpy how often the items appear together from amongst a of the transactions. Formay, we define support for a coection of items I as the proportion of a transactions in which a items in I appear and express the support for an association rue as support(a C) = P (A C). Typicay, we use sma vaues for the support, since overa the items that appear together frequenty enough that are of interest generay won t be the obvious ones that reguary appear together. The minimum confidence is aso expressed as the proportion of the tota number of transactions in the dataset. Informay, it is a measure of how often the items C appear whenever the items A appear in a transaction. Formay, it is a conditiona probabiity: confidence(a C) = P (C A) = P (A C)/P (A). It can aso be expressed in terms of the support: conf idence(a C) = support(a C)/support(A).

218 10.4 Tutoria Exampe 197 Typicay, this measure wi have arger vaues since we are ooking for the association rues that are quite strong, so that if we find the items in A in a transaction then there is quite a good chance of aso finding C in the transaction. There are a coection of other measures that are used with association rue anaysis. One that is used in R and hence Ratte is the ift. The ift is the increased ikeihood of C being in a transaction if A is incuded in the transaction. It is cacuated as ift(a C) = confidence(a C)/support(C). Another measure is the everage, which captures the fact that a higher frequency of A and C with a ower ift may be interesting: everage(a C) = support(a C) support(a) support(c) Tutoria Exampe Two types of association rues were identified above, corresponding to the type of data made avaiabe. The simpest case, known as market basket anaysis, is when we have a transaction dataset that records just a transaction identifier. The identifier might identify a singe shopping basket containing mutipe items from shopping or a particuar customer or patient and their associated purchases or medica treatments over time. A simpe exampe of a market basket dataset might record the purchases of DVDs by customers (three customers in this case): ID, Item 1, Sixth Sense 1, LOTR1 1, Harry Potter1 1, Green Mie 1, LOTR2 2, Gadiator 2, Patriot 2, Braveheart 3, LOTR1 3, LOTR2

219 Association Anaysis The resuting mode wi then be a coection of association rues that might incude LOT R1 LOT R2. The second form of association rue uses a dataset that we are more famiiar with. This approach treats each observation as a transaction and the variabes as the items in the shopping basket. Considering the weather dataset, we might obtain modes that incude rues of the form Humidity3pm = High & P ressure3pm = Low RainT oday = Y es. Both forms are supported in Ratte and R. Buiding a Mode Using Ratte Ratte buids association rue modes through the Associate tab. The two types of association rues are supported and the appropriate type is chosen using the Baskets check button. If the button is checked then Ratte wi use the Ident and Target variabes for the anaysis, performing a market basket anaysis. If the button is not checked, then Ratte wi use the Input variabes for a rues anaysis. For a basket anaysis, the data is thought of as representing shopping baskets (or any other type of coection of items, such as a basket of medica tests, a basket of medicines prescribed to a patient, a basket of stocks hed by an investor, and so on). Each basket has a unique identifier, and the variabe specified as an Ident variabe on the Data tab is taken as the identifier of a shopping basket. The contents of the basket are then the items contained in the coumn of data identified as the Target variabe. For market basket anaysis, these are the ony two variabes used. To iustrate market basket anaysis with Ratte, we can use a very simpe and trivia dataset consisting of the DVD movies purchased by customers. The data is avaiabe as a CSV fie (named dvdtrans.csv) from the Ratte package. The simpest way to oad this dataset into Ratte is to first oad the defaut sampe weather dataset from the weather.csv fie into Ratte. We do this by cicking the Execute button on starting Ratte. Then cick the Fiename button (which wi now be showing weather.csv) to ist the contents of Ratte s sampe CSV foder. Choose dvdtrans.csv and cick Open and then Execute. The ID variabe wi

220 10.4 Tutoria Exampe 199 automaticay be chosen as the Ident, but we wi need to change the roe of Item to be Target, as in Figure Figure 10.1: Choose the dvdtrans.csv fie and oad it into Ratte with a cick of the Execute button. Then set the roe for Item to be Target and cick Execute for the new roe to be noted. On the Associate tab, ensure that the Baskets button is checked. Cick the Execute button to buid a mode that wi consist of a coection of association rues. Figure 10.2 shows the resuting text view, which we now review. The first few ines of the text view ist the number of association rues that make up the mode. In our exampe, there are 127 rues: Summary of the Apriori Association Rues: Number of Rues: 127 The next code bock reports on the distribution of the three measures as found for the 127 rues of the mode: Summary of the Measures of Interestingness: support confidence ift Min. :0.100 Min. :0.100 Min. : st Qu.: st Qu.: st Qu.: Median :0.100 Median :1.000 Median : Mean :0.145 Mean :0.759 Mean : rd Qu.: rd Qu.: rd Qu.: Max. :0.700 Max. :1.000 Max. :10.000

221 Association Anaysis Figure 10.2: Buiding an association rues mode. The 127 association rues met the criteria of having a minimum support of 0.1 and a minimum confidence of 0.1. Across the rues the support ranges from 0.1 up to 0.4. Confidence ranges from 0.1 up to 1.0 and ift from 0.83 up to This section is foowed by a summary of the process of buiding the mode. It begins with a review of the options suppied or the defaut vaues for the various parameters. We can see confidence= and support= isted: Summary of the Execution of the Apriori Command: parameter specification: confidence minva smax arem ava originasupport none FALSE TRUE support minen maxen target ext rues FALSE

222 10.4 Tutoria Exampe 201 These options are tunabe through the Ratte interface. Others can be tuned directy through R. A set of parameters that contro how the agorithm itsef operates is then dispayed: agorithmic contro: fiter tree heap memopt oad sort verbose 0.1 TRUE TRUE FALSE TRUE 2 TRUE The fina section incudes detaied information about the agorithm and the mode that has been buit: apriori - find association rues with the apriori agorithm version 4.21 ( ) (c) Christian Borget set item appearances...[0 item(s)] done [0.00s]. set transactions...[10 item(s), 10 trans] done [0.00s]. sorting and recoding items... [10 item(s)] done [0.00s]. creating transaction tree... done [0.00s]. checking subsets of size done [0.00s]. writing... [127 rue(s)] done [0.00s]. creating S4 object... done [0.00s]. The Show Rues button wi show a of the association rues for the mode in the text view window, sorted by the eve of confidence in the rue. The top five rues wi be: hs rhs supp conf ift 1 {Harry Potter2} => {Harry Potter1} {Braveheart} => {Patriot} {Braveheart} => {Gadiator} {LOTR} => {Green Mie} {LOTR} => {Sixth Sense} These rues have ony a singe item on each side of the arrow, and a have a support of 0.1 and a confidence of 1. We can see that for either of the first two movies there is quite a arge ift obtained. Buiding a Mode Using R Arues (Hahser et a., 2011) provides apriori() for R. The package provides an interface to the widey used, and freey avaiabe, apriori

223 Association Anaysis software from Christian Borget. This software was, for exampe, commerciay icensed for use in the Cementine 1 data mining package and is a we-deveoped and respected impementation. When oading a dataset to process with apriori(), it needs to be converted into a transaction data structure. Consider a dataset with two coumns, one being the identifier of the basket and the other being an item contained in the basket, as is the case for the dvdtrans.csv data. We can oad that data into R: > ibrary(arues) > ibrary(ratte) > dvdtrans <- read.csv(system.fie("csv", "dvdtrans.csv", package="ratte")) > dvdds <- new.env() > dvdds$data <- as(spit(dvdtrans$item, dvdtrans$id), "transactions") > dvdds$data transactions in sparse format with 10 transactions (rows) and 10 items (coumns) We can then buid the mode using this transformed dataset: > dvdapriori <- new.env(parent=dvdds) > evaq({ mode <- apriori(data, parameter=ist(support=0.2, confidence=0.1)) }, dvdapriori) The rues can be extracted and ordered by confidence using inspect(). In the foowing code bock we aso use [1:5] to imit the dispay to just the first five association rues. We notice that the first two are symmetric, which is expected since everyone who purchases one of these movies aways aso purchases the other. 1 Cementine became an SPSS product and was then purchased by IBM to become IBM SPSS Modeer.

224 10.5 Command Summary 203 > inspect(sort(dvdapriori$mode, by="confidence")[1:5]) hs rhs support confidence ift 1 {LOTR1} => {LOTR2} {LOTR2} => {LOTR1} {Green Mie} => {Sixth Sense} {Patriot} => {Gadiator} {Patriot, Sixth Sense} => {Gadiator} Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: apriori() function Buid an association rue mode. arues package Support for association rues. inspect() function Dispay resuts of mode buiding. weather dataset Sampe dataset from ratte.

225

226 Chapter 11 Decision Trees Decision trees (aso referred to as cassification and regression trees) are the traditiona buiding bocks of data mining and the cassic machine earning agorithm. Since their deveopment in the 1980s, decision trees have been the most widey depoyed machine-earning based data mining mode buider. Their attraction ies in the simpicity of the resuting mode, where a decision tree (at east one that is not too arge) is quite easy to view, understand, and, importanty, expain. Decision trees do not aways deiver the best performance, and represent a trade-off between performance and simpicity of expanation. The decision tree structure can represent both cassification and regression modes. We introduce the decision tree as a knowedge representation anguage in Section A search agorithm for finding a good decision tree is presented in Section The measures used to identify a good tree are discussed in Section Section 11.4 then iustrates the buiding of a decision tree in Ratte and directy through R. The options for buiding a decision tree are covered in Section G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _11, Springer Science+Business Media, LLC

227 Decision Trees 11.1 Knowedge Representation The tree structure is used in many different fieds, such as medicine, ogic, probem soving, and management science. It is aso a traditiona computer science structure for organising data. We generay present the tree upside down, with the root at the top and the eaves at the bottom. Starting from the root, the tree spits from the singe trunk into two or more branches. Each branch itsef might further spit into two or more branches. This continues unti we reach a eaf, which is a node that is not further spit. We refer to the spit of a branch as a node of the tree. The root and eaves are aso referred to as nodes. A decision tree uses this traditiona structure. It starts with a singe root node that spits into mutipe branches, eading to further nodes, each of which may further spit or ese terminate as a eaf node. Associated with each noneaf node wi be a test or question that determines which branch to foow. The eaf nodes contain the decisions. Consider the decision tree drawn on page 205 (which is the same tree as Figure 2.5 on page 30). This represents knowedge about observing weather conditions one day and the observation of rain on the foowing day. The No and Yes vaues at the eaves of the decision tree represent the decisions. The root node of the exampe decision tree tests the mean sea eve pressure at 3 pm (Pressure3pm). When this variabe, for an observation, has a vaue greater than or equa to 1012 hpa, then we wi continue down the eft side of the tree. The next test down this eft side of the tree is on the amount of coud cover observed at 3 pm (Coud3pm). If this is ess than 8 oktas (i.e., anything but a fuy overcast sky), then it is observed that on the foowing day it generay does not rain (No). If we observe that it is overcast today at 3 pm (i.e., Coud3pm is 8 oktas, the maximum vaue of this variabe see Section 5.2.9, page 127) then generay we observe that it rains the foowing day (Yes). Thus we woud be incined to think that it might rain tomorrow if we observe these same conditions today. Resuming our interpretation of the mode from the root node of the

228 11.1 Knowedge Representation 207 tree, if Pressure3pm is ess than 1012 hpa and Sunshine is greater than or equa to 9 (i.e., we observe at east 9 hours of sunshine during the day), then we do not expect to observe rain tomorrow. If we record 9 or ess hours of sunshine, then we expect it to rain tomorrow. The decision tree is a very convenient and efficient representation of knowedge. Generay, modes expressed in one anguage can be transated to another anguage and so it is with a decision tree. One simpe and usefu transation is into a rue set. The decision tree above transates to the foowing rues, where each rue corresponds to one pathway through the decision tree, starting at the root node and terminating at a eaf node: Rue number: 7 [RainTomorrow=Yes cover=27 (11%) prob=0.74] Pressure3pm< 1012 Sunshine< 8.85 Rue number: 5 [RainTomorrow=Yes cover=9 (4%) prob=0.67] Pressure3pm>=1012 Coud3pm>=7.5 Rue number: 6 [RainTomorrow=No cover=25 (10%) prob=0.20] Pressure3pm< 1012 Sunshine>=8.85 Rue number: 4 [RainTomorrow=No cover=195 (76%) prob=0.05] Pressure3pm>=1012 Coud3pm< 7.5 A rue representation has its advantages. In reviewing the knowedge that has been captured, we can consider each rue separatey rather than being distracted by the more compex structure of a arge decision tree. It is aso easy to see how each rue coud be transated into a programming anguage statement ike R, Python, C, VisuaBasic, or SQL. The structure is as simpe, and cear, as an If-Then statement. We now expain the information provided for each rue. In buiding a decision tree, often a arger tree is buit and then cut back (or pruned) so that it is not so compex and aso to improve its accuracy. As a consequence, we wi often see node numbers (and rue numbers) that are not sequentia. The node numbers do not have any specific meaning other than as a reference.

229 Decision Trees Athough it is not shown in the tree representation at the beginning of the chapter, we see in the rues above the probabiities that are typicay recorded for each eaf node of the decision tree. The probabiities can be used to provide an indication of the strength of the decision we derive from the mode. Thus, rue number 7 indicates that for 74% of the observations (prob=0.74), when the observed pressure at 3 pm is ess than 1012 hpa and the hours of sunshine are ess than 8.85 hours, there is rainfa recorded on the foowing day (RainTomorrow=Yes). The other information provided with the rue is that 27 observations from the training dataset (i.e., 11% of the training dataset observations) are covered by this rue they satisfy the two conditions. There exist variations to the basic decision tree structure we have presented here for representing knowedge. Some approaches, as here, imit trees to two spits at any one node to generate a binary decision tree. For categoric data this might invove partitioning the vaues (eves) of the variabe into two groups. Another approach is to have a branch corresponding to each of the eves of a categoric variabe. From a representation point of view, what can be represented using a mutiway tree can aso be represented as a binary tree and vice versa. Other variations, for exampe, aow mutipe variabes to be tested at a node. We generay stay with the simper representation, though, sometimes at the cost of the resuting mode being a itte more compex than if we used a more compex decision tree structure Agorithm Identifying Aternative Modes The decision tree structure, as described above, is the anguage we use to express our knowedge. A sentence (or mode) in this anguage is a particuar decision tree. For any dataset, there wi be very many, or even infinite, possibe decision trees (sentences). Consider the simpe decision tree discussed above. Instead of the variabe Pressure3pm being tested against the vaue 1012, it coud have been tested against the vaue 1011, or 1013, or 1020, etc. Each woud, when the rest of the tree has been buit, represent a different sentence in the anguage, representing a sighty different capture of the knowedge. There are very many possibe vaues to choose from for just this one

230 11.2 Agorithm 209 variabe, even before we begin to consider vaues for the other variabes that appear in the decision tree. Aternativey, we might choose to test the vaue of a different variabe at the root node (or any other node). Perhaps we coud test the vaue of Humidity3pm instead of Pressure3pm. This again introduces a arge coection of aternative sentences that we might generate within the constraints of the anguage we have defined. Each sentence is a candidate for the capture of knowedge that is consistent with the observations represented in our training dataset. As we saw in Section 8.2, this weath of possibe sentences presents a chaenge which is the best sentence or equivaenty which is the best mode that fits the data? Our task is to identify the sentence (or perhaps sentences) that best captures the knowedge that can be obtained from the observations that we have avaiabe to us. We generay have an infinite coection of possibe sentences to choose from. Enumerating every possibe sentence, and testing whether it is a good mode, wi generay be too computationay expensive. This coud we invove days, weeks, months, or even more of our computer time. Our task is to use the observations (the training dataset) to narrow down this search task so that we can find a good mode in a reasonabe amount of time. Partitioning the Dataset The agorithm that has been deveoped for decision tree induction is referred to as the top-down induction of decision trees, using a divideand-conquer, or recursive partitioning, approach. We wi describe the agorithm intuitivey. We continue here with the weather dataset to describe the agorithm. The distribution of the observations, with respect to the target variabe RainTomorrow, is of particuar interest. There are 66 observations that have the target as Yes (18%) and 300 observations with No (82%). We want to find any input variabe that can be used to spit the dataset into two smaer datasets. The goa is to increase the homogeneity of each of the two datasets with respect to the target variabe. That is, for one of the datasets, we woud be ooking for it to have an increased proportion of observations with Yes and so the other dataset woud have an increased proportion of observations with No.

231 Decision Trees We might, for exampe, decide to construct a partition of the origina dataset using the variabe Sunshine with a spit vaue of 9. Every observation that has a vaue of Sunshine ess than 9 goes into one subset and those remaining (with Sunshine equa to 9) into a second subset. These new datasets wi have 201 and 162 observations, respectivey (noting that three observations have missing vaues for this variabe). Now we consider the proportions of Yes and No observations within the two new datasets. For the subset of observations with Sunshine ess than 9, the proportions are 28% Yes and 72% No. For the subset of observations with Sunshine greater than or equa to 9 the proportions are 5% Yes and 95% No. By spitting on this variabe, we have made an improvement in the homogeneity of the target variabe vaues. In particuar, the right dataset (Sunshine 9) resuts in a coection of observations that are very much in favour of no rain on the foowing day (95% No). This is what we are aiming to do. It aows us to observe that when the amount of sunshine on any day is quite high (i.e., at east 9 hours), then there is very itte chance of rain on the foowing day (ony a 5% chance based on our observations from the particuar weather station). The story for the other dataset is not quite so cear. The proportions have certainy changed, with a higher proportion of Yes observations than the origina dataset, but the No observations sti outnumber the Yes observations. Nonetheess, we can say that when we observe Sunshine < 9 there is an increased ikeihood of rain the foowing day based on our historic observations. There is a 28% chance of rain compared with an 18% over a observations. Choosing the vaue 9 for the variabe Sunshine is just one possibiity from amongst very many choices. If we had chosen the vaue 5 for the variabe Sunshine we woud have two new datasets with the Yes/No proportions 41%/59% and 12%/88%. Choosing a different variabe atogether (Coud3pm) with a spit of 6, we woud have two new datasets with the Yes/No proportions 8%/92% and 34%/66%. Another choice might be Pressure3pm with a spit of This gives the Yes/No proportions as 47%/53% and 10%/90%. We now have a coection of choices for how we might partition our

232 11.2 Agorithm 211 training dataset: which of these is the best spit? We come back to answer that question formay in Section For now, we assume we choose one of them. With whichever choice we make, the resut is that we now have two new smaer datasets. Recursive Partitioning The process is now repeated again separatey for the two new datasets. That is, for the eft dataset above (observations having Sunshine < 9), we consider a possibe variabes and spits to partition that dataset into two smaer datasets. Independenty, for the right dataset (observations having Sunshine 9) we consider a possibe variabes and spits to partition that dataset into two smaer datasets as we. Now we have four even smaer datasets and the process continues. For each of the four datasets, we again consider a possibe variabes and spits, choosing the best at each stage, partitioning the data, and so on, repeating the process unti we decide that we shoud stop. In genera, we might stop when we run out of variabes, run out of data, or when partitioning the dataset does not improve the proportions or the outcome. We can see now why this process is caed divide-and-conquer or recursive partitioning. At each step, we have identified a question, that we use to partition the data. The resuting two datasets then correspond to the two branches of the tree emanating from that node. For each branch, we identify a new question and partition appropriatey, buiding our representation of the knowedge we are discovering from the data. We continuay divide the dataset and conquer each of the smaer datasets more easiy. We are aso repeatedy partitioning the dataset and appying the same process, independenty, to each of the smaer datasets; thus it is recursive partitioning. At each stage of the process, we make a decision as to the best variabe and spit to partition the data. That decision may not be the best to make in the overa context of buiding this decision tree, but once we make that decision, we stay with it for the rest of the tree. This is generay referred to as a greedy approach.

233 Decision Trees A greedy agorithm is generay quite efficient, whist possiby sacrificing our opportunity to find the very best decision tree. There remains quite a bit of searching for the one variabe and spit point for each of the datasets we produce. However, this heuristic approach reduces our search space consideraby by fixing the variabe/spit once it has been chosen Measures In describing the basic agorithm above, it was indicated that we need to measure how good a particuar partition of the dataset is. Such a measure wi aow us to choose from amongst a coection of possibiities. We now consider how to measure the different spits of the dataset. Information Gain Ratte uses an information gain measure for deciding between aternative spits. The concept comes from information theory and uses a formuation of the concept of entropy from physics (i.e., the concept of the amount of disorder in a system). We discuss the concepts here in terms of a binary target variabe, but the concept generaises to mutipe casses and even to numeric target variabes for regression tasks. For our purposes, the concept of disorder reates to how mixed our dataset is with respect to the vaues of the target variabe. If the dataset contains ony observations that a have the same vaue for the target variabe (e.g., it contains ony observations where it rains the foowing day), then there is no disorder i.e., no entropy or zero entropy. If the two vaues of the target variabe are equay distributed across the observations (i.e., 50% of the dataset are observations where it rains tomorrow and the other 50% are observations where it does not rain tomorrow), then the dataset contains the maximum amount of disorder. We identify the maximum amount of entropy as 1. Datasets containing different mixtures of the vaues of the target variabe wi have a measure of entropy between 0 and 1. From an information theory perspective, we interpret a measure of 0 (i.e., an entropy of 0) as indicating that we need no further information in order to cassify a specific observation within the dataset a observations beong to the same cass. Conversey, a measure of 1 suggests we need the maxima amount of extra information in order to cassify our

234 11.3 Measures 213 observations into one of the two avaiabe casses. If the spit between the observations where it rains tomorrow and where it does not rain tomorrow is not 50%/50% but perhaps 75%/25%, then we need ess extra information in order to cassify our observations the dataset aready contains some information about which way the cassification is going to go. Like entropy, our measure of required information is thus between 0 and 1. In both cases, we wi use the mathematica ogarithm function for base 2 (og 2 ) to transform our proportions (the proportions being 0.5, 0.75, 1.00, etc.). Base 2 is chosen since we use binary digits (bits) to encode information. However, we can use any base since in the end it is the reative measure rather than the exact measure, that we are interested in and the ogarithm functions have identica behaviour in this respect. The defaut R impementation (as we wi see in Section 11.4) uses the natura ogarithm, for exampe. The formua we use to capture the entropy of a dataset, or equivaenty the information needed to cassify an observation, is info(d) = p og 2 (p) n og 2 (n) We now deve into the nature of this formua to understand why this is a usefu measure. We can easiy pot this function, as in Figure 11.1, with the x-axis showing the possibe vaues of p and the y-axis showing the vaues of info. From the pot, we can see that the maximum vaue of the measure is 1. This occurs when there is the most amount of disorder in the data or when the most amount of additiona information is required to cassify an observation. This occurs when the observations are equay distributed across the vaues of the target variabe. For a binary target, as here, this occurs when p = 0.5 and n = 0.5. Likewise, the minimum vaue of the measure is 0. This occurs at the extremes, where p = 1 (i.e., a observations are positive RainTomorrow has the vaue Yes for each) or p = 0 (i.e., a observations are negative RainTomorrow has the vaue No for each). This is interpreted as either no entropy or as requiring no further information in order to cassify the observations. This then provides a mechanism for measuring some aspect of the training dataset, capturing something about the knowedge content. As

235 Decision Trees p * og2(p) n * og2(n) p Figure 11.1: Potting the reationship between the proportion of positive observations in the data and the measure of information/entropy. we now see, we use this formuation to hep choose the best spit from among the very many possibe spits we identified in Section Each choice of a spit resuts in a binary partition of the training dataset. We wi ca these D 1 and D 2, noting that D = D 1 D 2. The information measure can be appied to each of these subsets to give I 1 and I 2. If we add these together, weighted by the sizes of the two subsets, we get a measure of the combined information, or entropy: info(d, S) = D 1 D I 1 + D 2 D I 2 Comparing this with the origina information, or entropy, we get a measure of the gain in knowedge obtained by using the particuar spit point: gain(d, S) = info(d) info(d, S)

236 11.4 Tutoria Exampe 215 This can then be cacuated for each of the possibe spits. The spit that provides the greatest gain in information (and equivaenty the greatest reduction in entropy) is the spit we choose. Other Measures A variety of measures can be used as aternatives to the information measure. The most common aternative is the Gini index of diversity. This was introduced into decision tree buiding through the origina CART (cassification and regression tree) agorithm (Breiman et a., 1984). The pot of the function is very simiar to the p og 2 (p) curve and typicay wi give the same spit points Tutoria Exampe The weather dataset is used to iustrate the buiding of a decision tree. We saw our first decision tree in Chapter 2. We can buid a decision tree using Ratte s Tree option, found on the Mode tab or directy in R through rpart() of rpart (Therneau and Atkinson, 2011). Buiding a Mode Using Ratte We buid a decision tree using Ratte s Mode tab s Tree option. After oading our dataset and identifying the Input variabes and the Target variabe, an Execute of the Mode tab wi resut in a decision tree. We can see the resut for the weather dataset in Figure 11.2, which shows the resuting tree in the text view and aso highights the key interface widgets that we need to dea with to buid a tree. The text view incudes much information, and we wi work our way through its contents. However, before doing so, we can get a quick view of the resuting decision tree by using the Draw button of the interface. A window wi pop up, dispaying the tree, as we saw in Figure 2.5 on page 30. Working our way through the textua summary of the decision tree, we start with a report of the number of observations that were used to buid the tree (i.e., 256): Summary of the Decision Tree mode for Cassification... n= 256

237 Decision Trees Figure 11.2: dataset. Buiding a decision tree predictive mode using the weather Tree Structure We now ook at the structure of the tree as it is presented in the text view. A egend is provided to assist in reading the tree structure: node), spit, n, oss, yva, (yprob) * denotes termina node The egend indicates that a node number wi be provided, foowed by a spit (which wi usuay be in the form of a variabe operation vaue), the number of entities n at that node, the number of entities that are incorrecty cassified (the oss), the defaut cassification for the node (the yva), and then the distribution of casses in that node (the yprobs). The distribution is ordered by cass and the order is the same for a nodes. The next ine indicates that a * denotes a termina node of the tree (i.e., a eaf node the tree is not spit any further at that node). The first node of any tree is aways the root node. We work our way into the tree itsef through the root node. The root node is numbered as node number 1: 1) root No ( ) The root node represents a observations. By itsef the node represents

238 11.4 Tutoria Exampe 217 a mode that simpy cassifies every observation into the cass that is associated with the majority from the training dataset. The information provided tes us that the majority cass for the root node (the yva) is No. The 41 tes us how many of the 256 observations wi be incorrecty cassified as Yes. This is technicay caed the oss. The yprob component then reports on the distribution of the casses across the observations. We know the casses to be No, and Yes. Thus, 84% (i.e., as a proportion) of the observations have the target variabe RainTomorrow as No, and 16% of the observations have it as Yes. If the root node itsef were treated as a mode, it woud aways decide that it won t rain tomorrow. Based on the training dataset, the mode woud be 84% correct. That is quite a good eve of accuracy, but the mode is not particuary usefu since we are reay interested in whether it is going to rain tomorrow. The root node is spit into two subnodes. The spit is based on the variabe Pressure3pm with a spit vaue of Node 2 has the spit expressed as Pressure3pm>= That is, there are 204 observations with a 3 pm pressure reading of more than hpa:. 2) Pressure3pm>= No ( ) Ony 16 of these 204 observations are miscassified, with the cassification associated with this node being No. This represents an accuracy of 92% in predicting that it does not rain tomorrow. Node 3 contains the remaining 52 observations which have a 3 pm pressure of ess than Whist the decision is No, it is pretty cose to a 50/50 spit in this partition: 3) Pressure3pm< No ( ) We ve skipped ahead a itte to jump to node 3, so we now have a ook again at node 2 and its spit into subnodes. The agorithm has chosen Coud3pm for the next spit, with a spit vaue of 7.5. Node 4 has 195 observations. These are the 195 observations for which the 3 pm pressure is greater than or equa to and the coud coverage at 3 pm is ess than 7.5. Under these circumstances, there is no rain the foowing day 95% of the time. 2) Pressure3pm>= No ( ) 4) Coud3pm< No ( ) * 5) Coud3pm>= Yes ( ) *

239 Decision Trees Node 5, at ast, predicts that it wi rain on the foowing day at east based on the avaiabe historic observations. There are ony nine observations here, and the frequency of observing rain on the foowing day is 67%. Thus we say there is a 67% probabiity of rain when the pressure at 3 pm is at east and the coud cover at 3 pm is at east 7.5. Both node 4 and node 5 are marked with an asterisk (*), indicating that they are termina nodes they are not further spit. The remaining nodes, 6 and 7, spit node 3 using the variabe Sunshine and a spit point of 8.85: 3) Pressure3pm< No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) * Node 3 has amost equa numbers of No and Yes observations (52% and 48%, respectivey). However, spitting on the number of hours of sunshine has quite nicey partitioned the observations into two groups that are quite a bit more homogeneous with respect to the target variabe. Node 6 represents ony a 20% chance of rain tomorrow, whist node 7 represents a 74% chance of rain tomorrow. That then is the mode that has been buit. It is a reativey simpe decision tree with just seven nodes and four eaf nodes, with a maximum depth of 2 (in fact, each eaf node is at a depth of exacty 2). Function Ca The next segment ists the underying R command ine that is used to buid the decision tree. This was automaticay generated based on the information provided through the interface. We coud have directy entered this at the prompt in the R Consoe: Cassification tree: rpart(formua=raintomorrow ~.,data=crs$dataset[crs$train, c(crs$input,crs$target)],method="cass", parms=ist(spit="information"), contro=rpart.contro(usesurrogate=0,maxsurrogate=0)) The formua notes that we want to buid a mode to predict the vaue of the variabe RainTomorrow based on the remainder of the variabes in the dataset suppied (notated as the ~. ). The dataset suppied consists of the crs$dataset data frame indexed to incude the rows isted in the

240 11.4 Tutoria Exampe 219 variabe crs$train. This is the training dataset. The coumns from 3 to 22, and then coumn 24, are incuded in the dataset from which the mode is buit. Foowing the specification of the formua and dataset are the tuning parameters for the agorithm. These are expained in detai in Section 11.5, but we briefy summarise them here. The method used is based on cassification. The method for choosing the best spit uses the information measure. Surrogates (for deaing with missing vaues) are not used by defaut in Ratte. Variabes Used In genera, ony a subset of the avaiabe variabes wi be used in the resuting decision tree mode. The next segment ists those variabes that do appear in the tree. Of the 20 input variabes, ony three are used in the fina mode. Variabes actuay used in tree construction: [1] Coud3pm Pressure3pm Sunshine Performance Evauation The next segment summarises the process of buiding the tree, and in particuar the iterations and associated change in the accuracy of the mode as new eves are added to the tree. The compexity tabe is discussed in more detai in Section Briefy, though, we are most ikey interested in the cross-vaidated error (refer to Section 15.1 for a discussion of cross-vaidation), which is the xerror coumn of the tabe. The error over the whoe dataset (i.e., if we were to cassify every observation as No) is 0.16, or 16%. Treating this as the baseine error (i.e., 1.00), the tabe shows the reative reduction in the error (and cross-vaidation-based error) as we buid the tree. From ine 2, we see that after the first spit of the dataset, we have reduced the cross-vaidation based error to 80% of the origina amount (i.e., , or 13%). Notice that the cross-vaidation is being reduced more sowy than the error on the training dataset (error). This is typica. The CP vaue (the compexity parameter) is expained further in Section 11.5, but for now we note that as the tree spits into more nodes,

241 Decision Trees the compexity parameter is reduced. But we aso note that the crossvaidation error starts to increase as we further spit the decision tree. This tes the agorithm to stop partitioning, as the error rate (at east the unbiased estimate of it refer to Section 15.1) is not improving: Root node error: 41/256 = 0.16 n= 256 CP nspit re error xerror xstd Time Taken Finay, we see how ong it took to buid the tree. generay very quick to buid. Decision trees are Time taken: 0.03 secs Tuning Options The Ratte interface provides a choice of Agorithm for buiding the decision tree. The Traditiona option is chosen by defaut, and that is what we have presented here. The Conditiona option uses a more recent conditiona inference tree agorithm, which is expained in more detai in Section A variety of other tuning options are aso provided, and they are discussed in some detai in Section Dispaying Trees The Rues and Draw buttons provide aternative views of the decision tree. Cicking on the Rues button wi transate the decision tree into a set of rues and ist those rues at the bottom of the text view. We need to scro down the text view in order to see the rues. The rues in this form can be more easiy extracted and used to generate code in other anguages. A common exampe is to generate a query in SQL to extract the corresponding observations from a database.

242 11.4 Tutoria Exampe 221 The Draw button wi pop up a separate window to dispay a more visuay appeaing representation of the decision tree. We have seen the pictoria representation of a decision tree a number of times now, and they were generated from this button, as was Figure Figure 11.3: Typica Ratte decision tree. Scoring We can now use the mode to predict the outcome for new observations something we often ca scoring. The Evauate tab provides the Score option and the choice to Enter some data manuay and have that data scored by the mode. Executing this setup wi resut in a popup window in which to enter the data, and, on cosing the window, the data is passed on to the mode and the predictions are dispayed in the Textview. Buiding a Mode using R Underneath Ratte s GUI, we are reying on a coection of R commands and functions. The Log tab wi expose them, and it is instructive to review the Log tab reguary to gain insight and understanding that wi

243 Decision Trees be hepfu in using R itsef. We effectivey ift the bonnet on the hood here so that we can directy buid decision trees using R. To use the traditiona decision-tree-buiding agorithm, we use rpart, This provides rpart() which is an impementation of the standard cassification and regression tree agorithms. The impementation is very robust and reiabe. > ibrary(rpart) As we saw in Section 2.9, we wi create the variabe weatherds (using new.env() new environment) to act as a container for the weather dataset and reated information. We wi access data within this container through the use of evaq() beow. > weatherds <- new.env() The weather dataset from ratte wi be used for the modeing. Three coumns from the dataset are ignored in our anayses, as they pay no roe in the mode buiding. The three variabes are the two that serve to identify the observations (Date and Location) and the risk variabe (RISK_MM the amount of rain recorded on the next day). Beow we identify the index of these variabes and record the negative index in vars, which is stored within the container: > ibrary(ratte) > evaq({ data <- weather nobs <- nrow(data) vars <- -grep('^(date Locat RISK)', names(weather)) }, weatherds) A random subset of 70% of the observations is chosen and wi be used to identify a training dataset. The random number seed is set, using set.seed(), so that we wi aways obtain the same random sampe for iustrative purposes and repeatabiity. Choosing different random sampe seeds is aso usefu, providing empiricay an indication of how stabe the modes are. > evaq({ set.seed(42) train <- sampe(nobs, 0.7*nobs) }, weatherds)

244 11.4 Tutoria Exampe 223 We add to the weatherds container the formua to describe the mode that is to be buit based on this dataset: > evaq({ form <- formua(raintomorrow ~.) }, weatherds) We now create a mode container for the information reevant to the decision tree mode that we wi buid. The container incudes the weatherds container (identifying it as parent= in the ca to new.env()): > weatherrpart <- new.env(parent=weatherds) The command to buid a mode is then straight forward. The variabes data, train, and vars are obtained from the weatherds container, and the resut wi be stored as the variabe mode within the weatherrpart container. We expain rpart() in detai beow. > evaq({ mode <- rpart(formua=form, data=data[train, vars]) }, weatherrpart) Here we use rpart(), passing to it a formua and the data. We don t need to incude the formua= and the data= in the forma arguments to the function, as they wi aso be determined from their position in the argument ist. It doesn t hurt to incude them either to provide more carity for others reading the code. The formua= argument identifies the mode that is to be buit. In this case, we pass to the function the variabe form that we previousy defined. The target variabe (to the eft of the tide in form) is RainTomorrow, and the input variabes consist of a of the remaining variabes in the dataset (denoted by the period to the right of the tide in form). We are requesting a mode that predicts a vaue for RainTomorrow based on today s observations. The data= argument identifies the training dataset. Once again, we pass to the function the variabe data that we previousy defined. The training dataset subset consists of the observation numbers isted in the variabe train. The variabes of the dataset that we wish to incude are specified by vars, which in this case actuay ists as negative integers the variabes to ignore. Together, train and vars identify the observations and variabes to incude in the training of the mode.

245 Decision Trees The resut of buiding the mode is assigned into the variabe mode inside the environment weatherrpart and so can be independenty referred to as weatherrpart$mode. Exporing the Mode Towards the end of Section 11.4, we expained the textua presentation of the resuts of buiding a decision tree mode. The output we saw there can be reproduced in R using print() and printcp(). The output from print() is: > print(weatherrpart$mode) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) 2) Pressure3pm>= No ( ) 4) Coud3pm< No ( ) * 5) Coud3pm>= Yes ( ) * 3) Pressure3pm< No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) * We briefy discussed the output of this and the printcp() beow in the previous section. We mentioned there how the CP (compexity parameter) is used to guide how arge a decision tree to buid. We might choose to stop when the cross-vaidated error (xerror) begins to increase. This is dispayed in the output of printcp(). We can aso obtain a usefu graphica representation of the compexity parameter using potcp() instead.

246 11.4 Tutoria Exampe 225 > printcp(weatherrpart$mode) Cassification tree: rpart(formua = form, data = data[train, vars]) Variabes actuay used in tree construction: [1] Coud3pm Pressure3pm Sunshine Root node error: 41/256 = 0.16 n= 256 CP nspit re error xerror xstd Another command usefu for providing information about the resuting mode is summary(): > summary(weatherrpart$mode) This command provides quite a bit more information about the mode buiding process, beginning with the function ca and data size. This is foowed by the same compexity tabe we saw above: Ca: rpart(formua=form, data=data[train, vars]) n= 256 CP nspit re error xerror xstd The summary goes on to provide information reated to each node of the decision. Node number 1 is the root node of the decision tree. Its information appears first (note that the text here is modified to fit the page):

247 Decision Trees Node number 1: 256 observations, compexity param= predicted cass=no expected oss= cass counts: probabiities: eft son=2 (204 obs) right son=3 (52 obs) Primary spits: Pressure3pm < 1012 right, improve=13.420, (0 missing) Coud3pm < 7.5 eft, improve= 9.492, (0 missing) Pressure9am < 1016 right, improve= 9.143, (0 missing) Sunshine < 6.45 right, improve= 8.990, (2 missing) WindGustSpeed < 64 eft, improve= 7.339, (2 missing) Surrogate spits: Pressure9am < 1013 right, agree=0.938, adj=0.692,... MinTemp < eft, agree=0.824, adj=0.135,... Temp9am < eft, agree=0.816, adj=0.096,... WindGustSpeed < 64 eft, agree=0.812, adj=0.077,... WindSpeed3pm < 34 eft, agree=0.812, adj=0.077,... We see that node number 1 has 256 observations to work with. It has a compexity parameter of , which is discussed in Section The next ine identifies the defaut cass for this node (No in this case) which corresponds to the cass that occurs most frequenty in the training dataset. With this cass as the decision associated with this node, the error rate (or expected oss) is 16% (or ). The tabe that foows then reports the frequency of observations by the target variabe. There are 215 observations with No for RainTomorrow (84%) and 41 with Yes (16%). The remainder of the information reates to deciding how to spit the node into two subsets. The resuting spit has a eft branch (abeed as node number 2) with 204 observations. The right branch (abeed as node number 3) has 52 observations. The actua variabe used to spit the dataset into these two subsets is Pressure3pm, with the test being on the vaue Any observation with Pressure3pm < goes to the right branch, and so goes to the eft. The measure (the improvement) associated with this spit of the dataset is We then see a coection of aternative spits and their associated measures. Ceary, Pressure3pm offers the best improvement, with the nearest competitor offering an improvement of 9.49.

248 11.4 Tutoria Exampe 227 The surrogate spits that are then presented reate to the handing of missing vaues in the data. Consider the situation where we appy the mode to new data but have an observation with Pressure3pm missing. We coud instead use Pressure9am. The information here indicates that 93.8% of the observations in the spit based on P ressure9am < are the same as that based on P ressure3pm < The adj vaue is an indication of what is gained by using this surrogate spit over simpy giving up at this node and assigning the majority decision to the new observation. Thus, in using Pressure9am we gain a 69% improvement by using the surrogate. The other nodes are then isted in the summary. They incude the same kind of information, and we see at the beginning of node number 2 here: Node number 2: 204 observations, compexity param= predicted cass=no expected oss= cass counts: probabiities: eft son=4 (195 obs) right son=5 (9 obs) Primary spits: Coud3pm < 7.5 eft, improve=6.516, (0 missing) Sunshine < 6.4 right, improve=2.937, (2 missing) Coud9am < 7.5 eft, improve=2.795, (0 missing) Humidity3pm < 71 eft, improve=1.465, (0 missing) WindDir9am spits as RRRRR...LLLL, improve=1.391, Note how categoric variabes are reported. WindDir9am has 16 eves: > eves(weather$winddir9am) [1] "N" "NNE" "NE" "ENE" "E" "ESE" "SE" "SSE" "S" [10] "SSW" "SW" "WSW" "W" "WNW" "NW" "NNW" A possibe binary combinations of eves wi have been considered and the one reported above offers the best improvement. Here the first five eves (N to E) correspond to the right (R) branch and the remainder to the eft (L) branch. The eaf nodes of the decision tree (nodes 4, 5, 6, and 7) wi have just the reevant information thus no information on spits or surrogates. An

249 Decision Trees exampe is node 7. The foowing text again comes from the output of summary():... Node number 7: 27 observations predicted cass=yes expected oss= cass counts: 7 20 probabiities: Node 7 is a eaf node that predicts Yes as the outcome. The error/oss is 7 out of 27, or or 26%, and the probabiity of Yes is 74%. Misceaneous Functions We have covered above the main functions and commands in R for buiding and dispaying a decision tree. Rpart and ratte aso provide a coection of utiity functions for exporing the mode. First, the where= component of the decision tree object records the eaf node of the decision tree in which each observation in the training dataset ends up: > head(weatherrpart$mode$where, 12) The pot() command and the reated text() command wi dispay a decision tree abeed appropriatey: > opar <- par(xpd=true) > pot(weatherrpart$mode) > text(weatherrpart$mode) > par(opar) We notice that the defaut pot (Figure 11.4) ooks different from the pot we obtain through Ratte. Ratte provides drawtreenodes() as a variation of pot() based on draw.tree() from maptree (White, 2010). The pot here is a basic pot. The ength of each ine within the tree branches gives a visua indication of the error down that branch of the tree. The pot and text can be further tuned through addition arguments to the two commands. There are very many tuning options avaiabe,

250 11.4 Tutoria Exampe 229 Pressure3pm>=1012 Coud3pm< 7.5 Sunshine>=8.85 No Yes No Yes Figure 11.4: Typica R decision tree. and they are isted in the manuas for the commands (?pot.rpart and?text.rpart). The path.rpart() command is then a usefu adjunct to pot(): > path.rpart(weatherrpart$mode) Running this command aows us to use the eft mouse button to cick on a node on the pot to ist the path to that node. For exampe, cicking the eft mouse button on the bottom right node resuts in: node number: 7 root Pressure3pm< 1012 Sunshine< 8.85 Cick on the midde or right mouse button to finish interacting with the pot.

251 Decision Trees 11.5 Tuning Parameters Any impementation of the decision tree agorithm provides a coection of parameters for tuning how the tree is buit. The defauts in Ratte (based on rpart s defauts) often provide a basicay good tree. They are certainy a very good starting point and may be a satisfactory end point, too. However, tuning wi be necessary where, for exampe, the target variabe has very few exampes of the particuar cass of interest or we woud ike to expore a number of aternative trees. Whist many tuning parameters are introduced here in some eve of detai, the R documentation provides much more information. Use?rpart to start exporing further. The rpart() function has two arguments for tuning the agorithm, each being a structure containing other options. They are contro= and parms=. We use these as in the foowing exampe: > evaq({ contro <- rpart.contro(minspit=10, minbucket=5, maxdepth=20, usesurrogate=0, maxsurrogate=0) mode <- rpart(formua=form, data=data[train, vars], method="cass", parms=ist(spit="information"), contro=contro) }, weatherrpart) We have aready discussed the formua= and data= arguments. remaining arguments are now discussed. The Modeing Method (method=) The method= argument indicates the type of mode to be buit and is dependent on the target variabe. For categoric targets, we generay buid cassification modes, and so we use method="cass". If the target is a numeric variabe, then the argument woud be method="anova" for an anaysis of variance, buiding a regression tree.

252 11.5 Tuning Parameters 231 Spitting Function (spit=) The spit= argument is used to choose between different spitting functions (measures). The argument appears within the parms argument of rpart(), which is buit up as a named ist. The spit="information" directs rpart to use the information gain measure we introduced above. The defaut choice of spit="gini" (in R, though Ratte s defaut is "information") uses the Gini index of diversity. The choice makes no difference in this case, as we can verify by reviewing the output of the foowing two commands (though here we show just the one set of output): > evaq({ rpart(formua=form, data=data[train, vars], parms=ist(spit="information")) }, weatherrpart) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) 2) Pressure3pm>= No ( ) 4) Coud3pm< No ( ) * 5) Coud3pm>= Yes ( ) * 3) Pressure3pm< No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) * > evaq({ rpart(formua=form, data=data[train, vars], parms=ist(spit="gini")) }, weatherrpart)

253 Decision Trees Minimum Spit (minspit=) The minspit= argument specifies the minimum number of observations that must exist at a node in the tree before it is considered for spitting. A node is not considered for spitting if it has fewer than minspit observations. The minspit= argument appears within the contro= argument of rpart(). The defaut vaue of minspit= is 20. In the foowing exampe, we iustrate the boundary between spitting and not spitting the root node of our decision tree. This is often an issue in buiding a decision tree, and an inconvenience when a we obtain is a root node. Here the exampe shows that with a minspit= of 53 the tree buiding wi not proceed past the root node: > evaq({ rpart(formua=form, data=data[train, vars], contro=rpart.contro(minspit=53)) }, weatherrpart) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) * Setting minspit= to 52 resuts in a spit on Pressure3pm (and further spitting) being considered and chosen, as we see in the code bock beow. Spitting on Pressure3pm spits the dataset into two datasets, one with 204 observations and the other with 52 observations. We can then see why, with minspit= set to of 53, the tree buiding does not proceed past the root node. Changing the vaue of minspit= aows us to eiminate some computation, as nodes with a sma number of observations wi generay pay ess of a roe in our modes. Leaf nodes can sti be constructed that have fewer observations than the minspit=, as that is controed by the minbucket= argument.

254 11.5 Tuning Parameters 233 > evaq({ rpart(formua=form, data=data[train, vars], contro=rpart.contro(minspit=52)) }, weatherrpart) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) 2) Pressure3pm>= No ( ) * 3) Pressure3pm< No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) * Minimum Bucket Size (minbucket=) The minbucket= argument is the minimum number of observations in any eaf node. The defaut vaue is 7, or about one-third of the defaut vaue of minspit=. If either of these two arguments is specified but not the other, then the defaut of the unspecified one is taken to be a vaue such that this reationship hods (i.e., minbucket= is one-third of minspit=) Once again we wi see two exampes of using minbucket=. The first exampe imits the minimum bucket size to be 10, resuting in the same mode we obtained above. The second exampe reduces the imit down to just 5 observations in the bucket. The resut wi generay be a arger decision tree, since we are aowing eaf nodes with a smaer number of observations to be considered, and hence the option to spit a node into smaer nodes wi often be exercised by the tree buiding agorithm.

255 Decision Trees > ops <- options(digits=2) > evaq({ rpart(formua=form, data=data[train, vars], contro=rpart.contro(minbucket=10)) }, weatherrpart) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) 2) Pressure3pm>=1e No ( ) * 3) Pressure3pm< 1e No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) * > evaq({ rpart(formua=form, data=data[train, vars], contro=rpart.contro(minbucket=5)) }, weatherrpart) n= 256 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) 2) Pressure3pm>=1e No ( ) 4) Coud3pm< No ( ) * 5) Coud3pm>= Yes ( ) * 3) Pressure3pm< 1e No ( ) 6) Sunshine>= No ( ) * 7) Sunshine< Yes ( ) 14) Evaporation< Yes ( ) 28) WindGustSpeed< No ( ) * 29) WindGustSpeed>= Yes ( ) * 15) Evaporation>= Yes ( ) * > options(ops)

256 11.5 Tuning Parameters 235 Note that changing the vaue of minbucket= can have an impact on the choice of variabe for the spit. This wi occur when one choice with a higher improvement resuts in a node with too few observations, eading to another choice being taken to meet the minimum requirements for the number of observations in a spit. Whist the defaut is to set minbucket= to be one-third of minspit=, there is no requirement for minbucket= to be ess than minspit=. A node wi aways have at east minbucket= entities, and it wi be considered for spitting if it has at east minspit= observations and if on spitting each of its chidren has at east minbucket= observations. Compexity Parameter (cp=) The compexity parameter is used to contro the size of the decision tree and to seect an optima tree size. The compexity parameter contros the process of pruning a decision tree. As we wi discuss in Chapter 15, without pruning, a decision tree mode can overfit the training data and then not perform very we on new data. In genera, the more compex a mode, the more ikey it is to match the data on which it has been trained and the ess ikey it is to match new, previousy unseen data. On the other hand, decision tree modes are very interpretabe, and thus buiding a more compex tree (i.e., having many branches) is sometimes tempting (and usefu). It can provide insights that we can then test statisticay. Using cp= governs the minimum benefit that must be gained at each spit of the decision tree in order to make a spit worthwhie. This therefore saves on computing time by eiminating spits that appear to add itte vaue to the mode. The defaut is A vaue of 0 wi buid a compete decision tree to maximum depth depending on the vaues of minpit= and minbucket=. This is usefu if we want to ook at the vaues for CP for various tree sizes. We ook for the number of spits where the sum of the xerror (cross-vaidation error reative to the root node error) and xstd is minimum (as discussed in Section 11.4). This is usuay eary in the ist. The potcp() command is usefu in visuaising the progression of the CP vaues. In the foowing exampe, 1 we buid a fu decision tree with 1 Note that the cptabe may vary sighty between different depoyments of R, particuary between 64 bit R, as here, and 32 bit R.

257 Decision Trees both cp= and minbucket= set to zero. We aso show the CP tabe. The corresponding pot is shown in Figure > set.seed(41) > evaq({ contro <- rpart.contro(cp=0, minbucket=0) mode <- rpart(formua=form, data=data[train, vars], contro=contro) }, weatherrpart) > print(weatherrpart$mode$cptabe) CP nspit re error xerror xstd > potcp(weatherrpart$mode) > grid() The figure iustrates a typica behaviour of mode buiding. As we proceed to buid a compex mode, the error rate (the y-axis) initiay decreases. It then fattens out and, as the mode becomes more compex, the error rate begins to again increase. We wi want to choose a mode where it has fattened out. Based on the principe of favouring simper modes, we might choose the first of the simiary performing bottom points and thus we might set cp= 0.1, for exampe. As a script, we coud automate the seection with the foowing: > xerr <- weatherrpart$mode$cptabe[,"xerror"] > minxerr <- which.min(xerr) > mincp <- weatherrpart$mode$cptabe[minxerr, "CP"] > weatherrpart$mode.prune <- prune(weatherrpart$mode, cp=mincp)

258 11.5 Tuning Parameters 237 size of tree X va Reative Error Inf cp Figure 11.5: Error rate versus compexity/tree size. Priors (prior=) Sometimes the proportions of casses in a training set do not refect their true proportions in the popuation. We can inform Ratte and R of the popuation proportions, and the resuting mode wi refect them. A probabiities wi be modified to refect the prior probabiities of the casses rather than the actua proportions exhibited in the training dataset. The priors can aso be used to boost a particuary important cass, by giving it a higher prior probabiity, athough this might best be done through the oss matrix (Section 11.5). In Ratte, the priors are expressed as a ist of numbers that sum to 1. The ist must be of the same ength as the number of unique casses in the training dataset. An exampe for binary cassification is 0.6,0.4. This transates into prior=c(0.6,0.4) for the ca to rpart(). The foowing exampe iustrates how we might use the priors to

259 Decision Trees favour a particuar target cass that was otherwise not being predicted by the resuting mode (because the resuting mode turns out to be ony a root node aways predicting No). We begin by creating the dataset object, consisting of the arger Austraian weather dataset, weatheraus : > wausds <- new.env() > evaq({ data <- weatheraus nobs <- nrow(data) form <- formua(raintomorrow ~ RainToday) target <- a.vars(form)[1] set.seed(42) train <- sampe(nobs, 0.5*nobs) }, wausds) A decision tree mode is then buit and dispayed: > wausrpart <- new.env(parent=wausds) > evaq({ mode <- rpart(formua=form, data=data[train,]) mode }, wausrpart) n=19509 (489 observations deeted due to missingness) node), spit, n, oss, yva, (yprob) * denotes termina node 1) root No ( ) * A tabe shows the proportion of observations assigned to each cass in the training dataset. > evaq({ freq <- tabe(data[train, target]) round(100*freq/ength(train), 2) }, wausrpart) No Yes Now we buid a decision tree mode but with different prior probabiities:

260 11.5 Tuning Parameters 239 > evaq({ mode <- rpart(formua=form, data=data[train,], parm=ist(prior=c(0.5, 0.5))) mode }, wausrpart) n=19509 (489 observations deeted due to missingness) node), spit, n, oss, yva, (yprob) * denotes termina node 1) root Yes ( ) 2) RainToday=No No ( ) * 3) RainToday=Yes Yes ( ) * The defaut priors when using raprt() without the prior= option are set to be the cass proportions as found in the training dataset suppied. Loss Matrix (oss=) The oss matrix is used to weight different kinds of errors (or oss) differenty. This refers to what are commony known as fase positives (or type I errors) and fase negatives (or type II errors) when we tak about a two-cass probem. Often, one type of error is more significant than another type of error. In fraud, for exampe, a mode that identifies too many fase positives is probaby better than a mode that identifies too many fase negatives (because we then miss too many rea frauds). In medicine, a fase positive means that we diagnose a heathy patient with a disease, whist a fase negative means that we diagnose an i patient as being heathy. The defaut oss for each of the true/fase positives/negatives is 1 they are a of equa impact or oss. In the case of a rare, and underrepresented cass (ike fraud) we might consider fase negatives to be four or even ten times worse than a fase positive. Thus, we communicate this to the agorithm so that it wi work harder to buid a mode to find a of the positive cases. The oss matrix records these reative weights for the two cass case ony. The foowing tabe iustrates the terminoogy (showing predicted

261 Decision Trees versus observed): P r v Ob T N F N 1 F P T P Noting that we do not specify any weights in the oss matrix for the true positives (TP) and the true negatives (TN), we suppy weights of 0 for them in the matrix. To specify the matrix in the Ratte interface, we suppy a ist of the form: 0, F N, F P, 0. In genera, the oss matrix must have the same dimensions as the number of casses (i.e., the number of eves of the target variabe) in the training dataset. For binary cassification, we must suppy four numbers with the diagonas as zeros. An exampe is the string of numbers 0, 10, 1, 0, which might be interpreted as saying that an actua 1 predicted as 0 (i.e., a fase negative) is ten times more unwecome than a fase positive. This is used to construct, row-wise, the oss matrix which is passed through to rpart() as oss=oss=matrix(c(0,10,1,0), byrow=true, nrow=2). The oss matrix is used to ater the priors, which wi affect the choice of variabe on which to spit the dataset on at each node, giving more weight where appropriate. Using the oss matrix is often indicated when we buid a decision tree that ends up being just a singe root node (often because the positive cass represents ess than 5% of the popuation and so the most accurate mode woud predict everyone to be a negative). Other Options The rpart() function provides many other tuning parameters that are not exposed through the Ratte GUI. These incude maxdepth= to imit the depth of a tree and maxcompete= to imit the number of competing aternative spits for each node that is retained in the resuting mode. A number of options reate to the handing of surrogates. As indicated above, surrogates in the mode aow for the handing of missing vaues. The surrogatestye= argument indicates how surrogates are given preference. The defaut is to prefer variabes with fewer missing vaues in the training dataset, with the aternative being to sort them by the percentage correct over the number of nonmissing vaues. The usesurrogate= argument contros how surrogates are made use of in the mode. The defaut for the usesurrogate= argument is 2.

262 11.6 Discussion 241 This is aso set when Ratte s Incude Missing check button is active. The behaviour here is to try each of the surrogates whenever the main variabe has a missing vaue, but if a surrogates are aso missing, then foow the path with the majority of cases. If usesurrogate= is set to 1, the behaviour is to try each of the surrogates whenever the main variabe has a missing vaue, but if a surrogates are aso missing, then go no further. When the argument is set to 0 (the case when Ratte s Incude Missing check button is not active), the observation with a missing vaue for the main variabe is not used any further in the tree buiding. The maxsurrogate= argument simpy imits the number of surrogates considered for each node Discussion Decision trees have been around for a ong time. They present a mechanism for structuring a series of questions. The next question to ask, at any time, is based on the answer to a previous question. In data mining, we commony identify decision trees as the knowedge representation scheme targeted by the famiy of techniques originating from ID3 within the machine earning community (Quinan, 1986) and from CART within the statistics community. The origina ID3 agorithm was extended to become the commerciay avaiabe C4.5 software. This was made avaiabe together with a book by Quinan (1993) that served as a guide to using the code. Traditiona decision tree agorithms can suffer from overfitting and can exhibit a bias towards seecting variabes with many possibe spits (i.e., categoric variabes). The agorithms do not use any statistica significance concepts and thus, as noted by Mingers (1989), cannot distinguish between significant and insignificant improvements in the information measure. The use of a cross-vaidated reative error measure, as in the impementation in rpart() does guard against overfitting. Hothorn et a. (2006) introduced an improvement to the approach presented here for buiding a decision tree, caed conditiona inference trees. Ratte offers the choice of traditiona and conditiona agorithms. Conditiona inference trees address overfitting and variabe seection biases by using a conditiona distribution to measure the association between the output and the input variabes. They take into account distributiona properties.

263 Decision Trees Conditiona inference trees can be buit using ctree() from party (Hothorn et a., 2006). Within Ratte, we can choose the Conditiona option to buid a conditiona inference tree. From the command ine, we woud use the foowing ca to ctree(): > ibrary(party) > weatherctree <- new.env(parent=weatherds) > evaq({ mode <- ctree(formua=form, data=data[train, vars]) }, weatherctree) We can review just ines 8 to 17 of the resuting output, which is the tree itsef: > cat(paste(capture.output(weatherctree$mode)[8:17], coapse="\n")) 1) Pressure3pm <= 1012; criterion = 1, statistic = ) Sunshine <= 8.8; criterion = 0.99, statistic = )* weights = 27 2) Sunshine > 8.8 4)* weights = 25 1) Pressure3pm > ) Coud3pm <= 7; criterion = 1, statistic = )* weights = 195 5) Coud3pm > 7 7)* weights = 9 A pot of the tree is presented in Figure The pot is quite informative and primariy sef-expanatory. Node 3, for exampe, predicts rain reativey accuratey, whist node 6 describes conditions under which there is amost never any rain on the foowing day Summary Decision tree agorithms hande mixed types of variabes and missing vaues, and are robust to outiers and monotonic transformations of the input and to irreevant inputs. The predictive power of decision trees tends to be poorer than for other techniques that we wi introduce. However, the agorithm is generay straightforward, and the resuting

264 11.8 Command Summary 243 Figure 11.6: A conditiona inference tree. modes are generay easiy interpretabe. This ast characteristic has made decision tree induction very popuar for over 30 years. This chapter has introduced the basic concept of representing knowedge as a decision tree and presented a measure for choosing a good decision tree and an agorithm for buiding one Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets:

265 Decision Trees ctree() function Buid a conditiona inference tree. draw.tree() command Enhanced graphic decision tree. maptree package Provides draw.tree(). party package Conditiona inference trees. path.rpart() function Identify paths through decision tree. pot() command Graphic dispay of the tree. potcp() command Pot compexity parameter. print() command Textua version of the decision tree. printcp() command Compexity parameter tabe. ratte package The weather dataset and GUI. rpart() function Buid a decision tree predictive mode. rpart package Provides decision tree functions. rpart.contro function Organise rpart contro arguments. set.seed() function Initiate random seed number sequence. summary() command Summary of the tree buiding process. text() command Add abes to decision tree graphic. weather dataset Sampe dataset from ratte.

266 Chapter 12 Random Forests Buiding a singe decision tree provides a simpe mode of the word, but it is often too simpe or too specific. Over many years of experience in data mining, it has become cear that many modes working together are better than one mode doing it a. We have now become famiiar with the idea of combining mutipe modes (ike decision trees) into a singe ensembe of modes (to buid a forest of trees). Compare this to how we might bring together panes of experts to ponder an issue and to then come up with a consensus decision. Governments, industry, and universities a manage their business processes in this way. It can often resut in better decisions compared to simpy reying on the expertise of a singe authority on a topic. The idea of buiding mutipe trees arose eary on with the deveopment of the mutipe inductive earning (MIL) agorithm (Wiiams, 1987, 1988). In buiding a singe decision tree, it was noted that often there was very itte difference in choosing between aternative variabes. For exampe, two or more variabes might not be distinguishabe in terms of their abiity to partition the data into more homogeneous datasets. The MIL agorithm buids a equay good modes and then combines them into one mode, resuting in a better overa mode. Today we see a number of agorithms generating ensembes, incuding boosting, bagging, and random forests. In this chapter, we introduce the random forest agorithm, which buids hundreds of decision trees and combines them into a singe mode. G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _12, Springer Science+Business Media, LLC

267 Random Forests 12.1 Overview The random forest agorithm tends to produce quite accurate modes because the ensembe reduces the instabiity that we can observe when we buid singe decision trees. This can often be iustrated simpy by removing a very sma number of observations from the training dataset, to see quite a change in the resuting decision tree. The random forest agorithm (and other ensembe agorithms) tends to be much more robust to changes in the data. Hence,it is very robust to noise (i.e., variabes that have itte reationship to the target variabe). Being robust to noise means that sma changes in the training dataset wi have itte, if any, impact on the fina decisions made by the resuting mode. Random forest modes are generay very competitive with noninear cassifiers such as artificia neura nets and support vector machines. Random forests hande underrepresented cassification tasks quite we. This is where, in the binary cassification task, one cass has very few (e.g., 5% or fewer) observations compared with the other cass. By buiding each decision tree to its maxima depth, as the random forest agorithm does (by not pruning the individua decision trees), we can end up with a mode that is ess biased. Each individua tree wi overfit the data, but this is outweighed by the mutipe trees using different variabes and (over) fitting the data differenty. The randomness used by a random forest agorithm is in the seection of both observations and variabes. It is this randomness that deivers considerabe robustness to noise, outiers, and overfitting, when compared with a singe-tree cassifier. The randomness aso deivers substantia computationa efficiencies. In buiding a singe decision tree, the mode buider may seect a random subset of the observations avaiabe in the training dataset. Aso, at each node in the process of buiding the decision tree, ony a sma fraction of a of the avaiabe variabes are considered when determining how to best partition the dataset. This substantiay reduces the computationa requirement. In the area of genetic marker seection and microarray data within bioinformatics, for exampe, random forests have been found to be particuary we suited. They perform we even when many of the input variabes have itte bearing on the target variabe (i.e., they are noise variabes). Random forests are aso suitabe when there are very many input variabes and not so many observations.

268 12.2 Knowedge Representation 247 In summary, a random forest mode is a good choice for mode buiding for a number of reasons. Often, very itte preprocessing of the data needs to be performed, as the data does not need to be normaised and the approach is resiient to outiers. The need for variabe seection is avoided because the agorithm effectivey does its own. Because many trees are buit using two eves of randomness (observations and variabes), each tree is effectivey an independent mode and the resuting mode tends not to overfit to the training dataset Knowedge Representation The random forest agorithm is commony presented in terms of decision trees as the primary form for representing knowedge. However, the random forest agorithm can be thought of as a meta-agorithm. It describes an approach to buiding modes where the actua mode buider coud be a decision tree agorithm, a regression agorithm, or any one of many other kinds of mode buiding agorithms. The genera concepts appy to any of these approaches. We wi stay with decision trees as the underying mode buider for our purposes here. In any ensembe approach, the key extension to the knowedge representation is in the way that we combine the decisions that are made by the individua experts or modes. Various approaches have been considered over the years. Many come from the knowedge-based and expert systems communities, which often need to consider the issue of combining expert knowedge from mutipe experts. Approaches to aggregating decisions into one fina decision incude simpe majority rues and a weighted score where the weights correspond to the quaity of the expertise (e.g., the measured accuracy of the individua tree). The random forest agorithms wi often buid from 100 to 500 trees. In depoying the mode, the decisions made by each of the trees are combined by treating a trees as equas. The fina decision of the ensembe wi be the decision of the majority of the constituent trees. If 80 out of 100 trees in the random forest say that it wi rain tomorrow, then we wi go with that decision and take the appropriate action for rain. Even if 51 of the 100 trees say that it wi rain, we might go with that, athough perhaps with ess certainty. In the context of regression rather than cassification, the resut is the average vaue over the ensembe of regression trees.

269 Random Forests 12.3 Agorithm Chapter 11 covered the buiding of an individua tree, and the same agorithm can be used for buiding one or 500 trees. It is how the training set is seected and how the variabes to use in the modeing are chosen that differs between the trees buit for a random forest. Samping the Dataset The random forest agorithm buids mutipe decision trees, using a concept caed bagging, to introduce random samping into the whoe process. Bagging is the idea of coecting a random sampe of observations into a bag (though the term itsef is an abbreviation of bootstrap aggregation). Mutipe bags are made up of randomy seected observations obtained from the origina observations from the training dataset. The seection in bagging is made with repacement, meaning that a singe observation has a chance of appearing mutipe times within a singe bag. The sampe size is often the same as for the fu dataset, and so in genera about two-thirds of the observations wi be incuded in the bag (with repeats) and one-third wi be eft out. Each bag of observations is then used as the training dataset for buiding a decision tree (and those eft out can be used as an independent sampe for performance evauation purposes). Samping the Variabes A second key eement of randomness reates to the choice of variabes for partitioning the dataset. At each step in buiding a singe decision node (i.e., at each spit point of the tree), a random, and usuay sma, set of variabes is chosen. Ony these variabes are considered when choosing a spit point. For each node in buiding a decision tree, a different random set of variabes is considered. Randomness By randomy samping both the data and the variabes, we introduce decision trees that purposefuy have different performance behaviours for different subsets of the data. It is this variation that aows us to consider an ensembe of such trees as representing a team of experts with differing expertise working together to deiver a better answer.

270 12.4 Tutoria Exampe 249 Samping aso offers another significant advantage computationa efficiency. By considering ony a sma fraction of the tota number of variabes avaiabe, whist considering spit points, the amount of computation required is significanty reduced. In buiding each decision tree, the random forest agorithm generay wi not perform any pruning of the decision tree. When buiding a singe decision tree, it was noted in Chapter 11 that pruning was necessary to avoid overfitting the data. Overfitted modes tend not to perform we on new data. However, a random forest of overfitted trees can deiver a very good mode that performs we on new data. Ensembe Scoring In depoying the mutipe decision trees as a singe mode, each tree has equa weight in the fina decision-making process. A simpe majority might dictate the outcome. Thus, if 300 of 500 decision trees a predict that it wi rain tomorrow, then we might be incined to expect there to be rain tomorrow. If ony 100 trees of the 500 predict rain tomorrow, then we might not expect rain Tutoria Exampe Our task is again to predict the ikeihood of rain tomorrow given today s weather conditions. We iustrate this using Ratte and directy through R. In both cases, randomforest (Liaw and Wiener, 2002) is used. This package provides direct access to the origina impementation of the random forest agorithm by its authors. Buiding a Mode using Ratte Ratte s Mode tab provides the Forest option to buid a forest of decision trees. Figure 12.1 dispays the graphica interface to the options for buiding a random forest with the defaut vaues and aso shows the top part of the resuts from buiding the random forest shown in the Textview area. We now step through the output of the text view ine by ine. The first few ines note the number of observations used to buid the mode and then an indication that missing vaues in the training dataset are being imputed. If missing vaue imputation is not enabed, then the

271 Random Forests Figure 12.1: Buiding a random forest predictive mode. number of observations may be ess than the number avaiabe, as the defaut is to drop observations containing missing vaues. Summary of the Random Forest mode: Number of observations used to buid the mode: 256 Missing vaue imputation is active. The next few ines record the actua function command ine ca that Ratte generated and passed onto R to be evauated: Ca: randomforest(formua = RainTomorrow ~., data = crs$dataset[crs$sampe, c(crs$input, crs$target)], ntree = 500, mtry = 4, importance = TRUE, repace = FALSE, na.action = na.roughfix) A more detaied dissection of the function ca is presented ater, but

272 12.4 Tutoria Exampe 251 in brief, 500 trees were asked for (ntree=) and just four variabes were considered for the spit point for each node (mtry=). An indication of the importance of variabes is maintained (importance=), and any observations with missing vaues wi have those vaues imputed (na.action=). The next few ines summarise some of the same information in a more accessibe form. Note that, due to numerica differences, specific resuts may vary sighty between 32 bit and 64 bit depoyments of R. The foowing was performed on a 64 bit depoyment of R: Type of random forest: cassification Number of trees: 500 No. of variabes tried at each spit: 4 Performance Evauation Next comes an indication of the performance of the resuting mode. The out-of-bag (OOB) estimate of the error rate is cacuated using the observations that are not incuded in the bag the bag is the subset of the training dataset used for buiding the decision tree, hence the out-of-bag terminoogy. This unbiased estimate of error suggests that when the resuting mode is appied to new observations, the answers wi be in error 14.06% of the time. That is, it is 85.94% accurate, which is a reasonaby good mode. OOB estimate of error rate: 14.06% This overa measure of accuracy is then foowed by a confusion matrix that records the disagreement between the fina mode s predictions and the actua outcomes of the training observations. The actua observations are the rows of this tabe, whist the coumns correspond to what the mode predicts for an observation and the ces count the number of observations in each category. That is, the mode predicts Yes and the observation was No for 26 observations. Confusion matrix: No Yes cass.error No Yes

273 Random Forests We see that the mode and the training dataset agree that it won t rain for 205 of the observations. They agree that it wi rain for 15 of the observations. However, there are 26 days for which the mode predicts that it does not rain the foowing day and yet it does rain. Simiary, the mode predicts that it wi rain the foowing day for ten of the observations when in fact it does not rain. The overa cass errors, aso cacuated from the out-of-bag data, are incuded in the tabe. The mode is wrong in predicting rain when there is none in ony 63.41% of the observations when there is no rain. This is contrasted with the 4.65% error rate in predicting that it does rain tomorrow. Underrepresented Casses The acceptabiity of such errors (fase positives versus fase negatives) depends on many factors. Predicting that it wi rain tomorrow and getting it wrong (fase positive) might be an inconvenience in terms of carrying an umbrea around for the day. However, predicting that it won t rain and not being prepared for it (fase negative) coud resut in a soggy dash for cover. The 63.41% error rate in predicting that it does not rain might be a concern. One approach with random forests in addressing the seriousness associated with the fase negatives might be to adjust the baance between the underrepresented cass (66 observations have RainTomorrow as Yes) and the overrepresented cass (300 observations have RainTomorrow as No). In the training dataset the observations are 41 and 215, respectivey (after removing any observations with missing vaues). We can use the Sampe Size option to encourage the agorithm to be more aggressive in predicting that it wi rain tomorrow. We wi baance up the samping so that equa numbers of observations with Yes and No are chosen. Specifying a vaue of 35,35 for the sampe size wi do this. The confusion matrix for the resuting random forest is: OOB estimate of error rate: 28.52% Confusion matrix: No Yes cass.error No Yes

274 12.4 Tutoria Exampe 253 The error rate for when it does rain tomorrow is now 12.2%, and now we get wet 5 days out of 41 when it does rain, which is better than 26 days out of 41 days on which we end up getting wet. The price we pay for this increased accuracy in predicting when it rains, is that we now have more days predicted as raining when in fact it does not rain. The business probem here indicates that carrying an umbrea with us unnecessariy is ess of a burden than getting wet when it rains and we don t have our umbrea. We are aso assuming that we don t want to carry our umbrea a the time. Variabe Importance One of the probems with a random forest, compared with a singe decision tree, is that it becomes quite a bit more difficut to readiy understand the discovered knowedge there are 500 trees here to try to understand. One way to get an idea of the knowedge being discovered is to consider the importance of the variabes, as emerges from their use in the buiding of the 500 decision trees. A variabe importance tabe is the next piece of information that appears in the text view (we reformat it here to fit the imits of the page): Variabe Importance No Yes Accu Gini Pressure3pm Sunshine Coud3pm WindGustSpeed Pressure9am Temp3pm Humidity3pm MaxTemp Temp9am WindSpeed9am The tabe ists each input variabe and then four measures of importance for each variabe. Higher vaues indicate that the variabe is reativey more important. The tabe is sorted by the Accuracy measure of importance.

275 Random Forests A näive approach to measuring variabe importance is to count the number of times the variabe appears in the ensembe of decision trees. This is a rather bunt measure as, for exampe, variabes can appear at different eves within a tree and thus have different eves of importance. Most measures thus incorporate some measure of the improvement made to the tree by each variabe. The third importance measure is a scaed average of the prediction accuracy of each variabe. The cacuation is based on a process of randomy permuting the vaues of a variabe across the observations and measuring the impact on the predictive accuracy of the resuting tree. The arger the impact then the more important the variabe is. Thus this measure reports the mean decrease in the accuracy of the mode. The actua magnitude of the measure is not so reevant as the reative positioning of variabes by the measure. The fina measure of importance is the tota decrease in a decision tree node s impurity (the spitting criterion) when spitting on a variabe. The spitting criterion used is the Gini index. This is measured for a variabe over a trees giving a measure of the mean decrease in the Gini index of diversity reating to the variabe. The Importance button dispays a visua pot of the accuracy and the Gini importance measures, as shown in Figure 12.2, and is more effective in iustrating the reative importance of the variabes. Ceary, Pressure3pm is the most important variabe, and then Sunshine. The accuracy measure then ists Coud3pm and the next most important. This is consistent with the decision tree we buit in Chapter 11. What we did not earn in buiding the decision tree is that Pressure9am is aso quite important, and that the remainder are ess so, at east according to the accuracy measure. We aso notice that the categoric variabes (ike the wind direction variabes WindGustDir, WindDir9am, and WindDir3pm) have a higher importance according to the Gini measure than with the accuracy measure. This bias towards categoric variabes with many categories, exhibited in the Gini measure, is discussed further in Section It is noteworthy that this bias wi misead us about the importance of these categoric variabes.

276 12.4 Tutoria Exampe 255 Figure 12.2: Two measures of variabe importance as cacuated by the random forest agorithm. Time Taken The tai of the textview provides information on how ong it took to buid the random forest of 500 trees. Note that even though we are buiding so many decision trees, the time taken is sti ess than 1 second. Tuning Options The Ratte interface provides a choice of Agorithm for buiding the random forest. The Traditiona option is chosen by defaut, and that is what we have presented here. The Conditiona option uses a more recent conditiona inference tree agorithm for buiding the decision trees. This is expained in more detai in Section A sma number of other tuning options are aso provided, and they are discussed in some detai in Section Error Pots A usefu diagnostic too is the error pot, obtained with a cick of the Error button. Figure 12.3 shows the resuting error pot for our random forest mode. The pot reports the accuracy of the forest of trees (in terms of error rate on the y-axis) against the number of trees that have been incuded

277 Random Forests Figure 12.3: The error rate of the overa mode gneray decreases as each new tree is added to the ensembe. in the forest (the x-axis). The key point we take from this pot is that after some number of trees there is actuay very itte that changes by adding further trees to the forest. From Figure 12.3 it woud appear that going beyond about 20 trees in the forest adds very itte vaue, when considering the out-of-bag (OOB) error rate. The two other pots show the changes in error rate associated with the predictions of the mode (here we have two casses predicted and so two additiona ines). We aso take these into account when deciding how many trees to add to the forest. Conversion to Rues Another button avaiabe with the Forest option is the Rues button, with an associated text entry box. Cicking this button wi convert the specified tree into a set of rues. If the tree specified is 0 (rather than, for exampe, the defaut 1), then a trees wi be converted to rues. Be carefu, though, as that coud take a very ong time for 500 trees and 20 or more rues per tree (10,000 rues). The first two rues from tree 1 of the random forest are shown in the foowing code bock.

278 12.4 Tutoria Exampe 257 Random Forest Mode 1 Tree 1 Rue 1 Node 28 Decision No 1: Sunshine <= : Coud9am <= 7.5 3: WindGustSpeed <= : Humidity3pm <= : MaxTemp <= Tree 1 Rue 2 Node 29 Decision Yes 1: Sunshine <= : Coud9am <= 7.5 3: WindGustSpeed <= : Humidity3pm <= : MaxTemp > Buiding a Mode Using R As usua, we wi create a container into which we pace the reevant information for the modeing. We set up some usefu variabes within the container (using evaq()) as we as constructing the training and test datasets based on a random sampe of 70% of the observations, incuding ony those coumns (i.e., dataset variabes) that are not identified as being ignored (which is a ist of negative indices, and thus indicates which coumns not to incude). > ibrary(ratte) > weatherds <- new.env() > evaq({ data <- na.omit(weather) nobs <- nrow(data) form <- formua(raintomorrow ~.) target <- a.vars(form)[1] vars <- -grep('^(date Location RISK_)', names(data)) set.seed(42) train <- sampe(nobs, 0.7*nobs) }, weatherds)

279 Random Forests Considering the formua, the variabe RainTomorrow is the target, with a remaining variabes (~.) from the provided dataset as the input variabes. Next we buid the random forest. We first generate our training dataset as a random sampe of 70% of the suppied dataset, noting that we reset the random number generator s seed back to a known number for repeatabiity. The data itsef consists of the observations contained in the training dataset. > ibrary(randomforest) > weatherrf <- new.env(parent=weatherds) > evaq({ mode <- randomforest(formua=form, data=data[train, vars], ntree=500, mtry=4, importance=true, ocaimp=true, na.action=na.roughfix, repace=false) }, weatherrf) The remaining arguments to the function are expained in Section Exporing the Mode The mode object itsef contains quite a ot of information about the mode that has been buit. The str() command gives the definitive ist of a the components avaiabe within the object. An expanation is aso avaiabe through the hep page for randomforest(): > str(weatherrf$mode) >?randomforest We consider some of the information stored within the object here. The predicted component contains the vaues predicted for each observation in the training dataset based on the out-of-bag sampes. If an observation is never in an out-of-bag sampe then the prediction wi be reported as NA. Here we show just the first ten predictions:

280 12.4 Tutoria Exampe 259 > head(weatherrf$mode$predicted, 10) No No No No No No No No No No Leves: No Yes The importance component records the information reated to measures of variabe importance as discussed in detai in Section 12.4, page 253. The information is reported for four measures (coumns). > head(weatherrf$mode$importance) No Yes MeanDecreaseAccuracy MinTemp MaxTemp Rainfa Evaporation Sunshine WindGustDir MeanDecreaseGini MinTemp MaxTemp Rainfa Evaporation Sunshine WindGustDir The importance of each variabe in predicting the outcome for each observation in the training dataset can aso be avaiabe in the mode object. This is accessibe through the ocaimp component: > head(weatherrf$mode$ocaimp)[,1:4] MinTemp MaxTemp Rainfa Evaporation Sunshine WindGustDir The error rate data is stored as the err.rate component. This can be accessed from the mode object as we see in the foowing code bock:

281 Random Forests > weatherrf$mode$err.rate In Ratte, we saw an error pot that showed the change in error rate as more trees are added to the forest. We can obtain the actua data behind the pot quite easiy: > round(head(weatherrf$mode$err.rate, 15), 4) OOB No Yes [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] [9,] [10,] [11,] [12,] [13,] [14,] [15,] Here we see that the OOB estimate decreases quicky and then starts to fatten out. We can find the minimum quite simpy, together with a ist of the indexes where each minimum occurs: > evaq({ min.err <- min(data.frame(mode$err.rate)["oob"]) min.err.idx <- which(data.frame(mode$err.rate)["oob"] == min.err) }, weatherrf) The actua minimum vaue together with the indexes can be isted: > weatherrf$min.err [1] > weatherrf$min.err.idx [1]

282 12.5 Tuning Parameters 261 We can then ist the actua modes where the minimum occurs: > weatherrf$mode$err.rate[weatherrf$min.err.idx,] OOB No Yes [1,] [2,] [3,] [4,] [5,] We might thus decide that 12 (the first instance of the minimum OOB estimate) is a good number of trees to have in the forest. Another interesting component is votes, which records the number of trees that vote No and Yes within the ensembe for a particuar observation. > head(weatherrf$mode$votes) No Yes The numbers are reported as proportions and so add up to 1 for each observation, as we can confirm: > head(appy(weatherrf$mode$votes, 1, sum)) Tuning Parameters Ratte provides access to just a few basic tuning options (Figure 12.1) for the random forest agorithm. The user interface aows the number of trees, the number of variabes, and the sampe size to be specified. As is generay the case with Ratte, the defauts are a good starting point!

283 Random Forests These resut in 500 trees being buit, choosing from the square root of the number of variabes avaiabe for each node, and no samping of the training dataset to baance the casses. In Figure 12.1, we see that the number of variabes has automaticay been set to 4 for the weather dataset, which has 20 input variabes. The user interface options correspond to the function arguments ntree=, ntry=, and sampsize=. Ratte aso sets importance= to TRUE, repace= to FALSE, and na.action= to na.roughfix(). A ca to randomforest() incuding a arguments covered here wi ook ike: > evaq({ mode <- randomforest(formua=form, data=data[train, vars], ntree=500, mtry=4, repace=false, sampsize=.632*nobs, importance=true, ocaimp=false, na.action=na.roughfix) }, weatherrf) Number of Trees ntree= This specifies how many trees are to be buit to popuate the random forest. The defaut vaue is 500, and a common recommendation is that a minimum of 100 trees be buit. The performance of the resuting random forest mode tends not to degrade as the number of trees increases, though computationay it wi take onger and wi be more compex to use when scoring, and often there is itte to gain from adding too many trees to a forest. The error matrix and error pot provide a guide to a good number of trees to incude in a forest. See Section 12.4 for exampes. Number of Variabes ntry= The number of variabes to consider for spitting at every node is specified by ntry=. This many variabes wi be randomy seected from a of those avaiabe each time we ook to partition a dataset in the process of

284 12.5 Tuning Parameters 263 buiding the decision tree. The genera defaut vaue is the square root of the tota number of variabes avaiabe, for cassification tasks and one-third of the number of avaiabe variabes for regression. If there are many noise variabes (i.e., variabes that pay itte or no roe in predicting the outcome), then we might consider increasing the number of variabes considered at each node to ensure we have some reevant variabes to choose from. Sampe Size sampsize= The sampe size argument can be used to force the agorithm to seect a smaer sampe size than the defaut or to sampe the observations differenty based on the output variabe vaues (for cassification tasks). For exampe, if our training dataset contains 5,000 observations for which it does not rain tomorrow and ony 500 for which it does rain tomorrow, we can specify the sampe size as 400,400, for exampe, to have equa weight on both outcomes. This provides a mechanism for effectivey setting the prior probabiities. See Section 12.4 for an exampe of doing this in Ratte. Variabe Importance importance= The importance argument aows us to review the importance of each variabe in determining the outcome. Two importance measures are cacuated in addition to importance of the variabe in reation to each outcome in a cassification task. These have been described in Section 12.4, and issues with the measures are discussed in Section Samping with Repacement repace= By defaut, the samping is performed when the training observations are samped for buiding a particuar tree within the forest sampes with repacement. This means that any particuar observation might appear mutipe times within the sampe, and thus some observations get overrepresented in some datasets. This is a feature of the approach. The repace= argument set to FALSE wi perform samping without repacement.

285 Random Forests Handing Missing Vaues na.action= The impementation of the randomforest() agorithm does not directy hande missing vaues. A common approach on finding missing vaues is simpy to ignore the observation with missing vaues by specifying na.omit as the vaue of na.action=. For some data, this coud actuay end up removing a observations from the training dataset. Another quick option is to repace missing vaues with the median (for numeric data) or the most frequent vaue (for categoric data) using na.roughfix Discussion Brief History and Aternative Approaches The concept of an ensembe of experts was something that the knowedge based and expert systems research communities were exporing in the 1980 s. Some eary work on buiding and combining mutipe decision trees was undertaken at that time (Wiiams, 1988). Mutipe decision trees were buit by choosing different variabes at nodes where the choice of variabe was not cear. The resuting ensembe was found to produce a better predictive mode. Ho (1995, 1998) then deveoped the concept of randomy samping variabes to buid the ensembe of decision trees. Haf of the avaiabe variabes were randomy chosen for buiding each of the decision tree. She noted that as more trees were added to the ensembe, the predictive performance increased, mosty monotonicay. Breiman (2001) buit on the idea of randomy samping variabes by introducing random samping of variabes at each node as the decision tree is buit. He aso added the concept of bagging (Breiman, 1996) where different random sampes of the dataset are chosen as the training dataset for each tree. His agorithm is in common use today, and his actua impementation can be accessed within R through randomforest. In some situations we wi have avaiabe a huge number of variabes to choose from. Often ony a sma proportion of the avaiabe variabes wi have some infuence on the target variabe. By randomy seecting a sma proportion of the avaiabe variabes we wi often miss the more reevant variabes in buiding our trees. An approach to address this situation introduces a weighted variabe seection scheme to impement an enriched random forest (Amaratunga

286 12.6 Discussion 265 et a., 2008). Weights can be based on the q-vaue, derived from the p-vaue for a two-sampe t-test. We test for a group mean effect of a variabe, testing how we the variabe can separate the vaues of the binary target variabe. The resuting weights then bias the random seection of variabes toward those that have more infuence on the target variabe. An extension to this method aows it to work when the target variabe has more than two vaues. In that case we can use a chi-square or information gain measure. The approach can be shown to produce consideraby more accuracte modes, by ensuring each decision tree has a high degree of independence from the other trees and by weighting the samping of the variabes to ensure important variabes are seected for each tree. Using Other Random Forests The randomforest() function can aso be appied to regression tasks, surviva anaysis, and unsupervised custering (Shi and Horvath, 2006). Limitation on Categories An issue with the impementation of random forests in R is that it can not hande categoric data with more than 32 categoric vaues. Statistica concerns aso suggest that categoric variabes with more than 32 categories don t make a ot of sense, and thus itte effort has been made in the R package to rectify the issue. Importance Measures We introduced the idea of measures of variabe importance in buiding a mode in Section There we ooked at the mean decrease in accuracy and the mean decrease in the Gini index as two measures cacuated whist the trees of the random forest are being buit. These variabe importance measures provided by randomforest() have been found to be unreiabe under certain conditions. The issue particuary arises where there is a mix of numeric and categoric variabes or the numeric variabes have quite different scaes (e.g., Age versus Income), or then categoric variabes have very different numbers of categories (Strob et a., 2007). Less important variabes can end up having too high an importance according to the measures used, and thus we wi

287 Random Forests be mised into beieving the measures provided. Indeed, the Gini measure can be quite biased, so that categorics with many categories obtain a higher importance. The cforest() function of party (Hothorn et a., 2006) provides an improved importance measure. This newer measure can be appied to any dataset, using subsamping without repacement, to give a more reiabe measure of variabe importance. A key aspect is that rather than samping the data with repacement to obtain a same size sampe, a random subsampe is used. Underneath, cforest() buids conditiona decision trees by using the ctree() function discussed in Chapter 11. In the foowing code bock we first oad party into the ibrary and we create a new data structure to store our forest object, attaching the weather dataset to the object. > ibrary(party) > weathercforest <- new.env(parent=weatherds) Now we can buid the mode itsef with a ca to cforest(): > evaq({ mode <- cforest(form, data=data[vars], contros=cforest_unbiased(ntree=50, mtry=4)) }, weathercforest) We coud now expore the resuting forest, but here we wi simpy ist the top few most important variabes, according to the measure used by party: > evaq({ varimp <- as.data.frame(sort(varimp(mode), decreasing=true)) names(varimp) <- "Importance" head(round(varimp, 4), 3) }, weathercforest) Importance Pressure3pm Sunshine Coud3pm

288 12.7 Summary Summary A random forest is an ensembe (i.e., a coection) of unpruned decision trees. Random forests are often used when we have very arge training datasets and a very arge number of input variabes (hundreds or even thousands of input variabes). A random forest mode is typicay made up of tens or hundreds of decision trees. The generaisation error rate from random forests tends to compare favouraby with boosting approaches (see Chapter 13), yet the approach tends to be more robust to noise in the training dataset and so tends to be a very stabe mode buider, as it does not suffer the sensitivity to noise in a dataset that singe-decision-tree induction does. The genera observation is that the random forest mode buider is very competitive with noninear cassifiers such as artificia neura nets and support vector machines. However, performance is often dataset-dependent, so it remains usefu to try a suite of approaches. Each decision tree is buit from a random subset of the training dataset, using what is caed repacement (thus it is doing what is known as bagging) in performing this samping. That is, some observations wi be incuded more than once in the sampe, and others won t appear at a. Generay, about two-thirds of the observations wi be incuded in the subset of the training dataset and one-third wi be eft out. In buiding each decision tree mode based on a different random subset of the training dataset a random subset of the avaiabe variabes is used to choose how best to partition the dataset at each node. Each decision tree is buit to its maximum size, with no pruning performed. Together, the resuting decision tree modes of the forest represent the fina ensembe mode, where each decision tree votes for the resut, and the majority wins. (For a regression mode, the resut is the average vaue over the ensembe of regression trees.) In buiding the random forest mode, we have options to choose the number of trees, the training dataset sampe size for buiding each decision tree, and the number of variabes to randomy seect when considering how to partition the training dataset at each node. The random forest mode buider can aso report on the input variabes that are actuay most important in determining the vaues of the output variabe. By buiding each decision tree to its maxima depth (i.e., by not pruning the decision tree), we can end up with a mode that is ess biased. The randomness introduced by the random forest mode buider in seecting

289 Random Forests the dataset and the variabe deivers considerabe robustness to noise, outiers, and overfitting when compared with a singe tree cassifier. The randomness aso deivers substantia computationa efficiencies. In buiding a singe decision tree, the mode buider may seect a random subset of the training dataset. Aso, at each node in the process of buiding the decision tree, ony a sma fraction of a of the avaiabe variabes are considered when determining how best to partition the dataset. This substantiay reduces the computationa requirement. In summary, a random forest mode is a good choice for mode buiding for a number of reasons. First, just ike decision trees, very itte, if any, preprocessing of the data needs to be performed. The data does not need to be normaised and the approach is resiient to outiers. Second, if we have many input variabes, we generay do not need to do any variabe seection before we begin mode buiding. The random forest mode buider is abe to target the most usefu variabes. Third, because many trees are buit and there are two eves of randomness, and each tree is effectivey an independent mode, the mode buider tends not to overfit to the training dataset. A key factor about a random forest being a coection of many decision trees is that each decision tree is not infuenced by the other decision trees when constructed Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: cforest() function Buid a conditiona random forest. ctree() function Buid a conditiona inference tree. evaq() function Access environment for storing data. na.roughfix() function Impute missing vaues. party package Conditiona inference trees. randomforest() function Impementation of random forests. randomforest package Buid ensembe of decision trees. str() function Show the structure of an object. weather dataset Sampe dataset from ratte.

290 Chapter 13 Boosting Training Data/Cassification Data The Boosting meta-agorithm is an efficient, simpe, and easy-touse approach to buiding modes. The popuar variant caed AdaBoost (an abbreviation for adap- M1 M2 tive boosting) has been described Mn as the best off-the-shef cassifier in the word (attributed to Leo { 1,1} { 1,1} { 1,1} Weighted Sum = { 1,1} Breiman by Hastie et a. (2001, Yes/No p. 302)). Boosting agorithms buid mutipe modes from a dataset by using some other earning agorithm that need not be a particuary good earner. Boosting associates weights with observations in the dataset and increases (boosts) the weights for those observations that are hard to mode accuratey. A sequence of modes is constructed, and after each mode is constructed the weights are modified to give more weight to those observations that are harder to cassify. In fact, the weights of such observations generay osciate up and down from one mode to the next. The fina mode is then an additive mode constructed from the sequence of modes, each mode s output being weighted by some score. There is itte tuning required and itte is assumed about the earner used, except that it shoud be a weak earner! We note that boosting can fai to perform if there is insufficient data or if the weak modes are overy compex. Boosting is aso susceptibe to noise. Boosting agorithms are therefore simiar to random forests in that an ensembe of modes is buit and then combined to deiver a better mode G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _13, Springer Science+Business Media, LLC

291 Boosting than any of the constituent modes. The basic distinguishing characteristic of the boosting approach is that the trees are buit one after another, with refinement being based on the previousy buit modes. The concept is that after buiding one mode any observations that are incorrecty cassified by that mode are boosted. A boosted observation is essentiay given more prominence in the dataset, making the singe observation overrepresented. This has the effect that the next mode is more ikey to correcty cassify that observation. If not, then that observation wi again be boosted. In common with random forests, the boosting agorithms tend to be meta-agorithms. Any type of modeing approach might be used as the earning agorithm, but the decision tree agorithm is the usua approach Knowedge Representation The boosting agorithm is commony presented in terms of decision trees as their primary form for the representation of knowedge. The key extension to the knowedge representation is in the way that we combine the decisions that are made by the individua experts or modes. For boosting, a weighted score is used, with each of the modes in the ensembe having a weight corresponding to the quaity of its expertise (e.g., the measured accuracy of the individua tree) Agorithm As a meta-earner, boosting empoys any simpe earning agorithm to buid mutipe modes. Boosting often reies on the use of a weak earning agorithm essentiay any weak earner can be used. An ensembe of weak earners can ead to a strong mode. A weak earning agorithm is one that is ony sighty better than random guessing in terms of error rates (i.e., the mode gets the decision wrong just ess than 50% of the time). An eary exampe was a decision tree of depth 1 (having a singe spit point and thus often referred to as a decision stump). Each weak mode is sighty better than random but as an ensembe deivers considerabe accuracy. The agorithm begins quite simpy by buiding a weak initia mode from the training dataset. Then, any observations in the training data that the mode incorrecty cassifies wi have their importance within the

292 13.2 Agorithm 271 agorithm boosted. This is done by assigning a observations a weight a observations might start, for exampe, with a weight of 1. Weights are boosted through a formua so that those that are wrongy cassified by the mode wi have a weight greater than 1 for the buiding of the next decision stump. A new mode is buit with these boosted observations. We can think of them as the probematic observations. The agorithm needs to take into account the weights of the observations in buiding a mode. Consequenty, the mode buider effectivey tries harder each iteration to correcty cassify these difficut observations. The process of buiding a mode and then boosting observations incorrecty cassified is repeated unti a newy generated mode performs no better than random. The resut is then an ensembe of modes to be used to make a decision on new data. The decision is arrived at by combining the expertises of each mode in such a way that the more accurate modes carry more weight. We can iustrate the process abstracty with a simpe exampe. Suppose we have ten observations. Each observation wi get an initia weight 1 of, et s say, 10, or 0.1. We buid a decision tree that incorrecty cassifies four observations (e.g., observations 7, 8, 9, and 10). We can cacuate the sum of the weights of the miscassified observations as 0.4 (and generay we denote this as ɛ). This is a measure of the accuracy (actuay the inaccuracy) of the mode. The ɛ is transformed into a measure used to update the weights and to provide a weight for the mode when it forms part of the ensembe. This transformed vaue is α and is often something ike 0.5 og( 1 ɛ ɛ ). The new weights for the miscassified observations can then be recacuated as e α times the od weight. In our exampe, α = (i.e., (0.5 og( )) and so the new weights for observations 7, 8, 9, and 10 become 0.1 e α, or The tree buider is caed upon again, noting that some observations are effectivey mutipied to have more representation in the agorithm. Thus a different tree is ikey to be buit that is more ikey to correcty cassify the observations that have higher weights (i.e., have more representation in the training dataset). This new mode wi again have errors. Suppose this time that the mode incorrecty cassifies observations 1 and 8. Their current weights are 0.1 and , respectivey. Thus, the new ɛ is , or The new α is then This is the weight that we give to

293 Boosting this mode when incuded in the ensembe. It is again used to modify the weights of the incorrecty cassified observations, so that observation 1 gets a weight of 0.1 e α, or and observation 8 s weight becomes e α, or So we can see that observation 8 has the highest weight now since it seems to be quite a probematic observation. The process continues unti the individua tree that is buit has an error rate of greater than 50%. To depoy the individua modes as an ensembe, each tree is used to cassify a new observation. Each tree wi provide a probabiity that it wi rain tomorrow (a number between 0 and 1). For each tree, this is mutipied by the weight (α) associated with that tree. The fina resut is then cacuated as the average of these predictions. Actua impementations of the boosting agorithm use variations to the simpe approach we have presented here. Variations are found in the formuas for updating the weights and for weighting the individua modes. However, the overa concept remains the same Tutoria Exampe Buiding a Mode Using Ratte The Boost option of the Mode tab wi buid an ensembe of decision trees using the approach of boosting miscassified observations. The individua decision trees are buit using rpart. The resuts of buiding the mode are shown in the Textview area. Using the weather dataset (oaded by defaut in Ratte if we cick Execute on starting up Ratte) we wi see the Textview popuated as in Figure The Textview begins with the usua summary of the underying function ca to buid the mode: Summary of the Ada Boost mode: Ca: ada(raintomorrow ~., data = crs$dataset[crs$train, c(crs$input, crs$target)], contro = rpart.contro(maxdepth = 30,, minspit = 20, xva = 10), iter = 50)

294 13.3 Tutoria Exampe 273 Figure 13.1: Buiding an AdaBoost predictive mode. The mode is to predict RainTomorrow based on the remainder of the variabes. The data consists of the dataset oaded into Ratte, retaining ony the observations whose index is contained in the training ist and incuding a but coumns 1, 2, and 23. The contro= argument is passed directy to rpart() and has the same meaning as for rpart() (see Chapter 11). The number of trees to buid is specified by the iter= argument. The next ine of information reports on some of the parameters used for buiding the mode. We won t go into the detais of the Loss and Method. Briefy, though, the Loss is exponentia, indicating that the agorithm is minimising a so caed exponentia oss function, and the Method used in the agorithm is discrete rather than gente or rea. The Iteration: item simpy indicates the number of trees that were asked to be buit. Performance Evauation A confusion matrix presents the performance of the mode over the training data, and the foowing ine in the Textview reports the training

295 Boosting dataset error. Fina Confusion Matrix for Data: Fina Prediction True vaue No Yes No Yes The out-of-bag error and the associated iteration are then reported. This is foowed by suggestions of the number of iterations based on the training error and an error measure based on the Kappa statistic. The Kappa statistic adjusts for the situation where there are very different numbers of observations for each vaue of the target variabe. Using these error estimates, the best number of iterations is suggested: Out-Of-Bag Error: iteration= 41 Additiona Estimates of number of iterations: train.err1 train.kap The actua training and Kappa (adjusted) error rates are then recorded: train.err train.kap Time Taken The ada() impementation takes onger than randomforest() because it is reying on using the inbuit rpart() rather than speciay written Fortran code as is the case for randomforest(). Time taken: 1.62 secs Error Pot Once a boosted mode has been buit, the Error button wi dispay a pot of the decreasing error rate as more trees are added to the mode. The pot annotates the curve with a series of five 1s simpy to identify the curve. (Extended pots can incude curves for test datasets.)

296 13.3 Tutoria Exampe 275 Figure 13.2: The error rate as more trees are added to the ensembe. Figure 13.2 shows the decreasing error as more trees are added. The pot is typica of ensembes where the error rate drops off quite quicky eary on and then fattens out as we proceed. We might decide, from the pot, a point at which we stop buiding further trees. Perhaps that is around 40 trees for our data. Variabe Importance A measure of the importance of variabes is aso provided by ada (Cup et a., 2010). Figure 13.3 shows the pot. The measure is a reative measure so that the order and distance between the scores are more reevant than the actua scores. The measure cacuates, for each tree, the improvement in accuracy that the variabe chosen to spit the dataset offers the mode. This is then averaged over a trees in the ensembe. Of the five most important variabes, we notice that there are two categoric variabes (WindDir9am and WindDir3pm). Because of the nature of how variabes are chosen for a decision tree agorithm, there may we be a bias here in favour of categoric variabes, so we might discount their importance. See Chapter 12, page 265 for a discussion of the issue.

297 Boosting Figure 13.3: The variabe importance pot for a boosted mode. Tuning Options A few basic tuning options for boosting are provided by the Ratte interface. The first option is the Number of Trees to buid, which is set to 50 by defaut. The Max Depth, Min Spit, and Compexity are as provided by the decision tree agorithm and are discussed in Section Adding Trees The Continue button aows further trees to be added to the mode. This aows us to easiy expore whether the addition of further trees wi offer much improvement in the performance of the mode, without starting the modeing over again. To add further trees, increase the vaue specified in the Number of Trees text box and cick the Continue button. This wi pick up the mode buiding from where it eft off and buid as many more trees as is needed to get up to the specified number of trees. R The package ada provides ada(), which impements the boosting agorithm depoyed by Ratte. The ada() function itsef uses rpart() from rpart to buid the decision trees. With the defaut settings, a very reasonabe mode can be buit.

298 13.3 Tutoria Exampe 277 We wi step through the simpe process of buiding a boosted mode. First, we create the dataset object, as usua. This wi encapsuate the weather dataset from ratte, together with a coection of other usefu data about the weather dataset. A training sampe is aso identified/ > ibrary(ratte) > weatherds <- new.env() > evaq({ data <- weather nobs <- nrow(weather) vars <- -grep('^(date Location RISK_)', names(data)) form <- formua(raintomorrow ~.) target <- a.vars(form)[1] set.seed(42) train <- sampe(nobs, 0.7*nobs) }, weatherds) We can now buid the boosted mode based on this dataset. Once again we create a container for the mode, and incude the above container for the dataset within this container. > ibrary(ada) > weatherada <- new.env(parent=weatherds) Within this new container we now buid our mode. > evaq({ contro <- rpart.contro(maxdepth=30, cp= , minspit=20, xva=10) mode <- ada(formua=form, data=data[train, vars], contro=contro, iter=50) }, weatherada) We can obtain a basic overview of the mode simpy by printing its vaue, as we do in the foowing code bock (note that the resuts here may vary sighty between 32 bit and 64 bit impementations of R).

299 Boosting > weatherada$mode Ca: ada(form, data = data[train, vars], contro = contro, iter = 50) Loss: exponentia Method: discrete Iteration: 50 Fina Confusion Matrix for Data: Fina Prediction True vaue No Yes No Yes Train Error: 0.07 Out-Of-Bag Error: iteration= 38 Additiona Estimates of number of iterations: train.err1 train.kap The summary() command provides a itte more detai. > summary(weatherada$mode) Ca: ada(form, data = data[train, vars], contro = contro, iter = 50) Loss: exponentia Method: discrete Iteration: 50 Training Resuts Accuracy: 0.93 Kappa: Repicating AdaBoost Directy using rpart() We can repicate the boosting process directy using rpart(). We wi iustrate this as an exampe of a itte more sophistication in R coding.

300 13.3 Tutoria Exampe 279 We wi first oad the weather dataset and extract the input variabes (x) and the output variabe (y). To simpify some of the mathematics we wi map the predictions to 1/1 rather than 0/1 (since then a mode that predicts a vaue greater than 0 is a positive exampe and one beow zero is a negative exampe). The data is encapsuated within a container caed weatherbrp. > ibrary(rpart) > weatherbrp <- new.env() > evaq({ data <- weather vars <- -grep('^(date Location RISK_)', names(data)) target <- "RainTomorrow" N <- nrow(data) M <- nco(data) - ength(vars) data$target <- rep(-1, N) data$target[data[target] == "Yes"] <- 1 vars <- c(vars, -(nco(data)-1)) # Remove od target form <- formua(target ~.) target <- a.vars(form)[1] }, weatherbrp) The first few observations show the mapping from the origina target, which has the vaues No and Yes, to the new numeric vaues 1 and 1. > head(weatherbrp$data[c("raintomorrow", "Target")]) RainTomorrow Target 1 Yes 1 2 Yes 1 3 Yes 1 4 Yes 1 5 No -1 6 No -1

301 Boosting We can check the ist of variabes avaiabe (ony checking the first few here), and note that we excude four from our anaysis: > head(names(weatherbrp$data)) [1] "Date" "Location" "MinTemp" [4] "MaxTemp" "Rainfa" "Evaporation" > weatherbrp$vars [1] Now we can initiaise the observation weights, which to start with are a the same (1/N): > evaq({ w <- rep(1/n, N) }, weatherbrp) > round(head(weatherbrp$w), 4) [1] Next we buid the first mode. The rpart() function, convenienty, has a weights argument, and we simpy pass to it the cacuated weights store in w. We aso set up rpart.contro() for buiding a decision tree stump. The contro simpy incudes maxdepth=, setting it to 1 so that a singe-eve tree is buit: > evaq({ contro <- rpart.contro(maxdepth=1) M1 <- rpart(formua=form, data=data[vars], weights=w/mean(w), contro=contro, method="cass") }, weatherbrp)

302 13.3 Tutoria Exampe 281 We can then dispay the first mode: > weatherbrp$m1 n= 366 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root ( ) 2) Humidity3pm< ( ) * 3) Humidity3pm>= ( ) * We see that the decision tree agorithm has chosen Humidity3pm on which to spit the data, at a spit point of For Humidity3pm < 71.5 the decision is 1 with probabiity 0.86, and for Humidity3pm 71.5 the decision is 1 with probabiity We now need to find those observations that are incorrecty cassified by the mode. The R code here cas predict() to appy the mode M1 to the dataset from which it was buit. From this resut, we get the second coumn. This is the ist of probabiities of each observation being in cass 1. If this probabiity is above 0.5, then the resut is 1, otherwise it is 1 (mutipying the ogica vaue by 2 and then subtracting 1 achieves this since TRUE is regarded as 1 and FALSE as 0). The resuting cass is then compared with the target, and which() returns the index of those observations for which the prediction differs from the actua cass: > evaq({ ms <- which(((predict(m1)[,2]>0.5)*2)-1!= data[target]) names(ms) <- NULL }, weatherbrp)

303 Boosting The indexes of the first few of the 53 miscassified can be isted: > evaq({ cat(paste(ength(ms), "observations incorrecty cassified:\n")) head(ms) }, weatherbrp) 53 observations incorrecty cassified: [1] We now cacuate the mode weight (based on the weighted error rate of this deicison tree) dividing by the tota sum of weights to get a normaised vaue (so that sum(w) remains 1): > evaq({e1 <- sum(w[ms])/sum(w); e1}, weatherbrp) [1] The adjustment is then cacuated: > evaq({a1 <- og((1-e1)/e1); a1}, weatherbrp) [1] We then update the observation weights: > round(head(weatherbrp$w[weatherbrp$ms]), 4) [1] > evaq({w[ms] <- w[ms]*exp(a1)}, weatherbrp) > round(head(weatherbrp$w[weatherbrp$ms]), 4) [1] A second mode can now be buit: > evaq({ M2 <- rpart(formua=form, data=data[vars], weights=w/mean(w), contro=contro, method="cass") }, weatherbrp)

304 13.3 Tutoria Exampe 283 This resuts in a simpe decision tree invoving the variabe Pressure3pm: > weatherbrp$m2 n= 366 node), spit, n, oss, yva, (yprob) * denotes termina node 1) root ( ) 2) Pressure3pm>= ( ) * 3) Pressure3pm< ( ) * Once again we identify the miscassified observations > evaq({ ms <- which(((predict(m2)[,2]>0.5)*2)-1!= data[target]) names(ms) <- NULL }, weatherbrp) There are 118 of them: > evaq({ength(ms)}, weatherbrp) [1] 118 The indexes of the first few can aso be isted: > evaq({head(ms)}, weatherbrp) [1] We again boost the miscassified observations, first cacuating the weighted error rate of the decision tree: > evaq({e2 <- sum(w[ms])/sum(w); e2}, weatherbrp) [1] The adjustment is cacuated: > evaq({a2 <- og((1-e2)/e2); a2}, weatherbrp) [1]

305 Boosting The adjustments are then made to the weights of the individua observations that were miscassified: > round(head(weatherbrp$w[weatherbrp$ms]), 4) [1] > evaq({w[ms] <- w[ms]*exp(a2)}, weatherbrp) > round(head(weatherbrp$w[weatherbrp$ms]), 4) [1] A third (and for our purposes the ast) mode can then be buit: > evaq({ M3 <- rpart(formua=form, data=data[vars], weights=w/mean(w), contro=contro, method="cass") ms <- which(((predict(m3)[,2]>0.5)*2)-1!= data[target]) names(ms) <- NULL }, weatherbrp) Again we identify the miscassified observations: > evaq({ength(ms)}, weatherbrp) [1] 145 Cacuate the error rate: > evaq({e3 <- sum(w[ms])/sum(w); e3}, weatherbrp) [1] Cacuate the adjustment: > evaq({a3 <- og((1-e3)/e3); a3}, weatherbrp) [1]

306 13.5 Discussion 285 We can then finay adjust the weights (in case we decide to continue buiding further decision trees): > round(head(weatherbrp$w[weatherbrp$ms]), 4) [1] > evaq({w[ms] <- w[ms]*exp(a3)}, weatherbrp) > round(head(weatherbrp$w[weatherbrp$ms]), 4) [1] The fina (combined or ensembe) mode, if we choose to stop here, is then M(x) = M 1 (x) M 2 (x) M 3 (x) Tuning Parameters A number of options are given by Ratte for boosting a decision tree mode. We briefy review them here. Number of Trees iter=50 The number of trees to buid is specified by the iter= argument. The defaut is to buid 50 trees. Bagging bag.frac=0.5 Bagging is used to randomy sampe the suppied dataset. The defaut is to seect a random sampe from the popuation of 50% Discussion References Boosting originated with Freund and Schapire (1995). Buiding a coection of modes into an ensembe can reduce miscassification error, bias, and variance (Bauer and Kohavi, 1999; Schapire et a., 1997). The origina formuation of the agorithm adjusts a weights each iteration weights are increased if the corresponding record is miscassified or decreased if it is correcty cassified. The weights are then further normaised each iteration to ensure they continue to represent a distribution

307 Boosting (so that n j=1 w j = 1). This can be simpified, as by Hastie et a. (2001), to increase ony the weights of the miscassified observations. A number of R packages impement boosting. We have covered ada here, and this is the package presenty used by Ratte. catoos (Tuszynski, 2009) provides LogitBoost(), which is simpe to use and an efficient impementation for very arge datasets, using a carefuy crafted impementation of decision stumps as the weak earners. gbm (Ridgeway, 2010) impements generaised boosted regression, providing a more widey appicabe boosting agorithm. mboost (Hothorn et a., 2011) is another aternative offering mode-based boosting. The variabe importance measure impemented for ada() is described by Hastie et a. (2001, pp ). Aternating Decision Trees Using Weka An aternating decision tree (Freund and Mason, 1997), combines the simpicity of a singe decision tree with the effectiveness of boosting. The knowedge representation combines tree stumps, a common mode depoyed in boosting, into a decision tree type structure. A key characteristic of the tree representation is that the different branches are no onger mutuay excusive. The root node is a prediction node and has just a numeric score. The next ayer of nodes are decision nodes and are essentiay a coection of decision tree stumps. The next ayer then consists of prediction nodes, and so on, aternating between prediction nodes and decision nodes. A mode is depoyed by identifying the possiby mutipe paths from the root node to the eaves, through the aternating decision tree, that correspond to the vaues for the variabes of an observation to be cassified. The observation s cassification score (or measure of confidence) is the sum of the prediction vaues aong the corresponding paths. The aternating decision tree agorithm is impemented in the Weka data mining suite. Weka is avaiabe directy from R through RWeka (Hornik et a., 2009), which provides its comprehensive coection of data mining toos within the R framework. A simpe exampe wi iustrate the incredibe power that this offers using R as a unifying interface to an extensive coection of data mining toos. We can buid an aternating decision tree in R using RWeka after instaing the appropriate Weka package:

308 13.5 Discussion 287 > ibrary(rweka) > WPM("refresh-cache") > WPM("insta-package", "aternatingdecisiontrees") We use make_weka_cassifier() to turn a Weka object into an R function: > WPM("oad-package", "aternatingdecisiontrees") > cpath <- "weka/cassifiers/trees/adtree" > ADT <- make_weka_cassifier(cpath) We can obtain some background information about the resuting function by printing the vaue of the resuting variabe: > ADT An R interface to Weka cass 'weka.cassifiers.trees.adtree', which has information Cass for generating an aternating decision tree. The basic agorithm is based on: [...] Argument ist: (formua, data, subset, na.action, contro = Weka_contro(), options = NULL) Returns objects inheriting from casses: Weka_cassifier

309 Boosting The function WOW(), standing for Weka option wizard, wi ist the command ine arguments that become avaiabe with the generated function, as seen in the foowing code bock: > WOW(ADT) -B Number of boosting iterations. (Defaut = 10) Number of arguments: 1. -E Expand nodes: -3(a), -2(weight), -1(z_pure), >=0 seed for random wak (Defaut = -3) Number of arguments: 1. -D Save the instance data with the mode Next we perform our usua mode buiding. As aways we first create a container for the mode, making avaiabe the appropriate dataset container for use from within this new container: > weatheradt <- new.env(parent=weatherds) The mode is buit as a simpe ca to ADT: > evaq({ mode <- ADT(formua=form, data=data[train, vars]) }, weatheradt) The resuting aternating decision tree can then be dispayed as we see in the foowing code bock.

310 13.5 Discussion 289 > weatheradt$mode Aternating decision tree: : (1)Pressure3pm < : (1)Pressure3pm >= : (3)Temp3pm < 14.75: (3)Temp3pm >= 14.75: (2)Sunshine < 8.85: (4)WindSpeed9am < 6.5: (4)WindSpeed9am >= 6.5: (8)Sunshine < 6.55: (9)Temp3pm < 18.75: (9)Temp3pm >= 18.75: (8)Sunshine >= 6.55: (2)Sunshine >= 8.85: (6)MaxTemp < 24.35: (6)MaxTemp >= 24.35: (7)Sunshine < 10.9: (7)Sunshine >= 10.9: (5)Pressure3pm < : (5)Pressure3pm >= : (10)MaxTemp < 19.55: (10)MaxTemp >= 19.55: Legend: -ve = No, +ve = Yes Tree size (tota number of nodes): 31 Leaves (number of predictor nodes): 21

311 Boosting We can then expore how we the mode performs: > evaq({ predictions <- predict(mode, data[-train, vars]) tabe(predictions, data[-train, target], dnn=c("predicted", "Actua")) }, weatheradt) Actua Predicted No Yes No Yes Compare this with the ada() generated mode: > evaq({ predictions <- predict(mode, data[-train, vars]) tabe(predictions, data[-train, target], dnn=c("predicted", "Actua")) }, weatherada) Actua Predicted No Yes No Yes 7 15 In this exampe, the ada() mode performs better than the ADT() mode Summary Boosting is an efficient, simpe, and easy-to-understand mode buiding strategy that tends not to overfit our data, hence buiding good modes. The popuar variant caed AdaBoost (an abbreviation for adaptive boosting) has been described as the best off-the-shef cassifier in the word (attributed to Leo Breiman by Hastie et a. (2001, p. 302)). Boosting agorithms buid mutipe modes from a dataset, using some other mode buiders, such as a decision tree buider or neura network, that need not be particuary good mode buiders. The basic idea of boosting is to associate a weight with each observation in the dataset. A series of modes are buit and the weights are increased (boosted) if a

312 13.7 Command Summary 291 mode incorrecty cassifies the observation. The weights of such observations generay osciate up and down from one mode to the next. The fina mode is then an additive mode constructed from the sequence of modes, each mode s output weighted by some score. There is itte tuning required and itte is assumed about the mode buider used, except that it shoud be reativey weak mode. We note that boosting can fai to perform if there is insufficient data or if the weak modes are overy compex. Boosting is aso susceptibe to noise Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: ada() function Impementation of AdaBoost. ada package Buids AdaBoost modes. catoos package Provides LogitBoost(). gbm package Generaised boosted regression. LogitBoost() function Aternative boosting agorithm. predict() function Appies mode to new dataset. randomforest() function Impementation of random forests. ratte package The weather dataset and GUI. rpart() function Buids a decision tree mode. rpart.contro() function Contros ada() passes to rpart(). rpart package Buids decision tree modes. RWeka package Interface Weka software. summary() function Summarise an ada mode. which() function Eements of a vector that are TRUE. WOW() function The Weka option wizard.

313

314 Chapter 14 Support Vector Machines A support vector machine (SVM) searches for so-caed support vectors which are observations that are found to ie at the edge of an area in space which presents a boundary between one of these casses of observations (e.g., the squares) and another cass of observations (e.g., the circes). In the terminoogy of SVM we tak about the space between these two Pressure9am Support Vectors Margin WindGustSpeed wx + b = 1 wx + b = 0 Support Vectors wx + b = 1 regions as the margin between the casses. Each region contains observations with the same vaue for the target variabe (i.e., the cass). The support vectors, and ony the support vectors, are used to identify a hyperpane (a straight ine in two dimensions) that separates the casses. The maximum margin between the separabe casses is sought. This then represents the mode. It is usuay quite rare that we can separate the data with a straight ine (or a hyperpane when we have more than two input variabes). That is, the data is not usuay distributed in such a way that it is ineary separabe. When this is the case, a technique is used to combine (or remap) the data in different ways, creating new variabes so that the casses are then more ikey to become ineary separabe by a hyperpane (i.e., so that with the new dimensiona data there is a gap between observations in the two casses). We can use the mode we have buit to score new observations by mapping the data in the same way as when the mode was G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _14, Springer Science+Business Media, LLC

315 Support Vector Machines buit, and then decide on which side of the hyperpane the observation ies and hence the decision associated with it. Support vector machines have been found to perform we on probems that are noninear, sparse, and high-dimensiona. A disadvantage is that the agorithm is sensitive to the choice of tuning option (e.g., the type of transformations to perform), making it harder to use and timeconsuming to identify the best mode. Another disadvantage is that the transformations performed can be computationay expensive and are performed both whist buiding the mode and when scoring new data. An advantage of the method is that the modeing ony deas with these support vectors rather than the whoe training dataset, and so the size of the training set is not usuay an issue. Aso, as a consequence of ony using the support vectors to buid a mode, the mode is ess affected by outiers Knowedge Representation The approach taken by a support vector machine mode is to buid a inear mode; that is, to identify a ine (in two dimensions) or a fat pane (in mutipe dimensions) that separates observations with different vaues of the target variabe. If we can find such a ine or pane, then we find one that maximises the area between the two groups (when we are ooking at binary cassification, as with the weather dataset). Consider a simpe case where we have just two input variabes (i.e., two-dimensiona space). We wi choose Pressure3pm and Sunshine. We wi aso purposefuy seect observations that wi ceary demonstrate a significant separation (margin) between the observations for which it rains tomorrow and those for which it does not. The R code here iustrates our seection of the data and drawing of the pot. From the weather dataset, we ony seect observations that meet a coupe of conditions to get two cumps of observations, one with No and the other with Yes for RainTomorrow (see Figure 14.1):

316 14.1 Knowedge Representation 295 > ibrary(ratte) > obs <- with(weather, Pressure3pm+Sunshine > 1032 (Pressure3pm+Sunshine < 1020 & RainTomorrow == "Yes")) > ds <- weather[obs,] > with(ds, pot(pressure3pm, Sunshine, pch=as.integer(raintomorrow), co=as.integer(raintomorrow)+1)) > ines(c(1016.2, ), c(0, 12.7)) > ines(c(1032.8, ), c(0, 12.7)) > egend("topeft", c("yes", "No"), pch=2:1, co=3:2) Sunshine Yes No Pressure3pm Figure 14.1: A simpe and easiy ineary separabe coection of observations. Two ines are shown in Figure 14.1 as two possibe inear modes. Taking either ine as a mode, the observations to the eft wi be cassified as Yes and those to the right as No. However, there is an infinite coection of possibe ines that we coud draw to separate the two regions.

317 Support Vector Machines The support vector approach suggests that we find a ine in the separating region such that we can make the ine as thick as possibe to butt up against the observations on the boundary. We choose the ine that fis up the maximum amount of space between the two regions, as in Figure The observations that butt up against this region are the support vectors. Sunshine Yes No Pressure3pm Figure 14.2: A maxima region or margin between the two casses of observations. This is the representation of the mode that we buid using the approach of identifying support vectors and a maxima region between the cassifications. The approach generaises to mutipe dimensions (i.e., many input variabes), where we search for hyperpanes that maximay fi the space between casses in the same way.

318 14.2 Agorithm Agorithm It is rarey the case that our observations are ineary separabe. More ikey, the data wi appear as it does in Figure 14.3, which was generated directy from the data. > ds <- weather > with(ds, pot(pressure3pm, Sunshine, pch=as.integer(raintomorrow), co=as.integer(raintomorrow)+1)) > egend("topeft", c("yes", "No"), pch=2:1, co=3:2) Pressure3pm Sunshine Yes No Figure 14.3: A nonineary separabe coection of observations. This kind of situation is where the kerne functions pay a roe. The idea is to introduce some other derived variabes that are obtained from the input variabes but combined and mathematicay changed in some noninear way. Rather simpe exampes coud be to add a new variabe

319 Support Vector Machines which squares the vaue of Pressure3pm and another new variabe that mutipies Pressure3pm by Sunshine. Adding such variabes to the data can enhance separation. Figure 14.4 iustrates the resuting ocation of the observations, showing an improvement in separation (though to artificiay exaggerate the improvement, not a points are shown) p3sq p3su Yes No Figure 14.4: Nonineary transformed observations showing Pressure3pm squared (x-axis) against Pressure3pm mutipied by Sunshine, artificiay enhanced. A genuine benefit is ikey to be seen when we add further variabes to our dataset. It becomes difficut to dispay such muti-dimensiona pots on the page, but toos ike GGobican assist in visuaising such data and confirming improved separation. The basic kerne functions that are often used are the radia basis function, a inear function, and a poynomia function. The radia basis function uses a formua somewhat of the form e γ x x 2 for two observations x and x (without going into the detais). The poynomia function has the form (1+ x, x ) d, for some integer d. For two input variabes X 1

320 14.3 Tutoria Exampe 299 and X 2 and a power of 2, this becomes (1+x 1 x 1 +x 2x 2 )2. Again, we skip the actua detais of how such a formua is used. There are a variety of kerne functions avaiabe, but the commony preferred one, and a good pace to start, is the radia basis function. Once the input variabe space has been appropriatey transformed, we then proceed to buid a inear mode as described in Section Tutoria Exampe Buiding a Mode Using Ratte Ratte supports the buiding of support vector machine (SVM) modes through kernab (Karatzogou et a., 2004). This package provides an extensive coection of kerne functions that can be appied to the data. This works by introducing new variabes. Quite a variety of tuning options are provided by kernab, but ony a few are given through Ratte. It is quite easy to experiment with different kernes using the weather dataset provided. The defaut kerne (radia basis function) is a good starting point. Such modes can be quite accurate with no or itte tuning. Two parameters are avaiabe for the radia basis function. C= is a penaty or cost parameter with a defaut vaue of 1. The Options widget can be used to set different vaues (e.g., C=10). We review here the information provided to summarise the mode, as dispayed in Figure The Textview begins with the summary of the mode, identifying it as an object of cass (or type) ksvm (kerne support vector machine): Summary of the SVM mode (buit using ksvm): Support Vector Machine object of cass "ksvm" The type of support vector machine is then identified. The C-svc indicates that the standard reguarised support vector cassification (svc) agorithm is used, with parameter C for tuning the agorithm. The vaue used for C is aso reported: SV type: C-svc (cassification) parameter : cost C = 1

321 Support Vector Machines Figure 14.5: Buiding a support vector machine cassification mode. An automatic agorithm is used to estimate another parameter (sigma) for the radia basis function kerne. The next two ines incude an indication of the estimated vaue: Gaussian Radia Basis kerne function. Hyperparameter : sigma = The remaining ines report on the characteristics of the mode, incuding how many observations are on the boundary (i.e., the support vectors), the vaue of the so-caed objective function that the agorithm optimises, and the error cacuated on the training dataset: Number of Support Vectors : 106 Objective Function Vaue : Training error : Probabiity mode incuded.

322 14.3 Tutoria Exampe 301 Time Taken The support vector machine is reasonaby efficient: Time taken: 0.16 secs R There is a weath of functionaity provided through kernab and ksvm() for buiding support vector machine modes. We wi cover the basic functionaity here. As usua, we begin with the dataset object from which we wi be buiding our mode: > ibrary(ratte) > weatherds <- new.env() > evaq({ data <- weather nobs <- nrow(weather) target <- "RainTomorrow" vars <- -grep('^(date Location RISK_)', names(data)) set.seed(42) train <- sampe(nobs, 0.7*nobs) form <- formua(raintomorrow ~.) }, weatherds) We can now buid the boosted mode based on this dataset: > ibrary(kernab) > weathersvm <- new.env(parent=weatherds) > evaq({ mode <- ksvm(form, data=data[train, vars], kerne="rbfdot", prob.mode=true) }, weathersvm) The kerne= argument indicates that we wi use the radia basis function as the kerne function. The prob.mode= argument, set to TRUE, resuts in a mode that predicts the probabiity of the outcomes. We obtain the usua overview of the mode by simpy printing its vaue:

323 Support Vector Machines > weathersvm$mode Support Vector Machine object of cass "ksvm" SV type: C-svc (cassification) parameter : cost C = 1 Gaussian Radia Basis kerne function. Hyperparameter : sigma = Number of Support Vectors : 107 Objective Function Vaue : Training error : Probabiity mode incuded Tuning Parameters We describe here a number of tuning parameters, but many other options are avaiabe and are documented as part of kernab. Mode Type type= The ksvm() function can be used for a variety of modeing tasks, depending on the type of target variabe. We are generay interested in cassification tasks using the so-caed C-svc formuation (support vector cassification with a C parameter for tuning). This is a standard formuation for SVMs and is referred to as reguarised support vector cassification. Other options here incude nu-svc for automaticay reguarised support vector cassification, one-svc for novety detection, eps-svr for support vector regression that is robust to sma (i.e., epsion) errors, and nu-svr for support vector regression that automaticay minimises epsion. Other options are avaiabe. Kerne Function kerne= The kerne method is the mechanism used to map our observations into a higher dimensiona space. It is then within this higher dimensiona space that the agorithm ooks for a hyperpane that partitions our observations

324 14.4 Tuning Parameters 303 to find the maxima margin between the different vaues of the target variabe. The ksvm() function supports a number of kerne functions. A good starting point is the radia basis function (using a Gaussian type of function). This is the rfdot option. The dot refers to the mathematica dot function or inner product between two vectors. This is integra to how support vector machines work, though not covered here. Other options incude poydot for a poynomia kerne, vaniadot for a inear kerne, and spinedot for a spine kerne, amongst others. Cass Probabiities prob.mode= Kerne Parameter: Cost of Constraints Vioation C= If this is set to TRUE, then the resuting mode wi cacuate cass probabiities. The cost parameter C= is by defaut 1. Larger vaues (e.g., 100 or 10,000) wi consider more the points near the decision boundary, whist smaer vaues reate to points further away from the decision boundary. Depending on the data, the choice of the cost argument may ony pay a sma roe in the resuting mode. Kerne Parameter: Sigma sigma= For a radia basis function kerne, the sigma vaue can be set. Ratte uses automatic sigma estimation (using sigest()) for this kerne. This wi find a good sigma vaue based on the data. To experiment with various sigma vaues we can use the R code from Ratte s Log tab and paste that into the R Consoe and then add in the additiona settings and run the mode. This parameter tunes the kerne function seected, and so is isted as the kparm= ist. Cross Vaidation cross= We can specify an integer vaue here to indicate whether to perform k-fod cross-vaidation.

325 Support Vector Machines 14.5 Command Summary This chapter has referenced the foowing R packages, commands, functions, and datasets: kernab package Kerne-based agorithms for machine earning. ksvm() function Buid an SVM mode. ratte package The weather dataset and GUI. sigest() function Sigma estimation for kerne. weather dataset Sampe dataset from ratte.

326 Part III Deivering Performance

327

328 Chapter 15 Mode Performance Evauation If a mode ooks too good to be true, then generay it is. The preceding chapters presented a number of agorithms for buiding descriptive and predictive modes. Before we can identify the best from amongst the different modes, we must evauate the performance of the mode. This wi aow us to understand what to expect when we use the mode to score new observations. It can aso hep identify whether we have made any mistakes in our choice of input variabes. A common error is to incude as an input variabe a variabe that directy reates to the outcome (ike the amount of rain tomorrow when we are predicting whether it wi rain tomorrow). Consequenty, this input variabe is exceptionay good at predicting the target. In this chapter, we consider the issue of evauating the performance of the modes that we have buit. Essentiay, we consider predict(), provided by R and accessed through Ratte s Evauate tab, and the functions that summarise and anayse the resuts of the predictions. We wi work through each of the approaches for evauating mode performance. We start with a simpe tabe, caed a confusion matrix, that compares predictions with actua answers. This aso introduces the concepts of true positives, fase positives, true negatives, and fase negatives. We then expain a risk chart which graphicay compares the performance of the mode against known outcomes and is used to identify a suitabe trade-off between effort and outcomes. Traditiona ROC G. Wiiams, Data Mining with Ratte and R: The Art of Excavating Data for Knowedge Discovery, Use R, DOI / _15, Springer Science+Business Media, LLC

329 Mode Performance Evauation curves are then introduced. We finish with a discussion of simpy scoring datasets and saving the resuts to a fie. In appying a mode to a new dataset, the new dataset must contain a of the same variabes and have the same data types on which the mode was buit. This is true even if any variabes were not used in the fina mode. If the variabe is missing from the new dataset, then generay an error is generated The Evauate Tab: Evauation Datasets Ratte s Evauate tab provides access to a variety of options for evauating the performance of our modes. We can see the options isted in Figure We briefy introduce the options here and expand on them in this chapter. Figure 15.1: The Evauate tab options. Types of Evauations The range of different Types of evauations is presented as a series of radio buttons running from Confusion Matrix to Score. Ony one type of evauation is permitted to be chosen at any time. Each type of evauation is presented in the foowing sections of this chapter. Modes to Evauate Beow the row of evauation types is a row of check boxes to choose the modes we wish to evauate. These check boxes are ony avaiabe once a mode has been buit. As modes are buit, the check boxes wi become avaiabe as options to check.

330 15.1 The Evauate Tab: Evauation Datasets 309 As we move from the Mode tab to this Evauate tab, the most recenty buit mode wi be automaticay checked (and any previousy checked mode choices wi be unseected). This corresponds to a common pattern of behaviour in that often we wi buid and tune a mode, then want to expore its performance by moving to this Evauate tab. If the A option has been chosen within the Mode tab, then a modes that were successfuy buit wi automaticay be checked on the Evauate tab. This is the case here, where the six predictive modes are checked. Dataset Used for Evauation To evauate a mode, we need to identify a dataset on which to perform the evauation. The next row of options within the Ratte interface provides a coection of aternative sources of data. The first four options for the Data correspond to the partitioning of the dataset specified on the Data tab. The options are Training, Vaidation, Testing, and Fu. The concept of a training/vaidation/testing dataset partition was discussed in Section 3.1, and we discussed the concept of samping and associated biases in Section 4.7. We now discuss it further in the context of evauating the modes. The first option (but not the best option) is to evauate our mode on the Training dataset. This is generay not a good idea, and an information diaogue wi be shown to reinforce this. The probem with evauating our mode on the training dataset is that we have buit it on this training dataset. It is often the case that the mode wi perform very we on that dataset. It shoud, because we ve tried hard to make sure it does. But this does not give us a very good idea of how we the mode wi perform in genera on previousy unseen data. We need a better guide to how we the mode wi perform in genera, that is, how the mode wi perform on new and previousy unseen data. To answer that question, we need to appy the mode to such data. In doing so, we wi obtain the overa error rate of the mode. This is simpy the proportion of observations where the mode and the actua known outcomes differ. This error rate, and not the error rate from the training dataset, wi then be a better estimate of how we the mode wi perform. It is a ess biased estimate of the error. We use the Vaidation dataset to test the performance of a mode whist we are buiding and fine-tuning it. Thus, after buiding one deci-

331 Mode Performance Evauation sion tree, we wi check its performance against this vaidation dataset. We might then change some of the tuning options for buiding a decision tree mode. We compare the new mode against the od one based on its performance on the vaidation dataset. In this sense, the vaidation dataset is used during the modeing process to buid the fina mode. Consequenty, we wi sti have a biased estimate of the fina performance of our mode if we rey on the vaidation dataset for this measure. The Testing dataset is then a hod-out dataset that has not been used at a during the mode buiding. Once we have identified our best mode based on using the vaidation dataset, the mode s performance on the testing dataset is then assessed. This is then an estimate of the expected performance on any new data. The fourth option uses the Fu dataset for evauating the mode (the combined training, vaidation, and testing datasets). This might be seen to be usefu ony as a curiosity rather than for accurate performance. Another option avaiabe as a data source is provided through the Enter choice. This is avaiabe when Score is chosen as the type of evauation. In this case, a window wi pop up to aow us to directy enter some data and have that scored by the mode. The fina two options for the data source are a CSV Fie and an R Dataset. These aow data to be oaded into R from a CSV fie, and the mode can be evauated on that dataset. Aternativey, for a data frame aready avaiabe through R, the R Dataset wi aow it to be chosen and the mode evauated on that. Risk Variabe The fina row of options begins with an informative abe that reports on the name of the Risk Variabe chosen in the Data tab. The risk variabe is used as a measure of how significant each observation is with respect to the target variabe. For exampe, it might record the doar vaue of the fraud or the amount of rain received tomorrow. The risk chart makes use of this variabe, if there is one, and it is incuded in the common area of the Evauate tab for information purposes ony. Scoring The remaining options on the fina row of options reate to scoring. Many modes can predict an outcome or a probabiity for a particuar outcome.

332 15.1 The Evauate Tab: Evauation Datasets 311 The Report option aows us to choose which we woud ike to see in the output when scoring. The Incude option indicates whether to incude a variabes for each observation in the output or just the identifiers (those variabes marked as having an Ident roe on the Data tab). A Note on Cross-Vaidation In Section 2.7, we introduced the concept of partitioning our dataset into three sampes: the training, vaidation, and testing datasets. This concept was then further discussed in Section 3.1 and in the section above. In considering each of the modeing agorithms, we aso touched on the evauation of the modes, using the vaidation dataset, as part of the mode buiding process. We have stressed that the testing dataset is used as the fina unbiased estimate of the performance of a mode. A reated paradigm for evauating the performance of our modes is through the use of cross-vaidation. Indeed, some of the agorithms impemented in R wi perform cross-vaidation for us and report a performance measure based on it. The decision tree agorithm using rpart() is an exampe. Cross-vaidation is a simpe concept. Given a dataset, we partition it into, perhaps, ten random sampe subsets. Then we buid a mode using nine of those subsets, combined to form the training dataset. We can then measure the performance of the resuting mode on the hod-out dataset. Then we can repeat this by seecting a different nine subsets to use as a training dataset. Once again, the remaining dataset wi serve as a testing dataset. This can be repeated ten times to give us a measure of the expected performance of the resuting mode. A reated concept, and one that we often find in the context of ensembe modes, is the concept of out-of-bag, or OOB, measures of performance. We saw this concept when buiding a random forest mode in Section We might reca that in buiding a random forest we sampe a subset of the fu dataset. The subset is used as the training dataset. Thus, the remaining dataset can be used to test the performance of the resuting mode. In those cases where the R impementation of an agorithm provides its own performance measure, using cross-vaidation or out-of-bag estimates, we might choose not to create a vaidation dataset in Ratte. Instead, we can rey on the measure suppied by the agorithm as we buid and fine-tune our modes. The testing dataset remains usefu then

333 Mode Performance Evauation to provide an unbiased measure once we have buit our best modes Measure of Performance Quite a coection of measures has been deveoped over many years to gauge the performance of a mode. The hep page for performance() of ROCR (Sing et a., 2009) in R coects most of them together with a brief description, with 30 other measures isted. To review that ist, using the R Consoe, simpy ask for hep: > ibrary(rocr) > hep(performance) We wi discuss performance in the context of a binary cassification mode. This has been our focus with the weather dataset, predicting No or Yes for the variabe RainTomorrow. For binary cassification, we aso often identify the predictions as positives or negatives. Thus, in terms of predicting whether it wi rain tomorrow, Yes is the positive cass and No is the negative cass. For an evauation of a mode, we appy the mode to a dataset of observations with known actua outcomes (casses). The mode wi be used to predict the cass for each observation. We then compare the predicted cass with the actua cass. Error Rate The simpest measure of the performance of a mode is the error rate. This is cacuated as the proportion of observations for which the mode incorrecty predicts the cass with respect to the actua cass. That is, we simpy divide the number of miscassifications by the tota number of observations. True and Fase Positives and Negatives If our weather mode predicts Yes in agreement with the actua outcome, then we refer to this as a true positive. Simiary, when they both agree on the negative, we refer to it as a true negative. On the other hand, when the mode predicts No and the actua is Yes, then we have a fase negative. Predicting a Yes when it is actuay a No resuts in a fase positive.

334 15.2 Measure of Performance 313 Often it is usefu to differentiate in this way between the types of miscassification errors that a mode makes. For exampe, in the context of our weather dataset, it makes a difference whether we have a fase positive or a fase negative. A fase positive woud predict that it wi rain tomorrow when in fact it does not. The consequence is that I might take my umbrea with me but I won t need to use it ony a minor inconvenience. However, a fase negative predicts that it won t rain tomorrow but in fact it does rain. Reying on the mode, I woud not bother with an umbrea. Consequenty, I am caught in the rain and get uncomfortaby wet. The consequences of a fase negative in this case are more significant for me than they are for a fase positive. Whether fase positives or fase negatives are more of an issue depends on the appication. For medica appications, a fase positive (e.g., fasey diagnosed with cancer) may be ess of an issue than a fase negative (e.g., the diagnosis of cancer being missed). Different mode buiders can dea with these situations in different ways. The decision tree agorithm, for exampe, can accept a oss matrix that gives different weights to the different outcomes. This wi then bias the mode buiding to avoid one type of error or the other. Often, we are interested in the ratio of the number of true positives to the number of predicted positives. This is referred to as the true positive rate and simiary for the fase positive rate and so on. Precision, Reca, Sensitivity, Specificity The precision of a mode is the ratio of the number of true positives to the tota number of predicted positives (the sum of the true positives and the fase positives). It is a measure of how accurate the positive predictions are, or how precise the mode is in predicting. The reca of a mode is just another name for the true positive rate. It is a measure of how many of the actua positives the mode can identify, or how much the mode can reca. The reca is aso known as the sensitivity of the mode. Another measure that often arises in the context of sensitivity is specificity. This is simpy another name for the true negative rate.

335 Mode Performance Evauation Other Measures We wi use and refine the measure we have introduced here in describing the various approaches to evauating our modes in the foowing sections. As the hep() for ROCR indicates, we have very many to choose from, and which works best for the many different appication areas is often determined through tria and error and experience Confusion Matrix A confusion matrix (aso known as an error matrix) is appropriate when predicting a categoric target (e.g., in binary cassification modes). We saw a number of confusion matrices in Chapter 2. In Ratte, the Confusion Matrix is the defaut on the Evauate tab. Cicking the Execute button wi run the seected mode(s) against the chosen dataset to predict the outcomes for each of the observations in that dataset. The predictions are compared with the actua observations, and the true and fase positives and negatives are cacuated. Figure 15.2 iustrates this for the decision tree mode using the weather dataset. We see in Figure 15.2 that six modes have been buit, and the Textview wi show the confusion matrix for each of the seected modes. A quick way to buid each type of mode is to choose the A option on the Mode tab. The confusion matrix dispays the predicted versus the actua resuts in a tabe. The first tabe shows the actua counts, whist the second tabe shows the percentages. For the decision tree appied to the vaidation dataset, there are 5 true positives and 39 true negatives, and so the mode is correct for 44 observations out of 54. That is, the overa error rate is 10 out of 54, or 19%. The fase positives and fase negatives have the same count. On five days we wi get wet and on another five we wi carry an umbrea with us unnecessariy. If we scro the text view window of the Evauate tab, we can see the confusion-matrix-based performance measure for other modes. The random forest appears to provide a sighty more accurate prediction, as we see in Figure Note that the resuts vary sighty between different depoyments of R, particuary between 64 bit R, as here, and 32 bit R.

336 15.4 Risk Charts 315 Figure 15.2: The Evauate tab showing a confusion matrix. The overa error rate for the random forest is 12%, with 4 true positives and 40 true negatives. Compared with the decision tree, there is one ess day when we wi get wet and three fewer days when we woud unnecessariy carry our umbrea. We might instead ook for a mode that reduces the fase negatives rather than the fase positives. (Aso remember that we shoud be carefu when comparing such sma numbers of observations the differences won t be significant, though when using very arge training datasets, as woud be typica for data mining, we are in a better position to compare.) 15.4 Risk Charts A risk chart, aso known as a cumuative gain chart, provides another perspective on the performance of a binary cassification mode. Such a chart can be dispayed by choosing the Risk option on the Evauate tab. We wi expain risk charts here using the audit dataset. The use of risk charts to evauate modes of fraud and noncompiance is more ogica

337 Mode Performance Evauation Figure 15.3: Confusion matrix for the random forest mode. than with the appication to predicting rain. The audit dataset (Section B.2) contains observations of taxpayers who have been audited together with the outcome of the audit: No or Yes. A positive outcome indicates that the taxpayer was required to update the tax return because of inaccuracies in the figures reported. A negative outcome indicates that the tax return required no adjustment. For each adjustment we aso record its doar amount (as the risk variabe). We can buid a random forest mode using this dataset, but we first need to oad it into Ratte. To do so, go back to the Data tab and after oading ratte s weather dataset cick on the Fiename chooser. We can then seect the fie audit.csv. Cick on Execute to have the new dataset oaded. Then, from Ratte s Mode tab, buid a Forest and then request a Risk Chart from the Evauate tab. The resuting risk chart is shown in Figure To read the risk chart, we wi pick a particuar point and consider a specific scenario. The scenario is that of auditing taxpayers. Suppose we normay audit 100,000 taxpayers each year. Of those, ony 24,000, et s say, end up requiring an adjustment to their tax return. We

338 15.4 Risk Charts 317 ca this the strike rate. That is, we strike 24,000 out of the 100,000 as being of interest a strike rate of 24%. Figure 15.4: A risk chart for a random forest on the audit dataset. Suppose our funding now aows us to audit ony 50,000 taxpayers. If we were to randomy seect 50% from the 100,000 taxpayers, then we woud expect to identify just 50% of the actua taxpayers whose tax returns required an adjustment. That is, we woud identify ony 12,000 of the 24,000 tax returns requiring an adjustment from amongst the 50,000 taxpayers randomy seected. This random seection is represented by the diagona ine in the pot. A random 50% caseoad (i.e., 50,000 cases) wi deiver a 50% performance (i.e., ony haf of the known cases of interest wi be found). We can think of this as the baseine this is what the situation woud be if we used random seection and no other mode. We now introduce our random forest mode, which predicts the ikeihood of a taxpayer s tax return requiring an adjustment. For each taxpayer, the mode provides a score the probabiity of the taxpayer s tax return requiring an adjustment. We can now prioritise our audits of taxpayers based on these scores so that taxpayers with a higher score are audited before taxpayers with a ower score. Once again, but now using this priority, we choose to audit ony 50,000 taxpayers, but we seect the 50,000 that have the highest risk scores.

339 Mode Performance Evauation The dashed green ine of the pot indicates the performance achieved when using the mode to prioritise the audits. For a 50% caseoad, the performance is approximatey 90%. That is, we expect to identify 90% of the tax returns requiring an adjustment. So 21,600 of the 24,000 known adjustments, from amongst the 50,000 taxpayers chosen, are expected to be identified. That is a significant improvement over the 12,000 from the 50,000 seected randomy. Indeed, as the bue ine in the pot indicates, that provides a ift in performance of amost 2. That is, we are identifying amost twice as many tax returns requiring adjustment than we woud expect if we were simpy seecting taxpayers randomy. In this ight, the mode provides quite a significant benefit. Note that we are not particuary concentrating on error rates as such but on the benefit we achieve in using the mode to rank or prioritise our business processes. Whist a ot of attention is often paid to simpistic measures of mode performance, other factors usuay come into pay in deciding which mode performs best. Note aso from the pot in Figure 15.4 that after we have audited about 85% of the cases (i.e., at a caseoad of 85) the mode achieves 100% performance. That is, the mode has ensured that a tax returns requiring adjustment have been identified by the time we have audited 85,000 taxpayers. A conservative use of the mode woud then ensure neary a required audits (i.e., 24,000) are performed, yet saving 15% of the effort previousy required to identify a of the required audits. We aso note that out of the 85,000 audits we are sti unnecessariy auditing 61,000 taxpayers. The soid red ine of the risk chart often foows a path simiar to that of the green ine. It provides an indication of the measure of the size of the risk covered by the mode. It is based on the variabe identified as having a roe as a Risk variabe on the Data tab. In our case, it is the variabe RISK_Adjustment and records the doar amount of any adjustment made to a tax return. In that sense, it is a measure of the size of the risk. The risk performance ine is incuded for information. It has not been used in the modeing at a (though it coud have been). Empiricay, we often note that it sits near or above the target performance ine. If it sits high above the target ine, then the mode is fortuitousy identifying higher-risk cases earier in the process, which is a usefu outcome.

340 15.4 Risk Charts 319 A risk chart can be dispayed for any binary cassification mode buit using Ratte. In comparing risk charts for different modes, we are ooking for a arger area under the curve. This generay means that the curve coser to the top eft of the risk chart identifies a better-performing mode than a curve that is coser to the baseine (diagona ine). Figure 15.5 iustrates the output when mutipe modes are seected, so that performances can be directy compared. Figure 15.5: Four risk charts dispayed to compare performances of mutipe mode buiders on the audit dataset. The pots generated by Ratte incude a measure of the area under the curve in the egend of the pot. For Figure 15.4, the area under the

Use R! Series Editors: Robert Gentleman Kurt Hornik Giovanni G. Parmigiani. For further volumes:

All Aspects. of a...business...industry...company. Planning. Management. Finance. An Information. Technical Skills. Technology.

Health Literacy Online

Securing the future of excellent patient care. Final report of the independent review Led by Professor David Greenaway

How to Make Adoption an Affordable Option

OPINION Two cheers for P-values?

The IBM System/38. 8.1 Introduction

Are Health Problems Systemic?

On the relationship between radiance and irradiance: determining the illumination from images of a convex Lambertian object

RESIDENCE YOUR CONTRACT WITH US 2014/2015. residenceatwestern.ca 2012 / 2013 RESIDENCE HANDBOOK. Residence Handbook. and Understandings

YOUR GUIDE TO Healthy Sleep

Relationship Between the Retirement, Disability, and Unemployment Insurance Programs: The U.S. Experience

Defining and Testing EMR Usability: Principles and Proposed Methods of EMR Usability Evaluation and Rating

Who to Follow and Why: Link Prediction with Explanations

Computing in the national curriculum

Making Smart IT Choices

THE FIELD GUIDE. to DATA S CIENCE COPYRIGHT 2013 BOOZ ALLEN HAMILTON INC. ALL RIGHTS RESERVED.

6 USE CASES. Introduction. Chapter 6. Objectives. The indispensable first step to getting the things you want out of life: decide what you want.

SOFTWARE ENGINEERING

IP ASSETS MANAGEMENT SERIES. Successful Technology Licensing

Rich Data, Poor Data Designing Dashboards to Inform

PLANNING ANALYSIS PLANNING ANALYSIS DESIGN

Base Tutorial: From Newbie to Advocate in a one, two... three!

Introduction to Data Mining and Knowledge Discovery

On System Design. Jim Waldo

Privacy and Electronic Communications Regulations. Guidance on the rules on use of cookies and similar technologies

EVERYTHING YOU ALWAYS WANTED TO KNOW ABOUT SNAKES (BUT WERE AFRAID TO ASK) Jim Ivins & John Porrill

How to Develop and Monitor Your Company's Intellectual Capital. Tools and actions for the competency-based organisation

Why Johnny Can t Encrypt: A Usability Evaluation of PGP 5.0

JCR or RDBMS why, when, how?

A First Encounter with Machine Learning. Max Welling Donald Bren School of Information and Computer Science University of California Irvine