Occupation Classifier

Similar documents
STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Decision Trees from large Databases: SLIQ

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining Methods: Applications for Institutional Research

Data Mining for Knowledge Management. Classification

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Addressing Analytics Challenges in the Insurance Industry. Noe Tuason California State Automobile Association

Gerry Hobbs, Department of Statistics, West Virginia University

Data Mining Techniques Chapter 6: Decision Trees

Predicting Flight Delays

Classification and Prediction

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining - Evaluation of Classifiers

Data Mining Classification: Decision Trees

Data Mining Practical Machine Learning Tools and Techniques

Data mining techniques: decision trees

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Lecture 10: Regression Trees

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

Data Mining. Nonlinear Classification

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

A Methodology for Predictive Failure Detection in Semiconductor Fabrication

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Package acrm. R topics documented: February 19, 2015

How To Make A Credit Risk Model For A Bank Account

Beating the MLB Moneyline

Predicting Defaults of Loans using Lending Club s Loan Data

Microsoft Azure Machine learning Algorithms

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Data mining is used to develop models for the early prediction of freshmen GPA. Since

Data Mining Algorithms Part 1. Dejan Sarka

CART 6.0 Feature Matrix

Knowledge Discovery and Data Mining

Classification/Decision Trees (II)

L13: cross-validation

Clustering Marketing Datasets with Data Mining Techniques

The Data Mining Process

Classification of Bad Accounts in Credit Card Industry

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Binary Logistic Regression

Beating the NCAA Football Point Spread

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Lasso on Categorical Data

Fast Analytics on Big Data with H20

Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University.

Chapter 12 Discovering New Knowledge Data Mining

Using multiple models: Bagging, Boosting, Ensembles, Forests

Data Analysis of Trends in iphone 5 Sales on ebay

Better credit models benefit us all

Social Media Mining. Data Mining Essentials

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Decision-Tree Learning

DATA MINING TECHNIQUES AND APPLICATIONS

Employer Health Insurance Premium Prediction Elliott Lui

Data mining and statistical models in marketing campaigns of BT Retail

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

The Operational Value of Social Media Information. Social Media and Customer Interaction

Lecture 13: Validation

6 Classification and Regression Trees, 7 Bagging, and Boosting

Knowledge Discovery and Data Mining

Predicting daily incoming solar energy from weather data

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Benchmarking of different classes of models used for credit scoring

Understanding Characteristics of Caravan Insurance Policy Buyer

A Study to Predict No Show Probability for a Scheduled Appointment at Free Health Clinic

Classification algorithm in Data mining: An Overview

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

An Overview and Evaluation of Decision Tree Methodology

Model Combination. 24 Novembre 2009

2 Decision tree + Cross-validation with R (package rpart)

Regularized Logistic Regression for Mind Reading with Parallel Validation

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Chapter 12 Bagging and Random Forests

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Cell Phone based Activity Detection using Markov Logic Network

Random forest algorithm in big data environment

Northumberland Knowledge

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Data Mining Part 5. Prediction

Missing data: the hidden problem

M1 in Economics and Economics and Statistics Applied multivariate Analysis - Big data analytics Worksheet 3 - Random Forest

Homework Assignment 7

Chapter 11 Boosting. Xiaogang Su Department of Statistics University of Central Florida - 1 -

Predict the Popularity of YouTube Videos Using Early View Data

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Azure Machine Learning, SQL Data Mining and R

The Office of Public Services Reform The Drivers of Satisfaction with Public Services

Transcription:

Occupation Classifier Sebastián Jiménez-Bonnet - 05427277 1 Introduction December 15, 2011 In this project I explored how accurately could the occupation of a person be guesses based on other demographic variables. To this end, several algorithms were attempted, such as multinomial logistic regression, SVM and Classification and Regression Trees (CART). I present here the results from the most successful algorithm, CART, and discuss some of the reasons that made it perform better. 2 The Data The data set used is an extract form a bigger marketing database 1, originally intended to classify people by income. The original data set consists of 9409 questionnaires filled up by shopping mall customers in the San Francisco Bay area. The questionnaires contained 502 questions, and the extract used consists on 14 demographic attributes. This dataset is a mixed set in which we can find many categorical and numerical variables with a lot of missing values. After those entries for which the response variable was missing were removed, a total of 8857 observations served as our 1 Source: Impact Resources, Inc., Columbus, OH (1987). dataset to train and test the predictor. All the attributes were discretized and are listed below with their corresponding number of classes 2 : 1. Occupation (9 levels) 2. Annual income of household (9 levels) 3. Sex (2 levels) 4. Marital status (5 levels) 5. Age (7 levels) 6. Education (6 levels) 7. How long have you lived in the San Francisco/ Oakland/San Jose area? (5 levels) 8. Dual incomes, if married (3 levels) 9. Persons in your household (9 levels) 10. Persons in household under 18 (9 levels) 11. Householder status (3 levels) 12. Type of home (5 levels) 13. Ethnic classification (8 leveles) 14. What language is spoken most often in your home? (3 levels) Many of the variables have many levels, including the response variable Occupation which has 9 levels. This, summed to the fact that almost 20% (1743 observations) of the data set have at least one missing value, makes this data set hard to work with. The treatment of these missing values played a central role the project. The dataset was randomly split into a training set with 7085 entries (80%) and a learning set with 1772 entries (20%). The former was used to fit our model and the latter was kept separate until the very end where it was used to estimate the prediction error of our final model. 2 For a complete listing of the levels visit http://wwwstat.stanford.edu/ tibs/elemstatlearn/ 1

Classification Relative Error Proportion of Root Node Error 0.5 0.6 0.7 0.8 0.9 1.0 CV estimate Training Set 0 50 100 150 200 250 Number of splits Figure 1: A plot of the CV error and the training set error for each of the fitted trees, as a fraction of the error achieved by the single node tree. On the x-axis we have the number of nodes of the corresponding trees 3 Classification and Regression Trees The CART algorithm, roughly speaking, generates a sequence of binary splits on one of the input variables at a time, so that the resulting measure of impurity is minimized each time. The tree is grown until a stopping criteria is satisfied and finally it is pruned to avoid overfitting. In order to make a prediction, it assigns the class with the most votes to each of the final nodes. There are several reasons why trees are specially appropriate for our particular problem. First, trees work well with discrete input and response variables that have many levels. Second, they are fairly easy to interpreted which may lead to a deeper understanding of the relation of the input with the response variable. Lastly, trees offer several ways to deal with missing observations, which is really important for our dataset. 3.1 Missing values Binary trees are specially well suited to deal with missing values. There are several different approaches to this matter, the simplest one is to forget about them and build a model based only on observations that have all their attributes observed. This approach wastes lots of information, (we would dismiss almost 20% of our dataset!) and is not able to offer a prediction for entries that have missing values themselves. Another approach is to add a dummy level to each of our attributes, corresponding to a missing value. This approach is useful in that it can help uncover a relation between the missing values and the response. But if the missing values 2

Age=3,4,5,6,7 Age=3,4,5 Income=2,3,4,5,6,7,8,9 Education=5,6 Age=6 Householder=1,2 Student Dual.income=1,2 Dual.income=2 Education=5,6 Age=2 Retired Income=6,7,8,9 Sex=1 Sex=2 Retired Student Sales Sex=2 Dual.income=1,2 Time.bay.area=3,4,5 Homemaker Education=4 Education=1,4 Education=4 Clerical Homemaker Militar Income=3,4,6,7,8 Factory Worker Clerical Factory Worker Student Figure 2: Our final binary tree. appear randomly or are the product of data processing rather that data collection, this approach is also inefficient. Both of these approaches can be implemented by many algorithms, but trees offer an additional one, surrogate splits. It consists in choosing several substitute splits for each of the original splits elected for the model. In choosing these surrogate splits, only variables different from the original one are considered. The split is then chosen so that the resulting partition of the data has the biggest overlap with the partition the original split achieved (where only data that has both variables observed is considered). The second surrogate excludes the original variables and the variable for the first surrogate, and so on. 3.2 Fitting the predictor In this implementation of the CART algorithm, the GINI index is used as the measure of impurity to grow the tree. Three criteria were considered to stop branching. 1. The nodes considered for splitting were required to have at least 5 observations. 2. The end nodes were required to have at least two observations. 3. The split to be made had to decrease our measure of impurity by at least C p = 0.001. These stopping rules aim to avoid huge trees that are computationally expensive and which would be trivially pruned at a later stage. To fit the model I used R as the software platform, specifically I used the package rpart. The algorithm calculates a whole path of nested trees, from the trivial root node, to the full tree reached with our stopping criteria. It indexes each tree by the value of the parameter C p that would have taken to stop branching exactly at that tree. To choose the appropriate tree and avoid overfitting I calculated a 10-fold cross-validation (CV) error for each of the trees. In Figure 1 we 3

Prediction Accuracy 0.0 0.2 0.4 0.6 0.8 1.0 0.911 0.801 0.789 0.310 0.251 0.188 0.137 0.033 0.000 Retired Student Homemaker Clerical Factory Militar Sales Unemployed Figure 3: Accuracy for class of the response variable Occupation. can see a plot of the training error and the CV error for each of these trees as a percentage of the trivial root-node classification. The root node trivial error is 0.679, and is achieved by the classifier which assigns the most frequent class (in our case /Managerial) to any input. I used the two standard deviation parsimonious approach in which the selected model is the simplest model (least number of final nodes) which achieves a CV error within to standard deviations of the minimum CV error. 3.3 Final model Our final model corresponds to a C p value of 0.00208, which translates into a binary tree with 20 nodes. In figure 2 we can see the structure of the model. Note that the model only uses 7 of the 13 possible variables. The variables that seem to be the most important in our tree are Age, Annual income and Education. 3.4 Prediction Testing our final model on the test set, it has a resulting test error of 46%. At first sight it may seem a bit high, but considering we have 9 classes for the response variable and a sparse training set, it is actually a quite remarkable performance. For a frame of reference, it can be compared to the root node error, which is 68%. In figure 3 we can see the classification accuracy for each class. We can see that, excluding the Retired (which is easily identifiable by age), there is a clear bias in favor of the more frequent classes which are and Student. In fact the least frequent class, Unemployed is never assigned. 4 Conclusions Our tree predictor had a good performance in classifying our test set. One key feature of our model was its treatment of the missing values. For comparison, I fitted another CART classifier omitting the missing values and got a test error 4

of 53%, which suggests that there is a considerable loss of information doing this. There are some questions that still need to be answered in order to asses our results. How general is our model? Can we extend it to the general population of California or the United States? One obstacle I can see, which figure 3 supports, is that our predictions are heavily affected by the frequency of each class. Probably, in order to have more generally applicable results, we could weight our observations in order to balance the frequency of each class with the frequency of the actual population we want to predict. Regarding the importance of the variables, it is important to keep in mind that binary trees have a really high variance. Little variation on the data can mean many different combinations on the selected variables for each split. In order to better asses the importance of the variables, at least when there is a lack of context to aid us, Random Forests has a more robust importance measure for the predictor variables, and may be a good continuation of this work. 5 References (i) Hastie T., R. Tibshirani and J. Friedman J The Elements of Statistical Learning Stanford, California. 2008. (ii) Bishop, Christofer M. Pattern Recognition and Machine Learning Cambridge. 2006. (iii) Michel, Tom M. Machine Learning 1997. 5