Predictive Modeling of Titanic Survivors: a Learning Competition



Similar documents
Smart Sell Re-quote project for an Insurance company.

Internet Gambling Behavioral Markers: Using the Power of SAS Enterprise Miner 12.1 to Predict High-Risk Internet Gamblers

Improved Interaction Interpretation: Application of the EFFECTPLOT statement and other useful features in PROC LOGISTIC

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Classification of Titanic Passenger Data and Chances of Surviving the Disaster Data Mining with Weka and Kaggle Competition Data

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller

Improving performance of Memory Based Reasoning model using Weight of Evidence coded categorical variables

Enhancing Compliance with Predictive Analytics

Gerry Hobbs, Department of Statistics, West Virginia University

How To Make A Credit Risk Model For A Bank Account

Modeling to improve the customer unit target selection for inspections of Commercial Losses in Brazilian Electric Sector - The case CEMIG

A Property & Casualty Insurance Predictive Modeling Process in SAS

PharmaSUG2011 Paper HS03

A fast, powerful data mining workbench designed for small to midsize organizations

Data mining and statistical models in marketing campaigns of BT Retail

A Comparison of Decision Tree and Logistic Regression Model Xianzhe Chen, North Dakota State University, Fargo, ND

Data Mining Techniques Chapter 6: Decision Trees

APPLICATION PROGRAMMING: DATA MINING AND DATA WAREHOUSING

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Didacticiel Études de cas

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Efficient Integration of Data Mining Techniques in Database Management Systems

partykit: A Toolkit for Recursive Partytioning

Successfully Implementing Predictive Analytics in Direct Marketing

Big Data Analytics. Benchmarking SAS, R, and Mahout. Allison J. Ames, Ralph Abbey, Wayne Thompson. SAS Institute Inc., Cary, NC

The Operational Value of Social Media Information. Social Media and Customer Interaction

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Analyzing Marine Piracy from Structured & Unstructured data using SAS Text Miner

Binary Logistic Regression

Microsoft Azure Machine learning Algorithms

Reevaluating Policy and Claims Analytics: a Case of Non-Fleet Customers In Automobile Insurance Industry

Data Mining Using SAS Enterprise Miner Randall Matignon, Piedmont, CA

Predicting earning potential on Adult Dataset

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Data Mining Lab 5: Introduction to Neural Networks

Methods for Interaction Detection in Predictive Modeling Using SAS Doug Thompson, PhD, Blue Cross Blue Shield of IL, NM, OK & TX, Chicago, IL

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

Data Mining: A Magic Technology for College Recruitment. Tongshan Chang, Ed.D.

Benchmarking of different classes of models used for credit scoring

ASSIGNMENT 4 PREDICTIVE MODELING AND GAINS CHARTS

PAKDD 2006 Data Mining Competition


DECISION TREE ANALYSIS: PREDICTION OF SERIOUS TRAFFIC OFFENDING

A Property and Casualty Insurance Predictive Modeling Process in SAS

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

S The Difference Between Predictive Modeling and Regression Patricia B. Cerrito, University of Louisville, Louisville, KY

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

2015 Workshops for Professors

Data Mining Classification: Decision Trees

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Data Mining Methods: Applications for Institutional Research

Risk pricing for Australian Motor Insurance

Getting Started with SAS Enterprise Miner 7.1

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

SAS ENTERPRISE MINER 5.3

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining Practical Machine Learning Tools and Techniques

Prediction of Stock Performance Using Analytical Techniques

Predicting Customer Default Times using Survival Analysis Methods in SAS

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Journée Thématique Big Data 13/03/2015

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

The Basics of SAS Enterprise Miner 5.2

Leveraging Ensemble Models in SAS Enterprise Miner

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Classification of Bad Accounts in Credit Card Industry

Data Mining. Nonlinear Classification

Regression Modeling Strategies

New Clergy Compensation Report

Application of Predictive Analytics to Higher Degree Research Course Completion Times

An Overview and Evaluation of Decision Tree Methodology

Decision Trees What Are They?

Paper Downtime of a truck = Truck repair end date - Truck repair start date

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

8. Machine Learning Applied Artificial Intelligence

Predicting Car Purchase Intent Using Data Mining Approach

Transcription:

SAS Analytics Day Predictive Modeling of Titanic Survivors: a Learning Competition Linda Schumacher

Problem Introduction On April 15, 1912, the RMS Titanic sank resulting in the loss of 1502 out of 2224 passengers and crew. Predicting the survivors based on demographic variables is a predictive modeling classification problem. kaggle.com hosts a Titanic Getting Starting public competition with model scoring based on the accuracy fit statistic. Predictive Modeling was performed using SAS Enterprise Miner 12.1

Data Exploration Target variable is binary: survived. Input variables: name, age, fare, sex, embarked, cabin, ticket, pseudo class, #siblings/spouse, #parents/children. Sex, social standing related variables (fare, pclass, cabin) and age have highest worth.

Date Preparation Impute Node Age has 20% missing values and is imputed with a tree method using a computed variable, title, #siblings/spouse, and #parents/children. Title is extracted from the name. Transform Node Fare is right skewed with a large range. The log transformation improved the skew and kurtosis. Range Skew Kurtosis fare 0-512.13 4.79 33.40 log(fare+1) 0-6.24 0.42 0.85

Modeling in SAS Enterprise Miner Data Partitioning Node Because of the limited number of cases, 890 passengers, data was partitioned into 70% training and 30% validation. A separate test set of 418 passengers was used for scoring. Modeling Nodes Decision Trees, Gradient Boosting Machine, Logistic Regression, Neural Network, Rule Induction. Modeling objective is highest accuracy on scored test data.

Modeling Variable Importance Title is the most important variable identified by trees, GBM, and logistic regression. Models next selected variables pclass, #siblings/spouse, ticket, fare, and age. Splits on #parents/children, embarked, and cabin rarely occurred.

Modeling Diagram Ensemble A Autonomous Decision trees were built using split criteria entropy or Gini. Trees were optimized using the Average Square Error assessment on validation data. An interactive tree was also created. A Gradient Boosting Machine with a maximum depth of 10 and a maximum of 2 surrogate rules was created.

Model Comparison Fit Statistics The SAS EM nodes all performed well. The ensembles have the best accuracy fit statistics on scored test data. Model Valid ROC Valid ASE Valid MISC Test Accuracy Score Ensemble A 0.907 0.104 0.144 0.81340 Ensemble B 0.912 0.105 0.144 0.80816 DT entropy 0.910 0.102 0.129 0.80383 GBM 0.910 0.111 0.133 0.78947 Regression 0.870 0.125 0.178 0.79904 Neural Net 0.892 0.114 0.163 0.76555 Rule Induct 0.856 0.121 0.159 0.79426 DT interact 0.842 0.128 0.156 0.77990 Ensemble B

Model Comparison ROC Chart Validation ROC index: 0.842-0.912 Test accuracy: 0.766-0.813 An autonomous tree using the entropy split criterion had the highest ROC index. Ensemble A had the highest test accuracy.

Modeling Results Title was initially created to impute age but became the most important variable for predictions. Title encapsulates age, gender, marital status, and some professions. In general, models predicted survivors with titles: Miss, Mrs., Master Females traveling in 1 st & 2 nd class and pockets within 3 rd class. Males 12 and younger with less than 3 siblings. Mr., Rev., and Others were not predicted to survive. Males 13 and older. Note that although 72 men were survivors in training & validation data, the models did not identify any groups of men as survivors. No Reverends in training or validation data survived

Conclusion Decision Trees are good predictive models for the Titanic disaster survival because: Variables are related or redundant. Most variables are ordinal or nominal. Trees make no assumptions on the distributions of variables. Interactions between variables are present. Interpretation of tree models is straightforward. Ensemble nodes improve on the predictions of component models by forming consensus predictions.

For more details contact presenter at: Linda Schumacher Phone: 9198481374 Email: linda.schumacher@okstate.edu Faculty Advisor: Dr. Goutam Chakraborty Competition Website: http://www.kaggle.com/c/titanic-gettingstarted Acknowledgements: The author thanks Austen Head, Rick Pack, and Brian Fannin for their valuable comments and suggestions.