Predictive Analytics Workshop using IBM SPSS Modeler IBM Corporation

Predictive Analytics Workshop using IBM SPSS Modeler 2012 IBM Corporation

Objectives Smarter Software for Smarter Cities

Agenda 8:30-8:40 Welcome and Introductions 8:40-8:55 Introduction to Predictive Analytics 8:55-9:05 Exercise: Navigating IBM SPSS Modeler 9:05-9:25 Exercise: Predictive in 20 Minutes 9:25-9:40 Data Mining Methodology and Application 9:40-9:55 Break 9:55-10:55 Exercise: Data Mining Techniques 10:55-11:25 Exercise: Text Analytics 11:25-11:50 Deployment 11:50-12:00 Wrap-up

Purpose of the Workshop Introduction to predictive analytics and data mining Stimulate thinking about how data mining would benefit your organization Demonstrate ease of use of powerful technology Get experience in doing data mining See examples of how other organizations are benefitting from deploying predictive analytics

Smarter Planet Instrumented Interconnected Intelligent

What is Predictive Analytics? Predictive Analytics helps connect data to effective action by drawing reliable conclusions about current conditions and future events Gareth Herschel, Research Director, Gartner Group

Our Ultimate Goal is to Ensure Your Success Florida Juvenile Justice reduced delinquency in schools by 34% Madrid reduced emergency response time by 25% Analytics helped reduce Hamilton County s dropout rate by 25% Predictive analytics helped slash Memphis s crime rate by 40% in one year Lancaster, uses predictive policing models to reduced crime saving the city $1.3M a year. Analytics decreased fraud in the Federal government by 50%

Predictive Analytics in Public Sector Crime Analyses Force Deployment Lead Generation Hot Spot Analyses Fraud detection and prevention Money laundering Network intrusion Tax audits & collection Entity Resolution Text Analytics Social Network Analysis Education Which among our students are at risk? Who are the most promising applicants? Which alumni will donate and how much? How can my institution plan for future development? Corrections Which inmates are at risk of recidivism? Why do some inmates return? Which programs are successful?

IBM SPSS Modeler A Quick Overview

IBM SPSS Modeler High performance data mining and text analytics workbench Used for the proactive Identification of fraud, waste and abuse Reduction of costs Identification of risk Students at risk of failing Inmates at risk of returning Patients at risk of relapse Forecasting demographic shifts and migration Allows analytics to be repeated and integrated

IBM SPSS Modeler

Predictive in 20 Minutes A Quick Exercise

Exercise: Predictive in 20 Minutes Goal: Create a model to identify who are at risk of heart attack Approach: Use patient data which contains various health and behavioral information Define which fields to use Choose the modeling technique Automatically generate a model to identify who are at risk Review results Why? For public health policy implications, by proactively identifying and quantifying high-risk behaviors and practices

Break - Please Return in 15 Minutes

IBM SPSS Modeler One Analytical Workbench Endless Techniques

Data Mining Methodology Cross-Industry Standard Process Model for Data Mining Describes Components of Complete Data Mining Project Cycle Shows Iterative Nature of Data Mining Vendor and Industry Neutral

Data Mining Techniques Technique Usage Algorithms Classification (or prediction) Used to predict group membership (e.g., will this employee leave?) or a number (e.g., how many widgets will I sell?) Auto Classifiers, Decision Trees, Logistic, SVM, Time Series, etc.

Data Mining Techniques Technique Usage Algorithms Classification (or prediction) Segmentation Used to predict group membership (e.g., will this employee leave?) or a number (e.g., how many widgets will I sell?) Used to classify data points into groups that are internally homogenous and externally heterogeneous Identify cases that are unusual Auto Classifiers, Decision Trees, Logistic, SVM, Time Series, etc. Auto Clustering, K- means, etc. Anomoly detection

Data Mining Techniques Technique Usage Algorithms Classification (or prediction) Segmentation Association Used to predict group membership (e.g., will this employee leave?) or a number (e.g., how many widgets will I sell?) Used to classify data points into groups that are internally homogenous and externally heterogeneous. Identify cases that are unusual Used to find events that occur together or in a sequence (e.g., market basket) Auto Classifiers, Decision Trees, Logistic, SVM, Time Series, etc. Auto Clustering, K- means, etc. Anomoly detection APRIORI, Carma, Sequence

Additional Data Mining Techniques Technique Usage Algorithms Text Analytics Entity Analytics Used to discover patterns resident in text or other unstructured data (e.g., sentiment analysis) Used to determine which cases are likely the same actor, and which seemingly identical cases are actually independent Natural Language Processing Parts of Speech Analysis Context Accumulation Social Network Analysis Used to uncover associations which may exist between cases, and identify central or influential actors

IBM SPSS Modeler Segmentation Modeling

Segmentation Modeling Goal: Discover natural groupings or clusters of alumni donors Approach: Alumni data from a university Define which fields to use Use K-Means Clustering to generate a model to group alumni Appendix: Use these clusters to predict donation Why? Better alumni understanding (demographics, socio-economic etc) Tailored messages for each group/segment Personal and more relevant for alumni Institutional Planning

IBM SPSS Modeler Entity Analytics

Entity Analytics Suppose that you have the following records from two different sources, and are not sure whether they refer to the same person or different people. Source 1 Record no.: 70001 Name: Jon Smith Address: 123 Main Street Driv. License: 0001133107 DL No exact matches between the two records. However, if we introduce a third source, we find some common attributes Source 2 Record no.: 9103 Name: JOHNATHAN Smith Date of Birth: 06/17/1934 Telephone: 555-1212 Email: jls@mail.com IP address: 9.50.18.77. Source 3 Record no.: 6251 Name: Jon Smith Telephone: 555-1212 Driv. License: 0001133107 Telephone

Entity Analytics 3634 Suspects Results Fields Used in EA Resolution % Missing in PD Database LAST 0.7 FIRST 0.4 MIDDLE 63.9 RACE 0.1 SEX 0.1 DOB 2.4 ADDR 0.9 DRLIC 63.8 PHONE 59.1

IBM SPSS Modeler Classification Modeling

Classification model Goal: Identify students likely to persist Approach: Use student performance scores and other demographics Define which fields to use Use the Auto Classifier to choose the appropriate modeling technique Review results Why? Identify students likely to persist into their second year Conversely, same methods can be used to identify students at risk of attrition (or prisoners at risk of recidivism, or patients likely to respond to treatment)

IBM SPSS Modeler Text Analytics

The Importance of Text Because people communicate with words, not numbers, it has become critical to be able to mine text for its meaning and to sort, analyse, and understand it in the same way that data has been tamed. In fact, the two basic types of information complement each other, with data supplying the what and text supplying the why. Source IDC: Text Analytics: Software s Missing Piece?

Text Analytics Turn unstructured officer notes and narratives into useable and searchable context-rich content with Text Analytics.

Data Mining and Text Analytics Data Mining Use advanced analytical techniques on data Discover key relationships between variables Model effect of variables on outcomes Determine influence on outcomes Predict outcomes Apply models to new data Text Analytics Extract, analyze and create structure for unstructured data Integrate analysis results into operational systems Integrate analysis results into Business Intelligence applications Integrate analysis results with structured data and use as input for Data Mining Improves model accuracy

Deployment Many Options

Why IBM SPSS?

Workshop Takeaways Easy to use, visual interface Short timeframe to be productive with actionable results Does not require knowledge of programming language Business results focused Cost effective solution that delivers powerful results across organization Flexible licensing and deployment options Full range of algorithms for your business problems End-to-end solution Data preparation through real time interactions Use structured, unstructured and survey data Full suite of products, from data collection through deployment Flexible architecture Leverages the investments already made in technology Does not require data in a proprietary format or DB Structured and unstructured data Open architecture (both inputs and outputs) SQL Pushback

Nucleus Research: The Real ROI from IBM 94% of clients achieved a positive ROI, with an average payback period of 10.7 months Key benefits achieved include reduced costs, increased productivity, improved citizen & employee satisfaction and safety. 81% of projects deployed on time, 75% on or under budget This is one of the highest ROI scores Nucleus has ever seen in its Real ROI series of research reports. Rebecca Wettemann, Vice President of Research, Nucleus Research

Appendix

Data Mining Overview From Amazon.com Paperback: 512 pages Publisher: Wiley; 1 edition (December 28, 1999) Language: English ISBN-10: 0471331236 ISBN-13: 978-0471331230 ; Good introductory text on data mining for marketing from two top communicators in the field

Statistical Analysis and Data Mining Handbook of Statistical Analysis and Data Mining Applications Robert Nisbet, John Elder IV, and Gary Miner Academic Press (2009) ISBN-10: 0123747651 An excellent guide to many aspects of data mining including Text mining.

Data Mining Algorithms From Amazon.com Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by Eibe Frank, Ian H. Witten Paperback - 416 pages (October 13, 1999) Morgan Kaufmann Publishers; ISBN: 1558605525; Best book I ve found in between highly technical and introductory books. Good coverage of topics, especially trees and rules, but no neural networks.