Lecture 6 - Data Mining Processes



Similar documents
2 Data Mining Process

Social Media Mining. Data Mining Essentials

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining 5. Cluster Analysis

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Microsoft Azure Machine learning Algorithms

Chapter 12 Discovering New Knowledge Data Mining

Introduction to Data Mining

Chapter 20: Data Analysis

Quick Introduction of Data Mining Techniques

Elementary Statistics

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Data Preprocessing. Week 2

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Chapter 1: The Nature of Probability and Statistics

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Data Mining Part 5. Prediction

Advanced Data Mining Techniques

Simple Predictive Analytics Curtis Seare

Classification Techniques (1)

Concepts of Variables. Levels of Measurement. The Four Levels of Measurement. Nominal Scale. Greg C Elvers, Ph.D.

Foundations of Artificial Intelligence. Introduction to Data Mining

Lecture 2: Types of Variables

Data Mining: Overview. What is Data Mining?

IBM SPSS Direct Marketing 22

not possible or was possible at a high cost for collecting the data.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Database Marketing, Business Intelligence and Knowledge Discovery

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Framing Business Problems as Data Mining Problems

Data Mining: Introduction

IBM SPSS Direct Marketing 23

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Data Mining Techniques

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Foundations of Business Intelligence: Databases and Information Management

S P S S Statistical Package for the Social Sciences

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Business Statistics: Intorduction

Customer Classification And Prediction Based On Data Mining Technique

Sutee Sujitparapitaya, Ph.D. Institutional Effectiveness and Analytics San José State University

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Easily Identify the Right Customers

CS Introduction to Data Mining Instructor: Abdullah Mueen

Organizing Your Approach to a Data Analysis

Data Mining Applications in Higher Education

Concept and Applications of Data Mining. Week 1

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

MBA Data Mining & Knowledge Discovery

Master of Science in Health Information Technology Degree Curriculum

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

DATA PREPARATION FOR DATA MINING

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining with Weka

Web Mining as a Tool for Understanding Online Learning

FAO Standard Seed Security Assessment CREATING MS EXCEL DATABASE

Data Mining: Data Preprocessing. I211: Information infrastructure II

Analyzing Research Data Using Excel

Data Mining: An Introduction

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

IBM SPSS Direct Marketing 19

1 Choosing the right data mining techniques for the job (8 minutes,

Decision Trees What Are They?

An Overview of Knowledge Discovery Database and Data mining Techniques

Introduction. A. Bellaachia Page: 1

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Hexaware E-book on Predictive Analytics

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Data Mining Classification: Decision Trees

Knowledge Discovery and Data Mining

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Azure Machine Learning, SQL Data Mining and R

USING LOGIT MODEL TO PREDICT CREDIT SCORE

How To Use Neural Networks In Data Mining

from Larson Text By Susan Miertschin

Statistics. Measurement. Scales of Measurement 7/18/2012

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Data cleaning and Data preprocessing

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Role of Social Networking in Marketing using Data Mining

IBM SPSS Statistics for Beginners for Windows

DMDSS: Data Mining Based Decision Support System to Integrate Data Mining and Decision Support

Survey Analysis: Data Mining versus Standard Statistical Analysis for Better Analysis of Survey Responses

Credit Risk Models. August 24 26, 2010

Data Warehousing and Data Mining in Business Applications

Measurement Information Model

Statistics and Data Mining

Descriptive Statistics and Measurement Scales

Transcription:

Lecture 6 - Data Mining Processes Dr. Songsri Tangsripairoj Dr.Benjarath Pupacdi Faculty of ICT, Mahidol University 1

Cross-Industry Standard Process for Data Mining (CRISP-DM) Example Application: Telephone Bill Study 2

CRISP-DM Cross-Industry Standard Process for Data Mining i (http://www.crisp-dm.org/) /) CRISP-DM is a data mining process model that describes commonly used approaches that expert tdata miners use to tackle problems. One of first comprehensive attempts toward standard process model for data mining Independent of industry sector & technology 3

CRISP-DM Phases 1. Business (or problem) understanding 2. Data understanding 3. Data preparation Transform & create data set for modeling 4. Modeling 5. Evaluation Check good models, evaluate to assure nothing missing 6. Deployment 4

1. Business Understanding Determine business objectives Solve a specific problem Assess the current situation ti Convert the above into a data mining gproblem What types of customers are interested in each of our products? What are typical profiles of our customers? Develop a project plan 5

2. Data Understanding Initial Data Collection Data Description Data Exploration Data Quality Verification Data Selection Related data can come from many sources Internal (ERP (or MIS), Data Warehouse) External (Government data, Commercial data) Created (Research) 6

Set up a concise and clear description of the problem Identify spending behaviors of female shoppers who purchase seasonal clothes Identify bankruptcy patterns of credit card holders Identify the relevant data for the problem description Demographical, credit card transactional, financial data Selected variables for the relevant data should be independent of each other 7

Demographic data Such as income, education, number of fhouseholds, h and age Socio-graphic data Such as hobby, club membership, and entertainment Transactional data Such as sales record, credit card spending, issued checks 8

Nominal Ordinal Interval Ratio 9

Have finite non-ordered values Values are distinct symbols Only equality tests can be performed (=, ) Example: outlook: {sunny, overcast, rainy} sex: {male, female} eye color: {black, blue, green, brown, etc.} } 10

Have finite ordered values Impose order on values (<, >) But: no distance between values defined Example: grades: A > B > C > D > F credit ratings: excellent > fair > bad temperature: hot > mild > cool height: tall > medium > short 11

Interval quantities are not only ordered but measured din fixed and equal units The differences between values are meaningful, i.e., a unit of measurement exists (+, - ) Examples: temperatures in Celsius or Fahrenheit calendar dates 12

Ratio quantities are treated as real numbers All mathematical operations are allowed Both differences and ratios are meaningful (*, /) Example: age, length, time, counts, monetary yquantities 13

The type of an attribute depends on which of the following properties (operations) it possesses: Distinctness: = Order: <> Addition: + - Multiplication: * / Nominal: distinctness Ordinal: distinctness & order Interval: distinctness, order & addition Categorical (Qualitative) Numeric Ratio: all 4 properties (Quantitative) 14

Discrete data Has only a finitei or countably infinite i set of values Often represented as integer variables. Note: binary attributes (e,g., true/false, yes/no, 0/1) are a special case of discrete attributes Examples: zip codes, counts, or the set of words in a collection of documents 15

Continuous data Infinite i number of possible values Continuous attributes are typically y represented as floating-point variables Has real numbers as attribute values Practically, real values can only be measured and represented using a finite number of digits Examples: temperature, height, or weight 16

Types of Data Features PolyAnalyst PASW Modeler Continuous Numerical Range Integer Integer Range Yes/No Binary Flag Finite Categorical Set Date/Time String Text Range Typeless 17

3. Data Preparation Clean selected data for better quality Fill in missing values, Identify or remove outliers Resolve redundancy caused by data integration Correct inconsistent data Transform data Convert different measurements of data into a unified numerical scale by using simple mathematical formulations 18

Customer Zip Gender Income Age Marital Transaction ID Statust Amount 1001 10048 M 75000 28 M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 19

incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -40000 inconsistent: containing discrepancies in codes or names eg e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C 20

Outliers differ greatly from the majority of data Data that are clearly out of range of the selected data groups Example: The Income of a customer included in the middle class is $250,000. The age of a credit card holder is recorded as 12. 21

Incomplete data may come from Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) 22

Transform numerical to numerical scales Salary ranges from $20,000 to $100,000 to a number in [0.0, 1.0] The metric system (e.g., meter, kilometer) to the English system (e.g., foot and mile) Recode categorical data to numerical scales 1 = Yes and 0 = No 1 for $0 to $20,000 and 2 for $20,001 to $40,000 23

4. Modeling Data Treatment Training i set Test set Maybe others Data Mining Techniques Association Classification Clustering Predictions Sequential patterns 24

Derive a set of association rules showing relationships among attributes and data items, based on statistical significance. Example: Market-Basket analysis TID Items Rules discovered can be 1 Bread, Coke, Milk {milk} {coke} {diaper, milk} {beer} 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk 25

Classify data items into one of several predefined classes. Example: To indicate whether a customer is likely to buy a computer A decision tree <=30 Student Age Yes 31..40 >40 Credit rating No Yes Excellent Fair No Yes No Yes 26

To group data items into a number of clusters by using some similarity measures. Example: Find subgroups of customers having similar purchase behaviors. Dimension = 2 Classes = 3 Patterns in class 1 = 20 Patterns in class 2 = 28 Patterns in class 3 = 25 Total patterns = 73 P 2 class 1 class 2 class 3 P 1 27

Related to regression techniques To discover the relationship between een the dependent and independent variables, the relationship between the independent variables Examples: Predict the amount of revenue that each item will generate during an upcoming sale, based on previous sales data Predict sales amounts of new product based on advertising expenditure. 28

To find similar patterns in data transaction over a business period Example: In point-of-sale transaction sequences, Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) Computer Bookstore: (Modern Database Management) (Data Warehousing Fundamentals) --> (Introduction to Data Mining) 29

5. Evaluation Does model meet business objectives? Any important business objectives not addressed? Does model make sense? Is model actionable? It should be possible to make business decisions after this step. All important objectives should be achieved. 30

6. Deployment Ongoing g monitoring & maintenance a Evaluate performance against success criteria Market reaction & competitor changes 31

Example Application Telephone industry Problem: Unpaid bills Data mining used to develop models to predict nonpayment as early as possible 32

Telephone Bill Study Billing period sequence analyzed Use 2 months, receive bill, payment due month of billing, disconnect if unpaid in given period Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period 33

1: Business Understanding Predict which customers would be insolvent In time for firm to take preventive measures (and avert losing good customers) Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period 34

2: Data Understanding Static customer information available in files Bills, payments, usage Used data warehouse to gather and organize data Coded to protect customer privacy 35

Creating Target Data Set Customer files Customer information Disconnects Reconnections Time-dependent data Bills Payments Usage 100,000 customers over 17-month period Stratified sampling to assure all groups appropriately represented 36

3: Data Preparation Filtered out incomplete data Deleted inexpensive calls Reduced data volume about 50% Low number of fraudulent cases Cross-checked with phone disconnects Lagged data made synchronization necessary 37

Data Reduction & Projection Information grouped by account Customer data aggregated by 2-week periods Discriminant analysis on 23 categories Calculated average owed by category (significant) ifi Identified extra charges (significant) Investigated payment by installments (not significant) 38

Choosing Data Mining Function Classes: Most possibly solvent (99.3%) Most possibly insolvent (0.7%) Costs of error widely different New data set created through stratified sampling Retained all insolvent Altered distribution to 90% solvent Used 2,066 cases total Citi Critical period didentified d Last 15 two-week periods before service interruption Variables defined d by counting measures in two-week periods 46 variables as candidate discriminant factors 39

4: Modeling Discriminant Analysis Linear model SPSS stepwise forward selection Decision Trees Rule-based classifier Neural Networks Nonlinear model 40

Data Mining Training set is about 2/3 of the data. The rest of the data (1/3) is the test set. Discriminant i i analysis Used 17 variables Equal costs 0.875 correct Unequal costs 0.930 correct Rule-based 0.952 correct Neural network 0.929 correct 41

5: Evaluation 1st objective e to maximize accuracy acy of predicting insolvent customers Decision tree classifier best 2nd objective to minimize error rate for solvent customers Neural network model close to Decision tree Used all 3 on case-by-case basis 42

Coincidence Matrix Combined Models Model Model Unclass Totals insolvent solvent Actual insolvent 19 17 28 64 Actual solvent 1 626 27 654 Totals 20 643 91 718 43

6: Implementation Every customer examined using all 3 algorithms If all 3 agreed, used that classification If disagreement, categorized as unclassified Correct on test data 0.898 Only 1 actually solvent customer would ldhave been disconnected 44