Lecture 6 - Data Mining Processes Dr. Songsri Tangsripairoj Dr.Benjarath Pupacdi Faculty of ICT, Mahidol University 1
Cross-Industry Standard Process for Data Mining (CRISP-DM) Example Application: Telephone Bill Study 2
CRISP-DM Cross-Industry Standard Process for Data Mining i (http://www.crisp-dm.org/) /) CRISP-DM is a data mining process model that describes commonly used approaches that expert tdata miners use to tackle problems. One of first comprehensive attempts toward standard process model for data mining Independent of industry sector & technology 3
CRISP-DM Phases 1. Business (or problem) understanding 2. Data understanding 3. Data preparation Transform & create data set for modeling 4. Modeling 5. Evaluation Check good models, evaluate to assure nothing missing 6. Deployment 4
1. Business Understanding Determine business objectives Solve a specific problem Assess the current situation ti Convert the above into a data mining gproblem What types of customers are interested in each of our products? What are typical profiles of our customers? Develop a project plan 5
2. Data Understanding Initial Data Collection Data Description Data Exploration Data Quality Verification Data Selection Related data can come from many sources Internal (ERP (or MIS), Data Warehouse) External (Government data, Commercial data) Created (Research) 6
Set up a concise and clear description of the problem Identify spending behaviors of female shoppers who purchase seasonal clothes Identify bankruptcy patterns of credit card holders Identify the relevant data for the problem description Demographical, credit card transactional, financial data Selected variables for the relevant data should be independent of each other 7
Demographic data Such as income, education, number of fhouseholds, h and age Socio-graphic data Such as hobby, club membership, and entertainment Transactional data Such as sales record, credit card spending, issued checks 8
Nominal Ordinal Interval Ratio 9
Have finite non-ordered values Values are distinct symbols Only equality tests can be performed (=, ) Example: outlook: {sunny, overcast, rainy} sex: {male, female} eye color: {black, blue, green, brown, etc.} } 10
Have finite ordered values Impose order on values (<, >) But: no distance between values defined Example: grades: A > B > C > D > F credit ratings: excellent > fair > bad temperature: hot > mild > cool height: tall > medium > short 11
Interval quantities are not only ordered but measured din fixed and equal units The differences between values are meaningful, i.e., a unit of measurement exists (+, - ) Examples: temperatures in Celsius or Fahrenheit calendar dates 12
Ratio quantities are treated as real numbers All mathematical operations are allowed Both differences and ratios are meaningful (*, /) Example: age, length, time, counts, monetary yquantities 13
The type of an attribute depends on which of the following properties (operations) it possesses: Distinctness: = Order: <> Addition: + - Multiplication: * / Nominal: distinctness Ordinal: distinctness & order Interval: distinctness, order & addition Categorical (Qualitative) Numeric Ratio: all 4 properties (Quantitative) 14
Discrete data Has only a finitei or countably infinite i set of values Often represented as integer variables. Note: binary attributes (e,g., true/false, yes/no, 0/1) are a special case of discrete attributes Examples: zip codes, counts, or the set of words in a collection of documents 15
Continuous data Infinite i number of possible values Continuous attributes are typically y represented as floating-point variables Has real numbers as attribute values Practically, real values can only be measured and represented using a finite number of digits Examples: temperature, height, or weight 16
Types of Data Features PolyAnalyst PASW Modeler Continuous Numerical Range Integer Integer Range Yes/No Binary Flag Finite Categorical Set Date/Time String Text Range Typeless 17
3. Data Preparation Clean selected data for better quality Fill in missing values, Identify or remove outliers Resolve redundancy caused by data integration Correct inconsistent data Transform data Convert different measurements of data into a unified numerical scale by using simple mathematical formulations 18
Customer Zip Gender Income Age Marital Transaction ID Statust Amount 1001 10048 M 75000 28 M 5000 1002 J2S7K7 F -40000 40 W 4000 1003 90210 10000000 45 S 7000 1004 6269 M 50000 0 S 1000 1005 55101 F 99999 30 D 3000 19
incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate data e.g., occupation= noisy: containing errors or outliers e.g., Salary= -40000 inconsistent: containing discrepancies in codes or names eg e.g., Age= 42 Birthday= 03/07/1997 e.g., Was rating 1,2,3, now rating A, B, C 20
Outliers differ greatly from the majority of data Data that are clearly out of range of the selected data groups Example: The Income of a customer included in the middle class is $250,000. The age of a credit card holder is recorded as 12. 21
Incomplete data may come from Not applicable data value when collected Different considerations between the time when the data was collected and when it is analyzed. Human/hardware/software problems Noisy data (incorrect values) may come from Faulty data collection instruments Human or computer error at data entry Errors in data transmission Inconsistent data may come from Different data sources Functional dependency violation (e.g., modify some linked data) 22
Transform numerical to numerical scales Salary ranges from $20,000 to $100,000 to a number in [0.0, 1.0] The metric system (e.g., meter, kilometer) to the English system (e.g., foot and mile) Recode categorical data to numerical scales 1 = Yes and 0 = No 1 for $0 to $20,000 and 2 for $20,001 to $40,000 23
4. Modeling Data Treatment Training i set Test set Maybe others Data Mining Techniques Association Classification Clustering Predictions Sequential patterns 24
Derive a set of association rules showing relationships among attributes and data items, based on statistical significance. Example: Market-Basket analysis TID Items Rules discovered can be 1 Bread, Coke, Milk {milk} {coke} {diaper, milk} {beer} 2 Beer, Bread 3 Beer, Coke, Diaper, Milk 4 Beer, Bread, Diaper, Milk 5 Coke, Diaper, Milk 25
Classify data items into one of several predefined classes. Example: To indicate whether a customer is likely to buy a computer A decision tree <=30 Student Age Yes 31..40 >40 Credit rating No Yes Excellent Fair No Yes No Yes 26
To group data items into a number of clusters by using some similarity measures. Example: Find subgroups of customers having similar purchase behaviors. Dimension = 2 Classes = 3 Patterns in class 1 = 20 Patterns in class 2 = 28 Patterns in class 3 = 25 Total patterns = 73 P 2 class 1 class 2 class 3 P 1 27
Related to regression techniques To discover the relationship between een the dependent and independent variables, the relationship between the independent variables Examples: Predict the amount of revenue that each item will generate during an upcoming sale, based on previous sales data Predict sales amounts of new product based on advertising expenditure. 28
To find similar patterns in data transaction over a business period Example: In point-of-sale transaction sequences, Athletic Apparel Store: (Shoes) (Racket, Racketball) --> (Sports_Jacket) Computer Bookstore: (Modern Database Management) (Data Warehousing Fundamentals) --> (Introduction to Data Mining) 29
5. Evaluation Does model meet business objectives? Any important business objectives not addressed? Does model make sense? Is model actionable? It should be possible to make business decisions after this step. All important objectives should be achieved. 30
6. Deployment Ongoing g monitoring & maintenance a Evaluate performance against success criteria Market reaction & competitor changes 31
Example Application Telephone industry Problem: Unpaid bills Data mining used to develop models to predict nonpayment as early as possible 32
Telephone Bill Study Billing period sequence analyzed Use 2 months, receive bill, payment due month of billing, disconnect if unpaid in given period Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period 33
1: Business Understanding Predict which customers would be insolvent In time for firm to take preventive measures (and avert losing good customers) Hypothesis: Insolvent customers would change calling habits & phone usage during a critical period before & immediately after termination of billing period 34
2: Data Understanding Static customer information available in files Bills, payments, usage Used data warehouse to gather and organize data Coded to protect customer privacy 35
Creating Target Data Set Customer files Customer information Disconnects Reconnections Time-dependent data Bills Payments Usage 100,000 customers over 17-month period Stratified sampling to assure all groups appropriately represented 36
3: Data Preparation Filtered out incomplete data Deleted inexpensive calls Reduced data volume about 50% Low number of fraudulent cases Cross-checked with phone disconnects Lagged data made synchronization necessary 37
Data Reduction & Projection Information grouped by account Customer data aggregated by 2-week periods Discriminant analysis on 23 categories Calculated average owed by category (significant) ifi Identified extra charges (significant) Investigated payment by installments (not significant) 38
Choosing Data Mining Function Classes: Most possibly solvent (99.3%) Most possibly insolvent (0.7%) Costs of error widely different New data set created through stratified sampling Retained all insolvent Altered distribution to 90% solvent Used 2,066 cases total Citi Critical period didentified d Last 15 two-week periods before service interruption Variables defined d by counting measures in two-week periods 46 variables as candidate discriminant factors 39
4: Modeling Discriminant Analysis Linear model SPSS stepwise forward selection Decision Trees Rule-based classifier Neural Networks Nonlinear model 40
Data Mining Training set is about 2/3 of the data. The rest of the data (1/3) is the test set. Discriminant i i analysis Used 17 variables Equal costs 0.875 correct Unequal costs 0.930 correct Rule-based 0.952 correct Neural network 0.929 correct 41
5: Evaluation 1st objective e to maximize accuracy acy of predicting insolvent customers Decision tree classifier best 2nd objective to minimize error rate for solvent customers Neural network model close to Decision tree Used all 3 on case-by-case basis 42
Coincidence Matrix Combined Models Model Model Unclass Totals insolvent solvent Actual insolvent 19 17 28 64 Actual solvent 1 626 27 654 Totals 20 643 91 718 43
6: Implementation Every customer examined using all 3 algorithms If all 3 agreed, used that classification If disagreement, categorized as unclassified Correct on test data 0.898 Only 1 actually solvent customer would ldhave been disconnected 44