1 Title Introduction to Data Mining Dr Arulsivanathan Naidoo Statistics South Africa OECD Conference Cape Town 8-10 December

2 Outline Introduction Statistics vs Knowledge Discovery Predictive Modeling Data Mining Examples Census 2011 ROC Conclusions 2

3 Introduction What is Data Mining? Data Mining is a general term. Data mining is defined a an application of intelligent techniques such as decision trees, Neural networks, fuzzy logic genetic algorithm, nearest neighbour method, rule induction and data visualization to large quantities of data to discover hidden trends, patterns and relationships (Lam and Kamber 2006) 3

4 Orgins of Data Mining Statistics Pattern Recognition Neurocomputing Data Mining Machine Learning Databases Artificial Intellengence KDD 4

5 Hypothesis Testing Statement Top Hypothesis Analysis Decision Accept H 0 Down 5

6 Knowledge Discovery Statement Up Answer Beer Question? What item is purchased with disposable Baby Napkins? Data Bottom 6

7 Unsupervised learning Data Association Items bought together Disassociation Items not bought together Sequential Cluster / SOM Items bought in order Grouping- Segments 7

8 Supervised Learning Data Target Variable Decision Tree Regression Neural Network Two Stage 8

9 What is a Model? One Word Equation Straight Line Y = mx + c Example: Countryside 9

10 Decision Tree A decision tree model is constructed by segmenting a dataset using a series of simple rules, resulting in a hierarchy of segments within segments Algorithms such as the CHAID (chi squared automatic interactive detection) can be used to decide on how to split the segments. The hierarchy is called a tree and each segment a node 10

11 Decision Tree 100 M 100 W Short Hair Long Hair Earings No Earings Predict everyone with short hair and earings is female 11

12 Regression x 1 x 2 Y x 3 12

13 Neural Networks x 1 H 1 x 2 H 2 Y x 3 Inputs Outputs Black Box 13

14 Two Stage Buy from every Catalogue R100 Buy from Catalogue once/year R

15 Eurostat Funding KESO ( Knowledge extraction for statistical offices) This is a Eurostat project with the goal to construct a versatile efficient industrial strength data mining system that satisfies the needs of providers large scale databases SPIN (Spatial mining for data of public interest) was developed to support statistical offices in their timely and cost effective dissemination of statistical data by integrating the state of the art GIS and data mining functionality in an open highly extensible internet enabled plug in architecture IDSA (Intelligent Data Control System) Hassain et al 2010 This is an application of data mining to the official statistics 15

16 NASS Decision Trees Census Non Response Weighting Census Mail List Trimming Analysis of reporting Errors Allocation of Survey Incentives Prediction of Survey Non Respondents 16

17 NASS Cluster Analysis 2007 Census Donor Pool Screening Questionnaire design and Construction Identifying Subtypes of records Missing from the Census Mail List Association Analysis Survey Data Edit design 17

18 Examples Absa Branch Robberies Old Mutual Policies MTN prepaid HSBC Bank Credit Cards Royal Saudi Air Force Census

19 Census 2011 Model A Results Sample Data Model B Assess Score (Ranking) Model C Census 2001 Will Respond Informal Areas High Wall Areas 19

20 Prediction Types Training Data Case 1 : inputs target Case 2 : inputs target Case 3 : inputs target Case 4 : inputs target Predictions Decisions Rankings Case 5 : inputs target Estimates 20

21 Prediction Types Training Data Case 1 : inputs target Case 2 : inputs target Case 3 : inputs target Case 4 : inputs target Case 5 : inputs target Decisions Success Failure Failure Success Success 21

22 Prediction Types Training Data Case 1 : inputs target Case 2 : inputs target Case 3 : inputs target Case 4 : inputs target Case 5 : inputs target Rankings

23 Prediction Types Training Data Case 1 : inputs target Case 2 : inputs target Case 3 : inputs target Case 4 : inputs target Case 5 : inputs target Estimates

24 Prediction Type Validation Fit Statistic Direction Decisions Rankings Misclassification Average Profit/Loss Kolmogorov-Smirnov Statistic ROC Index (Concordance) Gini Coefficient Smallest Largest/Smallest Largest Largest Largest Estimates Average Square Error Schwarz s Bayesian Criterion Log-likelihood Smallest Smallest Largest 24

25 Confusion Matrix Actual male female Predicted male female True Positive False a b Negative False Positive True Negative c d 25

26 ROC a Sensitivity = a+b d Specificity = c+d 26

27 ROC The ROC (Receiver Operating Characteristic) curve was first used during World War 2 following the attacks on Pearl harbour in The US army research the prediction of correctly detecting Japanese aircraft from their radar signals 27

28 ROC Curve 28

29 Conclusion Data mining is a growing discipline which originated outside statistics in the data base community mainly for commercial purposes Today data Mining can be considered a branch of exploratory statistics where useful models and patterns are uncovered through the extensive use of algorithms Finally who should analyse huge data sets, the National statistics Offices or other research institutions Data mining techniques use individual records not aggregate data There is by law the confidentiality clause The NSO are the best place and this will imply new directions of research 29

30 Conclusion Official statistics should be a field for data mining giving new life and value to its huge data bases, but this may imply a redefinition of the visions and missions of official statistics offices South Africa changed its vision and mission this year In Statistics South Africa we have acquired data mining software and we have started a data mining user group of over 100 researchers We are hoping to start a working paper series where some of this research will be published on our website for comments 30

31 Thank you 31

