Data Mining Toon Calders t.calders@tue.nl
What is Data Mining? Huge sets of data are being collected and stored
What is Data Mining? Analyzing all data manually becomes impossible Data mining emerged from this need Data mining is the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. (Hand, Mannila, Smyth)
2II15: Course Organization Lectures: Thursday 13:30-15:15 in Auditorium 12 Lecturer: Toon Calders ( t.calders@tue.nl HG 7.82a ) Course website: http://www.win.tue.nl/~tcalders/teaching/datamining/ Boek: Tan, Steinbach, Kumar: Introduction to datamining Subscribe to the course in Studyweb
2II15: Course Organization Evaluation: Written exam 50% Group project 50% Without project, no grade Without exam, no grade Project/exam scores can be transferred to August if at least 6
2II15: Course Organization Group project Groups of 3-4 students Pick a dataset to analyze (suggestions online) Analyze the dataset; report results W8: Groups formed, assignment proposal W14: half-time report (presentation in W16) W22: end presentation (report in W23) Detailed list + examples next week
Outline Three Main Categories: Classification Clustering Pattern Mining Potential dangers of Data Mining Overfitting Bad experimental design Spurious discoveries Case study
Technique 1: Classification Learn a model based on labeled data. The model can be used for prediction. Example: age <30 30 M gender F sports Car type family High Medium High Low
Technique 1: Classification Early Class: Phase Intermediate Attributes: image features, wavelengths Late Dataset size: 72 million stars, 20 million galaxies Object catalog: 9 GB Image database: 150 GB Courtesy: http://aps.umn.edu
Technique 1: Classification Other examples Strijd tegen fiscale fraude bracht vorig jaar 590 miljoen op Spam filters Bron: De Standaard 6/6/08 [ ] Content analysis details: (5.7 points, 5.0 required) Datamining pts rule name description ---- ------------------------------------------------------------------- Classifying De techniek van solar de datamining systemsblijft wel evenveel geld opleveren: 0.6 NO_REAL_NAME From: does not include a real name 0.0 NORMAL_HTTP_TO_IP 204,37 miljoen euro vorig URI: jaar, Uses tegenover a dotted-decimal 218,3 miljoen IP address in 2006. in URL 2.0 RCVD_IN_SORBS_DUL [ ] RBL: SORBS: sent directly from dynamic IP address [122.164.179.102 listed in dnsbl.sorbs.net] 3.1 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL [122.164.179.102 listed in zen.spamhaus.org] 0.0 RCVD_IN_PBL Fraud detection RBL: Received via a relay in Spamhaus PBL [122.164.179.102 listed in zen.spamhaus.org]......
Technique 1: Classification The course will cover: Different algorithms: Decision tree construction Nearest neighbor Naïve Bayes How to combine classifiers How to evaluate the performance of classifiers
Technique 2: Clustering Automatically dividing data into homogeneous groups
Technique 2: Clustering Example:
Technique 2: Clustering Clustering stock with similar behavior
Technique 2: Clustering The course will cover: Agglomerative clustering Distance based Density based Hierarchical clustering How to measure cluster quality
Technique 3: Pattern Mining Find regularities, trends, patterns that frequently occur in the data
Technique 3: Pattern Mining Other example:
Technique 3: Pattern Mining The course will cover: Algorithms Apriori FPGrowth Output reduction Condensed representations
Techniques: Summary Current state-of-the-art in Data Mining: Toolbox Many different techniques; Also deviation/outlier detection, regression, webmining, Typically Data Mining involves many different steps Not one optimal algorithm Interactive process
Outline Three Main Categories: Classification Clustering Pattern Mining Case Study: Heating and Cooling Potential dangers of Data Mining Meaningless Discoveries Overfitting Bad experimental design
Case Study Optimizing energy usage for heating and cooling complex system dynamics only partially known lots of data being generated
Case Study Performance of individual components in idealized conditions well-known Reality turns out not to be so nice Different parameters constantly being monitored Room temperature Temperature in boiler Flow of water
Case Study Data mining helps: Model «normal» behavior of the system Learned from observations Classification/regression Difficult to model statistically Monitor when systems no longer follows model alarm-function: something changed Find regularities in the irregularities
Case Study: Conclusion Real applications need Physics Statistics estimate situation-dependent parameters Data mining for finding unexpected patterns, modelling complex systems
Outline Three Main Categories: Classification Clustering Pattern Mining Case study Potential dangers of Data Mining Meaningless discoveries Overfitting Bad experimental design
Meaningless Discoveries Implication causality Simpson s paradox Data dredging Redundancy No new information
Implication Causality Diet Coke Obesity Intensive Care Death Beach: Ice cream sales go up # drowned goes up # drowned goes up Ice cream sales go up
Simpson s Paradox Two hospitals: Academic hospital, local hospital. Success rate of simple and complex operations is measured: Academic Local Simple 95% 92% Complex 75% 60% Total 78% 89%
Simpson s Paradox Two hospitals: Academic hospital, local hospital. Success rate of simple and complex operations is measured: Academic Local Simple 190/200 920/1000 Complex 750/1000 60/100 Total 940/1200 980/1100
Data Dredging Torturing the data until they confess If you keep trying, eventually you will succeed.
Redundancy Often the number of frequent sets is extremely large. Data Patterns
No New Information Most frequent patterns = most well-known patterns Many interesting patterns are infrequent; otherwise we would already know them
Outline Three Main Categories: Classification Clustering Pattern Mining Case study Potential dangers of Data Mining Meaningless discoveries Overfitting Bad experimental design
Overfitting Setting: Training data Separate set for testing the data We keep updating the model Make it more and more specific Make it better and better on the training data What happens to the generalization power?
Overfitting
Overfitting Underfitting Overfitting Underfitting: Model did not see enough data Overfitting: Model learns peculiarities of input data
Overfitting Due to Noise Two-dimensional data, class + or - B + + - - + + + - - + + - - - + - + - + - + - - - - - - - - - - - - A
Overfitting Due to Noise Good model B + + - - + + + - - + + - - - + - + - + - + - - - - - - - - - - - - A
Overfitting Due to Noise Bad model with better training performance B + + - - + + + - - + + - - - + - + - + - + - - - - - - - - - - - - A
Outline Three Main Categories: Classification Clustering Pattern Mining Case study Potential dangers of Data Mining Meaningless discoveries Overfitting Bad experimental design
Bad Experimental Design Keep in mind: Never, ever test performance of your solutions on data that is used in the training process Always keep the scenario in mind in which you will deploy your method
Bad Experimental Design Example: Nearest Neighbor Classification Training set has been given A B C D Class 0.5 0.3 0.1 7.5 + 0.3 0.1 0.7 8.9-0.4 0.2 0.8 4.2 + Classifying a new example p: Find closest example q in training set Assign label of q to p
Bad Experimental Design
Bad Experimental Design How do we measure the distance? Weighted Eucledian distance between new point (p 1,, p k ) and (q 1,, q k ) dist = n w ( p q We try some different settings for the weights Equal weights Accuracy of 56% Standardized weights Accuracy of 65% Giving more weight to C Accuracy of 75% k = 1 k k k ) 2
Bad Experimental Design We draw the following conclusions: Standardized weights with a small correction to increase the weight of C gives the best results We can get an accuracy as high as 75%
Bad Experimental Design We draw the following conclusions: Standardized weights with a small correction to increase the weight of C gives the best results We can get an accuracy as high as 75% WHAT IS WRONG? (Problem reported by Eamon Keogh)
Conclusions Three main techniques: Classification Pattern Mining Clustering Many dangers Under/overfitting Meaningless discoveries Bad experimental design
See you again next week!