Data Mining. Toon Calders



Similar documents
King Saud University

Data Mining. Toon Calders TU Eindhoven

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

DATA MINING TECHNIQUES AND APPLICATIONS

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Data Mining Classification: Decision Trees

Machine Learning using MapReduce

Social Media Mining. Data Mining Essentials

Data Mining: Overview. What is Data Mining?

Classification Techniques (1)

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Using Data Mining for Mobile Communication Clustering and Characterization

CS Data Science and Visualization Spring 2016

Introduction to Data Mining

Knowledge Discovery and Data Mining

Data Mining Part 5. Prediction

not possible or was possible at a high cost for collecting the data.

Machine Learning Capacity and Performance Analysis and R

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining for Fun and Profit

: Introduction to Machine Learning Dr. Rita Osadchy

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Customer Classification And Prediction Based On Data Mining Technique

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

An Introduction to Data Mining

Knowledge Discovery and Data Mining

Data Mining and Visualization

Using reporting and data mining techniques to improve knowledge of subscribers; applications to customer profiling and fraud management

Content-Based Recommendation

How To Perform An Ensemble Analysis

Bisecting K-Means for Clustering Web Log data

SPATIAL DATA CLASSIFICATION AND DATA MINING

Sentiment analysis using emoticons

How To Cluster

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Lecture: Mon 13:30 14:50 Fri 9:00-10:20 ( LTH, Lift 27-28) Lab: Fri 12:00-12:50 (Rm. 4116)

Information Management course

Foundations of Artificial Intelligence. Introduction to Data Mining

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Introduction. A. Bellaachia Page: 1

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Association rules for improving website effectiveness: case analysis

Data Mining - Evaluation of Classifiers

Maximizing Return and Minimizing Cost with the Decision Management Systems

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. Lecture Notes for Chapter 9. Introduction to Data Mining

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Quick Introduction of Data Mining Techniques

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

Model Selection. Introduction. Model Selection

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining + Business Intelligence. Integration, Design and Implementation

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Predicting borrowers chance of defaulting on credit loans

Perspectives on Data Mining

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Distributed forests for MapReduce-based machine learning

Learning is a very general term denoting the way in which agents:

Data Mining Algorithms Part 1. Dejan Sarka

Random forest algorithm in big data environment

Chapter 6. The stacking ensemble approach

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Data Mining for Business Analytics

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining for Knowledge Management. Classification

Monday Morning Data Mining

Automated News Item Categorization

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Dynamic Data in terms of Data Mining Streams

Framing Business Problems as Data Mining Problems

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining: A Preprocessing Engine

MHI3000 Big Data Analytics for Health Care Final Project Report

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Advanced Ensemble Strategies for Polynomial Models

The Data Mining Process

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

MS1b Statistical Data Mining

Beating the MLB Moneyline

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Comparison of K-means and Backpropagation Data Mining Algorithms

Graph Mining and Social Network Analysis

Linköpings Universitet - ITN TNM DBSCAN. A Density-Based Spatial Clustering of Application with Noise

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Transcription:

Data Mining Toon Calders t.calders@tue.nl

What is Data Mining? Huge sets of data are being collected and stored

What is Data Mining? Analyzing all data manually becomes impossible Data mining emerged from this need Data mining is the use of sophisticated data analysis tools to discover previously unknown, valid patterns and relationships in large data sets. (Hand, Mannila, Smyth)

2II15: Course Organization Lectures: Thursday 13:30-15:15 in Auditorium 12 Lecturer: Toon Calders ( t.calders@tue.nl HG 7.82a ) Course website: http://www.win.tue.nl/~tcalders/teaching/datamining/ Boek: Tan, Steinbach, Kumar: Introduction to datamining Subscribe to the course in Studyweb

2II15: Course Organization Evaluation: Written exam 50% Group project 50% Without project, no grade Without exam, no grade Project/exam scores can be transferred to August if at least 6

2II15: Course Organization Group project Groups of 3-4 students Pick a dataset to analyze (suggestions online) Analyze the dataset; report results W8: Groups formed, assignment proposal W14: half-time report (presentation in W16) W22: end presentation (report in W23) Detailed list + examples next week

Outline Three Main Categories: Classification Clustering Pattern Mining Potential dangers of Data Mining Overfitting Bad experimental design Spurious discoveries Case study

Technique 1: Classification Learn a model based on labeled data. The model can be used for prediction. Example: age <30 30 M gender F sports Car type family High Medium High Low

Technique 1: Classification Early Class: Phase Intermediate Attributes: image features, wavelengths Late Dataset size: 72 million stars, 20 million galaxies Object catalog: 9 GB Image database: 150 GB Courtesy: http://aps.umn.edu

Technique 1: Classification Other examples Strijd tegen fiscale fraude bracht vorig jaar 590 miljoen op Spam filters Bron: De Standaard 6/6/08 [ ] Content analysis details: (5.7 points, 5.0 required) Datamining pts rule name description ---- ------------------------------------------------------------------- Classifying De techniek van solar de datamining systemsblijft wel evenveel geld opleveren: 0.6 NO_REAL_NAME From: does not include a real name 0.0 NORMAL_HTTP_TO_IP 204,37 miljoen euro vorig URI: jaar, Uses tegenover a dotted-decimal 218,3 miljoen IP address in 2006. in URL 2.0 RCVD_IN_SORBS_DUL [ ] RBL: SORBS: sent directly from dynamic IP address [122.164.179.102 listed in dnsbl.sorbs.net] 3.1 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL [122.164.179.102 listed in zen.spamhaus.org] 0.0 RCVD_IN_PBL Fraud detection RBL: Received via a relay in Spamhaus PBL [122.164.179.102 listed in zen.spamhaus.org]......

Technique 1: Classification The course will cover: Different algorithms: Decision tree construction Nearest neighbor Naïve Bayes How to combine classifiers How to evaluate the performance of classifiers

Technique 2: Clustering Automatically dividing data into homogeneous groups

Technique 2: Clustering Example:

Technique 2: Clustering Clustering stock with similar behavior

Technique 2: Clustering The course will cover: Agglomerative clustering Distance based Density based Hierarchical clustering How to measure cluster quality

Technique 3: Pattern Mining Find regularities, trends, patterns that frequently occur in the data

Technique 3: Pattern Mining Other example:

Technique 3: Pattern Mining The course will cover: Algorithms Apriori FPGrowth Output reduction Condensed representations

Techniques: Summary Current state-of-the-art in Data Mining: Toolbox Many different techniques; Also deviation/outlier detection, regression, webmining, Typically Data Mining involves many different steps Not one optimal algorithm Interactive process

Outline Three Main Categories: Classification Clustering Pattern Mining Case Study: Heating and Cooling Potential dangers of Data Mining Meaningless Discoveries Overfitting Bad experimental design

Case Study Optimizing energy usage for heating and cooling complex system dynamics only partially known lots of data being generated

Case Study Performance of individual components in idealized conditions well-known Reality turns out not to be so nice Different parameters constantly being monitored Room temperature Temperature in boiler Flow of water

Case Study Data mining helps: Model «normal» behavior of the system Learned from observations Classification/regression Difficult to model statistically Monitor when systems no longer follows model alarm-function: something changed Find regularities in the irregularities

Case Study: Conclusion Real applications need Physics Statistics estimate situation-dependent parameters Data mining for finding unexpected patterns, modelling complex systems

Outline Three Main Categories: Classification Clustering Pattern Mining Case study Potential dangers of Data Mining Meaningless discoveries Overfitting Bad experimental design

Meaningless Discoveries Implication causality Simpson s paradox Data dredging Redundancy No new information

Implication Causality Diet Coke Obesity Intensive Care Death Beach: Ice cream sales go up # drowned goes up # drowned goes up Ice cream sales go up

Simpson s Paradox Two hospitals: Academic hospital, local hospital. Success rate of simple and complex operations is measured: Academic Local Simple 95% 92% Complex 75% 60% Total 78% 89%

Simpson s Paradox Two hospitals: Academic hospital, local hospital. Success rate of simple and complex operations is measured: Academic Local Simple 190/200 920/1000 Complex 750/1000 60/100 Total 940/1200 980/1100

Data Dredging Torturing the data until they confess If you keep trying, eventually you will succeed.

Redundancy Often the number of frequent sets is extremely large. Data Patterns

No New Information Most frequent patterns = most well-known patterns Many interesting patterns are infrequent; otherwise we would already know them

Outline Three Main Categories: Classification Clustering Pattern Mining Case study Potential dangers of Data Mining Meaningless discoveries Overfitting Bad experimental design

Overfitting Setting: Training data Separate set for testing the data We keep updating the model Make it more and more specific Make it better and better on the training data What happens to the generalization power?

Overfitting

Overfitting Underfitting Overfitting Underfitting: Model did not see enough data Overfitting: Model learns peculiarities of input data

Overfitting Due to Noise Two-dimensional data, class + or - B + + - - + + + - - + + - - - + - + - + - + - - - - - - - - - - - - A

Overfitting Due to Noise Good model B + + - - + + + - - + + - - - + - + - + - + - - - - - - - - - - - - A

Overfitting Due to Noise Bad model with better training performance B + + - - + + + - - + + - - - + - + - + - + - - - - - - - - - - - - A

Outline Three Main Categories: Classification Clustering Pattern Mining Case study Potential dangers of Data Mining Meaningless discoveries Overfitting Bad experimental design

Bad Experimental Design Keep in mind: Never, ever test performance of your solutions on data that is used in the training process Always keep the scenario in mind in which you will deploy your method

Bad Experimental Design Example: Nearest Neighbor Classification Training set has been given A B C D Class 0.5 0.3 0.1 7.5 + 0.3 0.1 0.7 8.9-0.4 0.2 0.8 4.2 + Classifying a new example p: Find closest example q in training set Assign label of q to p

Bad Experimental Design

Bad Experimental Design How do we measure the distance? Weighted Eucledian distance between new point (p 1,, p k ) and (q 1,, q k ) dist = n w ( p q We try some different settings for the weights Equal weights Accuracy of 56% Standardized weights Accuracy of 65% Giving more weight to C Accuracy of 75% k = 1 k k k ) 2

Bad Experimental Design We draw the following conclusions: Standardized weights with a small correction to increase the weight of C gives the best results We can get an accuracy as high as 75%

Bad Experimental Design We draw the following conclusions: Standardized weights with a small correction to increase the weight of C gives the best results We can get an accuracy as high as 75% WHAT IS WRONG? (Problem reported by Eamon Keogh)

Conclusions Three main techniques: Classification Pattern Mining Clustering Many dangers Under/overfitting Meaningless discoveries Bad experimental design

See you again next week!