Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan



Similar documents
Data Mining: Foundation, Techniques and Applications

Principles of Data Mining by Hand&Mannila&Smyth

Cross-validation for detecting and preventing overfitting

Data Mining. Nonlinear Classification

STA 4273H: Statistical Machine Learning

Statistics Graduate Courses

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

LCs for Binary Classification

Data, Measurements, Features

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

CART 6.0 Feature Matrix

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

The Artificial Prediction Market

Data Mining Techniques Chapter 6: Decision Trees

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Supervised Learning (Big Data Analytics)

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Introduction to Learning & Decision Trees

Introduction to Data Mining

Perspectives on Data Mining

Data Preprocessing. Week 2

Statistical Machine Learning

Introduction. A. Bellaachia Page: 1

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Statistical Models in R

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Analytics on Big Data

Basics of Statistical Machine Learning

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Data Mining Part 5. Prediction

Azure Machine Learning, SQL Data Mining and R

Statistical Models in Data Mining

SAS Software to Fit the Generalized Linear Model

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

SAS Certificate Applied Statistics and SAS Programming

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Exploratory Data Analysis

How To Understand The Theory Of Probability

Server Load Prediction

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Neural Network Add-in

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

OUTLIER ANALYSIS. Data Mining 1

3F3: Signal and Pattern Processing

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Question 2 Naïve Bayes (16 points)

Data Mining Analytics for Business Intelligence and Decision Support

Supervised and unsupervised learning - 1

Generalized Linear Models

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Local classification and local likelihoods

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Lecture 3: Linear methods for classification

Predict Influencers in the Social Network

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Imputing Values to Missing Data

Chapter 12 Discovering New Knowledge Data Mining

Why do statisticians "hate" us?

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Predictive Data modeling for health care: Comparative performance study of different prediction models

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Data Mining Practical Machine Learning Tools and Techniques

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Machine Learning with MATLAB David Willingham Application Engineer

Binary Logistic Regression

Foundations of Artificial Intelligence. Introduction to Data Mining

Learning outcomes. Knowledge and understanding. Competence and skills

Linear Threshold Units

Chapter 20: Data Analysis

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Predictive Modeling Techniques in Insurance

Learning is a very general term denoting the way in which agents:

The Data Mining Process

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Data Mining Apriori Algorithm

What is Data mining?

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Data Mining: Algorithms and Applications Matrix Math Review

Information Management course

Car Insurance. Havránek, Pokorný, Tomášek

Detection of changes in variance using binary segmentation and optimal partitioning

Classification and Prediction

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Data Mining Algorithms Part 1. Dejan Sarka

itesla Project Innovative Tools for Electrical System Security within Large Areas

Classification of Bad Accounts in Credit Card Industry

Simple and efficient online algorithms for real world applications

Transcription:

Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan

Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms: Model-Based Clustering Algorithms: Frequent Items and Association Rules Future Directions, etc.

Of Laws, Monsters, and Giants Moore s law: processing capacity doubles every 18 months : CPU, cache, memory It s more aggressive cousin: Disk storage capacity doubles every 9 months hat do the two laws combined produce? A rapidly growing gap between our ability to generate data, and our ability to make use of it. Disk TB Shipped per Year 1E+7 1E+6 1E+5 1E+4 1E+3 1998 Disk Trend (Jim Porter) http://www.disktrend.com/pdf/portrpkg.pdf. disk TB growth: 112%/y EaByte Moore's Law: 58.7%/y 1988 1991 1994 1997 2000

hat is Data Mining? Finding interesting structure in data Structure: refers to statistical patterns, predictive models, hidden relationships Eamples of tasks addressed by Data Mining Predictive Modeling (classification, regression) Segmentation (Data Clustering ) Summarization Visualization

Ronny Kohavi, ICML 1998

Data Mining Algorithms A data mining algorithm is a well-defined procedure that takes data as input and produces output in the form of models or patterns Hand, Mannila, and Smyth well-defined : can be encoded in software algorithm : must terminate after some finite number of steps

Algorithm Components 1. The task the algorithm is used to address (e.g. classification, clustering, etc.) 2. The structure of the model or pattern we are fitting to the data (e.g. a linear regression model) 3. The score function used to judge the quality of the fitted models or patterns (e.g. accuracy, BIC, etc.) 4. The search or optimization method used to search over parameters and/or structures (e.g. steepest descent, MCMC, etc.) 5. The data management technique used for storing, indeing, and retrieving data (critical when data too large to reside in memory)

Backpropagation data mining algorithm 1 2 3 h 1 h 2 y 4 4! # i ii s = i i = 1 2 = 1 i s 1 = ;! " h! si ( s i ) = 1 (1 + e ) 4 2 y =! = w h i 1 i i 4 2 vector of p input values multiplied by p d 1 weight matri resulting d 1 values individually transformed by non-linear function resulting d 1 values multiplied by d 1 d 2 weight matri 1

Backpropagation (cont.) Parameters: " 1,," 4,! 1,,! 4, w1, w2 Score: SSSE =!( y( i) " n i= 1 yˆ( i)) 2 Search: steepest descent; search for structure?

Models and Patterns Models Prediction Probability Distributions Structured Data Linear regression Piecewise linear

Prediction Models Probability Distributions Structured Data Linear regression Piecewise linear Nonparamatric regression

Prediction Models Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification logistic regression naïve bayes/tan/bayesian networks NN support vector machines Trees etc.

Prediction Models Probability Distributions Structured Data Linear regression Piecewise linear Nonparametric regression Classification Parametric models Mitures of parametric models Graphical Markov models (categorical, continuous, mied)

Prediction Linear regression Piecewise linear Nonparametric regression Classification Models Probability Distributions Parametric models Mitures of parametric models Graphical Markov models (categorical, continuous, mied) Structured Data Time series Markov models Miture Transition Distribution models Hidden Markov models Spatial models

Markov Models First-order:! = " = T t t t t T y y p y p y y p 2 1 1 1 1 ) ( ) ( ),, ( e.g.: 2 1 1 ) ( 2 1 ep 2 1 ) (! " # $ % & ' ' = ' ' ( )( t t t t y g y y y p g linear standard first-order auto-regressive model e y y t t + + =!1 0 " 1 " ) (0, ~ 2! N e y 1 y 2 y 3 y T

First-Order HMM/Kalman Filter 1 2 3 T y 1 y 2 y 3 y T! = " = T t t t t t T T p y p y p p y y p 2 1 1 1 1 1 1 1 1 ) ( ) ( ) ( ) ( ),,,,, ( Note: to compute p(y 1,,y T ) need to sum/integrate over all possible state sequences...

Bias-Variance Tradeoff High Bias - Low Variance Score function should embody the compromise Low Bias - High Variance overfitting - modeling the random component

The Curse of Dimensionality X ~ MVN p (0, I) Gaussian kernel density estimation Bandwidth chosen to minimize MSE at the mean Suppose want: E[( pˆ( )! p( )) 2 p( ) 2 < 0.1 = 0 Dimension # data points 1 4 2 19 3 67 6 2,790 10 842,000

Patterns Global Local Clustering via partitioning Hierarchical Clustering Miture Models Outlier detection Changepoint detection Bump hunting Scan statistics Association rules

Scan Statistics via Permutation Tests The curve represents a road Each marks an accident Red denotes an injury accident Black means no injury Is there a stretch of road where there is an unusually large fraction of injury accidents?

Scan with Fied indow If we know the length of the stretch of road that we seek, e.g., we could slide this window long the road and find the most unusual window location

How Unusual is a indow? Let p and p denote the true probability of being red inside and outside the window respectively. Let (,n ) and (,n ) denote the corresponding counts Use the GLRT for comparing H 0 : p = p versus H 1 : p p n n n n n n n n n n n n!!!! + +!! + +! + + = )] / ( [1 ) / ( )] / ( [1 ) / ( ))] ) /( (( [1 )] ) /( [( " 2 log λ here has an asymptotic chi-square distribution with 1df lambda measures how unusual a window is

Permutation Test Since we look at the smallest λ over all window locations, need to find the distribution of smallest-λ under the null hypothesis that there are no clusters Look at the distribution of smallest-λ over say 999 random relabellings of the colors of the s smallest-λ 0.376 0.233 0.412 0.222 Look at the position of observed smallest-λ in this distribution to get the scan statistic p-value (e.g., if observed smallest-λ is 5 th smallest, p-value is 0.005)

Variable Length indow No need to use fied-length window. Eamine all possible windows up to say half the length of the entire road O = fatal accident O = non-fatal accident

Spatial Scan Statistics Spatial scan statistic uses, e.g., circles instead of line segments

Spatial-Temporal Scan Statistics Spatial-temporal scan statistic use cylinders where the height of the cylinder represents a time window

Other Issues Poisson model also common (instead of the bernoulli model) Covariate adjustment Andrew Moore s group at CMU: efficient algorithms for scan statistics

Software: SaTScan + others http://www.satscan.org http://www.phrl.org http://www.terraseer.com

Association Rules: Support and Confidence Customer buys beer Customer buys both Customer buys diaper Find all the rules Y Z with minimum confidence and support support, s, probability that a transaction contains {Y & Z} confidence, c, conditional probability that a transaction having Y also contains Z Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F Let minimum support 50%, and minimum confidence 50%, we have A C (50%, 66.6%) C A (50%, 100%)

Mining Association Rules An Eample Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F For rule A C: support = support({a &C}) = 50% confidence = support({a &C})/support({A}) = 66.6% The Apriori principle: Min. support 50% Min. confidence 50% Any subset of a frequent itemset must be frequent Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50%

Mining Frequent Itemsets: the Key Step Find the frequent itemsets: the sets of items that have minimum support A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ;

Database D TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 The Apriori Algorithm Scan D itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 Eample C 1 L 1 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 C 2 C 2 L 2 itemset sup Scan D {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 C 3 itemset Scan D L 3 {2 3 5} itemset sup {2 3 5} 2 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5}

Association Rule Mining: A Road Map Boolean vs. quantitative associations (Based on the types of values handled) buys(, SQLServer ) ^ buys(, DMBook ) buys(, DBMiner ) [0.2%, 60%] age(, 30..39 ) ^ income(, 42..48K ) buys(, PC ) [1%, 75%] Single dimension vs. multiple dimensional associations (see e. Above) Single level vs. multiple-level analysis hat brands of beers are associated with what brands of diapers? Various etensions (thousands!)