Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Similar documents
Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Principles of Data Mining by Hand&Mannila&Smyth

An Introduction to Data Mining

HT2015: SC4 Statistical Data Mining and Machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

STA 4273H: Statistical Machine Learning

Statistical Models in Data Mining

Supervised Learning (Big Data Analytics)

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Classification algorithm in Data mining: An Overview

Data Mining: An Overview. David Madigan

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Support Vector Machine (SVM)

Knowledge Discovery and Data Mining

Statistics Graduate Courses

Data Mining. Nonlinear Classification

Machine learning for algo trading

Sentiment analysis using emoticons

Cell Phone based Activity Detection using Markov Logic Network

Supervised and unsupervised learning - 1

Question 2 Naïve Bayes (16 points)

OUTLIER ANALYSIS. Data Mining 1

Monotonicity Hints. Abstract

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Knowledge Discovery and Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

An Introduction to Machine Learning

Data Mining Part 5. Prediction

Classification Techniques for Remote Sensing

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Local classification and local likelihoods

Social Media Mining. Data Mining Essentials

Machine Learning Logistic Regression

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Machine Learning in Spam Filtering

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Content-Based Recommendation

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Linear Classification. Volker Tresp Summer 2015

Evaluation of Machine Learning Techniques for Green Energy Prediction

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

LCs for Binary Classification

Detecting Corporate Fraud: An Application of Machine Learning

Towards better accuracy for Spam predictions

Statistical Machine Learning

The Optimality of Naive Bayes

Neural Networks and Support Vector Machines

Data, Measurements, Features

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Linear Threshold Units

Christfried Webers. Canberra February June 2015

Predict Influencers in the Social Network

Cross-validation for detecting and preventing overfitting

Knowledge Discovery and Data Mining

Linear Models for Classification

Microsoft Azure Machine learning Algorithms

Study Guide for the Final Exam

: Introduction to Machine Learning Dr. Rita Osadchy

Basics of Statistical Machine Learning

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Data Mining + Business Intelligence. Integration, Design and Implementation

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Foundations of Artificial Intelligence. Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

DATA MINING TECHNIQUES AND APPLICATIONS

Classification Problems

Active Learning SVM for Blogs recommendation

Machine Learning with MATLAB David Willingham Application Engineer

Tagging with Hidden Markov Models

Predicting Student Performance by Using Data Mining Methods for Classification

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Comparison of machine learning methods for intelligent tutoring systems

Data Mining - Evaluation of Classifiers

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Data Mining Practical Machine Learning Tools and Techniques

Distances, Clustering, and Classification. Heatmaps

Hong Kong Stock Index Forecasting

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

OUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA

Model Combination. 24 Novembre 2009

BayesX - Software for Bayesian Inference in Structured Additive Regression

Employer Health Insurance Premium Prediction Elliott Lui

Data Mining Yelp Data - Predicting rating stars from review text

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Transcription:

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Models vs. Patterns Models A model is a high level, global description of a data set. Patterns Local characteristics of the data, perhaps holding for only a few records or a few variable Type of Models & Patterns Descriptive: summarizing the data in a convenient and concise way Predictive: allowing one to make some statement about the population from which the data were drawn or about likely future values.

Modeling Model structures Model structure is the functional form of the model, which contains a set of parameters Parameters Parameters is the variable of a special functional form, a specific parameter instantiate a concrete model Modeling is essential to data mining, whose goal is to determine a specific parameter of a pre-defined model structure in order to capture the relationships that you concern in data

Predictive modeling Prediction Classification Predicts categorical labels Tasks Regression Predicts real-valued labels Model the boundaries that separate the data of different labels in d- dim space Modeling Model the data points spread in (d+1)-dim space.

Linear models Linear model for regression Linear model for regression is a (d+1)-dimensional hyperplane Linear model for classification Linear model for regression is a d-dimensional hyperplane

Linear models Pros Simple. Linear Easy to interpret the individual contribution of each attribute Cons Unable to reflex the non-linear relationship in the data which is common in many applications The individual contribution makes no sense if attributes are highly correlated with each other.

From linearity to non-linearity Typical ways to transform linear models to nonlinear models. Non-linear mappings + linear models Stacks many simple non-linear models to achieve flexible nonlinearity. Local Piecewise linear models Memory-based local models

Non-linear mapping-based linear model Extending the linear models for non-linearity by transforming the predictor variables X1, X2, and / or the response variables Smooth, non-linear function Limitations: the appropriate function forms of g() and h() have to be determined manually beforehand e.g., polynomial regression is a special case if g is a polynomial function Y X X X 2 3

Stacks many simple non-linear models Stacks simple non-linear models to achieve more flexible non-linearity. e.g., multi-layer perceptron Each perceptron is a simple non-linear model, where h -1 is a sigmoid function These simple non-linear models are stacked to gain arbitrary approximation of any (non-linear) function

Local piecewise models Combines a set of different linear models, each covering a region of the input space disjoint from the others e.g., classification / regression trees

Local piecewise models Remarks Piecewise linear is a very important idea in data mining and machine learning Building simple model or patterns locally, and combines them for complex global structures. Locality provides a framework for decomposing a complex model into simple local patterns. Global models structures Piecewise linear Local pattern structures Problem: difficult to decide the size of the region. Need to balance the approximation ability and number of data points for estimating the parameters.

Memory-based local models A lazy approach: Retain all the data points Leave the estimation of the predicted value Y until the time at which a prediction is actually required. Key idea: Use the labels of stored examples similar to the instance to be predicted to make the prediction. How to find similar instances? Hard version: neighbors contribute all Soft version: everyone contribute some

Memory-based local models Hard version: k-nearest neighbor classification and regression. Compute the distance between the current instance to all the instance stored in the data set. Find k nearest neighbors. Use majority vote for classification and average for regression Blue or Red? k =3 k =5

Memory-based local models Soft version: Kernel estimate Select a single peak kernel function K with a bandwidth σ Evaluate the kernel function over the current instances and others stored in the data set Use the weighted vote for classification and weighted average for regression, where weight is the function value

Model complexity: the more complex the better? Model complexity The expressive power of the model structure increases as the model becomes more complex e.g., Piecewise linear model: The number of pieces determines the expressive power The more complex the better? Multi-layer perceptron: the number of hidden units determines the expressive power Classification and regression tree: the size of the tree determines the expressive power

Model complexity: the more complex the better? Both simple model and complex model can fit the training data very well The complex model may fit the training data better than the simple model Black one or blue one?

Overfitting Overfitting: a model describes random error or noise or the characteristics of each data point instead of the underlying relationship Overfitting occurs when the model is expressively complex (have too many freedom) in relation to the amount of data available. Complex model Relative small amount of data The overfitted model usually achieves good performance over the training data but cannot generalize well for the underlying distribution.

Selecting appropriate model complexity In general, design a good score function to indicate the performance of the predictive model over the whole distribution instead of the training data can help to achieve compromise between simplicity and complexity of the model structure Widely used approaches (to avoid overfitting) goodness-of-fit + model complexity penalization Cross validation Early stop Bayesian prior

Descriptive modeling Goals: Produces summary or description of the data Model structures The model structures attempts to describe the data by modeling the distributions of the data The data are generated from the (probabilistic) model structure Such models are usually referred to as generative model Major classes of distribution or density models Parametric Non-Parametric

Mixtures of Parametric Models A mixture model is a probabilistic model for density estimation using a mixture distribution K different mixture components, where Relative simple parametric model with parameters θ k The probability of choosing a mixture component The data generative story: 1. Randomly pick a mixture component according to the probability π k 2. Generate a data point x from the probability density p k (x θ k )

Mixtures of Parametric Models A widely used mixture model -- Gaussian Mixture models (GMM) If p k (x) is Normal distribution, the mixture model is called Gaussian Mixture Model

Computing the joint distribution Joint distribution is usually difficult to compute More parameters to be estimated than marginal distributions Sensitive to dimensionality (especially for the nominal data) Need more data to achieve reliable estimate of the parameters Can we find an effective way to compute the joint distribution? Factorize the joint distribution into simpler models based on some assumption of independence between variables

Factorization of density functions (I) Independence assumption: Individual variables are independent to each other. x 1 x 2 x d Conditional independence assumption: (Naïve Bayes) Individual variables are independent to each other given a (latent) variable y x 1 x 2 x d

Factorization of density functions (II) Markov assumption: Individual variables are dependent on the preceding variables x 1 x 2 x 3 x d Markov Chain Individual variables are only dependent on the immediately preceding variable x 1 x 2 x 3 x d

Modeling the structure data In many real-world applications, the collected data have certain structures. Sequential data (e.g., time series, sentences, audios, ) Spatial data (e.g., maps, images, ) The structure information between data need to be modeled explicitly. Sequential data: HMM, Conditional Random Fields, Spatial data: Markov Random Fields,

Hidden Markov model Model specification: Besides the observed random variable Y, there exists a hidden state random variable X (discrete variable of m values) at each time t The generative story 1. The state of time t is drawn according to p(x t x t-1 ). 2. The observation y t is determined according to the distribution p(y t x t ). x 1 x 2 x 3 x T y 1 y 2 y 3 y T

Pattern structures Unlike the model structure describing the whole data set, pattern structure aims to characterize some local aspect of the data. Two important things for pattern structure Syntax of the pattern: how to describe the pattern Semantics of the pattern: the information the pattern structures convey

Patterns for data matrices Syntax Primitive patterns: Complex patterns using logical connections: Pattern class: a set of legal patterns, which consists of all primitive patterns and legal combination of primitives using logical connections. Form of the pattern structures

Patterns for data matrices Pattern discovery: Find all patterns from the class that satisfy certain conditions with respect to the data set Informativeness, novelty and understandability should be considered in the discovery process. Association rule: Accuracy and frequency measure should be associated with the rule to describe how accurate of the rule (p s, p c ) Functional dependency: the rule should be accurate

Let s move to Chapter 7