Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Similar documents
Data Mining - Evaluation of Classifiers

Cross Validation. Dr. Thomas Jensen Expedia.com

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

L13: cross-validation

Decompose Error Rate into components, some of which can be measured on unlabeled data

Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 13: Validation

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

: Introduction to Machine Learning Dr. Rita Osadchy

Social Media Mining. Data Mining Essentials

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data, Measurements, Features

Data Mining Part 5. Prediction

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Cross-Validation. Synonyms Rotation estimation

Predict Influencers in the Social Network

F. Aiolli - Sistemi Informativi 2007/2008

Statistical Machine Learning

Supervised Learning (Big Data Analytics)

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

HT2015: SC4 Statistical Data Mining and Machine Learning

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining. Nonlinear Classification

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Linear Threshold Units

Learning is a very general term denoting the way in which agents:

Designing a learning system

Professor Anita Wasilewska. Classification Lecture Notes

Introduction to Learning & Decision Trees

Chapter 6. The stacking ensemble approach

Lecture 10: Regression Trees

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Knowledge Discovery and Data Mining

Cell Phone based Activity Detection using Markov Logic Network

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Practical Machine Learning Tools and Techniques

Performance Metrics for Graph Mining Tasks

STA 4273H: Statistical Machine Learning

L25: Ensemble learning

Data Mining for Knowledge Management. Classification

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Introduction to Online Learning Theory

Classification and Prediction

Using Data Mining for Mobile Communication Clustering and Characterization

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Knowledge Discovery and Data Mining

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Bootstrapping Big Data

Introduction to Machine Learning Using Python. Vikram Kamath

Knowledge Discovery and Data Mining

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Standardization and Its Effects on K-Means Clustering Algorithm

ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS

How To Understand The Relation Between Simplicity And Probability In Computer Science

Data Mining Classification: Decision Trees

Collaborative Filtering. Radek Pelánek

Machine Learning.

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Additional sources Compilation of sources:

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

Version Spaces.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Big Data Analytics CSCI 4030

Lecture 3: Linear methods for classification

Regularized Logistic Regression for Mind Reading with Parallel Validation

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Classification and Prediction

Basics of Statistical Machine Learning

Why do statisticians "hate" us?

Evaluation & Validation: Credibility: Evaluating what has been learned

Econometrics Simple Linear Regression

MACHINE LEARNING IN HIGH ENERGY PHYSICS

LCs for Binary Classification

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Data Mining for Business Analytics

A Survey of Evolutionary Algorithms for Data Mining and Knowledge Discovery

The Data Mining Process

Machine Learning Logistic Regression

{ Mining, Sets, of, Patterns }

Fairfield Public Schools

Logistic Regression for Spam Filtering

Microsoft Azure Machine learning Algorithms

Statistical Models in R

CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Probabilistic user behavior models in online stores for recommender systems

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Making Sense of the Mayhem: Machine Learning and March Madness

Big Data Analytics. Lucas Rego Drumond

Learning Agents: Introduction

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Distances, Clustering, and Classification. Heatmaps

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Course: Model, Learning, and Inference: Lecture 5

Data Mining: An Overview. David Madigan

Transcription:

Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34

Outline 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 2 / 34

Introduction to Inductive learning 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 3 / 34

Introduction to Inductive learning Inductive learning Is the area with the larger number of methods Goal: To discover general concepts from a limited set of examples (experience) It is based on the search of similar characteristics among examples (common patterns) All its methods are based on inductive reasoning Cat Dog Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 4 / 34

Introduction to Inductive learning Inductive reasoning vs Deductive reasoning Inductive reasoning It obtains general knowledge (a model) from specific information The knowledge obtained is new Its not truth preserving (new information can invalidate the knowledge obtained) It has no well founded theory (heuristic) + Induction New examples Induction C1 C2 C1! C2! C1 C2 + + + + + + + + + + + + + + + + + + + + + + + + + + Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 5 / 34

Introduction to Inductive learning Inductive reasoning vs Deductive reasoning Deductive reasoning It obtains knowledge using well established methods (logic) The knowledge is not new (it is implicit in the initial knowledge) New knowledge can not invalidate the knowledge already obtained Its basis is mathematical logic Domain theory New examples Domain theory RRRR1 >C1 RRRR2 >C1 RRRR3 >C1 RRRR1 >C2 RRRR2 >C2 RRRR3 >C2 R1 R2 R3 R1 R2 R3 RRRR1 >C1 RRRR2 >C1 RRRR3 >C1 RRRR4 >C1 RRRR5 >C1 RRRR1 >C2 RRRR2 >C2 RRRR3 >C2 RRRR4 >C2 R2 R3 R1 R1 R4 R2 R4 R5 R3 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 6 / 34

Introduction to Inductive learning Inductive learning From a formal point of view the knowledge that we obtain is invalid We assume that a limited number of examples represents the characteristics of the concept that we want to learn Just only one counterexample invalidates the result But, most of the human learning is inductive! Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 7 / 34

Introduction to Inductive learning Types of inductive learning Supervised inductive learning Examples are usually described by pairs attribute-value Each example is labeled with the concept it belongs to Learning is performed by contrast among concepts A set of heuristics allows to generate different hypotheses A criteria of preference (bias) is used to choose the most suitable hypothesis for the examples Result: The concept or concepts that describe better the examples (,Cat) (,Cat) (,Cat) (Whiskers=yes, Tail=long, Ears=Pointy => Cat) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 8 / 34

Introduction to Inductive learning Types of inductive learning Unsupervised inductive learning Examples are not labeled We want to discover a suitable way to cluster the objects Learning is based on the similarity/dissimilarity among examples Heuristic preference criteria will guide the search Result: A partition of the examples and a characterization of the partitions Group1???? Group2?? Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 9 / 34

Search and inductive learning 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 10 / 34

Search and inductive learning Learning as search (I) The usual way to view inductive learning is as a search problem The goal is to discover a function/representation that summarizes the characteristics of a set of examples The space of search is all the possible concepts that can be built There are different ways to do the search Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 11 / 34

Search and inductive learning Learning as search (II) Search space: Language used to describe the concepts = Set of concepts that can be described Search operators: Heuristic operators that allow to explore the space of concepts Heuristic function: Preference function/criteria that guides the search (Bias) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 12 / 34

Hypothesis Spaces 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 13 / 34

Hypothesis Spaces Hypothesis Spaces There are multiple models that we can use for inductive learning Each one of them has its advantages and disadvantages The most common are: Logic formulas/rules Probabilistic models (discrete/continuous) (parametric/non parametric) General functions (linear/non linear) Depending on the problem some could model better its characteristics than others Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 14 / 34

Hypothesis Spaces Hypothesis space - Logic formulas Logic as a model: The language used defines the size of the search space and the kind of concepts that can be learned. Represents hard linear decisions. Pure conjunctive/disjunctive formulas (2 n ) k-term-cnf/k-term-dnf (2 kn ) (A B C D E) (A B D E) (A B C) (A B D E)... k-cnf/k-dnf (2 nk ) CNF/DNF (2 2n ) (A B C) (A B D) (A B E)... (A B D E) (A B C) (A B D E) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 15 / 34

Hypothesis Spaces Hypothesis space - Logic formulas pure conjunctive DNF k-dnf k-term-dnf Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 16 / 34

Hypothesis Spaces Hypothesis space - Probabilistic functions We assume that the attributes follow specific PDFs The complexity of the concepts depends on the PDF The size of the search space depends on the parameters needed to estimate the PDF It allows soft decisions (the prediction is a probability) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 17 / 34

Hypothesis Spaces Hypothesis space - General functions We could assume a specific type of function or family of functions (linear, non linear) The complexity of the concepts depends on the function The size of the search space depends on the parameters of the function Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 18 / 34

Formal View 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 19 / 34

Formal View Supervised inductive learning - A formal view We can define formally the task of supervised inductive learning as: Given a set E of n examples described by input-output pairs: (x 1, y 1 ), (x 2, y 2 ),..., (x n, y n ) each one generated by an unknown function, y = f (x) with probability P(x, y), given a hypothesis space H = {h(x; α) α Π}, where Π is a set of parameters and a loss function L(y, h(x; α)) (the cost of being wrong), discover the function h α that approximates f The learning process has to search in the space of hypothesis the function h that performs well for all possible values (even the ones we have not seen) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 20 / 34

Formal View Supervised inductive learning - A formal view This formalization allows to compute the true error of the hypothesis h α as the expected loss: E true (h α ) = L(y, h α (x))p(x, y)dx dy 1 Because we only have a subset of examples in practice we just can compute the empirical error: The goal is then to solve E emp (h α ) = 1 n (x,y) E L(y, h α (x)) h α = arg min h α H E emp(h α ) 1 Weighted average of the errors for all the possible values. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 21 / 34

Formal View Supervised inductive learning - A formal view Usually x is a vector of attribute-value pairs, where each attribute is discrete or continuous More complex representations for x are also possible Depending on the values of y we can distinguish two different tasks: If y is a finite set of values we have classification (in the case of only two values it is usually called binary classification) If y is a number we have regression The usual loss functions for these tasks are: 0/1 loss L 0/1 (y, ŷ) = 0 if y = ŷ, else 1 (classification) Square error loss L 2 (y, ŷ) = (y ŷ) 2 (regression) Absolute error loss L 1 (y, ŷ) = y ŷ (regression) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 22 / 34

Model Selection 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 23 / 34

Model Selection Supervised inductive learning - Learning the hypothesis The choice of the space of functions H depends on the problem and it is a parameter, but we need to have a compromise: A hypothesis space too complex needs more time to be explored A hypothesis space too simple could not contain a good approximation of f Minimizing the empirical error usually does not gives enough information for the search Usually the more complex is h α the better we can reduce the error We also have to keep in mind that we want to predict the unseen examples too: A complex hypothesis could fit too well the seen data and not to generalize well (overfitting) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 24 / 34

Model Selection Complex hypothesis vs Overfitting Fitting perfectly the training data can lead to poor generalization training examples Unseen examples Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 25 / 34

Model Selection Supervised inductive learning - Model selection A usual criteria for guiding the learning process is, given a choice of different hypothesis, to prefer the simpler one This principle is called the Ockam s Razor Plurality ought never be posited without necessity The justification of using this principle is that for a simple hypothesis the probability of it having unnecessary conditions is reduced The way to enforce this preference is usually called the bias o hypothesis preference of the learning algorithm Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 26 / 34

Model Selection Model Selection - Error estimation In order to know how well our hypothesis generalizes we need to estimate the error for the unseen examples There are several methods for estimating this error: Holdout cross validation: We split the data in a training set and a test set. The hypothesis h α is obtained by minimizing the error in the training test and the expected error is estimated on the test set (usually a poor estimate) k-fold cross validation: We split the data in k subsets and perform k learning processes with 1/k examples as test set and the rest as training set leave one out cross validation: We repeat n times the learning process always leaving out one example to measure the error (useful when dataset is small) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 27 / 34

Model Selection Model Selection - Selecting complexity If the complexity of the model is a parameter we can use the empirical error estimate to find the adequate value for our hypothesis Function: crossvalidate-model-complexity (Learner,k,examples) size 1 repeat errort[size], errorv[size] cross-validation(learner,size,k,examples) size ++ until errort stops decreasing best size min(errorv) return Learner(best size, examples) When the empirical error for the training converges we begin to overfit The size with the lower test error is the best hypothesis size Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 28 / 34

Model Selection Model Selection - Selecting complexity Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 29 / 34

Model Selection Model Selection - Selecting complexity There are other methods for selecting the complexity of the model like: Regularization: We assume that we can somehow measure the complexity of the model (Compl(h α )), we look for the hypothesis that optimizes: h α = arg min h α H E emp(h α ) + λcompl(h α ) where λ is the weight for the complexity of the model Feature Selection: We determine what attributes are relevant during the search Minimum Description Length: We minimize the number of bits needed to encode the hypothesis vs the number of bits needed to encode the prediction errors in the examples (complexity and error are measured in the same units) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 30 / 34

Model Performance 1 Introduction to Inductive learning 2 Search and inductive learning 3 Hypothesis Spaces 4 Formal View 5 Model Selection 6 Model Performance Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 31 / 34

Model Performance Model Performance The measured empirical error is an estimate of how well our model generalizes In our measure of this error we are using all the data set For example, each crossvalidation fold has a 90% of the data of the other folds In order to obtain a better estimate we need to have a separate test set A poor result in this test set could mean that our learning sample is biased and is not representative of the task Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 32 / 34

Model Performance Model Performance - Uncertainty of the Estimate The nature of the errors we commit has different sources: The specific space hypothesis, the best hypothesis is at a distance of the true function The specific sample of the task, different samples give different information The uncertainty of the values/representation of the sample We are possibly using only a partial view of the task, some variables are not observed We have possibly a corrupted version of the sample (the task labels are wrong) Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 33 / 34

Model Performance Model Performance - Uncertainty of the Estimate Our error can be decomposed in: Error= Systematic error ( from the hypothesis space) + Dependence on the specific sample + Random nature of the process This error can also be described with what is called the Bias-Variance decomposition model Bias: How much the average prediction deviates from the truth Variance: How sensitive is our prediction to the sample used A specific expression for this decomposition can be computed for the different loss functions In general when the complexity of the model increases the bias decreases, but the variance increases The usual way to reduce both is to increase the size of the sample Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 34 / 34