CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.



Similar documents
: Introduction to Machine Learning Dr. Rita Osadchy

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Designing a learning system

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Faculty of Science School of Mathematics and Statistics

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

ADVANCED MACHINE LEARNING. Introduction

HT2015: SC4 Statistical Data Mining and Machine Learning

Machine Learning.

Machine Learning: Statistical Methods - I

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

Introduction to Machine Learning Using Python. Vikram Kamath

6.2.8 Neural networks for data mining

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Supervised and unsupervised learning - 1

Machine learning for algo trading

Data, Measurements, Features

1 Review of Least Squares Solutions to Overdetermined Systems

Support Vector Machine (SVM)

Support Vector Machines with Clustering for Training with Very Large Datasets

Machine Learning for Data Science (CS4786) Lecture 1

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Machine Learning Introduction

An Introduction to Data Mining

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

New Ensemble Combination Scheme

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Lecture 9: Introduction to Pattern Analysis

CS Data Science and Visualization Spring 2016

Support Vector Machines Explained

MA2823: Foundations of Machine Learning

Data Mining. Nonlinear Classification

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Machine Learning Capacity and Performance Analysis and R

Supervised Learning (Big Data Analytics)

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

MUSICAL INSTRUMENT FAMILY CLASSIFICATION

MD - Data Mining

Introduction to Learning & Decision Trees

Principal components analysis

Course 395: Machine Learning

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

MS1b Statistical Data Mining

Statistical Machine Learning

Statistics W4240: Data Mining Columbia University Spring, 2014

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Introduction to Pattern Recognition

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

STA 4273H: Statistical Machine Learning

Data Mining Part 5. Prediction

Classification Problems

A Simple Introduction to Support Vector Machines

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Maschinelles Lernen mit MATLAB

Machine Learning in Statistical Arbitrage

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Advanced Ensemble Strategies for Polynomial Models

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Monotonicity Hints. Abstract

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Classification algorithm in Data mining: An Overview

Introduction to Support Vector Machines. Colin Campbell, Bristol University

REVIEW OF ENSEMBLE CLASSIFICATION

Learning outcomes. Knowledge and understanding. Competence and skills

Data Mining - Evaluation of Classifiers

Big Data Analytics for SCADA

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Machine Learning and Statistics: What s the Connection?

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Knowledge Discovery and Data Mining

CSci 538 Articial Intelligence (Machine Learning and Data Analysis)

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Machine Learning with MATLAB David Willingham Application Engineer

DATA MINING TECHNIQUES AND APPLICATIONS

Information Management course

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Advanced Signal Processing and Digital Noise Reduction

Chapter 6. The stacking ensemble approach

Predict Influencers in the Social Network

Lecture 3: Linear methods for classification

Foundations of Artificial Intelligence. Introduction to Data Mining

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

CS570 Data Mining Classification: Ensemble Methods

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Azure Machine Learning, SQL Data Mining and R

Transcription:

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 Office hours: TBA TA: Michael Moeng moeng@cs.pitt.edu 5 Sennott Square Office hours: TBA

Administration Study material Handouts, your notes and course readings Primary textbook: Chris. Bishop. Pattern Recognition and Machine Learning. Springer,. Administration Study material Other books: Friedman, Hastie, Tibshirani. Elements of statistical learning. Springer,. Duda, Hart, Stork. Pattern classification. nd edition. J Wiley and Sons,. C. Bishop. Neural networks for pattern recognition. Oxford U. Press, 99. T. Mitchell. Machine Learning. McGraw Hill, 997 J. Han, M. Kamber. Data Mining. Morgan Kauffman,.

Administration Homeworks: weekly Programming tool: Matlab (CSSD machines and labs) Matlab Tutorial: next week Exams: Midterm (March) Final (April -) Final project: Written report + Oral presentation (April 5-3) Lectures: Attendance and Activity Tentative topics Introduction to Machine Learning Density estimation. Supervised Learning. Linear models for regression and classification. Multi-layer neural networks. Support vector machines. Kernel methods. Unsupervised Learning. Learning Bayesian networks. Latent variable models. Expectation maximization. Clustering.

Tentative topics (cont) Dimensionality reduction. Feature extraction. Principal component analysis (PCA) Ensemble methods. Mixture models. Bagging and boosting. Hidden Markov models. Reinforcement learning Machine Learning The field of machine learning studies the design of computer programs (agents) capable of learning from past experience or adapting to changes in the environment The need for building agents capable of learning is everywhere predictions in medicine, text and web page classification, speech recognition, image/text retrieval, commercial software

Learning Learning process: Learner (a computer program) processes data D representing past experiences and tries to either develop an appropriate response to future data, or describe in some meaningful way the data seen Example: Learner sees a set of patient cases (patient records) with corresponding diagnoses. It can either try: to predict the presence of a disease for future patients describe the dependencies between diseases, symptoms Types of learning Supervised learning Learning mapping between input x and desired output y Teacher gives me y s for the learning purposes Unsupervised learning Learning relations between data components No specific outputs given by a teacher Reinforcement learning Learning mapping between input x and desired output y Critic does not give me y s but instead a signal (reinforcement) of how good my answer was Other types of learning: Concept learning, Active learning

Supervised learning Data: D = { d, d,.., dn} a set of n examples d i =< xi, yi > is input vector, and y is desired output (given by a teacher) x i Objective: learn the mapping f : X Y s.t. yi f ( xi ) for all i =,.., n Two types of problems: Regression: X discrete or continuous Y is continuous Classification: X discrete or continuous Y is discrete Supervised learning examples Regression: Y is continuous Debt/equity Earnings Future product orders company stock price Classification: Y is discrete Label 3 Handwritten digit (array of,s)

Unsupervised learning Data: D = { d, d,.., dn} di = xi vector of values No target value (output) y Objective: learn relations between samples, components of samples Types of problems: Clustering Group together similar examples, e.g. patient cases Density estimation Model probabilistically the population of samples Unsupervised learning example Clustering. Group together similar examples di = x i 3.5.5.5 -.5 - - -.5 - -.5.5.5

Unsupervised learning example Clustering. Group together similar examples di = x i 3.5.5.5 -.5 - - -.5 - -.5.5.5 Unsupervised learning example Density estimation. We want to build the probability model P(x) of a population from which we draw examples d = x i i 3.5.5.5 -.5 - - -.5 - -.5.5.5

Unsupervised learning. Density estimation A probability density of a point in the two dimensional space Model used here: Mixture of Gaussians Reinforcement learning We want to learn: f : X Y We see samples of x but not y Instead of y we get a feedback (reinforcement) from a critic about how good our output was input sample Learner output reinforcement Critic The goal is to select outputs that lead to the best reinforcement

Learning: first look Assume we see examples of pairs (x, y) in D and we want to learn the mapping f : X Y to predict y for some future x We get the data D - what should we do? y - - - -.5 - -.5 - -.5.5.5 x Learning: first look Problem: many possible functions f : X Y exists for representing the mapping between x and y Which one to choose? Many examples still unseen! y - - - -.5 - -.5 - -.5.5.5 x

Learning: first look Solution: make an assumption about the model, say, f ( x) = ax + b + ε ε = N(, σ ) - random (normally distributed) noise Restriction to a linear model is an example of learning bias y - - - -.5 - -.5 - -.5.5.5 x Learning: first look Bias provides the learner with some basis for choosing among possible representations of the function. Forms of bias: constraints, restrictions, model preferences Important: There is no learning without a bias! y - - - -.5 - -.5 - -.5.5.5 x

Learning: first look Choosing a parametric model or a set of models is not enough Still too many functions f ( x) = ax + b + ε ε = N(, σ ) One for every pair of parameters a, b y - - - -.5 - -.5 - -.5.5.5 x Fitting the data to the model We want the best set of model parameters Objective: Find parameters that: reduce the misfit between the model M and observed data D Or, (in other words) explain the data the best Objective function: Error function: Measures the misfit between D and M Examples of error functions: n Average Square Error ( yi f ( xi )) n i= n Average misclassification error Average # of misclassified cases n i = y f ( x ) i i

Fitting the data to the model Linear regression problem Minimizes the squared error function for the linear model n minimizes ( yi f ( xi )) n i= y - - - -.5 - -.5 - -.5.5.5 x Learning: summary Three basic steps: Select a model or a set of models (with parameters) E.g. y = ax + b Select the error function to be optimized n E.g. ( yi f ( xi )) n i= Find the set of parameters optimizing the error function The model and parameters with the smallest error represent the best fit of the model to the data But there are problems one must be careful about

Learning Problem We fit the model based on past experience (past examples seen) But ultimately we are interested in learning the mapping that performs well on the whole population of examples Training data: Data used to fit the parameters of the model n Training error: ( yi f ( xi )) n i= True (generalization) error (over the whole unknown population): E ( x, y )[( y f ( x)) ] Mean squared error Training error tries to approximate the true error!!!! Does a good training error imply a good generalization error? Learning Problem We fit the model based on past examples observed in D But ultimately we are interested in learning the mapping that performs well on the whole population of examples Training data: Data used to fit the parameters of the model Training error: n Error ( D, f ) = ( yi f ( xi )) n i= True (generalization) error (over the whole population): E ( x, y )[( y f ( x)) ] Mean squared error Training error tries to approximate the true error!!!! Does a good training error imply a good generalization error?

Overfitting Assume we have a set of points and we consider polynomial functions as our possible models - - - -.5 - -.5.5.5 Overfitting Fitting a linear function with the square error Error is nonzero - - - -.5 - -.5.5.5

Overfitting Linear vs. cubic polynomial Higher order polynomial leads to a better fit, smaller error - - - -.5 - -.5.5.5 Overfitting Is it always good to minimize the error of the observed data? - - - -.5 - -.5.5.5

Overfitting For data points, the degree 9 polynomial gives a perfect fit (Lagrange interpolation). Error is zero. Is it always good to minimize the training error? - - -.5 - -.5.5.5 Overfitting For data points, degree 9 polynomial gives a perfect fit (Lagrange interpolation). Error is zero. Is it always good to minimize the training error? NO!! More important: How do we perform on the unseen data? - - -.5 - -.5.5.5

Overfitting Situation when the training error is low and the generalization error is high. Causes of the phenomenon: Model with a large number of parameters (degrees of freedom) Small data size (as compared to the complexity of the model) - - -.5 - -.5.5.5 How to evaluate the learner s performance? Generalization error is the true error for the population of examples we would like to optimize E ( x, y )[( y f ( x)) ] But it cannot be computed exactly Sample mean only approximates the true mean Optimizing (mean) training error can lead to the overfit, i.e. training error may not reflect properly the generalization error ( yi f ( xi )) n i=,.. n So how to test the generalization error?

How to evaluate the learner s performance? Generalization error is the true error for the population of examples we would like to optimize Sample mean only approximates it Two ways to assess the generalization error is: Theoretical: Law of Large numbers statistical bounds on the difference between true and sample mean errors Practical: Use a separate data set with m data samples to test the model (Mean) test error ( y j f ( x j )) m j =,.. m Basic experimental setup to test the learner s performance. Take a dataset D and divide it into: Training data set Testing data set. Use the training set and your favorite ML algorithm to train the learner 3. Test (evaluate) the learner on the testing data set The results on the testing set can be used to compare different learners powered with different models and learning algorithms