Introduction to Machine Learning

Similar documents
Attribution. Modified from Stuart Russell s slides (Berkeley) Parts of the slides are inspired by Dan Klein s lecture material for CS 188 (Berkeley)

Learning is a very general term denoting the way in which agents:

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning: Overview

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Data Mining Part 5. Prediction

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

: Introduction to Machine Learning Dr. Rita Osadchy

Introduction to Learning & Decision Trees

What is Learning? CS 391L: Machine Learning Introduction. Raymond J. Mooney. Classification. Problem Solving / Planning / Control

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Supervised and unsupervised learning - 1

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Introduction to Pattern Recognition

Machine Learning using MapReduce

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

MA2823: Foundations of Machine Learning

Machine Learning and Statistics: What s the Connection?

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Machine Learning.

Classification algorithm in Data mining: An Overview

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Course 395: Machine Learning

Big Data Analytics CSCI 4030

Supervised Learning (Big Data Analytics)

Social Media Mining. Data Mining Essentials

6.2.8 Neural networks for data mining

1 What is Machine Learning?

Data Mining: An Overview. David Madigan

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Some Essential Statistics The Lure of Statistics

Graduate Co-op Students Information Manual. Department of Computer Science. Faculty of Science. University of Regina

Neural Networks and Support Vector Machines

Machine Learning Introduction

Data, Measurements, Features

Data Mining Practical Machine Learning Tools and Techniques

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Information Management course

Introduction to Data Mining

Data Mining - Evaluation of Classifiers

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining. Nonlinear Classification

Lecture Slides for INTRODUCTION TO. ETHEM ALPAYDIN The MIT Press, Lab Class and literature. Friday, , Harburger Schloßstr.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Version Spaces.

Introduction to Data Mining

Knowledge Discovery and Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Chapter 12 Discovering New Knowledge Data Mining

Why do statisticians "hate" us?

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Software Development Training Camp 1 (0-3) Prerequisite : Program development skill enhancement camp, at least 48 person-hours.

An Introduction to Data Mining

Using Artificial Intelligence to Manage Big Data for Litigation

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Introduction. A. Bellaachia Page: 1

{ Mining, Sets, of, Patterns }

Introduction to Machine Learning Using Python. Vikram Kamath

Web Document Clustering

How To Understand The Relation Between Simplicity And Probability In Computer Science

Machine Learning Capacity and Performance Analysis and R

Data Mining and Machine Learning in Bioinformatics

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Question 2 Naïve Bayes (16 points)

A1 Introduction to Data exploration and Machine Learning

Machine Learning. CS494/594, Fall :10 AM 12:25 PM Claxton 205. Slides adapted (and extended) from: ETHEM ALPAYDIN The MIT Press, 2004

DATA MINING TECHNIQUES AND APPLICATIONS

OUTLIER ANALYSIS. Data Mining 1

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Knowledge Discovery and Data Mining

Monotonicity Hints. Abstract

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Process Mining. ^J Springer. Discovery, Conformance and Enhancement of Business Processes. Wil M.R van der Aalst Q UNIVERS1TAT.

Machine Learning in Hospital Billing Management. 1. George Mason University 2. INOVA Health System

TDS - Socio-Environmental Data Science

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Mining. Practical. Data. Monte F. Hancock, Jr. Chief Scientist, Celestech, Inc. CRC Press. Taylor & Francis Group

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

The Discipline of Machine Learning

Introduction to Online Learning Theory

Data Isn't Everything

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Azure Machine Learning, SQL Data Mining and R

Learning Agents: Introduction

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Big Data: Rethinking Text Visualization

Steven C.H. Hoi School of Information Systems Singapore Management University

Circuits and Boolean Expressions

KS3 Computing Group 1 Programme of Study hours per week

Transcription:

Introduction to Machine Learning Ronald J. Williams CSG220, Spring 2007 What is learning? Learning => improving with experience Learning Agent = Performance Element + Learning Element Performance element decides what actions to take e.g., identify this image e.g., choose a move in this game Learning element modifies performance element so that it makes better decisions Introduction: Slide 2 1

Learning agent design Which components of the performance element are to be learned? What feedback is available to learn these components? What representation is to be used for the components? How is performance to be measured (i.e., what is meant by better decisions?) Introduction: Slide 3 Wide range of possible goals Given a set of data, find potentially predictive patterns in it data mining scientific discovery As a result of acquiring new data, gain knowledge allowing an agent to exploit its environment robot navigation acquisition of new knowledge may be passive or active (e.g., exploration or queries) Introduction: Slide 4 2

Given experience in some problem domain, improve performance in it gameplaying robotics Rote learning qualifies, but more interesting and challenging aspect is to be able to generalize successfully beyond actual experiences Introduction: Slide 5 Learning vs. programming Learning is essential for unknown environments, i.e., when designer lacks omniscience Learning is essential in changing environments Learning is useful as a system construction method expose the agent to reality rather than trying to write it down Introduction: Slide 6 3

Application examples Robot control Playing a game Recognizing handwritten digits Various bioinformatics applications Filtering email (e.g., spam detection) Intelligent user interfaces Introduction: Slide 7 Relevant Disciplines Artificial intelligence Probability & statistics Control theory Computational complexity Philosophy Psychology Neurobiology Introduction: Slide 8 4

3 categories of learning problem Supervised learning Unsupervised learning Reinforcement learning Not an exhaustive list Not necessarily mutually exclusive Introduction: Slide 9 Supervised Learning Also called inductive inference Given training data {(x i,y i )}, where y i = f(x i ) for some unknown function f : X > Y, find an approximation h to f called a classification problem if Y is a small discrete set (e.g., {+, }) called a regression problem if Y is a continuous set (e.g., a subset of R) More realistic, but harder: each observed y i is a noisecorrupted approximation to f(x i ) Introduction: Slide 10 5

X called the instance space Construct/adjust h to agree with f on training set h is consistent if it agrees with f on all training examples inappropriate if noise assumed present If Y = {+, }, define {x ε X f(x)=+}, the set of all positive instances, to be a concept Thus 2class classification problems may also be called concept learning problems Introduction: Slide 11 Classification (supervised) + + + + + + + + + + + + + + Instance Space X = R 2 Target Space Y = {+, } Labeled Data Introduction: Slide 12 6

Regression (supervised) Curve Fitting Target Space Y = R Instance Space X = R Introduction: Slide 13 Regression Introduction: Slide 14 7

Regression Introduction: Slide 15 Regression Introduction: Slide 16 8

Regression Which curve is best? Introduction: Slide 17 Unsupervised Learning Have an instance space X Possible objectives clustering characterize distribution principal component analysis One possible use: novelty detection This newly observed instance is different Also includes such things as association rules in data mining People who buy diapers tend to also buy beer Introduction: Slide 18 9

Unsupervised Unlabeled Data Instance Space X = R 2 Introduction: Slide 19 But also... + Mix of labeled and unlabeled data + + Instance Space X = R 2 Target Space Y = {+, } Introduction: Slide 20 10

Overfitting Especially applicable in supervised learning, but may also appear in other types of learning problems Often manifests itself by having the learner perform worse on test data even as it gets better at fitting the training data There are practical techniques as well as theoretical approaches for trying to avoid this problem Introduction: Slide 21 Introduction: Slide 22 11

Performance measurement for supervised learning How do we know that h f? (Hume's Problem of Induction) use theorems of computational/statistical learning theory, or try h on a new test set of examples (using same distribution over instance space as training set) Learning curve = % correct on test set as a function of training set size Introduction: Slide 23 Introduction: Slide 24 12

Learning curve depends on realizable (can express target function) vs. nonrealizable nonrealizability can be due to missing attributes or restricted hypothesis class that excludes true hypothesis (called selection bias) redundant expressiveness (e.g., many irrelevant attributes) size of hypothesis space Introduction: Slide 25 Introduction: Slide 26 13

Occam's razor: maximize a combination of consistency and simplicity in this form, just an informal principle involves a tradeoff Attempts to formalize this penalize more complex hypotheses Minimum Description Length Kolmogorov complexity Alternative: Bayesian approach start with a priori distribution over hypotheses Introduction: Slide 27 Reinforcement Learning Applies to choosing sequences of actions to obtain a good longterm outcome gameplaying controlling dynamical systems Key feature is that system not told directly how to behave, only given a performance score That was good/bad That was worth a 9.5 Introduction: Slide 28 14

Issues for a learning system designer How to represent performance element inputs and outputs symbolic logical expressions numerical attribute vectors How to represent the input/output mapping artificial neural network decision tree Bayesnetwork general computer program What kind of prior knowledge to use and how to represent it and/or take advantage of it during learning Introduction: Slide 29 Contrasting representations of X (and Y and h, if applicable) symbolic, with logical rules (e.g., X = shapes with size and color specified) e.g., instances: (Shape=circle)^(Size=large)^(Color=red) e.g., rules: IF (shape=circle)^(size=large) THEN (interesting=yes) IF (shape=square)^(color=green) THEN (interesting = no) numeric e.g., points in Euclidean space R n Introduction: Slide 30 15

Learning as search through a hypothesis space H Inductive bias Selection bias: only hypotheses h ε H are allowed Preference bias: H includes all possible hypotheses, but if more than one fits the data, choose the best among these (e.g., Occam s razor: simpler hypotheses are better) Selection bias leads to less of an overfitting problem, but runs the risk of eliminating the true hypothesis (i.e., true hypothesis is unrealizable in H) Introduction: Slide 31 Selection bias example H = pure conjunctive concepts in some attribute/value description language (Shape=square)^(Size=large) (Shape=circle)^(Size=small)^(Color=red) Boolean: A^~B (equivalent to (A=true)^(B=false) Description of all positive instances restricted to have this form Introduction: Slide 32 16

Another selection bias example H = space of all linear separators in the plane + Separating line can be anywhere, and either side can be the + side Such a separator can be specified by 3 real numbers Introduction: Slide 33 Hypothesis space size How many distinct Boolean functions of n Boolean attributes? Introduction: Slide 34 17

Hypothesis space size How many distinct Boolean functions of n Boolean attributes? = number of distinct truth tables with 2 n rows = 2 2n Introduction: Slide 35 Hypothesis space size How many distinct Boolean functions of n Boolean attributes? = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses Introduction: Slide 36 18

Hypothesis space size How many distinct Boolean functions of n Boolean attributes? = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses How many purely conjunctive concepts over n Boolean attributes? Introduction: Slide 37 Hypothesis space size How many distinct Boolean functions of n Boolean attributes? = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses How many purely conjunctive concepts over n Boolean attributes? Each attribute can be required to be true, required to be false, or ignored, so 3 n distinct purely conjunctive hypotheses Introduction: Slide 38 19

Hypothesis space size How many distinct Boolean functions of n Boolean attributes? = number of distinct truth tables with 2 n rows = 2 2n E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 different possible Boolean hypotheses How many purely conjunctive concepts over n Boolean attributes? Each attribute can be required to be true, required to be false, or ignored, so 3 n distinct purely conjunctive hypotheses More expressive hypothesis space increases chance that target function can be expressed increases number of hypotheses consistent w/ training set so may get worse predictions Introduction: Slide 39 Hypothesis space size (cont.) How many distinct linear separators in ndimensional Euclidean space? Introduction: Slide 40 20

Hypothesis space size (cont.) How many distinct linear separators in ndimensional Euclidean space? Infinitely many Introduction: Slide 41 Hypothesis space size (cont.) How many distinct linear separators in ndimensional Euclidean space? Infinitely many How many distinct quadratic separators in ndimensional Euclidean space (e.g., with quadratic curves as separators in R 2 )? Introduction: Slide 42 21

Hypothesis space size (cont.) How many distinct linear separators in ndimensional Euclidean space? Infinitely many How many distinct quadratic separators in ndimensional Euclidean space (e.g., with quadratic curves as separators in R 2 )? Infinitely many Introduction: Slide 43 Hypothesis space size (cont.) How many distinct linear separators in ndimensional Euclidean space? Infinitely many How many distinct quadratic separators in ndimensional Euclidean space (e.g., with quadratic curves as separators in R 2 )? Infinitely many It s clear that allowing more complex separators gives rise to a more expressive hypothesis space, but a simple count of hypotheses doesn t measure it Introduction: Slide 44 22

Hypothesis space size (cont.) How many distinct linear separators in ndimensional Euclidean space? Infinitely many How many distinct quadratic separators in ndimensional Euclidean space (e.g., with quadratic curves as separators in R 2 )? Infinitely many Preview of coming It s clear that allowing more complex attractions separators gives rise to a more expressive hypothesis space, but a simple count of hypotheses doesn t measure it But there is a measure of hypothesis space size that applies even to these infinite hypothesis spaces: VapnikChervonenkis (VC) dimension Introduction: Slide 45 23