Machine Learning Overview

Similar documents
Statistical Models in Data Mining

STA 4273H: Statistical Machine Learning

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Linear Classification. Volker Tresp Summer 2015

Machine Learning.

Statistical Machine Learning

Classification Problems

Data, Measurements, Features

: Introduction to Machine Learning Dr. Rita Osadchy

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Basics of Statistical Machine Learning

Sense Making in an IOT World: Sensor Data Analysis with Deep Learning

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

CHAPTER 2 Estimating Probabilities

HT2015: SC4 Statistical Data Mining and Machine Learning

Statistics Graduate Courses

Supervised Learning (Big Data Analytics)

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Learning is a very general term denoting the way in which agents:

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Neural Networks for Machine Learning. Lecture 13a The ups and downs of backpropagation

Linear Threshold Units

Statistical Machine Learning from Data

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Bayes and Naïve Bayes. cs534-machine Learning

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Christfried Webers. Canberra February June 2015

Introduction to Machine Learning Using Python. Vikram Kamath

Big Data Analytics CSCI 4030

Predict Influencers in the Social Network

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

MA2823: Foundations of Machine Learning

Cell Phone based Activity Detection using Markov Logic Network

Azure Machine Learning, SQL Data Mining and R

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Question 2 Naïve Bayes (16 points)

Bayesian probability theory

Social Media Mining. Data Mining Essentials

Monotonicity Hints. Abstract

PS 271B: Quantitative Methods II. Lecture Notes

Introduction to Data Mining

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

INTRODUCTION TO MACHINE LEARNING 3RD EDITION

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Bayesian networks - Time-series models - Apache Spark & Scala

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Bayesian Statistics: Indian Buffet Process

Supervised and unsupervised learning - 1

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

Lecture 3: Linear methods for classification

Data Mining - Evaluation of Classifiers

Lecture 9: Introduction to Pattern Analysis

Machine learning for algo trading

Machine Learning and Statistics: What s the Connection?

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

An Introduction to Machine Learning

Content-Based Recommendation

Data Mining Algorithms Part 1. Dejan Sarka

Big Data, Machine Learning, Causal Models

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Simple and efficient online algorithms for real world applications

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Machine Learning Introduction

Course: Model, Learning, and Inference: Lecture 5

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Machine Learning for Data Science (CS4786) Lecture 1

Supporting Online Material for

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Introduction to Online Learning Theory

Logistic Regression (1/24/13)

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Introduction to Pattern Recognition

Predictive Modeling Techniques in Insurance

Markov Chain Monte Carlo Simulation Made Simple

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Dirichlet Processes A gentle tutorial

Tagging with Hidden Markov Models

Learning outcomes. Knowledge and understanding. Competence and skills

Java Modules for Time Series Analysis

Invited Applications Paper

Novelty Detection in image recognition using IRF Neural Networks properties

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Predict the Popularity of YouTube Videos Using Early View Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Machine Learning with MATLAB David Willingham Application Engineer

Steven C.H. Hoi School of Information Systems Singapore Management University

An Introduction to Data Mining

Introduction to Learning & Decision Trees

The Scientific Data Mining Process

Transcription:

Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1

Outline 1. What is Machine Learning (ML)? 1. As a scientific Discipline 2. As an area of Computer Science/AI 2. Core learning methods 1. Supervised (Regression/Classification/Deep) 2. Unsupervised (PCA, Clustering, Topic Models) 3. Reinforcement 3. Main drivers 1. Mobile systems (big data) 2. Personalization 2

Machine Learning as a Discipline Focused on two inter-related fundamental scientific/engineering questions 1. How can one construct computer systems that automatically improve through experience? 2. What are the statistical-computational-informationtheoretic laws that govern all learning systems Including computers, humans and organizations? Machine learning is also important for highly practical computer software fielded across many applications 3

Machine Learning as Software Area Programming computers to: Perform tasks that humans perform well but difficult to specify algorithmically Principled way of building high performance information processing systems Probabilistic responses to queries IR Adaptive user interfaces, personalized assistants (information systems) Scientific/engineering applications 4

ML within AI ML has emerged as method of choice for practical software for: Computer vision Speech recognition Natural language processing Robot control Other applications Far easier to train by showing examples of input-output behavior Than manually anticipate response for every input 5

Example Problem: Handwritten Digit Recognition Wide variability of same numeral Handcrafted rules will result in large no of rules and exceptions Better to have a machine that learns from a large training set Handwriting recognition cannot be done without machine learning! 6

Most Successful Application of ML Learning to recognize spoken words Speaker-specific strategies for recognizing primitive sounds (phonemes) and words from speech signal Neural networks and methods for learning HMMs for customizing to individual speakers, vocabularies and microphone characteristics Recently Google increased accuracy for Table 1.1 Android by 25% 7

ML Example: Self-Driving Vehicle ALVINN: Drive at 70mph for 90 miles on public highways Google Prototype Tesla Autopilot Learning to drive an autonomous vehicle Train computer-controlled vehicles to steer correctly Associate steering commands with image sequences Deployment: Taxi Courier Service 8

Drivers of ML Progress Mobile systems gather/transport vast amounts of data: Big data Turn to ML solutions to obtain insights, predictions, decisions Granularized personalized data Personalization: relevance of posts shown Advertising copywriting Historical medical records: Determine treatment Historical traffic data: Control congestion 9

Learning Problem Definition Improving some measure of performance P when executing some task T through some type of training experience E Example: Learning to detect credit card fraud Task T Assign label of fraud or not fraud to credit card transaction Performance measure P Accuracy of fraud classifier With higher penalty when fraud is labeled as not fraud Training experience E Historical credit card transactions labeled as fraud or not 10

The ML Approach Data Collection Samples Model Selection Probability distribution to model process Parameter Estimation Values/distributions Generalization (Training) Inference Find responses to queries Decision (Inference OR Testing) 11

ML History within AI ML/PR Methods around for over 50 years 12

Core Methods 1. Supervised Learning Training data consists of (x,y) pairs Goal is prediction y* for input x* 2. Unsupervised Learning Analysis of unlabeled data 3. Reinforcement Learning Training data inbetween supervised/unsupervised Indication of whether action is correct or not Rewad signal may refer to an entire input sequence 13

Supervised Learning Most widely used methods of ML, e.g., Spam classification of email Face recognizers over images Medical diagnosis systems Inputs x are vectors or more complex objects documents, DNA sequences or graphs Outputs are binary, multiclass(k), Multi-label (more than one class), ranking, Structured: y is a graph satisfying constraints, e.g., POS tagging 14 Real-valued or mixture of discrete and real-valued

Supervised Classification Example Off-shore oil transfer pipelines Non-invasive measurement of proportion of oil,water, gas Called Three-phase Oil/Water/Gas Flow Input data: Dual-energy gamma densitometry Beam of gamma rays passed through pipe Attenuation in intensity indicates density of material Single beam insufficient Two degrees of freedom: fraction of oil, fraction of water One beam of Gamma rays of two energies (frequencies) Detector Six Beams 12 measurements attenuation

Prediction Problems 1. Predict Volume Fractions of oil/water/gas 2. Predict configuration (one of three) Twelve Features Three classes Two variables, 100 points shown Naïve cell based voting fails exponential growth of cells with dimensionality 12 dimensions discretized into 6 gives 3 million cells Hardly any points in each cell Which class should x belong to? 16

Probability Theory Sum Rule for Marginalization L p(x = x i ) = p(x = x i,y = y j ) j=1 Product Rule: for combining p(x,y ) = n ij N = p(y X)p(X) Bayes Rule p( X Y) p( Y) p ( Y X ) = where p ( X ) = p( X Y) p( Y) Y p( X ) Viewed as Posterior α likelihood x prior Fully Bayesian approach Conjugate distributions Feasible with increased computational power Intractable posterior handled using either Variational Bayes or Stochastic sampling e.g., Markov Chain Monte Carlo, Gibbs 17

Probability Distributions Discrete- Binary Binomial N samples of Bernoulli N=1 Bernoulli Single binary variable Conjugate Prior Beta Continuous variable between {0,1] Discrete- Multi-valued Large N Multinomial One of K values = K-dimensional binary vector Conjugate Prior K=2 Dirichlet K random variables between [0.1] Continuous Gaussian Student s-t Generalization of Gaussian robust to Outliers Infinite mixture of Gaussians Gamma ConjugatePrior of univariate Gaussian precision Gaussian-Gamma Conjugate prior of univariate Gaussian Unknown mean and precision Wishart Conjugate Prior of multivariate Gaussian precision matrix Gaussian-Wishart Conjugate prior of multi-variate Gaussian Unknown mean and precision matrix Exponential Special case of Gamma Angular Von Mises Uniform 18

Statistical Models Generative Naïve Bayes Mixtures of multinomials Mixtures of Gaussians Hidden Markov Models (HMM) Bayesian networks Markov random fields Discriminative Logistic regression SVMs Traditional neural networks Nearest neighbor Conditional Random Fields (CRF) 19

HMMs for Speech Recognition Three distinct layers 1. Language Model: generates sentences as sequences of words 2. Word Model: described as a sequence of phonemes /p//u//sh/ 3. Acoustic model: shows progression of the acoustic signal through a phoneme 20

DBN for monitoring a vehicle Represents system dynamics X 5 : Observation depends on car s location (and map not modeled) and error status of sensor (failure) (X 4 ) Weather Weather' Weather 0 X 1 : Bad weather makes sensor likely to fail (X 4 ) Velocity Location Velocity' Location' Velocity 0 Location 0 X 3 : Location depends on previous position and velocity (X 2 ) Failure Failure' Obs' Failure 0 Ob Time slice t Time slice t +1 Time slice 0 (a) 21 (b)

Regression Problem data set Corresponding inverse problem by reversing x and t Red curve is result of fitting a two-layer neural network by minimizing squared error Very poor fit to data: GMMs used here 22

Regression: Learning to Rank Input (x i ): (d Features of Query-URL pair) (d >200) In LETOR 4.0 dataset 46 query-document features Maximum of 124 URLs/query Log frequency of query in anchor text Query word in color on page # of images on page # of (out) links on page PageRank of page URL length URL contains ~ Page length Traditional IR uses TF/IDF Output (y): Relevance Value Target Variable - Point-wise (0,1,2,3) - Regression returns continuous value - Allows fine-grained ranking of URLs

Deep Learning Multilayer stack of simple modules subject to: Learning, Non-linear map (ReLU) 5 to 20 layers Sensitive to minute details (Samoyeds from white wolves) Invariant (Background, pose, lighting, other objects) Convolutional Nets alternate convolutional layer and pooling layer Stunning success ConvNet +Recurrent Net 1. Representation by CNN 24 2. RNN trained to translate

Unsupervised Learning Labeled data under assumption of underlying structure of data, e.g., 1. Clustering is to find partition of data 2. Identify a low-dimensional manifold PCA, manifold learning, factor analysis, random projections, auto-encoders Topic modeling, Recommendation systems A criterion function is used e.g., max likelihood Computational complexity is key to exploit large unlabeled data sets 25

Clustering Finding a partition for observed data And a rule for predicting future data Old Faithful Geyser in Yellowstone Simple Gaussian unable to capture structure Linear superposition of two Gaussians is better Gaussian cannot model such data sets Gaussian Mixture Models give very complex densities K p( x) = π N( x µ, Σ ) k = 1 k k k 272 observations Duration (mins, horiz axis) vs Time to next eruption (vertical) p k are mixing coefficients that sum to one Log-likelihood function is N K ln p ( X π, µ, Σ) = ln π k ( n k Σk ) = N x µ, = n 1 k 1 There is no closed-form solution Use either iterative numerical optimization techniques or Expectation Maximization One dimension Three Gaussians in blue Sum in red 26

Topic Models Unsupervised methods to analyze documents Topics are distributions over words A document is a distribution across topics Methods: SVD, Collaborative Filtering Topics Topic 1 training 0.08 network 0.05 neural 0.03 Topic 2 noise 0.017 uncertain 0.011 reliability 0.010 positive 0.0084.. Topic 3 data 0.1041 estimate 0.020 estimation 0.019. An Example of Topic Modeling The ability to learn from data with uncertain and missing information is a fundamental requirement for learning systems. In the "real world", features are missing due to unrecorded information or due to occlusion in vision, and measurements are affected by noise. In some cases the experimenter might want to assign varying degrees of reliability to the data. In regression, uncertainty is typically attributed to the dependent variable which is assumed to be disturbed by additive noise. But there is no reason to assume that input features might not be uncertain as well or even missing completely. In some cases, we can ignore the problem: instead of trying to model the relationship between the true input and the output we are satisfied with modeling the relationship between the uncertain input and the output. But there are at least two reasons why we might want to explicitly deal with uncertain inputs. First, we might be interested in the underlying relationship between the true input and the output (e.g. the relationship has some physical meaning). Second, the problem might be non-stationary in the sense that for different samples different inputs are uncertain or missing or the levels of uncertainty vary. The naive strategy of training networks for all possible input combinations explodes in complexity and would require sufficient data for all relevant cases. It makes more sense to define one underlying true model and relate all data to this one model. Ahmad and Tresp (1993) have shown how to include uncertainty during recall under the assumption that the network approximates the "true" underlying function. In this paper, we first show how input uncertainty can be taken into account in the training of a feedforward neural network. Then we show that for networks of Gaussian basis functions it is possible to obtain closed-form solutions. We validate the solutions on two applications. Topic Distribution Topic 1 Topic 2 Topic 3 27

Recommendation Systems Data indicates links between users and items Suggest other items to a user based on data across all users Solution: SVD, Collaborative Filtering 28

Reinforcement Learning Dog is given a reward/punishment for an action Policies: what actions to take in a particular situation Utility estimation: how good is state (à used by policy) No supervised output but delayed reward Credit assignment what was responsible for outcome Applications: Game playing Robot in a maze Multiple agents, partial observability, 29

Causal Learning A Causal Bayesian Network Example of Inference: Cancer is independent of Age and Gender given exposure to Toxics and Smoking Computationally feasible inference 30

Summary Machine Learning Discipline Study how systems improve with experience Study statistical-computational-information-theoretic laws governing learning systems Used with success in all AI applications Core methods: Supervised Classification, Regression, Ranking Fully Bayesian approach together with Variational methods and Monte Carlo sampling Deep Learning Unsupervised (PCA, Topic Models, Clustering) Reinforcement Drivers are mobile systems (big data), personalization 31