An Introduction to Machine Learning



Similar documents
Introduction to Support Vector Machines. Colin Campbell, Bristol University

Statistical Machine Learning

Lecture 3: Linear methods for classification

Support Vector Machine (SVM)

Linear Threshold Units

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Machine Learning and Pattern Recognition Logistic Regression

Several Views of Support Vector Machines

Support Vector Machines for Classification and Regression

Linear Models for Classification

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning in Spam Filtering

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Support Vector Machines Explained

A Simple Introduction to Support Vector Machines

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Classification Problems

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Linear Classification. Volker Tresp Summer 2015

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Data Mining: Algorithms and Applications Matrix Math Review

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Lecture 2: The SVM classifier

DATA ANALYSIS II. Matrix Algorithms

How to assess the risk of a large portfolio? How to estimate a large covariance matrix?

Christfried Webers. Canberra February June 2015

Statistical Machine Learning from Data

Lecture 9: Introduction to Pattern Analysis

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

CSCI567 Machine Learning (Fall 2014)

Support Vector Machines

Inner Product Spaces

Least Squares Estimation

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Data visualization and dimensionality reduction using kernel maps with a reference point

Early defect identification of semiconductor processes using machine learning

Part 2: Analysis of Relationship Between Two Variables

Simple and efficient online algorithms for real world applications

Making Sense of the Mayhem: Machine Learning and March Madness

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

STA 4273H: Statistical Machine Learning

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Introduction to Machine Learning Using Python. Vikram Kamath

Support Vector Machines with Clustering for Training with Very Large Datasets

Sections 2.11 and 5.8

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

A Study on the Comparison of Electricity Forecasting Models: Korea and China

Introduction to General and Generalized Linear Models

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Nonlinear Algebraic Equations Example

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

Online learning of multi-class Support Vector Machines

Understanding and Applying Kalman Filtering

Gaussian Processes in Machine Learning

Simple Linear Regression Inference

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction: Overview of Kernel Methods

Support Vector Machines

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

Lecture 6: Logistic Regression

Two-Stage Stochastic Linear Programs

Introduction to Online Learning Theory

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Neural Networks: a replacement for Gaussian Processes?

Message-passing sequential detection of multiple change points in networks

Big Data Analytics: Optimization and Randomization

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

: Introduction to Machine Learning Dr. Rita Osadchy

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

A Log-Robust Optimization Approach to Portfolio Management

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu

Big Data Analytics CSCI 4030

1 Portfolio mean and variance

Towards Online Recognition of Handwritten Mathematics

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

MATHEMATICAL METHODS OF STATISTICS

Cell Phone based Activity Detection using Markov Logic Network

Geostatistics Exploratory Analysis

Tutorial on Markov Chain Monte Carlo

Basics of Statistical Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning

Markov Chain Monte Carlo Simulation Made Simple

Moving Least Squares Approximation

Gaussian Conjugate Prior Cheat Sheet

Transcription:

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January 2007 Alexander J. Smola: An Introduction to Machine Learning 1 / 40

Overview L1: Machine learning and probability theory Introduction to pattern recognition, classification, regression, novelty detection, probability theory, Bayes rule, inference L2: Density estimation and Parzen windows Nearest Neighbor, Kernels density estimation, Silverman s rule, Watson Nadaraya estimator, crossvalidation L3: Perceptron and Kernels Hebb s rule, perceptron algorithm, convergence, kernels L4: Support Vector estimation Geometrical view, dual problem, convex optimization, kernels L5: Support Vector estimation Regression, Novelty detection L6: Structured Estimation Sequence annotation, web page ranking, path planning, implementation and optimization Alexander J. Smola: An Introduction to Machine Learning 2 / 40

L5 Novelty Detection and Regression Novelty Detection Basic idea Optimization problem Stochastic Approximation Examples Regression Additive noise Regularization Examples SVM Regression Quantile Regression Alexander J. Smola: An Introduction to Machine Learning 3 / 40

Novelty Detection Data Observations (x i ) generated from some P(x), e.g., network usage patterns handwritten digits alarm sensors factory status Task Find unusual events, clean database, distinguish typical examples. Alexander J. Smola: An Introduction to Machine Learning 4 / 40

Applications Network Intrusion Detection Detect whether someone is trying to hack the network, downloading tons of MP3s, or doing anything else unusual on the network. Jet Engine Failure Detection You can t destroy jet engines just to see how they fail. Database Cleaning We want to find out whether someone stored bogus information in a database (typos, etc.), mislabelled digits, ugly digits, bad photographs in an electronic album. Fraud Detection Credit Cards, Telephone Bills, Medical Records Self calibrating alarm devices Car alarms (adjusts itself to where the car is parked), home alarm (furniture, temperature, windows, etc.) Alexander J. Smola: An Introduction to Machine Learning 5 / 40

Novelty Detection via Densities Key Idea Novel data is one that we don t see frequently. It must lie in low density regions. Step 1: Estimate density Observations x 1,..., x m Density estimate via Parzen windows Step 2: Thresholding the density Sort data according to density and use it for rejection Practical implementation: compute p(x i ) = 1 k(x i, x j ) for all i m and sort according to magnitude. Pick smallest p(x i ) as novel points. j Alexander J. Smola: An Introduction to Machine Learning 6 / 40

Typical Data Alexander J. Smola: An Introduction to Machine Learning 7 / 40

Outliers Alexander J. Smola: An Introduction to Machine Learning 8 / 40

A better way... Alexander J. Smola: An Introduction to Machine Learning 9 / 40

A better way... Problems We do not care about estimating the density properly in regions of high density (waste of capacity). We only care about the relative density for thresholding purposes. We want to eliminate a certain fraction of observations and tune our estimator specifically for this fraction. Solution Areas of low density can be approximated as the level set of an auxiliary function. No need to estimate p(x) directly use proxy of p(x). Specifically: find f (x) such that x is novel if f (x) c where c is some constant, i.e. f (x) describes the amount of novelty. Alexander J. Smola: An Introduction to Machine Learning 10 / 40

Maximum Distance Hyperplane Idea Find hyperplane, given by f (x) = w, x + b = 0 that has maximum distance from origin yet is still closer to the origin than the observations. Hard Margin minimize 1 2 w 2 subject to w, x i 1 Soft Margin minimize subject to Alexander J. Smola: An Introduction to Machine Learning 11 / 40 1 2 w 2 + C m i=1 w, x i 1 ξ i ξ i 0 ξ i

The ν-trick Problem Depending on C, the number of novel points will vary. We would like to specify the fraction ν beforehand. Solution Use hyperplane separating data from the origin H := {x w, x = ρ} where the threshold ρ is adaptive. Intuition Let the hyperplane shift by shifting ρ Adjust it such that the right number of observations is considered novel. Do this automatically Alexander J. Smola: An Introduction to Machine Learning 12 / 40

The ν-trick Primal Problem Dual Problem minimize 1 2 w 2 + minimize 1 2 m ξ i mνρ i=1 where w, x i ρ + ξ i 0 ξ i 0 m α i α j x i, x j i=1 where α i [0, 1] and m α i = νm. Similar to SV classification problem, use standard optimizer for it. i=1 Alexander J. Smola: An Introduction to Machine Learning 13 / 40

USPS Digits Better estimates since we only optimize in low density regions. Specifically tuned for small number of outliers. Only estimates of a level-set. For ν = 1 we get the Parzen-windows estimator back. Alexander J. Smola: An Introduction to Machine Learning 14 / 40

A Simple Online Algorithm Objective Function 1 2 w 2 + 1 m m max(0, ρ w, φ(x i ) ) νρ i=1 Stochastic Approximation Gradient 1 2 w 2 max(0, ρ w, φ(x i ) ) νρ { w φ(xi ) if w, φ(x w [...] = i ) < ρ w otherwise { (1 ν) if w, φ(xi ) < ρ ρ [...] = ν otherwise Alexander J. Smola: An Introduction to Machine Learning 15 / 40

Practical Implementation Update in coefficients α j (1 η)α j for j i { ηi if i 1 α i j=1 α ik(x i, x j ) < ρ 0 otherwise { ρ + η(ν 1) if i 1 ρ = j=1 α ik(x i, x j ) < ρ ρ + ην otherwise Using learning rate η. Alexander J. Smola: An Introduction to Machine Learning 16 / 40

Online Training Run Alexander J. Smola: An Introduction to Machine Learning 17 / 40

Worst Training Examples Alexander J. Smola: An Introduction to Machine Learning 18 / 40

Worst Test Examples Alexander J. Smola: An Introduction to Machine Learning 19 / 40

Mini Summary Novelty Detection via Density Estimation Estimate density e.g. via Parzen windows Threshold it at level and pick low-density regions as novel Novelty Detection via SVM Find halfspace bounding data Quadratic programming solution Use existing tools Online Version Stochastic gradient descent Simple update rule: keep data if novel, but only with fraction ν and adjust threshold. Easy to implement Alexander J. Smola: An Introduction to Machine Learning 20 / 40

A simple problem Alexander J. Smola: An Introduction to Machine Learning 21 / 40

Inference p(weight height) = p(height, weight) p(height) p(height, weight) Alexander J. Smola: An Introduction to Machine Learning 22 / 40

Bayesian Inference HOWTO Joint Probability We have distribution over y and y, given training and test data x, x. Bayes Rule This gives us the conditional probability via p(y, y x, x ) = p(y y, x, x )p(y x) and hence p(y y) p(y, y x, x ) for fixed y. Alexander J. Smola: An Introduction to Machine Learning 23 / 40

Normal Distribution in R n Normal Distribution in R ( 1 p(x) = exp 1 ) (x µ)2 2πσ 2 2σ2 Normal Distribution in R n 1 p(x) = ( (2π)n det Σ exp 1 ) 2 (x µ) Σ 1 (x µ) Parameters µ R n is the mean. Σ R n n is the covariance matrix. Σ has only nonnegative eigenvalues: The variance is of a random variable is never negative. Alexander J. Smola: An Introduction to Machine Learning 24 / 40

Gaussian Process Inference Our Model We assume that all y i are related, as given by some covariance matrix K. More specifically, we assume that Cov(y i, y j ) is given by two terms: A general correlation term, parameterized by k(x i, x j ) An additive noise term, parameterized by δ ij σ 2. Practical Solution Since y y N( µ, K ), we only need to collect all terms in p(t, t ) depending on t by matrix inversion, hence K = K y y K [ yy 1 Kyy K yy and µ = µ + Kyy K 1 yy (y µ) ] } {{ } independent of y Key Insight We can use this for regression of y given y. Alexander J. Smola: An Introduction to Machine Learning 25 / 40

Some Covariance Functions Observation Any function k leading to a symmetric matrix with nonnegative eigenvalues is a valid covariance function. Necessary and sufficient condition (Mercer s Theorem) k needs to be a nonnegative integral kernel. Examples of kernels k(x, x ) Linear x, x Laplacian RBF exp ( λ x x ) Gaussian RBF exp ( λ x x 2) Polynomial B-Spline B 2n+1 (x x ) Cond. Expectation ( x, x + c ) d, c 0, d N E c [p(x c)p(x c)] Alexander J. Smola: An Introduction to Machine Learning 26 / 40

Linear Covariance Alexander J. Smola: An Introduction to Machine Learning 27 / 40

Laplacian Covariance Alexander J. Smola: An Introduction to Machine Learning 28 / 40

Gaussian Covariance Alexander J. Smola: An Introduction to Machine Learning 29 / 40

Polynomial (Order 3) Alexander J. Smola: An Introduction to Machine Learning 30 / 40

B 3 -Spline Covariance Alexander J. Smola: An Introduction to Machine Learning 31 / 40

Gaussian Processes and Kernels Covariance Function Function of two arguments Leads to matrix with nonnegative eigenvalues Describes correlation between pairs of observations Kernel Function of two arguments Leads to matrix with nonnegative eigenvalues Similarity measure between pairs of observations Lucky Guess We suspect that kernels and covariance functions are the same... Alexander J. Smola: An Introduction to Machine Learning 32 / 40

Training Data Alexander J. Smola: An Introduction to Machine Learning 33 / 40

Mean k (x)(k + σ 2 1) 1 y Alexander J. Smola: An Introduction to Machine Learning 34 / 40

Variance k(x, x) + σ 2 k (x)(k + σ 2 1) 1 k(x) Alexander J. Smola: An Introduction to Machine Learning 35 / 40

Putting everything together... Alexander J. Smola: An Introduction to Machine Learning 36 / 40

Another Example Alexander J. Smola: An Introduction to Machine Learning 37 / 40

The ugly details Covariance Matrices Additive noise K = K kernel + σ 2 1 Predictive mean and variance Pointwise prediction K = K y y K yy 1 Kyy K yy and µ = Kyy 1 Kyy y K yy = K + σ 2 1 K y y = k(x, x) + σ2 K yy = (k(x 1, x),..., k(x m, x)) Plug this into the mean and covariance equations. Alexander J. Smola: An Introduction to Machine Learning 38 / 40

Mini Summary Gaussian Process Like function, just random Mean and covariance determine the process Can use it for estimation Regression Jointly normal model Additive noise to deal with error in measurements Estimate for mean and uncertainty Alexander J. Smola: An Introduction to Machine Learning 39 / 40

Support Vector Regression Loss Function Given y, find f (x) such that the loss l(y, f (x)) is minimized. Squared loss (y f (x)) 2. Absolute loss y f (x). ɛ-insensitive loss max(0, y f (x) ɛ). Quantile regression loss max(τ(y f (x)), (1 τ)(f (x) y)). Expansion f (x) = φ(x), w + b Optimization Problem minimize w m l(y i, f (x i )) + λ 2 w 2 i=1 Alexander J. Smola: An Introduction to Machine Learning 40 / 40

Regression loss functions Alexander J. Smola: An Introduction to Machine Learning 41 / 40

Summary Novelty Detection Basic idea Optimization problem Stochastic Approximation Examples LMS Regression Additive noise Regularization Examples SVM Regression Alexander J. Smola: An Introduction to Machine Learning 42 / 40