Risk and Loss Functions. Yong Wang Dongqing Zhang

Similar documents
Statistical Machine Learning

Master s Theory Exam Spring 2006

MATHEMATICAL METHODS OF STATISTICS

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

STA 4273H: Statistical Machine Learning

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Linear Models for Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Gaussian Processes in Machine Learning

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Lecture 3: Linear methods for classification

Support Vector Machine (SVM)

Basics of Statistical Machine Learning

Multivariate Normal Distribution

An Introduction to Machine Learning

Least Squares Estimation

Non Parametric Inference

Linear Classification. Volker Tresp Summer 2015

Machine Learning and Pattern Recognition Logistic Regression

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Introduction to General and Generalized Linear Models

Classification Problems

A Simple Introduction to Support Vector Machines

2. Linear regression with multiple regressors

2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)

Factor analysis. Angela Montanari

Christfried Webers. Canberra February June 2015

Fairfield Public Schools

Linear Threshold Units

Probability and Random Variables. Generation of random variables (r.v.)

EMPIRICAL RISK MINIMIZATION FOR CAR INSURANCE DATA

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Econometrics Simple Linear Regression

Question 2 Naïve Bayes (16 points)

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Nonlinear Regression Functions. SW Ch 8 1/54/

Efficient Curve Fitting Techniques

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

Variations of Statistical Models

How Far is too Far? Statistical Outlier Detection

Signal Detection. Outline. Detection Theory. Example Applications of Detection Theory

Interactive Machine Learning. Maria-Florina Balcan

GLM, insurance pricing & big data: paying attention to convergence issues.

Penalized Logistic Regression and Classification of Microarray Data

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

BayesX - Software for Bayesian Inference in Structured Additive Regression

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Monotonicity Hints. Abstract

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

171:290 Model Selection Lecture II: The Akaike Information Criterion

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Estimating an ARMA Process

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Bootstrapping Big Data

These axioms must hold for all vectors ū, v, and w in V and all scalars c and d.

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Monte Carlo Simulation

Multivariate Statistical Inference and Applications

1 Teaching notes on GMM 1.

Data Structures and Algorithms

Applications to Data Smoothing and Image Processing I

Epipolar Geometry. Readings: See Sections 10.1 and 15.6 of Forsyth and Ponce. Right Image. Left Image. e(p ) Epipolar Lines. e(q ) q R.

Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13

33. STATISTICS. 33. Statistics 1

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

Module 5: Multiple Regression Analysis

Least-Squares Intersection of Lines

MACHINE LEARNING IN HIGH ENERGY PHYSICS

A Stochastic 3MG Algorithm with Application to 2D Filter Identification

Probability and Statistics Vocabulary List (Definitions for Middle School Teachers)

University of Ljubljana Doctoral Programme in Statistics Methodology of Statistical Research Written examination February 14 th, 2014.

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Part 2: Analysis of Relationship Between Two Variables

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

CS Introduction to Data Mining Instructor: Abdullah Mueen

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Safety margins for unsystematic biometric risk in life and health insurance

CHAPTER 2 Estimating Probabilities

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Kristine L. Bell and Harry L. Van Trees. Center of Excellence in C 3 I George Mason University Fairfax, VA , USA kbell@gmu.edu, hlv@gmu.

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Regression III: Advanced Methods

Multiple Linear Regression in Data Mining

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

Approaches for Analyzing Survey Data: a Discussion

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

SAS Software to Fit the Generalized Linear Model

E3: PROBABILITY AND STATISTICS lecture notes

How To Know If A Dealer Is Reporting Correctly Or Incorrectly

Point and Interval Estimates

Convolution. 1D Formula: 2D Formula: Example on the web:

TOPIC 4: DERIVATIVES

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

Transcription:

Risk and Loss Functions Yong Wang Dongqing Zhang Friday, February 6, 2004

Outline Lost functions Test error & expected risk A statistical perspective Robust estimators Summary 2

Lost Functions: Overview Motivation: Determine a criterion according to which we will assess the quality of an target estimation based on observations during the learning Non-trivial: the options are plenteous Definition: The mapping c of the triplex (x, y, f(x)) into [0, infinite) with c(x, y, y) = 0 The minimum of the loss is 0 and obtainable, at least for a given x, y In practice: the incurred loss is not always the quantity that we will attempt to minimize Feasibility, confidence level consideration, 3

Some Examples In classification Misclassification error Input-dependent loss Asymmetric loss Soft margin loss Logistic loss 4

Some Examples (cont.) In regression Usually associated with the degree of the difference Squared error _-insensitive loss Criterion in practice Cheap to compute Small number of discontinuities in the first derivative Convex to ensure the uniqueness of the solution Outlier resistance 5

Test Error & Expected Risk Motivation: Given errors penalized on specific instances (x, y, f(x), how to combine these penalties to assess a particular estimation f Definition of test error Hard to resolve Definition of expected risk Situation is not becoming better: P(x, y) is unknown Simplification: empirical estimation using training patterns 6

Approximations Assumptions The existence of a underlying probability distribution P(x, y) governing the data generation Data (x, y) are drawn i.i.d. from P(x, y) pdf p(x,y) exists Empirical density Lead to a quantity reasonably close to the expected risk Empirical risk Risk of rising ill-posed problems Overfitting 7

Ill-posed Problem: Example Address a regression problem using quadratic loss function Dealing with a linear class of functions where If we have more basis functions f i than observations, there will be a subspace of solutions 8

A Statistical Perspective For a given observation and its estimation, besides what risk we can expect for it, we may be interested in which probability the corresponding loss is going to occur Need to compute p(y x) Should be aware there are two approximations Model the density p firstly Compute a minimum of the expected risk This could lead to inferior or at least not easily predictable results Additional approximation steps might make the estimates worse 9

Maximum Likelihood Estimation Likelihood Log-Likelihood Minimization of Log-likelihood coincides with empirical risk if the loss function c is chosen according to For regression: _ is the additive noise to f(x) with density p _ For classification: 10

Density Modeling Possible models Logistic transfer function Probit model Inverse complementary log-log model Q: what s the policy to select a suitable model? For classification:logistic model & loss function For regression: see next page 11

Loss Functions & Density Models _ density models for regression 12

Practical Consideration Loss functions resulting from a maximum likelihood reasoning might be non-convex Strong assumption: explicitly we know P(y x, f) The minimization of log-likelihood depends on the class of functions No better situation than by minimizing empirical risk Is the choice of loss function arbitrary? Does there exist good means of assessing the performance of an estimator? Solution: efficiency How noisy an estimator is with respect to a reference estimator 13

Estimator Denote by P(y _) a distribution of y depending on the parameters _ (might be a vector), and by Y={y 1,, y m } an m-sample drawn i.i.d. from P(y _) Estimator of the parameter _ based on Y Unbiased assumption The efficient way to compare unbiased estimators is to compute their variance The smaller the variance, the lower the probability that will deviate from _ Use variance as a one-number performance measure 14

Fisher Information, etc. Score function Indicating how much the data affect the choice of _ Covariance of V _ (Y) is called the Fisher information matrix I Covariance of the estimator 15

Cramer & Rao Boundary Any unbiased estimator satisfies deviates from _ by more than a certain amount The definition of a one-number summary of the properties of an estimator, namely how closely the inequality is met Efficiency: The closer e is to 1, the lower the variance of the estimator. For a special class of estimators, B and e can be computed efficiently 16

Efficiency Asymptotic variance Maximum Likelihood(ML) is asymptotically efficient I.e., e=1, as m -> infinite 17

ML In Reality: No Perfect ML is efficient asymptotically For finite sample size, it is possible to do better other than ML estimation Practical considerations such as the goal of sparse decomposition (?) may lead to the choice of a nonoptimal loss function We may not know the true density model P(y _), which is required to define the ML estimator Definitely we can guess While a bad guess can lead to large errors Solution: robust estimators 18

Robust Estimators Practical assumptions A certain class of distributions from which P(Y) is chosen Training and testing data are identically distributed Robust estimators are used to safeguard us against the cases where the above assumptions are not true Avoid a certain fraction _ of bad observations (outliers) seriously affecting the quality of the estimate The influence of individual patterns should be bounded from above 19

Robustness via Loss Functions Basic idea (Huber): take a loss function as provided by the ML framework, and modify it in such a way as to limit the influence of each individual patter Achieved by providing an upper bound on the slope of -ln[p(y _)] Examples trimmed mean or median _-insensitive loss function 20

Robust Loss Function Theorem 21

Practice Consideration Even though a loss function defined in Theorem 3.15 is generally desirable, we may be less cautious, and use a different loss function for improved performance, when we have additional knowledge of the distribution Trimmed mean estimator (Remark 3.17) Discards _ of the data: effectively all _ i deviating from the mean by more than _ are ignored and the mean is adjusted When _ -> 1, we recover the median estimator: all patterns but the median one are discarded (?) 22

Efficiency & _-Insensitive Loss Function Use efficiency theorem, the performance of _-insensitive loss function can be estimated when applied to different types of noise model Gaussian Noise If the underlying noise model is Gaussian with variance _ and _-insensitive loss function is used, the most efficient estimator from this family is given by _=0.612_ More general: _ has to be known in advance Otherwise: adaptive loss functions 23

Adaptive Loss Functions In _-insensitive loss function case, adjust _ with a small enough _ and see the loss changes Idea: for a given p(y _), determine the optimal value of _ by computing the corresponding fraction _ of patterns outside the interval [-_+_, _+_]. _ is found by Theorem 3.21 Given the type of additive noise, we can determine the value of _ such that it yields the asymptotically efficient estimator Case study: polynomial noise model 24

Summary Two complementary concepts as to how risk and loss functions should be designed Data driven: uses the incurred loss as its principal guideline Empirical risk Expected risk Idea of estimating the distribution which may generate the data ML is conceptually rather similar to the notions of risk & loss Evaluate the estimator performance using Cramer-Rao theorem How loss functions adjust themselves to the amount of noise, achieving optimal performance _-insensitive loss function is extensively discussed as case study 25