Curve Fitting & Multisensory Integration

Similar documents
STA 4273H: Statistical Machine Learning

Christfried Webers. Canberra February June 2015

Basics of Statistical Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning

Penalized regression: Introduction

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Maximum Likelihood Estimation

Monotonicity Hints. Abstract

Statistical Machine Learning

Gaussian Processes in Machine Learning

Nonparametric statistics and model selection

Lecture 9: Introduction to Pattern Analysis

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

CHAPTER 2 Estimating Probabilities

Semi-Supervised Support Vector Machines and Application to Spam Filtering

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Master s Theory Exam Spring 2006

Cross-validation for detecting and preventing overfitting

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Time Series Analysis

Bayes and Naïve Bayes. cs534-machine Learning

5. Multiple regression

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Probabilistic Methods for Time-Series Analysis

1 Review of Least Squares Solutions to Overdetermined Systems

STATISTICA Formula Guide: Logistic Regression. Table of Contents

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

The Dirichlet-Multinomial and Dirichlet-Categorical models for Bayesian inference

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

PS 271B: Quantitative Methods II. Lecture Notes

Statistics Graduate Courses

Principle of Data Reduction

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining Algorithms Part 1. Dejan Sarka

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Supervised Learning (Big Data Analytics)

Statistical Machine Learning from Data

Bayesian probability theory

1 Prior Probability and Posterior Probability

L4: Bayesian Decision Theory

FEGYVERNEKI SÁNDOR, PROBABILITY THEORY AND MATHEmATICAL

Logistic Regression (1/24/13)

Linear Threshold Units

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

Data Mining - Evaluation of Classifiers

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Notes on Continuous Random Variables

Bayesian Information Criterion The BIC Of Algebraic geometry

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

The Artificial Prediction Market

Interpreting Kullback-Leibler Divergence with the Neyman-Pearson Lemma

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

SAS Certificate Applied Statistics and SAS Programming

Lecture 3: Linear methods for classification

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Multivariate Normal Distribution

Lecture 7: Continuous Random Variables

Linear Models for Classification

SAS Code to Select the Best Multiple Linear Regression Model for Multivariate Data Using Information Criteria

Introduction to Probability

Linear Classification. Volker Tresp Summer 2015

We discuss 2 resampling methods in this chapter - cross-validation - the bootstrap

Introduction to Learning & Decision Trees

BayesX - Software for Bayesian Inference in Structured Additive Regression

Collaborative Filtering. Radek Pelánek

Estimating an ARMA Process

Statistical Models in Data Mining

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Chapter ML:XI (continued)

1 Maximum likelihood estimation

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Sections 2.11 and 5.8

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

HURDLE AND SELECTION MODELS Jeff Wooldridge Michigan State University BGSE/IZA Course in Microeconometrics July 2009

Probabilistic user behavior models in online stores for recommender systems

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

An Introduction to Basic Statistics and Probability

Probability for Estimation (review)

Predictive Modeling Techniques in Insurance

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

OUTLIER ANALYSIS. Data Mining 1

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

What is Statistics? Lecture 1. Introduction and probability review. Idea of parametric inference

Linear regression methods for large n and streaming data

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Math 370, Spring 2008 Prof. A.J. Hildebrand. Practice Test 1 Solutions

Data Mining Part 5. Prediction

Bayesian networks - Time-series models - Apache Spark & Scala

Chapter 3 RANDOM VARIATE GENERATION

Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean

Transcription:

Curve Fitting & Multisensory Integration Using Probability Theory Hannah Chen, Rebecca Roseman and Adam Rule COGS 202 4.14.2014

Overheard at Porters You re optimizing for the wrong thing!

What would you say? What makes for a good model? Best performance? Fewest assumptions? Most elegant? Coolest name?

Agenda Motivation Tools Applications Finding Patterns Probability Theory Curve Fitting Multimodal Sensory Integration

The Motivation FINDING PATTERNS

You have data... now find patterns

You have data... now find patterns Unsupervised x = data (training) y(x) = model Clustering Density estimation

You have data... now find patterns Unsupervised x = data (training) y(x) = model Supervised x = data (training) t = (target vector) y(x) = model Clustering Density estimation Classification Regression

You have data... now find patterns Unsupervised x = data (training) y(x) = model Supervised x = data (training) t = (target vector) y(x) = model Clustering Density estimation Classification Regression

Important Questions What kind of model is appropriate? What makes a model accurate? Can a model be too accurate? What are our prior beliefs about the model?

The Tools PROBABILITY THEORY

Properties of a distribution x = event p(x) = prob. of event 1. 2.

Rules Sum Product

Example (Discrete)

Bayes Rule (review) posterior likelihood prior evidence

Probability Density vs. Mass Function PDF PMF continuous Intuition: How much probability f(x) concentrated near x per length dx, how dense is probability near x discrete Intuition: Probability mass is same interpretation but from discrete point of view: f(x) is probability for each point, whereas in PDF f(x) is probability for an interval dx Notation: p(x) Notation: P(x) p(x) can be greater than 1, as long as integral over entire interval is 1

Expectation & Covariance

Application CURVE FITTING

Curve Fitting Observed Data: Given n points (x,y)

Curve Fitting Observed Data: Given n points (x,t) Textbook Example): generate x uniformly from range [0,1] and calculate target data t with sin(2πx) function + noise with Gaussian distribution Why?

Curve Fitting Observed Data: Given n points (x,t) Textbook Example): generate x uniformly from range [0,1] and calculate target d ata t with sin(2πx) function + noise with Gaussian distribution Why? Real data sets typically have underlying regularity that we are trying to learn.

Curve Fitting Observed Data: Given n points (x,y) Goal: use observed data to predict new target values t for new values of x

Curve Fitting Observed Data: Given n points (x,y) Can we fit a polynomial function with this data? What values for w and M fits this data well?

How to measure goodness of fit? Minimize an error function Sum of squares of error

Overfitting

Combatting Overfitting 1. Increase data points

Combatting Overfitting 1. Increase data points Data observations may have just been noisy With more data, can see if data variation is due to noise or if is part of underlying relationship between observations

Combatting Overfitting 1. Increase data points Data observations may have just been noisy With more data, can see if data variation is due to noise or if is part of underlying relationship between observations 2. Regularization Introduce penalty term Trade off between good fit and penalty Hyperparameter,, is input to model. Hyperparam will reduce overfitting, in turn reducing variance and increasing bias (difference between estimated and true target)

How to check for overfitting? Training and validation subset heuristic: data points > (multiple)(# parameters) Training vs Testing don t touch test set until you are actually evaluating experiment!!

Cross-validation 1. Use portion, (S-1)/S, for training (white) 2. Assess performance (red) 3. Repeat for each run 4. Average performance scores 4-fold cross-validation (S=4)

Cross-validation When to use?

Cross-validation When to use? Validation set is small. If very little data, use S=number of observed data points

Cross-validation When to use? Validation set is small. If very little data, use S=number of observed data points Limitations: Computationally expensive - # of training runs increases by factor of S Might have multiple complexity parameters - may lead to training runs that is exponential in # of parameters

Alternative Approach: Want an approach that depends only on size of training data rather than # of training runs e.g.) Akaike information criterion (AIC), Bayesian information criterion (BIC, Sec 4.4)

Akaike Information Criterion (AIC) Choose model with largest value for best-fit log likelihood # of adjustable model parameters

Gaussian Distribution This satisfies the two properties for a probability density! (what are they?)

Likelihood Function for Gaussian Assumption: data points x drawn independently from same Gaussian distribution defined by unknown mean and variance parameters, i.e. independently and identically distributed (i.i.d)

Curve Fitting (ver. 1) Assumption: 1. Given value of x, corresponding target value t has a Gaussian distribution with mean y(x,w) 2. Data {x,t} drawn independently from distribution: likelihood =

Maximum Likelihood Estimation (MLE) Log likelihood: What does maximizing the log likelihood look similar to?

Maximum Likelihood Estimation (MLE) Log likelihood: What does maximizing the log likelihood look similar to? wrt w:

Maximum Posterior (MAP) Simpler example: use Gaussian of form With Bayes can calculate posterior:

Maximum Posterior (MAP) Determine w by maximizing posterior distribution Equivalent to taking negative log of: and combining the Gaussian & log likelihood function from earlier...

Maximum Posterior (MAP) Minimum of What does this look like?

Maximum Posterior (MAP) Minimum of What does this look like? What is regularization parameter?

Maximum Posterior (MAP) Minimum of What does this look like? What is regularization parameter?

Bayesian Curve Fitting However MLE and MAP are not fully Bayesian because they involve using point estimates for w

Curve Fitting (ver. 2) Given training data {x, t} and new point x, predict the target value t Assume parameters are fixed Evaluate predictive distribution:

Bayesian Curve Fitting Fully Bayesian approach requires integrating over all values of w, by applying sum and product rules of probability

Bayesian Curve Fitting marginalization This posterior is Gaussian and can be evaluated analytically (Sec 3.3) posterior

Bayesian Curve Fitting Predictive is Gaussian of form with mean and variance and matrix

Bayesian Curve Fitting Need to define Mean and variance depend on x as a result of marginalization (not the case in MLE/MAP )

Application MULTIMODAL SENSORY INTEGRATION

Two Models Visual Capture probability v a location

Two Models Visual Capture MLE probability v a location

Procedure

Final Result probability v a location

The Math (MLE Model) Likelihood

The Math (MLE Model) Likelihood

The Math ( Bayesian Model) Likelihood * Prior Uniform Inverse Gamma

The Math (Empirical) likelihood logistic function location estimates weight constraint

Final Result

QUESTIONS?

Two Models (Prediction) Visual Capture MLE Visual Weight Visual Weight Visual Noise Visual Noise