# Classification Problems

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and

2 Classification Problems Many machine learning applications can be seen as classification problems given a vector of D inputs describing an item, predict which class the item belongs to. Examples: Given anatomical measurements of an animal, predict which species the animal belongs to. Given information on the credit history of a customer, predict whether or not they would pay back a loan. Given an image of a hand-written digit, predict which digit (0-9) it is. Given the proportions of iron, nickel, carbon, etc. in a type of steel, predict whether the steel will rust in the presence of moisture. We assume that the set of possible classes is known, with labels C 1,..., C K. We have a training set of items in which we know both the inputs and the class, from which we will somehow learn how to do the classificaton. Once we ve learned a classifier, we use it to predict the class of future items, given only the inputs for those items.

3 Approaches to Classification Classification problems can be solved in (at least) three ways: Learn how to directly produce a class from the inputs that is, we learn some function that maps an input vector, x, to a class, C k. Learn a discriminative model for the probability distribution over classes for given inputs that is, learn P (C k x) as a function of x. From P (C k x) and a loss function, we can make the best prediction for the class of an item. Learn a generative model for the probability distribution of the inputs for each class that is, learn P (x C k ) for each class k. From this, and the class probabilities, P (C k ), we can find P (C k x) using Bayes Rule. Note that the last option above makes sense only if there is some well-defined distribution of items in a class. This isn t the case for the example of determining whether or not a type of steel will rust.

4 Loss functions and Classification Learning P (C k x) allows one to make a prediction for the class in a way that depends on a loss function, which says how costly different kinds of errors are. We define L kj to be the loss we incur if we predict that an item is in class C j when it is actually in class C k. We ll assume that losses are non-negative and that L kk = 0 for all k (ie, there s no loss when the prediction is correct). Only the relative values of losses will matter. If all errors are equally bad, we would let L kj be the same for all k j. Example: Giving a loan to someone who doesn t pay it back (class C 1 ) is much more costly than not giving a loan to someone who would pay it back (class C 0 ). So for this application we might define L 01 = 1 and L 10 = 10. Note that we should define the loss function to account both for monetary consequences (money not repaid, or interest not earned) and other effects that don t have immediate monetary consequences, such as customer dissatisfaction when their loan isn t approved.

5 Predicting to Minimize Expected Loss A basic principle of decision theory is that we should take the action (here, make the prediction) that minimizes the expected loss, according to our probabilistic model. If we predict that an item with inputs x is in class C j, the expected loss is K L kj P (C k x) k=1 We should predict that this item in the class, C j, for which this expected loss is smallest. (The minimum might not be unique, in which case more than one prediction would be optimal.) If all errors are equally bad (say loss of 1), the expected loss when predicting C j is 1 P (C j x), so we should predict the class which highest probability given x. For binary classification (K = 2, with classes labelled by 0 and 1), minimizing expected loss is equivalent to predicting that an item is in class 1 if P (C 1 x) P (C 0 x) L 10 L 01 > 1

6 Classification from Generative Models Using Bayes Rule In the generative model approach to classification, we learn models from the training data for the probability or probability density of the inputs, x, for items in each of the possible classes, C k that is, we learn models for P (x C k ) for k = 1,..., K. To do classification, we instead need P (C k x). We can get these conditional class probabilities using Bayes Rule: P (C k x) = P (C k ) P (x C k ) K j=1 P (C j) P (x C j ) Here, P (C k ) is the prior probability of class C k. We can easily estimate these probabilities by the frequencies of the classes in the training data. Alternatively, we may have good information about P (C k ) from other sources (eg, census data). For binary classification, with classes C 0 and C 1, we get P (C 1 x) = = P (C 1 ) P (x C 1 ) P (C 0 ) P (x C 0 ) + P (C 1 ) P (x C 1 ) P (C 0 ) P (x C 0 ) / P (C 1 ) P (x C 1 )

7 Naive Bayes Models for Binary Inputs When the inputs are binary (ie, x is a vectors of 1 s and 0 s) we can use the following simple generative model: P (x C k ) = D i=1 µ x i ki (1 µ ki) 1 x i Here, µ ki is the estimated probability that input i will have the value 1 in items from class k. The maximum likelihood estimate for µ ki is simply the fraction of 1 s in training items that are in class k. This is called the naive Bayes model Bayes because we use it with Bayes s Rule to do classification, an naive because this model assumes that inputs are independent given the class, which is something a naive person might assume, though it s usually not true. It s easy to generalize naive Bayes models to discrete inputs with more than two values, and further generalizations (keeping the independence assumption) are also possible.

8 Binary Classification using Naive Bayes Models When there are two classes (C 0 and C 1 ) and binary inputs, applying Bayes Rule with naive Bayes gives the following probability for C 1 given x: P (C 1 x) = = = P (C 1 ) P (x C 1 ) P (C 0 ) P (x C 0 ) + P (C 1 ) P (x C 1 ) P (C 0 ) P (x C 0 ) / P (C 1 ) P (x C 1 ) exp( a(x)) where a(x) = ( ) P (C1 ) P (x C 1 ) log P (C 0 ) P (x C 0 ) = ( ) P (C 1 ) D ( µ1i ) xi ( 1 µ1i ) 1 xi log P (C 0 ) µ 0i 1 µ 0i = log ( P (C 1 ) P (C 0 ) i=1 D i=1 1 µ 1i 1 µ 0i ) + D x i log i=1 ( ) µ1i /(1 µ 1i ) µ 0i /(1 µ 0i )

9 Gaussian Generative Models When the inputs are real-valued, a Gaussian model for the distribuiton of inputs in each class may be appropriate. If we also assume that the covariance matrix for all classes is the same, the class probabilities for binary classification turn out to depend on a linear function of the inputs. For this model, P (x C k ) = (2π) D/2 Σ 1/2 exp ( ) (x µ k ) T Σ 1 (x µ k ) / 2 where µ k is an estimate of the mean vector for class C k, and Σ is an estimate for the covariance matrix (same for all classes). As shown in Bishop s book, the maximum likelihood estimate for µ k is the sample means of input vectors for items in class C k, and the maximum likelihood estimate for Σ is K k=1 N k N S k where N k is the number of training items in class C k, and S k is the usual maximum likelihood estimate for the covariance matrix in class C k.

10 Classification using Gaussian Models for Each Class For binary classification, we can now apply Bayes Rule to get the probability of class 1 from a Gaussian model with the same covariance matrix in each class: As for the naive Bayes model: P (C 1 x) = exp( a(x)) where a(x) = log ( ) P (C1 ) P (x C 1 ) P (C 0 ) P (x C 0 ) Substituting the Gaussian densities, we get ( ) P (C1 ) a(x) = log + log P (C 0 ) ( ) P (C1 ) = log + 1 ) (µ T 0 Σ 1 µ 0 µ T 1 Σ 1 µ 1 P (C 0 ) 2 ( exp( (x µ1 ) T Σ 1 (x µ 1 ) / 2) exp( (x µ 0 ) T Σ 1 (x µ 0 ) / 2) ) ) + x (Σ T 1 (µ 1 µ 0 ) The quadratic terms of the form x T Σ 1 x T /2 cancel, producing a linear function of the inputs, as was also the case for naive Bayes models.

11 Logistic Regression We see that binary classification using either naive Bayes or Gaussian generative models leads to the probability of class C 1 given inputs x having the form P (C 1 x) = exp( a(x)) where a(x) is a linear function of x, which can be written as a(x) = w 0 + x T w. Rather than start with a generative model, however, we could simply start with this formula, and estimate w 0 and w from the training data. Maximum likelihood estimation for w 0 and w is not hard, though there is no explicit formula. This is a discriminative training procedure, that estimates P (C k x) without estimating P (x C k ) for each class.

12 Which is Better Generative or Discriminative? Even though logistic regression uses the same formula for the probability for C 1 given x as was derived for the earlier generative models, maximum likelihood logistic regression does not in general give the same values for w 0 and w as would be found with maximum likelihood estimation for the generative model. So which gives better results? It depends... If the generative model accurately represents the distribution of inputs for each class, it should give better results than discriminative training it effectively has more information to use when estimating parameters. (The discussion of this in Bishop s book, bottom of page 205, is misleading.) However, if the generative model is not a good match for the actual distributions, using it might produce very bad results, even when logistic regression would work well. The independence assumption for naive Bayes and the equal covariance assumption for Gaussian models are often rather dubious. Similarly, logistic regression may be less sensitive to outliers than a Gaussian generative model.

13 Non-linear Logistic Models As we saw earlier for neural networks, the form P (C 1 x) = 1 / (1 + exp( a(x))) for the probability of class C 1 given x can be used with a(x) being a non-linear function of x. If a(x) can equally well be any non-linear function, the choice of this form for P (C 1 x) doesn t really matter, since any function P (C 1 x) could be obtained with an approriate a(x). However, in practice, non-linear models are biased towards some non-linear functions more than others, so it does matter that a logistic model is being used, though not as much as when a(x) must be linear.

14 Probit Models An alternative to logistic models is the probit model, in which we let P (C 1 x) = Φ(a(x)) where Φ is the cumulative distribution function of the standard normal distribution: Φ(a) = a (2π) 1/2 exp( x 2 /2) dx This would be the right model if the class depended on the sign of a(x) plus a standard normal random variable. But this isn t a reasonable model for most applications. It might still be useful, though.

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### Probability Theory. Elementary rules of probability Sum rule. Product rule. p. 23

Probability Theory Uncertainty is key concept in machine learning. Probability provides consistent framework for the quantification and manipulation of uncertainty. Probability of an event is the fraction

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Nominal and ordinal logistic regression

Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Regression with a Binary Dependent Variable

Regression with a Binary Dependent Variable Chapter 9 Michael Ash CPPA Lecture 22 Course Notes Endgame Take-home final Distributed Friday 19 May Due Tuesday 23 May (Paper or emailed PDF ok; no Word, Excel,

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### CSCI567 Machine Learning (Fall 2014)

CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /

### Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application

Iraqi Journal of Statistical Science (18) 2010 p.p. [35-58] Bayesian Classifier for a Gaussian Distribution, Decision Surface Equation, with Application ABSTRACT Nawzad. M. Ahmad * Bayesian decision theory

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Cell Phone based Activity Detection using Markov Logic Network

Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel sxs104721@utdallas.edu 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

### Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation

SYSM 6304: Risk and Decision Analysis Lecture 3 Monte Carlo Simulation M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu September 19, 2015 Outline

### The zero-adjusted Inverse Gaussian distribution as a model for insurance claims

The zero-adjusted Inverse Gaussian distribution as a model for insurance claims Gillian Heller 1, Mikis Stasinopoulos 2 and Bob Rigby 2 1 Dept of Statistics, Macquarie University, Sydney, Australia. email:

### Reject Inference in Credit Scoring. Jie-Men Mok

Reject Inference in Credit Scoring Jie-Men Mok BMI paper January 2009 ii Preface In the Master programme of Business Mathematics and Informatics (BMI), it is required to perform research on a business

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Lecture 8: Signal Detection and Noise Assumption

ECE 83 Fall Statistical Signal Processing instructor: R. Nowak, scribe: Feng Ju Lecture 8: Signal Detection and Noise Assumption Signal Detection : X = W H : X = S + W where W N(, σ I n n and S = [s, s,...,

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE Alexer Barvinok Papers are available at http://www.math.lsa.umich.edu/ barvinok/papers.html This is a joint work with J.A. Hartigan

### Detecting Corporate Fraud: An Application of Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Lecture 10: Regression Trees

Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,

### DATA ANALYTICS USING R

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

### Loss Functions for Preference Levels: Regression with Discrete Ordered Labels

Loss Functions for Preference Levels: Regression with Discrete Ordered Labels Jason D. M. Rennie Massachusetts Institute of Technology Comp. Sci. and Artificial Intelligence Laboratory Cambridge, MA 9,

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

### Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com

Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Covariance and Correlation

Covariance and Correlation ( c Robert J. Serfling Not for reproduction or distribution) We have seen how to summarize a data-based relative frequency distribution by measures of location and spread, such

### Covariance and Correlation. Consider the joint probability distribution f XY (x, y).

Chapter 5: JOINT PROBABILITY DISTRIBUTIONS Part 2: Section 5-2 Covariance and Correlation Consider the joint probability distribution f XY (x, y). Is there a relationship between X and Y? If so, what kind?

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

### Pattern Recognition. Probability Theory

Pattern Recognition Probability Theory Probability Space 2 is a three-tuple with: the set of elementary events algebra probability measure -algebra over is a system of subsets, i.e. ( is the power set)

### Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

### 203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

### Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### P (x) 0. Discrete random variables Expected value. The expected value, mean or average of a random variable x is: xp (x) = v i P (v i )

Discrete random variables Probability mass function Given a discrete random variable X taking values in X = {v 1,..., v m }, its probability mass function P : X [0, 1] is defined as: P (v i ) = Pr[X =

### 4. Joint Distributions of Two Random Variables

4. Joint Distributions of Two Random Variables 4.1 Joint Distributions of Two Discrete Random Variables Suppose the discrete random variables X and Y have supports S X and S Y, respectively. The joint

### CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

### LCs for Binary Classification

Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

### Topic 4: Multivariate random variables. Multiple random variables

Topic 4: Multivariate random variables Joint, marginal, and conditional pmf Joint, marginal, and conditional pdf and cdf Independence Expectation, covariance, correlation Conditional expectation Two jointly

### Statistics in Retail Finance. Chapter 6: Behavioural models

Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural

### SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS John Banasik, Jonathan Crook Credit Research Centre, University of Edinburgh Lyn Thomas University of Southampton ssm0 The Problem We wish to estimate an

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### 3F3: Signal and Pattern Processing

3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Automatic parameter regulation for a tracking system with an auto-critical function

Automatic parameter regulation for a tracking system with an auto-critical function Daniela Hall INRIA Rhône-Alpes, St. Ismier, France Email: Daniela.Hall@inrialpes.fr Abstract In this article we propose

### Statistical machine learning, high dimension and big data

Statistical machine learning, high dimension and big data S. Gaïffas 1 14 mars 2014 1 CMAP - Ecole Polytechnique Agenda for today Divide and Conquer principle for collaborative filtering Graphical modelling,

### Logit and Probit. Brad Jones 1. April 21, 2009. University of California, Davis. Bradford S. Jones, UC-Davis, Dept. of Political Science

Logit and Probit Brad 1 1 Department of Political Science University of California, Davis April 21, 2009 Logit, redux Logit resolves the functional form problem (in terms of the response function in the

### 10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

### A Log-Robust Optimization Approach to Portfolio Management

A Log-Robust Optimization Approach to Portfolio Management Dr. Aurélie Thiele Lehigh University Joint work with Ban Kawas Research partially supported by the National Science Foundation Grant CMMI-0757983

### A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA

REVSTAT Statistical Journal Volume 4, Number 2, June 2006, 131 142 A LOGNORMAL MODEL FOR INSURANCE CLAIMS DATA Authors: Daiane Aparecida Zuanetti Departamento de Estatística, Universidade Federal de São

### Statistical Analysis of Life Insurance Policy Termination and Survivorship

Statistical Analysis of Life Insurance Policy Termination and Survivorship Emiliano A. Valdez, PhD, FSA Michigan State University joint work with J. Vadiveloo and U. Dias Session ES82 (Statistics in Actuarial

### Logistic regression modeling the probability of success

Logistic regression modeling the probability of success Regression models are usually thought of as only being appropriate for target variables that are continuous Is there any situation where we might

### Applications to Data Smoothing and Image Processing I

Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### " Y. Notation and Equations for Regression Lecture 11/4. Notation:

Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Beckman HLM Reading Group: Questions, Answers and Examples Carolyn J. Anderson Department of Educational Psychology I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN Linear Algebra Slide 1 of

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### Regression III: Advanced Methods

Lecture 4: Transformations Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture The Ladder of Roots and Powers Changing the shape of distributions Transforming

### Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters

### Statistical Machine Learning from Data

Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Gaussian Mixture Models Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique

### On Attacking Statistical Spam Filters

On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

1.3 Neural Networks 19 Neural Networks are large structured systems of equations. These systems have many degrees of freedom and are able to adapt to the task they are supposed to do [Gupta]. Two very

### CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Maximum Likelihood Estimation

Math 541: Statistical Theory II Lecturer: Songfeng Zheng Maximum Likelihood Estimation 1 Maximum Likelihood Estimation Maximum likelihood is a relatively simple method of constructing an estimator for

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling