CSCI567 Machine Learning (Fall 2014)

Size: px
Start display at page:

Download "CSCI567 Machine Learning (Fall 2014)"

Transcription

1 CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu September 22, 2014 Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

2 Outline Administration 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

3 Administration A few announcements Homework 1: due 9/24 (see the homework sheets for detailed submission information) Revised lecture slides are on Blackboard and the course website Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

4 Outline Logistic Regression - continued 1 Administration 2 Logistic Regression - continued Logistic regression Numerical optimization Gradient descent Gradient descent for logistic regression Newton method 3 Multiclass classification Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

5 Logistic Regression - continued Logistic regression Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Model of conditional distribution p(y = 1 x; b, w) = σ[g(x)] where g(x) = b + d w d x d = b + w T x Linear decision boundary g(x) = b + w T x = 0 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

6 Logistic Regression - continued Logistic regression Maximum likelihood estimation Cross-entropy error (negative log-likelihood) E(b, w) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Numerical optimization Gradient descent: simple, scalable to large-scale problems Newton method: fast but not scalable Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

7 Logistic Regression - continued Logistic regression Shorthand notation This is for convenience Append 1 to x x [1 x 1 x 2 x D ] Append b to w w [b w 1 w 2 w D ] Cross-entropy is then E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} NB. We are not using the x and w (as in several textbooks) for cosmetic reasons. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

8 Logistic Regression - continued Numerical optimization How to find the optimal parameters for logistic regression? We will minimize the error function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} However, this function is complex and we cannot find the simple solution as we did in Naive Bayes. So we need to use numerical methods. Numerical methods are messier, in contrast to cleaner analytic solutions. In practice, we often have to tune a few optimization parameters patience is necessary. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

9 Logistic Regression - continued Numerical optimization An overview of numerical methods We describe two Gradient descent (our focus in lecture): simple, especially effective for large-scale problems Newton method: classical and powerful method Gradient descent is often referred to as an first-order method as it requires only to compute the gradients (i.e., the first-order derivative) of the function. In contrast, Newton method is often referred as to an second-order method. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

10 Logistic Regression - continued Gradient descent Example: min f(θ) = 0.5(θ 2 1 θ 2 ) (θ 1 1) 2 We compute the gradients f = 2(θ1 2 θ 2 )θ 1 + θ 1 1 θ 1 (1) f = (θ1 2 θ 2 ) θ 2 (2) Use the following iterative procedure for gradient descent 1 Initialize θ (0) 1 and θ (0) 2, and t = 0 2 do [ θ (t+1) 1 θ (t) 1 η 2(θ (t) 2 ] (t) 1 θ 2 )θ(t) 1 + θ (t) 1 1 [ θ (t+1) 2 θ (t) 2 η (θ (t) 2 ] (t) 1 θ 2 ) 3 until f(θ (t) ) does not change much (3) (4) t t + 1 (5) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

11 Logistic Regression - continued Gradient descent Gradient descent General form for minimizing f(θ) Remarks θ t+1 θ η f θ η is often called step size literally, how far our update will go along the the direction of the negative gradient Note that this is for minimizing a function, hence the subtraction ( η) With a suitable choice of η, the iterative procedure converges to a stationary point where f θ = 0 A stationary point is only necessary for being the minimum. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

12 Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

13 Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? large η is too unstable? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

14 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( ) 1 d a 1 + e a = (1 + e a ) (1 + e a ) 2 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

15 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = e a (1 + e a ) ( e a ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

16 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a d log σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = e a = σ(a)[1 σ(a)] = 1 σ(a) (1 + e a ) ( e a ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

17 Logistic Regression - continued Gradient descent for logistic regression Gradients of the cross-entropy error function Gradients E(w) w = n { yn [1 σ(w T x n )]x n (1 y n )σ(w T x n )x n } (6) = n { σ(w T x n ) y n } xn (7) Remarks e n = { σ(w T } x n ) y n is called error for the nth training sample. Stationary point (in this case, the optimum): σ(w T x n )x n = x n y n n n Intuition: on average, the error is zero. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

18 Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

19 Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Remarks The step size needs to be chosen carefully to ensure convergence. The step size can be adaptive (i.e. varying from iteration to iteration). For example, we can use techniques such as line search There is a variant called stochastic gradient descent, also popularly used (later in this semester). Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

20 Logistic Regression - continued Intuition for Newton method Newton method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

21 Logistic Regression - continued Newton method Approximation Taylor expansion of the cross-entropy function E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) where E(w (t) ) is the gradient H (t) is the Hessian matrix evaluated at w (t) Example: a scalar function sin(θ) sin(0) + θ cos(θ = 0) θ2 [ sin(θ = 0)] = θ where sin(θ) = cos(θ) and H = cos(θ) = sin(θ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

22 Logistic Regression - continued So what is the Hessian matrix? Newton method The matrix of second-order derivatives In other words, H = 2 E(w) ww T H ij = ( ) E(w) w j w i So the Hessian matrix is R D D, where w R D. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

23 Logistic Regression - continued Optimizing the approximation Newton method Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

24 Logistic Regression - continued Newton method Optimizing the approximation Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) The quadratic function minimization has a closed form, thus, we have i.e., the Newton method. ( w (t+1) w (t) H (t)) 1 E(w (t) ) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

25 Logistic Regression - continued Newton method Contrast gradient descent and Newton method Similar Both are iterative procedures. Difference Newton method requires second-order derivatives. Newton method does not have the magic η to be set. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

26 Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

27 Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex For any vector v, E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt v T Hv = homework 0 Thus, positive definite. Thus, the cross-entropy error function is convex, with only one global optimum. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

28 Logistic Regression - continued Good about Newton method Newton method Fast! Suppose we want to minimize f(x) = x 2 + 2x and we have its current estimate at x (t) 1. So what is the next estimate? x (t+1) x (t) [f (x)] 1 f (x) = x (t) 1 2 (2x(t) + 2) = 1 Namely, the next step (of iteration) immediately tells us the global optimum! (In optimization, this is called superlinear convergence rate). In general, the better our approximation, the fast the Newton method is in solving our optimization problem. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

29 Logistic Regression - continued Newton method Bad about Newton method Not scalable! Computing and inverting Hessian matrix can be very expensive for large-scale problems where the dimensionally D is very large. Newton method does not guarantee convergence if your starting point is far away from the optimum NB. There are fixes and alternatives, such as Quasi-Newton/Quasi-second order method. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

30 Outline Multiclass classification 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Use binary classifiers as building blocks Multinomial logistic regression Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

31 Multiclass classification Setup Suppose we need to predict multiple classes/outcomes: C 1, C 2,..., C K Weather prediction: sunny, cloudy, raining, etc Optical character recognition: 10 digits + 26 characters (lower and upper cases) + special characters, etc Studied methods Nearest neighbor classifier Naive Bayes Gaussian discriminant analysis Logistic regression Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

32 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

33 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

34 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

35 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? 2 What if multiple classifiers say positive? Take-home exercise: there are different combination strategies. Can you think of any? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

36 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

37 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

38 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers There are K(K 1)/2 votes! Take-home exercise: can you think of any good combination strategies? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

39 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

40 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

41 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Bad about both of them Combining classifiers outputs seem to be a bit tricky. Any other good methods? Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

42 Multiclass classification Multinomial logistic regression Multinomial logistic regression Intuition: from the decision rule of our naive Bayes classifier y = arg max c p(y = c x) = arg max c log p(x y = c)p(y = c) (10) = arg max c log π c + z k log θ ck = arg max c wc T x (11) k Essentially, we are comparing with one for each category. w T 1 x, w T 2 x,, w T C x (12) Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

43 First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

44 First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] This would not work at least for the reason p(y = c x) = σ[wc T x] 1 c c as each the summand can be any number (independently) between 0 and 1. But we are close Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

45 Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

46 Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Decision boundary: assign x with the label that is the maximum of posterior arg max k P (C k x) arg max k w T k x Note: the notation is changed to denote the classes as C k instead of just c Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, / 31

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

CSC 411: Lecture 07: Multiclass Classification

CSC 411: Lecture 07: Multiclass Classification CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass

More information

Lecture 8 February 4

Lecture 8 February 4 ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Linear Models for Classification

Linear Models for Classification Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

More information

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

More information

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

More information

Introduction to Logistic Regression

Introduction to Logistic Regression OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

More information

(Quasi-)Newton methods

(Quasi-)Newton methods (Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

More information

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

More information

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

More information

Classification using Logistic Regression

Classification using Logistic Regression Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function

More information

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

More information

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil. Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Inner Product Spaces

Inner Product Spaces Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

More information

Lecture 6: Logistic Regression

Lecture 6: Logistic Regression Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

More information

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

More information

CS229 Lecture notes. Andrew Ng

CS229 Lecture notes. Andrew Ng CS229 Lecture notes Andrew Ng Supervised learning Let s start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

Feature Engineering in Machine Learning

Feature Engineering in Machine Learning Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon

More information

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014 Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

More information

1 Maximum likelihood estimation

1 Maximum likelihood estimation COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

More information

Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product

Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product Dot product and vector projections (Sect. 12.3) Two definitions for the dot product. Geometric definition of dot product. Orthogonal vectors. Dot product and orthogonal projections. Properties of the dot

More information

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product

Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product Geometrical definition Properties Expression in components. Definition in components Properties Geometrical expression.

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

3F3: Signal and Pattern Processing

3F3: Signal and Pattern Processing 3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by

More information

2.3 Convex Constrained Optimization Problems

2.3 Convex Constrained Optimization Problems 42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

More information

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

More information

Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

More information

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

More information

Big Data Analytics. Lucas Rego Drumond

Big Data Analytics. Lucas Rego Drumond Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

More information

L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

More information

Lecture notes: single-agent dynamics 1

Lecture notes: single-agent dynamics 1 Lecture notes: single-agent dynamics 1 Single-agent dynamic optimization models In these lecture notes we consider specification and estimation of dynamic optimization models. Focus on single-agent models.

More information

Learning from Data: Naive Bayes

Learning from Data: Naive Bayes Semester 1 http://www.anc.ed.ac.uk/ amos/lfd/ Naive Bayes Typical example: Bayesian Spam Filter. Naive means naive. Bayesian methods can be much more sophisticated. Basic assumption: conditional independence.

More information

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

More information

Classification by Pairwise Coupling

Classification by Pairwise Coupling Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

More information

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

More information

The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

More information

Machine Learning Logistic Regression

Machine Learning Logistic Regression Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

CSE 473: Artificial Intelligence Autumn 2010

CSE 473: Artificial Intelligence Autumn 2010 CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron

More information

Introduction to Online Learning Theory

Introduction to Online Learning Theory Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

More information

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d). 1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

More information

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,... IFT3395/6390 Historical perspective: back to 1957 (Prof. Pascal Vincent) (Rosenblatt, Perceptron ) Machine Learning from linear regression to Neural Networks Computer Science Artificial Intelligence Symbolic

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

Bayes and Naïve Bayes. cs534-machine Learning

Bayes and Naïve Bayes. cs534-machine Learning Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

More information

Machine Learning: Multi Layer Perceptrons

Machine Learning: Multi Layer Perceptrons Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons

More information

Mechanics 1: Conservation of Energy and Momentum

Mechanics 1: Conservation of Energy and Momentum Mechanics : Conservation of Energy and Momentum If a certain quantity associated with a system does not change in time. We say that it is conserved, and the system possesses a conservation law. Conservation

More information

Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

More information

Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

More information

5.3 The Cross Product in R 3

5.3 The Cross Product in R 3 53 The Cross Product in R 3 Definition 531 Let u = [u 1, u 2, u 3 ] and v = [v 1, v 2, v 3 ] Then the vector given by [u 2 v 3 u 3 v 2, u 3 v 1 u 1 v 3, u 1 v 2 u 2 v 1 ] is called the cross product (or

More information

Multi-Class and Structured Classification

Multi-Class and Structured Classification Multi-Class and Structured Classification [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] [ p y( )] http://www.cs.berkeley.edu/~jordan/courses/294-fall09 Basic Classification in ML Input Output

More information

Basics of Statistical Machine Learning

Basics of Statistical Machine Learning CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

More information

Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }]. Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for

More information

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure

Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Exact Inference for Gaussian Process Regression in case of Big Data with the Cartesian Product Structure Belyaev Mikhail 1,2,3, Burnaev Evgeny 1,2,3, Kapushev Yermek 1,2 1 Institute for Information Transmission

More information

Elementary Gradient-Based Parameter Estimation

Elementary Gradient-Based Parameter Estimation Elementary Gradient-Based Parameter Estimation Julius O. Smith III Center for Computer Research in Music and Acoustics (CCRMA Department of Music, Stanford University, Stanford, California 94305 USA Abstract

More information

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen (für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

More information

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all.

Increasing for all. Convex for all. ( ) Increasing for all (remember that the log function is only defined for ). ( ) Concave for all. 1. Differentiation The first derivative of a function measures by how much changes in reaction to an infinitesimal shift in its argument. The largest the derivative (in absolute value), the faster is evolving.

More information

Local classification and local likelihoods

Local classification and local likelihoods Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters

More information

Tagging with Hidden Markov Models

Tagging with Hidden Markov Models Tagging with Hidden Markov Models Michael Collins 1 Tagging Problems In many NLP problems, we would like to model pairs of sequences. Part-of-speech (POS) tagging is perhaps the earliest, and most famous,

More information

Support Vector Machines

Support Vector Machines Support Vector Machines Charlie Frogner 1 MIT 2011 1 Slides mostly stolen from Ryan Rifkin (Google). Plan Regularization derivation of SVMs. Analyzing the SVM problem: optimization, duality. Geometric

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Lecture 6. Artificial Neural Networks

Lecture 6. Artificial Neural Networks Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

More information

Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin

Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each

More information

Designing a learning system

Designing a learning system Lecture Designing a learning system Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x4-8845 http://.cs.pitt.edu/~milos/courses/cs750/ Design of a learning system (first vie) Application or Testing

More information

Bayesian Multinomial Logistic Regression for Author Identification

Bayesian Multinomial Logistic Regression for Author Identification Bayesian Multinomial Logistic Regression for Author Identification David Madigan,, Alexander Genkin, David D. Lewis and Dmitriy Fradkin, DIMACS, Rutgers University Department of Statistics, Rutgers University

More information

Statistical Machine Translation: IBM Models 1 and 2

Statistical Machine Translation: IBM Models 1 and 2 Statistical Machine Translation: IBM Models 1 and 2 Michael Collins 1 Introduction The next few lectures of the course will be focused on machine translation, and in particular on statistical machine translation

More information

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1

GenOpt (R) Generic Optimization Program User Manual Version 3.0.0β1 (R) User Manual Environmental Energy Technologies Division Berkeley, CA 94720 http://simulationresearch.lbl.gov Michael Wetter MWetter@lbl.gov February 20, 2009 Notice: This work was supported by the U.S.

More information

Chapter 4: Artificial Neural Networks

Chapter 4: Artificial Neural Networks Chapter 4: Artificial Neural Networks CS 536: Machine Learning Littman (Wu, TA) Administration icml-03: instructional Conference on Machine Learning http://www.cs.rutgers.edu/~mlittman/courses/ml03/icml03/

More information