# CSCI567 Machine Learning (Fall 2014)

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu September 22, 2014 Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

2 Outline Administration 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

3 Administration A few announcements Homework 1: due 9/24 (see the homework sheets for detailed submission information) Revised lecture slides are on Blackboard and the course website Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

4 Outline Logistic Regression - continued 1 Administration 2 Logistic Regression - continued Logistic regression Numerical optimization Gradient descent Gradient descent for logistic regression Newton method 3 Multiclass classification Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

5 Logistic Regression - continued Logistic regression Logistic classification Setup for two classes Input: x R D Output: y {0, 1} Training data: D = {(x n, y n ), n = 1, 2,..., N} Model of conditional distribution p(y = 1 x; b, w) = σ[g(x)] where g(x) = b + d w d x d = b + w T x Linear decision boundary g(x) = b + w T x = 0 Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

6 Logistic Regression - continued Logistic regression Maximum likelihood estimation Cross-entropy error (negative log-likelihood) E(b, w) = n {y n log σ(b + w T x n ) + (1 y n ) log[1 σ(b + w T x n )]} Numerical optimization Gradient descent: simple, scalable to large-scale problems Newton method: fast but not scalable Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

7 Logistic Regression - continued Logistic regression Shorthand notation This is for convenience Append 1 to x x [1 x 1 x 2 x D ] Append b to w w [b w 1 w 2 w D ] Cross-entropy is then E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} NB. We are not using the x and w (as in several textbooks) for cosmetic reasons. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

8 Logistic Regression - continued Numerical optimization How to find the optimal parameters for logistic regression? We will minimize the error function E(w) = n {y n log σ(w T x n ) + (1 y n ) log[1 σ(w T x n )]} However, this function is complex and we cannot find the simple solution as we did in Naive Bayes. So we need to use numerical methods. Numerical methods are messier, in contrast to cleaner analytic solutions. In practice, we often have to tune a few optimization parameters patience is necessary. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

9 Logistic Regression - continued Numerical optimization An overview of numerical methods We describe two Gradient descent (our focus in lecture): simple, especially effective for large-scale problems Newton method: classical and powerful method Gradient descent is often referred to as an first-order method as it requires only to compute the gradients (i.e., the first-order derivative) of the function. In contrast, Newton method is often referred as to an second-order method. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

10 Logistic Regression - continued Gradient descent Example: min f(θ) = 0.5(θ 2 1 θ 2 ) (θ 1 1) 2 We compute the gradients f = 2(θ1 2 θ 2 )θ 1 + θ 1 1 θ 1 (1) f = (θ1 2 θ 2 ) θ 2 (2) Use the following iterative procedure for gradient descent 1 Initialize θ (0) 1 and θ (0) 2, and t = 0 2 do [ θ (t+1) 1 θ (t) 1 η 2(θ (t) 2 ] (t) 1 θ 2 )θ(t) 1 + θ (t) 1 1 [ θ (t+1) 2 θ (t) 2 η (θ (t) 2 ] (t) 1 θ 2 ) 3 until f(θ (t) ) does not change much (3) (4) t t + 1 (5) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

11 Logistic Regression - continued Gradient descent Gradient descent General form for minimizing f(θ) Remarks θ t+1 θ η f θ η is often called step size literally, how far our update will go along the the direction of the negative gradient Note that this is for minimizing a function, hence the subtraction ( η) With a suitable choice of η, the iterative procedure converges to a stationary point where f θ = 0 A stationary point is only necessary for being the minimum. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

12 Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

13 Seeing in action Logistic Regression - continued Gradient descent Choose the right η is important small η is too slow? large η is too unstable? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

14 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( ) 1 d a 1 + e a = (1 + e a ) (1 + e a ) 2 Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

15 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = e a (1 + e a ) ( e a ) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

16 Logistic Regression - continued Gradient descent for logistic regression How do we do this for logistic regression? Simple fact: derivatives of σ(a) d σ(a) d a d log σ(a) d a = d ( 1 d a 1 + e a = ) = (1 + e a ) e a (1 + e a ) 2 = e a = σ(a)[1 σ(a)] = 1 σ(a) (1 + e a ) ( e a ) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

17 Logistic Regression - continued Gradient descent for logistic regression Gradients of the cross-entropy error function Gradients E(w) w = n { yn [1 σ(w T x n )]x n (1 y n )σ(w T x n )x n } (6) = n { σ(w T x n ) y n } xn (7) Remarks e n = { σ(w T } x n ) y n is called error for the nth training sample. Stationary point (in this case, the optimum): σ(w T x n )x n = x n y n n n Intuition: on average, the error is zero. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

18 Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

19 Logistic Regression - continued Gradient descent for logistic regression Numerical optimization Gradient descent Choose a proper step size η > 0 Iteratively update the parameters following the negative gradient to minimize the error function w (t+1) w (t) η n { σ(w T x n ) y n } xn Remarks The step size needs to be chosen carefully to ensure convergence. The step size can be adaptive (i.e. varying from iteration to iteration). For example, we can use techniques such as line search There is a variant called stochastic gradient descent, also popularly used (later in this semester). Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

20 Logistic Regression - continued Intuition for Newton method Newton method Approximate the true function with an easy-to-solve optimization problem f(x) f quad (x) f(x) f quad (x) x k x k +d k x k x k +d k Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

21 Logistic Regression - continued Newton method Approximation Taylor expansion of the cross-entropy function E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) where E(w (t) ) is the gradient H (t) is the Hessian matrix evaluated at w (t) Example: a scalar function sin(θ) sin(0) + θ cos(θ = 0) θ2 [ sin(θ = 0)] = θ where sin(θ) = cos(θ) and H = cos(θ) = sin(θ) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

22 Logistic Regression - continued So what is the Hessian matrix? Newton method The matrix of second-order derivatives In other words, H = 2 E(w) ww T H ij = ( ) E(w) w j w i So the Hessian matrix is R D D, where w R D. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

23 Logistic Regression - continued Optimizing the approximation Newton method Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

24 Logistic Regression - continued Newton method Optimizing the approximation Minimize the approximation E(w) E(w (t) ) + (w w (t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) and use the solution as the new estimate of the parameters w (t+1) min w (w w(t) ) T E(w (t) ) (w w(t) ) T H (t) (w w (t) ) The quadratic function minimization has a closed form, thus, we have i.e., the Newton method. ( w (t+1) w (t) H (t)) 1 E(w (t) ) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

25 Logistic Regression - continued Newton method Contrast gradient descent and Newton method Similar Both are iterative procedures. Difference Newton method requires second-order derivatives. Newton method does not have the magic η to be set. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

26 Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

27 Logistic Regression - continued Newton method Other important things about Hessian Our cross-entropy error function is convex For any vector v, E(w) w = n {σ(w T x n ) y n }x n (8) H = 2 E(w) = homework (9) wwt v T Hv = homework 0 Thus, positive definite. Thus, the cross-entropy error function is convex, with only one global optimum. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

28 Logistic Regression - continued Good about Newton method Newton method Fast! Suppose we want to minimize f(x) = x 2 + 2x and we have its current estimate at x (t) 1. So what is the next estimate? x (t+1) x (t) [f (x)] 1 f (x) = x (t) 1 2 (2x(t) + 2) = 1 Namely, the next step (of iteration) immediately tells us the global optimum! (In optimization, this is called superlinear convergence rate). In general, the better our approximation, the fast the Newton method is in solving our optimization problem. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

29 Logistic Regression - continued Newton method Bad about Newton method Not scalable! Computing and inverting Hessian matrix can be very expensive for large-scale problems where the dimensionally D is very large. Newton method does not guarantee convergence if your starting point is far away from the optimum NB. There are fixes and alternatives, such as Quasi-Newton/Quasi-second order method. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

30 Outline Multiclass classification 1 Administration 2 Logistic Regression - continued 3 Multiclass classification Use binary classifiers as building blocks Multinomial logistic regression Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

31 Multiclass classification Setup Suppose we need to predict multiple classes/outcomes: C 1, C 2,..., C K Weather prediction: sunny, cloudy, raining, etc Optical character recognition: 10 digits + 26 characters (lower and upper cases) + special characters, etc Studied methods Nearest neighbor classifier Naive Bayes Gaussian discriminant analysis Logistic regression Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

32 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

33 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

34 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

35 Multiclass classification Use binary classifiers as building blocks Logistic regression for predicting multiple classes? Easy The approach of one versus the rest For each class C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel all the rest data into negative (or 0 ) This step is often called 1-of-K encoding. That is, only one is nonzero and everything else is zero. Example: for class C 2, data go through the following change (x 1, C 1 ) (x 1, 0), (x 2, C 3 ) (x 2, 0),..., (x n, C 2 ) (x n, 1),..., Train K binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers 1 What if all the classifiers say negative? 2 What if multiple classifiers say positive? Take-home exercise: there are different combination strategies. Can you think of any? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

36 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

37 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

38 Multiclass classification Use binary classifiers as building blocks Yet, another easy approach The approach of one versus one For each pair of classes C k and C k, change the problem into binary classification 1 Relabel training data with label C k, into positive (or 1 ) 2 Relabel training data with label C k into negative (or 0 ) 3 Disregard all other data Ex: for class C 1 and C 2, (x 1, C 1 ), (x 2, C 3 ), (x 3, C 2 ),... (x 1, 1), (x 3, 0),... Train K(K 1)/2 binary classifiers using logistic regression to differentiate the two classes When predicting on x, combine the outputs of all binary classifiers There are K(K 1)/2 votes! Take-home exercise: can you think of any good combination strategies? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

39 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

40 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

41 Multiclass classification Use binary classifiers as building blocks Contrast these two approaches Pros and cons of each approach one versus the rest: only needs to train K classifiers. Make a huge difference if you have a lot of classes to go through. Can you think of a good application example where there are a lot of classes? one versus one: only needs to train a smaller subset of data (only those labeled with those two classes would be involved). Make a huge difference if you have a lot of data to go through. Bad about both of them Combining classifiers outputs seem to be a bit tricky. Any other good methods? Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

42 Multiclass classification Multinomial logistic regression Multinomial logistic regression Intuition: from the decision rule of our naive Bayes classifier y = arg max c p(y = c x) = arg max c log p(x y = c)p(y = c) (10) = arg max c log π c + z k log θ ck = arg max c wc T x (11) k Essentially, we are comparing with one for each category. w T 1 x, w T 2 x,, w T C x (12) Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

43 First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

44 First try Multiclass classification Multinomial logistic regression So, can we define the following conditional model? p(y = c x) = σ[w T c x] This would not work at least for the reason p(y = c x) = σ[wc T x] 1 c c as each the summand can be any number (independently) between 0 and 1. But we are close Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

45 Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

46 Multiclass classification Multinomial logistic regression Definition of multinomial logistic regression Model For each class C k, we have a parameter vector w k and model the posterior probability as p(c k x) = ewt k x k ewt k x This is called softmax function Decision boundary: assign x with the label that is the maximum of posterior arg max k P (C k x) arg max k w T k x Note: the notation is changed to denote the classes as C k instead of just c Drs. Sha & Liu CSCI567 Machine Learning (Fall 2014) September 22, / 31

### Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

### Christfried Webers. Canberra February June 2015

c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### CSC 411: Lecture 07: Multiclass Classification

CSC 411: Lecture 07: Multiclass Classification Class based on Raquel Urtasun & Rich Zemel s lectures Sanja Fidler University of Toronto Feb 1, 2016 Urtasun, Zemel, Fidler (UofT) CSC 411: 07-Multiclass

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

### Lecture 8 February 4

ICS273A: Machine Learning Winter 2008 Lecture 8 February 4 Scribe: Carlos Agell (Student) Lecturer: Deva Ramanan 8.1 Neural Nets 8.1.1 Logistic Regression Recall the logistic function: g(x) = 1 1 + e θt

### Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### Linear Classification. Volker Tresp Summer 2015

Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

### Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features

### Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

### Linear Models for Classification

Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci

### Lecture 2: The SVM classifier

Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka log-linear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15

### University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons

University of Cambridge Engineering Part IIB Module 4F0: Statistical Pattern Processing Handout 8: Multi-Layer Perceptrons x y (x) Inputs x 2 y (x) 2 Outputs x d First layer Second Output layer layer y

### CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

### A crash course in probability and Naïve Bayes classification

Probability theory A crash course in probability and Naïve Bayes classification Chapter 9 Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s

### (Quasi-)Newton methods

(Quasi-)Newton methods 1 Introduction 1.1 Newton method Newton method is a method to find the zeros of a differentiable non-linear function g, x such that g(x) = 0, where g : R n R n. Given a starting

### A Logistic Regression Approach to Ad Click Prediction

A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh

### Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

### Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S

Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) To go the other way, you need to diagonalize S 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard

### Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction

### Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on

Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output

### The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method

The Steepest Descent Algorithm for Unconstrained Optimization and a Bisection Line-search Method Robert M. Freund February, 004 004 Massachusetts Institute of Technology. 1 1 The Algorithm The problem

### Machine Learning. CUNY Graduate Center, Spring 2013. Professor Liang Huang. huang@cs.qc.cuny.edu

Machine Learning CUNY Graduate Center, Spring 2013 Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning Logistics Lectures M 9:30-11:30 am Room 4419 Personnel

### Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

### Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

### Classification using Logistic Regression

Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function

### Neural Networks. CAP5610 Machine Learning Instructor: Guo-Jun Qi

Neural Networks CAP5610 Machine Learning Instructor: Guo-Jun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### Introduction to Machine Learning

Introduction to Machine Learning Prof. Alexander Ihler Prof. Max Welling icamp Tutorial July 22 What is machine learning? The ability of a machine to improve its performance based on previous results:

### Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Steven J Zeil Old Dominion Univ. Fall 200 Discriminant-Based Classification Linearly Separable Systems Pairwise Separation 2 Posteriors 3 Logistic Discrimination 2 Discriminant-Based Classification Likelihood-based:

### LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as

LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values

### Inner Product Spaces

Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### CSC 321 H1S Study Guide (Last update: April 3, 2016) Winter 2016

1. Suppose our training set and test set are the same. Why would this be a problem? 2. Why is it necessary to have both a test set and a validation set? 3. Images are generally represented as n m 3 arrays,

### An Introduction to Machine Learning

An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

### CS229 Lecture notes. Andrew Ng

CS229 Lecture notes Andrew Ng Supervised learning Let s start by talking about a few examples of supervised learning problems Suppose we have a dataset giving the living areas and prices of 47 houses from

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Feature Engineering in Machine Learning

Research Fellow Faculty of Information Technology, Monash University, Melbourne VIC 3800, Australia August 21, 2015 Outline A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Dot product and vector projections (Sect. 12.3) There are two main ways to introduce the dot product

Dot product and vector projections (Sect. 12.3) Two definitions for the dot product. Geometric definition of dot product. Orthogonal vectors. Dot product and orthogonal projections. Properties of the dot

### Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about

### Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product

Cross product and determinants (Sect. 12.4) Two main ways to introduce the cross product Geometrical definition Properties Expression in components. Definition in components Properties Geometrical expression.

### 2.3 Convex Constrained Optimization Problems

42 CHAPTER 2. FUNDAMENTAL CONCEPTS IN CONVEX OPTIMIZATION Theorem 15 Let f : R n R and h : R R. Consider g(x) = h(f(x)) for all x R n. The function g is convex if either of the following two conditions

### MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Machine Learning for Computer Vision 1 MVA ENS Cachan Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr Department of Applied Mathematics Ecole Centrale Paris Galen

### Learning from Data: Naive Bayes

Semester 1 http://www.anc.ed.ac.uk/ amos/lfd/ Naive Bayes Typical example: Bayesian Spam Filter. Naive means naive. Bayesian methods can be much more sophisticated. Basic assumption: conditional independence.

### Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

### A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

### Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem

### Programming Exercise 3: Multi-class Classification and Neural Networks

Programming Exercise 3: Multi-class Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement one-vs-all logistic regression and neural networks

### Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

### 3F3: Signal and Pattern Processing

3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by

### Simple and efficient online algorithms for real world applications

Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,

### i=1 In practice, the natural logarithm of the likelihood function, called the log-likelihood function and denoted by

Statistics 580 Maximum Likelihood Estimation Introduction Let y (y 1, y 2,..., y n be a vector of iid, random variables from one of a family of distributions on R n and indexed by a p-dimensional parameter

### The equivalence of logistic regression and maximum entropy models

The equivalence of logistic regression and maximum entropy models John Mount September 23, 20 Abstract As our colleague so aptly demonstrated ( http://www.win-vector.com/blog/20/09/the-simplerderivation-of-logistic-regression/

### Introduction to Online Learning Theory

Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent

### large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Going For Large Scale Going For Large Scale 1

### L3: Statistical Modeling with Hadoop

L3: Statistical Modeling with Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...

### Lecture notes: single-agent dynamics 1

Lecture notes: single-agent dynamics 1 Single-agent dynamic optimization models In these lecture notes we consider specification and estimation of dynamic optimization models. Focus on single-agent models.

### Introduction to Convex Optimization for Machine Learning

Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning

### Classification by Pairwise Coupling

Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating

### Machine Learning Logistic Regression

Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Bayes and Naïve Bayes. cs534-machine Learning

Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule

### Logistic Regression for Spam Filtering

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used

### CSE 473: Artificial Intelligence Autumn 2010

CSE 473: Artificial Intelligence Autumn 2010 Machine Learning: Naive Bayes and Perceptron Luke Zettlemoyer Many slides over the course adapted from Dan Klein. 1 Outline Learning: Naive Bayes and Perceptron

### Lecture 4: Equality Constrained Optimization. Tianxi Wang

Lecture 4: Equality Constrained Optimization Tianxi Wang wangt@essex.ac.uk 2.1 Lagrange Multiplier Technique (a) Classical Programming max f(x 1, x 2,..., x n ) objective function where x 1, x 2,..., x

### t := maxγ ν subject to ν {0,1,2,...} and f(x c +γ ν d) f(x c )+cγ ν f (x c ;d).

1. Line Search Methods Let f : R n R be given and suppose that x c is our current best estimate of a solution to P min x R nf(x). A standard method for improving the estimate x c is to choose a direction

### GI01/M055 Supervised Learning Proximal Methods

GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators

### 203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data

Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian

### Numerisches Rechnen. (für Informatiker) M. Grepl J. Berger & J.T. Frings. Institut für Geometrie und Praktische Mathematik RWTH Aachen

(für Informatiker) M. Grepl J. Berger & J.T. Frings Institut für Geometrie und Praktische Mathematik RWTH Aachen Wintersemester 2010/11 Problem Statement Unconstrained Optimality Conditions Constrained

### Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

### 5.3 The Cross Product in R 3

53 The Cross Product in R 3 Definition 531 Let u = [u 1, u 2, u 3 ] and v = [v 1, v 2, v 3 ] Then the vector given by [u 2 v 3 u 3 v 2, u 3 v 1 u 1 v 3, u 1 v 2 u 2 v 1 ] is called the cross product (or

### Machine Learning: Multi Layer Perceptrons

Machine Learning: Multi Layer Perceptrons Prof. Dr. Martin Riedmiller Albert-Ludwigs-University Freiburg AG Maschinelles Lernen Machine Learning: Multi Layer Perceptrons p.1/61 Outline multi layer perceptrons

### Big Data Analytics CSCI 4030

High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

### Mechanics 1: Conservation of Energy and Momentum

Mechanics : Conservation of Energy and Momentum If a certain quantity associated with a system does not change in time. We say that it is conserved, and the system possesses a conservation law. Conservation

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Big Data - Lecture 1 Optimization reminders

Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics

### Multi-Class and Structured Classification

Multi-Class and Structured Classification [slides prises du cours cs294-10 UC Berkeley (2006 / 2009)] [ p y( )] http://www.cs.berkeley.edu/~jordan/courses/294-fall09 Basic Classification in ML Input Output

### Basics of Statistical Machine Learning

CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar

### Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters

### A fast multi-class SVM learning method for huge databases

www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

### Lecture 6. Artificial Neural Networks

Lecture 6 Artificial Neural Networks 1 1 Artificial Neural Networks In this note we provide an overview of the key concepts that have led to the emergence of Artificial Neural Networks as a major paradigm

### Multiclass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin

Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each

### Bayesian Multinomial Logistic Regression for Author Identification

Bayesian Multinomial Logistic Regression for Author Identification David Madigan,, Alexander Genkin, David D. Lewis and Dmitriy Fradkin, DIMACS, Rutgers University Department of Statistics, Rutgers University

### Wes, Delaram, and Emily MA751. Exercise 4.5. 1 p(x; β) = [1 p(xi ; β)] = 1 p(x. y i [βx i ] log [1 + exp {βx i }].

Wes, Delaram, and Emily MA75 Exercise 4.5 Consider a two-class logistic regression problem with x R. Characterize the maximum-likelihood estimates of the slope and intercept parameter if the sample for