Logistic Regression for Spam Filtering



Similar documents
Introduction to Logistic Regression

Data Mining. Nonlinear Classification

Data Mining - Evaluation of Classifiers

Making Sense of the Mayhem: Machine Learning and March Madness

Statistical Machine Learning

Machine Learning Logistic Regression

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Machine Learning and Pattern Recognition Logistic Regression

Supervised Learning (Big Data Analytics)

Question 2 Naïve Bayes (16 points)

1 Maximum likelihood estimation

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression

STA 4273H: Statistical Machine Learning

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Linear Threshold Units

Knowledge Discovery and Data Mining

Introduction to Online Learning Theory

Predicting the Stock Market with News Articles

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Active Learning with Boosting for Spam Detection

Azure Machine Learning, SQL Data Mining and R

Logistic Regression (1/24/13)

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Predicting borrowers chance of defaulting on credit loans

The Artificial Prediction Market

Predict Influencers in the Social Network

1 Introduction. 2 Prediction with Expert Advice. Online Learning Lecture 09

Simple and efficient online algorithms for real world applications

Cross Validation. Dr. Thomas Jensen Expedia.com

A Content based Spam Filtering Using Optical Back Propagation Technique

Lecture 13: Validation

A Logistic Regression Approach to Ad Click Prediction

Chapter 6. The stacking ensemble approach

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Lecture 3: Linear methods for classification

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Machine Learning in Spam Filtering

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Data Mining Algorithms Part 1. Dejan Sarka

CSCI567 Machine Learning (Fall 2014)

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Lecture 6: Logistic Regression

Classification using Logistic Regression

Probabilistic user behavior models in online stores for recommender systems

A semi-supervised Spam mail detector

Lecture 2: The SVM classifier

Linear Classification. Volker Tresp Summer 2015

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Machine Learning Final Project Spam Filtering

Data Mining Practical Machine Learning Tools and Techniques

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus

Distributed Machine Learning and Big Data

Knowledge Discovery and Data Mining

Social Media Mining. Data Mining Essentials

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

NEURAL NETWORKS A Comprehensive Foundation

An Introduction to Neural Networks

3F3: Signal and Pattern Processing

Machine Learning Big Data using Map Reduce

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Knowledge Discovery and Data Mining

Evolutionary Detection of Rules for Text Categorization. Application to Spam Filtering

Identifying SPAM with Predictive Models

Lecture 10: Regression Trees

Sentiment analysis using emoticons

How I won the Chess Ratings: Elo vs the rest of the world Competition

Christfried Webers. Canberra February June 2015

Introduction to Machine Learning Using Python. Vikram Kamath

Runtime Hardware Reconfiguration using Machine Learning

Linear Models for Classification

Car Insurance. Prvák, Tomi, Havri

Poisson Models for Count Data

Supporting Online Material for

Semi-Supervised Support Vector Machines and Application to Spam Filtering

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Lecture 6. Artificial Neural Networks

Some Essential Statistics The Lure of Statistics

Data Mining Methods: Applications for Institutional Research

11. Analysis of Case-control Studies Logistic Regression

Regularized Logistic Regression for Mind Reading with Parallel Validation

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Towards running complex models on big data

Cross-validation for detecting and preventing overfitting

Asymmetric Gradient Boosting with Application to Spam Filtering

Towards better accuracy for Spam predictions

Machine Learning. CUNY Graduate Center, Spring Professor Liang Huang.

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Transcription:

Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used in spam filtering is to predict using logistic regression. Words that frequently occur in a spam email are used as the feature set in the regression problem. This in this report, we examine some of the different techniques used for minimizing the logistic loss function and provide a performance analysis of the differnt techniques. Specifically three diffrent types of minimization techniques were implemented and tested: Regular Batch Gradient Descent Algorithm, Regularized Gradient Descent Algorithm and Stochastic Gradient Descent Algorithm. 1 Introduction and Problem Description What is SPAM? - One of the definition could be, electronic junk mail or junk newsgroup postings. Some people define spam even more generally as any unsolicited e-mail. And a spam filter is a software tool used to classify spam emails from genuine emails. Hence the spam filter predicts which class the email belongs to spam/no spam. This problem has been addressed using several techniques such as SVMs, Naive Bayes, and Logistic Regression. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables (features) that may be either numerical or categories. Other names for logistic regression used in various other application areas include logistic model, logit model, and maximum-entropy classifier. Logistic regression is one of a class of models known as generalized linear models. In this report three different minimization techniques have been studied to minimize the logistic loss function. The Normal Gradient Descent, Regularized Gradient Descent and the Stochastic Gradient Descent. The Logistic Regression algorithm is introduced in section[2], the minimization techniques are explained in detail in the sections[3,4 and 5]. A detailed experimental analysis is provided in section6. 2 Logistic Regression An explanation of logistic regression begins with an explanation of the logistic function(also called the sigmoid 1 function): f(z) =. The logistic function is useful because it can take as an input, any value from 1 + e z negative infinity to positive infinity, whereas the output is confined to values between and 1. The variable, z represents the exposure to some set of risk factors, while f(z) represents the probability of a particular outcome, given that set of risk factors. The variable z is a measure of the total contribution of all the risk factors used in the model and is known as the logit. The variable z is usually defined as: z = n i=1 w ix i, where x 1..x n are the features and w 1..w n are the regression coefficients (weights). The Logistic Regression algorithm is as given below: 1. initialize weight vector to zero 2. train the features by minimizing the logistic loss while ( gradient 1 > precision) do calculate the new prediction ỹ vector using: Ỹ = 1 1 + e W.X 1

t calculate the gradient vector using: gradient = ((ỹ t y t )x t ) T update the weights using: w t+1 = w t + η.gradient 3. calculate the logistic loss on the test set using loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) { ln(1 ỹ) = ln(1 + e = w.x ) if y = ln(ỹ) = ln(1 + e w.x ) w.x if y = 1 = negative log likelihood 3 Minimization of logistic loss using Normal(batch) Gradient Descent One of the standard methods used for minimization of any convex function is the method of Gradient Descent, where the optimal solution is found when the gradient of the function is equal to zero by taking steps proportional to the negative of the gradient of the function. As shown in the algorithm above, for each gradient step, the gradient of the loss vector for all the examples in the batch is found and the weights for each feature is updated. This is continued till the gradient is less than some threshold precision value. The gradient equation and the weight update equations are as given below: loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) t gradient = ((ỹ t y t )x t ) T w t+1 = w t + η.gradient Gradient Descent could be time consuming as it may take many iterations to converge to a local minima. Gradient descent could also lead to overfitting of the data. Overfitting the training data is undesirable as it would lead to a higher loss on the test data. Hence it is common to find the gradient using other iterative methods. 4 Minimization of logistic function using Regularized Gradient Descent A common technique used to prevent overfitting of the training data is to regularise the weights. Regularization as defined in [1] is Any tunable method that increases the average loss on the training set, but decreases the average loss on the test set. Some of the techniques used in regularization are, to stop training early, to regularise with relative entropies, feature selection, clipping the range of labels etc. We have implemented regularization using the following minimization function: loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) t inf (( 1 2η w 2 2) + (loss(y t,ỹ))) w T and train until the gradient: 1 η w + 1 T t (ỹ y t)x t 1 precision 5 Minimization of logistic function using Stochastic Gradient Descent with Simulated Annealing In standard (or batch ) gradient descent, the true gradient is used to update the parameters of the model. The true gradient is usually the sum of the gradients caused by each individual training example. The 2

parameter vectors are adjusted by the negative of the true gradient multiplied by a step size. Therefore, batch gradient descent requires one sweep through the training set before any parameters can be changed. In stochastic (or online ) gradient descent, the true gradient is approximated by the gradient of the cost function only evaluated on a single training example. The parameters are then adjusted by an amount proportional to this approximate gradient. Therefore, the parameters of the model are updated after each training example. For large data sets, on-line gradient descent is found to be much faster than batch gradient descent. Instead of using a constant learning rate for each gradient update, a variable learning rate was implemented, whose value is gradually reduced to control the weight vectors. This technique is intutively similar to the Annealing technique[2] where the metal is heated to a high temprature and then gradually cooled. The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one. We are cooling the learning rate η by an α i 1 factor where i is the iteration(or pass) number. The gradient and the weight update equations are as given below: loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) t gradient = ((ỹ t y t )x t ) T w t+1 = w t + (η α t 1 ).gradient 6 Experimental Results Logistic regression algorithms were implemented in Matlab. Several tests were conducted, a detailed analysis of which will be presented in the following sections. The spam dataset (given in the class website) was used for analysis. The number of trials were limited to 2. In total 2 features were used for each example. 6.1 Cross Validation A 5 fold Cross validation over 1 runs were used to obtain the average training and on each algorithm. We have used 2 examples to train and test the algorithms. With each example containing 2 features (words). 1. for i= 1 to 1 permute data split into 3/4 training and 1/4 testing set perform 5 fold cross validation to determine the best model parameters: - partition the training set into 5 parts - for each of the 5 holdouts * train all models on the 4/5 part - training set * record the average logistic loss on 1/5 part - validation set - the best model is chosen as the one with the best average over 5 holdouts compute the best model by computing the average logistic loss on the 1/4 test set 2. compute the average performance of the best model on the 1 runs 6.2 Logistic Regression using Gradient Descent The regular(batch) gradient descent algorithm was implemented and the training and es were found using the 5 fold cross validation as stated in section[6.1]. Values were obtained for all of the precision values with 1 runs, except for.1, which completed only 5 runs after running for 2 days. Hence the training loss and were computed for only 5 runs. 3

6.2.1 Effect of Early Stopping of the training By running the gradient descent algorithm for minimization of the total logistic loss and stopping the training early (not allow the gradient to go to zero) amounts to implicit regularization as the weights are initially small. Figure[1] shows the variation of the training loss and as the precision values are varied. It can be seen that for precision 1 4 the training leads to overfitting of data due to which the training loss is very low, but the is high. But as we reduce the precision(early stopping), the training loss increases, that is we control the weights, and hence the is lowered. Again a slight increase in and training loss is observed at precision 1, this could be due to the fact that the weights were not trained enough. Hence from the graph, we can select.1 as a good value for the stopping point of the gradient..35 Early Stopping for Logistic Regression using normal batch gradient descent Test Loss Training Loss.3.25.2.15.1.5 1 4 1 3 1 2 1 1 1 Gradient at stopping point Figure 1: Normal Gradient Descent: Variation of mean logistic loss as a function of gradient stopping point 6.3 Logistic Regression using Regularized Gradient Descent Logistic regression using regularised gradient descent was implemented as explained in section[4]. A 5-fold crossvalidation over 1 runs were conducted to obtain the mean logistic loss on the test and validation data sets. After conducting the tests, precision set at.1 and λ=.1 and η=.2 were found to be good choices for the parameters. 6.3.1 Effect of regularization parameter λ Figure[2] shows the variation of the training loss and the with respect to the regularization parameter λ. It was observed that for λ.1 the logistic test and training losses remained almost constant as expected, thus implying, regularization helps in preventing overfitting. For lower values of λ it is observed that the effect of the regularization is reduced and hence overfitting on the trainng data is observed due to which high es are recorded. 4

.5.45 Logistic Regression using 2 Norm regularizer training loss.4.35.3.25.2.15.1.5 1 4 1 3 1 2 1 1 1 Regularization Parameter: lamda Figure 2: Regularized Gradient Descent: Variation of mean logistic loss as a function of λ 6.3.2 Effect of learning rate η Figure[3] shows the variation of the logistic loss for different learning rates. It was observed that varying the learning rates did not have a pronounced effect on the test and training losses. The lowest was recorded at η =.2 and the losses were slightly higher for the other learning rates. This could be due to the regularization. The 2 norm regularization of weights is ensuring a faster convergence of the gradient, and hence even for low learning rates, the redundant weights are not influencing the loss function. 6.4 Logistic Regression using Stochastic Gradient Descent with Simulated Annealing Logistic Regression using Stochastic Gradient Descent was implemented as explained in section[5]. A 5-fold crossvalidation over 1 runs were conducted to obtain the mean logistic loss on the test and validation data sets. After studing the varies effects of the different parameters (α, η, passes), the parameters were set at, passes=1, η=.2 and α=.5. 6.4.1 Effect of varying the number of passes Figure[4] shows the variation of the mean logistic loss on the test and the validation set as the number of training passes(iterations) is varied. It can be seen that the loss is the lowest for pass=1. But the difference in the loss values is not significantly different for different number of passes. It is unclear if an optimal number of passes exists for stochastic gradient descent. Since the weights are updated after seeing each example, the gradient values would be random after each update. But when the number of passes is restricted to 1, the loss on both the test set and training sets increase significantly. This is due to the fact that when pass=1, the learning rate remains a constant for the entire algorithm. But when the number of passes is increased, the learning rate is reduced by an α i 1 factor, where i is the iteration number. And this helps in controlling the weight vector due to simulated annealing. 5

.22 vs learning rate eta.2.18.16.14.12.1 training loss.8.6.4 1 1 1 1 1 Learning Rate eta Figure 3: Regularized gradient Descent: as a function of η.12 Stochastic Gradient Descent: Mean logistic loss vs Number of Passes.1.8.6.4 train loss.2 1 2 3 4 5 6 7 8 9 1 Number of passes Figure 4: Stochastic Gradient Descent: as a function of number of passes 6

6.4.2 Effect of varying η Figure[5] shows the variation of the mean logistic loss as a function of the learning rate η. Clearly, higher learning rates do not work well for stochastic gradient descent. This could be due to the randomness in the gradients. The weight vectors are not controlled at higher learning rates even with a low cooling rate of α =.5. From the figure[5], η =.2 seems to be a good value as a learning rate parameter. 1.6 1.4 Stochastic Gradient Descent: vs Learning Rate eta training loss 1.2 Mean Logistic loss 1.8.6.4.2.5 1 1.5 2 2.5 3 3.5 4 Learning rate eta Figure 5: Stochastic Gradient Descent: as a function of learning rate η 6.4.3 Effect of varying α Figure[6] shows the variation of the mean logistic loss as a function of the cooling rate α. As explained in section[5], α helps in gradually reducing the learning rate of the weight vector. It was observed that the performance of the algorithm was similar for a range of.5 α.95. But when α.5, the loss on both the test and the training sets went up, same was the case when α = 1(constant learning rate). This shows that using simulated annealing technique helps in faster and better convergence of the gradient. 7

.18.16 Stochastic Gradient Descent: vs Cooling Rate alpha training loss.14.12.1.8.6.4.2.2.3.4.5.6.7.8.9 1 Cooling Rate alpha Figure 6: Stochastic Gradient Descent: as a function of Cooling rate α 7 Conclusion Three different logistic loss minimization techniques were implemented and studied. Normal Gradient Descent with varying gradient stopping points, Regularized Gradient Descent with different λ and η values, and the Stochastic(online) Gradient Descent with Simulated Annealing with varying α and η values were studied. Incase of a normal gradient descent, it was observed that, early stopping of the gradient helped in preventing overfitting on the training data and thus improved the performance on the test set. Using a 2-norm regularizer of the weights along with the logistic loss helped in obtaining a faster and better convergence of the gradient. This was due the fact that the regularizer acted as a relative entropy, thus controlling the learning of the weights. This prevented the weights from overfitting on the training data, leading to a better performance on the test set. In the stochastic gradient descent with simulated annealing technique, the over fitting was prevented by starting with a low learning rate η and further reducing the learning rate using the cooling rate α. Since the gradient values vary on each update it is still unclear how to optimally control the weight vector. An extension on this algorithm would be include a relative entropy term in the minimization function and then apply stochastic gradient descent with varying learning rates. 8 References [1] Shrik-Stretch of labels for regularizing logistic regression by Manfred K. Warmuth [2] Wikipedia on Simulated Annealing 8