Logistic Regression for Spam Filtering


 Marcus Wright
 3 years ago
 Views:
Transcription
1 Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an as a spam or not spam. One of the classic techniques used in spam filtering is to predict using logistic regression. Words that frequently occur in a spam are used as the feature set in the regression problem. This in this report, we examine some of the different techniques used for minimizing the logistic loss function and provide a performance analysis of the differnt techniques. Specifically three diffrent types of minimization techniques were implemented and tested: Regular Batch Gradient Descent Algorithm, Regularized Gradient Descent Algorithm and Stochastic Gradient Descent Algorithm. 1 Introduction and Problem Description What is SPAM?  One of the definition could be, electronic junk mail or junk newsgroup postings. Some people define spam even more generally as any unsolicited . And a spam filter is a software tool used to classify spam s from genuine s. Hence the spam filter predicts which class the belongs to spam/no spam. This problem has been addressed using several techniques such as SVMs, Naive Bayes, and Logistic Regression. Logistic regression is a model used for prediction of the probability of occurrence of an event. It makes use of several predictor variables (features) that may be either numerical or categories. Other names for logistic regression used in various other application areas include logistic model, logit model, and maximumentropy classifier. Logistic regression is one of a class of models known as generalized linear models. In this report three different minimization techniques have been studied to minimize the logistic loss function. The Normal Gradient Descent, Regularized Gradient Descent and the Stochastic Gradient Descent. The Logistic Regression algorithm is introduced in section[2], the minimization techniques are explained in detail in the sections[3,4 and 5]. A detailed experimental analysis is provided in section6. 2 Logistic Regression An explanation of logistic regression begins with an explanation of the logistic function(also called the sigmoid 1 function): f(z) =. The logistic function is useful because it can take as an input, any value from 1 + e z negative infinity to positive infinity, whereas the output is confined to values between and 1. The variable, z represents the exposure to some set of risk factors, while f(z) represents the probability of a particular outcome, given that set of risk factors. The variable z is a measure of the total contribution of all the risk factors used in the model and is known as the logit. The variable z is usually defined as: z = n i=1 w ix i, where x 1..x n are the features and w 1..w n are the regression coefficients (weights). The Logistic Regression algorithm is as given below: 1. initialize weight vector to zero 2. train the features by minimizing the logistic loss while ( gradient 1 > precision) do calculate the new prediction ỹ vector using: Ỹ = e W.X 1
2 t calculate the gradient vector using: gradient = ((ỹ t y t )x t ) T update the weights using: w t+1 = w t + η.gradient 3. calculate the logistic loss on the test set using loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) { ln(1 ỹ) = ln(1 + e = w.x ) if y = ln(ỹ) = ln(1 + e w.x ) w.x if y = 1 = negative log likelihood 3 Minimization of logistic loss using Normal(batch) Gradient Descent One of the standard methods used for minimization of any convex function is the method of Gradient Descent, where the optimal solution is found when the gradient of the function is equal to zero by taking steps proportional to the negative of the gradient of the function. As shown in the algorithm above, for each gradient step, the gradient of the loss vector for all the examples in the batch is found and the weights for each feature is updated. This is continued till the gradient is less than some threshold precision value. The gradient equation and the weight update equations are as given below: loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) t gradient = ((ỹ t y t )x t ) T w t+1 = w t + η.gradient Gradient Descent could be time consuming as it may take many iterations to converge to a local minima. Gradient descent could also lead to overfitting of the data. Overfitting the training data is undesirable as it would lead to a higher loss on the test data. Hence it is common to find the gradient using other iterative methods. 4 Minimization of logistic function using Regularized Gradient Descent A common technique used to prevent overfitting of the training data is to regularise the weights. Regularization as defined in [1] is Any tunable method that increases the average loss on the training set, but decreases the average loss on the test set. Some of the techniques used in regularization are, to stop training early, to regularise with relative entropies, feature selection, clipping the range of labels etc. We have implemented regularization using the following minimization function: loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) t inf (( 1 2η w 2 2) + (loss(y t,ỹ))) w T and train until the gradient: 1 η w + 1 T t (ỹ y t)x t 1 precision 5 Minimization of logistic function using Stochastic Gradient Descent with Simulated Annealing In standard (or batch ) gradient descent, the true gradient is used to update the parameters of the model. The true gradient is usually the sum of the gradients caused by each individual training example. The 2
3 parameter vectors are adjusted by the negative of the true gradient multiplied by a step size. Therefore, batch gradient descent requires one sweep through the training set before any parameters can be changed. In stochastic (or online ) gradient descent, the true gradient is approximated by the gradient of the cost function only evaluated on a single training example. The parameters are then adjusted by an amount proportional to this approximate gradient. Therefore, the parameters of the model are updated after each training example. For large data sets, online gradient descent is found to be much faster than batch gradient descent. Instead of using a constant learning rate for each gradient update, a variable learning rate was implemented, whose value is gradually reduced to control the weight vectors. This technique is intutively similar to the Annealing technique[2] where the metal is heated to a high temprature and then gradually cooled. The heat causes the atoms to become unstuck from their initial positions (a local minimum of the internal energy) and wander randomly through states of higher energy; the slow cooling gives them more chances of finding configurations with lower internal energy than the initial one. We are cooling the learning rate η by an α i 1 factor where i is the iteration(or pass) number. The gradient and the weight update equations are as given below: loss(y,ỹ) = yln( ỹ y ) + (1 y)ln(1 y 1 ỹ ) t gradient = ((ỹ t y t )x t ) T w t+1 = w t + (η α t 1 ).gradient 6 Experimental Results Logistic regression algorithms were implemented in Matlab. Several tests were conducted, a detailed analysis of which will be presented in the following sections. The spam dataset (given in the class website) was used for analysis. The number of trials were limited to 2. In total 2 features were used for each example. 6.1 Cross Validation A 5 fold Cross validation over 1 runs were used to obtain the average training and on each algorithm. We have used 2 examples to train and test the algorithms. With each example containing 2 features (words). 1. for i= 1 to 1 permute data split into 3/4 training and 1/4 testing set perform 5 fold cross validation to determine the best model parameters:  partition the training set into 5 parts  for each of the 5 holdouts * train all models on the 4/5 part  training set * record the average logistic loss on 1/5 part  validation set  the best model is chosen as the one with the best average over 5 holdouts compute the best model by computing the average logistic loss on the 1/4 test set 2. compute the average performance of the best model on the 1 runs 6.2 Logistic Regression using Gradient Descent The regular(batch) gradient descent algorithm was implemented and the training and es were found using the 5 fold cross validation as stated in section[6.1]. Values were obtained for all of the precision values with 1 runs, except for.1, which completed only 5 runs after running for 2 days. Hence the training loss and were computed for only 5 runs. 3
4 6.2.1 Effect of Early Stopping of the training By running the gradient descent algorithm for minimization of the total logistic loss and stopping the training early (not allow the gradient to go to zero) amounts to implicit regularization as the weights are initially small. Figure[1] shows the variation of the training loss and as the precision values are varied. It can be seen that for precision 1 4 the training leads to overfitting of data due to which the training loss is very low, but the is high. But as we reduce the precision(early stopping), the training loss increases, that is we control the weights, and hence the is lowered. Again a slight increase in and training loss is observed at precision 1, this could be due to the fact that the weights were not trained enough. Hence from the graph, we can select.1 as a good value for the stopping point of the gradient..35 Early Stopping for Logistic Regression using normal batch gradient descent Test Loss Training Loss Gradient at stopping point Figure 1: Normal Gradient Descent: Variation of mean logistic loss as a function of gradient stopping point 6.3 Logistic Regression using Regularized Gradient Descent Logistic regression using regularised gradient descent was implemented as explained in section[4]. A 5fold crossvalidation over 1 runs were conducted to obtain the mean logistic loss on the test and validation data sets. After conducting the tests, precision set at.1 and λ=.1 and η=.2 were found to be good choices for the parameters Effect of regularization parameter λ Figure[2] shows the variation of the training loss and the with respect to the regularization parameter λ. It was observed that for λ.1 the logistic test and training losses remained almost constant as expected, thus implying, regularization helps in preventing overfitting. For lower values of λ it is observed that the effect of the regularization is reduced and hence overfitting on the trainng data is observed due to which high es are recorded. 4
5 .5.45 Logistic Regression using 2 Norm regularizer training loss Regularization Parameter: lamda Figure 2: Regularized Gradient Descent: Variation of mean logistic loss as a function of λ Effect of learning rate η Figure[3] shows the variation of the logistic loss for different learning rates. It was observed that varying the learning rates did not have a pronounced effect on the test and training losses. The lowest was recorded at η =.2 and the losses were slightly higher for the other learning rates. This could be due to the regularization. The 2 norm regularization of weights is ensuring a faster convergence of the gradient, and hence even for low learning rates, the redundant weights are not influencing the loss function. 6.4 Logistic Regression using Stochastic Gradient Descent with Simulated Annealing Logistic Regression using Stochastic Gradient Descent was implemented as explained in section[5]. A 5fold crossvalidation over 1 runs were conducted to obtain the mean logistic loss on the test and validation data sets. After studing the varies effects of the different parameters (α, η, passes), the parameters were set at, passes=1, η=.2 and α= Effect of varying the number of passes Figure[4] shows the variation of the mean logistic loss on the test and the validation set as the number of training passes(iterations) is varied. It can be seen that the loss is the lowest for pass=1. But the difference in the loss values is not significantly different for different number of passes. It is unclear if an optimal number of passes exists for stochastic gradient descent. Since the weights are updated after seeing each example, the gradient values would be random after each update. But when the number of passes is restricted to 1, the loss on both the test set and training sets increase significantly. This is due to the fact that when pass=1, the learning rate remains a constant for the entire algorithm. But when the number of passes is increased, the learning rate is reduced by an α i 1 factor, where i is the iteration number. And this helps in controlling the weight vector due to simulated annealing. 5
6 .22 vs learning rate eta training loss Learning Rate eta Figure 3: Regularized gradient Descent: as a function of η.12 Stochastic Gradient Descent: Mean logistic loss vs Number of Passes train loss Number of passes Figure 4: Stochastic Gradient Descent: as a function of number of passes 6
7 6.4.2 Effect of varying η Figure[5] shows the variation of the mean logistic loss as a function of the learning rate η. Clearly, higher learning rates do not work well for stochastic gradient descent. This could be due to the randomness in the gradients. The weight vectors are not controlled at higher learning rates even with a low cooling rate of α =.5. From the figure[5], η =.2 seems to be a good value as a learning rate parameter Stochastic Gradient Descent: vs Learning Rate eta training loss 1.2 Mean Logistic loss Learning rate eta Figure 5: Stochastic Gradient Descent: as a function of learning rate η Effect of varying α Figure[6] shows the variation of the mean logistic loss as a function of the cooling rate α. As explained in section[5], α helps in gradually reducing the learning rate of the weight vector. It was observed that the performance of the algorithm was similar for a range of.5 α.95. But when α.5, the loss on both the test and the training sets went up, same was the case when α = 1(constant learning rate). This shows that using simulated annealing technique helps in faster and better convergence of the gradient. 7
8 Stochastic Gradient Descent: vs Cooling Rate alpha training loss Cooling Rate alpha Figure 6: Stochastic Gradient Descent: as a function of Cooling rate α 7 Conclusion Three different logistic loss minimization techniques were implemented and studied. Normal Gradient Descent with varying gradient stopping points, Regularized Gradient Descent with different λ and η values, and the Stochastic(online) Gradient Descent with Simulated Annealing with varying α and η values were studied. Incase of a normal gradient descent, it was observed that, early stopping of the gradient helped in preventing overfitting on the training data and thus improved the performance on the test set. Using a 2norm regularizer of the weights along with the logistic loss helped in obtaining a faster and better convergence of the gradient. This was due the fact that the regularizer acted as a relative entropy, thus controlling the learning of the weights. This prevented the weights from overfitting on the training data, leading to a better performance on the test set. In the stochastic gradient descent with simulated annealing technique, the over fitting was prevented by starting with a low learning rate η and further reducing the learning rate using the cooling rate α. Since the gradient values vary on each update it is still unclear how to optimally control the weight vector. An extension on this algorithm would be include a relative entropy term in the minimization function and then apply stochastic gradient descent with varying learning rates. 8 References [1] ShrikStretch of labels for regularizing logistic regression by Manfred K. Warmuth [2] Wikipedia on Simulated Annealing 8
Introduction to Logistic Regression
OpenStaxCNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStaxCNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationCS 688 Pattern Recognition Lecture 4. Linear Models for Classification
CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(
More informationLogistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.
Logistic Regression Vibhav Gogate The University of Texas at Dallas Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld. Generative vs. Discriminative Classifiers Want to Learn: h:x Y X features
More informationData Mining  Evaluation of Classifiers
Data Mining  Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationQuestion 2 Naïve Bayes (16 points)
Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationImproving Generalization
Improving Generalization Introduction to Neural Networks : Lecture 10 John A. Bullinaria, 2004 1. Improving Generalization 2. Training, Validation and Testing Data Sets 3. CrossValidation 4. Weight Restriction
More informationNatural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression
Natural Language Processing Lecture 13 10/6/2015 Jim Martin Today Multinomial Logistic Regression Aka loglinear models or maximum entropy (maxent) Components of the model Learning the parameters 10/1/15
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Daybyday Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationT61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
More informationPredicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
More informationActive Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationA Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad  Iraq ABSTRACT
More informationCOMP 598 Applied Machine Learning Lecture 21: Parallelization methods for largescale machine learning! Big Data by the numbers
COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for largescale machine learning! Instructor: (jpineau@cs.mcgill.ca) TAs: PierreLuc Bacon (pbacon@cs.mcgill.ca) Ryan Lowe (ryan.lowe@mail.mcgill.ca)
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationPMR5406 Redes Neurais e Lógica Fuzzy Aula 3 Multilayer Percetrons
PMR5406 Redes Neurais e Aula 3 Multilayer Percetrons Baseado em: Neural Networks, Simon Haykin, PrenticeHall, 2 nd edition Slides do curso por Elena Marchiori, Vrie Unviersity Multilayer Perceptrons Architecture
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationProbabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
More informationMachine Learning in Spam Filtering
Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRALab,
More information1 Introduction. 2 Prediction with Expert Advice. Online Learning 9.520 Lecture 09
1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch
More informationLecture 6: Logistic Regression
Lecture 6: CS 19410, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
More informationCSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationPredicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
More informationIntroduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski trajkovski@nyus.edu.mk
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski trakovski@nyus.edu.mk Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
More informationProbabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationThe Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Analyst @ Expedia
The Impact of Big Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia Who am I? Senior Business Analyst @ Expedia Working within the competitive intelligence unit
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationMAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
More informationOn Attacking Statistical Spam Filters
On Attacking Statistical Spam Filters Gregory L. Wittel and S. Felix Wu Department of Computer Science University of California, Davis One Shields Avenue, Davis, CA 95616 USA Paper review by Deepak Chinavle
More informationCheng Soon Ong & Christfried Webers. Canberra February June 2016
c Cheng Soon Ong & Christfried Webers Research Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 31 c Part I
More informationDistributed Machine Learning and Big Data
Distributed Machine Learning and Big Data Sourangshu Bhattacharya Dept. of Computer Science and Engineering, IIT Kharagpur. http://cse.iitkgp.ac.in/~sourangshu/ August 21, 2015 Sourangshu Bhattacharya
More informationClassification using Logistic Regression
Classification using Logistic Regression Ingmar Schuster Patrick Jähnichen using slides by Andrew Ng Institut für Informatik This lecture covers Logistic regression hypothesis Decision Boundary Cost function
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationLecture 13: Validation
Lecture 3: Validation g Motivation g The Holdout g Resampling techniques g Threeway data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model
More informationA semisupervised Spam mail detector
A semisupervised Spam mail detector Bernhard Pfahringer Department of Computer Science, University of Waikato, Hamilton, New Zealand Abstract. This document describes a novel semisupervised approach
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? Web data (web logs, click histories) ecommerce applications (purchase histories) Retail purchase histories
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationINTRODUCTION TO NEURAL NETWORKS
INTRODUCTION TO NEURAL NETWORKS Pictures are taken from http://www.cs.cmu.edu/~tom/mlbookchapterslides.html http://research.microsoft.com/~cmbishop/prml/index.htm By Nobel Khandaker Neural Networks An
More informationIdentifying SPAM with Predictive Models
Identifying SPAM with Predictive Models Dan Steinberg and Mikhaylo Golovnya Salford Systems 1 Introduction The ECMLPKDD 2006 Discovery Challenge posed a topical problem for predictive modelers: how to
More informationNeural Networks. CAP5610 Machine Learning Instructor: GuoJun Qi
Neural Networks CAP5610 Machine Learning Instructor: GuoJun Qi Recap: linear classifier Logistic regression Maximizing the posterior distribution of class Y conditional on the input vector X Support vector
More informationFoundations of Machine Learning OnLine Learning. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu
Foundations of Machine Learning OnLine Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.
More informationCross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.
Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month
More informationFINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT. Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly
FINDING SUBGROUPS OF ENHANCED TREATMENT EFFECT Jeremy M G Taylor Jared Foster University of Michigan Steve Ruberg Eli Lilly 1 1. INTRODUCTION and MOTIVATION 2. PROPOSED METHOD Random Forests Classification
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationA Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi kondakin@usc.edu Satakshi Rana satakshr@usc.edu Aswin Rajkumar aswinraj@usc.edu Sai Kaushik Ponnekanti ponnekan@usc.edu Vinit Parakh
More informationData Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More information11. Analysis of Casecontrol Studies Logistic Regression
Research methods II 113 11. Analysis of Casecontrol Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
More informationNEURAL NETWORKS A Comprehensive Foundation
NEURAL NETWORKS A Comprehensive Foundation Second Edition Simon Haykin McMaster University Hamilton, Ontario, Canada Prentice Hall Prentice Hall Upper Saddle River; New Jersey 07458 Preface xii Acknowledgments
More informationInsurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics  analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
More information3F3: Signal and Pattern Processing
3F3: Signal and Pattern Processing Lecture 3: Classification Zoubin Ghahramani zoubin@eng.cam.ac.uk Department of Engineering University of Cambridge Lent Term Classification We will represent data by
More informationRegression Clustering
Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationAn Introduction to Neural Networks
An Introduction to Vincent Cheung Kevin Cannons Signal & Data Compression Laboratory Electrical & Computer Engineering University of Manitoba Winnipeg, Manitoba, Canada Advisor: Dr. W. Kinsner May 27,
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationSentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network?  Perceptron learners  Multilayer networks What is a Support
More informationEvolutionary Detection of Rules for Text Categorization. Application to Spam Filtering
Advances in Intelligent Systems and Technologies Proceedings ECIT2004  Third European Conference on Intelligent Systems and Technologies Iasi, Romania, July 2123, 2004 Evolutionary Detection of Rules
More informationDetecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
More informationAsymmetric Gradient Boosting with Application to Spam Filtering
Asymmetric Gradient Boosting with Application to Spam Filtering Jingrui He Carnegie Mellon University 5 Forbes Avenue Pittsburgh, PA 523 USA jingruih@cs.cmu.edu ABSTRACT In this paper, we propose a new
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 20150305
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 20150305 Roman Kern (KTI, TU Graz) Ensemble Methods 20150305 1 / 38 Outline 1 Introduction 2 Classification
More informationHow I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
More informationLecture 10: Regression Trees
Lecture 10: Regression Trees 36350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationIntroduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
More informationRuntime Hardware Reconfiguration using Machine Learning
Runtime Hardware Reconfiguration using Machine Learning Tanmay Gangwani University of Illinois, UrbanaChampaign gangwan2@illinois.edu Abstract Tailoring the machine hardware to varying needs of the software
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationSemiSupervised Support Vector Machines and Application to Spam Filtering
SemiSupervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationPattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University
Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationW6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
More informationPoisson Models for Count Data
Chapter 4 Poisson Models for Count Data In this chapter we study loglinear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More informationIn this tutorial, we try to build a roc curve from a logistic regression.
Subject In this tutorial, we try to build a roc curve from a logistic regression. Regardless the software we used, even for commercial software, we have to prepare the following steps when we want build
More informationCCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal  the stuff biology is not
More informationModels for Count Data With Overdispersion
Models for Count Data With Overdispersion Germán Rodríguez November 6, 2013 Abstract This addendum to the WWS 509 notes covers extrapoisson variation and the negative binomial model, with brief appearances
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More information