Lab 13: Logistic Regression

Similar documents
Generalized Linear Models

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Logistic Regression (a type of Generalized Linear Model)

Multivariate Logistic Regression

Régression logistique : introduction

Multiple Linear Regression

Statistical Models in R

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Psychology 205: Research Methods in Psychology

Logistic regression (with R)

Regression 3: Logistic Regression

Simple example of collinearity in logistic regression

Outline. Dispersion Bush lupine survival Quasi-Binomial family

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Journal of Statistical Software

SAS Software to Fit the Generalized Linear Model

MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.

Comparing Nested Models

EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

Statistical Models in R

Chapter 7: Simple linear regression Learning Objectives

Local classification and local likelihoods

A survey analysis example

Exchange Rate Regime Analysis for the Chinese Yuan

Lecture 6: Poisson regression

Effect Displays in R for Generalised Linear Models

We extended the additive model in two variables to the interaction model by adding a third term to the equation.

Detecting Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo

Lecture 8: Gamma regression

Homework Assignment 7

Examining a Fitted Logistic Model

Section 6: Model Selection, Logistic Regression and more...

Lucky vs. Unlucky Teams in Sports

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

Basic Statistical and Modeling Procedures Using SAS

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Getting started with qplot

Each function call carries out a single task associated with drawing the graph.

1 Logistic Regression

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

5. Multiple regression


SUGI 29 Statistics and Data Analysis

WatchGuard QMS End User Guide

5. Linear Regression

L3: Statistical Modeling with Hadoop

Marketing 10Mistakes

XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables

Viewing Ecological data using R graphics

Package dsmodellingclient

MIXED MODEL ANALYSIS USING R

GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,

Using R for Linear Regression

E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Adjust Webmail Spam Settings

USER GUIDE: MANAGING NARA RECORDS WITH GMAIL AND THE ZL UNIFIED ARCHIVE

Google Apps Sync for Microsoft Outlook

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Package smoothhr. November 9, 2015

How To Prevent Spam From Being Filtered Out Of Your Program

Barracuda Spam Firewall Users Guide. Greeting Message Obtaining a new password Summary report Quarantine Inbox Preferences

ANOVA. February 12, 2015

Setting up Microsoft Outlook to reject unsolicited (UCE or Spam )

SPSS Explore procedure

Graphics in R. Biostatistics 615/815

Junk Filtering System. User Manual. Copyright Corvigo, Inc All Rights Reserved Rev. C

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Cluster Analysis using R

Using Excel for Statistics Tips and Warnings

Correlation and Simple Linear Regression

BIOL 933 Lab 6 Fall Data Transformation

Classification and Regression by randomforest

Package MDM. February 19, 2015

FireEye Threat Prevention Cloud Evaluation

More Details About Your Spam Digest & Dashboard

Logs Transformation in a Regression Equation

Credit Risk Analysis Using Logistic Regression Modeling

Anti Spamming Techniques

FILTERING FAQ

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

Setting up and controlling

Exercise 1.12 (Pg )

GLM with a Gamma-distributed Dependent Variable

Life after Microsoft Outlook Google Apps

Lecture 3: Linear methods for classification

Package polynom. R topics documented: June 24, Version 1.3-8

Time Series Analysis with R - Part I. Walter Zucchini, Oleg Nenadić

A quick guide to... Permission: Single or Double Opt-in?

2. Making example missing-value datasets: MCAR, MAR, and MNAR

Transcription:

Lab 13: Logistic Regression Spam Emails Today we will be working with a corpus of emails received by a single gmail account over the first three months of 2012. Just like any other email address this account received and sent regular emails as well as receiving a large amount of spam, unsolicited bulk email. We will be using what we have learned about logistic regression models to see if we can build a model that is able to predict whether or not a message is spam based on a variety of characteristics of the email (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large coorportations like Google and Microsoft are quite a bit more complex the fundamental idea is the same - binary classification based on a set of predictors. Template for lab report Write your report, or at least run the code and create the plots, as you go so that if you get errors you can ask your TA to help on the spot. Knit often to more easily determine the source of the error. download.file("http://stat.duke.edu/courses/spring13/sta102.001/lab/lab13.rmd", destfile = "lab13.rmd") The data download.file("http://stat.duke.edu/courses/spring13/sta102.001/lab/email.rdata", destfile = "email.rdata") load("email.rdata") List of variables: 1. spam: Indicator for whether the email was spam. 2. to multiple: Indicator for whether the email was addressed to more than one recipient. 3. from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email). 4. cc: Indicator for whether anyone was CCed. 5. sent email: Indicator for whether the sender had been sent an email in the last 30 days. 6. image: Indicates whether any images were attached. 7. attach: Indicates whether any files were attached. 8. dollar: Indicates whether a dollar sign or the word dollar appeared in the email. 9. winner: Indicates whether winner appeared in the email. 10. inherit: Indicates whether inherit (or an extension, such as inheritance) appeared in the email. 11. password: Indicates whether password appeared in the email. 12. num char: The number of characters in the email, in thousands. 13. line breaks: The number of line breaks in the email (does not count text wrapping). 14. format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links) or plaintext. 1

15. re subj: Indicates whether the subject started with Re:, RE:, re:, or re : 16. exclaim subj: Indicates whether there was an exclamation point in the subject. 17. urgent subj: Indicates whether the word urgent was in the email subject. 18. exclaim mess: The number of exclamation points in the email message. 19. number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number. Exercise 1 In this lab we will focus on predicting whether an email is spam or not. How many emails make up this data set? What proportion of the emails were spam? A simple spam filter We will start with a simple spam filter that will only use a single predictor to multiple to classify a message as spam or not. To do this we will fit a logistic regression model between spam and to multiple using the glm function. This is done in the same way that a simple or multiple regression model is fit in R, except we use the glm function instead of lm, and we must indicate that we wish to fit a logistic model by include the argument family=binomial. g_simple = glm(spam ~ to_multiple, data = email, family = binomial) summary(g_simple) Call: glm(formula = spam ~ to_multiple, family = binomial, data = email) Deviance Residuals: Min 1Q Median 3Q Max -0.477-0.477-0.477-0.477 2.809 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.1161 0.0562-37.67 < 2e-16 *** to_multipleyes -1.8092 0.2969-6.09 1.1e-09 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2437.2 on 3920 degrees of freedom Residual deviance: 2372.0 on 3919 degrees of freedom AIC: 2376 Number of Fisher Scoring iterations: 6 Exercise 2 Based on the results of this logistic regression does the inclusion of multiple recipients make a message more or less likely to be spam? Explain your reasoning. Exercise 3 Using these results calculate the probability that a message is spam if it has multiple recipients, what is the probability if it does not have multiple recipients. Exercise 4 Pick one of the other 17 remaining predictors that you think is most likely to predict 2

an email s spam status and fit a glm model using it. Describe how this predictor affects the probability of an email being spam. A more complex spam filter We will now fit the fully specified model by including all the predictors in our model. Note that we use. as short hand to indicate that we wish to include all predictors in our model. For the time being we will ignore the warning that is produced when we fit the model. g_full = glm(spam ~., data = email, family = binomial) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(g_full) Call: glm(formula = spam ~., family = binomial, data = email) Deviance Residuals: Min 1Q Median 3Q Max -2.001-0.437-0.175 0.000 3.390 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.00e+01 5.92e+03 0.00 0.99731 to_multipleyes -2.64e+00 3.31e-01-7.96 1.7e-15 *** fromyes -2.06e+01 5.92e+03 0.00 0.99722 ccyes -5.98e-01 3.41e-01-1.75 0.07947. sent_emailyes -1.71e+01 2.84e+02-0.06 0.95195 imageyes -1.72e+00 8.01e-01-2.14 0.03210 * attachyes 1.28e+00 2.71e-01 4.72 2.3e-06 *** dollaryes 1.67e-01 1.81e-01 0.92 0.35516 winneryes 1.92e+00 3.61e-01 5.33 1.0e-07 *** inherityes 3.08e-02 3.40e-01 0.09 0.92785 passwordyes -1.79e+00 5.42e-01-3.30 0.00097 *** num_char 3.03e-02 2.42e-02 1.25 0.21065 line_breaks -4.77e-03 1.35e-03-3.53 0.00041 *** formatplain 6.40e-01 1.51e-01 4.25 2.1e-05 *** re_subjyes -1.40e+00 4.11e-01-3.41 0.00064 *** exclaim_subjyes -1.41e-02 2.41e-01-0.06 0.95343 urgent_subjyes 3.58e+00 1.24e+00 2.88 0.00396 ** exclaim_mess 9.64e-03 1.78e-03 5.41 6.3e-08 *** numbersmall -1.23e+00 1.58e-01-7.82 5.3e-15 *** numberbig -4.62e-01 2.30e-01-2.01 0.04430 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2437.2 on 3920 degrees of freedom Residual deviance: 1668.5 on 3901 degrees of freedom AIC: 1708 Number of Fisher Scoring iterations: 18 3

Exercise 5 Imagine we are using this logistic regression model as a spam filter, every new message you receive is analyzed and its characteristics calculated for each of the 18 predictor variables. If the message contained an image is it more or less likely to be flagged as spam? What if it only had an attachment? What if it had an image and an attachment? Exercise 6 Which variables appear to be meaningful for identifying spam? Describe how each of these variables affects the probability of an email being spam, numerical answers are not needed here. (For categorical predictors be sure to indicate the reference level and how this affects the interpretation) Exercise 7 Which predictor appears to have the largest effect? Does this agree with what you have seen in the spam you receive? Assessing our spam filter While not quite the same as the residual plots we saw in simple and multiple linear regression we can create a plot that shows how well out logistic regession model is doing by plotting our predicted probability against the response variable (spam or not spam). We jitter the y coordinates slightly so that emails with similar probabilities are not directly on top of one another in the plot. set.seed(1) jitter = rnorm(nrow(email), sd = 0.08) plot(g_full$fitted.values, email$spam + jitter, xlim = 0:1, ylim = c(-0.5, 1.5), axes = FALSE, xlab = "Predicted probability", ylab = "", col = adjustcolor("blue", 0.2), pch = 16) axis(1) axis(2, at = c(0, 1), labels = c("0 (not spam)", "1 (spam)")) 0 (not spam) 1 (spam) 0.0 0.2 0.4 0.6 0.8 1.0 Predicted probability We can also see the difference in the distribution of probabilities for the two classes by plotting side by side box plots. 4

plot(factor(email$spam, labels = c("not spam", "spam")), g_full$fitted.values) not spam spam 0.0 0.2 0.4 0.6 0.8 1.0 From both plots it is clear that in general spam messages have higher probabilities, which is what we expect to see if our spam filter is working well. Exercise 8 Fit another glm model with the subset of the predictors that you think will result in the best possible spam filter. Your criteria should be based on both what you know about spam and the results of fitting the full model. Describe why you chose this particular model and how its results differ from the full model. Exercise 9 Recreate the above plots for your new model, based on these plots discuss if your model appears to be a better or worse spam filter than the full model. What other information do you think you would need to better make this decision? Extra Credit In the dot plot above, there appears to be a banding pattern in the predicted probabilities, many points falling at the same or very nearly the same predicted probabilities. Explain why this pattern is occuring. 5