Lab 13: Logistic Regression
|
|
|
- Jessica Morton
- 10 years ago
- Views:
Transcription
1 Lab 13: Logistic Regression Spam s Today we will be working with a corpus of s received by a single gmail account over the first three months of Just like any other address this account received and sent regular s as well as receiving a large amount of spam, unsolicited bulk . We will be using what we have learned about logistic regression models to see if we can build a model that is able to predict whether or not a message is spam based on a variety of characteristics of the (e.g. inclusion of words like winner, inherit, or password, the number of exclamation marks used, etc.) While the spam filters used by large coorportations like Google and Microsoft are quite a bit more complex the fundamental idea is the same - binary classification based on a set of predictors. Template for lab report Write your report, or at least run the code and create the plots, as you go so that if you get errors you can ask your TA to help on the spot. Knit often to more easily determine the source of the error. download.file(" destfile = "lab13.rmd") The data download.file(" destfile = " .rdata") load(" .rdata") List of variables: 1. spam: Indicator for whether the was spam. 2. to multiple: Indicator for whether the was addressed to more than one recipient. 3. from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing ). 4. cc: Indicator for whether anyone was CCed. 5. sent Indicator for whether the sender had been sent an in the last 30 days. 6. image: Indicates whether any images were attached. 7. attach: Indicates whether any files were attached. 8. dollar: Indicates whether a dollar sign or the word dollar appeared in the winner: Indicates whether winner appeared in the inherit: Indicates whether inherit (or an extension, such as inheritance) appeared in the password: Indicates whether password appeared in the num char: The number of characters in the , in thousands. 13. line breaks: The number of line breaks in the (does not count text wrapping). 14. format: Indicates whether the was written using HTML (e.g. may have included bolding or active links) or plaintext. 1
2 15. re subj: Indicates whether the subject started with Re:, RE:, re:, or re : 16. exclaim subj: Indicates whether there was an exclamation point in the subject. 17. urgent subj: Indicates whether the word urgent was in the subject. 18. exclaim mess: The number of exclamation points in the message. 19. number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number. Exercise 1 In this lab we will focus on predicting whether an is spam or not. How many s make up this data set? What proportion of the s were spam? A simple spam filter We will start with a simple spam filter that will only use a single predictor to multiple to classify a message as spam or not. To do this we will fit a logistic regression model between spam and to multiple using the glm function. This is done in the same way that a simple or multiple regression model is fit in R, except we use the glm function instead of lm, and we must indicate that we wish to fit a logistic model by include the argument family=binomial. g_simple = glm(spam ~ to_multiple, data = , family = binomial) summary(g_simple) Call: glm(formula = spam ~ to_multiple, family = binomial, data = ) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** to_multipleyes e-09 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 3920 degrees of freedom Residual deviance: on 3919 degrees of freedom AIC: 2376 Number of Fisher Scoring iterations: 6 Exercise 2 Based on the results of this logistic regression does the inclusion of multiple recipients make a message more or less likely to be spam? Explain your reasoning. Exercise 3 Using these results calculate the probability that a message is spam if it has multiple recipients, what is the probability if it does not have multiple recipients. Exercise 4 Pick one of the other 17 remaining predictors that you think is most likely to predict 2
3 an s spam status and fit a glm model using it. Describe how this predictor affects the probability of an being spam. A more complex spam filter We will now fit the fully specified model by including all the predictors in our model. Note that we use. as short hand to indicate that we wish to include all predictors in our model. For the time being we will ignore the warning that is produced when we fit the model. g_full = glm(spam ~., data = , family = binomial) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(g_full) Call: glm(formula = spam ~., family = binomial, data = ) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 2.00e e to_multipleyes -2.64e e e-15 *** fromyes -2.06e e ccyes -5.98e e sent_ yes -1.71e e imageyes -1.72e e * attachyes 1.28e e e-06 *** dollaryes 1.67e e winneryes 1.92e e e-07 *** inherityes 3.08e e passwordyes -1.79e e *** num_char 3.03e e line_breaks -4.77e e *** formatplain 6.40e e e-05 *** re_subjyes -1.40e e *** exclaim_subjyes -1.41e e urgent_subjyes 3.58e e ** exclaim_mess 9.64e e e-08 *** numbersmall -1.23e e e-15 *** numberbig -4.62e e * --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 3920 degrees of freedom Residual deviance: on 3901 degrees of freedom AIC: 1708 Number of Fisher Scoring iterations: 18 3
4 Exercise 5 Imagine we are using this logistic regression model as a spam filter, every new message you receive is analyzed and its characteristics calculated for each of the 18 predictor variables. If the message contained an image is it more or less likely to be flagged as spam? What if it only had an attachment? What if it had an image and an attachment? Exercise 6 Which variables appear to be meaningful for identifying spam? Describe how each of these variables affects the probability of an being spam, numerical answers are not needed here. (For categorical predictors be sure to indicate the reference level and how this affects the interpretation) Exercise 7 Which predictor appears to have the largest effect? Does this agree with what you have seen in the spam you receive? Assessing our spam filter While not quite the same as the residual plots we saw in simple and multiple linear regression we can create a plot that shows how well out logistic regession model is doing by plotting our predicted probability against the response variable (spam or not spam). We jitter the y coordinates slightly so that s with similar probabilities are not directly on top of one another in the plot. set.seed(1) jitter = rnorm(nrow( ), sd = 0.08) plot(g_full$fitted.values, $spam + jitter, xlim = 0:1, ylim = c(-0.5, 1.5), axes = FALSE, xlab = "Predicted probability", ylab = "", col = adjustcolor("blue", 0.2), pch = 16) axis(1) axis(2, at = c(0, 1), labels = c("0 (not spam)", "1 (spam)")) 0 (not spam) 1 (spam) Predicted probability We can also see the difference in the distribution of probabilities for the two classes by plotting side by side box plots. 4
5 plot(factor( $spam, labels = c("not spam", "spam")), g_full$fitted.values) not spam spam From both plots it is clear that in general spam messages have higher probabilities, which is what we expect to see if our spam filter is working well. Exercise 8 Fit another glm model with the subset of the predictors that you think will result in the best possible spam filter. Your criteria should be based on both what you know about spam and the results of fitting the full model. Describe why you chose this particular model and how its results differ from the full model. Exercise 9 Recreate the above plots for your new model, based on these plots discuss if your model appears to be a better or worse spam filter than the full model. What other information do you think you would need to better make this decision? Extra Credit In the dot plot above, there appears to be a banding pattern in the predicted probabilities, many points falling at the same or very nearly the same predicted probabilities. Explain why this pattern is occuring. 5
Generalized Linear Models
Generalized Linear Models We have previously worked with regression models where the response variable is quantitative and normally distributed. Now we turn our attention to two types of models where the
A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn
A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn CHAPTER 6 Logistic Regression and Generalised Linear Models: Blood Screening, Women s Role in Society, and Colonic Polyps
Logistic Regression (a type of Generalized Linear Model)
Logistic Regression (a type of Generalized Linear Model) 1/36 Today Review of GLMs Logistic Regression 2/36 How do we find patterns in data? We begin with a model of how the world works We use our knowledge
Multivariate Logistic Regression
1 Multivariate Logistic Regression As in univariate logistic regression, let π(x) represent the probability of an event that depends on p covariates or independent variables. Then, using an inv.logit formulation
Régression logistique : introduction
Chapitre 16 Introduction à la statistique avec R Régression logistique : introduction Une variable à expliquer binaire Expliquer un risque suicidaire élevé en prison par La durée de la peine L existence
Multiple Linear Regression
Multiple Linear Regression A regression with two or more explanatory variables is called a multiple regression. Rather than modeling the mean response as a straight line, as in simple regression, it is
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Linear Models in R Regression Regression analysis is the appropriate
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Psychology 205: Research Methods in Psychology
Psychology 205: Research Methods in Psychology Using R to analyze the data for study 2 Department of Psychology Northwestern University Evanston, Illinois USA November, 2012 1 / 38 Outline 1 Getting ready
Logistic regression (with R)
Logistic regression (with R) Christopher Manning 4 November 2007 1 Theory We can transform the output of a linear regression to be suitable for probabilities by using a logit link function on the lhs as
Regression 3: Logistic Regression
Regression 3: Logistic Regression Marco Baroni Practical Statistics in R Outline Logistic regression Logistic regression in R Outline Logistic regression Introduction The model Looking at and comparing
Simple example of collinearity in logistic regression
1 Confounding and Collinearity in Multivariate Logistic Regression We have already seen confounding and collinearity in the context of linear regression, and all definitions and issues remain essentially
Outline. Dispersion Bush lupine survival Quasi-Binomial family
Outline 1 Three-way interactions 2 Overdispersion in logistic regression Dispersion Bush lupine survival Quasi-Binomial family 3 Simulation for inference Why simulations Testing model fit: simulating the
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller
Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller Getting to know the data An important first step before performing any kind of statistical analysis is to familiarize
Journal of Statistical Software
JSS Journal of Statistical Software January 2011, Volume 38, Issue 5. http://www.jstatsoft.org/ Lexis: An R Class for Epidemiological Studies with Long-Term Follow-Up Martyn Plummer International Agency
SAS Software to Fit the Generalized Linear Model
SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling
MSwM examples. Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech.
MSwM examples Jose A. Sanchez-Espigares, Alberto Lopez-Moreno Dept. of Statistics and Operations Research UPC-BarcelonaTech February 24, 2014 Abstract Two examples are described to illustrate the use of
Comparing Nested Models
Comparing Nested Models ST 430/514 Two models are nested if one model contains all the terms of the other, and at least one additional term. The larger model is the complete (or full) model, and the smaller
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION
EDUCATION AND VOCABULARY MULTIPLE REGRESSION IN ACTION EDUCATION AND VOCABULARY 5-10 hours of input weekly is enough to pick up a new language (Schiff & Myers, 1988). Dutch children spend 5.5 hours/day
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9
DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9 Analysis of covariance and multiple regression So far in this course,
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
Chapter 7: Simple linear regression Learning Objectives
Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -
Local classification and local likelihoods
Local classification and local likelihoods November 18 k-nearest neighbors The idea of local regression can be extended to classification as well The simplest way of doing so is called nearest neighbor
A survey analysis example
A survey analysis example Thomas Lumley November 20, 2013 This document provides a simple example analysis of a survey data set, a subsample from the California Academic Performance Index, an annual set
Exchange Rate Regime Analysis for the Chinese Yuan
Exchange Rate Regime Analysis for the Chinese Yuan Achim Zeileis Ajay Shah Ila Patnaik Abstract We investigate the Chinese exchange rate regime after China gave up on a fixed exchange rate to the US dollar
Lecture 6: Poisson regression
Lecture 6: Poisson regression Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Introduction EDA for Poisson regression Estimation and testing in Poisson regression
Effect Displays in R for Generalised Linear Models
Effect Displays in R for Generalised Linear Models John Fox McMaster University Hamilton, Ontario, Canada Abstract This paper describes the implementation in R of a method for tabular or graphical display
We extended the additive model in two variables to the interaction model by adding a third term to the equation.
Quadratic Models We extended the additive model in two variables to the interaction model by adding a third term to the equation. Similarly, we can extend the linear model in one variable to the quadratic
Detecting Email Spam. MGS 8040, Data Mining. Audrey Gies Matt Labbe Tatiana Restrepo
Detecting Email Spam MGS 8040, Data Mining Audrey Gies Matt Labbe Tatiana Restrepo 5 December 2011 INTRODUCTION This report describes a model that may be used to improve likelihood of recognizing undesirable
Lecture 8: Gamma regression
Lecture 8: Gamma regression Claudia Czado TU München c (Claudia Czado, TU Munich) ZFS/IMS Göttingen 2004 0 Overview Models with constant coefficient of variation Gamma regression: estimation and testing
Homework Assignment 7
Homework Assignment 7 36-350, Data Mining Solutions 1. Base rates (10 points) (a) What fraction of the e-mails are actually spam? Answer: 39%. > sum(spam$spam=="spam") [1] 1813 > 1813/nrow(spam) [1] 0.3940448
Examining a Fitted Logistic Model
STAT 536 Lecture 16 1 Examining a Fitted Logistic Model Deviance Test for Lack of Fit The data below describes the male birth fraction male births/total births over the years 1931 to 1990. A simple logistic
Section 6: Model Selection, Logistic Regression and more...
Section 6: Model Selection, Logistic Regression and more... Carlos M. Carvalho The University of Texas McCombs School of Business http://faculty.mccombs.utexas.edu/carlos.carvalho/teaching/ 1 Model Building
Lucky vs. Unlucky Teams in Sports
Lucky vs. Unlucky Teams in Sports Introduction Assuming gambling odds give true probabilities, one can classify a team as having been lucky or unlucky so far. Do results of matches between lucky and unlucky
MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group
MISSING DATA TECHNIQUES WITH SAS IDRE Statistical Consulting Group ROAD MAP FOR TODAY To discuss: 1. Commonly used techniques for handling missing data, focusing on multiple imputation 2. Issues that could
Basic Statistical and Modeling Procedures Using SAS
Basic Statistical and Modeling Procedures Using SAS One-Sample Tests The statistical procedures illustrated in this handout use two datasets. The first, Pulse, has information collected in a classroom
Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.
Prof. Dr. J. Franke All of Statistics 1.52 Binary response variables - logistic regression Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit
Getting started with qplot
Chapter 2 Getting started with qplot 2.1 Introduction In this chapter, you will learn to make a wide variety of plots with your first ggplot2 function, qplot(), short for quick plot. qplot makes it easy
Each function call carries out a single task associated with drawing the graph.
Chapter 3 Graphics with R 3.1 Low-Level Graphics R has extensive facilities for producing graphs. There are both low- and high-level graphics facilities. The low-level graphics facilities provide basic
1 Logistic Regression
Generalized Linear Models 1 May 29, 2012 Generalized Linear Models (GLM) In the previous chapter on regression, we focused primarily on the classic setting where the response y is continuous and typically
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node
Enterprise Miner - Regression 1 ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node 1. Some background: Linear attempts to predict the value of a continuous
5. Multiple regression
5. Multiple regression QBUS6840 Predictive Analytics https://www.otexts.org/fpp/5 QBUS6840 Predictive Analytics 5. Multiple regression 2/39 Outline Introduction to multiple linear regression Some useful
T O P I C 1 2 Techniques and tools for data analysis Preview Introduction In chapter 3 of Statistics In A Day different combinations of numbers and types of variables are presented. We go through these
SUGI 29 Statistics and Data Analysis
Paper 194-29 Head of the CLASS: Impress your colleagues with a superior understanding of the CLASS statement in PROC LOGISTIC Michelle L. Pritchard and David J. Pasta Ovation Research Group, San Francisco,
WatchGuard QMS End User Guide
WatchGuard QMS End User Guide WatchGuard QMS Overview The WatchGuard QMS device enables spam messages from the WatchGuard XCS to be directed to a local quarantine area that provides spam storage for each
5. Linear Regression
5. Linear Regression Outline.................................................................... 2 Simple linear regression 3 Linear model............................................................. 4
L3: Statistical Modeling with Hadoop
L3: Statistical Modeling with Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 10, 2014 Today we are going to learn...
Email Marketing 10Mistakes
Most Common Email Marketing 10Mistakes At Upper Case, we see very smart customers make mistakes that cause their email response rates to suffer. Here are the most common mistakes we encounter... 01 Not
XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables
XPost: Excel Workbooks for the Post-estimation Interpretation of Regression Models for Categorical Dependent Variables Contents Simon Cheng [email protected] php.indiana.edu/~hscheng/ J. Scott Long [email protected]
Viewing Ecological data using R graphics
Biostatistics Illustrations in Viewing Ecological data using R graphics A.B. Dufour & N. Pettorelli April 9, 2009 Presentation of the principal graphics dealing with discrete or continuous variables. Course
Package dsmodellingclient
Package dsmodellingclient Maintainer Author Version 4.1.0 License GPL-3 August 20, 2015 Title DataSHIELD client site functions for statistical modelling DataSHIELD
MIXED MODEL ANALYSIS USING R
Research Methods Group MIXED MODEL ANALYSIS USING R Using Case Study 4 from the BIOMETRICS & RESEARCH METHODS TEACHING RESOURCE BY Stephen Mbunzi & Sonal Nagda www.ilri.org/rmg www.worldagroforestrycentre.org/rmg
GLMs: Gompertz s Law. GLMs in R. Gompertz s famous graduation formula is. or log µ x is linear in age, x,
Computing: an indispensable tool or an insurmountable hurdle? Iain Currie Heriot Watt University, Scotland ATRC, University College Dublin July 2006 Plan of talk General remarks The professional syllabus
Using R for Linear Regression
Using R for Linear Regression In the following handout words and symbols in bold are R functions and words and symbols in italics are entries supplied by the user; underlined words and symbols are optional
E(y i ) = x T i β. yield of the refined product as a percentage of crude specific gravity vapour pressure ASTM 10% point ASTM end point in degrees F
Random and Mixed Effects Models (Ch. 10) Random effects models are very useful when the observations are sampled in a highly structured way. The basic idea is that the error associated with any linear,
ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.
ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R. 1. Motivation. Likert items are used to measure respondents attitudes to a particular question or statement. One must recall
Adjust Webmail Spam Settings
Adjust Webmail Spam Settings An unsolicited bulk email message is known as "spam." Spam, which usually contains some sort of commercial advertising or proposition, is sent to a large number of recipients
USER GUIDE: MANAGING NARA EMAIL RECORDS WITH GMAIL AND THE ZL UNIFIED ARCHIVE
USER GUIDE: MANAGING NARA EMAIL RECORDS WITH GMAIL AND THE ZL UNIFIED ARCHIVE Version 1.0 September, 2013 Contents 1 Introduction... 1 1.1 Personal Email Archive... 1 1.2 Records Management... 1 1.3 E-Discovery...
Google Apps Sync for Microsoft Outlook
Sync for Microsoft Sync for Microsoft is a plug-in for 2003, 2007, or 2010 that lets you use your familiar interface to manage your mail, calendar, and contacts It also lets you import data from to your
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Package smoothhr. November 9, 2015
Encoding UTF-8 Type Package Depends R (>= 2.12.0),survival,splines Package smoothhr November 9, 2015 Title Smooth Hazard Ratio Curves Taking a Reference Value Version 1.0.2 Date 2015-10-29 Author Artur
How To Prevent Spam From Being Filtered Out Of Your Email Program
Active Carrot - Avoiding Spam Filters Table of Contents What is Spam?... 3 How Spam Filters Work... 3 Avoid these common mistakes... 3 Preventing False Abuse Reports... 4 How Abuse Reports Work... 4 Reasons
Barracuda Spam Firewall Users Guide. Greeting Message Obtaining a new password Summary report Quarantine Inbox Preferences
Barracuda Spam Firewall Users Guide Greeting Message Obtaining a new password Summary report Quarantine Inbox Preferences Greeting Message The first time the Barracuda Spam Firewall quarantines an email
ANOVA. February 12, 2015
ANOVA February 12, 2015 1 ANOVA models Last time, we discussed the use of categorical variables in multivariate regression. Often, these are encoded as indicator columns in the design matrix. In [1]: %%R
Setting up Microsoft Outlook to reject unsolicited email (UCE or Spam )
Reference : USER 191 Issue date : January 2004 Updated : January 2008 Classification : Staff Authors : Matt Vernon, Richard Rogers Setting up Microsoft Outlook to reject unsolicited email (UCE or Spam
SPSS Explore procedure
SPSS Explore procedure One useful function in SPSS is the Explore procedure, which will produce histograms, boxplots, stem-and-leaf plots and extensive descriptive statistics. To run the Explore procedure,
Graphics in R. Biostatistics 615/815
Graphics in R Biostatistics 615/815 Last Lecture Introduction to R Programming Controlling Loops Defining your own functions Today Introduction to Graphics in R Examples of commonly used graphics functions
Junk Email Filtering System. User Manual. Copyright Corvigo, Inc. 2002-03. All Rights Reserved. 509-8282-00 Rev. C
Junk Email Filtering System User Manual Copyright Corvigo, Inc. 2002-03. All Rights Reserved 509-8282-00 Rev. C The Corvigo MailGate User Manual This user manual will assist you in initial configuration
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Cluster Analysis using R
Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other
Using Excel for Statistics Tips and Warnings
Using Excel for Statistics Tips and Warnings November 2000 University of Reading Statistical Services Centre Biometrics Advisory and Support Service to DFID Contents 1. Introduction 3 1.1 Data Entry and
Correlation and Simple Linear Regression
Correlation and Simple Linear Regression We are often interested in studying the relationship among variables to determine whether they are associated with one another. When we think that changes in a
BIOL 933 Lab 6 Fall 2015. Data Transformation
BIOL 933 Lab 6 Fall 2015 Data Transformation Transformations in R General overview Log transformation Power transformation The pitfalls of interpreting interactions in transformed data Transformations
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Package MDM. February 19, 2015
Type Package Title Multinomial Diversity Model Version 1.3 Date 2013-06-28 Package MDM February 19, 2015 Author Glenn De'ath ; Code for mdm was adapted from multinom in the nnet package
FireEye Email Threat Prevention Cloud Evaluation
Evaluation Prepared for FireEye June 9, 2015 Tested by ICSA Labs 1000 Bent Creek Blvd., Suite 200 Mechanicsburg, PA 17050 www.icsalabs.com Table of Contents Executive Summary... 1 Introduction... 1 About
More Details About Your Spam Digest & Dashboard
TABLE OF CONTENTS The Spam Digest What is the Spam Digest? What do I do with the Spam Digest? How do I view a message listed in the Spam Digest list? How do I release a message from the Spam Digest? How
Logs Transformation in a Regression Equation
Fall, 2001 1 Logs as the Predictor Logs Transformation in a Regression Equation The interpretation of the slope and intercept in a regression change when the predictor (X) is put on a log scale. In this
Credit Risk Analysis Using Logistic Regression Modeling
Credit Risk Analysis Using Logistic Regression Modeling Introduction A loan officer at a bank wants to be able to identify characteristics that are indicative of people who are likely to default on loans,
Anti Spamming Techniques
Anti Spamming Techniques Written by Sumit Siddharth In this article will we first look at some of the existing methods to identify an email as a spam? We look at the pros and cons of the existing methods
E-MAIL FILTERING FAQ
V8.3 E-MAIL FILTERING FAQ COLTON.COM Why? Why are we switching from Postini? The Postini product and service was acquired by Google in 2007. In 2011 Google announced it would discontinue Postini. Replacement:
sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted
Sample uartiles We have seen that the sample median of a data set {x 1, x, x,, x n }, sorted in increasing order, is a value that divides it in such a way, that exactly half (i.e., 50%) of the sample observations
Setting up and controlling E-mail
Setting up and controlling E-mail Two methods Web based PC based Setting up and controlling E-mail Web based the messages are on the Internet accessed by dial-up or broadband at your Internet Service Provider
Exercise 1.12 (Pg. 22-23)
Individuals: The objects that are described by a set of data. They may be people, animals, things, etc. (Also referred to as Cases or Records) Variables: The characteristics recorded about each individual.
GLM with a Gamma-distributed Dependent Variable
GLM with a Gamma-distributed Dependent Variable Paul E. Johnson October 6, 204 Introduction I started out to write about why the Gamma distribution in a GLM is useful. In the end, I ve found it difficult
Life after Microsoft Outlook Google Apps
Welcome Welcome to Gmail! Now that you ve switched from Microsoft Outlook to, here are some tips on beginning to use Gmail. Google Apps What s Different? Here are some of the differences you ll notice
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Package polynom. R topics documented: June 24, 2015. Version 1.3-8
Version 1.3-8 Package polynom June 24, 2015 Title A Collection of Functions to Implement a Class for Univariate Polynomial Manipulations A collection of functions to implement a class for univariate polynomial
Time Series Analysis with R - Part I. Walter Zucchini, Oleg Nenadić
Time Series Analysis with R - Part I Walter Zucchini, Oleg Nenadić Contents 1 Getting started 2 1.1 Downloading and Installing R.................... 2 1.2 Data Preparation and Import in R.................
A quick guide to... Permission: Single or Double Opt-in?
A quick guide to... Permission: Single or Double Opt-in? In this guide... Learn how to improve campaign results by sending new contacts a confirmation email to verify their intention to join. Table of
2. Making example missing-value datasets: MCAR, MAR, and MNAR
Lecture 20 1. Types of missing values 2. Making example missing-value datasets: MCAR, MAR, and MNAR 3. Common methods for missing data 4. Compare results on example MCAR, MAR, MNAR data 1 Missing Data
