How much do users feel annoying with spam mail? : measuring the inconvenience costs of spam

Size: px
Start display at page:

Download "How much do E-mail users feel annoying with spam mail? : measuring the inconvenience costs of spam"

Transcription

1 How much do users feel annoying with spam mail? : measuring the inconvenience costs of spam Yuri Park Yuri Park *1, Yeonbae Kim 1, Jeong-Dong Lee 1, Jongsu Lee 1 1 Ph. D. Candidate, research professor, associate professor, assistant professor, respectively, in the Techno-Economics and Policy Program, Seoul National University, Seoul, Korea * Corresponding author (Techno-Economics and Policy Program, Seoul National University, Shillim-Dong, Kwanak-Ku, Seoul , South Korea; [email protected]; Phone: ; Fax: ) JEL classification: L86; C11; C25; C42; D18 Key Words: Spam mail, negative externality, inconvenience cost, conjoint analysis I. INTRODUCTION Some thirty billion messages per day travelled the Internet in 2003; by 2006 that number could rise to sixty billion, and more than half of those messages could be spam (OECD, 2003). Definitions of spam vary, but broadly speaking, we can say that spam comprises all unsolicited or unwanted commercial electronic messages sent to a large number of users without regard to the identity of the individual user (ITU, 2004). We believe most people are annoyed with the volume of obscene, commercial, or other types of spam they receive. Besides the psychological costs involved, those on the receiving end of spam incur various costs such as decreased labour productivity, 1

2 wasted time, the potential to lose useful mixed in with spam, and wasted bandwidth occupied by spam. Perhaps the most serious problem is that spam could eventually play a part in collapsing the foundation of trust supporting online society. Part of the reason boxes are flooded with spam is that in determining which mailing lists to target with their advertising, spammers do not take into consideration the costs consumers incur in processing the messages. Spammers therefore impose a negative externality on users; they send messages to users who may not want them but are forced to read them in order to find out other valuable messages. Moreover, the marginal costs of sending spam mail are so low that spammers have an economic incentive to send bulk spam mailings as long as just a few receivers respond. If the costs to society of a particular mailing, including sending and receiving costs, are greater than the expected benefits to spammers and users of the messages (for example, benefits accruing to firms and to the consumers of the goods that are sold by means of the mailing), then from a social planner s point of view more messages are being sent than is optimal. We call this situation excessive message sending (Shiman, 1996, p. 37). Shiman uses theoretical modelling to show how excessive message sending exists when firms use the system to advertise their goods. Given that spam constitutes a negative externality, we need an intervention such as a spam-control measure to internalise the externality. Van Zandt (2004) and Loder et al. (2004) suggest various alternatives to protect against spam for example, an attention bond mechanism 1 ; increasing communications costs; and other various 1 Attention bond mechanism is an economic solution to spam that allocates receivers attention. Each receiver set the size of attention bond that can be adjusted to receivers opportunity costs and e- mail is delivered after receivers permission. 2

3 tools exist for the purpose of controlling spam, but these have limitations, such as false positive determinations and difficulty to enforce those tools.. At this point no one can say what the best solution is. In order to address the spam problem and evaluate the effectiveness of spamcontrol alternatives we first need to measure the magnitude of spam costs. Researchers have attempted to estimate the social costs imposed by spam in various ways; as might be expected, governments as well as the business world have a keen interest in this project. 2 Studies show that the various costs of spam are considerable; however, in method these studies have heretofore merely added up possible costs rather than performing a true estimation. In addition, most studies calculate the costs of dealing with spam as opportunity costs for labour. They do not consider users disutility despite the certainty that receivers of spam are annoyed by having to filter, delete, or read spam messages even when they are not at work. To our knowledge, almost no one is looking at estimating the disutility of spam receivers 3, perhaps because it is tough to quantify such intangible losses. In this paper, we use conjoint analysis of statedpreference data to estimate the inconvenience costs incurred by users who receive spam. Using stated-preference data gathered from users, we can directly 2 According to Ferris Research (2003) spam cost U.S. corporations more than $10 billion in The Korean Information Security Agency (2003) and Nara Research (2004) reported that in Korea spam cost $11 billion and $43 billion in 2003 and 2004 respectively. 3 Yoo et al. (2003) estimated consumers willingness to pay for a spam-blocking program using contingent valuation method. 3

4 elicit users valuation of the negative effects of spam, taking into account such costs as time spent, loss of useful mail, intangible psychological distress, decreased labour productivity, and inconvenience of having to avoid using . Such an estimate of the inconvenience costs of spam should prove valuable to further research into the spam phenomenon. II. METHODOLOGY Conjoint Analysis Conjoint analysis enables us to use stated choices of survey respondents to measure their preferences in hypothetical situations. In such analysis, levels of attributes describing a good or service are combined to build descriptions of hypothetical bundles. Respondents are asked to state their preferences for each alternative card by ranking, rating, or choosing one alternative card on which a hypothetical bundle of attributes is described (Alvarez-Farizo and Hanley, 2002). In order to estimate the cost of spam in terms of inconvenience, we combined five service attributes (and their levels) to build a description of a hypothetical service package (Table 1). We defined spam for the respondents as unwanted, commercial, bulk . First attribute identify the volume of spam messages received each day. We set the volume range from 10 to 50 messages per day. We assume only two types of spam which constitute a major portion of spam mailings (KISA, 2003): spam messages with commercial content and those with obscene content. Use (or non-use) of an antispam program is included as an attribute because 4

5 such programs are an important tool in protecting against spam. Useful may not be delivered because the amount of spam exceeds the capacity limit of storage and for that reason storage capacity is an important factor and included as an attribute. Service price is included as an attribute in order to evaluate the other attributes in 110monetary terms. The survey was designed to ask respondents to rankorder a set of conjoint cards-bundles of attributes-according to their preference. Table 1. Attributes and Levels Measured by the Conjoint Analysis Cards Attribute Level Description Number of spam messages Antispam program 10 messages 30 messages 50 messages Self-conducted Offered by service provider Number of spam messages delivered per day. Each spam message consists of commercial and obscene contents. Respondent is using an antispam program provided by government or a company. Respondent is using an antispam program run by an service provider. Respondent uses no antispam program. None 5 Mbyte storage 20 Mbyte The storage capacity of the box. capacity 50 Mbyte 500 won * /month (US$0.42/ month) Monthly cost in order to use 1,500 won * /month (US$1.26/ month) service condition written in conjoint service price 2,500 won * /month (US$2.1/ month) cards * As of June 21, 2005, US$1 is equivalent to 1,013.4 Korean won. The survey was administered to 1,000 residents of Seoul, Korea, in May 2004; the sample was drawn on the basis of age and sex distribution in the population of Seoul. Responses were obtained face-to-face by well-trained interviewers. The size of the sample used for empirical analysis was 537; the screening question excluded 463 respondents who did not use an service. Model Specification 5

6 In this survey, each respondent was asked to rank the alternative cards in order of preference. In the contingent ranking conjoint analysis, a rank-ordered logit model is generally used for the estimation (Layton, 2000; Calfee et al., 2001). Although this model has an advantage in that the ranking (choice) probability has a simple closed form, it imposes restrictions on the ordering of preferences, such that the coefficients for each attribute are estimated to be the same across all consumers. Therefore we use the random coefficient model for the purpose of estimation. The random coefficient discrete choice model captures preference variation by introducing stochastic terms into the coefficients created by deviations from mean preferences, and allowing these terms to be correlated with each other. With this method, the stochastic component of the utility function is correlated with the choice alternatives through the model s attributes. That is, the model does not impose the independent of irrelevant alternatives property (Calfee et al., 2001). Procedures for estimating random coefficient discrete choice models have been developed within both the classical and Bayesian frameworks. Classical methods of estimation are generally based on the maximum likelihood estimation. Problems related to the computational burden of calculating the integration of multivariate (normal) density functions are overcome using the simulated maximum likelihood estimation method. Applications of the random coefficient discrete choice model using the classical approach are presented by Brownstone and Train (1999), Hensher (2001), Layton (2000), and Calfee et al. (2001). In particular, Layton (2000) and Calfee et al. (2001) used the classical approach to estimate the random coefficient discrete choice model in the framework of contingent ranking conjoint analysis. Allenby and Rossi (1999), Chiang et al. (1999), Huber and Train (2001), and Train (2003) have developed Bayesian approaches to random coefficient discrete 6

7 choice modeling. These methods construct a Markov chain Gibbs sampler that can be used for drawing directly from the exact posterior distribution and perform finite sample likelihood inference to any degree of accuracy (McColluch and Rossi, 1994). These procedures have certain advantages over the classical approach. First, one can avoid direct evaluation of the nontrivial likelihood function and the associated problem of approximating the choice probabilities that arise in applying the classical method. In addition, mathematical properties of the multinomial model do not guarantee convergence of the maximum likelihood estimation process to the global maximum, and the solution obtained by the nonlinear programming optimizer may depend critically on the location of the starting point of search for the solution. The Bayesian procedures, therefore, offer an advantage, since they do not involve maximization of the likelihood function. Second, the results of Bayesian procedures can be interpreted simultaneously from both the Bayesian and classical perspectives, drawing on the insights afforded by each tradition. The Bernstein von Mises theorem states that, under the conditions maintained in this study s methods, the mean of the Bayesian posterior is a classical estimator asymptotically equivalent to the maximum likelihood estimator. The theorem also establishes that the covariance of the posterior is the asymptotic covariance of this estimator (Train and Sonnier, 2003). Third, the desirable estimation properties, such as consistency and efficiency, can be attained under more relaxed conditions using Bayesian procedures, as compared to the classical methods. Maximum simulated likelihood is consistent only if the number of drawings used in a simulation is considered to increase with the sample size. Efficiency is attained only if the number of drawings increases faster than the square root of the sample size. In contrast, Bayesian estimators are consistent for a fixed 7

8 number of drawings used in the simulation and are efficient if the number of drawings rises at any rate with the sample size (Train, 2003). According to the random utility framework proposed by McFadden (1974), we assume that an individual i faces a choice among J alternatives in each of T choice sets in a survey, and is asked to rank the alternatives in order of preference. In the empirical setting, the alternatives are services in Korea. We can then represent the utility derived by an individual choosing alternatives j in a choice set t as follows: U = β x + ε (1) ijt iz ijt ijt where x ijt is the vector of attributes associated with alternative j, the coefficients of attribute vector, xijt and ε ijt is a random disturbance. The random disturbance ( ε ijt β iz is a vector of ) is assumed to have an independent and identical extreme value distribution. The coefficients vector, β i, is assumed to be distributed normally across the population with mean vector b and variance-covariance matrix W; that is, βi follows unbounded normal distribution. In our setting of contingent ranking conjoint analysis, we can adopt the same procedure for Bayesian estimation as in Train (2003). The only difference is that we calculate the probability of the individual s sequence of rankings, used in the Metropolis Hasting (M-H) algorithm, instead of the probability based on the response of the most preferred choice in Train (2003). The probability of an individual i s observed sequence of rankings among alternatives is 8

9 Lr ( { r,..., r} β ) J 1 β x i i1 it k= J β x t= 1 j= 1 e k= j T e ijt = = (3) ikt where r it = { ri t, ri 2,..., r 1 t ijt } is the vector of individual s (i s) ranking responses of the choice sets in the descending order of preference in choice set t. The unbounded normal distribution for the price coefficient has some undesirable properties. For example, the normal distribution for a price coefficient implies that some share of the population actually prefers higher prices. The existence of price coefficients with the wrong sign renders the model unusable for calculating the WTP (willingness to pay) and other welfare measures. Also if the distribution of price coefficients overlaps 0 (zero), then the WTP becomes infinitely large for some customers (Train and Sonnier, 2003). In this study, we assume that the price coefficient has a log-normal distribution. This distribution has better properties in that it restricts the price coefficient for all respondents to have the same sign, and the price coefficient cannot have the value of zero. This distribution can be obtained as a simple transformation of normal distribution of β, C = exp( β). The unbounded normal distribution is also inappropriate in case of the coefficient of a desirable attribute that is being valued by all customers. For example, it is implausible that there would be users who dislike higher quality of service. That is, users who choose low-quality service level priced the same as is higher-quality service level are unlikely to exist in real world. Accordingly we also assume anti-spam program and storage capacity? have a log-normal distribution. When a transformation is used for bounded distributions of coefficients, the utility function is specified as follows: 9

10 U = C( β ) x + ε (4) ijt i ijt ijt where C is a transformation function. There are minor changes in the estimation procedure using this transformation. The probability of the individual s sequence of ranking, which is used in M-H algorithm, should be changed based on transformed β i (Train and Sonnier, 2003). III. ESTIMATION RESULTS We generated 20,000 draws with the Gibbs sampling. The first ten thousand draws were discarded. The draws in every tenth iteration of the second ten thousand draws were retained. One thousand retained draws were used to draw inferences. The means of the one thousand draws of b and of the diagonal elements of W are shown in Table 2. From the Bayesian perspective, these are posterior means of b and the diagonal elements of W. From the classical perspective, they represent the estimated mean and variance of the βi -s in the population. Generally, every estimated parameteris statistically significant. Table 3. Estimation results (before transformation) Attribute Variable Mean (b) of β Variance (W) of β Number of Spam messages SPAM (0.0090) a (0.0022) Antispam program (self-conducted) SELF (0.7786) ( ) Antispam program (offred by service provider) PROV (0.5224) (2.4187) 10

11 storage capacity CAP (0.0071) (0.0015) service price PRICE b (0.3224) a (Posterior) standard deviations in parentheses. b The PRICE variables are entered as the negative of PRICE (1.8689) Since b and W are the mean and variance of the β i in the population from the classical perspective, the distribution of coefficient of each variable is obtained through the simulation process on the estimated values of b and W. Two thousand draws of β i were taken from a normal distribution with the mean equal to the estimated value of b and variance equal to the value of W. Each draw of β i was then transformed to obtain a draw of coefficients shown in Table 3. The negative sign of SPAM means people don t like increasing volume of spam mail. Mean of coefficient for anti-spam program shows people prefer self-conducting to offering by service provider but the variance of SELF is very large compare to PROV. So we know that people prefer the anti-spam program offered by service provider based on median estimates. Table 3. Transformed random coefficient estimates Variable Mean Variance Median SPAM SELF PROV CAP PRICE We now calculate the welfare associated with changes in the level of attributes. Change in the level of compensating variation associated with one unit increase of each attribute is the ratio of the coefficient for the attribute to the corresponding coefficient for the price (coefficient). We calculate the marginal willingness to pay 11

12 (MWTP) of individual i from each draw of parameters ( β ) in the same simulation described earlier, and we obtain distributions of MWTP for a change in each attribute 4. (See Table 4) i Table 4. Marginal Willingness to Pay (MWTP) in US$/month Attribute Median SPAM Anti-spam SELF Program (base: none) PROV CAP WTP for decreasing one spam message a day is US$ per a month. That is, users are willing to pay, on average, about US$ per message for a one unit decrease per day. That is understood as the marginal externality cost of one spam mail received. And we find that people are willing to pay US$4.25 and US$5.57 per month for a self-conducted antispam program and a service-provider offered antispam program, respectively. The preference of users for antispam program doesn t show large difference but having an antispam program is an important factor, as it has a relatively high WTP. WTP for a 1-megabyte increase in storage capacity is US$ per month, and that roughly matches the price in the real world. (A user of the popular Hotmail service pays about US$19.95 per year for 2 gigabytes of capacity.) IV. CONCLUSION 4 Since we use utility function which does not include income term, the compensating variation for 12

13 In this paper, users overall inconvenience cost created by spam is estimated based on stated-preference data using conjoint analysis. First, we estimate the marginal unit cost to the receiver of spam to be about US$ ) for each spam message. The social costs of receiving spam can be calculated based on research into current basic statistics regarding spam that is, how many messages are delivered, how large a share spam occupies of all messages, how many accounts people have, how many people use an service, and so on. We infer that e-commerce may sink into atrophy if consumers get to the point where they are unlikely to buy online products or services because they receive too many fraudulent or deceitful advertising messages. 5 That is, we can envision a scenario under which the trust that supports online society is undermined. If an extremely low marginal sending cost and the existence of externality are enabling the flood of spam, the best solution is to internalise that externality. One way to accomplish that is to increase the sending cost to cover the social costs of spam including the users inconvenience cost. Hence we see the value of an estimate of the marginal externality cost of spam. Governments and businesses in countries worldwide are seeking solutions to the problems caused by spam and are considering various technological or legislative measures. The results of this study should prove essential for comparing alternatives and for making judgements regarding mitigating spam and increasing social welfare. attribute change is equal to MWTP. 5 According to the International Telecommunications Union (2004), some 36 percent of spam consists of an attempt to deceptively sell something or perpetrate a scam or fraud. 13

14 REFERENCES Allenby, G.M. and P.E. Rossi (1999) Marketing models of consumer heterogeneity, Journal of Econometrics 89, Alvarez-Farizo, B., and Hanley, N. (2002) Using conjoint analysis to quantify public preferences over the environmental impacts of wind farms: An example from Spain. Energy Policy, 30, Brownstone, D. and K. Train (1999) Forecasting new product penetration with flexible substitution patterns, Journal of Econometrics 89, Calfee, J., Winston, C., and Stempski, R. (2001) Econometric issues in estimating consumer preferences from stated preference data: A case study of the value of automobile travel time. Review of Economics and Statistics, 83, Chapman, R. G., and Staelin, R. (1982) Exploiting rank ordered choice set data within the stochastic utility model. Journal of Marketing Resource, 22, Chiang, J., S. Chib and C. Narasimhan (1999) Markov chain Monte Carlo and models of consideration set and parameter heterogeneity, Journal of Econometrics 89, Ferris Research (2003) Spam control: Problems and opportunities?, San Francisco, Ferris Research. Huber, J. and K. Train (2001) On the Similarity of Classical and Bayesian Estimates of Individual Mean Partworths, Marketing Letters 12, International Telecommunication Union (ITU) (2004) Spam in the information society: Building frameworks for international cooperation. Geneva, International Telecommunication Union. Korean Information Security Agency (KISA) (2003) Spam mail: Circulated volume and its damages, Seoul, Korean Information Security Agency. 14

15 Layton, D. F. (2000) Random coefficient models for stated preference surveys, Journal of Environmental Economics and Management 40, McColluch R. and P.E. Rossi (1994) An exact likelihood analysis of the multinomial probit model, Journal of Econometrics 64, McFadden, D. (1974) Conditional logit analysis of qualitative choice behavior, in P. Zarembka (Eds.), Frontiers in Econometrics Academic Press, NY. Nara Research (2004) A survey on the current situation of spam mail, Seoul, Nara Research. Organisation for Economic Co-operation and Development (OECD) (2003) Background paper for the OECD workshop on SPAM. Paris, Organisation for Economic Co-operation and Development. Loder, T., Alstyne, M. V., and Wash, R. (2004) Information asymmetry and thwarting spam. Technical report, University of Michigan. Shiman, D. R. (1996) When becomes junk mail: The welfare implications of the advancement of communications technology. Review of Industrial Organization, 11, Train, T.E. (2003) Discrete choice methods with simulation. Cambridge University Press, Cambridge. Train, K. and G. Sonnier (2003) Mixed logit with bounded distribution of partworths, working paper (University of California, Berkeley and Los Angeles) Van Zandt, T. (2004) Information overload in a network of targeted communication. Rand Journal of Economics, 35, Yoo S. H., Kwak S. J., and Shin C. O. (2003) Measuring the inconvenience costs of spam mail in Korea using the contingent valuation method. Applied Economics Letters, under review. 15

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers

Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, [email protected] Bart Frischknecht, University of Technology Sydney,

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

11. Time series and dynamic linear models

11. Time series and dynamic linear models 11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd

More information

Decentralized Utility-based Sensor Network Design

Decentralized Utility-based Sensor Network Design Decentralized Utility-based Sensor Network Design Narayanan Sadagopan and Bhaskar Krishnamachari University of Southern California, Los Angeles, CA 90089-0781, USA [email protected], [email protected]

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written

More information

The Cost of Annoying Ads

The Cost of Annoying Ads The Cost of Annoying Ads DANIEL G. GOLDSTEIN Microsoft Research and R. PRESTON MCAFEE Microsoft Corporation and SIDDHARTH SURI Microsoft Research Display advertisements vary in the extent to which they

More information

The Study on the Value of New & Renewable Energy as a Future Alternative Energy Source in Korea

The Study on the Value of New & Renewable Energy as a Future Alternative Energy Source in Korea , pp.26-31 http://dx.doi.org/10.14257/astl.2015.86.06 The Study on the Value of New & Renewable Energy as a Future Alternative Energy Source in Korea Woo-Jin Jung 1, Tae-Hwan Kim 2, and Sang-Ying Tom Lee

More information

A Game Theoretical Framework for Adversarial Learning

A Game Theoretical Framework for Adversarial Learning A Game Theoretical Framework for Adversarial Learning Murat Kantarcioglu University of Texas at Dallas Richardson, TX 75083, USA muratk@utdallas Chris Clifton Purdue University West Lafayette, IN 47907,

More information

Least Squares Estimation

Least Squares Estimation Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

More information

Marketing Mix Modelling and Big Data P. M Cain

Marketing Mix Modelling and Big Data P. M Cain 1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored

More information

The HB. How Bayesian methods have changed the face of marketing research. Summer 2004

The HB. How Bayesian methods have changed the face of marketing research. Summer 2004 The HB How Bayesian methods have changed the face of marketing research. 20 Summer 2004 Reprinted with permission from Marketing Research, Summer 2004, published by the American Marketing Association.

More information

Better decision making under uncertain conditions using Monte Carlo Simulation

Better decision making under uncertain conditions using Monte Carlo Simulation IBM Software Business Analytics IBM SPSS Statistics Better decision making under uncertain conditions using Monte Carlo Simulation Monte Carlo simulation and risk analysis techniques in IBM SPSS Statistics

More information

Incorporating prior information to overcome complete separation problems in discrete choice model estimation

Incorporating prior information to overcome complete separation problems in discrete choice model estimation Incorporating prior information to overcome complete separation problems in discrete choice model estimation Bart D. Frischknecht Centre for the Study of Choice, University of Technology, Sydney, [email protected],

More information

Introduction to Markov Chain Monte Carlo

Introduction to Markov Chain Monte Carlo Introduction to Markov Chain Monte Carlo Monte Carlo: sample from a distribution to estimate the distribution to compute max, mean Markov Chain Monte Carlo: sampling using local information Generic problem

More information

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach

Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Refik Soyer * Department of Management Science The George Washington University M. Murat Tarimcilar Department of Management Science

More information

QUALITY ENGINEERING PROGRAM

QUALITY ENGINEERING PROGRAM QUALITY ENGINEERING PROGRAM Production engineering deals with the practical engineering problems that occur in manufacturing planning, manufacturing processes and in the integration of the facilities and

More information

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling

What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling What s New in Econometrics? Lecture 8 Cluster and Stratified Sampling Jeff Wooldridge NBER Summer Institute, 2007 1. The Linear Model with Cluster Effects 2. Estimation with a Small Number of Groups and

More information

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

More information

How To Understand The Theory Of Probability

How To Understand The Theory Of Probability Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL

More information

Parallelization Strategies for Multicore Data Analysis

Parallelization Strategies for Multicore Data Analysis Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management

More information

Centre for Central Banking Studies

Centre for Central Banking Studies Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics

More information

Bayesian Statistics: Indian Buffet Process

Bayesian Statistics: Indian Buffet Process Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note

More information

An Empirical Analysis of Sponsored Search Performance in Search Engine Advertising. Anindya Ghose Sha Yang

An Empirical Analysis of Sponsored Search Performance in Search Engine Advertising. Anindya Ghose Sha Yang An Empirical Analysis of Sponsored Search Performance in Search Engine Advertising Anindya Ghose Sha Yang Stern School of Business New York University Outline Background Research Question and Summary of

More information

Tutorial on Markov Chain Monte Carlo

Tutorial on Markov Chain Monte Carlo Tutorial on Markov Chain Monte Carlo Kenneth M. Hanson Los Alamos National Laboratory Presented at the 29 th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Technology,

More information

APPLIED MISSING DATA ANALYSIS

APPLIED MISSING DATA ANALYSIS APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

PROJECT RISK MANAGEMENT

PROJECT RISK MANAGEMENT 11 PROJECT RISK MANAGEMENT Project Risk Management includes the processes concerned with identifying, analyzing, and responding to project risk. It includes maximizing the results of positive events and

More information

Bootstrapping Big Data

Bootstrapping Big Data Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

More information

Handling attrition and non-response in longitudinal data

Handling attrition and non-response in longitudinal data Longitudinal and Life Course Studies 2009 Volume 1 Issue 1 Pp 63-72 Handling attrition and non-response in longitudinal data Harvey Goldstein University of Bristol Correspondence. Professor H. Goldstein

More information

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics For 2015 Examinations Aim The aim of the Probability and Mathematical Statistics subject is to provide a grounding in

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization

Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization Non Linear Dependence Structures: a Copula Opinion Approach in Portfolio Optimization Jean- Damien Villiers ESSEC Business School Master of Sciences in Management Grande Ecole September 2013 1 Non Linear

More information

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam

Groundbreaking Technology Redefines Spam Prevention. Analysis of a New High-Accuracy Method for Catching Spam Groundbreaking Technology Redefines Spam Prevention Analysis of a New High-Accuracy Method for Catching Spam October 2007 Introduction Today, numerous companies offer anti-spam solutions. Most techniques

More information

Gerry Hobbs, Department of Statistics, West Virginia University

Gerry Hobbs, Department of Statistics, West Virginia University Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit

More information

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

More information

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME. Amir Gandomi *, Saeed Zolfaghari **

OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME. Amir Gandomi *, Saeed Zolfaghari ** OPTIMAL DESIGN OF A MULTITIER REWARD SCHEME Amir Gandomi *, Saeed Zolfaghari ** Department of Mechanical and Industrial Engineering, Ryerson University, Toronto, Ontario * Tel.: + 46 979 5000x7702, Email:

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis ElegantJ BI White Paper The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis Integrated Business Intelligence and Reporting for Performance Management, Operational

More information

Introduction to General and Generalized Linear Models

Introduction to General and Generalized Linear Models Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

The Cheap-talk Protocol and the Estimation of the Benefits of Wind Power

The Cheap-talk Protocol and the Estimation of the Benefits of Wind Power The Cheap-talk Protocol and the Estimation of the Benefits of Wind Power Todd L. Cherry and John Whitehead Department of Economics Appalachian State University August 2004 1 I. Introduction The contingent

More information

BayesX - Software for Bayesian Inference in Structured Additive Regression

BayesX - Software for Bayesian Inference in Structured Additive Regression BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich

More information

Department of Economics

Department of Economics Department of Economics On Testing for Diagonality of Large Dimensional Covariance Matrices George Kapetanios Working Paper No. 526 October 2004 ISSN 1473-0278 On Testing for Diagonality of Large Dimensional

More information

Chapter 4 SUPPLY CHAIN PERFORMANCE MEASUREMENT USING ANALYTIC HIERARCHY PROCESS METHODOLOGY

Chapter 4 SUPPLY CHAIN PERFORMANCE MEASUREMENT USING ANALYTIC HIERARCHY PROCESS METHODOLOGY Chapter 4 SUPPLY CHAIN PERFORMANCE MEASUREMENT USING ANALYTIC HIERARCHY PROCESS METHODOLOGY This chapter highlights on supply chain performance measurement using one of the renowned modelling technique

More information

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS

CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Examples: Regression And Path Analysis CHAPTER 3 EXAMPLES: REGRESSION AND PATH ANALYSIS Regression analysis with univariate or multivariate dependent variables is a standard procedure for modeling relationships

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Calculating the Probability of Returning a Loan with Binary Probability Models

Calculating the Probability of Returning a Loan with Binary Probability Models Calculating the Probability of Returning a Loan with Binary Probability Models Associate Professor PhD Julian VASILEV (e-mail: [email protected]) Varna University of Economics, Bulgaria ABSTRACT The

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

On Correlating Performance Metrics

On Correlating Performance Metrics On Correlating Performance Metrics Yiping Ding and Chris Thornley BMC Software, Inc. Kenneth Newman BMC Software, Inc. University of Massachusetts, Boston Performance metrics and their measurements are

More information

From the help desk: Bootstrapped standard errors

From the help desk: Bootstrapped standard errors The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution

More information

PS 271B: Quantitative Methods II. Lecture Notes

PS 271B: Quantitative Methods II. Lecture Notes PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.

More information

Fighting spam in Australia. A consumer guide

Fighting spam in Australia. A consumer guide Fighting spam in Australia A consumer guide Fighting spam Use filtering software Install anti-virus software Use a personal firewall Download security patches Choose long and random passwords Protect your

More information

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian

More information

Why is Insurance Good? An Example Jon Bakija, Williams College (Revised October 2013)

Why is Insurance Good? An Example Jon Bakija, Williams College (Revised October 2013) Why is Insurance Good? An Example Jon Bakija, Williams College (Revised October 2013) Introduction The United States government is, to a rough approximation, an insurance company with an army. 1 That is

More information

Uncovering Consumer Decision Rules under Complex Dynamic Environments: The Case of Coalition Loyalty Programs

Uncovering Consumer Decision Rules under Complex Dynamic Environments: The Case of Coalition Loyalty Programs Uncovering Consumer Decision Rules under Complex Dynamic Environments: The Case of Coalition Loyalty Programs Andrew Ching Masakazu Ishihara Very Preliminary and Incomplete April 18, 2014 Abstract We propose

More information

Understanding the Impact of Weights Constraints in Portfolio Theory

Understanding the Impact of Weights Constraints in Portfolio Theory Understanding the Impact of Weights Constraints in Portfolio Theory Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected] January 2010 Abstract In this article,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition) Abstract Indirect inference is a simulation-based method for estimating the parameters of economic models. Its

More information

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell THE HYBID CAT-LOGIT MODEL IN CLASSIFICATION AND DATA MINING Introduction Dan Steinberg and N. Scott Cardell Most data-mining projects involve classification problems assigning objects to classes whether

More information

HONEYPOT SECURITY. February 2008. The Government of the Hong Kong Special Administrative Region

HONEYPOT SECURITY. February 2008. The Government of the Hong Kong Special Administrative Region HONEYPOT SECURITY February 2008 The Government of the Hong Kong Special Administrative Region The contents of this document remain the property of, and may not be reproduced in whole or in part without

More information

Conn Valuation Services Ltd.

Conn Valuation Services Ltd. CAPITALIZED EARNINGS VS. DISCOUNTED CASH FLOW: Which is the more accurate business valuation tool? By Richard R. Conn CMA, MBA, CPA, ABV, ERP Is the capitalized earnings 1 method or discounted cash flow

More information

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected]

SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing lsun@smu.edu.sg SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected] IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way

More information

Chapter 1 INTRODUCTION. 1.1 Background

Chapter 1 INTRODUCTION. 1.1 Background Chapter 1 INTRODUCTION 1.1 Background This thesis attempts to enhance the body of knowledge regarding quantitative equity (stocks) portfolio selection. A major step in quantitative management of investment

More information

MATHEMATICAL METHODS OF STATISTICS

MATHEMATICAL METHODS OF STATISTICS MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS

More information

An extension of the factoring likelihood approach for non-monotone missing data

An extension of the factoring likelihood approach for non-monotone missing data An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Master of Mathematical Finance: Course Descriptions

Master of Mathematical Finance: Course Descriptions Master of Mathematical Finance: Course Descriptions CS 522 Data Mining Computer Science This course provides continued exploration of data mining algorithms. More sophisticated algorithms such as support

More information

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber 2011 1 Data Modeling & Analysis Techniques Probability & Statistics Manfred Huber 2011 1 Probability and Statistics Probability and statistics are often used interchangeably but are different, related fields

More information

An introduction to Value-at-Risk Learning Curve September 2003

An introduction to Value-at-Risk Learning Curve September 2003 An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk

More information

Reasons for the Coexistence of Different Distribution Channels: An Empirical Test for the German Insurance Market

Reasons for the Coexistence of Different Distribution Channels: An Empirical Test for the German Insurance Market The Geneva Papers, 2008, 33, (389 407) r 2008 The International Association for the Study of Insurance Economics 1018-5895/08 $30.00 www.palgrave-journals.com/gpp Reasons for the Coexistence of Different

More information

Nonparametric adaptive age replacement with a one-cycle criterion

Nonparametric adaptive age replacement with a one-cycle criterion Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]

More information

How To Find Out What Search Strategy Is Used In The U.S. Auto Insurance Industry

How To Find Out What Search Strategy Is Used In The U.S. Auto Insurance Industry Simultaneous or Sequential? Search Strategies in the U.S. Auto Insurance Industry Current version: April 2014 Elisabeth Honka Pradeep Chintagunta Abstract We show that the search method consumers use when

More information

IP Valuation. WIPO Workshop on Innovation, Intellectual Asset Management and Successful Technology Licensing: Wealth Creation in the Arab Region

IP Valuation. WIPO Workshop on Innovation, Intellectual Asset Management and Successful Technology Licensing: Wealth Creation in the Arab Region IP Valuation WIPO Workshop on Innovation, Intellectual Asset Management and Successful Technology Licensing: Wealth Creation in the Arab Region Muscat, Oman, December 12 and 13, 2011 Topics Intangibles

More information

There are three kinds of people in the world those who are good at math and those who are not. PSY 511: Advanced Statistics for Psychological and Behavioral Research 1 Positive Views The record of a month

More information

OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES

OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES OPTIMIZATION AND FORECASTING WITH FINANCIAL TIME SERIES Allan Din Geneva Research Collaboration Notes from seminar at CERN, June 25, 2002 General scope of GRC research activities Econophysics paradigm

More information

EEV, MCEV, Solvency, IFRS a chance for actuarial mathematics to get to main-stream of insurance value chain

EEV, MCEV, Solvency, IFRS a chance for actuarial mathematics to get to main-stream of insurance value chain EEV, MCEV, Solvency, IFRS a chance for actuarial mathematics to get to main-stream of insurance value chain dr Krzysztof Stroiński, dr Renata Onisk, dr Konrad Szuster, mgr Marcin Szczuka 9 June 2008 Presentation

More information