A Direct Approach to Data Fusion


 Myron Scott
 1 years ago
 Views:
Transcription
1 A Direct Approach to Data Fusion Zvi Gilula Department of Statistics Hebrew University Robert E. McCulloch Peter E. Rossi Graduate School of Business University of Chicago 1101 E. 58 th Street Chicago, IL July 2003 This Version, May 2004 Abstract The generic data fusion problem is to make inferences about the joint distribution of two sets of variables without any direct observations of the joint distribution. Instead, information is only available about each set separately along with some other set of common variables. The standard approach to data fusion creates a fused data set with the variables of interest and the common variables. Our approach directly estimates the joint distribution of just the variables of interest. For the case of either discrete or continuous variables, our approach yields a solution that can be implemented with standard statistical models and software. In typical marketing applications, the common variables are psychographic or demographic variables and the variables to be fused involve media viewing and product purchase. For this example, our approach will directly estimate the joint distribution of media viewing and product purchase without including the common variables. This is the object required for marketing decisions. In marketing applications, fusion of discrete variables is required. We develop a method for relaxing the assumption of conditional independence for this case. We illustrate our approach with product purchase and media viewing data from a large survey of British consumers. Keywords Data fusion, Predictive distributions, Bayesian analysis, Media planning and buying Acknowledgements The authors wish to thank BMRB, London for access to their survey data. Rossi thanks the James M. Kilts Center for Marketing, Graduate School of Business, University of Chicago for partial support of this research.
2 1. Introduction Data fusion is the problem of how to make inferences about the joint distribution of two sets of random variables (hereafter called the target variables) when only information on the marginal distribution of each set is available. For example, separate surveys are conducted about buying or purchase behavior and media viewing behavior. Information is available on the marginal distribution of buying behavior and media viewing but there is no direct observation of the joint distribution. In media planning problems, inferences about the joint distribution of buying and viewing are desired. The problem, therefore, is to make inferences about the joint distribution of media viewing and buying without direct observation of these two sets of variables jointly. The general problem of inferences regarding a joint distribution on the basis of the marginals is not solvable. There are many possible joint distributions consistent with the same marginal distributions so that the joint distribution is not identified by knowledge of the marginals alone. Additional information must be brought to bear on the problem to solve it. In the case of the data fusion problem, fusion is made possible by some group of variables common to both sources of information on the marginals of the two sets of target variables. An example of this common information is demographic or psychographic variables. In the media planning example, demographic information is available both in the survey of buying as well as in the survey of media viewership. Data fusion methods utilize this common information to make inferences about the joint distribution. It should be clear that presence of common variables alone is not sufficient to identify the joint distribution of the two sets of target variables. Additional assumptions must be made about the conditional distribution of the target variables given the common variables in order to achieve identification. 1
3 The term data fusion was coined for this problem to connote the merging or fusion of two data sets. One data set has one set of target variables and the common variables and another data set has the other target set of variables (and the same common variables). The data set with the buying data, for example, must be fused with the data set with media viewing habits via the information in the common set of demographic variables. If all dependence between buying and media viewing is via the common variables, then it might be natural to view data fusion problem as a sort of matching problem (see Kadane (1978) and Rodgers (1984)). A given record from the buying data must be matched with one or more records from the viewing data. One example of this matching is the hot deck method in which records from one file are matched with one record from the other file. If the common variables are discrete, one could simply find all the records in the other file that have exactly the same values of the common variable. If there is only one record in the other file, then we simply impute or match this one record. If there is more than one record in the other file with identical values of the common variables, we use some measure of the average value of the target variable or use multiple imputed values as in Rubin (1986). In marketing problems, the data sets are often produced by surveys and all variables are discrete. Moreover, many of the important variables are categorical in nature. The ultimate target variables whose joint distribution we wish to estimate are also discrete. Examples include media viewership and purchase which are binary variables. Multiple imputation methods based on multivariate normal distributional assumptions (see Rassler (2002) for an excellent discussion of multiple imputation methods) are not applicable to many if not most marketing applications. 2
4 The basic idea behind matching is to form groups of observations that are similar as measured by their common variable values. The groups of observations can be used to impute the values of the target variables that are not observed or missing in a particular data set. As such, the problem of data fusion can be cast as a missing data problem as emphasized by Rubin (1986). The imputation groups can be formed by simple rules such as having identical values of the demographic variables or close values as defined by a distance metric (Rassler 2002 (pp. 19, 56, 68), Moriarity and Scheuren(2001, 2003)). Kamakura and Wedel (1997 and 2000) generalize this notion quite elegantly by defining imputations groups implicitly via a finite mixture model. All observations in a given mixture component form the imputation group and the imputation group memberships are inferred from the data rather than imposed via some sort of ad hoc distance metric. In the Kamakura and Wedel approach, multiple imputations or simulation methods can be used to conduct formal inference, properly accounting for uncertainty about imputation group assignment. Our approach is to directly estimate the joint distribution of the target variables rather than use a matching or concatenation approach. The joint distribution can then be used to solve the inference problems desired for marketing decisions. Our approach works equally well with either discrete or continuous target and common variables. In particular, we do not require any explicit modeling of the distribution of the common variables and instead condition on these variables. This reduces the number of parameters estimated as well as the possible specification errors that might occur from postulating a joint distribution of the common variables. Our focus is on marketing decisions for which the joint distribution is the ultimate goal of the analysis. Multiple imputation and other fusions approaches are designed for 3
5 more generic situations in which the ultimate goal of the analysis is not known at the time of fusion. Our approach is also designed to exploit existing methods for modeling the conditional distribution of the target variables given the common variables rather than requiring specialized code. Standard methods (such as logit or regression models) can be used interchangeably with more involved stateoftheart techniques. The paper is organized as follows. In the next section, we outline a general framework for the data fusion problem and present our general approach. The assumption of conditional independence plays an important role in many approaches to data fusion. We discuss how other approaches relate to our general formulation of the problem and how these approaches either do or do not employ conditional independence. We develop a method for relaxing the assumption of conditional independence that is useful for the case where some fused data or prior information is available in section 3. In section 4, we illustrate the value of our approach by using buying and media viewing data from a large survey of British consumers. We show that our approach achieves highly accurate fusion without the use of highly parameterized models or specialized code. 4
6 2. A Framework for Data Fusion In order to develop a general framework for data fusion, we need to propose a precise definition of the data fusion problem. Much of the data fusion literature takes the view that the goal of data fusion is to combine or fuse two datasets into one complete data set. To establish a notation, let ( ) D = x,b i = 1,,N denote a dataset of observations b bi i b on one target variable (denoted by b ) and the common variables x. ( ) D = x,m i = 1,,N denotes the dataset of observations on the other target variable m mi i m (with their associated common variables). We label the target variables b and m to suggest a media buying situation in which b would denote product purchase or usage and m would denote media viewing. Typically, x is a high dimensional vectors of variables. While this notation conforms to our data application, the problem is obviously more general. The data fusion literature regards the problem as to create a one combined dataset with observations on all three sets of variables ( x,b, m ) (this is sometimes referred to as file concatenation ). For example, Rubin (1986) views the problem as a missing data problem. That is, the information on m is missing from the Db dataset and must be imputed. Our view is that the goal of data fusion is to form inferences about the joint distribution of (b,m) using the information in the data D ( D,D ) =. Estimates of the joint distribution of (b,m) can then be used to solve whatever decision problem is required by the marketing application. For example, in media planning, media choices that have a high proportion of viewers who purchase the advertised product are considered desirable. b m Therefore, media choice aspects of the joint distribution of b and m. Below we discuss underwhat circumstances we require the joint probability of b and m or simply the conditional probability of b given m. 5
7 Our goal can then be succinctly stated as the computation of the predictive distribution of (b,m) given the data D ( D,D ) by integrating out over the parameters of the joint distribution. ( ) ( ) pb,md= pb,m θ,ddθ =. The predictive distribution is obtained b Since b and m are not observed jointly but only separately along with the common variables, we must provide a model for the conditional distribution of (b,m) given x. As discussed below in the section on identification, some further assumptions are required to identify this model. We will start with the assumption of conditional independence; that is, (1) p( b,m x, θ ) = p( b x, θ) p( m x, θ ) The idea here is that the source of commonality between b and m are the x variables and that after controlling for or conditioning on these variables the dependence between b and m is eliminated. For situations with a rich array of x variables, this may be a reasonable approximation. What is important to emphasize is that some assumption regarding dependence must be made to solve the data fusion problem. We start with the assumption of conditional independence for which we believe a reasonable argument can be made. However, without direct information on the joint distribution of b and m, this assumption cannot be tested. Parts of the literature on data fusion do not make explicit mention of the assumption of conditional independence but are implicitly assuming this. Others such as Rogers (1984) explicitly make the assumption. In the identification section below, we discuss the other approaches and the implicit and explicit assumptions made regarding conditional independence. We will also develop a method for relaxing conditional independence that can apply to many marketing applications. m 6
8 Under the assumption of conditional independence, the predictive distribution of b and m can be computed as follows. p ( θ D) ( ) = ( θ) ( ) ( θ ) θ ( ) ( ) ( ) pb,md pb,mx, pxp Ddxd = p b x, θ p m x, θ p(x)p θ D dxdθ is the posterior distribution of the parameters given the two datasets and p(x) is the marginal distribution of the common variables. In general, x may not be continuous. Therefore, it may be useful to view the inner integral above as the expectation of the conditional distribution of b,m given x and θ with respect to the marginal distribution of the x variables. ( ) ( ) ( ) pb,md= E pb,mx, θ pθ Ddθ x ( ) = E θ D Ex p b,m x, θ In order to compute the expectation with respect to the marginal distribution of x, it is not necessary to model the distribution of (b,x) and (m,x) or even just the marginal of x. We only require the ability to take the expectation with respect to this distribution. The x variables may exhibit many forms of dependence and mixtures of discrete and continuous distributions. Given that we only require the expectation and not the entire distribution, we can simply approximate this expectation by summing over the observations. This avoids making arbitrary distribution assumptions or the very difficult nonparametric problem of approximating the high dimensional distribution of x. In survey work, we typically have samples of several thousand or more so that this approximation is apt to be very accurate. Our approach to computing the predictive distribution of b and m is simply to form the expectation, 7
9 (2) ( ) = ( θ) pb,md Eθ D Ex pb,mx, E 1 θ D p b x N i, p m x i, b+ N m x D ( θ) ( θ) The summation is over all observations of x in both datasets. The outer expectation with respect to the posterior distribution of θ can easily be achieved by using draws from modern MCMC methods or by even less computationally demanding methods such as importance sampling. As a practical matter, this means that we only have to model the conditional distribution of b given x and m given x to perform data fusion. In typical situations, each element b and m can be either binary variables in which case simple logit models might suffice or continuous variables in which standard regression models could be used. Diagnostics can be performed on these model fits to select among alternative specifications. Modeling the conditional distribution of b given x or m given x is considerably less demanding that modeling the joint distribution of (b,x) and/or (m,x). This reduces computational requirement and guards against model misspecification. 2.1 Joint or Conditional Probability? In order to determine which aspect of the joint distribution of b and m are required, we must examine the media buying decision. Consider the problem of allocating a media buying budget over k possible media (in our case over k possible TV shows). We view the objective as maximizing the total exposure to consumers who have revealed an interest in the product via purchase (b). Thus, the media buying decision could be formalized as k k k k ( = = ) max Pr b 1 and m 1 Q s.t. P Q = E k k 8
10 E is the total media budget. P k is the price per exposure for medium k, Q k is the number of exposures purchased for medium k. Note that the total number of exposures will be proportional to the probability of consumers viewing medium k and purchasing the product which is simply the joint probability of b and m k. This takes into account both the total viewership of medium k as well as the proportion of viewers of medium k who have expressed an interest in the product category via purchase. The solution to this problem posed above is to purchase the medium with the highest ratio, ( ) Pr b and m /P. This implies that the joint probability of P and M k is the k k object of interest for media planning. However, if the price of a medium is proportional to the size of the viewership, P cpr( m 1) k = =, then this optimality condition simply becomes to choose the medium with the highest conditional probability. ( k) Pr ( b and mk) P cpr( m ) Pr b and m k k k ( ) = Pr b m k 2.2 Situations in which only b m is Desired As discussed above, there are some situations in which we do not require estimates of the entire joint distribution of (b,m) but only require the conditional distribution of b m. In these cases, some computational simplifications can be achieved over the approach outlined above. The goal now becomes to compute the predictive distribution of the conditional distribution of b m. ( ) ( ) ( ) ( ) p b m,d = p b, θ m,d dθ= p b θ,m,d p θ m,d dθ We now introduce the conditioning x variables into the expression. ( ) ( ) ( ) ( ) ( ) = p b,x θ,m dx p θd dθ= p b x, θ,m p x θ,m dx p θ D dθ 9
11 Using the assumption of conditional independence, we obtain ( ) ( ) ( ) ( ) ( ) ( ) = pbx, θ px θ,mdxpθ Ddθ = pbx, θ pxmdxp θddθ This expression means that we average the conditional distribution of b x with respect to the conditional distribution of x m. ( ) ( ) = E p b x, θ p θ D dθ xm We can approximate this conditional expectation by summing over the observations of x for the given value of m. (3) ( ) 1 ( ) ( ) pbm,d pbx, θ p θ Dd θ Nxm xm N xm is the number of x observations given m takes on a particular value. In the media viewing case, this simply means that we sum over the empirical distribution of x for a specific media. Thus, if we are only interested in computing b m, we can simply model b x and sum over the relevant values of the x variables. We avoid the effort and possible model misspecification errors which are associated with modeling m x. 2.3 Identification and the Assumption of Conditional Independence There is a fundamental identification issue in the data fusion problem (see also, Rassler (2002), p.5). The identification problem stems from the fact that we only observe data on the two marginal distributions of (b,x) and (m,x). The goal is to make inferences about the joint distribution of (b,m). In our data fusion method, the distribution of (b,m) is obtained from conditional distribution by averaging over the marginal distribution of x, ( ) ( ) pb,m = p(b,mx)pxdx. To see the identification problem, consider the alternative 10
12 definition that the joint distribution of (b,m) is a marginal of the joint (b,m,x), pb,m ( ) = p(b,m,x)dx. For any given marginal distributions of (b,x) and (m,x), there are many possible joint distributions (b,m,x). This means that the data fusion problem is fundamentally unidentified without some sort of restrictions on the joint distribution of (b,m,x) or, equivalently, on the conditional distribution of b, m x. We start with the restriction that b,m are independent conditional on x. This is based on the view is that if the x vector is rich enough then b and m can be approximately independent. Clearly, if the x vector does not have sufficient explanatory power, the assumption of conditional independence can be violated. If a source of prior information (e.g. a sample of fused data) is available, we can incorporate deviations from conditional independence as illustrated in section 3. In many situations, the assumption of conditional independence may be reasonable. However, it is clear that there may be some situations in which the content of the x vector may not be sufficient to insure conditional independence. For example, consider the case in which the x vector contains only demographic variables. In order to ensure conditional independence, there must be no common component between category purchase and media viewership conditional on x. If the media is narrowly focused on a specific interest, then the assumption of conditional independence might be violated. For example, consider the category of photographic equipment. Interest in photography is certainly related to demographics but is unlikely to be perfectly predicted by demographics alone. This means that there is likely to be a common component (interest in photography) that is present in both b (purchase of cameras) and m (readership of a photographic magazine). However, for more general media such as TV programs, radio shows, newspapers, and general interest magazines, this is less likely to be a problem. It is important to realize that some restriction 11
13 is required and, that without an additional source of data on (b,m), this assumption cannot be tested. It is instructive to examine other approaches to data fusion to see what identification assumptions are imposed either explicitly or implicitly. The oldest approaches to data fusion involve data matching of some sort. Equivalent groups of observations are identified using the x variables. For example, in the hot deck approach observations with the same values of the x vector are assumed to be equivalent or to be a random sample from the conditional distribution of (b,m) given x. While it is not explicitly stated, the justification for these matching procedures is that conditional independence is approximately held (see also, Rodgers (1984)). Data matching approaches that define a distance metric in the x space (e.g. Soong and de Montigny (2001)) and use observations that are close in terms of their x values also use the assumption of conditional independence. Kamakura and Wedel (1997) do not assume conditional independence and use a finite mixture of independent multinomials to approximate the joint distribution (b,m,x). It is not clear whether their procedure will give rise to estimates of the joint distribution that display conditional independence. 12
14 3. Relaxing the Conditional Independence Assumption Our view is that conditional independence is a useful default or maintained model assumption. If the set of x variables is comprehensive and predictive of b, m behavior, conditional independence is likely to hold. Relaxing the assumption of conditional independence requires additional information beyond the sample information as the joint distribution of b and m is not identified. Supplemental information can come from a variety of sources. We will consider the possibility that a subset of data for which the complete distribution of b, m and x is observed. There are many ways to incorporate conditional dependence by replacing (1) with some model of the conditional joint distribution of (b,m) x. For example, Rassler (2002) introduces a prior distribution that captures some view of dependence for the case of multivariate normal b, m, and x variables. The problem is that the results are very sensitive to the choice of this prior and assessment of the prior can be difficult. Our view is that this prior information must ultimately come from jointly observing data on comparable (b,m,x). Models of conditional dependence will depend on whether or not b and m are discrete or continuous and even in the discrete case, the number of values that b and m can take on. The literature has focused on multivariate normal models which are of questionable relevance in marketing applications. Here we develop an approach to adding dependence for binary b and m variables which is the most important case for many marketing applications. The table of the joint distribution is a 4 dimensional multinomial distribution given x, with probabilities 13
15 ( ) p11 x = Pr b= 0,m = 0 x ( ) p12 x = Pr b= 0,m = 1x ( ) p21 x = Pr b= 1,m = 0 x ( ) p22 x = Pr b= 1,m = 1x. In general, our approach involves building models for b x and m x. Let θ = (θ b, θ m ) where θ b denotes the parameters of the model for b x and θ m denotes the parameters of the model for m x. Let p = pr( b= 1 x, θ ) and p Pr( m 1 x, ) b b m = = θ. For example, if m we use a binary logit model, exp( x ' θb = ) + ( θ ). If we assume conditional p b 1 exp x' b independence, the multinomial probabilities are given in an array P ( )( ) ( ) p ( 1 p ) p p 1 p 1 p 1 p p b m b m = pij = b m b m We can provide for a departure from conditional independence by introducing a parameter λ ( λ 1). For positive λ, let a min( ( 1 pb) p m,pb( 1 pm) ) (( )( ) ) b m b m. =λ. For negative λ, let a =λmin 1 p 1 p,p p. a can be used to perturb the P array to represent a new multinomial distribution with conditional dependence. (4) ( )( ) ( ) ( ) 1 pb 1 pm + a 1 pb pm a P = pb 1 pm a pbpm + a If λ< 1, then this will constitute a valid multinomial distribution. Positive values of λ will provide for positive conditional dependence and vice versa. We note that the parameterization in (4) will preserve the marginals of b and m while accommodating a specific degree of conditional dependence indexed by λ. The likelihood function for λθ, θ is given by b m 14
16 n 2 2 I pi,j, = 1i= 1j= 1 (5) ( ) i,j, L λ = Ii,j, is an indicator function for each of the four possibilities represented in the multinomial distribution. Given a prior on λ, we can easily implement a conditional Bayesian analysis. We have prior information that whatever conditional dependence exists that is likely to be small. A reasonable prior for this case would be (6) p( ) λ 1 ( 1 +λ) α Note that (4) gives a joint model for (b,m) x, θb, θm, λ. If we integrate out either b or m from the joint model, we will obtain the same marginal model for b x or b x as used to construct the joint. Thus, in the empirical application, we will infer about λ conditional on the values of θˆ ˆ b, θ m obtained by fitting the models b x, θ b and m x, θ m. While one could estimate all model parameters jointly, we do not expect to lose much precision by our conditional approach. The conditional approach has the benefit of a simple implementation. 4. An Empirical Application One of the most common applications of data fusion methods is the fusing of buying behavior and media exposure. There are general purpose surveys of exposure to print and television media. Typically, these surveys collect demographic information as well. If a marketer is designing a marketing communication strategy for a product or group of products in a particular category, it is useful to know which media types are efficient for communication. This means that the marketer is interested in b m for a specific set of m s whose coverage is observed in a media exposure survey. Fusion is made feasible by a set of 15
17 demographic variables, available in a separate buying survey, that are common to both the b and m data sets. 4.1 The BMRB dataset Our data comes from a survey of Great Britain consumers conducted by the British Market Research Bureau (BMRB) in This is a general purpose survey of more than 20,000 consumers. The BMRB survey collects detailed information on viewership of most popular GB TV shows along with extensive demographic information. Table 1 lists 19 demographics variables available in this data. These data include standard household demographics as well as geographic information. The BMRB survey also collects information on purchases of a variety of different categories of products. Table 1 lists 15 such product categories. These product categories have penetration rates of between 20 and 86 per cent. We choose these 15 product categories from the approximately 35 total available in the data including only those categories for this there was no missing data. In a typical application, the buying information would come from a separate survey design specifically for this task or diary panel data sets in which there would be few missing values. The BMRB survey is designed mostly for the purpose of obtaining purchase data and lifestyle information. This includes measuring media exposure. We confined attention to information on TV viewing of 64 shows surveyed and no missing data (table 1 provides a list of the shows). All of our B variables and M variables are binary with B=1 defined as using the product and M=1 as specifically choose to watch this program. The sample size is 24,497. The BMRB data set provides fused data in the sense that both b and m variables are observed for the same survey respondent. This enables us to gauge the performance of our proposed methods. 16
18 Ultimately the goal of data fusion is to estimate the joint distribution of b and m. Specifically, we will estimate the conditional distribution, b m, which we have indicated would be used to make media selection decisions. In the BMRB dataset, each of the b and m variables are binary variables and we have an extensive set of x variables. Our predictive approach requires estimation of the conditional distributions, b x and, in the case of the joint approach (eqn (2)) m x as well. We start with a logistic regression specification of both conditional distributions. The X variables are a mixture of ordinal, categorical and discretized continuous variables (age and education). We specifiy a logit fit with all variables, except age, entered as dummy variables for all (except one, of course) of their possible values. This logit specification guards against potential misspecifications of the form that the independent variables enter via additive, but possibly nonlinear, functions, but does not defend against misspecification of the probability locus and the singleindex assumption. To check for violations of model assumptions, we perform a simple graphical diagnostic. For each of 15 b variables, we have separate fitted logit model and associated fitted probabilities. We sort the data into k groups on the basis of the fitted probability. Actual frequencies for the dependent variable are then plotted against the expected frequency or average probability from the model fit. We use k=20 groups in this example which means that each group comprises over 1000 observations and the sample frequencies will be highly accurate estimates of the true probability that the dependent variable is 1. If the model is properly specified, the sample frequencies and modelbased expected frequencies should agree closely (note: the HosmerLemeshow test for model misspecification is based on a test statistic which is the sum of the squared discrepancies between the sample and expected frequencies). We find the plot to be more informative. 17
19 Figure 1 displays the best (7. Restaurant patronage in the evening) and the worst performing b variables (15. Vitamins). Even the worst (in the sense of greatest deviation from a line) case provides strong evidence in favor of the logit specification. Similar results were obtained for the 64 models of m x. Given the fitted logit models, we implement the joint method (equation 2) in which we average the distribution of (b,m) x over all possible values in the data. An alternative would be to only fit b x and simply average over only x values for which m= 1 or 0 (equation 3). The first method we call the joint method and the second method is termed the direct method. It should be noted that it is not clear which method will do a better job of estimating the conditional distribution of b m. The joint method uses a large sample of x values (the entire dataset) to average but incurs sampling and misspecification error associated with modeling m x. The direct method avoids the cost of modeling m x but only averages over a smaller subset of x values. Given the large size of our dataset and the fact that the logit models seem to be very well specified, it is no surprise that the results based on the joint and direct methods agree quite closely. There are negligible differences between these two estimates over the 960 = 15 (b vars) x 64 (m vars) in our data. The correlation between these probabilities is.99985). Given that we have direct measurement of the joint distribution of b and m in our data, we can check our estimates against the value of b m in the data. j ( j) { } b ˆp = ; M = :m = 1 actual i, i,j j j,l M dim M We do not need to subset our data to test our method since we do not use any aspect of the joint distribution of b and m in computing our estimator in (3). Figure 2 plots ˆp direct i,j vs. actual ˆp i,j for all of the 960 pairs. As pointed out in Kamakura and Wedel (1997), it would be 18
20 deceptive to plot the raw value of these probability estimates against each other. If the marginal probability of b varies a great deal over the set of 15 b variables, then a terrible estimate such as reporting only the marginal of each b variable would still have a reasonably high correlation with the actual sample values. For this reason, we subtract the marginal probability of each b variable from the estimates. That is, we plot p direct i,j vs. actual i,j p where p i,j = pˆ i,j pˆ i. ˆp i is the marginal probability of b variable i. Figure 2 shows a very close correspondence between our estimates and the actual sample values based on the full sample of 24,497 pairs of b and m. The correlation is.98 direct actual and the mean absolute deviation, MAD = p p /( I J), is The dark i line in figure 2 is the 45 degree line. The dotted line is a least squares fitted line through the j i,j i,j cloud of points. It is evident that there are two dimensions along which the direct and actual estimates differ. The first, and most evident, is that the bulk of the points lie below the 45 degree line, indicating that our estimates are a bit too low. This downward bias is slight but discernable. The second is that the point cloud is rotated slightly in the clockwise direction from the 45 degree line as indicated by the difference between the 45 degree and least squares fitted lines. As we will show, both of these discrepancies from perfect fit (up to sampling error) are the result of the assumption of conditional independence. The rotation is caused by a combination of both positive and negative association deviations from conditional independence. The downward bias is caused by the preponderance of positive association deviation relative to negative. 4.2 Comparison to Matching Procedures 19
Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights
Seventh IEEE International Conference on Data Mining Scalable Collaborative Filtering with Jointly Derived Neighborhood Interpolation Weights Robert M. Bell and Yehuda Koren AT&T Labs Research 180 Park
More informationPRINCIPAL COMPONENT ANALYSIS
1 Chapter 1 PRINCIPAL COMPONENT ANALYSIS Introduction: The Basics of Principal Component Analysis........................... 2 A Variable Reduction Procedure.......................................... 2
More informationAn Introduction to Regression Analysis
The Inaugural Coase Lecture An Introduction to Regression Analysis Alan O. Sykes * Regression analysis is a statistical tool for the investigation of relationships between variables. Usually, the investigator
More informationHandling Missing Values when Applying Classification Models
Journal of Machine Learning Research 8 (2007) 16251657 Submitted 7/05; Revised 5/06; Published 7/07 Handling Missing Values when Applying Classification Models Maytal SaarTsechansky The University of
More informationData Quality Assessment: A Reviewer s Guide EPA QA/G9R
United States Office of Environmental EPA/240/B06/002 Environmental Protection Information Agency Washington, DC 20460 Data Quality Assessment: A Reviewer s Guide EPA QA/G9R FOREWORD This document is
More informationOPRE 6201 : 2. Simplex Method
OPRE 6201 : 2. Simplex Method 1 The Graphical Method: An Example Consider the following linear program: Max 4x 1 +3x 2 Subject to: 2x 1 +3x 2 6 (1) 3x 1 +2x 2 3 (2) 2x 2 5 (3) 2x 1 +x 2 4 (4) x 1, x 2
More informationIndexing by Latent Semantic Analysis. Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637
Indexing by Latent Semantic Analysis Scott Deerwester Graduate Library School University of Chicago Chicago, IL 60637 Susan T. Dumais George W. Furnas Thomas K. Landauer Bell Communications Research 435
More informationMissingdata imputation
CHAPTER 25 Missingdata imputation Missing data arise in almost all serious statistical analyses. In this chapter we discuss a variety of methods to handle missing data, including some relatively simple
More informationThe Capital Asset Pricing Model: Some Empirical Tests
The Capital Asset Pricing Model: Some Empirical Tests Fischer Black* Deceased Michael C. Jensen Harvard Business School MJensen@hbs.edu and Myron Scholes Stanford University  Graduate School of Business
More informationWe show that social scientists often do not take full advantage of
Making the Most of Statistical Analyses: Improving Interpretation and Presentation Gary King Michael Tomz Jason Wittenberg Harvard University Harvard University Harvard University Social scientists rarely
More informationIntroduction to Data Mining and Knowledge Discovery
Introduction to Data Mining and Knowledge Discovery Third Edition by Two Crows Corporation RELATED READINGS Data Mining 99: Technology Report, Two Crows Corporation, 1999 M. Berry and G. Linoff, Data Mining
More informationTHE development of methods for automatic detection
Learning to Detect Objects in Images via a Sparse, PartBased Representation Shivani Agarwal, Aatif Awan and Dan Roth, Member, IEEE Computer Society 1 Abstract We study the problem of detecting objects
More informationMisunderstandings between experimentalists and observationalists about causal inference
J. R. Statist. Soc. A (2008) 171, Part 2, pp. 481 502 Misunderstandings between experimentalists and observationalists about causal inference Kosuke Imai, Princeton University, USA Gary King Harvard University,
More informationAn Introduction to Variable and Feature Selection
Journal of Machine Learning Research 3 (23) 11571182 Submitted 11/2; Published 3/3 An Introduction to Variable and Feature Selection Isabelle Guyon Clopinet 955 Creston Road Berkeley, CA 9478151, USA
More informationEVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION. Carl Edward Rasmussen
EVALUATION OF GAUSSIAN PROCESSES AND OTHER METHODS FOR NONLINEAR REGRESSION Carl Edward Rasmussen A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy, Graduate
More informationMODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS
MODEL SELECTION FOR SOCIAL NETWORKS USING GRAPHLETS JEANNETTE JANSSEN, MATT HURSHMAN, AND NAUZER KALYANIWALLA Abstract. Several network models have been proposed to explain the link structure observed
More informationCalculating Space and Power Density Requirements for Data Centers
Calculating Space and Power Density Requirements for Data Centers White Paper 155 Revision 0 by Neil Rasmussen Executive summary The historic method of specifying data center power density using a single
More informationCan political science literatures be believed? A study of publication bias in the APSR and the AJPS
Can political science literatures be believed? A study of publication bias in the APSR and the AJPS Alan Gerber Yale University Neil Malhotra Stanford University Abstract Despite great attention to the
More informationIntroduction to Linear Regression
14. Regression A. Introduction to Simple Linear Regression B. Partitioning Sums of Squares C. Standard Error of the Estimate D. Inferential Statistics for b and r E. Influential Observations F. Regression
More informationSteering User Behavior with Badges
Steering User Behavior with Badges Ashton Anderson Daniel Huttenlocher Jon Kleinberg Jure Leskovec Stanford University Cornell University Cornell University Stanford University ashton@cs.stanford.edu {dph,
More information2 Basic Concepts and Techniques of Cluster Analysis
The Challenges of Clustering High Dimensional Data * Michael Steinbach, Levent Ertöz, and Vipin Kumar Abstract Cluster analysis divides data into groups (clusters) for the purposes of summarization or
More informationRisk Attitudes of Children and Adults: Choices Over Small and Large Probability Gains and Losses
Experimental Economics, 5:53 84 (2002) c 2002 Economic Science Association Risk Attitudes of Children and Adults: Choices Over Small and Large Probability Gains and Losses WILLIAM T. HARBAUGH University
More informationGenerative or Discriminative? Getting the Best of Both Worlds
BAYESIAN STATISTICS 8, pp. 3 24. J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, D. Heckerman, A. F. M. Smith and M. West (Eds.) c Oxford University Press, 2007 Generative or Discriminative?
More informationAN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS. Department of Mathematics and Statistics University of Calgary
AN INTRODUCTION TO MARKOV CHAIN MONTE CARLO METHODS AND THEIR ACTUARIAL APPLICATIONS DAVID P. M. SCOLLNIK Department of Mathematics and Statistics University of Calgary Abstract This paper introduces the
More informationRegression. Chapter 2. 2.1 Weightspace View
Chapter Regression Supervised learning can be divided into regression and classification problems. Whereas the outputs for classification are discrete class labels, regression is concerned with the prediction
More informationWas This Review Helpful to You? It Depends! Context and Voting Patterns in Online Content
Was This Review Helpful to You? It Depends! Context and Voting Patterns in Online Content Ruben Sipos Dept. of Computer Science Cornell University Ithaca, NY rs@cs.cornell.edu Arpita Ghosh Dept. of Information
More informationMartian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA
Martian Chronicles: Is MARS better than Neural Networks? by Louise Francis, FCAS, MAAA Abstract: A recently developed data mining technique, Multivariate Adaptive Regression Splines (MARS) has been hailed
More informationLearning to Select Features using their Properties
Journal of Machine Learning Research 9 (2008) 23492376 Submitted 8/06; Revised 1/08; Published 10/08 Learning to Select Features using their Properties Eyal Krupka Amir Navot Naftali Tishby School of
More informationAn Account of Global Factor Trade
An Account of Global Factor Trade By DONALD R. DAVIS AND DAVID E. WEINSTEIN* A half century of empirical work attempting to predict the factor content of trade in goods has failed to bring theory and data
More informationBIS RESEARCH PAPER NO. 112 THE IMPACT OF UNIVERSITY DEGREES ON THE LIFECYCLE OF EARNINGS: SOME FURTHER ANALYSIS
BIS RESEARCH PAPER NO. 112 THE IMPACT OF UNIVERSITY DEGREES ON THE LIFECYCLE OF EARNINGS: SOME FURTHER ANALYSIS AUGUST 2013 1 THE IMPACT OF UNIVERSITY DEGREES ON THE LIFECYCLE OF EARNINGS: SOME FURTHER
More information