CHARLES J. ROMEO Economist, Antitrust Division, U.S. Department of Justice, Washington, D.C.,

Transcription

1 Quantitative Marketing and Economics, 3, 71 93, C 2005 Springer Science + Business Media, Inc. Printed in The United States. Estimating Discrete Joint Probability Distributions for Demographic Characteristics at the Store Level Given Store Level Marginal Distributions and a City-Wide Joint Distribution CHARLES J. ROMEO Economist, Antitrust Division, U.S. Department of Justice, Washington, D.C., charles.romeo@usdoj.gov Abstract. This paper provides a solution to the problem of estimating a joint distribution using the associated marginal distributions and a related joint distribution. The particular application we have in mind is estimating joint distributions of demographic characteristics corresponding to market areas for individual retail stores. Marginal distributions are generally available at the census tract level, but joint distributions are only available for Metropolitan Statistical Areas which are generally much larger than the market for a single retail store. Joint distributions over demographics are an important input into mixed logit demand models for aggregate data. Market shares that vary systematically with demographics are essential for relieving the restrictions imposed by the Independence from Irrelevant Alternative property of the logit model. We approach this problem by formulating a parametric function that incorporates both the city-wide joint distributional information and marginal information specific to the retail store s market area. To estimate the function, we form moment conditions equating the moments of the parametric function to observed data, and we input these into a GMM objective. In one of our illustrations we use four marginal demographic distributions from each of eight stores in Dominick s Finer Foods data archive to estimate a four dimensional joint distribution for each store. Our results show that our GMM approach produces estimated joint distributions that differ substantially from the product of marginal distributions and emit marginals that closely match the observed marginal distributions. Mixed logit demand estimates are also presented which show the estimates to be sensitive to the formulation of the demographics distribution. Key words. mixed logit, discrete joint probability distributions, generalized method of moments JEL Classifications: C51, C81 1. Introduction The advantage of the mixed logit model for aggregate data pioneered by Berry (1994) and Berry, Levinsohn, and Pakes (1995) (henceforth BLP) is that it allows one to solve for the primitives of a flexible differentiated products model using only aggregate data on prices, quantities sold, and product characteristics. Heterogeneity is introduced by interacting randomly generated consumer tastes with the characteristics of products in a logit demand function. The BLP paper has been followed by a steady stream of papers in the The views expressed are not purported to reflect those of the United States Department of Justice.

2 72 ROMEO economics and marketing literatures, as it offers the possibility of flexible inference with readily available data. 1 However, the flexibility engendered by mixing the logit model does not come about magically. Generating elasticities relieved of the restrictions imposed by the Independence from Irrelevant Alternatives (IIA) property of the logit model generally requires information beyond just aggregate data on product characteristics. BLP recognized this in their seminal paper by introducing demographic data on income in addition to normal random variates to represent individual types. The direction of extensions that have been produced since BLP, have pushed in the direction of incorporating additional demographic information into the model. Nevo (2001) and Davis (1998) introduced draws from a joint distribution of demographic information into the second stage of the demand hierarchy. Berry, Levinsohn, and Pakes (2003) and Petrin (2002) incorporated moments conditions based on consumer survey data into the GMM objective to improve the fit of certain aspects of the model. Dube (2002) and Hendal (1999) present multiple discrete choice models that completely mix micro with aggregate data, while Chintagunta and Dube (2003) present a BLP type model in which they integrate household level purchase data with store level market share data to improve the estimate of both the mean response and the heterogeneity distribution over what could be obtained with a single data source. A difficulty that one sometimes faces with these models is in obtaining a joint distribution of demographics that matches the contours of the market for the products under study. The particular application we have in mind for this paper is the Dominick s store level data available on the University of Chicago, Graduate School of Business web site. 2 This data archive contains as many as 400 weeks of store level observations on a myriad of supermarket products. In addition, a file of store level demographic distributions is available that provides a snapshot of the characteristics of the households and the local economy for each of the 89 Dominick s stores. However, all the distributions in the demographic file are marginals, and this limits their usefulness for mixing with BLP class models. One is either limited to drawing a single demographic characteristic, as BLP did in their original work, or to forming store level joint distributions as a product of marginals and hoping that the difference between joint distributions approximated in this manner and the true store level joint distributions is empirically unimportant. 3 This limitation is not specific to the Dominick s data. Marginal distributional information is available for numerous demographic variables at the census tract level while joint distributions can only be formed for a few variables. Consequently, the potential is there for researchers to face this limitation whenever the focus is on modeling demand at the retail outlet level and the market area for the outlet s goods is a subset of the census tracts in the Metropolitan Statistical Area (MSA). 4 1 Kadiyali, Sudhir, and Rao (2001) provide a survey Meza and Sudhir (2003) appear to take this approach. 4 At the MSA or Primary MSA (PMSA) level, joint distributional information is readily available for a wide variety a variables from the Current Population Survey web site (

3 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 73 The innovation offered by this current paper is to develop a Generalized Method of Moments (GMM) approach for consistently estimating discrete store level joint distributions by combining discrete store level marginal distributions, with information from a discrete joint distribution for the same set of variables from the associated MSA. The essence of our approach is to use the available store level and MSA level information to form an initial estimate of the joint distribution of interest that contains all of the elements of variation of the true store level distribution. For example, suppose we are interested in the joint distribution of income and race for an individual store. To form an initial estimate of this joint distribution, we could specify a parametric function that incorporates the MSA level joint distribution over income and race to capture joint variation in these two variables that is not specific to the individual retail outlet, and the store level marginal distributions for income and race to capture information specific to the individual store. This function varies over both income and race, and is specific to each individual retail outlet. Moment conditions are then formed that equate moments formulated using this parametric function to observed moments. Previous researchers have faced this problem in other contexts and two previous solutions have been offered. To our knowledge the first solution is attributed to Deming and Stephan (1940). Their method, Iterative Proportional Fitting (IPF), was used to estimate internal cells of a two-way contingency table for the total census population. The inputs they used were a two-way contingency table for the same variables generated from a five percent census sample and marginal frequencies for the total census population. The approach is iterative in that row frequencies are matched first, then column frequencies. Matching column frequencies alters the row frequencies and vice-versa, so each is then updated in second and subsequent iterations until convergence is achieved. The objective function underlying IPF is a constrained minimum chi-square. This is an intrinsically statistical objective, but, to our knowledge, the statistical properties of the IPF estimator have never been developed. 5 We do not develop them here as that is outside the scope of this research. More recently, Putler, Kalyanam, and Hodges (1996) have offered a Bayesian solution. Their interest is in estimating joint distributions over demographics to improve the targeting of marketing efforts. They too use MSA level Census data to, in their case, provide a prior joint distribution and they use smaller area marginal distributions as data inputs to the posterior. They form a posterior distribution over free cells, i.e., those not constrained by the marginal information. In comparison to our moment based approach, the Putler et al. Bayesian approach has the advantage of incorporating the structural likelihood information which should improve the fit of model to the data. On the other hand, the number of parameters to be estimated grows rapidly as the number of free cells increases in both the number of cells in a given dimension and the number of dimensions. This limits the 5 This was an active area of research throughout the 1940s until at least the early 1960s. Researchers offered avariety of modifications to Deming and Stephan s IPF algorithm (Stephan, 1942; Smith, 1947; Friedlander, 1961), but the focus was generally on providing a more efficient algorithm to reduce computational costs. In the days when computational power was defined by pencil and paper, statistical distributions that were not feasible to calculate may have been perceived as too esoteric to invest in describing. The most recent discussion of IPF that we have found is in Bishop, Fienberg, and Holland (1975) which also does not contain any discussion of the statistical properties of this estimator.

4 74 ROMEO feasibility of the Putler et al. approach to estimating joint distributions with relatively few free cells. Two illustrations are provided. In the first illustration, we use an example from Putler et al. to facilitate a comparison of the three approaches to solving this problem: IPF, Bayesian, and GMM. This example shows that all three methods produce similar estimates of the joint distribution. In the second illustration, we estimate a four-dimensional joint distribution over demographics for each of eight Dominick s stores using only the GMM approach. Our results show that the model produces an excellent fit to the moment conditions, and that the estimated joint distributions produce marginal distributions that are generally the same as the observed marginals to at least two decimal places. In addition, we evaluate the empirical importance of the demographic distribution formulation in a mixed logit demand model. To do this, we generate two sets of estimates for an equilibrium mixed logit demand and supply system for bath tissue data from these eight Dominick s stores. For the first estimates, we draw the demographics from a joint distribution formulated as a product of marginal distributions, while for the second estimates we draw demographics from the joint distribution estimated by GMM. The results show substantial, though generally not statistically significant differences between the estimates. The remainder of this paper is organized as follows. Section 2 contains the methodology for initializing and estimating store level discrete joint distributions. This section is developed in five parts. In part one, we formulate the parametric function and moment conditions for a two-by-two discrete probability distribution. In part two, we formulate the parametric function for the general case, while part three contains the associated moment conditions. Differences in model parameterization for the GMM, IPF, and Bayesian approaches to inference are discussed in part four, and part five discusses GMM estimation. Section 3 contains the two illustrations, and Section 4 contains conclusions. 2. Formulating and estimating a discrete joint distribution 2.1. The two-by-two case Suppose we observe joint probabilities for a city-wide area, and marginal probabilities for the market area associated with a retail outlet within the city as in Table 1. subject to: c jk 0, j,k c jk = 1. d j, d k 0, j d j = 1, k d k = 1 Our goal is to use this information to parameterize a joint probability distribution for the retail outlet. Let p(θ) =(p 11 (θ), p 12 (θ), p 21 (θ), p 22 (θ)) be the unknown joint probabilities formulated in terms of observed data and unknown parameters θ. Suppose now we take log odds transformations of the data in Table 1 and incorporate this information into a logistic

5 z 1 1 c 11 c 12 d 1 d 1 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 75 Table 1. Acity-wide joint probability distribution with associated retail outlet marginal distributions for the two-by-two case. City-wide joint distribution z 2 Retail outlet marginal distributions 1 2 z 1 z 2 2 c 21 c 22 d 2 d 2 distribution for p(θ)asfollows, p jk (θ) = exp { A jk ln ( c jk ) c 22 + β1 I [ j=1] ln ( d j ) d 2 + β2 I [k=1] ln ( )} d k d s,r=1,2,r<s exp{ A sr ln ( c sr ) c 22 + β1 I [s=1] ln ( d s ) d 2 + β2 I [r=1] ln ( )}. (1) d r d 2 The log odds transformations are formed by dividing each element of the distributions in Table 1 by the last element in that distribution and then taking logs. Since, by construction, ln(c 22 /c 22 ) = ln(d 2 /d 2 ) = ln(d 2 /d 2 ) = 0 the specifications for p 12 (θ), p 21 (θ), p 22 (θ) each have fewer terms than that for p 11 (θ). Define A = [A jr ] j,r=1,2, and θ = (A,β j, j = 1, 2). The I [ ] are indicator functions equal to one if the condition in the brackets is satisfied. The advantage of specifying a parametric form for p(θ)is that it contains all the elements of variation of the true unknown joint distribution. Incorporating the ln(c jk /c 22 ) terms reflect the joint variation at the city level, while the ln(d j /d 2 ) and ln(d k /d 2 ) terms incorporate information specific to the retail outlet into p(θ) that adjust the city level joint variation. Using a parametric form for p(θ) enables us to fit the moment conditions with fewer parameters than required for either the IPF or Bayesian approaches. In addition, the number of parameters to be estimated will grow more slowly with problem size than for either of these other approaches. We form three sets of moment conditions and associated GMM objective to consistently estimate θ. The first set of conditions are formed as the difference between the estimated and observed marginals. 6 p j (θ) d j = 0, j = 1, 2, p r (θ) d r = 0, r = 1, 2. (2) Given the adding up constraints on the marginal distributions, only two of the four moment conditions above are independent. Using these four moment conditions alone will 6 It is a slight abuse of notation to set the following moment conditions exactly to zero. Rather, the GMM objective will make the discrepancies between the estimated and observed moments as small as possible. We extend our notation to remedy this abuse in Section 2.3.

6 76 ROMEO only enable us to consistently estimate β 1 and β 2 with A set = [1]. 7 To estimate parameters in A we introduce a second condition relating the city-wide and retail outlet covariances. where and cov(z 1, z 2 ; θ) city-wide cov(z 1, z 2 ) = 0, (3) cov(z 1, z 2 ; θ) = j,k=1,2 (z 1 j E[z 1 ; θ])(z 2k E[z 2 ; θ])p jk (θ), E[z 1 ; θ] = (1/2) j z 1 j p j (θ), and E[z 2 ; θ] is similarly formulated. Covariance is not a very meaningful measure in discrete distributions. We chose to use a condition based on covariance discrepancies because covariances are the simplest moments that are formulated using bivariate distributional information. This condition penalizes differences in the estimated and city-wide bivariate distributions, p jk (θ) and the c jk respectively. Including it improved the model fit in both our illustrations. The model in (1) contains five parameters-a 22 does not enter the model. For purposes of parsimony and identification, we structure A as the product of two 2 1vectors α 1 and α 2, such that A = α 1 α 2 and set one of these vectors to a column of ones.8 For the third set of conditions we specify the joint distribution for each cell as the product of a conditional distribution derived from (1) and a retail outlet marginal distribution. Taking the difference between two formulations of each cell provides the moment conditions. Specifically, form moment conditions as d j p jr (θ)/p j (θ) d r p jr (θ)/p r (θ) = d j p r (θ) d r p j (θ) = 0, j, r = 1, 2. (4) As the second line of (4) shows, this condition simplifies to a difference of products of marginal distributions of d and p. There is a condition in the form of (4) for each cell in the joint probability distribution, and this penalizes the model for any discrepancies in the joint probabilities. Since this condition relies on the same sample moments as (2), it does not increase the number of parameters in θ that can be identified, but, as our illustrations show, including this condition improves model fit. 7 For problems larger than 2 2, enough independent marginal moment conditions are available to make elements in A estimable with these conditions alone, at least in principle. In one of our illustrations, however, estimating the βs and elements in A with just these conditions produced estimated joint distributions that were very close to joint distributions formulated as a product of marginal distributions. 8 In the 2 2 case we still have four parameters and only three independent moment conditions if A is structured this way, and hence we still could not estimate either α vector. This is only a problem for this particular case.

7 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS The general case: Formulating an initial estimate of the store level joint probabilities Let z = (z 1,...,z J ) be a discrete random vector. For each z j, j = 1,...,J, let m j = (m j1,...,m jkj ) be a vector of k j < support points, and let m ={(m 1l1, m 2l2,...,m JlJ ): l j = 1,...,k j, j = 1,...,J} denote the set of all J dimensional support points for z. Indexing stores by s = 1,...,S, our goal is to consistently estimate the true joint probabilities P 0 z (m s) = [P0 z (m 1l 1, m 2l2,...,m JlJ s)] (kj k2 ) k 1 (joint over the random vector z) for each s, given known joint probability C z (m) = [C z (m 1l1, m 2l2,...,m JlJ )] (kj k2 ) k 1 for the city as a whole, and known marginal distributions D z j (m j s) = [D z j (m jl j s)] k j 1 (marginal with respect to the z j ). Our first step in estimating P 0 z (m s) istouse the available information to parameterize a function, say P z (m s; θ), that contains all of the elements of variation in P 0 z (m s). To formulate P z (m s; θ)wegenerate the necessary data inputs from log odds transformations of our known city-wide and store level distributions. From the city-wide distribution we generate the data x ( ( ( )) ) Cz m1l1,...,m JlJ m 1l1,...,m JlJ = ln ( ), l j = 1,...,k j, j = 1,...,J, C z m1k1,...,m JkJ while we use the store level marginals to provide data y ( ( ( ) ) Dz m jl j m jl ) j s j s = ln ( D z j m jkj s ), l j = 1,...,k j j = 1,...,J, (6) for each s. Itisuseful to organize the matrix x = [x(m 1l1,...,m JlJ )] so that it is of dimension (k J k J 1 k 2 )xk 1,asweare going to organize the corresponding matrix of unknown parameters A conformably with x. Specifically, define the set of parameter vectors {α 1,...,α J } such that α j is k j x1 for each j, and formulate the matrix A of parameters as A = α J α 3 (α 2 α 1 ). A has the same dimensions as x, and A(m 1l 1,...,m JlJ ) = α 1l1 α 2l2 α JlJ.Ingeneral, {α set j = ı k j, j = 1,...,J, j r}, where ı is a k j vector of ones, so that only one α vector is estimated. The choice of which vector to allow to vary freely, α r, depends in part on the number of linear and covariance constraints available, and in part on model fit criteria. Use (5) and (6) to formulate P z (m s; θ) = [P z (m 1l1, m 2l2,...,m JlJ s; θ)] (kj k2 )xk 1 as a logistic distribution having elements ( ) P z m 1l1,...,m JlJ s; θ = exp(a(m 1l1,...,m JlJ ) x(m 1l1,...,m JlJ ) + β 1 y(m 1l1 s) β J y(m JlJ s)) l 1,...,l J exp(a(m 1l1,...,m JlJ ) x(m 1l1,...,m JlJ ) + β 1 y(m 1l1 s) β J y(m JlJ s)), (5) l j = 1,...,k j, j = 1,...,J, (7)

8 78 ROMEO where the β j are scalars and θ = (α r,β 1,...,β J ). Equation (7) contains all the all the elements of variation contained in unknown store level distributions P 0 z (m s) : x(m 1l1,...,m JlJ ) allows for variation among the (z 1,...,z J ) unconditional on s, while each of the y( s) allows for adjustments to the city-wide distribution for a particular z j conditional on s. Finally, setting α r = ı kr, and setting β j = 1 for all j, gives us an initial estimate of P 0 z (m s). To improve upon this estimate, we specify a set of moment conditions having P 0 z (m s)astheir unique solution. Choosing θ to minimize the GMM criterion formed from these moment conditions provides a consistent estimate of P 0 z (m s) Moment conditions Using (7), specify marginal probabilities as ( ) ( ) P zr mrlr s; θ = P z m1l1,...,m JlJ s; θ. (8) l 1,...,l r 1,l r+1,...,l J Form the following three sets of moment conditions for each s: ( ) ( ) (M1) P z j m jl j s; θ Dz j m jl j s = δ jl j, l j = 1,...,k j, j = 1,...,J (M2) Cov(z j, z r s; θ) city-wide Cov(z j, z r ) = η jr, each j, r, j r; ( ) ( ) ( ) ( ) (M3) P zg mglg s; θ Dzr mrlr s Pzr mrlr s; θ Dzg mglg rg s = ν l 1,...,l J, l r = 1,...,k r, l g = 1,...,k g r, g = 1,...,J, r < g. Condition (M1) is formed as the difference between the estimated and observed marginal distributions at each point of support. There are k j moment conditions constraining the marginals for each j. The difference between the estimated and observed marginals is assumed to equal an error δ jlj having the properties E[δ jl j ] = 0, and Var[δ jl j ] <, at each l j, all j. Moment condition (M2) imposes covariance assumptions in the estimation. There are ( 2 J ) = J(J 1)/2 ofthese conditions available. We assume the difference between the estimated and observed covariance to equal an error η jr,having the properties E[η jr ] = 0, and Var[η jr ] <, all j, r. As shown in (4) above, condition (M3) is equivalent to formulating the joint distribution P z (m s; θ) two different ways at each point of support m, with each formulation mixing a different estimated conditional distribution with an observed store level marginal. There are k 1 k 2 k J conditions (M3) corresponding to the same number of points of support of P z (m s; θ) for each (r, g) pair, and there are ( 2 J ) = J(J 1)/2(r, g) pairs. The difference in the two formulations of the joint moments is assumed to be equal to an error ν rg l 1,...,l J that has the properties E[ν rg l 1,...,l J ] = 0, and Var[ν rg l 1,...,l J ] <, ateach l 1,...,l J, all r, g.

9 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS Differences in parameterization of the GMM, IPF, and Bayesian approaches to inference In the general case, there are P J = J j=1 k j 1 independent cells in the joint probability distribution and S J = J j=1 k j J independent marginal relationships. In addition, we add J(J 1)/2 covariance conditions yielding a total of S J + J(J 1)/2 independent constraints for identifying θ. Identification requires the number of free parameters in θ to be less than or equal to the number of independent constraints. In general, our approach requires many fewer free parameters than there are independent constraints to achieve a good model fit. We associate one β parameter with each store level marginal, and allow one α vector, the rth, to be free for a total of J + k r free parameters. Underlying the IPF estimator is a chi-square criterion that minimizes the difference between the unknown probabilities P 0 z (m s) and the observed city-wide joint distribution C z (m). This criterion is then subject to the S J marginal constraints each interacted with an unknown Lagrangian multiplier. Estimation of the Lagrangian multipliers is the focus of the Iterated Proportional Fitting algorithm. Then, as Deming and Stephan (1940) show, using the estimated Lagrangian multipliers one can infer estimates for all the free cells in P 0 z (m s). In general, the number of Lagrangian multipliers S J > J + k r parameters estimated using our GMM approach, but not substantially so. Putler, Kalyanam, and Hodges (1996) take a Bayesian approach. They use the city-wide joint distribution to form a Dirichlet prior and specify a multinomial likelihood over the store level joint distribution. They do not parameterize the probabilities in the multinomial distribution and, as such, this leaves them with a posterior distribution over DF = P J S J free parameters. Since DF grows quickly in both the number of cells in any given dimension and in the number of dimensions, the computational cost of posterior inference grows more quickly than for either the parametric GMM approach we propose, or the IPF approach. The Putler et al. approach can be extended by specifying parametric functions for the Dirichlet prior probabilities and for the probabilities in the likelihood to reduce the dimensionality of the estimation problem. This will, however, complicate the structure of the posterior as conjugacy of the prior and the likelihood will be lost by introducing a parametric form for P z (m s) Estimation To form a GMM objective function, we input the moment conditions in one long vector. We chose this formulation because there is a different number of moments associated with each set of moment conditions, and for moment condition (M1), the number of moments varies with each marginal distribution. Hence there is no natural way to allow the moment conditions to freely correlate. Define (θ) = [ δ 1l1 (θ),δ 1l2 (θ),...,δ 1lk1 (θ),...,δ JlkJ (θ) ], H(θ) = [η 12 (θ),η 13 (θ),...,η 1J (θ),...,η J 1,J (θ)], V (θ) = [ ν 1,2 1l 1,...,Jl 1 (θ),ν 1,2 1l 1,...,Jl 2 (θ),...,ν 1,2 1l k1,...,jl kj (θ),...,ν J 1,J 1l k1,...,jl kj (θ) ],

10 80 ROMEO and define T (θ) = [ (θ), H(θ), V (θ) ], where vector T (θ) has length J j=1 k j + J(J 1)/2 + (k 1 k 2 k J )J(J 1)/2. Now specify the objective function as θ = argmin{t (θ) T (θ)}. (8) Estimates of θ from (8) have asymptotic normal distribution ˆθ N(θ,σ 2 (G G) 1 ), where σ 2 = Var[T (ˆθ)], and G = T (ˆθ)/d ˆθ.Inaddition, P z (m s; θ) isasymptotically normally distributed with mean P z (m s; ˆθ) and variance σ 2 H(G G) 1 H, where H = P z (m s; ˆθ)/ˆθ. 3. Illustrations We present two illustrations. The first uses data and results from an illustration presented in Putler et al. The authors present joint distributions estimated three ways: as a product of marginals, using IPF, and conducting posterior inference. To these results we add a column obtained using our GMM approach. The second illustration uses demographic data from eight Dominick s stores and from the Chicago PMSA. We use the GMM approach to estimate a joint distribution over demographics for each of these stores, and present model fit statistics. We then determine if the formulation of the joint demographic distribution has empirically important effects on the results of an equilibrium mixed logit demand-supply model for bath tissue consumption. Two sets of results are generated. For the first set of results we draw individual types from a joint demographic distribution formulated from a product of marginals, while for the second, we take draws from our GMM estimate of the joint demographic distribution Targeting the market for stain resistant carpeting As discussed in Putler et al., the target market for stain resistant carpeting is married couples who are homeowners with young children living at home. To estimate a joint distribution for these three variables for Sioux Falls, South Dakota, the authors use marginals for Sioux Falls, and a prior joint distribution that corresponds to the whole state of South Dakota. The variables are each coded into binary categories: (renter, homeowner), (married, unmarried), (children under 18, no children under 18). The true distribution for Sioux Falls is available for evaluating goodness of fit. These data are presented in Table 2, along with estimates of the joint distribution derived four different ways: independence estimate, IPF, posterior mode, GMM mode. As the table shows, the independence estimate is a poor approximation to the true joint distribution. In fact, this estimate is considerably worse than using the prior distribution for the entire state of South Dakota to represent the distribution of these variables in Sioux Falls. On the other hand, the IPF, posterior, and GMM results each produce substantial improvements over both the independence estimate and the prior. All three approaches produce very similar estimates of the joint distribution, and all three are found to be statistically indistinguishable

11 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 81 Table 2. Estimates of joint probabilities for stain-resistant carpeting direct mail campaign. Cell descriptor a (housing, married, Actual Prior Independence Posterior GMM children <18) proportion a proportion a estimate a IPF a mode a estimate (rent, no, no) (0.0055) (0.0033) (rent, no, yes) (0.0037) (0.0012) (rent, yes, no) (0.0035) (0.0014) (rent, yes, yes) (0.0046) (0.0020) (own, no, no) (0.0055) (0.0039) (own, no, yes) (0.0037) (0.0013) (own, yes, no) (0.0053) (0.0048) (own, yes, yes) (0.0061) (0.0036) χ 2 goodness of fit measure b a Source: Table 5, Putler et al. (1996). b All χ 2 are based on 2083 observations. from the true joint distribution by a χ 2 goodness of fit test at the one percent significance level. 9 To produce the GMM results we formulated A = α r (α m α a ),with r, m, a representing the housing status, marital status, and child status dimensions respectively, and we tested which 2 1 α vectors to estimate and which ones to set equal to ı,avector of ones. Since we have three independent moment conditions on the marginal distributions and three more covariance conditions, in principal we can identify up to six parameters in θ. Inpractice we found that the χ 2 goodness of fit statistic was the smallest with A set = [1]. Estimating any of the α vectors produced a small increase in the χ 2 statistic. We also tried estimating the model using just marginal moment conditions (M1), and including only conditions (M2) or only (M3) in addition to (M1). We found that excluding conditions of type (M2) and/or (M3) caused a slight deterioration in the fit. For example, estimating the three ß parameters using only moment conditions (M1) produced a χ 2 statistic equal to 16.54, up from the value of obtained using all three sets of moment conditions. 9 Putler et al. Also include posterior mean estimates that yield χ 2 statistics as small as 16.1.

12 82 ROMEO 3.2. Dominick s data As stated in the introduction, the motivation behind this research is to estimate joint distributions corresponding to a subset of variables in the store level marginal distributions provided in the Dominick s data archive. These estimated joint distributions will then be used as our source of individual types in a mixed logit model for aggregate data. Data for the Chicago PMSA from the March 1996 Current Population Survey is used to provide the city-wide joint distribution for our GMM procedure. To begin, we extracted demographic data for eight stores from the Dominick s data archive that were reasonably representative of the total population of Dominick s stores. We limit attention to eight stores to keep sample size within the range of computational feasibility for mixed logit estimation. Three criteria were used to choose stores. First we wanted to have at least two stores from each of the three pricing regimes that Dominick s employs. To this end we chose two low price, three medium price, and three high price stores, while the population of Dominick s stores contains 9.4, 64.7, and 25.9 percent of low, medium, and high price stores respectively. Second, the stores were chosen to exhibit substantial variation over our four variables of interest. Third, we closely matched the means and correlations of our sample and the Dominick s population for these variables in order to produce a sample that is representative of the Dominick s population. The variables we chose are income, number of persons in a household, race, and number of units in a housing structure. Two pieces of information are provided regarding the income distribution for the market area surrounding each Dominick s store: the log median income, and the standard deviation of income. The researcher is left to guess what continuous probability distribution these variables parameterize. It is straightforward to show that the log median of a lognormal distribution is the mean of the associated normal distribution, hence we inferred that a lognormal was used. 10 However, since the standard deviations provided appear to be for a lognormal, we are left with a normal mean and lognormal standard error. To make these two statistics coherent with one another, we solved for the lognormal mean,, using the relationship µ = ln 2.5ln( 2 +λ 2 ), (see e.g. Greene, 1997, p. 71) where µ is the normal mean, and λ 2 is the lognormal variance. 11 With the income distribution parameterized, we index it by i, and discretize it into 17 adjacent intervals, (in 000 s): [0,10), [10,20), [20,30), [30,40), [40,50), [50,60), [60,70), [70,80), [80,90), [90,100),[100,125), [125,150), [150,175), [175,200), [200,300), [300,400), [400, ). For number of persons in household, n, the Dominick s data provides a distribution with four points of support: 1 person, 2 persons, 3 or 4 persons, and 5 or more persons. For race, indexed by r, the data provides the percentage of nonwhites. The percentage of detached houses, u, is our housing units variable. Corresponding variables and the city-wide joint distribution were readily available for the Chicago PMSA from the March 1996 Current Population Survey. 10 Specifically, if x has a lognormal distribution with parameters µ and σ 2, such that E[x] = exp(µ+σ 2 /2) and Var[x] λ 2 = exp(2µ + σ 2 )(exp(σ 2 ) 1) and median(x) = γ, then if y = ln(x), y N(µ, σ 2 ) and ln(γ ) = µ. 11 We used a bisection algorithm to solve this relationship for.

13 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 83 Table 3 contains the raw data for our eight store sample, and descriptive statistics to compare our eight stores with the whole population of Dominick s stores and the Chicago PMSA. As the table shows, the means for our eight store sample match the Dominick s population quite closely for all except the proportion of nonwhites. Our sample contain nearly five percent more nonwhites than the Dominick s population. In comparison with 1996 March CPS for the Chicago PMSA, the Dominick s market areas have similar means for ln(median income) and proportion of nonwhites, but have fewer persons/household, and fewer detached houses. A sample correlation comparison of our sample with the Dominick s population is contained in Table 4. While less closely matched than the means, the correlations all have the same signs, and are similar in magnitude. In this example, J = 4, k i = 17, k n = 4, k r = 2, and k u = 2. This yields k i + k n + k r + k u = 25 marginal distribution moment conditions of type (M1), J(J 1)/2 = 6covariance conditions of type (M2), and k i k n k r k u = 272 moment conditions of type (M3) for each pair of variables, and there are J(J 1)/2 = 6variable pairs. Together, these yield a total of 1663 moment conditions, 31 of which provide identifying information about θ. A separate joint distribution is estimated for the marketing area corresponding to each of our eight stores, and three criteria are used to gauge model fit for each store: the Euclidean distance of the estimated joint distribution from a joint distribution formed from a product of marginal distributions, the weighted average Euclidean distance of all moment conditions from zero, and the GMM function value. The first criteria enables us to gauge the impact of excluding moment conditions (M2) and/or (M3) and of the parameterization of A on our ability to estimate a joint distribution that differs substantially from a joint distribution formed under the assumption of independence. To form the second criteria we first evaluate the Euclidean distance of each set of moment conditions from zero, and then use the weighted average of these distances as a model fit metric. 12 In forming this measure we incorporate both included and excluded moment conditions. So, for example, if we estimate the model without moment conditions of type (M3) this will enable us to determine if a substantial deterioration in fit occurred as the (M3) moment conditions still get included in the metric. The third criterion, the GMM function value, provides a fit metric that is affected only by the included moment conditions. This metric is less reliable as it can show improvements even if overall fit, as measured by the Euclidean distance metric, has deteriorated. To parameterize θ, we tested letting each of the α vectors in A be free individually and in various combinations. These tests showed that allowing α n to vary freely produced marginally better results and than allowing α r or α u to be free. Allowing α i to vary freely caused problems with inverting (G G)asdid allowing the α s to vary freely in any combination. In addition, we estimated the model with different combinations of the moment conditions included and with A = [1]. Table 5 contains the results of tests of various moment and parameter configurations for one of our Dominick s stores (Store # 111). Results for other stores were similar. The results in Column 1 include all three sets of moment conditions in the model and allow α n to vary freely. This combination produces the best fit statistics. The estimated 12 The proportions of moment conditions of each type are used as the weights.

14 84 ROMEO Table 3. The eight store sample, with a comparison of descriptive statistics for the eight stores and the Dominick s population. Store # Persons/household ln(median Detached income) or 4 5+ Mean a house Nonwhite Low priced Medium priced High priced store sample Mean (std. dev.) (0.365) (0.073) (0.026) (0.056) (0.029) (0.236) (0.224) (0.323) (0.463) (0.518) (0.518) Dominick s population Mean (std. dev.) (0.280) (0.080) (0.031) (0.058) (0.031) (0.263) (0.208) (0.186) (0.294) (0.481) (0.441) Chicago PMSA Mean (std. dev.) (1.715) (0.478) (0.424) a Mean persons/household is calculated by indexing 1, 2, 3 or 4, and 5+ persons with 1, 2, 3.5, and 6 respectively.

15 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 85 Table 4. Acomparison of sample correlations between the eight store sample and the Dominick s population. Dominick s population ln(median income) Mean persons/hh Detached house Nonwhites 8 store sample ln(median income) Mean persons/hh Detached house Nonwhites Table 5. Effect of model specification on fit criteria for Dominick s store 111. Moment conditions included and parameterization of A (1) (2) (3) (4) (5) (6) Marginal conditions (M1) Yes Yes Yes Yes Yes Yes included Covariance conditions (M2) Yes Yes Yes No No No included Individual cell conditions (M3) Yes No Yes Yes Yes No included α n free Yes Yes No No Yes Yes Fit criteria Distance: P z (u, r, n, i s; θ) P(u)P(r)P(n)P(I ) Weighted average distance of all moment conditions from 0 GMM function value e e-6 joint distribution is substantially different from a product of marginal distributions, and the weighted average distance of all moment conditions from 0 is the smallest of the first four columns. In Columns 2 4, we exclude moment conditions (M3), fix α n = ı 4, and exclude conditions (M2) and fix α n = ı 4 respectively. Each of these changes produces a deterioration in fit as measured by the weighted average distance metric, and some changes in distance of the joint distribution from a product of marginals. In Columns 5 and 6 we exclude conditions covariance conditions (M2) and let α n vary freely. This produces large changes in the estimated distribution as it has moved much closer to a product of independent marginals. The important point to take from the results in Columns 5 and 6 is that one has to be wary when estimating an α vector without including covariance moment conditions. The GMM and weighted average Euclidean distance criteria both show that the model in these two columns fits better than in any of the previous columns. The reality, however,

16 86 ROMEO Table 6. Summary measures of model fit: weighted average Euclidean distances and GMM objective function values for a model including all three types of moment conditions and letting α n vary freely. Weighted average Euclidean Distance Store number Initial Final Distance between estimated GMM and independence distributions objective function value Euclidean Max absolute is that the GMM objective is optimized by placing very little weight on city-wide joint distribution. Each estimated element in α n < 0.01 in absolute value and, as such, yields a joint distribution that is very close to a product of marginals. Table 6 includes summary measures of model fit for all eight Dominick s stores for the model including all three sets of moment conditions and with θ = (α n,β j, j = u, r, n, i). 13 The results indicate that the model fits the moment conditions well for all eight stores, and it estimates joint distributions that differ substantially from joints formed as a product of marginal distributions. In addition to the three fit measures discussed above, we include the maximum absolute distance between the estimated and independence joint distributions. This is done to try and give the reader a better feel for how far the estimated model is from a joint distribution formed as a product of marginals. More specifically, these joint probability distributions each contain 272 cells. Hence, the average probability associated with each cell is 1/272 = The Euclidean distance is an order of magnitude larger than this value for seven of eight stores and the maximum absolute distance is an order of magnitude larger for all eight stores, thereby indicating the differences from independence to be substantial. Tables 7a and 7b provide the estimated joint probability distribution and the observed and estimated marginal distribution for stores 18 and 111 respectively. These tables are provided to show that the estimated marginals closely match the observed marginals, and that the estimated joint distributions for each store are substantially different. The two stores serve very different demographic populations. Store 18 s market population is 91 percent white, 61 percent of whom live in detached houses. For store 111, 99.5 percent of its market population is nonwhite, and only 31 percent live in detached homes. There are also substantial differences in the income distributions, and store 111 has a higher average number of persons per household. 13 We estimated this model using GAUSS on a 2GHz Pentium 4. Total estimation time for all eight stores was 3.2 seconds.

17 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 87 Table 7a. Estimated joint and marginal probability distributions for store 18. Units Race Persons Income interval midpoints (in 000 s) Attached White Attached White Attached White 3 or Attached White Attached Nonwhite Attached Nonwhite Attached Nonwhite 3 or Attached Nonwhite Detached White Detached White Detached White 3 or Detached White Detached Nonwhite Detached Nonwhite Detached Nonwhite 3 or Detached Nonwhite Store level marginal distributions Income: observed Estimated Persons: observed (1, 2, 3 or 4,5+) estimated Race: observed (white, nonwhite) estimated Units: observed (attach, detach) estimated Notes: entries imply probability element s value is in interval [10 8,10 3 ), while a 0 entry implies element is strictly less than Standard errors not reported to reduce clutter.

18 88 ROMEO Table 7b. Estimated joint and marginal probability distributions for store 111. Units Race Persons Income interval midpoints (in 000 s) Attached White Attached White Attached White 3 or Attached White Attached Nonwhite Attached Nonwhite Attached Nonwhite 3 or Attached Nonwhite Detached white Detached white Detached white 3 or Detached white Detached Nonwhite Detached Nonwhite Detached Nonwhite 3 or Detached Nonwhite Store level marginal distributions Income: observed estimated Persons: observed (1, 2, 3 or 4,5+) estimated Race: observed (white, nonwhite) estimated Units: observed (attach, detach) estimated Notes: entries imply probability element s value is in interval [10 8,10 3 ), while a 0 entry implies element is strictly less than Standard errors not reported to reduce clutter.

19 ESTIMATING DISCRETE JOINT PROBABILITY DISTRIBUTIONS 89 The observed and estimated marginals match to the second decimal place in most cases. The estimated race distribution for store 111 misses the observed distribution by a full three percent, but this is because of the extreme skewing of the distribution toward nonwhites. The model does a much better job of matching the somewhat less skewed race distribution of store A study of the effect of formulation of the joint demographic distribution on the results of an equilibrium model for demand and supply We estimate an equilibrium demand-supply model for bathroom tissue using one year of weekly data from each of the eight Dominick s stores; the demand model is mixed logit and a static Bertrand-Nash equilibrium condition is used to generate the supply function. Two versions of the model are run each using a different estimate of the joint demographic distribution for each of the eight stores. In the first version, the joint demographic distribution is estimated as a product of marginals, in the second version it is estimated using GMM Demand model. We use a random coefficients specification to represent the conditional indirect utility of consumption, c ijmt, for consumer i from product j purchased from store m in week t, yielding U(c ijmt ; d ) = x a j θ a + x b j θ b im P jmtα im + ξ j + ξ jmt + ε ijmt, i = 1,...,N, j = 0,...,J mt, m = 1,...,M, t = 1,...,T, (9) where for each product j we observe characteristics x j = (x a j, x b j ), and prices, p jmt.we decompose x j into subvectors x a and x b to highlight the point that we restrict x a to enter only the mean, while x b is allowed to influence both the mean and the random coefficients. In addition, the x s are only subscripted by j because all product characteristics other than price remain constant across stores and throughout the time period. Different products may be available over time or across stores, but the characteristics of products with a particular UPC number do not change. 15 Examples of product characteristics are the color of the tissue, the size of the roll (single or double), the ply of the paper (1- or 2-ply), the lotion content of the paper (with or without lotion), and the scent of the paper (scented or scent free). ξ j is the mean valuation of the unobserved (by the econometrician) product characteristic across all of the stores in our dataset, and ξ jmt is the store-week specific deviation from that mean. Following Nevo (2001), we use brand dummy variables to control for the ξ j, leaving the ξ jmt as our error terms. We expect that consumers and firms take the characteristics of all J mt products into consideration when making decisions, and hence the ξ jmt will be 14 We sketch out the demand model here, but do not present the supply function. A structural supply function is incorporated to provide additional structure that improves the precision of the demand model estimates. Supply function estimates, however, are not substantially affected by the choice of demographic distribution. Hence its development is ancillary to our main focus. The supply function is developed in detail in Romeo and Sullivan (2004). 15 In each store-week we observe between 28 and 42 bath tissue UPCs.

20 90 ROMEO correlated with the prices of all products available in store m in week t.wedefine alternative j = 0asthe outside good. Since we do not have detailed information about this alternative we retain the intercept and normalize other elements of x, p 0mt,ξ 0, ξ 0mt to equal 0. To control the effect of demographic differences on bath tissue choices in each store s market area, we specify the vector of consumer taste parameters (θim b,α im) as a function of store area demographics and a random normal component as in ( θ b im,α im) = ( θ b, ᾱ) + Ɣa im + ϒν im, i = i,...,n, m = 1,...,M, (10) where each a im draw is a L 1vector having probability distribution ˆP(l 1,...,l L ) and Ɣ is a matrix of unknown parameters. ˆP is the estimated joint demographic distribution and ϒ is specified as a diagonal matrix of unknown random coefficient parameters and the ν im are draws from N(0,I). Finally, the ε ijmt are unobserved buyer attributes that are assumed to follow a type 1 extreme value distribution that is independent across individuals, products, and time periods. In addition, a im,ν im, ξ jmt, and ε ijmt are assumed to be mutually independent. Aggregate demand shares are obtained by assuming that individuals make utility maximizing choices in their consumption of bath tissue, and integrating the ε ijmt, a im, and ν im over the appropriate regions. Integration over ε yields a logit distribution. a and ν are integrated numerically by drawing 200 samples from ˆP and N(0,I) for each store. As discussed in BLP, Nevo (2001), and Romeo and Sullivan (2004), integration of a and ν is embedded in a contraction mapping for determining mean utility Demand model results. Table 8 contains estimates of the product characteristic parameters for the demand model. 16 The table contains two columns of results. In the first column, results incorporate draws taken from demographic distributions formulated as products of marginal distributions, while the second column results incorporate draws taken from the demographic distributions estimated using GMM; 200 draws from each store are used in both cases. The results show the choice of demographic distribution to produce results that are substantially different, though the differences are not generally statistically significant. For example, the price coefficient is two units larger in absolute value when the GMM estimated distribution is used. This difference produces own- and cross-elasticity estimates (not reported here) that are 5 10 percent larger for the model using the GMM estimated demographic distributions. Finally, the objective function value indicates that the model using the GMM estimated joint distributions fits the data better. The overidentifying conditions are not rejected are the 5 percent level for this model, while they are rejected at the 5 percent level for the model with draws from the product of the marginals. 16 The demand model also includes month and brand dummies. Instrumental variables issues are discussed in Romeo and Sullivan.