Does Sample Size Still Matter? David Bakken and Megan Bond, KJT Group Introduction The survey has been an important tool in academic, governmental, and commercial research since the 1930 s. Because in most cases the intent of a survey is to measure or estimate the value that a variable takes on in some population of interest, the development of sampling science has been integral to the advancement of survey research. While it may be possible to conduct a census among a small, easily accessed population, in most cases observing or measuring a sample of members of the population is necessary for reasons of cost, timing, and practicality. Most of our understanding of sampling theory and method is based on probability sampling. A probability sample is one in which all members of the population of interest have a known probability of being included in the sample. The most basic form of probability sampling is the simple random sample (SRS) without replacement. With SRS without replacement, each population member or unit has an equal probability of being selected for the sample (with that probability being equal to 1/N, where N is the size of the population). The importance of probability sampling becomes apparent when we want to make statements about the degree of difference between the value of a parameter (such as a mean, a proportion, or a regression coefficient) observed in the sample and the true population value of that parameter. Probability sampling allows us to estimate the error attributable to looking at a sample rather than the entire population. The math of probability sampling (based on the number of possible permutations, such as the number of ways that you can get a result of seven by rolling a pair of dice) is such that if we took an infinitely large number of samples of a given size and measured a parameter for each sample, such as the mean of a variable, the distribution of these sample means (a.k.a. the sampling distribution of the mean) would have a normal distribution and the mean of this distribution would equal the population mean. Furthermore, we can calculate a margin of error around our sample mean based on this overall sampling distribution of means. It turns out that the margin of error for a sample estimate is related to the size of the sample, and larger probability samples, all other things being equal, will have smaller sampling errors. If we were to compare the sampling distribution of means based on SRS samples of 1,000 and 100, we would expect to find greater variability in the means based on samples of size 100. In other words, larger samples lead to more precise estimates of the parameter under study. This property has guided the design of survey samples, and most market researchers understand the relationship between population size, sample size, and precision (or margin of error), and they may apply relatively simple formulas to determine the appropriate sample size to achieve a specific level of precision. 1
Different areas of research practice have different standards or expectations for survey sampling error. Opinion polls conducted to forecast the outcome of an election may be designed for a margin of error of around 3% at a stated likelihood (usually 95%), by which we mean that if we repeated the poll 100 times with the same size probability sample, we would expect to find a value for the expected vote to be within three percentage points on either side of the sample estimate in 95 of those samples. For commercial purposes, the desired precision or margin of error is more likely to be a function of the cost of making a bad bet on some future outcome (this is known as the loss function ) and the magnitude of a meaningful difference in the real world. For example, a small difference in market share may represent a significant increase in revenue for one company but mere accounting noise for another company, and each company will have different requirements for precision in order to make the right bet on a particular action. Precision comes with a cost, however, and as, Figure 1 illustrates, the relationship between precision and sample size is non-linear. Reducing the margin of error at 95% confidence from 3% to 2% requires a near doubling of the sample size; reducing it from 3% to 1% requires a seven-fold increase in sample size. For that reason, researchers must find the appropriate trade-off between cost and precision for a particular survey problem. We should mention two other considerations with respect to precision. When estimating proportions, the formula for calculating the margin of error for a specific sample size is: ME = z (p (1 p))/n where p is the expected proportion. The margin of error for a given sample size is greatest when that proportion is exactly 50%. If we have a prior belief that the population proportion of interest is less than 50%, we may able to achieve a specified level of precision with a smaller sample. However, in the absence of that prior belief, 50% is the most conservative estimate and many people use that value as a default. Similarly, the degree of variability in the population impacts precision, and if we have prior beliefs about the degree of homogeneity or heterogeneity in the population, we may be able to achieve the precision required to satisfy our decision-making needs with a smaller sample. Despite the well-known math of probability sampling, market researchers often fail to conduct studies with samples that are large enough (based on sampling theory) to support their conclusions. Many researchers develop heuristics to simplify decisions about sample size. For example, psychology graduate students of a certain era were taught that a small sample (in particular for a randomized control group experiment) was 30, because that was the point at which one could switch from Student s T to a z-test to compare means. Market researchers have similar rules of thumb for determining the minimum number of elements from a population subgroup or segment to include in a sample. These rules of thumb are often intuitive rather than empirically-based. 2
The Shrinking Market Research Survey Sample Market researchers face a number of challenges in designing and implementing sampling schemes for survey research. Unlike public opinion polling, where the target population may be more or less the same from one poll to another, market research surveys serve a wide variety of information objectives and last week s survey may have targeted a completely different population from this week s. The advent of online research, in particular, online panels, promised to make very large samples affordable. Alas while online panels have driven down CPI, small samples (with perhaps fewer than 100 respondents) have become commonplace. Reasons include the targeting of niche and otherwise low incidence segments and declining response rates. Faced with the need to help marketers make reasonable business decisions using survey data obtained from relatively small samples, we set out to investigate the relationship between sample size, the variability of parameter estimates based on those sample sizes, and the implications for managerial decision-making. We could, of course, calculate sampling errors for our different sample sizes and let it go at that. In fact, the frequentist approach, based on the long term frequency with which a parameter estimate occurs, such as the sampling distribution of the mean, stops at this point. However, this approach assumes that we are completely ignorant about the true population parameter value (even if we have measured it previously). Our research was inspired in part by the story of Jean Baptiste Eugène Estienne, a French Army general who devised a method using Bayes theorem that enabled assessment of the overall quality of a batch of 20,000 artillery shells by destructive testing of no more than 20 shells. At the outset of World War I Germany seized much of France s manufacturing capability, making the existing ammunition stores that much more precious. Applying the standard frequentist approach (calculating a sample size based on an acceptable margin of error around some criterion, such as 10% of all shells) would have required destruction of a few hundred shells. Estienne s method relied on updating the probability that a batch overall was defective (i.e., 10% or more bad shells) with each successive detonation. Thomas Bayes was an 18 th Century English clergyman and amateur mathematician who proposed a rule for accounting for uncertainty. Bayes theorem, as it is known, was described in a paper published posthumously in 1763 by the Royal Society. This theorem is the foundation of Bayesian statistical inference. In Bayesian statistics, probabilities reflect a belief about the sample of data under study rather than about the frequency of events across hypothetical samples. In effect, the Bayesian statistician asks the question, given the data I have in hand, what is the probability of any specific hypothesis about the population parameter value? In contrast, the frequentist asks how probable is the data, given my hypothesis? In effect, the frequentist approach decides whether to accept the data as real. With respect to small samples, we speculated that a Bayesian approach to inference would provide a means to account for uncertainty in a way that gives managers a better understanding of the probability of the sample data with respect to a specific decision. In this approach, we take the data as given and then calculate the probability of different possible true values. This requires a shift in thinking about the marketer s decision problem. Suppose that a company is planning to launch a new product and wants to determine the potential adoption rate at a few 3
different price points. Imagine that the company conducts a survey, employing a simple direct elicitation of willingness to pay, such as the Gabor-Granger method. Further imagine that the results indicate that 15% of the target market says they will definitely purchase the product at a price of $15 or less. The company has determined that they need to achieve at least 20% market adoption at a price of $15 in order move ahead with the launch. The standard frequentist approach is not much help in this case. If the survey sample is relatively small, the 20% threshold is likely to fall within the margin of error; if the sample is large, the resulting increase in precision will shrink the confidence interval around the 15% estimate such that the 20% threshold looks extremely unlikely. We can use Bayes theorem to reduce the uncertainty. Bayes theorem exploits the fact that the joint probability of two events, A and B, can be written as the product of the probability of one event and the conditional probability of the second event, given the first event. While there are some different ways to express the theorem, here is a simple representation: Prob H = xy xy + z(1 x) We wish to estimate the probability of our hypothesis (for example, that the adoption rate will be 20%). The value X reflects our best guess about the likelihood of the hypothesis in the absence of any data (our prior probability belief). Y is the probability that the hypothesis is true given the data, and z is the probability of observing the data if the hypothesis is not true. Overview of Our Study The overall objective of this study, as noted previously, was to assess the variability in parameter estimates for samples of different sizes. We followed the classic paradigm for evaluation of parameter estimates under varying treatments or methods. We started with a population where the parameter values were known. In many studies such a population is synthetic; the observations are generated by specifying the parameter values and then using Monte Carlo simulation methods to create one or more synthetic populations with those parameter values. In our case, we started with a reasonably large sample of actual survey responses and, treating that sample as the population, drew multiple simple random samples of varying size (as described below). Using responses to a choice-based conjoint exercise that was embedded in an online survey of approximately 897 individuals, we created a series of samples of different sizes using different restrictions to reflect the ways in which both probability and convenience samples might be generated. The choice-based conjoint was a simple brand and price exercise that included four brands of LCD television and four price levels. We conducted two separate experiments, as described below. 4
Experiment 1: We drew multiples of ten random samples of 25, 50, 75, 100, 150, 225 and 450 from our population of 897 respondents, resulting in 70 individual samples. We estimated HB models for each sample (using Sawtooth Software s CBC-HB program). Experiment 2: We repeated the method of Experiment 1 but altered the sampling strategy so that samples were more homogeneous. We used two different sets of restrictions to achieve this, one based on demographics, and one based on an attitudinal measure in the original survey. We applied the same overall design, with multiples of 10 samples of size 25, 50, 75, and 100, resulting in a total of 40 samples based on the demographic restriction and 40 based on the attitudinal restriction. Results When using results from choice-based conjoint analysis for research-on-research, we usually employ choice shares predicted by a market simulator (employing a logit transformation to generate purchase probabilities). This method is preferable to comparing different samples using model-based parameters (e.g., regression coefficients) because, in the multinomial logit model that captures the likelihood of choosing an alternative given the alternative s attributes, each sample has a unique scaling parameter. Transforming the model coefficients into predicted choice shares removes this difference between samples. In addition to comparing samples of different size with respect to the variance in predicted choice shares and deviation from the true population value, we also looked at aggregate and individual (i.e., hit rate ) validation using holdout choice tasks. Experiment 1 Figure 2 shows the average prediction variance across the 10 replicates at each sample size. There are two interesting patterns here. First, some brands have smaller prediction variance. These happen to be somewhat larger brands than the other two. The second pattern is that prediction variance shrinks as sample size increases, dropping roughly in half when the sample size is at least 100, compared to samples of 25. Insert Figure 2 here. Figure 3 compares aggregate holdout prediction errors for each of the sample replicates. Aggregate holdout prediction error is the difference between the shares predicted for each brand at the prices set for a holdout task (that is not included in the modeling) and the actual choices that respondents made in those tasks. Larger errors reflect more noise in the parameters, and we see that these errors are both larger on average and more variable when the sample is small than when it is larger. Insert Figure 3 here. 5
Figure 4 compares individual hit rates for each of the sample replicates. The hit rate is the proportion of times the prediction choice for a given respondent matches the actual choice the respondent made in that holdout task. With one notable exception (samples of 100), the average hit rates and the variability in hit rates are similar across different sample sizes. This is probably a consequence of the HB method used to estimate the individual-level utilities. This method borrows data from other respondents to derive individual models for each respondent. It is possible that the hit rates for smaller samples are the result of over-fitting since there are fewer cases to borrow data from (which pulls the individual models in the direction of the overall average) while with larger samples, the individual parameter space is better represented, so the borrowed data is more probable. Insert Figure 4 here. The final indication of the potential error associated with sample size is reflected in the differences between predicted choice shares based on each sample replicate and the overall population value (the modeled choice shares using the entire sample). Figure 5 shows these errors for predicted choice shares for just one of the brands. As with the other measures, individual sample prediction errors are larger for smaller samples, but when the samples are averaged (within sample size), the predictions are pretty close to the actual population value. Insert Figure 5 here. Experiment 2 As we noted in the description of our second experiment, market research samples often are restricted in ways that might impact the variability or heterogeneity within the sample. All other things being equal, samples from populations that are more homogeneous should produce more consistent parameter estimates (as long as the population variability is related to the parameter of interest). We devised two constrained sampling approaches to yield samples that would be either demographically more similar (using age) or attitudinally more similar. Overall, as Figures 6 and 7 indicate, the patterns of variability in predicted choice shares in these constrained samples is similar to the unconstrained samples. Since our sample restrictions were arbitrary and only two of many possible sample restrictions, it is possible that any resulting increase in homogeneity was either small or not relevant to the parameters of interest. It is also possible that the HB method attenuates the impact of increased homogeneity on the individual- level choice models. Insert Figures 6 and 7 about here. 6
Accounting for Uncertainty Looking across these sample replicates, we want to know, for a given sample size, how likely we are to make a seriously wrong decision. We applied Bayes theorem to estimate the uncertainty associated with samples of different size. Knowing that the population choice share for Toshiba at a particular price is roughly 19% and that if the price is lowered by $100 the choice share doubles, we can calculate the uncertainty for each of the samples. Figure 8 compares the results of this calculation for samples of 25 and 100. We can see that we should have greater confidence in any one sample of 100 than in any one sample of size 25. Insert Figure 8 about here. Conclusions For us, our experiments indicate that sample size does still matter. Moreover, we now have greater confidence in drawing the line for minimum sample size of about 100 respondents, at least for studies involving relatively simple choice-based conjoint models estimated using a hierarchical Bayesian method. Regardless of the sample size, Bayes theorem offers a way to quantify the uncertainty around population parameters. Bayes theorem requires that we alter our way of thinking about the data. Rather than base our inferences on the long term frequencies from hypothetical sample replicates, Bayes theorem allows us to ground our estimates in the data at hand. We do not view Bayesian inference as a total replacement for frequentist methods of estimating sampling error. Instead, we see Bayes theorem as an additional tool that can help managers make the best possible decisions or bets based on all the information we have available. 7
Figures Figure 1. Figure 2. 8
Figure 3. Figure 4. 9
Figure 5. Figure 6. 10
Figure 7. Figure 8. 11