Social Search in Small-World Experiments
|
|
|
- Garry Chandler
- 10 years ago
- Views:
Transcription
1 Social Search in Small-World Experiments Sharad Goel 1, Roby Muhamad 2, and Duncan Watts 1,2 1 Yahoo! Research, 111 West 40th Street, New York, NY Department of Sociology, Columbia University, 420 West 118th Street, New York, NY [email protected], [email protected], [email protected] ABSTRACT The algorithmic small-world hypothesis states that not only are pairs of individuals in a large social network connected by short paths, but that ordinary individuals can find these paths. Although theoretically plausible, empirical evidence for the hypothesis is limited, as most chains in small-world experiments fail to complete, thereby biasing estimates of true chain lengths. Using data from two recent small-world experiments, comprising a total of 162,328 message chains, and directed at one of 30 targets spread across 19 countries, we model heterogeneity in chain attrition rates as a function of individual attributes. We then introduce a rigorous way of estimating true chain lengths that is provably unbiased, and can account for empiricallyobserved variation in attrition rates. Our findings provide mixed support for the algorithmic hypothesis. On the one hand, it appears that roughly half of all chains can be completed in 6-7 steps thus supporting the six degrees of separation assertion but on the other hand, estimates of the mean are much longer, suggesting that for at least some of the population, the world is not small in the algorithmic sense. We conclude that search distances in social networks are fundamentally different from topological distances, for which the mean and median of the shortest path lengths between nodes tend to be similar. Categories and Subject Descriptors J.4 [Computer Applications]: Social and Behavioral Sciences General Terms Experimentation, Human Factors Keywords Social search, small-world experiment, attrition 1. INTRODUCTION For forty years, the provocative small-world experiments of Stanley Milgram and colleagues [17, 25, 34] have been cited as evidence that everyone in the world is connected to everyone else via six degrees of separation. This claim, however, can be interpreted in two very different ways. The Copyright is held by the International World Wide Web Conference Committee (IW3C2). Distribution of these papers is limited to classroom use, and personal use by others. WWW 2009, April 20 24, 2009, Madrid, Spain. ACM /09/04. first interpretation which we call the topological version of the hypothesis holds only that for a randomly chosen pair of individuals, there exists with high probability a short chain of intermediaries that connects them, where short is usually interpreted as proportional to the logarithm of the population size [38]. The second interpretation makes a much stronger claim namely that ordinary individuals can effectively navigate these short chains themselves, with every individual having only local knowledge of the social network in question [1, 2, 13, 14, 21, 26, 33, 37]. For this reason, it has been labeled the algorithmic small-world problem [13, 14]. Distinguishing between the topological and algorithmic interpretations of the small-world hypothesis is important because each is relevant to different social processes [36]. The spread of a sexually-transmitted disease along networks of sexual relations, for example, does not require that participants have any awareness of the disease, or intention to spread it; thus for an individual to be at risk of acquiring an infection, he or she need only be connected in the topological sense to existing infectives. On the contrary, individuals attempting to network in order to locate some resources like a new job [9] or a service provider [19] must actively traverse chains of referrals; thus must be connected in the algorithmic sense. Depending on the application of interest, therefore, either the topological or algorithmic distance between individuals may be more relevant or possibly both together. Given the distinct implications of topological versus algorithm connectivity, it is also important to note that the two interpretations are supported by very different kinds of evidence. In recent years, hundreds of empirical studies of large-scale networks have been conducted across a number of domains, including not only social networks [8, 18, 20, 27], but also biological [12, 35, 38], technological [31, 38], organizational [16], and virtual networks [4]. Two nearuniversal findings of these studies are: (1) that a majority of individuals in a given population are connected in a single giant component ; and (2) that the typical shortest path length connecting pairs of nodes within the giant component is on the order of the logarithm of the system size. In a recent study of a network of 180M instant messenger users (where a link is defined as two users nominating each other as IM buddies ), for example, Leskovec and Horvitz [20] found that users were separated by a mean of 6.6 steps and a median of 7 steps. Taken together, therefore, these studies provide overwhelming evidence for the topological interpretation of the small-world hypothesis, confirming it
2 for networks spanning six orders of magnitude in size (from hundreds to hundreds of millions), and across many substantive domains. Empirical evidence in favor of the algorithmic small-world hypothesis that individuals can locate these paths is, however, much less conclusive [15, 36]. A handful of classic sociological case studies such as Lee s The search for an abortionist [19] and Granovetter s Getting a job [9] along with numerous anecdotal examples from the growing literature of networking books [24, 28, 30] suggest that in at least some circumstances individuals can indeed navigate their social networks to locate useful resources. It is unclear, however, whether these examples provide evidence for a general searchability property of social networks that allows even quite ordinary individuals to perform successful social searches under routine conditions, or instead represent a collection of special cases either because the individuals in question, or their circumstances, are somehow atypical. In recent years, a number of mathematical and simulation models [1, 2, 13, 14, 21, 26, 33, 37] have been proposed with the objective of making a theoretical case in favor of the generic searchability of social networks. Although suggestive, these results rely on some form of simulation of the underlying social network, of the search procedure, or of both together and thus they still do not constitute direct empirical evidence for the key claim of the algorithmic small-world hypothesis, that ordinary individuals in large social networks can routinely locate short paths. More seriously, perhaps, the models described above all tend to make certain homogeneity assumptions that treat all individuals as equivalent; hence, they have little to say about the impact on social search of well-known sources of heterogeneity and inequality in real social networks [5, 29]. Mathematical models, computer simulations, and isolated examples aside, therefore, the primary source of empirical evidence for the algorithmic small-world hypothesis continues to be that provided by small-world type experiments of the kind invented by Milgram. How persuasive, then, are these findings? As Kleinberg [13] noted in his initial formulation of the algorithmic smallworld problem, the experiment conducted by Travers and Milgram [25, 34] clearly demonstrated that at least some people are able to construct short chains, comprising an average of approximately five intermediaries, that connect distant senders from a chosen target individual. But as Travers and Milgram themselves emphasized, it was also the case that most of the chains that started roughly 80% in their case never reached the target. Subsequent experiments [6, 11, 17, 22, 23, 32] have demonstrated much the same pattern: On the one hand, chains that reach their targets tend to be short; but on the other hand, the rate of chain completion tends to be low. In Korte and Milgram s follow-up study of white senders in Los Angeles attempting to reach one of 270 black targets in New York [17], the average length of completed chains was 7, but the completion rate was only 13%. And in the most recently repeated small-world experiment, conducted by Dodds et. al. [6], the pattern was even more striking: Completed chains were only 4 steps long, but only 0.4% of about 24,000 chains that started (a total of 384) reached their targets. References to six degrees of separation [10] typically emphasize the first of these results that completed chains tend to be short but here we wish to emphasize the second finding that the vast majority of chains never reach their ultimate target, and therefore that most of the desired data on chain length is effectively missing. Determining what chain attrition tells us about the searchability of social networks is therefore the main object of this paper, which is organized as follows. In the next section, we review the standard approach to handling chain attrition, and raise two possible objections that have not been resolved by previous studies. In Section 3, we describe data from two recent small-world experiments, and model attrition as a function of individual and relational attributes. In Section 4, we derive a general correction for missing data based on the classical statistical idea of importance sampling. In Section 5 we estimate chain length distributions based on the attrition model and the importance sampling estimator. We find that although the median chain length is consistent with earlier estimates, the mean is significantly longer than previously believed. Finally, in Section 6, we close with a discussion of the new evidence, concluding that the usual interpretation of the smallworld hypothesis requires some caveats. 2. RELATED WORK The conventional wisdom regarding chain attrition in smallworld experiments, initially proposed by White [39], is that message chains terminate for reasons that are unrelated either to the topology of the underlying network, or to the search process. Rather, it is assumed that participants fail to pass on messages either because they are not sufficiently motivated to do so, or else because they fail to receive them in the first place. Message passing can therefore be viewed as a stochastic process that takes place on some network, where the only parameter governing the success of a search is the probability of termination at each step. 1 A convenient feature of this stochastic attrition interpretation is that it can be invoked to derive estimates of the true length of chains that is, the length distribution of chains that would have been observed had no attrition taken place. In particular, White [39] proposed the estimator p l ˆp l = Q l 2 j=0 (1 rj) (1) where ˆp l is an estimate of the true percentage of paths with length l, p l is the observed percentage of completed chains with length l, and r j is the probability of attrition from step j to step j + 1. In deriving this estimator, White assumed that individuals who knew the target directly would always pass on the message; thus last-step attrition would be zero. Although plausible, this assumption is not inevitable, nor was it based on empirical evidence. Dodds et al. [6] therefore removed this assumption, resulting in the estimator p l ˆp l = Q l 1 j=0 (1 rj) (2) which is identical to (1), except that the product in the denominator is from j = 0 to j = l 1. Importantly, it follows from either estimator that the true average path length is longer than that for observed chains, for the simple reason that longer chains are more likely to terminate than shorter chains. For example, White estimated that in the absence 1 In the original Travers and Milgram experiment, for example, roughly 75% of messages were passed on at each step, corresponding to a 25% attrition rate.
3 of attrition, Milgram s experiment would have yielded chain lengths in the vicinity of eight steps, not six. The resulting estimates, however, are still short in the sense of being comparable to that of an equivalent random graph (i.e., of the same size and average density); thus the upshot of these methods, and the conclusion of most previous studies, is that as long as the stochastic failure assumption is reasonable, then it would appear that the observation of even a small fraction of short chains in small-world experiments is sufficient to support the algorithmic small-world hypothesis. An alternative interpretation of results from small-world experiments, however, takes direct issue with the stochastic attrition assumption, contending that low completion rates should be seen not merely as missing data, but as evidence that most pairs of individuals are not connected by short, navigable paths. For example, Kleinfeld [15] has argued that The research on the small-world problem suggests not a counter-intuitive triumph of social research, but an all-too familiar pattern: We live in a world where social capital, the ability to make personal connections, is not wide-spread and more apt to be a possession of people of higher social status. Many small worlds do exist, such as scientists with worldwide connections or university administrators at a single campus. Rather than living in a small, small world, we may live in a world that looks a lot like a bowl of lumpy oatmeal, with many small worlds loosely connected and perhaps some small worlds not connected at all. Objections of this kind raise two distinct ways in which low completion rates may be systematically related to the searchability of the network itself. First, it may be the case simply that some people are good at conducting social searches, while others are not. Exceptionally determined or resourceful individuals, that is, may be able to construct short paths to distant targets; but most people lacking such social capital [5, 29] are effectively isolated from one another. A second objection, compounding the first, is that chains terminate because most source-target pairs live in effectively separate populations that can only be reached via long or otherwise undiscoverable paths that is, many small worlds, loosely connected. Thus, pairs of individuals who are already close, in the sense of sharing social, geographical, and demographic attributes, can find each other, but distant pairs cannot. Because, moreover, many more pairs are distant than close, the bulk of message chains should not be expected to complete as indeed is observed in small-world experiments. Although distinct, both objections call into question the veracity of the estimators (1) and (2). Specifically, the first objection implies that chain attrition is not merely a function of unrelated extrinsic factors, as assumed by the stochastic attrition hypothesis, but rather reflects variability among individuals. Accordingly, one would expect chain attrition to correlate strongly with individual-level attributes such as education, income, and so on, that are typically related to social capital [5, 29]. Meanwhile, the second objection that many source-target pairs live in distant groups implies that long chains are not being counted in the estimation procedure, and that the resulting estimators are therefore biased. Because the estimators (1) and (2) fail to account for heterogeneity in attrition, and also neglect to provide any guarantee that they are unbiased, one might suspect that the true chain length estimates are higher possibly much higher than conventional wisdom allows. In what follows, we examine data from two small-world experiments more systematically than in previous studies, and introduce an estimator that is (a) provably unbiased, and (b) allows for heterogeneous attrition rates. We find support for each of the interpretations discussed above: Specifically, the median true chain length (i.e. in a hypothetical experiment with zero attrition) can be estimated robustly to be about 6-7; however, the mean of this distribution is likely to be much larger. In other words, at least half of all chains that start would be expected to reach their targets within about six steps; but at least some chains should be expected to be very long, consistent with the claim that some pairs of individuals are extremely unlikely to be able to locate each other. 3. MODELING ATTRITION Our data come from two recent experiments one of which has been reported on previously [6] that were designed to replicate Milgram s small-world method, except using instead of physical packets, a change that permitted access to a much larger and more diverse population than was available to Milgram, and at lower cost [3]. The first of these experiments was conducted between December 2001 and August 2003; and the second version followed immediately thereafter, and ran until December In the first experiment, 98,865 people from 168 countries initiated 106,295 chains directed at 18 targets in 13 countries. In the second experiment, 85,621 people from 163 countries participated in 56,033 chains, directed at 21 targets in 13 countries. In both cases, participants were mostly from the United States and Western Europe, were predominantly white and Christian, and were largely young, college-educated, middle-class professionals. 2 The results exhibit the same combination of short path lengths and low completion rates that typify small-world 2 Five targets were recruited directly by members of the research teams, and the remaining targets were selected from more than 4,000 volunteers (recruited via the website) with the aim of achieving as diverse a pool as possible. Initial senders were recruited from among people who had heard about the experiment from the media or by word of mouth, where, in order to obtain as many participants as possible, there was no attempt to control the characteristics of senders. Participants registered at the website and were asked to reach target persons, with the restriction that they could only advance the chain through people whom they already knew. The following message was then ed to the chosen recipients: We are testing the idea that everyone in the world can be reached through a short chain of social acquaintances. Your objective is to use your social connections to move a message closer to a particular target person. Upon receiving this message, recipients had to verify whether they knew the senders and then send the message on to another person whom they knew; and this process continued until the target was reached. Thus, they created a series of message chains from initiators to targets. The experiment also recorded demographics data and relational attributes (type, origin, strength of relationship, and reasons for choosing recipients) between senders and recipients.
4 experiments. The completion rates for these experiments, however, were much lower than those recorded by Milgram and his colleagues: whereas Travers and Milgram experienced roughly 20% completion, in Experiment 1, a total of 491 chains (0.5%) successfully reached their targets 3 ; and in Experiment 2, the completion rate was 0.1% (61 chains). These ultra-low completion rates are directly attributable to the peculiar design of the small-world method, in which chain completion rates diminish exponentially with chain length. 4 In order to address the first critique, above, that the estimators (1) and (2) do not account for individual-level heterogeneity, we first need to model attrition in terms of individual-level attributes, like socioeconomic status, gender, education, and age, that are typically associated with differences in social capital [5, 29]. Unfortunately, measuring the impact of individual attributes on chain attrition is complicated by another feature of small-world experiments: The majority of people who did not continue chains never came to the experiment website; thus we do not have data on their individual attributes. As a substitute, therefore, we instead estimate the next-step continuance probability: Given an individual A who forwards the message to B (and assuming B is not the ultimate target), we estimate the probability that B continues the chain. In this sense, we are effectively modeling the search ability of our participants (i.e., their ability to choose someone who will pass along the message). As such, our model also includes relational attributes (between A and B) and is adjusted for the target. In total, we analyzed 88,875 sender-recipient pairs 5, of which 32% of pairs comprised recipients who forwarded messages (continued links), and 68% comprised recipients who did not (terminated links). We estimate the next-step continuance probability by logistic multilevel regression (also known as hierarchical linear modeling ), a standard statistical tool for modeling data with group structure [7]. In our case, for example, high school, college and graduate school would be separate groups in the category of education. Multilevel regression 3 Although we have used the same raw data as Dodds et al. [6], changes in our coding and some subsequent cleaning of the data have resulted in somewhat different figures for number of participants and completed chains. There are three main reasons for these discrepancies. First, we now include a number of chains that were originally excluded by Dodds et al [6] because they contained unidentified individuals. Second, Dodds et al. [6] required actual s to be sent between two people in order to establish a connection. After closer examination, however, we found that there were people-especially those directly adjacent to targets-who received more than one message but who had forwarded only one. Here we count a connection whenever we know that it exists from previous exchanges; thus we have a higher total number for both incomplete and completed chains. Finally, we have added three demographic variables (ethnicity, work industry, and work position) that were inaccessible previously due to many participants answering the other category. We solved the problem by checking manually all uncategorized answers and putting them into relevant categories. 4 For example, if 100,000 chains are initiated with a 25% constant attrition, in six removes there are 1780 chains left, whereas with a 67% attrition rate, only 129 survive. 5 By contrast, we note that there were only 491 completed chains; thus by studying links instead of completed chains, we dramatically increase the number of relevant data points. can be thought of as a compromise between two extreme approaches to pooling data from different groups: no pooling and complete pooling. No pooling corresponds to treating different groups within the same category as unrelated; thus one would fit distinct parameters for each group (e.g. high school versus college) without imposing any relationship amongst them. By contrast, complete pooling effectively corresponds to ignoring the presence of groups within a particular category; for example, treating all individuals as the same regardless of educational status. In the middle ground of multilevel models, one allows for the possibility that groups within a category are related, without specifying a hard constraint on the strength of their relationship. Specifically, our model is of the form P(y i = 1) = logit 1 (γ + β nonwhite X nonwhite,i + β female X female,i + α j1 [i] + α j9 [i]) where the outcome variable y i indicates the next-step continuance, γ is the intercept, the two β terms are fixed effects for female and nonwhite participants respectively, and the α jk [i] correspond to the nine group effects. For each category k (e.g., education), j k [i] is the group (e.g., high school, college, graduate school, etc.) of the i th response, and we model the group parameters within each category as coming from a normal distribution: α jk [i] N(0, σ 2 k). The subscript j k [i], in other words, can be thought of as a mapping from a particular response i to a particular group j k [i] within category k. Our model therefore includes a total of 66 parameters: an overall intercept term; one parameter each for gender (male/female) and race (white/non-white); 54 distinct attribute parameters, which in turn are grouped into nine categories 6 (age, education, work field, work position, income, strength of relationship with message recipient, reason for choosing recipient, origin of relationship with recipient, and target); and one variance parameter for each category. The estimated attrition rate for a particular individual (e.g., female, years old, with a college degree, etc.) is obtained by adding together the relevant terms in the model, and taking logit 1 of the sum. The variance parameters σ 2 k indicate how closely related the groups are within each category k, and themselves are inferred from the data; large variance corresponds to weak association of groups (i.e. no pooling), and small variance corresponds to strong association (i.e. complete pooling). Table 1 shows the standard deviations σ k for the nine categories 7 (first column). Although these σ k capture the typical effect of the category on attrition, they do so on the logit scale, which is difficult to interpret. To aid interpretation, therefore, these values are translated in the second column to a probability scale that is relative to the baseline attrition rate. For example, differences between targets account for a 2% absolute change in attrition rates, while differences in the education of senders accounts for 3%. Since the baseline attrition rate is 30%, a 2% absolute change corresponds to a 7% relative change. Furthermore, as discussed below, these differences are amplified by the correlation of attributes to- 6 We found religion, country, type of relationship, and current chain length were not statistically significant predictors. 7 Although is is difficult to compute overall standard errors for the variance parameters, Table 2 states standard errors for each group-level coefficient within every category.
5 Table 1: Category standard deviation parameters from a multilevel logistic regression model of nextstep continuance probabilities. Attrition is stated as typical deviation from the baseline of 30% for white males. Category σ k Attrition Target 0.09 ± 0.02 Age 0.12 ± 0.02 Relationship Origin 0.04 ± 0.01 Income 0.07 ± 0.01 Work Position 0.03 ± 0.01 Work Field 0.08 ± 0.02 Reason for Choosing Recipient 0.05 ± 0.01 Relationship Strength 0.11 ± 0.02 Education Level 0.14 ± 0.03 gether with the compounding effects of chain-based message propagation. Table 2 develops this analysis, where now we examine the effects of individual and relational attributes on the nextstep continuance probability, relative to the baseline attrition of 30% (for typical white males). Each row in Table 2 corresponds to a group (e.g., college and ) within the nine attribute categories (e.g., education and age ), as well as the overall intercept, and the two fixed effects for females and non-whites. For any given group (i.e., row in the table), the first table column is the estimated coefficient for that group 8, along with its associated standard error; and the second column gives the corresponding effect on the next-step continuance rate. Consistent with our interpretation of Table 1, Table 2 reveals a small but significant range of attrition rates, with individuals possessing the attributes of high social capital associated with higher continuance probabilities: possessing a graduate education, for example, increased pass-along by 4% above the baseline, whereas having only a high-school education diminished it by 3%; and whereas participants earning over $100,000 were 2% more likely than average to pass along a chain, those earning less than $25,000 were 1% less likely. As one might suspect, individual-level heterogeneity in attrition rates tends to be correlated across attributes (e.g., high education is associated with high income); thus the overall distribution of estimated attrition rates for participants is considerably greater than is indicated by any single group effect, with attrition rates varying from 60% to 80%, as shown in Figure 1. Overall, therefore, our analysis shows that high status individuals are more likely to pass along messages to friends who again pass them along, and that these differences, once compounded over multiple attributes, can be large. One might also note, however, that the distribution in Figure 1 is sharply peaked around the mean; thus although individual differences can be large, they are typically small. Regardless, the homogeneity assumption used in estimators (1) and (2) is clearly invalid. To understand the relation between attrition and chain completion, therefore, we require an estimator that can account for heterogeneous attrition rates. We also wish to address the second criticism raised in Sec- 8 Since gender and race are fixed effects in our model, we use white males as a baseline group. Table 2: Coefficient estimates from a multilevel logistic regression model of next-step continuance probabilities. The probabilities are stated as deviation from the baseline of 30% for white males. Attributes Coef. (S.E.) Probability Age Under (0.11) (0.06) (0.06) (0.06) (0.06) Above (0.07) Education Level Graduate school 0.18 (0.08) 0.04 College/University (0.08) 0.0 High school (0.08) Elementary school (0.11) Work Field Media/Advertising/Arts (0.05) 0.02 Education/Science (0.04) 0.01 IT/Telecommunication (0.05) 0.0 Government (0.05) Other (0.04) Work Position Specialist/Technical (0.03) 0.01 Student (0.03) 0.0 Other (0.02) 0.0 Unemployed/Retired (0.03) 0.0 Executive/Manager (0.02) Income More than $100, (0.04) 0.02 $50,000 - $100, (0.04) 0.01 $25,000 - $49, (0.04) 0.0 $2,000 - $24, (0.04) Less than $ (0.05) Relationship Strength Extremely close 0.13 (0.06) 0.03 Very close (0.05) 0.0 Fairly close 0.05 (0.05) 0.01 Casually (0.05) 0.0 Not close (0.07) Reason for Choosing Recipient Profession (0.04) 0.01 Education (0.04) 0.01 Work brings contact (0.04) 0.0 Geography (0.03) 0.0 Other (0.03) Relationship Origin Work (0.03) 0.01 School (0.03) 0.0 Internet (0.03) 0.0 Mutual friend (0.03) 0.0 Relative (0.03) Other (0.03) Fixed Effects Intercept (0.12) NA Female (0.025) Nonwhite (0.041) -0.03
6 Density Attrition paths longer true paths are more likely to be observed as incomplete chains 9. Hence, the outcome ω Ω is ultimately observed to complete with probability P (ω)q(ω). In particular, summing over all possible outcomes, a nonmissing P value is observed on a given trial with probability ω P (ω)q(ω), and a missing values is observed with probability 1 P ω P (ω)q(ω). Let X1,..., Xn denote n such independent trials (in this case, n is the total number of chains that are initiated), where X i is either an uncorrupted observation (e.g., a completed chain) or a missing value (e.g., an incomplete chain). That is, X i Ω {NA} where NA denotes a missing value. The true expected chain length without attrition µ can be expressed as a weighted average over the space of all paths (higher weights correspond to more likely paths): µ = X ω f(ω)p (ω) Figure 1: The estimated distribution of attrition over individuals. Average attrition 0.7. tion 2, which is that the estimators (1) and (2) have not been proven to be unbiased, and therefore may not account correctly for the presence of long, but unobserved chains. 4. CORRECTING FOR MISSING DATA We address both these problems by introducing a general technique to correct for missing data based on the classical statistical idea of importance sampling. Although at a high level importance sampling is similar to estimators (1) and (2), its more rigorous formulation permits us to guarantee that estimates are unbiased, and to specify errors on the estimates. It also allows us to incorporate heterogeneous attrition rates in a straightforward manner Intuitively, our estimation procedure works in the following manner. For each non-missing data point (i.e., completed chain), we estimate the likelihood of that data point having been observed. We then re-weight the non-missing values to account for the fact that some of the data points were more likely to have been observed (i.e., the shorter chains) than others (i.e., the longer chains). More formally, we consider the space of all possible paths between all pairs of individuals. In this hypothetical space, there may be many paths that connect any two given individuals, and so we reasonably assume that some paths are more likely to be traversed than others. In an ideal small world experiment without attrition, we would repeatedly observe the lengths of these random paths between random pairs of individuals. In the more general setting, we suppose there is a discrete outcome space Ω (e.g., the space of paths) that is equipped with a probability P (e.g., P describes the likelihood of selecting any particular path amongst all paths between all pairs of individuals). An experimental trial corresponds to observing an outcome (e.g., a path) ω Ω drawn with probability P. In the case of missing data, however, we do not always get to see the uncorrupted outcome ω Ω (e.g., the path). To model this missing data situation, we suppose that after an outcome ω is drawn, it is observed with probability Q(ω) and reported missing with probability 1 Q(ω). In our smallworld setting, missing corresponds to an incomplete chain, and due to attrition, Q(ω) is generally smaller for longer where f(ω) denotes the length of the path ω 10. Without missing values (i.e., Q(ω) = 1), the P usual unbiased estimator for µ is the sample average 1 n n i=1 f(xi). With missing values, (i.e., Q(ω) < 1), averaging over all the non-missing values in the sample biases our estimate toward outcomes that we are more likely to observe 11. We adjust for this biased observation of outcomes by re-weighting samples by their inverse probability of observation, an idea based on the classical statistical technique of importance sampling. This re-weighting results in an unbiased estimator, as given in Theorem 1. Theorem 1. In the general setting described above, an unbiased estimate of the mean µ = P ω f(ω)p (ω) is given by ˆµ = 1 mx f(x ki ) n Q(X ki ) i=1 where X k1,..., X km are the m observed, non-missing values, and Q(ω) is the probability that ω is observed uncorrupted after it has been sampled. Proof. First extend f to a function f defined on Ω {NA} (where NA denotes a missing value), by setting f(ω) = f(ω) for ω Ω and f(na) = 0. Then we can rewrite the estimator as ˆµ = 1 n nx i=1 f(x i) Q(X i) where the sum is taken over all samples (including the missing values). Since the samples X i are identically distributed, 9 We emphasize that our use of the term observed corresponds to its usage in probability theory that is, as a random trial in an experiment and does not imply that the chain is observed to complete. Indeed, the point is that most chains that are observed in the statistical sense do not complete. Thus one should think of all chains as being observed, where some (a minority) are observed to complete, while others (a majority) are observed not to complete. 10 More generally, we might be interested in the expectation of an arbitrary function f. 11 We emphasize again that this problem is particularly egregious in the case of estimating chain lengths, since the probability of observing a chain tends to decrease exponentially with its length; thus we are much more likely to see short chains, and hence underestimate mean chain length.
7 Eˆµ = E[ f(x i)/q(x i)]. Finally, since ω is observed nonmissing with probability P (ω)q(ω), and f(na) = 0, we have» f(xi) E = X f(ω) Q(X i) Q(ω) P (ω)q(ω) = X f(ω)p (ω) = µ. ω Ω ω Ω Hence, ˆµ is unbiased. Theorem 1 shows that ˆµ generates an unbiased estimate of µ = P ω f(ω)p (ω) for any function f. When f(ω) is the length of a chain ω, then ˆµ estimates the mean chain length. ˆµ = 1 mx L(X ki ) (3) n Q(X ki ) i=1 where X k1,..., X km are the m observed completed chains, L(ω) is the length of chain ω, and n is the total number of complete and incomplete chains in the sample. Theorem 1 can also be used to estimate the entire chain length distribution. Set f i(ω) = 1 if ω is a chain of length i, and f i(ω) = 0 otherwise. Then the expectation of f i is given by p i = X X f i(ω)p (ω) = P (ω). ω Ω ω is of length i That is, p i is the probability that a randomly chosen chain has true length i. Applying Theorem 1 to this indicator function f i yields an unbiased estimate of p i: 2 3 ˆp i = 1 mx f i(x kj ) n Q(X kj ) = 1 X (4) n Q(X k ) j=1 X k has length i Computing ˆp i for i = 1, 2,... gives an approximation of the true chain length distribution, and in particular, allows us to estimate the true median chain length in an idealized world without attrition. Theorem 1 shows that ˆµ is unbiased; Corollary 1, below, derives the variance of ˆµ, and in particular, shows that the variance of ˆµ increases as the probability of observation Q(ω) decreases. That is, when data are more likely to go missing, our estimates are understandably more variable. Corollary 1. The variance of ˆµ satisfies "! # Var(ˆµ) = 1 X f 2 (ω) P (ω) µ 2. n Q 2 (ω) ω Ω Proof. Using the notation of Theorem 1, by independence, var(ˆµ) = 1 «f(x1) n var. Q(X 1) Since ˆµ is unbiased, «f(x1) var Q(X 1) and the result is shown. = E = «f 2 (X 1) µ 2 Q 2 (X 1) X ω Ω f 2 (ω) P (ω) Q 2 (ω)! µ 2 In Corollary 1, the variance of ˆµ is expressed in terms of the true probability of selection P (ω), which is not usually available in applications. Hence, the variance cannot generally be computed directly by this formula. Furthermore, in practice variability in estimates of the mean is further increased by noise in estimates of attrition (i.e., noise in estimates of Q(ω)). We account for these two sources of variance by a modified bootstrap sampling procedure described in Section ESTIMATING CHAIN LENGTHS Applying the attrition model of Section 3 together with the missing data correction of Section 4, we now estimate the true chain length distribution based on the small-world data. We start by examining the Travers and Milgram [34] data, and the Dodds et al. [6] data, under the assumption of homogeneous attrition; and then proceed with the main analysis, an investigation of chain length with heterogeneous attrition. Homogeneous Attrition. For estimation based on the homogeneous attrition model, we assume that individuals terminate chains (i.e., they do not forward messages) with probability r, independent of their personal attributes; hence, individuals independently forward messages with probability 1 r. Since a chain ω of length L(ω) reaches its target only if each individual in the chain forwards the message, in the homogeneous attrition model the chain is observed as complete with probability Q(ω) = (1 r) L(ω). We generate generate confidence intervals for our estimates by bootstrap sampling. From the original set of n complete and incomplete chains, we first resample n chains with replacement, generating a bootstrap sample S 1, where we note that in the sample some of the original chains appear more than once while others do not appear at all. Repeating this resampling procedure k = 10, 000 times produces k bootstrap samples S 1,..., S k where each sample S i is itself a random resampling of the original data. These randomly generated sets of chains simulate what would have been observed had we been able to repeat the entire experiment k times; thus we can obtain k estimates of the true mean chain length 12. The confidence interval for the mean chain length is then defined by the range of the middle 95% of these k bootstrap estimates. In the Travers and Milgram data, we empirically find an attrition rate of r =.25, and so have Q(ω) = (.75) L(ω). Applying the estimator (3) with this choice of Q(ω) yields an estimated true mean of 11.8 (95% CI: ), compared with the observed mean 6.2. The estimator (4) correspondingly allows us to approximate the entire chain length distribution, from which a median of 7 (95% CI: 6-7) is computed, in approximate agreement 13 with the previously reported median of 8 [39]. Next we consider the Dodds et al. [6] data, again under a homogeneous attrition assumption. Here, however, we consider a more nuanced attrition model in which the 12 We apply the same estimator (3) to each experiment but because the data given to the estimator changes with each bootstrap sample, we obtain different estimates. 13 The agreement is even closer than it seems, on account of additional assumption made by White that senders at the last remove, being personally acquainted with the target (by definition) would always (i.e. with 100% likelihood) pass on the letter. As noted earlier, by effectively increasing the probability that chains would complete, this assumption actually increases the estimated chain length by one, versus the alternative assumption (made by Dodds et al. [6]) that all links were equally susceptible to attrition. Thus White s estimate of 8 is essentially the same as our estimate of 7.
8 first individual in a chain has attrition r 0 =.41 and all the remaining individuals in the chain have attrition r =.70. These values of attrition were empirically determined from the data, and the substantial difference found between r 0 and r motivates our decision to treat them separately 14. To reach the target, the chain initiator must first forward the message to the second participant in the chain (which happens with probability 1 r 0), and then each of the next participants must also forward the message (which occurs with probability (1 r) L(ω) 1 ). Consequently, in this model, the probability Q(ω) that a chain reaches its target (i.e., is observed as complete) is given by Q(ω) = (1 r 0)(1 r) L(ω) 1 = (1.41)(1.70) L(ω) 1. With this choice of Q(ω), the estimator (3) for mean chain length and the estimator (4) for the full chain length distribution yield a true mean length of 41.5 (95% CI: 20-68) and a robust median of 6 (95% CI: 6-6). This estimate of the median is in agreement with Dodds et al. s estimated range of 5-7 [6]. Heterogeneous Attrition. We now proceed with the main analysis, estimating the chain length distribution while allowing for heterogeneous attrition, as modeled in Section 3. Recall that due to limits in the experimental design, we estimate the next-step continuance probability: Given an individual A who forwards the message to B (and assuming B is not the ultimate target), we estimate the probability that B continues the chain based on A s attributes. In particular, the next-step continuance probability is a function of A s gender, race, age, education, work field, work position, income, strength of A s relationship with B, A s reason for selecting B, origin of A s relationship with B, and the final target. We first define the probability Q(ω) of observing a completed chain in this heterogeneous attrition model. For a chain ω of length L(ω), enumerate the links in the chain by ω 0 ω 1, ω 1 ω 2,..., ω L(ω) 1 ω L(ω), and let r ωi ω i+1 denote the probability that the i th participant in the chain fails to forwards the message. We assume the chain initiator fails to forwards the message with fixed probability r ω0 ω 1 =.41, found empirically from the data. For i > 1, r ωi ω i+1 is estimated from the next-step continuance model of Section 3 based on the attributes of the (i 1) st participant in the chain ω. For example, if the (i 1) st participant in ω is a year-old white male with a graduate degree, etc., we may estimate r ωi ω i+1 =.67. Combining these continuance probabilities from each step, we define the probability of observing a complete chain (i.e., a chain that makes it to its target) by Q(ω) = (1 r ω0 ω 1 ) (1 r ω1 ω 2 ) 1 r ωl(ω) 1 ωl(ω). To generate confidence intervals for estimates based on heterogeneous attrition, we require a slightly more complicated bootstrap sampling method than for the homogeneous case above, as now we also need to account for uncertainty 14 The reason for the much lower attrition at the first step derives from the nature of the experiment, which relied on initial senders signing up at the web site to participate. Naturally, subsequent senders in a given chain did not volunteer, but were instead recruited by previous senders; thus were presumably not as motivated to complete the exercise as those who volunteered to initiate the chains. Probability Chain length Figure 2: The estimated CDF of chain length under the heterogeneous attrition model. Error bars indicate 95% confidence intervals. in our estimates of attrition, which in turn translates to uncertainty in Q(ω). As in the case of homogenous attrition, bootstrap samples S 1,..., S k are generated by resampling n = 162, 328 chains from the original set of n complete and incomplete chains. Each bootstrap sample is a simulated dataset that mimics what could have occurred had the entire small-world experiment been repeated. Next, we simulate vectors of regression coefficients β 1,..., β k for the attrition model of Section 3, where each vector of parameters β i is a complete set of coefficients for the model. In particular, β i consists of parameter values for each group-level effect (e.g., graduate school and ), and allows one to estimate next-step continuance probabilities for given attribute data. These vectors of coefficients are generated by taking into account both the uncertainty in individual parameters, and the correlation between parameters 15. In short, for a given chain ω, each coefficient vector β i produces different estimates for the probability r ωi ω i+1 of failing to forwarding a message across each link in the chain, and hence generates different estimates of the probability Q βi (ω) that the chain ω is observed, reflecting uncertainty in the attrition model. The mean for each bootstrap sample S i is computed by way of the observation probability Q βi ( ), as derived from the attrition model with parameter vector β i. As before, this procedure results in k = 10, 000 different estimates of the mean chain length, and the reported confidence interval is the range of the middle 95% of these estimates. Appealing as before to the estimators (3) and (4), we find the mean under this heterogeneous attrition model is 22 (95% CI: ), and the median is 7 (95% CI: 6-8.5). Estimation of mean chain length is clearly sensitive to uncertainty in estimates of the probability of observation Q(ω), resulting in a wide confidence interval. The median, however, appears to be much more stable. The entire estimated cumulative distribution function (CDF) for chain length is shown in Figure 2. Since there are relatively few long chains, the variance in the estimated CDF grows with chain length. Randomized Attrition. One possible objection to these estimates is that since individuals in completed chains by 15 In a Bayesian interpretation, the coefficient vectors are generated from a non-informative prior, and represent configurations of parameters that are consistent with the data (see Gelman & Hill [7], Section 7.2).
9 definition must have passed on messages, we are at risk of over-fitting the attrition probabilities, mistakenly inferring that attributes of participants who happen to be in completed chains predict lower attrition rates. Indeed, the average estimated attrition for individuals in completed chains is 3% lower than the average estimated attrition for individuals in incomplete chains. To address this possibility of over-fitting, we consider a second heterogeneous attrition model, in which attrition probabilities R i are randomly generated from the distribution of estimated attrition rates shown in Figure 1. That is, we assume individuals have attrition rates that are randomly drawn from this estimated population distribution, and define the probability of observing a completed chain ω of length L(ω) to be Q(ω) = (1 R ω0 )(1 R ω1 (1 R ωl(ω) 1 ). Each individual in each chain is independently assigned an attrition rate chosen from the population distribution. Again invoking the estimators (3) and (4), we find that under this modified attrition model, the true mean chain length is now 49 (95% CI: 37-63), where confidence intervals are generated in analog to the previous heterogeneous attrition case. Also as before, we estimate the median, which we find to be 6 (95% CI: 6-6). All four estimates one based on Travers and Milgram s data; and three based on the experiments described above under assumptions of (a) homogenous attrition; (b) empirically observed heterogeneous attrition; and (c) randomized heterogeneous attrition are presented in Table 3, which prompts three observations. First, estimates of the median are extremely stable, with all four procedures predicting a true median chain length in the range of 6-7 steps; thus, for approximately half the population, the claim that everyone is connected by six degrees of separation appears to be valid not only in the topological sense, but also in the algorithmic sense. Second, in contrast with the median, estimates of the mean are extremely unstable, ranging from 11.8 to 49 steps, and with 95% confidence intervals ranging from 4.5 to 68. That estimated means can vary so wildly as a function of different assumptions about the heterogeneity of attrition is perhaps not surprising given the known sensitivity of chain completion to attrition, but it does suggest that any conclusions regarding the mean should be treated with caution. In spite of this, however, a third conclusion appears warranted namely that at least some, and possibly many, chains are much, much longer than the median. Thus although the algorithmic small-world phenomenon appears to be satisfied for at least half the population, it also appears not to be satisfied for at least some fraction. 6. CONCLUSION In concluding, we return to the distinction posed in the Introduction between the topological and algorithmic versions of the small-world hypothesis. Empirically, the most striking contrast between the two is that whereas in our analysis, we find very large differences between the mean and the median path lengths, studies of topological path lengths typically find that the two measures are almost interchangeable: in the largest such study to date, for example, Leskovec and Horvitz [20] found that the mean shortest path length was 6.6 while the median was 7. Precisely why topological and algorithmic path lengths differ in this manner is ultimately Table 3: Summary of true average algorithmic distance under homogenous and heterogeneous attrition models. Model Homogeneous attrition (Travers/Milgram) Homogeneous attrition (Dodds et. al. 3 ) Heterogeneous attrition (Dodds et al. 3 ) Randomized attrition (Dodds et al. 3 ) Mean (95% CI) Median (95% CI) 11.8 (8.5-15) 7 (6-7) 41.5 (20-68) 6 (6-6) 22 ( ) 7 (6-8.5) 49 (37-63) 6 (6-6) unclear; however, it probably derives from the observation that the number of steps in a search chain depends not only on the actual (i.e. topological) distance between the source and target, but also on the search strategies of the intermediaries. Two individuals, in other words, may be the same distance from a given target, but if one has a better search strategy, the resulting chain will be shorter. Moreover, if one sender merely perceives the task to be easier, the two chains may experience different outcomes for example, one may terminate while the other continues thus leading to the appearance of a difference in difficulty when in fact the real difficulty is the same. Given only a set of complete and incomplete chains, therefore, it is arguably impossible to determine how much of the observed differences in chain length arise from (a) differences in topological distance, (b) differences in search strategies, and (c) differences in beliefs and motivations. In simulation exercises, these ambiguities can be eliminated by assuming a search rule [13, 37], but of course there is no guarantee that these rules resemble those being used by the participants in real-world experiments; thus simulations alone cannot resolve the problem either. In principle, the problem might be resolved with an alternative experimental design, in which one could sample pairs of individuals according to their known topological distances, and also to represent a wide range of individual-level attributes. One could then run a large number of smallworld type experiments, where participants would be drawn from the known sample, and would be required to pass messages exclusively through the known network. By comparing the lengths of completed chains with known topological shortest paths, one could place a measure of the true difficulty of a search between pairs with given attributes. And by varying incentives to participants as well as performing follow-up surveys, search difficulty could be further parsed in terms of differences in search strategies, perceived difficulty, and apathy. Although non-trivial to implement, the recent explosion of large social networking sites like Facebook, as well as [18] and instant messaging networks [20], at least render such a design feasible. The results of such a study, moreover, would not only help to settle a decades-long debate over the effective connectivity of the social world, but would also shed new light on the relation between individual social capital, topological network structure, and the algorithmic properties of social search.
10 Acknowledgments The authors wish to acknowledge helpful conversations with P. Bearman and A. Gelman. Research supported in part by NSF grants SES and SES , and by the James S. McDonnell Foundation. 7. REFERENCES [1] L. Adamic and E. Adar. How to search a social network. Social Networks, 27: , [2] L. Adamic, R. Lukose, A. Puniyani, and B. Huberman. Search in power-law networks. Physical Review E, 64, [3] J. Best, B. Krueger, and C. Smith. An assessment of the generalizability of internet surveys. Social Science Computer Review, 19: , [4] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. W Graph structure in the web. Computer Networks, pages , [5] J. S. Coleman. Social capital in the creation of human capital. The American Journal of Sociology, 94:S95 S120, [6] P. Dodds, R. Muhamad, and D. J. Watts. An experimental study of search in global social networks. Science, 301: , [7] A. Gelman and J. Hill. Data Analysis Using Regression and Multileve/Hierarchical Models. Cambridge University Press, [8] S. Golder, D. Wilkinson, and B. A. Huberman. Rhythms of social interaction: Messaging within a massive online network. In 3rd International Conference on Communities and Technologies, [9] M. Granovetter. Getting a Job: A Study of Contacts and Careers. Harvard University Press, 2nd edition, [10] J. Guare. Six Degrees of Separation [11] J. Guiot. A modification of milgram s small world method. European Journal of Social Psychology, 6: , [12] H. Jeong, B. Tombor, R. Albert, Z. N. Oltval, and A. L. Barabasi. The large-scale organization of metabolic networks. Nature, 407: , [13] J. Kleinberg. The small-world phenomenon: An algorithmic perspective. In 32nd ACM Symposium on Theory of Computing, [14] J. Kleinberg. Navigation in a small world. Nature, 406, [15] J. Kleinfeld. Could it be a big world after all? Society, 39:61 66, [16] B. Kogut and G. Walker. The small world of germany and the durability of national networks. American Sociological Review, 66: , [17] C. Korte and S. Milgram. Acquaintance networks between racial groups: Application of the small world method. Journal of Personality and Social Psychology, 15: , [18] G. Kossinets and D. Watts. Empirical analysis of an evolving social network. Science, 311:88 90, [19] N. H. Lee. The Search for an Abortionist. The University of Chicago Press, [20] J. Leskovec and E. Horvitz. Planetary-scale views on a large instant-messaging network. In 17th International World Wide Web Conference, [21] D. Liben-Nowell, J. Novak, R. Kumar, P. Raghavan, and A. Tomkins. Geographic routing in social networks. Proc. Natl. Acad. Sci. USA, 102: , [22] N. Lin, P. Dayton, and P. Greenwald. The urban communication network and social stratification: A small world experiment. In B. D. Ruben, editor, Communication Yearbook, pages Transaction Books, [23] C. Lundberg. Patterns of acquaintanceship in society and complex organization: A comparative study of the small world problem. Pacific Sociological Review, 18: , [24] J. W. Meshel. One phone call away: Secrets of a master networker, Portfolio. [25] S. Milgram. The small world problem. Psychology Today, 1:60 67, [26] A. E. Motter, T. Nishikawa, and Y. C. Lai. Large-scale structural organization of social networks. Physical Review E, 68, [27] M. E. J. Newman. The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA, 98: , [28] O. Pierson. The Unwritten Rules of Highly Effective Job Search. McGraw-Hill, [29] A. Portes. Social capital: Its origins and applications in modern sociology. Annual Review of Sociology, 24:1 24, [30] D. Rezac. Work the Pond! Berkley Publishing Group, [31] P. Sen, S. Dasgupta, A. Chatterjee, P. A. Sreeram, G. Mukherjee, and S. S. Manna. Small-world properties of the indian railway network. Physical Review E, 67, [32] R. Shotland. University Communication Networks: The Small World Method. Wiley, [33] O. Simsek and D. Jensen. Decentralized search in networks using homophily and degree disparity. In Proceedings of the Ninenteenth International Joint Conference on Artificial Intelligence, [34] J. Travers and S. Milgram. An experimental study of the small world problem. Sociometry, 32: , [35] A. Wagner and D. A. Fell. The small world inside large metabolic networks. Proceedings of the Royal Society of London Series B-Biological Sciences, 268: , [36] D. J. Watts. Six Degrees: The Science of A Connected Age. W. W. Norton, [37] D. J. Watts, P. S. Dodds, and M. E. J. Newman. Identity and search in social networks. Science, 296: , [38] D. J. Watts and S. H. Strogatz. Collective dynamics of small-world networks. Nature, 393: , [39] H. C. White. Search parameters for small world problem. Social Forces, 49: , 1970.
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs
Why High-Order Polynomials Should Not be Used in Regression Discontinuity Designs Andrew Gelman Guido Imbens 2 Aug 2014 Abstract It is common in regression discontinuity analysis to control for high order
Simple Linear Regression Inference
Simple Linear Regression Inference 1 Inference requirements The Normality assumption of the stochastic term e is needed for inference even if it is not a OLS requirement. Therefore we have: Interpretation
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Six Degrees of Separation in Online Society
Six Degrees of Separation in Online Society Lei Zhang * Tsinghua-Southampton Joint Lab on Web Science Graduate School in Shenzhen, Tsinghua University Shenzhen, Guangdong Province, P.R.China [email protected]
1 Prior Probability and Posterior Probability
Math 541: Statistical Theory II Bayesian Approach to Parameter Estimation Lecturer: Songfeng Zheng 1 Prior Probability and Posterior Probability Consider now a problem of statistical inference in which
Standard Deviation Estimator
CSS.com Chapter 905 Standard Deviation Estimator Introduction Even though it is not of primary interest, an estimate of the standard deviation (SD) is needed when calculating the power or sample size of
Marketing Mix Modelling and Big Data P. M Cain
1) Introduction Marketing Mix Modelling and Big Data P. M Cain Big data is generally defined in terms of the volume and variety of structured and unstructured information. Whereas structured data is stored
Introduction to. Hypothesis Testing CHAPTER LEARNING OBJECTIVES. 1 Identify the four steps of hypothesis testing.
Introduction to Hypothesis Testing CHAPTER 8 LEARNING OBJECTIVES After reading this chapter, you should be able to: 1 Identify the four steps of hypothesis testing. 2 Define null hypothesis, alternative
Simple Regression Theory II 2010 Samuel L. Baker
SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the
AP Physics 1 and 2 Lab Investigations
AP Physics 1 and 2 Lab Investigations Student Guide to Data Analysis New York, NY. College Board, Advanced Placement, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks
LOGNORMAL MODEL FOR STOCK PRICES
LOGNORMAL MODEL FOR STOCK PRICES MICHAEL J. SHARPE MATHEMATICS DEPARTMENT, UCSD 1. INTRODUCTION What follows is a simple but important model that will be the basis for a later study of stock prices as
Econometrics Simple Linear Regression
Econometrics Simple Linear Regression Burcu Eke UC3M Linear equations with one variable Recall what a linear equation is: y = b 0 + b 1 x is a linear equation with one variable, or equivalently, a straight
Chapter 3 RANDOM VARIATE GENERATION
Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets
Unraveling versus Unraveling: A Memo on Competitive Equilibriums and Trade in Insurance Markets Nathaniel Hendren January, 2014 Abstract Both Akerlof (1970) and Rothschild and Stiglitz (1976) show that
9. Sampling Distributions
9. Sampling Distributions Prerequisites none A. Introduction B. Sampling Distribution of the Mean C. Sampling Distribution of Difference Between Means D. Sampling Distribution of Pearson's r E. Sampling
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
17. SIMPLE LINEAR REGRESSION II
17. SIMPLE LINEAR REGRESSION II The Model In linear regression analysis, we assume that the relationship between X and Y is linear. This does not mean, however, that Y can be perfectly predicted from X.
How to Get More Value from Your Survey Data
Technical report How to Get More Value from Your Survey Data Discover four advanced analysis techniques that make survey research more effective Table of contents Introduction..............................................................2
Fairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
Asymmetry and the Cost of Capital
Asymmetry and the Cost of Capital Javier García Sánchez, IAE Business School Lorenzo Preve, IAE Business School Virginia Sarria Allende, IAE Business School Abstract The expected cost of capital is a crucial
Sample Size and Power in Clinical Trials
Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance
Review of Basic Options Concepts and Terminology
Review of Basic Options Concepts and Terminology March 24, 2005 1 Introduction The purchase of an options contract gives the buyer the right to buy call options contract or sell put options contract some
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Evaluation of a New Method for Measuring the Internet Degree Distribution: Simulation Results
Evaluation of a New Method for Measuring the Internet Distribution: Simulation Results Christophe Crespelle and Fabien Tarissan LIP6 CNRS and Université Pierre et Marie Curie Paris 6 4 avenue du président
Nonparametric adaptive age replacement with a one-cycle criterion
Nonparametric adaptive age replacement with a one-cycle criterion P. Coolen-Schrijner, F.P.A. Coolen Department of Mathematical Sciences University of Durham, Durham, DH1 3LE, UK e-mail: [email protected]
Introduction to time series analysis
Introduction to time series analysis Margherita Gerolimetto November 3, 2010 1 What is a time series? A time series is a collection of observations ordered following a parameter that for us is time. Examples
The Basics of Graphical Models
The Basics of Graphical Models David M. Blei Columbia University October 3, 2015 Introduction These notes follow Chapter 2 of An Introduction to Probabilistic Graphical Models by Michael Jordan. Many figures
1 Portfolio mean and variance
Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring
Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables
Predicting Successful Completion of the Nursing Program: An Analysis of Prerequisites and Demographic Variables Introduction In the summer of 2002, a research study commissioned by the Center for Student
Analysis of Bayesian Dynamic Linear Models
Analysis of Bayesian Dynamic Linear Models Emily M. Casleton December 17, 2010 1 Introduction The main purpose of this project is to explore the Bayesian analysis of Dynamic Linear Models (DLMs). The main
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
11. Analysis of Case-control Studies Logistic Regression
Research methods II 113 11. Analysis of Case-control Studies Logistic Regression This chapter builds upon and further develops the concepts and strategies described in Ch.6 of Mother and Child Health:
Missing data in randomized controlled trials (RCTs) can
EVALUATION TECHNICAL ASSISTANCE BRIEF for OAH & ACYF Teenage Pregnancy Prevention Grantees May 2013 Brief 3 Coping with Missing Data in Randomized Controlled Trials Missing data in randomized controlled
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010
Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Fixed-Effect Versus Random-Effects Models
CHAPTER 13 Fixed-Effect Versus Random-Effects Models Introduction Definition of a summary effect Estimating the summary effect Extreme effect size in a large study or a small study Confidence interval
" Y. Notation and Equations for Regression Lecture 11/4. Notation:
Notation: Notation and Equations for Regression Lecture 11/4 m: The number of predictor variables in a regression Xi: One of multiple predictor variables. The subscript i represents any number from 1 through
Estimation of σ 2, the variance of ɛ
Estimation of σ 2, the variance of ɛ The variance of the errors σ 2 indicates how much observations deviate from the fitted surface. If σ 2 is small, parameters β 0, β 1,..., β k will be reliably estimated
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Chapter 6: The Information Function 129. CHAPTER 7 Test Calibration
Chapter 6: The Information Function 129 CHAPTER 7 Test Calibration 130 Chapter 7: Test Calibration CHAPTER 7 Test Calibration For didactic purposes, all of the preceding chapters have assumed that the
The primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
2. Linear regression with multiple regressors
2. Linear regression with multiple regressors Aim of this section: Introduction of the multiple regression model OLS estimation in multiple regression Measures-of-fit in multiple regression Assumptions
Stochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)
Summary of Formulas and Concepts Descriptive Statistics (Ch. 1-4) Definitions Population: The complete set of numerical information on a particular quantity in which an investigator is interested. We assume
Using simulation to calculate the NPV of a project
Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial
Principle of Data Reduction
Chapter 6 Principle of Data Reduction 6.1 Introduction An experimenter uses the information in a sample X 1,..., X n to make inferences about an unknown parameter θ. If the sample size n is large, then
SENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
II. DISTRIBUTIONS distribution normal distribution. standard scores
Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,
UNDERSTANDING THE TWO-WAY ANOVA
UNDERSTANDING THE e have seen how the one-way ANOVA can be used to compare two or more sample means in studies involving a single independent variable. This can be extended to two independent variables
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers
Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, [email protected] Bart Frischknecht, University of Technology Sydney,
9.2 Summation Notation
9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a
Simple Predictive Analytics Curtis Seare
Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Simple Random Sampling
Source: Frerichs, R.R. Rapid Surveys (unpublished), 2008. NOT FOR COMMERCIAL DISTRIBUTION 3 Simple Random Sampling 3.1 INTRODUCTION Everyone mentions simple random sampling, but few use this method for
On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information
Finance 400 A. Penati - G. Pennacchi Notes on On the Efficiency of Competitive Stock Markets Where Traders Have Diverse Information by Sanford Grossman This model shows how the heterogeneous information
Statistical Rules of Thumb
Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN
Nonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
Chapter Seven. Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS
Chapter Seven Multiple regression An introduction to multiple regression Performing a multiple regression on SPSS Section : An introduction to multiple regression WHAT IS MULTIPLE REGRESSION? Multiple
Technical Report. Teach for America Teachers Contribution to Student Achievement in Louisiana in Grades 4-9: 2004-2005 to 2006-2007
Page 1 of 16 Technical Report Teach for America Teachers Contribution to Student Achievement in Louisiana in Grades 4-9: 2004-2005 to 2006-2007 George H. Noell, Ph.D. Department of Psychology Louisiana
Social Media Mining. Network Measures
Klout Measures and Metrics 22 Why Do We Need Measures? Who are the central figures (influential individuals) in the network? What interaction patterns are common in friends? Who are the like-minded users
Recursive Estimation
Recursive Estimation Raffaello D Andrea Spring 04 Problem Set : Bayes Theorem and Bayesian Tracking Last updated: March 8, 05 Notes: Notation: Unlessotherwisenoted,x, y,andz denoterandomvariables, f x
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Multivariate Analysis of Ecological Data
Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology
Bayesian Statistics in One Hour. Patrick Lam
Bayesian Statistics in One Hour Patrick Lam Outline Introduction Bayesian Models Applications Missing Data Hierarchical Models Outline Introduction Bayesian Models Applications Missing Data Hierarchical
Hypothesis Testing for Beginners
Hypothesis Testing for Beginners Michele Piffer LSE August, 2011 Michele Piffer (LSE) Hypothesis Testing for Beginners August, 2011 1 / 53 One year ago a friend asked me to put down some easy-to-read notes
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
3.4 Statistical inference for 2 populations based on two samples
3.4 Statistical inference for 2 populations based on two samples Tests for a difference between two population means The first sample will be denoted as X 1, X 2,..., X m. The second sample will be denoted
The Best of Both Worlds:
The Best of Both Worlds: A Hybrid Approach to Calculating Value at Risk Jacob Boudoukh 1, Matthew Richardson and Robert F. Whitelaw Stern School of Business, NYU The hybrid approach combines the two most
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics
South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study
A Review of Cross Sectional Regression for Financial Data You should already know this material from previous study But I will offer a review, with a focus on issues which arise in finance 1 TYPES OF FINANCIAL
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Six Degrees: The Science of a Connected Age. Duncan Watts Columbia University
Six Degrees: The Science of a Connected Age Duncan Watts Columbia University Outline The Small-World Problem What is a Science of Networks? Why does it matter? Six Degrees Six degrees of separation between
6 Scalar, Stochastic, Discrete Dynamic Systems
47 6 Scalar, Stochastic, Discrete Dynamic Systems Consider modeling a population of sand-hill cranes in year n by the first-order, deterministic recurrence equation y(n + 1) = Ry(n) where R = 1 + r = 1
CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA
We Can Early Learning Curriculum PreK Grades 8 12 INSIDE ALGEBRA, GRADES 8 12 CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA April 2016 www.voyagersopris.com Mathematical
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION
HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION HOD 2990 10 November 2010 Lecture Background This is a lightning speed summary of introductory statistical methods for senior undergraduate
Response to Critiques of Mortgage Discrimination and FHA Loan Performance
A Response to Comments Response to Critiques of Mortgage Discrimination and FHA Loan Performance James A. Berkovec Glenn B. Canner Stuart A. Gabriel Timothy H. Hannan Abstract This response discusses the
A Bayesian hierarchical surrogate outcome model for multiple sclerosis
A Bayesian hierarchical surrogate outcome model for multiple sclerosis 3 rd Annual ASA New Jersey Chapter / Bayer Statistics Workshop David Ohlssen (Novartis), Luca Pozzi and Heinz Schmidli (Novartis)
1 Six Degrees of Separation
Networks: Spring 2007 The Small-World Phenomenon David Easley and Jon Kleinberg April 23, 2007 1 Six Degrees of Separation The small-world phenomenon the principle that we are all linked by short chains
Statistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
Basic Concepts in Research and Data Analysis
Basic Concepts in Research and Data Analysis Introduction: A Common Language for Researchers...2 Steps to Follow When Conducting Research...3 The Research Question... 3 The Hypothesis... 4 Defining the
Chapter 4 Lecture Notes
Chapter 4 Lecture Notes Random Variables October 27, 2015 1 Section 4.1 Random Variables A random variable is typically a real-valued function defined on the sample space of some experiment. For instance,
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE
LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE MAT 119 STATISTICS AND ELEMENTARY ALGEBRA 5 Lecture Hours, 2 Lab Hours, 3 Credits Pre-
individualdifferences
1 Simple ANalysis Of Variance (ANOVA) Oftentimes we have more than two groups that we want to compare. The purpose of ANOVA is to allow us to compare group means from several independent samples. In general,
The Variability of P-Values. Summary
The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 [email protected] August 15, 2009 NC State Statistics Departement Tech Report
