Protocols for Randomized Experiments to Identify Network Contagion A.C. Thomas Michael Finegold March 14, 2013 Abstract Identifying the existence and magnitude of social contagion, or the spread of an individual trait along ties in a social network, is a challenging task due in part to the tendency of individuals with similar characteristics to connect, also known as homophily. While randomized experiments on individuals of a network would seem to be the ideal method for establishing contagion, there are still considerable methodological issues stemming from structural considerations, including the likelihood of inclusion in the sample group and the implicit dependence between units due to latent homophily. We construct a protocol that correctly adjusts for these factors in a number of experimental situations. 1 INTRODUCTION The phenomenon of contagion on social networks is of considerable interest to researchers and practitioners in marketing, sociology, biology and technology, as evidenced by the very nature of viral marketing as a field of study and investment; it assumes that one need only manipulate a small number of nodes on a network before a message is able to spread across a much greater share of the population. While the promise of its exploitation is considerable, the challenge of identifying whether or not a trait is viral that is, whether a particular behavior can actually spread on a network is even more difficult to determine accurately. The general term network autocorrelation encompasses many reasons why two members of a social network would have characteristics similar to one another. Two of the simplest explanations are contagion, in which one friend influences another s adoption choices, and homophily, in which similar individuals become friends on the basis of shared interests or opportunities (Manski, 1993). While many observational studies have sought to distinguish the two, it is rarely simple; latent homophily, for example, is a phenomenon where a homophilous factor that causes a friendship to form may also cause a change in personal characteristics, which can be mistaken for contagion (Shalizi and Thomas, 2011). All these problems have 1
emphasized the role that experiments can play in identifying whether or not contagion has actually taken place. By randomizing a treatment condition over a different number of units, one can isolate the direct contribution from observable characteristics while removing the contribution of confounding factors. One potential approach is to construct a social network from scratch, under the complete control of the experimenter, such as those commissioned by Centola (2010; 2011). In this case, a health website with social network components was created de novo, and new visitors were organized into small-scale social networks according to their characteristics and particular experimental conditions on network topology and enforced homophily. The adoption of a trait (the use and maintenance of a diet log) could only be achieved if a directly connected peer had themselves adopted it, so that the trait could spread only through the social network. While this approach has its merits in a smaller-scale system, it has little applicability to large social network experiments in vivo, in which the conditions that spawned and grew the network are largely unknown. The typical social network experiment is what we will call the viral marketing design: distribute an incentive-style treatment to a random sample of the population the manipulated group and compare the rate their contact adopt a certain behavior to the rate an unmanipulated group s contacts adopt the same behavior. In particular, if a real contagion effect is small enough, we may be justified in ignoring spillover effects, or extensions of the contagious process beyond the original units. Adoption of an online service is a common choice in information systems(bapna and Umyarov, 2012; Aral and Walker, 2012), and the effect sizes in these studies tend to be small enough that our assumption is reasonable. The principle at work in this experimental design is that the manipulation can be very subtly achieved: one can select a small subset of nodes in a network and indirectly manipulate many more, proportional to at least the number of individuals who are directly connected, which makes this design seem compelling. Since the manipulated units were chosen at random, they should have no systematic differences from the population. The problem lies in the fact that we are measuring outcomes on the followers of the manipulated and unmanipulated individuals, not the manipulated and unmanipulated groups themselves. The groups of contacts may be systematically different from groups chosen at random from the population. Two effects have particular impact on this sort of study, even when the experimental set-up is taken into account: Latent homophily Individuals with similar characteristics tend to form more connections with each other than with those whose characteristics are different. If these factors also contribute to the observed behavior of an individual, this can be mistaken for contagion 2
(Shalizi and Thomas, 2011); in general, it can be difficult to disentangle this even with time-dependent models. In a perfectly balanced experiment, each treated unit will have a matching control, and so the expected value of the difference will not be affected. Because the units within each group can be slightly dependent by design, the variance of the estimator can be considerably higher from the covariance between units; without taking this into account, standard test statistics for differences, such as the two-sample t-test, will not have the level of coverage as originally advertised by the number of discrete units in each sample. Since we suspect dependence between units, and it is simple to construct a corrected statistical test based on a null distribution with independent units, we show that using clusters of friends, based on the original seed nodes, corrects the is discrepancy in coverage properties. Inclusion by degree Even though one can select a subset of the population uniformly at random, it is not at all clear that the actual distribution of experimental units, the social followers of the original subset, will be representative of the population. In particular, while the selection process for seed nodes can be uniformly random, it is well-known that processes that crawl the social graph are biased in favor of well-connected individuals. If the probability of inclusion is dependent on the outcome being measured, then the estimator of the population mean will be biased. A standard correction for this is the estimator of Horvitz and Thompson (1952); this has since been used in some form in methods for sampling from a network that crawl the social graph, such as Respondent-Driven Sampling(Heckathorn, 1997; Gile and Handcock, 2010). We demonstrate that when considering each of these corrective factors in our estimators, we can control their associated factors appropriately in terms of coverage probabilities for this class of network experiments. We continue in Section 2 by illustrating the failure of standard statistical tests under this paradigm, particularly in Section 2.1 with simulated networks and Section 2.2 with a real-world network subset. We demonstrate how this applies in a quasi-experimental setting in Section 3 before discussing future extensions in Section 4. 2 DEMONSTRATION In these experiments, we have four primary groups of individuals on the network under inspection: Two groups of nodes, the manipulated M, and the unmanipulated U, corresponding to the original sample of individuals on the network (let L be the union of these 3
groups). These two groups together are typically sampled uniformly at random on the network, and nodes are partitioned between these groups through straight or blocked randomization. Two further groups of nodes, corresponding to the treated T, and the control C, are those individuals who name members of M and U in their social networks (that is, the followers of M and U respectively). The differential exposure to a manipulated node is the treatment in question. Nodes that are exposed to members of M and U simultaneously are excluded for balance purposes rather than simply included in T (Bapna and Umyarov, 2012); if the sample is small compared to the population, the impact on selection probability for the remaining nodes is minimal. We first generate synthetic networks to demonstrate how the standard sampling scheme can lead to unexpected consequences. 2.1 Simulated Networks and Outcomes For demonstration purposes, we consider a simple network model with two features: variability on the number of connections made by each individual (in terms of both inbound and outbound links), and a binary factor that drives a homophily-based mechanism. For each individual i [1,..., N] the binary factor, X i, is drawn from the Bernoulli distribution, X i Be(p), and the propensity to form inbound, α i, and outbound, β i, ties is drawn from the bivariate normal distribution, [ α i β i ] ([ ] [ 0 σα 2 N 2, 0 ρσ α σ β ρσ α σ β so that there is variability in both follower and followee count which may be related. Each potential directed edge, denoted as Z ij, is then drawn from a Bernoulli distribution, σ 2 β ]), Z ij Be(Φ(µ + α i + β j + γi(x i = X j ))), where γ > 0 ensures that ties are more likely to form between nodes with the same binary factor. This model draws elements from the p 1 model (Holland and Leinhardt, 1981) and the stochastic block model (Holland et al., 1983), but the characteristics of the networks it generates are known to be common to many real-world networks. 4
Define Z i. = j Z ij and Z.j = i Z ij, the out-degree and in-degree respectively. Let Z = i Z i./n be the grand mean degree. Once the network has been established, we generate a pair of potential outcomes for each individual corresponding to treatment and control conditions. For the sake of exposition, we consider two parts to this effect: the average treatment effect, and a variability in effect due to the number of a person s connections, particularly the number of people who they identify as friends (corresponding to out-degree). It is the average treatment effect that is typically of greatest interest. First, we generate the baseline (control) outcome according to the normal distribution, Y i (c) N (θ 1 X i, 1), where we choose θ 1 > 0 so that those units with the positive binary factor tend to have higher outcomes. We then generate the outcome under exposure to the treatment according to the normal distribution, Y i (t) N ( Y i (c) + θ 2 (Z i. Z) + τ, 1) ), where τ is the average treatment effect, and θ 2 > 0 means that units with higher out degree tend to be more positively affected by the treatment. It remains to choose the parameters (ρ, σ α, σ β, µ, γ, θ 1, θ 2, τ) to test the mechanism. For the sake of these trials, we choose parameters that lead to social networks with reasonable properties: a network with n = 10, 000 nodes, a mean degree of 10, with the majority of individual degrees between 3 and 30, is adequate for our purposes. In particular, we choose θ 1 > 0 and θ 2 > 0 for clarity of explanation, though neither of these signs must be constrained in real examples. Once we identify the status of our nodes as being from the treatment or the control, we construct the test statistic in the usual fashion, ˆτ = k Y kw k (t)i(k T ) k W ki(k T ) k Y kw k I(k C) k W ki(k C) where the weight W k is uniform in the standard case, and 1/Z k. in the Horvitz-Thompson case. Figure 1 demonstrates the distribution of p-values under the null hypothesis of τ = 0 for comparing the original manipulated node sets M and U, as well as the experimental unit sets T and C. While the distribution is uniform as expected in the first case, in a network with either latent homophily or a dependence on degree, the p-value distribution for the two-sided t-test is heavily skewed towards zero. 5
Simple t test, tau=0: M vs U Simple t test, tau=0: T vs C 0.0 0.4 0.8 0 2 4 6 8 Figure 1: Distributions of p-values under standard t-tests for various groups of nodes from simulated networks. Each group represents nodes under hypothetical treatment and control, under the null hypothesis, when latent homophily is present. Left: the comparison of M and U, who were chosen uniformly at random from the population; the distribution is uniform as expected. Right: the comparison of T and C, who are followers of M and U respectively, and are autocorrelated due to latent homophily; the distribution of p-values is shifted in the extreme towards zero. Figure 2 shows how the situation improves with each change to the process. First, consider permutation tests for the null hypothesis. The simplest method is to permute the labels for membership in T and C directly, which yields incorrect coverage. The next correction is the use of the Horvitz-Thompson correction, so that only latent homophily still plays a role, increasing the effective variance. Continuing further, we permute instead the labels on M and U, and reassign the labels of T and C accordingly. This move from full node permutation to cluster-based permutation permuting groups so that all followers of any particular seed node in M or U stay together restores the uniform distribution of p-values for the null distribution. Second, we explore the properties of bootstrap confidence intervals under the alternate hypothesis τ 0 under the same principles. As expected, Horvitz-Thompson estimation alone is insufficient to correct the distribution of p-values with respect to the true generative value of τ. Simple block bootstrapping does not completely fix the problem; as in the Horvitz- Thompson estimator, the probability of sampling a cluster for the bootstrap must also be weighted by the inverse of the number of nodes to correct for oversampling, before the HT correction to the estimate is made once again. 6
Node Perm Test, tau=0 Node Bootstrap, HT 0 2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 2.5 Node Perm Test, tau=0. HT Block Bootstrap, HT 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 Block Perm Test, tau=0, HT Weighted Block Bootstrap, HT 1.2 7 1.2
2.2 Real-World Networks with Autocorrelation Our simulations demonstrate that the design and analysis previously described can lead to false discoveries of contagion if there is sufficient latent homophily present, let alone a differential effect size due to degree. For the sorts of networks that we have access to, and wish to conduct experiments on, it is prudent to investigate whether there is significant observable homophily on observable characteristics on these real-world networks, and whether this is a practical concern. We consider a subset of the Twitter network representing Singapore followers of Korean Pop music (or K-Pop). Elicited behaviors on Twitter can include the spread of information, from simple hashtags, to the sharing of news articles, to a full viral marketing campaign that leads to the purchasing of real-world products. A typical treatment applied to the target group is a message with a link to a website and enrollment incentive, to see if it has any effect on the enrollment of their followers. Our subset was derived from a complete, multi-stage snowball sample (Goodman, 1961) whose seed nodes originate in Singapore. We excluded private accounts for which we could not gather information, and expanded the set to included the neighbors of the remaining public users. Of these, we selected those users who follow one of 50 identified K-pop news sources. The final set contains 7,283 users. To conduct our experiment, we choose L to be a uniform random sample of 500 nodes from the final set, randomly assigning 250 to M and the other 250 to U. We then identify our treatment and control groups from the entire population of Twitter users. We identify the set of users T, who are in neither M nor U, follow at least one user in M, but follow no users in U. Similarly, we identify the set C as those users in neither M nor U who follow at least one user in U, but follow no users in M. At this point, to demonstrate the properties of this sampling mechanism when there is no treatment, we apply our phantom manipulation (such as Send No Message, or SNM) to users in M and do absolutely nothing (DAN) to users in U. We then measure the outcomes of interest for T and C and perform significance tests in the usual manner. In particular, we measure four response variables: time of last tweet, friend count, whether the user is in the Singapore time zone, and if the user s language is English. For the first two we perform a standard t-test and for the last two we perform Fisher s exact test. We compare the responses for two pairs: between M and U and between T and C. In all, we perform a total of eight tests and collect eight p-values. We repeat the entire process (sample M and U, create T and C, and measure the outcomes) 1000 times. Since there is no actual difference between our mock treatment and control conditions, a proper significance test should assign p-values according to the U nif orm(0, 1) distribution. 8
Figure 3 suggests that while the p-values for comparing M to U do indeed follow an approximately uniform distribution (since M and U are drawn at uniform from the population), there are serious deviations from the uniform distribution in each outcome. In particular, p-values less than 0.05 occur 10-14 times more often than would be expected. This demonstrates the practical implications for studies that treat such p-values as indicators of statistical significance. It is true that the network autocorrelation for both tweet time and friend count is plausibly caused by contagious mechanisms; a tweet by a user might cause several followers to retweet, or to follow the same users. It is far less plausible that time zone and language are contagious in this way. Friends could conceivably move closer to each other, or learn a new language, but structural homophily seems a more reasonable explanation for the observed autocorrelation; that is, those with a common language and in a common location are more likely to become friends (or followers, in this case) than those with different languages or in different locations. In either case, the difference between the T and C groups clearly can not be explained by the different effects of our treatment and control conditions. A real-world test we might perform would tend to falsely conclude that a certain manipulation of one user has an effect on his followers behavior. It would be to an experimenter s benefit to pre-screen the data in this manner, by first performing simulated no-treatment experiments as just described; if tests comparing the outcome of interest (e.g., whether a subscription was purchased in the previous month) between the two groups yields near uniform p-values, then one can conduct the desired experiment with different treatments. This is less than ideal there is no guarantee that network behavior is static over time and in any case it is better to proceed according to the methods we have described. 3 QUASI-EXPERIMENTAL APPLICATION We can further verify the usefulness of the method by testing its worthiness in quasiexperimental situations. The propagation of information on Twitter, for example, may be treated as entirely endogenous, making a true experimental design quite difficult to achieve. However, we can still consider a quasi-experimental design to estimate the effects of network treatment effects. Consider another extraction of the Singapore Twitter network; in this subset, we take those users with 100 or more total followers and construct a new sub-network. On August 9, 2012, the country celebrated Singapore s 47th National Day with a parade, and a unique hashtag for the event, #ndp2012, was expressed by many users before and during the day, 9
M v U T v C 0.0 0.4 0.8 1.2 0 2 4 6 8 10 Tweet Time Tweet Time 0.0 0.4 0.8 1.2 0 2 4 6 8 10 Friend Count Friend Count 0.0 0.5 1.0 1.5 2.0 0 2 4 6 8 10 12 Singapore Time Singapore Time 0 2 4 6 8 10 14 10 0 2 4 6 8 10 English English
Property Size of subnetwork 4586 Total Uses in Hour 1 33 Total Uses in Hours 2-4 334 Fraction of Uses (T) 0.120 Fraction of Uses (C) 0.069 Standard t-test p-value (1-side) 0.02 Block Permutation p-value (1-side) 0.11 Table 1: Properties of the propagation test of hashtag #ndp2012 on August 9, 2012 for the Singapore Twitter community. The unadjusted t-test p-value is less than 0.05; with the appropriate test, correcting for homophily and follower bias, the p-value rises dramatically. peaking while the parade itself was in progress. The particular hashtag was publicly advertised in advance of the event, rather than arising spontaneously by one or more users of the microblogging service. For a quasi-experimental design, we consider the use of the hashtag during a one-hour period before the beginning of the parade as the initial manipulated group M; the treated user set T are that user s followers. We then select a control set U by matching on indegree and out-degree respectively; the followers of U then become the control group C. The outcome of interest is the use of the hashtag in each of the three hours that followed. As shown in Table 1, the original t-test to compare the use of the hashtag in T and C gives a p-value of 0.02 for the fraction of uses of the hashtag in the treated group compared to the control, which is statistically significant for α = 0.05; the effective use rate is roughly doubled, which suggests a reasonably strong effect size overall. However, latent homophily appears to play a strong role in this difference as well; using the block permutation test, the p-value now rises to 0.11, removing statistical significance from this result. 4 DISCUSSION Many attempts have been made to identify influence in social networks. The difficulty of distinguishing social influence from latent homophily in observational studies has been demonstrated, however, even for dynamic studies over long time periods. One prescription for identifying influence is the design of randomized experiments. We demonstrate in this paper, however, that even in the case of randomized experiments, researchers can still mistake homophily for influence if adequate care is not given to the method of randomization. We have shown how one can correct for latent homophily and degree bias when analyzing the results from a particular experimental design. Given these complications, it is worth 11
asking how we can design experiments with greater power. The experimental protocol we describe implies that we can only manipulate the status of selected nodes, then observe the outcomes on the rest of the network, beginning with their neighbors. The very notion that homophily exists, however, suggests that we can get the most leverage by using a mechanism that can block-randomize on members of the same cluster. Manipulating the exposure of each individual in a group to the treatment would seem to be the most effective means of achieving that. That is, select a sample of users for inclusion in M. Then, for each user in M, randomly select a subset of their followers and remove their ability to see the results of the manipulation. We are not always in a position to manipulate what an individual sees, however, and are often limited to the incentive-style framework described in this paper. We then seek to ask, given a fixed number of manipulations, if there is a better way to choose the set M. We may suspect, for example, that users who follow only a few other users are more likely to be influenced by a manipulation on a user they follow. The original sampling scheme will under-sample users with low out degree. Even if corrected for in analysis, the amount of under-sampling may lead to very low power. One easy correction for this is to do a weighted sampling of the M group, with higher selection probability given to users whose followers have low out degree. There is an inevitable trade-off, however, between the selection bias corrected for in the groups T and C and the selection bias introduced for the groups M and U. Improving design may depend heavily on modeling assumptions, prior assessment of parameter variability, and the specific domain. Acknowledgements This work was partly funded through the authors involvement with the Living Analytics Research Center. Thanks to Xiao Hui Tai and Agus Kwee for preparing the Twitter networks for analysis. References Aral, S. and Walker, D. (2012). Social Networks. Science, 337. Identifying Influential and Susceptible Members of Bapna, R. and Umyarov, A. (2012). Do Your Online Friends Make You Pay? A Randomized Field Experiment in an Online Music Social Network. NBER Working Paper Series 2012. 12
Centola, D. (2010). The Spread of Behavior in an Online Social Network Experiment. Science, 329 1194 1197. Centola, D. (2011). An Experimental Study of Homophily in the Adoption of Health Behavior. Science, 334. Gile, K. J. and Handcock, M. S. (2010). Respondent-Driven Sampling: An Assessment of Current Methodology. Sociological Methodology, 40. Goodman, L. (1961). Snowball sampling. Annals of Mathematical Statistics, 32 148 170. Heckathorn, D. D. (1997). Respondent-Driven Sampling: A New Approach to the Study of Hidden Populations. Social Problems, 44 174199. Holland, P., Laskey, K. and Leinhardt, S. (1983). Stochastic Block Models: First Steps. Social Networks, 5 109 137. Holland, P. and Leinhardt, S. (1981). An Exponential Family of Probability Distributions for Directed Graphs. Journal of the American Statistical Association, 76 33 65. Horvitz, D. G. and Thompson, D. J. (1952). A Generalization of Sampling Without Replacement From a Finite Universe. Journal of the American Statistical Association, 47 663 685. Manski, C. F. (1993). Identification of Endogeneous Social Effects: The Reflection Problem. Review of Economic Studies, 60 531 542. Shalizi, C. R. and Thomas, A. C. (2011). Homophily and Contagion Are Generically Confounded in Observational Social Network Studies. Sociological Methods and Research, 40 211 239. 13