Homophily in Online Social Networks

Transcription

1 Homophily in Online Social Networks Bassel Tarbush and Alexander Teytelboym Department of Economics, University of Oxford Department of Economics, University of Oxford Abstract. We develop a parsimonious and tractable dynamic social network formation model in which agents interact in overlapping social groups. The model allows us to analyse network properties and homophily patterns simultaneously. We derive analytical expressions for the distributions of degree and, importantly, of homophily indices, using mean-field approximations. We test our model using a large dataset from Facebook covering student friendship networks in 0 American colleges in 00. We find that our analytical expressions and simulations fit the homophily patterns, degree distribution, and individual clustering coefficients well with the data. Introduction Friendships are an essential part of economic life and social networks affect many areas of public policy. In many social network formation models in the economics literature agents are anonymous and the network structure depends entirely on the formation process. Yet we can think of numerous examples, such as information transmission, peer-to-peer lending, or sexual contacts, which suggest that the network topology is not only explained by the network formation process, but also by node characteristics. We develop a dynamic network formation model that uses information on node characteristics to explain friendship patterns in online social networks and we test it against the data on Facebook networks in American colleges. In our model, agents spend time interacting with others across various social categories, such as attending lectures and spending time in their dorm. Naturally, the time allocation could be established institutionally by timetables or geographical proximity. The time allocation determines who agents are likely to meet and with whom they document their resulting friendship on Facebook. Our parsimonious model has only three parameters and is simple enough to allow us to derive analytic solutions for structural properties of the network. Conceptually, the model is related to affiliation networks introduced by []. However, these models typically contain a large number of parameters and most, such as [,,] rely entirely on simulations. A particular focus of this paper is homophily the tendency of individuals to associate with those similar to themselves which has been well documented in sociology []. [] make it clear that the observed racial homophily patterns in American high schools do not necessarily arise from an exogenous bias in preferences towards people of the same race. In our model, we do not assume that agents have any preference bias.

2 Rather the entire process is governed by the allocation of time and by the relative size of the social groups in which agents interact. Homophily therefore emerges purely from the correlations in agents likelihood of interaction in similar social groups. The empirical part of this paper provides striking support for our model. Using the analytical expressions, we find the best-fitting parameter values, which determine the allocation of time across social categories, for ten separate Facebook networks. Students friendships reveal that they spend more time socialising in class than in their dorms. Despite its parsimony, the model closely matches the empirical degree and homophily distributions in gender and year at the best-fitting parameter values. Remarkably, the simulations run at these values show that the individual clustering distributions also match the empirical clustering patterns. Model. Characteristics of agents Let K = [K 0,...,K R ] be a finite ordered list of social categories. An element K r is the r th category and k K r is a characteristic within that category. Let R = {0,,...,R}. Every agent i N is represented by a vector k i = (ki 0,...,kR i ) of characteristics, where for each r R, ki r K r. For any pair i, j N, let ki 0 = k 0 j. For each r R, define a social group γi r = { j N ki r = kr j }\{i}, which is the set of all agents (other than i) that share the characteristic ki r within the social category r with i. Note that γi 0 = N\{i}. Finally, for each non-empty subset of social category indices S R, define π i (S) = r Sγ r i \ r R\(S {0}) γ r i, () which induces a partition Π i = {π i (S) S R,S /0} on N\{i}. Therefore, π i (S) is the set of agents (other than i) that share only the characteristics within the set of categories indexed by S with i. Example. In a university context, we could have K = [K 0,K,K,K,K ] = [student,class,dorm,gender,year o f graduation]. All agents are students (ki 0 = k 0 j for all i, j N). K K, which represents class, can include k {maths, literature, biology}. Suppose, that agent i is represented by a vector k i = (student,maths,campus, f emale,00). Let us consider S = {,}. γi is the set of all maths students other than i and γi is the set of all female students other than i. Therefore, π i (S) is the set of female maths students, who do not live on campus and are of a different graduating year than i. π i ({0}) would be the set of all male nonmathematicians, who do not live on campus and are of a different graduating year than i. Π i represents the partition into disjoint sets of students, who share exactly,,, or social categories with i. This does not restrict the characteristics space in any way. The zeroth category, which greatly simplifies notation, is one in which all agents share the same characteristic. Note that π i (S) = π i (S {0}) for all non-empty S R. Furthermore, since γ r i = π {πi (S) r S}π, a social group is a union of disjoint partition elements.

3 . Network formation process We model our network as a simple undirected graph with a finite set of nodes N (which represent agents), a finite set of edges (which represent friendships), and no self-loops. The degree of an agent is the number of the agent s friends. At time period t = 0 all agents are active and have no friends. Let q = (q 0,...,q R ) and r R q r =. In each period t {,,...}, an active agent interacts with agents in the social group γi r with probability q r 0. We can thus interpret q r as the proportion of time in period t that agent i spends with agents in the social group γi r (one can think of γi 0 = N\{i} as the social group that i interacts with during i s free time ). During the interaction in a social group, the agent is linked uniformly at random to another active agent in that group with whom the agent is not yet a friend. If the agent is already linked to every other active agent in that social group, the agent makes no friends in that period. Friendships are always reciprocal, so all links are undirected. Finally, in every period, an agent remains active with a given probability p (0,) until the following period and becomes inactive with probability p. If the agent i becomes inactive, i retains all friendships, but can no longer form any links with other agents in all subsequent periods. There must be reasons, other than having linked with every user in the network, for why people stop adding new friends online: losing interest, finding an alternative online social network, reaching a cognitive capacity for social interaction, and so on. Including all these explanations would require a much richer model, so we simply capture them as a random process with the inactivity probability p. We are interested in how the agents degrees change over time. Let us call d i (t) the expected degree of agent i in period t. We analyse a mean-field approximation to this dynamic system. This technique is commonly used in statistical mechanics in order to simplify many-body systems. Essentially, it assumes that the realisation of any random variable in any time period is its expected value. Hence, we chose to approximate our model by a discrete-time system, which changes deterministically at the rate proportional to the expected change (see [,]). The probability with which agent i interacts with an agent from π i (S) is given by [ ] q π i(s) = π i (S) r S {0} q r γ r i. () Indeed, with probability q r, an agent is assigned to social group γi r, and the probability that he meets an agent in π i (S) γi r is given by π i(s) γi r. Note that π Πi q π =. For every π Π i, let R π (t) be the number of remaining active agents in π at t (other than i) with whom i is not yet linked. Furthermore, recall that an agent makes a link in every period and on average receives a link with probability R π (t) from each of the Rπ (t) agents (in each π weighted by q π ). Since i interacts with agents in π with probability q π, i makes q π links with agents in π in every period until T π the expected number of periods for i to form links with every agent in π. We find T π by solving R π (t + ) = p[r π (t) q π ]. () This difference equation states that R π (t + ) is the number of agents who remain active in π out of R π (t) less the number of agents that i links with in π at t. Solving for

4 R π (t) with initial condition R π (0) = π and setting R π (T π ) = 0 gives us ( ) q ln π p T π q = π p+( p) π (except if q π = 0 then T π = 0). () ln(p) This allows us to obtain the expected degree of agent i at time t d i (t) = π Π i d π i (t) = π Π i q π [t(t T π ) + T π (t > T π )], () where di π(t) is the expected number of link i has with agents in π Π i in period t. Note that d i (t) is concave, piecewise linear, and strictly increasing in the range [0,max π Πi {T π }]. Hence, active agents make friends at a decreasing rate over time. Since an agent remains active exactly x periods with probability p x ( p), we have that Pr(t x) = t=x t=0 pt ( p) = p x+. Therefore, the probability that node i has degree at most d is given by G i (d) Pr(d i (t) d) = Pr(t t i (d)) = p ti(d)+, where t i (d) d i (d) = d π Π i q π T π (d > d i (T π )) π Πi q π (d d i (T π )) Finally, the overall average degree distribution is G(d) = N i N G i (d).. Homophily. () Homophily captures the tendency of agents to form links with those similar to themselves. Let Πi r = {π i (S) Π i r S} be the set of partition elements containing agents that share the characteristic ki r in category r with i. The individual homophily index in social category r of agent i in period t is defined as H r i (t) = number of friends of i at t that share kr i number of friends of i at t = π Πi r di π(t). () d i (t) This is a standard definition from which we can easily recover various other definitions of homophily given in []. Finally, it will be useful to define a composition function h r i (d) (Hr i t i )(d), which expresses individual homophily as a function of degree rather than as a function of time.. Test of the mean-field approximation Since we used a mean-field method to derive the analytical expressions, we must test the accuracy of its approximations against simulations []. We did this for degree distributions and the individual homophily distribution against an average of 00 runs of the simulation for multiple parameter values. In general, the fits were good. An example is illustrated in Fig.. There is some loss of accuracy at extreme values of the cumulative distribution of the individual homophily index: () makes it clear that the individual homophily index is unlikely to be near 0 or. Yet the mean-field approximation of the average is good.

5 Best fit time allocation! q 0 q q Average of the cumulative degree distribution! Fig.. Results for all colleges Average individual clustering coefficient! Average individual homophily coefficient (gender)! Average individual homophily coefficient (year)! Harvard! Columbia! Stanford! Yale! Cornell! Dartmouth! UPenn! MIT! NYU! BU! Empirical average with % and % Chebyshev confidence intervals! Analytic result at best fit! Simulation result at best fit! 0 Fig.. Detailed results for Harvard University Degree distribution (log-log plot of frequency distribution)! Cum. distribution of individual clustering coefficients! ln(f(x)) F(x) ln(x) x Cum. distribution of the individual homophily index (gender)! Cum. distribution of the individual homophily index (year)! F(x) 0. F(x) x Black: empirical Red: analytical Blue: simulation! x

6 Data We use the September 00 cross-section of the complete structures of social connections on within (but not across) the first ten American colleges that joined Facebook (see [0]). We observe six social categories for each user: gender, year of graduation, major, minor, dorm, and high school. Since all personal data were provided voluntarily, some users did not submit all their information. We dropped any user (and their links), who has not provided all the personal characteristics other than high school. We therefore look only at students graduating between 00 and 00, who have supplied all the relevant personal characteristics (except high school). Empirical strategy We test our model against the data using the social categories identified in the Example. Using the available information in our dataset, we define agents i and j to be in the same class if they are in the same year and major or in the same year and minor. We assume that every agent i interacts in i s class and dorm with respective probabilities q and q. The probabilities of interacting with the gender and year social categories are set to zero (q = q = 0) since it is unreasonable to suppose that agents allocate time specifically to interacting with agents in these categories. Meeting agents of the same gender or year happens only through the interactions in the other social groups. Finally, q 0 = q q is the proportion of time spent interacting with all other agents (their free time). Hence, the model has parameters and degrees of freedom. We focus on explaining empirical homophily patterns in gender and year of graduation. Measuring homophily in these social categories is appropriate because gender and year of graduation are entirely immutable agent categories: unlike class and dorm, there is no feedback loop between social category membership and homophily.. Fitting the model to data In order to fit the model to the data (degree distribution and homophily), we used a grid search on parameters q 0, q, q, and p. For the degree distribution, we computed the analytical degree distribution, and, for homophily, we found the analytical homophily index in gender and year as a function of i s empirical degree at each point in the grid. We then found the values q 0, q, q, and p that minimise an intuitive loss function, which measures the overall error of the fit by taking the product of the normalised sums of squared distances between the analytical and the empirical distributions for degree and homophily in gender and year at each point in the grid.. Results For each college, we ran 00 simulations at its best-fitting values of q 0, q, q, and p. Figure presents results for all colleges showing that our model closely matches For q 0, q and q we took values from 0 to in steps of 0.0. For p, we took values from 0. to 0. in steps of The results shown are averages over the 00 runs.

7 average degree, average homophily, and the average individual clustering coefficient (see [, p. ] for a standard definition). Unsurprisingly, students spend most of their time interacting with others in their class. Interestingly, q 0 is small, which suggests that friendship patterns are far from random. Figure shows the empirical, analytical, and simulated degree, homophily (in gender and year), and individual clustering distributions for Harvard University. These fits are representative of the other colleges. Conclusions We presented a network formation model, which provides rich microfoundations for the macroscopic properties of online social networks. The friendship and homophily patterns generated by the model find good support in data. We were also able to estimate how much time agents spend in particular social groups. There is still scope for further theoretical work, including finding accurate analytical approximations to the clustering measures and diameter. Acknowledgments. We would like to thank Edo Gallo, Manuel Mueller-Frank, and John Quah for valuable discussions and three anonymous referees for their excellent suggestions. Bernie Hogan introduced us to digital social science research. References. Breiger, R.L.: The duality of persons and groups. Social Forces () () 0. Leskovec, J., Kleinberg, J., Faloutsos, C.: Graphs over time: Densification laws, shrinking diameters and possible explanations. In: KDD 0 Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. (00). Leskovec, J., Lang, K.J., Dasgupta, A., Mahoney, M.W.: Statistical properties of community structure in large social and information networks. In: WWW 0 Proceedings of the th international conference on World Wide Web. (00). Foudalis, I., Jain, K., Papadimitriou, C.H., Sideri, M.: Modeling social networks through user background and behavior. In: th International Workshop on Algorithms and Models for the Web Graph (WAW). (0) 0. McPherson, M., Smith-Lovin, L., Cook, J.M.: Birds of a feather: Homophily in social networks. Annual Review of Sociology (00). Currarini, S., Jackson, M.O., Pin, P.: An Economic Model of Friendship: Homophily, Minorities, and Segregation. Econometrica () (00) Barabási, A.L., Albert, R., Jeong, H.: Mean-field theory for scale-free random networks. Physica A (-) (). Jackson, M.O., Rogers, B.W.: Meeting strangers and friends of friends: How random are social networks? American Economic Review 0() (00) 0. Jackson, M.O.: Social and Economic Networks. Princeton University Press (00) 0. Traud, A.L., Mucha, P.J., Porter, M.A.: Social structure of Facebook networks. Physica A () (0) 0 In order to avoid making any assumptions about the distributions, we estimated standard errors around the empirical averages non-parametrically. Figure therefore represents the Chebyshev confidence intervals at the % and % levels. Note that clustering appears to fit relatively well even though it did not appear in our loss function.