Social Nfluence and Its Models

Transcription

1 Influence and Correlaton n Socal Networks Ars Anagnostopoulos Rav Kumar Mohammad Mahdan Yahoo! Research 701 Frst Ave. Sunnyvale, CA {ars,ravkumar,mahdan}@yahoo-nc.com ABSTRACT In many onlne socal systems, socal tes between users play an mportant role n dctatng ther behavor. One of the ways ths can happen s through socal nfluence, the phenomenon that the actons of a user can nduce hs/her frends to behave n a smlar way. In systems where socal nfluence exsts, deas, modes of behavor, or new technologes can dffuse through the network lke an epdemc. Therefore, dentfyng and understandng socal nfluence s of tremendous nterest from both analyss and desgn ponts of vew. Ths s a dffcult task n general, snce there are factors such as homophly or unobserved confoundng varables that can nduce statstcal correlaton between the actons of frends n a socal network. Dstngushng nfluence from these s essentally the problem of dstngushng correlaton from causalty, a notorously hard statstcal problem. In ths paper we study ths problem systematcally. We defne farly general models that replcate the aforementoned sources of socal correlaton. We then propose two smple tests that can dentfy nfluence as a source of socal correlaton when the tme seres of user actons s avalable. We gve a theoretcal justfcaton of one of the tests by provng that wth hgh probablty t succeeds n rulng out nfluence n a rather general model of socal correlaton. We also smulate our tests on a number of examples desgned by randomly generatng actons of nodes on a real socal network (from Flckr) accordng to one of several models. Smulaton results confrm that our test performs well on these data. Fnally, we apply them to real taggng data on Flckr, exhbtng that whle there s sgnfcant socal correlaton n taggng behavor on ths system, ths correlaton cannot be attrbuted to socal nfluence. Categores and Subject Descrptors: J.4 [Computer Applcatons]:Socal and Behavoral Scences Socology General Terms: Economcs, Human Factors Keywords: Socal nfluence, Socal networks, Correlaton, Taggng Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. KDD 08, August 24 27, 2008, Las Vegas, Nevada, USA. Copyrght 2008 ACM /08/08...$ INTRODUCTION Onlne socal networks are playng an ever-mportant role n shapng the behavor of users on the web. Popular socal stes such as Facebook, MySpace, Flckr, and del.co.us, are enjoyng ncreasng traffc and are turnng nto communty spaces, where users nteract wth ther frends and acquantances. The avalablty of such rch data at never-before seen scales makes t possble to analyze user actons at an ndvdual level n order to understand user behavor at large. In partcular, questons nterpretng a user s acton n the context of hs/her onlne frends and correlatng the actons of socally connected users, become hghly nterestng. There has been some theoretcal and emprcal work on how a user s actons can be correlated to hs/her socal afflatons. Backstrom et al. [1] examned the membershp problem n an onlne communty. They observed correlaton between the acton of a user jonng an onlne communty and the number of frends who are already members of that communty. Marlow et al. [5] consdered the tag usage problem n Flckr and studed the set of tags placed by a user and those placed by the frends of the user. They exhbted a correlaton between socal connectvty and tag vocabulary. Whle these studes have establshed the exstence of correlaton between user actons and socal afflatons, they do not address the source of the correlaton. Causes of correlaton n socal networks can be categorzed nto roughly three types. The frst s nfluence (also known as nducton), where the acton of a user s trggered by one of hs/her frend s recent actons. An example of ths scenaro s when a user buys a product because one of hs/her frends has recently bought the same product. The second s homophly, whch means that ndvduals often befrend others who are smlar to them, and hence perform smlar actons. For example, two ndvduals who own Xboxes are more lkely to become frends due to the common nterest. The thrd s envronment (also known as confoundng factors or external nfluence), where external factors are correlated both wth the event that two ndvduals become frends and also wth ther actons. For example, two frends are lkely to lve n the same cty, and therefore to post pctures of the same landmarks n an onlne photo sharng system. From a practcal pont of vew, dentfyng stuatons where socal nfluence s the source of correlaton s mportant. In the presence of socal nfluence, an dea, norm of behavor, or a product dffuses through the socal network lke an epdemc. A marketng frm, for example, can use ths nformaton to desgn vral marketng campagns or gve out coupons to nfluental nodes n the network, or a system

2 desgner can take advantage of ths nformaton n order to nduce the users to follow a desred mode of behavor. There has already been sgnfcant research on methods for desgnng strateges to leverage socal nfluence n such systems [3] and on the effect of nfluence on the growth pattern of new products [8]. The man dea n all vral marketng strateges s essentally that n cases that nfluence between users s prevalent, careful targetng can have a cascadng effect on the adopton of a product/technology. Therefore, beng able to dentfy n whch cases nfluence prevals s an mportant step to strategy desgn. Our contrbutons. Gven the sgnfcance of socal nfluence, t s mportant to be able to test f a gven socal system exhbts sgns of socal nfluence. Ths s a partcularly dffcult problem n onlne settngs where ndvduals are often anonymous and therefore t s mpossble to control for all potental confoundng factors. We overcome ths problem by takng advantage of the avalablty of data about the tmng of actons n onlne settngs. We propose a statstcal test (called the shuffle test) based on the ntuton that f nfluence s not a lkely source of correlaton n a system, tmng of actons should not matter, and therefore reshufflng the tme stamps of the actons should not sgnfcantly change the amount of correlaton. We prove that n a rather general model of homophly and confoundng, ths test succeeds n rulng out nfluence as the source of socal correlaton. We also show the effectveness of our test usng smulatons. Our test cases are based on a large socal network from Flckr. We generate the acton data randomly from a model wth or wthout socal nfluence, and run our test on ths data set to decde whether the correlaton s caused by nfluence. Our results show that n nearly all cases our algorthm succeeds n dentfyng the source of correlaton. We also present results for another test (called the edge-reversal test) nspred by a recent study on the spread of obesty n real-world socal networks [2]. Fnally, we apply our algorthms on real taggng data n Flckr. Our results show that even though taggng behavor n ths system exhbts a consderable degree of socal correlaton, ths cannot be attrbuted to socal nfluence. Organzaton. In Secton 2 we detal the dfferent forms of socal correlaton. In Secton 3 we descrbe our methodology, and present a theoretcal analyss n a model of homophly and confoundng. We descrbe our data generaton models and present the results of smulatons n Secton 4. We descrbe our experments on Flckr tags n Secton MODELS OF SOCIAL CORRELATION We study a settng where a group of ndvduals (also called agents or users) are nodes of a socal network G. In general, G s a drected graph and s generated from an unknown probablty dstrbuton. We are concerned wth ndvduals performng a certan acton for the frst tme, e.g., purchasng a product, vstng a web-page, or taggng a photo wth a partcular tag. 1 After an agent performs the acton, we say that the agent has become actve. We observe 1 In many cases, e.g., purchasng certan products or usng certan tags, an ndvdual mght perform the acton multple tmes. We focus on the frst tme the acton s performed by each ndvdual, snce subsequent occurrences of the same acton by the same ndvdual s often more dependent on the frst occurrence than on the socal network. the system for a certan tme perod [0, T ]. Let W denote the set of agents that are actve at the end of ths tme perod. Socal correlaton,.e., correlaton between the behavor of afflated agents n a socal network s a well-known phenomenon. Formally, ths means that for two nodes u and v that are adjacent n G, the events that u becomes actve s correlated wth v becomng actve. There are three prmary explanatons for ths phenomenon: homophly, the envronment (or confoundng factors), and socal nfluence. Homophly. Homophly s the tendency of ndvduals to choose frends wth smlar characterstcs [4, 6]. Ths s a pervasve phenomenon, and not surprsngly, leads to correlaton between the actons of adjacent nodes n a socal network. For example, one plausble hypothess for why there s socal correlaton n membershp n an onlne communty s that ndvduals mght know each other and become frends after jonng the communty. Mathematcally, n a pure homophly model, the set W of actve nodes s frst selected accordng to some dstrbuton, and then the graph G s pcked from a dstrbuton that depends on W. Confoundng. The second explanaton for correlaton between actons of adjacent agents n a socal network s external nfluence from elements n the envronment (also referred to as confoundng factors), whch are more lkely to affect ndvduals that are located close to each other n the socal network. Mathematcally, ths means that there s a confoundng varable X, and both the network G and the set of actve ndvduals W come from dstrbutons correlated wth X. For example, two ndvduals who lve n the same cty are more lkely to become frends than two random ndvduals, and they are also more lkely to take pctures of smlar scenery and post them on Flckr wth the same tag. Note that there s a fne dstncton between ths explanaton and homophly: homophly refers to stuatons where the set W affects ndvduals choces to become frends, whle n confoundng, both the choces of ndvduals to become frends and ther choce to become actve are affected by the same unobserved varable. It s possble to dstngush between these models by lookng at the tme where the edges of G are establshed. The focus of ths paper, however, s on dstngushng socal nfluence from other types of socal correlaton. Therefore, we study a common generalzaton of the confoundng and the homophly model as follows: frst, the par (G, W ) s selected accordng to a jont probablty dstrbuton, and then the tme of actvaton for ndvduals n W s pcked..d. accordng to a dstrbuton T on [0, T ]. We call ths model the correlaton model. The man assumpton here s that the probablty that an ndvdual s actve can be affected by whether ther frends become actve, but not by when they become actve. Ths s n contrast wth the nfluence model, as defned below. Influence. The thrd, and perhaps the most consequental explanaton for socal correlaton s socal nfluence. Ths refers to the phenomenon that the acton of ndvduals can nduce ther frends to act n a smlar way. Ths can be through settng an example for ther frends (as n the case of fashon), nformng them about the acton (as n the case of vral marketng), or ncreasng the value of an acton for them (as n the case of adopton of a technology). Mathematcally, ths can be modeled as follows: frst, the graph G s drawn accordng to some dstrbuton. Then, n each of the tme steps 1,..., T, each non-actve agent decdes whether

3 to become actve. The probablty of becomng actve for each agent u s a functon p(x) of the number x of other agents v that have an edge to u and are already actve. 2 Here, p( ) can be any ncreasng functon, although later n the paper we consder a specal class of functons that provdes a good ft wth the real data and also corresponds to a commonly used statstcal model for estmatng the probablty of bnary events, namely the logstc regresson. 3. METHODOLOGY In ths secton we present the methodology that we use to measure socal correlaton and test whether nfluence s a source of such correlaton. We start n Secton 3.1 by explanng how logstc regresson can be used to quantfy the extent of socal correlaton. In Secton 3.2 we defne the shuffle test for decdng f nfluence s a lkely source of correlaton, and prove that ths test successfully rules out nfluence as the source of correlaton n the correlaton (confoundng/homophly) model defned n Secton 2. Fnally, n Secton 3.3 we defne another test called the edge-reversal test, whch we evaluate expermentally. 3.1 Measurng socal correlaton The frst step n our analyss s to obtan a measure of socal correlaton between the actons of an ndvdual and that of her frends n the network. Ths measure s desgned to recover the actvaton probablty, assumng that the agents follow the nfluence model defned n Secton 2. Recall that n the nfluence model, each ndvdual flps an ndependent con n every tme step to decde whether or not to become actve. In prncple, the probablty of ths con can vary from agent to agent and from tme to tme; n the smplest model, whch s the focus of most of ths paper, we measure ths probablty as a functon of only one varable: the number of already-actve frends the agent has. 3 Note that the parameter we use s the number of frends that have become actve at any earler tme step, as opposed to frends who have become actve mmedately before. Ths s because n onlne systems lke Flckr actons are stored, and mght be observed by others much later. As t turns out, for most tags n the Flckr data set, a logstc functon wth the logarthm of the number of frends as the explanatory varable provdes a good ft for the probablty. Therefore, for smplcty and to reduce the possblty of overfttng, we use the logstc functon wth ths varable, that s, we estmate the probablty p(a) of actvaton for an agent wth a already-actve frends as follows: 4 p(a) = ln(a+1)+β eα, (1) 1 + eα ln(a+1)+β 2 Ths model assumes that tme progresses n dscrete steps. A smlar model wth contnuous tme can be defned usng the Posson dstrbuton. 3 We also consdered usng the fracton of the total populaton that s actve as another explanatory varable n our estmaton on the Flckr data set, but the results ndcated that ths parameter s of no value: the correspondng coeffcent s nsgnfcant for almost all tags. 4 We have also duplcated some of our experments usng a as the explanatory varable. The results are not qualtatvely dfferent, and almost always the lkelhood of the ft s better wth the logarthmc varable. where α and β are coeffcents. Equvalently, ( ) p(a) ln = α ln(a + 1) + β. (2) 1 p(a) The coeffcent α measures socal correlaton: a large value of α ndcates a large degree of correlaton. We estmate α, β usng maxmum lkelhood logstc regresson. More precsely, let Y a,t be the number of users who at the begnnng of tme t had a actve frends and started usng the tag at tme t. Smlarly, let N a,t be those users who at tme t were nactve, had a actve frends, but dd not start usng the tag (at tme t). Fnally, let Y a = t Ya,t, and Na = t Na,t. Then we compute the values of α and β that maxmze the expresson p(a) Ya (1 p(a)) Na, (3) a where p(a) s defned n (1). Typcally, the values of Y a and N a decrease quckly and lose ther statstcal sgnfcance as a grows. Therefore, for practcal reasons, we may restrct the lkelhood expresson (3) to only all a R, for a carefully chosen value of R, whle we accumulate all the values correspondng to a > R to Y R+1 and N R+1. Whle n general there s no closed form soluton, there are many software packages that can solve such a problem qute effcently; we used Matlab s statstcs toolbox n our experments. 3.2 The shuffle test In ths secton we ntroduce the shuffle test for dentfyng socal nfluence. It s based on the dea that f nfluence does not play a role, even though an agent s probablty of actvaton could depend on her frends, the tmng of such actvaton should be ndependent of the tmng of other agents. Let G be the socal network, and W = {w 1,..., w l } be the set of users that are actvated durng the perod [0, T ]. Recall that n the correlaton model, (G, W ) s drawn from an arbtrary jont dstrbuton. Assume that user w s frst actvated at tme t. Usng the method n Secton 3.1, we compute Y a and N a, for a R, where R s a constant, and use the maxmum lkelhood method to estmate α. Next, we create a second problem nstance wth the same graph G and the same set W of actve nodes, by pckng a random permutaton π of {1,..., l}, and settng the tme of actvaton of node w to t := t π(). Agan we use the method n Secton 3.1 to compute Y a and N a for a R, and the socal correlaton coeffcent α. The shuffle test declares that the model exhbts no socal nfluence f the values of α and α are close to each other. Intutvely, the reason that the shuffle test correctly rules out socal nfluence n nstances generated accordng to the correlaton model s the followng: n an nstance generated from ths model, the tme stamps t are ndependent, dentcally dstrbuted (..d.) from a dstrbuton T over [0, T ]. The second nstance constructed above only permutes all tme stamps, and hence the new t s are stll..d. from the same dstrbuton T. Therefore, the two nstances come from the exact same dstrbuton, and hence they should lead to the same expected socal correlaton coeffcent α. The only thng that remans to be proven s that ths coeffcent s concentrated around ts expectaton (where the expectaton s taken over the random choce of the tme stamps, condtonng on a fxed choce of G and W ). In the next secton, we formalze ths ntuton, leadng to Theorem 1.

4 3.2.1 Theoretcal analyss To ad our analyss, we make three smplfyng assumptons. Frst, we assume that the dstrbuton T of the actvaton tmes s unform over [0, T ]. Second, we modfy the test to pck each t ndependently from T, nstead of usng a permutaton of the orgnal tme stamps. Nether of these assumptons s necessary, but t smplfes the arguments wthout substantvely changng the technques. The thrd set of assumptons ensures that there are enough data to gather statstcs. Let d (d + ) be the ndegree (outdegree) of node w, and let (d W + ) be the ndegree (outdegree) of node w n the subgraph nduced by W (recall that W s the set of users that became actve). Also, let W = {w 1,..., w l }, where l l be the set of nodes n W and ther neghbors (note that the frst l nodes are those n W ). Then we make the followng assumptons: 1. l = Θ(n). 2. d, d+ d max, for l and for some constant d max. 3. { : d W R + 1} = Θ(n). These assumptons are not the strctest possble for our results to hold, but they are nevertheless qute natural and smple to state. In partcular, we make the frst assumpton only to smplfy the notaton (otherwse the results hold wth probabltes that depend on l and l nstead of n). Theorem 1. Let G = (V, E) be a drected graph on n nodes and let W = {w 1,..., w l } V be the set of nodes that become actve durng the tme perod [0, T ]. Assume that the actvaton tme t of the node w s pcked..d. from the unform dstrbuton over {1,..., T }, and assume that the three assumptons hold. Let α denote the socal correlaton coeffcent computed usng the method n Secton 3.1. Then, wth hgh probablty 5 the value of α s close to ts expectaton, where the probabltes are over random choces of the actvaton tmes. Proof. The man part of the proof s Lemma 2 where we show that the values of Y a and N a are concentrated. Ths s proved usng concentraton nequaltes for martngales. We can then show (detals deferred for the full verson of the work) that when we apply logstc regresson wth nputs that are close to each other, the socal correlaton values α recovered are also close to each other. Therefore, wth hgh probablty the value of α recovered s close to ts expectaton whp. Lemma 2. Assume the condtons of Theorem 1, and let Y a and N a, a R + 1, defned as n Secton 3.1. Then we have that Y a and N a are close to ther expectatons whp. Proof. Frst we calculate E[Y a], for a fxed a. We ntroduce some notaton. Let Ya = 1 f when node w used the tag had a actve neghbors and 0 otherwse. Notce that we have Y a = r =1 Y a. The probablty that exactly a of the neghbors are actve when node w used a tag s 0 f < a. Otherwse, f a R, ths probablty s 1/( + 1), snce node w and ts neghbors have the same probablty to be the ath node among them that used the 5 The term wth hgh probablty, abbrevated whp., refers to an event that holds wth probablty that tends to 1 as n. tag. Fnally, f a = R + 1 (recall that R + 1 corresponds to the ensemble of all the values greater than R), then the probablty s ( R)/( + 1). Thus, we have E[Y a] = for a R, and E[Y a] = l =1 l =1 E[Y a ] = E[Y a ] = : a : R , R + 1, for a = R + 1. One can verfy that from our assumptons we have that both of these quanttes are Θ(n). Note that the terms are not ndependent. Thus, to show concentraton, we wll employ Azuma s nequalty [7]. For a fxed a we defne the (Doob s) martngale X = E[Y a t 1, t 2,..., t ]. We have that X 0 = E[Y a] and X l = Y a. Note that we have that X X 1 d W + + 1, snce a node affects only tself the nodes for whch t s a contact. Then Azuma s nequalty mples that Pr( Y a E[Y a] > λ) = Pr( X l X 0 > λ) 2e λ 2 2 (d W + +1) 2, whch s o(1) for λ = ω( n). To compute the value of E[N a] we have to be a bt more careful, snce a node can contrbute multple tme perods to N a. Frst, note that we have to count also the neghbors of the nodes n W. Recall that W = {w 1,..., w l }, s the set of actve nodes and ther neghbors. Let us wrte N a = l =1 N a, where Na counts the number of tmesteps before node w becomng actve (f at all) and had exactly a actve contacts. Let us compute E[Na], frst for l. Of course, ths equals 0 f < a. Otherwse, the expected tme untl one of the +1 nodes (w and ts contacts) becomes actvated s T/( + 2), thus E[N0] = T ( + 2). Wth probablty /( + 1) the frst node s not w, hence we have E[N 1] = More generally we get that for a R, and dw E[Na] = dw a + 1 T , E[Na] = dw R + 1 d W + 1 R T, 2 T for a = R + 1. (The frst fracton s the probablty that w becomes actvated after R + 1 neghbors, and then t s expected to arrve n the mddle of the leftover perod.) For > l we can show wth smlar arguments that E[Na] = 0 f < a, otherwse for a R, and E[N a] = T + 1, E[N a] = R T,

5 for a = R + 1. By our assumptons for the graph we have that N a = l =1 N a = Θ(T n). Agan we show concentraton by usng the Azuma nequalty. We defne Z = E[N a t 1, t 2,..., t ], and notce that we have Z Z 1 T (d + + 1), wth the same reasonng as prevously. So we get that λ 2 2 Pr( N a E[N a] > λ) = Pr( Z r Z 0 > λ) 2e T 2 (d + +1)2, whch s o(1) for λ = ω(t n) Detectng nfluence We showed that the values of α that we obtan wth the correlaton model are close to each other wth hgh probablty wth and wthout the tmestep shuffle. Now we contrast ths wth the nfluence model and we show that n the latter case the values of α that we compute wth and wthout the tmestep shuffle are n general dfferent. We demonstrate ths fact wth a smple example. Consder a lne graph wth n + 1 nodes, v 0, v 1, v 2,..., v n, and edge set the {(v, v +1); = 0, 2,..., n 1}. For smplcty we assume that that node v 0 s has ntally used a tag; ths does not change the nature of our example. For some p [0, 1], consder now the nfluence model wth α = log 2 (p/(1 p)) and β = 0, and we observe the system for T tme steps (wth T p beng suffcently small, say T p < n/2). Durng the T steps, the nodes wll start to use the tags from left to rght, and at each step, the probablty that the leftmost nactve node wll become actve equals p. Then at the end of the T steps, f the number of new actve nodes s denoted by L, we have E[Y 1] = E[L] = T p and E[N 1] = T (1 p). Assume now that we perform the shuffle test. Then for = 1,..., L, let Y1 be 1 f node v became actve after node v 1, and N1 the number of tme steps that node v dd not become actve although node v 1 was (0 f node v 1 became actve after node v ). Then we have Y 1 = L =1 Y 1 and E[Y 1] = E[L/2] = T p/2, snce the probablty that node v 1 becomes actve before node v s 1/2. Smlarly, N 1 = L =1 N 1 and E[N v 1 ] = T =1 ( ) 1 T T p 2 1. Ths follows snce node v becomes actve at tme step wth probablty 1/T, the probablty that node v 1 arrves before s /T p and n that case the arrval tme s unformly dstrbuted n [0, ] so the expected number of tmes that node v does not become actve s /2 1. Therefore, and so, E[N v 1 ] = E[N 1] = (T + 1)(2T 5), 12T p (T + 1)(2T 5). 12 Hence we see that the nput to the regresson functon s n general very dfferent and as a result the values of α wll n general be very dfferent. 3.3 The edge-reversal test In ths secton we ntroduce the second test for dstngushng nfluence smlar to the one used n the obesty study [2]: we reverse the drecton of all the edges and run logstc regresson on the data usng the new graph (whch we call the reverse graph) as well 6. Snce other forms of socal correlaton (other than socal nfluence) are only based on the fact that two frends often share common characterstcs or are affected by the same external varables and are ndependent of whch of these two ndvduals has named the other as a frend, we ntutvely expect reversng the edges not to change our estmate of the socal correlaton sgnfcantly. On the other hand, socal nfluence spreads n the drecton specfed by the edges of the graph, and hence reversng the edges should ntutvely change the estmate of the correlaton. We wll test ths hypothess on several classes of nstances generated usng probablstc models of dfferent forms of socal correlaton. 4. SIMULATIONS 4.1 Generatve models To verfy the valdty of the technques descrbed n Secton 3, we defne three generatve models one correspondng to a settng where there s no socal correlaton, one correspondng to a settng that there s only socal nfluence and one that there s socal correlaton but not nfluence. In each model, we wll try to keep other aspects of the model as close to Flckr s data as possble. In partcular, n all models the network (both number of users and connectons) grows at the same rate as n the real Flckr data, and we wll try to let the number of users that become actve n each tme step to follow the pattern correspondng to a tag n the real data. The frst model concerns a settng where there s no socal correlaton nfluence or otherwse n the pattern of actvatons. The second model s for a settng where nfluence s the only form of socal correlaton; ths model s defned to match the logstc regresson model descrbed earler. The thrd model seeks to capture stuatons where agents that are close to each other n the network are affected by the same external factors (the envronment) that make them more lkely to be actvated. We now descrbe the models. The no-correlaton model. For every tag n the real data, we can generate a no-correlaton nstance as follows: the network grows exactly n the same way as n the real data. In each tme step, we look at the real data to see how many new agents use the tag, and pck the same number of agents unformly at random from the set of agents that have already joned the network and have not been pcked yet. The nfluence model. Ths model s parameterzed n terms of two parameters, α and β. The network, and the growth pattern of the network s kept as n the real data. In every tme step, each node n the set of nodes that has joned the network but not actvated yet flps a con ndependently to decde f to become actve n ths tme step. The probablty of actvaton for ths node s computed usng (2), where a s the number of frends of ths node that have become actve n one of the prevous tme steps. 6 Note that we are only able to use ths test because n Flckr data set, a sgnfcant number of edges are drected.

6 (a) Hstogram. (b) Emprcal CDF. (a) Hstogram. (b) Emprcal CDF. Fgure 1: model. Dstrbuton of α for the no-correlaton Fgure 2: Dstrbuton of α for the nfluence model. The correlaton (no-nfluence) model. Agan, we keep the network and the pattern of growth of the populaton the same as n the real data. The model s parameterzed n terms of one parameter L, and follows the pattern of a gven tag n the real data. Before generatng the acton data, we select a set S of nodes by sequentally pckng a number of centers at random, and addng a ball of radus 2 around each to S. 7 We stop ths process as soon as the sze of S reaches the prespecfed number L. Then, we generate the set of agents that become actve n each tme step n a manner smlar to the one n the no-correlaton model, except that n each tme step we pck the set of agents to become actve unformly at random from S. 4.2 Measurng correlaton Our frst set of experments focuses on the measurement of correlaton n the network. In Fgure 1 we dsplay the results of the applcaton of logstc regresson to the no-correlaton model. We can see that the dstrbuton of the values of α s centered at zero and most of the mass s around there. In Fgure 2 we can see the applcaton of the logstc regresson to the nfluence model. Recall from Secton 3 that ths model s based on the logstc functon, whch we are tryng to ft. Not surprsngly, we recover the values of α that we set n our model. Thus, Fgure 2 essentally dsplays those values of α. Fnally, n Fgure 3 we see the results n the correlaton model. Note that here as well the values of α that we recover are postve. 4.3 Dstngushng nfluence After establshng the presence of correlaton n users behavor, we turn to tests for the source of ths correlaton. Frst we apply the shuffle test and then we turn to the edgereversal test Shuffle test Let us frst observe the nfluence model, where the values of α wth the orgnal taggng tmes are hgh. From the ntuton ganed n Secton 3.2, we expect to see those values to decrease, when we shuffle the taggng tmesteps. In 7 We have chosen a radus of 2 here snce because the network s hghly connected, a ball of radus 3 can become very large, whle a ball of radus 1 only conssts of the neghbors of a node, whch s often too small. (a) Hstogram. (b) Emprcal CDF. Fgure 3: Dstrbuton of α for the correlaton model. Fgure 4(a) we can observe the results for some of the tags. Notce how the cumulatve densty functon (CDF) s shfted to the left, whch means that when we reverse the edges the value of α decreases. In Fgure 4(b) we can see the values n absolute terms. Now we swtch to the correlaton model. Accordng to the analytcal fndngs of Secton 3.2, the values of α that we obtan wth and wthout the shufflng should not dffer wth hgh probablty. Fgure 5 confrms our analytcal fndngs and shows that for almost all tags the values of α retreved are very close wth and wthout the shuffle Edge-reversal test Now we present the results of our second nfluence-detecton test, the edge reversal, confrmng the results of the prevous secton. Frst we apply t to the nfluence model, depctng the results n Fgure 6. Smlarly to the prevous test, there s a sgnfcant dfference n the values of α n the forward and backward drecton. On the contrary, n the correlaton model, as seen n Fgure 7, the values of α essentally concde. In Fgure 7(a) we can notce that the two CDFs essentally concde. In Fgure 7(b) we see a more detaled pcture. Here every pont corresponds to a tag, and the graph shows the value of α n the network versus the value of α n the network wth the edges reversed. Take notce of the proxmty of the ponts to the lne y = x.

7 (a) Emprcal probablty densty. (b) α of orgnal and shuffled taggng tmesteps. Fgure 4: Shuffle test for the nfluence model. (a) Emprcal probablty densty. (b) α of orgnal and shuffled taggng tmesteps. Fgure 5: Shuffle test for the correlaton model. (a) Emprcal probablty densty. (b) α of drect vs. edges. reversed Fgure 6: Edge-reversal test for the nfluence model. (a) Emprcal probablty densty. Fgure 7: model. (b) α of drect vs. edges. reversed Edge-reversal test for the correlaton 5. EXPERIMENTS ON REAL DATA After verfyng that our technques are effectve for the smulated data, we apply them on real-world data, namely on the Flckr socal network. Frst we descrbe the data set. Then we show that there s postve correlaton n the users behavor. Fnally, we address the ssue of the source of correlaton. We apply the tests of Secton 3, and we conclude that nfluence s not a lkely source of the correlaton. 5.1 The Flckr dataset We analyzed the taggng behavor of users for a perod of 16 months. The fnal number of users was about 800K. Snce the majorty of users dd not exhbt any taggng behavor at all, we restrcted our attenton to the set of users who have tagged any photo wth any tag, whch s about 340K users. Lookng at ths subgraph at the end of the 16-month perod, the sze of the gant component s 160K users, the second one has sze 16, and there are 165K solated users. The number of drected edges between the users s 2.8M and, on the average, for a gven user u, the proporton of u s contacts that do not have u as a contact s 28.5%. In Fgure 8 we depct the sze of the subgraph that we analyze as a functon of tme. (The growth rate of the entre network exhbts a very smlar behavor.) Out of a collecton of about 10K tags that users had used, we selected a set of 1, 700, and analyzed each of them ndependently. We selected tags of varous types (event, colors, objects, etc.), varous numbers of users (most of them were used by more than 1, 000 users), and varous growth patterns: bursty (e.g., halloween, katrna ), smooth (e.g., photos, ) and perodc (e.g., moon ). 5.2 Measurng correlaton Frst we confrm the exstence of correlaton n the Flckr data set as expected. In Fgure 9 we can see the dstrbuton of α along the tags of Flckr. Note that for almost all the tags the value s hgher than 1, suggestng that correlaton s prevalent n users taggng actvtes for almost all the tags. Ths correlaton s not necessarly due to socal nfluence; we examne ths ssue next. 5.3 Dstngushng nfluence After establshng the presence of correlaton n users behavor, we turn to the test for the source of ths correlaton.

8 (a) Emprcal probablty densty. (b) α of orgnal tmesteps vs. shuffled tmesteps. Fgure 10: Shuffle test for the Flckr socal network. Fgure 8: Growth of the Flckr network. (a) Emprcal probablty densty. (b) α of drect vs. edges. reversed Fgure 11: network. Edge-reversal test for the Flckr socal (a) Hstogram. (b) Emprcal CDF. Fgure 9: Dstrbuton of α for the Flckr socal network. Frst we apply the shuffle test and then we turn to the edgereversal test. In Fgure 10 we show the results of applyng the shuffle test on the Flckr data set. In Fgure 10(a), notce that the two cumulatve dstrbuton functons essentally concde. It seems that the correlaton that we observed n Secton 5.2 cannot be attrbuted to nfluence. Ths ndcates that ether users do not tend to browse ther contacts photos to a large extent, or even when they browse, they do not tend to start usng the tags they see. In Fgure 10(b) we see more detals. Once agan, every pont corresponds to a tag, and the graph shows the value of α n the Flckr network versus the value of α n the network wth the edges reversed. As before, notce the strkng proxmty of the ponts to the lne y = x. Fnally, n Fgure 11 we observe the results of applyng the edge-reversal test to the Flckr network, whch once agan confrms all our prevous observatons. 5.4 Some nfluence n Flckr Whle t s true that nfluence does not play an mportant role n users taggng behavor n Flckr, we can actually dscover that there s some lmted effect by lookng at the dfference between smlar tags. As a concrete example, consder the tag grafft ; the dfference between the values of α n the two edge drectons s essentally 0. A lot of users used the msspelled tag graftt. Here the dfference turns out to

9 be slghtly larger (stll small though). It s easy to magne that ndeed there s some propagaton of the msspelled verson. (The analogy wth the TA who grades two homeworks wth the same mstakes should make ths concept clear!) Fnally, wth a thrd, even less common spellng ( grafftt ), the dfference ncreased yet more. 6. CONCLUSIONS In ths paper we appled statstcal analyss on the data from a large socal system n order to dentfy and measure socal nfluence as a source of correlaton between the actons of ndvduals wth socal tes. Ths s an nstance of the age-old problem of dstngushng correlaton from causaton. Ths problem s very dffcult n general; however, n our case, we used the avalablty of data about the tme-step of each acton, as well as asymmetrc socal tes between the agents n order to study ths problem. There are stll many nterestng open drectons left for future research. Frst, our technques provde only a qualtatve ndcaton of the exstence of nfluence and not a quanttatve measure. Furthermore, we do not provde any formal verfcaton of our results. For example, s t ndeed the case that n Flckr users taggng behavor, nfluence has a lmted role? Or, can we pnpont socal networks and behavors where nfluence s ndeed prevalent and verfy our tests? Also, what happens when dfferent sources of socal correlaton are present, as s usually the case? All these mportant questons mght be trcky to answer and probably requre the desgn of controlled user experments. Furthermore, t would be very nterestng to extend our theoretcal model for dstngushng between socal nfluence and other forms of correlaton n socal networks. Under what condtons the nformaton about the tme step of events s enough to acheve ths goal? How can the pattern of the spread of an acton be used to dentfy socal nfluence even n a settng where all socal tes are symmetrc? How can we fnd an nfluental node just by lookng at the data about the spread of an acton? Gven the great potental of vral marketng technologes to shape the future of marketng on the Internet, ths and many other related questons are of tremendous practcal value. Acknowledgments We thank Alex Jaffe, Malcolm Slaney, and Duncan Watts for nvaluable dscussons, as well as the anonymous revewers for nsghtful comments. 7. REFERENCES [1] L. Backstrom, D. Huttenlocher, J. Klenberg, and X. Lan. Group formaton n large socal networks: Membershp, growth, and evoluton. In 12th KDD, pages 44 54, [2] N. A. Chrstaks and J. H. Fowler. The spread of obesty n a large socal network over 32 years. The New England Journal of Medcne, 357(4): , [3] D. Kempe, J. Klenberg, and E. Tardos. Maxmzng the spread of nfluence through a socal network. In 9th KDD, pages , [4] P. Lazarsfeld and R. K. Merton. Frendshp as a socal process: A substantve and methodologcal analyss. In M. Berger, T. Abel, and C. H. Page, edtors, Freedom and Control n Modern Socety, pages Van Nostrand, [5] C. Marlow, M. Naaman, D. Boyd, and M. Davs. Ht06, taggng paper, taxonomy, Flckr, academc artcle, to read. In 17th HYPERTEXT, pages 31 40, [6] M. McPherson, L. Smth-Lovn1, and J. M. Cook. Brds of a feather: Homophly n socal networks. Annual Revew of Socology, 27: , [7] M. Mtzenmacher and E. Upfal. Probablty and Computng. Cambrdge Unversty Press, [8] P. Young. The dffuson of nnovatons n socal networks. In L. E. Blume and S. N. Durlauf, edtors, The Economy as a Complex Evolvng System, volume III. Oxford Unversty Press, 2003.