www.scencemag.org/cg/content/full/311/5757/88/dc1 Supportng Onlne Materal for Emprcal Analyss of an Evolvng Socal Network Gueorg Kossnets and Duncan J. Watts* *To whom correspondence should be addressed. E-mal: djw24@columba.edu Ths PDF fle ncludes: Materals and Methods References Publshed 6 January 2006, Scence 311, 88 (2006) DOI: 10.1126/scence.1116869
Emprcal Analyss of an Evolvng Socal Network Supportng Onlne Materal Gueorg Kossnets and Duncan J. Watts Department of Socology, and Insttute for Socal and Economc Research and Polcy, Columba Unversty, 420 West 118th Street, MC 3355, New York, NY 10027, USA. Data Our populaton conssts of 43,553 undergraduate and graduate students, faculty and staff at a large US unversty who sent and receved e-mal usng a unversty e-mal address durng academc year 2003-2004. The data were collected and anonymzed on our behalf by the unversty IT department. The dataset conssts of three parts: (1) the regstry of e-mal nteractons obtaned from the unversty e-mal server; (2) the table of personal attrbutes (status, gender, age, departmental afflaton, number of years n the communty, dormtory and home zp code for undergraduate students); (3) the lsts of classes attended or taught n every semester, respectvely for students and nstructors. For each e-mal message the tme, sender, and lst of recpents (but not the content) were recorded. To ensure that our data represent genune nterpersonal communcaton (as opposed to bulk malngs) we fltered out messages wth more than 4 recpents (95% of all messages had 4 or less addressees). For purposes of ths report, we treat each message wth n recpents as n smultaneous messages each wth a sngle recpent. After flterng, there are 14,584,423 messages exchanged by 43,553 ndvduals durng 355 days of observaton. As a prvacy protecton measure, all ndvdual e-mal addresses and group dentfers (such as course numbers or department names) were encrypted; so t s possble to tell, for example, whether two anonymous ndvduals were n the same class together but not what class that was. Anonymzaton was necessary n order to qualfy for an exempton from full revew by the Insttutonal Revew Board; otherwse researchers are requred to obtan wrtten consent from every human subject, whch would not be feasble for a project of such a scale as ours. All computatons were performed usng custom-wrtten programs n C and Perl on a 2GHz Lnux workstaton wth 2GB of RAM. The data (daly e-mal logs, snapshots of employee database, course regstraton fle, lsts of encrypted unversty and outsde addresses) were made avalable to us as gzpped plan text fles on a per-semester bass. Each nstallment requred from 1.5 to 3.6 GB of dsk space. We parsed the gzpped fles drectly usng a lbrary avalable n Perl. The computatonally more ntensve routnes were mplemented n C and the wrappng code was programmed n Perl. When the data structures were too large to ft nto computer memory (for example, estmatng cyclc closure bas requred storng a trangular matrx of parwse dstances 8 for approxmately 9.5 10 vertex pars), we used packed arrays and temporary dsk fles. Statstcal analyss was carred out n R and Matlab. More techncal detals wll be forthcomng n our future publcatons as well as n GK s doctoral dssertaton. We also ntend to post the programs that we developed on our web-ste, n the hope that other researchers wll use them and mprove upon them.
Relevance of e-mal data E-mal communcaton s strongly correlated wth other knds of socal nteracton, such as faceto-face and telephone conversatons (1-6). Moreover, the extent to whch people use e-mal vs-àvs other meda appears to reflect ther nherent socablty (2, 3, 6). Recent fndngs suggest that e-mal serves as much socal functon as face-to-face nteractons or phone calls (5, 6), partcularly wth nearby frends (4). Instead of a trade-off between face-to-face nteractons and e-mal communcaton, college students have been found to expand exstng face-to-face relatonshps to nclude telephone and onlne nteractons (6). Although nstant messagng popularty s on the rse, recent reports estmate that e-mal accounts for 62 to 70% of students onlne nteractons (5, 6). Whle ndvduals may vary n ther e-mal usage, both overall and n partcular socal stuatons (7), the large sze of the communty that we study mples a reducton to the mean n terms of both ndvdual and dyadc behavor. We expect that by averagng over thousands of observed relatonshps, e-mal communcaton wll reflect the ntensty and drectonalty of underlyng relatonshps wthn our unversty communty. Our data on e-mal communcaton have been collected from the unversty e-mal server, and as such provde a full record of communcaton between the unversty e-mal addresses. However, t s common for ndvduals to mantan multple e-mal addresses (1, 5, 8). Accordng to the Pew Internet Research Project survey, about 66% of college students use at least two e-mal addresses (5). On the other hand, ndvduals rarely use more than three e-mal addresses for personal communcaton (8). Typcally, multple addresses are used n order to separate socal roles (professonal, academc, anonymous, etc.) or specfc tasks (e.g. personal communcaton, shoppng, or regstraton for servces), as well as for techncal reasons (e.g. to crcumvent nsttutonal polces or to transfer large fles). Although dfferent roles may not always correspond to specfc e-mal addresses, we fnd t lkely that the communcaton wthn the unversty communty that we study s largely related to the actvtes assocated wth the unversty and hence reflects the prmary roles (statuses) of ndvduals. In addton, based on nformaton from the Unversty IT department, t seems lkely that the students at the unversty n queston may well prefer ther offcal e-mal over free mal accounts for all knds of personal communcaton. There are a number of reasons for that: (1) all students are requred to use ther unversty e-mal to receve offcal communcaton and access varous servces, such as lbrares, course materals, etc.; (2) a unversty e-mal connotes prestge and status; (3) some very popular onlne servces for undergraduates (such as facebook.com) requre a college e-mal account; (4) t s easy to fnd people usng an onlne unversty drectory; (5) the unversty has an effcent spam-flterng system whch s superor to many free servces; t also provdes a streamlned, advertsement-free web-nterface n addton to free, convenent access to e-mal from varous e-mal applcatons. Thus whle ndvduals ndeed tend to use multple e-mal accounts to compartmentalze tasks and relatonshps, there are reasons to beleve that n our dataset, the unversty e-mal addresses are used preferentally for unversty-related communcaton, and by extenson, for varous knds of communcaton wth other ndvduals at the unversty.
Constructng network tme seres from dscrete dyadc nteractons Ongong socal relatonshps produce observable spkes of e-mal communcaton (9-12); therefore t s possble to create an approxmaton of the nstantaneous socal network by applyng a flter (13). We approxmate nstantaneous strength of a relatonshp w j ( t, τ ) by the average geometrc rate of blateral e-mal exchange wthn a wndow of wdthτ : w t, τ ) = m m / τ, j ( j j where m j and m j are respectve counts of messages from person to person j and back durng the perod (t τ,t]. Ths parameterzaton allows us to recover the network at arbtrary tmes by ncludng only tes wth non-zero nstantaneous strength w j ( t, τ ) > 0. The geometrc average serves as a conservatve measure of ntensty: te strength s hgh f both drected lnks are strong; t s low f ether drected lnk n the par has low ntensty. Therefore, a te s present n the nstantaneous network at tme t f and only f there are messages n both drectons durng ( t τ, t]. The wdth of the smoothng wndow τ effectvely sets a relevancy horzon; that s, t determnes whch past events are relevant to the current state of the network. In addton, the frequency wth whch the network s measured (samplng frequency or, equvalently, samplng perod) determnes whch events wll be consdered smultaneous and ndependent of each other. It s mportant to choose the two tme scales smoothng wndow τ and samplng perod δ approprately. If smoothng wndowτ s too short, some ongong tes wll be msclassfed as tes that have been termnated and then re-enacted; f τ s too large, many past nteractons whch are not lkely to be relevant to the present state of the relatonshp wll be nevertheless ncluded n the calculaton of relatonshp strength. If samplng perod δ s too large, then a sequence of events may be msclassfed as ndependent, smultaneous events; on the other hand, δ should not be chosen too small, or event hstory may be based by the errors present n tme measurements. We use τ = 60 days because the rate of new te formaton stablzes after approxmately 60 days snce the begnnng of observaton, whch suggests that 60 days s close to the characterstc te formaton scale for our network. Ths choce s supported by analyzng the dstrbuton of dyadc response tmes (about 90% of pooled response tmes are wthn 60 days, accountng for censored observatons). The edge set of the nstantaneous network at any pont n tme therefore conssts of all pars of ndvduals that exchanged one or more messages wthn the past 60 days. Wth ths choce of τ, the frst 60 days of data collecton are used to estmate the network at day 61, so the effectve span of the data s 295 days (day 61 through day 355). We also checked that our results are robust for τ = 30 and 90 days. Fgures 1, 2, and 4 were created usng days 61 through 270, that s, not ncludng the Summer break, because of a substantal drop n actvty assocated wth ndvduals leavng the unversty for the holdays and also because there are very few regular courses offered durng the summer. The approprate samplng perod may be calculated by applyng the Nyqust samplng theorem (14) to the maxmum rate of te formaton. Although there are a few perods of hgh actvty n our network (for example, at the begnnng of the Sprng semester, when the changng class attendance pattern leads to formaton of many new socal tes), we estmated that samplng for structural changes every δ = 1 day produced a reasonable approxmaton, takng nto account the natural perodcty of human actvtes. We checked ths assumpton by comparng network tme seres obtaned wth δ = 1 hour and δ = 1 day, fndng qualtatvely smlar results. We use
daly measurements to calculate the parameters of te formaton and hourly resoluton for the multvarate survval analyss of tradc closure, to mprove model senstvty. We note that there are other smoothng methods avalable for constructng the network from dyadc nteracton data; for example, the exponentally weghted movng average flter (9, 15). However, wth respect to the tme of te actvaton n unweghted networks, the exponentally weghted movng average and the sldng wndow flter produce dentcal results f calbrated approprately (13). Cyclc and focal closure To produce Fgure 1, we computed geodesc dstance d j for all pars of ndvduals n the network from day 61 through 270 (Fall and Sprng semesters) wth a 1-day resoluton, and at each step dentfed tes not present n the network on the prevous day. The average per-day emprcal probablty of a new te as a functon of network dstance d j and the number of shared foc s j s computed as 270 t = 61 P new ( dj, sj ) = M new ( dj, sj, t) / M ( dj, sj, t), where t s tme n days, M ( dj, sj, t) s the number of vertex pars n category ( d j, sj) at tme t, and M new ( dj, sj, t) s the number of new tes n ths category snce tme t 1. Summer (85 days) was excluded from ths calculaton as there are very few regular courses offered durng the Summer semester. Because the frst 60 days of data are used to approxmate the network at day 61, the effectve tme span for ths calculaton s 355-85-60=210 days. Also, the effects of common department afflaton are much weaker than those of shared classes, and do not alter any of our conclusons; hence we dd not nclude them n our report. 270 t = 61 Multvarate survval analyss of tradc closure To examne the determnants of tradc closure, we used the Cox proportonal hazards model (16) of the form h( t, x1, x2,...) = h0 ( t)exp( β 1x1 + β 2 x2 +...). Here h ( t, x 1, x2,...) s nstantaneous hazard the probablty of event (closure) at tme t gven that the observaton wth covarates (x 1, x 2, ) has survved to tme t; and h 0 (t) s baselne hazard that descrbes temporal dependence of the hazard rate common to all observatons. The quantty g = exp( β ) s called a hazard rato and means that nstantaneous probablty of closure ncreases ( g > 1 ) or decreases ( g < 1) by a factor of g wth a unt change n the covarate x or relatve to the reference category; g = 1 ndcates that covarate x has no effect on the probablty of outcome. Because tradc closure s a rare event (p<0.001), a retrospectve (case-control) samplng scheme was used (17): we frst sampled cases vertex pars that transtoned to dstance d j = 2 and subsequently formed a te durng observaton days 61 270, and then matched each case wth 10 controls pars that entered the rsk set (d j = 2) at approxmately the same tme as the respectve case but dd not develop a te by the tme the respectve case dd. In order to mnmze possble correlatons between observatons, the fnal sample was composed of pars that formed a maxmal
ndependent vertex set n the dependence graph (18) of the cumulatve network constructed from all pars that exchanged e-mal durng days 61 270. We estmated a number of survval regresson models (not shown); the followng dyadc varables were consdered: (a) Strong ndrect for each par n the sample we compute ndrect nteracton strength as ω j (t) = 1 k j τ k j q =1 (m q + m q )(m jq + m qj ) possessed by vertces and j, and, where k j s the number of mutual neghbors m q s the number of messages from to q durng the perod ( t τ, t]. The sum m q + mq s therefore the total volume of traffc between vertces and q durng that perod. For ease of nterpretaton, we dchotomze ths quantty such that pars that have ω j (t) above the sample medan are assgned 1 and pars below sample medan are assgned 0. The resultng bnary varable ndcates pars that are ndrectly strongly connected. (b) Acquantances the number of mutual network neghbors less 1, at the tme of samplng. (c) Classes the number of jontly attended classes at the tme of samplng. (d) Acquantances*Classes nteracton effect showng whether the effect of the number of mutual acquantances s dfferent dependng on the number of shared classes, and vce versa. (e) Gender male-male, female-female and female-male, the latter servng as the reference category. (f) Same age 1 f the absolute dfference n age between the members of the par s less or equal to one year, 0 otherwse. (g) Same year 1 f the absolute dfference n years n the communty between the members of the par s less or equal to one year, 0 otherwse. (h) Same status 1 f both members of the par are of the same status (Faculty, Graduate student, Undergraduate student, Staff, Other), 0 otherwse. () Obstructon 1 f no mutual acquantance has the same status as ether member of the par, 0 otherwse. (j) Same dormtory (undergraduate students only) 1 f both members of the par lve n the same dormtory, 0 otherwse. (k) Amercans (students only) 1 f both members of the par have home address n the US, 0 otherwse. The model presented n Fgure 2 of the Report s for a sample of 1190 pars of graduate and undergraduate students and contans the best combnaton of nterestng predctors (some varables are avalable mostly, and others exclusvely, for students). The results suggest that strongly ndrectly connected pars enjoy approxmately 2.7 tmes hgher rate of closure than pars wth weak ndrect connecton. Also, every addtonal mutual acquantance ncreases the lkelhood of tradc closure by a factor of 1.4, and each shared class by a factor of 1.5. However, the jont effect of mutual acquantances and shared foc exhbts saturaton, as ndcated by the statstcally sgnfcant, negatve nteracton term. For example, havng 5 mutual acquantances 4 4 and sharng 1 class ncreases the lkelhood of closure by a factor of 1.39 1.46 0.75 1.7 relatve to pars wth just one mutual acquantance, nstead of 5.5, whch would be expected wthout the nteracton term.
References 1. A. Lenhart, L. Rane, O. Lews, Teenage lfe onlne: The rse of the nstant-message generaton and the Internet s mpact on frendshps and famly relatonshps (Pew Internet & Amercan Lfe Project, 2001). 2. W. Chen, J. Boase, B. Wellman, n The Internet n Everyday Lfe, B. Wellman, C. Haythornthwate, Eds. (Balckwell, Oxford, 2002) pp. 74-113. 3. J. I. Copher, A. G. Kanfer, M. B. Walker, n The Internet n everyday lfe, B. Wellman, C. Haythornthwate, Eds. (Blackwell, Oxford, 2002), pp. 263-288. 4. A. Quan-Haase, B. Wellman, J. Wtte, K. N. Hampton, n The Internet n Everyday Lfe, B. Wellman, C. Haythornthwate, Eds. (Blackwell, Oxford, 2002) pp. 291-324. 5. S. Jones, The Internet Goes to College: How students are lvng n the future wth today s technology (Pew Internet & Amercan Lfe Project, 2002). 6. N. K. Baym, Y. B. Zhang, M. Ln, New Meda & Socety 6, 299 (2004). 7. J. A. Bargh, K. Y. A. McKenna, Ann. Rev. Psych. 55, 573 (2004). 8. B. M. Gross, paper presented at the Frst Conference on E-mal and Ant-Spam (CEAS), Mountan Vew, CA, July 30-31, 2004. 9. C. Cortes, D. Pregbon, C. Volnsky, J. Comp. Graph. Stat. 12, 950 (2003). 10. J. P. Eckmann, E. Moses, D. Serg, Proc. Natl. Acad. Sc. U.S.A. 101, 14333 (2004). 11. F. Reke, D. Warland, R. R. v. Stevennck, W. Balek, Spkes: Explorng the Neural Code (MIT Press, Cambrdge, MA, 1997). 12. J. R. Tyler, D. M. Wlknson, B. A. Huberman, n Communtes and Technologes, M. Huysman, E. Wenger, V. Wulf, Eds. (Kluwer B.V., Deventer, The Netherlands, 2003) pp. 81-96. 13. G. Kossnets, D. J. Watts, n preparaton. 14. A. V. Oppenhem, R. W. Schafer, Dscrete-Tme Sgnal Processng (Prentce-Hall, Englewood Clffs, NJ, 1989). 15. S. Hll, D. Agarwal, R. Bell, C. Volnsky, J. Comp. Graph. Stat., n press. 16. D. W. Hosmer, S. Lemeshow, Appled survval analyss: Regresson modelng of tme to event data. (Wley, New York, 1999). 17. G. Kng, L. Zeng, Statstcs n Medcne 21, 1409 (2002). 18. S. Wasserman, P. Pattson, Psychometrka 61, 401 (1996).