Supporting Online Material for



Similar documents
Can Auto Liability Insurance Purchases Signal Risk Attitude?

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Alternative Way to Measure Private Equity Performance

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

Forecasting the Direction and Strength of Stock Market Movement

DEFINING %COMPLETE IN MICROSOFT PROJECT

What is Candidate Sampling

The Current Employment Statistics (CES) survey,

NPAR TESTS. One-Sample Chi-Square Test. Cell Specification. Observed Frequencies 1O i 6. Expected Frequencies 1EXP i 6

How To Calculate The Accountng Perod Of Nequalty

Tuition Fee Loan application notes

Traffic-light a stress test for life insurance provisions

Calculation of Sampling Weights

Traffic State Estimation in the Traffic Management Center of Berlin

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

APPLICATION OF PROBE DATA COLLECTED VIA INFRARED BEACONS TO TRAFFIC MANEGEMENT

Project Networks With Mixed-Time Constraints

How To Understand The Results Of The German Meris Cloud And Water Vapour Product

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

Statistical Methods to Develop Rating Models

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

CHAPTER 14 MORE ABOUT REGRESSION

An Empirical Study of Search Engine Advertising Effectiveness

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Transition Matrix Models of Consumer Credit Ratings

The OC Curve of Attribute Acceptance Plans

Diabetes as a Predictor of Mortality in a Cohort of Blind Subjects

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Gender differences in revealed risk taking: evidence from mutual fund investors

Recurrence. 1 Definitions and main statements

Hollinger Canadian Publishing Holdings Co. ( HCPH ) proceeding under the Companies Creditors Arrangement Act ( CCAA )

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

Customer Lifetime Value Modeling and Its Use for Customer Retention Planning

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

A Model of Private Equity Fund Compensation

Small pots lump sum payment instruction

PAS: A Packet Accounting System to Limit the Effects of DoS & DDoS. Debish Fesehaye & Klara Naherstedt University of Illinois-Urbana Champaign

Traffic-light extended with stress test for insurance and expense risks in life insurance

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

Power-of-Two Policies for Single- Warehouse Multi-Retailer Inventory Systems with Order Frequency Discounts

Analysis of Premium Liabilities for Australian Lines of Business

Calculating the high frequency transmission line parameters of power cables

1 De nitions and Censoring

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

LIFETIME INCOME OPTIONS

A DATA MINING APPLICATION IN A STUDENT DATABASE

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

L10: Linear discriminants analysis

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

1. Measuring association using correlation and regression

Extending Probabilistic Dynamic Epistemic Logic

How To Trade Water Quality

SUPPLIER FINANCING AND STOCK MANAGEMENT. A JOINT VIEW.

Method for assessment of companies' credit rating (AJPES S.BON model) Short description of the methodology

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

Student Performance in Online Quizzes as a Function of Time in Undergraduate Financial Management Courses

One Click.. Ȯne Location.. Ȯne Portal...

An MILP model for planning of batch plants operating in a campaign-mode

Overview of monitoring and evaluation

! # %& ( ) +,../ # 5##&.6 7% 8 # #...

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Statistical algorithms in Review Manager 5

Survival analysis methods in Insurance Applications in car insurance contracts

"Research Note" APPLICATION OF CHARGE SIMULATION METHOD TO ELECTRIC FIELD CALCULATION IN THE POWER CABLES *

A Secure Password-Authenticated Key Agreement Using Smart Cards

Efficient Project Portfolio as a tool for Enterprise Risk Management

RESEARCH ON DUAL-SHAKER SINE VIBRATION CONTROL. Yaoqi FENG 1, Hanping QIU 1. China Academy of Space Technology (CAST)

Forecasting and Stress Testing Credit Card Default using Dynamic Models

Management Quality, Financial and Investment Policies, and. Asymmetric Information

A 'Virtual Population' Approach To Small Area Estimation

RELIABILITY, RISK AND AVAILABILITY ANLYSIS OF A CONTAINER GANTRY CRANE ABSTRACT

Traditional versus Online Courses, Efforts, and Learning Performance

) of the Cell class is created containing information about events associated with the cell. Events are added to the Cell instance

Applications of Social Network Analysis to Community Dynamics

The Greedy Method. Introduction. 0/1 Knapsack Problem

Heterogeneous Paths Through College: Detailed Patterns and Relationships with Graduation and Earnings

Enterprise Master Patient Index

IT09 - Identity Management Policy

Intrinsic versus Image-Related Utility in Social Media: Why Do People Contribute Content to Twitter?

Multiple-Period Attribution: Residuals and Compounding

Intra-year Cash Flow Patterns: A Simple Solution for an Unnecessary Appraisal Error

Survive Then Thrive: Determinants of Success in the Economics Ph.D. Program. Wayne A. Grove Le Moyne College, Economics Department

UK Letter Mail Demand: a Content Based Time Series Analysis using Overlapping Market Survey Statistical Techniques

VRT012 User s guide V0.1. Address: Žirmūnų g. 27, Vilnius LT-09105, Phone: (370-5) , Fax: (370-5) , info@teltonika.

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Transcription:

www.scencemag.org/cg/content/full/311/5757/88/dc1 Supportng Onlne Materal for Emprcal Analyss of an Evolvng Socal Network Gueorg Kossnets and Duncan J. Watts* *To whom correspondence should be addressed. E-mal: djw24@columba.edu Ths PDF fle ncludes: Materals and Methods References Publshed 6 January 2006, Scence 311, 88 (2006) DOI: 10.1126/scence.1116869

Emprcal Analyss of an Evolvng Socal Network Supportng Onlne Materal Gueorg Kossnets and Duncan J. Watts Department of Socology, and Insttute for Socal and Economc Research and Polcy, Columba Unversty, 420 West 118th Street, MC 3355, New York, NY 10027, USA. Data Our populaton conssts of 43,553 undergraduate and graduate students, faculty and staff at a large US unversty who sent and receved e-mal usng a unversty e-mal address durng academc year 2003-2004. The data were collected and anonymzed on our behalf by the unversty IT department. The dataset conssts of three parts: (1) the regstry of e-mal nteractons obtaned from the unversty e-mal server; (2) the table of personal attrbutes (status, gender, age, departmental afflaton, number of years n the communty, dormtory and home zp code for undergraduate students); (3) the lsts of classes attended or taught n every semester, respectvely for students and nstructors. For each e-mal message the tme, sender, and lst of recpents (but not the content) were recorded. To ensure that our data represent genune nterpersonal communcaton (as opposed to bulk malngs) we fltered out messages wth more than 4 recpents (95% of all messages had 4 or less addressees). For purposes of ths report, we treat each message wth n recpents as n smultaneous messages each wth a sngle recpent. After flterng, there are 14,584,423 messages exchanged by 43,553 ndvduals durng 355 days of observaton. As a prvacy protecton measure, all ndvdual e-mal addresses and group dentfers (such as course numbers or department names) were encrypted; so t s possble to tell, for example, whether two anonymous ndvduals were n the same class together but not what class that was. Anonymzaton was necessary n order to qualfy for an exempton from full revew by the Insttutonal Revew Board; otherwse researchers are requred to obtan wrtten consent from every human subject, whch would not be feasble for a project of such a scale as ours. All computatons were performed usng custom-wrtten programs n C and Perl on a 2GHz Lnux workstaton wth 2GB of RAM. The data (daly e-mal logs, snapshots of employee database, course regstraton fle, lsts of encrypted unversty and outsde addresses) were made avalable to us as gzpped plan text fles on a per-semester bass. Each nstallment requred from 1.5 to 3.6 GB of dsk space. We parsed the gzpped fles drectly usng a lbrary avalable n Perl. The computatonally more ntensve routnes were mplemented n C and the wrappng code was programmed n Perl. When the data structures were too large to ft nto computer memory (for example, estmatng cyclc closure bas requred storng a trangular matrx of parwse dstances 8 for approxmately 9.5 10 vertex pars), we used packed arrays and temporary dsk fles. Statstcal analyss was carred out n R and Matlab. More techncal detals wll be forthcomng n our future publcatons as well as n GK s doctoral dssertaton. We also ntend to post the programs that we developed on our web-ste, n the hope that other researchers wll use them and mprove upon them.

Relevance of e-mal data E-mal communcaton s strongly correlated wth other knds of socal nteracton, such as faceto-face and telephone conversatons (1-6). Moreover, the extent to whch people use e-mal vs-àvs other meda appears to reflect ther nherent socablty (2, 3, 6). Recent fndngs suggest that e-mal serves as much socal functon as face-to-face nteractons or phone calls (5, 6), partcularly wth nearby frends (4). Instead of a trade-off between face-to-face nteractons and e-mal communcaton, college students have been found to expand exstng face-to-face relatonshps to nclude telephone and onlne nteractons (6). Although nstant messagng popularty s on the rse, recent reports estmate that e-mal accounts for 62 to 70% of students onlne nteractons (5, 6). Whle ndvduals may vary n ther e-mal usage, both overall and n partcular socal stuatons (7), the large sze of the communty that we study mples a reducton to the mean n terms of both ndvdual and dyadc behavor. We expect that by averagng over thousands of observed relatonshps, e-mal communcaton wll reflect the ntensty and drectonalty of underlyng relatonshps wthn our unversty communty. Our data on e-mal communcaton have been collected from the unversty e-mal server, and as such provde a full record of communcaton between the unversty e-mal addresses. However, t s common for ndvduals to mantan multple e-mal addresses (1, 5, 8). Accordng to the Pew Internet Research Project survey, about 66% of college students use at least two e-mal addresses (5). On the other hand, ndvduals rarely use more than three e-mal addresses for personal communcaton (8). Typcally, multple addresses are used n order to separate socal roles (professonal, academc, anonymous, etc.) or specfc tasks (e.g. personal communcaton, shoppng, or regstraton for servces), as well as for techncal reasons (e.g. to crcumvent nsttutonal polces or to transfer large fles). Although dfferent roles may not always correspond to specfc e-mal addresses, we fnd t lkely that the communcaton wthn the unversty communty that we study s largely related to the actvtes assocated wth the unversty and hence reflects the prmary roles (statuses) of ndvduals. In addton, based on nformaton from the Unversty IT department, t seems lkely that the students at the unversty n queston may well prefer ther offcal e-mal over free mal accounts for all knds of personal communcaton. There are a number of reasons for that: (1) all students are requred to use ther unversty e-mal to receve offcal communcaton and access varous servces, such as lbrares, course materals, etc.; (2) a unversty e-mal connotes prestge and status; (3) some very popular onlne servces for undergraduates (such as facebook.com) requre a college e-mal account; (4) t s easy to fnd people usng an onlne unversty drectory; (5) the unversty has an effcent spam-flterng system whch s superor to many free servces; t also provdes a streamlned, advertsement-free web-nterface n addton to free, convenent access to e-mal from varous e-mal applcatons. Thus whle ndvduals ndeed tend to use multple e-mal accounts to compartmentalze tasks and relatonshps, there are reasons to beleve that n our dataset, the unversty e-mal addresses are used preferentally for unversty-related communcaton, and by extenson, for varous knds of communcaton wth other ndvduals at the unversty.

Constructng network tme seres from dscrete dyadc nteractons Ongong socal relatonshps produce observable spkes of e-mal communcaton (9-12); therefore t s possble to create an approxmaton of the nstantaneous socal network by applyng a flter (13). We approxmate nstantaneous strength of a relatonshp w j ( t, τ ) by the average geometrc rate of blateral e-mal exchange wthn a wndow of wdthτ : w t, τ ) = m m / τ, j ( j j where m j and m j are respectve counts of messages from person to person j and back durng the perod (t τ,t]. Ths parameterzaton allows us to recover the network at arbtrary tmes by ncludng only tes wth non-zero nstantaneous strength w j ( t, τ ) > 0. The geometrc average serves as a conservatve measure of ntensty: te strength s hgh f both drected lnks are strong; t s low f ether drected lnk n the par has low ntensty. Therefore, a te s present n the nstantaneous network at tme t f and only f there are messages n both drectons durng ( t τ, t]. The wdth of the smoothng wndow τ effectvely sets a relevancy horzon; that s, t determnes whch past events are relevant to the current state of the network. In addton, the frequency wth whch the network s measured (samplng frequency or, equvalently, samplng perod) determnes whch events wll be consdered smultaneous and ndependent of each other. It s mportant to choose the two tme scales smoothng wndow τ and samplng perod δ approprately. If smoothng wndowτ s too short, some ongong tes wll be msclassfed as tes that have been termnated and then re-enacted; f τ s too large, many past nteractons whch are not lkely to be relevant to the present state of the relatonshp wll be nevertheless ncluded n the calculaton of relatonshp strength. If samplng perod δ s too large, then a sequence of events may be msclassfed as ndependent, smultaneous events; on the other hand, δ should not be chosen too small, or event hstory may be based by the errors present n tme measurements. We use τ = 60 days because the rate of new te formaton stablzes after approxmately 60 days snce the begnnng of observaton, whch suggests that 60 days s close to the characterstc te formaton scale for our network. Ths choce s supported by analyzng the dstrbuton of dyadc response tmes (about 90% of pooled response tmes are wthn 60 days, accountng for censored observatons). The edge set of the nstantaneous network at any pont n tme therefore conssts of all pars of ndvduals that exchanged one or more messages wthn the past 60 days. Wth ths choce of τ, the frst 60 days of data collecton are used to estmate the network at day 61, so the effectve span of the data s 295 days (day 61 through day 355). We also checked that our results are robust for τ = 30 and 90 days. Fgures 1, 2, and 4 were created usng days 61 through 270, that s, not ncludng the Summer break, because of a substantal drop n actvty assocated wth ndvduals leavng the unversty for the holdays and also because there are very few regular courses offered durng the summer. The approprate samplng perod may be calculated by applyng the Nyqust samplng theorem (14) to the maxmum rate of te formaton. Although there are a few perods of hgh actvty n our network (for example, at the begnnng of the Sprng semester, when the changng class attendance pattern leads to formaton of many new socal tes), we estmated that samplng for structural changes every δ = 1 day produced a reasonable approxmaton, takng nto account the natural perodcty of human actvtes. We checked ths assumpton by comparng network tme seres obtaned wth δ = 1 hour and δ = 1 day, fndng qualtatvely smlar results. We use

daly measurements to calculate the parameters of te formaton and hourly resoluton for the multvarate survval analyss of tradc closure, to mprove model senstvty. We note that there are other smoothng methods avalable for constructng the network from dyadc nteracton data; for example, the exponentally weghted movng average flter (9, 15). However, wth respect to the tme of te actvaton n unweghted networks, the exponentally weghted movng average and the sldng wndow flter produce dentcal results f calbrated approprately (13). Cyclc and focal closure To produce Fgure 1, we computed geodesc dstance d j for all pars of ndvduals n the network from day 61 through 270 (Fall and Sprng semesters) wth a 1-day resoluton, and at each step dentfed tes not present n the network on the prevous day. The average per-day emprcal probablty of a new te as a functon of network dstance d j and the number of shared foc s j s computed as 270 t = 61 P new ( dj, sj ) = M new ( dj, sj, t) / M ( dj, sj, t), where t s tme n days, M ( dj, sj, t) s the number of vertex pars n category ( d j, sj) at tme t, and M new ( dj, sj, t) s the number of new tes n ths category snce tme t 1. Summer (85 days) was excluded from ths calculaton as there are very few regular courses offered durng the Summer semester. Because the frst 60 days of data are used to approxmate the network at day 61, the effectve tme span for ths calculaton s 355-85-60=210 days. Also, the effects of common department afflaton are much weaker than those of shared classes, and do not alter any of our conclusons; hence we dd not nclude them n our report. 270 t = 61 Multvarate survval analyss of tradc closure To examne the determnants of tradc closure, we used the Cox proportonal hazards model (16) of the form h( t, x1, x2,...) = h0 ( t)exp( β 1x1 + β 2 x2 +...). Here h ( t, x 1, x2,...) s nstantaneous hazard the probablty of event (closure) at tme t gven that the observaton wth covarates (x 1, x 2, ) has survved to tme t; and h 0 (t) s baselne hazard that descrbes temporal dependence of the hazard rate common to all observatons. The quantty g = exp( β ) s called a hazard rato and means that nstantaneous probablty of closure ncreases ( g > 1 ) or decreases ( g < 1) by a factor of g wth a unt change n the covarate x or relatve to the reference category; g = 1 ndcates that covarate x has no effect on the probablty of outcome. Because tradc closure s a rare event (p<0.001), a retrospectve (case-control) samplng scheme was used (17): we frst sampled cases vertex pars that transtoned to dstance d j = 2 and subsequently formed a te durng observaton days 61 270, and then matched each case wth 10 controls pars that entered the rsk set (d j = 2) at approxmately the same tme as the respectve case but dd not develop a te by the tme the respectve case dd. In order to mnmze possble correlatons between observatons, the fnal sample was composed of pars that formed a maxmal

ndependent vertex set n the dependence graph (18) of the cumulatve network constructed from all pars that exchanged e-mal durng days 61 270. We estmated a number of survval regresson models (not shown); the followng dyadc varables were consdered: (a) Strong ndrect for each par n the sample we compute ndrect nteracton strength as ω j (t) = 1 k j τ k j q =1 (m q + m q )(m jq + m qj ) possessed by vertces and j, and, where k j s the number of mutual neghbors m q s the number of messages from to q durng the perod ( t τ, t]. The sum m q + mq s therefore the total volume of traffc between vertces and q durng that perod. For ease of nterpretaton, we dchotomze ths quantty such that pars that have ω j (t) above the sample medan are assgned 1 and pars below sample medan are assgned 0. The resultng bnary varable ndcates pars that are ndrectly strongly connected. (b) Acquantances the number of mutual network neghbors less 1, at the tme of samplng. (c) Classes the number of jontly attended classes at the tme of samplng. (d) Acquantances*Classes nteracton effect showng whether the effect of the number of mutual acquantances s dfferent dependng on the number of shared classes, and vce versa. (e) Gender male-male, female-female and female-male, the latter servng as the reference category. (f) Same age 1 f the absolute dfference n age between the members of the par s less or equal to one year, 0 otherwse. (g) Same year 1 f the absolute dfference n years n the communty between the members of the par s less or equal to one year, 0 otherwse. (h) Same status 1 f both members of the par are of the same status (Faculty, Graduate student, Undergraduate student, Staff, Other), 0 otherwse. () Obstructon 1 f no mutual acquantance has the same status as ether member of the par, 0 otherwse. (j) Same dormtory (undergraduate students only) 1 f both members of the par lve n the same dormtory, 0 otherwse. (k) Amercans (students only) 1 f both members of the par have home address n the US, 0 otherwse. The model presented n Fgure 2 of the Report s for a sample of 1190 pars of graduate and undergraduate students and contans the best combnaton of nterestng predctors (some varables are avalable mostly, and others exclusvely, for students). The results suggest that strongly ndrectly connected pars enjoy approxmately 2.7 tmes hgher rate of closure than pars wth weak ndrect connecton. Also, every addtonal mutual acquantance ncreases the lkelhood of tradc closure by a factor of 1.4, and each shared class by a factor of 1.5. However, the jont effect of mutual acquantances and shared foc exhbts saturaton, as ndcated by the statstcally sgnfcant, negatve nteracton term. For example, havng 5 mutual acquantances 4 4 and sharng 1 class ncreases the lkelhood of closure by a factor of 1.39 1.46 0.75 1.7 relatve to pars wth just one mutual acquantance, nstead of 5.5, whch would be expected wthout the nteracton term.

References 1. A. Lenhart, L. Rane, O. Lews, Teenage lfe onlne: The rse of the nstant-message generaton and the Internet s mpact on frendshps and famly relatonshps (Pew Internet & Amercan Lfe Project, 2001). 2. W. Chen, J. Boase, B. Wellman, n The Internet n Everyday Lfe, B. Wellman, C. Haythornthwate, Eds. (Balckwell, Oxford, 2002) pp. 74-113. 3. J. I. Copher, A. G. Kanfer, M. B. Walker, n The Internet n everyday lfe, B. Wellman, C. Haythornthwate, Eds. (Blackwell, Oxford, 2002), pp. 263-288. 4. A. Quan-Haase, B. Wellman, J. Wtte, K. N. Hampton, n The Internet n Everyday Lfe, B. Wellman, C. Haythornthwate, Eds. (Blackwell, Oxford, 2002) pp. 291-324. 5. S. Jones, The Internet Goes to College: How students are lvng n the future wth today s technology (Pew Internet & Amercan Lfe Project, 2002). 6. N. K. Baym, Y. B. Zhang, M. Ln, New Meda & Socety 6, 299 (2004). 7. J. A. Bargh, K. Y. A. McKenna, Ann. Rev. Psych. 55, 573 (2004). 8. B. M. Gross, paper presented at the Frst Conference on E-mal and Ant-Spam (CEAS), Mountan Vew, CA, July 30-31, 2004. 9. C. Cortes, D. Pregbon, C. Volnsky, J. Comp. Graph. Stat. 12, 950 (2003). 10. J. P. Eckmann, E. Moses, D. Serg, Proc. Natl. Acad. Sc. U.S.A. 101, 14333 (2004). 11. F. Reke, D. Warland, R. R. v. Stevennck, W. Balek, Spkes: Explorng the Neural Code (MIT Press, Cambrdge, MA, 1997). 12. J. R. Tyler, D. M. Wlknson, B. A. Huberman, n Communtes and Technologes, M. Huysman, E. Wenger, V. Wulf, Eds. (Kluwer B.V., Deventer, The Netherlands, 2003) pp. 81-96. 13. G. Kossnets, D. J. Watts, n preparaton. 14. A. V. Oppenhem, R. W. Schafer, Dscrete-Tme Sgnal Processng (Prentce-Hall, Englewood Clffs, NJ, 1989). 15. S. Hll, D. Agarwal, R. Bell, C. Volnsky, J. Comp. Graph. Stat., n press. 16. D. W. Hosmer, S. Lemeshow, Appled survval analyss: Regresson modelng of tme to event data. (Wley, New York, 1999). 17. G. Kng, L. Zeng, Statstcs n Medcne 21, 1409 (2002). 18. S. Wasserman, P. Pattson, Psychometrka 61, 401 (1996).