Network Structure & Information Advantage



Similar documents
The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

PRODUCTIVITY EFFECTS OF INFORMATION DIFFUSION

Can Auto Liability Insurance Purchases Signal Risk Attitude?

ANALYZING THE RELATIONSHIPS BETWEEN QUALITY, TIME, AND COST IN PROJECT MANAGEMENT DECISION MAKING

CHOLESTEROL REFERENCE METHOD LABORATORY NETWORK. Sample Stability Protocol

An Interest-Oriented Network Evolution Mechanism for Online Communities

An Alternative Way to Measure Private Equity Performance

Feature selection for intrusion detection. Slobodan Petrović NISlab, Gjøvik University College

DEFINING %COMPLETE IN MICROSOFT PROJECT

benefit is 2, paid if the policyholder dies within the year, and probability of death within the year is ).

An Empirical Study of Search Engine Advertising Effectiveness

1.1 The University may award Higher Doctorate degrees as specified from time-to-time in UPR AS11 1.

The OC Curve of Attribute Acceptance Plans

How To Calculate The Accountng Perod Of Nequalty

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Course outline. Financial Time Series Analysis. Overview. Data analysis. Predictive signal. Trading strategy

Calculation of Sampling Weights

The Greedy Method. Introduction. 0/1 Knapsack Problem

1. Measuring association using correlation and regression

The Current Employment Statistics (CES) survey,

Statistical Methods to Develop Rating Models

An Evaluation of the Extended Logistic, Simple Logistic, and Gompertz Models for Forecasting Short Lifecycle Products and Services

Causal, Explanatory Forecasting. Analysis. Regression Analysis. Simple Linear Regression. Which is Independent? Forecasting

Multiple-Period Attribution: Residuals and Compounding

HOUSEHOLDS DEBT BURDEN: AN ANALYSIS BASED ON MICROECONOMIC DATA*

Number of Levels Cumulative Annual operating Income per year construction costs costs ($) ($) ($) 1 600,000 35, , ,200,000 60, ,000

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

Overview of monitoring and evaluation

Recurrence. 1 Definitions and main statements

Forecasting the Direction and Strength of Stock Market Movement

M-applications Development using High Performance Project Management Techniques

The Use of Analytics for Claim Fraud Detection Roosevelt C. Mosley, Jr., FCAS, MAAA Nick Kucera Pinnacle Actuarial Resources Inc.

Searching and Switching: Empirical estimates of consumer behaviour in regulated markets

THE DISTRIBUTION OF LOAN PORTFOLIO VALUE * Oldrich Alfons Vasicek

When Talk is Free : The Effect of Tariff Structure on Usage under Two- and Three-Part Tariffs

Financial Mathemetics

Selecting Best Employee of the Year Using Analytical Hierarchy Process

CHAPTER 5 RELATIONSHIPS BETWEEN QUANTITATIVE VARIABLES

A Novel Methodology of Working Capital Management for Large. Public Constructions by Using Fuzzy S-curve Regression

AN APPOINTMENT ORDER OUTPATIENT SCHEDULING SYSTEM THAT IMPROVES OUTPATIENT EXPERIENCE

The Complementarities of Competition in Charitable Fundraising

Credit Limit Optimization (CLO) for Credit Cards

1 Example 1: Axis-aligned rectangles

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Gender differences in revealed risk taking: evidence from mutual fund investors

To manage leave, meeting institutional requirements and treating individual staff members fairly and consistently.

What is Candidate Sampling

Exhaustive Regression. An Exploration of Regression-Based Data Mining Techniques Using Super Computation

Staff Paper. Farm Savings Accounts: Examining Income Variability, Eligibility, and Benefits. Brent Gloy, Eddy LaDue, and Charles Cuykendall

PRIVATE SCHOOL CHOICE: THE EFFECTS OF RELIGIOUS AFFILIATION AND PARTICIPATION

Institute of Informatics, Faculty of Business and Management, Brno University of Technology,Czech Republic

Vision Mouse. Saurabh Sarkar a* University of Cincinnati, Cincinnati, USA ABSTRACT 1. INTRODUCTION

Heterogeneous Paths Through College: Detailed Patterns and Relationships with Graduation and Earnings

Enabling P2P One-view Multi-party Video Conferencing

FREQUENCY OF OCCURRENCE OF CERTAIN CHEMICAL CLASSES OF GSR FROM VARIOUS AMMUNITION TYPES

Enterprise Master Patient Index

A powerful tool designed to enhance innovation and business performance

How To Get A Tax Refund On A Retirement Account

Descriptive Models. Cluster Analysis. Example. General Applications of Clustering. Examples of Clustering Applications

A DATA MINING APPLICATION IN A STUDENT DATABASE

PSYCHOLOGICAL RESEARCH (PYC 304-C) Lecture 12

Project Networks With Mixed-Time Constraints

A DYNAMIC CRASHING METHOD FOR PROJECT MANAGEMENT USING SIMULATION-BASED OPTIMIZATION. Michael E. Kuhl Radhamés A. Tolentino-Peña

Factors Affecting Outsourcing for Information Technology Services in Rural Hospitals: Theory and Evidence

Methodology to Determine Relationships between Performance Factors in Hadoop Cloud Computing Applications

CHAPTER 14 MORE ABOUT REGRESSION

DO LOSS FIRMS MANAGE EARNINGS AROUND SEASONED EQUITY OFFERINGS?

Analysis of Premium Liabilities for Australian Lines of Business

Activity Scheduling for Cost-Time Investment Optimization in Project Management

Management Quality, Financial and Investment Policies, and. Asymmetric Information

Demographic and Health Surveys Methodology

Calculating the high frequency transmission line parameters of power cables

Survive Then Thrive: Determinants of Success in the Economics Ph.D. Program. Wayne A. Grove Le Moyne College, Economics Department

On the Optimal Control of a Cascade of Hydro-Electric Power Stations

Traffic-light a stress test for life insurance provisions

Traffic State Estimation in the Traffic Management Center of Berlin

Capacity-building and training

STAMP DUTY ON SHARES AND ITS EFFECT ON SHARE PRICES

Data Mining from the Information Systems: Performance Indicators at Masaryk University in Brno

Efficient Project Portfolio as a tool for Enterprise Risk Management

iavenue iavenue i i i iavenue iavenue iavenue

CS 2750 Machine Learning. Lecture 3. Density estimation. CS 2750 Machine Learning. Announcements

METHODOLOGY TO DETERMINE RELATIONSHIPS BETWEEN PERFORMANCE FACTORS IN HADOOP CLOUD COMPUTING APPLICATIONS

Characterization of Assembly. Variation Analysis Methods. A Thesis. Presented to the. Department of Mechanical Engineering. Brigham Young University

Small pots lump sum payment instruction

INVESTIGATION OF VEHICULAR USERS FAIRNESS IN CDMA-HDR NETWORKS

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Design and Development of a Security Evaluation Platform Based on International Standards

How Sets of Coherent Probabilities May Serve as Models for Degrees of Incoherence

ADVERSE SELECTION IN INSURANCE MARKETS: POLICYHOLDER EVIDENCE FROM THE U.K. ANNUITY MARKET *

Returns to Experience in Mozambique: A Nonparametric Regression Approach

SIMPLE LINEAR CORRELATION

RequIn, a tool for fast web traffic inference

1. Fundamentals of probability theory 2. Emergence of communication traffic 3. Stochastic & Markovian Processes (SP & MP)

Transcription:

Network Structure & Informaton Advantage Snan Aral, Stern School of Busness, NYU & Sloan School of Management, MIT. snana@mt.edu Marshall Van Alstyne, Boston Unversty School of Management & Sloan School of Management, MIT. mva@bu.edu We nvestgate the long held but emprcally untested assumpton that dverse networks drve performance by provdng access to novel nformaton. We buld and valdate an analytcal model of nformaton dversty, develop theory lnkng network structure to the dstrbuton of novel nformaton among actors and ther performance, and test our theory usng a unque ten month panel of emal communcaton patterns, message content and performance data from a medum szed executve recrutng frm. Whle our theory and results demonstrate that network structures predct performance due to ther mpact on access to nformaton, we also fnd mportant theoretcally drven non-lneartes n these relatonshps. Novel and dverse nformaton are ncreasng n network sze and network dversty, but wth dmnshng margnal returns. There are also dmnshng margnal productvty returns to novel nformaton, consstent wth theores of cogntve capacty, bounded ratonalty, and nformaton overload. Network dversty contrbutes to performance even when controllng for the performance effects of novel nformaton, suggestng addtonal benefts to dverse networks beyond those conferred through nformaton advantage. Our theory and results suggest subtle nuances n relatonshps between networks, nformaton and economc performance, and the methods and tools developed are replcable, openng a new lne of nqury nto these relatonshps. Keywords: Socal Networks, Informaton Economcs, Informaton Content, Informaton Dversty, Network Sze, Network Dversty, Performance, Productvty, Informaton Work. We are grateful to Erk Brynjolfsson, Emlo Castlla, Ezra Zuckerman, Arun Sundararajan, Erol Pekoz, and semnar partcpants at the Workshop on Informaton Systems Economcs, the Sunbelt Socal Networks Conference, the Internatonal Conference on Network Scence, the Academy of Management, NYU and MIT for valuable comments, and to the Natonal Scence Foundaton (Career Award IIS-987633 and grant IIS-008575), Csco Systems, France Telecom and the MIT Center for Dgtal Busness for generous fundng. We thank Tm Choe, Petch Manoharn and Jun Zhang for ther treless research assstance. 1 Electronc copy avalable at: http://ssrn.com/abstract=958158

1. Introducton A growng body of evdence lnks the structural propertes of ndvduals and groups networked relatonshps to varous dmensons of economc performance. However, the mechansms drvng ths lnkage, thought to be related to the value of the nformaton flowng between connected actors, are typcally nferred, and rarely emprcally demonstrated. As a consequence, our understandng of how and why socal structure explans economc outcomes remans underdeveloped, and competng explanatons of the causal mechansms underlyng structural advantage prolferate. For nstance, we know lttle about the relatve mportance of nformaton and control benefts to socal structure, and numerous puzzles reman concernng the stuatonal mportance of network coheson and brokerage (Burt 199, Coleman 1988), and the tradeoffs between the knowledge and power benefts derved from network structures (Reagans & Zuckerman 006). At the heart of these puzzles le foundatonal questons about the degree to whch socal structure creates ntermedate nformaton benefts, and how dfferent network topologes enable these benefts. Comprehensve theores of the structure-performance relatonshp requre a more thorough examnaton of the ntermedate mechansms through whch socal structure affects economc advantage. The strategy of ths paper s to narrowly examne one of these mechansms the relatonshp between network structure and nformaton benefts n detal. One of the most promnent mechansms theorzed to drve the relatonshp between socal structure and performance s the exstence of nformaton benefts to network structure. Accordng to ths argument, actors n favorable structural postons enjoy socal and economc advantages based on ther access to specfc types of nformaton. Burt (199) convncngly shows that ndvduals wth structurally dverse networks (networks low n (a) coheson, and (b) structural equvalence) are more successful n terms of wages, promoton, job placement, and creatvty (Burt 004a). He argues that these performance dfferentals can be explaned n part by actors access to dverse pools of knowledge, and ther ablty to effcently gather non-redundant nformaton. 1 Aral, Brynjolfsson and Van Alstyne (006) demonstrate 1 Coleman s (1988) argument, that focused nformaton from cohesve networks provdes more precse sgnals of actors envronments, also assumes that cohesve networks provde focused (whle dverse networks provde dverse) nformaton. 1 Electronc copy avalable at: http://ssrn.com/abstract=958158

that structural dversty s assocated wth hgher levels of economc productvty for task-based nformaton workers. These studes, and numerous others, nfer that network dversty s assocated wth performance n part because dverse contacts provde access to novel nformaton. Novel nformaton s thought to be valuable due to ts local scarcty. Actors wth scarce, novel nformaton n a gven network neghborhood are better postoned to broker opportuntes, use nformaton as a commodty, or apply nformaton to problems that are ntractable gven local knowledge. Whle ths argument s ntutvely appealng, there are mportant theoretcal and emprcal reasons to be skeptcal about whether structural dversty drves performance by provdng access to dverse, novel nformaton. Smultaneous consderaton of wthn channel novelty (the novelty of nformaton receved from the same alter over tme) and across channel novelty (the relatve novelty of nformaton receved from dfferent alters over tme) may lead to ndetermnate or nonlnear predctons about the relatonshp between structural dversty and access to novel nformaton. If the nformaton actors receve through dverse networks tends to have hgh topc varety across channels but low topc varety wthn channels, t could be that dverse networks provde less total novel nformaton on average or that there are dmnshng margnal contrbutons to novelty from ncreasng structural dversty. In addton, the greater structural awareness of actors n constraned networks (Coleman 1988) may enable alters to dfferentate ther nformaton flows from one another, allowng them to avod transmttng redundant nformaton. There may also be non-nformaton based benefts drvng the relatonshp between network dversty and performance, or lmts to the benefts of novel nformaton tself. Although theores of the value of nformaton and emprcal evdence on the relatonshp between network structure and performance exst, lttle theory, and almost no emprcal evdence addresses how network structure nfluences the nature of the nformaton dstrbuted across a network - the network s nformaton structure. To buld theory relatng network structure to nformaton structure we nvestgate how topologcal propertes of ndvduals network postons (network sze and network dversty) mpact The term nformaton structure s used n the economcs lterature to denote the mappng of states of nature to sgnals.e. news, receved by a decson maker (see Arrow 1985). Electronc copy avalable at: http://ssrn.com/abstract=958158

the abundance and dversty of the nformaton they receve and dstrbute, and whether ths n turn explans productvty. We test the mplcatons of our theory usng emprcal evdence from a ten month panel of emal communcaton patterns and message content among nformaton workers n a medum szed executve recrutng frm. Our fndngs ndcate that (1) the total amount of novel nformaton and the dversty of nformaton flowng to actors are ncreasng n actors network sze and network dversty, but () dmnshng margnal returns set n at two levels. Network sze s a concave predctor of nformaton dversty, and there are dmnshng margnal productvty returns to novel nformaton. Part of the explanaton for the decreasng margnal contrbuton of network sze to nformaton dversty s that (3) network dversty s ncreasng n network sze, but wth dmnshng margnal returns. As actors establsh relatonshps wth a fnte set of possble contacts n an organzaton, the probablty that a margnal relatonshp wll be nonredundant, and provde access to novel nformaton, decreases as possble alters n the network are exhausted. We also fnd that (4) network dversty contrbutes to performance even when controllng for the postve performance effects of access to novel nformaton, suggestng addtonal benefts to network dversty beyond those conferred through nformaton advantage, Surprsngly, (5) tradtonal demographc and human captal varables (e.g. age, gender, ndustry experence, educaton) have lttle effect on access to dverse nformaton, hghlghtng the mportance of network structure for nformaton advantage. These results represent some of the frst evdence on the relatonshp between network structure and nformaton content and reveal subtle nuances and non-lneartes n ther relatonshps. Our fndngs advance our understandng of the economc value of nformaton and the ntermedate mechansms drvng the relatonshp between socal structure and productvty. Our methods for analyzng network structure and nformaton content n emal data are replcable, openng a new lne of nqury nto the relatonshp between networks, nformaton and performance.. Theory.1. Network Structure & Informaton Advantage: A Crtcal Inference 3

The assumpton that network structure nfluences the dstrbuton of nformaton and knowledge n socal groups (and thus characterstcs of the nformaton to whch ndvduals have access) underpns a sgnfcant amount of theory lnkng socal structure to economc outcomes. Granovetter (1973) argues that topologcal propertes of frendshp networks, constraned by basc norms of socal nteracton, empower weak tes to delver nformaton about socally dstant opportuntes more effectvely than strong tes. He posts that contacts mantaned through weak tes typcally move n crcles dfferent from our own and thus have access to nformaton dfferent from that whch we receve [and are therefore] the channels through whch deas, nfluence, or nformaton socally dstant from ego may reach hm (Granovetter 1973: 1371). Burt (199) argues that networks rch n structural dversty confer nformaton benefts by provdng access to dverse perspectves, deas and nformaton. As nformaton n local network neghborhoods tends to be redundant, structurally dverse contacts provde channels through whch novel nformaton flows to ndvduals from dstnct pools of socal actvty. Redundant nformaton s less valuable because many actors are aware of t at the same tme, reducng opportuntes assocated wth ts use. Structural redundancy s also neffcent because actors ncur costs to mantan redundant contacts whle recevng no new nformaton from them (Burt 199). In contrast, exposure to dverse deas, perspectves, and solutons s thought to enable nformaton arbtrage, the creaton of new nnovatons, and access to economc opportuntes. Hargadon and Sutton (1997) descrbe how engneers use ther structural postons between dverse engneerng and scentfc dscplnes to broker the flow of nformaton and knowledge from unconnected ndustral sectors, creatng novel desgn solutons. As Burt (004b) puts t, creatvty s an mport-export game, not a creaton game. The economc value of nformaton n a network stems from ts uneven dstrbuton across actors and resdes n pockets of dstnct and dverse pools of nformaton and expertse n local network neghborhoods. Actors wth access to these dverse pools beneft from dspartes n the level and value of partcular knowledge held by dfferent groups (Hargadon & Sutton 1997: 717), and one of the key mechansms through whch network structures are theorzed to mprove performance s through access to novel, non-redundant nformaton (Burt 199). 4

Whle the argument that network structures nfluence performance through ther effect on the dstrbuton of nformaton s ntutve and appealng, the vast majorty of emprcal work on networks and nformaton advantage remans content agnostc (Hansen 1999: 83), and nfers the relatonshp between network structure and nformaton structure from evdence of a lnk between networks and performance (e.g. Sparrowe et al. 001, Cummngs & Cross 003). Reagans & Zuckerman (001) nfer that productvty gans from the external networks of corporate R&D teams are due n part to nformaton benefts, a broader array of deas and opportuntes, and access to dfferent sklls, nformaton and experence. Burt (199, 004a) also makes ths emprcal leap, nferrng that the observed co-varaton of wages, promoton, job placement, and creatvty wth network dversty s due n part to access to dverse and novel nformaton. Others equate network content wth the socal functon of relatonshps. For example, Burt (000: 45) refers to network content as the substance of relatonshps, qualtes defned by dstnctons such as frendshp versus busness versus authorty. In one of the frst studes to explore ths type of network content, Podolny & Baron (1997) showed that whle cohesve tes are benefcal n buy-n networks and for those contacts that have control over the fate of employees, structural holes are mportant for collectng advce and nformaton. We take a dfferent vew of network content, focused on the subject matter of communcaton rather than the socal functon of relatonshps. The lmted research that does emprcally examne networks and nformaton content has ether focused on dentfyng te and network characterstcs that facltate effectve knowledge transfers; or on types of nformaton (e.g. complex or smple; tact or explct) most effectvely transferred through dfferent types of tes. As a result, the fundamental assumpton that structurally dverse network contacts provde access to dverse and novel nformaton remans unexplored. For example, several studes examne how characterstcs of dyadc relatonshps, lke the strength of tes, mpact the effectveness of knowledge transfer, and how knowledge transfer processes n turn affect performance (Granovetter 1973, Uzz 1996, 1997, Hansen 1999). These studes nfer the mpact of network structure on the effectveness of knowledge sharng from the strength of ndvdual dyadc relatonshps. Reagans & McEvly (003) extend ths work by smultaneously examnng the effects of te strength and network structure on the ease 5

of transferrng knowledge between ndvduals. These studes ether examne the strength of dyadc tes or the mpact of network structure on dscrete dyadc nformaton transfer events, nstead of on the nformaton actors receve from all ther network contacts n concert. Others examne characterstcs of the nformaton transferred across dfferent types of tes. For example, Hansen (1999, 00) and Uzz (1996, 1997) explore the degree to whch knowledge beng transferred s tact or codfable, smple or complex, and related or unrelated to a focal actor s knowledge. To complement ths research, we ask a related, yet fundamentally dfferent queston: Do networks affect the acquston of dverse and novel nformaton and to what extent does ths ntermedate mechansm predct performance? In pursung ths queston, we undertake two fundamental departures from the current lterature. Frst, by explorng the relatve nformaton content dfferences among dfferent network contacts, we explore an actor s nformaton dversty n relaton to the body of nformaton avalable n the entre network. Second, we focus on subject matter. Rather than characterzng the smplcty or complexty of nformaton, or the degree to whch knowledge s codfable or tact, we explore the topcal content beng dscussed. Both smple and complex nformaton can be ether focused or dverse n terms of subject matter. Complexty and codfablty do not descrbe whether nformaton s topcally smlar or dssmlar, or novel relatve to a larger body of knowledge. As the theoretcal mechansm lnkng structure to performance through nformaton rests on the relatve novelty of the nformaton to whch actors have access, these two departures from prevous research are crtcal to effectvely explorng the dmensons of nformaton theorzed to drve value n networks... A Need for Skeptcsm More detaled theoretcal and emprcal examnatons of nformaton advantage are warranted because t s not obvous that network dversty necessarly delvers more novel nformaton or that novel nformaton contrbutes to performance. Four arguments hghlght the need for skeptcsm the frst two examne whether dverse networks provde access to more novel nformaton; the second two show that even wth new nformaton, productvty need not rse. 6

Frst, consder the dstncton between novelty across channels and novelty wthn channels. A smple model demonstrates that although a dverse network of weak tes ( dverse-weak ) can provde access to more novel nformaton than a constraned network of strong tes ( constraned-strong ), the converse s also possble. Ths ndetermnacy arses from a basc tradeoff: Whle constraned tes favor redundant nformaton, they are also typcally stronger (Granovetter 1973, Burt 199), mplyng greater bandwdth. Weak tes are by ther nature lower bandwdth conduts for nformaton (Granovetter 1973, Burt 199). Informaton flows less frequently (Granovetter 1973), wth lower complexty (Hansen 1999) and detal (Uzz 1999), and along fewer topcal dmensons (see Granovetter 1973: p 1361) through weak tes. Ths mples that the total amount of novel nformaton flowng wthn each channel n a dverse network could be lower than the amount of novel nformaton flowng wthn each channel n a constraned network, where stronger tes enable thcker communcaton between actors. 3 An ego mght therefore receve greater novelty from ether strong yet constraned tes or weak yet dverse tes dependng on the relatve mportance of bandwdth and bas n determnng the type of content receved. To llustrate, let E represent the event that an ego encounters new nformaton through a new lnk. If n s a subset of all possble topcs T (n<t ), then an actor receves based content f she s more lkely to receve news on one set of topcs than another (p 1 >p ), where p 1 and p are the probabltes of recevng nformaton from topcs n 1 and n. More precsely, a person wth based content has an asymmetrc dstrbuton over the lkelhood of seeng dfferent topcs. Note that basc laws of probablty requre n p 1 + (T-n) p =1. Snce the lkelhood of encounterng new nformaton depends on what ego has learned from exstng lnks, let L represent current contacts. 4 Then P[E c ], the probablty of encounterng novel nformaton from a new constraned lnk, can be descrbed as: 5 ( 1 p ) L + p ( T n) ( p ) L P[ E c ] = p1n 1 1 [1] 3 Theoretcal arguments concernng network dversty and novel nformaton have thus far focused almost exclusvely on the relatve dversty of the nformaton receved across alters n a network. 4 More precsely, l represents an nformaton exchange wth an exstng lnk. In probablstc terms t s a sample on lnk l such that ego receves nformaton on a gven topc n wth probablty p from each sample, makng the lkelhood of recevng new nformaton a functon of the number of samples (or analogously, the thckness of the communcaton channel). 7

Unbased content mples p 1 = p, so that Equaton 1 reduces to P[E D ] = pt(1-p) L, where E c and E D represent the events of forgng a constraned and a dverse lnk and gettng new nformaton. 6 To model the more frequent communcaton of the hgher bandwdth te, let B represent addtonal chances to cover new materal over the constraned lnk durng any gven nterval. Smplfyng wth n =T-n 1 gves: P[ E C ] L+ B c = P[ E ] = p1n1 1 1 1 1 1 l= L L L L+ B L+ B ( 1 p ) + p n ( 1 p ) +... p n ( 1 p ) + p n ( p ) [] To see that a constraned-strong te could offer more novel nformaton, let bas be neglgble wth p 1 = p + ε so that P[E c ] P[E D ]. Then choose any B large enough such that the followng nequalty s strct: c c c D D D D P [ EL ] + P[ EL+ 1] +... P[ EL+ B ] P[ EL ] + P[ EL+ 1] +... P[ EL+ B ] > P[ EL ] [3] Ths demonstrates the orgnal clam. When the advantage of bandwdth swamps the dsadvantage of bas, an ego always prefers the constraned-strong te to the dverse-weak te to ncrease the chances of encounterng novel nformaton. To see when an dverse-weak te could be preferred, let a group thnk network spread ts bandwdth only over the subset of n topcs wth probablty p 1 =B/T (such bas necessarly constrans p ε). For ease of smplfcaton, let n = T/B. Then algebra reduces the relatve probabltes to: L L c B 1 D P [ EL ] = 1 < 1 = P[ EL ] [4] L T C c c Ths alternatve case demonstrates the counterclam. Although P [ EL ] = P[ EL ] +... P[ EL+ B ] and ncreas- C D ng B adds more terms to P[ ] and none to P [ ], t also causes each term to approach 0 faster. No E L E L matter how large the bandwdth on constraned tes, there always exsts a fxed number of lnks L such that lnk L+1 should be an unconstraned te. When the dsadvantage of bas swamps the advantage of bandwdth, an ego always prefers the dverse-weak te to the constraned-strong te to ncrease chances of encounterng novel nformaton. Whle an enormous range of ntermedate cases span these two extremes, 5 Snce our purpose s llustratve rather than proof theoretc, we refran from presentng non-essental prmatves and assumptonns here and present the dervaton of Equaton 1 n Appendx A. 8

condtons exst when a person could always prefer one or the other type of lnk dependng on bas, bandwdth, and the number of lnks already present. Second, greater structural awareness of actors n constraned networks may enable them to dfferentate ther nformaton flows and avod transmttng redundant nformaton. Pror research suggests that actors n constraned networks are more aware of other actors, what they know and whom they know. Coleman s (1988) argument about the value of network closure reles n part on actors awareness of the knowledge of others n ther mmedate network. Informaton exchange n constraned networks may therefore exhbt greater specalzaton as actors are more aware of the nformaton flowng to and from other actors n the network. Actors may avod transmttng repettve nformaton knowng that such nformaton s flowng to ther contacts from others n the network. For example, two mmedate subordnates workng on a portfolo of projects for a manager may dvde ther nformaton flows across subjects to maxmze the value of ther lmted communcaton tme wth the manager. Such optmzaton may be more lkely n organzatonal settngs where tme s scarce and nformaton s crtcal to work. Thrd, other mechansms can explan the observed relatonshp between network dversty and performance. Network contacts could provde resources other than nformaton (e.g. Podolny & Barron 1997), there could be power or control benefts to network structure ndependent of nformaton flows (e.g. Burt 199), and structural dversty could reduce dependence, place ndvduals n favorable tradng relatonshps (e.g. Emerson 196) or enttle them to benefts from nformal recprocty (e.g. Cook, Emerson & Glmore 1983). These alternate mechansms could also explan the lnk between structural dversty and performance wthout any predcton concernng actors nformaton access. Indeed, our emprcal results n 4.3 suggest that non-nformaton benefts to network structure also affect productvty. Fnally, several fundamental results from nformaton economcs show that complex nonlneartes n the value of nformaton affect the qualty of decson makng. Arrow (1985) demonstrates that the expected payoffs from decsons about uncertan events are concave n the amount of nformaton 6 The lkelhood of encounterng novel nformaton (for both constraned and unconstraned tes) decreases strctly and asymptotcally toward 0 wth each addtonal te L. Ths exactly mrrors the pattern we observe emprcally as shown later n Fgure 5. 9

the decson maker obtans, mplyng dmnshng margnal returns to more nformaton. As measured by decson relevance, value only ncreases when new nformaton leads to dfferent and better decsons (Arrow 1985, Hrshlefer 1973). Informaton s novel f t provdes an alternate perspectve on a known topc or knowledge of an altogether new topc. As new nformaton on known topcs accumulates, belefs tend to converge on a partcular vew of the world, makng further confrmaton unnecessary. Expected convergence under Bayes' Rule, for example, exhbts clear dmnshng returns such that, beyond some threshold, more news has no more value. As new nformaton on new topcs accumulates, value s lkely to exhbt dmnshng margnal returns due to decson rrelevance. As actors nformaton space becomes dsparate, deas are less lkely to connect n complementary ways and each bt of nformaton s less lkely to be relevant to the space of decsons and actons the actor s nterested n. We fnd evdence of dmnshng margnal returns to novel nformaton n our own theoretcal model above, and n our emprcal analyss below. Collectvely, these arguments suggest that non-lneartes may exst n relatonshps between networks, nformaton and performance, and they help explan the current lack of emprcal evdence relatng novel nformaton to performance evdence we seek to provde..3. Network Determnants of Informaton Advantage Two network characterstcs n partcular are theorzed to drve access to dverse, novel nformaton: network sze and network dversty. These characterstcs are fundamental because they represent the two dmensons of structure most drectly related to nformaton acquston. As Burt (199: 16) argues everythng else constant, a large, dverse network s the best guarantee of havng a contact present where useful nformaton s ared Network Sze. The sze of s network (S ) s smply the number of contacts wth whom exchanges at least one message. Sze s the most famlar network characterstc related to nformaton benefts and s a good proxy for a varety of characterstcs, lke degree centralty, betweenness centralty and network reach, whch descrbe the breadth and range of actors networks (see Burt 199: 1). In our data, network sze s sgnfcantly correlated wth degree centralty (ρ =.70; p <.001), betweeness centralty (ρ =.77; p <.001), and reach (ρ =.56; p <.001), demonstratng ts value as a proxy for network breadth. 10

The greater the sze of an actor s network, the more lkely she s to have access to more nformaton and to multple socal crcles ncreasng the dversty of her nformaton. However, sze may not matter f each addtonal contact s embedded n the same socal crcles, basng the nformaton she receves. Network dversty may therefore be more mportant n provdng access to dverse nformaton. Network Dversty. Network dversty determnes the number of non-redundant pools of nformaton to whch an actor s connected and therefore the channels through whch new, dverse nformaton mght flow. Network dversty descrbes the degree to whch contacts are structurally non-redundant, and there are both frst order and second order dmensons of redundancy as shown n Fgure 1. In the frst order, drect contacts can be connected to each other. Indvduals who are n contact are lkely to share nformaton and be aware of the same opportuntes, deas and expertse. Formally, networks n whch contacts are hghly connected are termed cohesve. In the second order, contacts n a network can themselves be connected to the same people, connectng the focal actor ndrectly to redundant sources of nformaton. Contacts that are themselves connected to the same people are termed structurally equvalent. A B Frst Order: Drect Contacts Frst Order: Drect Contacts Second Order: Indrect Contacts Second Order: Indrect Contacts Fgure 1. Structurally dverse networks are low n a) coheson and b) structural equvalence. Actor A has two unconnected contacts whch dsplay no structural equvalence, whle B has two redundant contacts that are connected and maxmally structurally equvalent. We measure redundancy n the frst order of drect contacts by the lack of constrant n actors networks, and n the second order by the average structural equvalence of actors contacts. We defne the constrant C (Burt 199: 55) 7 of an actor s network as the degree to whch an ndvdual s contacts are 7 Where p j + p q p qj measures the proporton of s network contacts that drectly or ndrectly nvolve j and C sums ths across all of s contacts. 11

connected to each other, such that C = pj + pq pqj, q, j ; and the structural dversty D of an j q actor s network as1 C. We use the standard defnton of the structural equvalence of two actors, measured as the Eucldean dstance of ther contact vectors. 8 In our settng, we expect the dsadvantage of bas to swamp the advantage of bandwdth. Intervews ndcate that the dmensonalty of nformaton content n executve recrutng s lmted (n the parlance of our model T, the space of topcs, s small) meanng thcker channels are not as necessary to communcate nformaton on more topcs. Therefore, as ndvduals communcate wth more contacts, and as ndvduals networks connect them to actors that are themselves unconnected and structurally non-equvalent, we expect the nformaton they receve to be more dverse and we expect them to receve more total novel nformaton: H1: Network sze and network dversty are postvely assocated wth recevng more dverse nformaton and less redundant nformaton. Whle a greater number of contacts are lkely to provde access to more dverse, non-redundant nformaton, the probablty that an addtonal contact wll have novel nformaton s lkely decreasng n the sze of an ndvdual s network. Ths expectaton s a drect result of our model and s also supported by pror emprcal evdence on network formaton. Socal networks tend to cluster nto homophlous clques (for a revew see McPherson, Smth-Lovng, & Cook 001). Snce ndvduals usually make connectons through contacts they already have, n bounded networks the lkelhood that a margnal contact wll be redundant should ncrease n the number of people already known. 9 As actors establsh relatonshps wth a fnte set of alters, the probablty that a margnal relatonshp wll be structurally nonredundant should decrease as possble alters n the network are exhausted. We therefore expect margnal ncreases n nformaton dversty and network dversty are decreasng n network sze: Ha: The margnal ncrease n nformaton dversty s decreasng n network sze. 8 Eucldean dstance measures the square root of the sum of squared dstances between two contact vectors, or the degree to whch contacts are connected to the same people. We measure the average structural equvalence of actors drect contacts. 9 We focus on nternal networks due to dffcultes n collectng relable data outsde the frm and n estmatng accurate network structures wthout access to whole network data (see Barnes 1979, Marsden 1990). As Burt (199: 17) demonstrates however lttle evdence of hole effects [are] lost... when socometrc choces [are] restrcted to relatons wthn the frm. 1

Hb: The margnal ncrease n structural network dversty s decreasng n network sze..4. Non-Network Determnants of Informaton Advantage Several other factors could affect access to dverse nformaton and ndvdual performance. We therefore examne fve possble alternatve explanatons as controls: demography, human captal, total communcaton volume, unobservable ndvdual characterstcs, and temporal shocks to the flow of nformaton n the frm. Demography can nfluence performance, learnng capabltes and the varety of deas to whch ndvduals have access (e.g. Ancona & Caldwell 199, Reagans & Zuckerman 001). Older employees may have pror knowledge on a wder varety of topcs or be more aware of experts. Employment dscrmnaton and nterpersonal dfference could also mpact the relatve performance and nformaton seekng habts of men and women. We therefore control for the age and gender of employees. Greater ndustry experence, educaton or ndvduals organzatonal poston could also create varaton n access to dverse and novel nformaton and performance. As ndvduals gan experence, they may collect expertse across several domans, or specalze and focus ther work and communcaton on a lmted number of topcs. We therefore control for educaton, ndustry experence and organzatonal poston. 10 As prevous studes have demonstrated the mportance of controllng for communcaton volume to solate structural effects (e.g. Cummngs & Cross 003), we nclude controls for total emal communcaton. At the same tme, some employees may smply be more socal or more ambtous, creatng varaton n nformaton seekng habts and performance. To control for unobservable ndvdual characterstcs we test fxed effects specfcatons of each hypothess. Fnally, temporal shocks could affect demand for the frm s servces and nformaton seekng actvtes assocated wth more work. 11 These exogenous shocks to demand could drve smultaneous ncreases n project workload, nformaton seekng, and revenue generaton creatng a spurous correlaton between nformaton flows and output. We therefore control for temporal varaton wth dummy varables for each month and year..5. The Settng Executve Recrutng 10 Employees are partners, consultants or researchers we nclude dummy varables for each of these postons. 11 In our data, busness exhbts seasonal varaton, wth demand for the frm s servces pckng up sharply n January and declnng over the next eght months. There may also be transtory shocks to demand n a gven month or year. 13

We studed a medum-szed executve recrutng frm wth fourteen offces n the U.S. Intervews revealed that the core of executve recruters work nvolves matchng job canddates to clents requrements. Ths matchng process s nformaton-ntensve and requres actvtes geared toward assemblng, analyzng, and makng decsons based on nformaton gathered from team members, other frm employees, and contacts outsde the frm. Qualtatve studes show that executve recruters fll brokerage postons between clents and canddates and rely heavly on nformaton flows to complete ther work effectvely (Fnlay & Coverdll 000). In our context, more precse or accurate nformaton about the canddate pool reduces tme wasted ntervewng unsutable canddates and ncreases the qualty of placement decsons (Bulkley & Van Alstyne 004). In addton, the sharng of procedural nformaton can mprove effcency and effectveness (Szulansk 1996) and executve recruters report learnng to deal wth dffcult stuatons through communcaton wth peers. Recruters generate revenue by fllng vacances rather than bllng hourly. Therefore, the speed wth whch vacances are flled s an mportant ntermedate measure of workers productvty. Contract completon mples that recruters have met a clent s mnmum thresholds of canddate ft and qualty. Project duraton can therefore be nterpreted as a qualty controlled measure of productvty. In assessng ndvdual recruters performance, we measure revenues generated per month, projects completed per month and average project duraton per month. Effectve recruters rely on beng n the know and delverng canddates that dsplay specfc professonal and personal attrbutes. To accomplsh ths, recruters must be aware of several dfferent nformaton channels to match dfferent canddates wth dfferent clent requrements. We therefore expect recruters wth dverse and non-redundant nformaton to complete more projects, to complete projects faster, and to generate more revenue for the frm per unt tme. H3: Access to non-redundant and dverse nformaton s postvely assocated wth more project completons, faster project completons and more revenue generated per unt tme. Whle we expect network structure to mpact performance through ts effects on access to dverse and novel nformaton, there could be other ntermedate mechansms tyng structure to performance as 14

outlned n.. We therefore hypothesze that network dversty s postvely assocated wth performance even controllng for access to novel nformaton. H4: Network dversty s postvely assocated wth more project completons, faster project completons and more revenue generated per unt tme, controllng for access to novel nformaton. Fnally, as argued n., there s reason to suspect that there are dmnshng margnal returns to novel nformaton. In partcular, our formal model showed that the lkelhood of novel nformaton decreases wth each addtonal lnk. Further, nformaton economc arguments show that, regardless of source, ncremental news has no beneft past the pont of decson relevance. Therefore: H5: The margnal ncrease n performance assocated wth access to novel nformaton s decreasng n the amount of novel nformaton to whch actors have access. 3. Methods By analyzng emal communcaton patterns and message content, we are able to match nformaton channels to the subject matter of the content flowng through them. Our emprcal approach also addresses another methodologcal puzzle that has hstorcally troubled network research. In tradtonal network studes, a fundamental tradeoff exsts between comprehensve observaton of whole networks and the accuracy of respondents recall. Most research elcts network data from respondents who have dffculty recallng ther networks (e.g. Bernard et. al 1981), especally among ndvduals socally dstant to themselves (Krackhardt & Klduff 1999). The naccuracy of respondent recall and the bas assocated wth recall at socal dstance creates naccurate estmates of network varables (Kumbasar, Romney & Batchelder 1994), forcng most emprcal studes to artfcally lmt the boundary of estmated networks to local areas around respondents (e.g. Reagans & McEvly 003). Such emprcal strateges create estmaton challenges due to the senstvty of network metrcs to the completeness of data (Marsden 1990). If mportant areas of the network are not captured, estmates of network postons can be based. Furthermore, as our content measures consder the smlarty of topcs across the entre network, poor coverage of the frm could bas our estmates of the relatve novelty or dversty of topcs dscussed va emal. We therefore take several steps to ensure a hgh level of partcpaton (descrbed below). As 87% of elgble recruters agreed to partcpate, and gven that our nablty to observe the remanng 13% s lmted to 15

messages between two employees who both opted out of the study, we collect emal network and ndvdual content data wth nearly full coverage of the frm. 1 3.1. Data Our data come from four sources: () detaled accountng records of ndvdual project assgnments and performance, () emal data from the corporate server, () survey data on demographc characterstcs, human captal and nformaton seekng behavors, and (v) data from the web ste Wkpeda.org used to valdate our analytcal models of nformaton dversty. Internal accountng data descrbe: revenues generated by ndvdual recruters, contract start and stop dates, projects handled smultaneously by each recruter, project team composton, and job levels of recruters and placed canddates. These provde excellent performance measures that can be normalzed for qualty. Emal data cover 10 months of complete emal hstory at the frm. The data were captured from the corporate mal server durng two equal perods from October 1, 00 to March 1, 003 and from October 1, 003 to March 1, 004. Partcpants receved $100 n exchange for permttng use of ther data, resultng n 87% coverage of elgble recruters and more than 15,000 emal messages captured. Detals of emal data collecton are descrbed by Aral, Brynjolfsson & Van Alstyne (006). The thrd data set contans survey responses on demographc and human captal varables such as age, educaton, ndustry experence, and nformaton-seekng behavors. Survey questons were generated from a revew of relevant lterature and ntervews wth recruters. Experts n survey methods at the Inter-Unversty Consortum for Poltcal and Socal Scence Research vetted the survey nstrument, whch was then pre-tested for comprehenson and ease-of-use. Indvdual partcpants receved $5 for completed surveys and partcpaton exceeded 85%. The fourth dataset s a set of 91 entres collected from Wkpeda.org, whch we descrbe n detal n the secton pertanng to the valdty of our nformaton dversty metrcs (see Appendx C). 1 F-tests comparng performance levels of those who opted out wth those who remaned dd not show statstcally sgnfcant dfferences. F (Sg): Rev0.95 (.136), Comp0.837 (.365), Multtaskng.386 (.538). 16

Table 1: Descrptve Statstcs Varable Obs. Mean SD Mn Max Age 5 4.36 10.94 4 67 Gender (1=male) 657.56.50 0 1 Industry Experence 5 1.5 9.5 1 39 Years Educaton 5 17.66 1.33 15 1 Total Incomng Emals 563 80.31 59.67 0 34 Informaton Dversty 563.57.14 0.87 Total Non-Redundant Informaton 563 47.94 35.97 0 3.30 Network Sze 563 16.81 8.79 1 58 Structural Holes 563.71.17 0.91 Structural Equvalence 563 77.5 16.3 7.35 175.86 Revenue 630 096.03 18843.16 0 80808.41 Completed Projects 630.39.36 0 1.69 Average Project Duraton (Days) 630 5.3 165.77 0 91.04 Table : Par Wse Correlatons Between Independent Varables Measure 1 3 4 5 6 7 8 9 10 11 1 13 1. Age 1.00. Gender.11* 1.00 3. Industry Experence.73*.0* 1.00 4. Years Educaton.38*.06.15* 1.00 5. Total Incomng Emal -.33* -.10* -.8* -.15* 1.00 6. Informaton Dversty.09.05.16*.05.9* 1.00 7. Non-redundant Informaton -.3* -.09* -.7* -.1*.98*.36* 1.00 8. Network Sze -.07.0 -.01.09.63*.45*.64* 1.00 9. Network Dversty.1*.0.5*.01.34*.71*.35*.6* 1.00 10. Structural Equvalence -.19* -.06 -.4* -.06.3* -.08.3* -.05 -.16* 1.00 11. Revenue.44* -.0.33*.15* -.09*.3* -.1* -.1*.7* -..16* 1.00 1. Completed Projects.41* -.01.9*.11* -.09*.3* -.11* -.09*.5* -.14*.9* 1.00 13. Average Project Duraton.50*.1*.49*.1* -.30*.14* -.31* -.07.18* -.1*.54*.47* 1.00 * p <.05 17

Descrptve statstcs and correlatons are provded n Tables 1 &. An observaton s a person-month. 13 3..1. Modelng & Measurng Topcs n Emal: A Vector Space Model of Communcaton Content We model and measure the dversty of nformaton n ndvduals emal usng a Vector Space Model of the topcs present n emal content (e.g. Salton et. al. 1975). 14 Vector Space Models are wdely used n nformaton retreval and search query optmzaton algorthms to dentfy documents that are smlar to each other or pertan to topcs dentfed by search terms. They represent textual content as vectors of topcs n multdmensonal space based on the relatve prevalence of topc keywords. In our model, each emal s represented as a multdmensonal topc vector whose elements are the frequences of keywords n the emal. The prevalence of certan keywords ndcates that a topc that corresponds to those keywords s beng dscussed. For example, an emal about pets mght nclude frequent mentons of the words dog, cat, and veternaran; whle an emal about econometrcs mght menton the words varance, specfcaton, and heteroskedastcty. The relatve topc smlarty of two emals can then be assessed by topc vector convergence or dvergence the degree to whch the vectors pont n the same or orthogonal drectons. 15 To measure content dversty, we characterze all emals as topc vectors and measure the varance or spread of topc vectors n ndvduals nboxes and outboxes. Emals about smlar topcs contan smlar language on average, and vectors used to represent them are therefore closer n multdmensonal space, reducng ther collectve varance or spread. 3... Constructon of Topc Vectors & Keyword Selecton 13 We wrote and developed emal capture software specfc to ths project and took multple steps to maxmze data ntegrty. New code was tested at Mcrosoft Research Labs for server load, accuracy and completeness of message capture, and securty exposure. To account for dfferences n user deleton patterns, we set admnstratve controls to prevent data expungng for 4 hours. The project went through nne months of human subjects revew and content was masked usng cryptographc technques to preserve prvacy (see Van Alstyne & Zhang 003). Spam messages were excluded by elmnatng external contacts who dd not receve at least one message from someone nsde the frm. 14 Whle emal s not the only source of employees communcaton, t s one of the most pervasve meda that preserves content. It s also a good proxy for other socal sources of nformaton n organzatons where emal s wdely used. In our data, the average number of contacts by phone (ρ=.30, p <.01) and nstant messenger (ρ =.15, p <.01) are postvely and sgnfcantly correlated wth emal contacts. Our ntervews ndcate that n our frm, emal s a prmary communcaton meda. 15 Each emal may pertan to multple topcs based on keyword prevalence, and topc vectors representng emals can emphasze one topc more than another based on the relatve frequences of keywords assocated wth dfferent topcs. In ths way, our framework captures nuances of emals that may pertan to several topcs of dfferng emphass. 18

Vector Space Models characterze documents D by keywords k weghted accordng to ther frequency of use (or wth 0 weghts for words excluded from the analyss called stop words ). Each document s represented as an n-dmensonal vector of keywords n topc space, where k j represents the weght of the jth keyword. D = k, k,..., k ), ( 1 n j K 1 D 1 = ( k1, k, k3 ) 1 ( 1 k3 D = k, k, ) ( 1 k 3 K K 3 D = k, k, ) 3 ( 1 k 3 Fgure. A three dmensonal Vector Space Model of three documents s shown on the left. A Vector Space Model contanng a test nbox wth emals clustered along three dmensons s shown on the rght. Weghts defne the degree to whch a partcular keyword mpacts the vector characterzaton of a document. Words that dscrmnate topcs are weghted more heavly than words less useful n dstngushng topcs. As terms that appear frequently n a document are typcally thematc and relate to the document s subject matter, we use the term frequency of keywords n emal as weghts to construct topc vectors and refne our keyword selecton wth crtera desgned to select words that dstngush and represent topcs. 16 In order to mnmze ther mpact on the clusterng process, we ntalzed our data by removng common stop words, such as a, an, the, and, and other common words wth hgh frequency across all emals that are lkely to create nose n content measures. We then mplemented an teratve, k- means clusterng algorthm to group emals nto clusters that use the same words, smlar words or words 16 Another common weghtng scheme s the term-frequency/nverse-document frequency. However, we use a more sophstcated keyword selecton refnement method specfc to ths dataset descrbed n detal n the remander ths secton. 19

that frequently appeared together. 17 The result of teratve k-means clusterng s a seres of assgnments of emals to clusters based on ther language smlarty. Rather than mposng exogenous keywords on the topc space, we extract topc keywords lkely to characterze topcs by usng a seres of algorthms guded by three basc prncples. Frst, n order to dentfy dstnct topcs n our corpus, keywords should dstngush topcs from one another. We therefore chose keywords that maxmze the varance of ther mean frequences across k- means clusters. Ths refnement favors words wth wdely dfferng mean frequences across clusters, suggestng an ablty to dstngush between topcs. In our data, we fnd the coeffcent of varaton of the mean frequences across topcs to be a good ndcator of ths dsperson. 18 C v = 1 n ( m M ) M Second, keywords should represent the topcs they are ntended to dentfy. In other words, keywords dentfyng a gven topc should frequently appear n emals about that topc. To acheve ths goal we chose keywords that mnmze the mean frequency varance wthn clusters, favorng words that are consstently used across emals dscussng a partcular topc: 19 ITF = c ( f M ) M c c Thrd, keywords should not occur too nfrequently. Infrequent keywords wll not represent or dstngush topcs and wll create sparse topc vectors that are dffcult to compare. We therefore select hgh frequency words (not elmnated by the stop word lst of common words) that maxmze the nter-topc 17 K-means clusterng generates clusters by locally optmzng the mean squared dstance of all documents n a corpus. The algorthm frst creates an ntal set of clusters based on language smlartes, computes the centrod of each cluster, and then reassgns documents to clusters whose centrod s the closest to that document n topc space. The algorthm stops teratng when no reassgnment s performed or when the objectve functon falls below a pre-specfed threshold. 18 The coeffcent of varaton s partcularly useful due to ts scale nvarance, enablng comparsons of datasets, lke ours, wth heterogeneous mean values (Ancona & Caldwell 199). To ease computaton we use the square of the coeffcent of varaton, whch produces a monotonc transformaton of the coeffcent wthout affectng our keyword selecton. 19 ndexes emals and c ndexes k-means clusters. We squared the varaton to ease computaton as n footnote 18. 0

coeffcent of varaton and mnmze ntra-topc mean frequency varaton. Ths process generated topcal keywords from usage characterstcs of the emal communcaton of employees at our research ste. 0 3..3. Measurng Emal Content Dversty Usng the keywords generated by our usage analyss, we populated topc vectors representng the subject matter of the emals n our data. We then measured the degree to whch the emals n an ndvdual employee s nbox or outbox were focused or dverse by measurng the spread or varance of ther topc vectors. We created fve separate dversty measurement specfcatons based on technques from the nformaton retreval, document smlarty and nformaton theory lteratures. The approach of all fve measures s to compare ndvduals emals to each other, and to characterze the degree to whch emals are about a set of focused topcs, or rather about a wder set of dverse topcs. We used two common document smlarty measures (Cosne smlarty and Dce s coeffcent) and three measures enhanced by an nformaton theoretc weghtng of emals based on ther nformaton content. 1 We performed extensve valdaton tests of our dversty measures and ther correlatons, ncludng applcaton to an ndependent dataset from Wkpeda. A detaled descrpton of the valdaton process and results appears n Appendx C. As all dversty measures are hghly correlated (~ corr =.98; see Appendx B), our specfcatons use the average cosne dstance of employees ncomng emal topc vectors I d j from the mean vector of ther topc space M to represent ncomng nformaton dversty ( ID ): I I ID I N I I ( Cos( d j, M ) j= = 1 N d M w j j w Mj I, where: Cos ( dj, M ) = =, such that 0 ID 1. d M w w j Mj Ths measure aggregates the cosne dstance of emal vectors n an nbox from the mean topc vector of that nbox, approxmatng the spread or varance of topcs n ncomng emal for a gven ndvdual. We measure the total amount of s ncomng emal communcaton as a count of ncomng emal messages, 0 We conducted senstvty analyss of our keyword selecton process by choosng dfferent thresholds at whch to select words based on our crtera and found results were robust to all specfcatons and generated keyword sets more precse than those used n tradtonal term frequency/nverse document frequency weghted vector space models that do not refne keyword selecton. 1

E I = j m j, where m j represents a message sent from j to ; and the total amount of non-redundant I I I I nformaton flowng to each actor ( NRI ) as dversty tmes total ncomng emal: NRI = ( E * ID ). 3.3. Statstcal Specfcatons We began by examnng the structural determnants of access to dverse and novel nformaton. We frst estmated an equaton relatng network structure to the dversty of nformaton flowng nto actors emal nboxes usng pooled OLS specfcatons controllng for ndvdual characterstcs and fxed effects models on monthly panels of ndvduals networks and nformaton dversty. The estmatng equaton s specfed as follows: ID I t I = γ + β1 Et + β NSt + β 3NSt + β 3NDt + β 4SEt + B j HC j + BmMonth + ε t [5], j m where I ID t represents the dversty of the nformaton n a gven ndvdual s nbox, total number of ncomng messages receved by, NS t represents network sze squared, SE t represents average structural equvalence, NS t represents the sze of s network, I E t represents the ND t represents structural dversty (measured by one mnus constrant), β j HC j represents controls for human captal and j demographc varables (Age, Gender, Educaton, Industry Experence, and Manageral Level), and β mmonth represents temporal controls for each month/year. m We then examned the relatonshp between network structure and the total amount of novel nformaton flowng nto actors emal nboxes ( NRI ), agan testng pooled OLS and fxed effects specfcatons usng the followng model: I t NRI I t = γ + β1 NSt + β NSt + β 3NDt + β 4SEt + B j HC j + BmMonth + ε t [6]. j m 1 Informaton Content s used to descrbe how nformatve a word or phrase s based on ts level of abstracton. Formally, the nformaton content of a concept c s quantfed as ts negatve log lkelhood log p(c). We focus n ths paper on ncomng nformaton for two reasons. Frst, we expect network structure to nfluence ncomng nformaton more than outgong nformaton. Second, the theory we ntend to test s about the nformaton to whch ndvduals have access as a result of ther network structure, not the nformaton ndvduals send. These dmensons are hghly correlated.

To explore the mechansms drvng the non-lnear relatonshp between network sze and nformaton dversty, we tested the hypothess (b) that whle structural dversty s ncreasng n sze, there are dmnshng margnal dversty returns to sze n bounded networks. If ths s the case, we should see a non-lnear postve relatonshp between network sze and structural dversty, such that the margnal ncrease n structural dversty s decreasng n sze n the followng model: ND t = γ + β1 NSt + β NSt + B j HC j + BmMonth + ε t [7]. j m Fnally, we tested the relatonshp between non-redundant nformaton ( NRI ) and performance I t P ), and ncluded our measure of structural network dversty ( ND ) n the specfcaton. ( t t P t I = γ + β1 NRIt + β NDt + β3nst + B jhc j + BmMonth + ε t [8]. j m If nformaton benefts to network dversty exst, network dversty should be postvely assocated wth access to dverse and non-redundant nformaton, and non-redundant nformaton should be postvely assocated wth performance. If network dversty confers addtonal benefts beyond nformaton advantage (such as power or favorable tradng condtons) network dversty should contrbute to performance beyond ts contrbuton through nformaton dversty. 3 Fnally, f there are dmnshng margnal returns to novel nformaton, we should see a non-lnear relatonshp between novel nformaton and productvty. 4. Results 4.1. Network Structure & Access to Dverse, Non-Redundant Informaton We frst estmated the relatonshps between network sze, network dversty and access to dverse nformaton controllng for demographc factors, human captal, unobservable ndvdual characterstcs, temporal shocks and the total volume of communcaton. Our results, shown n Table 3 Models 1-4, demonstrate that the dversty of nformaton flowng to an actor s ncreasng n the actor s network sze and 3 We were unable to reject the hypothess of no heteroskedastcty and report standard errors accordng to the Whte correcton (Whte 1980). Whte s approach s conservatve. Estmated coeffcents are unbased but not effcent. In small samples, we may observe low t-statstcs even when varables exert a real nfluence. As there may be dosyncratc error at the level of ndvduals, for OLS analyses we report robust standard errors clustered by ndvdual. Clustered robust standard errors are robust to correlatons wthn observatons of each ndvdual, but are never fully effcent. They are conservatve estmates of standard errors. 3

network dversty, whle the margnal ncrease n nformaton dversty s decreasng n network sze, supportng hypotheses 1 and a. A one standard devaton ncrease n the sze of recruters networks (approxmately 8 addtonal contacts) s assocated wth a 1. standard devaton ncrease n nformaton dversty; whle the coeffcent on network sze squared s negatve and sgnfcant ndcatng dmnshng margnal dversty returns to network sze. 4 As actors add network contacts, the contrbuton to nformaton dversty lessens, mplyng that nformaton benefts to network sze are constraned. Network dversty s also postvely and sgnfcantly assocated wth greater nformaton dversty n ncomng emal. The frst order dversty varable whch measures the lack of constrant n the an actor s network s hghly sgnfcant n all specfcatons, whle the average structural equvalence of actors contacts does not nfluence access to dverse nformaton controllng for network sze and frst order structural dversty. These results demonstrate that large dverse networks provde access to dverse, novel sets of nformaton. We then tested relatonshps between network sze, network dversty and the total amount of novel nformaton that accrues to recruters n ncomng emal. Our results, shown n Table 3 Models 5-8, demonstrate that the amount of novel nformaton flowng to an actor s ncreasng n the actor s network sze and network dversty. Network dversty has a strong postve relatonshp wth the total amount of novel nformaton flowng nto actors nboxes (Models 5 & 6), but s not sgnfcant when controllng for network sze (Models 7 & 8). The mpact of sze on total novel nformaton domnates that of structural dversty because of the strong relatonshp between sze and total ncomng emal, a crtcal drver of the total amount of novel nformaton (par wse correlaton: ρ =.98, p <.01). Ths result hghlghts the mportance of nformaton flows over tme. The amount of novel nformaton flowng n networks of smlar structural dversty s greater n larger networks. We would also expect network dversty to drve greater access to total non-redundant nformaton, controllng for network sze. However, our model and results mply that whle structural dversty has a strong mpact on characterstcs of the nformaton actors receve (greater nformaton dversty per unt of nformaton), varaton n the total amount of novel nfor- 4 We also tested a negatve exponental specfcaton of ths relatonshp wth very smlar results. Both models ft well. 4

maton receved s determned mostly by network sze, a key determnant (along wth te strength) of total communcaton bandwdth. Table 3. Network Structure & Access to Dverse, Novel Informaton Model 1 Model Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Dependent Informaton Informaton Informaton Informaton Varable: Dversty Dversty Dversty Dversty NRI NRI NRI NRI Specfcaton FE OLS-c FE OLS-c FE OLS-c FE OLS-c Age.006 -.001.000 -.005 (.009) (.006) (.01) (.010) Gender.003.135 -.006 -.17 (.135) (.097) (.188) (.155) Educaton -.061 -.00 -.068 -.098 (.006) (.04) (.06) (.053) Industry Experence.010 (.010) -.001 (.007) -.09** (.013) -.015 (.010) Partner -.147.175 -.480 -.44 (.84) (.188) (.437) (.395) Consultant -.006.1 -.839 -.403 (.46) (.168) (.318) (.96) Total Emal Incomng -.001 (.001).000 (.001) -.001 (.001).001 (.001) Network Sze 1.99*** (.133) 1.38*** (.301).474*** (.114).96* (.138).711*** (.17) 1.195*** (.34) Network Sze-Squared -.880*** (.11) -1.048*** (.66) -.7** (.089) -.40* (.139) -.109 (.103) -.518* (.63) Network Dversty.18** (.05).68*** (.07).9*** (.061).530*** (.131) -.070 (.060) -.138 (.103) Structural Equvalence -.005 (.033).06 (.096) -.053 (.043) -.000 (.006).0 (.037) -.138 (.10) Constant.059.863.18*.016 -.81*** 1.655** -.47*** 1.784** (.094) (.895) (.075) (.634) (.079) (1.090) (.068) (.890) Temporal Controls Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year F-Value (d.f.) 13.70*** (11) 3.76*** (17) 5.61*** (13) 5.03*** (19) 10.54*** (10) 1.86*** (16) 5.05*** (1) 15.85*** (18) R.4.38.14.4.19.35.40.55 Obs. 563 448 540 434 540 434 540 434 These results also suggest a nuanced relatonshp between sze and dversty, whch we explore next. 4.. Tradeoffs between Network Sze & Network Dversty There s a strong, postve, but non-lnear relatonshp between network sze and network dversty n our data: structural dversty s ncreasng n network sze, but wth dmnshng margnal returns (see Table 4). Ths result supports hypothess b, and demonstrates why nformaton benefts to larger networks may be constraned n bounded organzatonal networks. As recruters contact more colleagues, the contrbuton of a margnal contact to the structural dversty of a focal actor s network s ncreasng, 5

but wth dmnshng margnal returns. The mplcatons of a fundamental trade off between sze and structural dversty complement Burt s (199: 167) concepts of effectve sze and effcency. 5 Table 4. Network Sze & Structural Network Dversty Model 1 Model Model 3 Model 4 Dependent Varable: Network Dversty Network Dversty Structural Equvalence Structural Equvalence Specfcaton Fxed Effects OLS-c Fxed Effects OLS-c Age -.005.016** (.006) (.005) Gender -.156*.04 (.091) (.10) Educaton -.030.011 (.034) (.045) Industry Experence.05** (.009) -.01 (.007) Partner -.004-1.01*** (.186) (.0) Consultant.19 -.940*** (.140) (.167) Network Sze 1.585*** 1.66*** -.077 -.14 (.113) (.09) (.145) (.9) Network Sze-Squared -1.038*** (.098) -1.069*** (.190) -.109 (.1) -.006 (.171) Constant.083.651 -.907*** -.946 (.064) (.630) (.074) (.784) Temporal Controls Month / Year Month / Year Month / Year Month / Year F-Value (d.f.) 33.39*** 15.58*** 6.39*** 59.97*** (10) (16) (10) (16) R.41.64.58.58 Obs. 563 448 540 434 Fgure 5 dsplays graphs relatng network sze, network dversty and nformaton dversty, clearly showng the postve, non-lnear relatonshps. Fgure 5. Graphs showng relatonshps between network sze, network dversty and nformaton dversty. 4.3. Network Structure, Informaton Dversty & Performance 5 In fact, Burt (199: 169) fnds stronger evdence of hole effects wth the constrant measures we employ than wth effectve 6

Fnally, we test the performance mplcatons of network structure and access to dverse, nonredundant nformaton measured by revenues generated per month, projects completed per month, and the average duraton of projects. 6 Table 5 dsplays strong evdence of a postve relatonshp between access to non-redundant nformaton and performance. In fxed effects models, whch control for varaton explaned by unobserved, tme nvarant characterstcs of ndvduals, a one unt ncrease n the amount of non-redundant nformaton flowng to ndvduals s assocated on average wth just over $3,800 more revenue generated, an extra one tenth of one project completed, and 14 days shorter average project duraton per person per month. Between estmates are all n the same drecton and of smlar magntude, although only the relatonshp wth revenue s sgnfcant. Pooled OLS estmates also show that access to non-redundant nformaton s assocated wth hgher productvty across all measures. These results support Hypothess 3 and provde evdence for nformaton advantages to network structure. Tables 3, 4 and 5 together demonstrate that dverse networks provde access to dverse, non-redundant nformaton, whch n turn drves performance n nformaton ntensve work. We also uncover evdence of alternatve mechansms lnkng network structure to performance. Table 5 shows network dversty s postvely assocated wth performance even when holdng access to novel nformaton constant, provdng prelmnary evdence of addtonal benefts to network structure beyond those conferred through nformaton advantage. Controllng for access to novel nformaton, network dversty s assocated wth greater revenue generaton n fxed effects and pooled OLS specfcatons, more completed projects n pooled OLS specfcatons, and wth faster project completon n fxed effects specfcatons. These results leave open the possblty that some benefts to network dversty come not from access to novel, non-redundant nformaton, but rather from other mechansms, lke access to job support, power or organzatonal nfluence. sze, demonstratng exclusve access s a crtcal qualty of relatons that span structural holes. 6 As there are some employees who do not take on projects or who are not nvolved n any projects n a gven month, we only estmate equatons for ndvduals wth non-zero revenues n a gven month. 7

Table 5. Network Structure, Non-Redundant Informaton and Indvdual Performance Model 1 Model Model 3 Model 4 Model 5 Model 6 Model 7 Model 8 Model 9 Dependent Completed Completed Completed Project Project Project Revenue Revenue Revenue Varable: Projects Projects Projects Duraton Duraton Duraton Specfcaton Fxed Effects Between Fxed Effects Between Fxed Effects Between OLS-c OLS-c Wthn Estmator Wthn Estmator Wthn Estmator OLS-c Age -41.75 -.006.344 (94.08) (.005) (.147) Gender -617.33 -.096-1.155 (3816.54) (.056) (6.346) Educaton -774.60 -.003 17.769 Industry Experence Partner Consultant Non- Redundant Informaton Network Dversty Constant 3806.1** (111.06) 165.14 (931.5) 3538.48*** (144.79) 476.45* (783.69) 5558.04* (368.6) 891.45* (1614.11) (1103.03) -91.58 (78.91) 1979.80 (8533.10) 977.93 (6763.74) 7709.13** (3143.6) 30.45* (1779.18) 5619.10** (0886.57).097*** (.04).1 (.018).660*** (.08).084 (.059).070 (.069).40 (.344) (.0).00 (.006).156 (.159).50** (.11).17** (.050).057* (.03).873** (.431) -14.11** (5.44) -1.735** (4.18) 88.96*** (6.48) -35.33 (5.516) 33.38 (9.961) 43.07 (148.63) (10.686) 4.51 (.59) -83.39 (79.35) -104.555 (57.056) -6.461* (14.931) -14.764 (11.499) -36.571 (190.419) Temporal Controls Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year Month / Year F-Value (d.f.).16** (10) 5.1*** (8) 3.97*** (16) 3.15*** (10) 3.46** (8) 4.7*** (16) 3.3*** (10) 1.7 (8) 4.06*** (16) R.06.49.4.08.39.7.08.4.8 Obs. 40 40 30 40 40 30 40 40 30 Fnally, we tested whether the postve relatonshp between access to novel nformaton and performance was strctly lnear, or rather whether access to novel nformaton dsplayed dmnshng margnal performance returns (Hypothess 5). We found across the board that access to nonredundant nformaton had dmnshng margnal performance returns n each of our performance measures (see Table 6). 8

Table 6. Non-Redundant Informaton and Performance Model 1 Model Model 3 Model 4 Model 5 Model 6 Dependent Completed Completed Project Project Revenue Revenue Varable: Projects Projects Duraton Duraton Specfcaton Fxed Effects OLS-c Fxed Effects OLS-c Fxed Effects OLS-c Age -33.39 -.006.58 (80.47) (.005) (.191) Gender -6539.65 -.101* -8.43 (3930.3) (.059) (6.907) Educaton -1061.88 -.008 18.050 (1131.07) (.0) (11.195) Industry Experence -.493 (61.17).003 (.005) 3.817 (.531) Partner 13891.09*.17-96.117 (7971.08) (.157) (8.379) Consultant 9457.81.5** -115.09* (6055.44) (.115) (59.89) Non- Redundant Informaton Non- Redundant Informaton Squared 6096.58*** (187.8) -370.83*** (775.10) 9310.37*** (58.86) -6659.35*** (1194.65).15*** (.05) -.074*** (.015).01*** (.040) -.11*** (.06) -3.11*** (5.94) 9.59*** (3.58) -31.870** (16.047) 4.17 (9.686) Constant 37808.06*** 64901.77**.74*** 1.03** 76.75*** (1489.1) (0890.10) (.09) (.443) (6.87) Temporal Month / Month / Month / Month / Month / Controls Year Year Year Year Year F-Value (d.f.) 4.04*** 5.56*** 3.10*** 5.69*** 3.10*** (10) (16) (10) (16) (10) R.10.30.08.31.08.7 Obs. 40 30 40 30 40 30-44.411 (197.604) Month / Year 3.09** (16) These parameter estmates suggest that the postve performance mpacts of novel nformaton are much lower when employees already have access to sgnfcant amounts of novel nformaton. Fgure 6. Graphs of the relatonshps between novel nformaton, completed projects and revenue. 9

In fact, as the graphs n Fgure 6 demonstrate, there seem to be negatve returns to more novel nformaton beyond the normalzed mean. 7 These non-lneartes n the value of novel nformaton lkely arse for the reasons outlned n.. Frst, beyond the threshold for decson relevance, new nformaton adds no value. Second, an employee s capacty to process new nformaton may be constraned, makng them less able to get the most out of novel nformaton after they receve too much of t. Ths explanaton s consstent wth theores of bounded ratonalty, cogntve capacty and nformaton overload. 5. Concluson We present some of the frst emprcal evdence on the relatonshp between network structure and the content of nformaton flowng to and from actors n a network. We develop theory detalng how network structures enable nformaton benefts wth measurable performance mplcatons, and buld and valdate an analytcal model to measure the dversty of nformaton n emal communcaton. Our results lend broad support to the argument that network structures predct performance due to ther mpact on ndvduals access to dverse nformaton. But we also fnd subtle non-lneartes n the relatonshps between network structure and nformaton access, and between nformaton access and performance. The total amount of novel nformaton and the dversty of nformaton flowng to actors are ncreasng n actors network sze and network dversty, but the margnal ncrease n nformaton dversty s decreasng n network sze. Part of the explanaton for the decreasng margnal contrbuton of network sze to nformaton dversty s that network dversty s ncreasng n network sze, but wth dmnshng margnal returns. As actors establsh relatonshps wth a fnte set of possble contacts n an organzaton, the probablty that a margnal relatonshp wll be non-redundant, and provde access to novel nformaton, decreases as possble alters n the network are exhausted. We also fnd that there are dmnshng margnal productvty returns to novel nformaton, a result consstent wth anecdotal evdence of nformaton overload, and theores of bounded ratonalty and lmts to cogntve capacty. In our context, network dversty contrbutes to performance even when controllng for the postve performance effects of access to 7 For novel nformaton greater than the normalzed mean, coeffcents n revenue regressons are negatve and sgnfcant (β FE =- 3340.33**; β OLS =-3661.60*), and n completed projects regressons are negatve, though not sgnfcant (β FE =-.04; β OLS =-.05). 30

novel nformaton, suggestng addtonal benefts to network dversty beyond those conferred through nformaton advantage. Surprsngly, tradtonal demographc and human captal varables (e.g. age, gender, ndustry experence, educaton) have lttle effect on access to dverse nformaton, hghlghtng the mportance of network structure for nformaton advantage. These results represent some of the frst evdence on the relatonshp between network structure and nformaton advantage. But, relatonshps between socal structure, nformaton access and economc outcomes are subtle and complex and requre more detaled theoretcal development and emprcal nqury across dfferent contexts. Our methods for analyzng network structure and nformaton content n emal data are replcable, openng a new lne of nqury nto the relatonshps between networks, nformaton and economc performance. References Ancona, D.G. & Caldwell, D.F. 199. Demography & Desgn: Predctors of new Product Team Performance. Organzaton Scence, 3(3): 31-341. Aral, S., Brynjolfsson, E., & Van Alstyne, M. 006. Informaton, Technology and Informaton Worker Productvty: Task Level Evdence. Proceedngs of the 7 th Annual Internatonal Conference on Informaton Systems, Mlwaukee, Wsconsn. Arrow, K.J. 1985. Informatonal Structure of the Frm. AEA Papers and Proceedngs, 75(): 303-307. Bernard, H.R., Kllworth, P., & Salor, L. 1981. Summary of research on nformant accuracy n network data and the reverse small world problem. Connectons, (4:): 11-5. Bulkley, N. & Van Alstyne, M. 004. Why Informaton Influence Should Productvty The Network Socety: A Global Perspectve; Manuel Castells (ed.). Edward Elgar Publshers. pp: 145-173. Burt, R. 199. Structural Holes: The Socal Structure of Competton. Harvard Unversty Press, Cambrdge, MA. Burt, R. 000. The network structure of socal captal In B. Staw, & Sutton, R. (Ed.), Research n organzatonal behavor (Vol. ). New York, NY, JAI Press. Burt, R. 004a. Structural Holes & Good Ideas Amercan Journal of Socology, (110): 349-99. Burt, R. 004b. Where to get a good dea: Steal t outsde your group. As quoted by Mchael Erard n The New York Tmes, May. Coleman, J.S. 1988. Socal Captal n the Creaton of Human Captal Amercan Journal of Socology, (94): S95-S10. Cook, K.S., Emerson, R.M., Glmore, M.R., & Yamagsh, T. 1983. The dstrbuton of power n exchange networks. Amercan Journal of Socology, 89: 75-305. Cummngs, J., & Cross, R. 003. Structural propertes of work groups and ther consequences for performance. Socal Networks, 5(3):197-10. 31

Emerson, R. 196. Power-Dependence Relatons. Amercan Socologcal Revew, 7: 31-41. Fnlay, W. & Coverdll, J.E. 000. Rsk, Opportunsm & Structural Holes: How headhunters manage clents and earn fees. Work & Occupatons, (7): 377-405. Granovetter, M. 1973. The strength of weak tes. Amercan Journal of Socology (78):1360-80. Hansen, M. 1999. "The search-transfer problem: The role of weak tes n sharng knowledge across organzaton subunts." Admnstratve Scence Quarterly (44:1):8-111. Hansen, M. 00. "Knowledge networks: Explanng effectve knowledge sharng n multunt companes." Organzaton Scence (13:3): 3-48. Hargadon, A. & R, Sutton. 1997. Technology brokerng and nnovaton n a product development frm. Admnstratve Scence Quarterly, (4): 716-49. Hrshlefer, J. 1973. Where are we n the theory of nformaton? Amercan Economc Revew (63): 31-39. Krackhardt, D. & Klduff, M. 1999. Whether close or far: Socal dstance effects on perceved balance n frendshp networks. Journal of personalty and socal psychology (76) 770-8. Kumbasar, E., Romney, A.K., and Batchelder, W.H. 1994. Systematc bases n socal percepton. Amercan Journal of Socology, (100): 477-505. Marsden, P. 1990. Network Data & Measurement. Annual Revew of Socology (16): 435-463. McPherson, M., L. Smth-Lovn & J. Cook. 001. Brds of a Feather: Homophly n Socal Networks. Annual Revew of Socology 7: 415-444. Podolny, J., Baron, J. 1997. "Resources and relatonshps: Socal networks and moblty n the workplace." Amercan Socologcal Revew (6:5): 673-693. Reagans, R. & McEvly, B. 003. Network Structure & Knowledge Transfer: The Effects of Coheson & Range. Admnstratve Scence Quarterly, (48): 40-67. Reagans, R. & Zuckerman, E. 001. "Networks, dversty, and productvty: The socal captal of corporate R&D teams." Organzaton Scence (1:4): 50-517. Reagans, R. & Zuckerman, E. 006. "Why Knowledge Does Not Equal Power: The Network Redundancy Tradeoff" Workng Paper Sloan School of Management 006, pp. 1-67. Salton, G., Wong, A., & Yang, C. S. 1975. A Vector Space Model for Automatc Indexng. Communcatons of the ACM, 18(11): 613-60. Sparrowe, R., Lden, R., Wayne, S., & Kramer, M. 001. Socal networks and the performance of ndvduals and groups. Academy of Management Journal, 44(): 316-35. Szulansk, G. 1996. "Explorng nternal stckness: Impedments to the transfer of best practce wthn the frm." Strategc Management Journal (17): 7-43. Uzz, B. 1996. The sources and consequences of embeddedness for the economc performance of organzatons: The network effect Amercan Socologcal Revew, (61):674-98. Uzz, B. 1997. Socal structure and competton n nterfrm networks: The paradox of embeddedness. Admnstratve Scence Quarterly, 4: 35-67. Van Alstyne, M. & Zhang, J. 003. EmalNet: A System for Automatcally Mnng Socal Networks from Organzatonal Emal Communcaton, NAACSOS. Whte, H. 1980. A heteroscedastcty-consstent covarance matrx estmator and a drect test for heteroscedastcty." Econometrca (48:4): 817-838. 3

Onlne Appendx A. Model Dervaton Ths short secton provdes the dervaton for Equaton 1. Let there be 1 n 1 topcs n topc set n 1 and 1 n topcs n topc set n for a total of n 1 +n = T. Defne the lkelhoods of encounterng n 1 and n topcs as p 1 and p respectvely. It follows that n 1 p 1 + n p = 1. Further, defne the followng: I = 1 f lnk l connects to dea k, 0 otherwse. lk L 1 f = = I lk 0 J k l= 1 0 otherwse Ψ = {Event that lnk L+1 connects to a new dea} Here, J k ndcates whether dea k has faled to appear among the nformaton provded by any of the lnks 1 L. Wth ths termnology, we can now derve P(Ψ), the probablty of encounterng a new dea gven that there are k deas remanng to be seen. P( Ψ ) = E[ P( Ψ J1... J k )] = E [ n 1 = 1 J p 1 + T h h= n + 1 = n 1 p1e[ J ] + n pe[ J h] L = n p 1 p ) + n p (1 p ) 1 1( 1 The last step arses because an dea that occurs wth probablty p must not have occurred n any of the prevous L draws. Ths completes the dervaton. It s useful to note three propertes. Frst, havng no pror lnks L=0 mples that a new dea s encountered wth certanty. Second, ncreasng lnks wthout bound L mples the chances of encounterng a new dea approach 0. Thrd, unbased nformaton mples p 1 = p =1/T. Further, f deas n n 1 become B tmes more lkely to appear among n-group communcatons, then p 1 =B/T whch mples that p = 1 n B 1 T T n1 (wth n 1 < T, B<T, and n 1 B T) whch smplfes the fnal dervaton n the man text. 1 J p ] L Onlne Appendx B. Descrptons & Correlatons of Informaton Dversty Metrcs 1. Cosne Dstance Varance Varance based on cosne dstance (cosne smlarty): N I I ( Cos( d j, M ) d I j= ID = 1 M w j j w Mj, where Cos ( dj, M ) = = N d M wj wmj We measure the varance of devaton of emal topc vectors from the mean topcs vector and average the devaton across emals n a gven nbox or outbox. The dstance measurement s derved from a well-known document smlarty measure the cosne smlarty of two topc vectors.. Dce s Coeffcent Varance Varance based on Dce s Dstance and Dce s Coeffcent: VarDce I N I ( DstDce( d j ) j= = 1 N, where 33

DstDce( d) = DceDst( d, M ) = 1 Dce( d, M ), and where Dce( D1, D) T = 1 = T t ( t + t D1 j D j ) T D1 j = 1 = 1 Smlar to VarCos, varance s used to reflect the devaton of the topc vectors from the mean topc vector. Dce s coeffcent s used as an alternatve measure of the smlarty of two emal topc vectors. 3. Average Common Cluster AvgCommon measures the level to whch the documents n the document set resde n dfferent k-means clusters produced by the eclassfer algorthm: t D j N I I ( CommonDst( d j d j ), 1 I j= AvgCommon = 1, N I I where ( d1 j, d j ) represents a gven par of documents (1 and ) n an nbox and j ndexes all pars of documents n an nbox, and where: I I I I ( d1 j, d j ) = 1 CommonSm( d1 j, d j ) ( ) Iteratons _ n _ same _ I I 1 j, d j = Iteratons CommonDst CommonSm d cluster AvgCommon s derved from the concept that documents are smlar f they are clustered together by k-means clusterng and dssmlar f they are not clustered together. The k-means clusterng procedure s repeated several tmes, creatng several clusterng results wth 5, 10, 0, 30, 40 00 clusters. Ths measures counts the number of tmes durng ths teratve process two emals were clustered together dvded by the number of clusterng teratons. Therefore, every two emals n an nbox and outbox that are placed n separate clusters contrbute to hgher dversty values. 4. Average Common Cluster wth Informaton Content AvgCommonIC uses a measure of the nformaton content of a cluster to weght n whch dfferent emals resde. AvgCommonIC extends the AvgCommon concept by compensatng for the dfferent amount of nformaton provded n the fact that an emal resdes n the same bucket for ether hghly dverse or tghtly clustered clusters. For example, the fact that two emals are both n a cluster wth low ntra-cluster dversty s lkely to mply more smlarty between the two emals than the fact that two emals resde n a cluster wth hgh ntra-cluster dversty. CommonICSm ( D1, D ) = log( 1 1 all _ documents ) D1, Dn _ same_ bucket log documents_ n _ the _ bucket ( ) all _ documents total _ number_ of _ bucket _ levels CommonICDst( D1, D ) = 1 CommonICSm( D1, D ) AvgCommonIC = average CommonICDst d, d ) d d documents 1, { } ( 1 5. Average Cluster Dstance AvgBucDff measures dversty usng the smlarty/dstance between the clusters that contan the emals: AvgBucDff = average DocBucDst d, d ), where DocBucketD st d1, d documents 1 { } ( 1 ( BucketDst ( B teraton =, D, B teraton =, D ) 1 ( D1, D ) = ) cluster _ teratons cluster _ teratons, and: 34

BucketDst ( B1, B ) = CosDst( m B, m ) 1 B. AvgBucDff extends the concept of AvgCommon by usng the smlarty/dstance between clusters. Whle AvgCommon only dfferentates whether two emals are n the same cluster, AvgBucDff also consders the dstance between the clusters that contan the emals. Correlatons Between the Fve Measures of Informaton Dversty Measure 1 3 4 5 1. VarCosSm 1.0000. VarDceSm 0.9999 1.0000 3. AvgCommon 0.9855 0.9845 1.0000 4. AvgCommonIC 0.9943 0.9937 0.9973 1.0000 5. AvgClusterDst 0.9790 0.9778 0.9993 0.9939 1.0000 Onlne Appendx C: External Valdaton of Dversty Measures We valdated our dversty measurement usng an ndependent, publcly avalable corpus of documents from Wkpeda.org. Wkpeda.org, the user created onlne encyclopeda, stores entres accordng to a herarchy of topcs representng successvely fne-graned classfcatons. For example, the page descrbng genetc algorthms, s assgned to the Genetc Algorthms category, found under Evolutonary Algorthms, Machne Learnng, Artfcal Intellgence, and subsequently under Technology and Appled Scences. Ths herarchcal structure enables us to construct clusters of entres on dverse and focused subjects and to test whether our dversty measurement can successfully characterze dverse and focused clusters accurately. We created a range of hgh to low dversty clusters of Wkpeda entres by selectng entres from ether the same sub-category n the topc herarchy to create focused clusters, or from a dverse set of unrelated subtopcs to create dverse clusters. For example, we created a mnmum dversty cluster (Type-0) usng a fxed number of documents from the same thrd level sub-category of the topc herarchy, and a maxmum dversty cluster (Type-9) usng documents from unrelated thrd level sub-categores. We then constructed a seres of document clusters (Type- 0 to Type-9) rangng from low to hgh topc dversty from 91 ndvdual entres as shown n Fgure 3. 8 The topc herarchy from whch documents were selected appears at the end of ths secton. If our measurement s robust, our dversty measures should dentfy Type-0 clusters as the least dverse and Type-9 clusters as the most dverse. We expect dversty wll ncrease relatvely monotoncally from Type-0 to Type-9 clusters, although there could be debate for example about whether Type-4 clusters are more dverse than Type-3 clusters. 9 After creatng ths ndependent dataset, we used the Wkpeda entres to generate keywords and measure dversty usng the methods descrbed above. Our methods were very successful n characterzng dversty and rankng clusters from low to hgh dversty. Fgure 3 dsplays cosne smlarty metrcs for Type-0 to Type-9 clusters usng 30, 60, and 90 documents to populate clusters. All fve dversty measures return ncreasng dversty scores for clusters selected from successvely more dverse topcs. 30 Overall, these results gve us confdence n the ablty of our dversty measurement to characterze the subject dversty of groups of text documents of varyng szes. 8 We created several sets of clusters for each type and averaged dversty scores for clusters of lke type. We repeated the process usng 3, 6 and 9 document samples per cluster type to control for the effects of the number of documents on dversty measures. 9 Whether Type-3 or Type-4 clusters are more dverse depends on whether the smlarty of two documents n the same thrd level sub category s greater or less than the dfference of smlartes between documents n the same second level sub category as compared to documents n categores from the frst herarchcal layer onwards. Ths s, to some extent, an emprcal queston. 30 The measures produce remarkably consstent dversty scores for each cluster type and the dversty scores ncrease relatvely monotoncally from Type-0 to Type-9 clusters. The dversty measures are not monotoncally ncreasng for all successve sets, such as Type-4, and t s lkely that the nformaton contaned n Type-4 clusters are less dverse than Type-3 clusters due smply to the fact that two Type-4 documents are taken from the same thrd level sub category. 35

Document clusters selected from Wkpeda.org Dversty measurement valdaton results 0.65 0.6 0.55 Dversty 0.5 0.45 0.4 Varance of Cosne Dstances (30 Documents) Varance of Cosne Dstances (60 Documents) Varance of Cosne Dstances (90 Documents) 0.35 0.3 0 1 3 4 5 6 7 8 9 Type Fgure C1. Wkpeda.org Document Clusters and Dversty Measurement Valdaton Results. Wkpeda.org Categores + Computer scence > + Artfcal ntellgence + Machne learnng + Natural language processng + Computer vson + Cryptography + Theory of cryptography + Cryptographc algorthms + Cryptographc protocols + Computer graphcs + 3D computer graphcs + Image processng + Graphcs cards + Geography > + Clmate + Clmate change + Hstory of clmate + Clmate forcng + Cartography + Maps + Atlases + Navgaton + Exploraton + Space exploraton + Exploraton of Australa + Technology > + Robotcs + Robots + Robotcs compettons + Engneerng + Electrcal engneerng + Boengneerng + Chemcal engneerng + Vdeo and move technology + Dsplay technology + Vdeo codecs + Dgtal photography 36