Statistical Analysis of Complete Social Networks Introduction to networks Christian Steglich c.e.g.steglich@rug.nl median geodesic distance between groups 1.8 1.2 0.6 transitivity 0.0 0.0 0.5 1.0 1.5 2.0 homophily 6.0 3.3 3.6 3.9 4.2 4.5 4.8 5.1 5.4 5.7 3.0 c b K b a ( s x s x ) Pr( x i x ) ln( ) = β ( ) ( ) c a k ik ik Pr( x x ) i k= 1
Overview of topics Network research in the Some characterising features Network data Definitions Data collection & related problems Different types of network data A quick overview of concepts in network analysis This will help understanding later course topics Statistical Analysis of Complete Social Networks 2
What are social networks? Social networking web services Stay in contact with friends Find new partners Social networking as strategy Career building for individuals Trust building among organisations Social networks as organisation form: Terrorist networks Flash mobs More systematically... Statistical Analysis of Complete Social Networks 3
When is network research appropriate? Generally whenever the detailed functioning of a social system is studied. Not just the actors of the system and their individual properties ( composition of the system ), but how they relate to each other ( structure of the system )! Compared to other disciplines, network research has been labelled holistic, non-reductionist, organic, or explanatory. Characteristic are the interdependencies between the different actors in the system. Statistical Analysis of Complete Social Networks 4
The meat grinder argument (Barton, 1968) Survey data cannot possibly reveal anything about how social systems work. Social system Independence assumptions Population of social units The network approach can! Survey sample Statistical Analysis of Complete Social Networks 5
The social network approach is quite universal Social actors and social relations can be many things. One can distinguish between the domains of constructed social order (government, organisations) and sponta- neous social order (markets, primary socialisation). In all four domains, social networks can be studied: Government International scale: political alliances, trade, all sorts of contracted agreements between countries. Domestic scale: networking of different government agencies (e.g., agencies involved in water management of river X ). Statistical Analysis of Complete Social Networks 6
Markets Trade networks: flow of goods, supply chains, auctions, buyerseller (bipartite) networks. Labour markets: Vacancy chains, getting jobs. Organisations Intra-organisational: Formal and informal relations at work. Inter-organisational: Contracting, interlocking directorates. Primary Social Order ( PSO ) Kin, friendship, confiding, and other private relations. There are overlapping aspects e.g. research on social capital addresses (in part) how PSO & informal organisational ties explain market position of social actors... Statistical Analysis of Complete Social Networks 7
Networks in general might be even more universal Physicists take on networks as universal entities : Gene networks (one gene activating another gene) Neural networks (brain cells activating each other) Food webs (see diagram) Transport networks (power grids, highway nets, telecommunication) Are all these different things indeed similar, on some level? Statistical Analysis of Complete Social Networks 8
Formal vs. Informal / Design vs. Emergence Social systems in constructed social orders can be part of the construction (formal networks, designed interdependence) or not (informal networks, emergent interdependence). Department structure (formal group) vs. communities (informal) Job description (formal role) vs. emergent (informal) roles Organigram (formal power) vs. collegial power attribution The informal, emergent phenomena is what network research typically focuses on. Statistical Analysis of Complete Social Networks 9
Statistical inference for interdependent data? The statistical procedures that hold for random samples do generally NOT hold for a network data set! Interdependence of social actors precludes this. Procedures from classical statistics not applicable (no regression, no ANOVA, no t-tests...) Special statistical procedures are required. Statistical Analysis of Complete Social Networks 10
Qualitative or Quantitative? Social network research knows a huge toolbox of quantitative measures: centrality, power, geodesic distances, etc., however, typically, just one social system is under study. SNA uses quantitative tools for qualitative case studies. Some form of external validation of SNA results is always desirable. Replication of results is essential. Better yet, work with a random sample of networks. Then you can also generalise results to a population (of networks). Statistical Analysis of Complete Social Networks 11
Handling networks: Data issues Definition & Notation Data collection Data storage Bipartite networks, multiplexity, components Classical data sets Statistical Analysis of Complete Social Networks 12
Definition & Notation Mathematically, a (binary) network is defined as a graph G=(V,E), where V={1,2,...,n} is a set of vertices (or nodes ) and E {<i,j> i,j V} is a set of edges (or ties, arcs ), one edge just being a pair of vertices. In social network analysis, nodes are typically called (social) actors (individuals, firms, countries,...) We write x ij =1 if actors i and j are related to each other (i.e., if <i,j> E), and x ij =0 otherwise. Networks can be symmetric (x ij =x ji for all i,j), and the definition can be extended to valued networks (x ij R). Statistical Analysis of Complete Social Networks 13
Network data collection Ways to collect network data Design of network studies Measurement of networks Methodological issues when collecting network data Ethical issues when collecting network data Statistical Analysis of Complete Social Networks 14
Data collection design Socio-centric or complete network design Identify a group, assess ties within this group. (Default case.) Ego-centric design, can be part of a classical survey Ask (random) sample of respondents about their contacts, and the contact persons interconnectedness. Respondent-driven design, used in open system networks Design exploits network structure. Necessary for study of hidden populations (e.g., injection drug users); let early respondents recruit subsequent respondents. In huge networks: method to extract meaningful communities. Statistical Analysis of Complete Social Networks 15
Measurement Observation Professional observer codes visible interactions. Name generators in questionnaires or interviews Cued recall (from lists of available candidates) vs. free recall (candidates from own memory); Limited vs. unlimited nominations; Resource-based vs. role-based. Mining Exploit relational data bases (internet based, stock exchange, trade auctions,...). Statistical Analysis of Complete Social Networks 16
Methodological issues Boundary specification Where does the network end? Sensitivity to missing data More problematic than in random samples, as structure can be affected crucially by non-response. Difficulty to distinguish missingness from non-response for some data collection methods. Informant accuracy lacking (Killworth, Bernard, et al.) People systematically forget about lost relations. Feld & Carter (2003): expansiveness and attractiveness bias. Statistical Analysis of Complete Social Networks 17
Ethical issues General points: Respect, benificence, justice; Informed consent, assessing risks & benefits; Anonymity, confidentiality, privacy. Network-specific points: Possibility to re-construct identities even after anonymisation; Passive consent vs. active consent and the double role of what participation in the study means. Statistical Analysis of Complete Social Networks 18
Data storage Adjacency matrix Node list X = 0 0 1 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 0 0 1 0 a 1 a 2 a 4 a 3 a 5 a 1 a 3 a 2 Edge list a 3 a 2, a 4, a 5 a 4 a 5 a 5 a 4 Statistical Analysis of Complete Social Networks 19
Two-mode data Two-mode (a.k.a. bipartite) networks consist of ties between two qualitatively distinct sets of nodes, often events and actors attending them They can be projected to two onemode networks: between actors (relation: sharing events) and between events (relation: sharing actors) On the right, you see hangjongeren (adolescents, blue circles) by policeregistered incidents (red squares) in a village in province Drenthe/NL Statistical Analysis of Complete Social Networks 20
An almost bipartite network: Romantic and Sexual Relations at Jefferson High School Characteristic feature: few cycles, in particular few 4-cycles. Statistical Analysis of Complete Social Networks 21
4-cycle prohibition for kinship relations is partly coded into law (Figure from Lin Freeman s 2004 book on the history of SNA). Statistical Analysis of Complete Social Networks 22
Multiplexity There can be multiple social relations between the same actors; this is called multiplexity. Formal job hierarchy Workflow Statistical Analysis of Complete Social Networks 23
Components A network can split into mutually unconnected regions. These regions are called components of the network. On the right you see a network separating into seven components. Statistical Analysis of Complete Social Networks 24
Classical data sets Newcomb s Fraternity data (1961) Liking rankings among 17 university students. Sampson s Monastery data (1968) Trust & liking scales among 25 novice monks. Zachary s Karate Club data (1977) Friendship among 34 members of a Karate club. Padgett s Florentine Families data (1994) Marriage ties among families in late medieval Florence. & a couple of others... All classics are typically provided with software, e.g., Ucinet. Statistical Analysis of Complete Social Networks 25
Some concepts in descriptive network analysis Dyad census and triad census Distances & geodesic distributions Degree distributions Centrality measures Status / power measures Positional measures Macro measures: segmentation, segregation, centralisation, hierarchisation, autocorrelation Statistical Analysis of Complete Social Networks 26
Dyad census A dyad is a pair of actors <i,j> in the network, plus the configuration of tie variables <x ij, x ji > between them. In a directed, binary network, there are n(n-1) tie variables located in n(n-1)/2 dyads. Dyads can be of three types: M mutual A asymmetric N null relation (n=75) M A N friendship 214 313 2248 advice 45 217 2513 A simple count of types ( dyad census ) gives information about the degree to which the network is symmetric. Statistical Analysis of Complete Social Networks 27
Network indices based on the dyad census Density of the network Is defined as the proportion of actually observed ties among the potentially observable ones. Actually observed are 2M+A 2M + A density = Potentially observable are n(n-1) n(n 1) Reciprocity index of the network Can be defined as the proportion of actually reciprocated ties among the potentially reciprocable ones. Actually reciprocated are 2M Potentially reciprocable are 2M+A reciprocity = 2M 2M + A Statistical Analysis of Complete Social Networks 28
Indices as (conditional) tie probabilities Assume two nodes i,j are randomly sampled from the network. Then...... the (unconditional) probability that i is tied to j is Pr(xij = 1) = density... the conditional probability that i is tied to j, given that j is tied to i, is relation (n=75) Pr(x = 1 x = 1) = reciprocity M A N density ij ji logodds reciprocity logodds Effect of the incoming tie x ji on the logodds of outgoing tie x ij friendship 214 313 2248 13.3% -1.87 57.6% 0.31 2.18 advice 45 217 2513 5.5% -2.84 29.3% -0.88 1.96 Statistical Analysis of Complete Social Networks 29
Triad census A triad is a set of three actors in the network, together with the configuration of ties between them. In a directed, binary network, there are sixteen triad types typically indicated by their dyad census MAN plus (where necessary) a distinguishing letter: C = cyclical U = up D = down T = transitive Statistical Analysis of Complete Social Networks 30
Network indices based on the triad census Transitivity index of the network Is defined as the proportion of actually observed transitively closed triples of nodes <i,k,j> among the k observed potentially closed paths of length 2 from i to j via k. i j Actually closed are 030T + 2 120D + 2 120U + 120C + 3 210 + 6 300 Potentially closed are 021c + 111D + 111U + 030T + 3 030C + 2 201 + 2 120D + 2 120U + 3 120C + 4 210 + 6 300 Statistical Analysis of Complete Social Networks 31
Transitivity index as conditional probability The index can also be calculated directly from the matrix : transitivity = x x x j i,k i,k j ik kj ij i j x x j i,k i,k j ik kj k relation (n=75) actually transitive potentially transtitive In network literature also other indices are used! transitivity index friendship 4143 9546 0.43 advice 369 1213 0.30 Statistical Analysis of Complete Social Networks 32
Distances in a network Length of a shortest connecting path defines the (sociometric) distance between two actors in a network 15 33 9 1 7 17 Example of a shortest path of length 5 Statistical Analysis of Complete Social Networks 33
Calculating distances Matrix calculus: The adjacency matrix X indicates which row actor is directly connected to which column actor. So the squared adjacency matrix X 2 indicates which row actor can reach which column actor in two steps. etc. sociometric shortest distances a.k.a. geodesic distances X 2 a 3 a a 5 1 a 2 a 4 0 0 1 0 0 0 0 1 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 = 0 1 0 1 1 0 1 0 1 1 = 0 0 0 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 1 a 1 a 2 a 3 a 5 a 4 Statistical Analysis of Complete Social Networks 34
Geodesic distributions For each network, you can calculate how often each distance occurs. This generates the distribution of geodesic distances. Actors in separate components have infinite distance. 17.5% 15.0% 12.5% 10.0% 7.5% 5.0% 2.5% 0.0% 0 3 6 9 12 15 18 21 24 On the right you see a series of distributions of geodesic distances T2 T3 T4 in English/Welsh school-based friendship networks over 3 years. Statistical Analysis of Complete Social Networks 35
Degree distributions For each network, you can calculate how often each in- or outdegree occurs. This generates the so-called degree distributions. Outliers on the right are called hubs. On the right you see a series of distributions of outdegrees (top) and indegrees (bottom) in a Scottish school-based friendship network over 3 years. 20% 15% 10% 5% 0% 20% 15% 10% 5% 0% 0 1 2 3 4 5 6 T1 T2 T3 0 1 2 3 4 5 6 7 8 9 10 11 12 Statistical Analysis of Complete Social Networks 36
Scale-free networks Degree distributions are often depicted on a loglog-scale (physics tradition) If the distribution is linear on this scale, the network is called scalefree (A.-L. Barabási) On the right, you see the indegree distribution of the previous slide s data set on a log-log-scale. 32 16 8 4 2 1 T1 T2 T3 1 2 4 8 16 32 Statistical Analysis of Complete Social Networks 37
Centrality and Centralisation Basic idea behind these concepts: Well-connected actors are in a structurally advantageous position. What makes an actor well-connected? different notions of centrality try to capture such aspects When actors in a network differ strongly in terms of their centrality, the network is called centralised Statistical Analysis of Complete Social Networks 38
Structurally advantageous positions A star structure expresses the centrality concept most clearly: Peripheral actors are only indirectly connected to each other, they rely on the central actor for access to each other. There is some nuance to this Central actor is connected to everyone, dominates access. Statistical Analysis of Complete Social Networks 39
Three centrality measures Degree centrality Number of people nominating you (indegree centrality) or nominated by you (outdegree centrality) Eigenvector centrality Your centrality is proportional to the sum of your nominees / nominators centrality. Betweenness centrality Sum of fractions of shortest paths between any two nodes that pass through a given node S2 W9 S2 W9 S2 W9 I3 W6 W8 S4 I3 W6 W8 S4 I3 W6 W8 W7 W7 W7 W5 W5 W5 S1 W4 W1 W3 S1 W4 W1 W3 S1 W4 W1 W3 W2 I1 W2 I1 W2 I1 S4 Statistical Analysis of Complete Social Networks 40
Status / prestige measures Centrality works in the first place for symmetric (undirected) networks. When evaluating them on directed networks, they tend to acquire the meaning of prestige or status measures. Katz sociometric status (1953) is essentially directed eigenvector centrality Other status measures: Popularity / indegree; Hubbell s & Taylor s proposals,... Statistical Analysis of Complete Social Networks 41
Positional measures Group member peripheral bridge isolate Problem: very algorithmic, various definitions of what a group is, etc. Structural holes / brokerage Actors can exploit other actors disconnectedness Statistical Analysis of Complete Social Networks 42
Macro measures Segmentation Degree to which network falls apart into components. Segregation Degree to which network falls apart into components that are homogeneous on some actor property (e.g., sex, race). Centralisation Degree to which actors differ in centrality. Statistical Analysis of Complete Social Networks 43
More macro measures Hierarchisation (Krackhardt, 1994) Degree to which network exhibits a tree structure. Autocorrelation (Moran, Geary) Degree to which actors who are similar on a variable are also directly connected in the network. Density, reciprocity, transitivity,... Median geodesic distance, average ego-network density etc. Statistical Analysis of Complete Social Networks 44
Qualitative or quantitative? Social network research knows a huge toolbox of quantitative measures: centrality, power, geodesic distances, etc., however, typically, just one social system is under study. SNA uses quantitative tools for qualitative case studies. Some form of external validation of SNA results is always desirable. Replication of results is essential. Better yet, work with a random sample of networks. Then you can also generalise results to a population (of networks). Statistical Analysis of Complete Social Networks 45
Understanding the hype Why is social network research hot now? Data collection is easier than in the past. Mining opens up a lot of possibilities, especially when coupled with new social media. Data analysis is possible only now. Statistical methods for hypothesis testing are computationally intensive, require high-end calculators. Mainstream awareness of networks has risen. Political events & information technology. Statistical Analysis of Complete Social Networks 46