Social network analysis with R sna package

Similar documents

Social Media Mining. Network Measures

Practical Graph Mining with R. 5. Link Analysis

Outline. NP-completeness. When is a problem easy? When is a problem hard? Today. Euler Circuits

Univariate Regression

Social Media Mining. Graph Essentials

Package sna. February 20, 2015

MEASURES OF VARIATION

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

DATA ANALYSIS II. Matrix Algorithms

Imputation of missing network data: Some simple procedures

Data Structure [Question Bank]

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures)

Point Biserial Correlation Tests

SGL: Stata graph library for network analysis

Graph/Network Visualization

Course on Social Network Analysis Graphs and Networks

Part 2: Community Detection

Linear Threshold Units

Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations

Graph Theory and Complex Networks: An Introduction. Chapter 08: Computer networks

Additional sources Compilation of sources:

Introduction to Multivariate Analysis

! E6893 Big Data Analytics Lecture 10:! Linked Big Data Graph Computing (II)

Statistical Analysis of Complete Social Networks

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

CRLS Mathematics Department Algebra I Curriculum Map/Pacing Guide

Data Mining: Algorithms and Applications Matrix Math Review

Week 3. Network Data; Introduction to Graph Theory and Sociometric Notation

Strong and Weak Ties

Sections 2.11 and 5.8

Simple Linear Regression Inference

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

IBM SPSS Modeler Social Network Analysis 15 User Guide

Module 3: Correlation and Covariance

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Part 2: Analysis of Relationship Between Two Variables

Mean = (sum of the values / the number of the value) if probabilities are equal

Principal Component Analysis

MISSING DATA TECHNIQUES WITH SAS. IDRE Statistical Consulting Group

B490 Mining the Big Data. 2 Clustering

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Distributed Computing over Communication Networks: Maximal Independent Set

Graph Mining and Social Network Analysis

How To Understand The Network Of A Network

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Big Data Analytics of Multi-Relationship Online Social Network Based on Multi-Subnet Composited Complex Network

CMPSCI611: Approximating MAX-CUT Lecture 20

Equivalence Concepts for Social Networks

Performance Metrics for Graph Mining Tasks

How To Cluster Of Complex Systems

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Multivariate Analysis of Ecological Data

Exercise 1.12 (Pg )

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

1 o Semestre 2007/2008

MATH 10: Elementary Statistics and Probability Chapter 5: Continuous Random Variables

Groups and Positions in Complete Networks

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Linear functions Increasing Linear Functions. Decreasing Linear Functions

Systems and Algorithms for Big Data Analytics

Cluster Analysis: Advanced Concepts

Homework 11. Part 1. Name: Score: / null

Multivariate Normal Distribution

Complex Networks Analysis: Clustering Methods

How To Understand Multivariate Models

Virtual Landmarks for the Internet

Improving Experiments by Optimal Blocking: Minimizing the Maximum Within-block Distance

5. Linear Regression

Using Excel for inferential statistics

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Network Analysis Basics and applications to online data

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

The Goldberg Rao Algorithm for the Maximum Flow Problem

Sampling Biases in IP Topology Measurements

Statistics. Measurement. Scales of Measurement 7/18/2012

Lecture 15 An Arithmetic Circuit Lowerbound and Flows in Graphs

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

430 Statistics and Financial Mathematics for Business

An Empirical Study of Two MIS Algorithms

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

HISTOGRAMS, CUMULATIVE FREQUENCY AND BOX PLOTS

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Review Jeopardy. Blue vs. Orange. Review Jeopardy

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

List of Examples. Examples 319

Logistic Regression (a type of Generalized Linear Model)

WAN Wide Area Networks. Packet Switch Operation. Packet Switches. COMP476 Networked Computer Systems. WANs are made of store and forward switches.

Distance Degree Sequences for Network Analysis

Design & Analysis of Ecological Data. Landscape of Statistical Methods...

Network Analysis For Sustainability Management

Information Network or Social Network? The Structure of the Twitter Follow Graph

AP STATISTICS REVIEW (YMS Chapters 1-8)

Protein Protein Interaction Networks

Algebra 1 Course Information

Transcription:

Social network analysis with R sna package George Zhang iresearch Consulting Group (China) bird@iresearch.com.cn birdzhangxiang@gmail.com

Social network (graph) definition G = (V,E) Max edges = N All possible E edge graphs = Linear graph(without parallel edges and slings) N E V V V V V V6 V7 V V9 V0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 0 0 0 0 0 0 0 0 0 0 0 7 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6 Erdös, P. and Rényi, A. (960). On the Evolution of Random Graphs.

Different kinds of networks Random graphs a graph that is generated by some random process Scale free network whose degree distribution follows a power law Small world most nodes are not neighbors of one another, but most nodes can be reached from every other by a small number of hops or steps

Differ by Graph index Degree distribution average node-to-node distance average shortest path length clustering coefficient Global, local

network examples Butts, C.T. (006). Cycle Census Statistics for Exponential Random Graph Models.

GLI-Graph level index

Array statistic Mean Variance Standard deviation Skr Graph statistic Degree Density Reciprocity Centralization

Simple graph measurements Degree Number of links to a vertex(indegree, outdegree ) Density sum of tie values divided by the number of possible ties Reciprocity the proportion of dyads which are symmetric Mutuality the number of complete dyads Transtivity the total number of transitive triads is computed

Example Degree sum(g) = Density 9 gden(g) = /90 = 0. Reciprocity grecip(g, measure="dyadic") = 0.76 0 7 grecip(g,measure="edgewise") = 0 Mutuality 6 mutuality(g) = 0 Transtivity gtrans(g) = 0.

Path and Cycle statistics kpath.census kcycle.census dyad.census Triad.census Butts, C.T. (006). Cycle Census Statistics for Exponential Random Graph Models.

Multi graph measurements Graph mean In dichotomous case, graph mean corresponds to graph s density Graph covariance gcov/gscov Graph correlation gcor/gscor Structural covariance unlabeled graph Butts, C.T., and Carley, K.M. (00). Multivariate Methods for Interstructural Analysis.

Example gcov(g,g) = -0.0096 9 gscov(g,g,exchange.list=:0) = - 0.0096 gscov(g,g)=0.00 unlabeled graph 0 7 6 gcor(g,g) = -0.0076 gscor(g,g,exchange.list=:0) = - 0.0076 gscor(g,g) = 0.099 7 6 0 unlabeled graph 9

gcov

Connectedness 0 for no edges for N edges Hierarchy Measure of structure 0 for all two-way links for all one-way links Efficiency 0 for edgs N V - N( N ) - V MaxV for N- edges Least Upper Boundedness (lubness) 0 for all vertex link into one for all outtree V - MaxV Krackhardt,David.(99). Graph Theoretical Dimensionsof Informal Organizations.

Example Outtree Connectedness= Hierarchy= Efficiency= Lubness=

Graph centrality Degree Number of links to a vertex(indegree, outdegree ) Betweenness Number of shortest paths pass it Closeness Length to all other vertices Centralization by ways above 0 for all vertices has equal position(central score) for vertex be the center of the graph See also evcent, bonpow, graphcent, infocent, prestige Freeman,L.C.(979). Centrality in Social Networks-Conceptual Clarification

Example > centralization(g,degree,mode="graph") [] 0.9 > centralization(g,betweenness,mode="graph") [] 0.06 > centralization(g,closeness,mode="graph") [] 0 6 0 7 9 Mode= graph means only consider indegree

Example > centralization(g,degree,mode="graph") [] 0.9 > centralization(g,betweenness,mode="graph") [] 0.06 > centralization(g,closeness,mode="graph") [] 0 Degree center 6 0 7 9 Mode= graph means only consider indegree

Example > centralization(g,degree,mode="graph") [] 0.9 > centralization(g,betweenness,mode="graph") [] 0.06 > centralization(g,closeness,mode="graph") [] 0 Betweeness center Degree center 6 0 7 9 Mode= graph means only consider indegree

Example > centralization(g,degree,mode="graph") [] 0.9 > centralization(g,betweenness,mode="graph") [] 0.06 > centralization(g,closeness,mode="graph") [] 0 Betweeness center Degree center 6 0 7 Mode= graph means only consider indegree 9 Both closeness centers

GLI relation

GLI map density Connectedness Efficiency Hierarchy 0 size Centralization(degree) Centralization(betweenness) Centralization(closeness) Anderson,B.S.;Butts,C. T.;and Carley,K. M.(999). The Interaction of Size and Density with Graph-Level Indices.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 connectedness distribution by graph size and density density size / /6 / 6 Anderson,B.S.;Butts,C. T.;and Carley,K. M.(999). The Interaction of Size and Density with Graph-Level Indices.

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 efficiency distribution by graph size and density density size /6 / / 6

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 hierarchy distribution by graph size and density density size /6 / / 6

GLI map R code compare<-function(size,den) { g=rgraph(n=size,m=00,tprob=den) gli=apply(g,,connectedness) gli=apply(g,,efficiency) gli=apply(g,,hierarchy) gli=apply(g,,function(x) centralization(x,degree)) gli=apply(g,,function(x) centralization(x,betweenness)) gli6=apply(g,,function(x) centralization(x,closeness)) x=mean(gli,na.rm=t) x=mean(gli,na.rm=t) x=mean(gli,na.rm=t) x=mean(gli,na.rm=t) x=mean(gli,na.rm=t) x6=mean(gli6,na.rm=t) return(c(x,x,x,x,x,x6)) } nx=0 ny=0 res=array(0,c(nx,ny,6)) size=:6 den=seq(0.0,0.,length.out=0) for(i in :nx) for(j in :ny) res[i,j,]=compare(size[i],den[j]) #image(res,col=gray(000:/000)) par(mfrow=c(,)) image(res[,,],col=gray(000:/000),main="connectedness") image(res[,,],col=gray(000:/000),main="efficiency") image(res[,,],col=gray(000:/000),main="hierarchy") image(res[,,],col=gray(000:/000),main="centralization(degree)") image(res[,,],col=gray(000:/000),main="centralization(betweenness)") image(res[,,6],col=gray(000:/000),main="centralization(closeness)")

GLI distribution R code par(mfrow=c(,)) for(i in :) for(j in :) hist(centralization(rgraph(*î,00,tprob=j/),betweenness),main="",xlab="",ylab="",xlim=range(0:),ylim=range(0:0)) hist(centralization(rgraph(*î,00,tprob=j/),degree),main="",xlab="",ylab="",xlim=range(0:),ylim=range(0:0)) hist(hierarchy(rgraph(*î,00,tprob=j/6)),main="",xlab="",ylab="",xlim=range(0:),ylim=range(0:0)) hist(efficiency(rgraph(*î,00,tprob=j/6)),main="",xlab="",ylab="",xlim=range(0:),ylim=range(0:0)) hist(connectedness(rgraph(*î,00,tprob=j/)),main="",xlab="",ylab="",xlim=range(0:),ylim=range(0:00))

Graph distance Clustering, MDS

Distance between graphs Hamming(labeling) distance e : ( e E( G ), e E( G)) ( e E( G ), e E( G number of addition/deletion operations required to turn the edge set of G into that of G hdist for typical hamming distance matrix Structure distance d (G,H L,L ) min d( (G), (H)) S G H L G, L H structdist & sdmat for structure distance with exchange.list of vertices )) Butts, C.T., and Carley, K.M. (00). Multivariate Methods for Interstructural Analysis.

Example 6 7 9 0 6 7 9 0 6 7 9 0 6 7 9 0 6 7 9 0

Example hdist(g) 0 9 9 0 9 9 0 0 9 9 0 sdmat(g) [,] [,] [,] [,] [,] [,] 0 7 [,] 0 7 9 [,] 0 6 [,] 7 6 0 [,] 7 9 0 structdist(g) 0 0 0 0 0 0 0 0 0

Measure Distance 0.0 0.06 0.0-0 -0 0 0 0 Inter-Graph MDS gdist.plotstats Plot by distances between graphs Add graph level index as third or forth dimension -0-0 0 0 0 > g.h<-hdist(g) #sample graph used before > gdist.plotdiff(g.h,gden(g),lm.line=true) > gdist.plotstats(g.h,cbind(gden(g),grecip(g))) 0 0 Inter-Graph Distance

0.6 0.0 0. 0. Height 0 0 Graph clustering Use hamming distance g.h=hdist(g) g.c<-hclust(as.dist(g.h)) rect.hclust(g.c,) g.cg<-gclust.centralgraph(g.c,,g) gplot(g.cg[,,]) gplot(g.cg[,,]) gclust.boxstats(g.c,,gden(g)) Cluster Dendrogram hc hc as.dist(g.h) hclust (*, "complete") X X

Distance between vertices Structural equivalence sedist with methods:. correlation: the product-moment correlation. euclidean: the euclidean distance. hamming: the Hamming distance. gamma: the gamma correlation Path distance geodist with shortest path distance and the number of shortest pathes Breiger, R.L.; Boorman, S.A.; and Arabie, P. (97). An Algorithm for Clustering Relational Data with Applications to Social Network Analysis and Comparison with Multidimensional Scaling. Brandes, U. (000). Faster Evaluation of Shortest-Path Based Centrality Indices.

sedist Example 9 7 0 6 sedist(g) = sedist(g,mode="graph") [,] [,] [,] [,] [,] [,6] [,7] [,] [,9] [,0] [,] 0 0 7 9 [,] 0 0 7 9 [,] 0 6 6 6 6 6 [,] 7 7 6 0 6 6 6 6 [,] 6 0 [6,] 6 6 0 6 [7,] 6 6 0 6 [,] 6 6 0 6 [9,] 0 6 [0,] 9 9 6 6 6 6 6 0

geodist Example 7 6 0 9 geodist(g) $counts [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] [,] 0 [,] 0 0 [,] 0 0 0 [,] 0 0 0 [6,] 0 0 0 [7,] 0 0 0 [,] 0 0 0 [9,] 0 0 0 [0,] 0 0 0 $gdist [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] 0 [,] Inf 0 [,] Inf Inf 0 [,] Inf Inf Inf 0 [,] Inf Inf Inf 0 6 [6,] Inf Inf Inf 0 [7,] Inf Inf Inf 0 [,] Inf Inf Inf 0 [9,] Inf Inf Inf 0 [0,] Inf Inf Inf 0

geodist Example 7 6 0 9 geodist(g) $counts [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] [,] 0 [,] 0 0 [,] 0 0 0 [,] 0 0 0 [6,] 0 0 0 [7,] 0 0 0 [,] 0 0 0 [9,] 0 0 0 [0,] 0 0 0 $gdist [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] 0 [,] Inf 0 [,] Inf Inf 0 [,] Inf Inf Inf 0 [,] Inf Inf Inf 0 6 [6,] Inf Inf Inf 0 [7,] Inf Inf Inf 0 [,] Inf Inf Inf 0 [9,] Inf Inf Inf 0 [0,] Inf Inf Inf 0

geodist Example 7 6 0 9 geodist(g) $counts [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] [,] 0 [,] 0 0 [,] 0 0 0 [,] 0 0 0 [6,] 0 0 0 [7,] 0 0 0 [,] 0 0 0 [9,] 0 0 0 [0,] 0 0 0 $gdist [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] 0 [,] Inf 0 [,] Inf Inf 0 [,] Inf Inf Inf 0 [,] Inf Inf Inf 0 6 [6,] Inf Inf Inf 0 [7,] Inf Inf Inf 0 [,] Inf Inf Inf 0 [9,] Inf Inf Inf 0 [0,] Inf Inf Inf 0

geodist Example 7 6 0 9 geodist(g) $counts [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] [,] 0 [,] 0 0 [,] 0 0 0 [,] 0 0 0 [6,] 0 0 0 [7,] 0 0 0 [,] 0 0 0 [9,] 0 0 0 [0,] 0 0 0 $gdist [,] [,] [,] [,] [,] [,6] [,7] [,] [,9][,0] [,] 0 [,] Inf 0 [,] Inf Inf 0 [,] Inf Inf Inf 0 [,] Inf Inf Inf 0 6 [6,] Inf Inf Inf 0 [7,] Inf Inf Inf 0 [,] Inf Inf Inf 0 [9,] Inf Inf Inf 0 [0,] Inf Inf Inf 0

geodist reachability gplot(reachability(g),label=:0) 6 7 9 7 0 0 9 6

6 7 9 Height 0 6 0 Graph vertices clustering by sedist General clustering methods equiv.clust for vertices clustering by Structural equivalence( sedist ) Cluster Dendrogram 9 7 0 6 as.dist(equiv.dist) hclust (*, "complete")

Distance 0. 0. 0. 0.7 Graph structure by geodist structure.statistics > ss<-structure.statistics(g) > plot(0:9,ss,xlab="mean Coverage",ylab="Distance") 9 7 0 6 0 6 Mean Coverage

Graph cov based function Regression, principal component, canonical correlation

Multi graph measurements Graph mean In dichotomous case, graph mean corresponds to graph s density Graph covariance gcov/gscov Graph correlation gcor/gscor Structural covariance unlabeled graph Butts, C.T., and Carley, K.M. (00). Multivariate Methods for Interstructural Analysis.

Correlation statistic model Canonical correlation netcancor Linear regression netlm Logistic regression netlogit Linear autocorrelation model lnam nacf

Random graph models

Graph evolution Random Biased Phases

Proportion Reached 0. 0. Mut Asym Null 00 0 0 0D 0U 0C D U 00T 00C 0 0D 0U 0C 0 00 Count Count 0 0 Biased net model graph generate: rgbn graph prediction: bn Predicted Dyad Census Predicted Triad Census 9 6 Dyad Type Triad Type 0 7 Predicted Structure Statistics 0 6 Distance

Graph statistic test cugtest qaptest

Thanks