Enron Network Analysis Tutorial r date()



Similar documents
Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice

Tutorial for the WGCNA package for R I. Network analysis of liver expression data in female mice

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

A Demonstration of Hierarchical Clustering

Graphs, Networks and Python: The Power of Interconnection. Lachlan Blackhall - lachlan@repositpower.com

DATA ANALYSIS II. Matrix Algorithms

Hierarchical Clustering Analysis

MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process

How to Analyze Company Using Social Network?

Social Media Mining. Network Measures

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

Part 2: Community Detection

Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp

PROJECT PORTFOLIO MANAGEMENT: WHERE THEORY HITS THE ROAD

A General Framework for Weighted Gene Co-expression Network Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Multivariate Analysis of Ecological Data

Clustering UE 141 Spring 2013

CJA Panel Directory 10/03/13

Original Group Gap Analysis Report

Data exploration with Microsoft Excel: analysing more than one variable

82 Fennemore Craig Attorneys Named to Best Lawyers in America 2014

Practical Graph Mining with R. 5. Link Analysis

USING SPECTRAL RADIUS RATIO FOR NODE DEGREE TO ANALYZE THE EVOLUTION OF SCALE- FREE NETWORKS AND SMALL-WORLD NETWORKS

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

Social network analysis with R sna package

Similar matrices and Jordan form

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Cluster Analysis using R

Appendix A. Rayleigh Ratios and the Courant-Fischer Theorem

COC131 Data Mining - Clustering

Statistical Applications in Genetics and Molecular Biology

Medina County Domestic Relations Court Detail Schedule Jackie Owen:

Performance Metrics for Graph Mining Tasks

Controlled Vocabulary. Valentine s Day. Tom had 20 classmates. He has 17 Valentines so far. How many more does he need?

Social and Technological Network Analysis. Lecture 3: Centrality Measures. Dr. Cecilia Mascolo (some material from Lada Adamic s lectures)

NETZCOPE - a tool to analyze and display complex R&D collaboration networks

Introduction to Clustering

BEST LAWYERS RECOGNIZES 77 MVA ATTORNEYS. August 2013

Social Network Analysis: Lecture 4-Centrality Measures

Abernathy, Bruce R. Jr R Bruce R Abernethy Jr. PA babernethy@bruceapa.com 130 S. Indian River Drive Ft. Pierce, Fl

Data Mining: Algorithms and Applications Matrix Math Review

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Option 1: empirical network analysis. Task: find data, analyze data (and visualize it), then interpret.

How To Understand Multivariate Models

How To Understand The Network Of A Network

7.0 CHAPTER 7 LIST OF PREPARERS

CIRCUIT COURT FOR HOWARD COUNTY PAGE 1 10/02/15 14:45

Principal Component Analysis

Computer program review

New York State Education Department Title II Part B Mathematics & Science Partnerships Program MSP Directory

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Best Lawyers. Stewart McKelvey. Fredericton Allison McCarthy (2012) Banking and Finance Law Corporate Law Frederick McElman QC (2012)

Division of Information Technology

USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS

Neural Networks Lesson 5 - Cluster Analysis

The Best Lawyers in America 2008

1999 Simon Baron-Cohen, Michelle O Riordan, Rosie Jones & Kate Plaisted. Faux Pas Recognition Test. (Child Version)

How To Cluster

Exploratory Data Analysis with MATLAB

CS Data Science and Visualization Spring 2016

Java Modules for Time Series Analysis

Tutorial 2: Reading and Manipulating Files Jason Pienaar and Tom Miller

HEALTH DATA LEADERSHIP SUMMIT

October Tex. B. J. 868

Equivalence Concepts for Social Networks

PJM Compliance Bulletin

Surplus Trustee # 1 All American Investigators Vincent J. Costa, Principal 4333 NW 67th Avenue Coral Springs, FL 33067

Semester Project Report

Forschungskolleg Data Analytics Methods and Techniques

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Using R for Social Media Analytics

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

NOAA Fisheries Alaska Region

Madison Police Department

Cluster software and Java TreeView

Linear Algebra Review. Vectors

SGL: Stata graph library for network analysis

IC05 Introduction on Networks &Visualization Nov

Orthogonal Diagonalization of Symmetric Matrices

NEW VERSION OF DECISION SUPPORT SYSTEM FOR EVALUATING TAKEOVER BIDS IN PRIVATIZATION OF THE PUBLIC ENTERPRISES AND SERVICES

Practical statistical network analysis (with R and igraph)

Dr. Brooks reviewed and discussed comments he had received about the last President's letter with the Board. No further action was taken.

CS Masters Watford

2013 Summer Hybrid and Online Classes

Chapter ML:XI (continued)

I- w-4! pi-s- a FILED1 FILED1. AUG z ti AUG z ti August 8,2002

Data Mining with R. Text Mining. Hugh Murrell

Transcription:

Enron Network Analysis Tutorial r date() Enron Tutorial We provide this Enron Tutorial as an appendix to the paper in Journal of Statistical Education, Network Analysis with the Enron Email Corpus. The paper describes the centrality measures in detail, and we go through the steps in the R analysis here. As in the.rmd file with the R code, be sure to install the pacakges WGCNA and igraph. The first step of the analysis is to import the data and create the adjacency matrix of our choice. AM represents emails sent from node i to node j with messages sent via CC weighted as described in the paper. The transpose of the matrix AMt represents emails recieved by node i from node j (again, with CC values weighted differently than messages sent directly to an individual). The sum of the two matrices, AM2, represents the total email correspondence (sent and received) between nodes i and j. setwd("h:/teaching/researchcircle/spring2014 - DataScience/JSE.DSS") AM = as.matrix(read.csv("final Adjacency Matrix.csv", sep=",", header=true, row.names=1)) # sent emails AMlist = read.csv("final Adjacency Matrix.csv", sep=",", header=true, row.names=1) # sent emails as list # employee information might be interesting to analyze for considering relationships # within the company # enronemployees = read.table("enron Employee Information.csv", sep=",", header=t) AMt = t(am) # received emails AM2 <- AM + t(am) - 2*diag(diag(AM)) # sent and received emails We can represent the adjacency matrix graphically using a heatmap. AM.names=c(rep(NA,20), row.names(am)[21],rep(na,44), row.names(am)[66], rep(na,2), row.names(am)[69], rep(na,87)) heatmap.2(log2(am+1), Rowv=FALSE, Colv= FALSE, dendrogram="none", col = (brewer.pal(9,"blues")),scale="none", trace="none", labrow=am.names,labcol=am.names, colsep=false, density="none", key.title="", key.xlab="# of emails (log2 scale)", mar=c(8,8)) 1

0 2 4 6 8 # of emails (log2 scale) Jeff Dasovich Kenneth Louise Kitchen Lay Jeff Dasovich Louise Kitchen Kenneth Lay Eigenvector Centrality The first measure of centrality that we use is eigenvector centrality; the evcent function is available in the igraph package. Degree, Betweenness, and Closeness centrality measures are also given in the igraph package. # eigenvalue centrality (on both directed graphs), # degree, betweenness, and closeness eng <- graph.adjacency(do.call(rbind,amlist)) # creates a network graph using the adjacency matrix engt <- graph.adjacency(do.call(cbind,amlist)) # creates a network graph using the transpose of the ad eigcent <- igraph::evcent(eng, directed=true) # eigenvalue centrality eigcentt <- igraph::evcent(engt, directed=true) # eigenvalue centrality on transpose of graph dcent <- igraph::degree(eng) # degree centrality bmeas <- igraph::betweenness(eng) # betweenness cmeas <- igraph::closeness(eng) # closeness # TOM AM2 <- AM2 / max(am2) # set values between 0 and 1 TOM <- TOMsimilarity(AM2) # create TOM ##..connectivity.. ##..matrix multiplication.. ##..normalization.. ##..done. 2

TOMrank <- as.matrix(apply(tom,1,sum)) rownames(tomrank) <- rownames(am) colnames(tomrank) <- "value" # grab its row-sums Initially, we plot the ranks of the individuals based on the different measures of centrality. The ranks are clearly correlated, but we can also see that they seem to be measuring different qualities of the email correspondence matrix. comptable <- matrix(ncol=6, nrow=dim(am)[1]) comptable[,1] <- rank(dcent) comptable[,2] <- rank(eigcent$vector) comptable[,3] <- rank(eigcentt$vector) comptable[,4] <- rank(cmeas) comptable[,5] <- rank(bmeas) comptable[,6] <- rank(tomrank) pairs(comptable[,1:6],pch=20,main="ranking Metrics Comparison", labels=c("degree","ev Cent.", "EV Cent. (T)","Closeness","Betweenness", "TOM"), cex=.5,xlim=c(0,160),ylim=c(0,160),lower.panel=panel.cor) Ranking Metrics Comparison 0 50 150 0 50 150 0 50 150 Degree 0.80 EV Cent. 0.69 0.80 EV Cent. (T) 0.53 0.58 0.71 Closeness 0.67 0.65 0.54 0.59 Betweenness 0.99 0.78 0.71 0.56 0.64 TOM 0 50 150 0 50 150 0 50 150 Next, we lists the top 10 most central individuals for each metric. Note that we use the negative of the centrality measure so that the order function produces the first individual as the most central. rankedenron <- data.frame(degree = rownames(am)[order(-dcent)], EVcent = rownames(am)[order(-eigcent$vector)], 3

rankedenron[1:10,] EVcentT = rownames(am)[order(-eigcentt$vector)], Close = rownames(am)[order(-cmeas)], Between = rownames(am)[order(-bmeas)], TOM = rownames(am)[order(-tomrank)]) ## Degree EVcent EVcentT Close ## 1 Jeff Dasovich Tana Jones Sara Shackleton Robert Benson ## 2 Mike Grigsby Sara Shackleton Susan Bailey Mike Grigsby ## 3 Tana Jones Stephanie Panus Marie Heard Louise Kitchen ## 4 Sara Shackleton Marie Heard Tana Jones Kevin M. Presto ## 5 Richard Shapiro Susan Bailey Stephanie Panus Susan Scott ## 6 Steven J. Kean Kay Mann Elizabeth Sager Scott Neal ## 7 Louise Kitchen Louise Kitchen Jason Williams Barry Tycholiz ## 8 Susan Scott Elizabeth Sager Louise Kitchen Greg Whalley ## 9 Michelle Lokay Jason Williams Jeffrey T. Hodge Phillip K. Allen ## 10 Chris Germany Jeff Dasovich Gerald Nemec Jeff Dasovich ## Between TOM ## 1 Louise Kitchen Jeff Dasovich ## 2 Mike Grigsby Richard Shapiro ## 3 Susan Scott Steven J. Kean ## 4 Jeff Dasovich Mike Grigsby ## 5 Mary Hain Tana Jones ## 6 Sally Beck Sara Shackleton ## 7 Kenneth Lay Mary Hain ## 8 Scott Neal Marie Heard ## 9 Kate Symes Stephanie Panus ## 10 Cara Semperger Susan Scott Hierarchical Clustering Below, we create the heirarchical cluster with both the symmetric (sent and received) adjacency email matrix as well as the TOM adjacency build from the symmetric measures. After building the dendrogram, we find groups of employees who are strongly linked and report the names of the individuals. # dissimilarity is 1 - number of sent and received / max of sent and recieved over all individuals dissam2=1-am2 # Create the heirarchical clustering hieram2=hclust(as.dist(dissam2), method="average") groups.9=as.character(cutreestaticcolor(hieram2, cutheight=.9, minsize=4)) # Plot results of all module detection methods together: plotdendroandcolors(dendro = hieram2,colors=data.frame(groups.9), dendrolabels = FALSE, abheight=.9, marall =c(0.2, 5, 2.7, 0.2), hang=.05, main ="min 4 per group, cutoff=0.9",ylab="1 - S&R/max(S&R)") 4

min 4 per group, cutoff=0.9 1 S&R/max(S&R) 0.0 0.2 0.4 0.6 0.8 1.0 groups.9 table(groups.9) as.dist(dissam2) hclust (*, "average") ## groups.9 ## blue grey turquoise ## 4 147 5 row.names(am)[groups.9=="turquoise"] ## [1] "Susan Bailey" "Marie Heard" "Tana Jones" "Stephanie Panus" ## [5] "Sara Shackleton" row.names(am)[groups.9=="blue"] ## [1] "Jeff Dasovich" "Mary Hain" "Steven J. Kean" "Richard Shapiro" ### Now cluster with TOM # for the next plot, dissimilarity uses TOM metric to encorporate neighbors disstom=tomdist(am2) ##..connectivity.. ##..matrix multiplication.. ##..normalization.. ##..done. 5

rownames(disstom) <- rownames(am) colnames(disstom) <- rownames(am) # Create the heirarchical clustering hiertom=hclust(as.dist(disstom), method="average") groupstom.95=as.character(cutreestaticcolor(hiertom, cutheight=.95, minsize=4)) # Plot results of all module detection methods together: plotdendroandcolors(dendro = hiertom,colors=data.frame(groupstom.95), abheight=.95, dendrolabels = FALSE, marall =c(0.2, 5, 2.7, 0.2), main ="min 4 per group, cutoff=0.95", ylab="tom dissimilarity") min 4 per group, cutoff=0.95 TOM dissimilarity 0.2 0.4 0.6 0.8 1.0 groupstom.95 as.dist(disstom) hclust (*, "average") table(groupstom.95) ## groupstom.95 ## blue brown grey turquoise yellow ## 6 4 136 6 4 row.names(am)[groupstom.95=="turquoise"] ## [1] "Robert Badeer" "Jeff Dasovich" "Mary Hain" ## [4] "Steven J. Kean" "Richard Shapiro" "James D. Steffes" row.names(am)[groupstom.95=="blue"] 6

## [1] "Susan Bailey" "Marie Heard" "Tana Jones" "Stephanie Panus" ## [5] "Elizabeth Sager" "Sara Shackleton" row.names(am)[groupstom.95=="brown"] ## [1] "Lindy Donoho" "Michelle Lokay" "Mark McConnell" "Kimberly Watson" row.names(am)[groupstom.95=="yellow"] ## [1] "Drew Fossum" "Steven Harris" "Kevin Hyatt" "Susan Scott" 7