Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice



Similar documents
Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice

Tutorial for the WGCNA package for R I. Network analysis of liver expression data in female mice

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

Statistical Applications in Genetics and Molecular Biology

A General Framework for Weighted Gene Co-expression Network Analysis

Cluster Analysis using R

Enron Network Analysis Tutorial r date()

COC131 Data Mining - Clustering

Journal of Statistical Software

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Distances, Clustering, and Classification. Heatmaps

A Demonstration of Hierarchical Clustering

Each function call carries out a single task associated with drawing the graph.

7 Time series analysis

Visualizing non-hierarchical and hierarchical cluster analyses with clustergrams

Tutorial 3: Graphics and Exploratory Data Analysis in R Jason Pienaar and Tom Miller

Designing forms for auto field detection in Adobe Acrobat

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Hierarchical Clustering Analysis

Basic Analysis of Microarray Data

Time series clustering and the analysis of film style

Getting Started with Command Prompts

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Table of Contents. Online backup Manager User s Guide

Module 9. User Interface Design. Version 2 CSE IIT, Kharagpur

Learning Module 4 - Thermal Fluid Analysis Note: LM4 is still in progress. This version contains only 3 tutorials.

Bioinformatics: Network Analysis

An Order-Invariant Time Series Distance Measure [Position on Recent Developments in Time Series Analysis]

Cluster software and Java TreeView

Clustering UE 141 Spring 2013

Smart Connection 9 Element Labels

Text Analytics using High Performance SAS Text Miner

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

PRODUCT INFORMATION. Insight+ Uses and Features

This chapter will demonstrate how to perform multiple linear regression with IBM SPSS

Mathematics (Project Maths Phase 1)

Tutorial for proteome data analysis using the Perseus software platform

Cluster Analysis. Aims and Objectives. What is Cluster Analysis? How Does Cluster Analysis Work? Postgraduate Statistics: Cluster Analysis

Scientific Graphing in Excel 2010

Visualization Plugin for ParaView

Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA

There are six different windows that can be opened when using SPSS. The following will give a description of each of them.

Describe the Create Profile dialog box. Discuss the Update Profile dialog box.examine the Annotate Profile dialog box.

JustClust User Manual

Essay 5 Tutorial for a Three-Dimensional Heat Conduction Problem Using ANSYS Workbench

DeCyder Extended Data Analysis module Version 1.0

Formulas, Functions and Charts

LabVIEW Day 6: Saving Files and Making Sub vis

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

Quick Start. Creating a Scoring Application. RStat. Based on a Decision Tree Model

Pavement Management System Overview

Visualization with OpenDX

PURPOSE OF GRAPHS YOU ARE ABOUT TO BUILD. To explore for a relationship between the categories of two discrete variables

Package copa. R topics documented: August 9, 2016

WEKA Explorer User Guide for Version 3-4-3

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

2 SYSTEM DESCRIPTION TECHNIQUES

SAS Analyst for Windows Tutorial

Tutorial: 3D Pipe Junction Using Hexa Meshing

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Package bigrf. February 19, 2015

Protein Protein Interaction Networks

R software tutorial: Random Forest Clustering Applied to Renal Cell Carcinoma Steve Horvath and Tao Shi

Planning Terrestrial Radio Networks

Knowledge Discovery and Data Mining

Practical Differential Gene Expression. Introduction

ECONOMICS 351* -- Stata 10 Tutorial 2. Stata 10 Tutorial 2

sample median Sample quartiles sample deciles sample quantiles sample percentiles Exercise 1 five number summary # Create and view a sorted

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

SPSS Tutorial, Feb. 7, 2003 Prof. Scott Allard

They can be obtained in HQJHQH format directly from the home page at:

Step-by-Step Guide to Basic Expression Analysis and Normalization

Arena Tutorial 1. Installation STUDENT 2. Overall Features of Arena

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

MPLAB X + CCS C Compiler Tutorial

Guide for Data Visualization and Analysis using ACSN

Visual Tutorial Basic Edition 1. Visual. Basic Edition Tutorial.

How To Cluster Of Complex Systems

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

LSA SAF products: files and formats

Create Charts in Excel

Categorical Data Visualization and Clustering Using Subjective Factors

How To Cluster

QuantStudio 3D AnalysisSuite Software: Relative Quantification

Period04 User Guide. Abstract. 1 Introduction. P. Lenz, M. Breger

A Short Guide to R with RStudio

Big Data: Rethinking Text Visualization

Lab 0: Preparing your laptop for the course OS X

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Strong and Weak Ties

Immunophenotyping Flow Cytometry Tutorial. Contents. Experimental Requirements...1. Data Storage...2. Voltage Adjustments...3. Compensation...

Neural Networks Lesson 5 - Cluster Analysis

2. Simple Linear Regression

Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price

Transcription:

Tutorial for the WGCNA package for R: I. Network analysis of liver expression data in female mice 2.b Step-by-step network construction and module detection Peter Langfelder and Steve Horvath November 25, 2014 Contents 0 Preliminaries: setting up the R session 1 2 Step-by-step construction of the gene network and identification of modules 2 2.b Step-by-step network construction and module detection.......................... 2 2.b.1 Choosing the soft-thresholding power: analysis of network topology................ 2 2.b.2 Co-expression similarity and adjacency................................ 3 2.b.3 Topological Overlap Matrix (TOM)................................. 3 2.b.4 Clustering using TOM......................................... 3 2.b.5 Merging of modules whose expression profiles are very similar................... 5 0 Preliminaries: setting up the R session Here we assume that a new R session has just been started. We load the WGCNA package, set up basic parameters and load data saved in the first part of the tutorial. Important note: The code below uses parallel computation where multiple cores are available. This works well when R is run from a terminal or from the Graphical User Interface (GUI) shipped with R itself, but at present it does not work with RStudio and possibly other third-party R environments. If you use RStudio or other third-party R environments, skip the enablewgcnathreads() call below. # Display the current working directory getwd(); # If necessary, change the path below to the directory where the data files are stored. # "." means current directory. On Windows use a forward slash / instead of the usual \. workingdir = "."; setwd(workingdir); # Load the WGCNA package library(wgcna) # The following setting is important, do not omit. options(stringsasfactors = FALSE); # Allow multi-threading within WGCNA. At present this call is necessary. # Any error here may be ignored but you may want to update WGCNA if you see one. # Caution: skip this line if you run RStudio or other third-party R environments. # See note above. enablewgcnathreads() # Load the data saved in the first part lnames = load(file = "FemaleLiver-01-dataInput.RData");

#The variable lnames contains the names of loaded variables. lnames We have loaded the variables datexpr and dattraits containing the expression and trait data, respectively. 2 Step-by-step construction of the gene network and identification of modules This step is the bedrock of all network analyses using the WGCNA methodology. We present three different ways of constructing a network and identifying modules: a. Using a convenient 1-step network construction and module detection function, suitable for users wishing to arrive at the result with minimum effort; b. Step-by-step network construction and module detection for users who would like to experiment with customized/alternate methods; c. An automatic block-wise network construction and module detection method for users who wish to analyze data sets too large to be analyzed all in one. In this tutorial section, we illustrate the step-by-step network construction and module detection. 2.b Step-by-step network construction and module detection 2.b.1 Choosing the soft-thresholding power: analysis of network topology Constructing a weighted gene network entails the choice of the soft thresholding power β to which co-expression similarity is raised to calculate adjacency [1]. The authors of [1] have proposed to choose the soft thresholding power based on the criterion of approximate scale-free topology. We refer the reader to that work for more details; here we illustrate the use of the function picksoftthreshold that performs the analysis of network topology and aids the user in choosing a proper soft-thresholding power. The user chooses a set of candidate powers (the function provides suitable default values), and the function returns a set of network indices that should be inspected, for example as follows: # Choose a set of soft-thresholding powers powers = c(c(1:10), seq(from = 12, to=20, by=2)) # Call the network topology analysis function sft = picksoftthreshold(datexpr, powervector = powers, verbose = 5) # Plot the results: sizegrwindow(9, 5) par(mfrow = c(1,2)); cex1 = 0.9; # Scale-free topology fit index as a function of the soft-thresholding power plot(sft$fitindices[,1], -sign(sft$fitindices[,3])*sft$fitindices[,2], xlab="soft Threshold (power)",ylab="scale Free Topology Model Fit,signed R^2",type="n", main = paste("scale independence")); text(sft$fitindices[,1], -sign(sft$fitindices[,3])*sft$fitindices[,2], labels=powers,cex=cex1,col="red"); # this line corresponds to using an R^2 cut-off of h abline(h=0.90,col="red") # Mean connectivity as a function of the soft-thresholding power plot(sft$fitindices[,1], sft$fitindices[,5], xlab="soft Threshold (power)",ylab="mean Connectivity", type="n", main = paste("mean connectivity")) text(sft$fitindices[,1], sft$fitindices[,5], labels=powers, cex=cex1,col="red") The result is shown in Fig. 1. We choose the power 6, which is the lowest power for which the scale-free topology fit index reaches 0.90.

Scale independence Mean connectivity Scale Free Topology Model Fit,signed R^2 0.0 0.2 0.4 0.6 0.8 1 2 3 6 7 8 16 18 20 9 14 10 12 5 4 5 10 15 20 Soft Threshold (power) Mean Connectivity 0 200 400 600 1 2 3 4 5 6 7 8 9 10 12 14 16 18 20 5 10 15 20 Soft Threshold (power) Figure 1: Analysis of network topology for various soft-thresholding powers. The left panel shows the scale-free fit index (y-axis) as a function of the soft-thresholding power (x-axis). The right panel displays the mean connectivity (degree, y-axis) as a function of the soft-thresholding power (x-axis). 2.b.2 Co-expression similarity and adjacency We now calculate the adjacencies, using the soft thresholding power 6: softpower = 6; adjacency = adjacency(datexpr, power = softpower); 2.b.3 Topological Overlap Matrix (TOM) To minimize effects of noise and spurious associations, we transform the adjacency into Topological Overlap Matrix, and calculate the corresponding dissimilarity: # Turn adjacency into topological overlap TOM = TOMsimilarity(adjacency); disstom = 1-TOM 2.b.4 Clustering using TOM We now use hierarchical clustering to produce a hierarchical clustering tree (dendrogram) of genes. Note that we use the function hclust that provides a much faster hierarchical clustering routine than the standard hclust function. # Call the hierarchical clustering function genetree = hclust(as.dist(disstom), method = "average"); # Plot the resulting clustering tree (dendrogram) sizegrwindow(12,9) plot(genetree, xlab="", sub="", main = "Gene clustering on TOM-based dissimilarity", labels = FALSE, hang = 0.04); The clustering dendrogram plotted by the last command is shown in Figure 2. In the clustering tree (dendrogram), each leaf, that is a short vertical line, corresponds to a gene. Branches of the dendrogram group together densely interconnected, highly co-expressed genes. Module identification amounts to the identification of individual branches ( cutting the branches off the dendrogram ). There are several methods for branch cutting; our standard method is the Dynamic Tree Cut from the package dynamictreecut. The next snippet of code illustrates its use.

Gene clustering on TOM based dissimilarity Height 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Figure 2: Clustering dendrogram of genes, with dissimilarity based on topological overlap. # We like large modules, so we set the minimum module size relatively high: minmodulesize = 30; # Module identification using dynamic tree cut: dynamicmods = cutreedynamic(dendro = genetree, distm = disstom, deepsplit = 2, pamrespectsdendro = FALSE, minclustersize = minmodulesize); table(dynamicmods) The function returned 22 modules labeled 1 22 largest to smallest. Label 0 is reserved for unassigned genes. The above command lists the sizes of the modules. We now plot the module assignment under the gene dendrogram: # Convert numeric lables into colors dynamiccolors = labels2colors(dynamicmods) table(dynamiccolors) # Plot the dendrogram and colors underneath sizegrwindow(8,6) plotdendroandcolors(genetree, dynamiccolors, "Dynamic Tree Cut", dendrolabels = FALSE, hang = 0.03, addguide = TRUE, guidehang = 0.05, main = "Gene dendrogram and module colors") The result is shown in Fig. 3.

Gene dendrogram and module colors Height 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Dynamic Tree Cut as.dist(disstom) hclust (*, "average") Figure 3: Clustering dendrogram of genes, with dissimilarity based on topological overlap, together with assigned module colors. 2.b.5 Merging of modules whose expression profiles are very similar The Dynamic Tree Cut may identify modules whose expression profiles are very similar. It may be prudent to merge such modules since their genes are highly co-expressed. To quantify co-expression similarity of entire modules, we calculate their eigengenes and cluster them on their correlation: # Calculate eigengenes MEList = moduleeigengenes(datexpr, colors = dynamiccolors) MEs = MEList$eigengenes # Calculate dissimilarity of module eigengenes MEDiss = 1-cor(MEs); # Cluster module eigengenes METree = hclust(as.dist(mediss), method = "average"); # Plot the result sizegrwindow(7, 6) plot(metree, main = "Clustering of module eigengenes", xlab = "", sub = "") We choose a height cut of 0.25, corresponding to correlation of 0.75, to merge (see Fig. 4):

MEDissThres = 0.25 # Plot the cut line into the dendrogram abline(h=medissthres, col = "red") # Call an automatic merging function merge = mergeclosemodules(datexpr, dynamiccolors, cutheight = MEDissThres, verbose = 3) # The merged module colors mergedcolors = merge$colors; # Eigengenes of the new merged modules: mergedmes = merge$newmes; To see what the merging did to our module colors, we plot the gene dendrogram again, with the original and merged module colors underneath (Figure 5). sizegrwindow(12, 9) #pdf(file = "Plots/geneDendro-3.pdf", wi = 9, he = 6) plotdendroandcolors(genetree, cbind(dynamiccolors, mergedcolors), c("dynamic Tree Cut", "Merged dynamic"), dendrolabels = FALSE, hang = 0.03, addguide = TRUE, guidehang = 0.05) #dev.off() In the subsequent analysis, we will use the merged module colors in mergedcolors. We save the relevant variables for use in subsequent parts of the tutorial: # Rename to modulecolors modulecolors = mergedcolors # Construct numerical labels corresponding to the colors colororder = c("grey", standardcolors(50)); modulelabels = match(modulecolors, colororder)-1; MEs = mergedmes; # Save module colors and labels for use in subsequent parts save(mes, modulelabels, modulecolors, genetree, file = "FemaleLiver-02-networkConstruction-stepByStep.RData") References [1] B. Zhang and S. Horvath. A general framework for weighted gene co-expression network analysis. Statistical Applications in Genetics and Molecular Biology, 4(1):Article 17, 2005.

Consensus clustering of consensus module eigengenes Height 0.0 0.2 0.4 0.6 0.8 1.0 MElightgreen MElightyellow MEblack MEturquoise MEbrown MEroyalblue MEgreen MEred MEpurple MEtan MEmagenta MEyellow MEblue MEmidnightblue MEdarkgreen MEsalmon MEgrey MEpink MEdarkred MEcyan MEgreenyellow MEgrey60 MElightcyan Figure 4: Clustering dendrogram of genes, with dissimilarity based on topological overlap, together with assigned merged module colors and the original module colors.

Cluster Dendrogram Height 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Dynamic Tree Cut Merged dynamic d hclust (*, "average") Figure 5: Clustering dendrogram of genes, with dissimilarity based on topological overlap, together with assigned merged module colors and the original module colors.