Gene Expression Analysis Processing, Querying and Visualizing Gene Expression Data CANB 7640

Similar documents
Cluster software and Java TreeView

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Row Quantile Normalisation of Microarrays

Step by Step Guide to Importing Genetic Data into JMP Genomics

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Importance of Statistics in creating high dimensional data

NCBI resources III: GEO and ftp site. Yanbin Yin Spring 2013

Tutorial for proteome data analysis using the Perseus software platform

Gene expression analysis. Ulf Leser and Karin Zimmermann

Measuring gene expression (Microarrays) Ulf Leser

Quality Assessment of Exon and Gene Arrays

GAIA: Genomic Analysis of Important Aberrations

Basic processing of next-generation sequencing (NGS) data

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Data Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis

#jenkinsconf. Jenkins as a Scientific Data and Image Processing Platform. Jenkins User Conference Boston #jenkinsconf

Analysis of Illumina Gene Expression Microarray Data

Step-by-Step Guide to Basic Expression Analysis and Normalization

A truly robust Expression analyzer

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

JustClust User Manual

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

Applying data integration into reconstruction of gene networks from micro

Basic Analysis of Microarray Data

Protein Protein Interaction Networks

Package GEOquery. August 18, 2015

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Delivering the power of the world s most successful genomics platform

Reporting. Understanding Advanced Reporting Features for Managers

Data, Measurements, Features

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Data deluge (and it s applications) Gianluigi Zanetti. Data deluge. (and its applications) Gianluigi Zanetti

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

MultiExperiment Viewer Quickstart Guide

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

Advanced Big Data Analytics with R and Hadoop

Analysis of ChIP-seq data in Galaxy

European Genome-phenome Archive database of human data consented for use in biomedical research at the European Bioinformatics Institute

A pretty picture, or a measurement? Retinal Imaging

Guide for Data Visualization and Analysis using ACSN

Package hoarder. June 30, 2015

Structural Health Monitoring Tools (SHMTools)

Data processing goes big

itunes Store Publisher User Guide Version 1.1

PREDA S4-classes. Francesco Ferrari October 13, 2015

DeCyder Extended Data Analysis (EDA) Software

Visualizing Networks: Cytoscape. Prat Thiru

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Step-by-Step Guide to Bi-Parental Linkage Mapping WHITE PAPER

Bioinformatics Resources at a Glance

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

ADVANCED DATA VISUALIZATION

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Euro-BioImaging European Research Infrastructure for Imaging Technologies in Biological and Biomedical Sciences

A Knowledge Base Approach for Genomics Data Analysis

Case Study Life Sciences Data

HETEROGENEOUS DATA INTEGRATION FOR CLINICAL DECISION SUPPORT SYSTEM. Aniket Bochare - aniketb1@umbc.edu. CMSC Presentation

Understanding Data: A Comparison of Information Visualization Tools and Techniques

Open Access to Manuscripts, Open Science, and Big Data

LAB 1 Intro to Ucinet & Netdraw

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

NaviCell Data Visualization Python API

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

DataPA OpenAnalytics End User Training

PPInterFinder A Web Server for Mining Human Protein Protein Interaction

ENABLING DATA TRANSFER MANAGEMENT AND SHARING IN THE ERA OF GENOMIC MEDICINE. October 2013

Creating a New Annotation Package using SQLForge

Top 10 Things to Know about WRDS

DeCyder Extended Data Analysis module Version 1.0

The Scientific Data Mining Process

DEVELOPMENT OF AN ANALYSIS AND REPORTING TOOL FOR ORACLE FORMS SOURCE CODES

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Gene Expression Analysis

Healthcare data analytics. Da-Wei Wang Institute of Information Science

SAP Data Services 4.X. An Enterprise Information management Solution

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Big Data: Rethinking Text Visualization

LifeScope Genomic Analysis Software 2.5

Downloading & Using Data from the STORET Warehouse: An Exercise

MetaXpress Image Acquisition and Analysis Software. Analysis Guide Version 2.0 for Microsoft Windows XP

MultiQuant Software 2.0 for Targeted Protein / Peptide Quantification

A Comparison of Normalization Methods for High Density Oligonucleotide Array Data Based on Variance and Bias

Core Technology Development Team Meeting

Exercise with Gene Ontology - Cytoscape - BiNGO

EMC Documentum Content Services for SAP iviews for Related Content

ENHANCED PUBLICATIONS IN THE CZECH REPUBLIC

Fast. Integrated Genome Browser & DAS. Easy. Flexible. Free. bioviz.org/igb

LabVIEW Day 6: Saving Files and Making Sub vis

Microarray Technology

LONI IMAGE & DATA ARCHIVE USER MANUAL

Transcription:

Gene Expression Analysis Processing, Querying and Visualizing Gene Expression Data CANB 7640 Aik Choon Tan, Ph.D. Associate Professor of Bioinformatics Division of Medical Oncology Department of Medicine aikchoon.tan@ucdenver.edu 10/18/2016 http://tanlab.ucdenver.edu/labhomepage/teaching/canb7640/

Outline Processing Raw Data Normalization Concepts Affymetrix Power Tools Querying Public Gene Expression Database NCBI GEO Rules for Reusing Public Gene Expression Data Visualizing matrix as heat map Heat map concept matrix2png

Analytical Process in Microarray Experiments gene expression data analysis Cluster analysis Candidate gene analysis Gene Set analysis gene expression secondary analysis Clusters Gene Lists Gene Sets gene expression visualization Clusters Gene Lists Gene Sets

So far, Tools Covered in the Analytical Process in Microarray Experiments gene expression data analysis (CLUSTER 3.0) (SAM) Cluster analysis Candidate gene analysis Gene Set analysis (GSEA) (Leading Edge Analysis) gene expression secondary analysis Clusters Gene Lists Gene Sets (Java TreeView) (NIH DAVID, KEGG) gene expression visualization Clusters Gene Lists Gene Sets

Complete Analytical Process in Microarray Experiments gene expression data analysis Cluster analysis Candidate gene analysis Gene Set analysis gene expression secondary analysis Clusters Gene Lists Gene Sets reuse and re-discovery Published Papers Public Databases publication and data sharing Publication Deposit into Public Databases gene expression visualization Clusters Gene Lists Gene Sets

Tools to Learn Today in the Complete Analytical Process in Microarray Experiments gene expression data analysis Cluster analysis Candidate gene analysis Gene Set analysis Affymetrix Power Tools (APT) gene expression secondary analysis Clusters Gene Lists Gene Sets reuse and re-discovery Published Papers Public Databases NCBI GEO publication and data sharing Publication Deposit into Public Databases NCBI GEO Heat map tool -matrix2png gene expression visualization Clusters Gene Lists Gene Sets

Extracting Signals from Microarray Image file to pixel Potential biases in scanner, distribution of the fluorescence dyes, over/under exposure of a particular corner in the array (Batch) Potential biases in sample preparation steps (Technical) Normalization is a way to account for these biases and hope to generate a comparable measurements across samples

Normalization Methods in Microarray Image file to pixel Potential biases in scanner, distribution of the fluorescence dyes, over/under exposure of a particular corner in the array (Batch) Potential biases in sample preparation steps (Technical) Normalization is a way to account for these biases and hope to generate a comparable measurements across samples for downstream analysis

Robust Multi-array Average (RMA) M-A plot AvgDiff q-q plot MAS5.0 M=log 2 (y/x) Li and Wang Observed Quantiles RMA (Cited by 6094) A=1/2(log 2 (x*y)) Reference Quantiles

Affymetrix Power Tools Affymetrix Power Tools (APT) are a set of cross-platform command line programs that implement algorithms for analyzing and working with Affymetrix GeneChip arrays. APT is an open-source project licensed under the GNU General Public License (GPL). (Developers who need a non- GPL license may purchase a commercial license from Affymetrix.) APT programs are intended for "power users" who prefer programs that can be utilized in scripting environments and are sophisticated enough to handle the complexity of extra features and functionality. The vision is that APT provides a platform for developing and deploying new algorithms without waiting for the GUI implementations.

How to get Affymetrix Power Tools? Webpage: www.affymetrix.com Register for Free Login

User Manual (Help Page) http://media.affymetrix.com/support/developer/powertools/changelog/apt-probeset-summarize.html

Usage Call the program Type of analysis Chip annotanons: - d chip.cdf or - p chip.pgf - c chip.clf Output Folder Input CEL files

NCBI GEO http://www.ncbi.nlm.nih.gov/geo/

NCBI GEO Data Formats GPL* GSM* GSE* GSD*

NCBI GEO Data Formats (e.g HG- U133 Plus 2.0 microarray) (e.g Sample1) (e.g Cancer_vs_No rmal)

NCBI GEO Data Formats

Querying NCBI GEO Look for Data Sets/ Samples Query by keyword

Read and Locate Data from NCBI GEO

Querying NCBI GEO by GSE ID Query by GSE ID GSE18088

Different Normalization Method (MAS5.0)

Other Platform (different array)

Other formats (RNAseq)

Different Array layouts within a Data Set

Types of Data Available for Download SOFT data MINiML data SERIES data RAW data

SOFT Format SOFT/: This directory contains files in Simple Omnibus Format in Text (SOFT). SOFT files are generated for DataSet entries, as well as for Series and Platform entries (subdirectories are included for each entry type). The Series and Platform files are actually family files that include the metadata and complete data tables of all related entries in the family. In contrast, the DataSet SOFT files include the metadata of the DataSet entry only, plus a matrix table containing the extracted gene annotations and Sample values used in GEO Profiles.

MINiML format MINiML/: This directory includes files in MINiML (MIAME Notation in Markup Language) format. MINiML is essentially an XML rendering of SOFT format, and the files provided here are the XML-equivalents of the Series and Platform family files provided in the SOFT/ directory.

MIAME Six Parts of MIAME 1. Experimental design: the set of hybridization experiments as a whole 2. Array design: each array used and each element (spot, feature) on the array 3. Samples: samples used, extract preparation and labeling 4. Hybridizations: procedures and parameters 5. Measurements: images, quantification and specifications 6. Normalization controls: types, values and specifications

Series Matrix Format SeriesMatrix/: This directory contains tab-delimited value-matrices generated from the VALUE column of the Sample tables of each Series entry. Files also include Series and Sample metadata and are ideal for opening in spreadsheet applications such as MicrosoftExcel. Most users find SeriesMatrix files the most convenient format for handling data that have not been assembled into a DataSet

Series Information

Sample Descriptions

Start of the table (!series_matrix_table_begin)

End of the table (!series_matrix_table_end)

GEO2R Analysis Pipeline GEO2R Analysis Pipeline

GEO2R - Analysis

GEO2R Samples in the Series

GEO2R Define Groups

GEO2R Define Samples

GEO2R Run Analysis

GEO2R Top250 DEG

GEO2R Box Plot Distributions

GEO2R Options for Analysis

GEO2R Profile Graph

GEO2R R Script

Reuse of public genome-wide gene expression data

Major Microarray Data Repositories (Rung & Brazma, Nat Rev Genetics 2013)

Disease Specific Microarray Data Databases (Rung & Brazma, Nat Rev Genetics 2013)

Other Microarray Data Repositories (Rung & Brazma, Nat Rev Genetics 2013)

Things to Consider when Reusing of Raw Data from Public Databases 1. Quality control. Public archives store data as they have been received from the submitter. Include only arrays that pass quality-control criteria in further analysis. Be aware that some studies, often from the same laboratory, contain identical raw data files, such as when a set of control samples has been used independently in two different studies. 2. Revise annotation. Annotation of public data can be incomplete, not-up-to-date or conflicting (different terms to annotate samples and experimental factors). 3. Array selection. Experiments that reuse data need experimental design (like new experiment). Include only arrays in the study that address the intended question. By excluding non-informative arrays, decrease the data heterogeneity and improve the conditions for accurate statistical tests concerning the goal of the study. (Rung & Brazma, Nat Rev Genetics 2013)

Things to Consider when Reusing of Raw Data from Public Databases 4. Define and annotate probe sets. Different platforms, and sometimes even different versions of arrays, may define probe sets or the individual probe sequences differently. Original manufacturer annotation may be outdated. This may have a serious impact on the data analysis 5. Normalize and analyze across all arrays and experimental conditions as if it were a single data set. Cross-platform normalization needs special attention and should be dealt with carefully, or the biases introduced may outweigh the benefit of combining many samples. In downstream analysis, adjusting for study effect and other biases may be necessary. (Rung & Brazma, Nat Rev Genetics 2013)

Presenting the Microarray Results: Gene List vs Heat Map vs

Heat map Heat maps represent two-dimensional tables of numbers as shades of colors. This is a popular plotting technique in biology, used to depict gene expression and other multivariate data. The dense and intuitive display makes heat maps well-suited for presentation of high-throughput data. Hundreds of rows and columns can be displayed on a screen. Heat maps rely fundamentally on color encoding and on meaningful reordering of the rows and columns. When either of these components is compromised, the utility of the visualization suffers. (Gehlenborg & Wong, Nat Methods 2012)

Examples of Heat maps (Gehlenborg & Wong, Nat Methods 2012)

matrix2png - Heat map Generation Tool http://www.chibi.ubc.ca/matrix2png/

matrix2png hzp://www.chibi.ubc.ca/matrix2png/

Examples of matrix2png hzp://www.chibi.ubc.ca/matrix2png/

Examples of matrix2png hzp://www.chibi.ubc.ca/matrix2png/

Examples of Heat maps (Tan et al, Bioinformatics 2005)

Examples of Heat maps (Diamond et al, Clinical Cancer Research 2013)

Take Home Message Know your data Know how to extract and normalize your microarray gene expression data. If data set was downloaded from public databases, know how the data were processed. Best to download raw data and do your own normalization. Know where to find data Know how to locate public data from published papers. Know how to download data sets for correlative analysis / metaanalysis. Know how to integrate data for hypothesis generation / discovery. Know how to present your data analysis Heat map is one of the methods to visualize gene expression and other high-throughput data. Need to know how to add value to heat map.