Comparing Methods for Identifying Transcription Factor Target Genes

Similar documents

Analysis of ChIP-seq data in Galaxy

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Introduction. Overview of Bioconductor packages for short read analysis

GMQL Functional Comparison with BEDTools and BEDOPS

GeneProf and the new GeneProf Web Services

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Basic processing of next-generation sequencing (NGS) data

Introduction to NGS data analysis

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Frequently Asked Questions Next Generation Sequencing

UGENE Quick Start Guide

Module 1. Sequence Formats and Retrieval. Charles Steward

Challenges associated with analysis and storage of NGS data

Next generation sequencing (NGS)

New solutions for Big Data Analysis and Visualization

Bioinformatics Unit Department of Biological Services. Get to know us

Data Analysis for Ion Torrent Sequencing

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

BIO 3352: BIOINFORMATICS II HYBRID COURSE SYLLABUS

Analysis of NGS Data

Analysis of Illumina Gene Expression Microarray Data

A Tutorial in Genetic Sequence Classification Tools and Techniques

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Deep Sequencing Data Analysis

A Brief Introduction on DNase-Seq Data Aanalysis

Version 5.0 Release Notes

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

GenBank, Entrez, & FASTA

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Understanding West Nile Virus Infection

Visualisation tools for next-generation sequencing

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

How Sequencing Experiments Fail

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

CSE-E5430 Scalable Cloud Computing. Lecture 4

Data formats and file conversions

Next Generation Sequencing: Technology, Mapping, and Analysis

Biological Sequence Data Formats

Current Motif Discovery Tools and their Limitations

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

NGS Data Analysis: An Intro to RNA-Seq

Human-Mouse Synteny in Functional Genomics Experiment

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

-> Integration of MAPHiTS in Galaxy

Reduced Representation Bisulfite-Seq A Brief Guide to RRBS

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Global and Discovery Proteomics Lecture Agenda

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

BioHPC Web Computing Resources at CBSU

Hadoop-BAM and SeqPig

Bioinformatics Resources at a Glance

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Genome-wide measurements of protein-dna interaction by chromatin immunoprecipitation

Exercise with Gene Ontology - Cytoscape - BiNGO

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Bio-Informatics Lectures. A Short Introduction

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

LifeScope Genomic Analysis Software 2.5

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

A Primer of Genome Science THIRD

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Introduction To Epigenetic Regulation: How Can The Epigenomics Core Services Help Your Research? Maria (Ken) Figueroa, M.D. Core Scientific Director

Practical Guideline for Whole Genome Sequencing

Time series experiments

Replacing TaqMan SNP Genotyping Assays that Fail Applied Biosystems Manufacturing Quality Control. Begin

Teaching Bioinformatics to Undergraduates

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Bioinformatics Grid - Enabled Tools For Biologists.

Module 10: Bioinformatics

NGS data analysis. Bernardo J. Clavijo

Transcription:

Comparing Methods for Identifying Transcription Factor Target Genes Alena van Bömmel (R 3.3.73) Matthew Huska (R 3.3.18) Max Planck Institute for Molecular Genetics Folie 1

Transcriptional Regulation TF not bound = no gene expression TF bound = gene expression

Transcriptional Regulation TF not bound = no gene expression TF bound = gene expression Problem: There are many genes and many TF's, how do we identify the targets of a TF?

Methods for Identifying TF Target Genes PWM Genome Scan Microarray ChIP-seq

PWM Genome Scan Purely computational method Input: o o position weight matrix for your TF genomic region(s) of interest Score threshold Pros: o No need to do wet lab experiments Cons: o Many false positives, not able to take biological conditions into account

PWM genome scan 1) Download the PWMs of your TF of interest from the database (they might include >1 motif) 2) Define the sequences to analyze (promoter sequences) 3) Run the PWM genome scan (hitbased method or affinity prediction method) 4) Rank the genomic sequences by the affinity signal Suggested Reading: Roider et al.: Predicting transcription factor affinities to DNA from a biophysical model. Bioinformatics (2007). Thomas-Chollier et al. Transcription factor binding predictions using TRAP for the analysis of ChIP-seq data and regulatory SNPs. Nature Protocols (2011). Folie 6

PWM-PSCM Stat3 pscm Binding motif for the transcription factor: Stat3 from ChIP-seq experiment in mouse (Jaspar ID: MA0144.1) Folie 7

TRAP 1) Convert the PSSM(position 2) 3) 4) 5) specific scoring matrix) to PSEM (position specific energy matrix) Scan the sequences of interest with TRAP Results in 1 score per sequence=binding affinity Doesn t separate the exact TF binding sites (easier for ranking) Sequences must have the same length! ANNOTATE=/project/gbrowse/Pipeline/ANNOTATE_v3.02/Release TRAP trap.molgen.mpg.de/cgi- bin/home.cgi Folie 8

Matrix-scan 1) Use directly the PSSM 2) Finds all TFBS which exceed a predefined threshold (e.g. p-value) 3) More complicated to create ranked lists of genomic sequences (more hits in the sequence) 4) Exact location of the binding site reported matrix- scan h:p://rsat.ulb.ac.be/ Folie 9

Finding the target genes target genes will be the top-ranked genes (promoters) which are the top-ranked genes? (top-100,500,1000...?) There s no exact definition of promoters, usually 2000bp upstream, 500bp downstream of the TSS Folie 10

Microarrays R/Bioconductor (details later)

Folie 12 Genetik für molekulare Microarrays (2) Pros: o o o There is a lot of microarray data already available (might not have to generate the data yourself) Inexpensive and not very difficult to perform Computational workflow is well established Cons: o Can not distinguish between indirect regulation and direct regulation

ChIP-seq Map reads to the genome Call peaks to determine most likely TF binding locations

Folie 14 Genetik für molekulare ChIP-seq (2) Pros: o Direct measure of genome-wide protein-dna interaction(*) Cons: o o o o o Don't know whether binding causes changes in gene expression More complicated experimentally and in terms of computational analysis Most expensive Need an antibody against your protein of interest Biases are not as well understood as with microarrays

ChIP-seq analysis 1) Download the reads from 2) 3) 4) 5) 6) given source (experiments and controls) Quality control of the reads and statistics (è fastqc) Mapping the reads to the reference genome (è bwa/ Bowtie) Peak calling (è MACS) Visualization of the peaks in a genome browser (genome browser, IGV) Finding the closest genes to the peaks(è Bioconductor/ ChIPpeakAnno) Visualised peaks in a genome browser Suggested Reading: Bailey et al Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol (2013). Thomas-Chollier et al. A complete workflow for the analysis of fullsize ChIP-seq (and similar) data sets using peak-motifs. Nature Protocols (2012). Folie 15

Sequencing data raw data=reads usually very large file (few GB) format fastq (ENCODE) or SRA (Sequence Read Archive of NCBI) Analysis 1) Quality control with fastqc 2) Filtering of reads with adapter sequences 3) Mapping of the reads to the reference genome (bwa or Bowtie) Example of fastq data file Folie 16

Quality control with fastqc per base quality sequence quality (avg. > 20) sequence length sequence duplication level (duplication by PCR) overrepresented sequences/ kmers (adapter sequences) produces a html report manual (read it!) software at the MPI Example of per base seq quality scores FASTQC=/scratch/ngsvin/bin/chip- seq/fastqc/fastqc/fastqc Folie 17

Mapping with bwa mapping the sequencing reads to a reference genome manual (read it!) map the experiments and the controls reference genome in fasta format (hg19) create an index of the reference file for faster mapping (only if not available) 3) align the reads (specify parameters e.g. for # of mismatches, read trimming, threads used...) 4) generate alignments in the SAM format (different commands for single-end and pair-end reads!) 1) 2) software and data at the MPI: BWA = /scratch/ngsvin/bin/executables/bwa hg19: /scratch/ngsvin/mappingindices/hg19.fa bwa index: /scratch/ngsvin/mappingindices/bwa/hg19 Folie 18

File manipulation with samtools 1) 2) 3) utilities that manipulate SAM/BAM files manual (read it!) merge the replicates in one file (still separate experiment and control) convert the SAM file into BAM file (binary version of SAM, smaller) sort and index the BAM file now the sequencing files are ready for further analysis software at the MPI: SAMTOOLS = /scratch/ngsvin/bin/executables/samtools Folie 19

Peak finding with MACS find the peaks, i.e. the regions with a high density of reads, where the studied TF was bound manual (read it!) 1) call the peaks using the experiment (treatment) data vs. control 2) set the parameters e.g. fragment length, treatment of duplication reads 3) analyse the MACS results (BED file with peaks/summits) software at the MPI: MACS = /scratch/ngsvin/bin/executables/macs Folie 20

Finding the target genes find the genes which are in the closest distance to the (significant) peaks how to define the closest distance? (+- X kb) use ChIPpeakAnno in Bioconductor or bedtools Scale chr10: 69,200,000 78 _ GM12878 c-myc Sg 0_ 78 _ 100 kb hg18 69,250,000 69,300,000 69,350,000 UCSC Genes (RefSeq, GenBank, trnas & Comparative Genomics) DNAJC12 SIRT1 DNAJC12 SIRT1 SIRT1 HERC4 HERC4 HERC4 HERC4 HERC4 KIAA1593 ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-myc in GM12878 cells) HERC4 ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-myc in GM12878 cells) ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Peaks (c-myc in K562 cells) ENCODE TFBS, Yale/UCD/Harvard ChIP-seq Signal (c-myc in K562 cells) K562 c-myc Sig 0_ RepeatMasker Repeating Elements by RepeatMasker Folie 21

Methods for Identifying TF Target Genes PWM Genome Scan Microarray ChIP-seq Threshold s

Bioinformatics Read mapping (Bowtie/bwa) Peak Calling (MACS/ Bioconductor) Peak-Target Analysis (Bioconductor) Microarray data analysis (Bioconductor) Differential Genes (R) GSEA PWM Genome Scan (TRAP/ MatScan) Statistics (R) Data Integration (R/Python/Perl) Statistical Analysis (R) Folie 23

Bioinformatics tools READ THE MANUALS! Bowtie bowtie-bio.sourceforge.net/manual.shtml bwa bio-bwa.sourceforge.net/bwa.shtml MACS github.com/taoliu/macs/blob/macs_v1/readme.rst TRAP trap.molgen.mpg.de/cgi-bin/home.cgi matrix-scan http://rsat.ulb.ac.be/ Bioconductor www.bioconductor.org/ (more info in R course) Databases GEO www.ncbi.nlm.nih.gov/geo/ ENCODE genome.ucsc.edu/encode/ SRA www.ncbi.nlm.nih.gov/sra JASPAR http://jaspar.genereg.net/ Folie 24

Schedule 03.03. Introduction lecture, R course 04.03. R & Bioconductor homework submission 11.03. Presentation of the detailed plan of each group (which TF, cell line, tools, data, data integration, team work ) 10:30am, 11:30am every Tuesday 10:30am, 11:30am progress meetings 17.04. Final report deadline 24.04. (tentative) Presentations 28.04. Final meeting, discussion of final reports Folie 25

GR Group Expression and ChIP-seq data: Luca F, Maranville JC, et al., PLoS ONE, 2013 PWM database: jaspar.genereg.net Folie 26

c-myc Group Expression data: Cappellen, Schlange, Bauer et al., EMBO reports, 2007 Musgrove et al., PLoS One, 2008 ChIP-seq data: ENCODE Project PWM database: jaspar.genereg.net Folie 27

Additional analysis Binding motifs binding motifs binding motifs are the overrepresented motifs in the ChIP-peak regions different? do we find any co-factors? Recommended tool: RSAT rsat.ulb.ac.be binding motifs Folie 28