Basic processing of next-generation sequencing (NGS) data



Similar documents
Introduction. Overview of Bioconductor packages for short read analysis

Challenges associated with analysis and storage of NGS data

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Expression Quantification (I)

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Module 1. Sequence Formats and Retrieval. Charles Steward

Comparing Methods for Identifying Transcription Factor Target Genes

Version 5.0 Release Notes

PROGRAMMING FOR BIOLOGISTS. BIOL 6297 Monday, Wednesday 10 am -12 pm

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

GeneProf and the new GeneProf Web Services

Analysis of NGS Data

Analysis of ChIP-seq data in Galaxy

Introduction to NGS data analysis

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Frequently Asked Questions Next Generation Sequencing

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Hadoopizer : a cloud environment for bioinformatics data analysis

Next Generation Sequencing

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Computational Genomics. Next generation sequencing (NGS)

Normalization of RNA-Seq

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Practical Solutions for Big Data Analytics

Visualisation tools for next-generation sequencing

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Gene Expression Analysis

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

RNAseq / ChipSeq / Methylseq and personalized genomics

New solutions for Big Data Analysis and Visualization

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Data formats and file conversions

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Gene Models & Bed format: What they represent.

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Next Generation Sequencing: Technology, Mapping, and Analysis

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

NGS Data Analysis: An Intro to RNA-Seq

UGENE Quick Start Guide

-> Integration of MAPHiTS in Galaxy

A Tutorial in Genetic Sequence Classification Tools and Techniques

LifeScope Genomic Analysis Software 2.5

Bioinformatics Unit Department of Biological Services. Get to know us

High Throughput Sequencing Data Analysis using Cloud Computing

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

G E N OM I C S S E RV I C ES

PrimePCR Assay Validation Report

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

Understanding West Nile Virus Infection

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Deep Sequencing Data Analysis

Next generation sequencing (NGS)

CHALLENGES IN NEXT-GENERATION SEQUENCING

Bioinformatics Resources at a Glance

Welcome to the Plant Breeding and Genomics Webinar Series

Searching Nucleotide Databases

Hadoop. Bioinformatics Big Data

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

PreciseTM Whitepaper

How-To: SNP and INDEL detection

mrna NGS Data Analysis Report

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Databases and mapping BWA. Samtools

Athanasia Pavlopoulou University of Thessaly, Lamia June 2015

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Installation Guide for Windows

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

Processing Genome Data using Scalable Database Technology. My Background

Copy Number Variation: available tools

Cluster software and Java TreeView

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

GMQL Functional Comparison with BEDTools and BEDOPS

Translation Study Guide

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

Text file One header line meta information lines One line : variant/position

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

RNA- seq de novo ABiMS

SRA File Formats Guide

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Transcription:

Basic processing of next-generation sequencing (NGS) data Getting from raw sequence data to expression analysis! 1

Reminder: we are measuring expression of protein coding genes by transcript abundance Biology: Chromosome Gene Transcript Protein mrna abundance (transcript sequence copies) 2

A typical experimental setup Condition 1 Condition 2 Tissue sample 1 Tissue sample 2. Tissue sample n Tissue sample 1 Tissue sample 2. Tissue sample n mrna sequencing (RNA-seq) Data processing Expression analysis (comparisons) 3

Data processing overview Raw sequence generation Sequence filtration Mapping Mapping filtration Computing gene expression values Gene expression analysis 4

Sequence file formats Output files from sequencing equipment Can be MANY and HUGE data files! A common file format is Fastq (36 bp reads example): @OBAN:8:1:2:902#0/1 AAAGCTTGTTTTTTCCCTACANCTGTATCCTTTCTT +OBAN:8:1:2:902#0/1 aaava]ay]aaba_ryz``x[dn_[x_]pu_aaz_a @OBAN:8:1:2:1718#0/1 TAAATATAACATTCTTTCCACNACACTTTCTAGGAC +OBAN:8:1:2:1718#0/1 aaaaaaaaa`aaa]aaa[`_xd\\`\`a^[^]pfxw... @OBAN:8:1:1114:370#0/1 GGAAGGCAGCGAACATCTGTTCAATCTCCTCCTTGG +OBAN:8:1:1114:370#0/1 a^aba\]_[_]wa[[z] M^YNX``]]^a`_XTXB DNA sequence Quality sequence Sequence number 1 Sequence number 2 Sequence number n 5

Sequence format conversions If specific sequence file format is required for downstream analysis/processing, conversion can be done with programs like maq or fq_all2std.pl Program: maq (Mapping and Assembly with Qualities) Version: 0.7.1 Contact: Heng Li <lh3@sanger.ac.uk> Usage: maq <command> [options] Format converting: sol2sanger convert Solexa FASTQ to standard/sanger FASTQ mapass2maq convert mapass2's map format to maq's map format bfq2fastq convert BFQ to FASTQ format fq_all2std.pl 6

Filtering on raw sequences Quality Read count per tissue/sample Uniqueness Trimming 7

Mapping sequences Choice of suitable program Depend on your purpose of analysis. There are many alignment software programs, here are some commonly used examples Bowtie: Basic mapping to reference sequence Maq/BWA: Mapping and SNP/indel detection Tophat: splices alignments for studying exon splice-junctions 8

Mapping sequences Choice of reference sequence database what to map against? Genome reference: common if you want to do transcript assembly, study alternative splicing and detect SNPs/indels Transcript reference: common if you want to do simple read mapping, count transcript copies and analyze expression levels 9

Mapping sequences Common issues Reference database not available. If you do not have a reference you need to consider building you own 10

Mapping sequences Common issues Lack of unique mapping will typically lead to random mapping Gene A Gene B ATCATCGGGCCATCGATTAGCTGATCGGACGCTA ATCGATTAGCTG TTTTCCTCTTTATCGATTAGCTGGGGGT ATCGATTAGCTG Sequence read with multiple mapping options 11

Mapping sequences Paired end reads 12

Example: map reads to reference database with Bowtie Build reference database index: bowtie-build NC_002127.fna e_coli_o157_h7 Test build: bowtie -c e_coli_o157_h7 GCGTGAGCTATGAGAAAGCGCCACGCTTCC Map/align reads to references: bowtie S e_coli reads/e_coli_1000.fq 13

Sequence Alignment MAP format (SAM: Bowtie output example) Standard alignment file format: OBAN:8:1:3:1366#0/1 16 ENSSSCG00000004803 ENSSSCT00000005302 ACTC1 1250 255 36M * 0 0 GTCTACTTTACGTTCAGGATGACAGGTTAATGCTTC VXG^Z^`_ZVUYZ]T`ZQ\_U\W[X_^aaa`[R\\R XA:i:0 MD:Z:36 NM:i:0 OBAN:8:1:3:285#0/1 4 * 0 0 * * 0 0 AGGTATTGGGTTTGGGGGCCTTACACACCAGGTGGA `VOW^b`RVRS`aUQMT[Z^_a_`_Y_]Y]\RNTVW XM:i:0 OBAN:8:1:3:672#0/1 4 * 0 0 * * 0 0 TGGGTATACAGTTCATCCAGTACCCGCTCCGGCTTC a`^\y_^``]``aq]a^^_vp`[[\^[sy[yqjwub XM:i:0 14

Filtering sequence mapping Low percent reads mapped Low number of genes covered 15

Computing absolute expression Counting transcript copies Mapping output Sample 1 Count Sample 1 GeneA ATCGATTAGAC GeneA 2 GeneA ATGGGCTGCAG GeneB 1 GeneB ATTTCGGCTGC GeneC 3 GeneC GeneC GeneC ATCCCTCCCTA GGGCTGGCTGC GCCGGCGGCAA Count copies f.ex. with Perl script 16

Creating a gene expression matrix Concat and Pivot Column IDs defined by tissue samples Rows IDs defined by gene/transcript IDs Gene Sample1 Sample2. SampleN GeneA 2 4 GeneB 1 0 GeneC 3 14... 17

Filtering genes by absolute expression Number of reads per gene per sample (alternatively per total samples) Overall tissue sample assessment table 18

Transformation of expression values Adjust for technical differences like total read count per sample Depending in the downstream analysis tool Some tools read absolute count and does transformation/normalization Relative abundance (RA) Log transformation of RA Reads per Kilobase per Million Reads (RPKM; Mortazavi et al, 2008) 19

Carry on with gene expression analysis! Differential expression Clustering Etc 20

Some general considerations Linux is the best environment for handling and analyzing huge data files Learning some of the Linux commands can be helpful (grep, sed, cut, awk) Learning Perl/R programming can also help data text file processing Use batch files to build data processing pipelines (documentation and re-use) Get use to shift between various tools for processing, analyzing and visualization Check input/output files, you are responsible, not the software/script authors! 21

Resources for NGS data Forum for discussing NGS data analysis: http://www.seqanswers.com Galaxy online tools: http://main.g2.bx.psu.edu NCBI Short Read Archive (SRA): http://trace.ncbi.nlm.nih.gov/traces/sra Bioconductor packages for NGS: http://www.bioconductor.org/help/workflows/high-throughput-sequencing ShortRead, Biostrings, edger, Rsamtools, biomart