A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Similar documents
Practical Guideline for Whole Genome Sequencing

Comparing Methods for Identifying Transcription Factor Target Genes

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Hadoop. Bioinformatics Big Data

Data formats and file conversions

Text file One header line meta information lines One line : variant/position

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Analysis of ChIP-seq data in Galaxy

Version 5.0 Release Notes

Introduction to NGS data analysis

Introduction. Overview of Bioconductor packages for short read analysis

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Analysis of NGS Data

Next generation sequencing (NGS)

Bismark Bisulfite Mapper User Guide - v0.15.0

Basic processing of next-generation sequencing (NGS) data

UGENE Quick Start Guide

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

NGS Data Analysis: An Intro to RNA-Seq

Databases and mapping BWA. Samtools

BioHPC Web Computing Resources at CBSU

Workflow. Reference Genome. Variant Calling. Galaxy Format Conversion Groomer. Mapping BWA GATK Preprocess

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

-> Integration of MAPHiTS in Galaxy

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

A Tutorial in Genetic Sequence Classification Tools and Techniques

Linux command line. An introduction to the Linux command line for genomics. Susan Fairley

How Sequencing Experiments Fail

Sequence Alignment/Map Format Specification

Subread/Rsubread Users Guide

Human Genomes and Big Data Challenges QUANTITY, QUALITY AND QUANDRY Gerry Higgins, M.D., Ph.D. AssureRx Health, Inc.

Hadoop-BAM and SeqPig

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

SRA File Formats Guide

Module 1. Sequence Formats and Retrieval. Charles Steward

Bio-Informatics Lectures. A Short Introduction

Step by Step Guide to Importing Genetic Data into JMP Genomics

Writing & Running Pipelines on the Open Grid Engine using QMake. Wibowo Arindrarto DTLS Focus Meeting


CSE-E5430 Scalable Cloud Computing. Lecture 4

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

MiSeq: Imaging and Base Calling

Next Generation Sequencing

TZWorks Windows Event Log Viewer (evtx_view) Users Guide

LifeScope Genomic Analysis Software 2.5

DNA Sequencing Overview

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Processing NGS Data with Hadoop-BAM and SeqPig

Elementary Sequence Analysis

HiSeq Analysis Software v0.9 User Guide

2.3 Identify rrna sequences in DNA

Building Highly-Optimized, Low-Latency Pipelines for Genomic Data Analysis

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

Gene Models & Bed format: What they represent.

Searching Nucleotide Databases

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Workload Characteristics of DNA Sequence Analysis: from Storage Systems Perspective

Welcome to the Plant Breeding and Genomics Webinar Series

How To Use The Librepo Software On A Linux Computer (For Free)

PREREQUISITES LOGGING IN

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Introduction to Bioinformatics 3. DNA editing and contig assembly

7 Why Use Perl for CGI?

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Bioinformatics Resources at a Glance

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Integrated Rule-based Data Management System for Genome Sequencing Data

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Introduction to analyzing big data using Amazon Web Services

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Simplifying Data Interpretation with Nexus Copy Number

Command Line Crash Course For Unix

GDC Data Transfer Tool User s Guide. NCI Genomic Data Commons (GDC)

Reduced Representation Bisulfite-Seq A Brief Guide to RRBS

Using the Yale HPC Clusters

Data Algorithms. Mahmoud Parsian. Tokyo O'REILLY. Beijing. Boston Farnham Sebastopol

Introduction to next-generation sequencing data

Module 10: Bioinformatics

Lab 2 : Basic File Server. Introduction

Methods, tools, and pipelines for analysis of Ion PGM Sequencer mirna and gene expression data

Database Searching Tutorial/Exercises Jimmy Eng

Automatic Network Protocol Analysis

Supervised DNA barcodes species classification: analysis, comparisons and results. Tutorial. Citations

Managed File Transfer with Universal File Mover

Unemployment Insurance Data Validation Operations Guide

SMALL INDEX LARGE INDEX (SILT)

Accessing the 1000 Genomes Data. Paul Flicek European BioinformaMcs InsMtute

Radius Maps and Notification Mailing Lists

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Storing Measurement Data

WEB USER GUIDE VAULT MEDIA STORAGE

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Transcription:

A Complete Example of Next- Gen DNA Sequencing Read Alignment Presentation Title Goes Here 1

FASTQ Format: The de- facto file format for sharing sequence read data Sequence and a per- base quality score SAM (Sequence Alignment/Map) format: A unified format for storing read alignments to a reference genome. Generally large files (a byte per bp) Very compact in size but computagonally efficient to access. BAM (Binary Alignment/Map) format: A Binary equivalent to SAM. Developed for fast processing and indexing hmp://bioinformagcs.oxfordjournals.org/cgi/reprint/btp352v1

FASTQ Files Sequence Id (Illumina) @HWUSI-EAS610_0001:3:1:4:1405#0/1 GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79## 36 bps read 36 Quality scores The de-facto file format for sharing DNA sequence read data 4 Lines per read Sequence line and a per-base Phred quality score line per read FASTQ Files are Text files There is No file Header

An Introduction to Phred Quality Score ε =10 Q Phred 10 Q Phred = 10 log 10 (ε) ε is the Error Probability: The probability that a base call is wrong. Q: Phred Quality Score Q ε Probability the base call in wrong (confidence) 40 0.0001 0.01% (99.99%) 30 0.001 0.1% (99.9%) 20 0.01 1% (99%) 10 0.1 10% (90%) Phred Quality Score encoding in FASTQ/SAM files: ASCII Character = Q + 33 FASTQ Files: Q represents Base Call Quality: Probability the base call is wrong. SAM Files: Q represents Mapping Quality: Probability the mapping position of the read is incorrect. $perl e print chr(33); http://en.wikipedia.org/wiki/fastq_format

Exercise: Hands- On: Examining a FASTQ File We placed a few FASTQ files in /ifs/data/tutorials/hpcclass/resources/sequencing/ 1. Use ls to list the files. 2. Use head, tail, more, less to look at the contents of one (or more) of the files. 3. How long is each DNA read? 4. Count the number of DNA Reads in one of the fastq files. 5. Create a new directory under your home directory, called project1. Generate a new file, called mydata.fastq that contains the first 1000 DNA reads in file data01.fastq

Exercise: Examining a FASTQ File with fastqc demo@phoenix1 project1]$ module load fastqc [demo@phoenix1 project1]$ fastqc mydata.fastq Started analysis of mydata.fastq Started analysis of mydata.fastq Approx 5% complete for mydata.fastq Approx 10% complete for mydata.fastq.. demo@phoenix1 project1]$ ls ltra L> scp efstae01@phoenix.med.nyu.edu:~/project1/mydata_fastqc.zip. L> unzip mydata_fastqc.zip Open the file mydata_fastqc/fastqc_report.html with the web browser.

The Reference and the Reference Index files The Reference Genome file is a text file containing the genome sequence in FASTA format. [efstae01@phoenix1 ~]$ ls lh /ifs/data/tutorials/mm9/mm9.fasta - rw- r- - - - - 1 efstae01 efstae01 2.6G Jun 21 16:12 /ifs/data/tutorial/mm9/mm9.fasta The Reference Index (lookup table) file helps access any region (sub- sequence) of the reference genome quickly. Text file containing one line for each chromosome (congg). Format: Sequence Name, Sequence Length, Offset of first base of the sequence in the file, Length (number of bases) in each line in Reference FASTA file, Number of Bytes in each line. [efstae01@phoenix1 ~]$ more /ifs/data/tutorials/mm9/mm9.fasta.fai chr10 129993255 7 50 51 chr11 121843856 132593135 50 51 chr12 121257530 256873876 50 51 chr13 120284312 380556564 50 51 chr14 125194864 503246570 50 51 chr15 103494974 630945339 50 51 chr16 98319150 736510220 50 51....

Ready-to-use References and Annotations: igenomes A collecgon of reference genomes and annotagon files for commonly analyzed organisms. hmp://support.illumina.com/sequencing/sequencing_sokware/igenome.ilmn Exercise: [efstae01@phoenix1 ~]$ module load igenomes [efstae01@phoenix1 ~]$ echo $IGENOMES_ROOT [efstae01@phoenix1 ~]$ ls l $IGENOMES_ROOT/Mycobacterium_tuberculosis_H37RV/NCBI/ 2001-09- 07/Sequence/ /phoenix/igenomes/[organism]/[source]/[build]/sequence/bwaindex/genome.fa [organism] organism of interest (ex. Mycobacterium_tuberculosis_H37RV ) [source] source of the sequence (ex. NCBI, UCSC) [build] genome draft (ex. mm10)

Using BWA [efstae01@phoenix1 ~]$ module avail bwa [efstae01@phoenix1 ~]$ module load bwa [efstae01@phoenix1 ~]$ module display bwa [efstae01@phoenix1 ~]$ export REF=$IGENOMES_ROOT/ Mycobacterium_tuberculosis_H37RV/NCBI/2001-09- 07/Sequence/BWAIndex/ genome.fa The BWA aln command generates the alignments in Suffix Array (SA) coordinates [efstae01@phoenix1 ~]$ bwa aln $REF mydata.fastq - f mydata.sai The BWA samse command converts to chromosomal coordinates [efstae01@phoenix1 ~]$ bwa samse $REF mydata.sai mydata.fastq - f mydata.sam

[efstae01@phoenix1 ~]$ more mydata.sam @SQ SN:chr10 LN:130694993 @SQ SN:chr11 LN:122082543 @SQ SN:chr12 LN:120129022 @SQ SN:chr13 LN:120421639 @SQ SN:chr14 LN:124902244 @SQ SN:chr15 LN:104043685 @SQ SN:chr16 LN:98207768 @SQ SN:chr17 LN:94987271 @SQ SN:chr18 LN:90702639 The SAM file Exercise: How many alignments are listed in the SAM file? @SQ SN:chr19 LN:61431566 @SQ SN:chr1 LN:195471971 @SQ SN:chr2 LN:182113224 @SQ SN:chr3 LN:160039680 @SQ SN:chr4 LN:156508116 @SQ SN:chr5 LN:151834684 @SQ SN:chr6 LN:149736546 @SQ SN:chr7 LN:145441459 @SQ SN:chr8 LN:129401213 @SQ SN:chr9 LN:124595110 @SQ SN:chrM LN:16299 @SQ SN:chrX LN:171031299 @SQ SN:chrY LN:91744698 HWUSI- EAS610_0001:3:1:4:1405#0 16 chr19 60798324 37 36M * 0 0 CACTCACCTCTCTCTGATCTCTGGAATTGAACTATC ##97>08<<?6<0?2A>1?7?5?=@5/7@4<;03;B XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:9T26 HWUSI- EAS610_0001:3:1:5:1490#0 0 chr19 32885883 37 36M * 0 0 GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:36 HWUSI- EAS610_0001:3:1:6:388#0 16 chr19 18103063 37 36M * 0 0 GTCTAGGAAGACTAGAGGCCTATTTCATGAACTCTG @AABAAB@@B@?BBBABABA;?<CBBBBBA>C;BBB XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:36 HWUSI- EAS610_0001:3:1:7:1045#0 16 chr19 60622265 37 36M * 0 0 ATGTGAGGCAATGTGCTCCATTTCCTTTCCCTATCC =>6AB?@BA<;:?AA@9AB87;.=@=:>@B@>3,?B XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:36

hmp://samtools.sourceforge.net/sam1.pdf

Mapping Quality (MAPQ) in BWA Mapping Quality is a funcgon of Edit Distance and the Uniqueness of the alignment. BWA Mapping Quality 0 1 23 25 37 A read aligns equally well to mulgple posigons (hits). BWA picks randomly one of the posigons and assigns MAPQ=0 Only 1 Best hit (with no subopgmal hits) with more than 2 mismatches. Or Only 1 Best hit, with 1 subopgmal hit. Only 1 Best hit (no subopgmal hits), with up to 2 mismatches (edit distance could be more than 2)

Header secgon @HD @SQ @RG @RG SAM/BAM format VN:1.0 SN:chr20 AS:HG18 LN:62435964 ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 posigon of alignment Alignment secgon Query Name Ref sequence query sequence (same strand as ref) query quality V00-HWI-EAS132:3:38:959:2035#0 147 chr1 28 255 36M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aax`[\`y^y^]zx``\ev_bbbbbbbbbbbbbbbb NM:i:1 V00-HWI-EAS132:4:99:122:772#0 177 chr1 32 255 36M = 9127 0 AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\owai_\wl\aa`xa^]\zuaa[xwt\^xr NM:i:1 V00-HWI-EAS132:4:44:473:970#0 25 chr1 40 255 36M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2 V00-HWI-EAS132:4:29:113:1934#0 99 chr1 44 255 36M = 107 0 GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaqaa ``]\\_^``^a^`a`_^^^_xq[zs\xx NM:i:1

Post- processing: Tools and programming APIs for parsing and manipulagng alignments: Samtools: hmp://samtools.sourceforge.net/ Convert SAM to BAM and vice versa Sort and Index BAM files Merge mulgple BAM files Show alignments in text viewer Remove Duplicates from PCR amplificagon step Picard Tools: (Java- based) hmp://picard.sourceforge.net/index.shtml

Converting the SAM file to a BAM file Binary, plaporm independent format, resulgng in more efficient storage. [efstae01@phoenix1 ~]$ module avail samtools [efstae01@phoenix1 ~]$ module load samtools/0.1.19 [efstae01@phoenix1 ~]$ module display samtools [efstae01@phoenix1 ~]$ samtools view bt $REF o mydata.bam mydata.sam [samopen] SAM header is present: 22 sequences. [efstae01@phoenix1 ~]$ samtools sort mydata.bam mydata.sorted [efstae01@phoenix1 ~]$ samtools index mydata.sorted.bamefstae01@phoenix1 ~]

Examining the BAM file [efstae01@phoenix1 ~]$ samtools view c mydata.sorted.bam [efstae01@phoenix1 ~]$ samtools view c q 30 mydata.sorted.bam [efstae01@phoenix1 ~]$ samtools view c q 30 mydata.sorted.bam \ chr19:10,000,000-11,000,000 [efstae01@phoenix1 ~]$ samtools view c f 4 mydata.sorted.bam [efstae01@phoenix1 ~]$ samtools view c F 4 mydata.sorted.bam

In your project1 directory: -bash-3.2$ pwd /ifs/home/demo/project1 Putting it all together -bash-3.2$ cp /ifs/data/tutorials/align1.sge. == Examine the file using nano: -bash-3.2$ nano align1.sge == Submit the job using qsub: -bash-3.2$ qsub align1.sge - bash- 3.2$ qstat

logout Presentation Title Goes Here 19