A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Size: px

Start display at page:

Download "A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here"

Stanley Crawford
10 years ago
Views:

1 A Complete Example of Next- Gen DNA Sequencing Read Alignment Presentation Title Goes Here 1

2 FASTQ Format: The de- facto file format for sharing sequence read data Sequence and a per- base quality score SAM (Sequence Alignment/Map) format: A unified format for storing read alignments to a reference genome. Generally large files (a byte per bp) Very compact in size but computagonally efficient to access. BAM (Binary Alignment/Map) format: A Binary equivalent to SAM. Developed for fast processing and indexing hmp://bioinformagcs.oxfordjournals.org/cgi/reprint/btp352v1

3 FASTQ Files Sequence Id GATAGTTCAATTCCAGAGATCAGAGAGAGGTGAGTG + B;30;<4@7/5@=?5?7?1>A2?0<6?<<80>79## 36 bps read 36 Quality scores The de-facto file format for sharing DNA sequence read data 4 Lines per read Sequence line and a per-base Phred quality score line per read FASTQ Files are Text files There is No file Header

4 An Introduction to Phred Quality Score ε =10 Q Phred 10 Q Phred = 10 log 10 (ε) ε is the Error Probability: The probability that a base call is wrong. Q: Phred Quality Score Q ε Probability the base call in wrong (confidence) % (99.99%) % (99.9%) % (99%) % (90%) Phred Quality Score encoding in FASTQ/SAM files: ASCII Character = Q + 33 FASTQ Files: Q represents Base Call Quality: Probability the base call is wrong. SAM Files: Q represents Mapping Quality: Probability the mapping position of the read is incorrect. $perl e print chr(33);

5 Exercise: Hands- On: Examining a FASTQ File We placed a few FASTQ files in /ifs/data/tutorials/hpcclass/resources/sequencing/ 1. Use ls to list the files. 2. Use head, tail, more, less to look at the contents of one (or more) of the files. 3. How long is each DNA read? 4. Count the number of DNA Reads in one of the fastq files. 5. Create a new directory under your home directory, called project1. Generate a new file, called mydata.fastq that contains the first 1000 DNA reads in file data01.fastq

6 Exercise: Examining a FASTQ File with fastqc demo@phoenix1 project1]$ module load fastqc [demo@phoenix1 project1]$ fastqc mydata.fastq Started analysis of mydata.fastq Started analysis of mydata.fastq Approx 5% complete for mydata.fastq Approx 10% complete for mydata.fastq.. demo@phoenix1 project1]$ ls ltra L> scp [email protected]:~/project1/mydata_fastqc.zip. L> unzip mydata_fastqc.zip Open the file mydata_fastqc/fastqc_report.html with the web browser.

7 The Reference and the Reference Index files The Reference Genome file is a text file containing the genome sequence in FASTA format. [efstae01@phoenix1 ~]$ ls lh /ifs/data/tutorials/mm9/mm9.fasta - rw- r efstae01 efstae01 2.6G Jun 21 16:12 /ifs/data/tutorial/mm9/mm9.fasta The Reference Index (lookup table) file helps access any region (sub- sequence) of the reference genome quickly. Text file containing one line for each chromosome (congg). Format: Sequence Name, Sequence Length, Offset of first base of the sequence in the file, Length (number of bases) in each line in Reference FASTA file, Number of Bytes in each line. [efstae01@phoenix1 ~]$ more /ifs/data/tutorials/mm9/mm9.fasta.fai chr chr chr chr chr chr chr

8 Ready-to-use References and Annotations: igenomes A collecgon of reference genomes and annotagon files for commonly analyzed organisms. hmp://support.illumina.com/sequencing/sequencing_sokware/igenome.ilmn Exercise: [efstae01@phoenix1 ~]$ module load igenomes [efstae01@phoenix1 ~]$ echo $IGENOMES_ROOT [efstae01@phoenix1 ~]$ ls l $IGENOMES_ROOT/Mycobacterium_tuberculosis_H37RV/NCBI/ /Sequence/ /phoenix/igenomes/[organism]/[source]/[build]/sequence/bwaindex/genome.fa [organism] organism of interest (ex. Mycobacterium_tuberculosis_H37RV ) [source] source of the sequence (ex. NCBI, UCSC) [build] genome draft (ex. mm10)

9 Using BWA ~]$ module avail bwa ~]$ module load bwa ~]$ module display bwa ~]$ export REF=$IGENOMES_ROOT/ Mycobacterium_tuberculosis_H37RV/NCBI/ /Sequence/BWAIndex/ genome.fa The BWA aln command generates the alignments in Suffix Array (SA) coordinates ~]$ bwa aln $REF mydata.fastq - f mydata.sai The BWA samse command converts to chromosomal coordinates [efstae01@phoenix1 ~]$ bwa samse $REF mydata.sai mydata.fastq - f mydata.sam

10 ~]$ more SN:chr10 SN:chr11 SN:chr12 SN:chr13 SN:chr14 SN:chr15 SN:chr16 SN:chr17 SN:chr18 LN: The SAM file Exercise: How many alignments are listed in the SAM SN:chr19 SN:chr1 SN:chr2 SN:chr3 SN:chr4 SN:chr5 SN:chr6 SN:chr7 SN:chr8 SN:chr9 SN:chrM SN:chrX SN:chrY LN: HWUSI- EAS610_0001:3:1:4:1405#0 16 chr M * 0 0 CACTCACCTCTCTCTGATCTCTGGAATTGAACTATC ##97>08<<?6<0?2A>1?7?5?=@5/7@4<;03;B XT:A:U NM:i:1 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:9T26 HWUSI- EAS610_0001:3:1:5:1490#0 0 chr M * 0 0 GGGCTGGTGGAGTGATCCCAAGGGGTGGGGATGGGG B@A?AAA1BB;A5B44>AA3'@AB>+>@AB94A?A? XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:36 HWUSI- EAS610_0001:3:1:6:388#0 16 chr M * 0 0 XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:36 HWUSI- EAS610_0001:3:1:7:1045#0 16 chr M * 0 0 ATGTGAGGCAATGTGCTCCATTTCCTTTCCCTATCC =>6AB?@BA<;:?AA@9AB87;.=@=:>@B@>3,?B XT:A:U NM:i:0 X0:i:1 X1:i:0 XM:i: XO:i:0 XG:i:0 MD:Z:36

11 hmp://samtools.sourceforge.net/sam1.pdf

12 Mapping Quality (MAPQ) in BWA Mapping Quality is a funcgon of Edit Distance and the Uniqueness of the alignment. BWA Mapping Quality A read aligns equally well to mulgple posigons (hits). BWA picks randomly one of the posigons and assigns MAPQ=0 Only 1 Best hit (with no subopgmal hits) with more than 2 mismatches. Or Only 1 Best hit, with 1 subopgmal hit. Only 1 Best hit (no subopgmal hits), with up to 2 mismatches (edit distance could be more than 2)

13 SAM/BAM format VN:1.0 SN:chr20 AS:HG18 LN: ID:L1 PU:SC_1_10 LB:SC_1 SM:NA12891 ID:L2 PU:SC_2_12 LB:SC_2 SM:NA12891 posigon of alignment Alignment secgon Query Name Ref sequence query sequence (same strand as ref) query quality V00-HWI-EAS132:3:38:959:2035#0 147 chr M = 79 0 GATCTGATGGCAGAAAACCCCTCTCAGTCCGTCGTG aax`[\`y^y^]zx``\ev_bbbbbbbbbbbbbbbb NM:i:1 V00-HWI-EAS132:4:99:122:772#0 177 chr M = AAAGGATCTGATGGCAGAAAACCCCTCTCAGTCCGT aaaaaa\owai_\wl\aa`xa^]\zuaa[xwt\^xr NM:i:1 V00-HWI-EAS132:4:44:473:970#0 25 chr M * 0 0 GTCGTGGTGAAGGATCTGATGGCAGAAAACACCTCT YaZ`W[aZNUZ[U[_TL[KVVX^QURUTDRVZBB NM:i:2 V00-HWI-EAS132:4:29:113:1934#0 99 chr M = GGGTTTTCTGCCATCAGATCCTTTACCACGACAGAC aaaqaa ``]\\_^``^a^`a`_^^^_xq[zs\xx NM:i:1

15 Post- processing: Tools and programming APIs for parsing and manipulagng alignments: Samtools: hmp://samtools.sourceforge.net/ Convert SAM to BAM and vice versa Sort and Index BAM files Merge mulgple BAM files Show alignments in text viewer Remove Duplicates from PCR amplificagon step Picard Tools: (Java- based) hmp://picard.sourceforge.net/index.shtml

16 Converting the SAM file to a BAM file Binary, plaporm independent format, resulgng in more efficient storage. [efstae01@phoenix1 ~]$ module avail samtools [efstae01@phoenix1 ~]$ module load samtools/ [efstae01@phoenix1 ~]$ module display samtools [efstae01@phoenix1 ~]$ samtools view bt $REF o mydata.bam mydata.sam [samopen] SAM header is present: 22 sequences. [efstae01@phoenix1 ~]$ samtools sort mydata.bam mydata.sorted [efstae01@phoenix1 ~]$ samtools index mydata.sorted.bamefstae01@phoenix1 ~]

17 Examining the BAM file ~]$ samtools view c mydata.sorted.bam [efstae01@phoenix1 ~]$ samtools view c q 30 mydata.sorted.bam [efstae01@phoenix1 ~]$ samtools view c q 30 mydata.sorted.bam \ chr19:10,000,000-11,000,000 [efstae01@phoenix1 ~]$ samtools view c f 4 mydata.sorted.bam [efstae01@phoenix1 ~]$ samtools view c F 4 mydata.sorted.bam

18 In your project1 directory: -bash-3.2$ pwd /ifs/home/demo/project1 Putting it all together -bash-3.2$ cp /ifs/data/tutorials/align1.sge. == Examine the file using nano: -bash-3.2$ nano align1.sge == Submit the job using qsub: -bash-3.2$ qsub align1.sge - bash- 3.2$ qstat

19 logout Presentation Title Goes Here 19

Practical Guideline for Whole Genome Sequencing

Practical Guideline for Whole Genome Sequencing Disclosure Kwangsik Nho Assistant Professor Center for Neuroimaging Department of Radiology and Imaging Sciences Center for Computational Biology and Bioinformatics