Databases and mapping BWA. Samtools



Similar documents
Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Next generation sequencing (NGS)

UGENE Quick Start Guide

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Data formats and file conversions

Welcome to the Plant Breeding and Genomics Webinar Series

Module 1. Sequence Formats and Retrieval. Charles Steward

Bioinformatics Resources at a Glance

A Tutorial in Genetic Sequence Classification Tools and Techniques

Linear Sequence Analysis. 3-D Structure Analysis

Bioinformatics Grid - Enabled Tools For Biologists.

Analysis of NGS Data

Guide for Bioinformatics Project Module 3

Introduction to NGS data analysis

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Version 5.0 Release Notes

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Biological Databases and Protein Sequence Analysis

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

CD-HIT User s Guide. Last updated: April 5,

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

NGS Data Analysis: An Intro to RNA-Seq

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Pairwise Sequence Alignment

Basic processing of next-generation sequencing (NGS) data

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

SRA File Formats Guide

Comparing Methods for Identifying Transcription Factor Target Genes

Analysis of ChIP-seq data in Galaxy

Frequently Asked Questions Next Generation Sequencing

This document presents the new features available in ngklast release 4.4 and KServer 4.2.

An agent-based layered middleware as tool integration

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Unipro UGENE Manual. Version

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

A Primer of Genome Science THIRD

Distributed Data Mining in Discovery Net. Dr. Moustafa Ghanem Department of Computing Imperial College London

Introduction. Overview of Bioconductor packages for short read analysis

Searching Nucleotide Databases

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Practical Guideline for Whole Genome Sequencing

Apply PERL to BioInformatics (II)

Unipro UGENE User Manual Version

Geneious Biomatters Ltd

RAST Automated Analysis. What is RAST for?

Bio-Informatics Lectures. A Short Introduction

Global and Discovery Proteomics Lecture Agenda

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Next Generation Sequencing: Technology, Mapping, and Analysis

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

GenBank, Entrez, & FASTA

Introduction to Genome Annotation

Copy Number Variation: available tools

Deep Sequencing Data Analysis

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

LifeScope Genomic Analysis Software 2.5

454 Sequencing System Software Manual Version 2.6

Introduction to next-generation sequencing data

High Throughput Sequencing Data Analysis using Cloud Computing

How To Use The Assembly Database In A Microarray (Perl) With A Microarcode) (Perperl 2) (For Macrogenome) (Genome 2)

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

BioHPC Web Computing Resources at CBSU

Next Generation Sequencing

BIOINFORMATICS TUTORIAL

Sequencing the Human Genome

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Next Generation Sequencing Data Visualization

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

-> Integration of MAPHiTS in Galaxy

Processing Genome Data using Scalable Database Technology. My Background

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Sequencing Analysis Software User Guide

(A GUIDE for the Graphical User Interface (GUI) GDE)

Geneious 8.1. Biomatters Ltd

BIOLOMICS SOFTWARE & SERVICES GENERAL INFORMATION DOCUMENT

Laboratorio di Bioinformatica

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Genome Viewing. Module 2. Using Genome Browsers to View Annotation of the Human Genome

Clone Manager. Getting Started

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

MetaPathways v1.0 Installation

Integrated Rule-based Data Management System for Genome Sequencing Data

Biological Sequence Data Formats

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Sequence Database Administration

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

Transcription:

Databases and mapping BWA Samtools

FASTQ, SFF, bax.h5 ACE, FASTG FASTA BAM/SAM GFF, BED GenBank/Embl/DDJB many more File formats

FASTQ Output format from Illumina and IonTorrent sequencers. Quality scores:

FASTQ Headers (Casava 1.8, qualities Sanger encoded) @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG EAS139 the unique instrument name 136 the run id FC706VJ the flowcell id 2 flowcell lane 2104 tile number within the flowcell lane 15343 'x'-coordinate of the cluster within the tile 197393 'y'-coordinate of the cluster within the tile 1 the member of a pair, 1 or 2 (paired-end or mate-pair reads only) Y Y if the read fails filter (read is bad), N otherwise 18 0 when none of the control bits are on, otherwise it is an even number ATCACG index sequence

SFF Standard Flowgram Format - binary format used to encode results from 454 sequencers - can be converted to fasta/fastq (sff2fastq tool)

PacBio files.bax.h5 The.bax.h5 files contain sequence data..bas.h5 The bas.h5 file now contains only the information necessary to dereference by hole number the ZMWlevel data. There are currently several different combinations of polymerases (P1-5) and chemistries (C1-3) used by PacBio. They differ by output file and error it's good to know which combination generated your data.

ACE Stores complete data about genomic contigs. All assemblers can be run with this or similar file output. Recommended for your final assembly! You can have a look at broken pairs of reads, browse differences in sequencing coverage,...

FASTG A format for faithfully representing genome assemblies in the face of allelic polymorphism and assembly uncertainty. The G stands for graph. http://fastg.sourceforge.net #FASTG:begin; #FASTG:version=1.0:assembly_name="tiny example"; >chr1:chr1; ACGANNNNN[5:gap:size=(5,4..6)]CAGGC[1:alt:allele C,G]TATACG >chr2;4 ACATACGCATATATATATATATATATAT[20:tandem:size=(10,8..12) AT]TCA GGCA[1:alt A,T,TT]GGAC #FASTG:end;

FASTA Be consistent when naming your fasta files! Avoid special characters and spaces in headers..fa, fas.,.fasta,.fna,.faa >sequence_name GGAGGGGACGACGTCAAGTCATCATGGCCTTTATGGGTGGGGCTTCACACGTCATACAATGGTTGGAGCA AAGGGTCGCCAACTCGAGAGAGGGAGCTAATCCCACAAACCCAGCCCCAGTTCGGATTGGAGTCTGCAAC TCGACTCCATGAAGTAGGAATCGCTAGTAATCGTGGATCAGCATGCCACGGTGAATACGTTCCCGGGTCT TGTACACACCGCCCGTCACACCATGGAAGTAGGCCGCATCCGAAGCAGCCTCCCTAACCCTATTGCTGGG AAGGAGGCTGCGAAGGTGGGGTCTATGACTGGGGTGAAGTCGTAACAAGGTAGCCGTACCGGAAGGTGCG

BAM/SAM The SAM Format is a text format for storing sequence data in a series of tab delimited ASCII columns. Most often it is generated as a human readable version of its sister BAM format, which stores the same data in a compressed, indexed, binary form. Output from aligners that read FASTQ files and assign the sequences to a position with respect to a known reference genome. Contains header and alignment sections..bam.bai &.sam.sai are indexed version of files for quick access of data

GFF/GTF General Feature Format, currently GFF3. The GTF (General Transfer Format) is a refinement of GFF Version 2 and is sometimes referred to as GFF2.5 - used for describing genes and other features of DNA, RNA, and protein sequences http://www.sequenceontology.org/gff3.shtml

BED BED format provides a flexible way to define the data lines that are displayed in an annotation track. BED lines have three required fields and nine additional optional fields. The number of fields per line must be consistent throughout any single set of data in an annotation track. The order of the optional fields is binding: lower-numbered fields must always be populated if higher-numbered fields are used. http://genome.ucsc.edu/faq/faqformat.html#format1

GenBank/EMBL/DDJB http://www.ncbi.nlm.nih.gov/nuccore/11466244?report=genbank The GenBank sequence format is a rich format for storing sequences and associated annotations. It shares a feature table vocabulary and format with the EMBL and DDJB formats. FEATURES: source, gene, CDS

Software needed for this lecture Directory with all programs will be distributed on a USB flash drive. Add the folder with all binaries (executable files) into your path or run the individual programs locally. We will use Blast+, Hmmer, Bowtie/BWA and Samtools today. These programs are used on daily basis by almost every bioinformatician dealing with genomic data and they can be easily run on a laptop. BWA and Samtools need to be compiled from source. If the compilation fails (e.g. missing zlib for samtools), please let me know. Type: cd bwa-0.7.8; make cd samtools-0.1.19; make

Sequence databases NR - non-redundant proteins from GenBank CDS translations, RefSeq Proteins, PDB, SwissProt, PIR and PRF - produced by NCBI. RefSeq - NCBI reference sequence collection, a set of taxonomically diverse, non-redundant and richly annotated sequences. UniProtKB - comprehensive resource for protein sequence and annotation data produced by the Universal Protein Resource consortium. Pfamseq - Pfamseq is the underlying sequence database that Pfam is built upon. As there should be no-overlaps between Pfam domains, this provides a stable sequence database for investigating domains and domain architectures. Swiss-Prot - Manually reviewed, high quality protein sequence and functional annotation - produced by UniProt. PDB - sequences with an experimentally determined structure.

Databases of metabolic pathways & enzyme nomenclature KEGG http://www.genome.jp/kegg/ BioCyc http://biocyc.org/ ExPASy Enzyme http://enzyme.expasy.org/ BRENDA http://www.brenda-enzymes.info/ Keep in mind that an identical enzymatic reaction can be carried out by enzymes coded by completely different genes. Enzyme Commission number (EC number) is a numerical classification for enzymes, based on the chemical reactions they catalyze.

How to make custom blast databases makeblastdb -in fastafile -dbtype nucl/prot update_blastdb.pl - perl script bundled with blast programs - alows downloading/upgrading blast databases such as nr/nt/refseq/swissprot from NCBI ftp://ftp.ncbi.nlm.nih.gov/blast/db/

BLAST+ programs ftp://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/ blastn - Search a nucleotide database using a nucleotide query blastx - Search protein database using a translated nucleotide query blastp/psiblast - Search protein database using a protein query tblastn - Search translated nucleotide database using a protein query tblastx - Search translated nucleotide database using a translated nucleotide query blastx -help - prints help for the particular blast program

Using blast blastp -db databasename -query yourfastafile > seq_vs_database.blastp Useful blast arguments: -h prints help for a particular blast program -outfmt -num_descriptions -num_alignments -evalue -num_threads

Blast exercise Try to find immune proteins in the recently published tsetse fly genome by searching it with Drosophila melanogaster immunity proteins as queries. Use several different e-value cut-offs (1, 1e-3, 1e-6, 1e- 8,...) and output formats. makeblastdb -in Glossina_morsitans.faa -dbtype prot blastp -db Glossina_morsitans.faa -query Drosophila_melanogaster_imunity.faa -num_threads 4 > Dmel_immunity_vs_Gmors.blastp

Blast output formats *** Formatting options -outfmt <String> alignment view options: 0 = pairwise, 1 = query-anchored showing identities, 2 = query-anchored no identities, 3 = flat query-anchored, show identities, 4 = flat query-anchored, no identities, 5 = XML Blast output, 6 = tabular, 7 = tabular with comment lines, 8 = Text ASN.1, 9 = Binary ASN.1, 10 = Comma-separated values, 11 = BLAST archive format (ASN.1) Options 6, 7, and 10 can be additionally configured to produce a custom format specified by space delimited format specifiers.

HMMER HMMER is mainly used for searching sequence databases for homologs of protein sequences, and for making protein sequence alignments. It implements methods using probabilistic models called profile hidden Markov models (profile HMMs). Compared to BLAST and other sequence alignment and database search tools based on older scoring methodology, HMMER aims to be significantly more accurate and more able to detect remote homologs because of the strength of its underlying mathematical models. In the past, this strength came at significant computational expense, but in the new HMMER3 project, HMMER is now essentially as fast as BLAST. Webserver: http://hmmer.janelia.org/search/phmmer User's guide: [PDF, 116 pages] ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/userguide.pdf

Standalone HMMER3 http://hmmer.janelia.org/software 1) Build models and align sequences (DNA or protein) hmmbuild Build a profile HMM from an input multiple alignment. hmmalign Make a multiple alignment of many sequences to a common profile HMM.

Individual hmmer programs Search protein queries against protein database phmmer Search a single protein sequence against a protein sequence database. (BLASTP-like) jackhmmer Iteratively search a protein sequence against a protein sequence database. (PSIBLAST-like) hmmsearch Search a protein profile HMM against a protein sequence database. hmmscan Search a protein sequence against a protein profile HMM database Search DNA queries against DNA database nhmmer Search a DNA sequence, alignment, or profile HMM against a DNA sequence database. (BLASTN-like) nhmmscan Search a DNA sequence against a DNA profile HMM database.

Searching a protein sequence database with a single protein profile HMM The subdirectory /tutorial in the HMMER distribution contains the files used in the tutorial, as well as a number of examples of various file formats that HMMER reads. hmmbuild globins4.hmm tutorial/globins4.sto hmmsearch globins4.hmm tutorial/globins45.fa > globins4.out phmmer tutorial/hbb_human tutorial/globins45.fa jackhmmer tutorial/hbb_human tutorial/globins45.fa

Searching a profile HMM database with a query sequence hmmbuild globins4.hmm tutorial/globins4.sto hmmbuild fn3.hmm tutorial/fn3.sto hmmbuild Pkinase.hmm tutorial/pkinase.sto cat globins4.hmm fn3.hmm Pkinase.hmm > minifam hmmpress minifam hmmscan minifam tutorial/7less_drome

Target profile HMM databases Gene3D - a collection of models that are based on CATH structural protein domains. Pfam - a large comprehensive collection of protein families. Superfamily - a collection of models, which represent structural protein domains at the SCOP superfamily level. TIGRFAMS - models that are designed for automated sequence annotation and that are aimed at matching the full length (or near) of the sequence.

Mapping high throughput sequencing data - Blat, Bowtie, BWA, MAQ, TopHat, Mummer,... - there are over a hundred of tools available - Illumina, 454, IonT, Sanger, GridIon/MinIon, PacBio,... - some of them are extremely fast and some of them are accurate different mappers give different results! - the two most cited short read aligners are Bowtie and BWA http://en.wikipedia.org/wiki/list_of_sequence_alignment_software#short- Read_Sequence_Alignment http://wwwdev.ebi.ac.uk/fg/hts_mappers/

Burrows-Wheeler algorithm (transform) http://www.homolog.us/animation/bwt-b.html Compression techniques work by finding repeated patterns in the data and encoding the duplications more compactly. The Burrows Wheeler transform (BWT, also called block-sorting compression) rearranges a character string into runs of similar characters. This is useful for compression, since it tends to be easy to compress a string that has runs of repeated characters by techniques such as move-to-front transform and run-length encoding. More importantly, the transformation is reversible, without needing to store any additional data. Working with short-read aligners - create an index for a set of FASTA files obtained from any source - align your reads - analyze SAM and BAM alignment files (SAMtools)

Maq Eland Soap Bowtie BWA Soap2

BWA (Burrows-Wheeler Aligner) http://sourceforge.net/projects/bio-bwa/files/ BWA-MEM: For 70bp or longer Illumina, 454, Ion Torrent and Sanger reads, assembly contigs and BAC sequences BWA-backtrack: For short sequences BWA-SW: may have better sensitivity when alignment gaps are frequent. For all the algorithms, BWA first needs to construct the FM-index for the reference genome (the index command). Alignment algorithms are invoked with different sub-commands: aln/samse/sampe for BWAbacktrack, bwasw for BWA-SW and mem for the BWA-MEM algorithm. bwa index ref.fa bwa mem ref.fa reads_f.fastq reads_r.fastq > aln-pe.sam

Using BWA and Bowtie2 bwa index lambda_virus.fa bwa mem lambda_virus.fa reads_f.fastq reads_r.fastq > bwa-mem_pe.sam bowtie2-build lambda_virus.fa lambda_virus bowtie2 -x lambda_virus -1 reads_f.fastq -2 reads_r.fastq > bowtie2_pe.sam http://bowtie-bio.sourceforge.net/bowtie2/index.shtml

SAMTOOLS http://sourceforge.net/projects/samtools/files/ A BAM file is just a SAM file stored in binary. import: SAM-to-BAM conversion view: BAM-to-SAM conversion and subalignment retrieval sort: sorting alignment merge: merging multiple sorted alignments index: indexing sorted alignment faidx: FASTA indexing and subsequence retrieval tview: text alignment viewer pileup: generating position-based output and consensus/indel calling

Using SAMTOOLS Convert and sort: samtools view -bs bowtie2_pe.sam > bowtie2-pe.bam samtools sort bowtie2-pe.bam bowtie2-pe.sorted.bam Create a bam index file: samtools index bowtie2-pe.sorted.bam bowtie2-pe.sorted.bam.bai Try it with both bowtie2 and bwa-mem sam files. Aligned reads (.sorted.bam file) can be viewed in genome browsers (e.g. Artemis). Filter out unmapped reads: samtools view -h -F 4 -b test.bam > test_only_mapped.bam