Welcome to the Plant Breeding and Genomics Webinar Series



Similar documents
Pairwise Sequence Alignment

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Bioinformatics Resources at a Glance

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Introduction to NGS data analysis

Databases and mapping BWA. Samtools

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Next Generation Sequencing: Technology, Mapping, and Analysis

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Version 5.0 Release Notes

A Tutorial in Genetic Sequence Classification Tools and Techniques

Next generation sequencing (NGS)

Next Generation Sequencing

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Module 1. Sequence Formats and Retrieval. Charles Steward

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Introduction to next-generation sequencing data

This chapter introduces you to Microso2 Office Access The chapter focuses on what a database is, the components of a database, what a database

Gene Models & Bed format: What they represent.

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Next generation DNA sequencing technologies. theory & prac-ce

Basic processing of next-generation sequencing (NGS) data

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

CD-HIT User s Guide. Last updated: April 5,

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Molecular Databases and Tools

Bioinformatics Grid - Enabled Tools For Biologists.

Data formats and file conversions

Introduction to Bioinformatics 3. DNA editing and contig assembly

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

Searching Nucleotide Databases

NGS data analysis. Bernardo J. Clavijo

A Primer of Genome Science THIRD

How Sequencing Experiments Fail

UGENE Quick Start Guide

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

3. About R2oDNA Designer

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

PreciseTM Whitepaper

Biotechnology: DNA Technology & Genomics

BIOINFORMATICS TUTORIAL

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST

Analysis of ChIP-seq data in Galaxy

IOmark Suite. Benchmarking Storage with Applica4on Workloads August, Evaluator Group, Inc.

BioHPC Web Computing Resources at CBSU

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Package hoarder. June 30, 2015

Analysis of NGS Data

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

Note: This document wh_informatics_practical.doc and supporting materials can be downloaded at

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Challenges associated with analysis and storage of NGS data

The Chinese University of Hong Kong igem 2010 Bacterial based storage and encryp2on device

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

ACAAGGGACTAGAGAAACCAAAA AGAAACCAAAACGAAAGGTGCAGAA AACGAAAGGTGCAGAAGGGGAAACAGATGCAGA CHAPTER 3

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

PrimePCR Assay Validation Report

Frequently Asked Questions Next Generation Sequencing

Design Style of BLAST and FASTA and Their Importance in Human Genome.

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Biological Databases and Protein Sequence Analysis

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Ordered Index Seed Algorithm for Intensive DNA Sequence Comparison

Exercises for the UCSC Genome Browser Introduction

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

DNA Sequencing Overview

Hadoopizer : a cloud environment for bioinformatics data analysis

Guide for Bioinformatics Project Module 3

De Novo Assembly Using Illumina Reads

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

GenBank: A Database of Genetic Sequence Data

Comparing Methods for Identifying Transcription Factor Target Genes

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

PrimePCR Assay Validation Report

G E N OM I C S S E RV I C ES

High Performance Compu2ng Facility

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Apply PERL to BioInformatics (II)

RNAseq / ChipSeq / Methylseq and personalized genomics

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Genomes and SNPs in Malaria and Sickle Cell Anemia

Vector NTI Advance 11 Quick Start Guide

New solutions for Big Data Analysis and Visualization

Module 3. Genome Browsing. Using Web Browsers to View Genome Annota4on. Kers4n Howe Wellcome Trust Sanger Ins4tute zfish-

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Transcription:

Welcome to the Plant Breeding and Genomics Webinar Series Today s Presenter: Dr. Candice Hansey Presentation: http://www.extension.org/pages/ 60428 Host: Heather Merk Technical Production: John McQueen PBG home page: www.extension.org/plant_breeding_genomics Sign up for PBG News: http://pbgworks.org

Please fill out the survey evaluation! (You will be contacted via email) Watch past webinars and sign up for future webinars! http://www.extension.org/pages/60426

How to Align Sequences Presenter: Candice Hansey hansey @msu.edu Michigan State University

Overview NavigaAng NCBI to obtain sequences Using BLAST for sequence alignment Using other programs for specialized sequence alignment Next generaaon sequence alignment programs

Overview NavigaAng NCBI to obtain sequences Using BLAST for sequence alignment Using other programs for specialized sequence alignment Next generaaon sequence alignment programs

Goal Obtain sequence for the maize teosinte branched1 gene and then find organisms with orthologous genes

NCBI

NCBI Search specifying Zea Mays as the organism resulted in thousands of sequences

NCBI

NCBI

NCBI

NCBI

Overview NavigaAng NCBI to obtain sequences Using BLAST for sequence alignment Using other programs for specialized sequence alignment Next generaaon sequence alignment programs

BLAST BLAST = Basic Local Alignment Search Tool

BLAST BLAST = Basic Local Alignment Search Tool Used for searching a query sequence or sequences against a database or subject sequence

BLAST BLAST = Basic Local Alignment Search Tool Used for searching a query sequence or sequences against a database or subject sequence Uses a local alignment, which searches for regions in the query that locally align to subject sequences

BLAST BLAST = Basic Local Alignment Search Tool Used for searching a query sequence or sequences against a database or subject sequence Uses a local alignment, which searches for regions in the query that locally align to subject sequences BLAST uses word based heurisacs to approximate the Smith- Waterman algorithm to find the near- opamal local alignments quickly, thus gaining speed over sensiavity.

BLAST BLAST = Basic Local Alignment Search Tool Used for searching a query sequence or sequences against a database or subject sequence Uses a local alignment, which searches for regions in the query that locally align to subject sequences BLAST uses word based heurisacs to approximate the Smith- Waterman algorithm to find the near- opamal local alignments quickly, thus gaining speed over sensiavity. Two main versions: NCBI Blast and WU- BLAST

Local vs Global Alignments For sequences that are divergent, the opamal global alignment introduces gaps that can hide biologically relevant informaaon such as moafs.

BLAST The process starts by searching the sequences for exact matches of small fixed length strings from the query called words.

BLAST The process starts by searching the sequences for exact matches of small fixed length strings from the query called words. These matches are the seeds for the local alignments.

BLAST The process starts by searching the sequences for exact matches of small fixed length strings from the query called words. These matches are the seeds for the local alignments. The next step is to extend the seed by aligning the sequence unal a gap is found mismatches are inserted here.

BLAST The process starts by searching the sequences for exact matches of small fixed length strings from the query called words. These matches are the seeds for the local alignments. The next step is to extend the seed by aligning the sequence unal a gap is found mismatches are inserted here. A gapped alignment is then performed using a modified Smith- Waterman algorithm indels are added here.

BLAST The process starts by searching the sequences for exact matches of small fixed length strings from the query called words. These matches are the seeds for the local alignments. The next step is to extend the seed by aligning the sequence unal a gap is found mismatches are inserted here. A gapped alignment is then performed using a modified Smith- Waterman algorithm indels are added here. Only results scoring above a threshold (expect value or e value) are reported back to the user

BLAST Programs blastn - query DNA, subject DNA blastp- query Protein, subject Protein blastx - query DNA (6 frame translaaon) subject Protein tblastn - query protein - subject DNA (6 frame translaaon) - slow tblastx - query DNA (6 frame translaaon), subject DNA (6 frame translaaon) - very slow

Which Program Should You Use What database do you have and how sensiave does your search has to be? blastn, blastp good for idenafying sequences that are already in a database, finding local regions of similarity in closely related organisms.

Which Program Should You Use What database do you have and how sensiave does your search has to be? blastn, blastp good for idenafying sequences that are already in a database, finding local regions of similarity in closely related organisms. blastx - Use when you have a nucleoade sequence with an unknown reading frame and/or sequencing errors that would lead to frame shi`s or coding errors such as ESTs. More sensiave.

Which Program Should You Use What database do you have and how sensiave does your search has to be? blastn, blastp good for idenafying sequences that are already in a database, finding local regions of similarity in closely related organisms. blastx - Use when you have a nucleoade sequence with an unknown reading frame and/or sequencing errors that would lead to frame shi`s or coding errors such as ESTs. More sensiave. tblastn- useful for finding homologs in sequence where the frame is unknown or sequencing errors are likely to be present such as ESTs and dra` sequence.

Which Program Should You Use What database do you have and how sensiave does your search has to be? blastn, blastp good for idenafying sequences that are already in a database, finding local regions of similarity in closely related organisms. blastx - Use when you have a nucleoade sequence with an unknown reading frame and/or sequencing errors that would lead to frame shi`s or coding errors such as ESTs. More sensiave. tblastn- useful for finding homologs in sequence where the frame is unknown or sequencing errors are likely to be present such as ESTs and dra` sequence. tblastx - used to detect novel ORFs/exons. Very Slow, use as the last resort.

Performing BLAST

Performing BLAST

Performing BLAST

Output from BLAST

Output from BLAST

Output from BLAST

BLAST For addiaonal informaaon on BLAST go to hdp:// blast.ncbi.nlm.nih.gov/blast.cgi?cmd=web&page_type=blastdocs

BLAST For addiaonal informaaon on BLAST go to hdp:// blast.ncbi.nlm.nih.gov/blast.cgi?cmd=web&page_type=blastdocs Web based BLAST interfaces are designed for low throughput searching Personalized databases can be generated and subsequent BLASTs performed using the command line Using the command line BLAST, mulaple sequences can be aligned simultaneously

Ubuntu For PC users I highly recommend gegng Ubuntu This can be run as a virtual machine on your PC through programs such as VirtualBox hdps://www.virtualbox.org/

Overview NavigaAng NCBI to obtain sequences Using BLAST for sequence alignment Using other programs for specialized sequence alignment Next generaaon sequence alignment programs

EST and cdna Alignment Exonerate is a generic tool for pairwise sequence comparison (hdp:// www.ebi.ac.uk/~guy/exonerate/) and comes grid ready with the ability to chunk files directly through exonerate

EST and cdna Alignment Exonerate is a generic tool for pairwise sequence comparison (hdp:// www.ebi.ac.uk/~guy/exonerate/) and comes grid ready with the ability to chunk files directly through exonerate Exonerate can be run with many different models for gapped and ungapped alignments

EST and cdna Alignment Exonerate is a generic tool for pairwise sequence comparison (hdp:// www.ebi.ac.uk/~guy/exonerate/) and comes grid ready with the ability to chunk files directly through exonerate Exonerate can be run with many different models for gapped and ungapped alignments The est2genom model can be used for alignment of EST sequences to genomic sequence and the cdna2genome model can be used to align full length cdnas that can be flanked by UTRs

EST and cdna Alignment Exonerate is a generic tool for pairwise sequence comparison (hdp:// www.ebi.ac.uk/~guy/exonerate/) and comes grid ready with the ability to chunk files directly through exonerate Exonerate can be run with many different models for gapped and ungapped alignments The est2genom model can be used for alignment of EST sequences to genomic sequence and the cdna2genome model can be used to align full length cdnas that can be flanked by UTRs For addiaonal models see the man page online at hdp://www.ebi.ac.uk/ ~guy/exonerate/exonerate.man.html or on the command line with the - h opaon exonerate - h

EST and cdna Alignment Command line example exonerate - - query est_sequences.fasta - - querytype DNA - - target genome_sequence.fasta targedype DNA - - model est2genome showalignment no - - showvulgar no - - showtargetgff no - - ryo %qi\t%a\t%qab\t%qae\t%tab\t%tae \n minintron 10 - - maxintron 1500 >exonerate_alignment.txt

EST and cdna Alignment Command line example exonerate - - query est_sequences.fasta - - querytype DNA - - target genome_sequence.fasta targedype DNA - - model est2genome showalignment no - - showvulgar no - - showtargetgff no - - ryo %qi\t%a\t%qab\t%qae\t%tab\t%tae \n minintron 10 - - maxintron 1500 >exonerate_alignment.txt The output from this command will be a tab delimited file with the following columns for each alignment: query_id, target_id, query_start, query_end, target_start, target_end Note: using this output format the posiaons are in interbase coordinates For complete list of opaons see the man page online at hdp://www.ebi.ac.uk/~guy/ exonerate/exonerate.man.html or on the command line with the - h opaon

Whole Genome Alignment MUMmer designed for rapid alignment of enare genomes NUCmer designed for alignment of conags to another set of conags or a genome PROmer designed for alignment of species too divergent for DNA alignment using six- frame translaaon of both input sequences hdp://mummer.sourceforge.net/

Whole Genome Alignment mummer can handle mulaple reference and query sequences Command line usage mummer [opaons] <reference- file> <query- files> mummerplot [opaons] <match file> Example command line mummer mum b c genotype1.fasta genotype2.fasta > output.mums - mum finds all maximal unique matches, - b will compute forward and reverse complement matches, - c reports all match posiaons relaave to the forward starnd mummerplot x [0,275287] y [0,265111] png output.mums - x sets the x- axis range in the plot, - y sets the y- axis range in the plot, - - png outputs the plot in.png format AddiAonal opaons are available in the manual or by using the - h opaon for both mummer and mummerplot Examples for NUCmer and PROmer are available at hdp://mummer.sourceforge.net/ examples/

Whole Genome Alignment Forward MUMs are red and reverse MUMs are green. Dots on a line with a slope = 1 are unchanged, and dots on a line with a slope = - 1 are inverted. hdp://mummer.sourceforge.net/examples/

Short Sequence Alignment Vmatch is a tool that is ideal for aligning short sequences such as probes, primers, and SNPs with short context sequence This program requires a license hdp://www.vmatch.de/

Overview NavigaAng NCBI to obtain sequences Using BLAST for sequence alignment Using other programs for specialized sequence alignment Next generaaon sequence alignment programs

High Throughput Sequencing Plaxorms Illumina HiSeq 1000 and HiSeq 2000 Illumina Genome Analyzer IIx Life Sciences/Roche 454 pyrosequencing Pacific Biosciences Ion Torrent

High Throughput Sequencing HiSeq 2000 Highly parallel sequencing by synthesis Single and paired- end reads between 50 bp and 100 bp 187 million single end or 374 million paired- end reads per lane High error rate in the 3 end

Sequence File Format Read Name Sequence Quality Quality scores are in ASCII characters that are converted to Phred scores. These scores provide a likelihood that the base was called incorrectly. 10 1 in 10 chance the base call is incorrect 20 1 in 100 chance the base call is incorrect 30 1 in 1000 chance the base call is incorrect Suggest checking the quality of the reads prior to performing sequence alignments

Read Quality with the FASTX- Toolkit hdp://hannonlab.cshl.edu/fastx_toolkit/

Read Quality with the FASTX- Toolkit >fastx_quality_stats i sequence_file.fastq o stats.txt >fastx_quality_boxplot_graph.sh i stats.txt t sample1 o quality.png

Read Quality with the FASTX- Toolkit Bad Sequence Good Sequence

High Throughput Sequence Alignment TradiAonal sequence alignment algorithms cannot be scaled to align millions of reads

High Throughput Sequence Alignment TradiAonal sequence alignment algorithms cannot be scaled to align millions of reads UAlize genome indexing such as Burrows- Wheeler for ultrafast and memory- efficient alignment programs

High Throughput Sequence Alignment TradiAonal sequence alignment algorithms cannot be scaled to align millions of reads UAlize genome indexing such as Burrows- Wheeler for ultrafast and memory- efficient alignment programs Next generaaon sequence alignment algorithms are rapidly evolving to accommodate the increasing sequence throughput

Tuxedo Suite BowAe fast and quality aware short read aligner for aligning DNA and RNA sequence reads (hdp:// bowae- bio.sourceforge.net/index.shtml) TopHat fast, splice juncaon mapper for RNA- Seq reads built on the BowAe aligner (hdp:// tophat.cbcb.umd.edu/) Cufflinks assembles transcripts, esamates their abundances, and test for differenaal expression and regulaaon using the alignments from BowAe and TopHat

BowAe Available for Windows, Mac OS X, Linux, and Solaris

BowAe Available for Windows, Mac OS X, Linux, and Solaris Not a general- purpose alignment tool like MUMmer, BLAST, or Vmatch Ideal for aligning short reads to large genomes

BowAe Available for Windows, Mac OS X, Linux, and Solaris Not a general- purpose alignment tool like MUMmer, BLAST, or Vmatch Ideal for aligning short reads to large genomes Forms the basis for TopHat, Cufflinks, Crossbow, and Myrna

BowAe Available for Windows, Mac OS X, Linux, and Solaris Not a general- purpose alignment tool like MUMmer, BLAST, or Vmatch Ideal for aligning short reads to large genomes Forms the basis for TopHat, Cufflinks, Crossbow, and Myrna Online manual (hdp://bowae- bio.sourceforge.net/manual.shtml) is very helpful to understand the opaons that are available

BowAe BowAe Index To print a help screen with the opaonal parameters bowae- build h Command line usage bowae- build [opaons]* <reference_in> <ebwt_base> Example command line bowae- build chr1.fa,chr2.fa,chr3.fa,chr4.fa test_index AddiAonal opaons can be used to improve performance

BowAe BowAe Index To print a help screen with the opaonal parameters bowae- build h Command line usage bowae- build [opaons]* <reference_in> <ebwt_base> Example command line bowae- build chr1.fa,chr2.fa,chr3.fa,chr4.fa test_index AddiAonal opaons can be used to improve performance Alignment with BowAe To print a help screen with the opaonal parameters bowae h Command line usage bowae [opaons]* <ebwt> {- 1 <m1> - 2 <m2> - - 12 <r> <s>} [<hit>] Example command line for single end reads with our test_index BowAe test_index - - solexa1.3- quals test_index sequence_file.fastq > output_file

BowAe AddiAonal OpAons: Quality aware MAQ- like mode is the default, BowAe can also be run in a non- quality aware SOAP- like mode BowAe test_index - - solexa1.3- quals v 2 test_index sequence_file.fastq > output_file - v 2 is specifying that there can only be two mismatches in an alignment - a report all valid alignments - k 2 report up to 2 valid alignments - - best report the best alignment - m 2 do not report any alignments for reads with greater than 2 reportable alignments - S print alignments in SAM format which is compaable with SAMtools for subsequent variant calling and manipulaaon of the alignments Many other opaons available

TopHat Available for Linux and OS X

TopHat Available for Linux and OS X Built on BowAe and uses the same genome index

TopHat Available for Linux and OS X Built on BowAe and uses the same genome index Used for alignment of RNA- Seq reads to a genome

TopHat Available for Linux and OS X Built on BowAe and uses the same genome index Used for alignment of RNA- Seq reads to a genome OpAmized for paired- end, Illumina sequence reads >70bp

TopHat Available for Linux and OS X Built on BowAe and uses the same genome index Used for alignment of RNA- Seq reads to a genome OpAmized for paired- end, Illumina sequence reads >70bp Online manual (hdp://tophat.cbcb.umd.edu/ manual.html) is very helpful to understand the opaons that are available

TopHat

TopHat Alignment with TopHat To print a help screen with the opaonal parameters tophat h Command line usage tophat [opaons]* <index_base> <reads1_1[,...,readsn_1]> [reads1_2,...readsn_2] Example command line for single end reads with our test_index from before Tophat o output_directory - - solexa1.3- quals test_index sequence_file.fastq AddiAonal opaons - - max- intron size (default is 500kb) - - min- intron size (default is 70bp) - - max- mulahits (default is 40) - - mate- inner- dis (required for paired- end alignment mode) - - num_threads N (runs on N CPUs)

TopHat Wiggle tracks can be generated from the TopHat output file accepted_hits.bam Convert to SAM file samtools view h o accepted_hits.sam accepted_hits.bam Generate wiggle file wiggles accepted_hits.sam coverage.wig Wiggle tracks are viewable in programs such as Integrated Genome Browser (hdp://bioviz.org/igb/)

AddiAonal Alignment Programs Nature Biotechnology 27, 455-457 (2009)

Valuable Resources

QuesAons