Pairwise Sequence Alignment



Similar documents
BLAST. Anders Gorm Pedersen & Rasmus Wernersson

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Bio-Informatics Lectures. A Short Introduction

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Bioinformatics Resources at a Glance

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Network Protocol Analysis using Bioinformatics Algorithms

Introduction to Bioinformatics 3. DNA editing and contig assembly

Clone Manager. Getting Started

Amino Acids and Their Properties

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIOINFORMATICS TUTORIAL

Welcome to the Plant Breeding and Genomics Webinar Series

A Tutorial in Genetic Sequence Classification Tools and Techniques

Molecular Databases and Tools

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

CD-HIT User s Guide. Last updated: April 5,

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Design Style of BLAST and FASTA and Their Importance in Human Genome.

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Introduction to Bioinformatics AS Laboratory Assignment 6

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Linear Sequence Analysis. 3-D Structure Analysis

Bioinformatics Grid - Enabled Tools For Biologists.

3. About R2oDNA Designer

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Genome Explorer For Comparative Genome Analysis

Biological Databases and Protein Sequence Analysis

Biological Sequence Data Formats

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Module 1. Sequence Formats and Retrieval. Charles Steward

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Phylogenetic Trees Made Easy

Phylogenetic Analysis using MapReduce Programming Model

Chapter 6 DNA Replication

Guide for Bioinformatics Project Module 3

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Innovations in Molecular Epidemiology

Apply PERL to BioInformatics (II)

Analyzing A DNA Sequence Chromatogram

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Module 10: Bioinformatics

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Next Generation Sequencing: Technology, Mapping, and Analysis

Biological Sciences Initiative. Human Genome

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

Final Project Report

Introduction to Genome Annotation

Hidden Markov Models

The Central Dogma of Molecular Biology

Data for phylogenetic analysis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Interaktionen von RNAs und Proteinen

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Protein Sequence Analysis - Overview -

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

EMBOSS A data analysis package

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

MUTATION, DNA REPAIR AND CANCER

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Gene mutation and molecular medicine Chapter 15

MASCOT Search Results Interpretation

Gene Models & Bed format: What they represent.

Genetomic Promototypes

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

Human-Mouse Synteny in Functional Genomics Experiment

An FPGA Acceleration of Short Read Human Genome Mapping

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Dynamic Programming. Lecture Overview Introduction

Activity 7.21 Transcription factors

Okami Study Guide: Chapter 3 1

BUDAPEST: Bioinformatics Utility for Data Analysis of Proteomics using ESTs

Genomes and SNPs in Malaria and Sickle Cell Anemia

Current Motif Discovery Tools and their Limitations

Transcription:

Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013

Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics

What is a Sequence Alignment? Quite simply, the comparison of two or more DNA or protein sequences to each other. The purpose of alignment is to highlight similarity between sequences. Alignment is the procedure of writing two (or more) sequences in a way that a maximum of identical or similar characters are placed in the same column by -

Word Alignment Species 1: SOMEONE Species 2: AWESOME Species 1: - - - SOMEONE Species 2: AWESOME - - -

Less trivial Species 1: ACGTTAGA Species 2: CGTTGAA Species 1: - - - - - - - ACGTTAGA Species 2: CGTTGAA - - - - - - - - Species 1: ACGTTAGA Species 2: - CGTT- GAA

Less trivial Species 1: - - - - - - - ACGTTAGA Species 2: CGTTGAA - - - - - - - - score: -15 (gaps = -1, match = 1) Species 1: ACGTTAGA Species 2: - CGTT - GAA score: 3

FASTA Format - Input Standard input format for alignment programs >Name1 ASEQUENCE1 >Name2 comments SEQU CE2 Strictly speaking, should not contain gaps

FASTA Format - Output Increasingly, multiple alignment returned in FASTAlike format >Name1 ASEQUENCE1 >Name2 comments -SEQU--CE2 etc... - Order of sequences may be different in output to input.

Relatedness of residues in same column Making these alignments is EASY... As we know where and which evolutionary events occurred - and must infer it

Quiz Which alignment (X, Y or Z) shows only residues related by substitution events in the same column?

Types of alignments methods We cannot enummerate all possible alignments. Approaches are: Dot matrix Dynamic Programming Word-based or k-tupel methods (database searches)

Dot Matrix Given two

In a dot matrix we can identify: Existing alignable parts of sequences Possible indels Duplicated sequences and repeats Self-complementarity Gene-order differences among genomes

Dot plots

a) A continuous main diagonal shows perfect similarity for symbols with the same indices. b) Parallels to the main diagonal indicate repeated regions in the same reading direction on different parts of the sequences. In this case a region D is found twice in the sequence (D1, D2, so called c) Lines perpendicular to the main diagonal indicate palindromic areas. In this case the sequence is completely palindromic in the displayed area. d) Partially palindromic sequence (For DNA sequences this refers to a perfect match of the normal strand with its reverse complement, which is frequently found for many transposable elements. e) Bold blocks on the main diagonal indicate repetition of the same symbol in both sequences, e.g. (G)50, so called microsatellite repeats f) Parallel lines indicate tandem repeats of a larger motif in both sequences, e.g. (AGCTCTGAC)20, so called minisatellite patterns. The distance between the diagonals equals the distance of the motif. g) When the diagonal is a discontinuous line this indicates that the sequences T1 and T2 share a common source. In literal analyses we may have to deal with plagiarism or in DNA analyses sequences may be homologous because of a common ancestor. The number of interruptions increases with modifications on the text or the time of independent evolution and mutation rate. h) indel sequences this can be often observed for many different types of domains, which got lost or substituted during evolution.

Aligning a pair of sequences gap = -15 match = +10, mismatch = 0 Aim: get from one corner to other Moves have a cost Choose cheapest way Fill in table Trace route backwards to find alignment

Aligning a pair of sequences (Dynamic Programming) Aim: get from one corner to other Moves have a cost Choose cheapest way Fill in table Trace route backwards to find alignment A G G G A - - G C Aim: get from one corner to other Moves have a cost Choose cheapest way Fill in table Trace route backwards to find alignment A G G G T T T G C

Needlemann-Wunsch Algorithm Initialize NxM matrix with the sequences A and B of length N and M Starting at the top left corner set the intermediate scoring value =

Substitution matrices Some amino acids are more similar than others Adjust cost according to some similarity matrix E.g. Blosum62 Leu -> Leu: 4 Leu -> Met: 2 Leu -> Pro: -3... etc.

Gap panalties Gaps tend to occur together one penalty unrealistic a gap of length three should not cost three times as much Use affine gap cost Make extending an already existing gap cheaper Gap opening (G) / gap extension (E) Total cost for gap length x: G + x E

Global vs Local Alignment Global: Find the best overall alignment between sequences. Local: Find short regions of highly conserved sequence.

Global vs Local Species 1: SOMEONE Species 2: AWESOME Species 1: - - - SOMEONE Species 2: AWESOME - - - Species 1: SOME Species 2: SOME

Smith Watermann Algorithm Instead of looking at each sequence in its entirety this compares segments of all possible lengths (LOCAL alignments) and chooses whichever maximizes the similarity For every cell the algorithm calculates ALL possible paths leading to it. These paths can be of any length and contain insertions and deletions

Calculating significance We have calculated the optimal alignment the alignment with the best score related or not call this the maximum segment pair (MSP) How many MSPs do we expect with at least the same score by chance?

Calculating significance We make use of the extreme value distribution (EVD) to calculate the number of alignments between random sequences that we expect given our score or better This is known as the e-value E(S) = Kmn K and = scaling parameters calculated based on the search space (K) and scoring scheme ( ) m, n = size of the search space The probability of finding at least one match with our score(the p value) 1-e -E(S) As both the e value and the p value decrease, the biological significance increases

BLAST Basic Local Alignment Search Tool: Used to find local sequence alignments between protein and nucleotide sequences (Altschul et al., 1990, cited over 43,000 times) Heuristic so it is an approximate best match (SW is a guarantee) calculate the high scoring matches instead of the maximum scoring matches (HSP instead of MSP)

BLAST 28, we will look at 4) GTTCACATCATCCTGC GTTC TTCA TCAC CACA ACAT CATC ATCA...

BLAST on scoring matrices) you could call this the neighborhood GTTCACATCATCCTGC GTTC: CTTC,GTTC,GATC... TTCA: TTCT,TTGA,TTGT... TCAC: AGAC,CCAC,TCTG... CACA:... ACAT:... CATC:... ATCA:......

BLAST calculate E values expectation that you would get that alignment by change given the database of sequences return significant results we already talked about these e-values and p-values with Smith-Waterman significance

BLAST Types: Nucleotide vs. Nucleotide: blastn Protein vs Protein: blastp Translated Nucleotide vs Protein: blastx Protein vs Translated Nucleotide: tblastn Translated Nucleotide vs translated database: tblastx

DNA vs protein Should you use blastn or blastp? There are four potential nucleotides A,C,GT and therefore four potential states There are 22 standard amino acids and therefore 22 potential states blastp should be more sensitive because of the lower chance of a random hit than blastn because of the state space If there is the possibility of highly similar sequences, DNA works well intergenic spacers RNA genes

Things to consider nothing is 90% homologous there may be a degree of your belief in homology statistical significance depends on the size of the alignments and the database e-value increases as database gets bigger more chance for a random hit e-value decreases as alignments get longer more significant the longer the alignment

Therefore sequence similarity can suggest homology a significant alignment over the length of both sequences strongly suggests homology homologous sequences do not always produce significant alignments! regions with low complexity (but that are not cleaned out by initial steps in BLAST) can produce significant alignments with no homology

Rules There are no hard and fast rules Nucleotides it has been suggested that sequence identity of more than 70% suggests homology e-values of 10^-6 or less too bad Proteins 25% or more sequence identity e-values of 10^-3 or less nope you have to verify somehow, and if you are high throughput, there will be errors

Next We will go over some examples in lab Needleman-Wunsch BLAST