Sequence Alignment Young-Rae Cho

Similar documents
Pairwise Sequence Alignment

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Bio-Informatics Lectures. A Short Introduction

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Bioinformatics Grid - Enabled Tools For Biologists.

Network Protocol Analysis using Bioinformatics Algorithms

Bioinformatics Resources at a Glance

Introduction to Bioinformatics 3. DNA editing and contig assembly

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

BIOINFORMATICS TUTORIAL

Protein Protein Interaction Networks

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Amino Acids and Their Properties

Welcome to the Plant Breeding and Genomics Webinar Series

Molecular Databases and Tools

CD-HIT User s Guide. Last updated: April 5,

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

Biological Databases and Protein Sequence Analysis

Phylogenetic Trees Made Easy

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Clone Manager. Getting Started

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Linear Sequence Analysis. 3-D Structure Analysis

Computational searches of biological sequences

Introduction to Bioinformatics AS Laboratory Assignment 6

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

A Tutorial in Genetic Sequence Classification Tools and Techniques

Analyzing A DNA Sequence Chromatogram

Guide for Bioinformatics Project Module 3

Choices, choices, choices... Which sequence database? Which modifications? What mass tolerance?

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Genome Explorer For Comparative Genome Analysis

UCHIME in practice Single-region sequencing Reference database mode

DNA Printer - A Brief Course in sequence Analysis

T cell Epitope Prediction

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Hidden Markov Models

Design Style of BLAST and FASTA and Their Importance in Human Genome.

Module 1. Sequence Formats and Retrieval. Charles Steward

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Module 10: Bioinformatics

Bayesian Phylogeny and Measures of Branch Support

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

2.3 Identify rrna sequences in DNA

Dynamic Programming. Lecture Overview Introduction

Learning from Diversity

Protein Sequence Analysis - Overview -

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Sequence information - lectures

Introduction to Phylogenetic Analysis

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Tutorial for proteome data analysis using the Perseus software platform

Web Data Extraction: 1 o Semestre 2007/2008

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

HOBIT at the BiBiServ

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Version 5.0 Release Notes

The Central Dogma of Molecular Biology

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

DNA Sequencing Overview

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Physical Data Organization

Ordered Index Seed Algorithm for Intensive DNA Sequence Comparison

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

VIBE. Visual Integrated Bioinformatics Environment. Enter the Visual Age of Computational Genomics. Whitepaper

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

EMBOSS A data analysis package

Supplementary Information

Convergence of Translation Memory and Statistical Machine Translation

HIV NOMOGRAM USING BIG DATA ANALYTICS

A java applet visualizing the Aho-Corasick can be found at: buehler/ac/ac1.html

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

Big Data and Scripting map/reduce in Hadoop

Graph Mining and Social Network Analysis

Current Motif Discovery Tools and their Limitations

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Introduction to Genome Annotation

Searching Nucleotide Databases

Worksheet - COMPARATIVE MAPPING 1

They can be obtained in HQJHQH format directly from the home page at:

MASCOT Search Results Interpretation

Final Project Report

EMBL-EBI Web Services

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Multiple Sequence Alignment and Analysis: Part I An Introduction to the Theory and Application of Multiple Sequence Analysis.

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Transcription:

BINF 3350, Genomics and Bioinformatics Sequence Alignment Young-Rae Cho Associate Professor Department of Computer Science Baylor University BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search 1

Sequence Homology Homologs Similar sequence and Common ancestor Similar sequence and Same function (in divergent evolution) Orthologs Homologous sequences in different species by species divergence Paralogs Homologous sequences in the same species by gene duplication Analogs Similar sequence and No common ancestor (in convergent evolution) Sequence Similarity Importance of finding similar (DNA or protein) sequences Evolutionary closeness Relationship between sequences and evolution Functional similarity Relationship between sequences and functions How to measure sequence similarity (Method 1) Counting identical letters on each position A T G T T A T T C G T A C T (Method 2) Inserting gaps to maximize the number of identical letters Sequence alignment A T G T T A T T C G T A C T 2

Sequence Alignment Sequence Alignment Aligning two or more sequences to maximize their similarity including gaps How to find sequence alignment? (1) Measuring edit distance Edit Distance (1) Definition Edit distance between two sequences x and y : the minimum number of editing operations (insertion, deletion, substitution) to transform x into y Example x= TGCATAT (m=7), y= ATCCGAT (n=7) TGCATAT ATGCATAT ATCCATAT ATCCGATAT ATCCGATT ATCCGAT insertion of A substitution of G with C insertion of G deletion of A deletion of T edit distance = 5? 3

Edit Distance (2) Example x= TGCATAT (m=7), y= ATCCGAT (n=7) TGCATAT ATGCATAT ATGCAAT ATCCAAT ATCCGAT insertion of A deletion of T substitute of G with C substitute of A with G edit distance = 4? Can it be done in 3 steps? How to measure edit distance efficiently? A T G T T A T G C A A T G T A C T T A T C G T A C T C A G T T C A A G T C A Edit Distance (3) Example in 2-Row Representation x= ATCTGATG (m=8), y= TGCATAC (n=7) x y A T C T G A T G T G C A T A C 4 matches 4 insertions 3 deletions x y A T C T G A T G T G C A T A C 4 matches 3 insertions 2 deletions 1 substitutions Edit distance = #insertions + #deletions + #substitutions 4

Hamming Distance vs. Edit Distance Hamming Distance Compares the letters on the same position between two sequences Not good to measure evolutionary distance between DNA sequences Edit Distance Compares the letters between two sequences after inserting gaps Allows comparison of two sequences of different lengths Good to measure evolutionary distance between DNA sequences Example x= ATATATAT, y= TATATATA Hamming distance between x and y? Edit distance between x and y? Sequence Alignment Sequence Alignment Aligning two or more sequences to maximize their similarity including gaps How to find sequence alignment? (1) Measuring edit distance (2) Finding longest common subsequence 5

Longest Common Subsequence (1) Subsequence of x An ordered sequence of letters from x Not necessarily consecutive e.g., x= ATTGCTA, AGCA?, TCG?, ATCT?, TGAT? Common Subsequence of x and y e.g., x= ATCTGAT and y= TGCATA, TCTA?, TGAT?, TATA? Longest Common Subsequence (LCS) of x and y? Longest Common Subsequence (2) Example x= ATCTGATG (m=8), y= TGCATAC (n=7) LCS of X and Y? 2-row representation How to find LCS efficiently? A T G T T A T G C A A T G T A C T T A G A C T C A A G T G C C A T T T G A C T C G T A C T C A G T T C A A G T C A G T T A C G A G T A C A T G C A A A C 6

Sequence Alignment Sequence Alignment Aligning two or more sequences to maximize their similarity including gaps How to find sequence alignment? (1) Measuring edit distance (2) Finding longest common subsequence Dynamic programming BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search 7

Dynamic Programming Definition An algorithm to solve complex problems by breaking them down into simpler sub problems The result of a sub problem is used to solve the next sub problem Features Optimization Finding an optimal solution Saving memory space Examples Binary search tree Sequence alignment Dynamic Programming for Sequence Alignment Edit Graph 2 D grid structure having a diagonal on the position of the same letter Weight the diagonal lines as 1 Weight the other lines as 0 source A T C G T A C A T G Goal Finding the strongest path from source to sink T T A T Algorithm (1) Compute the max score for each node (The max score means the max counts of identical letters from source to each node) (2) When reaching the sink, trace backward to find LCS sink 8

Sequence Alignment Example Example source A T C G T A C 0 0 0 0 0 0 0 0 A 0 1 1 1 1 1 1 1 T 0 1 2 2 2 2 2 2 G 0 1 2 2 3 3 3 3 T 0 1 2 2 3 4 4 4 T 0 1 2 2 3 4 4 4 A 0 1 2 2 3 4 5 5 T 0 1 2 2 3 4 5 5 sink Sequence Alignment A T C G T A C A T G T TA T Quiz Example X = ATGCGT, Y = AGACAT source A T G C G T A G A C Sequence Alignment A T sink 9

BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search Scoring Alignments: Percent Identity (1) Identity Degree of identical matches between sequences Percent Identity Percentage of identical matches Dot-plot representations Visualization method of identity 10

Scoring Alignments: Percent Identity (2) Dot-plot representations of self alignment The background noise can be removed by setting a threshold of the min identity score in a fixed window Scoring Alignments: Percent Similarity Percent Similarity Percentage of similar amino acid pairs in biochemical structure (Protein) Percentage of similar nucleotide pairs in biochemical structure (DNA) Advanced Scoring Schemes Varying scores in similarity of biochemical structures Penalties (negative scores) for strong mismatches Relative likelihood of evolutionary relationship Probability of mutations Minimum Acceptance Score 90% of sequence pairs with more than 30% sequence identity: homolog 20~30% sequence identity: twilight zone 11

Substitution Matrices (1) Substitution Matrix Score matrix among nucleotides or amino acids 4 4 array representation for DNA sequences or (4) (4) array 20 20 array representation for protein sequences or (20) (20) array Entry of δ(i,j) has the score between i and j, i.e., the rate at which i is substituted with j over time Substitution Matrices (2) PAM (Point Accepted Mutations) For protein sequence alignment Amino acid substitution frequency in mutations Logarithmic matrix of mutation probabilities PAM120: Results from 120 mutations per 100 residues PAM120 vs. PAM240 BLOSUM (Block Substitution Matrix) For protein sequence alignment Applied for local sequence alignments Substitution frequencies between clustered groups BLOSUM-62: Results with a threshold (cut-off) of 62% identity BLOSUM-62 vs. BLOSUM-50 12

Substitution Matrices (3) Substitution Matrix Examples BLOSUM-62 PAM120 Theory of Scoring Alignments Random model Non-random model Odds ratio Odds ratio for each position Odds ratio for entire alignment log-odds ratio (a score in a substitution matrix) Expected score 13

BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search Gap Penalty (1) Gaps Contiguous sequence of spaces in one of the aligned sequences Gaps inserted as the results of insertions and deletions (indels) Gap Penalties High penalties vs. Low penalties Fixed penalties vs. Flexible penalties depending on residues No penalty on start gaps and end gaps Finding optimal number of gaps for the best score in sequence alignment Dynamic Programming 14

Gap Penalty (2) Examples of high penalties and low penalties Affine Gap Penalty (1) Motivation -σ for 1 gap (insertion or deletion) -2σ for 2 consecutive gaps (insertions or deletions) -3σ for 3 consecutive gaps (insertions or deletions), etc. too severe penalty for a series of 100 consecutive gaps Example x= ATAGC, y= ATATTGC single event x= ATAGGC, y= ATGTGC 15

Affine Gap Penalty (2) Linear Gap Penalty Score for a gap of length x : -σ x Constant Gap Penalty Score for a gap of length x : -ρ Affine Gap Penalty Score for a gap of length x : - (ρ + σ x) ρ : gap opening penalty / σ : gap extension penalty ( ρ σ ) BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search 16

Global vs. Local Alignment Global Alignment Finding sequence alignment across the whole length of sequences Dynamic Programming (Needleman-Wunch algorithm) Local Alignment Finding significant similarity in a part of sequences Dynamic Programming (Smith-Waterman algorithm) Example x = TCAGTGTCGAAGTTA y = TAGGCTAGCAGTGTA T C A G T G T C G A A G T T A T A G G C T A G C A G T G T A T C A G T G T C G A A G T T A T A G G C T A G C A G T G T A Local Alignment Example Local Alignment Applied for multi-domain protein sequences Protein domain Basic functional block Evolutionary conserved 17

BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search Multiple Alignment (1) Pairwise Alignment Alignment of two sequences Sometimes two sequences are functionally similar or have common ancestor although they have weak sequence similarity Multiple Alignment Alignment of three or more sequences simultaneously Finds similarity which is invisible in pairwise alignment 18

Multiple Alignment (2) Example Dynamic Programming? Computationally not acceptable Need heuristic methods Hierarchical Method (1) Hierarchical Method (1) Compares all sequences in pairwise alignments (2) Creates a guide tree (hierarchy) (3) Follows the guide tree for a series of pairwise alignments v 1 v 2 v 3 v 4 v 1 v 2 v 3 v 4 -.17.87.28.59.33.62 19

Hierarchical Method (2) Features Also called progressive alignment More intelligent strategy on each step Use of consensus sequence to compare groups of sequences Gaps are permanent ( once a gap, always a gap ) Works well for close sequences Application Tools ClustalW Comparing residues one pair at a time and imposing gap penalties DIALIGN Finding pairs of equal-length gap-free segments Divide-and-Conquer Method Process Features Fast aligning of long sequences 20

Multiple Alignment Results Examples Summary of PSA & MSA Algorithms Rigorous Algorithms Heuristic Algorithms 21

BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search Searching Databases Sequence Homolog Search Search similar sequences to a query sequence in a database Computational issues Dynamic programming (N-W / S-W algorithms) are rigorous But inefficient in searching a huge database Need heuristic approaches Sequence Homolog Searching Tools FASTA BLAST 22

FASTA (1) FASTA DNA / protein sequence alignment tool (local alignment) Applies dynamic programming in scoring selected sequences Heuristic method in candidate sequence search Algorithm (1) Finding all pairwise k-tuples (at least k contiguous matching residues) (2) Scoring the k-tuples by a substitution matrix (3) Selecting sequences with high scores for alignment FASTA (2) Indexing (or Hashing) Indexing Process in FASTA (1) Find all k-tuples from a query sequence and calculate c i (2) Build an index table 23

FASTA Package FASTA package query sequence database fasta protein protein fasta DNA DNA fastx / fasty DNA (all reading frames) protein tfastx / tfasty protein DNA (all reading frames) ssearch : applies dynamic programming (S-W algorithm) BLAST (1) BLAST (Basic Local Alignment Search Tool) DNA / protein sequence alignment tool Finds local alignments Heuristic method in sequence search Runs faster than FASTA Algorithm (1) Makes a list of words (word pairs) from the query sequence (2) Chooses high-scoring words (3) Searches database for matches (hits) with the high-scoring words (4) Extends the matches in both directions to find high-scoring segment pair (HSP) (5) Selects the sequence which has two or more HSPs for S-W alignment 24

BLAST (2) Deterministic Finite Automata (DFA) DFA Analysis Process in BLAST (1) Build DFA using high-scoring words (2) Read sequences in database and trace DFA (3) Output the positions for hits BLAST Package BLAST programs query sequence database blastp protein protein blastn DNA DNA blastx DNA (all reading frames) protein tblastn protein DNA (all reading frames) tblastx DNA (all reading frames) DNA (all reading frames) 25

Search Results BLAST Search Results FASTA Search Results E-value E-value Average number of alignments with a score of at least S that would be expected by chance alone in searching a database of n sequences Ranges of E-value: 0 ~ n High alignment score S Low E-value Low alignment score S High E-value Factors Alignment score The number of sequences in the database Sequence length Default E-value threshold: 0.01 ~ 0.001 26

Filtering Low-Complexity Region Highly biased amino acid composition Lowers significant hits in sequence alignment BLAST filters the query sequence for low-complexity regions and mark X Summary of Homolog Search Algorithms Rigorous Algorithms Heuristic Algorithms 27

BINF 3350, Chapter 4, Sequence Alignment 1. Sequence Alignment 2. Dynamic Programming 3. Scoring Alignments 4. Gap Penalty 5. Global vs. Local Alignment 6. Pairwise vs. Multiple Sequence Alignment 7. Sequence Homolog Search 8. Motif Search Motifs Motifs Short sequence patterns Functionally related sequences share similarly distributed patterns (motifs) of critical functional residues Types of Motif Search Search a query sequence in a motif database Search a pattern in a sequence database Find a pattern from a set of sequences Motif Finding Consensus method by global multiple alignment 28

Motif Search Tools (1) BLOCKS Logos Size of letters: conservation levels Color of letters: biochemical properties Motif Search Tools (2) MEME Summary motif information Location of motifs in sequences 29

Motif Databases PROSITE Code for patterns Each letter represents an amino acid residue All positions are separated by - Code description Example X any amino acid G-X-L-M-S-A-D-F-F-F [] two or more possible amino acid G-[LI]-L-M-S-A-D-F-F-F {} disallowed amino acid G-[LI]-L-M-S-A-{RK}-F-F-F (n) repetition by n of the amino acid G-[LI]-L-M-S-A-{RK}-F(3) (n,m) a range: only allowed with X G-[LI]-L-M-S-A-{RK}-X(1,3) 30