Global and Local Sequence Alignment

Similar documents

Pipe Cleaner Proteins. Essential question: How does the structure of proteins relate to their function in the cell?

BOC334 (Proteomics) Practical 1. Calculating the charge of proteins

Concluding lesson. Student manual. What kind of protein are you? (Basic)

IV. -Amino Acids: carboxyl and amino groups bonded to -Carbon. V. Polypeptides and Proteins

Advanced Medicinal & Pharmaceutical Chemistry CHEM 5412 Dept. of Chemistry, TAMUK

Guidelines for Writing a Scientific Paper

Molecular Facts and Figures

Amino Acids, Peptides, Proteins

Part A: Amino Acids and Peptides (Is the peptide IAG the same as the peptide GAI?)

PRACTICE TEST QUESTIONS

Pairwise Sequence Alignment

Shu-Ping Lin, Ph.D.

RNA and Protein Synthesis

Academic Nucleic Acids and Protein Synthesis Test

UNIVERSITETET I OSLO Det matematisk-naturvitenskapelige fakultet

Application Note. Determination of Amino acids by UHPLC with automated OPA- Derivatization by the Autosampler. Summary. Fig. 1.

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Multiple Choice Write the letter that best answers the question or completes the statement on the line provided.

Application Note. Determination of 17 AQC derivatized Amino acids in baby food samples. Summary. Introduction. Category Bio science, food Matrix

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

Chapter 9. Applications of probability. 9.1 The genetic code

The Organic Chemistry of Amino Acids, Peptides, and Proteins

THE CHEMICAL SYNTHESIS OF PEPTIDES

INFORMATIKA ANGOL NYELVEN INFORMATION TECHNOLOGY

Network Protocol Analysis using Bioinformatics Algorithms

H H N - C - C 2 R. Three possible forms (not counting R group) depending on ph

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Chapter 26 Biomolecules: Amino Acids, Peptides, and Proteins

DNA is found in all organisms from the smallest bacteria to humans. DNA has the same composition and structure in all organisms!

Amino Acids. Amino acids are the building blocks of proteins. All AA s have the same basic structure: Side Chain. Alpha Carbon. Carboxyl. Group.

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Chemistry 110. Bettelheim, Brown, Campbell & Farrell. Introduction to General, Organic and Biochemistry Chapter 22 Proteins

A. A peptide with 12 amino acids has the following amino acid composition: 2 Met, 1 Tyr, 1 Trp, 2 Glu, 1 Lys, 1 Arg, 1 Thr, 1 Asn, 1 Ile, 1 Cys

Provincial Exam Questions. 9. Give one role of each of the following nucleic acids in the production of an enzyme.

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Journal of Chemical and Pharmaceutical Research

HIV NOMOGRAM USING BIG DATA ANALYTICS

Structure and properties of proteins. Vladimíra Kvasnicová

Human Tubal Fluid (HTF) Media & Modifi ed Human Tubal Fluid (mhtf) Medium with Gentamicin

Lab # 12: DNA and RNA

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

Peptide bonds: resonance structure. Properties of proteins: Peptide bonds and side chains. Dihedral angles. Peptide bond. Protein physics, Lecture 5

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Hands on Simulation of Mutation

AMINO ACIDS & PEPTIDE BONDS STRUCTURE, CLASSIFICATION & METABOLISM

Bioinformatics, Sequences and Genomes

Part ONE. a. Assuming each of the four bases occurs with equal probability, how many bits of information does a nucleotide contain?

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

Ch18_PT MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

agucacaaacgcu agugcuaguuua uaugcagucuua

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Transcription and Translation These terms describe the two steps used to transform the information carried in genes into useful products.

The chemistry of insulin

Chapter 2: Biochemistry Problems

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Chapter 11: Molecular Structure of DNA and RNA

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Basic Concepts of DNA, Proteins, Genes and Genomes

Name: Date: Problem How do amino acid sequences provide evidence for evolution? Procedure Part A: Comparing Amino Acid Sequences

Genomes and SNPs in Malaria and Sickle Cell Anemia

Introduction to Chemical Biology

CHAPTER 15: ANSWERS TO SELECTED PROBLEMS

Amino Acids and Proteins

13.2 Ribosomes & Protein Synthesis

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

CHAPTER 29 AMINO ACIDS, POLYPEPTIDES, AND PROTEINS SOLUTIONS TO REVIEW QUESTIONS

Recap. Lecture 2. Protein conformation. Proteins. 8 types of protein function 10/21/10. Proteins.. > 50% dry weight of a cell

Clone Manager. Getting Started

Problem Set 1 KEY

From Sequence to Structure

Structure and Function of DNA

Paper: 6 Chemistry University I Chemistry: Models Page: 2 of Which of the following weak acids would make the best buffer at ph = 5.0?

AMINO ACIDS QUANTITATION IN BIOLOGICAL MEDIA. Monica Culea

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Protein Structure and Function

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Rapid and Reproducible Amino Acid Analysis of Physiological Fluids for Clinical Research Using LC/MS/MS with the atraq Kit

A COMPARISON OF COMPUTATION TECHNIQUES FOR DNA SEQUENCE COMPARISON

Mutation. Mutation provides raw material to evolution. Different kinds of mutations have different effects

12.1 The Role of DNA in Heredity

Coding sequence the sequence of nucleotide bases on the DNA that are transcribed into RNA which are in turn translated into protein

a. Ribosomal RNA rrna a type ofrna that combines with proteins to form Ribosomes on which polypeptide chains of proteins are assembled

Mutation, Repair, and Recombination

Bio-Informatics Lectures. A Short Introduction

UNIT (12) MOLECULES OF LIFE: NUCLEIC ACIDS

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

NO CALCULATORS OR CELL PHONES ALLOWED

Developing an interactive webbased learning. environment for bioinformatics. Master thesis. Daniel Løkken Rustad UNIVERSITY OF OSLO

AP BIOLOGY 2010 SCORING GUIDELINES (Form B)

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Sickle cell anemia: Altered beta chain Single AA change (#6 Glu to Val) Consequence: Protein polymerizes Change in RBC shape ---> phenotypes

Transcription:

Global and Local Sequence Alignment Stephen F. Altschul National enter for Biotechnology Information National Library of Medicine National Institutes of Health

The Structure of DNA Watson, J.D. &rick, F.H.. (1953) Molecular structure of nucleic acids - A structure for deoxyribose nucleic acid. Nature 171:737-738.

The Structure of DNA in the Abstract 5 3 A G G T A Replication T A G G G T 3 5 Information

Watson-rick Base Pairs Purines Guanine (G) Adenine (A) Pyrimidines ytosine () Thymine (T)

The entral Dogma Replication Transcription Translation DNA RNA Protein Reverse transcription

Amino Acid Structure

The Twenty ommon Amino Acids One-letter code Three-letter code Name A D E F G H I K L Ala ys Asp Glu Phe Gly His Ile Lys Leu Alanine ysteine Aspartic acid Glutamic acid Phenylalanine Glycine Histidine Isoleucine Lysine Leucine M N P Q R S T V W Y Met Asn Pro Gln Arg Ser Thr Val Trp Tyr One-letter code Three-letter code Name Methionine Asparagine Proline Glutamine Arginine Serine Threonine Valine Tryptophan Tyrosine

hemical Structures of the Twenty ommon Amino Acids

Protein Domain Tertiary Structures http://www.lbl.gov/publications/urrents/archive/apr-1-25.html

The Genetic ode Marshall Nirenberg 1927-21

entral Issues in Biological Sequence omparison Definitions: What is one trying to find or optimize? Algorithms: an one find the proposed object optimally or in reasonable time optimize? Statistics: an one s result be explained by chance? In general there is a tension between questions. A definition that is too simple may allow efficient algorithms, but may not yield results of biological interest. However, a definition that includes most of the relevant biology may entail intractable algorithms and statistics. The most successful approaches find a balance between these considerations.

The Problem Given: Two protein or DNA sequences where the and are chosen from a finite alphabet, e.g.,,,. How can one define the distancebetween the sequences and, or alternatively their similarity? We shall adopt the somewhat more flexible formalism of similarity, with higher values considered better. Although there are other possibilities, similarity is generally defined with reference to a sequence alignment, in which individual letters from each sequence are placed into correspondence.

Examples of Sequence Alignment groan : grown colo-r colour theatre :: theater theatre theater elephant : eleg-ant vermiform :: ::::: formation vermiform----- about -----formation disestablishment dis-------s--ent disestablishment : dis---------sent

Applications Sequence alignment arises in many fields: Molecular biology Inexact text matching (e.g. spell checkers; web page search) Speech recognition In general: The precise definition of what constitutes an alignment may vary by field, and even within a field. Many different alignments of two sequences are possible, so to select among them one requires an objective (score) function on alignments. The number of possible alignments of two sequences grows exponentially with the length of the sequences. Good algorithms are required.

NBI BLAST Protein Database Search BLAST is a widely-used program for searching DNA and protein sequence databases for sequences similar to a query sequence. Here is one alignment returned by a BLAST protein database search: >sp Q99728.2 BARD1_HUMAN Length=777 GENE ID: 58 BARD1 BRA1 associated RING domain 1 [Homo sapiens] Score = 53.1 bits (126), Expect = 3e-7, Method: omposition-based stats. Identities = 32/111 (29%), Positives = 55/111 (5%), Gaps = 15/111 (14%) Query 24 THVVMKTDAEFVERTLKYFLGIAGGKWVVSYFWVTQSIKERKMLNEHDFEVRGDVVNGR 83 THVV+ DA + TLK LGI G W++ + WV ++ + E +E+ Sbjct 65 THVVVPGDA---VQSTLKMLGILNGWILKFEWVKALRRKVEQEEKYEIP------- 654 Query 84 NHQGPKRARESQDR---KIFRGLEIYGPFTNMPTDQLEWMVQLGASVV 131 +GP+R+R ++++ K+F G +G F + P D L +V G ++ Sbjct 655 --EGPRRSRLNREQLLPKLFDGYFYLWGTFKHHPKDNLIKLVTAGGGQIL 73

Elements of Global Sequence Alignment No crossings allowed. For algorithmic reasons, it is fortunate that, although there are natural mechanisms (mutations) that lead to amino acid or nucleotide substitutions, insertions and deletions, there are none that yield transpositions, unlike with keyboardproduced text. In contrast, when analyzing RNA folding, one may choose for algorithmic reasons to exclude pseudoknots, which do in fact occur naturally. Gaps. An arbitrary number of nullcharacters (represented by dashes) may be placed into either sequence, and aligned with letters in the other sequence. Two nulls may not be aligned. Depending upon one s perspective, the alignment of a letter with a null may be understood as the insertionof a letter into one sequence, or the deletionof a letter from the other. Therefore, a letter aligned with a null is sometimes called an indel. Alignment scores. The score for an alignment is taken to be the sum of scores for aligned pairs of letters, and scores for letters aligned with nulls. Each such pairing is called an alignment column. Substitution scores. Scores for aligned pairs of letters are called substitution scores, whether the letter aligned are identical or not. Most simply, substitution scores may take the form of match scores and mismatch scores. Gap scores. The score for a letter aligned with a null is called a gap score. Usually gap scores are letter-independent. Global alignment. All letters and nulls in each sequence must be aligned.

Sequence Similarity Define the similarityof two sequences as the score of their highest-scoring (optimal) alignment. How do we find the an optimal alignment of two sequence, and its score? Brute force enumeration is impractical, because the number of possible alignments becomes astronomically large for even fairly short sequences. Fortunately, the problem is soluble efficiently using a technique called dynamic programming.

Dynamic Programming and Global Alignment Dynamic programming is a method by which a larger problem may be solved by first solving smaller, partial versions of the problem. We demonstrate here how it may be applied to global sequence alignment, where at first we are interested only in the similarity of two sequences, and not the alignment that yields this score. Definitions:, the substitution score for aligning letters and the gap score for aligning any letter to a null the partial sequence consisting of the first letters of the partial sequence consisting of the first letters of (, ) the similarity of and onsider the last column of an optimal alignment of and. This column either aligns to, or to a null, or to a null. Because we do not allow crossing, there are no other possibilities. This observation yields the following recurrence:, + (, ), =, +, + and aligned aligned with a null aligned with a null In brief, we can solve for (, )by solving smaller versions of the problem first.

Path graphs A global alignment may be viewed as a path through a directed path graphwhich begins at the upper left corner and ends at the lower right. Diagonal steps correspond to substitutions, while horizontal or vertical steps correspond to indels. Scores are associated with each edge, and the score of an alignment is the sum of the scores of the edges it traverses. Each alignment corresponds to a unique path, and vice versa. Start End

Dynamic programming on path graphs One may associate a partial similarity with each node of a path graph. If the values of 1, 1, 1, and, 1 are known, the value of, may be calculated. (, ) (, ) (, ) (, ) (, )

A G -1-2 -3-4 -1-1 -2-1 1-1 An Example -1-3 -4-5 -6-1 -2-3 -4-1 1-1 -2-1 -1 1-1 -2-1 1 1 G A T A Scores: Match +1 Mismatch Gap -1

What is the optimal alignment? A G -1-2 -3-4 -1-2 -3-4 -5-6 -1-2 -3-4 -1 1-1 -2-1 -1 1-1 -2-1 1 1 G A T A Record tracebackinformation: Which edge or edges led to the optimal score at each node?

Follow the traceback edges from the final node A G -1-2 -3-4 -1-2 -3-4 -5-6 -1-2 -3-4 -1 1-1 -2-1 -1 1-1 -2-1 1 1 G A T A Optimal alignments: -AG- GATA and -A-G GATA

Pseudocode for Finding Sequence Similarity Similarity(X,Y): For i =,...,m: SIM[i,] = i*g For j = 1,...,n: SIM[,j] = j*g For i = 1,...,m: For j = 1,...,n: SIM[i,j] = max( SIM[i-1,j-1] + s(x[i],y[j]), SIM[i-1,j]+g, SIM[i,j-1]+g ) EndFor EndFor Return SIM[m,n] Exercise: Generalize the code to include tracebackinformation, and produce one optimal alignment. Note: This is generally known as the Needleman-Wunschalgorithm, after the first paper in the field of computational molecular biology to apply dynamic programming to the global alignment problem. However, the paper actually describes a somewhat different algorithm which is almost never used. Needleman, S.B. & Wunsch,.D. (197) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol. 48:443-453.

Observations and Generalizations The nodes can be expanded in a variety of orders, so long as all nodes that feed into a given node are expanded before that node. Possible expansion orders are: The time complexity of the algorithm is. If only the similarity is desired, the space complexity is min, ; if an optimal alignment is sought, the space complexity is, but as we shall see, this too can be reduced to [min, ]. It is possible to save time (but in general no more than a constant factor) by not expanding nodes that can not possibly participate in an optimal path. Fickett, J.W. (1984) Nucl. Acids Res. 12:175-18; Spouge, J.L. (1989) SIAM J. Appl. Math. 49:1552-1566.

Global Alignment Scores Multiplying all substitution and gap scores by a positive constant does not change the optimal alignment. Why? Adding a constant to all substitution scores, and /2to all gap scores, does not change the optimal alignment. Why? A global alignment scoring system with the three nominal parameters of match score, mismatch score, and gap score, in fact has a single free parameter. For example, assuming >, one can always construct an equivalent scoring system with =1and =. What is the scoring system of this form equivalent to ( =1, =, = 1)? Modifying global alignment scores so that =can speed up the inner loop of the dynamic programming algorithm.

Local Alignment: Motivation In the early days of protein sequence comparison, most known related proteins, were related over their whole lengths. However, soon proteins that shared only isolated regions of similarity were found. A schematic of a protein superfamilyis shown here, with related domains represented by similar boxes. The measure of global sequence similarity, and the Needleman-Wunsch alignment algorithm, was not welladapted to finding such domains. A new definition of local similarity was required, along with a new algorithm for finding locally optimal alignments.

Local Alignment: Definition During the 197s and early 198s, a variety of definitions for local alignment were proposed. The one that eventually gained the greatest popularity, along with an associated algorithm, is due to Smith & Waterman. Smith & Waterman proposed simply that a local alignment of two sequences allow arbitrary-length segments of each sequence to be aligned, with no penalty for the unaligned portions of the sequences. Otherwise, the score for a local alignment is calculated the same way as that for a global alignment. It would at first appear that the problem of finding an optimal local alignment should be significantly more complex than the problem of finding an optimal global alignment, because the start and stop positions of the alignment must be located as well. However, only a constant factor more calculation is necessary. Smith, T.F. & Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147:195-197.

. The Smith-Waterman Algorithm Two modifications to Needleman-Wunsch: 1) Allow a node to start at. 2) Record the highestscoring node, and trace back from there. Why does this algorithm yield an optimal local alignment? A G 4 2 4 2 3 1 2 4 2 1 2 5 3 2 3 1 3 4 G A T T G A Scores: Match +4 Mismatch -1 Gap -2

Pseudocode for Finding Local Sequence Similarity Local_Similarity(X,Y): S= For i =,...,m: SIM[i,] = i*g For j = 1,...,n: SIM[,j] = j*g For i = 1,...,m: For j = 1,...,n: SIM[i,j] = max(, SIM[i-1,j-1] + s(x[i],y[j]), SIM[i-1,j]+g, SIM[i,j-1]+g ) S=max(S,SIM[i,j]) EndFor EndFor Return S Exercise: Generalize the code to include tracebackinformation, and produce one optimal local alignment. Multiplying all substitution and gap scores by a positive constant does not change the optimal alignment. Why? Adding a constant to all substitution scores, and /2 to all gap scores, canchange the optimal alignment. Why?

. The Smith-Waterman Algorithm: Traceback Optimal local alignments, or subalignments: A-G ATTG and Questions: A-G ATTG an one find other locally optimal subalignments? How can they be defined? A G 4 2 4 2 3 1 2 4 2 1 2 5 3 2 3 1 3 4 G A T T G A Scores: Match +4 Mismatch -1 Gap -2

Local optimality: Definitions and Algorithms A definition of local optimality was proposed in 1984, along with an algorithm to find all locally optimal subalignments. [Sellers, P.H. (1984) Bull. Math. Biol. 46:51-514.] A subalignmentis locally optimal if its score is greater than or equal to that of any subalignment it touches. A provably ( )algorithm for finding all locally optimal subalignmentswas subsequently described. [Altschul, S.F. & Erickson, B.W (1986) Bull. Math. Biol. 48:633-66.] Problem: By Sellers definition, a strong subalignmentcan suppress, by means of intermediaries, subalignmentsit does not actually touch. This can be a particular problem if one is seeking internal approximate repeats. One may advance an alternative definition to address this problem: A subalignmentis weakly locally optimal if it touches no weakly locally optimal subalignmentthat has greater score (Altschul & Erickson, 1986). This definition is not circular, but recursive. No algorithm for finding all weakly locally optimal subalignmentsof two sequences has been described, although several incorrect ones have been published. > > ( )

. Locally Optimal Subalignments Optimal subalignments: A-G ATTG and A-G ATTG Additional, locally optimal subalignments: G G and A A A G 4 2 4 2 3 1 2 4 2 1 2 5 3 2 3 1 3 4 G A T T G A Scores: Match +4 Mismatch -1 Gap -2

Semi-Global Alignment Biological problem: Find approximate matches to a given pattern within a large sequence. For example, seek promoters within a DNA sequence, or a copies of a domain within a protein sequence. Solution: Semi-global alignment. Needleman-Wunschalgorithm with two modifications: 1) Penalize end gaps in the pattern, but not in the long sequence; 2) Trace back from the highest scoring node along the long edge of the path graph. -2-4 -6-8 Erickson, B.W. & Sellers, P.H. (1983) Recognition of patterns in genetic sequences. In Time Warps, String Edits and Macromolecules: The Theory and Practice of Sequence omparison, D. Sankoff& J.B. Kruskal(eds.), pp. 55-91, Addison-Wesley, Reading, MA.