CSE8393 Introduction to Bioinformatics Lecture 3: More problems, Global Alignment. DNA sequencing



Similar documents
Hidden Markov Models

Network Protocol Analysis using Bioinformatics Algorithms

Pairwise Sequence Alignment

Dynamic Programming. Lecture Overview Introduction

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

1 Approximating Set Cover

Amino Acids and Their Properties

Bioinformatics Resources at a Glance

Mass Spectra Alignments and their Significance

agucacaaacgcu agugcuaguuua uaugcagucuua

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

11 Multivariate Polynomials

Fast string matching

A Non-Linear Schema Theorem for Genetic Algorithms

Reading DNA Sequences:

Bio-Informatics Lectures. A Short Introduction

Systems of Linear Equations

Approximation Algorithms

Symbol Tables. Introduction

Data, Measurements, Features

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Guide for Bioinformatics Project Module 3

Regular Expressions and Automata using Haskell

Offline sorting buffers on Line

Discrete Optimization Introduction & applications

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

CMPSCI611: Approximating MAX-CUT Lecture 20

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

7 Gaussian Elimination and LU Factorization

CD-HIT User s Guide. Last updated: April 5,

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Protein Protein Interaction Networks

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Transportation Polytopes: a Twenty year Update

Introduction to Bioinformatics 3. DNA editing and contig assembly

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Biological Sequence Data Formats

Lecture 19: Proteins, Primary Struture

Self Organizing Maps: Fundamentals

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

An Introduction to Information Theory

Special Situations in the Simplex Algorithm

The Basics of Graphical Models

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

NP-Completeness and Cook s Theorem

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Notes on Determinant

Solutions to Homework 6

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)

1 Solving LPs: The Simplex Algorithm of George Dantzig

Scheduling Algorithm with Optimization of Employee Satisfaction

Compact Representations and Approximations for Compuation in Games

A parallel algorithm for the extraction of structured motifs

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

Keeping up with DNA technologies

HMM : Viterbi algorithm - a toy example

Linear Codes. Chapter Basics

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Clone Manager. Getting Started

4 UNIT FOUR: Transportation and Assignment problems

The Goldberg Rao Algorithm for the Maximum Flow Problem

Introduction to Bioinformatics AS Laboratory Assignment 6

The Trip Scheduling Problem

Several Views of Support Vector Machines

Applied Algorithm Design Lecture 5

Row Echelon Form and Reduced Row Echelon Form

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Statistical Machine Translation: IBM Models 1 and 2

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) Total 92.

Holland s GA Schema Theorem

DATA STRUCTURES USING C

Scheduling Shop Scheduling. Tim Nieberg

Efficient Recovery of Secrets

Final Project Report

Row Quantile Normalisation of Microarrays

(67902) Topics in Theory and Complexity Nov 2, Lecture 7

Network Flow I. Lecture Overview The Network Flow Problem

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Web Data Extraction: 1 o Semestre 2007/2008

School Timetabling in Theory and Practice

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Bioinformatics Grid - Enabled Tools For Biologists.

Solving Systems of Linear Equations

Translation Study Guide

Linear Programming. March 14, 2014

Resource Allocation Schemes for Gang Scheduling

Transcription:

SE8393 Introduction to Bioinformatics Lecture 3: More problems, Global lignment DN sequencing Recall that in biological experiments only relatively short segments of the DN can be investigated. To investigate the structure of a long DN molecule, the DN is broken into smaller fragments that can be studied more easily by biological experiments. For instance, physical mapping aims at ordering these fragments based on overlap information obtained from hybridization experiments with a set of probes. Shortest Superstring In DN sequencing, we have we deal with a more accurate overlap data for the fragments. For instance, in such an experiment, we manage to sequence each of the individual fragments, i.e. obtain the sequence of nucleotides for each fragment (there might be some experimental errors). Now we have map them back on the original DN. This leads to the following nice theoretical abstraction, the Shortest Superstring (a one that does not necessarily capture reality): Given a set of strings s 1, s 2,..., s k, find the shortest string s such that each s i appears as a substring of s. Intuitively speaking, the shortest superstring uses as much overlap as possible to construct the original DN (because it is the shortest string that contains all fragments). Is that a realistic thing to do? No, what if the DN contains repeats? Then the shortest covering string will not capture that. Here s an example. Onsider the following strings (all binary strings of length 3):, 1, 1, 11, 1, 11, 11, 111. trivial superstring would be 1 1 11 1 11 11 111. However a shortest substring would be 1 1 1 1. Intuitively speaking, the shortest superstring uses as much overlap as possible to construct the original DN (because it is the shortest string that contains all fragments). Is that a realistic thing to do? No, what if the DN contains repeats? Then the shortest covering string will not capture that. Furthermore, the model does not account for the possible experimental errors. The shortest superstring problem is NP-hard. Sequencing by Hybridization lthough the sequencing problem can be solved in fast and efficient techniques these days, no such techniques existed 1 years ago. One method for DN sequencing known as Sequencing by Hybridization was suggested by four biologists independently in 1988. They suggested to build a DN ship (called microarray) that would contain many small fragments (probes) acting as the memory of the chip. Each fragment will give some information about the DN molecule at hand (by hybridization), and they will collectively solve the DN sequencing problem. Of course no one believed that this would work. Now microarrays constitute an industry. Theoretically speaking, how can we use such a microarray to sequence the DN. We can make the memory of the ship constitute of all possible probes of length l. We ship will determine whether each of the probes hybridizes with the DN molecule. Note that every substring of length l in the original DN will hybridize with some probe in the microarray. Therefore, we collect information about all substrings of length l in the original DN. This information can be used to figure out the DN sequence. We will call this problem, sequencing by hybridization, SBH. Note that SBH is theoretically not much different than the shortest superstring problem described above. In fact it is a special case where all strings have the same length l. lthough the shortest superstring problem is NP-hard, SBH can be solved efficiently as we will see later on. Similarity search fter sequencing, the biologist would have no idea about a newly sequenced gene. In a similarity search we are interested in comparing the newly sequenced gene with others whose functionality is known. This will help us determine the functionality of the newly sequence genes. Whether it is a genes, a protein, or just a DN fragment, we can look at the new sequence a string over some alphabet (, G,, T for instance in case of DN). Using a distance metric we can then determine to what extent the strings are alike. This is often called an edit distance. The edit distance is a measure of how many operations are needed to transform one string into another. The operations can be insertion, deletion or substitution of a symbol. Hence the edit distance is a natural measure of the similarity between the two strings, because these are the three kinds of mutations that occur in DN and pretein sequences. We will see when we discuss alignment 1

algrithms that we can work with a more elaborate weighted edit distance, where each operation is assigned a weight, or a score. s for an example for now, if we decide to limit the mutations to just insertions and deletions and no substitutions then the edit distance problem is equivalent to finding the longest common subsequence. Given two strings u == u 1...u m and v = v 1...v n a common subsequence of u and v of length k is a set of indices i 1 < i 2 <... < i k and j 1 < j 2 <... < j k where u ix = v jx for x = 1...k. Let LS(u, v) denote the length of the longuest common subsequence, then the minimum number of insertions and deletions is clearly equal to m LS(u, v) + n LS(u, v) = m + n 2LS(u, v). This problem is a basic case of a general problem known as Sequence lignment. Finding G island Given a DN sequence we would like to find out which stretches on the DN are genes, i.e. code for proteins. We face the gene prediction problem. One approach to solving the Gene Prediction problem is by predicting what is known as G islands. G is the most infrequently occurring dinucleotide in many genomes because G tends to mutate to TG by a process called methylation. However, G tends to appear frequently around genes in areas called G island. Therefore, given the DN, determining which areas are G islands and which areas are not helps us to identify genes. Theoretically, this problem is analogous to the following dishonest casino problem: t the casina, a biased coin is used from time to time. The players cannot notice since the switch between the two coins happens with a small probability p. Given a sequence of coin tosses that you collect over a period of time, can you find out when the biased coin was used and when the fair coin was used? Where is the analogy here? Well, Given the DN sequence, can you tell which parts are G islands and which parts are not? We will use a Hidden Markov Model to abstract this probabilistic setting and answer questions such as the one above. Protein Folding In protein folding the main goal is to determine the three dimensional structure of the protein based on its amino acid sequence. Determining the 3D structure is important to understand its functionality. Many protein folding models use lattices to describe the space of conformations that a protein can assume. The folding is a self avoiding path in the lattice in which the vertices are labeled by the amino acids. The protein folds in a way to minimize the conformation energy i.e to maximize the number of bonds between the same acids. lot of disagreement still exists with regards to the structure of the protein and on ways to compute the energy of a given configuration. For instance, the Hydrophobic-Hydrophylic model (HP model) of protein folding states that the protein folds by the hydrophobic amino acids going first into the core of the protein to protect themselves from water. Under this model, we aim at maximzing the number of bonds between Hydrophobic amino acids. The protein folding problem is NP complete under various models and lattices. Global lignment In order to detect similarities/differences between two sequences we align the two sequences on above the other. For instance the following two sequences look very much alike: GGGTTG and GTGGTG. The similarity becomes more obvious when we align the sequences as follows: G-GGTTG GTGGTG gap was inserted in the first sequence to proparly align the two sequences. Sequence alignment is the insertion of gaps (or spaces) in arbitrary locations in the sequences so that they end up with the same size, provided that no gap in one sequence should be aligned to a gap in the other. We are interested in finding the best alignments between two sequences. But how do we formalize the notion of best alignment? We will define a scoring scheme for the alignment. The best alignment is the alginment with the highest score. Since our goal is to determine how similar two sequences are, our scoring scheme should reflect the concept of similarity. basic scoring scheme would be the following: two identical characters receive a score of +1, two different characters receive a score of, and a character and a gap receive a score of. Then the score of our alignment will be the sum of the individual scores in each column of the alignment. For example, the score of the alignment above is +1(9) 1(1) 2(1) = 6. Here we choose to penalize a gap more than a mismatch because inseritions and deletions are less likely to occur in practice than substitutions. 2

In general, our scoring could be +m for a match, s for a mismatch, and d for a gap. The score of the alginment will therefore be: +m(#matches) s(#mismatches) d(#gaps) The scoring scheme could be generalized even further. For instance, we could have a scoring matrix S where S(a, b) is the score for matching character a with character b. The gap (or the space) symbol could be made part of that scoring matrix; however, it is usually dealt with separatly. s we will see later on, the score for a gap could depend on the length of the gap. For instance two gap characters in a sequence could be scored differently than twice the score of a single gap. The reason for such an approach is that gaps tend to occur in bunches, i.e. once we have a gap, we are likely to encounter another one again, so we might want to penalize the start of the gap sequence more. This results in an non-linear gap penalty function that cannot be captured by the scoring matrix S. Now that we have an idea on how to score our alignment we go back to the question of how to compute the optimal alignment. One possibility is by using a greedy algorithm which looks at all possible alignments and determines the optimal one. However, the number of alignment grows exponentially with the sequence length resulting in a slow algorithm. We will make use of Dynamic programming to solve the problem. In dynamic programming we first find a solution for the sub-problem and then store the results obtained and use them in the computation of larger sub-instances of the problem. The dynamic programming algorithm is going to align prefixes of the two sequences x and y. By looking at the optimal alignment of prefixes of x and y, we will be able to obtain the optimal alignment for larger prefixes. t the end, we have the alignment between x and y. In developing our dynamic programming algorithm, we will rely on two main observations: First, using our scoring scheme of (+m, s, d), the score of an alignment (whether optimal or not) is additive. In other terms, if the alignment can be split in two parts, and each part is scored separately, then the sum of the two scores is equal to the score of the alignment. For instance, the alignment above can be split as follows: G- GGTTG GT GGTG The score of the alignment can be obtained as the sum of the scores of the two parts, i.e. (1+1) + (1+1+1+1-1+1+1+1) = + 6 = 6. This is true for any split of the alignment. The additive property implies that if the alignment is optimal so are the alignments corresponding to the two parts (because otherwise we can obtain a better alignment). Second, the optimal alignment between x 1...x i and y 1...y j can end with one of three possibilities. x i is aligned to y j x i is aligned with a gap y j is aligned with a gap Using the additivity property mentioned above, the score of an optimal alignment ending in the first case will be OP T (x 1...x i 1, y 1...y j 1 ) + s(x i, y j ) where OP T (x 1...x i 1 ) is the score of the optimal alignment between x 1...x i 1 and y 1...y j 1, and s(x i, y j ) = +m if x i = x j and s(x i, y j ) = s otherwise. Similarly, the score of an optimal alignment ending in the second case will be OP T (x 1...x i 1, y 1...y j ) d. Finally, the score of an optimal alignment ending in the third case will be OP T (x 1...x i, y 1...y j 1 ) d. Let (i, j) be the score of the optimal alignment between x 1...x i and y 1...y j. Then according to the discussion above, if we know (i 1, j 1), (i 1, j), and (i, j 1), then we can determine (i, j). How? Well, one the the three possibilities above has to be true for the optimal alignment between x 1...x i and y 1...y j ; therefore, (i, j) should be computed as (i, j) = max[(i 1, j 1) + s(x i, y j ), (i 1, j) d, (i, j 1) d] If we view as a matrix, then an entry (i, j) depends on three other entries as illustrated in Figure 1. (i,j) (i,j) (i,j) (i,j) Figure 1: omputing (i, j) 3

We can therefore carry the computation row by row from left to right (or column by column downward) until we obtain the value for (m, n) which will be the score of the optimal alignment between x 1...x i and y 1...y j. But how do we start the process? We need some initialization. For instance (1, 1) will require knowledge of (, ), (, 1) and (1, ). Well, (, ) corresponds to aligning the two empty strings of x and y. Therefore it must have a score of zero. (i, ) for i = 1...m corresponds to aligning the prefix x 1...x i with the empty string of y, i.e. with gaps. Therefore, (i, ) should be set to d i for i = 1...m. Similarly, (, j) should be set to d j for j = 1...n. Figure 2 shows an example of aligning the sequence x = and y = G: G -6 1-6 -8-5 Figure 2: Example The question is now how to obtain the optimal alignment itself, not just its score. We will simply keep pointers in the dynamic programming table to keep a trace on how we obtained each entry; therefore, for each (i, j) we have a pointer(s) to the entry(ies) that resulted in the value for (i, j). t the end, we trace back on a path from (m, n) to (, ). On a path, a diagonal pointer at (i, j) signifies the alignment of x i with y j, a upward pointer at (i, j) signifies the alignment of x i with a gap, and a left pointer at (i, j) signifies the alignment of y j with a gap. Figure 3 shows how pointers are kept during the computation. G -6 1-6 -8-5 G- Figure 3: Keeping pointers Figure 4 illustrates the final algorithm, known as the Needleman-Wunsch algorithm. The algorithm has O(mn) time and space complexity since it involves only the compuatation of the matrix and each entry requires a constant time to compute (checking 3 other entries and setting at most three pointers). 1. Initialization (, ) = (i, ) = i.d for i = 1 m (, j) = j.d for j = 1 n 2. Main Iteration (ligning prefixes) for each i = 1 m for each j = 1 n (i 1, j 1) + s(x i, y j ) [case 1] (i, j) = max (i 1, j) d [case 2] (i, j 1) d [case 3] Diag [case 1] Ptr(i, j) = Up [case 2] Left [case 3] 3. Termination (m, n) is the optimal score, and from Ptr(m, n) can trace back optimal alignment. Figure 4: The Needleman-Wunsch lgorithm 4

References Pevzner P., omputational Molecular Biology, hapter 1. Setubal J., Meidanis, J., Introduction to Molecular Biology, hapter 3. 5