Chapter 3. Multiple Sequence Alignment. 3.1 Definitions. Let S 1,S 2,,S k be k sequences over an alphabet X. We use S i to denote the length of S i.
|
|
- Darcy Perkins
- 7 years ago
- Views:
Transcription
1 Chapter 3 Multiple Sequence Alignment 3.1 Definitions Let S 1,S 2,,S k be k sequences over an alphabet X. We use S i to denote the length of S i. An alignment of S 1,S 2,,S k is given by a k n matrix A, where n S i for every i k, such that Row i contains characters of S i in order, interspersed with n S i spaces, and Each column contains at least one letter from X. Example: The following is an alignment of 4 sequences M Q P I L L - G M L R - L L - - M K - I L L L - M P P V L L I - 59
2 60 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT 3.2 Biological motivation Multiple sequence alignments are a start to many analysis of protein families. A good multiple alignment allows us to Find common conserved regions (or motif patterns) among sequences. Detect members of a gene family. Proteins are categorized into families. A protein family is a class of homologous proteins with similar sequences, structure, function, and/or similar evolutionary history. When an unknown protein is newly sequenced, one would often like to know to which family it belongs, as this can be a clue to its function. One approach to find the correct family for a protein is to compare the sequence of the protein to the alignment of each family. Backtracking evolutionary paths through sequence similarity. By counting the mutations that are necessary to explain transformation from an ancestor sequence to a current sequence, one can get an estimated evolutionary time when two sequences diverged.
3 3.3. SCORING A MULTIPLE SEQUENCE ALIGNMENT Scoring a multiple sequence alignment The score of a multiple alignment is defined as the sum of scores of columns. Various scoring schemes have been proposed to score a column. 1. The Sum of Pairs (SP) scoring scheme. The SP score of a column in the alignment is the sum of the scores of all pairs of characters in the column. For example, in the above example, the score SP(P, R,,P) of the 3rd column is s(p, R)+s(P, )+s(p, P)+s(R, )+s(r, P )+s(,p) where s(a, b) is the score of the pair a and b(including spaces) and s(, ) =0. Exercise: Assume we score a match 4 and a mismatch -2 and an indel -1. Find the SP score of the above alignment. 2. Consensus score. The consensus of a multiple alignment is a sequence of the most common characters in each column of the alignment. For example, M Q P I L - L M L R - L - L M K - I L L L M P P V L L I consensus M Q P I L L L The consensus score of a column is the number of characters (including spaces) that are identical to the consensus character in the column.
4 62 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Multiple Sequence Alignment Problem (MSA) Instance: A set of k sequences, and a scoring scheme (say SP and substitution matrix BLOSUM62) Question: Find an alignment of the given sequences that has the maximum score. Remark: 1. Pairwise alignment problem is a special case of the MSA problem in which there are only two sequences. 2. The optimal multiple alignment for all the sequences is not necessarily optimal for a given pair. For instance, consider this optimal alignment A T R A - R - T R A T R A T R in which we score mismatch and indel -2. In the alignment, the 2nd and 3rd alignment are not optimally aligned. Their optimal alignment is A R T R
5 3.4. DYNAMIC PROGRAMMING ALGORITHM FOR MSA Dynamic Programming Algorithm for MSA To solve the MSA problem for k sequences S 1,S 2, S k, we will generalize the two sequence case. For simplicity, we assume each sequence is of length n. Instead of a 2-dimensional table, we have a k-dimensional table T :(n+1) (n+1) (n+1) with (n+1) k entries. For each entry (i 1,i 2,,i k )oft, s(i 1,i 2,,i k ) is the score of the optimal alignment of the length-i 1 prefix of S 1, the length-i 2 prefix of S 2,, up to the length-i k prefix of S k. Finally, we use S 1 [j] to denote the jth letter of S 1. Recall that for the case k =2, we have recurrence relation s(i 1 1,i 2 0) + δ(s 1 [i 1 ], ), S 1 [i 1 ] vs - s(i 1,i 2 ) = max s(i 1 0,i 2 1) + δ(,s 2 [i 2 ]), - vs S 2 [i 2 ] s(i 1 1,i 2 1) + δ(s 1 [i 1 ],S 2 [i 2 ]), S 1 [i 1 ]vss 2 [i 2 ] (3.1) where S 1 [i 1 ] and S 2 [i 2 ] are the last characters of S 1 and S 2 respectively. The recurrence relation (3.1) can be rewritten as s(i) =max{s(i b)+δ( c) b =(1, 0), (0, 1), (1, 1)} where ī =(i 1,i 2 ) and c =(c 1,c 2 ) is defined as follows: c 1 = S 1 [i 1 ]ifb[1] = 1 and - (space) otherwise, c 2 = S 2 [i 2 ]ifb[2] = 1 and - (space) otherwise.
6 64 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT The recurrence relation for k sequences becomes s(i) =max{s(i b)+sp( c) b {0, 1} k (0, 0,, 0)} where ī =(i 1,i 2,,i k ), and c =(c 1,c 2,,c k ): c j = S j [i j ]ifb[j] = 1 and - (space) otherwise for j =1, 2,,k. Exercise: What are the base conditions for k sequences? This generalization takes O(k 2 2 k n k ) time. The reason is that there are (n +1) k = O(n k ) entries in the table, for each entry, we consider 2 k 1 possibilities, and the naive calculation of the SP score for a column takes O(k 2 ) steps. The space complexity is O(n k ). Hence, this approach is no longer efficient for aligning a protein family with hundreds of proteins. In addition, it is very unlikely to find a polynomial time algorithm for MSA because of the following complexity result. Theorem Finding an optimal multiple alignment with SP score scheme is NP-hard.
7 3.5. PROGRESSIVE ALIGNMENT APPROACHES Progressive alignment approaches Since the multiple sequence alignment is NP-hard, various heuristic approaches have been proposed. The commonly used one is progressive alignment. This approach works by aligning sequences using a series of pairwise alignments: Initially, two closely related sequences are aligned; this alignment is fixed. Then, a third sequence is chosen and aligned to the first alignment, and so on. This process is iterated until all sequences have been aligned. The heuristic is fast, but does not guarantee an optimal alignment.
8 66 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Star Alignment The idea of the star alignment is to find a sequence which is most similar to all the rest, and then to use it as the center of a star to align all the other sequences to it. Consider 5 sequences S 1 = ATTCGGATT S 2 = ATCCGGATT S 3 = ATGGAATTTT S 4 = ATGTTGTT S 5 = AGTCAGG To calculate the center sequence, we compute all the pairwise alignment scores. Assume these pairwise alignment scores are given in the following matrix S 1 S 2 S 3 S 4 S 5 S S S S S Summing pairwise scores in each row in the matrix, we obtain that S 1 is closest to all the other sequences. Hence, S 1 is selected to be at the center of the star.
9 3.5. PROGRESSIVE ALIGNMENT APPROACHES Calculating the optimal pairwise alignments between S 1 and all other sequences. Assume they are S 1 : S 2 : A T T C G G A T T A T C C G G A T T S 1 : A T T C G G A T T - - S 3 : A T G - G A A T T T T S 1 : A T T C G G A T T S 4 : A T G T T G - T T S 1 A T T C G G A T T S 5 A G T C A G G - -
10 68 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT 2. Merging all the alignments using once a gap, always a gap principle. We start with the alignment of S 1 and S 2. Then, we add S 3. Since two spaces follows S 1 in the alignment of S 1 and S 3, two spaces need to be added to the ends of S 1 and S 2. S 2 : A T C C G G A T T - - S 1 : A T T C G G A T T - - S 3 : A T G - G A A T T T T These gaps are never removed from the sequences in the alignments ( once a gap, always a gap ). Finally, we add S 4 and S 5 in order. S 2 : A T C C G G A T T - - S 3 : A T G - G A A T T T T S 1 : A T T C G G A T T - - S 4 : A T G T T G - T T - - S 2 : A T C C G G A T T - - S 3 : A T G - G A A T T T T S 4 : A T G T T G - T T - - S 1 : A T T C G G A T T - - S 5 : A G T C A G G
11 3.5. PROGRESSIVE ALIGNMENT APPROACHES 69 Complexity Analysis: Given k sequences of length n, the star alignment approach takes O(k 2 n 2 ) time to calculate all the pairwise alignment scores and then find the sequence S that is at the center of the star. It then takes O(k l) time for merging all the pairwise alignments to form a multiple alignment, where l is the length of the resulting alignment.
12 70 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Exercise: The star alignment approach is not guaranteed to give an optimal alignment due to the once a gap, always a gap, which tends to introduce excessive gaps. Here is an example to show this. Consider the following three sequences S 1 S 2 S 3 Optimal pairwise alignments are S 1 S 2 TCCGAA TCGAGA TCCAGA TCCGA-A TC -GAGA S 1 S 3 TCC -GAA TCCAGA- When these alignments are merged, the resulting multiple alignment S 1 S 2 TCC -GA - A TC - -GAG A S 3 TCCAGA - - is not optimal since it is worse than alignment S 1 S 2 S 3 TCCGA - A TC -GAG A TCC- AG A
13 3.5. PROGRESSIVE ALIGNMENT APPROACHES ClustalW INPUT: Sequences S 1,S 2,,S k. (1) Compute all the ( ) k 2 pairwise alignment scores and convert them into distances. (2) Construct a guide tree from pairwise distances using the Neighbor-Joining method. (3) Gradually build up the multiple sequence alignment following the order in the guide tree T. In Step (3), sequence-sequence alignments can be done with dynamic programming approach.
14 72 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT A sequence is added to an existing group by aligning it to each sequence in the group in turn. The highest scoring pairwise alignment is used to merge the sequence into the alignment of the group following the principle once a gap, always a gap. For example, consider the following group alignment S 1 : A G - A T - S 2 : - G A A T C and a sequence S: CGAAATC. The high scoring pairwise alignment is S 2 : - G - A A T C S: C G A A A T C Hence, S is merged into the group alignment as S 1 : A G - - A T - S 2 : - G - A A T C S: C G A A A T C
15 3.5. PROGRESSIVE ALIGNMENT APPROACHES 73 To align a group with a group, all sequences pairs between two groups are tried. The highest scoring pairwise alignment determines the alignment of two groups. For instance, consider the following two groups: S 1 : A T T G C C A T T - - S 2 : A T C - C A A T T T T S 3 : A T G G C C A T T S 4 : A T C T T C - T T The alignment with S 1 and S 3 : S 1 : S 3 : A T T G C C A T T A T G G C C A T T has the maximum score. Thus it is used for aligning the two groups as S 2 : A T C - C A A T T T T S 1 : A T T G C C A T T - - S 3 : A T G G C C A T T - - S 4 : A T C T T C - T T - -
16 74 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT ClustalW suffers the following problem (a) Optimal alignment may not be found. (b) The guide tree is derived from pairwise distances and less reliable. (c) When all the sequences are highly divergent (say less than 25-30% identity between any pair of sequences), this progressive approach becomes less reliable.
17 3.5. PROGRESSIVE ALIGNMENT APPROACHES T-Coffee T-Coffee is another fast multiple sequence alignment method. It stands for Tree-based Consistency Objective Function for alignment Evaluation. The progressive alignment method suffers from its greediness. Errors made in the first alignments cannot be rectified later as the rest of the sequences are added in. T-Coffee was proposed to minimize that effect. Another motivation for T-Coffee is to use properties of both local and global pairwise alignments of given sequences. It has 3 steps: Step 1: Generating a primary library of pairwise alignments. Step 2: Extending library Step 3: Progressively align all the sequences
18 76 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Step 1. Generating a library of pairwise alignments. The primary library contains a set of pairwise alignments between all the sequences to be aligned. Global alignment library is generated using ClustalW; Local alignment library is generated using Lalign in the FASTA package. Each pairwise alignment is considered as a list of pair-wise residue matches (residue a of sequence A is aligned with residue b of sequence B). Each of these matches is a constraint. These constraints are weighted for later use since some may come from parts of alignments that are more likely to be correct. Each aligned pair (a constraint) in a pairwise alignment receives a weight equal to the percent identity within the alignment. To combine local and global alignment information, when any pair is duplicated between the two libraries, it is merged into a single one that has a weight equal to the sum of the two weights.
19 3.5. PROGRESSIVE ALIGNMENT APPROACHES 77 SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD THE ---- FAT CAT SeqB GARFIELD THE FAST CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqD THE FA-T CAT SeqC GARFIELD THE VERY FAST CAT SeqD THE ---- FA-T CAT SeqA GARFIELD THE LAST FAT CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqC GARFIELD THE VERY FAST CAT SeqB GARFIELD THE FAST CAT SeqA GARFIELD THE LAST FAT CAT SeqD THE FAT CAT SeqB GARFIELD THE FAST CAT
20 78 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Step 2. Extending Library. Each pair of aligned residues (in the library) is reassigned a weight that reflects some of the information contained in the whole library. The triplet approach is used for reassigning a weight for each aligned pair: It takes each aligned residue pair from the library and check the alignment of the two residues with residues from the remaining sequences. The weight associated with the pair will be the SUM of all the weight gathered through the examination of all the triplets involving that pair. The more intermediate sequences supporting an aligned pair, the higher its weight
21 3.5. PROGRESSIVE ALIGNMENT APPROACHES 79 Step 3 Progressively align all the sequences. Weight will be zero for any residue pairs that never occur in library. (This will be true for the majority of residue pairs.) Thus, for any residue a in seqa and b in seqb, a weight is assigned to (a, b) in Step 1 and Step 2. Thus, all the weights form a position-dependent scoring scheme δ W for alignment. Step 3.1 Pairwise alignments are first made using the scoring scheme δ W to produce a distance matrix between all the sequences; The matrix in turn is used to produce a guide tree T using Neighbor- Joining method. Step 3.2 Gradually build up the multiple sequence alignment following the order in the guide tree T. In Steps 3.1 and 3.2, gap penalty is set to 0. This stems from the fact that the library weights were computed based on pairwise alignments where such penalty had already been applied.
22 80 CHAPTER 3. MULTIPLE SEQUENCE ALIGNMENT Comparison with other methods T-Coffee performed much better than other top alignment methods on benchmark alignment data BaliBase. Method Cat1 (81) Cat2 (23) Cat3 Cat4 Cat5 All ClustalW Prrp T-Coffee Questions 1. Can T-coffee be improved by using difference weighting scheme? 2. Can T-coffee be improved by building a library of 3-sequences alignments? Reference 1. Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinements as assessed by reference to structural alignment. JMB 264, Thompson, J., Plewniak, F. and Poch, O. (1999) BaliBase: A benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics 15, Thompson, J., Plewniak, F. and Poch, O. (1999) A comprehensive comparison of multiple sequence alignment programs, NAR 27, Notredame, C., Higgins, D. and Heringa (2000) J. T-Coffee:... JMB 302,
Pairwise Sequence Alignment
Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What
More informationBio-Informatics Lectures. A Short Introduction
Bio-Informatics Lectures A Short Introduction The History of Bioinformatics Sanger Sequencing PCR in presence of fluorescent, chain-terminating dideoxynucleotides Massively Parallel Sequencing Massively
More informationIntroduction to Bioinformatics 3. DNA editing and contig assembly
Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov
More informationProtein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004
Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004 Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2 Genomic & Proteomic Data Sequence
More informationSequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment
Sequence Analysis 15: lecture 5 Substitution matrices Multiple sequence alignment A teacher's dilemma To understand... Multiple sequence alignment Substitution matrices Phylogenetic trees You first need
More informationIntroduction to Bioinformatics AS 250.265 Laboratory Assignment 6
Introduction to Bioinformatics AS 250.265 Laboratory Assignment 6 In the last lab, you learned how to perform basic multiple sequence alignments. While useful in themselves for determining conserved residues
More informationNetwork Protocol Analysis using Bioinformatics Algorithms
Network Protocol Analysis using Bioinformatics Algorithms Marshall A. Beddoe Marshall_Beddoe@McAfee.com ABSTRACT Network protocol analysis is currently performed by hand using only intuition and a protocol
More informationBLAST. Anders Gorm Pedersen & Rasmus Wernersson
BLAST Anders Gorm Pedersen & Rasmus Wernersson Database searching Using pairwise alignments to search databases for similar sequences Query sequence Database Database searching Most common use of pairwise
More information10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method
578 CHAPTER 1 NUMERICAL METHODS 1. ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS As a numerical technique, Gaussian elimination is rather unusual because it is direct. That is, a solution is obtained after
More informationRETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison
RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the
More informationSupplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c
Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c a Department of Evolutionary Biology, University of Copenhagen,
More informationPhylogenetic Trees Made Easy
Phylogenetic Trees Made Easy A How-To Manual Fourth Edition Barry G. Hall University of Rochester, Emeritus and Bellingham Research Institute Sinauer Associates, Inc. Publishers Sunderland, Massachusetts
More informationRapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST
Rapid alignment methods: FASTA and BLAST p The biological problem p Search strategies p FASTA p BLAST 257 BLAST: Basic Local Alignment Search Tool p BLAST (Altschul et al., 1990) and its variants are some
More informationDynamic Programming. Lecture 11. 11.1 Overview. 11.2 Introduction
Lecture 11 Dynamic Programming 11.1 Overview Dynamic Programming is a powerful technique that allows one to solve many different types of problems in time O(n 2 ) or O(n 3 ) for which a naive approach
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationApproximation Algorithms
Approximation Algorithms or: How I Learned to Stop Worrying and Deal with NP-Completeness Ong Jit Sheng, Jonathan (A0073924B) March, 2012 Overview Key Results (I) General techniques: Greedy algorithms
More informationa 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
More informationCD-HIT User s Guide. Last updated: April 5, 2010. http://cd-hit.org http://bioinformatics.org/cd-hit/
CD-HIT User s Guide Last updated: April 5, 2010 http://cd-hit.org http://bioinformatics.org/cd-hit/ Program developed by Weizhong Li s lab at UCSD http://weizhong-lab.ucsd.edu liwz@sdsc.edu 1. Introduction
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationSystems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
More informationTHREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS
THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS O.U. Sezerman 1, R. Islamaj 2, E. Alpaydin 2 1 Laborotory of Computational Biology, Sabancı University, Istanbul, Turkey. 2 Computer Engineering
More informationRow Echelon Form and Reduced Row Echelon Form
These notes closely follow the presentation of the material given in David C Lay s textbook Linear Algebra and its Applications (3rd edition) These notes are intended primarily for in-class presentation
More informationEffect of Using Neural Networks in GA-Based School Timetabling
Effect of Using Neural Networks in GA-Based School Timetabling JANIS ZUTERS Department of Computer Science University of Latvia Raina bulv. 19, Riga, LV-1050 LATVIA janis.zuters@lu.lv Abstract: - The school
More informationProtein Sequence Analysis - Overview -
Protein Sequence Analysis - Overview - UDEL Workshop Raja Mazumder Research Associate Professor, Department of Biochemistry and Molecular Biology Georgetown University Medical Center Topics Why do protein
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationAmino Acids and Their Properties
Amino Acids and Their Properties Recap: ss-rrna and mutations Ribosomal RNA (rrna) evolves very slowly Much slower than proteins ss-rrna is typically used So by aligning ss-rrna of one organism with that
More informationThese axioms must hold for all vectors ū, v, and w in V and all scalars c and d.
DEFINITION: A vector space is a nonempty set V of objects, called vectors, on which are defined two operations, called addition and multiplication by scalars (real numbers), subject to the following axioms
More information1.2 Solving a System of Linear Equations
1.. SOLVING A SYSTEM OF LINEAR EQUATIONS 1. Solving a System of Linear Equations 1..1 Simple Systems - Basic De nitions As noticed above, the general form of a linear system of m equations in n variables
More informationSolving Systems of Linear Equations
LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how
More informationSimilarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003
Similarity Searches on Sequence Databases: BLAST, FASTA Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003 Outline Importance of Similarity Heuristic Sequence Alignment:
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationUCHIME in practice Single-region sequencing Reference database mode
UCHIME in practice Single-region sequencing UCHIME is designed for experiments that perform community sequencing of a single region such as the 16S rrna gene or fungal ITS region. While UCHIME may prove
More information160 CHAPTER 4. VECTOR SPACES
160 CHAPTER 4. VECTOR SPACES 4. Rank and Nullity In this section, we look at relationships between the row space, column space, null space of a matrix and its transpose. We will derive fundamental results
More informationCURVE FITTING LEAST SQUARES APPROXIMATION
CURVE FITTING LEAST SQUARES APPROXIMATION Data analysis and curve fitting: Imagine that we are studying a physical system involving two quantities: x and y Also suppose that we expect a linear relationship
More informationPROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE 2006 1. E-mail: msm_eng@k-space.org
BIOINFTool: Bioinformatics and sequence data analysis in molecular biology using Matlab Mai S. Mabrouk 1, Marwa Hamdy 2, Marwa Mamdouh 2, Marwa Aboelfotoh 2,Yasser M. Kadah 2 1 Biomedical Engineering Department,
More informationNear Optimal Solutions
Near Optimal Solutions Many important optimization problems are lacking efficient solutions. NP-Complete problems unlikely to have polynomial time solutions. Good heuristics important for such problems.
More informationA Brief Study of the Nurse Scheduling Problem (NSP)
A Brief Study of the Nurse Scheduling Problem (NSP) Lizzy Augustine, Morgan Faer, Andreas Kavountzis, Reema Patel Submitted Tuesday December 15, 2009 0. Introduction and Background Our interest in the
More informationBIOINFORMATICS TUTORIAL
Bio 242 BIOINFORMATICS TUTORIAL Bio 242 α Amylase Lab Sequence Sequence Searches: BLAST Sequence Alignment: Clustal Omega 3d Structure & 3d Alignments DO NOT REMOVE FROM LAB. DO NOT WRITE IN THIS DOCUMENT.
More informationMATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors. Jordan canonical form (continued).
MATH 423 Linear Algebra II Lecture 38: Generalized eigenvectors Jordan canonical form (continued) Jordan canonical form A Jordan block is a square matrix of the form λ 1 0 0 0 0 λ 1 0 0 0 0 λ 0 0 J = 0
More informationMultiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker
Multiple Sequence Alignment Hot Topic 5/24/06 Kim Walker Outline Why are Multiple Sequence Alignments useful? What Tools are Available? Brief Introduction to ClustalX Tools to Edit and Add Features to
More informationBioinformatics Grid - Enabled Tools For Biologists.
Bioinformatics Grid - Enabled Tools For Biologists. What is Grid-Enabled Tools (GET)? As number of data from the genomics and proteomics experiment increases. Problems arise for the current sequence analysis
More informationWeb Data Extraction: 1 o Semestre 2007/2008
Web Data : Given Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006. Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationApplied Algorithm Design Lecture 5
Applied Algorithm Design Lecture 5 Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Applied Algorithm Design Lecture 5 1 / 86 Approximation Algorithms Pietro Michiardi (Eurecom) Applied Algorithm Design
More informationGuide for Bioinformatics Project Module 3
Structure- Based Evidence and Multiple Sequence Alignment In this module we will revisit some topics we started to look at while performing our BLAST search and looking at the CDD database in the first
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More information8. Linear least-squares
8. Linear least-squares EE13 (Fall 211-12) definition examples and applications solution of a least-squares problem, normal equations 8-1 Definition overdetermined linear equations if b range(a), cannot
More informationOD-seq: outlier detection in multiple sequence alignments
Jehl et al. BMC Bioinformatics (2015) 16:269 DOI 10.1186/s12859-015-0702-1 RESEARCH ARTICLE Open Access OD-seq: outlier detection in multiple sequence alignments Peter Jehl, Fabian Sievers * and Desmond
More informationBIRCH: An Efficient Data Clustering Method For Very Large Databases
BIRCH: An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image.
More informationMATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
More information2 SYSTEM DESCRIPTION TECHNIQUES
2 SYSTEM DESCRIPTION TECHNIQUES 2.1 INTRODUCTION Graphical representation of any process is always better and more meaningful than its representation in words. Moreover, it is very difficult to arrange
More information5 Homogeneous systems
5 Homogeneous systems Definition: A homogeneous (ho-mo-jeen -i-us) system of linear algebraic equations is one in which all the numbers on the right hand side are equal to : a x +... + a n x n =.. a m
More informationSingle machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max. structure of a schedule Q...
Lecture 4 Scheduling 1 Single machine models: Maximum Lateness -12- Approximation ratio for EDD for problem 1 r j,d j < 0 L max structure of a schedule 0 Q 1100 11 00 11 000 111 0 0 1 1 00 11 00 11 00
More information36 CHAPTER 1. LIMITS AND CONTINUITY. Figure 1.17: At which points is f not continuous?
36 CHAPTER 1. LIMITS AND CONTINUITY 1.3 Continuity Before Calculus became clearly de ned, continuity meant that one could draw the graph of a function without having to lift the pen and pencil. While this
More informationMORPHEUS. http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.
MORPHEUS http://biodev.cea.fr/morpheus/ Prediction of Transcription Factors Binding Sites based on Position Weight Matrix. Reference: MORPHEUS, a Webtool for Transcripton Factor Binding Analysis Using
More informationBOOLEAN ALGEBRA & LOGIC GATES
BOOLEAN ALGEBRA & LOGIC GATES Logic gates are electronic circuits that can be used to implement the most elementary logic expressions, also known as Boolean expressions. The logic gate is the most basic
More informationFocusing on results not data comprehensive data analysis for targeted next generation sequencing
Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationCore Bioinformatics. Degree Type Year Semester. 4313473 Bioinformàtica/Bioinformatics OB 0 1
Core Bioinformatics 2014/2015 Code: 42397 ECTS Credits: 12 Degree Type Year Semester 4313473 Bioinformàtica/Bioinformatics OB 0 1 Contact Name: Sònia Casillas Viladerrams Email: Sonia.Casillas@uab.cat
More informationThe Goldberg Rao Algorithm for the Maximum Flow Problem
The Goldberg Rao Algorithm for the Maximum Flow Problem COS 528 class notes October 18, 2006 Scribe: Dávid Papp Main idea: use of the blocking flow paradigm to achieve essentially O(min{m 2/3, n 1/2 }
More informationHolland s GA Schema Theorem
Holland s GA Schema Theorem v Objective provide a formal model for the effectiveness of the GA search process. v In the following we will first approach the problem through the framework formalized by
More informationHidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006
Hidden Markov Models in Bioinformatics By Máthé Zoltán Kőrösi Zoltán 2006 Outline Markov Chain HMM (Hidden Markov Model) Hidden Markov Models in Bioinformatics Gene Finding Gene Finding Model Viterbi algorithm
More informationTopics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment
Topics in Computational Linguistics Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment Regina Barzilay and Lillian Lee Presented By: Mohammad Saif Department of Computer
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationThe Assignment Problem and the Hungarian Method
The Assignment Problem and the Hungarian Method 1 Example 1: You work as a sales manager for a toy manufacturer, and you currently have three salespeople on the road meeting buyers. Your salespeople are
More informationLecture 1: Systems of Linear Equations
MTH Elementary Matrix Algebra Professor Chao Huang Department of Mathematics and Statistics Wright State University Lecture 1 Systems of Linear Equations ² Systems of two linear equations with two variables
More informationPractical Guide to the Simplex Method of Linear Programming
Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear
More informationSolving Systems of Linear Equations
LECTURE 5 Solving Systems of Linear Equations Recall that we introduced the notion of matrices as a way of standardizing the expression of systems of linear equations In today s lecture I shall show how
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationIntroduction to Phylogenetic Analysis
Subjects of this lecture Introduction to Phylogenetic nalysis Irit Orr 1 Introducing some of the terminology of phylogenetics. 2 Introducing some of the most commonly used methods for phylogenetic analysis.
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationHMM : Viterbi algorithm - a toy example
MM : Viterbi algorithm - a toy example.5.3.4.2 et's consider the following simple MM. This model is composed of 2 states, (high C content) and (low C content). We can for example consider that state characterizes
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationEmbedded Systems 20 BF - ES
Embedded Systems 20-1 - Multiprocessor Scheduling REVIEW Given n equivalent processors, a finite set M of aperiodic/periodic tasks find a schedule such that each task always meets its deadline. Assumptions:
More informationOverview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
More informationClone Manager. Getting Started
Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software
More informationMultisequence Alignment as a new tool for Network Traffic Analysis
Multisequence Alignment as a new tool for Network Traffic Analysis Krzysztof Fabjański 1, Adam Kozakiewicz 1, Anna Felkner 1, Piotr Kijewski 1 and Tomasz Kruk 1 1 NASK, Research and Academic Computer Network:
More informationA Review And Evaluations Of Shortest Path Algorithms
A Review And Evaluations Of Shortest Path Algorithms Kairanbay Magzhan, Hajar Mat Jani Abstract: Nowadays, in computer networks, the routing is based on the shortest path problem. This will help in minimizing
More informationInferring Probabilistic Models of cis-regulatory Modules. BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.
Inferring Probabilistic Models of cis-regulatory Modules MI/S 776 www.biostat.wisc.edu/bmi776/ Spring 2015 olin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following
More information1 Solving LPs: The Simplex Algorithm of George Dantzig
Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.
More informationLinear Algebra Notes
Linear Algebra Notes Chapter 19 KERNEL AND IMAGE OF A MATRIX Take an n m matrix a 11 a 12 a 1m a 21 a 22 a 2m a n1 a n2 a nm and think of it as a function A : R m R n The kernel of A is defined as Note
More information6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, 2010. Class 4 Nancy Lynch
6.045: Automata, Computability, and Complexity Or, Great Ideas in Theoretical Computer Science Spring, 2010 Class 4 Nancy Lynch Today Two more models of computation: Nondeterministic Finite Automata (NFAs)
More informationBioinformatics Resources at a Glance
Bioinformatics Resources at a Glance A Note about FASTA Format There are MANY free bioinformatics tools available online. Bioinformaticists have developed a standard format for nucleotide and protein sequences
More informationLecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching
COSC 348: Computing for Bioinformatics Definitions A pattern (keyword) is an ordered sequence of symbols. Lecture 4: Exact string searching algorithms Lubica Benuskova http://www.cs.otago.ac.nz/cosc348/
More informationMolecular Databases and Tools
NWeHealth, The University of Manchester Molecular Databases and Tools Afternoon Session: NCBI/EBI resources, pairwise alignment, BLAST, multiple sequence alignment and primer finding. Dr. Georgina Moulton
More informationJoint models for classification and comparison of mortality in different countries.
Joint models for classification and comparison of mortality in different countries. Viani D. Biatat 1 and Iain D. Currie 1 1 Department of Actuarial Mathematics and Statistics, and the Maxwell Institute
More informationLecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
More informationEmbedded Systems 20 REVIEW. Multiprocessor Scheduling
Embedded Systems 0 - - Multiprocessor Scheduling REVIEW Given n equivalent processors, a finite set M of aperiodic/periodic tasks find a schedule such that each task always meets its deadline. Assumptions:
More informationA successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.
MARKET SEGMENTATION The simplest and most effective way to operate an organization is to deliver one product or service that meets the needs of one type of customer. However, to the delight of many organizations
More informationNotes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
More information2.3 Identify rrna sequences in DNA
2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
More informationNOTES ON LINEAR TRANSFORMATIONS
NOTES ON LINEAR TRANSFORMATIONS Definition 1. Let V and W be vector spaces. A function T : V W is a linear transformation from V to W if the following two properties hold. i T v + v = T v + T v for all
More informationCompact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
More informationLinear Programming. March 14, 2014
Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1
More informationLinearly Independent Sets and Linearly Dependent Sets
These notes closely follow the presentation of the material given in David C. Lay s textbook Linear Algebra and its Applications (3rd edition). These notes are intended primarily for in-class presentation
More information11 Multivariate Polynomials
CS 487: Intro. to Symbolic Computation Winter 2009: M. Giesbrecht Script 11 Page 1 (These lecture notes were prepared and presented by Dan Roche.) 11 Multivariate Polynomials References: MC: Section 16.6
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationStep by Step Guide to Importing Genetic Data into JMP Genomics
Step by Step Guide to Importing Genetic Data into JMP Genomics Page 1 Introduction Data for genetic analyses can exist in a variety of formats. Before this data can be analyzed it must imported into one
More information