Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c



Similar documents
A short guide to phylogeny reconstruction

Mr G Baktavatchalam Student, ME Software Engineering PSG College of Technology gmbakthavatchalam@gmail.com ABSTRACT

HOBIT at the BiBiServ

Bio-Informatics Lectures. A Short Introduction

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

OD-seq: outlier detection in multiple sequence alignments

Bioinformatics Grid - Enabled Tools For Biologists.

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Arbres formels et Arbre(s) de la Vie

Pairwise Sequence Alignment

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

UGENE Quick Start Guide

Sequence homology search tools on the world wide web

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Genome Explorer For Comparative Genome Analysis

Current Motif Discovery Tools and their Limitations

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

EMBL-EBI Web Services

How To Teach An Rna Bioinformatics Course In Copenhagen

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Three-Dimensional Window Analysis for Detecting Positive Selection at Structural Regions of Proteins

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Phylogenetic Trees Made Easy

Analytical Study of Hexapod mirnas using Phylogenetic Methods

Multisequence Alignment as a new tool for Network Traffic Analysis

Optimal Contact Map Alignment of Protein-Protein Interfaces Vinay Pulim, 1 Bonnie Berger, 1,2 * Jadwiga Bienkowska, 1,3,* 1

Protein Sequence Analysis - Overview -

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm146

RNA Movies 2: sequential animation of RNA secondary structures

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

1. Introduction Gene regulation Genomics and genome analyses Hidden markov model (HMM)

Network Protocol Analysis using Bioinformatics Algorithms

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Inferring Non-Coding RNA Families and Classes by Means of Genome-Scale Structure-Based Clustering

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

2.3 Identify rrna sequences in DNA

Consensus alignment server for reliable comparative modeling with distant templates

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Large-N Random Matrices for RNA Folding

ALTER: program-oriented conversion of DNA and protein alignments

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Functional Architecture of RNA Polymerase I

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

CD-HIT User s Guide. Last updated: April 5,

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Conditional Random Fields: An Introduction

Web Data Extraction: 1 o Semestre 2007/2008

BIOINFORMATICS ORIGINAL PAPER doi: /bioinformatics/btm377

The MIGenAS integrated bioinformatics toolkit for web-based sequence analysis

DNA, RNA and Protein Structure Prediction

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Clone Manager. Getting Started

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

INTREPID: a web server for prediction of functionally important residues by evolutionary analysis

Introduction to Bioinformatics AS Laboratory Assignment 6

Final Project Report

Evolution at Molecular Resolution

Protein Protein Interaction Networks

Visualizing RNA base pairing probabilities with RNAbow diagrams

Hidden Markov models in biological sequence analysis

university of copenhagen Bioinformatics Master Program

The Ramachandran Map of More Than. 6,500 Perfect Polypeptide Chains

Cell Phone based Activity Detection using Markov Logic Network

MATCH Commun. Math. Comput. Chem. 61 (2009)

1) Orthology of zebrafish HoxD4 and euteleost HoxD4a:

Sequence Conservation Metrics: Implementation and Comparative Analysis

A Tutorial in Genetic Sequence Classification Tools and Techniques

FART Neural Network based Probabilistic Motif Discovery in Unaligned Biological Sequences

Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS Spring 2015 Colin Dewey

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

ClusterControl: A Web Interface for Distributing and Monitoring Bioinformatics Applications on a Linux Cluster

T cell Epitope Prediction

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

Supporting Online Material for

Learning is a very general term denoting the way in which agents:

Using MATLAB: Bioinformatics Toolbox for Life Sciences

Protein annotation and modelling servers at University College London

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

Unipro UGENE User Manual Version

Phylogenetic Analysis using MapReduce Programming Model

Methods for big data in medical genomics

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Bioinformatics: course introduction

Feb , Atlanta, Georgia. Evolutionary Bioinformatics Education: a National Science Foundation Chautauqua Course

The Central Dogma of Molecular Biology

FBIO - Fundations of Bioinformatics

Information Retrieval and Web Search Engines

Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided.

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Data Structures and Algorithms Written Examination

TITLE MOTIVATION OBJECTIVES AUDIENCE COURSE INSTRUCTORS. Analysis of regulatory sequences controlling the expression of gene networks

Keywords: evolution, genomics, software, data mining, sequence alignment, distance, phylogenetics, selection

Transcription:

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c a Department of Evolutionary Biology, University of Copenhagen, Universitetsparken 15, 2100 Copenhagen Ø, Denmark b Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstr. 1, D 40225 Düsseldorf, Germany c Institut für Theoretische Chemie und Molekulare Strukturbiologie, Universität Wien, Währingerstraße 17, A-1090 Wien, Austria Supplementary Materials 1

Program Version Reference Short description Align-m 2.1 [1] Uses a non-progressive local approach to guide a global alignment. ClustalW 1.82 [2, 3] The classic progressive alignment program. DIALIGN 2.2 [4 6] Aligns gap-free segments as a whole without introducing gaps. Handel 0.1 (dart) [7, 8] Phylogenetic alignment using a evolutionary hidden Markov model based on Thorne-Kishino-Felsenstein evolutionary model. MAFFT 4.22 [9] Rapid group-to-group alignment by fast Fourier transformation. MUSCLE 3.51 [10, 11] Employs a draft progressive step followed by an improved progressive and iterative refinement steps. PCMA 2.0 [12] Progressive method which aligns highly similar sequences as ClustalW and divergent groups by the T-Coffee strategy. POA 2 [13] Represents alignments as graphs which are directly aligned without the need for profiles. ProAlign 0.5 [14] Probabilistic progressive alignment combining a pair hidden Markov model and an evolutionary model. Prrn 3.0 (scc) [15] Doubly nested randomised iterative alignment where group-to-group alignments are repeated to improve the overall score. T-Coffee 1.37 [16] Uses an alignment library to seek for maximum consistency of each residue pair with all other pairs of this library and guides the progressive step by means of this library. Dynalign second edition [17, 18] Simultaneously aligns and predicts the lowest free energy RNA secondary structure common to two sequences. Foldalign 2.0.0 [19] Structurally aligns two sequences using a light weight energy model in combination with RIBOSUM-like score matrices. PMcomp N/A [20] Computes and aligns base-pair probability matrices (calculated using Mc- Caskill s algorithm [21]). Stemloc 0.2 (dart) [22, 23] Alignment of RNA sequences using pre-folding and pre-alignment envelope heuristics. Table 1 This table summarises the alignment methods used in this study. 2

Label Command Reference Sequence Alignment Align-m (1) align_m -m RNA2 Align-m (2) align_m -m RNA2 -p2m_fmin 0.7 -p2m_nseq_min 5 Align-m (3) align_m -m RNA2 -s2p_go 10 -s2p_ge 1 Align-m (4) align_m -m RNA2 -s2p_go 10 -s2p_ge 1 -p2m_fmin 0.7 -p2m_nseq_min 5 Align-m (5) align_m -m RNA2 -s2p_w 3 ClustalW clustalw -type=dna -align [2, 3] ClustalW (qt) DIALIGN dialign2-2 -n DIALIGN (it) clustalw -type=dna -align -quicktree dialign2-2 -n -it DIALIGN (o) dialign2-2 -n -o DIALIGN (it,o) dialign2-2 -n -it -o Handel MAFFT (fftnsi) MAFFT (fftns) MAFFT (nwnsi) MAFFT (nwns) MUSCLE MUSCLE (nj) handalign.pl fftnsi fftns nwnsi nwns muscle MUSCLE (mi32) muscle -maxiters 32 MUSCLE (nj,mi32) MUSCLE (m6) muscle -maxtrees 6 MUSCLE (nj,mt6) muscle -cluster1 neighborjoining -cluster2 neighborjoining muscle -maxiters 32 -cluster1 neighborjoining -cluster2 neighborjoining muscle -maxtrees 6 -cluster1 neighborjoining -cluster2 neighborjoining MUSCLE (mi32,mt6) muscle -maxiters 32 -maxtrees 6 MUSCLE (nj,mi32,mt6) PCMA PCMA (agi20) PCMA (agi60) muscle -maxiters 32 -maxtrees 6 -cluster1 neighborjoining -cluster2 neighborjoining pcma pcma -ave_grp_id=20 pcma -ave_grp_id=60 POA poa -v blosum80.mat [13] POA (g) POA (p) POA (g,p) poa -do_global -v blosum80.mat poa -do_progressive -v blosum80.mat poa -do_global -do_progressive -v blosum80.mat ProAlign (bw400) java -Xmx256m -jar ProAlign_0.5a0.jar -bwidth=400 [14] Prrn prrn Prrn (S10) prrn -S10 T-Coffee T-Coffee (c) T-Coffee (f) T-Coffee (s) t_coffee t_coffee -in=mlalign_id_pair,mclustalw_pair t_coffee -in=mlalign_id_pair,mfast_pair t_coffee -in=mlalign_id_pair,mslow_pair Table 2 This table summarises parameters and references for applied sequence alignment methods corresponding to abbreviations used in the body of the manuscript. Due to space constraints we only display those parameters affecting algorithm methodology and not those for data input/output. 3 [1] [4 6] [7, 8] [9] [10, 11] [12] [15] [16]

Label Command Reference Structural Alignment Dynalign dynalign len2-len1+5 0.4 5 20 2 1 0 [17, 18] Foldalign foldalign -global -max_diff 25 -score_matrix global.fmat [19] PMcomp pmcomp.pl [20] PMcomp (fast) pmcomp.pl --fast [20, 24] Stemloc (slow) stemloc --global --multiple -verbose --nfold 1000 --norndfold [22, 23] Stemloc (fast) Statistics stemloc --global --multiple -verbose --nfold 110 --norndfold SPS bali_score [25] SCI RNAz [26] Percent Sequence Identity alistat [27] Table 3 This table summarises parameters and references for applied structural alignment methods corresponding to abbreviations used in the body of the manuscript. Due to space constraints we only display those parameters affecting algorithm methodology and not those for data input/output. Note that len1 corresponds to the length of the shortest sequence and len2 corresponds to the length of the longest sequence. 4

References [1] Van Walle, I., Lasters, I., and Wyns, L. (2004) Align-m: a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, 20(9), 1428 1435. [2] Thompson, J., Higgins, D., and Gibson, T. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.. Nucl. Acids Res., 22, 4673 4680. [3] Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T. J., Higgins, D. G., and Thompson, J. D. (2003) Multiple sequence alignment with the Clustal series of programs. Nucl. Acids Res., 31(13), 3497 3500. [4] Morgenstern, B., Frech, K., Dress, A., and Werner, T. (1998) DIALIGN: finding local similarities by multiple sequence alignment. Bioinformatics, 14(3), 290 294. [5] Morgenstern, B. (1999) DIALIGN 2: improvement of the segment-to-segment approach to multiple sequence alignment. Bioinformatics, 15(3), 211 218. [6] Morgenstern, B. (2004) DIALIGN: multiple DNA and protein sequence alignment at BiBiServ. Nucl. Acids Res., 32(suppl 2), W33 36. [7] Holmes, I. and Bruno, W. J. (2001) Evolutionary HMMs: a Bayesian approach to multiple alignment. Bioinformatics, 17(9), 803 820. [8] Holmes, I. (2003) Using guide trees to construct multiple-sequence evolutionary HMMs. Bioinformatics, 19(90001), 147i 157. [9] Katoh, K., Misawa, K., Kuma, K.-i., and Miyata, T. (2002) MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier transform. Nucl. Acids Res., 30(14), 3059 3066. [10] Edgar, R. C. (2004) Muscle: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5(1), 113. [11] Edgar, R. C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucl. Acids Res., 32(5), 1792 1797. [12] Pei, J., Sadreyev, R., and Grishin, N. V. (2003) PCMA: fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics, 19(3), 427 428. [13] Lee, C., Grasso, C., and Sharlow, M. F. (2002) Multiple sequence alignment using partial order graphs. Bioinformatics, 18(3), 452 464. [14] Löytynoja, A. and Milinkovitch, M. C. (2003) A hidden Markov model for progressive multiple alignment. Bioinformatics, 19(12), 1505 1513. [15] Gotoh, O. (1996) Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments.. J. Mol. Biol., 264, 823 838. 5

[16] Notredame, C., Higgins, D., and J., H. (2000) T-Coffee: A novel method for fast and accurate multiple sequence alignment.. J. Mol. Biol., 302, 205 217. [17] Mathews, D. and Turner, D. (2002) Dynalign: An algorithm for finding the secondary structure common to two RNA sequences. J. Mol. Biol., 317(2), 191 203. [18] Mathews, D. (2005) Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics, (Advance Access published February 24). [19] Hull Havgaard, J., Lyngsø, R., Stormo, G., and Gorodkin, J. (2005) Pairwise local structural alignment of RNA sequences with sequence similarity less than 40%. Bioinformatics, (Advance Access published January 18). [20] Hofacker, I., Bernhart, S., and Stadler, P. (2004) Alignment of RNA base pairing probability matrices. Bioinformatics, 20(14), 2222 2227. [21] McCaskill, J. S. (1990) The equilibrium partition function and base pair binding probabilities for RNA secondary structures. Biopolymers, 29, 1105 1119. [22] Holmes, I. (2004) A probabilistic model for the evolution of RNA structure. BMC Bioinformatics, 5(166). [23] Holmes, I. (2005) Accelerated probabilistic inference of RNA structure evolution. BMC Bioinformatics, 6(1), 73. [24] Bonhoeffer, S., McCaskill, J., Stadler, P., and Schuster, P. (1993) RNA multi-structure landscapes. A study based on temperature dependent partition functions. Eur Biophys J, 22(1), 13 24. [25] Thompson, J., Plewniak, F., and Poch, O. (1999) A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res, 27(13). [26] Washietl, S., Hofacker, I., and Stadler, P. (2005) Fast and reliable prediction of noncoding RNAs. Proc Natl Acad Sci U S A, 102, 2454 2459. [27] Eddy, S. SQUID - C function library for sequence analysis.. 6