2.3 Identify rrna sequences in DNA

Similar documents
Bioinformatics Grid - Enabled Tools For Biologists.

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Bioinformatics Resources at a Glance

Version 5.0 Release Notes

CD-HIT User s Guide. Last updated: April 5,

Genome Explorer For Comparative Genome Analysis

A Tutorial in Genetic Sequence Classification Tools and Techniques

The Central Dogma of Molecular Biology

A data management framework for the Fungal Tree of Life

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

Visualization of Phylogenetic Trees and Metadata

DNA Sequencing Overview

Next Generation Sequencing Technologies in Microbial Ecology. Frank Oliver Glöckner

Pairwise Sequence Alignment

Guide for Bioinformatics Project Module 3

COMPARING DNA SEQUENCES TO DETERMINE EVOLUTIONARY RELATIONSHIPS AMONG MOLLUSKS

UGENE Quick Start Guide

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Phylogenetic Trees Made Easy

Analyzing A DNA Sequence Chromatogram

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Protein Sequence Analysis - Overview -

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Introduction to Bioinformatics 3. DNA editing and contig assembly

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

GenBank, Entrez, & FASTA

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Introduction to Bioinformatics AS Laboratory Assignment 6

Introduction to Phylogenetic Analysis

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Gene Models & Bed format: What they represent.

Introduction to NGS data analysis

The world of non-coding RNA. Espen Enerly

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Vector NTI Advance 11 Quick Start Guide

Clone Manager. Getting Started

Central Dogma. Lecture 10. Discussing DNA replication. DNA Replication. DNA mutation and repair. Transcription

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

RAST Automated Analysis. What is RAST for?

Localised Sex, Contingency and Mutator Genes. Bacterial Genetics as a Metaphor for Computing Systems

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

The Steps. 1. Transcription. 2. Transferal. 3. Translation

MiSeq: Imaging and Base Calling

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

User Manual for SplitsTree4 V4.14.2

Amino Acids and Their Properties

Current Motif Discovery Tools and their Limitations

DNA Sequence Alignment Analysis

CREATING AN IMAGE FROM AUTOCAD CADD NOTE 16. MENU: AutoCAD, File, Plot COMMAND: plot ICON:

Final Project Report

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

UMass High Performance Computing Center

2 Decision tree + Cross-validation with R (package rpart)

HOW TO CREATE AN HTML5 JEOPARDY- STYLE GAME IN CAPTIVATE

Biological Sequence Data Formats

CDD user guide. PsN Revised

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Joomla! template Blendvision v 1.0 Customization Manual

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Apply PERL to BioInformatics (II)

Section 3 Comparative Genomics and Phylogenetics

Structural Health Monitoring Tools (SHMTools)

AmphoraNet: Taxonomic Composition Analysis of Metagenomic Shotgun Sequencing Data

LifeScope Genomic Analysis Software 2.5

Bio-Informatics Lectures. A Short Introduction

(A GUIDE for the Graphical User Interface (GUI) GDE)

In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Instructions for update installation of ElsaWin 5.00

Load testing with. WAPT Cloud. Quick Start Guide

UCINET Quick Start Guide

An introduction to bioinformatic tools for population genomic and metagenetic data analysis, 2.5 higher education credits Third Cycle

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Data Crow Creating Reports

Web Data Extraction: 1 o Semestre 2007/2008

MetaPathways v1.0 Installation

Protein Phylogenies and Signature Sequences: A Reappraisal of Evolutionary Relationships among Archaebacteria, Eubacteria, and Eukaryotes

Introduction to GCG and SeqLab

CLC Bioinformatics Database

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis

4. GS Reporter Application 65

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

14.3 Studying the Human Genome

E. coli plasmid and gene profiling using Next Generation Sequencing

Transcription:

2.3 Identify rrna sequences in DNA For identifying rrna sequences in DNA we will use rnammer, a program that implements an algorithm designed to find rrna sequences in DNA [5]. The program was made by modeling a large number of known rrna sequences and making them into generalized patterns. These patterns are then used as search models in a new DNA sequence. If a part of the DNA matches the model, the sequence is extracted as a likely rrna sequence. The help page for rnammer has the following description for the program: "RNAmmer predicts ribosomal RNA genes in full genome sequences by utilizing two levels of Hidden Markov Models: An initial spotter model searches both strands. The spotter model is constructed from highly conserved loci within a structural alignment of known rrna sequences. Once the spotter model detects an approximate position of a gene, flanking regions are extracted and parsed to the full model which matches the entire gene. By enabling a two-level approach it is avoided to run a full model through an entire genome sequence allowing faster predictions. RNAmmer consists of two components: A core Perl program, core-rnammer, and a wrapper, rnammer. The wrapper sets up the search by writing on or more temporary configuration(s). The wrapper requires the super kingdom of the input sequence (bacterial, archaeal, or eukaryotic) and the molecule type (5/8, 16/17s, and 23/28s) to search for. When the configuration files are written, they are parsed in parallel to individual instances of the core program. Each instance of the core program will in parallel search both strands, so a maximum of 3x2 hmmsearch processes will run simultaneously. The input sequences are read from sequence and must be in Pearson FASTA format. " The program is run as follows, specifying the taxonomical kingdom (bac) and the type of molecules to search for (tsu for 5/8s rrna, ssu for 16/18s rrna, lsu for 23/28s rrna). The parameter -f specifies the name of the output file. Listing 2.5: Identify 16S rrna sequences in genomic DNA $ rnammer -S bac -m ssu -f <name >. rrna <name >. gbk. dna # Example, one line command : $ rnammer -S bac -m ssu -f Neisseria_meningitidis_Z2491_ID_AL157959. gbk. dna. rrna Neisseria_meningitidis_Z2491_ID_AL157959. gbk. dna Note that this takes time to go through a single genome, and if there is more than one rrna operon (which is true for many bacteria), then multiple 16S rrna sequenced will be found per genome. See the example below, where first the N. meningitidis genomes are done, and then the N. gonorrhoeae genomes are done. for x in Neisseria * gbk. dna > do > echo $x > rnammer -S bac -m ssu -f $x. rrna $x > done Listing 2.6: Identify 16S rrna sequences in genomic DNA - loop Have a look at the FASTA header line for the 16S rrna sequences and look at the score and the lengths. Use grep to get the header lines for each file. 30

2.3.1 Exercises 1. Run rnammer on all *.gbk.dna files 2. The output files from RNAmmer are FASTA formatted; count the number of sequences in each file, why are there multiple entries? 3. Does the program only find 16S rrna sequences? What type of molecules has the algorithm found? 4. Are there some files that are empty? What does it mean if the file is empty? 5. What kind of information can you get from the header lines of these FASTA files? 31

2.4 Multiple sequence alignment of 16S rrna sequences One way of comparing the 16 srna sequences is to do a multiple sequence alignment. Multiple alignments are often used in identifying conserved sequence regions across a group of sequences hypothesized to be evolutionarily related. If two sequences in an alignment share a common ancestor, mismatches can be interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations) introduced in one or both lineages in the time since they diverged from one another. In this course we use clustal to align the sequences and find the distance/differences between them [6]. The greater the distance between two sequences the greater the difference between the organisms from which the sequences came. The rnammer program finds all possible rrna sequences in a genome. Some, and indeed many, genomes have more than one copy of this operon. In order to do a 16S rrna tree you should pick one sequence. Here we choose the one that has the highest score according to the rnammer models. The program rnammer_extractseqs takes all files in a directory named *.rrna and selects the best sequence within each file (each organism). For some organisms, rnammer might not find a sequence with a good enough score. This means that the model does not find a sequence that looks sufficiently like a rrna sequence. These sequences, and organisms, will be excluded from the analysis. The sequences will now be evaluated based on fixed criteria for a 16S rrna sequence. These criteria include length and fitness score to the rnammer models. This extraction procedure will only work for 16s rrna sequences that fulfill the criteria. Listing 2.7: Identify 16S rrna sequences in DNA - select sequences # The following code works on all files in the current working directory. $ rnammer_extractseqs <allfile >. RRNA $ rnammer_extractseqs_alllengths <allfile >. RRNA $ rnammer_extractseqs all. RRNA $ rnammer_extractseqs_alllengths alllengths. RRNA The output is a FASTA formatted file with rrna genes in DNA code. Count the number of sequences in this FASTA file (using grep). Try using grep without the -c option and look at what the headers of this FASTA file looks like. Look at the header lines for the selected sequences. The alignment is performed using clustalw and the file holding one rrna sequence from each organism. Listing 2.8: Multiple alignment of 16S rrna sequences $ clustalw <allfile >. RRNA $ clustalw all. RRNA # The following files are created : all. aln all. dnd 32

Next we create a distance tree for the alignment, using a bootstrap value of. Here is a quote from Wiki about bootstrapping in statistics: " In statistics, bootstrapping is a computer-based method for assigning measures of accuracy to sample estimates (Efron and Tibshirani 1994). This technique allows estimation of the sample distribution of almost any statistic using only very simple methods (Varian 2005). Generally, it falls in the broader class of resampling methods. Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data. In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset. " When using bootstrapping, different versions of the distance tree will be constructed and each time a branching point will be recorded. With a bootstrap value of, the number on the tree will indicate how many, out of trees, have this branching. The closer to the more sure we are of that branch. Listing 2.9: Multiple alignment of 16S rrna sequences - bootstrap $ clustalw <allfile >. RRNA - bootstrap = $ clustalw alllengths. RRNA - bootstrap = # The following files are created : all. phb The tree construction is simply a drawing program called njplot [9]. $ njplot all. RRNA. phb Listing 2.10: View 16S rrna tree Open the bootstrap tree file *.phb and tick the Display setting by clicking Bootstrap values. Click "File" and "Save Rooted tree". Output file is names all_root.phb. Under "File", save the tree as a PDF format and open a word processor (Word, Pages) or presentation software (PowerPoint, Keynote) on your LOCAL computer. Get the PDF from the shared folder and put the picture into your presentation. Add colors to indicate taxonomic groupings or others clusters (See Figure 2.2). 2.4.1 Exercises 1. Extract the best scoring sequences within the length criteria for all *.rrna files and store in file all.rrna 2. How many sequences are there in the all.rrna file? 3. Extract the best scoring sequences for all *.rrna files and store in file 4. How many sequences are there in the all.rrna file? 5. Run a multiple alignment on selected sequences 6. Generate a phylogenetic tree 7. Save tree as PDF and insert into presentation 33

0.02 993 Veillonella sp oral taxon 158 str F0422 Veillonella dispar ATCC 17748 Veillonella parvula DSM 2008 509 747 Veillonella sp 6 1 27 856Veillonella sp 3 1 44 926 Veillonella parvula ATCC 17745 Veillonella atypica ACS-134-V-Col7a 847 Veillonella atypica ACS-049-V-Sch6 * Veillonella sp oral taxon 780 str F0422 Megasphaera micronuciformis F0359 Megasphaera elsdenii DSM 20460 851 Megasphaera sp UPII 135-E Megasphaera sp UPII 199-6 894 Megasphaera genomosp type 1 str 28L Dialister invisus DSM 15470 Dialister microaerophilus UPII 345-E * Dialister micraerophilus DSM 19965 Thermosinus carboxydivorans Nor1 727 Selenomonas sp oral taxon 149 str 67H29BP * Selenomonas flueggei ATCC 43531 * Selenomonas noxia ATCC 43541 * 878 Selenomonas sp oral taxon 137 str F0430 Selenomonas artemidis F0399 * Selenomonas sputigena ATCC 35185 865 650 Mitsuokella multacida DSM 20544 Phascolarctobacterium succinatutens YIT 12067 Acidaminococcus fermentans DSM 20731 Acidaminococcus sp D21 Acidaminococcus intestini RyC-MR95 Cluster I Cluster II Cluster III Figure 2.2: 16S rrna tree. Each genome sequence was searched for 16S rrna patterns and candidate sequences were extracted. The best sequence from each genome was selected. For two genomes, no sequences were found, Centipeda periodontii DSM 2778, Megamonas hypermegale ART12 1. For 6 additional genomes, the located sequences were shorter than the default acceptable length. The short sequences sequences are marked with a *. Length criteria was changed from minimum 1 400 to 1 100 and maximum 1 800 unchanged. The distance tree was made with 1 000 bootstraps. 34