A new type of Hidden Markov Models to predict complex domain architecture in protein sequences

Similar documents
Bioinformatics Grid - Enabled Tools For Biologists.

Guide for Bioinformatics Project Module 3

Sequence Information. Sequence information. Good web sites. Sequence information. Sequence. Sequence

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

T cell Epitope Prediction

Introduction to Genome Annotation

Linear Sequence Analysis. 3-D Structure Analysis

A Primer of Genome Science THIRD

Databases and mapping BWA. Samtools

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Sequence homology search tools on the world wide web

Managing extended organizations and data governance : a foucauldian perspective

Hidden Markov models in biological sequence analysis

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

statistical significance estimation

Activity recognition in ADL settings. Ben Kröse

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Current Motif Discovery Tools and their Limitations

Hardware Implementation of Probabilistic State Machine for Word Recognition

Course: Model, Learning, and Inference: Lecture 5

Finding soon-to-fail disks in a haystack

Classification/Decision Trees (II)

Stock Trading by Modelling Price Trend with Dynamic Bayesian Networks

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Yale Pseudogene Analysis as part of GENCODE Project

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

Second International Workshop on Preservation of Evolving Big Data - Panel on Big Data Quality

Consensus alignment server for reliable comparative modeling with distant templates

MK Gibson et al. Training and Benchmarking of Resfams profile HMMs

Machine Learning and Statistics: What s the Connection?

Tagging with Hidden Markov Models

HMM : Viterbi algorithm - a toy example

Budapest University of Technology and Economics Department of Measurement and Information Systems. Business Process Modeling

Learning is a very general term denoting the way in which agents:

Advanced Signal Processing and Digital Noise Reduction

Introduction to MDM Part 4 - Enterprise Data Architecture Master Data Management

Cell Phone based Activity Detection using Markov Logic Network

Intrusion Detection via Machine Learning for SCADA System Protection

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

博 士 論 文 ( 要 約 ) A study on enzymatic synthesis of. stable cyclized peptides which. inhibit protein-protein interactions

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Conditional Random Fields: An Introduction

Automatic segmentation of text into structured records

Location: Events Center - Mandalay Bay Events Center Agenda Status: Default

Word Completion and Prediction in Hebrew

Pricing Indicators. National Pricing Indicators. For the week beginning 6/6/11. (Additional indexes and products inside)

Version 2.3.1; June 2003

Bio-Informatics Lectures. A Short Introduction

Terminology Extraction from Log Files

UGENE Quick Start Guide

REGULATIONS FOR THE DEGREE OF MASTER OF SCIENCE IN COMPUTER SCIENCE (MSc[CompSc])

Automated Content Analysis of Discussion Transcripts

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Bayesian Networks. Read R&N Ch Next lecture: Read R&N

Evaluation of Optimizations for Object Tracking Feedback-Based Head-Tracking

How To Calculate Energy From Water

Case Study: How Generic Domain Names Impact a SEM Campaign

Introduction. Key Benefits. Data Storage. Data Retrieval. TechTip: Wonderware Historian a Look at Different Data Retrieval Methods.

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

FBIO - Fundations of Bioinformatics

DYNAMIC CHORD ANALYSIS FOR SYMBOLIC MUSIC

THE TURING DEGREES AND THEIR LACK OF LINEAR ORDER

Annex 6: Nucleotide Sequence Information System BEETLE. Biological and Ecological Evaluation towards Long-Term Effects

Quick DDNS Quick Start Guide

Hidden Markov Models

Data Mining. Dr. Saed Sayad. University of Toronto

Predicting Student Retention in Massive Open Online Courses using Hidden Markov Models

Part A: Data Definition Language (DDL) Schema and Catalog CREAT TABLE. Referential Triggered Actions. CSC 742 Database Management Systems

Chapter 5 More SQL: Complex Queries, Triggers, Views, and Schema Modification

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Cursive Handwriting Recognition for Document Archiving

Quick DDNS Quick Start Guide

Hidden Markov Models Chapter 15

LabGenius. Technical design notes. The world s most advanced synthetic DNA libraries. hi@labgeni.us V1.5 NOV 15

AUTOMATIC VIDEO STRUCTURING BASED ON HMMS AND AUDIO VISUAL INTEGRATION

Transcription:

A new type of Hidden Markov Models to predict complex domain architecture in protein sequences Raluca Uricaru, Laurent Bréhélin and Eric Rivals LIRMM, CNRS Université de Montpellier 2 14 Juin 2007 Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 1 / 12

Summary 1 General Context 2 Proposed solution : CpHMM 3 Pentatrico-Peptide Repeat (PPR) protein family 4 Results Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 2 / 12

General Context General Context Motifs, Domains, PFAM Motifs = sequence regions conserved in a protein family Domains = conserved sequence regions with structural properties many proteins are composed of several domains Pfam collection of protein domains & families uses phmms (via HMMER software) to model domains Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 3 / 12

General Context phmms Profile Hidden Markov Models (phmms) HMM = probabilistic automaton ; phmm = HMM designed to represent domains/motifs ; test a protein s membership of a known family recognition tag the precise position of a domain in the sequence tagging Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 4 / 12

General Context phmms (2) Difficulties proteins are usually composed of several different domains ; rearrangements & duplications different domain architectures in the same family. Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 5 / 12

General Context phmms (2) Difficulties proteins are usually composed of several different domains ; rearrangements & duplications different domain architectures in the same family. phmms : have a linear structure, model a single domain unable to deal with complex domain architectures Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 5 / 12

Proposed solution : CpHMM Proposed solution : generalized phmm Cyclic profile HMM (CpHMM) - made up of nested phmms Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 6 / 12

Proposed solution : CpHMM CpHMM : advantages capitalizes on already developed phmms (PFAM) ; deals with both : variable number of repeated units & variable relative order of units ; takes in consideration both : sequence similarity & domain context ; Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 7 / 12

Proposed solution : CpHMM CpHMM : advantages (2) built with or without prior knowledge can be adjusted & parameterised for a specific family ; automatically performs both tagging and recognition ; yields a globally optimal of several domains ; multiple tagging gives a global E-value (computed as in HMMER) ; efficient in execution terms ( 20 minutes for 40000 arabidopsis proteins) Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 8 / 12

Pentatrico-Peptide Repeat (PPR) protein family PPR protein family Protein family with complex multi-domain architecture The PPR protein family is particularly large in plants Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 9 / 12

Pentatrico-Peptide Repeat (PPR) protein family PPR protein family (2) Motif structure of PPR proteins for Arabidopsis Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 10 / 12

Pentatrico-Peptide Repeat (PPR) protein family PPR protein family (3) PPR motif multiple alignment Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 11 / 12

Results Validation comparison between automatic annotation (CpHMM) & expert manual annotation on PCMP subfamily in arabidopsis : 197 protein sequences regular expression for PCMP subfamily : (P L S ) (P L 2 S ) [E [E + [Dyw]]] distribution of the number of PPR motifs per protein : avg = 15, min = 7, max = 28 diversity of architectures ; identical improved prediction slightly different different #proteins 98 30 24 45 automatic annotation valid in 88% of the motifs ; Raluca Uricaru (LIRMM/CNRS) Cyclic profile HMMs (CpHMMs) 14 Juin 2007 12 / 12