Motifs and Profile Analysis

Similar documents
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Bio-Informatics Lectures. A Short Introduction

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Amino Acids and Their Properties

Cross-references to the corresponding SWISS-PROT entries as well as to matched sequences from the PDB 3D-structure database 2 are also provided.

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

T cell Epitope Prediction

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

Database searching with DNA and protein sequences: An introduction Clare Sansom Date received (in revised form): 12th November 1999

COMP 250 Fall 2012 lecture 2 binary representations Sept. 11, 2012

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Pairwise Sequence Alignment

MATCH Commun. Math. Comput. Chem. 61 (2009)

Solving Systems of Linear Equations Using Matrices

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Sequence homology search tools on the world wide web

Lecture 3: Finding integer solutions to systems of linear equations

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

NetPrimer Manual. PREMIER Biosoft International Corina Way, Palo Alto, CA Tel: FAX:

Bioinformatics Grid - Enabled Tools For Biologists.

9.2 Summation Notation

Protein Sequence Analysis - Overview -

Guide for Bioinformatics Project Module 3

Hidden Markov Models

5.1 Identifying the Target Parameter

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

Question 2: How do you solve a matrix equation using the matrix inverse?

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Solving Systems of Linear Equations

The Assignment Problem and the Hungarian Method

Current Motif Discovery Tools and their Limitations

Lecture 2 Matrix Operations

CHAPTER 2 Estimating Probabilities

MASCOT Search Results Interpretation

Similarity and Diagonalization. Similar Matrices

1.5 Oneway Analysis of Variance

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Data Mining Techniques and their Applications in Biological Databases

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Fractions to decimals

Solution to Homework 2

Protein Protein Interaction Networks

BIOINFORMATICS: A NEW FIELD IN ENGINEERING EDUCATION

Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model

Flexible Information Visualization of Multivariate Data from Biological Sequence Similarity Searches

Notes on Determinant

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

HFCC Math Lab Arithmetic - 4. Addition, Subtraction, Multiplication and Division of Mixed Numbers

Inferring Probabilistic Models of cis-regulatory Modules. BMI/CS Spring 2015 Colin Dewey

Regression III: Advanced Methods

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Introduction to Bioinformatics 3. DNA editing and contig assembly

Partial Fractions. Combining fractions over a common denominator is a familiar operation from algebra:


Hidden Markov models in biological sequence analysis

Dynamic Programming. Lecture Overview Introduction

BIOINFORMATICS TUTORIAL

Clone Manager. Getting Started

Linear Algebra Notes

Network Protocol Analysis using Bioinformatics Algorithms

ProteinPilot Report for ProteinPilot Software

Simple Regression Theory II 2010 Samuel L. Baker

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Lecture 1: Systems of Linear Equations

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

Introduction to Bioinformatics AS Laboratory Assignment 6

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

UCHIME in practice Single-region sequencing Reference database mode

Bioinformatics Resources at a Glance

Joint models for classification and comparison of mortality in different countries.

Conditional Random Fields: An Introduction

MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.

8.2. Solution by Inverse Matrix Method. Introduction. Prerequisites. Learning Outcomes

Row Echelon Form and Reduced Row Echelon Form

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

1.2 Solving a System of Linear Equations

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

2x + y = 3. Since the second equation is precisely the same as the first equation, it is enough to find x and y satisfying the system

Matrix Differentiation

1 Review of Least Squares Solutions to Overdetermined Systems

Two-sample inference: Continuous data

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c

MTH6120 Further Topics in Mathematical Finance Lesson 2

Biological Sequence Data Formats

Bireatic Sequence Analysis - A Review

Parallel Algorithm for Dense Matrix Multiplication

Statistics Graduate Courses

Data Integration via Constrained Clustering: An Application to Enzyme Clustering

SYSTEMS OF EQUATIONS AND MATRICES WITH THE TI-89. by Joseph Collison

LECTURE 4. Last time: Lecture outline

Introduction to Principal Components and FactorAnalysis

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Transcription:

COS 597c: Topics in Computational Molecular Biology Lecture 0: October, 999 Lecturer: Mona Singh Scribes: Dawn Brooks Motifs and Profile Analysis Broadly speaking, a sequence motif is a conserved element of a sequence alignment Its function or structure may be known, or its significance may be unknown Thus, one way to get functional or structural information about a sequence is to determine what motifs it contains In a sense, we have already talked about this problem in the context of detecting sequence similarity using pairwise and multiple sequence alignments Today we will discuss another method, where we build profiles (which also have been called, among other things, weight matrices, templates, and position specific scoring matrices) for motifs or sequence families There are many websites that contain collections of amino acid sequence motifs that indicate particular structural of functional elements; searches of these websites with a new sequence allow inferences about its potential functional roles Such websites include BLOCKS, PRINTS and Pfam Unlike the alignment methods we have talked about thus far, intutively, profile analysis uses the fact that certain positions in a family are more conserved than other positions, and allows substitutions less readily in these conserved positions The latest approaches to building profiles are based on hidden Markov models, which we will talk about in a subsequent lecture We will restrict our discussion below to protein families, but the same techniques are also applicable to DNA sequences (For example, profile analysis has also been applied to recognizing promoter sequences) Profiles for Aligned Sequences The general framework for profile analysis is: given subsequences that are members of a family in interest, devise methods to determine whether a new sequences is a member of this family 998 Portions of the notes are adapted from lecture notes originally scribed by Xuxia Kuang in Fall

Here is an overview of the profile method: Align the sequences in the family Use the alignment to create a profile Test new sequences against the profile Initially, we will assume that there are no gaps in the alignment We look at the alignment of N sequences of l positions as follows: Position Sequence 2 3 l a a 2 a 3 a l 2 a 2 a 22 a 23 a 2l 3 a 3 N a N a N2 a N3 a Nl where a ij denotes the amino acid from the ith sequence at the jth position 2 We build the profile as follows We compute: f ij = % of column j that is amino acid i b i = % of background which is amino acid i The background can be computed, for example, from a large sequence database, or from a genome, or from some particular protein family Now compute the l array P ij, where P ij = f ij b i () Intuitively, P ij is the propensity for amino acid i in the j position in the alignment This gives us the following table: 2

Position Sequence 2 3 5 l L P L P L2 P L3 P Ll V P V P V 2 P V 3 P V l F P F And we use this table to compute: Score ij = log(p ij ) (2) For example, say we have the following alignment (N =, l = ): LEVK LDIR LEIK LDVE Assume for simplicity that all amino acids are equally likely in the in the background (eg, the fraction of amino acid i in some large database is ) We have P L = =, P D2 = 2 = 0 3 To use the profile to score a new sequence, we do the following: (a) Slide a window of width l over the new sequence (b) The score of the window equals the sum of the scores of each position in the window For example, sequence LEV E ER, Score of the first window = Score L + Score E2 + Score V 3 + Score E Score of the second window = Score E + Score V 2 + Score E3 + Score E (c) If the score of the window is higher than the cutoff, which is determined empirically, we can conclude that the window is a member of the family In addition, the higher the score, the more confident the prediction 3

Simple probabilistic interpretation Note that the profile method just described can be justified from a log-odds perspective In particular, when scoring a subsequence, profile methods assume that each position is independent, and estimate: log ( Pr(subsequence family) ) Pr(subsequence not family) You can show this simply by repeated application of the definition of conditional probability, and by using our assumption of position independence Probabilities from the numerator are estimated from frequencies for each amino acid in each position of the alignment, and probabilities in the denominator are estimated from the background frequencies Common Extensions to Profile Methods There are several variations to the method we just described We can modify the profile method to incorporate gaps We now have 2 l matrix, where l is the length of the alignment Gap costs vary at different positions For example, we tend to see more gaps at certain positions in loops of protein structures 2 The most important extension is how to handle the zero frequency case In practice, handling the zero frequency case is essential for good performance If an amino acid does not occur in a column, P ij is zero and Score ij = log (0), which is undefined One common approach for dealing with this problem is to calculate P ij as: P ij = # of amino acid i in position j + b N + b fraction of amino acid i in GenBank where b is usually some small value (eg b ) This is often known as the pseudo-count method, and can be justified in a Bayesian framework There are many other methods to deal with the zero-frequency case (3)

N k= W k δ kij 3 When several sequences are closely related, we do not want to overweigh these sequences; instead, we want to weigh more remote homologies equally Thus, once common extension to the profile method is to weight the sequences used in building the profile If sequence k is weighted W k and N k= W k = W, then P ij = W fraction of amino acid i in Genbank () where δ kij = if a kj = residue i, and 0 otherwise (Sanity check: note that if each sequence is weighted, ie W k =, we get back the original formula) Sometimes profile methods incorporate known substitution matrices used for alignments (The PAM250 matrix is a common example) If δ is our substitution matrix, we compute: Score ij = δ(i, d) d= for i {L, V, } and j l # of amino acid d in position j of alignment, (5) N This variation is actually how profile methods were first described Note that this formula gives an alternate way to handle the zero-frequency case; however, it does not fit into the log-odds perspective References [] R Durbin, S Eddy, A Krogh and G Mitchison Biological Sequence Analysis Cambridge University Press, 998 [2] M Gribskov, A D McLachlan and D Eisenberg (987) Profile Analysis: Dectection of distantly related proteins Proc Natl Acad Sci, U S A 8(3): 355-8 5