Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004



Similar documents
Pairwise Sequence Alignment

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

Bioinformatics Resources at a Glance

Bio-Informatics Lectures. A Short Introduction

Biological Databases and Protein Sequence Analysis

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIOINFORMATICS TUTORIAL

Bioinformatics Grid - Enabled Tools For Biologists.

Algorithms in Bioinformatics I, WS06/07, C.Dieterich 47. This lecture is based on the following, which are all recommended reading:

GenBank, Entrez, & FASTA

Module 1. Sequence Formats and Retrieval. Charles Steward

CD-HIT User s Guide. Last updated: April 5,

Clone Manager. Getting Started

Linear Sequence Analysis. 3-D Structure Analysis

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Bioinformática BLAST. Blast information guide. Buscas de sequências semelhantes. Search for Homologies BLAST

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Genome Explorer For Comparative Genome Analysis

Design Style of BLAST and FASTA and Their Importance in Human Genome.

A Tutorial in Genetic Sequence Classification Tools and Techniques

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

T cell Epitope Prediction

Searching Nucleotide Databases

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Protein Protein Interaction Networks

Introduction to Bioinformatics AS Laboratory Assignment 6

Human-Mouse Synteny in Functional Genomics Experiment

Integration of data management and analysis for genome research

Biological Sequence Data Formats

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Sequence homology search tools on the world wide web

Phylogenetic Trees Made Easy

Current Motif Discovery Tools and their Limitations


Guide for Bioinformatics Project Module 3

Molecular Databases and Tools

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Protein Sequence Analysis - Overview -

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Introduction to Bioinformatics 3. DNA editing and contig assembly

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Frequently Asked Questions Next Generation Sequencing

Activity 7.21 Transcription factors

Laboratorio di Bioinformatica

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Sequencing the Human Genome

HMM : Viterbi algorithm - a toy example

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

UGENE Quick Start Guide

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Version 5.0 Release Notes

Amino Acids and Their Properties

GenBank: A Database of Genetic Sequence Data

Genetomic Promototypes

Control of Gene Expression

MORPHEUS. Prediction of Transcription Factors Binding Sites based on Position Weight Matrix.

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

Core Bioinformatics. Degree Type Year Semester

DNA Sequencing Overview

3. About R2oDNA Designer

Sequence information - lectures

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

Network Protocol Analysis using Bioinformatics Algorithms

Biology & Big Data. Debasis Mitra Professor, Computer Science, FIT

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Specific problems. The genetic code. The genetic code. Adaptor molecules match amino acids to mrna codons

Comparing Methods for Identifying Transcription Factor Target Genes

Structure and Function of DNA

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Introduction to Bioinformatics 2. DNA Sequence Retrieval and comparison

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

Translation Study Guide

Final Project Report

2.3 Identify rrna sequences in DNA

Global and Discovery Proteomics Lecture Agenda

Transcription:

Protein & DNA Sequence Analysis Bobbie-Jo Webb-Robertson May 3, 2004

Sequence Analysis Anything connected to identifying higher biological meaning out of raw sequence data. 2

Genomic & Proteomic Data Sequence Alignment was one of the first bioinformatics techniques (~1970) pre-dates high-throughput sequencing techniques 1000 100 10 PetaBytes 1 0.1 0.01 0.001 0.0001 0.00001 1988 1990 1992 1994 1996 1998 2000 2002 2004 Years 2006 2008 2010 2012 2014 Proteomic data GenBank 2016 3

Motivation Structure Prediction Challenge 4

Guiding Principal The basic guiding principal is EVOLUTION Mutations to regions of the DNA/protein sequence that are functional units are less supportive of change Myoglobin vs Hemoglobin Zinc Finger 5

Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 6

Pairwise Sequence Alignment One of the most commonly performed tasks in bioinformatics Method to compare two sequences and make inferences on the relationship between them Homologs = Two molecules that share a common ancestor 7

Searching for Homology Query (Unknown function/structure) >d1npx_2 3.4.1.4.5 (120-242) NADH peroxidase IPGKDLDNIYLMRGRQWAIKLKQKTVDPEVNNVVVIGSGYIGIEAAEAFAKAGKKVTVID ILDRPlGVYLDKEFTDVLTEEMEANNITIATGETVERYEGDGRVQKVVTDKNAYDADLVV VAV Target (Known function/structure) >d3lada2 3.4.1.4.6 (159-227) Dihydrolipoamide dehydrogenase PAPVDQDVIVDSTGALDFQNVPGKLGVIGAGVIGLELGSVWARLGAEVTVLEAMDKFLPA VDEQVAKEAQKILTKQGLKILLGARVTGTEVKNKQVTVKFVDAEGEKSQAFDKLIVAVG Known Unknown 8

Pairwise Sequence Alignment One-to-one correspondence between the residues of two sequences R R (1) (2) = { R = { R (1) 1 (1) 1,..., R,..., R (1) I (1) J } = { HEAGAWGHE} } = { GHEE} Global Local HEAGAWGHE- ------GHEE GHE GHE Almost all alignment is done at the local level 9

Why Sequence Alignment Algorithms? There are a huge number of alignments length 10-185,000 Alignments length 20-138 billion Exponential < O( n 2 ) n is the length of the longest sequence 10

Evolution (mutation) Coded for at DNA level Captured in 2 parameters Scoring matrices Gap penalties 11

Scoring Matrices Characterize the probability that one residue was substituted for another (log odds-ratio) A R N D C Q E G H I L K M F P S T W Y V A 4-1 -2-2 0-1 -1 0-2 -1-1 -1-1 -2-1 1 0-3 -2 0 R -1 5 0-2 -3 1 0-2 0-3 -2 2-1 -3-2 -1-1 -3-2 -3 N -2 0 6 1-3 0 0 0 1-3 -3 0-2 -3-2 1 0-4 -2-3 D -2-2 1 6-3 0 2-1 -1-3 -4-1 -3-3 -1 0-1 -4-3 -3 C 0-3 -3-3 9-3 -4-3 -3-1 -1-3 -1-2 -3-1 -1-2 -2-1 Q -1 1 0 0-3 5 2-2 0-3 -2 1 0-3 -1 0-1 -2-1 -2 E -1 0 0 2-4 2 5-2 0-3 -3 1-2 -3-1 0-1 -3-2 -2 G 0-2 0-1 -3-2 -2 6-2 -4-4 -2-3 -3-2 0-2 -2-3 -3 H -2 0 1-1 -3 0 0-2 8-3 -3-1 -2-1 -2-1 -2-2 2-3 I -1-3 -3-3 -1-3 -3-4 -3 4 2-3 1 0-3 -2-1 -3-1 3 L -1-2 -3-4 -1-2 -3-4 -3 2 4-2 2 0-3 -2-1 -2-1 1 K -1 2 0-1 -3 1 1-2 -1-3 -2 5-1 -3-1 0-1 -3-2 -2 M -1-1 -2-3 -1 0-2 -3-2 1 2-1 5 0-2 -1-1 -1-1 1 F -2-3 -3-3 -2-3 -3-3 -1 0 0-3 0 6-4 -2-2 1 3-1 P -1-2 -2-1 -3-1 -1-2 -2-3 -3-1 -2-4 7-1 -1-4 -3-2 S 1-1 1 0-1 0 0 0-1 -2-2 0-1 -2-1 4 1-3 -2-2 T 0-1 0-1 -1-1 -1-2 -2-1 -1-1 -1-2 -1 1 5-2 -2 0 W -3-3 -4-4 -2-2 -3-2 -2-3 -2-3 -1 1-4 -3-2 11 2-3 Y -2-2 -2-3 -2-1 -2-3 2-1 -1-2 -1 3-3 -2-2 2 7-1 V 0-3 -3-3 -1-2 -2-3 -3 3 1-2 1-1 -2-2 0-3 -1 4 BLOSUM 62 1992 12

Gap Penalties Characterize the probability that a residue was inserted or deleted Linear d = gap opening penalty Affine γ ( g) = gd e = gap extension penalty γ ( g ) = d ( g 1) e 13

Typical Output Sequence Alignment means two residues are identical : means that two residues have similar physiochemical properties Typical Objective Function Score = end start ScoreMatrix end start GapPenalties 14

Pairwise Sequence Alignment Algorithms Exhaustive Dynamic Programming Needleman-Wunsch (1970) Smith-Waterman (1981) Approximate Heuristic Methods BLAST (PSI-BLAST) (1990, 1997) FASTA (1998) Statistical Bayes Block Aligner (1998) BALSA (2002) 15

Dynamic Programming Optimization method that uses sequential decisions to solve the problem Global 1970 - Needleman-Wunsch Local 1981 - Smith-Waterman Almost all alignment is done at the local level 16

Dynamic Programming Algorithms The optimal alignment, A*, is found by fixing the scoring matrix, Θ, and the gap penalties, Λ 0 0, and maximizing the log-likelihood (1) (2) (1) (2) log( P( R, R, A* Θ0, Λ0)) = max{log( P( R, R A, Θ0)) + log( P( A Λ0))} A scoring matrix Alignment (1) (2) s( R i, R j ) (1) gap penalties d = gap opening penalty e = gap extension penalty A i, j 1 = 0 if R i otherwise is aligned with R (2) j. 17

Standard Dynamic Programming Choices Match R R (1) i (2) j Insertion into Sequence 1 Deletion from Sequence 1 (1) R i - - (2) R j 18

F( i, Algorithm (1) F( i 1, j 1) + s( Ri, R F( i 1, j) d j) = max F( i, j 1) d 0 scoring matrix 1-1 1-1 -1 Gap Penalty 1-1 -1 d = 1-1 -1-1 -1 1-1 -1 1 Smith-Waterman Simple Gap Penalty (2) j ) G C G 0 0 0 0 (+1) G A G 0 (-1) +1 (-1) (0) 0 1 (-1) 0 0 0 0 (-1) 0 1 0 1 19

Reconstructing the Alignment 20

Smith-Waterman Availability http://www.dna. affrc.go.jp/htdo cs/swsrch/ http://www.ebi. ac.uk/mpsrch/ http://wwwhto.usc.edu/sof tware/seqaln/se qaln-query.html 21

Results from SWsrch with hemoglobin Title: >hemoglobin mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp Amhasldkflasvstvltskyr Perfect Score: 750 Sequence: 1 MVLSADDKTNIKNCWGKIGG...HASLDKFLASVSTVLTSKYR 142 SUMMARIES % Result Query No. Score Match Length DB ID Description ---------------------------------------------------------------------------- 1 750 100.0 142 1 HART1(I54239;I68531;A26903;A93047;A90284;A90285;A02268) hemoglobin alpha-1 chain - rat &RATHBAM_1(M17083 pid: 2 745 99.3 141 1 (P01946) Hemoglobin alpha-1 and alpha-2 c 3 742 98.9 142 1 RN2A1GL_1(X56325 pid:g3367722) R.norvegicus 4 632 84.3 142 1 AK003077_1(AK003077 pid:none) Mus musculus a 5 632 84.3 142 1 HAMS(A90791;I49720;A45964;I49722;I49721;B43560;A92945;) hemoglobin alpha chains - mouse &MMAGL1_1(V00714 pid: 6 630 84.0 142 1 AK011076_1(AK011076 pid:none) Mus musculus 1 7 627 83.6 141 1 (P01942) Hemoglobin alpha chain. 8 623 83.1 141 1 (P20854) Hemoglobin alpha chain. &HARTNG 9 623 83.1 142 1 L75940_1(L75940 pid:none) Mus musculus alpha 10 623 83.1 142 1 AK010422_1(AK010422 pid:none) Mus musculus E 11 615 82.0 141 1 HASL1W(S10481)hemoglobin alpha-i chain - Wed 12 611 81.5 141 1 (P01938) Hemoglobin alpha chain. &HALRN( 13 610 81.3 141 1 (P09420) Hemoglobin alpha chain. &A25359 14 609 81.2 141 1 (P01930) Hemoglobin alpha chain. &HAMQB( 15 607 80.9 141 1 (P18969) Hemoglobin alpha chain. &HAFQL( 16 605 80.7 141 1 (P01974) Hemoglobin alpha chain. &HACMA( 17 605 80.7 141 1 HAMN2F(S11533)hemoglobin alpha-ii chain - do 18 605 80.7 141 1 (P01945) Hemoglobin alpha chain. &HAHY(A 19 604 80.5 141 1 (P15163) Hemoglobin alpha-i and alpha-ii 20 603 80.4 141 1 (P01928) Hemoglobin alpha chain. &HAMQA( 22

Results from SWsrch with hemoglobin cont. RESULT 9 >L75940_1(L75940 pid:none) Mus musculus alpha-globin mrna, complete cds. &MUSALGL_1(L75940 pid:g1162945) Query Match 83.1%; Score 623; DB 1; Length 142; Best Local Similarity 83.8%; Pred. No. 1.80e-63; Matches 119; Conservative 7; Mismatches 16; Indels 0; Gaps 0; Inserts 0; InsGaps 0; Deletes 0; DelGaps 0; Db 1 mvlsgedksnikaawgkigghgaeyvaealermfasfpttktyfphfdvshgsaqvkghg 60 + + + + Qy 1 mvlsaddktnikncwgkigghggeygeealqrmfaafpttktyfshidvspgsaqvkahg 60 Db 61 kkvadalasaaghlddlpgalsalsdlhahklrvdpvnfkllshcllvtlashhpadftp 120 ++ Qy 61 kkvadalakaadhvedlpgalstlsdlhahklrvdpvnfkflshcllvtlachhpgdftp 120 Db 121 avhasldkflasvstvltskyr 142 + Qy 121 amhasldkflasvstvltskyr 142 23

BLAST (Basic Alignment Search Tool) Indexes Database Calculated neighborhood of each word in query using scoring matrix and probability threshold Look up all words and neighbors from query database index Extends High-scoring Segment Pairs (HSPs) left and right to maximal length Finds maximal segment pairs (MSPs) between query and database 24

BLAST database search 25

PSI-BLAST (Position specific iterative BLAST) A profile (position specific scoring matrix, PSSM) is constructed from a multiple alignment of the highest scoring hits in a BLAST search The PSSM is generated by calculating position-specific scores for each position in the alignment. The profile is used to perform a second (etc.) BLAST search and the results of each "iteration" used to refine the profile. This iterative searching strategy results in increased sensitivity. 26

Types of BLAST BLASTP search a Protein Sequence against a Protein Database. BLASTN search a Nucleotide Sequence against a Nucleotide Database. TBLASTN search a Protein Sequence against a Nucleotide Database, by translating each database Nucleotide sequence in all 6 reading frames. BLASTX Search a Nucleotide Sequence against a Protein Database, by first translating the query Nucleotide sequence in all 6 reading frames PSI-BLAST Profile generated from identified homologs, used, and iteratively updated. (Especially good for identifying remote homologies) PHI-BLAST Enforces the presence of a motif in addition to the usual PSI-BLAST criteria for matching 27

Assessing Evidence for Homology Is the score higher than expected from 2 random sequences (non-homologs) BLAST scores are not independent from the length of the sequences being aligned Fit extreme value distribution to randomly shuffled sequences BLAST returns the maximum score maximum of a larger number of i.i.d. random variables tends to an extreme distribution Expected number of HSPs for 2 sequences of length m and n λs E = Kmne Similar Approach for Smith-Waterman 28

BLAST Availability http://www.ncbi.nlm.ni h.gov/education/bla STinfo/information3.h tml http://www.ch.embnet.org/software/bottom BLAST.html http://hits.isbsib.ch/cgibin/hits_psi_blast 29

Results from BLAST with hemoglobin Distribution of 100 Blast Hits on the Query Sequence Mouse-over to show defline and scores. Click to show alignments 30

Results from BLAST cont. Sequences producing significant alignments: (Score bits) E-Value gi 6981010 ref NP_037228.1 hemoglobin, alpha 1 [Rattus nor... 214 4e-55 gi 34870607 ref XP_340780.1 similar to hemoglobin alpha ch... 211 4e-54 gi 12845853 dbj BAB26925.1 unnamed protein product [Mus mu... 184 4e-46 gi 12833511 dbj BAB22552.1 unnamed protein product [Mus mu... 183 8e-46 gi 26345020 dbj BAC36159.1 unnamed protein product [Mus mu... 183 9e-46 gi 12846963 dbj BAB27381.1 unnamed protein product [Mus mu... 183 1e-45 gi 122280 sp P11755 HBA1_TADBR Hemoglobin alpha-1 chain >gi... 182 1e-45 gi 6680175 ref NP_032244.1 hemoglobin alpha, adult chain 1... 182 1e-45 gi 1162945 gb AAB59723.1 alpha-globin [Mus musculus] 182 2e-45 gi 122352 sp P14387 HBA_ANTPA Hemoglobin alpha chain >gi 28... 180 6e-45 gi 70217 pir HASL1W hemoglobin alpha-i chain - Weddell seal 180 7e-45 gi 122341 sp P18969 HBA_AILFU Hemoglobin alpha chain >gi 70... 179 1e-44 gi 418658 pir HASHR2 hemoglobin alpha-ii chain - aoudad (t... 179 2e-44 gi 122491 sp P01950 HBA_SUNMU Hemoglobin alpha chain >gi 70... 179 2e-44 gi 122446 sp P26915 HBA_NASNA Hemoglobin alpha chain >gi 10... 178 2e-44 gi 14194808 sp Q9XSN3 HBA1_EQUBU Hemoglobin alpha-1 chain >... 178 2e-44 31

Results from BLAST cont. >gi 122352 sp P14387 HBA_ANTPA Hemoglobin alpha chain gi 281094 pir A29702 hemoglobin alpha chain - pallid bat Length = 141 Score = 180 bits (457), Expect = 6e-45 Identities = 96/141 (68%), Positives = 104/141 (73%) Query: 2 VLSADDKTNIKNCWXXXXXXXXXXXXXALQRMFAAFPTTKTYFSHIDVSPGSAQVKAHGX 61 VLS DKTN+K W AL+RMF +FPTTKTYF H D+SGSAQVK HG Sbjct: 1 VLSPADKTNVKAAWDKVGGHAGDYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK 60 Query: 62 XXXXXXXXXXXHVEDLPGALSTLSDLHAHKLRVDPVNFKFLSHCLLVTLACHHPGDFTPA 121 H++DLPGALS LSDLHA+KLRVDPVNFK LSHCLLVTLACHHPGDFTPA Sbjct: 61 KVGDALGNAVAHMDDLPGALSALSDLHAYKLRVDPVNFKLLSHCLLVTLACHHPGDFTPA 120 Query: 122 MHASLDKFLASVSTVLTSKYR 142 +HASLDKFLASVSTVL SKYR Sbjct: 121 VHASLDKFLASVSTVLVSKYR 141 32

BALSA (Bayesian Algorithm for Local Sequence Alignment) Smith-Waterman recursion with sums Formulate sequence alignment as a Bayesian inference problem Everything is a random variable Allows multiple scoring matrices Make inferences on the parameters 33

34 BALSA Methodology BALSA Methodology BALSA Methodology Joint = likelihood*priors Priors a priori Algorithm ), ( ) ( ),, ( ),,,, ( (2) (1) (2) (1) Θ Λ Λ Θ = Λ Θ P A P A R R P A R R P Θ Λ = Θ Λ, 1 ), ( N P = = Λ A A k A l e A k o A k A l e A k o e o g g g g g g A P A P ) ( ) ( ) ( ) ( ) ( ) ( ), ( ) ( λ λ λ λ λ λ Θ = Λ = Θ Θ Λ A A k A l e A k o A A k A l e A k o A g g g g g g A R R P A P A R R P R R P ) ( ) ( ) ( ) ( ) ( ) ( (2) (1) (2) (1) (2) (1) ),, ( ) ( ),, ( ),, ( λ λ λ λ

Assessing Evidence for Homology The scores are independent of the length of the sequences being aligned. Score = P( R P( R (1) (1), R (2) ) P( R H ) (2) ) Directly Calculate Probability not homolgous from score P( H R (1), R (2) ) = Score 1 P( H ) P( H ) + 1 35

BALSA Availability http://bayesweb.wadsworth.org /balsa/balsa.ht ml 36

Conclusions on Pairwise Sequence Alignment Choice of Algorithm is based on need Trade off between sensitivity and speed SCOP40 Sensitivity 1% EPQ BLAST 14.8% FASTA 16.7% SSEARCH 18.4% BALSA w/1 19.2% BALSA w/4 19.8% Speed 37

Outline Pairwise Sequence Alignment Dynamic Programming Heuristic Bayesian Multiple Sequence Alignment Heuristic Statistics (Gibbs Sampling) Conclusions 38

Multiple Sequence Alignment Multiple sequence alignment is generally concerned with finding structural or functional patterns between sequences Develop relationships and phylogenies Determine consensus sequence Build gene families Model protein structure for threading and fold prediction 39

Motivation: Example Motif Discovery 40

Motif -> > Scoring Matrix 41

Motif Finding Programs (search against a database of motifs) GCG SEQWEB Programs http://pmgm.stanford.edu STRINGSEARCH FINDPATTERNS MOTIFS PROSITE web programs PROSITE - http://www.expasy.ch/prosite PROSITE SCAN http://www.expasy.ch/tools/scanprosite emotif web programs emotif http://motif.stanford.edu/emotif emotif-search http://motif.stanford.edu/emotif-search emotif-scan http://motif.stanford.edu/emotif-scan 3MOTIF http://3motif.stanford.edu 42

Multiple Sequence Alignment Challenges Computation complexity O(n^k) for k sequences n long Space requirements O(n^k) for k sequences n long Sequence clusters require weighting function Weighted alignments tend to overweight erroneous sequence Approximations must be used for real world data Linked lists used to find exact words shared between k sequences BLAST can find inexact shared words between k sequences FASTA can be used to do progressive pair-wise alignments GIBBs sampling to find best overall alignment stochastically Final alignment is often dependent on order data presented Gaps make alignments unnaturally long 43

Multiple Alignment Multiple Sequence Aligner (MSA) Builds linked list of words GenAlign Iteratively adds sequences ClustalW Progressively adds sequences using clusters (a dendogram) Gibbs Sampling Generates a random alignment of size k and iteratively samples and updates the alignment until convergence 44

ClustalW Step 1 Generate all pairwise alignments Generate dendogram from alignment scores 45

ClustalW Step 2 Align most similar pair Align next most similar pair Combine 2 alignments 46

ClustalW General Approach 47

http://www.ebi.ac.uk/ clustalw/ http://bioweb.pasteu r.fr/seqanal/interface s/clustalw.html http://www.ch.embn et.org/software/clust alw.html ClustalW Availability 48

Gibbs Sampling Traditional Gibbs sampling 1. Sample an alignment given parameters P( A Θ, R) 2. Sample parameters given an alignment P( Θ A, R) 49

Phylogenetic Footprinting Find DNA functional elements and signals in the non-coding region surrounding a gene 50

Transcription Regulation Gene Transcription and Regulation Transcription initiated by RNA polymerase binding Enhancers and repressors RNA polymerase Promoter region Starting codon 5 3 AUG Binding of Transcription factors inhibits or enhances expression 51

Example: Corepressor Transcription in process Transcription inhibited 52

Motif Alignment Model a 1 a 2 Motif width = w a k length n k The missing data: Alignment variable: A={a 1, a 2,, a k } Alignment: starting positions of binding sites in each sequence Apriori all positions equally likely Final alignment dependent on DNA sequence 53

Gibbs Sampler Algorithm Initialized by choosing random starting (0) (0) (0) positions a1, a2,..., ak Iterate the following steps many times: Randomly or systematically choose a sequence, say, sequence k, to exclude. Carry out the predictive-updating step to update a k Stop when changes are infrequent, or some criterion met. 54

Gibbs Sampler Availability http://bayesweb.w adsworth.org/gibb s/gibbs.html http://www.bioinfor matics.ubc.ca/reso urces/tools/index.p hp?name=gibbs 55

Conclusions Sequence Analysis is the most commonly performed task in bioinformatics The choice of algorithm is dependent upon needs Pairwise Homology detection Multiple Motif detection Building Gene Families Phylogenetic footprinting The future is in whole genome comparisons 56

Other Sources of Information http://www.ncbi.nlm.nih.gov/blast/ Extensive tutorials Bioinformatics Books Biological Sequence Analyis (Durbin et al.) Bioinformatics: The Machine Learning Approach (Baldi & Brunak) Computational Molecular Biology (Pevzner) Journal Articles Altschul et al. Journal of Molecular Biology, 1990. 215; p403-410 (Original BLAST paper) Altschul et al. Nucleic Acids Research, 197. 25; p3389-3402 (PSI- BLAST Paper) McCue et al. Nucleic Acids Research, 2001. 29; p774-782 (Phylogenetic Footprinting) 57