Multiple Sequence Alignment

Similar documents
Pairwise Sequence Alignment

Protein Sequence Analysis - Overview -

Phylogenetic Trees Made Easy

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Bioinformatics Grid - Enabled Tools For Biologists.

Guide for Bioinformatics Project Module 3

Introduction to Bioinformatics AS Laboratory Assignment 6

The Central Dogma of Molecular Biology

Genome Explorer For Comparative Genome Analysis

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

Amino Acids and Their Properties

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Worksheet - COMPARATIVE MAPPING 1

Bio-Informatics Lectures. A Short Introduction

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Network Protocol Analysis using Bioinformatics Algorithms

Introduction to Phylogenetic Analysis

Introduction to Bioinformatics 3. DNA editing and contig assembly

Clone Manager. Getting Started

Multiple Sequence Alignment. Hot Topic 5/24/06 Kim Walker

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

T cell Epitope Prediction

MAKING AN EVOLUTIONARY TREE

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

Similarity Searches on Sequence Databases: BLAST, FASTA. Lorenza Bordoli Swiss Institute of Bioinformatics EMBnet Course, Basel, October 2003

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Arbres formels et Arbre(s) de la Vie

Protein & DNA Sequence Analysis. Bobbie-Jo Webb-Robertson May 3, 2004

Protein Protein Interaction Networks

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Graph theoretic approach to analyze amino acid network

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Linear Sequence Analysis. 3-D Structure Analysis

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

BIOINFORMATICS TUTORIAL

Bioinformatics: course introduction

Building a phylogenetic tree

Lecture 19: Proteins, Primary Struture

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

AP Biology Essential Knowledge Student Diagnostic

Introduction to Principal Components and FactorAnalysis

Principles of Evolution - Origin of Species

Data for phylogenetic analysis

Section 3 Comparative Genomics and Phylogenetics

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Current Motif Discovery Tools and their Limitations

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Optimal Contact Map Alignment of Protein-Protein Interfaces Vinay Pulim, 1 Bonnie Berger, 1,2 * Jadwiga Bienkowska, 1,3,* 1

2.3 Identify rrna sequences in DNA

Molecular Databases and Tools

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Supplementary material: A benchmark of multiple sequence alignment programs upon structural RNAs Paul P. Gardner a Andreas Wilm b Stefan Washietl c

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

Name: Date: Problem How do amino acid sequences provide evidence for evolution? Procedure Part A: Comparing Amino Acid Sequences

MATCH Commun. Math. Comput. Chem. 61 (2009)

Visualization of Phylogenetic Trees and Metadata

Supplementary Information

OD-seq: outlier detection in multiple sequence alignments

Computational localization of promoters and transcription start sites in mammalian genomes

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

A short guide to phylogeny reconstruction

Bioinformatics Resources at a Glance

Interaktionen von Nukleinsäuren und Proteinen

CSC 2427: Algorithms for Molecular Biology Spring Lecture 16 March 10

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Bayesian Phylogeny and Measures of Branch Support

Bioinformatics: Network Analysis

Dmitri Krioukov CAIDA/UCSD

Module 10: Bioinformatics

Optimal Network Alignment with Graphlet Degree Vectors

Crime Scene Investigation (Adopted from Forensics in the Classroom developed by Court TV)

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Biological Sciences Initiative. Human Genome

Genetomic Promototypes

Consensus alignment server for reliable comparative modeling with distant templates

Evidence for evolution factsheet

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Summary Genes and Variation Evolution as Genetic Change. Name Class Date

Biology & Big Data. Debasis Mitra Professor, Computer Science, FIT

Multiple Sequence Alignment and Analysis: Part I An Introduction to the Theory and Application of Multiple Sequence Analysis.

Data Integration via Constrained Clustering: An Application to Enzyme Clustering

PHYLOGENETIC ANALYSIS

Online Resource 9. Figure S8: Amino acid alignments.

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

EMBOSS A data analysis package

DNA Sequence Alignment Analysis

June 09, 2009 Random Mutagenesis

Annex 6: Nucleotide Sequence Information System BEETLE. Biological and Ecological Evaluation towards Long-Term Effects

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Core Bioinformatics. Titulació Tipus Curs Semestre Bioinformàtica/Bioinformatics OB 0 1

Human-Mouse Synteny in Functional Genomics Experiment

Using MATLAB: Bioinformatics Toolbox for Life Sciences

CCR Biology - Chapter 10 Practice Test - Summer 2012

Transcription:

Multiple Sequence Alignment

Multiple Sequence Alignment! Collection of three or more amino acid (or nucleic acid) sequences partially or completely aligned.! Aligned residues tend to occupy corresponding positions in the 3-D structure of each aligned protein.

General steps to multiple alignment. Create Alignment Edit the alignment to ensure that regions of functional or structural similarity are preserved USED FOR: Phylogenetic Analysis Structure Analysis Find conserved motifs to deduce function Design of PCR primers

Practical use of MSA! Helps to place protein into a group of related proteins. It will provide insight into function, structure and evolution.! Helps to detect homologs! Identifies sequencing errors! Identifies important regulatory regions in the promoters of genes.

Clustal W (Thompson et al., 1994)! CLUSTAL=Cluster alignment! The underlying concept is that groups of sequences are phylogenetically related. If they can be aligned, then one can construct a phylogenetic tree.! Phylogenetic tree-a tree showing the evolutionary relationships among various biological species or other entities that are believed to have a common ancestor.

Flowchart of computation steps in Clustal W (Thompson et al., 1994) Pairwise alignment: calculation of distance matrix Creation of unrooted neighbor-joining tree Rooted NJ tree (guide tree) and calculation of sequence weights Progressive alignment following the guide tree

Preliminary pairwise alignments Compare each pair of sequences. Different sequences A - B.87 - C.59.60 - A B C Each number represents the number of exact matches divided by the sequence length (ignoring gaps). Thus, the higher the number the more closely related the two sequences are. In this matrix, sequence A is 87% identical to sequence B

Step 1-Calculation of Distance Matrix Use the Distance Matrix to create a Guide Tree to determine the order of the sequences. Hbb-Hu 1 - Hbb-Ho 2.17 - Hba-Hu 3.59.60 - Hba-Ho 4.59.59.13 - Myg-Ph 5.77.77.75.75 - Gib-Pe 6.81.82.73.74.80 - Lgb-Lu 7.87.86.86.88.93.90-1 2 3 4 5 6 7 D = 1 (I) D = Difference score I = # of identical aa s in pairwise global alignment total number of aa s in shortest sequence

Step 2-Create an unrooted NJ tree Hba-Ho Myg-Ph Hba-Hu Hbb-Ho Gib-Pe Hbb-Hu Lgb-Lu

Step 3-Create Rooted NJ Tree Weight Alignment Order of alignment: 1 Hba-Hu vs Hba-Ho 2 Hbb-Hu vs Hbb-Ho 3 A vs B 4 Myg-Ph vs C 5 Gib-Pe vs D 6 Lgh-Lu vs E

Step 4-Progressive alignment

Step 4-Progressive alignment Scoring during progressive alignment

Rules for alignment! Short stretches of 5 hydrophilic residues often indicate loop or random coil regions (not essential for structure) and therefore gap penalties are reduced reduced for such stretches.! Gap penalties for closely related sequences are lowered compared to more distantly related sequences ( once a gap always a gap rule). It is thought that those gaps occur in regions that do not disrupt the structure or function.! Alignments of proteins of known structure show that proteins gaps do not occur more frequently than every eight residues. Therefore penalties for gaps increase when required at 8 residues or less for alignment. This gives a lower alignment score in that region.! A gap weight is assigned after each aa according the frequency that such a gap naturally occurs after that aa in nature

Amino acid weight matrices! As we know, there are many scoring matrices that one can use depending on the relatedness of the aligned proteins.! As the alignment proceeds to longer branches the aa scoring matrices are changed to more divergent scoring matrices. The length of the branch is used to determine which matrix to use and contributes to the alignment score.

Example of Sequence Alignment using Clustal W Asterisk represents identity : represents high similarity. represents low similarity

Multiple Alignment Considerations! Quality of guide tree. It would be good to have a set of closely related sequences in the alignment to set the pattern for more divergent sequences.! If the initial alignments have a problem, the problem is magnified in subsequent steps.! CLUSTAL W is best when aligning sequences that are related to each other over their entire lengths! Do not use when there are variable N- and C- terminal regions! If protein is enriched for G,P,S,N,Q,E,K,R then these residues should be removed from gap penalty list. (what types of residues are these?) Reference: http://www-igbmc.u-strasbg.fr/bioinfo/clustalw/