Hierarchical Classification:

Similar documents
Introduction to Phylogenetic Analysis

Introduction to Bioinformatics AS Laboratory Assignment 6

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

Protein Sequence Analysis - Overview -

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

Phylogenetic Trees Made Easy

Bio-Informatics Lectures. A Short Introduction

Network Protocol Analysis using Bioinformatics Algorithms

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

The Central Dogma of Molecular Biology

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

4. Why are common names not good to use when classifying organisms? Give an example.

Visualization of Phylogenetic Trees and Metadata

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

MAKING AN EVOLUTIONARY TREE

The Story of Human Evolution Part 1: From ape-like ancestors to modern humans

Protein Protein Interaction Networks

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

PHYLOGENETIC ANALYSIS

1) Orthology of zebrafish HoxD4 and euteleost HoxD4a:

Final Project Report

Guide for Bioinformatics Project Module 3

2.3 Identify rrna sequences in DNA

Bayesian Phylogeny and Measures of Branch Support

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

17.1. The Tree of Life CHAPTER 17. Organisms can be classified based on physical similarities. Linnaean taxonomy. names.

Introduction to Bioinformatics 3. DNA editing and contig assembly

High Throughput Network Analysis

Horizontal Gene Transfer and Its Part in the Reorganisation of Genetics during the LUCA Epoch

A Primer of Genome Science THIRD

Protein Phylogenies and Signature Sequences: A Reappraisal of Evolutionary Relationships among Archaebacteria, Eubacteria, and Eukaryotes

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Pairwise Sequence Alignment

A short guide to phylogeny reconstruction

Principles of Evolution - Origin of Species

11, Olomouc, , Czech Republic. Version of record first published: 24 Sep 2012.

Scaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search

Bioinformatics Grid - Enabled Tools For Biologists.

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Arbres formels et Arbre(s) de la Vie

Substitute 4 for x in the function, Simplify.

Phylogenetic Analysis using MapReduce Programming Model

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Theory of Evolution. A. the beginning of life B. the evolution of eukaryotes C. the evolution of archaebacteria D. the beginning of terrestrial life

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

BIRCH: An Efficient Data Clustering Method For Very Large Databases

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Worksheet - COMPARATIVE MAPPING 1

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

User Manual for SplitsTree4 V4.14.2

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Molecular Clocks and Tree Dating with r8s and BEAST

Bioinformatics: Network Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Biological Sciences Initiative. Human Genome

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Supporting Online Material for

Genome Explorer For Comparative Genome Analysis

The world of non-coding RNA. Espen Enerly

1 Mutation and Genetic Change

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Inferred thermophily of the last universal ancestor based on estimated

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Human-Mouse Synteny in Functional Genomics Experiment

1. Over the past century, several scientists around the world have made the following observations:

PHYLOGENY AND COMPARATIVE METHODS SYMBIOMICS WORKSHOP

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Introduction to Genome Annotation

DNA Sequence Alignment Analysis

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

AP Biology 2015 Free-Response Questions

UCHIME in practice Single-region sequencing Reference database mode

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

Unraveling protein networks with Power Graph Analysis

CCR Biology - Chapter 10 Practice Test - Summer 2012

a-cB. Code assigned:

Section 3 Comparative Genomics and Phylogenetics

Hierarchical Bayesian Modeling of the HIV Response to Therapy

AP Biology Essential Knowledge Student Diagnostic

Systematics - BIO 615

Human Genome Organization: An Update. Genome Organization: An Update

Exploratory data analysis for microarray data

Core Bioinformatics. Titulació Tipus Curs Semestre Bioinformàtica/Bioinformatics OB 0 1

A Non-Linear Schema Theorem for Genetic Algorithms

Tutorial for proteome data analysis using the Perseus software platform

Bioinformatics Resources at a Glance

MATCH Commun. Math. Comput. Chem. 61 (2009)

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 6, NO. 2, APRIL-JUNE

IsoBase: a database of functionally related proteins across PPI networks

Amino Acids and Their Properties

Data for phylogenetic analysis

Transcription:

Genome Bioinformatics Protein Families Annotation Phylogeny I Molecule Compare Domains? Compare Epression Similar Proteins? Epression What is a phylogenetic tree? How to make a phylogenetic tree? TLR Some of the slides in this lecture are courtesy of Jaap Heringa, Anders Gorm Pedersen and Michael Rosenerg Hierarchical Classification: Linnaeus Tree: depiction (formalization) of classification Carl Linnaeus 707-778 Theory of evolution The only figure in Darwin s On the Origin of Species is Charles Darwin 809-88

Phylogenetic trees. historical pattern of relationships among organisms: interpretation of a tree e.g. Flow of Time How to read a phylogenetic tree? Ancestors Trees are useful in ioinformatics eyond phylogeny of species. Where else can phylogenetic trees e used? Progressive multiple alignment general principles 4 5 5 5 Scores to distances Scores Similarity matri Score - Score - Score 4-5 Guide tree Multiple alignment Other trees (=clusters): gene epression Phylogenetic Trees Unrooted Rooted

Unrooted vs rooted Trees Trees and evolutionary time Unrooted vs rooted Trees Phylogenies using characters Faster Evolution Molecular Phylogeny changed taonomy Three main classes of phylogenetic methods Distance ased uses pairwise distances fastest approach Parsimony fewest numer of evolutionary events (mutations) attempts to construct maimum parsimony tree Maimum likelihood

Phylogenetic tree y Distance methods (Clustering) 4 5 5 5 Similarity criterion Scores Multiple alignment Distance matri Phylogenetic tree Distances Evolutionary sequence distance = sequence dissimilarity Human -KITVVGVGAVGMACAISILMKDLADELALVDVIEDKLKGEMMDLQHGSLFLRTPKIVSGKDYNVTANSKLVIITAGARQ Chicken -KISVVGVGAVGMACAISILMKDLADELTLVDVVEDKLKGEMMDLQHGSLFLKTPKITSGKDYSVTAHSKLVIVTAGARQ Dogfish KITVVGVGAVGMACAISILMKDLADEVALVDVMEDKLKGEMMDLQHGSLFLHTAKIVSGKDYSVSAGSKLVVITAGARQ Lamprey SKVTIVGVGQVGMAAAISVLLRDLADELALVDVVEDRLKGEMMDLLHGSLFLKTAKIVADKDYSVTAGSRLVVVTAGARQ Barley TKISVIGAGNVGMAIAQTILTQNLADEIALVDALPDKLRGEALDLQHAAAFLPRVRI-SGTDAAVTKNSDLVIVTAGARQ Maizey casei -KVILVGDGAVGSSYAYAMVLQGIAQEIGIVDIFKDKTKGDAIDLSNALPFTSPKKIYSA-EYSDAKDADLVVITAGAPQ Bacillus TKVSVIGAGNVGMAIAQTILTRDLADEIALVDAVPDKLRGEMLDLQHAAAFLPRTRLVSGTDMSVTRGSDLVIVTAGARQ Lacto ste -RVVVIGAGFVGASYVFALMNQGIADEIVLIDANESKAIGDAMDFNHGKVFAPKPVDIWHGDYDDCRDADLVVICAGANQ Lacto_plant QKVVLVGDGAVGSSYAFAMAQQGIAEEFVIVDVVKDRTKGDALDLEDAQAFTAPKKIYSG-EYSDCKDADLVVITAGAPQ Therma_mari MKIGIVGLGRVGSSTAFALLMKGFAREMVLIDVDKKRAEGDALDLIHGTPFTRRANIYAG-DYADLKGSDVVIVAAGVPQ Bifido -KLAVIGAGAVGSTLAFAAAQRGIAREIVLEDIAKERVEAEVLDMQHGSSFYPTVSIDGSDDPEICRDADMVVITAGPRQ Thermus_aqua MKVGIVGSGFVGSATAYALVLQGVAREVVLVDLDRKLAQAHAEDILHATPFAHPVWVRSGW-YEDLEGARVVIVAAGVAQ Mycoplasma -KIALIGAGNVGNSFLYAAMNQGLASEYGIIDINPDFADGNAFDFEDASASLPFPISVSRYEYKDLKDADFIVITAGRPQ Distance Matri 4 5 6 7 8 9 0 0.000 0. 0.8 0.0 0.78 0.46 0.50 0.55 0.5 0.54 0.58 0.65 0.67 Human Chicken 0. 0.000 0.55 0.4 0.8 0.48 0.58 0.569 0.56 0.54 0.54 0.6 0.65 Dogfish 0.8 0.55 0.000 0.96 0.89 0.7 0.5 0.567 0.56 0.5 0.54 0.600 0.655 4 Lamprey 0.0 0.4 0.96 0.000 0.46 0.56 0.55 0.589 0.544 0.50 0.544 0.66 0.669 5 Barley 0.78 0.8 0.89 0.46 0.000 0.7 0.56 0.565 0.56 0.547 0.56 0.69 0.575 6 Maizey 0.46 0.48 0.7 0.56 0.7 0.000 0.557 0.56 0.58 0.555 0.58 0.64 0.587 7 Lacto_casei 0.50 0.58 0.5 0.55 0.56 0.557 0.000 0.58 0.08 0.445 0.56 0.56 0.50 8 Bacillus_stea 0.55 0.569 0.567 0.589 0.565 0.56 0.58 0.000 0.477 0.56 0.56 0.598 0.495 9 Lacto_plant 0.5 0.56 0.56 0.544 0.56 0.58 0.08 0.477 0.000 0.4 0.489 0.56 0.485 0 Therma_mari 0.54 0.54 0.5 0.50 0.547 0.555 0.445 0.56 0.4 0.000 0.5 0.405 0.598 Bifido 0.58 0.54 0.54 0.544 0.56 0.58 0.56 0.56 0.489 0.5 0.000 0.604 0.64 Thermus_aqua 0.65 0.6 0.600 0.66 0.69 0.64 0.56 0.598 0.56 0.405 0.604 0.000 0.64 Mycoplasma 0.67 0.65 0.655 0.669 0.575 0.587 0.50 0.495 0.485 0.598 0.64 0.64 0.000 NB ecause evo distance we otain a phylogenetic tree 5 5 Clustering Scores Single linkage - Nearest neighour Cluster criterion Complete linkage Furthest neighour Group averaging UPGMA Neighour joining Distance matri Phylogenetic tree Clustering algorithm: UPGMA human - mouse - fugu 4 4 - Yeast 8 8 8 - human fugu mouse human mouse Fugu 4 Yeast Evolutionary clock speeds Uniform clock: Ultrametric distances lead to identical distances from root to leaves UPGMA trees would e correct if evolution had a uniform clock, ut it often did not! Neighour-Joining (Saitou and Nei, 987) Gloal : keeps total ranch length minimal At each step, join two nodes that are considering their respective distance to all other nodes, closest Leads to unrooted tree Non-uniform evolutionary clock: leaves have different distances to the root

Neighour joining Neighour joining y At each step all possile neighour joinings are checked and the one corresponding to the minimal total tree length (calculated y adding all ranch lengths) is taken. At each step all possile neighour joinings are checked and the one corresponding to the minimal total tree length (calculated y adding all ranch lengths) is taken. Neighour joining Introduce a root y root y Yeast ranch human root ranch y y fugu mouse Yeast fugu mouse human At each step all possile neighour joinings are checked and the one corresponding to the minimal total tree length (calculated y adding all ranch lengths) is taken. internal node leaf internal node (ancestor) leaf How to root a tree How to root a tree: outgroup Outgroup place root etween distant (still homolog) sequence and rest group Midpoint place root at midpoint of longest path (sum of ranches etween any two leafs) Gene duplication place root etween paralogous gene copies Y f-β fugu Yeast f 5 mouse m human h f-α Y f m h Y f m h h-α h-β f-α h-α f-β h-β 4

Orthologs and paralogs

Gene duplication and gene loss Simple real life eample Kinase-5: essential for centrosome separation in mitosis Gene duplication: divergence of a gene within one genome Let's tell a story Verterate Toll-Like Receptors Spanish Flu (98) Roach, Jared C. et al. (005) Proc. Natl. Acad. Sci. USA 0, 9577-958 Three main classes of phylogenetic methods Distance ased uses pairwise distances fastest approach Parsimony fewest numer of evolutionary events (mutations) attempts to construct maimum parsimony tree Maimum likelihood Parsimony 4 5 6 7 A a c a t g a a B a c t t g a a C a c a t g t a D a c a t g t a

Parsimony 4 5 6 7 A a c a t g a a B a c t t g a a C a c a t g t a D a c a t g t a Parsimony 4 5 6 7 A a c a t g a a B a c t t g a a C a c a t g t a D a c a t g t a Informative sites are the sites where at least two different characters occur at least twice. Another eample Another eample 4 5 6 7 Human c c t t g a a Chimp c c t t g a a Gorilla c c t a g t a Gion t c a a g a a Orangutan t c a a g a t 4 5 6 7 Human c c t t g a a Chimp c c t t g a a Gorilla c c t a g t a Gion t c a a g a a Orangutan t c a a g a t Chimp Gion Human Gorilla Orangutan Three main classes of phylogenetic methods Distance ased uses pairwise distances fastest approach Parsimony fewest numer of evolutionary events (mutations) attempts to construct maimum parsimony tree Maimum likelihood Maimum likelihood If data = alignment, hypothesis = tree, and under a given evolutionary model (e.g. Sustitution matri): compute likelihood that the hypothesis (=tree), given a model (e.g. sustitution matri), results in the oserved data (= multiple sequence alignment). maimum likelihood selects the hypothesis (tree) that maimises the oserved data Etremely time consuming method Best approach to find the true tree

Parsimony, Maimum Likelihood or Neighor- Joining? Common practice: use all methods and compare trees Data is of greater importance than method As with alignments, one must rememer that a phylogenetic tree is a hypothesis of the true evolutionary history. As a hypothesis it could e right or wrong or a it of oth. If we would know the true tree of life we would also know which method is est. How to assess confidence in tree Distance method ootstrap: Select multiple alignment columns with replacement Recalculate tree Compare ranches with original tree Repeat 00-000 times, so calculate 00-000 different trees How often is ranching preserved for each internal node? Uses samples of the data The Bootstrap The Bootstrap Original 4 5 6 7 8 C C V K V I Y S M A V R L I F S M A L R L L F S The Bootstrap The Bootstrap Original 4 5 6 7 8 C C V K V I Y S M A V R L I F S M A L R L L F S 4 5 6 7 8 C C V K V I Y S Original M A V R L I F S M A L R L L F S 4 8 6 6 8 6 V K V S I I S I Scramled V R V S I I S I L R L T L L S L Nonsupportive

The Bootstrap Bootstrap eample 85 times 5 times 85 Horizontal (lateral) gene transfer: The evolutionary history of a gene is not always consistent with the history of the species! Detecting HGT in trees Eukaryotes Aminoacyl-tRNA synthetase Discovering horizontal gene transfer y: Comparing phylogenetic trees of the species (SSU rrna) and that of the gene in question. Be careful however!! The sequences have to e orthologous to each other. Ancient gene duplications followed y differential loss can also give rise to horizontal gene transfer like trees. Archaea Leucine Aminoacyl-tRNA synthetase. Bacteria Detecting HGT in trees Detecting HGT in trees Eukaryotes Archaea Archaea Eukaryotes Bacteria Bacteria No apparent Horizontal Gene Transfer in the evolution of Leucine Aminoacyl-tRNA synthetase (the phylogeny of the sequences fits more or less the species phylogeny). Proline Aminoacyl-tRNA synthetase. Archaea Eukaryotes Bacteria?

Detecting HGT in trees Archaea Eukaryotes Bacteria Apparent Horizontal Gene Transfer to the parasites Bu (B.urgdorferi) and Mge, Mpe (Mycoplasmas) from the Eukaryotes represented y Cel (C.elegans) and Sce (S.cerevisiae) Let's tell a story MHC molecules Let's tell a story MHC molecules Another use of Phylogenies