How to Build a Phylogenetic Tree



Similar documents
The Central Dogma of Molecular Biology

Protein Sequence Analysis - Overview -

Introduction to Phylogenetic Analysis

PHYLOGENETIC ANALYSIS

Phylogenetic Trees Made Easy

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

Network Protocol Analysis using Bioinformatics Algorithms

Introduction to Bioinformatics AS Laboratory Assignment 6

Bio-Informatics Lectures. A Short Introduction

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

Phylogenetic Analysis using MapReduce Programming Model

PROC. CAIRO INTERNATIONAL BIOMEDICAL ENGINEERING CONFERENCE

Name Class Date. binomial nomenclature. MAIN IDEA: Linnaeus developed the scientific naming system still used today.

Principles of Evolution - Origin of Species

THREE DIMENSIONAL REPRESENTATION OF AMINO ACID CHARAC- TERISTICS

An experimental study comparing linguistic phylogenetic reconstruction methods *

Final Project Report

Amino Acids and Their Properties

Name: Class: Date: Multiple Choice Identify the choice that best completes the statement or answers the question.

UCHIME in practice Single-region sequencing Reference database mode

DnaSP, DNA polymorphism analyses by the coalescent and other methods.

Rapid alignment methods: FASTA and BLAST. p The biological problem p Search strategies p FASTA p BLAST

A Non-Linear Schema Theorem for Genetic Algorithms

Bayesian Phylogeny and Measures of Branch Support

A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Pairwise Sequence Alignment

A short guide to phylogeny reconstruction

Bayesian coalescent inference of population size history

How to report the percentage of explained common variance in exploratory factor analysis

6 Creating the Animation

Inference of Large Phylogenetic Trees on Parallel Architectures. Michael Ott

A successful market segmentation initiative answers the following critical business questions: * How can we a. Customer Status.

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

DNA Sequence Alignment Analysis

Scaling the gene duplication problem towards the Tree of Life: Accelerating the rspr heuristic search

An introduction to Value-at-Risk Learning Curve September 2003

MAKING AN EVOLUTIONARY TREE

AP Biology Essential Knowledge Student Diagnostic

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

A Review And Evaluations Of Shortest Path Algorithms

BIRCH: An Efficient Data Clustering Method For Very Large Databases

Systematics - BIO 615

Molecular Clocks and Tree Dating with r8s and BEAST

Building a phylogenetic tree

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Visualization of Phylogenetic Trees and Metadata

What mathematical optimization can, and cannot, do for biologists. Steven Kelk Department of Knowledge Engineering (DKE) Maastricht University, NL

The Story of Human Evolution Part 1: From ape-like ancestors to modern humans

Practice Questions 1: Evolution

Introduction to Bioinformatics 3. DNA editing and contig assembly

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Heuristics for the Sorting by Length-Weighted Inversions Problem on Signed Permutations

Worksheet - COMPARATIVE MAPPING 1

International Journal of Advanced Research in Computer Science and Software Engineering

Algebra 2 Chapter 1 Vocabulary. identity - A statement that equates two equivalent expressions.

User Manual for SplitsTree4 V4.14.2

Protein Protein Interaction Networks

Holland s GA Schema Theorem

Measuring the Performance of an Agent

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

Genetic Algorithms and Sudoku

Dmitri Krioukov CAIDA/UCSD

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Graph theoretic approach to analyze amino acid network

Core Bioinformatics. Degree Type Year Semester Bioinformàtica/Bioinformatics OB 0 1

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Section 3 Comparative Genomics and Phylogenetics

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

PHYML Online: A Web Server for Fast Maximum Likelihood-Based Phylogenetic Inference

Symbol Tables. Introduction

D-optimal plans in observational studies

Keywords: evolution, genomics, software, data mining, sequence alignment, distance, phylogenetics, selection

Data for phylogenetic analysis

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

How To Understand And Solve Algebraic Equations

Mining Social-Network Graphs

Improving the Performance of a Computer-Controlled Player in a Maze Chase Game using Evolutionary Programming on a Finite-State Machine

GLM, insurance pricing & big data: paying attention to convergence issues.

Homework until Test #2

CCR Biology - Chapter 10 Practice Test - Summer 2012

Evolution (18%) 11 Items Sample Test Prep Questions

Continued Fractions and the Euclidean Algorithm

Summary Genes and Variation Evolution as Genetic Change. Name Class Date

Note on growth and growth accounting

Evaluating the Performance of a Successive-Approximations Approach to Parameter Optimization in Maximum-Likelihood Phylogeny Estimation

8 Primes and Modular Arithmetic

Dynamic Programming. Lecture Overview Introduction

CHAPTER 14 ORDINAL MEASURES OF CORRELATION: SPEARMAN'S RHO AND GAMMA

Biology Notes for exam 5 - Population genetics Ch 13, 14, 15

Lab 2/Phylogenetics/September 16, PHYLOGENETICS

Answer: The relationship cannot be determined.

Transcription:

How to Build a Phylogenetic Tree Phylogenetics tree is a structure in which species are arranged on branches that link them according to their relationship and/or evolutionary descent. A typical rooted tree with scaled branches is illustrated in fig1 [1] In this short article, a brief review of different methods of tree building is given and their validity, reliability and efficiency are compared. The calculation of genetic distance and methods of trees rooting are also discussed. For other issues such as tree evaluation and searching algorithms, see reference [2]. Genetic distance: Genetic distance is the number of mutation/evolutionary events between species since their divergence. The simplest way to calculate it is to count the number of difference between two sequences. However, the method, which is often refer to as hamming method, does not reflect the evolutionary history of the sequence, because not all the mutation events are recorded in the current sequences. Figure 2 shows an example in which only 3 of 12 mutation events are detected between homologous sequences [3]. Jukes-Cantor Model, which was proposed in 1969, was the first attempt to cope with this problem. They gave a corrective formula 3 4 K = ln(1 p) (fig 3), where K is the genetic 4 3 distance (the mean number of substitutions) and p is the percent difference. This relation reveals that the correction increase with the genetic distance and when the percent of difference saturates (75%), the true genetic distance is no longer extractable [4]. More recent models take into consideration of the different transition/transversion rate, different base frequencies and non-uniform substation rates between sites. All these models give similar results at low divergence. Nevertheless, at high percent of difference, choosing proper substitution model will be very crucial. Tree-Building Methods The most popular and frequently used methods of tree building can be classified into two major categories: phenetic methods based on distances and cladistic methods based on characters. The former measures the pair-wise distance/dissimilarity between two genes, the actual size of which depends on different definitions, and constructs the tree totally from the resultant distance matrix. The latter evaluate all possible trees and seek for the one that optimizes the evolution. Distance-Based Methods: The most popular distance-based methods are the unweighted pair group method with arithmetic mean (UPGMA), neighbor joining (NJ) and those that optimize the additivity of a distance tree (FM and ME) [2]. - 1 -

UPGMA Method This method follows a clustering procedure: (1) Assume that initially each species is a cluster on its own. (2) Join closest 2 clusters and recalculate distance of the joint pair by taking the average. (3) Repeat this process until all species are connected in a single cluster. Strictly speaking, this algorithm is phenetic, which does not aim to reflect evolutionary descent. It assigns equal weight on the distance and assumes a randomized molecular clock. WPGMA is a similar algorithm but assigns different weight on the distances. UPGMS method is simple, fast and has been extensively used in literature. However, it behaves poorly at most cases where the above presumptions are not met. Neighbor Joining Method (NJ) This algorithm does not make the assumption of molecular clock and adjust for the rate variation among branches. It begins with an unresolved star-like tree (fig 4(a)). Each pair is evaluated for being joined and the sum of all branches length is calculated of the resultant tree. The pair that yields the smallest sum is considered the closest neighbors and is thus joined.a new branch is inserted between them and the rest of the tree (fig4 (b)) and the branch length is recalculated. This process is repeated until only one terminal is present. NJ method is comparatively rapid and generally gives better results than UPGMA method. But it produces only one tree and neglects other possible trees, which might be as good as NJ trees, if not significantly better. Moreover, since errors in distance estimates are exponentially larger for longer distances, under some condition, this method will yield a biased tree [5]. Weighted Neighbor-Joining (Weighbor) This is a new method proposed recently [5]. The Weighbor criterion consists of two terms; an additivity term (of external branches) and a positivity term (of internal branches), that quantifies the implications of joining the pair. Weighbor gives less weight to the longer distances in the distance matrix and the resulting trees are less sensitive to specific biases than NJ and relatively immune to the "long branches attraction/distraction" drawbacks observed with other methods. Fitch-Margoliash (FM) and Minimum Evolution (ME) Methods Fitch and Margoliash proposed in 1967 a criteria (FM Method) for fitting trees to distance matrices [2]. This method seeks the least squared fit of all observed pair-wise distances to the expected distance of a tree. The ME method also seeks the tree with the minimum sum of branch lengths. But instead of using all the pair-wise distances as FM, it fixed the internal nodes by using the distance to external nodes and then optimizes the internal branch lengths FM and ME methods perform best in the group of distance-based methods, but they work much more slowly than NJ, which generally yield a very close tree to these methods. - 2 -

Character-Based Methods Distance-based methods are more rapid and less computationally intensive than characterbased methods, but the actual characters are discarded once the distance matrix is derived. On the other hand, character-based methods make use of all known evolutionary information, i.e. the individual substitutions among the sequences, to determine the most likely ancestral relationships. Maximum parsimony (MP) The criterion of MP method is that the simplest explanation of the data is preferred, because it requires the fewest conjectures. By this criterion, the MP tree is the one with fewest substitutions/evolutionary changes for all sequences to derive from a common ancestor. For each site in the alignment, all possible trees are evaluated and are given a score based on the number of evolutionary changes needed to produce the observed sequence changes. The best tree is thus the one that minimized the overall number of mutation at all site. MP works faster than ML and the weighted parsimony schemes can deal with most of the different models used by ML. However, this method yields little information about the branch lengths and suffers badly from long-branch attraction, that is the long branches have become artificially connected because of accumulation of inhomogous similarities, even if they are not at all phylogenetically related. MP yields more than one tree with the same score. Maximum Likelihood (ML): Like MP methods, ML method also uses each position in an alignment and evaluates all possible trees. It calculates the likelihood for each tree and seeks the one with the maximum likelihood. For a given tree, at each site, the likelihood is determined by evaluating the probability that a certain evolutionary model has generated the observed data. The likelihood s for each site are then multiplied to provide likelihood for each tree. ML method is the slowest and most computationally intensive method, though it seems to give the best result and the most informative tree. Rooting trees: Root is the common ancestor of the species under study. Most phylogenetic methods do not locate the root of a tree and the unrooted trees only reflect the relationship among species but not the evolutionary path. Fig5 (a) shows an unrooted tree of species A, B, C and D. If one assumes that argument of molecular clock evolution is valid, which asserts that informational macromolecules evolve at rates that are constant through time and for different lineages, then the root is simply the mid-point of the longest span across the tree. For example, for the unrooted tree in fig 5(a), the root is just the mid-point of the longest span A-B (fig 5(b)). Whether the molecular clock assumption is valid is still a controversial [6]. However, evidences from many research works have shown that such a model is too simplified and - 3 -

sometimes behave in an erratic manner. A more commonly used method is to evaluate the rooting by an outgroup, i.e. a distantly related specie. For instance, mouse can be used an outgroup to root a tree of primates [3] (see fig5(c)). However, this method gives rise to a dilemma. If the outgroup species is very closely related to the ingroup, it may just be a mistakenly excluded member. (Of course, in our example, the mouse is obviously not a member of primate, but it is not always apparent in other cases). On the other hand, a very divergent outgroup may have accumulated so many inhomologous similarities that it seems to be artificially connected with the ingroup (see the long-branch attraction mentioned in MP method). Reference: [1] http://allserv.rug.ac.be/~avierstr/principles/phylogeny.html, Phylogenetics (1999) [2] Andreas D. Baxevanis & B.F. Francis Ouellette, Bioinformatics: a practical guide to the analysis of genes and proteins (2001) [3] P.Higgs & Manchester, Introduction to Phylogenetics Methods (ITP series on-line seminars) (2001) [4] http://hiv-web.lanl.gov/, How to make a phylogenetic tree (1999) [5] Bruno WJ, Socci ND, Halpern AL, Mol Biol Evol, 17(1), 189-97 (2000) [6] Wen-Hsiung Li & Dan Graur, Fundamentals of Molecular Evolution (1991) - 4 -

Fig 1. A typical rooted tree with scaled branches [1]. For definition of terminology, visit the website: http://allserv.rug.ac.be/~avierstr/principles/phylogeny.html) - 5 -

Fig 2 Two homologous DNA sequences (on the right) which descended form an ancestral sequence (on the left). Although 12 mutations have accumulated since their divergence from each other, only 3 of them are recorded by the current sequences [3]. Fig 3. Jukes-Cantor Model of the relation between the percent difference p and the genetic distance K : 3 4 [4] K = ln(1 p) 4 3-6 -

Fig 4 NJ method: (a) an unresolved star-like tree (b) the closest terminals(for definition, see text) 1 and 2 are joined and a branch is inserted between them and the remainder of the tree and the value of the branch is taken to be the mean of the two original values. This process is continued until only one terminal is left. - 7 -

Fig5 (a) an unrooted tree (b) is rooted under the assumption of molecular clock (c) Rooting a tree by using mouse as outgroup [3]. - 8 -