Molecular Clocks and Tree Dating with r8s and BEAST



Similar documents
A Step-by-Step Tutorial: Divergence Time Estimation with Approximate Likelihood Calculation Using MCMCTREE in PAML

Bayesian Phylogeny and Measures of Branch Support

A Rough Guide to BEAST 1.4

Divergence Time Estimation using BEAST v1.7.5

Introduction to Bioinformatics AS Laboratory Assignment 6

Maximum-Likelihood Estimation of Phylogeny from DNA Sequences When Substitution Rates Differ over Sites1

Gamma Distribution Fitting

Autodesk Navisworks 2015 Service Pack 3

Phylogenetic Trees Made Easy

A comparison of methods for estimating the transition:transversion ratio from DNA sequences

Arena Tutorial 1. Installation STUDENT 2. Overall Features of Arena

Using Microsoft Excel to Analyze Data

Publishing Geoprocessing Services Tutorial

Autodesk Navisworks 2015 Service Pack 2

Calculating P-Values. Parkland College. Isela Guerra Parkland College. Recommended Citation

PAML FAQ... 1 Table of Contents Data Files...3. Windows, UNIX, and MAC OS X basics...4 Common mistakes and pitfalls...5. Windows Essentials...

Most of your tasks in Windows XP will involve working with information

jmodeltest (April 2008) David Posada 2008 onwards

Tutorial for proteome data analysis using the Perseus software platform

The F distribution and the basic principle behind ANOVAs. Situating ANOVAs in the world of statistical tests

TCB No September Technical Bulletin. GS FLX+ System & GS FLX System. Installation of 454 Sequencing System Software v2.

Visualization of Phylogenetic Trees and Metadata

Tutorial: Get Running with Amos Graphics

Adding Audio to a Presenter File

12: Analysis of Variance. Introduction

A Demonstration of Hierarchical Clustering

IBM SPSS Statistics 20 Part 4: Chi-Square and ANOVA

Using Microsoft Excel to Analyze Data from the Disk Diffusion Assay

Sitecore InDesign Connector 1.1

Excel will open with the report displayed. You can format and/or save the report as desired.

Chapter 5 Analysis of variance SPSS Analysis of variance

Utilizing Microsoft Access Forms and Reports

This can dilute the significance of a departure from the null hypothesis. We can focus the test on departures of a particular form.

Note: With v3.2, the DocuSign Fetch application was renamed DocuSign Retrieve.

Missing data and the accuracy of Bayesian phylogenetics

Windows XP Managing Your Files

DropSend Getting Started Guide

AARP Tax-Aide Helpful Hints for Using the Volunteer Excel Expense Form and Other Excel Documents

Course Reports 10/18/2012

A combinatorial test for significant codivergence between cool-season grasses and their symbiotic fungal endophytes

Tutorial: Get Running with Amos Graphics

Ayear ago, I wrote an article entitled

Oracle BI Extended Edition (OBIEE) Tips and Techniques: Part 1

ODBC Driver Version 4 Manual

Juris Installation / Upgrade Guide

Create, Link, or Edit a GPO with Active Directory Users and Computers

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

SQL Server Instance-Level Benchmarks with DVDStore

Tutorial: Configuring GOOSE in MiCOM S1 Studio 1. Requirements

As time goes by: A simple fool s guide to molecular clock approaches in invertebrates*

Indices of Model Fit STRUCTURAL EQUATION MODELING 2013

Binary Diagnostic Tests Two Independent Samples

!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"

Survey, Statistics and Psychometrics Core Research Facility University of Nebraska-Lincoln. Log-Rank Test for More Than Two Groups

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

View Your Photos. What you ll need: A folder of digital photos Jasc Paint Shop Photo Album 5

Application. 1.1 About This Tutorial Tutorial Requirements Provided Files

Use of deviance statistics for comparing models

Student Quick Start Guide

The 2013 Experimental Warning Program (EWP) Virtual Weather Event Simulator (WES) Windows & Linux Installation Documentation

Creating Pivot Tables and Diagrams with Microsoft Excel, Visio and SQL Server 2008

HLM software has been one of the leading statistical packages for hierarchical

Quickstart Guide. First Edition, Published September Remote Administrator / NOD32 Antivirus 4 Business Edition

A branch-and-bound algorithm for the inference of ancestral. amino-acid sequences when the replacement rate varies among

A Quick Tour of F9 1

Cal Answers Analysis Training Part I. Creating Analyses in OBIEE

Regression step-by-step using Microsoft Excel

Transitioning from TurningPoint 5 to TurningPoint Cloud - LMS 1

AMATH 352 Lecture 3 MATLAB Tutorial Starting MATLAB Entering Variables

Data Mining Techniques Chapter 6: Decision Trees

Likelihood: Frequentist vs Bayesian Reasoning

Structural Health Monitoring Tools (SHMTools)

Tutorial Guide to the IS Unix Service

14.1. bs^ir^qfkd=obcib`qflk= Ñçê=emI=rkfuI=~åÇ=léÉåsjp=eçëíë

Creating Local Storage for Exchange Users

Phylogenetic systematics turns over a new leaf

Hands-On: Introduction to Object-Oriented Programming in LabVIEW

Autodesk Navisworks 2016 Service Pack 3

CATIA V5 Tutorials. Mechanism Design & Animation. Release 18. Nader G. Zamani. University of Windsor. Jonathan M. Weaver. University of Detroit Mercy

MetaMorph Software Basic Analysis Guide The use of measurements and journals

Notes on Excel Forecasting Tools. Data Table, Scenario Manager, Goal Seek, & Solver

Oracle Data Miner (Extension of SQL Developer 4.0)

Stats for Strategy Fall 2012 First-Discussion Handout: Stats Using Calculators and MINITAB

Outlook 2010 and 2013

Secure IIS Web Server with SSL

Additional sources Compilation of sources:

DNA Sequence Alignment Analysis

Enhanced Formatting and Document Management. Word Unit 3 Module 3. Diocese of St. Petersburg Office of Training Training@dosp.

SUMAN DUVVURU STAT 567 PROJECT REPORT

Normality Testing in Excel

CHAPTER 5 COMPARISON OF DIFFERENT TYPE OF ONLINE ADVERTSIEMENTS. Table: 8 Perceived Usefulness of Different Advertisement Types

Getting Started with Dynamic Web Sites

Bio-Informatics Lectures. A Short Introduction

Printer Connection Manager

EVALUATION ONLY. WA2088 WebSphere Application Server 8.5 Administration on Windows. Student Labs. Web Age Solutions Inc.

Data Analysis Tools. Tools for Summarizing Data

Instructions for data-entry and data-analysis using Epi Info

Data Tool Platform SQL Development Tools

Creating a New Search

Windows 8 Quick Start Guide

Transcription:

Integrative Biology 200B University of California, Berkeley Principals of Phylogenetics: Ecology and Evolution Spring 2011 Updated by Nick Matzke Molecular Clocks and Tree Dating with r8s and BEAST Today we are going to use several different methods of testing the molecular clock and estimating node times. We will use a couple of likelihood ratio tests to test the molecular clock against a totally unconstrained tree and a tree with a few branches allowed to vary independently. We will also use several rate smoothing methods to infer divergence times. We will not deal with several commonly used methods. In particular we will not use any relative rate tests to test the molecular clock. This is a very active field and there are constantly new methods and new programs being developed. Setup: Get on the web and download: r8s: http://loco.biosci.arizona.edu/r8s/ BEAST, BEAUTi, TreeAnnotator, Tracer, FigTree: http://beast.bio.ed.ac.uk/programs Assignment First, work through the short R script on the website: _dating_code_v1.r Second, we will switch to BEAST. Work through the online tutorial: Divergence Dating (Primates) v1.1a.zip (BEAST v1.5.x) Which is online at: http://beast.bio.ed.ac.uk/tutorials Third, I recently (last week) figured out a way that might let us use BEAST with the fossils as tips, rather than as calibration points. I tried this on a simulated dataset, and it works, depending on what priors you use. The true simulated tree with fossils is: simt_connected100_86_newick.txt. Look at it in R or another tree-viewing program. The root of the true tree is about 48 million years ago. Now, we will see how we do at estimating the tree and its dates in BEAST, using our simulated character data and fossils. Import the NEXUS file with the 100 simulated morphological characters (simmorph_4states100.nex) into BEAUTi. We are going to specify the ages of the tips. Change Since some time in the past to Before the Present Click Guess Dates At Defined by its order, change first to last (I have put the tip date on the end of each tip name)

Click OK Click through the other panels and play with the various priors. Now, in real life, I had to do a bunch more stuff to make BEAST take morphology data instead of sequence data. This involved manually editing the XML file that BEAUTi produces, which is a pain. So: shut down BEAUTi and open BEAST. Pick one of my XML files, online here: http://ib.berkeley.edu/courses/ib200b/labs/beast_inputs/ and run it in BEAST. It should take ~10 minutes. Process the results in TreeAnnotator, and display the resulting tree in FigTree. Compare your results to that of other XML files run by your classmates. I was checking the influence of different tree priors on the results, and also the results of using the true model of character evolution that the simulation had (clocklike evolution) versus estimating the rate (uncorrelated between branches, lognormal rate prior). The differences can sometimes be large! (Note: We don t have time today to go through testing the molecular clock with PAUP and the local clock with PAML; we typically learn that in 200A, but if you ever do serious dating, you should work through the exercises below.) Exercises from 200A: Testing for Global Molecular Clock Under the null hypothesis, the phylogeny is rooted and the branch lengths are constrained such that all of the tips can be drawn at a single time plane. Under the alternative hypothesis, each branch is allowed to vary independently. The alternative hypothesis invokes s - 2 additional parameters, where s is the number of sequences. The likelihood ratio test statistic is -2logL = 2(logL0 - logl1), where L0 and L1 are the likelihoods under the null and alternative hypotheses, respectively. The significance of the likelihood ratio test statistic can be approximated using a chi-square distribution (with s - 2 degrees of freedom). The following example shows how to perform the likelihood ratio test of the molecular clock using PAUP*. 1. Execute the file Cephalopod.nex (available on the IB 200B website). This file contains molecular data, and it also contains one tree. For this exercise, we have accepted this tree as our working phylogenetic hypothesis and we are now going to test whether it obeys a molecular clock. You can look at the trees if you want using showtrees. First, we will calculate the likelihood of this tree without enforcing a molecular clock. For speed, we ll use the Hasegawa, Kishino, and Yano (1985) (HKY85) model of DNA substitution with among site rate variation described using a gamma distribution. In PAUP, this model is set the variant=hky under the likelihood settings (lset). 2. Estimate model parameters for the Ts:tv ratio and the gamma distribution shape parameter, use these commands: lset tratio=estimate variant=hky shape=estimate;

lscores; 3. Record the lnl score. This is the likelihood score for the alternative hypothesis, which allows branches to vary independently. 4. Now, we will change the likelihood settings to enforce a molecular clock: lset tratio=estimate variant=hky shape=estimate clock=yes; 5. Recalculate the likelihood score under this null model: lscores; Conduct a likelihood ratio test in Excel to determine if you can reject the null model. As you know, the likelihood ratio test compares a simple model to a more complex one, to see if adding the extra parameters offers a significant improvement to the model. This is necessary since adding parameters will always improve the model, at least a little bit. Since a molecular clock only allows a single rate, it can be considered a simpler version of the HKY85. In testing a molecular clock, the degrees of freedom are the number of taxa - 2 (Felsenstein 1981). 6. Open an Excel file. 7. The likelihood ratio (LR) can be calculated as LR = 2 ((HKY85 + clock lnl) (HKY85 -lnl)) (I believe this is because subtracting natural logs is the same as dividing ) 8. The degrees of freedom (DF) can be calculated as: DF = number of taxa 2 The cephalopod matrix has 15 taxa, so there are 13 degrees of freedom. 9. Use the chidist function in Excel to get a p-value: =chidist(lr,df) If the p-values is less than 0.05, you can reject the simpler model (in this case, the global molecular clock.) The null hypothesis, that the rate of evolution is homogeneous among all branches in the phylogeny, is rejected. Rates of substitution significantly vary among branches and a molecular clock is inappropriate. Why is the likelihood score of the alternative model higher than the null model? Testing for a Local Molecular Clock In the previous example we tested whether the entire tree fit a clock as opposed to every branch on the tree having an independent rate. We could also test whether a clade has a different rate from the rest of the tree. We can not do this in PAUP*, because PAUP* does not allow us to specify different rates on different branches. Instead we will use BASEML, a program from the PAML package of phylogeny programs by Ziheng Yang. This program does ML anaysis of DNA sequences, and allows us to specify a tree and different distributions of rates on the tree. All these programs can be found at http://abacus.gene.ucl.ac.uk/software/paml.html. This program is entirely controlled by the input files. You will need to download these from the web. 10. Go to the syllabus page of the IB 200B website. Download three files: CephTree.trees, BaseML.ctl, CephSeq.nuc The first file is CephTree.trees open it with a text editor. As you can see this tree contains the same tree in Newick format as we used in the previous example. You will also see a $1 after the clade containing Joubiniteuthis and Moroteuthis. This specifies that all the branches in this clade will have

a different rate than the other branches in the tree. Open the file BaseML.ctl with a text editor. This is the control file for the BaseML program. When BASEML.exe is run, it automatically opens the control file, which must be in the same folder as it. The first line of the file specifies the file with the DNA sequences. The second line specifies the tree file. There are many other options in this file, but the only one that we are concerned with here is the clock option at the bottom of the tree. Here you can specify how the rates on the branches are grouped. 0 allows the rates on all the branches to vary independently; 1 enforces a molecular clock; and 2 enforces separate molecular clocks on each set of specified branches. We ll start with 0. 11. To run BaseML, just double-click on the BaseML program (If you ever want to use BaseML from a windows computer, it is slightly better to run it from the command prompt, but double-clicking will also work if there are no errors.) 12. When it is done, record the likelihood score. (It will be at the bottom of the screen, after lnl = ) 13. Open up the BaseML.ctl file in a text editor. Change the clock setting to 1. Run BASEML and record the likelihood score. Repeat the process for a clock setting of 2. 14. Use the likelihood ratio test to compare these models. Which model has the highest likelihoods? Why? Which model is the best? Estimating Divergence Times Using r8s Now we will use r8s (that s a pun pronounced rates ) to estimate divergence times. R8s uses a tree with branch lengths derived from another program, and then tries to estimate the node times by some measure of the rate differences between these branches. It is by Mike Sanderson and is freely available at http://loco.biosci.arizona.edu/r8s/. r8s uses normal nexus files as input files but you need to make a few additional commands.. In particular you need to specify the timing of the nodes which we can locate in time. As we learned in lecture, it is very difficult to locate a node in time. Open the CEPHr8s.nex file in a text editor. The tree block is the same as you would find in any nexus file. Branch lengths are included after the colons. Some of the internal nodes are also named, after the closed parentheses, but before the columns. It is necessary to name nodes so that dates can be assigned to them. The r8s block has several commands. lengths=persite means that the branch length is in changes per site not total number of changes ultrametric=no means that the input tree is not ultrametric fixage taxon=clade1 age=150 sets the age of node Clade1 to 150 constrain taxon=node2 min_age=200 max_age=300 forces node two to be between 200 and 300. These times are measured backwards from the present, so that min_age=200 means that this divergence happened at least 200 (million?) years ago. I believe the units are relative, depending on what you input. divtime method=lf starts the fitting algorithm using the Langley-Fitch method which deduces node times using maximum likelihood of the branch lengths assuming a constant rate of substitution 15. Download the file CEPHr8s.nex from the 200B website 16. Open the command terminal, type cd space, drag the folder labeled r8s into the terminal and hit return. Type./r8s v hit return and then type execute CEPHr8s.nex

This will execute the file and the commands that we inputted in the r8s block. According to the instructions this should be easier to do but I couldn t get it to work the other way. 17. When the program is done running type showage to output a table of node ages. 18. Now, let s try a different divergence time method, Non-Parametiric Rate Smoothing. Open the CEPHr8s.nex file in a text editor again, and change it to method = NPRS. When this option is set, the program minimizes the squares of the differences between adjacent branch lengths 19. Run r8s again. How do the node estimates for the two methods compare?