Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis



Similar documents
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Linear Classification. Volker Tresp Summer 2015

Hierarchical Bayesian Modeling of the HIV Response to Therapy

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Statistical Machine Learning

Christfried Webers. Canberra February June 2015

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Statistical Machine Learning from Data

Bayesian coalescent inference of population size history

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Least-Squares Intersection of Lines

Response variables assume only two values, say Y j = 1 or = 0, called success and failure (spam detection, credit scoring, contracting.

Understanding the Impact of Weights Constraints in Portfolio Theory

Health Status Monitoring Through Analysis of Behavioral Patterns

Neural Networks Lesson 5 - Cluster Analysis

Chapter ML:XI (continued)

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Linear Models for Classification

Introduction to General and Generalized Linear Models

Introduction to Phylogenetic Analysis

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Data Mining: Algorithms and Applications Matrix Math Review

An Introduction to Phylogenetics

Sequence Analysis 15: lecture 5. Substitution matrices Multiple sequence alignment

Statistics Graduate Courses

Classification Problems

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Least Squares Estimation

Gaussian Conjugate Prior Cheat Sheet

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

1 Teaching notes on GMM 1.

L4: Bayesian Decision Theory

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Phylogenetic Trees Made Easy

Integer Programming: Algorithms - 3

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

Arbres formels et Arbre(s) de la Vie

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Predict the Popularity of YouTube Videos Using Early View Data

Lecture 3: Linear methods for classification

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

INDIRECT INFERENCE (prepared for: The New Palgrave Dictionary of Economics, Second Edition)

SF2940: Probability theory Lecture 8: Multivariate Normal Distribution

Online Model-Based Clustering for Crisis Identification in Distributed Computing

MATCH Commun. Math. Comput. Chem. 61 (2009)

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

3. Regression & Exponential Smoothing

Bayesian Statistics in One Hour. Patrick Lam

Extreme-Value Analysis of Corrosion Data

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

An Internal Model for Operational Risk Computation

Azure Machine Learning, SQL Data Mining and R

Lecture 8: More Continuous Random Variables

Sales and operations planning (SOP) Demand forecasting

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Support Vector Machines Explained

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Distances, Clustering, and Classification. Heatmaps

Multidimensional data and factorial methods

Friction as an activated process

Bayesian Phylogeny and Measures of Branch Support

Data a systematic approach

Factor analysis. Angela Montanari

Java Modules for Time Series Analysis

Factor Analysis. Factor Analysis

Gamma Distribution Fitting

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

An Introduction to Basic Statistics and Probability

Credit Risk Models: An Overview

Introduction to Bioinformatics 3. DNA editing and contig assembly

Supplement to Call Centers with Delay Information: Models and Insights

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Evolutionary theory: introduction and examples

1 Introduction to Matrices

Reject Inference in Credit Scoring. Jie-Men Mok

Likelihood, MLE & EM for Gaussian Mixture Clustering. Nick Duffield Texas A&M University

An Introduction to the Use of Bayesian Network to Analyze Gene Expression Data

Unsupervised learning: Clustering

Financial Risk Forecasting Chapter 8 Backtesting and stresstesting

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Hidden Markov Models

Designing a learning system

Fitting Subject-specific Curves to Grouped Longitudinal Data

Support Vector Machines with Clustering for Training with Very Large Datasets

Machine Learning and Pattern Recognition Logistic Regression


Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

How To Perform An Ensemble Analysis

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

A Primer of Genome Science THIRD

Transcription:

Finding lusters in Phylogenetic Trees: Special Type of luster nalysis Why try to identify clusters in phylogenetic trees? xample: origin of HIV. NUMR: Why are there so many distinct clusters? LUR04-7 SYNHRONY: Was the onset of diversification synchronized?

xample Observe: main features of HIV-, type M - pprox. 0 distinct subtypes - Subtypes are approx. equidistant ( sunburst ) Question: ould these features have arisen naturally? pproach: - quantitative comparison to simulated frican epidemic. Simulation details are in the models/tools: - coalescent theory, phylogenetic tree estimation, - estimate the number of subtypes, and - classical statistics: are the main features outliers with respect to our forward model? FOUS: stimate the number of subtypes

This talk to focus on: To choose groups, consider: Model-based clustering (Raftery et al: mclust in S+) Max likelihood + bootstrap (State of art: PHYLIP, other) Markov hain Monte arlo (M)

omplicated Genetic ata Structure 94Y.04. - - - - GGTGTGGGG... 90M.4 TGGGTGGGG... xample sequence identifier: 94Y.04. : subtype 94: isolation year Y: country of origin 04: isolate : clone number nsure: global coverage, include all known subtypes widest possible span of isolation times more than one region of genome void: more than clone from same isolate Issues: genealogy implies correlation; evolution model

istance measures/micro evolutionary models P ij (t) = 4-by-4 transition prob. matrix P( -> in time t) = P (t), etc. For some P matrices, can define an evolutionary distance between taxa x and y each with N nucleotides (must correct for multiple substitutions) n n n G n T - aπ bπ G cπ T NF xy = n n n G n T aπ - dπ G eπ T bπ dπ - fπ T cπ eπ fπ G - n G n G n GG n Q GT ij/µ = P = e Qt n T n T n TG n TT GTR: π i P ij = π j P has 8 free parameters. ji. ommon models are special cases with fewer parameters. Use NF xy to estimate parameters. J: P ij (t) = /4 + /4e -µt for i = j, and /4 - /4e -µt for i!= j K: P ij (t) = /4 + /4e -µt + / e -µt (κ+)/, for i = j, etc. where κ is transition/transversion ratio

Number of subtypes: Model-based clustering nv Gag x x X x x W No. subtypes No. subtypes

Simulated data: 4 macro growth rates (a) N = N 0 e rt (b) N = N 0 (c) N = N 0, then N= N 0 e rt (d) N is quadratic from970 to 990

xample Real Trees J G H H G J nv F The ML + bootstrap approach suggests 7 clusters (subtypes) in the 9 env sequences and clusters (subtypes) in the 88 gag sequences. The data is available at hiv-web.lanl.gov and accession numbers are available upon request. NOT:, are similar and H, J are rare (omitted in this analysis) F K Gag

Model-based clustering as in mclust - pproximate ayes method to choose the no. of groups G. First assume: G is known and data is n cases of p-dim observations x = (x, x,, x n ) with probability density f k (x;θ) for observations from group k. Let γ = (γ,..,γ n ) be the group labels. hoose (θ,γ) to maximize L(θ;γ) = Π i f γi (x i ;θ) If f is MVN(µ k,σ k ), get a sum of squares criterion, with variations depending on assumptions on Σ k. R (99) use hierarchical agglomeration and iterative reallocation to maximize the classification likelihood: n L(x θν, ) = φ ( xi µ ν, Σν ), i= i i where φ i is MVN anfield and Raftery, iometrics 99

Model-based clustering as in mclust R approach: use the spectral decomposition T k k k k k Σ =λ where λ k, k, k control the volume, shape, and orientation of group k Next, to estimate p(g = r x), approximate the distribution of the ayes factor p(x G = r)/p(x G = s). llow: a noise component for new cluster cases and use heuristic method to address failure of a regularity condition in the clustering context.

Simulated xample x - 0 - - 0 x I -40-400 -0-00 4 4 4 4 4 4 4 4 4 I VI 4 VVV V VV 4 4 4 4 8 0 number of clusters valuation of emclust for a simulated data set of 0 observations from each of clusters (labeled,, in top plot) with true model VV denoting that the volume varies (V) among clusters, the shape does not vary ( for equal) among clusters, and the orientation varies (V) among clusters (model ). The I correctly chooses clusters but chooses VVV rather than the correct VV.

mclust suggests subtypes (tends to merge and ) 0.0 0. 0.4 G G G G G G G G x -0. 0.0 FF F F F F F F F G G G G G G G G F G F F F F F F -0. -0. 0.0 0. 0. x I 00 000 400 4 4 4 4 4 4 I VI 4 VVV V VV 0 0 number of clusters nv ata. (Top) Hierarchical lustering; (Middle) Principle oordinate plot; (ottom) Results of mclust.

MM via M On different data with fewer taxa: ompare MM to ML + bootstrap in case where groups chosen in advance Probability via MM 0.0 0. 0.4 0. 0.8.0 Probability via MM 0.0 0. 0.4 0. 0.8.0 0.0 0. 0.4 0. 0.8.0 Probability via ML+ootstrap (c) Influenza, H gene, 9,9,94, or 9 vs 9 groups 0.0 0. 0.4 0. 0.8.0 Probability via ML+ootstrap (d) Influenza, NP gene, H, S, groups

Summary Present method to choose the number of groups via ML + bootstrap or MM: trial and error. Usually: human eye studies tree, selects groups, then ML + bootstrap on specified groups. Similar with MM Model-based clustering offers automatic way to choose groups, but relies on pair-wise distances (less efficient than likelihood). FUTUR: consider how to automate (without human eye) cluster choices in ML + bootstrap or MM (or any other method such as weighted parsimony + bootstrap) Increasing the number of taxa: MM and ML are very slow, so currently limited to few hundred taxa onsider: identify groups, then assign new taxa to existing groups.

References anfield, J., & Raftery,. (99). Model-based gaussian and non-gaussian clustering. iometrics. 49, 80-8. urr T., Myers G., & Hyman J. (00). The Origin of IS arwinian or Lamarkian?, Phil. Trans. R. Soc. Lond.., 877-887. urr, T., Skourikhine,.N., Macken,., & runo, W. (999). onfidence measures for evolutionary trees: applications to molecular epidemiology. Proc. of the 999 I Inter. onference on Information, Intelligence and Systems, 07-4. urr T., oak J., Gattiker, J., & Stanbro, W. (00a). ssessing confidence in phylogenetic trees: bootstrap versus Markov hain Monte arlo, Mathematical and ngineering Methods in Medicine and iological Sciences., 8-87. urr, T., Gattiker, J., & Laerge, G. (00b). Genetic subtyping using cluster analysis, Special Interest Group on Knowledge iscovery and ata Mining xplorations., -4.