Finding Clusters in Phylogenetic Trees: A Special Type of Cluster Analysis

Finding lusters in Phylogenetic Trees: Special Type of luster nalysis Why try to identify clusters in phylogenetic trees? xample: origin of HIV. NUMR: Why are there so many distinct clusters? LUR04-7 SYNHRONY: Was the onset of diversification synchronized?

xample Observe: main features of HIV-, type M - pprox. 0 distinct subtypes - Subtypes are approx. equidistant ( sunburst ) Question: ould these features have arisen naturally? pproach: - quantitative comparison to simulated frican epidemic. Simulation details are in the models/tools: - coalescent theory, phylogenetic tree estimation, - estimate the number of subtypes, and - classical statistics: are the main features outliers with respect to our forward model? FOUS: stimate the number of subtypes

This talk to focus on: To choose groups, consider: Model-based clustering (Raftery et al: mclust in S+) Max likelihood + bootstrap (State of art: PHYLIP, other) Markov hain Monte arlo (M)

omplicated Genetic ata Structure 94Y.04. - - - - GGTGTGGGG... 90M.4 TGGGTGGGG... xample sequence identifier: 94Y.04. : subtype 94: isolation year Y: country of origin 04: isolate : clone number nsure: global coverage, include all known subtypes widest possible span of isolation times more than one region of genome void: more than clone from same isolate Issues: genealogy implies correlation; evolution model

istance measures/micro evolutionary models P ij (t) = 4-by-4 transition prob. matrix P( -> in time t) = P (t), etc. For some P matrices, can define an evolutionary distance between taxa x and y each with N nucleotides (must correct for multiple substitutions) n n n G n T - aπ bπ G cπ T NF xy = n n n G n T aπ - dπ G eπ T bπ dπ - fπ T cπ eπ fπ G - n G n G n GG n Q GT ij/µ = P = e Qt n T n T n TG n TT GTR: π i P ij = π j P has 8 free parameters. ji. ommon models are special cases with fewer parameters. Use NF xy to estimate parameters. J: P ij (t) = /4 + /4e -µt for i = j, and /4 - /4e -µt for i!= j K: P ij (t) = /4 + /4e -µt + / e -µt (κ+)/, for i = j, etc. where κ is transition/transversion ratio

Number of subtypes: Model-based clustering nv Gag x x X x x W No. subtypes No. subtypes

Simulated data: 4 macro growth rates (a) N = N 0 e rt (b) N = N 0 (c) N = N 0, then N= N 0 e rt (d) N is quadratic from970 to 990

xample Real Trees J G H H G J nv F The ML + bootstrap approach suggests 7 clusters (subtypes) in the 9 env sequences and clusters (subtypes) in the 88 gag sequences. The data is available at hiv-web.lanl.gov and accession numbers are available upon request. NOT:, are similar and H, J are rare (omitted in this analysis) F K Gag

Model-based clustering as in mclust - pproximate ayes method to choose the no. of groups G. First assume: G is known and data is n cases of p-dim observations x = (x, x,, x n ) with probability density f k (x;θ) for observations from group k. Let γ = (γ,..,γ n ) be the group labels. hoose (θ,γ) to maximize L(θ;γ) = Π i f γi (x i ;θ) If f is MVN(µ k,σ k ), get a sum of squares criterion, with variations depending on assumptions on Σ k. R (99) use hierarchical agglomeration and iterative reallocation to maximize the classification likelihood: n L(x θν, ) = φ ( xi µ ν, Σν ), i= i i where φ i is MVN anfield and Raftery, iometrics 99

Model-based clustering as in mclust R approach: use the spectral decomposition T k k k k k Σ =λ where λ k, k, k control the volume, shape, and orientation of group k Next, to estimate p(g = r x), approximate the distribution of the ayes factor p(x G = r)/p(x G = s). llow: a noise component for new cluster cases and use heuristic method to address failure of a regularity condition in the clustering context.

Simulated xample x - 0 - - 0 x I -40-400 -0-00 4 4 4 4 4 4 4 4 4 I VI 4 VVV V VV 4 4 4 4 8 0 number of clusters valuation of emclust for a simulated data set of 0 observations from each of clusters (labeled,, in top plot) with true model VV denoting that the volume varies (V) among clusters, the shape does not vary ( for equal) among clusters, and the orientation varies (V) among clusters (model ). The I correctly chooses clusters but chooses VVV rather than the correct VV.

mclust suggests subtypes (tends to merge and ) 0.0 0. 0.4 G G G G G G G G x -0. 0.0 FF F F F F F F F G G G G G G G G F G F F F F F F -0. -0. 0.0 0. 0. x I 00 000 400 4 4 4 4 4 4 I VI 4 VVV V VV 0 0 number of clusters nv ata. (Top) Hierarchical lustering; (Middle) Principle oordinate plot; (ottom) Results of mclust.

MM via M On different data with fewer taxa: ompare MM to ML + bootstrap in case where groups chosen in advance Probability via MM 0.0 0. 0.4 0. 0.8.0 Probability via MM 0.0 0. 0.4 0. 0.8.0 0.0 0. 0.4 0. 0.8.0 Probability via ML+ootstrap (c) Influenza, H gene, 9,9,94, or 9 vs 9 groups 0.0 0. 0.4 0. 0.8.0 Probability via ML+ootstrap (d) Influenza, NP gene, H, S, groups

Summary Present method to choose the number of groups via ML + bootstrap or MM: trial and error. Usually: human eye studies tree, selects groups, then ML + bootstrap on specified groups. Similar with MM Model-based clustering offers automatic way to choose groups, but relies on pair-wise distances (less efficient than likelihood). FUTUR: consider how to automate (without human eye) cluster choices in ML + bootstrap or MM (or any other method such as weighted parsimony + bootstrap) Increasing the number of taxa: MM and ML are very slow, so currently limited to few hundred taxa onsider: identify groups, then assign new taxa to existing groups.

References anfield, J., & Raftery,. (99). Model-based gaussian and non-gaussian clustering. iometrics. 49, 80-8. urr T., Myers G., & Hyman J. (00). The Origin of IS arwinian or Lamarkian?, Phil. Trans. R. Soc. Lond.., 877-887. urr, T., Skourikhine,.N., Macken,., & runo, W. (999). onfidence measures for evolutionary trees: applications to molecular epidemiology. Proc. of the 999 I Inter. onference on Information, Intelligence and Systems, 07-4. urr T., oak J., Gattiker, J., & Stanbro, W. (00a). ssessing confidence in phylogenetic trees: bootstrap versus Markov hain Monte arlo, Mathematical and ngineering Methods in Medicine and iological Sciences., 8-87. urr, T., Gattiker, J., & Laerge, G. (00b). Genetic subtyping using cluster analysis, Special Interest Group on Knowledge iscovery and ata Mining xplorations., -4.