Data visualization and clustering. Genomics is to no small extend a data science

Size: px

Start display at page:

Download "Data visualization and clustering. Genomics is to no small extend a data science"

Laurence Turner
7 years ago
Views:

1 Data visualization and clustering Genomics is to no small extend a data science [

2 Data visualization and clustering Genomics is to no small extend a data science [Andersson et al., Nature 2015]

3 Data visualization and clustering Data visualization: Look at the data. Why? - Quality control did a experiment work? - Exploratory data analysis what does the data say? - Sanity checks does my code work? - Interpreting data making a point. CAGE signature correlates with other enhancer marks. Fraction enhancers Mean signal [Andersson et al., Nature 2015]

- Exploratory data analysis what does the data say? - Sanity checks does my code work?

4 Data visualization and clustering 1.Tools for data visualization. How to: - Visualize distributions correlations - Visualize group structure: clustering - Visualize data along genomic coordinates - Visualize dependencies and interactions: Graphs/networks and layouts 2. Examples

structure: clustering - Visualize data along genomic coordinates -

5 Visualizing distributions Discrete RV Continuous RV X! {x 1,x 2,...,x n } Random variable Observed realizations (n data points)

6 Visualizing distributions: histogram X! {x 1,x 2,...,x n } {x 1,x 2,...,x n }! Distribution? Number of realizations in bin Bins, e.g. [0.5,0.6)

7 Visualizing distributions: density ESTIMATE of continuous p.d.f The sale of x-axis matters! [Gentleman et al. 2006] 2. mode

8 Visualizing distributions: 2d histograms Binned realizations of RV Y Contour lines Binned realizations of RV X Information about DEPENDENCE between X and Y: P (X, Y )=P (X Y )P (Y ) Joint distribution Conditional distribution Marginal distribution Independence: P (X Y )=P (X) Are the rows in the plot similar?

Y: P (X, Y )=P (X Y )P (Y ) Joint distribution Conditional distribution

9 Correlation {(x 1,y 1 ), (x 2,y 2 ),...,(x n,y n )} Joint realizations of X and Y Scatterplot Scatterplot with regression line Linear regression: E[Y X] =f(x) = + x Model the conditional expectation as linear function

10 Correlation Linear regression: E[Y X] =f(x) = + x How well does this work? Coefficient of determination. SS = X i ~Variance of Y Linear relation good: R 2 1 Linear relation bad: R 2 0 Can do the same for non-linear f (y i ȳ) 2 = X (f i ȳ) 2 i {z } SSreg ~Variance of regression line R 2 =1 SSres SS tot + X (y i f i ) 2 i {z } SSres ~Variance not explained by f

SS = X i ~Variance of Y Linear relation good: R 2 1 Linear relation bad: R 2 0 Can do the

11 Correlation Linear regression: E[Y X] =f(x) = + x How well does this work? Correlation coefficient Pearson s correlation coefficient: ˆ = r = X i r 2 = R 2 = Cov(X, Y ) X Y Cov(X, Y )=E[(X E[X])(Y E[y])] (x i x)(y i ȳ). Xi (x i x) X 2 (y i ȳ) i Coefficient of determination for linear model

2 = R 2 = Cov(X, Y ) X Y Cov(X, Y )=E[(X E[X])(Y E[y])] (x i x)(y i ȳ).

12 Correlation coefficient Pearson s correlation coefficient: [-1,1] and measures linear dependence

13 Correlation coefficient Pearson s correlation coefficient: [-1,1] and measures linear dependence There are measures that capture non-linear correlations.

14 Bar and Boxplots: comparing distributions [Spitzer et al., Nature Methods 2014]

15 Data visualization and clustering 1.Tools for data visualization. How to: - Visualize distributions correlations Histogram, density estimate, 2d histogram, coefficient of determination, correlation coefficient, scatterplot, boxplot (and variations thereof) - Visualize group structure: clustering - Visualize data along genomic coordinates - Visualize dependencies and interactions: Graphs/networks and layouts 2. Examples

determination, correlation coefficient, scatterplot, boxplot (and variations thereof) - Visualize group

16 Clustering: Grouping data Measurement of Y Measurement of X

17 Clustering: Grouping data 1. Organize data into clusters 2. No prior information (unsupervised) 3. Need some notion of distance/similarity

18 Hierarchical clustering - Euclidean distance - Agglomerative scheme - Average linkage Dendrogram Leafs are data points

19 Hierarchical clustering: distance Euclidean distance - Distance: D(a, b) =D(b, a) D(a, b) =0 iif a = b D(a, b) 0 D(a, b) apple D(a, c)+d(c, b) - Euclidean: D(x, y) = s X (x i y i ) 2 i

20 Hierarchical clustering: distance matrix Euclidean distance D(x, y) = s X (x i y i ) 2 i Heatmap of distances All pairs of data points - Magnitude is color coded - Matrix is symmetric - No apparent order

21 Hierarchical clustering: linkage Agglomerative scheme Start with each data point as its own cluster. Repeat until done: Merge the closest clusters. - Need closeness between points: Euclidean distance. - Need closeness between clusters (sets of points) - Average linkage: Average similarity points. - Single linkage: Take closest pair. - Complete linkage: Take furthest pair.

22 Hierarchical clustering - Euclidean distance - Agglomerative scheme - Average linkage Dendrogram

23 Hierarchical clustering - Euclidean distance - Agglomerative scheme - Linkage method matters Average linkage Single linkage (nearest neighbor)

24 Hierarchical clustering Distance matrix (heatmap) Ordered examples Internal nodes (not all are highlighted) Subtrees can rotate around nodes Ordering of leafs only partially defined

25 Hierarchical clustering Cutting the dendrogram defines clusters Distance matrix (heatmap) Ordered examples Cluster A Cluster B but it is often not clear how many to choose.

26 Prominent example: Clustering gene expression data Samples Genes Group [Gentleman et al. 2006] Group Group Group

27 Clustering: disclaimer There are a lot more clustering methods: - Partition clustering: No hierarchy, just disjoint clusters. Example: k-means. - Model-based clustering: Mixture distributions. - Others. P (X, Y )=P (X, Y cluster 1)P (cluster 1) + P (X, Y cluster 2)P (cluster 2) +... P (X, Y )=P (X, Y, Z) Unobserved cluster indicator random variable For each (xi,yi): Find the most likely zi Clustering.

28 Data visualization and clustering 1.Tools for data visualization. How to: - Visualize distributions correlations - Visualize group structure: clustering distance, distance matrix, heatmap, dendrogram, linkage (single, complete, average), partition clustering, model-based clustering - Visualize data along genomic coordinates - Visualize dependencies and interactions: Graphs/networks and layouts 2. Examples

29 Plotting data along linear genomic coordinates UCSC genome browser

30 Plotting data along linear genomic coordinates UCSC genome browser [Rosenbloom et al., NAR 2015]

31 Circular arrangement Circular visualization: circos Enrichment analysis [Saben et al., Placenta, 2013]

32 Circular arrangement Circular visualization: circos [Zhang et al. 2013]

33 Data visualization and clustering 1.Tools for data visualization. How to: - Visualize distributions correlations - Visualize group structure: clustering - Visualize data along genomic coordinates UCSC browser, circular visualization - Visualize dependencies and interactions: Graphs/networks and layouts 2. Examples

34 Graphs Vertices v 2 V Edges e 2 E Can be directed or undirected (E,V): Graph. Entities and relations between entities [Gentleman et al. 2006] [dzone.com] Tree: acyclic and connected

35 Graphs Tree: acyclic and connected [dzone.com] Directed: edges/arcs have direction

36 Graphs [dzone.com]

37 Rooted trees and DAGs DAG: directed acyclic graph Rooted tree: DAG where each node has one parent. [dzone.com] [cs.cornell.edu]

38 Rooted trees and DAGs DAG: directed acyclic graph Rooted tree: DAG where each node has one parent. [dzone.com] Gene Ontology: heart development Phylogenetic tree

39 Plotting of graphs: layout Same graph, three pictures [Gentleman et al. 2006] dot: hierarchical neato: no edge crossing two: circular structure

40 Plotting of graphs: hairballs Different layout algorithms: an interaction network Gene A Interaction Gene B [ Inferred by: - experimental assay - in-silico analyses

41 Data visualization and clustering 1.Tools for data visualization. How to: - Visualize distributions correlations - Visualize group structure: clustering - Visualize data along genomic coordinates - Visualize dependencies and interactions 2. Examples

42 Quality control: Color Number of cells in a well: Handling problem [Gentleman et al. 2006]

43 Example figure Mean signal Fraction enhancers

44 Example figure

45 Example figure

46 Visualizing distributions: microarray probes Intensity stratified by G+C [Gentleman et al. 2006]

Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be