MIC - Detecting Novel Associations in Large Data Sets by Nico Güttler, Andreas Ströhlein and Matt Huska
Outline Motivation Method Results Criticism Conclusions
Motivation - Goal Determine important undiscovered relationships in data sets with lots of variables Efficiently identify the important relationships
Motivation
Reshef et al. 2011
MIC - Maximal Information Coefficient Measure of variable dependence Association between variable pairs Univariate Detects functional & non-functional dependence
MIC - Maximal Information Coefficient Functional relationships: MIC ~= R2 Range: 0 (statistical independence) - 1 (no noise) For linear relationships: MIC ~= (Pearson correlation coefficient)2
MIC - Main Properties 1. Generality Provided sufficient sample size: detects a wide range of relationships Including non-functional types (e.g. functional superposition) 2. Equitability Similar scores to equally noisy relationships Independent of relationship type
MIC - Generality & Equitability
Noise vs. Spearman Rank Correlation Noise: 1-R2
Noise vs. MIC Score Noise: 1-R2
Example - Pearson Correlation
Example - MIC
Calculating MIC - Central Idea If a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship. Need to find the best: number of partitions (a.k.a. grid resolution) placement of the partitions
Scatterplots and Grids 2-variable plot Grid resolution Partition placements
Scoring Grids Resolution: MIC tries all resolutions (x,y) where xy < n0.6 Partitioning: For each resolution (x,y) MIC finds grid partition placement with highest mutual information Use approximation algorithm to reduce the number of partition placements we consider Mutual information: X,Y: random variables p(x,y): joint probability distribution function p(x), p(y): marginal probability distribution functions
Mutual information Probability of a box = # of data points in that box
Mutual information -0.00244 + -0.00912 + 0.0144 + -0.0231 + 0.0558 + -0.0223 + 0.0304 + -0.0336 + 0.0134 = 0.153
Characteristic matrix & Normalization Highest mutual information score for each resolution is stored in the characteristic matrix M(x,y) Different resolution grids have different maximum mutual information scores, we need to normalize them: Resulting normalized values range = (0, 1)
Characteristic matrix M=
M as a Surface
Measures based on MIC We can calculate other interesting statistics using MIC and the characteristic matrix M: Maximum Asymmetry Score (MAS): Deviation from monotonicity Minimum Cell Number (MCN): Complexity measure Tells you the minimum number of partitions to get the MIC score Collection of statistics: MINE - Maximal Information-based non-parametric Exploration
MINE statistics Pearson 1.00-0.09 0.01 0.61-0.02 0.00-0.1
MIC - Gene expression data Spellman data set from Monday MIC and MAS applied to time series gene expression data Method 1: MIC score of time vs. expression
Results - Reshef et al. 2011
MIC - Gene expression data Method 2: Calculated P-value for each MIC score by permuting one of the variables FDR controlled using Benjamini & Hochberg Resulting genes sorted by MAS scores (periodicity)
Results
MIC on microarray data (B1, 104 genes)
Criticism of MIC Comment to Science (Simon and Tibshirani 2012) MIC was shown to have low power in comparison to another method, distance correlation (dcor) (Szekely, Rizzo, and Bakirov 2007) Simulated pairs of variables with varying amounts of noise added Power: Probability test will correctly reject H0 lower power = more false positives
MIC vs. Pearson vs. dcor
Conclusion General tool for data exploration Not specific to certain data sets Find potential relationships of any kind Useful tool for identification and characterization of structure in data
References Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. Detecting novel associations in large data sets. Science. 2011 Dec 16;334(6062):1518-24. PubMed PMID: 22174245. Simon, Noah and Robert Tibshirani (2012). Comment On Detecting Novel Associations In Large Data Sets By Reshef et al, Science Dec 16, 2011.