MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska



Similar documents
Multivariate Regression Modeling for Home Value Estimates with Evaluation using Maximum Information Coefficient

Winning the Kaggle Algorithmic Trading Challenge with the Composition of Many Models and Feature Engineering

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Least Squares Estimation

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Univariate Regression

Correlation Coefficient The correlation coefficient is a summary statistic that describes the linear relationship between two numerical variables 2

Additional sources Compilation of sources:

Data Preparation and Statistical Displays

Section 3 Part 1. Relationships between two numerical variables

Multivariate Analysis of Ecological Data

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Basic Statistics and Data Analysis for Health Researchers from Foreign Countries

Package HHG. July 14, 2015

Statistical issues in the analysis of microarray data

Environmental Remote Sensing GEOG 2021

How To Cluster

containing Kendall correlations; and the OUTH = option will create a data set containing Hoeffding statistics.

Sections 2.11 and 5.8

Homework 11. Part 1. Name: Score: / null

Study Guide for the Final Exam

1) The table lists the smoking habits of a group of college students. Answer: 0.218

Using Excel for inferential statistics

SPSS Tests for Versions 9 to 13

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

II. DISTRIBUTIONS distribution normal distribution. standard scores

Simple Predictive Analytics Curtis Seare

UNIVERSITY OF NAIROBI

DATA ANALYSIS. QEM Network HBCU-UP Fundamentals of Education Research Workshop Gerunda B. Hughes, Ph.D. Howard University

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Package GSA. R topics documented: February 19, 2015

Package empiricalfdr.deseq2

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Linear Threshold Units

Pearson's Correlation Tests

Performance Metrics for Graph Mining Tasks

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

MTH 140 Statistics Videos

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Dimensionality Reduction: Principal Components Analysis

Point Biserial Correlation Tests

Gene Expression Analysis

Geostatistics Exploratory Analysis

Evaluation & Validation: Credibility: Evaluating what has been learned

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Module 5: Statistical Analysis

Data analysis process

True-Lift Modeling: Mining for the Most Truly Responsive Customers and Prospects

Some Essential Statistics The Lure of Statistics

Lecture 10: Regression Trees

How To Identify Noisy Variables In A Cluster

SPSS Explore procedure

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

How To Run Statistical Tests in Excel

Projects Involving Statistics (& SPSS)

We are often interested in the relationship between two variables. Do people with more years of full-time education earn higher salaries?

Package ERP. December 14, 2015

Data exploration with Microsoft Excel: analysing more than one variable

Introduction to Learning & Decision Trees

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

Paper No 19. FINALTERM EXAMINATION Fall 2009 MTH302- Business Mathematics & Statistics (Session - 2) Ref No: Time: 120 min Marks: 80

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Minitab Tutorials for Design and Analysis of Experiments. Table of Contents

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Gene expression analysis. Ulf Leser and Karin Zimmermann

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

The Statistics Tutor s Quick Guide to

Using Excel for Statistical Analysis

Descriptive Statistics

The Artificial Prediction Market

Lecture 3: Linear methods for classification

An Introduction to Machine Learning

Tutorial for proteome data analysis using the Perseus software platform

Data Analysis, Research Study Design and the IRB

Joint Exam 1/P Sample Exam 1

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Correlation key concepts:

. P. 4.3 Basic feasible solutions and vertices of polyhedra. x 1. x 2

The importance of graphing the data: Anscombe s regression examples

Parametric and Nonparametric: Demystifying the Terms

General Method: Difference of Means. 3. Calculate df: either Welch-Satterthwaite formula or simpler df = min(n 1, n 2 ) 1.

Social Media Mining. Data Mining Essentials

UNDERSTANDING THE TWO-WAY ANOVA

Transcription:

MIC - Detecting Novel Associations in Large Data Sets by Nico Güttler, Andreas Ströhlein and Matt Huska

Outline Motivation Method Results Criticism Conclusions

Motivation - Goal Determine important undiscovered relationships in data sets with lots of variables Efficiently identify the important relationships

Motivation

Reshef et al. 2011

MIC - Maximal Information Coefficient Measure of variable dependence Association between variable pairs Univariate Detects functional & non-functional dependence

MIC - Maximal Information Coefficient Functional relationships: MIC ~= R2 Range: 0 (statistical independence) - 1 (no noise) For linear relationships: MIC ~= (Pearson correlation coefficient)2

MIC - Main Properties 1. Generality Provided sufficient sample size: detects a wide range of relationships Including non-functional types (e.g. functional superposition) 2. Equitability Similar scores to equally noisy relationships Independent of relationship type

MIC - Generality & Equitability

Noise vs. Spearman Rank Correlation Noise: 1-R2

Noise vs. MIC Score Noise: 1-R2

Example - Pearson Correlation

Example - MIC

Calculating MIC - Central Idea If a relationship exists between two variables, then a grid can be drawn on the scatterplot of the two variables that partitions the data to encapsulate that relationship. Need to find the best: number of partitions (a.k.a. grid resolution) placement of the partitions

Scatterplots and Grids 2-variable plot Grid resolution Partition placements

Scoring Grids Resolution: MIC tries all resolutions (x,y) where xy < n0.6 Partitioning: For each resolution (x,y) MIC finds grid partition placement with highest mutual information Use approximation algorithm to reduce the number of partition placements we consider Mutual information: X,Y: random variables p(x,y): joint probability distribution function p(x), p(y): marginal probability distribution functions

Mutual information Probability of a box = # of data points in that box

Mutual information -0.00244 + -0.00912 + 0.0144 + -0.0231 + 0.0558 + -0.0223 + 0.0304 + -0.0336 + 0.0134 = 0.153

Characteristic matrix & Normalization Highest mutual information score for each resolution is stored in the characteristic matrix M(x,y) Different resolution grids have different maximum mutual information scores, we need to normalize them: Resulting normalized values range = (0, 1)

Characteristic matrix M=

M as a Surface

Measures based on MIC We can calculate other interesting statistics using MIC and the characteristic matrix M: Maximum Asymmetry Score (MAS): Deviation from monotonicity Minimum Cell Number (MCN): Complexity measure Tells you the minimum number of partitions to get the MIC score Collection of statistics: MINE - Maximal Information-based non-parametric Exploration

MINE statistics Pearson 1.00-0.09 0.01 0.61-0.02 0.00-0.1

MIC - Gene expression data Spellman data set from Monday MIC and MAS applied to time series gene expression data Method 1: MIC score of time vs. expression

Results - Reshef et al. 2011

MIC - Gene expression data Method 2: Calculated P-value for each MIC score by permuting one of the variables FDR controlled using Benjamini & Hochberg Resulting genes sorted by MAS scores (periodicity)

Results

MIC on microarray data (B1, 104 genes)

Criticism of MIC Comment to Science (Simon and Tibshirani 2012) MIC was shown to have low power in comparison to another method, distance correlation (dcor) (Szekely, Rizzo, and Bakirov 2007) Simulated pairs of variables with varying amounts of noise added Power: Probability test will correctly reject H0 lower power = more false positives

MIC vs. Pearson vs. dcor

Conclusion General tool for data exploration Not specific to certain data sets Find potential relationships of any kind Useful tool for identification and characterization of structure in data

References Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC. Detecting novel associations in large data sets. Science. 2011 Dec 16;334(6062):1518-24. PubMed PMID: 22174245. Simon, Noah and Robert Tibshirani (2012). Comment On Detecting Novel Associations In Large Data Sets By Reshef et al, Science Dec 16, 2011.