Monte Carlo testing with Big Data

Similar documents

Finding statistical patterns in Big Data

Anomaly detection for Big Data, networks and cyber-security

False Discovery Rates

Combining Weak Statistical Evidence in Cyber Security

Least Squares Estimation

The Variability of P-Values. Summary

GLM, insurance pricing & big data: paying attention to convergence issues.

Detection of changes in variance using binary segmentation and optimal partitioning

Package HHG. July 14, 2015

Bootstrapping Big Data

Monte Carlo Simulation

Information Theory and Stock Market

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

Correlational Research

Message-passing sequential detection of multiple change points in networks

Chapter 1. Introduction

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Package cpm. July 28, 2015

Intrusion Detection: Game Theory, Stochastic Processes and Data Mining

Exact Nonparametric Tests for Comparing Means - A Personal Summary

STATISTICA Formula Guide: Logistic Regression. Table of Contents

A Study of Local Optima in the Biobjective Travelling Salesman Problem

1. How different is the t distribution from the normal?

Component Ordering in Independent Component Analysis Based on Data Power

Machine Learning Big Data using Map Reduce

Uncertainty quantification for the family-wise error rate in multivariate copula models

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Tutorial 5: Hypothesis Testing

We shall turn our attention to solving linear systems of equations. Ax = b

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Multivariate Analysis of Ecological Data

Department of Economics

MODELING RANDOMNESS IN NETWORK TRAFFIC

Statistics in Retail Finance. Chapter 6: Behavioural models

Poisson Models for Count Data

Probabilistic Methods for Time-Series Analysis

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Chapter 7 Section 1 Homework Set A

Financial Risk Forecasting Chapter 8 Backtesting and stresstesting

An introduction to OBJECTIVE ASSESSMENT OF IMAGE QUALITY. Harrison H. Barrett University of Arizona Tucson, AZ

Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn

Adaptive Online Gradient Descent

Package dunn.test. January 6, 2016

Generating Random Samples from the Generalized Pareto Mixture Model

Bayesian Predictive Profiles with Applications to Retail Transaction Data

Permutation Tests for Comparing Two Populations

Composite performance measures in the public sector Rowena Jacobs, Maria Goddard and Peter C. Smith

Statistical Machine Learning

The Relative Worst Order Ratio for On-Line Algorithms

Randomized algorithms

Learning Vector Quantization: generalization ability and dynamics of competing prototypes

From the help desk: Bootstrapped standard errors

HOW TO WRITE A LABORATORY REPORT

Dealing with large datasets

ANALYZING NETWORK TRAFFIC FOR MALICIOUS ACTIVITY

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

The Basics of Graphical Models

Optimal Gateway Selection in Multi-domain Wireless Networks: A Potential Game Perspective

Introduction to Detection Theory

Data Mining - Evaluation of Classifiers

A study on the bi-aspect procedure with location and scale parameters

A Better Statistical Method for A/B Testing in Marketing Campaigns

Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski

Handling attrition and non-response in longitudinal data

A Model of Optimum Tariff in Vehicle Fleet Insurance

The Assumption(s) of Normality

On Adaboost and Optimal Betting Strategies

Managing uncertainty in call centers using Poisson mixtures

An Introduction to Using WinBUGS for Cost-Effectiveness Analyses in Health Economics

How To Check For Differences In The One Way Anova

Competitive Analysis of QoS Networks

Testing Group Differences using T-tests, ANOVA, and Nonparametric Measures

Statistics Graduate Courses

Gerry Hobbs, Department of Statistics, West Virginia University

The Bonferonni and Šidák Corrections for Multiple Comparisons

Testing predictive performance of binary choice models 1

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

4F7 Adaptive Filters (and Spectrum Estimation) Least Mean Square (LMS) Algorithm Sumeetpal Singh Engineering Department sss40@eng.cam.ac.

Chapter 6: Multivariate Cointegration Analysis

A Negative Result Concerning Explicit Matrices With The Restricted Isometry Property

Package empiricalfdr.deseq2

Parallel Data Selection Based on Neurodynamic Optimization in the Era of Big Data

Sums of Independent Random Variables

Bayesian networks - Time-series models - Apache Spark & Scala

Knowledge Discovery and Data Mining

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Package changepoint. R topics documented: November 9, Type Package Title Methods for Changepoint Detection Version 2.

Pricing Dual Spread Options by the Lie-Trotter Operator Splitting Method

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

A Statistical Framework for Operational Infrasound Monitoring

Section 13, Part 1 ANOVA. Analysis Of Variance

Social Media Mining. Data Mining Essentials

PROPERTIES OF THE SAMPLE CORRELATION OF THE BIVARIATE LOGNORMAL DISTRIBUTION

Section 7.1. Introduction to Hypothesis Testing. Schrodinger s cat quantum mechanics thought experiment (1935)

Transcription:

Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from: Nick Heard (Imperial College London), Dan Lawson (University of Bristol), Niall Adams (Imperial College London), Melissa Turcotte (Los Alamos National Laboratory) 8th January 2015 1 / 21

Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. (IEEE Spectrum, 20th October 2014). His point is about the risk of spurious discoveries when the number of possible hypotheses grows exponentially. Whilst our focus is mostly on the algorithmic challenge of implementing many tests, the theory of multiple testing is at the heart of our approach. 2 / 21

The hypothesis testing framework The null hypothesis, H 0, gives a probabilistic description of the data if there is no effect. The alternative, H 1, describes the case where the effect is present. We have a test statistic t which would tend be small under H 0, large under H 1. The p-value of the test is where T is a replicate of t under H 0. p = P(T t), It is straightforward to prove that p is uniformly distributed on [0, 1] under H 0 (if T is absolutely continuous). A low p-value, e.g. 1/100, is considered evidence in favour of the alternative. 3 / 21

Monte Carlo test Often P(T t) cannot be calculated exactly. A typical solution: simulate replicates of T under the null hypothesis, T 1,..., T N, and estimate ˆp = 1 N I(T j t), N where I is the indicator function, N large. j=1 4 / 21

Permutation test Let A and B denote two groups of data. Under H 0, the data are exchangeable the A/B labelling is arbitrary. Let t be some measure of discrepancy between A and B (e.g. the difference in means). Simulate T j by randomly relabelling the data, and recomputing the discrepancy. Estimate ˆp = 1 N N I(T j t). j=1 5 / 21

The permutation test essentially works by combining two important ideas: exchangeability and conditioning. 1 If the false positive rate is controlled at a level α (often 5%) given X = x, for all x, then it is also controlled marginally. 2 The simulated T j, in the permutation test, come from conditioning on the observed values, but not their labelling as A or B. More generally, my view is that exchangeability and other sorts of stochastic orders are key to robust inference on Big Data. 6 / 21

Running a large number of Monte Carlo tests Some possible principles: 1 Make each individual test faster, see e.g. Gandy (2009). 2 View task as a resource allocation problem: define an error and try to minimise its mean square. Choosing the error without taking into account the special nature the problem, e.g. using the Euclidean distance from the true vector of p-values, doesn t make much sense. For example, we could easily end up investing most of the effort on p-values around 0.5, which are typically not of interest. I will argue that for some Big Data problems it is natural to define the error on the basis of a method to combine p-values. 7 / 21

Running a large number of Monte Carlo tests Some possible principles: 1 Make each individual test faster, see e.g. Gandy (2009). 2 View task as a resource allocation problem: define an error and try to minimise its mean square. Choosing the error without taking into account the special nature the problem, e.g. using the Euclidean distance from the true vector of p-values, doesn t make much sense. For example, we could easily end up investing most of the effort on p-values around 0.5, which are typically not of interest. I will argue that for some Big Data problems it is natural to define the error on the basis of a method to combine p-values. 7 / 21

Running a large number of Monte Carlo tests Some possible principles: 1 Make each individual test faster, see e.g. Gandy (2009). 2 View task as a resource allocation problem: define an error and try to minimise its mean square. Choosing the error without taking into account the special nature the problem, e.g. using the Euclidean distance from the true vector of p-values, doesn t make much sense. For example, we could easily end up investing most of the effort on p-values around 0.5, which are typically not of interest. I will argue that for some Big Data problems it is natural to define the error on the basis of a method to combine p-values. 7 / 21

Combining p-values Consider the global hypothesis test: H 0 : all hypotheses hold, the p-values are independent and uniformly distributed, versus, H 1 : not all null hypotheses hold. Combining p-values is testing the p-values, rather than the original data. Note that we are using the p-values together to make one, global, decision. We are ignoring the local errors of accepting or rejecting each individual hypothesis. This can be reasonable if: 1 The data are heterogeneous there isn t an obvious way to combine them into an overall test statistic. 2 There is too much data a divide-and-conquer strategy is necessary. 8 / 21

Some well-known test statistics and their distributions under H 0 (when known): 1 Fisher s method: 2 log(p i ) χ 2 2m (Mosteller and Fisher, 1948). 2 Stouffer s score: Φ 1 (1 p i )/ m normal(0, 1) (Stouffer et al., 1949). 3 Simes s test: min{p (i) m/i} uniform[0, 1] (Simes, 1989). 4 Donoho s statistic (Donoho and Jin, 2004): / max m{i/m p(i) } p (i) {1 p (i) }. 1 i α 0 m 9 / 21

Probabilistic framework 1 Let p = (p 1,..., p m ) T be the vector of true (and unobservable) p-values. 2 We can construct any number of independent Bernoulli variables with success probability p i, for i = 1,..., m. 3 Let f be the function used to combine p-values, and let var N {f (ˆp)} = N 1/2 {f (ˆp) f (p)}, where N is the total number of samples. 4 We want to adaptively allocate the number of samples per test in order to minimise var N {f (ˆp)}. 10 / 21

Potential gains over even allocation For tests that focus attention on one p-value, e.g. min{p (i) m/i} (Simes), we can reduce the asymptotic variance by a factor of m (the number of tests). This is also true for some smooth functions of the p-values, including Fisher and Stouffer: Lemma If there exists i 1,..., m such that min k i sup i f (x) / k f (x) =, x [0,1] m then var {f (ˆp naïve )} sup p i (0,1) m var {f (ˆp opt )} = m. 11 / 21

Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

Outline of algorithm For known p and f we can calculate the optimal asymptotic allocation. E.g. with Fisher s method p i should get {p i /(1 p i )} 1/2 of samples. In batches of : 1 Decide how to allocate samples amongst m streams using current ˆp. 2 Sample test replicates. 3 Update ˆp. Various optimisations are possible (and necessary), e.g. how to estimate p, how to estimate allocation from p, whether we try to correct for previously wrongly allocated effort. The main point is that our algorithm seems to achieve the optimal asymptotic variance. 12 / 21

Cyber-security example: change detection in Netflow Traffic from one computer on Imperial College s network (with thanks to Andy Thomas) over a day. The data has an artificial changepoint where the user knowingly changed behaviour. We split the computer s traffic by edge (the other IP address), bin the data per hour, and throw away any edge with less than three bins. This results in approximately 100 time series. 13 / 21

Cyber-security example: change detection in Netflow cont d 1 For any proposed changepoint, count the absolute difference between the number of flows before and after the changepoint on each edge, resulting in a statistic t i. 2 For each edge: 1 Randomly permute the binned data. 2 Re-compute the absolute difference between the number of flows, resulting in a simulated statistic T ij. 3 Get a running estimate of the changepoint p-value for that edge, ˆp i. 3 Use our algorithm to combine the p-values using Fisher s score. 14 / 21

p value 1e 07 1e 05 1e 03 1e 01 0 6 12 18 24 Time Figure : Overall p-value in favour of a changepoint 15 / 21

5 4 3 2 1 0 6 12 18 24 Time Figure : Most significant edges. Samples taken: 15039, 14767, 11598, 7985, 6931 16 / 21

5 4 3 2 1 0 6 12 18 24 Time Figure : Least significant edges. Samples taken: 55, 56, 58, 50, 61 17 / 21

Other applications in Big Data Detecting changes in user authentification graphs (joint work with Melissa Turcotte and Los Alamos National Laboratory). Hypothesis tests on networks (e.g. finding link predictors, scoring a graph motif). Information flow (e.g. correlations in timing, matching words or phrases). Generic model-checking and anomaly detection. and much more... 18 / 21

Conclusion The proposed procedure has two advantages: 1 it focusses simulation effort on the interesting hypotheses. 2 it gives a running estimate of overall significance of the experiment, guarding against multiple testing. Some questions: how to integrate with a search (e.g. for a changepoint)? Can this procedure arrive at significance in less samples than a theoretically optimal test? 19 / 21

David Donoho and Jiashun Jin. Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, pages 962 994, 2004. Axel Gandy. Sequential implementation of Monte Carlo tests with uniformly bounded resampling risk. Journal of the American Statistical Association, 104(488):1504 1511, 2009. Michael I. Jordan. Machine-learning maestro Michael Jordan on the delusions of Big Data and other huge engineering efforts. IEEE Spectrum: http://spectrum.ieee.org/robotics/artificial-intelligence/machinelearning-maestromichael-jordan-on-the-delusions-of-big-data-and-other-huge-engineering-efforts, 10 2014. Accessed: 2015-01-03. 20 / 21

Frederick Mosteller and R. A. Fisher. Questions and answers. The American Statistician, 2(5):pp. 30 31, 1948. R John Simes. An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751 754, 1986. Samuel A Stouffer, Edward A Suchman, Leland C DeVinney, Shirley A Star, and Robin M Williams Jr. The American soldier: adjustment during army life.(studies in social psychology in World War II, Vol. 1.). Princeton Univ. Press, 1949. 21 / 21