Finding statistical patterns in Big Data



Similar documents
Monte Carlo testing with Big Data

False Discovery Rates

Anomaly detection for Big Data, networks and cyber-security

Combining Weak Statistical Evidence in Cyber Security

Package dunn.test. January 6, 2016

Permutation Tests for Comparing Two Populations

CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE

Statistical issues in the analysis of microarray data

Section 13, Part 1 ANOVA. Analysis Of Variance

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

Bootstrapping Big Data

Understanding Confidence Intervals and Hypothesis Testing Using Excel Data Table Simulation

Tutorial 5: Hypothesis Testing

Package HHG. July 14, 2015

Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)

Two-sample hypothesis testing, II /16/2004

The Bonferonni and Šidák Corrections for Multiple Comparisons

A study on the bi-aspect procedure with location and scale parameters

Simple Linear Regression Inference

From the help desk: Bootstrapped standard errors

Name: Date: Use the following to answer questions 3-4:

Projects Involving Statistics (& SPSS)

Non-Inferiority Tests for One Mean

Internet Appendix to False Discoveries in Mutual Fund Performance: Measuring Luck in Estimated Alphas

The Variability of P-Values. Summary

Package ERP. December 14, 2015

Fallback tests for co-primary endpoints

The CUSUM algorithm a small review. Pierre Granjon

Survival Analysis of Left Truncated Income Protection Insurance Data. [March 29, 2012]

A direct approach to false discovery rates

An Introduction to Statistics Course (ECOE 1302) Spring Semester 2011 Chapter 10- TWO-SAMPLE TESTS

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

The Wondrous World of fmri statistics

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

False Discovery Rate Control with Groups

Tutorial for proteome data analysis using the Perseus software platform

Recall this chart that showed how most of our course would be organized:

ESTABLISHING A NATIONAL CYBERSECURITY SYSTEM IN THE CONTEXT OF NATIONAL SECURITY AND DEFENCE SECTOR REFORM

Testing a claim about a population mean

Non-Inferiority Tests for Two Means using Differences

STATISTICS 8, FINAL EXAM. Last six digits of Student ID#: Circle your Discussion Section:

Chapter 23 Inferences About Means

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Testing Hypotheses About Proportions

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Two-Group Hypothesis Tests: Excel 2013 T-TEST Command

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

A Statistical Framework for Operational Infrasound Monitoring

Protein Protein Interaction Networks

Permutation P-values Should Never Be Zero: Calculating Exact P-values When Permutations Are Randomly Drawn

Two-Sample T-Tests Assuming Equal Variance (Enter Means)

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

Correlational Research

A MULTIVARIATE TEST FOR SIMILARITY OF TWO DISSOLUTION PROFILES

Statistics Graduate Courses

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Statistical Models in R

Introduction. Hypothesis Testing. Hypothesis Testing. Significance Testing

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

How To Create An Insight Analysis For Cyber Security

Fixed-Effect Versus Random-Effects Models

Uncertainty quantification for the family-wise error rate in multivariate copula models

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

University of Chicago Graduate School of Business. Business 41000: Business Statistics

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection: Game Theory, Stochastic Processes and Data Mining

Two-sample t-tests. - Independent samples - Pooled standard devation - The equal variance assumption

Introduction. Chapter 1

Simple Second Order Chi-Square Correction

Big Data-ready, Secure & Sovereign Cloud

Module 2 Probability and Statistics

MONT 107N Understanding Randomness Solutions For Final Examination May 11, 2010

Sample Size and Power in Clinical Trials

Factors affecting online sales

Hypothesis testing. c 2014, Jeffrey S. Simonoff 1

Internet Safety and Security: Strategies for Building an Internet Safety Wall

UNIVERSITY OF NAIROBI

The effect of cybercrime on a Bank's finances

Lecture Notes Module 1

How To Check For Differences In The One Way Anova

Hypothesis Testing. Steps for a hypothesis test:

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

Data Mining: An Overview. David Madigan

Data Quality Assessment: A Reviewer s Guide EPA QA/G-9R

The VAR models discussed so fare are appropriate for modeling I(0) data, like asset returns or growth rates of macroeconomic time series.

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course

Data Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico

Transcription:

Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1) 1st May 2015

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

Pitfalls of hypothesis testing with Big Data This talk focusses on two pervasive issues: 1 Hypothesis testing when the model is wrong 2 Multiple testing

The cyber-security application Network flow data at Los Alamos National Laboratory: 30 GB/day Typical attack pattern: A. Opportunistic infection B. Network traversal C. Data exfiltration Figure : Network traversal, source: Neil et al. (2013)

The cyber-security application Global cost of cyber-security is estimated at $400 billion (CSIC, 2014) Botnet behind a third of the spam sent in 2010: earned about $2.7 million spam prevention: cost about > $1 billion (Anderson et al., 2013) UK National Security Strategy Priority Risks (Cabinet Office, 2010) International terrorism affecting the UK or its interests, including a chemical, biological,radiological or nuclear attack by terrorists; and/or a significant increase in the levels of terrorism relating to Northern Ireland. Hostile attacks upon UK cyber space by other states and large scale cyber crime. A major accident or natural hazard which requires a national response, such as severe coastal flooding affecting three or more regions of the UK, or an influenza pandemic. An international military crisis between states, drawing in the UK, and its allies as well as other states and non-state actors.

Hypothesis testing framework 1 Null hypothesis H 0, e.g., the drug has no effect; any difference between the two groups is due to chance 2 Alternative hypothesis H 1, e.g., the drug has a (positive) effect 3 Test statistic T, e.g., difference in treatment outcomes 4 P-value: p = P 0 (T T ), where P 0 is the distribution of T under H 0 and T is a replicate of T under H 0 5 We reject the null hypothesis when p is small, e.g., less than 5%

Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

Example: two-sample t-test G 1 G 2 10 1 10 3 10 8 10 6 10 4 10 6 10 1 10 3 10 6 10 6 Difference: t-test: P-value: T = Ḡ 1 Ḡ 2 3.39 10 7 Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n 2 2.03 p 0.03

Example: two-sample t-test G 1 G 2 10 1 10 3 10 8 10 6 10 4 10 6 10 1 10 3 10 6 10 6 The t-test assumes: 1 H 0 : the data are independent and Gaussian with the same mean and variance. 2 H 1 : µ 1 > µ 2 Although we might correctly reject H 0, we don t know if we are rejecting: a) the data are Gaussian b) the data have the same mean c) the data have the same variance

Example: two-sample t-test G 1 G 2 10 1 10 3 10 8 10 6 10 4 10 6 10 1 10 3 10 6 10 6 A vector X 1,..., X n is exchangeable if its joint distribution is the same as X σ(1),..., X σ(n) for any permutation σ. Instead of assuming a Gaussian model under H 0, assume 1 H 0 : the data are exchangeable 2 H 1 : the data are not exchangeable, µ 1 > µ 2 Can think of T as a random draw from T 1,..., T M, where T i is the ith permutation of G 1 and G 2 (Formally, we are conditioning on a sufficient statistic for the unknowns)

Example: two-sample t-test G 1 G 2 10 1 10 3 10 8 10 6 10 4 10 6 10 1 10 3 10 6 10 6 Difference: t-test: T = Ḡ 1 Ḡ 2 3.39 10 7 Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n 2 2.03

Example: two-sample t-test G1 G2 10 6 10 3 10 6 10 6 10 8 10 1 10 6 10 4 10 1 10 3 Resampled difference: Resampled statistic: T = Ḡ 1 Ḡ 2 9.95 10 6 Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n 2 0.53

Example: two-sample t-test G1 G2 10 1 10 6 10 4 10 8 10 1 10 6 10 6 10 6 10 3 10 3 Resampled difference: Resampled statistic: T = Ḡ 1 Ḡ 2 3.34 10 7 Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n 2 2.07

Example: two-sample t-test G 1 G 2 10 1 10 3 10 8 10 6 10 4 10 6 10 1 10 3 10 6 10 6 Difference: Statistic: T = Ḡ 1 Ḡ 2 3.39 10 7 Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n 2 2.03 P-value (permutation-based): ˆp = 1 M + 1 where T 0 = T. M i=0 I(T i T ) 0.21,

Reasons to consider a non-parametric approach ( reasons modelling could be hard): 1 Visualisation and curation are difficult (e.g. for logistical or privacy reasons) 2 The same analytic is to be used on different data sources 3 A lack of domain expertise 4 The data are complicated objects, e.g. graphs Reasons not to: 1 Sometimes a non-parametric approach is not available 2 Model-based approaches can have greater power 3 There is a question of how to balance computational effort against simulation error

Reasons to consider a non-parametric approach ( reasons modelling could be hard): 1 Visualisation and curation are difficult (e.g. for logistical or privacy reasons) 2 The same analytic is to be used on different data sources 3 A lack of domain expertise 4 The data are complicated objects, e.g. graphs Reasons not to: 1 Sometimes a non-parametric approach is not available 2 Model-based approaches can have greater power 3 There is a question of how to balance computational effort against simulation error

Multiple testing Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. (IEEE Spectrum, 20th October 2014). In fact, the number of hypotheses tested often grows much faster than the data. The basic problem: as the number of tests gets large, the probability of finding a significant result becomes very high.

Example: spurious correlations Figure : Spurious correlations, source: http://www.tylervigen.com

Two approaches to multiple testing Suppose we have p-values p 1,..., p n. The two canonical tasks are: 1 sub-select a set for further analysis 2 combine the p-values into one overall score of significance

Sub-selection Define the false discovery rate to be (Benjamini and Hochberg, 1995): { 0 if no hypothesis is rejected Q = otherwise #incorrect rejections #total rejections Benjamini and Hochberg (1995) propose that Q is the quantity we want to control. 1 Let p (1) p (n) denote the ordered p-values 2 Let k be the largest i such that p (i) i n q 3 Reject hypotheses corresponding to p (1),..., p (k) 4 Then, if the p-values corresponding to the true null hypotheses are independent, E(Q) q

Sub-selection Define the false discovery rate to be (Benjamini and Hochberg, 1995): { 0 if no hypothesis is rejected Q = otherwise #incorrect rejections #total rejections Benjamini and Hochberg (1995) propose that Q is the quantity we want to control. 1 Let p (1) p (n) denote the ordered p-values 2 Let k be the largest i such that p (i) i n q 3 Reject hypotheses corresponding to p (1),..., p (k) 4 Then, if the p-values corresponding to the true null hypotheses are independent, E(Q) q

Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

Combining p-values II Consider the more specific needles-in-a-haystack scenario 1 n very large 2 Most of the tests are expected to have no signal: H 0 : all of the null hypotheses hold H 1 : a vanishing proportion of the alternatives hold Let π = max n{[(fraction significant at α) α]/ α(1 α)} 0 α α 0 Then under sparse conditions set out in Donoho and Jin (2004), π will manage to detect the alternative whenever it is asymptotically theoretically possible.

Combining p-values II Consider the more specific needles-in-a-haystack scenario 1 n very large 2 Most of the tests are expected to have no signal: H 0 : all of the null hypotheses hold H 1 : a vanishing proportion of the alternatives hold Let π = max n{[(fraction significant at α) α]/ α(1 α)} 0 α α 0 Then under sparse conditions set out in Donoho and Jin (2004), π will manage to detect the alternative whenever it is asymptotically theoretically possible.

Conclusion 1 We ve advertised a few simple techniques that can help with hypothesis testing at scale 2 Computational issues have been largely ignored, e.g. the permutation test is more effort (but we can control that) 3 Some of the concepts touched upon have a much deeper theory, e.g. exchangeability, dependence, stochastic orders, that is possibly very relevant to the theory of Big Data.

Anderson, R., Barton, C., Böhme, R., Clayton, R., Van Eeten, M. J., Levi, M., Moore, T., and Savage, S. (2013). Measuring the cost of cybercrime. In The economics of information security and privacy, pages 265 300. Springer. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages 289 300. Cabinet Office and National security and intelligence (2010). A strong britain in an age of uncertainty:the national security strategy. Center for Strategic and International Studies (2014). Net losses: Estimating the global cost of cybercrime economic impact of cybercrime II. Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, pages 962 994. Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs. Technometrics, 55(4):403 414. Sarkar, S. K. and Chang, C.-K. (1997). The simes method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92(440):1601 1608. Simes, R. J. (1986). An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):751 754.