Finding statistical patterns in Big Data

Transcription

1 Finding statistical patterns in Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research IAS Research Workshop: Data science for the real world (workshop 1) 1st May 2015

2 This talk is about hypothesis testing/feature extraction/anomaly detection for problems involving large amounts of data Computational issues are largely ignored Instead, focus is on some of the pitfalls of hypothesis testing The solutions presented are simple Recommendations are meant to help exploratory research; I m not taking any position on publishing standards

7 Pitfalls of hypothesis testing with Big Data This talk focusses on two pervasive issues: 1 Hypothesis testing when the model is wrong 2 Multiple testing

8 The cyber-security application Network flow data at Los Alamos National Laboratory: 30 GB/day Typical attack pattern: A. Opportunistic infection B. Network traversal C. Data exfiltration Figure : Network traversal, source: Neil et al. (2013)

9 The cyber-security application Global cost of cyber-security is estimated at $400 billion (CSIC, 2014) Botnet behind a third of the spam sent in 2010: earned about $2.7 million spam prevention: cost about > $1 billion (Anderson et al., 2013) UK National Security Strategy Priority Risks (Cabinet Office, 2010) International terrorism affecting the UK or its interests, including a chemical, biological,radiological or nuclear attack by terrorists; and/or a significant increase in the levels of terrorism relating to Northern Ireland. Hostile attacks upon UK cyber space by other states and large scale cyber crime. A major accident or natural hazard which requires a national response, such as severe coastal flooding affecting three or more regions of the UK, or an influenza pandemic. An international military crisis between states, drawing in the UK, and its allies as well as other states and non-state actors.

10 Hypothesis testing framework 1 Null hypothesis H 0, e.g., the drug has no effect; any difference between the two groups is due to chance 2 Alternative hypothesis H 1, e.g., the drug has a (positive) effect 3 Test statistic T, e.g., difference in treatment outcomes 4 P-value: p = P 0 (T T ), where P 0 is the distribution of T under H 0 and T is a replicate of T under H 0 5 We reject the null hypothesis when p is small, e.g., less than 5%

11 Hypothesis testing with the wrong model Example hypothesis test: H 0 : } the data are {{ Gaussian }, µ 1 = µ } {{ } 2 FALSE TRUE H 1 : } the data are {{ Gaussian }, µ 1 > µ } {{ } 2 FALSE FALSE It s right to reject the null hypothesis, but wrong to accept the alternative. In practice this seems to lead to p-values with U-shaped distributions.

14 Example: two-sample t-test G 1 G Difference: t-test: P-value: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n p 0.03

15 Example: two-sample t-test G 1 G The t-test assumes: 1 H 0 : the data are independent and Gaussian with the same mean and variance. 2 H 1 : µ 1 > µ 2 Although we might correctly reject H 0, we don t know if we are rejecting: a) the data are Gaussian b) the data have the same mean c) the data have the same variance

16 Example: two-sample t-test G 1 G A vector X 1,..., X n is exchangeable if its joint distribution is the same as X σ(1),..., X σ(n) for any permutation σ. Instead of assuming a Gaussian model under H 0, assume 1 H 0 : the data are exchangeable 2 H 1 : the data are not exchangeable, µ 1 > µ 2 Can think of T as a random draw from T 1,..., T M, where T i is the ith permutation of G 1 and G 2 (Formally, we are conditioning on a sufficient statistic for the unknowns)

17 Example: two-sample t-test G 1 G Difference: t-test: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n

18 Example: two-sample t-test G1 G Resampled difference: Resampled statistic: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n

19 Example: two-sample t-test G1 G Resampled difference: Resampled statistic: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n

20 Example: two-sample t-test G 1 G Difference: Statistic: T = Ḡ 1 Ḡ Ḡ 1 Ḡ 2 s 12 1/n1 + 1/n P-value (permutation-based): ˆp = 1 M + 1 where T 0 = T. M i=0 I(T i T ) 0.21,

21 Reasons to consider a non-parametric approach ( reasons modelling could be hard): 1 Visualisation and curation are difficult (e.g. for logistical or privacy reasons) 2 The same analytic is to be used on different data sources 3 A lack of domain expertise 4 The data are complicated objects, e.g. graphs Reasons not to: 1 Sometimes a non-parametric approach is not available 2 Model-based approaches can have greater power 3 There is a question of how to balance computational effort against simulation error

22 Reasons to consider a non-parametric approach ( reasons modelling could be hard): 1 Visualisation and curation are difficult (e.g. for logistical or privacy reasons) 2 The same analytic is to be used on different data sources 3 A lack of domain expertise 4 The data are complicated objects, e.g. graphs Reasons not to: 1 Sometimes a non-parametric approach is not available 2 Model-based approaches can have greater power 3 There is a question of how to balance computational effort against simulation error

23 Multiple testing Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. (IEEE Spectrum, 20th October 2014). In fact, the number of hypotheses tested often grows much faster than the data. The basic problem: as the number of tests gets large, the probability of finding a significant result becomes very high.

24 Example: spurious correlations Figure : Spurious correlations, source:

25 Two approaches to multiple testing Suppose we have p-values p 1,..., p n. The two canonical tasks are: 1 sub-select a set for further analysis 2 combine the p-values into one overall score of significance

26 Sub-selection Define the false discovery rate to be (Benjamini and Hochberg, 1995): { 0 if no hypothesis is rejected Q = otherwise #incorrect rejections #total rejections Benjamini and Hochberg (1995) propose that Q is the quantity we want to control. 1 Let p (1) p (n) denote the ordered p-values 2 Let k be the largest i such that p (i) i n q 3 Reject hypotheses corresponding to p (1),..., p (k) 4 Then, if the p-values corresponding to the true null hypotheses are independent, E(Q) q

27 Sub-selection Define the false discovery rate to be (Benjamini and Hochberg, 1995): { 0 if no hypothesis is rejected Q = otherwise #incorrect rejections #total rejections Benjamini and Hochberg (1995) propose that Q is the quantity we want to control. 1 Let p (1) p (n) denote the ordered p-values 2 Let k be the largest i such that p (i) i n q 3 Reject hypotheses corresponding to p (1),..., p (k) 4 Then, if the p-values corresponding to the true null hypotheses are independent, E(Q) q

28 Combining p-values I In the second approach, we consider the joint hypothesis test: One method for combining p-values: 1 Let π = H 0 : all of the null hypotheses hold H 1 : at least one alternative holds min i 1,...,n { np(i) 2 Then if the p-values are independent under H 0 (Simes, 1986), π uniform[0, 1] i } 3 Furthermore, if the p-values are positively dependent (Sarkar and Chang, 1997), π st uniform[0, 1], where st denotes the usual stochastic order. In statistical terminology, rejecting on the basis of π is conservative.

32 Combining p-values II Consider the more specific needles-in-a-haystack scenario 1 n very large 2 Most of the tests are expected to have no signal: H 0 : all of the null hypotheses hold H 1 : a vanishing proportion of the alternatives hold Let π = max n{[(fraction significant at α) α]/ α(1 α)} 0 α α 0 Then under sparse conditions set out in Donoho and Jin (2004), π will manage to detect the alternative whenever it is asymptotically theoretically possible.

33 Combining p-values II Consider the more specific needles-in-a-haystack scenario 1 n very large 2 Most of the tests are expected to have no signal: H 0 : all of the null hypotheses hold H 1 : a vanishing proportion of the alternatives hold Let π = max n{[(fraction significant at α) α]/ α(1 α)} 0 α α 0 Then under sparse conditions set out in Donoho and Jin (2004), π will manage to detect the alternative whenever it is asymptotically theoretically possible.

34 Conclusion 1 We ve advertised a few simple techniques that can help with hypothesis testing at scale 2 Computational issues have been largely ignored, e.g. the permutation test is more effort (but we can control that) 3 Some of the concepts touched upon have a much deeper theory, e.g. exchangeability, dependence, stochastic orders, that is possibly very relevant to the theory of Big Data.

35 Anderson, R., Barton, C., Böhme, R., Clayton, R., Van Eeten, M. J., Levi, M., Moore, T., and Savage, S. (2013). Measuring the cost of cybercrime. In The economics of information security and privacy, pages Springer. Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), pages Cabinet Office and National security and intelligence (2010). A strong britain in an age of uncertainty:the national security strategy. Center for Strategic and International Studies (2014). Net losses: Estimating the global cost of cybercrime economic impact of cybercrime II. Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Annals of Statistics, pages Neil, J., Hash, C., Brugh, A., Fisk, M., and Storlie, C. B. (2013). Scan statistics for the online detection of locally anomalous subgraphs. Technometrics, 55(4): Sarkar, S. K. and Chang, C.-K. (1997). The simes method for multiple hypothesis testing with positively dependent test statistics. Journal of the American Statistical Association, 92(440): Simes, R. J. (1986). An improved bonferroni procedure for multiple tests of significance. Biometrika, 73(3):