Combining Weak Statistical Evidence in Cyber Security Nick Heard Department of Mathematics, Imperial College London; Heilbronn Institute for Mathematical Research, University of Bristol Intelligent Data Analysis XIV 23 October, 2015 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 1 / 52
Collaborative work with Patrick Rubin-Delanchy, University of Bristol Melissa Turcotte & Alex Kent, Los Alamos National Laboratory Josh Neil, Ernst & Young Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 2 / 52
Combining p-values Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 3 / 52
Abstract setting It will be supposed that n independent hypothesis tests are being conducted. For the i th test, let H 0,i, H 1,i be the null and alternative hypotheses p i be the p-value derived from some test statistic t i ; for example, the upper tail probability of t i under H 0,i, p i = Pr H0,i (T i t i ) t i have a continuous distribution = H 0,i : p i U(0, 1) Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 4 / 52
The global hypothesis test that will be considered is: H 0 : i {1,..., n}, H 0,i is true H 1 : I n {1,..., n}, I n, s.t. i I n, H 1,i is true. The test statistic T (p 1,..., p n ) will be a combiner of the p-values to a single value. Finally, for defining some such test statistics, let p (1) p (2)... p (n) be the order statistics of the n p-values. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 5 / 52
In some settings, it is more sensible (and more powerful) to combine individual test statistics rather than their p-values. But sometimes in meta-analysis, the individual test statistics may no longer be available; or the individual tests are very different in nature and difficult to combine Combining p-values is mathematically equivalent to the multiple comparison problem in bulk hypothesis testing, but with a different (in my opinion, less vague) motivation. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 6 / 52
Motivating example: Cyber security Low and slow : A sophisticated cyber attack will try to blend in with the existing traffic in a computer network, gradually achieving its objectives. Possibly only a small subset of the traffic will be malicious and yield significant p-values from statistical models of normal network/user behaviour. = The signal of the intruder can be drowned out. Need to: detect the intruder from a weak signal identify specific compromised services to limit damage Question: How can p-values be filtered and then combined to most efficiently lead to detection? Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 7 / 52
Fisher s method s n = n n log p i = log i=1 Under H 0, 2 s n χ 2 2n and so the upper tail probability from that distribution is U(0, 1). Mathematically convenient Intuitive: s n is (a monotonic transformation of) the joint cdf of the p-values under H 0. [Recall the p-values are independent and since p i U(0, 1), Pr(p i p) = p.] = the combined p-value derived from Fisher s method reports the probability of observing an even lower joint cdf value LRT: Under a special case of the global alternative hypothesis, i=1 H 1 : a (0, 1) s.t. i, f 1,i (p i ) p a i, this is the uniformly most powerful (UMP) test Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 8 / 52 p i
Sum of p-values s n = Under H 0, s n follows an Irwin-Hall distribution, which can be well approximated by N(n/2, n/12) for n > 20. Mathematically less convenient Unintuitive: the sum has a more awkward probabilistic interpretation under H 0 as the mixture cdf of the p-values Almost nobody uses it n i=1 LRT: And yet, under another special case of the global alternative hypothesis, p i H 1 : b > 0 s.t. i, f 1,i (p i ) e bp i, the sum provides the UMP test Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 9 / 52
Linear combination w n log p i + (1 w) i=1 Mathematically inconvenient no simple closed form under H 0 Unintuitive: an even more awkward probabilistic interpretation under H 0 - a weighted mixture of the mixture and joint cdfs of the p-values! Nobody uses this LRT: Under yet another special case of the global alternative hypothesis, n i=1 H 1 : a (0, 1), b > 0 s.t. i, f 1,i (p i ) p a i e bp i, the linear combination with w = (1 a)/(1 + b a) provides the UMP test = knowledge of a and b would be required to choose w Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 10 / 52 p i
Each of the LRT alternative densities are decreasing functions of p i. Note that as p i 0, the first density whereas as the second density 1. = s n is optimal when some p-values are extreme, s n is more appropriate for bunched p-values. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 11 / 52
1 0.5 0 0 1 0.5 1 0 0.5 0 0.5 1 0 p 2 p 1 1 0.5 0.5 1 0 p 2 p 1 Figure: Significance levels from two p-values combined using s 2 (left) and s 2 (right). Truncated product/sum Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 12 / 52
Many other plausible combiners. Birnbaum (1954): Any combiner which is monotonic in the p-values provides the most powerful test for some special case of the alternative hypothesis H 1 : I n {1,..., n}, I n, s.t. i I n, p i f 1,i, where the alternative densities f 1,i are non-increasing. Question: Which combiners are good when I n << n? Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 13 / 52
Standard normal example: change in mean Donoho and Jin (2004): Suppose and p i = 1 Φ(t i ). H 0,i : t i N (0, 1) H 1,i : t i N ( 2r log n, 1), r (0, 1) If H 1,i holds i I n and I n n, then for a range of suitably small r, as n : Fisher s method cannot separate H 0 from H 1. Simes method and Higher Criticism separate H 0 from H 1 w.p. 1. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 14 / 52
Simes method Simes (1986): Under H 0, T Simes U(0, 1). p (i) T Simes = n min 1 i n i Equivalent to Benjamini-Hochberg (1995) procedure for controlling FDR Simes method is a KS-type test of the ratio of lower tail probabilities for two distribution functions. That is, T 1 Simes = sup F n (p) p (0,1) p where F n is the ecdf of the n p-values (Chang, 1955; Mason and Shuenmeyer, 1983) ( ) Under H 0, lim Pr p (i) arg min = 1 = e 1 0.37 n 1 i n i Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 15 / 52
Higher Criticism Donoho and Jin (2004): HC n = max 1 i n i/n p (i) p (i) (1 p (i) )/n = max 1 i n Distribution under H 0 obtained by Monte Carlo. Can be viewed as a weighted KS test HC n = sup w(p)(f n (p) p) p (0,1) i np (i) np (i) (1 p (i) ) where w(p) 1 = p(1 p)/n is the variance of F n (p) or a GLRT, approximating binomial distributions with Gaussians: For n i.i.d. U(0, 1) variates, the number p (i) is Binomial(n, p (i) ) N(np (i), np (i) (1 p (i) )) Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 16 / 52
Under H 1 : I n {1,..., n}, I n and a [0, 1) s.t. i I n, f 1,i (p i ) p a i or H 1 : I n {1,..., n}, I n and b > 0 s.t. i I n, f 1,i (p i ) e bp i, both Simes method and HC can still lack power as they are invariant to local changes in the smallest p-values which are deemed significant. It is desirable to make use of all of the significant p-values. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 17 / 52
Partial Fisher scores/sums Two proposed combiners for those alternatives are min ζ k ( s k ), s k = 1 k n min ζ k (s k ), s k = 1 k n k log p (i) where ζ k, ζ k approximate the corresponding distributions functions. i=1 k i=1 p (i) Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 18 / 52
Under H 0, s k has closed form cdf F ( s k ) = k k 1 n! k! n k i=1 ( 1) k+i i k 1 i! (n k i)! k 1 j =0 ( ) i j γ(j + 1, s k ) k j! k(e (k+i) s k /k 1) + k + i where γ is the lower incomplete gamma function; but for n > 13 this is not nice to work with. There is no closed form for the cdf of s k under H 0. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 19 / 52
Instead, we follow Donoho and Jin and approximate the true distributions with Gaussians, yielding simple test statistics where µ k = 1 ψ(k + 1) + ψ(n + 1), S n = min 1 k n {( s k µ k )/ σ k }, S n = min 1 k n {(s k µ k )/σ k }, σ 2 k = 1/k + ψ 1(k + 1) ψ 1 (n + 1) µ k = k(k + 1)/(2(n + 1)), σ 2 k = k(k + 1)(2n(1 + 2k) (3k + 2)(k 1))/(12(n + 1)2 (n + 2)), and ψ and ψ 1 are the digamma and trigamma functions. The distributions of S n and S n are much simpler to obtain by Monte Carlo. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 20 / 52
1 0.5 0 0 1 0.5 1 0 0.5 0 0.5 1 0 p p 2 1 1 0.5 0.5 1 0 p p 2 1 Figure: Significance levels from two p-values combined using S 2 (left) and S 2 (right). Product/sum Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 21 / 52
Empirical comparison Recall the example of Donoho and Jin (2004), H 0,i : t i N (0, 1) H 1,i : t i N ( 2r log n, 1), r (0, 1) where H 1,i holds for some non-empty I n {1,..., n}, s.t. I n n. For illustration, let I n = 3 n r = 4/15 (smallest detectable r assuming I n = 3 n is 1/6) Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 22 / 52
Power curve for n = 2: I n = 1, µ n = 0.6 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 23 / 52
Power curve for n = 10: I n = 2, µ n = 1.1 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 24 / 52
Power curve for n = 100: I n = 4, µ n = 1.6 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 25 / 52
Power curve for n = 1,000: I n = 10, µ n = 1.9 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 26 / 52
Power curve for n = 10,000: I n = 21, µ n = 2.2 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 27 / 52
Power curve for n = 100,000: I n = 46, µ n = 2.5 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 28 / 52
Power curve for n = 1,000,000: I n = 100, µ n = 2.7 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 29 / 52
Power curve for n = 10,000,000: I n = 215, µ n = 2.9 1 0.8 F1(p) 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 30 / 52
Power curve for n = 10,000,000: I n = 215, µ n = 2.9 1 0.8 F1(p) 0.6 0.4 0.2 Partial sum Partial product Higher criticism Simes method Fisher s method H 0 0 0 0.2 0.4 0.6 0.8 1 p Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 30 / 52
Distribution of significant subset size under H 1, n = 1,000, I n = 10. 1 0.1 0.01 0.001 0.0001 1 2 3 4 5 6 7 8 9 1011121314151617181920 Partial product Higher criticism Simes method Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 31 / 52
Concluding remarks on combining p-values The asymptotic assumptions from Donoho and Jin (2004) used here imply a decreasing proportion of the alternative hypotheses are true the effect size increases with the number of tests which seems an unlikely scenario (but allowed some impressive maths). A more realistic scenario might assume a constant proportion of the alternative hypotheses are true individual test effect sizes which decrease with n Finding the correct number of significant p-values seems very difficult In many (most?) applications, Fisher s method is really good Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 32 / 52
Discrete p-values Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 33 / 52
Discrete p-values In many practical applications, some of the test statistics for the individual hypothesis tests will be discrete. Either because they naturally are; or they are recorded with limited precision, leading to interval censoring Even under H 0, discrete p-values are not U(0, 1). They are stochastically larger. = Discrete p-values are conservative = Combining discrete p-values can be really conservative = Subset selection becomes even more important Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 34 / 52
Mid-p-values To combat the conservatism of discrete p-values, some practitioners advocate mid-p-values. If the regular p-value is the corresponding mid-p-value is p i = Pr H0,i (T i t i ), p i,mid = 1 2 Pr H 0,i (T i t i ) + 1 2 Pr H 0,i (T i > t i ). Mid-p-values are not stochastically larger than U(0, 1). They are Less than U(0, 1) in the convex order; they have mean ½, but are less variable not conservative: For some values of α, Pr H0,i (p i,mid α) = 2α Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 35 / 52
Random p-values If we randomly draw p i,random U(Pr H0,i (T i > t i ), Pr H0,i (T i t i )). then marginally p i,random U(0, 1). Very natural to think this way when discreteness arises through censoring. But when combining random p-values, there is no rule which could guarantee monotonicity in the p-values. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 36 / 52
Cyber security Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 37 / 52
Cyber security: big business Former MI5 director Jonathan Evans, October 2015: Cyber security is a recognised major international issue There are now a lot of serious people focused on cyber security and there is a lot of investment, not only in resilience, but also by venture capitalists in cyber security startups The amount of money, intellectual activity and resource going into this will have an impact and, as this matures, there will be a balancing out between the attackers and the defenders, but right now there is still a lot of work to be done Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 38 / 52
Statistical challenges in cyber No network can be made fully secure. It is not unreasonable to assume that most networks are compromised to some degree. Statistical monitoring of network traffic offers a robust ( signature free ), second level of defence. For those attacks which do penetrate a network perimeter, we need technologies for identifying malign activity amongst the bulk traffic. This is a data mining problem. A model based approach: We can learn about normal behaviour in a network by gathering data and building statistical models (null hypotheses). Anomalous behaviour w.r.t. those models can indicate potential breaches requiring further inspection. We are less able to gather data on potential attack behaviours, which are more adaptive = cannot assume a model for compromised behaviour = reliant on p-values from the null model, rather than e.g. Bayes factors Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 40 / 52
User credentials authentication & computer event logs Computer event logs are a critical resource for investigating security incidents, providing detailed information at a machine level. authentication, logons,... processes applications/services Many of these log entries are tied to a user credential action. Reusable user credentials are one of the most powerful items an attacker can obtain Adversaries require user credentials to traverse the network Due to single sign-on (favoured for convenience and usability), credentials and hashed passwords are stored in computer memory, making them simple to obtain and reuse Can we detect network intrusions from event logs by ranking the most unusual behaviour according to statistical models of each user credential? Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 41 / 52
Los Alamos National Laboratory computer network LANL is a US Department of Energy research lab. Their substantial Internet presence state-of-the-art computer systems huge stores of proprietary information make LANL a prime target for hacking. They face several million cyber attacks each day. Here we consider Windows-based authentication event logs from the LANL enterprise computer network. Features: Two months of data: 444 million events for 10k users Month long red-team exercise in the second month of data, with 78 known compromised credentials Random selection of 1,000 credentials plus compromised credentials 50 million associated events These data are now freely available at http://csr.lanl.gov/data/cyber1. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 42 / 52
Distribution of authentication event types in LANL network. 1 0.1 0.01 0.001 network logon interactive logon remote desktop process start kerberos other Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 43 / 52
A user credential authentication modelling approach Each user credential is monitored for the first month of data. The authentication events on the network using that credential form a sequence (c t, s t, e t ) t N where c t C is the client IP address s t S is the server IP address e t E is the authentication event type From our model we should like to capture statistical surprise in each of these three data items. Is this a new client/server for the user? And if so, is this an unusual choice of client/server? And given the client-server pair, is this an unusual authentication event type? Although each user will be modelled independently for tractability and parallelistation, hyperparameters will be pooled across users to learn typical client/server behaviour across the network. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 44 / 52
To specify the model, let C t, S t respectively be the set of unique clients or servers authenticated on by the user prior to time t, and define binary variables χ t, σ t {0, 1} s.t. χ t = 1 c t / C t, σ t = 1 s t / S t. Then the probability mass function for the triple (c t, s t, e t ) is specified through Pr(c t, s t, e t ) = Pr(χ t χ t 1 )Pr(c t χ t, c t 1 )Pr(σ t c t, σ t )Pr(s t c t, σ t, s t )Pr(e t s t ) where t is the last event time previous to t where the client was c t. Bayesian multinomial Dirichlet distributions are fit to each of these conditional distributions using the first month of data, assuming either an i.i.d. sequence or a first order Markov chain. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 45 / 52
In cases where a new client or server are used (χ t = 1 or σ t = 1), time varying Dirichlet prior parameters for the possible IP addresses are set proportional to the current in/out degrees of the chosen client/server at discrete time t. When the client or server are drawn from C t or S t, the Dirichlet parameters are set equal to the empirical frequencies of each IP address. Similarly for event types, for new client-server edges the Dirichlet prior parameters are determined by the observed event type frequencies from other users on the same client-server edge, and the empirical distribution otherwise. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 46 / 52
Given this predictive mass function, the (discrete) p-value for an observation being as unlikely as (c t, s t, e t ) is given by p t = I{Pr(c t, s t, e t ) Pr(c t, s t, e t )} Pr(c t, s t, e t ). C S E The corresponding mid p-value is given by p t,mid = 1 2 p t + 1 2 C S E I{Pr(c t, s t, e t ) < Pr(c t, s t, e t )} Pr(c t, s t, e t ). Detecting fraudulent misuse of user credentials is a task of combining subsets of small p-values indicating anomalous authentication behaviour mixed amongst the bulk traffic. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 47 / 52
p-values from authentication modelling 1.5 1.5 f (p) 1 0.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 p 0 0 0.2 0.4 0.6 0.8 1 p Figure: Density functions of discrete p-values for infected (left) and uninfected (right) user IDs. Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 48 / 52
ROC curve for discrete p-values 1 True positive rate 0.8 0.6 0.4 0.2 Fisher s method 0 0 0.2 0.4 0.6 0.8 1 False positive rate H 0 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 49 / 52
ROC curve for discrete p-values 1 True positive rate 0.8 0.6 0.4 0.2 Higher criticism Simes method Fisher s method 0 0 0.2 0.4 0.6 0.8 1 False positive rate H 0 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 49 / 52
ROC curve for discrete p-values 1 True positive rate 0.8 0.6 0.4 0.2 Partial product Higher criticism Simes method Fisher s method 0 0 0.2 0.4 0.6 0.8 1 False positive rate H 0 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 49 / 52
ROC curve for discrete p-values 1 True positive rate 0.8 0.6 0.4 0.2 Partial sum Partial product Higher criticism Simes method Fisher s method 0 0 0.2 0.4 0.6 0.8 1 False positive rate H 0 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 49 / 52
ROC curve for mid-p-values 1 True positive rate 0.8 0.6 0.4 0.2 Partial sum Partial product Higher criticism Simes method Fisher s method 0 0 0.2 0.4 0.6 0.8 1 False positive rate H 0 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 50 / 52
ROC curve for median from random p-values 1 True positive rate 0.8 0.6 0.4 0.2 Partial sum Partial product Higher criticism Simes method Fisher s method 0 0 0.2 0.4 0.6 0.8 1 False positive rate H 0 Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 51 / 52
Concluding remarks Combining p-values is an increasingly important topic in the Big Data paradigm shift. We cannot build a big joint distribution of everything and calculate the true p-value Away from authentication, there are many contexts just within cyber where combining weak evidence, in the form of p-values or weak signals, is important In NetFlow data, for example, we get a different view from the router level of IP-IP communications - service port numbers, TCP flags, numbers of packets/bytes, timings typically to the millisecond. All of these can be informative Beyond signature detection, there is often no smoking gun in cyber; just bits and pieces of evidence to piece together Statistical cyber security is a data fusion exercise, and combining p-values badly can undo the benefits of good modelling Nick Heard (Imperial College) Combining evidence in cyber 23 October, 2015 52 / 52