Anomaly detection for Big Data, networks and cyber-security

Transcription

1 Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London), Dan Lawson (University of Bristol), Niall Adams (Imperial College London) and Melissa Turcotte (Los Alamos National Laboratory) Dynamic Networks and Cyber-security 24th June 2015

2

3 Netflow Server IP address:port 10:53 10:771 11:148 11:5353 1:80 2:80 3:80 4:443 5:443 6:443 7:2049 8:389 9:53 9:771 00:00 01:00 02:00 03:00 04:00 05:00 Time (minutes:seconds) Figure : Five minutes of NetFlow, generated by a computer on the Imperial College London network.

4 Sessionization Server IP address:port 10:53 10:771 11:148 11:5353 1:80 2:80 3:80 4:443 5:443 6:443 7:2049 8:389 9:53 9:771 00:00 01:00 02:00 03:00 04:00 05:00 Time (minutes:seconds) Figure : Sessionization of Netflow Bayesian temporal clustering

5 Filtering Probability Probability Time of day (hours) Time

6 Point process modelling λ(t) Time Figure : Non-parametric Bayesian intensity estimation Enron data

7 De Finetti s representation theorem Let X 1, X 2,... be an infinite exchangeable sequence of Bernoulli variables. Then X i are independent conditional on a common success probability, p, where p is a random variable on [0, 1].

8 Aldous-Hoover theorem Let G be an infinite undirected exchangeable graph. Then each edge i j is a conditionally independent Bernoulli variable with success probability f (u i, u j ) where u 1, u 2,... are independent uniform random variables on [0, 1]. f is a random symmetric function from [0, 1] 2 [0, 1] (called a graphon).

9 Block approximation a) Graphon b) Block approximation (coarse) c) Block approximation (fine) Figure : A graphon and its block approximations

10 Stochastic block model Partitioning of nodes into communities represented a vector of labels l Marginal likelihood of the graph where P(G l) = k l 1 0 p n kl (1 p) o kl n kl df (p), 1 n kl number of edges between communities k and l 2 o kl number of possible edges 3 F prior distribution function on edge probability (assuming an IID prior)

11 Anomaly detection with complex models Data: D Anomaly: f (D) P-value: p = P{f (D ) f (D) D}

12 Anomaly detection with complex models Data: D Model: D θ Unknowns: θ Anomaly: f (D, θ) P-value: p = P{f (D, θ) f (D, θ) D}

13 Anomaly detection with complex models Data: D Model: D θ Unknowns: θ Anomaly: f (D, θ) P-value: p = P{f (D, θ) f (D, θ) D} Conservative p-value p + = sup θ (p)

14 Anomaly detection with complex models Data: D Model: D θ Unknowns: θ Anomaly: f (D, θ) P-value: p = P{f (D, θ) f (D, θ) D} Expected p-value p = E θ (p)

15 Posterior predictive p-value The posterior predictive p-value is (Meng, 1994; Gelman et al., 1996, Eq. 2.8, Eq. 7) P = P{f (D, θ) f (D, θ) D}, (1) where θ represents the model parameters, D is the observed dataset, D is a hypothetical replicated dataset generated from the model with parameters θ, and P( D) is the joint posterior distribution of (θ, D ) given D. In words: if a new dataset were generated from the same model and parameters, what is the probability that the new discrepancy would be as large? Discussion: Guttman (1967), Box (1980), Rubin (1984), Bayarri and Berger (2000), Hjort et al. (2006) Posterior predictive p-values in practice: Huelsenbeck et al. (2001), Sinharay and Stern (2003), Thornton and Andolfatto (2006), Steinbakk and Storvik (2009)

16 Two estimates Posterior sample: θ 1,..., θ n from θ D Estimate I: 1 Simulate data D1,..., D n. 2 Estimate ˆP = 1 n I{f (Di, θ i ) f (D, θ i )}. n i=1 Estimate II: 1 Calculate p-values Q i = P{f (D, θ i ) f (D, θ i )}, for i = 1,..., n. 2 Estimate ˆP = 1 n n Q i. i=1

17 Two estimates Posterior sample: θ 1,..., θ n from θ D Estimate I: 1 Simulate data D1,..., D n. 2 Estimate ˆP = 1 n I{f (Di, θ i ) f (D, θ i )}. n i=1 Estimate II: 1 Calculate p-values Q i = P{f (D, θ i ) f (D, θ i )}, for i = 1,..., n. 2 Estimate ˆP = 1 n n Q i. i=1

18 Interpreting posterior predictive p-values Recall: P = P{f (D, θ) f (D, θ) D}. How should we interpret P? For example: Is P = 40% good? If P = 10 6, is there cause for alarm? What if 500 tests are performed, and min(p i ) = 10 6? Some comments: The interpretation and comparison of posterior predictive p-values [is] a difficult and risky matter (Hjort et al., 2006) Its main weakness is that there is an apparent double use of the data...this double use of the data can induce unnatural behavior (Bayarri and Berger, 2000) Donald Rubin alludes to some conservative operating characteristics (Rubin, 1996) The basic problem is: P is not uniformly distributed under the null hypothesis.

19 Interpreting posterior predictive p-values Recall: P = P{f (D, θ) f (D, θ) D}. How should we interpret P? For example: Is P = 40% good? If P = 10 6, is there cause for alarm? What if 500 tests are performed, and min(p i ) = 10 6? Some comments: The interpretation and comparison of posterior predictive p-values [is] a difficult and risky matter (Hjort et al., 2006) Its main weakness is that there is an apparent double use of the data...this double use of the data can induce unnatural behavior (Bayarri and Berger, 2000) Donald Rubin alludes to some conservative operating characteristics (Rubin, 1996) The basic problem is: P is not uniformly distributed under the null hypothesis.

20 Meng s result In the last pages of Meng (1994): In our notation: suppose θ and D are drawn from the prior and model respectively, and f (D, θ) is absolutely continuous. Then, (i) E(P) = E(U) = 1/2 and (ii) for any convex function h, E{h(P)} E{h(U)}, where U is a uniform random variable on [0, 1].

21 A consequence Meng (1994) next finds:

22 The convex order Let X and Y be two random variables with probability measures µ and ν respectively. We say that µ cx ν if, for any convex function h, E{h(X )} E{h(Y )}, whenever the expectations exist (Shaked and Shanthikumar, 2007). We say that a probability measure P is sub-uniform if P cx U, where U is a uniform distribution on [0, 1]. Meng s theorem says that posterior predictive p-values have a sub-uniform distribution.

23 Reflecting on the theorem 1 Proof is in fact quite straightforward (Jensen s inequality) 2 Is there something more we can say? 1 The 2α bound suggests that P could be liberal, whereas we had the impression they were conservative. Can the bound be improved? 2 The theorem seems to be describing a huge space of distributions. Can it somehow be reduced?

24 On taking the average Theorem (Strassen s theorem) For two probability measures µ and ν on the real line the following conditions are equivalent: 1 µ cx ν; 2 there are random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X. Theorem due to Strassen (1965) (see also references therein), this version due to Müller and Rüschendorf (2001). In other words: Taking an average is reducing in the convex order

25 On taking the average Theorem (Strassen s theorem) For two probability measures µ and ν on the real line the following conditions are equivalent: 1 µ cx ν; 2 there are random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X. Theorem due to Strassen (1965) (see also references therein), this version due to Müller and Rüschendorf (2001). In other words: Taking an average is reducing in the convex order

26 Theorem Let µ and ν be two probability measures on the real line where ν is absolutely continuous. The following conditions are equivalent: 1 µ cx ν; 2 there exist random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X and the random variable Y X is either singular, i.e. Y = X, or absolutely continuous with µ-probability one. Theorem (Posterior predictive p-values and the convex order) P is a sub-uniform probability measure if and only if there exist random variables P, D, θ and an absolutely continuous discrepancy f (D, θ) such that P = P{f (D, θ) f (D, θ) D}, where P has measure P, D is a replicate of D conditional on θ and P( D) is the joint posterior distribution of (θ, D ) given D.

27 Theorem Let µ and ν be two probability measures on the real line where ν is absolutely continuous. The following conditions are equivalent: 1 µ cx ν; 2 there exist random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X and the random variable Y X is either singular, i.e. Y = X, or absolutely continuous with µ-probability one. Theorem (Posterior predictive p-values and the convex order) P is a sub-uniform probability measure if and only if there exist random variables P, D, θ and an absolutely continuous discrepancy f (D, θ) such that P = P{f (D, θ) f (D, θ) D}, where P has measure P, D is a replicate of D conditional on θ and P( D) is the joint posterior distribution of (θ, D ) given D.

28 Exploring the set of sub-uniform distributions The integrated distribution function of a random variable X with distribution function F X is, ψ X (x) = We have (Müller and Rüschendorf, 2001) 1 ψ X is non-decreasing and convex; x F X (t) d t. 2 Its right derivative ψ + X (x) exists and 0 ψ+ X (x) 1; 3 lim x ψ X (x) = 0 and lim x {x ψ X (x)} = E(X ). Furthermore, for any function ψ satisfying these properties, there is a random variable X such that ψ is the integrated distribution function of X. The right derivative of ψ is the distribution function of X, F X (x) = ψ + (x). Let Y be another random variable with integrated distribution function ψ Y. Then X cx Y if and only if ψ X ψ Y and lim x {ψ Y (x) ψ X (x)} = 0.

32 ψ(t) density t t Figure : Uniform distribution

33 ψ(t) density t t Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2)

34 ψ(t) density t t Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2), red: a distribution with maximal mass at.3

35 a) Model 1 b) Model 1 discrepancy X(0) θ = 1 θ = 0 1 2α X(1) 2α f(x,θ) f(x,1) f(x,0) x c) Model 2 d) Model 2 discrepancy p(x θ) p(x θ = 0) p(x θ = 1) f(x,θ) f(x,0) f(x,1) x x Figure : Some constructive examples

36 Combining p-values Independent p-values P 1,..., P n. Two canonical tasks: 1 sub-select a set for further analysis 2 combine the p-values into one overall score of significance

37 Discrete p-values Ordinary p-value (T = f (D)): P = P(T T T ), Mid-p-value (Lancaster, 1952): Q = 1 2 P(T T T ) P(T > T T ), Lemma (Fisher s method for mid-p-values) Under the null hypothesis, 1 Q cx U, where U is a uniform random variable on the unit interval 2 As a consequence, for t 2n, where F n = 2 n i=1 log(q i) P(F n t) exp{n t/2 n log(2n/t)},

40 50/50 Random binary Grid of ten n=10, β= 20 0 F^(x) 1 0 F^(x) 1 0 F^(x) 1 0 x 1 0 x 1 0 x 1 n=100, β= 5 0 F^(x) 1 0 F^(x) 1 0 F^(x) 1 ordinary mid p value y=x 0 x 1 0 x 1 0 x 1 Figure : Fisher s method with mid-p-values

41 Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis, 11(4): Bayarri, M. and Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452): Box, G. E. (1980). Sampling and Bayes inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General), pages Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4): Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B (Methodological), pages Hjort, N. L., Dahl, F. A., and Steinbakk, G. H. (2006). Post-processing posterior predictive p values. Journal of the American Statistical Association, 101(475): Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301): Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. Preprint, Institute for Advanced Study, Princeton, NJ, 2. Huelsenbeck, J. P., Ronquist, F., Nielsen, R., and Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294(5550): Lancaster, H. (1952). Statistical control of counting experiments. Biometrika, pages Lehmann, E. L. and Romano, J. P. (2006). Testing statistical hypotheses. Springer Science & Business Media. Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics, 22(3): Müller, A. and Rüschendorf, L. (2001). On the optimal stopping values induced by general dependence structures. Journal of applied probability, 38(3): Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4): Rubin, D. B. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4): Discussion of Gelman, Meng and Stern. Rubin-Delanchy, P. and Heard, N. A. (2014). A test for dependence between two point processes on the real line. arxiv preprint arxiv: Rubin-Delanchy, P. and Lawson, D. J. (2014). Posterior predictive p-values and the convex order. arxiv preprint arxiv: Shaked, M. and Shanthikumar, J. G. (2007). Stochastic orders. Springer. Sinharay, S. and Stern, H. S. (2003). Posterior predictive model checking in hierarchical models. Journal of Statistical Planning and Inference, 111(1): Steinbakk, G. H. and Storvik, G. O. (2009). Posterior predictive p-values in Bayesian hierarchical models. Scandinavian Journal of Statistics, 36(2): Strassen, V. (1965). The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2): Thornton, K. and Andolfatto, P. (2006). Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a netherlands population of drosophila melanogaster. Genetics, 172(3):