Anomaly detection for Big Data, networks and cyber-security



Similar documents
Monte Carlo testing with Big Data

1 Prior Probability and Posterior Probability

Combining Weak Statistical Evidence in Cyber Security

Bayesian logistic betting strategy against probability forecasting. Akimichi Takemura, Univ. Tokyo. November 12, 2012

Finding statistical patterns in Big Data

Maximum Likelihood Estimation

. (3.3) n Note that supremum (3.2) must occur at one of the observed values x i or to the left of x i.

THE NUMBER OF GRAPHS AND A RANDOM GRAPH WITH A GIVEN DEGREE SEQUENCE. Alexander Barvinok

Nonparametric adaptive age replacement with a one-cycle criterion

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Basics of Statistical Machine Learning

Principle of Data Reduction

Gambling Systems and Multiplication-Invariant Measures

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

5 Directed acyclic graphs

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Probability and statistics; Rehearsal for pattern recognition

Exact Nonparametric Tests for Comparing Means - A Personal Summary

Department of Mathematics, Indian Institute of Technology, Kharagpur Assignment 2-3, Probability and Statistics, March Due:-March 25, 2015.

The Chinese Restaurant Process

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

Validation of Software for Bayesian Models using Posterior Quantiles. Samantha R. Cook Andrew Gelman Donald B. Rubin DRAFT

Lecture 8. Confidence intervals and the central limit theorem

1 Sufficient statistics

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Dongfeng Li. Autumn 2010

Bayesian Analysis for the Social Sciences

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

HT2015: SC4 Statistical Data Mining and Machine Learning

MATHEMATICAL METHODS OF STATISTICS

Likelihood Approaches for Trial Designs in Early Phase Oncology

Random graphs with a given degree sequence

The Basics of Graphical Models

Statistics Graduate Courses

Section 7.1. Introduction to Hypothesis Testing. Schrodinger s cat quantum mechanics thought experiment (1935)

False Discovery Rates

Section 13, Part 1 ANOVA. Analysis Of Variance

The Variability of P-Values. Summary

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

Notes on Continuous Random Variables

How To Check For Differences In The One Way Anova

Inference of Probability Distributions for Trust and Security applications

E3: PROBABILITY AND STATISTICS lecture notes

Lecture 3: Linear methods for classification

Lecture 4: BK inequality 27th August and 6th September, 2007

Simple Linear Regression Inference

Estimating the Degree of Activity of jumps in High Frequency Financial Data. joint with Yacine Aït-Sahalia

Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation

A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails

Reject Inference in Credit Scoring. Jie-Men Mok

THE CENTRAL LIMIT THEOREM TORONTO

Bayesian Statistics: Indian Buffet Process


Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Permutation Tests for Comparing Two Populations

How to Gamble If You Must

Chapter 4 Lecture Notes

Big Data, Statistics, and the Internet

Lecture 13: Martingales

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Model-based Synthesis. Tony O Hagan

Bayesian networks - Time-series models - Apache Spark & Scala

Hypothesis Testing for Beginners

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.436J/15.085J Fall 2008 Lecture 5 9/17/2008 RANDOM VARIABLES

Chapter 4: Statistical Hypothesis Testing

MULTIVARIATE PROBABILITY DISTRIBUTIONS

An Introduction to Basic Statistics and Probability

CHAPTER 2 Estimating Probabilities

5.1 Identifying the Target Parameter

Poisson Models for Count Data

Master s Theory Exam Spring 2006

Random access protocols for channel access. Markov chains and their stability. Laurent Massoulié.

Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Parametric fractional imputation for missing data analysis

Lecture 7: Continuous Random Variables

Chicago Booth BUSINESS STATISTICS Final Exam Fall 2011

Gambling and Data Compression

Probability Generating Functions

Data Modeling & Analysis Techniques. Probability & Statistics. Manfred Huber

Week 1: Introduction to Online Learning

Towards running complex models on big data

BANACH AND HILBERT SPACE REVIEW

Convex Hull Probability Depth: first results

Chapter 2. Hypothesis testing in one population

Econometrics Simple Linear Regression

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Multivariate Analysis of Ecological Data

1 Norms and Vector Spaces

A Statistical Framework for Operational Infrasound Monitoring

Joint Exam 1/P Sample Exam 1

Wald s Identity. by Jeffery Hein. Dartmouth College, Math 100

Supplement to Call Centers with Delay Information: Models and Insights

Least Squares Estimation

Separation Properties for Locally Convex Cones

Transcription:

Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London), Dan Lawson (University of Bristol), Niall Adams (Imperial College London) and Melissa Turcotte (Los Alamos National Laboratory) Dynamic Networks and Cyber-security 24th June 2015

Netflow Server IP address:port 10:53 10:771 11:148 11:5353 1:80 2:80 3:80 4:443 5:443 6:443 7:2049 8:389 9:53 9:771 00:00 01:00 02:00 03:00 04:00 05:00 Time (minutes:seconds) Figure : Five minutes of NetFlow, generated by a computer on the Imperial College London network.

Sessionization Server IP address:port 10:53 10:771 11:148 11:5353 1:80 2:80 3:80 4:443 5:443 6:443 7:2049 8:389 9:53 9:771 00:00 01:00 02:00 03:00 04:00 05:00 Time (minutes:seconds) Figure : Sessionization of Netflow Bayesian temporal clustering

Filtering Probability 0.0025 0.0035 0.0045 Probability 0.002 0.004 0.006 0.008 0.010 0.012 0 5 10 15 20 0 5 10 15 20 Time of day (hours) Time 22 23 0 1 2 22 23 0 1 2 21 3 21 3 20 4 20 4 19 5 19 5 18 6 18 6 17 7 17 7 16 8 16 8 15 9 15 9 14 13 12 11 10 14 13 12 11 10

Point process modelling λ(t) 0 5 10 15 20 25 30 Time Figure : Non-parametric Bayesian intensity estimation Enron email data

De Finetti s representation theorem Let X 1, X 2,... be an infinite exchangeable sequence of Bernoulli variables. Then X i are independent conditional on a common success probability, p, where p is a random variable on [0, 1].

Aldous-Hoover theorem Let G be an infinite undirected exchangeable graph. Then each edge i j is a conditionally independent Bernoulli variable with success probability f (u i, u j ) where u 1, u 2,... are independent uniform random variables on [0, 1]. f is a random symmetric function from [0, 1] 2 [0, 1] (called a graphon).

Block approximation a) Graphon b) Block approximation (coarse) c) Block approximation (fine) 1 1 1 1 0.8 1 1 0.7 0.8 0.7 0.5 0.4 0.4 0.5 0 0 0.05 0 0 0.05 0 0 Figure : A graphon and its block approximations

Stochastic block model Partitioning of nodes into communities represented a vector of labels l Marginal likelihood of the graph where P(G l) = k l 1 0 p n kl (1 p) o kl n kl df (p), 1 n kl number of edges between communities k and l 2 o kl number of possible edges 3 F prior distribution function on edge probability (assuming an IID prior)

Anomaly detection with complex models Data: D Anomaly: f (D) P-value: p = P{f (D ) f (D) D}

Anomaly detection with complex models Data: D Model: D θ Unknowns: θ Anomaly: f (D, θ) P-value: p = P{f (D, θ) f (D, θ) D}

Anomaly detection with complex models Data: D Model: D θ Unknowns: θ Anomaly: f (D, θ) P-value: p = P{f (D, θ) f (D, θ) D} Conservative p-value p + = sup θ (p)

Anomaly detection with complex models Data: D Model: D θ Unknowns: θ Anomaly: f (D, θ) P-value: p = P{f (D, θ) f (D, θ) D} Expected p-value p = E θ (p)

Posterior predictive p-value The posterior predictive p-value is (Meng, 1994; Gelman et al., 1996, Eq. 2.8, Eq. 7) P = P{f (D, θ) f (D, θ) D}, (1) where θ represents the model parameters, D is the observed dataset, D is a hypothetical replicated dataset generated from the model with parameters θ, and P( D) is the joint posterior distribution of (θ, D ) given D. In words: if a new dataset were generated from the same model and parameters, what is the probability that the new discrepancy would be as large? Discussion: Guttman (1967), Box (1980), Rubin (1984), Bayarri and Berger (2000), Hjort et al. (2006) Posterior predictive p-values in practice: Huelsenbeck et al. (2001), Sinharay and Stern (2003), Thornton and Andolfatto (2006), Steinbakk and Storvik (2009)

Two estimates Posterior sample: θ 1,..., θ n from θ D Estimate I: 1 Simulate data D1,..., D n. 2 Estimate ˆP = 1 n I{f (Di, θ i ) f (D, θ i )}. n i=1 Estimate II: 1 Calculate p-values Q i = P{f (D, θ i ) f (D, θ i )}, for i = 1,..., n. 2 Estimate ˆP = 1 n n Q i. i=1

Two estimates Posterior sample: θ 1,..., θ n from θ D Estimate I: 1 Simulate data D1,..., D n. 2 Estimate ˆP = 1 n I{f (Di, θ i ) f (D, θ i )}. n i=1 Estimate II: 1 Calculate p-values Q i = P{f (D, θ i ) f (D, θ i )}, for i = 1,..., n. 2 Estimate ˆP = 1 n n Q i. i=1

Interpreting posterior predictive p-values Recall: P = P{f (D, θ) f (D, θ) D}. How should we interpret P? For example: Is P = 40% good? If P = 10 6, is there cause for alarm? What if 500 tests are performed, and min(p i ) = 10 6? Some comments: The interpretation and comparison of posterior predictive p-values [is] a difficult and risky matter (Hjort et al., 2006) Its main weakness is that there is an apparent double use of the data...this double use of the data can induce unnatural behavior (Bayarri and Berger, 2000) Donald Rubin alludes to some conservative operating characteristics (Rubin, 1996) The basic problem is: P is not uniformly distributed under the null hypothesis.

Interpreting posterior predictive p-values Recall: P = P{f (D, θ) f (D, θ) D}. How should we interpret P? For example: Is P = 40% good? If P = 10 6, is there cause for alarm? What if 500 tests are performed, and min(p i ) = 10 6? Some comments: The interpretation and comparison of posterior predictive p-values [is] a difficult and risky matter (Hjort et al., 2006) Its main weakness is that there is an apparent double use of the data...this double use of the data can induce unnatural behavior (Bayarri and Berger, 2000) Donald Rubin alludes to some conservative operating characteristics (Rubin, 1996) The basic problem is: P is not uniformly distributed under the null hypothesis.

Meng s result In the last pages of Meng (1994): In our notation: suppose θ and D are drawn from the prior and model respectively, and f (D, θ) is absolutely continuous. Then, (i) E(P) = E(U) = 1/2 and (ii) for any convex function h, E{h(P)} E{h(U)}, where U is a uniform random variable on [0, 1].

A consequence Meng (1994) next finds:

The convex order Let X and Y be two random variables with probability measures µ and ν respectively. We say that µ cx ν if, for any convex function h, E{h(X )} E{h(Y )}, whenever the expectations exist (Shaked and Shanthikumar, 2007). We say that a probability measure P is sub-uniform if P cx U, where U is a uniform distribution on [0, 1]. Meng s theorem says that posterior predictive p-values have a sub-uniform distribution.

Reflecting on the theorem 1 Proof is in fact quite straightforward (Jensen s inequality) 2 Is there something more we can say? 1 The 2α bound suggests that P could be liberal, whereas we had the impression they were conservative. Can the bound be improved? 2 The theorem seems to be describing a huge space of distributions. Can it somehow be reduced?

On taking the average Theorem (Strassen s theorem) For two probability measures µ and ν on the real line the following conditions are equivalent: 1 µ cx ν; 2 there are random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X. Theorem due to Strassen (1965) (see also references therein), this version due to Müller and Rüschendorf (2001). In other words: Taking an average is reducing in the convex order

On taking the average Theorem (Strassen s theorem) For two probability measures µ and ν on the real line the following conditions are equivalent: 1 µ cx ν; 2 there are random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X. Theorem due to Strassen (1965) (see also references therein), this version due to Müller and Rüschendorf (2001). In other words: Taking an average is reducing in the convex order

Theorem Let µ and ν be two probability measures on the real line where ν is absolutely continuous. The following conditions are equivalent: 1 µ cx ν; 2 there exist random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X and the random variable Y X is either singular, i.e. Y = X, or absolutely continuous with µ-probability one. Theorem (Posterior predictive p-values and the convex order) P is a sub-uniform probability measure if and only if there exist random variables P, D, θ and an absolutely continuous discrepancy f (D, θ) such that P = P{f (D, θ) f (D, θ) D}, where P has measure P, D is a replicate of D conditional on θ and P( D) is the joint posterior distribution of (θ, D ) given D.

Theorem Let µ and ν be two probability measures on the real line where ν is absolutely continuous. The following conditions are equivalent: 1 µ cx ν; 2 there exist random variables X and Y with marginal distributions µ and ν respectively such that E(Y X ) = X and the random variable Y X is either singular, i.e. Y = X, or absolutely continuous with µ-probability one. Theorem (Posterior predictive p-values and the convex order) P is a sub-uniform probability measure if and only if there exist random variables P, D, θ and an absolutely continuous discrepancy f (D, θ) such that P = P{f (D, θ) f (D, θ) D}, where P has measure P, D is a replicate of D conditional on θ and P( D) is the joint posterior distribution of (θ, D ) given D.

Exploring the set of sub-uniform distributions The integrated distribution function of a random variable X with distribution function F X is, ψ X (x) = We have (Müller and Rüschendorf, 2001) 1 ψ X is non-decreasing and convex; x F X (t) d t. 2 Its right derivative ψ + X (x) exists and 0 ψ+ X (x) 1; 3 lim x ψ X (x) = 0 and lim x {x ψ X (x)} = E(X ). Furthermore, for any function ψ satisfying these properties, there is a random variable X such that ψ is the integrated distribution function of X. The right derivative of ψ is the distribution function of X, F X (x) = ψ + (x). Let Y be another random variable with integrated distribution function ψ Y. Then X cx Y if and only if ψ X ψ Y and lim x {ψ Y (x) ψ X (x)} = 0.

Exploring the set of sub-uniform distributions The integrated distribution function of a random variable X with distribution function F X is, ψ X (x) = We have (Müller and Rüschendorf, 2001) 1 ψ X is non-decreasing and convex; x F X (t) d t. 2 Its right derivative ψ + X (x) exists and 0 ψ+ X (x) 1; 3 lim x ψ X (x) = 0 and lim x {x ψ X (x)} = E(X ). Furthermore, for any function ψ satisfying these properties, there is a random variable X such that ψ is the integrated distribution function of X. The right derivative of ψ is the distribution function of X, F X (x) = ψ + (x). Let Y be another random variable with integrated distribution function ψ Y. Then X cx Y if and only if ψ X ψ Y and lim x {ψ Y (x) ψ X (x)} = 0.

Exploring the set of sub-uniform distributions The integrated distribution function of a random variable X with distribution function F X is, ψ X (x) = We have (Müller and Rüschendorf, 2001) 1 ψ X is non-decreasing and convex; x F X (t) d t. 2 Its right derivative ψ + X (x) exists and 0 ψ+ X (x) 1; 3 lim x ψ X (x) = 0 and lim x {x ψ X (x)} = E(X ). Furthermore, for any function ψ satisfying these properties, there is a random variable X such that ψ is the integrated distribution function of X. The right derivative of ψ is the distribution function of X, F X (x) = ψ + (x). Let Y be another random variable with integrated distribution function ψ Y. Then X cx Y if and only if ψ X ψ Y and lim x {ψ Y (x) ψ X (x)} = 0.

Exploring the set of sub-uniform distributions The integrated distribution function of a random variable X with distribution function F X is, ψ X (x) = We have (Müller and Rüschendorf, 2001) 1 ψ X is non-decreasing and convex; x F X (t) d t. 2 Its right derivative ψ + X (x) exists and 0 ψ+ X (x) 1; 3 lim x ψ X (x) = 0 and lim x {x ψ X (x)} = E(X ). Furthermore, for any function ψ satisfying these properties, there is a random variable X such that ψ is the integrated distribution function of X. The right derivative of ψ is the distribution function of X, F X (x) = ψ + (x). Let Y be another random variable with integrated distribution function ψ Y. Then X cx Y if and only if ψ X ψ Y and lim x {ψ Y (x) ψ X (x)} = 0.

ψ(t) 0.0 0.1 0.2 0.3 0.4 0.5 density 0.0 0.5 1.0 1.5 0 0.3 0.5 1 t 0 0.3 0.5 1 t Figure : Uniform distribution

ψ(t) 0.0 0.1 0.2 0.3 0.4 0.5 density 0.0 0.5 1.0 1.5 0 0.3 0.5 1 t 0 0.3 0.5 1 t Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2)

ψ(t) 0.0 0.1 0.2 0.3 0.4 0.5 density 0.0 0.5 1.0 1.5 0.6 0 0.3 0.5 1 t 0 0.3 0.5 1 t Figure : Some sub-uniform distributions. Blue: uniform, green: Beta(2,2), red: a distribution with maximal mass at.3

a) Model 1 b) Model 1 discrepancy X(0) θ = 1 θ = 0 1 2α X(1) 2α f(x,θ) 0.0 0.4 0.8 f(x,1) f(x,0) 0.0 0.2 0.4 0.6 0.8 1.0 x c) Model 2 d) Model 2 discrepancy p(x θ) 0.0 1.0 2.0 3.0 p(x θ = 0) p(x θ = 1) f(x,θ) 0.0 0.4 0.8 f(x,0) f(x,1) 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x Figure : Some constructive examples

Combining p-values Independent p-values P 1,..., P n. Two canonical tasks: 1 sub-select a set for further analysis 2 combine the p-values into one overall score of significance

Discrete p-values Ordinary p-value (T = f (D)): P = P(T T T ), Mid-p-value (Lancaster, 1952): Q = 1 2 P(T T T ) + 1 2 P(T > T T ), Lemma (Fisher s method for mid-p-values) Under the null hypothesis, 1 Q cx U, where U is a uniform random variable on the unit interval 2 As a consequence, for t 2n, where F n = 2 n i=1 log(q i) P(F n t) exp{n t/2 n log(2n/t)},

Discrete p-values Ordinary p-value (T = f (D)): P = P(T T T ), Mid-p-value (Lancaster, 1952): Q = 1 2 P(T T T ) + 1 2 P(T > T T ), Lemma (Fisher s method for mid-p-values) Under the null hypothesis, 1 Q cx U, where U is a uniform random variable on the unit interval 2 As a consequence, for t 2n, where F n = 2 n i=1 log(q i) P(F n t) exp{n t/2 n log(2n/t)},

Discrete p-values Ordinary p-value (T = f (D)): P = P(T T T ), Mid-p-value (Lancaster, 1952): Q = 1 2 P(T T T ) + 1 2 P(T > T T ), Lemma (Fisher s method for mid-p-values) Under the null hypothesis, 1 Q cx U, where U is a uniform random variable on the unit interval 2 As a consequence, for t 2n, where F n = 2 n i=1 log(q i) P(F n t) exp{n t/2 n log(2n/t)},

50/50 Random binary Grid of ten n=10, β= 20 0 F^(x) 1 0 F^(x) 1 0 F^(x) 1 0 x 1 0 x 1 0 x 1 n=100, β= 5 0 F^(x) 1 0 F^(x) 1 0 F^(x) 1 ordinary mid p value y=x 0 x 1 0 x 1 0 x 1 Figure : Fisher s method with mid-p-values

Aldous, D. J. (1981). Representations for partially exchangeable arrays of random variables. Journal of Multivariate Analysis, 11(4):581 598. Bayarri, M. and Berger, J. O. (2000). P values for composite null models. Journal of the American Statistical Association, 95(452):1127 1142. Box, G. E. (1980). Sampling and Bayes inference in scientific modelling and robustness. Journal of the Royal Statistical Society. Series A (General), pages 383 430. Gelman, A., Meng, X.-L., and Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4):733 760. Guttman, I. (1967). The use of the concept of a future observation in goodness-of-fit problems. Journal of the Royal Statistical Society. Series B (Methodological), pages 83 100. Hjort, N. L., Dahl, F. A., and Steinbakk, G. H. (2006). Post-processing posterior predictive p values. Journal of the American Statistical Association, 101(475):1157 1174. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13 30. Hoover, D. N. (1979). Relations on probability spaces and arrays of random variables. Preprint, Institute for Advanced Study, Princeton, NJ, 2. Huelsenbeck, J. P., Ronquist, F., Nielsen, R., and Bollback, J. P. (2001). Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294(5550):2310 2314. Lancaster, H. (1952). Statistical control of counting experiments. Biometrika, pages 419 422. Lehmann, E. L. and Romano, J. P. (2006). Testing statistical hypotheses. Springer Science & Business Media. Meng, X.-L. (1994). Posterior predictive p-values. The Annals of Statistics, 22(3):1142 1160. Müller, A. and Rüschendorf, L. (2001). On the optimal stopping values induced by general dependence structures. Journal of applied probability, 38(3):672 684. Rubin, D. B. (1984). Bayesianly justifiable and relevant frequency calculations for the applied statistician. The Annals of Statistics, 12(4):1151 1172. Rubin, D. B. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6(4):733 760. Discussion of Gelman, Meng and Stern. Rubin-Delanchy, P. and Heard, N. A. (2014). A test for dependence between two point processes on the real line. arxiv preprint arxiv:1408.3845. Rubin-Delanchy, P. and Lawson, D. J. (2014). Posterior predictive p-values and the convex order. arxiv preprint arxiv:1412.3442. Shaked, M. and Shanthikumar, J. G. (2007). Stochastic orders. Springer. Sinharay, S. and Stern, H. S. (2003). Posterior predictive model checking in hierarchical models. Journal of Statistical Planning and Inference, 111(1):209 221. Steinbakk, G. H. and Storvik, G. O. (2009). Posterior predictive p-values in Bayesian hierarchical models. Scandinavian Journal of Statistics, 36(2):320 336. Strassen, V. (1965). The existence of probability measures with given marginals. The Annals of Mathematical Statistics, 36(2):423 439. Thornton, K. and Andolfatto, P. (2006). Approximate Bayesian inference reveals evidence for a recent, severe bottleneck in a netherlands population of drosophila melanogaster. Genetics, 172(3):1607 1619.