Signal processing tools for largescale network monitoring: a state of the art

Transcription

1 Signal processing tools for largescale network monitoring: a state of the art Kavé Salamatian Université de Savoie Outline Introduction & Methodologies Anomaly detection basic Statistical anomaly detection basis Parametric methods A refresh on Linear system theory Model calibration Filtering and processing Refresh on Neyman Pearson test theory Non Parametric Methods Sketches and hashes Random distributions and KL distance Sanov theorem and Stein Lemma Distributed anomaly detection (if we have time) Distributed KLT 1

2 Introduction and methodology Network state Network state results from Traffic demand Traffic matrix Not observable directly Capacity offer Routing matrix, link cap., Traffic Engineering, etc. Monitored by SNMP, etc. Network manager goal To drive this equilibrium to the best beneficial point by managing capacity offer Traffic engineering is the art of managing offered capacity 2

3 Network monitoring Monitoring? Being able to separate What is predictable Expected, normal, under control, What is not predictable Unexpected, abnormal, Interpretation framework Only what is unpredictable have a meaning, what can be predicted does not have a information Classes of anomaly detections Deterministic approaches Signature based Complexity vs. exhaustivity trade-off Statistical approaches Probabilistic False positive vs. detection rate trade off 3

4 Deterministic approaches Each observation is assumed to result from a known causality chain All anomalies are characterized by causality chain that describes the signature of an anomaly Anomaly detection consist of back tracking the causality sequence Need exhaustive anomaly signature database Have to check all possible signature Not scalable Statistical approach A dataset is a unique sequence of deterministic observation that is not anymore perfectly reproducible Statistical approach assumes that Observation results from a random function they come from a random choice in a (in)finite set of «possible» observations. Is this assumption sound? Gives access to an arsenal of probabilistic methods Stationarity : Intrinsic hypothesis Statistical properties hold over time Bet on future Possibility of radical error Essential False alarm vs. misdetection trade-off 4

5 /06/11 Empirical modeling Challenge Measurement contains two components A structured component that reflects essential properties of the phenomenon under study A random component that represent fluctuations that are put aside from the model Modeling challenge Separate structure from randomness and characterize it Signal processing role is to do this filtering About interpretation Measurements But what do they mean? Interpreting? Relating effects to causes Being able to predict the behaviours At different timescales Being able to react Interpretation need a priori 5

6 Plato cavern allegory Socrate : «compare our nature in respect of education and its lack to such an experience as this.picture men dwelling in a sort of cavern Picture further the light from fire burning higher up and at a distance behind them, and between the fire and the prisoners and above them men carrying past the wall implements of all kinds» Glaucon : «A strange image you speak of, and strange prisoners.» Socrate : Like to us, for, to begin with, tell me do you think that these men would have seen anything of themselves or of one another except the shadows cast from the fire on the wall of the cave that fronted them? Modelling approach in networking Descriptive approach Much more used in measurement papers Network is a black box with unknown structure describe observations only through descriptive statistics mean, variance, Hurst or multi-fractal parameters, etc Top-down approach Begin with observations and derive descriptive parameters Drawbacks It does not explain why? It does not answer what if? It is difficult to interpret them Interpretation need a priori It does not use all the available information We may have a priori information on the process generating the observation Constructive approach Classical approach Derivate IP performance through an explicative model of the process involved into the network Network is constituted of queues and routers, Uses simulation by ns or analytic queuing theory, network calculus, etc Down top approach Begin from input scenario and network structure and derive performance measures Drawbacks Generalization is difficult Too many parameters Simulation results do not describe real measurements The approach is open-loop 6

7 /06/11 Interpretation framework θ context Y X Hidden A priori Model Y observations X ˆ, ˆ What is the hidden causes (X et θ) that have lead to observing Y A priori mode compress all of our understanding of the process involved in the generation of the observation in Y=M(X,θ) Interpretation We have to deal with two main inverse problem Modelling problem What are the best context parameters θ that best describe the environment Interpretation problem Knowing the context θ what is hidden input X that will best describe observations A lot of measurement problem might be expressed in this framework Anomaly detection Flow classification and recognition 7

8 Statistical Anomaly detection Basics Anomalies??? What is an anomaly? What is normal? What is likely? Assumptions Anomaly are anomalous different from the norm Anomalies are rare Anomalous activity is malicious Reverse assumption : anomaly represents attacks or malicious behavior Are these assumptions correct??? 8

9 What is likely? Typical set The set of observed sequence 2 "n( H (Y k k"n )"# ) $ Pr Y k k +n (n ) From AEP Pr A " (1"#)2 "n( H (Y k k"n )+# ) (n) $ A # Y k n +k such that { } $ 2 "n( H (Y k k"n )+# ) { } #1$" $ 2 n( H (Y k k"n )+# ) Universal anomaly detector A simple anomaly detector k +n Calculate the likelihood Pr{ Y k } of an observed sequence Check if 2 "n( H (Y k k"n )"# ) k +n $ Pr{ Y k } $ 2 "n( H (Y k k"n )+# ) If not, it is an anomaly Non parametric and universal anomaly detector How to calculate the likelihood? Need simplifying hypothesis 9

10 Filtering What if the observations are independent k +n Pr Y k We have to transform the observation so that it become independent Basic signal processing question Input signal { } = ( Pr{ Y k }) n Pre-processing Flattened signal Anomaly detector Traffic dynamics 10

11 OD dynamics We need to capture all correlations Temporal Correlation Short-scale,dayly,weekly, etc.. Spatial Correlation Same Pop traffic Anomaly detector structure two generic steps Entropy reduction/filtering Remove dependencies in data Compress the data:information bottleneck Remove predictable component Decision/Detection Apply the typicality test on the resulting hopefully independent signal Statistical test In the Neyman-Pearson Framework Or other Bayesian, etc 11

12 Taxonomy of approaches Parametric Assume a parametric dependencies model Non Parametric Broke the dependencies in a universal way Parametric Approaches Assumes a parametric structure for dependencies Normal behavior model Calibrate the Model MLE estimation Filter the model prediction from observation Results in an IID gaussian innovation process Apply a Neyman-Pearson statistical test with H0: the observation is compatible with a zero mean with given variance 12

13 Non parametric approach Apply a de-correlating transform Random projection, hashing, Sketch, etc... Learn the resulting distribution of normal behavior Clustering, SVM, etc Detect anomalies by checking deviation from distribution of normal behavior Kullback-Leibner distance Parametric techniques 13

14 anomaly detection steps Modeling Filtering Decision How can we find a model? Physical models Direct relation with physical parameters High order, approximative, need complete process knowledge, physical parameters should be known For example using TCP models or Bianchi csma/ca model System identification Based on input/output measured data Inverse statistical problem Find the model that most likely has generated the observed data simple and efficient, with relatively low knowledge Limited validity (operating point, type of input), sensors,measurement noise, unknown model structure Frequent framework in practice 14

15 Model structure First step is to define a correlation/model structure Integrating temporal and spatial correlation The traffic is a dynamic entity Assume a state vector ( ) X k = x 1 1 k,x k "1,, x 1 k "T, x 2 k,, x 2 k "T,, x L L k,, x k "T Generic dynamical model " $ X k +1 = f ( X k ) + g( U k ) # % $ Y k = h( X k ) + i( U k ) Linear approximation Linearizing the generic dynamic model Using Taylor series expansion 2 " 2 f + X k + "f f ( X k ) = f (0) + X k "X k 0 "X k 2 Can extend the state vector to integrate powers of state X k 2,,X k N Any non-linear dynamical model can be represented by a linear dynamical model with large enough state vector 0 15

16 Linear Time Invariant systems After linearization " X k +1 = AX k + BU k # $ Y k = CX k + DU k A,B,C and D can change with time (Time variant system) The system might be without input Or have a random input Linear system representations time-domain representation input/output state-space frequency-domain representation Frequency response Pole and zero All four are equivalent Sometimes one of these representation is more useful 16

17 Time domain representations (1) x(k) y(k) [ ] Input/output y(k) = H x(k) If one know the impulse response h(k) = H "(k) of a system, he can derive the answer of the system to any input k y(k) = x(k) * h(k) = x(k " l)u(l) # l =0 [ ] Time representation(2) State space representation " X k +1 = AX k + BU k # $ Y k = CX k + DU k For random input iid Autoregressive model when B=I ARMA model in general " k # X k +1 = AX k + B" k $ % Y k = CX k + D" k 17

18 Frequency representation (1) Frequency response "( h(k) ) = H(#) Y(#) = H(#)X(#) For random signals S X " ( ) ( ) = # R XX (") S Y (") = H(") 2 S X (") Frequency representation(2) Z transform (or Laplace transform) X(z) = Z x(k) Y(z) = H(z)X(z) " # ( ) = x(k)z k j =0 Poles and Zeros values of z for which the value of the transfer function H(z) becomes infinity or zero respectively. For stochastic processes the mean of z-transform gives the characteristic function The moment of the distribution are derivative of the characteristic function. 18

19 Example x(k) = 0.91x(k "1) x(k " 2) x(k " 3) x(k " 4) Model calibration Several approaches Linear regression x(k) = a 1 x(k "1) + a 2 x(k " 2) + a 3 x(k " 3) + a 4 x(k " 4) Issue with correlation between values Maximum likelihood Find the model that have the largest probability of giving the output EM algorithm Approximation with Minimal Squared Error PCA/Karhunen Loeve expansion Finding the best approximation with a given dimension of the normal behavior 19

20 Biased/unbiased correlation Principal Component Analysis Let s suppose we have ( X correlated centered 1,,X K ) gaussian RV PCA finds an orthonormal basis P = ( φ 1,..., φ K ), where Y = PX such that the variance along the principal components is maximized After applying PCA, one can write in the new coordinate K X system as X = " Y j # j j =1 How to derive the basis? Using the covariance matrix " = E{ XX T } "# i = $ i # i var{ Y i } = " i 20

21 PCA as an approximation [ ] Transform matrix U = " 1 " 2 " K Y = UX An L dimensional approximation of X can be derived if we choose L<K column of U U L = [" 1 " 2 " L ] X ˆ = U T L Y L = U T L U L X L Y L = U L X X ˆ % = # Y j " j E& ( X $ X ˆ ) 2 K ( ) ' * = # + j This gives the MMSE approximation of dimension L j =1 j =L +1 Extension to Stochastic Processes Karhunen-Loeve expansion theorem one can rewrite a zero-mean stationary stochastic process X (t) as an orthogonal series expansion X ( t) = Y i Φ ( t) Y i i= 1 Where are pair-wise independent RV and Φ are pairwise orthogonal deterministic i (t) functions The KLE theorem for discrete processes is given by X[ k] = YiΦ i[ k] i= 1 i 21

22 22 How to derive KLE basis For time continuous process Galerkin approximation Apply PCA to the following matrix " i,l (s)# i, j (s $ t)ds = % l, j # l, j (t), j > 0 0 & ' i=1 K ( = = Φ = K i N j j i j i K k Y k X k X 1 1,, 1 ] [ ] [ ] [ + + = ) ( ) ( 1) ( (2) ) ( (1) ) ( ) ( 1) ( (2) ) ( (1) n x N x N n x x N n x x n x N x N n x x N n x x X K K K K K K KLE for modeling Let s assume a signal with model In PCA coordinates Using KLE approximation Effects on poles and zeros X k +1 = AX k +" k Y k +1 = U T AUY k +U T " k

23 Normal behaviour? What is the normal behavior characterized by the model Time and frequency representation are equivalent PCA model 23

24 Second step : Filtering Filter separate the signal space into two subspace Signals that pass through the filter Signals that are rejected by the filter Anomaly detection filter Pass everything is compatible with the normal behavior block everything is not compatible with the normal behavior Simplistic approach Deduce the signal as predicted by the normal behavior model from the observation. e(t) = X(t) " X ˆ (t) What to do if we have measurement error How to ensure that the error is not propagating in prediction 24

25 Kalman Filter Filtering what is compatible with the model Two steps Prediction Correction X ˆ (t +1) = X ˆ " (t +1) + K(t +1) Y(t +1) " CX ˆ " (t +1) Innovation: "(t +1) = Y(t +1) # CX ˆ # (t +1) X ˆ " (t +1) = AX ˆ (t) ( ) Kalman filter effect 25

26 Kalman filter results Innovation Recalibration needed 26

27 State of the Art Model+filtering+Decision PCA+difference+thresholding (Lakhina & al, 04) KLE+difference+thresholding (Brauckhof & al, 09) ML+Kalman+thresholding (Soule& al, 05) KLE+Kalman+thresholding (Ndong& al, 11) KLE+Kalman+fuzzy (Ndong & al, 11) Effect of sampling Hohn & al, 03 Brauckhof & al, 10 Results 27

28 Distributed case How to distribute the anomaly detection on a large-scale network By exchanging information between nodes Constraints The amount of traffic used Selfishness Privacy issues State sharing " Each node maintains a state vector " How to share useful information about states with interested nodes " In the simplest setting all the node are interested about all states " Nodes should be able to define which variables they are interested in 28

29 A crash course in linear estimation " Let s assume that we observe and we want to estimate " The MMSE estimator is known to be " For Jointly Gaussian distribution this reduce to " Example A crash course in source compression " How to represent n vectors using nr bits. " The optimal local compression scheme consists of two stages: " a projection stage where the state vector is projected linearly into an orthonormal space " a quantization phase that assigns the bit budget to different projection dimensions following a water filling argument. 29

30 Projection phase " Applying KLT/PCA " maps the correlated vector into an uncorrelated vector " By choosing just K columns " This gives an estimator " This is the MMSE estimator of dimension K Quantization Phase " Choose the number of component to transmit " Assign bits to each chosen components " Results in a signal sent 30

31 Single Hop case " m nodes are all connected (directly or through an overlay ) to a node c. " An approximation of the states of the m nodes should be derived. " Compression is needed. " Sampling " Local compression " Distributed compression Local compression " Each node calculate locally " its covariance " apply a KLT " apply the resulting projection " do quantization " forward to node c the compressed state vector " The state vector is reconstructed at node c 31

32 Distributed compression " Node i send to node c a noisy projection " What is the optimal projection and the optimal quantization? " At node c, is received " This can be used to estimate " Do not need to send just send the estimation error An iterative approach " Each terminal optimize its local encoder, while all other encoders are held fixed " Algorithm terminates when converged and leads to a local minimum. " Can be seen as a distributed learning phase using an EM loop An iteration of the distributed KLT algorithm: C2 and C3 are kept fixed while encoder 1 is chosen optimally 32

33 Optimum projection Multi-hop scenario " A node maintains three data structures " A vector of local states " A preference list with variance " a weighting list and a maximal variance " A list of received projections 33

34 Reception processing " Extracting node and state variable IDs and assigning each received value to the correct variables. " if a new projection or a new remote state variable is observed, update the data structures " Re-estimate the covariance matrix " Each 30 transmissions " Knowing, the estimation of remote state variables proceed " Update Covariance estimation " Estimating " Remote variables are not available locally but projections are " Covariance between states in nodes i and j " Covariance between state variables in neighbor nodes 34

35 Preference list processing " At time k=0 contains the IDs of the state variables node i is interested in with a high weight. " This is forwarded to neighbors " After reception neighbor's preference list, " a lower weight is assigned to the new values. Forwarding scheme " Forward variables that are correlated to neighbors " incentive/punishment mechanism " The node implements the distributed compression. " Apply the optimal projection " Forward the projections when they change 35

36 13/06/11 Incentive or punishment? " The node adds estimated variables from its own preference list to the forwarded variables " This adds its estimation error as a noise in the neighbors estimation process " Neighbors have an incentive to help it to reduce its estimation errors " Cooperation by punishment rather than by incentive Enforcing collaboration " The proposed punishment mechanism acts as a shadow price " If mechanism for enforcing shadow price exists we can implements a Pareto-Optimal cooperation mechanism " Being social becomes helpful " How to enforce shadow prices? " By entangling the performance of or neighbors estimation with our own estimation " Linear projections provide an elegant solution " Cooperation by punishment not by incentives 36

37 Cooperation scheme " Cooperation " Forward projections " Learn the environment " Infer covariance with neighbors " Use it to estimate best projections " Benefit from it " Estimate variable in your preference list " Want more benefits " Behave better with your neighbors Performance 37

38 Convergence Being social is helpful 38

39 Conclusion & Perspectives " We define a cooperative framework for networks " Illustrated with distributed state sharing " Applicable to a large set of scenarios " DTN, distributed compression/transmission " The essence of distributed setting is social intelligence " Node selfishness is essential Conclusion There is a new job emerging : network monitor This needs new skills Mixes Computer Science, Networking, Signal processing, Statistics, etc A new educational challenge Network monitoring and anomaly detection is becoming mature Issues in transferring research results to practice High sophistication Lack of source of real traffic to evaluate and calibrate methods Most of tools are very simplistic Issues with ground truth and the real value/cost of anomaly detection/non-detection 39