Skewness and Kurtosis in Function of Selection of Network Traffic Distribution



Similar documents
NETWORK STATISTICAL ANOMALY DETECTION BASED ON TRAFFIC MODEL

SKEWNESS. Measure of Dispersion tells us about the variation of the data set. Skewness tells us about the direction of variation of the data set.

MBA 611 STATISTICS AND QUANTITATIVE METHODS

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

seven Statistical Analysis with Excel chapter OVERVIEW CHAPTER

Quantitative Methods for Finance

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Descriptive Statistics. Purpose of descriptive statistics Frequency distributions Measures of central tendency Measures of dispersion

Probability and Statistics Prof. Dr. Somesh Kumar Department of Mathematics Indian Institute of Technology, Kharagpur

Week 1. Exploratory Data Analysis

How To Write A Data Analysis

Summarizing and Displaying Categorical Data

1) Write the following as an algebraic expression using x as the variable: Triple a number subtracted from the number

MEASURES OF VARIATION

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

SECTION 2-1: OVERVIEW SECTION 2-2: FREQUENCY DISTRIBUTIONS

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Descriptive Statistics

Exploratory Data Analysis

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

Review of Random Variables

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

The right edge of the box is the third quartile, Q 3, which is the median of the data values above the median. Maximum Median

EMPIRICAL FREQUENCY DISTRIBUTION

Measures of Central Tendency and Variability: Summarizing your Data for Others

Chapter 1: Looking at Data Section 1.1: Displaying Distributions with Graphs

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Normality Testing in Excel

Descriptive statistics parameters: Measures of centrality

What Does the Normal Distribution Sound Like?

Java Modules for Time Series Analysis

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

AMS 5 CHANCE VARIABILITY

Course Text. Required Computing Software. Course Description. Course Objectives. StraighterLine. Business Statistics

Business Statistics. Successful completion of Introductory and/or Intermediate Algebra courses is recommended before taking Business Statistics.

Fairfield Public Schools

Exploratory Data Analysis. Psychology 3256

Common Tools for Displaying and Communicating Data for Process Improvement

determining relationships among the explanatory variables, and

An Empirical Examination of Investment Risk Management Models for Serbia, Hungary, Croatia and Slovenia

Exercise 1.12 (Pg )

COMMON CORE STATE STANDARDS FOR

Data Exploration Data Visualization

1.3 Measuring Center & Spread, The Five Number Summary & Boxplots. Describing Quantitative Data with Numbers

Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013

Lecture 1: Review and Exploratory Data Analysis (EDA)

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

On Correlating Performance Metrics

Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods

Statistics Revision Sheet Question 6 of Paper 2

The Big Picture. Describing Data: Categorical and Quantitative Variables Population. Descriptive Statistics. Community Coalitions (n = 175)

Diagrams and Graphs of Statistical Data

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

INTRODUCING THE NORMAL DISTRIBUTION IN A DATA ANALYSIS COURSE: SPECIFIC MEANING CONTRIBUTED BY THE USE OF COMPUTERS

Describing Data: Measures of Central Tendency and Dispersion

Measurement and Modelling of Internet Traffic at Access Networks

A logistic approximation to the cumulative normal distribution

Geostatistics Exploratory Analysis

Exploratory data analysis (Chapter 2) Fall 2011

Module 4: Data Exploration

Center: Finding the Median. Median. Spread: Home on the Range. Center: Finding the Median (cont.)

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Simulation Exercises to Reinforce the Foundations of Statistical Thinking in Online Classes

Multivariate Normal Distribution

STATS8: Introduction to Biostatistics. Data Exploration. Babak Shahbaba Department of Statistics, UCI

modeling Network Traffic

Manhattan Center for Science and Math High School Mathematics Department Curriculum

What is the purpose of this document? What is in the document? How do I send Feedback?

Data exploration with Microsoft Excel: univariate analysis

Descriptive Statistics and Measurement Scales

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Descriptive Statistics and Exploratory Data Analysis

Lecture 2. Summarizing the Sample

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

2: Frequency Distributions

A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Summary of Formulas and Concepts. Descriptive Statistics (Ch. 1-4)

Simple linear regression

( ) = 1 x. ! 2x = 2. The region where that joint density is positive is indicated with dotted lines in the graph below. y = x

Aachen Summer Simulation Seminar 2014

STAT2400 STAT2400 STAT2400 STAT2400 STAT2400 STAT2400 STAT2400 STAT2400&3400 STAT2400&3400 STAT2400&3400 STAT2400&3400 STAT3400 STAT3400

Basic Tools for Process Improvement

3.2 Measures of Spread

How To Understand And Solve A Linear Programming Problem

Statistics Chapter 2

Northumberland Knowledge

Dongfeng Li. Autumn 2010

An Oracle White Paper October Optimizing Loan Portfolios

2. Filling Data Gaps, Data validation & Descriptive Statistics

430 Statistics and Financial Mathematics for Business

Foundation of Quantitative Data Analysis

$ ( $1) = 40

MATH BOOK OF PROBLEMS SERIES. New from Pearson Custom Publishing!

Common Core Unit Summary Grades 6 to 8

Chapter 4: Average and standard deviation

Module 3: Correlation and Covariance

DESCRIPTIVE STATISTICS. The purpose of statistics is to condense raw data to make it easier to answer specific questions; test hypotheses.

Transcription:

Acta Polytechnica Hungarica Vol. 7, No., Skewness and Kurtosis in Function of Selection of Network Traffic Distribution Petar Čisar Telekom Srbija, Subotica, Serbia, petarc@telekom.rs Sanja Maravić Čisar Subotica Tech College of Applied Sciences, Subotica, Serbia, sanjam@vts.su.ac.rs Abstract: The available literature is not completely certain what type(s) of probability distribution best models network traffic. Thus, for example, the uniform, Poisson, lognormal, Pareto and Rayleigh distributions were used in different applications. Statistical analysis presented in this paper aims to show how skewness and kurtosis of network traffic samples in a certain time interval may be criterions for selection of appropriate distribution type. The creation of histogram and probability distribution of network traffic samples is also discussed and demonstrated on a real case. Keywords: skewness; kurtosis; network traffic; histogram; probability distribution Introduction Skewness characterizes the degree of asymmetry of a given distribution around its mean. If the distribution of the data are symmetric then skewness will be close to. Positive skewness indicates a distribution with an asymmetric tail extending toward more positive values. Negative skewness indicates a distribution with an asymmetric tail extending toward more negative values. A measure of the standard error of skewness (SES) can roughly be estimated, according to [9], as the square root of 6/N, where N represents the number of samples. If the skewness is more than twice this amount, then it indicates that the distribution of the data is non-symmetric and it can be assumed that the distribution is significantly skewed. If the skewness is within the expected range of chance fluctuations in that statistic (i.e. ± SES), that would indicate a distribution with no significant skewness problem. 95

P. Čisar et al. Skewness and Kurtosis in Function of Selection of Network Traffic Distribution Kurtosis characterizes the relative peakedness or flatness of a distribution compared with the normal distribution. For normally distributed data the kurtosis is. Positive kurtosis indicates a relatively peaked distribution. Negative kurtosis indicates a relatively flat distribution. As with skewness, if the value of kurtosis is too big or too small, there is concern about the normality of the distribution. In this case, a rough formula for the standard error for kurtosis (SEK) is the square root of /N. Since the value of kurtosis falls within two standard errors (i.e. ± SEK), the data may be considered to meet the criteria for normality by this measure. These measures of skewness and kurtosis are one method of examining the distribution of the data. However, they are not definitive in concluding normality. What should also be examined is a graph (histogram) of the data; and further, one should consider performing other tests for normality such as the Shapiro-Wilk or the Kolmogorov-Smirnov test. Figure Illustration of skewness and kurtosis Skewness is a network characteristic that can be successfully implemented in statistical anomaly detection algorithms, as is shown in []. When trying to distort the distribution parameters (mean, median, etc.), the attacker puts more weight into one of the tails of the probability density function, which leads to an asymmetric, skewed distribution. It is well-known that the standard characterizing parameters of a distribution are the mean (or median), the standard deviation, the kurtosis, and the skewness. It is reasonable to assume that there will be some empirical distribution of the skewness in the case when there is no attack is available, since most of the time, the system is not normally attacked. The statistical intrusion detection algorithm can compare the skewness of the sample to the expected value of the skewness in order to decide if an attack is taking place or not. The authors have dealt with the topic of statistical intrusion detection in publications []-[8]. 96

Acta Polytechnica Hungarica Vol. 7, No., Different Types of Probability Distribution Using the software package "Matlab", sequences of random numbers with various types of probability distributions are generated (Figure function Export ), as the way to simulate network traffic. Figure Random number generation (Poisson distribution) The following different types of distribution are examined: Pareto, normal, Poisson, lognormal, Rayleigh, chi-square and Weibull. Their graphical representation is shown in Figure. Pareto,5,5,5,5 5 7 9 5 7 9 5 7 9 5 7 9 97

P. Čisar et al. Skewness and Kurtosis in Function of Selection of Network Traffic Distribution Uniform,,8,6,, 5 7 9 5 7 9 5 7 9 5 7 9 Normal - 5 7 9 5 7 9 5 7 9 5 7 9 - - Poisson 8 6 5 7 9 5 7 9 5 7 9 5 7 9 98

Acta Polytechnica Hungarica Vol. 7, No., Lognormal,5,5,5 5 7 9 5 7 9 5 7 9 5 7 9 Rayleigh 6 5 5 7 9 5 7 9 5 7 9 5 7 9 Chi-square 5,5,5,5,5,5 5 7 9 5 7 9 5 7 9 5 7 9 99

P. Čisar et al. Skewness and Kurtosis in Function of Selection of Network Traffic Distribution Weibull,5,5,5 5 7 9 5 7 9 5 7 9 5 7 9 Figure Examined types of distribution For the purpose of the statistical comparison of generated curves, the authentic traffic samples y t (local maxima) of a real ISP are also analyzed and graphically represented. Figure Network traffic curves

Acta Polytechnica Hungarica Vol. 7, No., Table Samples of network traffic Sample y t (daily) y t (weekly) y t (monthly).5.5 8.5 7.5 7 5 8.5 5 6.5 7 7 5.5 8 9.9 5 5 5.5.5 6.5.5 6.5 7.5 8 5.5 7 5 5 7 6 8.5 7 7 6.5 8 8 9 5.5.5.5 9 7 6.5 5 9 6 6 6.5 8 5 6 7.5 6 9 6 7.5 5 8 8.5 9 8.5.5.5 7 9 5 5 5 5

P. Čisar et al. Skewness and Kurtosis in Function of Selection of Network Traffic Distribution The numerical values of all samples are presented in the following table. Table Samples for various distributions As emphasized in Chapter, if the skewness and kurtosis are within the expected ranges of chance fluctuations in that statistic (i.e. ± SES and ± SEK), this implies that the distribution has no significant skewness problem. The different distributions shown in the table above (N = ) result in SEK =.55 and SES =.77. Also, for the ISP (N = 5) the values of SEK =.66 and SES =.8 were obtained. By calculating the kurtosis and skewness for each distribution, the values presented in the following table were obtained.

Acta Polytechnica Hungarica Vol. 7, No., Table Kurtosis and skewness for analyzed distributions In the table above, the values outside the calculated limits are marked in darker shading, indicating distribution types that have greater skewness and kurtosis than is allowed. In this sense, these distributions are not appropriate in applications that describe network traffic. The Variations of Skewness and Kurtosis Daily, weekly and monthly network traffic curves of the Internet user were analyzed in order to determine the extent of variations of skewness and kurtosis. Based on these curves, the appropriate descriptive statistics are calculated (Table ). Table Descriptive statistics of network samples daily weekly monthly Mean 9,75857,68579,85786 Standard Error,95887,9986,55985 Median,5 5 Mode 7 Standard Deviation 6,59679,568756,697 Sample Variance,67887 6,86556 7,586 Kurtosis -,5685 -,67696 -,6989 Skewness -,78,7858,955 Range 5, Minimum 8,5 Maximum,9 Sum 69, 86 87 Count 5 5 5 Confidence Level (99.%),975559,757,9 By analyzing the results obtained in the table above, it can be concluded that the values for skewness and kurtosis remain within allowable limits for all the time periods. Only in the case of skewness is a change of sign detected.

P. Čisar et al. Skewness and Kurtosis in Function of Selection of Network Traffic Distribution Histogram and Probability Distribution According to [], the purpose of a histogram is to graphically summarize the distribution of a univariate data set. The histogram graphically shows the following: center of the data spread (i.e. the scale) of the data skewness of the data presence of outliers 5 presence of multiple modes in the data These features provide strong indications of the proper distributional model for the data. The most common form of the histogram is obtained by splitting the range of the data into equal-sized bins (called classes). Then for each bin, the number of points from the data set that fall into each bin is counted. That is vertical axis - frequency (i.e. counts for each bin) horizontal axis - response variable The histogram of the frequency distribution can be converted to a probability distribution by dividing the tally in each group by the total number of data points to achieve the relative frequency []. In accordance with the aforementioned, the following graphical presentation related to network samples of ISP is obtained. Based on Figure 5, it can be concluded that the most frequent samples approximately correspond to the mean of network traffic flow. histogram probability distribution frequency 8 6 probability,5,,5,,5,,5 8 6 8 6 8 5 7 9 5 7 9 5 7 samples

Acta Polytechnica Hungarica Vol. 7, No., frequency 7 6 5 probability,8,6,,,,8,6, 5 6 7 8 9 samples, 5 6 7 8 9 9 8,5 7 frequency 6 5 probability,,5,,5 5 6 7 8 9 5 6 7 8 9 samples Conclusions Figure 5 Histogram and probability distribution of network samples Through analysis of the appropriate values, it can be concluded that the Pareto and Weibull distributions provide kurtosis and skewness that are significantly outside the permitted range. Looking at the skewness, it is necessary to note that the chisquare distribution gives values also higher than the allowed. Among the analyzed types of distribution, based on kurtosis and skewness as criteria for describing network traffic, the appropriate distributions are the uniform, the normal, the Poisson, the lognormal and the Rayleigh. It is also shown that the most frequent network sample in the histogram is approximately equal to the mean of the traffic flow. In the next phase of research, the skewness and kurtosis of network traffic will be examined in terms of the reliability of network anomaly detection. The expectation is to try to find their application in the field of information systems security. References [] Sorensen, S.: Competitive Overview of Statistical Anomaly Detection, White Paper, Juniper Networks, 5

P. Čisar et al. Skewness and Kurtosis in Function of Selection of Network Traffic Distribution [] Gong, F.: Deciphering Detection Techniques: Part II Anomaly-based Intrusion Detection, White Paper, McAfee Security, [] SANS Intrusion Detection FAQ: Can You Explain Traffic Analysis and Anomaly Detection? Available at http://www.sans.org/resources/idfaq/ anomaly_detection.php [] Dulanović, N., Hinić, D., Simić, D.: An Intrusion Prevention System as a Proactive Security Mechanism in Network Infrastructure, YUJOR - Yugoslav Journal of Operations Research, Vol. 8, No., pp. 9-, 8 [5] Lazarevic, A., Kumar, V., Srivastava, J.: Managing Cyber Threats: Issues, Approaches and Challenges, Chapter: A survey of Intrusion Detection techniques, Kluwer Academic Publishers, Boston, 5 [6] Lončarić, S.: Osnove slučajnih procesa, Sveučilište u Zagrebu, Fakultet elektrotehnike i računarstva, Zavod za elektroničke sustave i obradbu informacija, Available at http://spus.zesoi.fer.hr [7] Siris, V., Papagalou, F.: Application of Anomaly Detection Algorithms for Detecting SYN Flooding Attacks. Available at http://www.ist-scampi.org /publications/papers/siris-globecom.pdf [8] CAIDA, the Cooperative Association for Internet Data Analysis: Inferring Internet Denial-of-Service Activity, University of California, San Diego, [9] Tabachnick, B. G., Fidell, L. S.: Using Multivariate Statistics ( rd ed.), New York: Harper Collins, 996 [] National Institute of Standards and Technology, Engineering Statistics Handbook, http://www.itl.nist.gov/div898/handbook/pmc/pmc.htm [] NetMBA-Business Knowledge Center, Available at http://www.netmba.com/ statistics/histogram/ [] Buttyan, L., Schaffer, P., Vajda, I.: Resilient Aggregation with Attack Detection in Sensor Networks, Second IEEE International Workshop on Sensor Networks and Systems for Pervasive Computing (PerSeNS), IEEE Computer Society Press, Pisa, Italy, 6 6