On the Use of Compression Algorithms for Network Traffic Classification

On the Use of for Network Traffic Classification Christian CALLEGARI Department of Information Ingeneering University of Pisa 23 September 2008 COST-TMA Meeting Samos, Greece

Outline Outline 1 Introduction Motivations Theoretical Background 2 Lempel-Ziv-Welch Huffman Dynamic Markov Compression 3 4 Data-Set Results C. Callegari and Traffic Classification 2 / 17

Motivations Introduction Motivations Theoretical Background Language Classification Language trees and zipping D. Benedetto, E. Caglioti, and V. Loreto Physical Review Letters, January 2002 Traffic Classification based on the TCP flags A Markovian signature-based approach to IP traffic classification H. Dahmouni, S. Vaton, D. Rossé Proceedings of the 3rd annual ACM workshop on Mining network data, 2006 C. Callegari and Traffic Classification 3 / 17

Introduction Theoretical Background Motivations Theoretical Background Entropy The entropy H of a discrete random variable X is a measure of the amount of uncertainty associated with the value of X Referring to an alphabet composed of n distinct symbols, respectively associated to a probability p i, then The starting point H = n p i log 2 p i bit/symbol i=1 The entropy represents a lower bound to the compression rate that we can obtain: the more redundant the data are and the better we can compress them. C. Callegari and Traffic Classification 4 / 17

Introduction LZW Huffman DMC Dictionary based algorithms: based on the use of a dictionary, which can be static or dynamic, and they code each symbol or group of symbols with an element of the dictionary Lempel-Ziv-Welch Model based algorithms: each symbol or group of symbols is encoded with a variable length code, according to some probability distribution. Huffman Dynamic Markov Compression C. Callegari and Traffic Classification 5 / 17

Lempel-Ziv-Welch Introduction LZW Huffman DMC created by Abraham Lempel, Jacob Ziv, and Terry Welch. It was published by Welch in 1984 as an improved implementation of the LZ78 algorithm, published by Lempel and Ziv in 1978 universal adaptative 1 lossless data compression algorithm builds a translation table (also called dictionary) from the text being compressed the string translation table maps the message strings to fixed-length codes 1 The coding scheme used for the k th character of a message is based on the characteristics of the preceding k 1 characters in the message C. Callegari and Traffic Classification 6 / 17

Huffman Introduction LZW Huffman DMC developed by Huffman (1952) based on the use of a variable-length code table for encoding each source symbol the variable-length code table is derived from a binary tree built from the estimated probability of occurrence for each possible value of the source symbols prefix-free code 2 that expresses the most common characters using shorter strings of bits than are used for less common source symbols 2 The bit string representing some particular symbol is never a prefix of the bit string representing any other symbol C. Callegari and Traffic Classification 7 / 17

Introduction LZW Huffman DMC Dynamic Markov Compression developed by Gordon Cormack and Nigel Horspool (1987) adaptative lossless data compression algorithm based on the modelization of the binary source to be encoded by means of a Markov chain, which describes the transition probabilities between the symbol 0 and the symbol 1 the built model is used to predict the future bit of a message. The predicted bit is then coded using arithmetic coding C. Callegari and Traffic Classification 8 / 17

Introduction Input the system input is given by raw traffic traces in libpcap format the 5-tuple is used to identify a connection, while the value of the TCP flags is used to build the profile a value s i is associated to each packet: s i = SYN +2 ACK +4 PSH +8 RST +16 URG +32 FIN thus each mono-directional connection is represented by a sequence of symbols s i, which are integers in {0, 1,, 63} C. Callegari and Traffic Classification 9 / 17

Introduction Training Phase choose one of the three previously described algorithms (Huffman, DMC, or LZW) the compression algorithms have been modified so as that the learning phase is stopped after the training phase: Huffman case: the occurency frequency of each symbol is estimated only on the training dataset DMC case: the estimation of the Markov chain is only updated during the training phase LZW case: the construction of the dictionary is stopped after the training phase classification performed with a compression scheme that is optimal for the application used for building the considered profile and suboptimal for the others C. Callegari and Traffic Classification 10 / 17

Introduction Classification append each distinct observed connection b, to the training sequence A i of the application i compute the compression rate per symbol : L i = dim([a i b] ) dim([a i ] ) Length(b) (1) where [X] represents the compressed version of X choose argmin i (L i ) (2) C. Callegari and Traffic Classification 11 / 17

Data-Set Introduction Data-Set Results Data-Set 1 The 1999 DARPA/MIT IDS evaluation program it provides a corpus of data, that model the network traffic measured between a US Air Force base and the Internet 5 weeks data (several thousands connections per application) week 1: used for training week 3: used for classification Considered applications (several thousands connections per application): FTP, SSH, SMTP, and HTTP C. Callegari and Traffic Classification 12 / 17

Data-Set Introduction Data-Set Results Data-Set 2 Corpus of data collected in the TLC Net Group Laboratory- University of Pisa Considered applications (four hundred connections per application): FTP, SSH, SMTP, HTTP, and HTTPs Data-Set 3 Corpus of data provided by the italian research project (PRIN) RECIPE Considered applications (several thousands connections per application): POP3, SMTP, and HTTP C. Callegari and Traffic Classification 13 / 17

Results Introduction Data-Set Results LZW DMC Huffman D-1 D-2 D-3 D-1 D-2 D-3 D-1 D-2 D-3 FTP 100% 70% - 100% 0% - 100% 100% - SSH 95% 100% - 0% 100% - 50% 97% - SMTP 94% 60% 96% 100% 99% - 98% 70% 100 HTTP 95% 73% 97% 100% 76% - 83% 45% 52% HTTPS - 32% - - 33% - - 35% - POP3 - - 98% - - - - - 100% C. Callegari and Traffic Classification 14 / 17

Introduction Results 2: some more details Data-Set Results Huffman HTTP POP3 SMTP HTTP 53% 47% 0% HTTP nom 36% 64% 0% POP3 0% 100% 0% POP3 nom 0% 100% 0% SMTP 0% 0% 100% SMTP nom 0% 0% 100% LZW HTTP POP3 SMTP HTTP 96% 3.5% 0.5% HTTP nom 97% 3% 0% POP3 0% 98% 2% POP3 nom 1% 95% 4% SMTP 1% 3% 96% SMTP nom 0% 0% 100% C. Callegari and Traffic Classification 15 / 17

Conclusions Future Works Future Works More applications Background traffic Combine several statistical methods (e.g., compression + traffic descriptor statistics)... Application to the anomaly detection C. Callegari and Traffic Classification 16 / 17

Conclusions Future Works Thank You for your attention Any Question? C. Callegari and Traffic Classification 17 / 17