Statistical traffic classification in IP networks: challenges, research directions and applications

Statistical traffic classification in IP networks: challenges, research directions and applications Luca Salgarelli <luca.salgarelli@ing.unibs.it> A joint work with M. Crotti, M. Dusi, A. Este and F. Gringoli Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

Outline The networking group at the University of Brescia Traffic classification: objective, motivation, state of the art Statistical classification: two approaches Simple statistical fingerprinting : when good old Naïve Bayes still has something to tell us Machine learning: an approach based on Support Vector Machines One application: detection of tunneled applications Issues and further research Conclusions Slide 2

Telecommunications group at the University of Brescia Director: Riccardo Leonardi Seven Faculty, 15/20 post-docs, Ph.D. students, etc. Two major areas: Multimedia: signal processing, audio/video coding, etc. Networking Slide 3

The networking group at the University of Brescia Small and young group, started in 2004 Two faculty: L. Salgarelli, F. Gringoli Several Ph.D. students [Systems] Research and education related to networking: Network security Traffic characterization and its applications Wireless networks: 3G->4G, QoS in wireless LANs Slide 4

Traffic classification: motivation, state of the art Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

Traffic classification in IP networks A procedure for classifying IP traffic according to the application layer protocol that generated it Coarse classification: traffic classes Bulk transfer Chat HTTP SMTP SSH Interactive VoIP Etc. Finely-grained classification: perprotocol??? Classifier E-mail (SMTP, POP, etc.) FTP HTTP SSH Etc. Slide 6

Objective: robust and efficient classification Robust To application layer tunnels: SSH over HTTP, Chat over DNS, etc. To end-to-end encryption: both transport layer (TLS) and IP layer (IPSec) To changes in application layer protocols: HTTP 1.1 HTTP 1.x To changes in network conditions Efficient Up to 10 Gb/s links Run on moderately-priced hardware No decoding of application layer state machines Slide 7

Motivation: service differentiation (QoS support) QoS support requires identification of traffic classes Usually performed at network edges When done in the backbone, it might require dealing with endto-end encryption Independent verification of preclassified traffic (enforcement of SLAs, etc.) HTTP SMTP??? Classifier SSH High priority Medium priority Low priority Slide 8

Motivation: enforcement of security policies Stateful firewalls are not enough anymore Port-based filtering fails with many applications (e.g., peer-to-peer) ALGs are too computationally demanding HTTP SMTP? SSH Proxies can be tricked E.g., tunneling of applications on top of HTTP?? Classifier Once more, trouble ahead with encrypted traffic Block Slide 9

Motivation: pricing and billing Network operators are looking into creative (!) billing platforms 1 x /month DSL service without access to VoIP 2 x /month DSL service with VoIP service included These are not necessarily evil practices This kind of techniques could even help lower service costs to end users VoIP VoIP??? Classifier VoIP/ low-cost DSL Do we believe this? We ll see... Block Slide 10

State of the art: port-based and payload-based classification Many tools available for port-based classification: CoralReef, Tstat, snort, etc. High performance: for example, Tstat can do layer-4 traffic analysis, including TCP and UDP port analysis, at rates of several Gb/s Several open-source and commercial payload-based analyzers Open-source: l7filter Commercial: ALG s, intrusion-detection systems, Packeteer s packet-shaper, etc. Problems Ports change, some applications do not even use standard ports Tunneled traffic Encrypted traffic Complexity (?) of deep packet inspection Slide 11

State of the art: behavior based classifiers Machine Learning Heuristic Approaches 2004 McGregor et al. [3] 2003 - Dewes et al. [6] 2005 Moore et al. [4] 2006 Salamatian et al. [2] 2007 Our Second Approach SVM 2001 HIDE[7] 2004 - Roughan et al. [1] 2004 T.Karagiannis et al. [8] 2005 BLINC [5] 2006 Our First Approach Naïve Bayes-based Statistical Approaches Slide 12

Traffic classification through simple statistical fingerprinting Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

High-level scheme Classifier (size, Δt, pkt#) Φ P HTTP Classifier (size, Δt, pkt#) Φ P Training: derive protocol fingerprints from basic statistical properties of known traffic Classify traffic in real time, based on protocol fingerprints Update protocol fingerprints as network conditions or protocol specifications change Slide 14

Definition: TCP flow in our context Client-server, connectionoriented applications (HTTP, SSH, POP, etc.) Client Pktn Flow = unidirectional, ordered sequence of packets from client to server or vice-versa One application layer session = two flows (F client and F server ) Fclient Pkt2 Pkt1 Classifier Pkt1 Fserver Pktm Server Slide 15

Training phase: protocol mask vectors 4 Classifier 3 3 PDF3 M3 2 1 2 1 2 1 Extract (s, t) PDF2 PDF1 s Gaussian filter M2 M1 s Flow1 Flow2 FlowL t t Extract L flows, all generated by the same protocol p. Number packets in the order in which they re seen by the classifier Extract (size, t). For each packet number, calculate PDF(s, t): protocol PDF vector Reduce high-frequency noise. Gaussian filter each PDF: protocol mask vector. Obtain one mask vector for each protocol. Slide 16

Anomaly score S3(F1,M) << S3(F2,M) Anomaly Score Sn(F,M): measures how far flow F is from protocol mask M M3 M2 M1 t s F1 F2 Each of the flow F s packets Pi contribute to Sn, by means of the value of Mi in Pi Subscript n in Sn indicates at which packet number the evaluation of S(F,M) was stopped Useful for real-time classification of flows Slide 17

Anomaly score: the base of our classification algorithm S3(F1,M P1 ) << S3(F1,M P2 ) M3 M3 M2 s M2 s M1 M1 t F1 t M P 1 M P 2 Slide 18

Anomaly scores are not enough Thresholds Anomaly scores give an indication of how far a flow is from a given protocol mask Problem: not all protocols can be fingerprinted Using min{s(f,m)} to classify flow F is not enough Example: F is an SSH flow S5(F,M HTTP ) = 0.7 S5(F,M POP3 ) = 0.81 S5(F,M HTTP ) < S5(F,M POP3 ), but F is not HTTP Idea: use more information than just protocol masks Calculate average and std.dev. of the anomaly scores of flows used to build each protocol mask: define Thresholds Slide 19

Computing thresholds 4 Classifier 3 3 PDF3 M3 2 1 2 1 2 1 Extract (s, t) PDF2 PDF1 s Gaussian filter M2 M1 s Flow1 Flow2 FlowL t t Slide 20

Protocol fingerprints T4 T3 M4 M3 Protocol p s fingerprint is the union of it s mask vector and its threshold vector T2 T1 M2 M1 s Note that each threshold could actually be computed as a linear function of μ and σ t Φ p := protocol p s fingerprint For example, we will see classification results when using T=μ+x σ, with x ε [1:10] Two fingerprints per protocol: one derived from Fclient and one derived from Fserver flows Slide 21

Classification algorithm Slide 22

at a glance Classifier F 1 Φ 1 S 1,1 =0.1 T 1 =0.2 S 2,1 >T 1 F 1 Φ 1 F 2 Φ 2 S 1,2 =0.08 T 2 =0.1 S 2,2 >T 2 F 2 Φ i Φ 3 T 3 =0.15 S 1,3,S 2,3 >T >T 3 Slide 23

Does it work? Experimental analysis Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

Testbed setup: training phase Collect one week worth of outgoing TCP traffic at relevant well-known ports: HTTP, POP3 and SMTP Have l7filter validate by patternmatching each flow Flows that pass l7filter s validation become the training set: around 20K flows for each of the three protocols considered Calculate Φ HTTP, Φ POP3 and Φ SMTP Note: this training mechanism is very inefficient GARR Classifier 24 Mb/s ing.unibs.it 800 users UniBS main router 100 Mb/s Slide 25

Testbed setup: evaluation phase After two weeks, collect another week worth of outgoing TCP traffic, this time without any port-based filters Pre-classify a subset of the flows: obtain an evaluation set composed of certified 10K flows for each of the fingerprinted protocols, and 5K for non-fingerprinted ones Certification done by hand and by application-layer pattern matching We ll see results for Fclient only GARR Classifier 24 Mb/s ing.unibs.it 800 users UniBS main router 100 Mb/s Slide 26

Hit ratio and false positive in our scenario E HTTP E POP3 HTTP POP3 Classifier e POP3 := ě POP3 + ê POP3 e HTTP := ě HTTP + ê HTTP Hit ratio for p = ěp E p E SMTP E OTH SMTP OTHER e OTH := ě OTH + ê OTH e SMTP := ě SMTP + ê SMTP False positive for p = êp e p Certified evaluation set E p = number of flows of protocol p in evaluation set e p = number of flows classified as protocol p ě p = number of flows correctly classified as protocol p ê p = number of flows incorrectly classified as protocol p Other := not produced by either of the fingerprinted protocols Slide 27

Results: hit ratio and false positives Hit ratio Vs packet # Best classification results (pkt #4, T=µ+σ) Protocol Hr F+ POP3 94,58 3,06 SMTP 94,51 3,08 HTTP 91,76 6,38 OTHER 90,64 N.A. False positives Vs packet # Slide 28

Hit ratio and false positives: take two Hit ratio Vs size of trainingset Hit ratio Vs Threshold False positives Vs size of trainingset False Positives Vs Threshold Slide 29

Comparison with a payload-based classifier L7-based classifier Protocol Hr Pkt # POP3 65,77 2 (Fserver) SMTP 90,54 2/3 (Fserver) HTTP 99,25 3/4 (Fserver) OTHER 99+% N.A. Classification of not matched flows is as good as the one for matched flows All protocols are matched with over 90% of hit ratio Our approach Protocol Hr F+ Pkt # POP3 94,58 3,06 4 (Fclient) SMTP 94,51 3,08 4 (Fclient) HTTP 91,76 6,38 4 (Fclient) OTHER 90,64 N.A. 4 (Fclient) Slide 30

Traffic classification through machine learning: an approach based on Support Vector Machines (SVM) Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

An SVM-based classifier: motivation Although preliminary results look promising, our Naïve Bayes-like classifier still needs some refining touches For it to be effective, it needs quite large training sets Order of 10k flows for each protocol Relatively slow and complex training phase We have not found out why (yet!), but this approach cannot work well without considering interarrival times Problems with network noise Less robust Solution A - Investigate and find fixes (we are working on it ) Solution B - Try other approaches: SVM Very significant reduction of training set (at least in theory!) Much less dependent on interarrival times: can perform very well considering just packet size Slide 32

Basic concepts: binary SVMs Let x R n be an attribute vector E.g., a series of packet size values Let y {-1; 1} be a class label associated to each attribute vector The purpose of a binary SVM is to create a statistical model to predict a label value y i evaluating its feature vector x i High level overview: Create an ideal hyperplane that separates two training classes (the ones identified by label {-1} and the ones identified by label {1} The hyperplane can be found by solving a convex quadratic-programming problem with linear constraints This surface is described by means of Support Vectors Non-linear separation by remapping the samples to a higher-dimensional space, using a non-linear mapping function Hyperplane defined by: Kernel function K allows us not to explicitly specify φ: In case of a Gaussian kernel: Slide 33

Basic concepts: single and multi-class SVM Single class SVM Extension of the original binary SVM Introduction of ν (0, 1]: determines the tolerance to noise of the system with respect to the training set A kernel transformation maps the training data into a feature space The hyperplane defined by the Support Vectors separates the training vectors from the origin with the maximum margin Multi class SVM Multiple labels y {1, 2,, M} Simplest solution: one against all approach M binary SVMs that separate one class from the remaining (M-1) ones M decision functions Assign a sample x to the class that has the largest value associated to the decision functions Slide 34

Training an SVM classifier Feature extraction Slide 35

Training an SVM classifier Bi-dimensional space, HTTP traffic Slide 36

Training an SVM classifier Bi-dimensional space, HTTP traffic σ = parameter in the Gaussian kernel K ν = confidence level, indicates the confidence by which the surface contains the training set Slide 37

Our SVM-based classifier Training phase Find single class surfaces Optimize parameters N = number of features (i.e., packets) Note: here we do NOT separate F client from F server Training procedure searches for optimal parameters in the single class case through a grid search Multi-class case: pre-set (fixed) parameters 360 vectors for each class Not exactly low complexity, but this is just the training phase Problem: 360 vectors out of 10k Classification algorithm: IF there is only a surface that contains the vector under analysis, assign it to the corresponding protocol ELSE IF there are multiple surfaces containing the vector, use [one against all] multi class SVM ELSE [IF there no surface containing the vector,] assign it to UNKNOWN Slide 38

Does it work? Experimental results Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

UniBS set Training phase: parameters Slide 40

UniBS set Results Slide 41

LBNL set Training phase: parameters Slide 42

LBNL set Results Slide 43

CAIDA set Training phase: parameters Slide 44

CAIDA set Results Slide 45

An application of statistical techniques: detection of HTTP tunnels Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

Context: enforcing security at corporate network boundaries The task of an administrator is to guarantee the correct operation of their network, especially at its boundary QoS related to actual requirements of users Block unwanted protocols: for example, chat or peer-to-peer Problem: smart user can tunnel forbidden protocols into allowed ones Accurate network control is a very hard task Slide 47

The tunnel as security threat Are firewalls and Application Level Gateways enough? Tunneling of a generic protocol over an other one is a widespread method to circumvent security restrictions A protocol allowed by security policy is used as transport protocol e.g., chat sessions carried over HTTP Forbidden protocols are encapsulated according to the specifications of the allowed protocol Slide 48

Tunnel hunter: basic idea Statistical analysis of behavior at network layer (IP) of the HTTP application protocol Training phase: determination of HTTP fingerprint Evaluation phase: is a flow generated by an HTTP application? Training phase Validation of real HTTP traffic Building of statistical model Evaluation phase Definition of anomaly score Classification algorithm Slide 49

Results Only client->server (F client ) flows are considered One direction is enough to block all non-conforming traffic Best results at 3 rd packet Fast detection Real HTTP traffic detected with over 99% of accuracy Decreasing trend as the number of sections increase Protocol Hit ratio HTTP 99.78% POP3 over HTTP 100% SMTP over HTTP 100% CHAT over HTTP 100% #packets Slide 50

Challenges ahead Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

Efficient fingerprinting, i.e., robust pre-classification Improve payload-based techniques to achieve perfect results Even if they become very computationally intensive, they could be used at the very least for obtaining good training sets Combination of payload-based and statistical techniques How often fingerprints have to be re-computed? Are fingerprints transportable? Slide 52

Improve the algorithms The current algorithms are pretty simple: we can expect to improve their effectiveness substantially by introducing several new elements Statistical fingerprinting approach SVM Both Correlate Fclient and Fserver Don t stop with (s,δt): there are other statistical quantities that can be evaluated Smarter multi-class approach Evaluate other mapping functions (other kernels) Adaptive algorithm (per protocol pkt#, threshold, etc.) Finding optimal parameters is a tough problem Derive classification algorithms from other fields (e.g., signal processing) Slide 53

Encrypted traffic Statistical techniques should work with encrypted traffic, but do they really? How to detect flow boundaries in layer-3 or even layer-4 encrypted tunnels? How do we fingerprint starting from encrypted flows? Slide 54

High-performance implementation These kinds of statistical techniques seem to be lightweight: are they really? Is it really possible to implement them on commodity, HWaccelerated cards (network processors, FPGA, ASIC-based)? Would they scale to tens of Gb/s? Slide 55

Real comparisons with other approaches: the trouble with publicly-available traces Many organizations routinely release backbone packet traces (CAIDA, NLARN, etc.). However......these traces are fully anonymized and stripped of most of the payload As they are, they cannot be used for research in traffic classification: there is no means of verifying the application layer protocol that generated each flow It would be useful if researchers in this area would start to systematically release anonymized traces with full metadata, including the application layer protocol information Finding good, secure anonymization practices for these kind of traces could be an interesting piece of research (see Pang 06) Slide 56

Conclusions Dipartimento di Elettronica per l Automazione Facoltà di Ingegneria Università Degli Studi di Brescia Via Branze 38, 25123 Brescia, Italy

Conclusions Traffic classification is a tough problem Simple statistical fingerprinting can work, even in its most basic forms It can serve at least to offload the majority of traffic from more complex and computationally expensive classifiers It can be useful in data centers to trigger intrusion-prevention mechanisms on non-conforming traffic Further research on more complex algorithms can only improve today s results Next logical step: tunneled and encrypted traffic Slide 58

References [1] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield. Class-of-service mapping for QoS: a statistical signature-based approach to IP traffic classification. In IMC 04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 135 148, New York, NY, USA, 2004. ACM Press. [2] L. Bernaille, R. Teixeira, and K. Salamatian. Early Application Identification. In The 2nd ADETTI/ISCTE CoNEXT Conference, Lisboa, Portugal, December 2006. [3] A. McGregor, M. Hall, P. Lorier, and J. Brunskill. Flow Clustering Using Machine Learning Techniques. In Proceedings of the Fifth Passive and Active Measurement Workshop (PAM 2004), Mar. 2004. [4] A. W. Moore and D. Zuev. Internet traffic classification using bayesian analysis techniques. In SIGMETRICS 05: Proceedings of the 2005 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, pages 50 60, New York, NY, USA, 2005. ACM Press. [5] T. Karagiannis, K. Papagiannaki, and M. Faloutsos. BLINC: multilevel traffic classification in the dark. In SIGCOMM 05: Proceedings of the 2005 conference on Applications, technologies, architectures, and protocols for computer communications, pages 229 240, New York, NY, USA, 2005. ACM Press. [6] C. Dewes, A. Wichmann, and A. Feldmann. An analysis of Internet chat systems. In IMC 03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement, pages 51 64, New York, NY, USA, 2003. A. W. Moore and K. Papagiannaki. Toward the Accurate Identification of Network Applications. In Proceedings of the Sixth Passive and Active Measurement Workshop (PAM 2005), Oct. 2005. [7] HIDE: a Hierarchical Network Intrusion Detection System Using Statistical Preprocessing and Neural Network Classification in Proceedings of the 2001 IEEE Workshop on Information Assurance and Security United States Military Academy, West Point, NY, 5-6 June, 2001 [8] T. Karagiannis, A. Broido, M. Faloutsos, and K. C. Claffy,Transport layer identification of P2P traffic, in IMC 04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, (New York, NY,USA), pp. 121 134, ACM Press, 2004. [9] Charles V. Wright and Fabian Monrose and Gerald M. Masson, On Inferring Application Protocol Behaviors in Encrypted Network Traffic, Journal of Machine Learning Research, 7:2745 2769, 2006. [10] M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli, "Traffic Classification through Simple Statistical Fingerprinting", ACM SIGCOMM Computer Communication Review, Vol. 37, No. 1, pp. 5-16, Jan. 2007 [11] M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli, "Detecting HTTP Tunnels with Statistical Mechanisms", The 2007 IEEE International Conference on Communications, Glasgow, Jun. 2007 Slide 59