Toward line rate Traffic Classification Niccolo' Cascarano Politecnico di Torino http://sites.google.com/site/fulviorisso/ 1
Background In the last years many new traffic classification algorithms based on statistical approach One of the claims of these new algorithms is that their computational requirements are lows than Deep Packet Inspection [3-8] DPI is commonly considered too expensive Is that true? Can DPI be further improved? Is there anything better than DPI? 2
The path toward the answers Create a model of some classifiers (currently, DPI, Naïve Bayes and SVM) and compare their complexity Joint work with Università di Brescia Improve the DPI engine itself Service-based traffic classification 3
Question 1: is DPI so computationally complex? 4
What is DPI? DPI = pattern matching through regular expressions Two main flavors: Packet-Based per-flow State (PBFS): network data are analyzed on a packet-by-packet basis as soon packets are received by the classifier Message-Based per-flow State (MBFS): network data are analyzed as an unique stream of data after TCP/IP normalization PBFS seems roughly equivalent MBFS with respect to traffic classitication [1-2] We use PBFS DPI classifier + capability to analyze correlated session (e.g., FTP and SIP) 5
Methodology Cost modeling Average cost per packet (instead of worst-case) Modeled each classifier Derived the cost of each block Determined the transition probability from one block to the other by analyzing real traces (with ground truth [26]) Derived the min/max/average cost per packet Cost of each block timed the transition probability 6
Models DPI SVM Session ID Extracion extracts the L3 and L4 information from network packets Session lookup checks within the session table if a packets belongs to a classified session Pattern matching implements the pattern matching algorithm (DPI only) SVM decision implements the SVM classification algorithm (SVM only) Session update updates the session table with the outcome of the classification Correlated session it analyzes the application data for obtaining information on correlated sessions (DPI only) 7
Basic blocks implementation Session ID extraction: native assembly code for IA32 generated NetVM framework [19] Session Lookup e Session Update: C++ code using hash_map container of extended STL C++ library [18] Pattern matching: C++ code implementing a DFA-based algorithm generated by Flex [20]. About 30 application protocol are recognized (NOTE: the cost of this block does NOT depend on the number of protocol recognized) SVM Decision: C++ code written exploiting the multivariate Gaussian joint density function. We generated the models for recognizing about 10 application protocols. (NOTE: the cost of this block linearly DEPENDS on the number of protocol recognized) Correlated Session: C++ code written on purpose deriving correlated session rules for FTP and SIP protocol from the NetPDL database [17] 8
Experimental evaluation Costs of each block measured with the RDTSC instruction Costs dependent on the input traffic (e.g. DFA) is further characterized in order to push relevant parameters in the final formula Traffic traces UNIBS trace contains a big percentage of p2p traffic, known to be challenging for DPI classifiers POLITO trace contains a medium size campus network traffic trace (~6000 hosts within the network) 9
Absolute costs of each basic block Pattern matching depends on the packet size SVM depends on the number of protocols examined 10
Comparison 11
Comparison Legend Best case: all the packets belong to already classified sessions (fast path) Worst case: all the packets need to take the slow path Average case: the costs are normalized using the execution probabilities of each basic block Results DPI classifier has the same order of magnitude of the other ones, even for UNIBS challenging trace May be better on some traces Comparison not exactly fair (48 protocols for DPI against 12) 12
Conclusion 1 Packet-based DPI may not be as complex as we thought, as far as pure traffic classification is concerned 13
Question 2: can we reduce DPI cost? 14
Yes, We Can if we focus on traffic classification and not network security 15
(1) Use fast algorithms Min (ticks) Avg (ticks) Max (ticks) Flex (canonical DFA) 76 3980 19147 PCRE (NFA-based) 35.7K 2.08M 9.16M DFA is simple and O(payload_length) Key question: is the DFA usable? 16
(2) Use friendly regular expressions (preliminary results) 17
(2) and convert some in friendly Average cost on HTTP Match (ticks) No match Anchored 1663 1415 Anchored + Kleene 5622 1367 Not anchored + Kleene 5503 3300 Not anchored + Kleene and backtracking 5290 13659 Baseline: not anchored + Kleene http unknown Anchored (on UNIBS-GT) 0% 0% Anchored + Kleene (on UNIBS-GT) 0% 0% unknown http Anchored (on POLITO) 0.004% 0.38% Anchored + Kleene (on POLITO) 0.005% 0% 18
(3) Use a packet-based approach Unknown TCP traffic POLITO 23.5GB 2.6MB UNIBS-GT 870MB 0B Additional classified TCP traffic 19
(4) Snapshot-based classification no differences in accuracy when length >= 256 bytes 20
(4) Snapshot-based classification Fair speedup with TCP traffic 21
(5) Limiting classification attempts Avg # pkts Std dev UNIBS-GT (TCP) 654 4619 POLITO-GT (TCP) 563 3659 POLITO (TCP) 68 1879 UNIBS-GT (UDP) 2.62 0.71 POLITO-GT (UDP) 6.05 26.4 POLITO (UDP) 9.17 476 Avg # pkts Bittorrent (TCP) 1 0 Std dev Samba (TCP) 1.01 0.29 HTTP (TCP) 1.05 15.6 Skype (UDP) 1.7 437 SSL(UDP) 1.92 267 Telnet (TCP) 2599 3276 Direct Connect (TCP) 30694 60076 22
(5) Limiting classification attempts 23
(5) Limiting classification attempts Accuracy stable for TCP, may decrease in UDP; almost no misclassifications in both 24
(5) Limiting classification attempts Possible high speedup with TCP traffic 25
(4)+(5) Snapshot + Attempts limit Distribution of classified traffic changes; no clear understanding of the new parameters 26
Conclusions 2 DFA is OK for traffic classification Fast algorithms Up to 3 orders of magnitude friendly regex May achieve up to 5 times speedup No message-based processing Snapshot = 256 for UDP and fair attempts limit (e.g. 10) Fairly small packets; signature that operate on packet sequences Strict attempt limit for TCP (N=2) Able to catch response packets A speedup of 15 on results in Conclusion1 gives 20Mpps on a 3GHz CPU 27
Addendum What are regex? We usually assume regex= regular expressions (e.g. PERL) We believe this model is not powerful enough to cope with modern traffic classification We have to think about a more extended model E.g. currently Skype and RTP are detected with some imperative code in addition to regex Left to future work 28
Is there anything better than DPI? 29
Better perhaps no, but Service-Based Traffic Classification is surely an answer Not exactly a replacement of DPI Instead, something orthogonal to (I would like to say most) traffic classification approaches Service-Based Classification: Once you associated (IP, port) with Service S, all established sessions that insist on that endpoint are associated to S without further analysis 30
Service-Based Traffic Classification No further details are provided in this presentation However, a lot of analysis done that confirm that it really works By-product: if the first classification is correct, a lot of more traffic classified A service with a few sessions in clear and most encrypted traffic 31
SBC: Services vs. sessions 200000 180000 160000 Services Sessions 140000 120000 100000 80000 60000 40000 20000 0 0 20 40 60 80 100 120 140 160 Time (hours) Session table is one order of magnitude larger than service table 32
Conclusions DPI well-known limit is encrypted sessions No way to cope with that with DPI alone DPI (for traffic classification) may not be so costly compared to other competitors and have many advantages E.g. no training (regex are simple to derive) Simple implementation Most of time, walks over small portions of DFA (in cache) Service-Based Classification may be a good complement of previous solutions My 2c: statistical traffic classifiers may have a better fit with a limited number of protocols (i.e. if you want to identify just P2P) but are not applicable to hundreds of protocols 33
Questions? 34
References [1] A. Moore, K. Papagiannaki, Toward the Accurate Identification of Network Application, 6th International Workshop on Passive and Active Network Measurement,Boston MA, USA, May 2005, pp. 41-54. [2] F. Risso, A. Baldini, M. Baldi, P. Monclus, O. Morandi, Lightweight, Payload-Based Traffic Classification: An Experimental Evaluation, IEEE International Conference on Communications (ICC 2008), Beijing (China), pp. 5869-5875, May 2008. [3] J. Erman, A. Mahanti, M. Arlitt, C. Williamson, Identifying and discriminating between web an peer-to-peer traffic in the network core, Proceedings of the 16th International Conference on World Wide Web, Banff, Alberta, Canada pp. 883-892, 2007. [4] J. Erman, M. Arlitt, A. Mahanti, Traffic classification using clustering algorithms, Proceedings of the 2006 SIGCOMM, Pisa, Italy, pp. 281-286, 2006. [5] L. Bernaille, R. Teixeira, I. Akodkenou, Traffic classification on the fly, 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, CA, pp. 40-49, 2008. [6] S. Zander, T. Nguyen, G. Armitage, Self-learning IP traffic classification based on statistical flow characteristics, International Workshop on Passive and Active Network Measurement, Boston MA, pp. 325-328, 2005. [7] M. Crotti, M. Dusi, F. Gringoli, L. Salgarelli, Traffic Classification through Simple Statistical Fingerprinting, ACM SIGCOMM Computer Communication Review, Vol. 37, No. 1, pp. 5-16, Jan. 2007. [8] L. Bernaille, R. Teixeira, K. Salamatian, Early Application Identification, 2nd CoNEXT Conference, Lisboa, Portugal, Dec. 2006. [9] A. Este, F. Gringoli, L. Salgarelli, Support Vector Machines for TCP Traffic Classification, Universit` degli Studi di Brescia, Technical Report a. 08-07, Jul. 2008. [10] N. Williams and S. Zander and G. Armitage, A Preliminary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Classification, SIGCOMM Computer Communication Review, Vol. 36, No. 5,, pp. 7-15, Oct. 2006. [11] H. Kim, Kc Claffy, M. Fomenkova, D. Barman and M. Faloutsos, Internet Traffic Classification Demystified: The Myths, Caveats and Best Practices, ACM CoNEXT, Madrid, Spain, Dec. 2008. [12] WEKA, http://www.cs.waikato.ac.nz/ml/weka 35
References [13] T. Karagiannis, K, Papagiannaki, M. Faloutsos, BLINC: Multilevel traffic classification in the Dark, ACM SIGCOMM, Aug. 2005. [14] A. Este, F. Gargiulo, F. Gringoli, L. Salgarelli, C. Sansone, Pattern Recognition Approaches for Classifying IP Flows, 7th International Workshop on Statistical Pattern Recognition, Orlando, FL, Dec. 2008. [15] V.N. Vapnik, Statistical Learning Theory. John Wiley and Sons, New York, 1998. [16] B. Scholkopf, J.C. Platt, J. Shawe Taylor, A.J. Smola, R.C. Williamson, on Estimating the Support of a High Dimensional Distribution. Neural Computation, 13, pp. 1443 1471, 2001. [17] Computer Networks Group (NetGroup) at Politecnico di Torino. The NetBee Library. August 2004. [online] Available at http://www.nbee.org/. [18] Hash map container reference, http://www.sgi.com/tech/stl/hash map.html [19] O. Morandi, F. Risso, M. Baldi, A. Baldini, Enabling flexible protocol processing through dynamic code generation, International Conference on Communications, Beijing (China), pp. 5849-5856, May 2008. [20] flex: The Fast Lexical Analyzer, http://flex.sourceforge.net/ [21] R. Smith, C. Estan, S. Jha, S. Kong, Deflating the big bang: fast and scalable deep packet inspection with extended finite automata, ACM SIGCOMM Computer Communication Review, Volume 38, Issue 4 (October 2008), Pages 207-218. [22] M. Becchi, P. Crowley, Efficient regular expression evaluation: Theory to pratice, Symposium On Architecture For Networking And Communications Systems, Proceedings of the 4th ACM/IEEE Symposium on Architectures for Networking and Communications Systems, San Jose, California, Pp. 50-59, 2008. [23] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J. Turner, Algorithms to accelerate multiple regular expressions matching for deep packet inspection, ACM SIGCOMM Computer Communication Review, Volume 36, Issue 4, pp. 339-350, October 2006 [24] File Transfer Protocol (FTP), RFC 959, http://www.ietf.org/rfc/rfc959.txt [25] N. Brownlee, Traffic flow measurement: Meter MIB, Request for Comments RFC 2064, Internet Engineering Task Force, January 1997. [26] F. Gringoli, L. Salgarelli, M. Dusi, N. Cascarano, F. Risso, K.C. Claffy, GT: picking up the truth from the ground for Internet traffic, ACM Computer Communication Review, October 2009. 36