CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia
http://anss.org.au/nsclab
Major Research Themes Security and Privacy Large-scale attacks and defence Malware modelling and classification Trusted computing and authentication IP traceback Networking Network analytics Traffic classification Big data analytics CPS, IoT, and RFID Social networks
Publications Related to This Talk Jun Zhang, Yang Xiang, Yu Wang, Wanlei Zhou, Yong Xiang, Yong Guan, Network Traffic Classification Using Correlation Information, IEEE Transactions on Parallel and Distributed Systems, vol. 24, no.1, pp. 104-117, 2013. Jun Zhang, Chao Chen, Yang Xiang, Wanlei Zhou, and Yong Xiang, Internet Traffic Classification by Aggregating Correlated Naive Bayes Predictions, IEEE Transactions on Information Forensics and Security, vol. 8, no. 1, pp. 5-15, 2013. Jun Zhang, Chao Chen, Yang Xiang, Wanlei Zhou, and Athanasios V. Vasilakos, "An Effective Network Traffic Classification Method with Unknown Flow Detection", IEEE Transactions on Network and Service Management, vol. 10, no. 2, pp. 133-147, 2013.
Agenda Introduction Related Work Our Innovations Conclusion and Future Directions
Agenda Introduction Related Work Our Innovations Conclusion and Future Directions
Introduction The big data Features: 3Vs: volume, velocity, and variety The phenomenon behind big data
Big Network Traffic Data Internet traffic doubles every year according to CISCO New applications emerging every day No existing devices can record all network traffic
Challenges to the Future Networks Things are significantly changed in future networks: Infrastructure exposed Human involved Devices increased Network traffic increased
Traffic Classification Global Internet Traffic Expected to Quadruple by 2015 By 2015, about 3 billion people will be using the Internet 1 zettabyte in traffic per year (A zettabyte is equal to 1,000,000,000,000,000,000,000 bytes) http://www.theatlantic.com/technology/archive/2011/06/infographic-global-internet-traffic-expected-to-quadruple-by-2015/240182/
Traffic Classification What is in the traffic?
Traffic Classification A mixture of everything!
Traffic Classification Do you want to tell which is which? Technique: Classifying network traffic flows by their generation applications
Traffic Classification: Edge Link Example 10.0.0.2 10.0.0.1 74.125.237.114 74.125.237.96 Internet 117.121.253.57 10.0.0.5 ART-TC Classification Result Flow #2 #1 #3 Link: Ethernet II Internet: IPv4 10.0.0.2->74.125.237.96 10.0.0.1->74.125.237.114 10.0.0.7->117.121.253.57 Transport: TCP UDP port 7845->port 49698->port 3074->port 80 3074 80 Application: HTTP Gaming (streaming) (web (XBOXLIVE) browsing) The Real-Time Traffic Classification Engine
Traffic Classification VS. Packet Classification Packet classifier is actuator It applies a sequence of pre-defined rules to incoming packets A predicate over some packet header fields A decision to be taken upon the matching packets Challenge: huge set of rules & high speed links Traffic classifier is predictor It observes/extracts some features of incoming flows/packets Packet header fields, payloads, flow statistics It then predicts the underlying applications and applies labels Challenge: accuracy, efficiency, human efforts It uses packet classifiers to classify packets to flows It is usually used to generate rules for packet classifiers
Methods of Traffic Classification The Unit of traffic in consideration is usually flows (also called connections, sessions, conversations) Port number fields Application payload (Deep Packet Inspection) Flow statistics Describe flows with feature vectors by extracting pre-defined features Data points in the feature space Data are labelled: supervised learning Data are unlabelled: clustering Inter-packet Packet size: Flow time: max/min/mean/std.dev std.dev duration
Agenda Introduction Related Work Our Innovations Conclusion and Future Directions
Traffic Classification: Techniques Example Traffic Statistics: 13 4 FTP-DATA vs. TELNET (x-axis: avg. inter-packet-time; 10.0.0.1 y-axis: avg. packet size) 67 25 10.0.0.1 74.125.237.114 74.125.237.114 Flow Reassemble Link: Feature Extraction Feature Extraction 1 Packet 33 4 4 Header 2 2 Ethernet II Internet: IPv4 10.0.0.1->74.125.237.114 Transport: TCP port 49698->port 80 Packet Header Intelligent Decision Engine ART-TC Packet Payload 5 6 7 5 6 Packet Payload Machine Learning ART-TC TCP DST Port 80 Classification result Flow Statistics 7 Internet Flow Statistics HTTP (Web browsing Google) HTTP Security & QoS Control Traffic Analytics Advanced Data Mining User Profiling
Traffic Classification Methods Chapter 5, WAN and Application Optimization Solution Guide, CISCO
Flow Statistical Feature Based Methods Supervised classification Parametric classifiers (C4.5 decision tree, neural network) Non-parametric classifiers (k-nn) Unsupervised classification Clustering + Mapping Difficult to map a large number of clusters to a small number of applications
Supervised Traffic Classification Supervised algorithms + flow statistical feature Naïve Bayes (Moore and Zuev 2005) C4.5 decision tree (Williams et al. 2006) k-nn (Roughan et al. 2004) Bayesian network (Williams et al. 2006) Neural network (Auld et al. 2007) SVM (Kim et al. 2008, Este et al. 2009) Supervised algorithms + IP payload Naïve Bayes, AdaBoost, EM (Haffner et al. 2005) SVM (Finamore et al. 2010)
Unsupervised Traffic Classification Traffic clustering EM (McGregor et al. 2004) AutoClass (Zander et al. 2005) k-means (Bernaille et al. 2006) DBSCAN (Erman et al. 2006) Combine flow statistical features and IP payload information (Wang et al. 2010; Finamore et al. 2011) Semi-supervised clustering k-means + few supervised samples (Erman et al. 2007)
Agenda Introduction Related Work Our Innovations Conclusion and Future Directions
Challenges of Traffic Classification for Big Network Data Challenge 1: Big network data, small samples Challenge 2: Processing traffic accurately, with highspeed Challenge 3: Unknown applications
Our Innovations Solving Challenge 1: Big network data, small samples Solving Challenge 2: Processing traffic accurately, with high-speed Solving Challenge 3: Unknown applications
Innovation 1: Traffic Classification Using Correlation Information Problem Big network data, small samples Observation Correlation among flows Be Benefit to traffic classification Idea Supervised classification using flow correlation Effectively improve classification accuracy when a small number of supervised training samples are available
Major Contributions New approach Propose a novel non-parametric approach to incorporate flow correlation into classification process Theoretical study Provide a detailed theoretical analysis on the novel classification approach and its performance benefit Empirical study Validate the effectiveness by comparing classification performance of the proposed approach and state-of-the-art methods
Correlation Analysis: Example Video Text Image
System Model: TCC
Correlation Analysis 3-tuple heuristic: In a certain period of time, the flows sharing the same 3- tuple {des_ip, dst_port, protocol} form a Bag of Flows (BoF) In this example, flows AD, BD, and CD are generated by the same application, which can form a BoF.
Performance Benefit
Performance Benefit
Classification Method
Performance Evaluation Datasets Experiments Statistical features Performance metrics Results Overall performance Per-experiment performance Per-class performance Comparison with other existing methods Summary
Real-world Network Traffic Datasets wide: P2P, DNS, FTP, WWW, CHAT, MAIL isp: BT, DNS, ebuddy, FTP, HTTP, IMAP, MSN, POP3, RSP, SMTP, SSH, SSL, XMPP, YahooMsg
Statistical Features
Performance Metrics Overall accuracy Ratio of the sum of all correctly classified flows to the sum of all testing flows Measure the accuracy of a classifier on the whole testing data F-measure F measure= 2 precision recall/ precision+recall Evaluate the per-class performance
Overall Performance - wide
Overall Performance - isp
Summary Overall Performance With comparison to the NN classifier, the proposed methods can effectively improve the overall performance of traffic classification.
Per-Experiment Performance 10 training samples per class
Per-Experiment Performance 20 training samples per class
Summary - Per-Experiment Performance The proposed methods can improve the classification accuracy in a robust way and consistent improvement is achieved in almost every experiment.
F-measure Per-Class - wide
F-measure Per-Class - isp
F-measure Per-Class - isp
Summary F-measure Per-Class The proposed methods can improve the F-measure of every class and significant improvements are obtained in most classes.
Comparison with Other Methods - wide
Comparison with Other Methods - isp
Summary - Comparison TCC is superior to the existing traffic classification methods since it demonstrates the ability of applying flow correlation to effectively improve traffic classification performance.
Innovation 2: Bag of Flow Framework Problem: Processing traffic accurately, with high-speed We propose a new traffic classification scheme to utilize the information among the correlated traffic flows generated by an application We provide a theoretical study on the proposed scheme Theoretical framework of classifier combination Analyze the sensitivities to prediction errors of different aggregation rules employed in the proposed scheme
Classification Process of Correlated Traffic
Evaluation
Evaluation
Evaluation
Innovation 3: Compound Classification Framework Problem: Unknown applications
Statistics-based Traffic Classification Very high accuracy Training Set - labelled data HTTP FTP SMTP Testing Set - unlabelled data??? Predict the classes Classifier (by supervised learning)
Unknown Classes are Overlooked Training In classifier Set design, - known most previous works Testing assumed: Set - known classes classes All classes are known during training HTTP All classes have sufficient data for training HTTP FTP In evaluation, they got good results by excluding unwanted data FTP SMTP Classifiers were trained with a limited number of classes BitTorrent (unknown class) SMTP Classifiers were tested against only data from the trained classes
Innovation 3: Compound Classification Framework Problem: Unknown applications We aim to tackle the problem of unknown flows in a semisupervised framework This work considers very few labelled training samples and investigates flow correlation in real world network environment, which makes it different to previous works Flow label propagation to automatically label relevant flows from a large unlabelled dataset We proposed the compound classification to jointly identify the correlated flows in order to further boost the classification accuracy We provide the theoretical justification on performance benefit of applying these two new techniques to network traffic classification
System Model
Flow Label Propagation
Nearest Cluster-based Classifier
Compound Classification
Impact of Unknown Applications
Overall Accuracy and F-Measure
F-Measure on isp Data
Comparison against Other Methods
Comparison against Other Methods
Comparison against Other Methods
Comparison against Other Methods
Agenda Introduction Related Work Our Innovations Conclusion and Future Directions
Conclusion and Future Directions We proposed three frameworks to deal with three major challenges of the network traffic classification problems in big data era Solving Challenge 1: Big network data, small samples Solving Challenge 2: Processing traffic accurately, with high-speed Solving Challenge 3: Unknown applications
Future Directions Cloud computing: classifying encrypted traffic More than half of the traffic is HTTP: further classifying HTTP traffic Building user profile based on traffic classification CPS/IoT/Cloud: classifying data link layer traffic
Thank You! More about? Yang Xiang Yang Xiang http://anss.org.au/yang