CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA



Similar documents
Robust Network Traffic Classification

Encrypted Internet Traffic Classification Method based on Host Behavior

Identification of Network Applications based on Machine Learning Techniques

A Preliminary Performance Comparison of Two Feature Sets for Encrypted Traffic Classification

Traffic Classification with Sampled NetFlow

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Forensic Network Traffic Analysis

Protocols. Packets. What's in an IP packet

Online Classification of Network Flows

ATCM: A Novel Agent-based Peer-to-Peer Traffic Control Management

Network Traffic Characterization using Energy TF Distributions

The Applications of Deep Learning on Traffic Identification

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Traffic Analysis of Mobile Broadband Networks

How is SUNET really used?

EXPLORER. TFT Filter CONFIGURATION

Kick starting science...

An apparatus for P2P classification in Netflow traces

Breaking and Improving Protocol Obfuscation

Internet Protocol: IP packet headers. vendredi 18 octobre 13

Early Recognition of Encrypted Applications

Classification Using Data Reduction Method

Network Traffic Classification and Demand Prediction

Classifying P2P Activity in Netflow Records: A Case Study on BitTorrent

Flow Analysis Versus Packet Analysis. What Should You Choose?

Computer Networks. Secure Systems

Transport and Network Layer

Traffic Identification Based on Applications using Statistical Signature Free from Abnormal TCP Behavior *

An Overview of Knowledge Discovery Database and Data mining Techniques

Near Real Time Online Flow-based Internet Traffic Classification Using Machine Learning (C4.5)

Security in IPv6. Basic Security Requirements and Techniques. Confidentiality. Integrity

Bypassing PISA AGM Theme Seminar Presented by Ricky Lou Zecure Lab Limited

A Novel Distributed Denial of Service (DDoS) Attacks Discriminating Detection in Flash Crowds

Packet Flow Analysis and Congestion Control of Big Data by Hadoop

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Assuring Your Business Continuity

How To Classify Network Traffic In Real Time

Cisco IOS Flexible NetFlow Technology

An Implementation Of Network Traffic Classification Technique Based On K-Medoids

Analysis of Communication Patterns in Network Flows to Discover Application Intent

Hadoop Technology for Flow Analysis of the Internet Traffic

Signature-aware Traffic Monitoring with IPFIX 1

Lecture 28: Internet Protocols

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Trends and Differences in Connection-behavior within Classes of Internet Backbone Traffic

A Novel Approach for Network Traffic Summarization

Realtime Classification for Encrypted Traffic

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Getting the Most Out of Your Existing Network A Practical Guide to Traffic Shaping

Social Media Mining. Data Mining Essentials

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Top 10 Algorithms in Data Mining

Top Top 10 Algorithms in Data Mining

Introducing IBM s Advanced Threat Protection Platform

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Botnet Detection Based on Degree Distributions of Node Using Data Mining Scheme

HMM Profiles for Network Traffic Classification

Data Mining Part 5. Prediction

A host-based firewall can be used in addition to a network-based firewall to provide multiple layers of protection.

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Machine Learning Based Encrypted Traffic Classification: Identifying SSH and Skype

Role of Social Networking in Marketing using Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

Live Traffic Monitoring with Tstat: Capabilities and Experiences

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Distributed Systems. 2. Application Layer

Monitoring of Tunneled IPv6 Traffic Using Packet Decapsulation and IPFIX

CISC 1600 Introduction to Multi-media Computing

An Anomaly-Based Method for DDoS Attacks Detection using RBF Neural Networks

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

CompTIA Network+ (Exam N10-005)

Innovative, High-Density, Massively Scalable Packet Capture and Cyber Analytics Cluster for Enterprise Customers

IT services for analyses of various data samples

Steven C.H. Hoi School of Information Systems Singapore Management University

A survey on Data Mining based Intrusion Detection Systems

How To Prevent Network Attacks

Measurement of the Usage of Several Secure Internet Protocols from Internet Traces

Evaluating IPv6 Firewalls & Verifying Firewall Security Performance

Analysis of Network Packets. C DAC Bangalore Electronics City

Towards better accuracy for Spam predictions

Information Leakage in Encrypted Network Traffic

Chapter 6 Configuring the SSL VPN Tunnel Client and Port Forwarding

Bro at 10 Gps: Current Testing and Plans

International Journal of Recent Trends in Electrical & Electronics Engg., Feb IJRTE ISSN:

Networking Basics and Network Security

How To Choose A Network Firewall

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Firewall Firewall August, 2003

Transcription:

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia

http://anss.org.au/nsclab

Major Research Themes Security and Privacy Large-scale attacks and defence Malware modelling and classification Trusted computing and authentication IP traceback Networking Network analytics Traffic classification Big data analytics CPS, IoT, and RFID Social networks

Publications Related to This Talk Jun Zhang, Yang Xiang, Yu Wang, Wanlei Zhou, Yong Xiang, Yong Guan, Network Traffic Classification Using Correlation Information, IEEE Transactions on Parallel and Distributed Systems, vol. 24, no.1, pp. 104-117, 2013. Jun Zhang, Chao Chen, Yang Xiang, Wanlei Zhou, and Yong Xiang, Internet Traffic Classification by Aggregating Correlated Naive Bayes Predictions, IEEE Transactions on Information Forensics and Security, vol. 8, no. 1, pp. 5-15, 2013. Jun Zhang, Chao Chen, Yang Xiang, Wanlei Zhou, and Athanasios V. Vasilakos, "An Effective Network Traffic Classification Method with Unknown Flow Detection", IEEE Transactions on Network and Service Management, vol. 10, no. 2, pp. 133-147, 2013.

Agenda Introduction Related Work Our Innovations Conclusion and Future Directions

Agenda Introduction Related Work Our Innovations Conclusion and Future Directions

Introduction The big data Features: 3Vs: volume, velocity, and variety The phenomenon behind big data

Big Network Traffic Data Internet traffic doubles every year according to CISCO New applications emerging every day No existing devices can record all network traffic

Challenges to the Future Networks Things are significantly changed in future networks: Infrastructure exposed Human involved Devices increased Network traffic increased

Traffic Classification Global Internet Traffic Expected to Quadruple by 2015 By 2015, about 3 billion people will be using the Internet 1 zettabyte in traffic per year (A zettabyte is equal to 1,000,000,000,000,000,000,000 bytes) http://www.theatlantic.com/technology/archive/2011/06/infographic-global-internet-traffic-expected-to-quadruple-by-2015/240182/

Traffic Classification What is in the traffic?

Traffic Classification A mixture of everything!

Traffic Classification Do you want to tell which is which? Technique: Classifying network traffic flows by their generation applications

Traffic Classification: Edge Link Example 10.0.0.2 10.0.0.1 74.125.237.114 74.125.237.96 Internet 117.121.253.57 10.0.0.5 ART-TC Classification Result Flow #2 #1 #3 Link: Ethernet II Internet: IPv4 10.0.0.2->74.125.237.96 10.0.0.1->74.125.237.114 10.0.0.7->117.121.253.57 Transport: TCP UDP port 7845->port 49698->port 3074->port 80 3074 80 Application: HTTP Gaming (streaming) (web (XBOXLIVE) browsing) The Real-Time Traffic Classification Engine

Traffic Classification VS. Packet Classification Packet classifier is actuator It applies a sequence of pre-defined rules to incoming packets A predicate over some packet header fields A decision to be taken upon the matching packets Challenge: huge set of rules & high speed links Traffic classifier is predictor It observes/extracts some features of incoming flows/packets Packet header fields, payloads, flow statistics It then predicts the underlying applications and applies labels Challenge: accuracy, efficiency, human efforts It uses packet classifiers to classify packets to flows It is usually used to generate rules for packet classifiers

Methods of Traffic Classification The Unit of traffic in consideration is usually flows (also called connections, sessions, conversations) Port number fields Application payload (Deep Packet Inspection) Flow statistics Describe flows with feature vectors by extracting pre-defined features Data points in the feature space Data are labelled: supervised learning Data are unlabelled: clustering Inter-packet Packet size: Flow time: max/min/mean/std.dev std.dev duration

Agenda Introduction Related Work Our Innovations Conclusion and Future Directions

Traffic Classification: Techniques Example Traffic Statistics: 13 4 FTP-DATA vs. TELNET (x-axis: avg. inter-packet-time; 10.0.0.1 y-axis: avg. packet size) 67 25 10.0.0.1 74.125.237.114 74.125.237.114 Flow Reassemble Link: Feature Extraction Feature Extraction 1 Packet 33 4 4 Header 2 2 Ethernet II Internet: IPv4 10.0.0.1->74.125.237.114 Transport: TCP port 49698->port 80 Packet Header Intelligent Decision Engine ART-TC Packet Payload 5 6 7 5 6 Packet Payload Machine Learning ART-TC TCP DST Port 80 Classification result Flow Statistics 7 Internet Flow Statistics HTTP (Web browsing Google) HTTP Security & QoS Control Traffic Analytics Advanced Data Mining User Profiling

Traffic Classification Methods Chapter 5, WAN and Application Optimization Solution Guide, CISCO

Flow Statistical Feature Based Methods Supervised classification Parametric classifiers (C4.5 decision tree, neural network) Non-parametric classifiers (k-nn) Unsupervised classification Clustering + Mapping Difficult to map a large number of clusters to a small number of applications

Supervised Traffic Classification Supervised algorithms + flow statistical feature Naïve Bayes (Moore and Zuev 2005) C4.5 decision tree (Williams et al. 2006) k-nn (Roughan et al. 2004) Bayesian network (Williams et al. 2006) Neural network (Auld et al. 2007) SVM (Kim et al. 2008, Este et al. 2009) Supervised algorithms + IP payload Naïve Bayes, AdaBoost, EM (Haffner et al. 2005) SVM (Finamore et al. 2010)

Unsupervised Traffic Classification Traffic clustering EM (McGregor et al. 2004) AutoClass (Zander et al. 2005) k-means (Bernaille et al. 2006) DBSCAN (Erman et al. 2006) Combine flow statistical features and IP payload information (Wang et al. 2010; Finamore et al. 2011) Semi-supervised clustering k-means + few supervised samples (Erman et al. 2007)

Agenda Introduction Related Work Our Innovations Conclusion and Future Directions

Challenges of Traffic Classification for Big Network Data Challenge 1: Big network data, small samples Challenge 2: Processing traffic accurately, with highspeed Challenge 3: Unknown applications

Our Innovations Solving Challenge 1: Big network data, small samples Solving Challenge 2: Processing traffic accurately, with high-speed Solving Challenge 3: Unknown applications

Innovation 1: Traffic Classification Using Correlation Information Problem Big network data, small samples Observation Correlation among flows Be Benefit to traffic classification Idea Supervised classification using flow correlation Effectively improve classification accuracy when a small number of supervised training samples are available

Major Contributions New approach Propose a novel non-parametric approach to incorporate flow correlation into classification process Theoretical study Provide a detailed theoretical analysis on the novel classification approach and its performance benefit Empirical study Validate the effectiveness by comparing classification performance of the proposed approach and state-of-the-art methods

Correlation Analysis: Example Video Text Image

System Model: TCC

Correlation Analysis 3-tuple heuristic: In a certain period of time, the flows sharing the same 3- tuple {des_ip, dst_port, protocol} form a Bag of Flows (BoF) In this example, flows AD, BD, and CD are generated by the same application, which can form a BoF.

Performance Benefit

Performance Benefit

Classification Method

Performance Evaluation Datasets Experiments Statistical features Performance metrics Results Overall performance Per-experiment performance Per-class performance Comparison with other existing methods Summary

Real-world Network Traffic Datasets wide: P2P, DNS, FTP, WWW, CHAT, MAIL isp: BT, DNS, ebuddy, FTP, HTTP, IMAP, MSN, POP3, RSP, SMTP, SSH, SSL, XMPP, YahooMsg

Statistical Features

Performance Metrics Overall accuracy Ratio of the sum of all correctly classified flows to the sum of all testing flows Measure the accuracy of a classifier on the whole testing data F-measure F measure= 2 precision recall/ precision+recall Evaluate the per-class performance

Overall Performance - wide

Overall Performance - isp

Summary Overall Performance With comparison to the NN classifier, the proposed methods can effectively improve the overall performance of traffic classification.

Per-Experiment Performance 10 training samples per class

Per-Experiment Performance 20 training samples per class

Summary - Per-Experiment Performance The proposed methods can improve the classification accuracy in a robust way and consistent improvement is achieved in almost every experiment.

F-measure Per-Class - wide

F-measure Per-Class - isp

F-measure Per-Class - isp

Summary F-measure Per-Class The proposed methods can improve the F-measure of every class and significant improvements are obtained in most classes.

Comparison with Other Methods - wide

Comparison with Other Methods - isp

Summary - Comparison TCC is superior to the existing traffic classification methods since it demonstrates the ability of applying flow correlation to effectively improve traffic classification performance.

Innovation 2: Bag of Flow Framework Problem: Processing traffic accurately, with high-speed We propose a new traffic classification scheme to utilize the information among the correlated traffic flows generated by an application We provide a theoretical study on the proposed scheme Theoretical framework of classifier combination Analyze the sensitivities to prediction errors of different aggregation rules employed in the proposed scheme

Classification Process of Correlated Traffic

Evaluation

Evaluation

Evaluation

Innovation 3: Compound Classification Framework Problem: Unknown applications

Statistics-based Traffic Classification Very high accuracy Training Set - labelled data HTTP FTP SMTP Testing Set - unlabelled data??? Predict the classes Classifier (by supervised learning)

Unknown Classes are Overlooked Training In classifier Set design, - known most previous works Testing assumed: Set - known classes classes All classes are known during training HTTP All classes have sufficient data for training HTTP FTP In evaluation, they got good results by excluding unwanted data FTP SMTP Classifiers were trained with a limited number of classes BitTorrent (unknown class) SMTP Classifiers were tested against only data from the trained classes

Innovation 3: Compound Classification Framework Problem: Unknown applications We aim to tackle the problem of unknown flows in a semisupervised framework This work considers very few labelled training samples and investigates flow correlation in real world network environment, which makes it different to previous works Flow label propagation to automatically label relevant flows from a large unlabelled dataset We proposed the compound classification to jointly identify the correlated flows in order to further boost the classification accuracy We provide the theoretical justification on performance benefit of applying these two new techniques to network traffic classification

System Model

Flow Label Propagation

Nearest Cluster-based Classifier

Compound Classification

Impact of Unknown Applications

Overall Accuracy and F-Measure

F-Measure on isp Data

Comparison against Other Methods

Comparison against Other Methods

Comparison against Other Methods

Comparison against Other Methods

Agenda Introduction Related Work Our Innovations Conclusion and Future Directions

Conclusion and Future Directions We proposed three frameworks to deal with three major challenges of the network traffic classification problems in big data era Solving Challenge 1: Big network data, small samples Solving Challenge 2: Processing traffic accurately, with high-speed Solving Challenge 3: Unknown applications

Future Directions Cloud computing: classifying encrypted traffic More than half of the traffic is HTTP: further classifying HTTP traffic Building user profile based on traffic classification CPS/IoT/Cloud: classifying data link layer traffic

Thank You! More about? Yang Xiang Yang Xiang http://anss.org.au/yang