NetFlow Analysis with MapReduce

Similar documents

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

CSE-E5430 Scalable Cloud Computing Lecture 2

Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop

Signature-aware Traffic Monitoring with IPFIX 1

Toward Scalable Internet Traffic Measurement and Analysis with Hadoop

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Comparison of Different Implementation of Inverted Indexes in Hadoop

Efficacy of Live DDoS Detection with Hadoop

Get Your FIX: Flow Information export Analysis and Visualization

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Migration from Grid to Cloud Computing

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

Advanced SQL Query To Flink Translator

Enhancing Massive Data Analytics with the Hadoop Ecosystem

Improving MapReduce Performance in Heterogeneous Environments

Data and Algorithms of the Web: MapReduce

Hadoop IST 734 SS CHUNG

CREDIT CARD DATA PROCESSING AND E-STATEMENT GENERATION WITH USE OF HADOOP

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Performance Overhead on Relational Join in Hadoop using Hive/Pig/Streaming - A Comparative Analysis

Hadoop and Hive. Introduction,Installation and Usage. Saatvik Shah. Data Analytics for Educational Data. May 23, 2014

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Approaches for parallel data loading and data querying

Network Traffic Analysis using HADOOP Architecture. Zeng Shan ISGC2013, Taibei

CASE STUDY OF HIVE USING HADOOP 1

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A Small-time Scale Netflow-based Anomaly Traffic Detecting Method Using MapReduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Packet Flow Analysis and Congestion Control of Big Data by Hadoop

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

INTERNATIONAL JOURNAL OF INFORMATION TECHNOLOGY & MANAGEMENT INFORMATION SYSTEM (IJITMIS)

Big Data and Apache Hadoop s MapReduce

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Suresh Lakavath csir urdip Pune, India

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Hadoop Technology for Flow Analysis of the Internet Traffic

and reporting Slavko Gajin

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A Network Monitoring Tool for CCN

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Mobile Cloud Computing for Data-Intensive Applications

MapReduce and Hadoop Distributed File System

Open source Google-style large scale data analysis with Hadoop

Research Article Hadoop-Based Distributed Sensor Node Management System

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

BIG DATA WEB ORGINATED TECHNOLOGY MEETS TELEVISION BHAVAN GHANDI, ADVANCED RESEARCH ENGINEER SANJEEV MISHRA, DISTINGUISHED ADVANCED RESEARCH ENGINEER

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

An overview of traffic analysis using NetFlow

A Performance Analysis of Distributed Indexing using Terrier

DESIGN AN EFFICIENT BIG DATA ANALYTIC ARCHITECTURE FOR RETRIEVAL OF DATA BASED ON

MapReduce (in the cloud)

GraySort and MinuteSort at Yahoo on Hadoop 0.23

SARAH Statistical Analysis for Resource Allocation in Hadoop

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Scalable Cloud Computing

Implement Hadoop jobs to extract business value from large and varied data sets

MapReduce and Hadoop Distributed File System V I J A Y R A O

Chapter 7. Using Hadoop Cluster and MapReduce

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

How To Identify Different Operating Systems From A Set Of Network Flows

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Flow Based Traffic Analysis

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

SEAIP 2009 Presentation

A Cost-Evaluation of MapReduce Applications in the Cloud

Evaluating HDFS I/O Performance on Virtualized Systems

See Spot Run: Using Spot Instances for MapReduce Workflows

Limitations of Packet Measurement

The SCAMPI Scaleable Monitoring Platform for the Internet. Baiba Kaskina TERENA

Big Data Analytics Hadoop and Spark

Big Data With Hadoop

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

NFQL: A Tool for Querying Network Flow Records [6]

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Enabling High performance Big Data platform with RDMA

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

Case Study : 3 different hadoop cluster deployments

MapReduce, Hadoop and Amazon AWS

Management by Network Search

Log Mining Based on Hadoop s Map and Reduce Technique

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

BIG DATA What it is and how to use?

MyCloudLab: An Interactive Web-based Management System for Cloud Computing Administration

The Performance Characteristics of MapReduce Applications on Scalable Clusters

Comprehensive IP Traffic Monitoring with FTAS System

Hadoop/Hive General Introduction

A Brief Outline on Bigdata Hadoop

Integrating Big Data into the Computing Curricula

Virtual Machine Based Resource Allocation For Cloud Computing Environment

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Transcription:

NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with MapReduce", Cloudman workshop, April 2010 1

Introduction Flow-based traffic monitoring Volume of processed data is reduced Popular flow statistics tools : Cisco NetFlow [1] Traditional flow-based traffic monitoring Run on a high performance central server Routers Flow Data Storag e High Performance Server 2

Motivation A huge amount of flow data Long-term collection of flow data Flow data in our campus network ( /16 prefix ) #ofrouters 1Day 1 Month 1 Year Short-term period of flow data 1 1.2 GB 13 GB 156 GB 5 6 GB 65 GB 780 GB 10 12 GB 130 GB 1.5 TB 200 240 GB 2.6 TB 30 TB Massive flow data from anomaly traffic data of Internet worm and DDoS Cluster file system and cloud computing platform Google s programming model, MapReduce, big table [8] Open-source system, Hadoop [9] 3

MapReduce MapReduce is a programming g model for large data set First suggested by Google J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Cluster, OSDI, 2004 [8] User only specify a map and a reduce function Automatically parallelized and executed on a large cluster 4

MapReduce Split 1 Split 2 Split 3 Map Map Shuffle & Sort Reduce Reduce Result Split 4 ( K1, V ) List ( K2, V2 ) ( k2, list ( v2 ) ) List ( v3 ) Map : return a list containing zero or more ( k, v ) pair Output can be a different key from the input Output can have same key Reduce : return a new list of reduced output from input 5

Hadoop Open-source framework for running applications on large clusters built of commodity hardware Implementation of MapReduce and HDFS MapReduce : computational paradigm HDFS : distributed file system Node failures are automatically handled by framework Hadoop Amazon : EC2, S3 service Facebook : analyze the web log data 6

Related Work Widely used tools for flow statistics Flow-tools, flowscan or CoralReef[5] P2P-based distributed analysis of flow data DIPStorage : each storage tank associated with a rule [11] MapReduce software Snort log analysis : NCHC cloud computing research group [16] 7

Contribution A flow analysis method with MapReduce Process flow data in a cloud computing platform, hadoop Implementation of flow analysis programs with Hadoop Decrease flow computation time Enhance fault-tolerant tolerant of flow analysis jobs 8

Architecture of Flow Measurement and danalysis System Each router exports flow data to cluster node Cluster master manages cluster nodes 9

Components of Cluster Node flowtools Flow File Input Processor Cluster File Cluster SystemFile ( System HDFS ) ( HDFS ) Flow Analysis Map Map Flow Analysis Reduce Reduce Flow file input processor Flow analysis map/reduce Flow-tools Hadoop ( HDFS ) Flow analysis MapReduce Library Hadoop Java Virtual Machine Operating System ( Linux ) Hardware ( CPU, HDD, Memory, NIC ) HDFS MapReduce Java VM OS : Linux 10

Flow File Input Processor Local Disk NetFlow v5 Save NetFlow data in binary flow file Cluster Master Flow File ( Binary Format ) Convert Flow File ( Text Format ) Convert binary flow file into text file HDFS Copy Copy text t file to HDFS Cluster Nodes 11

Flow Analysis Map/Reduce Flow Flow Flow 53 128 64 Dst Port Flow Octet 53 [64, 128] 53 192 Read text flow files Run map tasks Read each line (Validation Check) Parsing flow data Save result into temporary files (key, value) Run reduce tasks Read temporary files (Key, List[Value]) Run sum process Write results to a file 12

Performance Evaluation Data: flow data from /24 subnet Duration Flow count (million) Environment Flow file count Total binary file size (GB) Total text file size (GB) 1 day 3.2 228 0.2 1.2 1 week 19.0 1596 0.3 2.3 1 month 109.1 7068 2.0 13.1 Compared methods : computing byte count per destination port flow-tools : flow-cat [flow data folder] flow-stat f 5 Our implementation with Hadoop Performance metric flow statistics ti ti computation ti time Fault recovery against map/reduce tasks 13

Our Testbed Internet Chungnam National University Router Hadoop 0.18.3 Cluster master x 1 Core 2 Duo 2.33 GHz Memory 2GB 1 GE Cluster node x 4 Core 2 Quad 2.83 GHz Memory 4GB HDD 1.5 TB 1 GE NetFlow v5 Data Export Cluster master Cluster nodes Gigabit Ethernet 14

Flow Statistics Computation Time Port-breakdown Computation Time flow-tools : 4h 30m 23s 18000 ime (sec) Port Break kdown Running ti 16000 14000 12000 10000 8000 6000 4000 2000 flow-tools MR (1) MR (2) MR (3) MR (4) 0 3.2 million (One Day) 19 million (One Week) 109.1 million (One Month) number of flows (duration) MR(4) : 1h 15m 49s Port breakdown computation time 72% decrease with MR(4) on Hadoop 15

Single Node Failure : Map Task Under 4 cluster nodes Map task fail time 4 sec (M : 9% R : 0%) Map task recover time 266 sec (M : 99% R : 32%) Fail time 4 sec Recover time 266 sec 16

Single Node Failure : Reduce Task Under 4 cluster nodes Reduce task fail time 29 sec (M : 41% R : 10% ) Reduce task recover time 320 sec (M : 99% R : 32% ) Fail time 29 sec Recover time 320 sec 17

Text vs. Binary NetFlow Files Flow Analyzer on Hadoop Packet Flow Exporter Netflow Packet Flow Collecter Binary flow file Text Converter Text flow file HDFS Flow analysis with text files TextInputFormat Map TextOutputFormat Reduce K : Text V : LongWritable Flow Analyzer on Hadoop Packet Flow Exporter Netflow Packet Flow Collector Binary flow file HDFS BinaryInputFormat BinaryOutputFormat Flow analysis with binary files Map Reduce K : BytesWritable V : BytesWritable 18

Binary Input in Hadoop Currently developing BinaryInputFormat module for Hadoop Small storage by binary NetFlow files Reduces # of Map tasks increasing gperformance Decreasing computation time By 18% ~ 55% for a single flow analysis job By 58% ~ 75% for two flow analysis jobs 19

Prototype 20

Summary NetFlow data analysis with MapReduce Easy management of big flow data Decreasing computation time Fault-tolerant service against a single machine failure Ongoing work Supporting binary NetFlow files Enhancing fast processing of NetFlow files 21

References [1] Cisco NetFlow, http://www.cisco.com/web/go/netflow. [2] L. Deri, nprobe: an Open Source NetFlow Probe for Gigabit Networks, TERENA Networking Conference, May 2003. [3] J. Quittek, T. Zseby, B. Claise, and S. Zander, Requirements for IP Flow Information Export (IPFIX), IETF RFC 3917, October 2004. [4] tcpdump, http://www.tcpdump.org. [5] CAIDA CoralReef Software Suite, http://www.caida.org/tools/measurement/coralreef. [6] M. Fullmer and S. Romig, The OSU Flow-tools Package and Cisco NetFlow Logs, USENIX LISA, 2000. [7] D. Plonka, FlowScan: a Network Traffic Flow Reporting and Visualizing Tool, USENIX Conference on System Administration, 2000. [8] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Cluster, OSDI, 2004. [9] Hadoop, http://hadoop.apache.org/. [10] H. Kim, K. Claffy, M. Fomenkov, D. Barman, M. Faloutsos, and K. Lee, Internet Traffic Classification Demystified: Myths, Caveats, and the Best Practices, ACM CoNEXT, 2008. [11] C. Morariu, T. Kramis, B. Stiller DIPStorage: Distributed Architecture for Storage of IP Flow Records., 16thWorkshop on Local and Metropolitan Area Networks, September 2008. [12] M. Roesch, Snort - Lightweight Intrusion Detection for Networks, USENIX LISA, 1999. [13] W. Chen and J. Wang, Building a Cloud Computing Analysis System for Intrusion Detection System, CloudSlam 2009. [14] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, Raghotham Murthy Hive: a warehousing solution over a map-reduce framework., Proceedings of the VLDB Endowment Volume 2, Issue 2 (August 2009) Pages: 1626-1629 [15] HBase, http://hadoop.apache.org/hbaseapache org/hbase [16] Wei-Yu Chen and Jazz Wang. Building a Cloud Computing Analysis System for Intrusion Detection System, CloudSlam'09 22