Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework



Similar documents
Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework

A Scalable Data Transformation Framework using the Hadoop Ecosystem

How To Handle Big Data With A Data Scientist

Figure 1. perfsonar architecture. 1 This work was supported by the EC IST-EMANICS Network of Excellence (#26854).

HBase Schema Design. NoSQL Ma4ers, Cologne, April Lars George Director EMEA Services

Getting Started with SandStorm NoSQL Benchmark

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Big Data Analytics Nokia

A Performance Analysis of Distributed Indexing using Terrier

Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently.

Big Data and Cloud Computing

An Open Source NoSQL solution for Internet Access Logs Analysis

Wireshark Developer and User Conference

NfSen Plugin Supporting The Virtual Network Monitoring

Limitations of Packet Measurement

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

How To Scale Out Of A Nosql Database

An overview of traffic analysis using NetFlow

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

NetFlow Tracker Overview. Mike McGrath x ccie CTO mike@crannog-software.com

Big Data With Hadoop

Network Traffic Analysis using HADOOP Architecture. Zeng Shan ISGC2013, Taibei

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Media Upload and Sharing Website using HBASE

Software-Defined Networking Architecture Framework for Multi-Tenant Enterprise Cloud Environments

Software-Defined Networking Architecture Framework for Multi-Tenant Enterprise Cloud Environments

HADOOP PERFORMANCE TUNING

Apache HBase. Crazy dances on the elephant back

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

BIG DATA What it is and how to use?

Using distributed technologies to analyze Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Comparison of the Frontier Distributed Database Caching System with NoSQL Databases

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

How Companies are! Using Spark

How good can databases deal with Netflow data

Ubuntu and Hadoop: the perfect match

Cloud Computing at Google. Architecture

and reporting Slavko Gajin

Analytics on Spark &

Chapter 7. Using Hadoop Cluster and MapReduce

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Watch your Flows with NfSen and NFDUMP 50th RIPE Meeting May 3, 2005 Stockholm Peter Haag

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hypertable Goes Realtime at Baidu. Yang Dong Sherlock Yang(

Comparing Scalable NOSQL Databases

Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee

Cloud Computing, Software Defined Networking, Network Function Virtualization

Hadoop Ecosystem B Y R A H I M A.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data and Scripting map/reduce in Hadoop

Decoding DNS data. Using DNS traffic analysis to identify cyber security threats, server misconfigurations and software bugs

Time-Series Databases and Machine Learning

Hybrid network traffic engineering system (HNTES)

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Scalable Extraction, Aggregation, and Response to Network Intelligence

Scalable Network Measurement Analysis with Hadoop. Taghrid Samak and Daniel Gunter Advanced Computing for Sciences, LBNL

Can the Elephants Handle the NoSQL Onslaught?

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Hadoop and Map-Reduce. Swati Gore

Developing MapReduce Programs

Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone

How To Analyze Netflow Data With Hadoop 1 And Netflow On A Large Scale On A Server Or Cloud On A Microsoft Server

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer

Storage of Structured Data: BigTable and HBase. New Trends In Distributed Systems MSc Software and Systems

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

Big Systems, Big Data

Scaling Up 2 CSE 6242 / CX Duen Horng (Polo) Chau Georgia Tech. HBase, Hive

Openbus Documentation

Oracle Big Data SQL Technical Update

Big Data Analytics - Accelerated. stream-horizon.com

Discovering Business Insights in Big Data Using SQL-MapReduce

NetFlow Analysis with MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Load Balancing in Distributed Web Server Systems With Partial Document Replication

Cloudera Certified Developer for Apache Hadoop

Sentimental Analysis using Hadoop Phase 2: Week 2

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Integrating Big Data into the Computing Curricula

Network Measurement. Why Measure the Network? Types of Measurement. Traffic Measurement. Packet Monitoring. Monitoring a LAN Link. ScienLfic discovery

Workshop on Hadoop with Big Data

Software Defined Networking What is it, how does it work, and what is it good for?

Moving From Hadoop to Spark

High Frequency Trading and NoSQL. Peter Lawrey CEO, Principal Consultant Higher Frequency Trading

Massive Cloud Auditing using Data Mining on Hadoop

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Transcription:

Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework Aryan TaheriMonfared Tomasz Wiktor Wlodarczyk Chunming Rong Department of Electrical Engineering and Computer Science University of Stavanger CloudCom, 2013

Problem? & Solution Problem? Proper network operation requires efficient monitoring Different monitoring instruments and protocols exist Monitoring data are huge Diverse query types are required (planned vs. ad-hoc)

Problem? & Solution Contributions A mechanism for: Scalable and flexible storage Real-time processing, long-term analysis Protocol independent

Norwegian NREN Backbone Network Data Characteristics Norwegian NREN backbone network Flow information from two core routers Anonymized records Average number of NetFlow records: 22m /day Average volume of NetFlow records: 60GB /day Sampling rate: 8

Overview Solution Overview Hadoop framework: HDFS, HBase, MapReduce HBase: nosql data store (row key, column-families, columns) Row Key: Facilitate accessing a specific data point or a range of them

Schema Schemas Composite row key: {src, dst}{addr, port}{ts} Three table types are required: IP Based Tables Port Based Tables Time Based Tables Single table has actual data, others are lookup tables

Implementation Implementation Initial data collection didn t perform well For a single day of NetFlow data: HBase max # op/s: 50 HBase max op latency: 2.3 s HDFS max # written bytes/s: 81 MB/s MR job duration: 45.46 min This is not good at all

Implementation What is wrong? Non uniform distribution of data across regions (Hot Regions) Write Ahead Log Concurrent-Mark-Sweep Garbage Collection (CMS-GC) Old generation heap fragmentation etc.

Performance Tuning What to do? Using Compression Tuning Swap Disabling Write Ahead Log Enabling Deferred Log Flush Increasing Heap Size Specifying Concurrent-Mark-Sweep Garbage Collection Enabling MemStore-Local Allocation Buffers (MSLAB) Pre-Splitting Regions

Performance Tuning Regions Basic element of availability and distribution for tables Has start and end row keys Two Splitting Strategies 1 Uniform splitting over leading field of rowkey IP in IP Based tables ((2 32 1)/#Regions) Port in Port Based tables ((2 16 1)/#Regions) 2 Empirical study of leading field value domain Norwegian IP blocks Popular src, dst Popular services

Performance Tuning Pre-Splitting Regions 1) Uniform Distribution Results: x30 more operation/s x14 faster operation x3 shorter duration 2) Empirical study Results: x64 more operation/s x80 faster operation x7.5 shorter duration

Top-N Host Pairs Top-N Host Pairs Results Finding host pairs which exchanged most traffic Belongs to long-term query family Aggregation of input and output bytes for all host pairs Query on Reference table (T1) with 5 billion records Traditional tools: not capable handling this much data (e.g. nfdump) Chaining MapReduce jobs: 26.10 min (Average response time) Reasonable duration

Service Server Discovery Service Server Discovery for a Given Period Criteria: Port number and Time range Four methods of execution: 1 HBase 2 OpenTSDB 3 NFD1 (Over complete dataset) 4 NFD2 (Limited dataset by time)

Service Server Discovery Service Server Discovery Results HBase x87 faster than OpenTSDB HBase x4472 faster than NFD1

Summary Data-intensive frameworks are effective for network monitoring Solutions should be protocol independent Designing proper data structure is crucial Data characteristics should be well studied Different query types have heterogeneous demands One size doesn t fit all

Ongoing Research End-to-End secure virtual layer 2 networks