Network Traffic Analysis using HADOOP Architecture. Zeng Shan ISGC2013, Taibei zengshan@ihep.ac.cn

Similar documents
Network Traffic Analysis using HADOOP Architecture. Shan Zeng HEPiX, Beijing 17 Oct 2012

Hadoop Architecture. Part 1

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

HDFS Architecture Guide

Introduction to Netflow

Network Monitoring and Management NetFlow Overview

NfSen Plugin Supporting The Virtual Network Monitoring

Hadoop IST 734 SS CHUNG

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Fault Tolerance in Hadoop for Work Migration

nfdump and NfSen 18 th Annual FIRST Conference June 25-30, 2006 Baltimore Peter Haag 2006 SWITCH

An overview of traffic analysis using NetFlow

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Network forensics 101 Network monitoring with Netflow, nfsen + nfdump

Distributed File Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Case Study : 3 different hadoop cluster deployments

THE HADOOP DISTRIBUTED FILE SYSTEM

The ntop Project: Open Source Network Monitoring

Apache Hadoop new way for the company to store and analyze big data

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Large scale processing using Hadoop. Ján Vaňo

Who is Generating all This Traffic?

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Open Source in Network Administration: the ntop Project

Apache Hadoop. Alexandru Costan

Netflow Collection with AlienVault Alienvault 2013

vsphere Networking vsphere 6.0 ESXi 6.0 vcenter Server 6.0 EN

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

[Optional] Network Visibility with NetFlow

Large-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop

Watch your Flows with NfSen and NFDUMP 50th RIPE Meeting May 3, 2005 Stockholm Peter Haag

Hadoop Technology for Flow Analysis of the Internet Traffic

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Viete, čo robia Vaši užívatelia na sieti? Roman Tuchyňa, CSA

Configuring Flexible NetFlow

Cisco NetFlow TM Briefing Paper. Release 2.2 Monday, 02 August 2004

Processing of Hadoop using Highly Available NameNode

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Lab VI Capturing and monitoring the network traffic

Scalable Extraction, Aggregation, and Response to Network Intelligence

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

IPV6 流 量 分 析 探 讨 北 京 大 学 计 算 中 心 周 昌 令

NetFlow Analysis with MapReduce

Network Management & Monitoring

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

Netflow Overview. PacNOG 6 Nadi, Fiji

International Journal of Advance Research in Computer Science and Management Studies

Network Monitoring and Traffic CSTNET, CNIC

NetFlow Performance Analysis

MapReduce Job Processing

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Cisco IOS Flexible NetFlow Technology

Hadoop: Embracing future hardware

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

NetFlow-Lite offers network administrators and engineers the following capabilities:

NTT DOCOMO Technical Journal. Large-Scale Data Processing Infrastructure for Mobile Spatial Statistics

Log Management with Open-Source Tools. Risto Vaarandi SEB Estonia

Limitations of Packet Measurement

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

NoSQL and Hadoop Technologies On Oracle Cloud

vsphere Networking vsphere 5.5 ESXi 5.5 vcenter Server 5.5 EN

How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer

User Documentation nfdump & NfSen

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Open source Google-style large scale data analysis with Hadoop

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

NoSQL Data Base Basics

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Accelerating and Simplifying Apache

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Generic Log Analyzer Using Hadoop Mapreduce Framework

Applications of Passive Message Logging and TCP Stream Reconstruction to Provide Application-Level Fault Tolerance. Sunny Gleason COM S 717

TP1: Getting Started with Hadoop

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Distributed File Systems

Flow Based Traffic Analysis

Overview of Network Traffic Analysis

IPv6 Traffic Analysis and Storage

Monitoring high-speed networks using ntop. Luca Deri

Transport and Network Layer

Log Management with Open-Source Tools. Risto Vaarandi rvaarandi 4T Y4H00 D0T C0M

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

IPTV Traffic Monitoring System with IPFIX/PSAMP

plixer Scrutinizer Competitor Worksheet Visualization of Network Health Unauthorized application deployments Detect DNS communication tunnels

Hadoop implementation of MapReduce computational model. Ján Vaňo

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Machine Learning Algorithm for Pro-active Fault Detection in Hadoop Cluster

Hands on Workshop. Network Performance Monitoring and Multicast Routing. Yasuichi Kitamura NICT Jin Tanaka KDDI/NICT APAN-JP NOC

VisuSniff: A Tool For The Visualization Of Network Traffic

MapReduce, Hadoop and Amazon AWS

IP address format: Dotted decimal notation:

Transcription:

Network Traffic Analysis using HADOOP Architecture Zeng Shan ISGC2013, Taibei zengshan@ihep.ac.cn

Flow VS Packet what are netflows? Outlines Flow tools used in the system nprobe nfdump Introduction to Hadoop HDFS Map/Reduce Architecture of our system Function Display

Flow VS Packet

What is a flow? Network flow is a sequence of packets From a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. A network flow measures sequences of IP packets sharing a common feature as they pass between devices. Flow format: NetFlow(Cisco) J-Flow(Juniper) Sflow(HP).

Netflows Netflows concept developed by Cisco Defines a protocol and packet content for collecting network traffic data Several versions of the protocol, v5 most widespread Netflow v5 A flow is a unique tuple of (router port,source IP+port,dest IP+port,TCP protocol) A flow also contains TCP flags and numbers of packets and bytes transferred Flows are cached on the router and sent regularly to a recording node If the load on a router becomes high, flows can get lost as the cache may be flushed Typically more than 2000 Flows/s can be recorded

Flow VS Packet Flows contain relevant information to characterize and analyze the traffic Flows contain less information than packets, so recording the flows will not lead to the high network loads Storing flows, comparing with storing packets, requires much less disk space 1 day traffic in IHEP is roughly 1.5G which is stored in flows Instead of recording all flows, sampling could also be used Flows can be delivered by the router or created from packets by nprobe nprobe is an alternative if the router doesn t output netflows

Flow Tools used in the System

What is nprobe? nprobe is an open source tools Capture packets flowing on a Ethernet segment, computes flows and export them to the specified collectors. Features: Ability to keep up with Gbit speeds on Ethernet networks handling thousand of packets per second without packet sampling on commodity hardware. Support for major OS including Unix, Windows and Mac OS X it is designed for environments with limited resources

nfdump fulfils two tasks: data storage and data filtering/resolution nfcapd: does receive flows on a configurable port and stores it every n minutes nfdump: reads dump files containing flow data, filters it and resolve it to readable files nfdump is IPv6 enabled Current stable version is 1.6.9 nfdump usage is very similar to tcpdump, especially for filtering rules Example: Looking for any activity of a host with IP 192.168.23.2 on Mar 14 2013 from 11:00 to 11:59am nfdump R nfcapd.201303141100:nfcapd.201303141155 ip 192.168.23.2

Information contains in nfdump output Tag Description Tag Description %ts Start Time - first seen %in Input Interface num %te End Time - last seen %out Output Interface num %td Duration %pkt Packets counts in this flow %pr Protocol %byt Bytes count in this flow %sa Source Address %fl Number of flows. %da Destination Address %flg TCP Flags %sap Source Address: Port %tos Type of Service %dap Destination Address:Port %bps bits per second %sp Source Port %pps packets per second %dp Destination Port %bpp Bytes per package %sas Source AS %das Destination AS

Introduction to Hadoop

What can Hadoop do? Hadoop is an open-source software framework. It was originally developed to support distribution for the Nutch search engine project. Licensed under the Apache v2 license. Supports data-intensive distributed applications. It enables applications to work with thousands of computation-independent computers and petabytes of data Storage: HDFS Compute: Map/Reduce

HDFS HDFS is short for Hadoop Distributed File System Differences from other distributed file systems: highly fault-tolerant designed to be deployed on low-cost hardware. Portability across heterogeneous hardware and software platforms Applications run on HDFS need streaming access to their data sets provides high throughput access to application data suitable for applications that have large data sets Moving computation is cheaper than moving data

HDFS master/slave architecture NameNode Manages name space of the file system Regulates access to files by clients Determine the mapping of blocks to DataNodes DataNodes Responsible for serving read and write requests from the clients Perform block creation, deletion and replication upon the instructions from NameNode

Data Replication in HDFS To ensure the fault tolerance in HDFS, the blocks of a file are replicated, the replicas of a block can be specified by the application

Map/Reduce MapReduce is a programming model for processing large data sets MapReduce is typically used to do distributed computing on clusters of computers MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The model is inspired by the map and reduce functions "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers of all the sub-problems and combines them in some way to form the final output

Architecture of our System

Architecture

Drawing tools RRDtool acronym for round-robin database tool The data are stored in a round-robin database(circular buffer) It also includes tools to extract RRD data in a graphical format drawing flow trend graph in three dimensionality: Flow count, Packet count, Traffic count Highstock Highstock lets you create stock or general timeline charts in pure JavaScript including sophisticated navigation options like a small navigator series, preset date ranges, date picker, scrolling and panning. We just need to write API between HDFS and Highstock

Function Display

Summary Network trend chart for IHEP every 5 minutes in three dimension: Flow count/packet count/traffic count Detailed traffic information chart select a single timeslot and get the detailed traffic information select a time window and get the detailed traffic information on hovering the chart, a tooltip text with traffic information on each point and series can be displayed. the tooltip follows as the user moves the mouse over the graph Traffic information related to the HEP experiments Once the IP address of the machines related to the data transferring of the HEP experiment is specified we already have DYB/YBJ/CMS/ALTAS Intrusion Detection is also possible Writing rules, generating alerts

Thank You Questions?