Network Traffic Analysis using HADOOP Architecture. Shan Zeng HEPiX, Beijing 17 Oct 2012



Similar documents
Network Traffic Analysis using HADOOP Architecture. Zeng Shan ISGC2013, Taibei

Hadoop Architecture. Part 1

Chapter 7. Using Hadoop Cluster and MapReduce

HDFS Architecture Guide

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Fault Tolerance in Hadoop for Work Migration

Apache Hadoop new way for the company to store and analyze big data

Hadoop IST 734 SS CHUNG

International Journal of Advance Research in Computer Science and Management Studies

Case Study : 3 different hadoop cluster deployments

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

An overview of traffic analysis using NetFlow

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

NfSen Plugin Supporting The Virtual Network Monitoring

Introduction to Netflow

Apache Hadoop. Alexandru Costan

THE HADOOP DISTRIBUTED FILE SYSTEM

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

TP1: Getting Started with Hadoop

Distributed File Systems

Hadoop Technology for Flow Analysis of the Internet Traffic

Network Monitoring and Management NetFlow Overview

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

NoSQL and Hadoop Technologies On Oracle Cloud

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Processing of Hadoop using Highly Available NameNode

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Large scale processing using Hadoop. Ján Vaňo

vsphere Networking vsphere 6.0 ESXi 6.0 vcenter Server 6.0 EN

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

GraySort and MinuteSort at Yahoo on Hadoop 0.23

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE

Machine Learning Algorithm for Pro-active Fault Detection in Hadoop Cluster

Suresh Lakavath csir urdip Pune, India

User Documentation nfdump & NfSen

Reduction of Data at Namenode in HDFS using harballing Technique

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Netflow Collection with AlienVault Alienvault 2013

CSE-E5430 Scalable Cloud Computing Lecture 2

Generic Log Analyzer Using Hadoop Mapreduce Framework

Hadoop: Embracing future hardware

Open source Google-style large scale data analysis with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

HadoopRDF : A Scalable RDF Data Analysis System

An Hadoop-based Platform for Massive Medical Data Storage

Distributed File Systems

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Network Probe User Guide

NoSQL Data Base Basics

nfdump and NfSen 18 th Annual FIRST Conference June 25-30, 2006 Baltimore Peter Haag 2006 SWITCH

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

Who is Generating all This Traffic?

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop Distributed File System (HDFS) Overview

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

HDFS Users Guide. Table of contents

vsphere Networking vsphere 5.5 ESXi 5.5 vcenter Server 5.5 EN

Data-Intensive Computing with Map-Reduce and Hadoop

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Introduction to HDFS. Prasanth Kothuri, CERN

Gluster Filesystem 3.3 Beta 2 Hadoop Compatible Storage

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lab VI Capturing and monitoring the network traffic

Cisco NetFlow TM Briefing Paper. Release 2.2 Monday, 02 August 2004

CLOUD COMPUTING USING HADOOP TECHNOLOGY

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Snapshots in Hadoop Distributed File System

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

The ntop Project: Open Source Network Monitoring

Open Source in Network Administration: the ntop Project

Deploying Hadoop with Manager

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Introduction to HDFS. Prasanth Kothuri, CERN

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Hadoop Distributed File System: Architecture and Design

How To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Transcription:

Network Traffic Analysis using HADOOP Architecture Shan Zeng HEPiX, Beijing 17 Oct 2012

Outline Introduction to Hadoop Traffic Information Capture Traffic Information Resolution Traffic Information Storage Traffic Information Analysis Traffic Information Display Conclusion

Introduction to Hadoop

What can Hadoop do? Hadoop is an open-source software framework. It was originally developed to support distribution for the Nutch search engine project. Supports data-intensive distributed applications. Licensed under the Apache v2 license. It enables applications to work with thousands of computation-independent computers and petabytes of data

Traffic Information Capture

What is a flow? Network flow is a sequence of packets From a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. A network flow measures sequences of IP packets sharing a common feature as they pass between devices. Flow format: NetFlow(Cisco) J-Flow(Juniper) Sflow(HP).

What is nprobe? nprobe is an open source tools Capture packets flowing on a Ethernet segment, computes flows and export them to the specified collectors. Features: Ability to keep up with Gbit speeds on Ethernet networks handling thousand of packets per second without packet sampling on commodity hardware. Support for major OS including Unix, Windows and Mac OS X it is designed for environments with limited resources

Traffic Information Resolution

nfcapd nfcapd is the netflow capture daemon, it reads netflow data from the network and stores it into files. The output file is automatically rotated and renamed every n minutes - typically 5 min e.g. nfcapd.201205030900 contains the data from May 3rd 2012 09:00 onward Usage: /usr/local/bin/nfcapd -p 2055 -t 300 -l /home/zengshan/netflow/nfcapd_file/ihep & -p portnum Specifies the port number to listen. Default port is 9995 -t interval Specifies the time interval in seconds to rotate files. -l base_directory Specifies the base directory to store the output files.

nfdump nfdump Reads the netflow data from the files stored by nfcapd And then dump them to text

nfdump output text format Tag Description Tag Description %ts Start Time - first seen %in Input Interface num %te End Time - last seen %out Output Interface num %td Duration %pkt Packets counts in this flow %pr Protocol %byt Bytes count in this flow %sa Source Address %fl Number of flows. %da Destination Address %flg TCP Flags %sap Source Address: Port %tos Type of Service %dap Destination Address:Port %bps bits per second %sp Source Port %pps packets per second %dp Destination Port %bpp Bytes per package %sas Source AS %das Destination AS

Traffic Information Storage

HDFS HDFS is short for Hadoop Distributed File System HDFS can provide high throughput access to application data Differences from other distributed file systems: highly fault-tolerant designed to be deployed on low-cost hardware. Portability across heterogeneous hardware and software platforms Applications run on HDFS need streaming access to their data sets provides high throughput access to application data suitable for applications that have large data sets Moving computation is cheaper than moving data

HDFS master/slave architecture NameNode Manages name space of the file system Regulates access to files by clients Determine the mapping of blocks to DataNodes DataNodes Responsible for serving read and write requests from the clients Perform block creation, deletion and replication upon the instructions from NameNode

Data Replication in HDFS To ensure the fault tolerance in HDFS, the blocks of a file are replicated, the replicas of a block can be specified by the application

Traffic Information Analysis

Map/Reduce MapReduce is a programming model for processing large data sets MapReduce is typically used to do distributed computing on clusters of computers MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The model is inspired by the map and reduce functions "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers to all the sub-problems and combines them in some way to form the final output

Traffic Information Display

RRDtool Drawing tools acronym for round-robin database tool The data are stored in a round-robin database(circular buffer) It also includes tools to extract RRD data in a graphical format drawing flow trend graph in three dimensionality: Flow count, Packet count, Traffic count Highstock Highstock lets you create stock or general timeline charts in pure JavaScript including sophisticated navigation options like a small navigator series, preset date ranges, date picker, scrolling and panning. We just need to write API between HDFS and Highstock

Architecture

Conclusion Network flow trend chart from IHEP every 5 minutes in three dimension: Flow count/packet count/traffic count Detailed traffic information chart select a single timeslot and get the detailed traffic information select a time window and get the detailed traffic information on hovering the chart, a tooltip text with traffic information on each point and series can be displayed. the tooltip follows as the user moves the mouse over the graph Traffic information related to the HEP experiments Once the IP addresses of the machines related to the data transferring of the HEP experiment is specified we already have DYB/YBJ/CMS/ALTAS

Thank You Questions?