Network Traffic Analysis using HADOOP Architecture Zeng Shan ISGC2013, Taibei zengshan@ihep.ac.cn
Flow VS Packet what are netflows? Outlines Flow tools used in the system nprobe nfdump Introduction to Hadoop HDFS Map/Reduce Architecture of our system Function Display
Flow VS Packet
What is a flow? Network flow is a sequence of packets From a source computer to a destination, which may be another host, a multicast group, or a broadcast domain. A network flow measures sequences of IP packets sharing a common feature as they pass between devices. Flow format: NetFlow(Cisco) J-Flow(Juniper) Sflow(HP).
Netflows Netflows concept developed by Cisco Defines a protocol and packet content for collecting network traffic data Several versions of the protocol, v5 most widespread Netflow v5 A flow is a unique tuple of (router port,source IP+port,dest IP+port,TCP protocol) A flow also contains TCP flags and numbers of packets and bytes transferred Flows are cached on the router and sent regularly to a recording node If the load on a router becomes high, flows can get lost as the cache may be flushed Typically more than 2000 Flows/s can be recorded
Flow VS Packet Flows contain relevant information to characterize and analyze the traffic Flows contain less information than packets, so recording the flows will not lead to the high network loads Storing flows, comparing with storing packets, requires much less disk space 1 day traffic in IHEP is roughly 1.5G which is stored in flows Instead of recording all flows, sampling could also be used Flows can be delivered by the router or created from packets by nprobe nprobe is an alternative if the router doesn t output netflows
Flow Tools used in the System
What is nprobe? nprobe is an open source tools Capture packets flowing on a Ethernet segment, computes flows and export them to the specified collectors. Features: Ability to keep up with Gbit speeds on Ethernet networks handling thousand of packets per second without packet sampling on commodity hardware. Support for major OS including Unix, Windows and Mac OS X it is designed for environments with limited resources
nfdump fulfils two tasks: data storage and data filtering/resolution nfcapd: does receive flows on a configurable port and stores it every n minutes nfdump: reads dump files containing flow data, filters it and resolve it to readable files nfdump is IPv6 enabled Current stable version is 1.6.9 nfdump usage is very similar to tcpdump, especially for filtering rules Example: Looking for any activity of a host with IP 192.168.23.2 on Mar 14 2013 from 11:00 to 11:59am nfdump R nfcapd.201303141100:nfcapd.201303141155 ip 192.168.23.2
Information contains in nfdump output Tag Description Tag Description %ts Start Time - first seen %in Input Interface num %te End Time - last seen %out Output Interface num %td Duration %pkt Packets counts in this flow %pr Protocol %byt Bytes count in this flow %sa Source Address %fl Number of flows. %da Destination Address %flg TCP Flags %sap Source Address: Port %tos Type of Service %dap Destination Address:Port %bps bits per second %sp Source Port %pps packets per second %dp Destination Port %bpp Bytes per package %sas Source AS %das Destination AS
Introduction to Hadoop
What can Hadoop do? Hadoop is an open-source software framework. It was originally developed to support distribution for the Nutch search engine project. Licensed under the Apache v2 license. Supports data-intensive distributed applications. It enables applications to work with thousands of computation-independent computers and petabytes of data Storage: HDFS Compute: Map/Reduce
HDFS HDFS is short for Hadoop Distributed File System Differences from other distributed file systems: highly fault-tolerant designed to be deployed on low-cost hardware. Portability across heterogeneous hardware and software platforms Applications run on HDFS need streaming access to their data sets provides high throughput access to application data suitable for applications that have large data sets Moving computation is cheaper than moving data
HDFS master/slave architecture NameNode Manages name space of the file system Regulates access to files by clients Determine the mapping of blocks to DataNodes DataNodes Responsible for serving read and write requests from the clients Perform block creation, deletion and replication upon the instructions from NameNode
Data Replication in HDFS To ensure the fault tolerance in HDFS, the blocks of a file are replicated, the replicas of a block can be specified by the application
Map/Reduce MapReduce is a programming model for processing large data sets MapReduce is typically used to do distributed computing on clusters of computers MapReduce can take advantage of locality of data, processing data on or near the storage assets to decrease transmission of data. The model is inspired by the map and reduce functions "Map" step: The master node takes the input, divides it into smaller sub-problems, and distributes them to slave nodes. The slave node processes the smaller problem, and passes the answer back to its master node. "Reduce" step: The master node then collects the answers of all the sub-problems and combines them in some way to form the final output
Architecture of our System
Architecture
Drawing tools RRDtool acronym for round-robin database tool The data are stored in a round-robin database(circular buffer) It also includes tools to extract RRD data in a graphical format drawing flow trend graph in three dimensionality: Flow count, Packet count, Traffic count Highstock Highstock lets you create stock or general timeline charts in pure JavaScript including sophisticated navigation options like a small navigator series, preset date ranges, date picker, scrolling and panning. We just need to write API between HDFS and Highstock
Function Display
Summary Network trend chart for IHEP every 5 minutes in three dimension: Flow count/packet count/traffic count Detailed traffic information chart select a single timeslot and get the detailed traffic information select a time window and get the detailed traffic information on hovering the chart, a tooltip text with traffic information on each point and series can be displayed. the tooltip follows as the user moves the mouse over the graph Traffic information related to the HEP experiments Once the IP address of the machines related to the data transferring of the HEP experiment is specified we already have DYB/YBJ/CMS/ALTAS Intrusion Detection is also possible Writing rules, generating alerts
Thank You Questions?