Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop 1 Amreesh kumar patel, 2 D.S. Bhilare, 3 Sushil buriya, 4 Satyendra singh yadav School of computer science & IT, DAVV, Indore, M.P., India Email id: 1 amreesh21@gnail.com, 2 bhilare@hotmail.com, 3 sushil.buriya@gmail.com, 4 sat1237@gmail.com Abstract Network traffic measurement and analysis have been regularly performed on a high performance server that collects and analysis packet flow. When we monitor a large volume of network traffic data for detailed statistics, a large-scale network, it is not easy to handle Tera or Peta byte data with a single server, there is need to thousands of machines. As distributed parallel processing scheme have been recently developed due to the cluster file system, which beneficially applied to analyzing big network traffic data. Hadoop is a popular parallel processing framework that is widely used for working with large datasets. We analyze the netflow data monitoring single node to multi nodes hadoop cluster and provide an algorithm to calculate packet count and packet size of each source ip address for every fix interval of time, with low rate of false positives to detect malicious activity. Finally, we highlight performance and benefits of hadoop distributed cluster when we used large data sets as well as small data sets. I. INTRODUCTION Data becomes big data when its velocity, volume and variety exceeds the abilities of IT systems to store, analyze, and process it. Today, many organizations have the equipment and expertise to handle large amount of structured data, but with the increasing volume and high rate of data flows, they lack the ability to mine it and discover actionable intelligence. Not only is the volume of this data growing rapidly for traditional analytics, but the speed of data arrival and the variety of data types requires new types of data processing and analytic solutions [1]. Big Data analytics is the process of mining and analyzing Big Data that can produce useful decision making information and discover actionable intelligence at an unprecedented scale and specificity. Big Data analytics can be leveraged to network security by analyzing network traffic to identify anomalies and suspicious activities. Traditional technologies fail to provide the tools to support long-term, large-scale analysis of network traffic because these tools are not as flexible as Big Data Analytics tool for data formats and leveraged like Big Data analytics. Big Data analytic systems use cluster computing infrastructures that are more reliable and available as compare to traditional security analytics technologies [2]. In this paper we are using the power of Big Data analytics for analysis of network flows to detect malicious behaviour of network using Apache Hadoop. The rest of the paper is organized as follows. In the next section related work in the field of big data analytics and security intelligence are presented. In section 3 Mapreduce based flow analysis is described. Section 4 will demonstrates the experimental setup and discusses the results. Finally, we conclude this paper in section 6 and points towards future work. II. BACKGROUND In this section we provide detailed information about Apache Hadoop and Netflow data. 1. APACHE HADOOP The Apache Hadoop provides a framework that allows for the distributed processing of large data sets across cluster. Hadoop has two parts: HDFS (Hadoop Distributed File System) file system and MapReduce programming paradigm [3]. 1.1 Hadoop Distributed File System Data in hadoop cluster broken down into small pieces and distributed through the cluster. Then the map and reduce can be executed on smaller pieces of large datasets, and this provides scalability for Big Data processing. Hadoop data distribution logic www.ijrcct.org Page 438
is managed by a special server called NameNode. This NameNode keeps the track of all data sets in HDFS. The entire nodes where data is stored known as DataNodes. There exists a replica of NameNode for backup purpose called Secondary NameNode [4]. 1.2 MapReduce It is programming paradigm that allows for massive scalability across hundreds or thousands of hadoop clusters. The MapReduce performs two separate and distinct tasks. The first is Map task, which takes a set of data and converts it into another set of data (Key/Value pairs). The Reduce task takes the output from a map as an input and combines those datasets (Key/Value pairs). In Hadoop cluster, a MapReduce program is referred to as a job. A job is executed by breaking it down into tasks. An application submits a job to a specific node called JobTracker. Job Tracker communicates with the Name-Node and, then breaks the jobs into map and reduces tasks. In Hadoop cluster, a set of continually running nodes called TaskTracker agents, monitor the status of each task [4]. 2. NETFLOW Netflow data collected by wireshark tool that are in.pcap format and then converted in text format. The original data set is in plain text format and each line represents a record of several fields separated by comma. [5] Network traffic data provides a mechanism for exporting summaries of traffic that is observed in networking equipment such as routers and switches. A flow is defined as a set of packets within a time frame that share a certain set of attributes. These attributes have been defined to be the following [6].Time, source IP address, destination IP address, protocol, length, and info. The all six attributes in the list above are simply IP packet header fields that a shown in fig. III. RELATED WORK Today, network is becoming more complex and assorted because of the appearance of various applications and services. Therefore, network traffic is growing rapidly, methods for network traffic analysis are not developed to capture the trend of increasing usage of network. Most methods for network traffic analysis are operated on a single server environment. Amount of traffic data is increased, the existing method has limit in terms of memory, processing speed, and data storage capacity. In this paper provide a traffic classification based on payload signature in Hadoop distributed computing environment. In analyzing small amount of traffic data hadoop based system not much effective, it showed big advantage in processing speed and storage capacity of dig data.[6] The needs of data storage, management, analysis, and measurement of the following netflow have been emerged as one of very significant issues. So far many studies for detecting anomaly netflow have been done. However, measurement and analysis studies for big data in distributed computing environments are not actively being made based on Hadoop. [7] Anomaly detection is essential for preventing network outages and maintaining the network resources available. Investigate the benefits of recent distributed computing approaches for real-time analysis of non-sampled Network traffic. Focus on the MapReduce model, our study uncovers a fundamental difficulty in order to detect network traffic anomalies by using Hadoop. The classical data slicing used for textual documents breaks spatial and temporal traffic structures, which dramatically deteriorates anomaly detector performance. [8] Denial of Service ( DDoS) attack is launched to make an internet resource unavailable, Due to increase the size of attack log files has also increased greatly and using conventional techniques to mine the logs and get some meaningful analyses. Hadoop MapReduce to deduce results efficiently and quickly which would otherwise take a long time if conventional technique used. [9] In Internet traffic measurement and analysis, flow-based traffic monitoring methods are widely deployed throughout Internet Service Providers (ISPs). Popular tools such as tcpdump or Coralreef are usually run on a single host to capture and process packets at a specific monitoring point. When we analyze flow data for a large-scale network, we need to handle and manage a few Tera or Peta-byte packet or flow files simultaneously. When the outbreak of global Internet worms or DDoS attacks happens, we also have to process fast a large volume of flow data at once. MapReduce is a software framework that supports distributed computing with www.ijrcct.org Page 439
two functions of map and reduce on large data sets on clusters. [1] IV. MAPREDUCE BASED FLOW ANALYSIS Map reduces flow analysis presents in two ways one is algorithm and another one is data flow diagram that s describe in the blow. Algorithm for packet counting Algorithm for netflow data analysis in hadoop distributed environment does a very simple filtering, counting and sum the volume of packet size of every source ip address in fixed interval of time yet offers very useful and fundamental analysis on the NetFlow data set. We can use it to count the percentage of the every source ip of network flow, or just filtering out records that we need by directly output the records in the algorithm. Here is a specific scenario, if we are trying to get the volume of data flow from each IP address, for a mapper it may output key/value pairs (1..22.1, 1 bytes), (1..22.1, 2 bytes) from two records it has processed and write the two key/value pairs to disk. combine the two key/values pairs in to one (127...1, 3 bytes) in the map phase, thus the mapper will only write one key/value pair. This way we can reduce the size of intermediate data between map phases and reduce phase. This kind of local reducers are called combiners and can be used to scale up aggregation. Data flow Analysis Hadoop consists of two main components: hadoop distributed file system (HDFS) that can be deployed across thousands of machines and a parallel processing framework implementing. Figure 1 illustrate the typical data flow of a packet counting of source ip address example, data flow is same for the packet size calculated of ip address in fixed interval of time. Here suppose we have a file textfile1 in HDFS which contains some text, and we want to calculate the count of each ip address in the file. [13] We can see that HDFS uses two blocks to store file textfile1. First the input text file is split to two mappers, where each mapper parses their assigned portion of the text, then emits <key (scoure ip address), value> pairs. Then Hadoop framework will shuffle the output of mappers, sort it, and partition the result by key and generate<key,list(values)> pairs. Finally, the two reducers will process the <key,list(values)> pairs, i.e., sum up all the values in the list, and output the result to their different files. Figure 1: Data flow of the ip counting Mapreduce job www.ijrcct.org Page 44
And we only implement the mapper code and reducer code, Hadoop will take care of orchestrating mappers and reducers and sorting the output of mappers, i.e., the intermediate results in an efficient way. An interesting note about HDFS is that files in HDFS are stored in blocks of large size: the default HDFS block size is 128 MB. This is because HDFS is design to handle large files, and for every block of a file, the namenode, which manage all the meta-data of files in HDFS, will need to keep record of this. computation time (seconds) 12 1 8 6 4 2 32 64 128 256 512 124 block sise of HDFS(MB) V. PERFORMANCE EVALUATION The performance evaluation of flow analysis with MapReduce, we built up a small Hadoop testbed consisting of a master node and four data nodes. Each node has core i3 1.78 GHz CPU, 4 GB memory, and 12 GB Hard disk. All Hadoop nodes are connected with 1 Gigabit Ethernet cards. With flow-tools we collected NetFlow packets, (wireshark) which exports flow data for a Gigabit Ethernet link in our campus network. Thus, to evaluate the flow statistics computation time for large data sets and small data sets. The text flow files to our MapReduce program we compare the flow statistics program, on a single-node and multinode. The purpose of tested programs is to compute the packet count and size of packet in fixed amount of time for each source ip address. To observe the impacts of the number of data nodes on the performance of the MapReduce program, we carried out the experiments with 1, 2, 3, and 4 data nodes. [1] After compute of packet count and size of packet in fixed interval of time we analysis large data sets, small data sets according the block size of hdfs. Default hdfs block size is 128MB. All three type performance evaluation illustrates in graph 1, 2, and 3. Bigger block size can help reduce the workload of name node, and reduce the network load if a distance host is requesting data from a local host in a cluster for big files. Block size will also affect the number of mappers in a MapReduce (MR) job because it is the upper bound of split size, which will be used by Hadoop to split input files into fragments and allocate to mappers. We evaluate a performance according to block size 1GB of data sets give input block size is 32mb, 64mb, 128mb, 512mb, and 124mb. Increases a block size then decreases computation time so we can say bigger block size take a less time that s show in graph. Figure2. Computation time according hdfs block size We evaluate Small data sets (1GB) in hadoop cluster system, process a data sets in single node, two nodes, three and four node. We saw increase Datanode then computation time also increases that s show in graph. Finally we can say hadoop in not useful for small netflow data sets. Computation time(second) 1 8 6 4 2 small data sets (1GB) 1 2 3 4 No. of data noads Figure 3.computation time of small data sets We evaluate Big Data sets (1GB) in hadoop distributed cluster system, process a data sets in single node, two nodes, three and four nodes. We saw increase Datanode then computation time decreases that s show in graph. Finally we can say hadoop is very useful for big netflow data analysis. www.ijrcct.org Page 441
Computation time(seconds) 3 25 2 15 1 5 Figure4. Computation time for big data set Using Hadoop distributed cluster computing, Big Data computation time inverse propositional to no. of node increases. All three graph show the per formation of Hadoop in small data sets and Big Data sets. VI. CONCLUSIONS We have presented a scalable Hadoop-based distributed packet processor that could analyze large packet trace files. Our proposal could easily manage large packet trace files of tera- or petabytes, because we have employed the MapReduce platform for parallel processing. On the Hadoop system, we have evaluated the performance of the MapReduce-based flow analysis method by developing a packet count and size of packet in fixed interval of time program. We experiments with four DataNodes, with small data sets (1GB), bog data sets (1GB) and changes in HDFS block size. After evaluation result we proposed two solutions, one is in HDFS used as possible as bigger block and second is hadoop distributed file system is very useful for Big Data sets not a small data. Finally we can say hadoop is most useful for netflow data analysis or ip address feature extraction. References Large data sets(1gb) 1 2 3 4 No. of data noads [1] Big Data Analytics Advanced Analytics in Oracle Database Oracle White Papers @ March 213 [2] Alvaro A. Cárdenas et al., Big Data Analytics for Security Intelligence in University of Texas at Dallas @ Cloud Security Alliance.213. [3] http://hadoop.apache.org/ [4] Paul C. Zikopoulos et al., Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data, New York Oxford University Press, 213. [5] Bingdong Li et al., A Survey of Network Flow Applications Journal of Networks and Computer Applications June 28, 212. [6] Kyu-Seok Shim, Su-Kang Lee et al., Application Traffic Classification in Hadoop Distributed Computing Environment Asia- Pacific Network Operation and Management Symposium (APNOMS), Korea, 214. [7] JongSuk R. Lee, Sang-Kug Ye et al., Detecting Anomaly Teletraffic Using Stochastic Selfsimilarity Based on Hadoop 16th International Conference on Network-Based Information Systems, Korea, 213. [8] Romain Fontugne, Johan Mazel et al., Hashdoop: A MapReduce Framework for Network Anomaly Detection IEEE INFOCOM Workshop on Security and Privacy in Big Data, Tokyo Japan, 214. [9] Rana Khattak, Shehar Bano et al., DOFUR: DDoS Forensics Using mapreduce 211 IEEE DOI 1.119/FIT.211.29. [1] Youngseok Lee, Wonchul Kang et al., An Internet Traffic Analysis Method with MapReduce. Chungnam National University @ Springer-Verlag Berlin Heidelberg 211. [11] Ravi Sharma, Study of Latest Emerging Trends on Cyber Security and its challenges to Society, International Journal of Scientific &Engineering Research, Volume 3, Issue 6, June-212 1 ISSN 2229-5518 IJSER 212. [12] Yeonhee Lee, Wonchul Kang, et al., A Hadoop- Based Packet Trace Processing Tool Chungnam National University @ Springer-Verlag Berlin Heidelberg 211. [13] Jan Tore Morken et al., Distributed NetFlow Processing Using the Map-Reduce Model NTNU Norway University @21. [14] Zeng shan et al., Network Traffic Analysis using HADOOP Architecture ISGC213, Taibei. www.ijrcct.org Page 442