Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop
|
|
- Job Pearson
- 8 years ago
- Views:
Transcription
1 Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop 1 Amreesh kumar patel, 2 D.S. Bhilare, 3 Sushil buriya, 4 Satyendra singh yadav School of computer science & IT, DAVV, Indore, M.P., India id: 1 amreesh21@gnail.com, 2 bhilare@hotmail.com, 3 sushil.buriya@gmail.com, 4 sat1237@gmail.com Abstract Network traffic measurement and analysis have been regularly performed on a high performance server that collects and analysis packet flow. When we monitor a large volume of network traffic data for detailed statistics, a large-scale network, it is not easy to handle Tera or Peta byte data with a single server, there is need to thousands of machines. As distributed parallel processing scheme have been recently developed due to the cluster file system, which beneficially applied to analyzing big network traffic data. Hadoop is a popular parallel processing framework that is widely used for working with large datasets. We analyze the netflow data monitoring single node to multi nodes hadoop cluster and provide an algorithm to calculate packet count and packet size of each source ip address for every fix interval of time, with low rate of false positives to detect malicious activity. Finally, we highlight performance and benefits of hadoop distributed cluster when we used large data sets as well as small data sets. I. INTRODUCTION Data becomes big data when its velocity, volume and variety exceeds the abilities of IT systems to store, analyze, and process it. Today, many organizations have the equipment and expertise to handle large amount of structured data, but with the increasing volume and high rate of data flows, they lack the ability to mine it and discover actionable intelligence. Not only is the volume of this data growing rapidly for traditional analytics, but the speed of data arrival and the variety of data types requires new types of data processing and analytic solutions [1]. Big Data analytics is the process of mining and analyzing Big Data that can produce useful decision making information and discover actionable intelligence at an unprecedented scale and specificity. Big Data analytics can be leveraged to network security by analyzing network traffic to identify anomalies and suspicious activities. Traditional technologies fail to provide the tools to support long-term, large-scale analysis of network traffic because these tools are not as flexible as Big Data Analytics tool for data formats and leveraged like Big Data analytics. Big Data analytic systems use cluster computing infrastructures that are more reliable and available as compare to traditional security analytics technologies [2]. In this paper we are using the power of Big Data analytics for analysis of network flows to detect malicious behaviour of network using Apache Hadoop. The rest of the paper is organized as follows. In the next section related work in the field of big data analytics and security intelligence are presented. In section 3 Mapreduce based flow analysis is described. Section 4 will demonstrates the experimental setup and discusses the results. Finally, we conclude this paper in section 6 and points towards future work. II. BACKGROUND In this section we provide detailed information about Apache Hadoop and Netflow data. 1. APACHE HADOOP The Apache Hadoop provides a framework that allows for the distributed processing of large data sets across cluster. Hadoop has two parts: HDFS (Hadoop Distributed File System) file system and MapReduce programming paradigm [3]. 1.1 Hadoop Distributed File System Data in hadoop cluster broken down into small pieces and distributed through the cluster. Then the map and reduce can be executed on smaller pieces of large datasets, and this provides scalability for Big Data processing. Hadoop data distribution logic Page 438
2 is managed by a special server called NameNode. This NameNode keeps the track of all data sets in HDFS. The entire nodes where data is stored known as DataNodes. There exists a replica of NameNode for backup purpose called Secondary NameNode [4]. 1.2 MapReduce It is programming paradigm that allows for massive scalability across hundreds or thousands of hadoop clusters. The MapReduce performs two separate and distinct tasks. The first is Map task, which takes a set of data and converts it into another set of data (Key/Value pairs). The Reduce task takes the output from a map as an input and combines those datasets (Key/Value pairs). In Hadoop cluster, a MapReduce program is referred to as a job. A job is executed by breaking it down into tasks. An application submits a job to a specific node called JobTracker. Job Tracker communicates with the Name-Node and, then breaks the jobs into map and reduces tasks. In Hadoop cluster, a set of continually running nodes called TaskTracker agents, monitor the status of each task [4]. 2. NETFLOW Netflow data collected by wireshark tool that are in.pcap format and then converted in text format. The original data set is in plain text format and each line represents a record of several fields separated by comma. [5] Network traffic data provides a mechanism for exporting summaries of traffic that is observed in networking equipment such as routers and switches. A flow is defined as a set of packets within a time frame that share a certain set of attributes. These attributes have been defined to be the following [6].Time, source IP address, destination IP address, protocol, length, and info. The all six attributes in the list above are simply IP packet header fields that a shown in fig. III. RELATED WORK Today, network is becoming more complex and assorted because of the appearance of various applications and services. Therefore, network traffic is growing rapidly, methods for network traffic analysis are not developed to capture the trend of increasing usage of network. Most methods for network traffic analysis are operated on a single server environment. Amount of traffic data is increased, the existing method has limit in terms of memory, processing speed, and data storage capacity. In this paper provide a traffic classification based on payload signature in Hadoop distributed computing environment. In analyzing small amount of traffic data hadoop based system not much effective, it showed big advantage in processing speed and storage capacity of dig data.[6] The needs of data storage, management, analysis, and measurement of the following netflow have been emerged as one of very significant issues. So far many studies for detecting anomaly netflow have been done. However, measurement and analysis studies for big data in distributed computing environments are not actively being made based on Hadoop. [7] Anomaly detection is essential for preventing network outages and maintaining the network resources available. Investigate the benefits of recent distributed computing approaches for real-time analysis of non-sampled Network traffic. Focus on the MapReduce model, our study uncovers a fundamental difficulty in order to detect network traffic anomalies by using Hadoop. The classical data slicing used for textual documents breaks spatial and temporal traffic structures, which dramatically deteriorates anomaly detector performance. [8] Denial of Service ( DDoS) attack is launched to make an internet resource unavailable, Due to increase the size of attack log files has also increased greatly and using conventional techniques to mine the logs and get some meaningful analyses. Hadoop MapReduce to deduce results efficiently and quickly which would otherwise take a long time if conventional technique used. [9] In Internet traffic measurement and analysis, flow-based traffic monitoring methods are widely deployed throughout Internet Service Providers (ISPs). Popular tools such as tcpdump or Coralreef are usually run on a single host to capture and process packets at a specific monitoring point. When we analyze flow data for a large-scale network, we need to handle and manage a few Tera or Peta-byte packet or flow files simultaneously. When the outbreak of global Internet worms or DDoS attacks happens, we also have to process fast a large volume of flow data at once. MapReduce is a software framework that supports distributed computing with Page 439
3 two functions of map and reduce on large data sets on clusters. [1] IV. MAPREDUCE BASED FLOW ANALYSIS Map reduces flow analysis presents in two ways one is algorithm and another one is data flow diagram that s describe in the blow. Algorithm for packet counting Algorithm for netflow data analysis in hadoop distributed environment does a very simple filtering, counting and sum the volume of packet size of every source ip address in fixed interval of time yet offers very useful and fundamental analysis on the NetFlow data set. We can use it to count the percentage of the every source ip of network flow, or just filtering out records that we need by directly output the records in the algorithm. Here is a specific scenario, if we are trying to get the volume of data flow from each IP address, for a mapper it may output key/value pairs ( , 1 bytes), ( , 2 bytes) from two records it has processed and write the two key/value pairs to disk. combine the two key/values pairs in to one ( , 3 bytes) in the map phase, thus the mapper will only write one key/value pair. This way we can reduce the size of intermediate data between map phases and reduce phase. This kind of local reducers are called combiners and can be used to scale up aggregation. Data flow Analysis Hadoop consists of two main components: hadoop distributed file system (HDFS) that can be deployed across thousands of machines and a parallel processing framework implementing. Figure 1 illustrate the typical data flow of a packet counting of source ip address example, data flow is same for the packet size calculated of ip address in fixed interval of time. Here suppose we have a file textfile1 in HDFS which contains some text, and we want to calculate the count of each ip address in the file. [13] We can see that HDFS uses two blocks to store file textfile1. First the input text file is split to two mappers, where each mapper parses their assigned portion of the text, then emits <key (scoure ip address), value> pairs. Then Hadoop framework will shuffle the output of mappers, sort it, and partition the result by key and generate<key,list(values)> pairs. Finally, the two reducers will process the <key,list(values)> pairs, i.e., sum up all the values in the list, and output the result to their different files. Figure 1: Data flow of the ip counting Mapreduce job Page 44
4 And we only implement the mapper code and reducer code, Hadoop will take care of orchestrating mappers and reducers and sorting the output of mappers, i.e., the intermediate results in an efficient way. An interesting note about HDFS is that files in HDFS are stored in blocks of large size: the default HDFS block size is 128 MB. This is because HDFS is design to handle large files, and for every block of a file, the namenode, which manage all the meta-data of files in HDFS, will need to keep record of this. computation time (seconds) block sise of HDFS(MB) V. PERFORMANCE EVALUATION The performance evaluation of flow analysis with MapReduce, we built up a small Hadoop testbed consisting of a master node and four data nodes. Each node has core i GHz CPU, 4 GB memory, and 12 GB Hard disk. All Hadoop nodes are connected with 1 Gigabit Ethernet cards. With flow-tools we collected NetFlow packets, (wireshark) which exports flow data for a Gigabit Ethernet link in our campus network. Thus, to evaluate the flow statistics computation time for large data sets and small data sets. The text flow files to our MapReduce program we compare the flow statistics program, on a single-node and multinode. The purpose of tested programs is to compute the packet count and size of packet in fixed amount of time for each source ip address. To observe the impacts of the number of data nodes on the performance of the MapReduce program, we carried out the experiments with 1, 2, 3, and 4 data nodes. [1] After compute of packet count and size of packet in fixed interval of time we analysis large data sets, small data sets according the block size of hdfs. Default hdfs block size is 128MB. All three type performance evaluation illustrates in graph 1, 2, and 3. Bigger block size can help reduce the workload of name node, and reduce the network load if a distance host is requesting data from a local host in a cluster for big files. Block size will also affect the number of mappers in a MapReduce (MR) job because it is the upper bound of split size, which will be used by Hadoop to split input files into fragments and allocate to mappers. We evaluate a performance according to block size 1GB of data sets give input block size is 32mb, 64mb, 128mb, 512mb, and 124mb. Increases a block size then decreases computation time so we can say bigger block size take a less time that s show in graph. Figure2. Computation time according hdfs block size We evaluate Small data sets (1GB) in hadoop cluster system, process a data sets in single node, two nodes, three and four node. We saw increase Datanode then computation time also increases that s show in graph. Finally we can say hadoop in not useful for small netflow data sets. Computation time(second) small data sets (1GB) No. of data noads Figure 3.computation time of small data sets We evaluate Big Data sets (1GB) in hadoop distributed cluster system, process a data sets in single node, two nodes, three and four nodes. We saw increase Datanode then computation time decreases that s show in graph. Finally we can say hadoop is very useful for big netflow data analysis. Page 441
5 Computation time(seconds) Figure4. Computation time for big data set Using Hadoop distributed cluster computing, Big Data computation time inverse propositional to no. of node increases. All three graph show the per formation of Hadoop in small data sets and Big Data sets. VI. CONCLUSIONS We have presented a scalable Hadoop-based distributed packet processor that could analyze large packet trace files. Our proposal could easily manage large packet trace files of tera- or petabytes, because we have employed the MapReduce platform for parallel processing. On the Hadoop system, we have evaluated the performance of the MapReduce-based flow analysis method by developing a packet count and size of packet in fixed interval of time program. We experiments with four DataNodes, with small data sets (1GB), bog data sets (1GB) and changes in HDFS block size. After evaluation result we proposed two solutions, one is in HDFS used as possible as bigger block and second is hadoop distributed file system is very useful for Big Data sets not a small data. Finally we can say hadoop is most useful for netflow data analysis or ip address feature extraction. References Large data sets(1gb) No. of data noads [1] Big Data Analytics Advanced Analytics in Oracle Database Oracle White March 213 [2] Alvaro A. Cárdenas et al., Big Data Analytics for Security Intelligence in University of Texas at Cloud Security Alliance.213. [3] [4] Paul C. Zikopoulos et al., Understanding Big Data Analytics for Enterprise Class Hadoop and Streaming Data, New York Oxford University Press, 213. [5] Bingdong Li et al., A Survey of Network Flow Applications Journal of Networks and Computer Applications June 28, 212. [6] Kyu-Seok Shim, Su-Kang Lee et al., Application Traffic Classification in Hadoop Distributed Computing Environment Asia- Pacific Network Operation and Management Symposium (APNOMS), Korea, 214. [7] JongSuk R. Lee, Sang-Kug Ye et al., Detecting Anomaly Teletraffic Using Stochastic Selfsimilarity Based on Hadoop 16th International Conference on Network-Based Information Systems, Korea, 213. [8] Romain Fontugne, Johan Mazel et al., Hashdoop: A MapReduce Framework for Network Anomaly Detection IEEE INFOCOM Workshop on Security and Privacy in Big Data, Tokyo Japan, 214. [9] Rana Khattak, Shehar Bano et al., DOFUR: DDoS Forensics Using mapreduce 211 IEEE DOI 1.119/FIT [1] Youngseok Lee, Wonchul Kang et al., An Internet Traffic Analysis Method with MapReduce. Chungnam National Springer-Verlag Berlin Heidelberg 211. [11] Ravi Sharma, Study of Latest Emerging Trends on Cyber Security and its challenges to Society, International Journal of Scientific &Engineering Research, Volume 3, Issue 6, June ISSN IJSER 212. [12] Yeonhee Lee, Wonchul Kang, et al., A Hadoop- Based Packet Trace Processing Tool Chungnam National Springer-Verlag Berlin Heidelberg 211. [13] Jan Tore Morken et al., Distributed NetFlow Processing Using the Map-Reduce Model NTNU Norway [14] Zeng shan et al., Network Traffic Analysis using HADOOP Architecture ISGC213, Taibei. Page 442
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationScalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee
Scalable NetFlow Analysis with Hadoop Yeonhee Lee and Youngseok Lee {yhlee06, lee}@cnu.ac.kr http://networks.cnu.ac.kr/~yhlee Chungnam National University, Korea January 8, 2013 FloCon 2013 Contents Introduction
More informationNetFlow Analysis with MapReduce
NetFlow Analysis with MapReduce Wonchul Kang, Yeonhee Lee, Youngseok Lee Chungnam National University {teshi85, yhlee06, lee}@cnu.ac.kr 2010.04.24(Sat) based on "An Internet Traffic Analysis Method with
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationVolume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies
Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com Image
More informationR.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5
Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationProblem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis
, 22-24 October, 2014, San Francisco, USA Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis Teng Zhao, Kai Qian, Dan Lo, Minzhe Guo, Prabir Bhattacharya, Wei Chen, and Ying
More informationEfficacy of Live DDoS Detection with Hadoop
Efficacy of Live DDoS Detection with Hadoop Sufian Hameed IT Security Labs, NUCES, Pakistan Email: sufian.hameed@nu.edu.pk Usman Ali IT Security Labs, NUCES, Pakistan Email: k133023@nu.edu.pk Abstract
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHow To Analyze Network Traffic With Mapreduce On A Microsoft Server On A Linux Computer (Ahem) On A Network (Netflow) On An Ubuntu Server On An Ipad Or Ipad (Netflower) On Your Computer
A Comparative Survey Based on Processing Network Traffic Data Using Hadoop Pig and Typical Mapreduce Anjali P P and Binu A Department of Information Technology, Rajagiri School of Engineering and Technology,
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationA Survey of Big Data Analytics for Network Traffic Monitoring to Identify Cyber Attacks
pp. 112-116 Krishi Sanskriti Publications http://www.krishisanskriti.org/acsit.html A Survey of Big Data Analytics for Network Traffic Monitoring to Identify Cyber Attacks Amreesh Kumar Patel 1 and D.S.
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationAdvances in Natural and Applied Sciences
AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationHadoopRDF : A Scalable RDF Data Analysis System
HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationIMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
More informationNETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE
NETWORK TRAFFIC ANALYSIS: HADOOP PIG VS TYPICAL MAPREDUCE Anjali P P 1 and Binu A 2 1 Department of Information Technology, Rajagiri School of Engineering and Technology, Kochi. M G University, Kerala
More informationAccelerating Hadoop MapReduce Using an In-Memory Data Grid
Accelerating Hadoop MapReduce Using an In-Memory Data Grid By David L. Brinker and William L. Bain, ScaleOut Software, Inc. 2013 ScaleOut Software, Inc. 12/27/2012 H adoop has been widely embraced for
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationMassive Cloud Auditing using Data Mining on Hadoop
Massive Cloud Auditing using Data Mining on Hadoop Prof. Sachin Shetty CyberBAT Team, AFRL/RIGD AFRL VFRP Tennessee State University Outline Massive Cloud Auditing Traffic Characterization Distributed
More informationInternational Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationGeneral Terms. Keywords 1. INTRODUCTION 2. RELATED WORKS
Design of a Hybrid Intrusion Detection System using Snort and Hadoop Prathibha.P.G P G Scholar Government Engineering College Thrissur, Kerala, India Dileesh.E.D Assistant Professor Government Engineering
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationSuresh Lakavath csir urdip Pune, India lsureshit@gmail.com.
A Big Data Hadoop Architecture for Online Analysis. Suresh Lakavath csir urdip Pune, India lsureshit@gmail.com. Ramlal Naik L Acme Tele Power LTD Haryana, India ramlalnaik@gmail.com. Abstract Big Data
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationData Refinery with Big Data Aspects
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 7 (2013), pp. 655-662 International Research Publications House http://www. irphouse.com /ijict.htm Data
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationNTT DOCOMO Technical Journal. Large-Scale Data Processing Infrastructure for Mobile Spatial Statistics
Large-scale Distributed Data Processing Big Data Service Platform Mobile Spatial Statistics Supporting Development of Society and Industry Population Estimation Using Mobile Network Statistical Data and
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationFinding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics
Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics Dharmendra Agawane 1, Rohit Pawar 2, Pavankumar Purohit 3, Gangadhar Agre 4 Guide: Prof. P B Jawade 2
More informationTelecom Data processing and analysis based on Hadoop
COMPUTER MODELLING & NEW TECHNOLOGIES 214 18(12B) 658-664 Abstract Telecom Data processing and analysis based on Hadoop Guofan Lu, Qingnian Zhang *, Zhao Chen Wuhan University of Technology, Wuhan 4363,China
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationHow To Analyze Netflow Data With Hadoop 1 And Netflow On A Large Scale On A Server Or Cloud On A Microsoft Server
Exploring Netflow Data using Hadoop X. Zhou 1, M. Petrovic 2,T. Eskridge 3,M. Carvalho 4,X. Tao 5 Xiaofeng Zhou, CISE Department, University of Florida Milenko Petrovic, Florida Institute for Human and
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationMobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationInternational Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationA Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud
A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud Thuy D. Nguyen, Cynthia E. Irvine, Jean Khosalim Department of Computer Science Ground System Architectures Workshop
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationNetwork Traffic Analysis using HADOOP Architecture. Zeng Shan ISGC2013, Taibei zengshan@ihep.ac.cn
Network Traffic Analysis using HADOOP Architecture Zeng Shan ISGC2013, Taibei zengshan@ihep.ac.cn Flow VS Packet what are netflows? Outlines Flow tools used in the system nprobe nfdump Introduction to
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationEnhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationConcept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
More informationHadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
More informationEnhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications
Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input
More informationBig Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationBig Application Execution on Cloud using Hadoop Distributed File System
Big Application Execution on Cloud using Hadoop Distributed File System Ashkan Vates*, Upendra, Muwafaq Rahi Ali RPIIT Campus, Bastara Karnal, Haryana, India ---------------------------------------------------------------------***---------------------------------------------------------------------
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationPacket Flow Analysis and Congestion Control of Big Data by Hadoop
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 6, June 2015, pg.456
More informationPerformance Analysis of Book Recommendation System on Hadoop Platform
Performance Analysis of Book Recommendation System on Hadoop Platform Sugandha Bhatia #1, Surbhi Sehgal #2, Seema Sharma #3 Department of Computer Science & Engineering, Amity School of Engineering & Technology,
More informationOracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>
s Big Data solutions Roger Wullschleger DBTA Workshop on Big Data, Cloud Data Management and NoSQL 10. October 2012, Stade de Suisse, Berne 1 The following is intended to outline
More informationISSN: 2320-1363 CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS
CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS A.Divya *1, A.M.Saravanan *2, I. Anette Regina *3 MPhil, Research Scholar, Muthurangam Govt. Arts College, Vellore, Tamilnadu, India Assistant
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationImproving Job Scheduling in Hadoop
Improving Job Scheduling in Hadoop MapReduce Himangi G. Patel, Richard Sonaliya Computer Engineering, Silver Oak College of Engineering and Technology, Ahmedabad, Gujarat, India. Abstract Hadoop is a framework
More informationHadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System dhruba@apache.org Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
More informationThis exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationThe Recovery System for Hadoop Cluster
The Recovery System for Hadoop Cluster Prof. Priya Deshpande Dept. of Information Technology MIT College of engineering Pune, India priyardeshpande@gmail.com Darshan Bora Dept. of Information Technology
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationHadoop and Map-reduce computing
Hadoop and Map-reduce computing 1 Introduction This activity contains a great deal of background information and detailed instructions so that you can refer to it later for further activities and homework.
More informationCyber Forensic for Hadoop based Cloud System
Cyber Forensic for Hadoop based Cloud System ChaeHo Cho 1, SungHo Chin 2 and * Kwang Sik Chung 3 1 Korea National Open University graduate school Dept. of Computer Science 2 LG Electronics CTO Division
More informationRecognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
More informationLimitations of Packet Measurement
Limitations of Packet Measurement Collect and process less information: Only collect packet headers, not payload Ignore single packets (aggregate) Ignore some packets (sampling) Make collection and processing
More informationProcessing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
More informationA Survey on Internet Traffic Measurement and Analysis
A Survey on Internet Traffic Measurement and Analysis Parekh Nilaykumar B. Department of CS&E, Governmernt Engineering Collage,Modasa, Aravalli,Gujarat,India Tel: +91 9913885777 E-mailnilaybparekh@yahoo.in
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More information