Shrida Girme 1, C.A.Laulkar 2 1,2 Sinhgad College of Engineering, Pune

Size: px
Start display at page:

Download "Shrida Girme 1, C.A.Laulkar 2 1,2 Sinhgad College of Engineering, Pune"

Transcription

1 Clustering Algorithm for Log File Analysis on Top of Hadoop Shrida Girme 1, C.A.Laulkar 2 1,2 Sinhgad College of Engineering, Pune Abstract Software developers usually hook the code to track the every crucial move in the business logic. Using the log module, system administrator keeps an eye on the fatal errors and warnings constantly by hunting into the substantial log files. Log files are usually located at the server sides or remote site with the target of simple text files. However, due to increasing scale and complexity of Business Systems, the size of logs file is reaching up the higher count. These logs files generated are often in Terabytes and Petabytes. Hence processing of huge log files by legacy methods is quite time consuming. Thus, it is usually inefficient for common methods to analyze system logs on standalone machine; it often takes a whole day to analyze the GBs of logs file. The main idea of this paper is to focus on the distributed log file processing using MapReduce programming methodology with data mining algorithms like Apriori and K-means clustering. Hadoop has the Map Reduce Programming Model based on the Research Paper by Google; this model allows dividing the task in the distributed manner and collecting the output from multiple peers and integrating it. Hence, this project aims to introduce an integrative methodology of analyzing the huge log files using Map Reduce-Based Programming Model on the top of Hadoop. The use of data mining algorithms will help in analyzing the logs to understand the user behavior. Index Terms Log files, Map Reduce, Hadoop, Apriori Algorithm and K mean. I. INTRODUCTION Development, Maintenance and Performance Refinement is very important in Large Scale Distributed Systems. This can be achieved by parsing the Weblogs to retrieve important information for analysis. It is a good way to obtain the troubleshooting and problem diagnosis by analyzing the web logs produced by the distributed web servers. But now-a-days, due to the increasing scale and complexity of distributed systems, the size of logs is being very large. Thus, it becomes inefficient for common methods to analyze the large web server logs on single node. Hence, there is a demand to adopt a distributed method for parsing the web logs based on log analysis. In this paper, a MapReduce-Based Framework is implemented to analyze the web log for detecting the various statistics required by the system admin in order to report aggregate data about which pages of a particular website were popular among the user and many more statistics. The framework is built on top of Hadoop, an open source distributed file system and MapReduce implementation. The system first uses K-means clustering algorithm to integrate the collected logs. After that, a MapReduce-Based algorithm is applied to parse these clustered log files. Finally, in order to make the best use of this collected data, a flexible and powerful way is utilized to display monitoring and analysis results. Thus this project will help to answer all the permutations and combinations of the possible queries asked by System Admin so that with the help of the data mining answers, system admin can enhance the decision support systems. This system will really help the admin to implement the Business Intelligence in their organization to enhance the revenue. II. RELATED WORK Till date a lot of work has been carried on Log File Analysis. Many Log file analyzers are also available. But these all work on a standalone system. Moving further we have proposed a Log file analysis system which supports distributed environment. Wichian Premchaiswadi proposed a novel and efficient web log mining model for web users clustering [1]. Zhang, using Clickstream Data Warehouse (CDW) proposed a hybrid data extraction strategy based on time characteristics in the server layer [2]. Hu and Zhong proposed an effective method which uses Web farming for collecting clickstream log with servlet filter and extended the common Web log standard to modeling clickstream log format [3]. Antonellis extended the model and specialize in the use of access log data in the form of clickstream. They used an online component store that summarized data in a temporary memory where the web log data was merged rather than using a storage system [4]. Hofgesang and Kowalczyk used standard statistical and data mining techniques for exploratory data analysis and anomalies detection with the intention to finite mixtures of multinomial distributions on the session data to extract user profiles for further analysis [5]. Banerjee and Ghosh introduced the idea of concept-based paths so that paths through conceptually similar pages have a non-null intersection even without any direct webpage overlap [6]. In 153

2 our paper we have extended the concept of Log File Analysis to use data mining algorithms based on map reduce programming model which will not only increase the efficiency of the system by generating data mining results but also will reduce the processing time for the huge size of log files. This section presents three cores that are related to our work as follows: Hadoop MapReduce Framework, K-means Algorithm and Apriori Algorithm for Log File Analysis. A. Hadoop MapReduce Framework Fig.1 Hadoop Distributed File System Architecture Overview (Courtesy: The Hadoop Distributed File System: Architecture and Design) As Figure 1 illustrates, Hadoop Distributed File System (HDFS) has master/slave architecture that consists of many components and specific roles as following as: 1) Name Node: In HDFS, cluster consists of a single Name Node to manage the file system namespace in order to regulate access to file by clients. 2) Data Nodes: Each slave machine in the cluster will host a Data Node daemon to perform the grunt work of the distributed file system - both reading and writing. 3) SecondaryNameNode (SNN): is an assistant daemon for monitoring the state of the cluster HDFS. 4) Job Tracker: The Job Tracker daemon is the liaison between the application and Hadoop. 5) Task Tracker: manage the execution of individual tasks on each slave node. According to MapReduce is a programming model for processing large-scale datasets in computer clusters. The MapReduce programming model consists of two functions, map() and reduce().the signatures of map() and reduce() are as follows: Map (k1, v1) list (k2, v2) Reduce (k2, list (v2)) list (v2) Users can implement their own processing logic by specifying a customized map() and reduce() function. The map () function takes an input key/value pair and produces a list of intermediate key/value pairs. The frozen part of the MapReduce is a large distributed sort, which consist of six functions as 1) An input reader, 2) Map function, 3) Partition function, 4) Compare function, 5) Reduce function, and 6) Output writer as shown in the Figure

3 Fig 2 Map Phase and Reduce Phase in HDFS (Courtesy: The Hadoop Distributed File System: Architecture and Design) As illustrated in Figure 2, the input reader divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs. Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. After that, the partition function is given the key and the number of reducers and returns the indexes of the desired reduce. However, between the map and reduce stages, the data is shuffled in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and output 0 or more values. As finally, the Output Writer writes the output of the Reduce to stable storage, usually a distributed file system. B. K-Means Clustering Algorithm K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is as below. Initialize k clusters Until converged Compute the probability of a point belong to a cluster for every <point,cluster> pair Recompute the cluster centers using above probability membership values of points to clusters k n J = j 1 i 1 x j i - c j 2 Where x j i - c j 2 is a chosen distance measure between a data point x j i and the cluster centre c j, is an indicator of the distance of the n data points from the respective cluster centers. K-means clustering algorithm just needs to do a distance calculation. 155

4 K-Means Map Reduce design Driver 1. Runs multiple iteration jobs using mapper+combiner+reducer 2. Runs final clustering job using only mapper Mapper 1. Configure: Single file containing encoded Clusters 2. Input: File split containing encoded Vectors 3. Output: Vectors keyed by nearest cluster Combiner 1. Input: Vectors keyed by nearest cluster 2. Output: Cluster centroid vectors keyed by cluster Reducer 1. Input: Cluster centroid vectors 2. Output: Single file containing Vectors keyed by cluster. C. Apriori Algorithm Apriori Algorithm is also referred to as Market Basket Problem. It is a data mining algorithm for finding the frequent item sets. The selection of Apriori algorithm is because of the performance where it able to run the mining process in short period. Currently, Apriori algorithm is commonly used for generating the Association Rules for Web Usage Mining. The Apriori Algorithm is the most well know Association Rule Algorithm and it uses the Large Itemset property which states that Any Large Itemset has a subset which is also large. Association Analysis: This is used for discovering relevant and useful relations between variables in large databases. Definition of Support: Equation (1) defines Support(S) for an item as the ratio of the number of transactions in which the item exists as a component to the total number of transactions present in the database or (2) defines a simpler version of Support(S) which defines support of an item as the number of occurrences the item occurs in all transactions. Support(S) = {{number of transactions in which the item is present)/total number of transactions}*100. (1) Support(S) = {number of occurrences in which the item is present in all transactions} (2) Definition of Large Itemset: When the frequency of occurrences of a specific pattern is above its threshold S, then we call that Itemset a Large Itemset (L). III. MATHEMATICAL MODEL Consider there are N number of log files - L 1, L 2, L 3, and L n each of size 100 MB to 2 TB. Each file i there are number of records Total number of records (LR) = Where, R= no. of last index of each file. Goal of mathematical model is to compare the total cost of evaluating these log records on single standalone machine and distributed cluster (Hadoop). Part 1: Consider c be the cost of processing each record independently on a machine or commodity hardware. Cost of processing LR records on single commodity hardware = 156

5 Where, ISSN: p = processing factor depend upon the configuration of commodity hardware. Using above equation we come to know that cost of processing on single machine is multiple of no of records which is quite heavy. Part 2: Consider a hadoop cluster HDFS having parallel processing nodes like {HN 1, HN 2,..., HN x } master node MH N. Consider that log file are stored in chunks CH 1, CH 2,..., CH X Where, x belongs to max chunk size in each HN x and MH nx. a These chunks are distributed on various HN x based on balancing condition of cluster. Part 3: Applying Data Mining Algorithm on various chunks of log files. Mapper:- S is the set of key, value pair generated on each HN. DMQ = condition for selection operator which can extract part of the string based on user Data Mining Query. Reducer: Where, S=Set of key Value pair generated by mapper based on DMQ. is a set of key value pair depicting information belong to DMQ given by user. Part 4: Calculation of cost of evaluating the Log Records (LR) on hadoop cluster having nodes (HN 1, HN 2, HN 3,...,HN x ) on which each mapper and reducer equation will be implemented based on DMQ of user. Consider, No.of mappers = MR 1,...,MR N No.of Reducers = RR 1,...,RR N Execution cost of Mapper Processes (CMR) = Cost of Execution of Reducers Processes = Where, = cost of evaluating records in CH x (each chunk) at HN i = cost of evaluating the <key, value> pairs generated by Total Cost of Processing Log File on Hadoop Cluster = CMR + CRR. IV. DYNAMIC PROGRAMMING AND SERIALIZATION Basically the Log File Analysis System is to be run on Open Source Apache Hadoop Distributed File System. Hadoop runs in a scale-out, shared-nothing fashion across many commodity servers and employs a divide-andconquer methodology for tackling big data use cases. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. Hadoop provides a storage layer that holds vast amounts of data, and an execution layer for running an application in 157

6 parallel across the cluster against parts of the stored data. The storage layer, looks like a single storage volume that has been optimized for many concurrent serialized reads of large data files. Where "large" ranges from Gigabytes to Petabytes. But it only supports a single writer. And random access to the data is not really possible in an efficient manner. But this is why it is so performant and reliable. Reliable in part because this restriction allows for the data to be replicated across the cluster reducing the chance of data loss. The execution layer relies on a "divide and conquer" strategy called MapReduce. Map Reduce with DataMiningQuery(DMQ) Job Tracker MR MR MR MR RR RR Data Mining Information Output based on DMQ Fig 3: Divide and Conquer Strategy in Map Reduce Programming V. DATA INDEPENDENCE AND DATA FLOW ARCHITECTURE The dataflow architecture for the proposed system is as follows: The N number of Log Files are taken as Input by the Hadoop Distributed File System. Job Tracker keeps a track of all the jobs submitted to the HDFS. The Map Reduce programs along with the Data Mining Algorithms are fed to the Mapr Reduce clients. The Final computation results are displayed through an efficient portal designed. The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Log Files (1...N) HDFS Job Tracker Output 1 to N MR programs With DM algo s Map Reduce Client Fig 4: Data Flow Architecture for Log File Analysis Summarization and Graphical Representation on Web Browser 158

7 Monday Tuesday Wednes Thursday Friday Saturday Sunday ISSN: VI. RESULTS AND DISCUSSIONS In this project the results are categorized into four sections; the daily activity, the most popular web pages, hourly activities and the activity in each clustering group. The following figure shows a part of the Log File Used for the analization. Figure 5 shows the sample of a log file which includes the parameters like IP address, Timestamp, Methods, Protocols, OS used, Bytes transferred, Status codes, Browser used etc. From This parameters it is easy to analyze any query given by the user [11/April/2012:05:17: ] "GET /support.html HTTP/1.1" " "Chrome/5.0 (MacOS; U; MacOS 10.2; en-us; rv: ) Gecko/ Chrome/5.0" [12/April/2012:06:17: ] "POST /aboutme.html HTTP/1.1" " "Mozilla/5.0 (Ubuntu; U; Ubuntu 12.04; en-us; rv: ) Gecko/ Firefox/ " [12/April/2012:06:17: ] "GET /support.html HTTP/1.1" " "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-us; rv: ) Gecko/ Firefox/ " Fig 5. A typical weblog file sample which has been generated using a java program Hits PageViews Visitors Fig 6. Graphical Representation of Log Analysis From figure 6 it can be concluded that the Number of Hits and Page views were the Maximum for Friday. The figures for the same are counted by applying the Data mining Algorithm on the given log file Hits Visitors Fig 7: Top 4 most popular web pages As Figure 7 illustrates, the highest number of visitors visited the link was 500 while number of the hits was 467 times. In this way number of parameters can be analyzed using the log files generated for a particular web server. The experimental setup for Log File Processing is to be built on Hadoop Platform. Our machine specification are as follows: the Intel X58 Express Processor W GHz, 4MB cache, 4GB (2*2 GB) DDR-1333 ECC Unbuffered RAM1-CPU and hard disks (7200rpm SATA). We are using Hadoop version running on Java program for the planned experiment. VII. CONCLUSION In this paper we aim to investigate the visitors behavior of any particular websites. For experimental purpose we have created our own log file, which is being analyzed and finally we will present a report which aggregates 159

8 data about which pages visitors visit in what order and which are the result of the succession of mouse clicks each visitor makes. Moreover, we applied Hadoop MapReduce framework for improving the performance of response time as quick as possible. Also we have applied Data Mining algorithms like K-means clustering and Apriori Algorithm to increase the efficiency of the results obtained. Based on the results obtained from the study, we can conclude which pages are most popular among the visitors hence we can make the respective changes in the web filtration. For the future work, we will apply this feature with Bayesian Network in order to predict the user s behavior and compare with the commercial software such as Google analytics, Piwik open source web analytics and Yahoo web analytics. REFERENCES [1] Premchaiswadi, W.; Romsaiyud, W.; "Extracting weblog of Siam University for learning user behavior on MapReduce," Intelligent and Advanced Systems (ICIAS), th International Conference on, vol.1, no., pp , June 2012 doi: /ICIAS S. [2] Li-na Zhang, Jie Liu and Yue Zhang, Data Mixed-extraction Strategy based on the Time Characteristics in CDW, First International Conference on Pervasive Computing, Signal Processing and Applications, pp , [3] Jia Hu and Nign Zhong, Clickstream Log Acquisition with Web Farming, IEEE Int l Conf. on Web Intelligence (WI 05), pp , [4] Duda, R.O., Hart, P.E. et. al., Pattern classification, 2nd edition, John Wiley and Sons Inc, Singapore (2001). [5] Panagiotis Antonellis, Christos Makris and Nikos Tsirakis, Algorithms for clustering clickstream data, Information Processing Letters, Vol. 109, Issue 8, March 2009, pp , [6] Peter I. Hofgesang and Wojtek Kowalczyk, Analyzing Clickstream Data: From Anomaly Detection to Visitor Profiling, ECML/PKDD Discovery Challenge 2005, [7] Arindam Banerjee and Joydeep Ghosh, Clickstream clustering using Weighted Longest Common Subsequences, Int l Conf of the Web Mining Workshop at the 1 st SIAM Conference on Data Mining, Chicago, [8] Veeramalai, N.Jaisankar, A.Kannan, Efficient Web Log Mining Using Enhanced Apriori Algorithm with Hash Tree and Fuzzy, IJCSIT, Vol.2, No.4, August [9] Dean and S. Ghemawat, Map Reduce: Simplified Data Processing on Large Clusters, USENIX OSDI, December, [10] Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R., The Hadoop Distributed File System, Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, vol., no., pp.1,10, 3-7 May 2010 doi: /MSST [11] S. Ghemawat, H. Gobioff, S. Leung., The Google file system, In Proc. of ACM Symposium on Operating Systems Principles, Lake George, NY, Oct

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

More information

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Detection of Distributed Denial of Service Attack with Hadoop on Live Network Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT

METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department

More information

GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and MinuteSort at Yahoo on Hadoop 0.23 GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

More information

Distributed Framework for Data Mining As a Service on Private Cloud

Distributed Framework for Data Mining As a Service on Private Cloud RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

More information

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Introduction to Hadoop

Introduction to Hadoop 1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System

How To Analyze Log Files In A Web Application On A Hadoop Mapreduce System Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute

More information

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

More information

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

More information

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be

More information

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster

Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Amresh Kumar Department of Computer Science & Engineering, Christ University Faculty of Engineering

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Mining Interesting Medical Knowledge from Big Data

Mining Interesting Medical Knowledge from Big Data IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

More information

An Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce

An Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce Proc. of Int. Conf. on Advances in Communication, Network, and Computing, CNC An Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce Savitha K 1 and Vijaya MS 2

More information

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

More information

Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

More information

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Mobile Storage and Search Engine of Information Oriented to Food Cloud Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

More information

Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

More information

Fault Tolerance in Hadoop for Work Migration

Fault Tolerance in Hadoop for Work Migration 1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

More information

Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE

HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE Sayalee Narkhede 1 and Tripti Baraskar 2 Department of Information Technology, MIT-Pune,University of Pune, Pune sayleenarkhede@gmail.com

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing

More information

ImprovedApproachestoHandleBigdatathroughHadoop

ImprovedApproachestoHandleBigdatathroughHadoop Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 14 Issue 9 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals

More information

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT

DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Processing of Hadoop using Highly Available NameNode

Processing of Hadoop using Highly Available NameNode Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace

More information

MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

More information

International Journal of Emerging Technology & Research

International Journal of Emerging Technology & Research International Journal of Emerging Technology & Research High Performance Clustering on Large Scale Dataset in a Multi Node Environment Based on Map-Reduce and Hadoop Anusha Vasudevan 1, Swetha.M 2 1, 2

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies

Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies Savitha K Dept. of Computer Science, Research Scholar PSGR Krishnammal College for Women Coimbatore, India. Vijaya MS Dept.

More information

Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers

Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 2, Issue 1, Feb-Mar, 2014 ISSN: 2320-8791 www.ijreat.

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 2, Issue 1, Feb-Mar, 2014 ISSN: 2320-8791 www.ijreat. Design of Log Analyser Algorithm Using Hadoop Framework Banupriya P 1, Mohandas Ragupathi 2 PG Scholar, Department of Computer Science and Engineering, Hindustan University, Chennai Assistant Professor,

More information

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

More information

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2 Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

More information

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions

More information

Reduction of Data at Namenode in HDFS using harballing Technique

Reduction of Data at Namenode in HDFS using harballing Technique Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.

More information

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

More information

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

More information

Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

More information

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop implementation of MapReduce computational model. Ján Vaňo Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed

More information

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14 Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department

More information

The Hadoop Framework

The Hadoop Framework The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

More information

HDFS Space Consolidation

HDFS Space Consolidation HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

International Journal of Innovative Research in Computer and Communication Engineering

International Journal of Innovative Research in Computer and Communication Engineering FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra

More information

Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Manifest for Big Data Pig, Hive & Jaql

Manifest for Big Data Pig, Hive & Jaql Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,

More information

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud

Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,

More information

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information

More information

Introduction to MapReduce and Hadoop

Introduction to MapReduce and Hadoop Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process

More information

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

More information