Shrida Girme 1, C.A.Laulkar 2 1,2 Sinhgad College of Engineering, Pune
|
|
- Pauline Charles
- 8 years ago
- Views:
Transcription
1 Clustering Algorithm for Log File Analysis on Top of Hadoop Shrida Girme 1, C.A.Laulkar 2 1,2 Sinhgad College of Engineering, Pune Abstract Software developers usually hook the code to track the every crucial move in the business logic. Using the log module, system administrator keeps an eye on the fatal errors and warnings constantly by hunting into the substantial log files. Log files are usually located at the server sides or remote site with the target of simple text files. However, due to increasing scale and complexity of Business Systems, the size of logs file is reaching up the higher count. These logs files generated are often in Terabytes and Petabytes. Hence processing of huge log files by legacy methods is quite time consuming. Thus, it is usually inefficient for common methods to analyze system logs on standalone machine; it often takes a whole day to analyze the GBs of logs file. The main idea of this paper is to focus on the distributed log file processing using MapReduce programming methodology with data mining algorithms like Apriori and K-means clustering. Hadoop has the Map Reduce Programming Model based on the Research Paper by Google; this model allows dividing the task in the distributed manner and collecting the output from multiple peers and integrating it. Hence, this project aims to introduce an integrative methodology of analyzing the huge log files using Map Reduce-Based Programming Model on the top of Hadoop. The use of data mining algorithms will help in analyzing the logs to understand the user behavior. Index Terms Log files, Map Reduce, Hadoop, Apriori Algorithm and K mean. I. INTRODUCTION Development, Maintenance and Performance Refinement is very important in Large Scale Distributed Systems. This can be achieved by parsing the Weblogs to retrieve important information for analysis. It is a good way to obtain the troubleshooting and problem diagnosis by analyzing the web logs produced by the distributed web servers. But now-a-days, due to the increasing scale and complexity of distributed systems, the size of logs is being very large. Thus, it becomes inefficient for common methods to analyze the large web server logs on single node. Hence, there is a demand to adopt a distributed method for parsing the web logs based on log analysis. In this paper, a MapReduce-Based Framework is implemented to analyze the web log for detecting the various statistics required by the system admin in order to report aggregate data about which pages of a particular website were popular among the user and many more statistics. The framework is built on top of Hadoop, an open source distributed file system and MapReduce implementation. The system first uses K-means clustering algorithm to integrate the collected logs. After that, a MapReduce-Based algorithm is applied to parse these clustered log files. Finally, in order to make the best use of this collected data, a flexible and powerful way is utilized to display monitoring and analysis results. Thus this project will help to answer all the permutations and combinations of the possible queries asked by System Admin so that with the help of the data mining answers, system admin can enhance the decision support systems. This system will really help the admin to implement the Business Intelligence in their organization to enhance the revenue. II. RELATED WORK Till date a lot of work has been carried on Log File Analysis. Many Log file analyzers are also available. But these all work on a standalone system. Moving further we have proposed a Log file analysis system which supports distributed environment. Wichian Premchaiswadi proposed a novel and efficient web log mining model for web users clustering [1]. Zhang, using Clickstream Data Warehouse (CDW) proposed a hybrid data extraction strategy based on time characteristics in the server layer [2]. Hu and Zhong proposed an effective method which uses Web farming for collecting clickstream log with servlet filter and extended the common Web log standard to modeling clickstream log format [3]. Antonellis extended the model and specialize in the use of access log data in the form of clickstream. They used an online component store that summarized data in a temporary memory where the web log data was merged rather than using a storage system [4]. Hofgesang and Kowalczyk used standard statistical and data mining techniques for exploratory data analysis and anomalies detection with the intention to finite mixtures of multinomial distributions on the session data to extract user profiles for further analysis [5]. Banerjee and Ghosh introduced the idea of concept-based paths so that paths through conceptually similar pages have a non-null intersection even without any direct webpage overlap [6]. In 153
2 our paper we have extended the concept of Log File Analysis to use data mining algorithms based on map reduce programming model which will not only increase the efficiency of the system by generating data mining results but also will reduce the processing time for the huge size of log files. This section presents three cores that are related to our work as follows: Hadoop MapReduce Framework, K-means Algorithm and Apriori Algorithm for Log File Analysis. A. Hadoop MapReduce Framework Fig.1 Hadoop Distributed File System Architecture Overview (Courtesy: The Hadoop Distributed File System: Architecture and Design) As Figure 1 illustrates, Hadoop Distributed File System (HDFS) has master/slave architecture that consists of many components and specific roles as following as: 1) Name Node: In HDFS, cluster consists of a single Name Node to manage the file system namespace in order to regulate access to file by clients. 2) Data Nodes: Each slave machine in the cluster will host a Data Node daemon to perform the grunt work of the distributed file system - both reading and writing. 3) SecondaryNameNode (SNN): is an assistant daemon for monitoring the state of the cluster HDFS. 4) Job Tracker: The Job Tracker daemon is the liaison between the application and Hadoop. 5) Task Tracker: manage the execution of individual tasks on each slave node. According to MapReduce is a programming model for processing large-scale datasets in computer clusters. The MapReduce programming model consists of two functions, map() and reduce().the signatures of map() and reduce() are as follows: Map (k1, v1) list (k2, v2) Reduce (k2, list (v2)) list (v2) Users can implement their own processing logic by specifying a customized map() and reduce() function. The map () function takes an input key/value pair and produces a list of intermediate key/value pairs. The frozen part of the MapReduce is a large distributed sort, which consist of six functions as 1) An input reader, 2) Map function, 3) Partition function, 4) Compare function, 5) Reduce function, and 6) Output writer as shown in the Figure
3 Fig 2 Map Phase and Reduce Phase in HDFS (Courtesy: The Hadoop Distributed File System: Architecture and Design) As illustrated in Figure 2, the input reader divides the input into appropriate size 'splits' (in practice typically 16 MB to 128 MB) and the framework assigns one split to each Map function. The input reader reads data from stable storage (typically a distributed file system) and generates key/value pairs. Each Map function takes a series of key/value pairs, processes each, and generates zero or more output key/value pairs. Each Map function output is allocated to a particular reducer by the application's partition function for sharing purposes. After that, the partition function is given the key and the number of reducers and returns the indexes of the desired reduce. However, between the map and reduce stages, the data is shuffled in order to move the data from the map node that produced it to the shard in which it will be reduced. The shuffle can sometimes take longer than the computation time depending on network bandwidth, CPU speeds, data produced and time taken by map and reduce computations. The input for each Reduce is pulled from the machine where the Map ran and sorted using the application's comparison function. The framework calls the application's Reduce function once for each unique key in the sorted order. The Reduce can iterate through the values that are associated with that key and output 0 or more values. As finally, the Output Writer writes the output of the Reduce to stable storage, usually a distributed file system. B. K-Means Clustering Algorithm K-Means works on those objects which can be represented in n-dimensional vector space and a distance measure is defined. The algorithm is as below. Initialize k clusters Until converged Compute the probability of a point belong to a cluster for every <point,cluster> pair Recompute the cluster centers using above probability membership values of points to clusters k n J = j 1 i 1 x j i - c j 2 Where x j i - c j 2 is a chosen distance measure between a data point x j i and the cluster centre c j, is an indicator of the distance of the n data points from the respective cluster centers. K-means clustering algorithm just needs to do a distance calculation. 155
4 K-Means Map Reduce design Driver 1. Runs multiple iteration jobs using mapper+combiner+reducer 2. Runs final clustering job using only mapper Mapper 1. Configure: Single file containing encoded Clusters 2. Input: File split containing encoded Vectors 3. Output: Vectors keyed by nearest cluster Combiner 1. Input: Vectors keyed by nearest cluster 2. Output: Cluster centroid vectors keyed by cluster Reducer 1. Input: Cluster centroid vectors 2. Output: Single file containing Vectors keyed by cluster. C. Apriori Algorithm Apriori Algorithm is also referred to as Market Basket Problem. It is a data mining algorithm for finding the frequent item sets. The selection of Apriori algorithm is because of the performance where it able to run the mining process in short period. Currently, Apriori algorithm is commonly used for generating the Association Rules for Web Usage Mining. The Apriori Algorithm is the most well know Association Rule Algorithm and it uses the Large Itemset property which states that Any Large Itemset has a subset which is also large. Association Analysis: This is used for discovering relevant and useful relations between variables in large databases. Definition of Support: Equation (1) defines Support(S) for an item as the ratio of the number of transactions in which the item exists as a component to the total number of transactions present in the database or (2) defines a simpler version of Support(S) which defines support of an item as the number of occurrences the item occurs in all transactions. Support(S) = {{number of transactions in which the item is present)/total number of transactions}*100. (1) Support(S) = {number of occurrences in which the item is present in all transactions} (2) Definition of Large Itemset: When the frequency of occurrences of a specific pattern is above its threshold S, then we call that Itemset a Large Itemset (L). III. MATHEMATICAL MODEL Consider there are N number of log files - L 1, L 2, L 3, and L n each of size 100 MB to 2 TB. Each file i there are number of records Total number of records (LR) = Where, R= no. of last index of each file. Goal of mathematical model is to compare the total cost of evaluating these log records on single standalone machine and distributed cluster (Hadoop). Part 1: Consider c be the cost of processing each record independently on a machine or commodity hardware. Cost of processing LR records on single commodity hardware = 156
5 Where, ISSN: p = processing factor depend upon the configuration of commodity hardware. Using above equation we come to know that cost of processing on single machine is multiple of no of records which is quite heavy. Part 2: Consider a hadoop cluster HDFS having parallel processing nodes like {HN 1, HN 2,..., HN x } master node MH N. Consider that log file are stored in chunks CH 1, CH 2,..., CH X Where, x belongs to max chunk size in each HN x and MH nx. a These chunks are distributed on various HN x based on balancing condition of cluster. Part 3: Applying Data Mining Algorithm on various chunks of log files. Mapper:- S is the set of key, value pair generated on each HN. DMQ = condition for selection operator which can extract part of the string based on user Data Mining Query. Reducer: Where, S=Set of key Value pair generated by mapper based on DMQ. is a set of key value pair depicting information belong to DMQ given by user. Part 4: Calculation of cost of evaluating the Log Records (LR) on hadoop cluster having nodes (HN 1, HN 2, HN 3,...,HN x ) on which each mapper and reducer equation will be implemented based on DMQ of user. Consider, No.of mappers = MR 1,...,MR N No.of Reducers = RR 1,...,RR N Execution cost of Mapper Processes (CMR) = Cost of Execution of Reducers Processes = Where, = cost of evaluating records in CH x (each chunk) at HN i = cost of evaluating the <key, value> pairs generated by Total Cost of Processing Log File on Hadoop Cluster = CMR + CRR. IV. DYNAMIC PROGRAMMING AND SERIALIZATION Basically the Log File Analysis System is to be run on Open Source Apache Hadoop Distributed File System. Hadoop runs in a scale-out, shared-nothing fashion across many commodity servers and employs a divide-andconquer methodology for tackling big data use cases. Hadoop is a framework for running applications on large clusters of commodity hardware using a computational paradigm named map/reduce, where the application is divided into many small fragments of work, each of which may be executed on any node in the cluster. Hadoop provides a storage layer that holds vast amounts of data, and an execution layer for running an application in 157
6 parallel across the cluster against parts of the stored data. The storage layer, looks like a single storage volume that has been optimized for many concurrent serialized reads of large data files. Where "large" ranges from Gigabytes to Petabytes. But it only supports a single writer. And random access to the data is not really possible in an efficient manner. But this is why it is so performant and reliable. Reliable in part because this restriction allows for the data to be replicated across the cluster reducing the chance of data loss. The execution layer relies on a "divide and conquer" strategy called MapReduce. Map Reduce with DataMiningQuery(DMQ) Job Tracker MR MR MR MR RR RR Data Mining Information Output based on DMQ Fig 3: Divide and Conquer Strategy in Map Reduce Programming V. DATA INDEPENDENCE AND DATA FLOW ARCHITECTURE The dataflow architecture for the proposed system is as follows: The N number of Log Files are taken as Input by the Hadoop Distributed File System. Job Tracker keeps a track of all the jobs submitted to the HDFS. The Map Reduce programs along with the Data Mining Algorithms are fed to the Mapr Reduce clients. The Final computation results are displayed through an efficient portal designed. The map function processes logs of web page requests and outputs <URL, 1>. The reduce function adds together all values for the same URL and emits a <URL, total count> pair. Log Files (1...N) HDFS Job Tracker Output 1 to N MR programs With DM algo s Map Reduce Client Fig 4: Data Flow Architecture for Log File Analysis Summarization and Graphical Representation on Web Browser 158
7 Monday Tuesday Wednes Thursday Friday Saturday Sunday ISSN: VI. RESULTS AND DISCUSSIONS In this project the results are categorized into four sections; the daily activity, the most popular web pages, hourly activities and the activity in each clustering group. The following figure shows a part of the Log File Used for the analization. Figure 5 shows the sample of a log file which includes the parameters like IP address, Timestamp, Methods, Protocols, OS used, Bytes transferred, Status codes, Browser used etc. From This parameters it is easy to analyze any query given by the user [11/April/2012:05:17: ] "GET /support.html HTTP/1.1" " "Chrome/5.0 (MacOS; U; MacOS 10.2; en-us; rv: ) Gecko/ Chrome/5.0" [12/April/2012:06:17: ] "POST /aboutme.html HTTP/1.1" " "Mozilla/5.0 (Ubuntu; U; Ubuntu 12.04; en-us; rv: ) Gecko/ Firefox/ " [12/April/2012:06:17: ] "GET /support.html HTTP/1.1" " "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-us; rv: ) Gecko/ Firefox/ " Fig 5. A typical weblog file sample which has been generated using a java program Hits PageViews Visitors Fig 6. Graphical Representation of Log Analysis From figure 6 it can be concluded that the Number of Hits and Page views were the Maximum for Friday. The figures for the same are counted by applying the Data mining Algorithm on the given log file Hits Visitors Fig 7: Top 4 most popular web pages As Figure 7 illustrates, the highest number of visitors visited the link was 500 while number of the hits was 467 times. In this way number of parameters can be analyzed using the log files generated for a particular web server. The experimental setup for Log File Processing is to be built on Hadoop Platform. Our machine specification are as follows: the Intel X58 Express Processor W GHz, 4MB cache, 4GB (2*2 GB) DDR-1333 ECC Unbuffered RAM1-CPU and hard disks (7200rpm SATA). We are using Hadoop version running on Java program for the planned experiment. VII. CONCLUSION In this paper we aim to investigate the visitors behavior of any particular websites. For experimental purpose we have created our own log file, which is being analyzed and finally we will present a report which aggregates 159
8 data about which pages visitors visit in what order and which are the result of the succession of mouse clicks each visitor makes. Moreover, we applied Hadoop MapReduce framework for improving the performance of response time as quick as possible. Also we have applied Data Mining algorithms like K-means clustering and Apriori Algorithm to increase the efficiency of the results obtained. Based on the results obtained from the study, we can conclude which pages are most popular among the visitors hence we can make the respective changes in the web filtration. For the future work, we will apply this feature with Bayesian Network in order to predict the user s behavior and compare with the commercial software such as Google analytics, Piwik open source web analytics and Yahoo web analytics. REFERENCES [1] Premchaiswadi, W.; Romsaiyud, W.; "Extracting weblog of Siam University for learning user behavior on MapReduce," Intelligent and Advanced Systems (ICIAS), th International Conference on, vol.1, no., pp , June 2012 doi: /ICIAS S. [2] Li-na Zhang, Jie Liu and Yue Zhang, Data Mixed-extraction Strategy based on the Time Characteristics in CDW, First International Conference on Pervasive Computing, Signal Processing and Applications, pp , [3] Jia Hu and Nign Zhong, Clickstream Log Acquisition with Web Farming, IEEE Int l Conf. on Web Intelligence (WI 05), pp , [4] Duda, R.O., Hart, P.E. et. al., Pattern classification, 2nd edition, John Wiley and Sons Inc, Singapore (2001). [5] Panagiotis Antonellis, Christos Makris and Nikos Tsirakis, Algorithms for clustering clickstream data, Information Processing Letters, Vol. 109, Issue 8, March 2009, pp , [6] Peter I. Hofgesang and Wojtek Kowalczyk, Analyzing Clickstream Data: From Anomaly Detection to Visitor Profiling, ECML/PKDD Discovery Challenge 2005, [7] Arindam Banerjee and Joydeep Ghosh, Clickstream clustering using Weighted Longest Common Subsequences, Int l Conf of the Web Mining Workshop at the 1 st SIAM Conference on Data Mining, Chicago, [8] Veeramalai, N.Jaisankar, A.Kannan, Efficient Web Log Mining Using Enhanced Apriori Algorithm with Hash Tree and Fuzzy, IJCSIT, Vol.2, No.4, August [9] Dean and S. Ghemawat, Map Reduce: Simplified Data Processing on Large Clusters, USENIX OSDI, December, [10] Shvachko, K.; Hairong Kuang; Radia, S.; Chansler, R., The Hadoop Distributed File System, Mass Storage Systems and Technologies (MSST), 2010 IEEE 26th Symposium on, vol., no., pp.1,10, 3-7 May 2010 doi: /MSST [11] S. Ghemawat, H. Gobioff, S. Leung., The Google file system, In Proc. of ACM Symposium on Operating Systems Principles, Lake George, NY, Oct
International Journal of Advance Research in Computer Science and Management Studies
Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationAnalysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
More informationComparative analysis of mapreduce job by keeping data constant and varying cluster size technique
Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in
More informationLarge-Scale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationRole of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop
Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop Kanchan A. Khedikar Department of Computer Science & Engineering Walchand Institute of Technoloy, Solapur, Maharashtra,
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationDetection of Distributed Denial of Service Attack with Hadoop on Live Network
Detection of Distributed Denial of Service Attack with Hadoop on Live Network Suchita Korad 1, Shubhada Kadam 2, Prajakta Deore 3, Madhuri Jadhav 4, Prof.Rahul Patil 5 Students, Dept. of Computer, PCCOE,
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationMETHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT
METHOD OF A MULTIMEDIA TRANSCODING FOR MULTIPLE MAPREDUCE JOBS IN CLOUD COMPUTING ENVIRONMENT 1 SEUNGHO HAN, 2 MYOUNGJIN KIM, 3 YUN CUI, 4 SEUNGHYUN SEO, 5 SEUNGBUM SEO, 6 HANKU LEE 1,2,3,4,5 Department
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationDistributed Framework for Data Mining As a Service on Private Cloud
RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &
More informationLecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl
Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationTutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA
Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationIntroduction to Hadoop
1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationHow To Analyze Log Files In A Web Application On A Hadoop Mapreduce System
Analyzing Web Application Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environment Sayalee Narkhede Department of Information Technology Maharashtra Institute
More informationIMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE
IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT
More informationRecognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework
Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),
More informationReducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
More informationVerification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster
Verification and Validation of MapReduce Program model for Parallel K-Means algorithm on Hadoop Cluster Amresh Kumar Department of Computer Science & Engineering, Christ University Faculty of Engineering
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationMining Interesting Medical Knowledge from Big Data
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 06-10 www.iosrjournals.org Mining Interesting Medical Knowledge from
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationBig Data with Rough Set Using Map- Reduce
Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,
More informationAn Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce
Proc. of Int. Conf. on Advances in Communication, Network, and Computing, CNC An Efficient Analysis of Web Server Log Files for Session Identification using Hadoop Mapreduce Savitha K 1 and Vijaya MS 2
More informationComparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques
Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,
More informationMobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage
More informationMobile Storage and Search Engine of Information Oriented to Food Cloud
Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
More informationSurvey on Load Rebalancing for Distributed File System in Cloud
Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university
More informationFault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationComparison of Different Implementation of Inverted Indexes in Hadoop
Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,
More informationData-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationHMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE
HMR LOG ANALYZER: ANALYZE WEB APPLICATION LOGS OVER HADOOP MAPREDUCE Sayalee Narkhede 1 and Tripti Baraskar 2 Department of Information Technology, MIT-Pune,University of Pune, Pune sayleenarkhede@gmail.com
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationTake An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc hairong@yahoo-inc.com What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationChapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related
Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related Summary Xiangzhe Li Nowadays, there are more and more data everyday about everything. For instance, here are some of the astonishing
More informationImprovedApproachestoHandleBigdatathroughHadoop
Global Journal of Computer Science and Technology: C Software & Data Engineering Volume 14 Issue 9 Version 1.0 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
More informationDESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT
DESIGN ARCHITECTURE-BASED ON WEB SERVER AND APPLICATION CLUSTER IN CLOUD ENVIRONMENT Gita Shah 1, Annappa 2 and K. C. Shet 3 1,2,3 Department of Computer Science & Engineering, National Institute of Technology,
More informationAnalysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and
More informationProcessing of Hadoop using Highly Available NameNode
Processing of Hadoop using Highly Available NameNode 1 Akash Deshpande, 2 Shrikant Badwaik, 3 Sailee Nalawade, 4 Anjali Bote, 5 Prof. S. P. Kosbatwar Department of computer Engineering Smt. Kashibai Navale
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationA Cost-Benefit Analysis of Indexing Big Data with Map-Reduce
A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce Dimitrios Siafarikas Argyrios Samourkasidis Avi Arampatzis Department of Electrical and Computer Engineering Democritus University of Thrace
More informationMapReduce and Hadoop Distributed File System
MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially
More informationInternational Journal of Emerging Technology & Research
International Journal of Emerging Technology & Research High Performance Clustering on Large Scale Dataset in a Multi Node Environment Based on Map-Reduce and Hadoop Anusha Vasudevan 1, Swetha.M 2 1, 2
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationMining of Web Server Logs in a Distributed Cluster Using Big Data Technologies
Mining of Web Server Logs in a Distributed Cluster Using Big Data Technologies Savitha K Dept. of Computer Science, Research Scholar PSGR Krishnammal College for Women Coimbatore, India. Vijaya MS Dept.
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationEnhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationIJREAT International Journal of Research in Engineering & Advanced Technology, Volume 2, Issue 1, Feb-Mar, 2014 ISSN: 2320-8791 www.ijreat.
Design of Log Analyser Algorithm Using Hadoop Framework Banupriya P 1, Mohandas Ragupathi 2 PG Scholar, Department of Computer Science and Engineering, Hindustan University, Chennai Assistant Professor,
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationAssociate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2
Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationReduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu vgkorat@gmail.com swamy.uncis@gmail.com Abstract HDFS stands for the Hadoop Distributed File System.
More informationAn Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
More informationA STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS
A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)
More informationIntro to Map/Reduce a.k.a. Hadoop
Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 14 Big Data Management IV: Big-data Infrastructures (Background, IO, From NFS to HFDS) Chapter 14-15: Abideboul
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationAnalyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen
Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen Anil G, 1* Aditya K Naik, 1 B C Puneet, 1 Gaurav V, 1 Supreeth S 1 Abstract: Log files which
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationThe Hadoop Framework
The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationHDFS Space Consolidation
HDFS Space Consolidation Aastha Mehta*,1,2, Deepti Banka*,1,2, Kartheek Muthyala*,1,2, Priya Sehgal 1, Ajay Bakre 1 *Student Authors 1 Advanced Technology Group, NetApp Inc., Bangalore, India 2 Birla Institute
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationInternational Journal of Innovative Research in Computer and Communication Engineering
FP Tree Algorithm and Approaches in Big Data T.Rathika 1, J.Senthil Murugan 2 Assistant Professor, Department of CSE, SRM University, Ramapuram Campus, Chennai, Tamil Nadu,India 1 Assistant Professor,
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationJournal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)
Journal of science e ISSN 2277-3290 Print ISSN 2277-3282 Information Technology www.journalofscience.net STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS) S. Chandra
More informationDistributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationManifest for Big Data Pig, Hive & Jaql
Manifest for Big Data Pig, Hive & Jaql Ajay Chotrani, Priyanka Punjabi, Prachi Ratnani, Rupali Hande Final Year Student, Dept. of Computer Engineering, V.E.S.I.T, Mumbai, India Faculty, Computer Engineering,
More informationParallel Data Mining and Assurance Service Model Using Hadoop in Cloud
Parallel Data Mining and Assurance Service Model Using Hadoop in Cloud Aditya Jadhav, Mahesh Kukreja E-mail: aditya.jadhav27@gmail.com & mr_mahesh_in@yahoo.co.in Abstract : In the information industry,
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop Jie Tao Karlsruhe Institute of Technology jie.tao@kit.edu Die Kooperation von Why Map/Reduce? Massive data Can not be stored on a single machine Takes too long to process
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More information