Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Transcription

1 Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Bangalore, India. saranpons3@gmail.com, umarahul@rediffmail.com Abstract Analysing web log files has become an important task for E-Commerce companies to predict their customer behaviour and to improve their business. Each click in an E-commerce web page creates 100 bytes of data. Large E-Commerce websites like flipkart.com, amazon.in and ebay.in are visited millions of customers simultaneously. As a result, these customers generate petabytes of data in their web log files. As the web log file size is huge we require parallel processing and reliable data storage system for processing the web log files. Both the requirements are provided by Hadoop framework. Hadoop provides Hadoop Distributed File System (HDFS) and MapReduce programming model for processing huge dataset efficiently and effectively. In this paper, NASA web log file is analysed and the total number of hits received by each web page in a website, the total number of hits received by a web site in each hour using Hadoop framework is calculated and it is shown that Hadoop framework takes less response time to produce accurate results. Keywords - Hadoop, MapReduce, Log Files, Parallel Processing, Hadoop Distributed File System, E- Commerce 1. Introduction E-Commerce is a rapidly growing industry all over the world. The biggest challenge for most E- Commerce businesses is to collect, store, analyse and organize data from multiple data sources. There s certainly a lot of data waiting to be analysed and it is a daunting task for some E-Commerce businesses to make sense of it all [1]. One kind of data that has to be analysed in E-Commerce business is web log file. Web log file contains the following details: The IP address of the computer making the request (i.e. the visitor), the date and time of the hit, the request method, the location and name of the requested file, the HTTP status code, the size of the requested file and etc. Mining the web log file will be always helpful to E-Commerce companies to increase their profits. Because when E-Commerce companies mine the web log file they can predict the behaviour of their online customers. Mining the web log file is called Web Usage Mining. By predicting, E-Commerce companies can offer an online customer a personalized experience, including content and promotions. Also, they can provide product recommendations to customers based on their browsing behaviour. E-Commerce companies can do a lot more by mining the web log file. As the number of customers visiting E-Commerce web sites are increasing the size of the web log file is also increasing and nowadays the size of web log file is in petabytes. There are already pattern discovery data mining techniques available to analyse the web log files. These data mining techniques store web log file in traditional DBMS and analyse. But in the current scenario, the number of online customers increases day by day and each click from a web page creates on the order of 100 bytes of data in a typical website log file [2]. Consequently, large websites handling millions of simultaneous visitors can generate hundreds of petabytes of logs per day. For example, ebay processes petabytes of data stored in web log file to create a better shopping experience. So, to analyse such a big web log file efficiently and effectively, we need to develop faster, efficient and effective parallel and scalable data mining algorithms. Also, we need a cluster of storage devices to store a petabyte of web log data and parallel computing model for analysing. Hadoop framework provides reliable cluster of storage facility to keep our large web log file data in a distributed manner and parallel processing feature to process a large web log file data efficiently and effectively. The remainder of the paper is organized as follows. Section 2 summarizes 1677

2 the related work. In section 3, the system architecture is discussed. Section 4 shows the proposed scheme. Section 5 discusses the experimental results and in section 6, paper is concluded. 2. Related work In [3], the SQL DBMS and Hadoop MapReduce are compared and it is suggested that Hadoop MapReduce performs better than the SQL DBMS. In [4], it is mentioned that traditional DBMS cannot handle a large dataset. So we need to have Big Data technologies like Hadoop framework. HadoopMapReduce [4][5][6] is used in many areas for big data analysis. Hadoop is a good platform to analyse the web log files as the size of the web log file is kept increasing nowadays [7][8]. Apache Hadoop is an open-source project created by Doug Cutting and developed by the Apache Software Foundation. Hadoop platform allows us to store large scale data in thousands of nodes and analyse it. In [5], Generally Hadoop cluster has thousands of nodes which store multiple blocks of log files. Hadoop fragments log files into blocks and these blocks are evenly distributed over hundreds of nodes in a Hadoop cluster. Also it replicates these blocks over the multiple nodes to achieve reliability and fault tolerance. MapReduce achieves parallel computation by breaking analysing job into number of tasks. Figure 1 shows the cluster configuration of Hadoop system which is implemented in this paper. There are 2 nodes in the cluster. One node is called master node and another one is called slave node. The architecture is divided into two layers: HDFS Layer and MapReduce Layer. Hadoop Distributed File System (HDFS) is a Java-based file system that provides scalable and reliable data storage that is designed to span large clusters of commodity servers [9]. MapReduce Layer reads data from, writes data to HDFS storage and processes the data in parallel. Namenode keeps track of how weblog file is broken down into file blocks, which nodes store those blocks. Secondary name node periodically reads the HDFS file system changes log and apply them into the fsimage file. Data node stores the replication of web log file. JobTracker determines the execution plan by deciding which files to be processed, assigns nodes to different tasks, and keeps track of all tasks as they are running. TaskTracker is responsible for the execution of individual tasks on each slave node. 4. Proposed scheme 4.1. Calculating the total number of hits received by each URL Input File Split 3. System architecture MapReduce Layer Master Node Slave Node Task Tracker File Block1 File Block2 File BlockN Map (URL1,1),(U RL2,1). (URL1,1),(URL 2,1).... (URL1,1), (URL2,1).... Task Tracker Shuffle (URL1,1),(U RL1,1). Job Tracker (URL2,1),(U RL2,1). (URLn,1),(URLn,1). Reduce Name Node Data Node Data Node (URL1, (URL2, (URLn, Output HDFS Layer 2 NODE CLUSTER Figure 1. Two node hadoop cluster system architecture Total number of hits received by each URL (URL1, (URL2,.. (URLn, Figure 2. Calculating total number of hits received by each URL 1678

3 2 depicts the MapReduce function of processing web log file and calculating the total number of hits received by each URL. The input to this function is a web log file. For each hit in the web site, a line will be added into the web log file. The line in the web log file contains the following fields: client IP address, User name, Server Name, date, time, request method, requested resource, HTTP version, HTTP Status and Bytes sent. Example line from a NASA web log file: in24.inetnebr.com - [01/Aug/1995:00:00: ] GET /shuttle/missions/sts-68/news/sts-68-mcc-05.txt HTTP/ The web log file is split into blocks by Hadoop Framework and stored into 2 node cluster. In the mapper function, Each block of the web log file is given as an input to a map function which in turn parses each line using regular expression and emits the URL as a key along with the value 1 (URL1,1), (URL2,1), (URL3,1),.,(URLn,1). After mapping, the shuffling collects all the (Key, Value) pairs which are having the same URL from different mapping function s and forms a group. After this process, Group1 entries will be (URL1,1), (URL1,1), (URL1,1) and so on. Group2 entries will be (URL2,1), (URL2,1) and so on. Then, the reduce function calculates the sum for each URL group. The result of the reduce function is (URL1,SUM), (URL2,SUM), (URLn,SUM). 3 depicts the MapReduce function of processing web log file and calculating the total number of hits received in every hour. The input to this function is a web log file. The web log file is split into blocks. In the mapper function, Each block of the web log file is given as an input to a map function which in turn parses each line using regular expression and emits the hour as a key along with the value 1 (hour0,1), (hour1,1), (hour3,1),.,(hour23,1). After mapping, the shuffling collects all the (Key,Value) pairs which are having the same hour from different mapping function s and forms a group. After this, Group1 will be (hour0,1), (hour0,1), (hour0,1) and so on. Group2 will be (hour1,1), (hour1,1) and so on. The reduce function calculates the sum for each hour group. The result of the reduce function will be (hour0, SUM), (hour1, SUM), (hour23,sum) Calculating the total number of hits received by a website in each hour Table 1. System configuration Figure File Input Split File Block1 File Block2 Figure 5. Experimental results This section discusses the results obtained from the experiment Experimental setup To calculate the total number of hits received by each URL and by a web site in each hour, a 2 node Hadoop cluster is set up with the configurations shown in Table 1. Operating System Ubuntu Hadoop Version Number of nodes in the cluster Dataset Hadoop ( , ) Nasa Access Log (July 1 July 31, 1995) 195 MB File BlockN Dataset Size (hour0,1),(h our1,1). (hour0,1),(h our1,1).... (hour1,1) (hour2,1) (hour1,1). (hour2,1). (hour1, (hour2, Map (hour0,1), (hour1,1) Results of calculating the total number of hits received by each URL (hour1,1)... Shuffle. (hour23,1) Before executing the MapReduce code in the 2 nodes cluster environment, the web log file is loaded into the HDFS of Hadoop framework. Total number of hits in the web log file is The first log was collected from 00:00:00 July 1, 1995 through 23:59:59 July 31, 1995, a total of 31 days [10]. Figure 4 shows the contents of the output directory named no_of_hits_by_url in HDFS. The output is stored in a file called part_r_ Figure 5 shows a chunk of the output file which is generated when the (hour23,1) Reduce. (hour23, Total number of hits received by website in every one hour Output Figure 3. Calculating total number of hits received in every hour 1679

4 MapReduce code for calculating the number of hits received by each URL is executed on the input web log file Results of calculating the total number of hits received by website in each hour Figure 6. no_hits_by_hour output directory in HDFS Figure 4. no_hits_by_url output directory in HDFS When MapReduce function to calculate the total number of hits received by each URL is executed, CPU time spent is Milliseconds. The number of map tasks launched is 3 and reduce tasks launched is 1. Time taken by map task is 32 Seconds and reduce task is 44 Seconds. Figure 5. A chunk of the number of hits received by each URL output file in HDFS When MapReduce function to calculate the total number of hits received by a website in each hour is executed, CPU time spent is Milliseconds. The number of map tasks launched to process the dataset is 3 and reduce tasks launched is 1. Time taken by map task is 38 Seconds and reduce task is 23 Seconds. Figure 7. Output: Number of hits received in each hour 1680

5 Figure 6 shows the contents of the output directory named no_of_hits_by_hour in HDFS. The output is stored in a file called part_r_ Figure 7 shows the number of hits received by a web site in each hour. This output is generated in HDFS storage after executing the MapReduce Code on the input web log file. 7. References [1] Why Big Data is a must in E-Commerce, Guest post by Jerry Jao, CEO of Retention Science. [2] 3 approaches to big data analysis with Apache Hadoop by DaveJaffe. ps1q jaffe [3] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. DeWitt, Samuel Madden, Michael Stonebraker, (2009) A Comparison of Approaches to Large-Scale Data Analysis, ACM SIGMOD 09. [4] Yogesh Pingle, Vaibhav Kohli, Shruti Kamat, Nimesh Poladia, (2012) Big Data Processing using Apache Hadoop in Cloud System, National Conference on Emerging Trends in Engineering & Technology. [5] Tom White, (2009) Hadoop: The Definitive Guide. O Reilly, Scbastopol, California. [6] Apache-Hadoop, Figure 8. Pictorial representation of number of hits received in each hour Figure 8 shows the pictorial representation of number of hits received by a web site in each hour. From the graph, it can be seen that during 9th hour maximum number of hits are received. 6. Conclusion A web log file is stored in a 2 node Hadoop distributed cluster environment and analysed. The response time taken to analyse the web log file is very less as the web log file is broken into blocks and stored on 2 nodes cluster and analysed in parallel. MapReduce programming model of Hadoop framework is used to analyse the weblog file in parallel. In this paper, the total number of hits received by each URL and the total number of hits received by a website in each hour are calculated. In the future, the number of nodes in the cluster can be increased and data mining techniques such as recommendation, clustering and classification can be applied on the web log file which is stored in the hadoop file system to extract useful patterns from the web log file. So that, E-Commerce companies can provide a better shopping experience to their online customers and increase their profits. [7] Jeffrey Dean and Sanjay Ghemawat., (2004) MapReduce: Simplified Data Processing on Large Clusters, Google Research Publication. [8] Sayalee Narkhede and Tripti Baraskar., (2013) HMR Log Analyzer: Analyze Web Application Logs Over Hadoop MapReduce, International Journal of UbiComp (IJU), Vol.4, No.3, July [9] [10]