Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki, stefan.andrei)@lamar.edu ABSTRACT There is a growing trend of applications that need to handle Big Data, as many corporations and organizations are required to collect more data from their operations. Recently, processing Big Data, using MapReduce structure has become very popular, because the traditional data warehousing solutions for handling such datasets are not feasible. Hadoop provides an environment for execution of MapReduce program over distributed memory clusters, which supports the processing of large datasets in a distributed computing environment. Information retrieval systems facilitate searching of the content of the books or journals based on the metadata or indexing. An inverted index is a data structure which stores a mapping from content, such as words or numbers, to its locations in one or more documents. In this paper we propose three different implementations for inverted indexes (Indexer, IndexerCombiner, and IndexerMap) in Hadoop environment using MapReduce programming model, and compare their performance to evaluate the impacts of different factors such as data format and output file format in MapReduce. KEYWORDS Big Data; Cluster; E-book; Hadoop; Inverted Index. 1 INTRODUCTION The proliferation of social networking sites combined with seamless interconnection of many everyday basic services such as healthcare, etc., has caused the accumulation of large amount of data. Therefore, dealing with terabytes or even petabytes of datasets is becoming a necessary daily processes for many companies [1, 2]. These companies are building environments for handling Big Data processing, so that they can remain in business. With the development of Cloud and IT technologies, processing Big Data needs more advanced and sophisticated applications. Therefore, to process Big Data, the new programming models such as the Google MapReduce [3] and Hadoop [4] are gaining more popularity with their speedy and efficient data intensive processing abilities [3]. MapReduce is one of the most popular programming models for processing large datasets. Hadoop is the Apache s free and open source implementation of MapReduce. The Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches in which one or more passes are made over a static set of input data. Therefore, at the completion of the process, standard indexing method support search queries in a variety of content-based applications [5]. Most modern search engines use some form of inverted indexes to process users submitted queries. In its most basic form, an inverted index is a simple hash table, which maps words in the documents to some sort of document identifiers [4]. In this paper, we implemented three different inverted indexes (Indexer, IndexerCombiner, and IndexerMap) for the Electronic Documents (e.g., E-Books) using the MapReduce paradigm in Hadoop environment. The main goal is to identify and evaluate different factors that have impact on the execution time of these three implementations. 2 PROBLEM DESCRIPTIONS An efficient method for searching the contents of the E-books has become necessary with the everincreasing amount of usage of electronic readers. Given a user query, a searching method searches ISBN: 978-0-9891305-4-7 2014 SDIWC 52

terabytes of textual data, retrieves the relevant pages, and sends them back to the user. However, this strategy for processing several large documents is problematic, as retrieval of relevant information in Big Data environment is a huge computational task. It may take several days to read in the huge data using a stand-alone computer. This problem can be avoided and Big Data can be processed more efficiently using the large clusters of computers. However, in large clusters computers may fail occasionally and data can be lost or corrupted. Therefore, programmers are concerned about a variety of processing errors, communication, etc. While, in the Hadoop environment, files are divided into uniform blocks and are distributed across clusters of nodes. The Hadoop File System (HDFS) performs the data block placement in different nodes, and the information which is stored in the data blocks, is replicated to handle hardware failure replication for performance and fault tolerance. Also, HDFS keeps checksums of data for corruption detection and recovery. 3 RELATED WORKS Information retrieval is the process of finding needed relevant information from a set of information resources. By using an information retrieval system, one can search the content of books, journals, or other documents based on metadata or on full-text indexing [4]. Therefore, information retrieval processes are mainly organized into two main processes: indexing and retrieval. The indexing process involves preprocessing of a collection of documents and storing the description of it as an index. On the other hand, the retrieval process involves issuing a query by accessing an index to find documents relevant to the query. To handle queries, search engines need fast access to all documents containing the set of search terms [5]. 3.1 Inverted Index An inverted index is an index structure that stores a mapping from the content to the location in a file [6]. The inverted index is variety of applications such as in document retrieval systems (e.g. search engine systems) [5, 6]. The inverted index enhances management and retrieval time for huge amount of information [6]. There are two main variants of inverted indexes [7]: 1) The inverted file index This type of index contains list of references to documents for each word. 2) The inverted list index This type of index contains the positions of each word within a document. Also, an inverted index can contain position of each word or term within a document [5]. Figure 1 shows the structure of an inverted index. words word1 word2 word3 postings Figure1. Simple design of an inverted index This structure contains one column for the words or terms in the document or the series of documents and another column for postings list. A postings list, as shown in Figure 1, is comprised of individual postings, each of which consists of a document id and the information about occurrences of the words or terms in the document/s (payload). In a simple inverted index, no additional information is needed in the postings column other than the document id; the existence of the postings itself indicates that occurrence of the word or term in the document. The most common payload, however, is the number of times a word or term occurs in the document (i.e. term frequency). More complex payloads also include positions of every occurrence of the word in the document. Also, properties of the word (such as if it occurred in a specific page or not) allow document ranking based on specific characteristic or notions of importance. In this paper payload indicates the word frequency and the location of the word in the document [3]. 3.2 Big Data did, p did, p did, p did, p did, p did, p did, p did, p did, p did, p did, p did, p Traditionally large computational problems are best tackled and solved through divide and concur approach. The basic idea is to partition the ISBN: 978-0-9891305-4-7 2014 SDIWC 53

problem into smaller sub problems. The sub problems then can be distributed to multiple machines to be processed in parallel [3, 4]. The same strategy can be applied to process Big Data. Although, this strategy will improve the performance, the failure of one machine in such environment can jeopardize the result. In a distributed environment, data may not arrive at a particular point in time due to unexpected network congestion. Individual compute nodes may overheat, crash, experience hard drive failures, or run out of memory or disk space. Different nodes may not synchronize. Therefore, the lock on different files may not be released on time. The nodes involved in distributed atomic transactions may lose their network connections, etc. However, in each of these cases, the proper mechanisms have been provided for a distributed system to recover from these failures or transient error conditions and continue to make progress. 3.3 MapReduce In recent years, to process Big Data, using MapReduce programming has become progressively more popular. Google uses MapReduce in hundreds of applications on thousands of machines with terabytes of data [3]. MapReduce is a programming model for doing distributed computations on huge data sets. It is an execution framework for large data processing on clusters of computers [8, 9]. It was originally developed by Google based on the principles of parallel and distributed processing concepts [8, 9]. MapReduce has become popular via its opensource implementation called Hadoop, which has been developed by Yahoo (now an Apache project) [4]. One of the most important ideas in MapReduce environment is to separate the distributed processing, referred to as a job, from other related execution activities. The programmer submits the job to the submission node of a cluster (in Hadoop, it is named the job tracker [4]). Then the execution framework (runtime) takes care of everything, namely: it handles all other aspects of distributed code execution, ranging from one node to a few hundred of nodes. Each job is divided into smaller parts called tasks. For example, a map task may be responsible for processing a certain block of input key-value pairs (called an input split in Hadoop). Similarly, a reduce task may handle a portion of the intermediate key space. MapReduce also has necessary properties regarding fault tolerance. Since the jobs are executed independently, then they do not depend on other jobs for intermediate results. If the system detects failures on the map tasks which is running by a node (worker node), then those tasks can be re-executed. This provides the MapReduce flexibility to handle the large-scale worker failure [3]. One of the most significant advantages of MapReduce is that, it hides many system-level details from the programmers. Therefore it reduces the complexity of processing large sets of data. The synchronization of MapReduce also is another great feature. In general, synchronization refers to the mechanisms by which multiple concurrently running processes join together. Therefore, MapReduce achieves flexible scalability through organizing the blocks of data in different nodes in a cluster. The runtime system automatically divides the input dataset into equalsized data blocks (as much as possible) and dynamically sends each data block to an available compute node for execution. A MapReduce program consists of two components: one component that implements the mapper function and another component that implements the reducer function. The first phase of a MapReduce program is called mapping. A list of data elements is provided, one at a time, to a Mapper function, which transforms each element of this input data to an output data. That is, the Mapper takes an input pair and produces a set of intermediate keyvalue pairs. To do the processing on a dataset using MapReduce, the programmer defines the mapper and reducer functions as below [9]: Map: (k1; v1) [(k2; v2)] Reduce: (k2; [v2]) [(k3; v3)] Every value (v) has a key (k) associated with it. Therefore, each key identifies a value, and the key-value pairs form the basic data structure in MapReduce. The keys and values may be primitives such as integers, floating-point values, ISBN: 978-0-9891305-4-7 2014 SDIWC 54

strings, or they may be complex structures. Programmers usually need to define their own data types, although a number of other libraries such as Protocol Buffers, Thrift, and Avro may simplify the task. 3.4 Hadoop Limitations Generally each node in a Hadoop cluster may typically have a few gigabytes of memory. However, if the input dataset is several terabytes, then this would require a thousand or more computers to hold the data in the RAM, and then, no single machine would be able to process all of the data. Although, hard drives are much bigger, and a single machine can these days hold multiple terabytes of information on its hard drives. The intermediate result generated from a data processing task in a large-scale computation can quickly fill up more space than what the original input data has required. Also, during data processing, some of the hard drives employed by the system may become full, and the system may need to route this data to other nodes that can store the overflow. Finally, bandwidth is limited even for an internal network. While a set of nodes are directly connected by a high capacity network link, when all of the nodes try to transmit multi-gigabyte of data, they could easily saturate the network capacity. Furthermore, the remote procedure calls and other data transfer requests which are using a channel may be delayed or simply dropped. To maintain successfully a robust distributed system, a large-scale distributed system must be able to manage its resources efficiently. For example, it must allocate some of its resources for maintaining the system as a whole, while devoting as much time as possible to the computations. 4. IMPLEMENTATION This section presents the implementation of the E- Book content indexing system, which represents the inverted index implementation using MapReduce within the Hadoop environment. In this section we show three different implementations of inverted index in MapReduce: Indexer, IndexerCombiner, and IndexerMap. 4.1 Indexer The Indexer is the simplest implementation of inverted index. Like any other MapReduce programs, this implementation consists of two classes (Map and Reduce). As for the input of mapper class, Hadoop splits data based on the input format, which is specified in job configuration. In this project the input format is TextInputFormat, which means each file in the dataset has at least one split. Split data size should be less than 64 MB (default Block size). If a file size is more than the block size, then Hadoop splits the file into more than one split. These splits are the inputs for the mapper class. In the map class, the mapper function tokenizes the words within a file as actual text and obtains the name of the file. At the end of this process it creates <word, file name> output for counting the occurrence of each word, then it forwards the output to the reducers. The output of mapper is also called intermediate key [10]. 4.2 IndexerCombiner In the IndexerCombiner implementation, the mapper class has extra responsibility to partially reduce the outputs in the mapper phase. This reduction is called combiner. The mapper is not based on creating an output <Word, Page name> for each word that it finds. Instead, the mapper counts the occurrence of the words in each split and creates an output <word, <file name: number>> to forward it to reducer. Therefore, the numbers of intermediate keys are fewer in comparison with Indexer. The reducer class for IndexerCombiner is very similar to reducer class in Indexer. However, these reducers handle the list of values differently. 4.3 IndexerMap The Apache Hadoop s SequenceFile provides the data structure for the key-value pairs. This data structure is append-only. That means it cannot be edit or remove a specific element from the file. The MapFile is a directory that contains two SequenceFile files: the data file ( /data ) and the ISBN: 978-0-9891305-4-7 2014 SDIWC 55

index file ( /index ). The data file contains all the key-value pairs in a sorted manner. The Index file contains the key and a LongWritable (Hadoop defined type) which represents the starting byte position of the record. The Index file does not contain all the keys but just a fraction of the keys in order to save the memory space. SequenceFile Data Key Value Key Value Key Value MapFile Index Key Key Data Key Value Key Value Key Value Figure 2. The layout of SequenceFile and Mapfile files Similar to Indexer s reducer, the reducer creates a HashMap list and starts to count the occurrences of each file-name in the list of values. Hadoop provides some OutputFormat instances to be written to files. The basic (default) instance is TextOutputFormat, which formats the key-value pairs on the individual lines of a text file as shown in Figure 2. 5 TESTING AND RESULTS CD images which contains a large number of E- books annually. We downloaded the April 2010 dual layer DVD image that includes over 26000 text files as E-books. However, the size of each file was not more than 500 KB. The overall size of the dataset was 7 Gigabyte. The second dataset was collected from (WestBuryLab 2010) [12]. This project has been developed to serve researchers in Natural Language processing. The data that we collected from this project was the Wikipedia corpus. This corpus was created from a snapshot of all the articles in the English part of the Wikipedia that was taken in April 2010. The dataset only has a large text file which contains all of the articles. The size of this file was 6 Gigabyte. 5.2 Metrics and Measured Items In order to compare different implementations we consider the following measurements Execution time: The CPU time that takes to complete a Map/Reduce job and CPU execution time that takes to complete each phase (Mapping / Reducing) individually. The size of output file: size of resulted index files for each implementation. 5.3 Web Interfaces In this project, we developed two series of implementations with two different datasets. The experiments are done with two nodes. The reason to use two different dataset is to demonstrate the effect of data in the computation time and memory allocation. We test our implementations with 1Gigabyte of each data. The reason for using a fraction of the whole data was the lack of HDFS size in our nodes. These nodes were not dedicated to Hadoop only so they were dual boot systems with limited capacity. Then at the end we compare our results. 5.1 Datasets The first dataset was collected from the Gutenberg project [11]. This project offers over 42,000 free E-books with different format such as (text, epub, and html). The Gutenberg project offers DVD or Hadoop environment gives the user the ability to track the NameNode status, running and completed jobs, as well as, running tasks. It uses a web page as the user interface. User can simply view the status of each part by connecting to the corresponding page. We use this ability to obtain our results for this project. Additionally, it allows the user to browse the HDFS namespace and views the files inside. 6 DISCUSSIONS First we did the test on 1 Gigabyte of Gutenberg project. The dataset consists of 3500 small data. When we ran the test for each of the implementation, Hadoop launched 3500 map task (one task for each line) and 1 reducer task. Figure 3 shows the execution time of each implementation for this test. ISBN: 978-0-9891305-4-7 2014 SDIWC 56

Figure 3. Comparison chart for execution time using WestBuryLab dataset According to the test result, IndexerCombiner class has the least execution time, and after that IndexerMap comes to the second place. As IndexerCombiner partially reduces the output of mapper class, therefore, the reducer has less record to process and because we only have one reducer having less record would affect the reducer computation significantly. IndexerMap performs the mapper similar to Indexer. However, writing and creating Mapfile takes more time than creating text file as the output. Therefore, the total execution time is also more for Indexer. Figure 4. Execution time comparison by phase using Gutenbergs Figure 4 shows the execution time for each phase individually. Since the reducer starts while the mapper is running, adding the execution time of each phase does not give us the total execution time. Indexer and IndexerMap class spent almost the same amount of time in mapping. However, IndexerMap spent slightly more amount of time due to lack of computation power (reducing takes more memory in IndexerMap). As it is mentioned the highest amount of time in reducing belongs to IndexerMap because creating Mapfile takes more amount of time. Although IndexerCombiner performs more operations on its mapping phase, the mapping phase of IndexerCombiner takes less time than Indexer and IndexerMap. Lesser input record and load for reducing give the computation power to mapping phase and finish the job sooner. 7 CONCLUSIONS This paper presents the implementation and resulting analysis for the comparison of three different implementations of E-Book and Online document content indexing system, which utilizes the inverted index using MapReduce programming within the Hadoop environment. Specially, we compared three different implementations of the inverted index in MapReduce: Indexer, IndexerCombiner, and IndexerMap. The Hadoop environment has a huge impact on the overall performance of the inverted index. In fact using only one machine without any parallel processing for the given size of the datasets and the same tasks load that we used it would take a longer amount of time. Hadoop provides some output format instances for writing to files. The default instance is TextOutputFormat, which writes (key, value) pairs on individual lines of a text file. Another file format is SequenceFile that stores the (key, value) as binary format as shown in Figure 2. A special type of SequenceFile is a MapFile, which provides an index structure for the output, which helps to have faster access to the generated data. We realized that MapFile takes much larger space than usual text files. Overall, the idea of reducing the number of messages between mapper tasks and reducer tasks had the most effect on the execution time of the Indexer. In addition, the type of output file has an effect on the size of output. That is, the size of Mapfile output is much larger than the text output, and generation of Mapfile takes more execution time than generation of text file. ISBN: 978-0-9891305-4-7 2014 SDIWC 57

8 REFERENCES [1] A. Thusoo, J. Sen Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu and R. Murthy, Hive - A petabyte scale data warehouse using Hadoop, in Proceedings of the International Conference on Data Engineering (ICDE 10), pp 996 1005, 2010. [2] N. Dzugan,; L. Fannin,; S. K. Makki, A recommendation scheme utilizing Collaborative Filtering, in Proceedings of 2013 8th International Conference for Internet Technology and Secured Transactions (ICITST), pp 96 100, 2013. [3] S. Dean, MapReduce: simplified data processing on large clusters, Communication of ACM, pp 107-113, 2008. [4] Yahoo, Apache Hadoop, 2010. [Online]. Available: http://developer.yahoo.com/hadoop/tutorial/module4.ht ml. [Accessed June 2013]. [5] J. Zobel and A. Moffat, K. Ramamohanarao Inverted files versus signature files for text indexing, ACM Transactions on Database Systems (TODS), pp 453-490, 1998. [6] X. Liu, Efficient maintenance scheme of inverted index for large-scale full-text retrieval, in Future Computer and Communication (ICFCC), 2nd International Conference on, Wuhan, 2010. [7] X. Liu, An efficient random access inverted index for information retrieval, in Proceedings of the 19th international conference on World wide web, Raleigh, North Carolina, USA, ACM, 2010, pp. 1153-1154. [8] Dean, Jeffrey and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, in Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pp 137-150, San Francisco, California, 2004. [9] Jimmy Lin, Exploring Large Data Issues in the Curriculum: A Case Study with MapReduce, in Proceedings of the Third Workshop on Issues in Teaching Computational Linguistics (TeachCL 08) at ACL 2008, p. 54-61, 2008, Columbus, Ohio. [10] D. Jiang, B. C. Ooi, L. Shi, and S. Wu, "The performance of MapReduce: an in-depth study," in Procdeeings of VLDB, pp. 472-483, 2010.. [11] Gutenberg, [Online], [Accessed 3 June 2013], available: http://www.gutenberg.org/. [12] WestBuryLab, [Online], [Accessed 3 June 2013], available: http://www.psych.ualberta.ca/~westburylab/. ISBN: 978-0-9891305-4-7 2014 SDIWC 58