A Novel Parallel Architecture Design of Information Retrieval System for Scientific Papers

A Novel Parallel Architecture Design of Information Retrieval System for Scientific Papers Aziz Murtazaev 1, Sanggil Kang 2 and Sangyoon Oh 3 1 Samsung Electronics, Suwon, South Korea 2 Department of Computer Science and Information Engineering, Inha University, Incheon, South Korea 3 School of Information and Communication, Ajou University Suwon, South Korea az.murtazaev@samsung.com, sgkang@inha.ac.kr, syoh@ajou.ac.kr Abstract Indexing allows converting raw document collection into easily searchable representation. Bigger scale indexing poses some challenges such as how to distribute indexing computation efficiently on a cluster of nodes. MapReduce framework can be an effective tool for parallelizing such tasks as inverted index construction. We propose SciPDFindexer, distributed information retrieval system for scientific articles in PDF. For given large collection of scientific articles in PDF our system parses and extracts metadata from articles, and then indexes extracted content using our proposed scheme. Our contribution is the design of distributed IR system and indexing scheme that improve the overall indexing performance. Keywords: distributed system, Hadoop, indexing, MapReduce, scientific articles, SciPDFindexer 1. Introduction When performing search over the whole contents of a collection of documents, scanning them one-by-one is inefficient due to considerable response time. Usually larger collections are scanned, analyzed and indexed before making any query on them. This approach greatly reduces response time of searching. Since a single node may take intolerable long time for performing large-scale indexing, we usually use distributed set of nodes for such tasks. Large-scale indexing poses some challenges of how to perform index construction efficiently using distributed systems. Google performs very well in indexing enormously large data; the number of indexed web pages is estimated over around 45 billion according to [1], and still it is able to give a sub-second query response time. Indexing job can be efficiently performed in distributed system as it complies to divide and conquer style processing. One of parallel processing techniques suitable for such type of problems is the MapReduce programming model introduced by Google [2]. MapReduce showed excellent scalability and performance by sorting 1 Tb data in 68 seconds using 1000 machines and sorting 1 Petabyte data in 6 hours and 2 minutes using 4000 machines [3]. MapReduce provides a simple interface to programmers in the form of map() and reduce() functions, and the underlying framework handles parallelization issues, such as splitting input data, moving intermediate data to corresponding nodes, sorting, grouping intermediate keys. Despite that MapReduce can automatically scale, the way we choose key-value pairs and how we process them in map and reduce phases affects the overall job performance. In case 107

of indexing, most indexing schemes are resemble each other, but the way the documents are indexed differs from one another. We believe that choosing the right scheme for indexing is important in terms of indexing efficiency measured by indexing throughput. The information retrieval (IR) for scientific papers is not as much researched domain as the general IR. Moreover there are some specifics which need to be considered when we design IRs for scientific papers. One of those specifics is the structure such as title, abstract and body; recent scientific papers are usually given in the form of PDF, in various layouts. To be processed, it needs to be converted to proper textual format before analyzing the text content. Even though various researches on the IR system for scientific papers have been conducted, there are few which consider the architecture design as a whole system, which describes all aspects of design from parsing to indexing to querying in detail. This whole system issue in the IR system for scientific papers is important because each parts of the architecture are interrelated. That is, how we parse, what structure we obtain from parsing will affect how we index documents. The way we index the documents and the index structures we choose affect querying performance. Among the few, there are notable ones like NEC s CiteSeer (and its decendant CiteSeer X ), Google Scholar and MS Academic Search. All of them are indexing systems that index academic literatures in an electronic format (e.g. Postscript files) [4]. We propose an IR system, SciPDFindexer, for parsing, indexing and querying scientific articles in PDF. For given large corpora of scientific articles in PDF, our proposed system parses and extracts article contents along with additional metadata, such as title and abstract. Next, it indexes extracted contents using the MapReduce framework in a distributed system. Our querying system, of which we also applied the parallelism using a distributed database, enables a free text querying on the resulted indices. Our main focus on this work is the indexing performance by designing efficient distributed indexing algorithm. The rest of the paper is organized as follows. Background and related works are described in Section 2. In Section 3, we discuss about the design and implementation of SciPDFindexer system. We conclude our work and provide insights into our future works in Section 4. 2. Background and Related Work Our research is related to many disciplines such as parallel computing with MapReduce framework, distributed indexing schemes and information retrieval of scientific papers. Distributed indexing schemes are discussed in Section 3.3. MapReduce framework. MapReduce is a programming model introduced by Google which enables specifying two user functions, map which processes key/value pair and generates another intermediate key/value and reduce which merges all intermediate values related with the same intermediate key [2]. MapReduce framework allows programmers focusing on key components, while infrastructure management logic, such as fault tolerance, scheduling, replication, tracking jobs are done by the underlying framework. We used Hadoop implementation of MapReduce in our system to parallelize parsing and indexing processes. Hadoop Distributed File System (HDFS) [5] is used as storage for collection of PDF documents which is used as input to MapReduce jobs. Information retrieval of scientific papers. There have been several works related with the information extraction from scientific papers. S. Lawrence et al. [6] proposed an Autonomous Citation Indexing (ACI) system named CiteSeer (and updated system - CiteSeer X [7]). Their main goal is to organize scientific literature openly available in the Web by automating creation of citation indices. Their system crawls scientific articles from the 108

Web, extracts citations, and indexes full-text articles as well. And users can query these articles where the resulted documents can be sorted by number of citations to that document. Verstak and Acharya released Google Scholar, free accessible search engine for scientific papers in 2004. Along with its vast amount of indexed articles and its unique ranking algorithm, Google Scholar provides many convenient features like group of and cited by. Developed by Microsoft Research Asia, Microsoft Academic Search is also one of most popular free search engine for scientific papers which is focused on computer science, electrical engineering and physics [8]. Unlike Google Scholar which discloses the list of coverage, it lists publishers online. CiteSeer X, Google Scholar, MS Academic Search have sophisticated ranking methods, citation indexing features, crawling systems (CiteSeer X is doing great in building citation indexes). But our focus is indexing performance (indexing the content of the scientific articles) with distributed system and that is how it differs from those systems (those systems do not explain the indexing part in detail). Also, we work only with pre-defined repository, while those systems try to cover whole Web. 3. Design and Implementation of SciPDFindexer System The overall architecture of our system is depicted on Figure 1. SciPDFindexer accomplishes two tasks: indexing documents and querying on resulted indices. As we can see from this figure, those two tasks correspond to the two major components: Indexer and QueryParser. Indexer takes a collection of PDFs as inputs and parses them into an appropriate textual representation using the PDFparser subcomponent. Then the textual data is analyzed by the TextAnalyzer. It extracts basic morphological forms of the words, removing frequently occurred words such as article and prepositions which carry little information, and counting word occurrences in the document. The resulted data structure is flashed into the Index Database where we query keywords through the Search UI. The Ranking subcomponent shows the most relevant documents at the top of a search list. Figure 1. Overall Architecture of SciPDFindexer So far a simplified architecture overview of our proposed system is describes from the perspective of components interactions. Now we discuss on the indexing more. Our implemented Indexer component consists of complex workflow of distributed jobs and hence will be described in detail in the following paragraphs. We decided to split the indexing process into two parts: preprocessing and text-indexing because of two reasons. First, in MapReduce programming model map tasks are independent from each other and they do not share any information at runtime. But indexing requires a global Document to map DocumentIDs for all documents, so that mapper can produce <term, 109

docid> pairs from the documents it processed. This requires those mappings should be known in advance. Second, we want to logically separate two different operations: PDF parsing from indexing. By doing so, we convert PDF into the plain text and bind those text with document IDs, and then use these text files with only necessary information for indexing to analyze text, chop into tokens, normalize them into terms, create <term, posting-list> mappings. In this way we make our architecture more modular. PDF files Preprocessing Parsing PDF format, extracting document fields Text files (document text with fields) Text-Indexing Analyzing text chunks: tokenizing, eliminating stopwords, lemmatizing Text files (tokens with posting-lists) Saving to DB Save indices from text files to DB DB (indices with posting lists) Figure 2. Indexing Process Workflow Figure 2 gives detailed picture of how the indexing job is done in terms of job workflow in distributed system. We designed a preprocessing and a text-indexing as two consequent MapReduce operations, the output of first operation is used as an input to the second operation. The preprocessing step parses PDF documents to the text representation, textindexing analyzes text chunks and create index structures. Finally index is saved to a database. This additional step is very much necessary to avoid concurrency issues of databases. For example, when several reducers attempt to insert data into database simultaneously, we have a concurrency problem to address. Additionally, we use a distributed database system in order to query indices in parallel. The preprocessing step is responsible for parsing scientific articles in PDF into text files and creating document structure of given PDF files which usually do not contain any hints to recreate structure of scientific paper. We designed a special parser algorithm, by assuming that scientific articles have some common layouts which will help us to rebuild that structure. That algorithm divides the document content into three zones: title, abstract and body. This is done for the purpose of displaying search result compactly and for differentiated ranking for each zone. We used the PDFTextStream library [9] to extract text from PDFs. After parsing scientific article PDF files into the proper text structure, the actual indexing begins. One of the most common index implementations used in search engines is inverted index and MapReduce framework is especially suitable for inverted index construction job. Where terms from each document are usually used as key, and the keys are sorted and grouped by the framework itself and finally posting-lists for each term are collected in reduce phase. There exist several distributed indexing schemes using MapReduce framework. Conceptually all of them perform the same thing; that is, finally outputting posting-lists from collection of documents. However, they are different from the way postings are created, by the structure of posting-lists which need to be used later in querying and by performance as well. In our scheme each call to map function analyses document, calculate term frequencies and aggregate local results by emitting <term, local_postings> key-value pairs. local_postings contains an array of posting objects belonging to single term, and posting is in (docid, title_freq, abstract_freq, body_freq) tuple form. Reduce function only aggregates local_postings to posting-list, thereby emits final <term, posting_list> pairs. We are able to optimize indexing performance by moving some of computations of reduce side into map side, which resulted in less amount of data copied to reduce side. That allowed us to improve indexing throughput compared to base scheme described in original MapReduce paper [2]. 110

4. Conclusion and Future Works In this work we addressed the information retrieval problem of scientific articles and provided our solution for that problem. Our focus in this work is, first, IR system design of scientific articles. The second is improving indexing performance in a distributed set of machines that can efficiently index a large corpus of scientific articles in parallel with MapReduce framework. We designed and implemented a full IR system for scientific articles in PDF - SciPDFindexer which uses the distributing indexing scheme and optimal parameters as mentioned above. Our system is designed to perform indexing job and run query both in parallel using distributed set of nodes to deal with large scale. As a future work, we intend to extend our system to support dynamic and incremental indexing, such as when new documents are regularly added to the collection and indices need to be up-to-date. This poses some new challenges: how to manage new indices and how to merge them with old ones. Acknowledgements This work was jointly supported by the MKE, Korea under the ITRC support program supervised by NIPA (NIPA-2012-(C1090-1221-0011)) and Basic Science Research Program through the NRF of Korea (No. 2011-0015089) References [1] WorldWideWebSize.com. http://www.worldwidewebsize.com/ [cited in (2011)] [2] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, (2004) December. [3] E. Lai, Google claims MapReduce sets data-sorting record, topping Yahoo, conventional databases, ComputerWorld, (2008) November 28. Available at: http://www.computerworld.com/s/article/9121278/google_claims_mapreduce_sets_data_sorting_record_top ping_yahoo_conventional_databases [cited in 2011]. [4] C. Lee Giles, K. D. Bollacker and S. Lawrence, CiteSeer: An Automatic Citation Indexing System, Proceedings of the 3rd ACM Conference on Digital Libraries, New York, pp. 89-98, (1998). [5] Hadoop Distributed File System. http://hadoop.apache.org/hdfs/ [cited in (2011)]. [6] S. Lawrence, C. Lee Giles and K. Bollacker, Digital Libraries and Autonomous Citation Indexing, IEEE Computer, vol. 32, no. 6, (1999). [7] H. Li, I. Councill, W. Lee and C. Lee Giles, CiteSeerx: an architecture and web service design for an academic document search engine, Proceedings of the 15th International Conference on World Wide Web 2006 (WWW 06), (2006). [8] M. Thelwall, Extracting accurate and complete results from search engines Case Study Windows Live, Journal of the American Society for Information Science and Technology, 59(1), pp. 38-50, (2008). [9] PDFTextStream. PDF Text Extraction library for Java,.NET, Python. http://snowtide.com/pdftextstream [cited in (2011)]. Authors Aziz Murtazaev received B.A. in Economics from the National University of Uzbekistan in 2007, and M.S. in Computer Engineering in Ajou University, South Korea in 2011. Currently he is working at Samsung Electronics as Software engineer. His research interests include distributed systems, cloud computing, information retrieval and largescale software system. 111

Sanggil Kang received the M.S. and Ph.D. degrees in Electrical Engineering from Columbia University and Syracuse University, USA in 1995 and 2002, respectively. He is currently an associate Professor in the Department of Computer Science and Information Engineering at INHA University, Korea. His research interests include Semantic Web, Artificial Intelligence, Multimedia Systems, Inference Systems, etc. Sangyoon Oh received Ph.D. in Computer Science Department from Indiana University at Bloomington, U.S.A. He is an assistant professor of School of Information and Computer Engineering at Ajou University, South Korea. Before joining Ajou University, he worked for SK Telecom, South Korea. His main research interest is in the design and development of web based large scale software systems and he has published papers in the area of mobile software system, collaboration system, Web Service technology, Grid systems, and Service Oriented Architecture (SOA). 112