A Novel Parallel Architecture Design of Information Retrieval System for Scientific Papers

Similar documents
Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

A Novel Adaptive Virtual Machine Deployment Algorithm for Cloud Computing

Comparison of Different Implementation of Inverted Indexes in Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Chapter 7. Using Hadoop Cluster and MapReduce

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Introduction to Parallel Programming and MapReduce

The Hadoop Framework

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

BIG DATA IN SCIENCE & EDUCATION

Hadoop and Map-reduce computing

MapReduce and Hadoop Distributed File System

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

MapReduce (in the cloud)

MapReduce and Hadoop Distributed File System V I J A Y R A O

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

NoSQL and Hadoop Technologies On Oracle Cloud

A Case for Flash Memory SSD in Hadoop Applications

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Introduction to Hadoop

Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

A Study on Data Analysis Process Management System in MapReduce using BPM

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

SEMANTIC WEB BASED INFERENCE MODEL FOR LARGE SCALE ONTOLOGIES FROM BIG DATA

CiteSeer x in the Cloud

Data-Intensive Computing with Map-Reduce and Hadoop

Survey on Load Rebalancing for Distributed File System in Cloud

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

CSE-E5430 Scalable Cloud Computing Lecture 2

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A programming model in Cloud: MapReduce

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Patent Big Data Analysis by R Data Language for Technology Management

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Keywords: Big Data, HDFS, Map Reduce, Hadoop


What is Analytic Infrastructure and Why Should You Care?

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Hadoop and Map-Reduce. Swati Gore

Parallel Processing of cluster by Map Reduce

Distributed Computing and Big Data: Hadoop and MapReduce

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Introduction to Hadoop

Introduction to DISC and Hadoop

Big Data Storage, Management and challenges. Ahmed Ali-Eldin

Big Data Processing with Google s MapReduce. Alexandru Costan

Content Based Search Add-on API Implemented for Hadoop Ecosystem

HadoopRDF : A Scalable RDF Data Analysis System

INTRO TO BIG DATA. Djoerd Hiemstra. Big Data in Clinical Medicinel, 30 June 2014

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Big Data and Apache Hadoop s MapReduce

Big Data with Rough Set Using Map- Reduce

International Journal of Advance Research in Computer Science and Management Studies

IMAV: An Intelligent Multi-Agent Model Based on Cloud Computing for Resource Virtualization

Fault Tolerance in Hadoop for Work Migration

How To Handle Big Data With A Data Scientist

DYNAMIC QUERY FORMS WITH NoSQL

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Information Retrieval Elasticsearch

Generic Log Analyzer Using Hadoop Mapreduce Framework

Analyzing Log Files to Find Hit Count Through the Utilization of Hadoop MapReduce in Cloud Computing Environmen

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Search and Information Retrieval

Hadoop Parallel Data Processing

Open source Google-style large scale data analysis with Hadoop

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Efficient Data Replication Scheme based on Hadoop Distributed File System

Open source large scale distributed data management with Google s MapReduce and Bigtable

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Big Data With Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

L1: Introduction to Hadoop

SPATIAL DATA CLASSIFICATION AND DATA MINING

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

MapReduce With Columnar Storage

A computational model for MapReduce job flow

Cassandra A Decentralized, Structured Storage System

Transcription:

A Novel Parallel Architecture Design of Information Retrieval System for Scientific Papers Aziz Murtazaev 1, Sanggil Kang 2 and Sangyoon Oh 3 1 Samsung Electronics, Suwon, South Korea 2 Department of Computer Science and Information Engineering, Inha University, Incheon, South Korea 3 School of Information and Communication, Ajou University Suwon, South Korea az.murtazaev@samsung.com, sgkang@inha.ac.kr, syoh@ajou.ac.kr Abstract Indexing allows converting raw document collection into easily searchable representation. Bigger scale indexing poses some challenges such as how to distribute indexing computation efficiently on a cluster of nodes. MapReduce framework can be an effective tool for parallelizing such tasks as inverted index construction. We propose SciPDFindexer, distributed information retrieval system for scientific articles in PDF. For given large collection of scientific articles in PDF our system parses and extracts metadata from articles, and then indexes extracted content using our proposed scheme. Our contribution is the design of distributed IR system and indexing scheme that improve the overall indexing performance. Keywords: distributed system, Hadoop, indexing, MapReduce, scientific articles, SciPDFindexer 1. Introduction When performing search over the whole contents of a collection of documents, scanning them one-by-one is inefficient due to considerable response time. Usually larger collections are scanned, analyzed and indexed before making any query on them. This approach greatly reduces response time of searching. Since a single node may take intolerable long time for performing large-scale indexing, we usually use distributed set of nodes for such tasks. Large-scale indexing poses some challenges of how to perform index construction efficiently using distributed systems. Google performs very well in indexing enormously large data; the number of indexed web pages is estimated over around 45 billion according to [1], and still it is able to give a sub-second query response time. Indexing job can be efficiently performed in distributed system as it complies to divide and conquer style processing. One of parallel processing techniques suitable for such type of problems is the MapReduce programming model introduced by Google [2]. MapReduce showed excellent scalability and performance by sorting 1 Tb data in 68 seconds using 1000 machines and sorting 1 Petabyte data in 6 hours and 2 minutes using 4000 machines [3]. MapReduce provides a simple interface to programmers in the form of map() and reduce() functions, and the underlying framework handles parallelization issues, such as splitting input data, moving intermediate data to corresponding nodes, sorting, grouping intermediate keys. Despite that MapReduce can automatically scale, the way we choose key-value pairs and how we process them in map and reduce phases affects the overall job performance. In case 107

of indexing, most indexing schemes are resemble each other, but the way the documents are indexed differs from one another. We believe that choosing the right scheme for indexing is important in terms of indexing efficiency measured by indexing throughput. The information retrieval (IR) for scientific papers is not as much researched domain as the general IR. Moreover there are some specifics which need to be considered when we design IRs for scientific papers. One of those specifics is the structure such as title, abstract and body; recent scientific papers are usually given in the form of PDF, in various layouts. To be processed, it needs to be converted to proper textual format before analyzing the text content. Even though various researches on the IR system for scientific papers have been conducted, there are few which consider the architecture design as a whole system, which describes all aspects of design from parsing to indexing to querying in detail. This whole system issue in the IR system for scientific papers is important because each parts of the architecture are interrelated. That is, how we parse, what structure we obtain from parsing will affect how we index documents. The way we index the documents and the index structures we choose affect querying performance. Among the few, there are notable ones like NEC s CiteSeer (and its decendant CiteSeer X ), Google Scholar and MS Academic Search. All of them are indexing systems that index academic literatures in an electronic format (e.g. Postscript files) [4]. We propose an IR system, SciPDFindexer, for parsing, indexing and querying scientific articles in PDF. For given large corpora of scientific articles in PDF, our proposed system parses and extracts article contents along with additional metadata, such as title and abstract. Next, it indexes extracted contents using the MapReduce framework in a distributed system. Our querying system, of which we also applied the parallelism using a distributed database, enables a free text querying on the resulted indices. Our main focus on this work is the indexing performance by designing efficient distributed indexing algorithm. The rest of the paper is organized as follows. Background and related works are described in Section 2. In Section 3, we discuss about the design and implementation of SciPDFindexer system. We conclude our work and provide insights into our future works in Section 4. 2. Background and Related Work Our research is related to many disciplines such as parallel computing with MapReduce framework, distributed indexing schemes and information retrieval of scientific papers. Distributed indexing schemes are discussed in Section 3.3. MapReduce framework. MapReduce is a programming model introduced by Google which enables specifying two user functions, map which processes key/value pair and generates another intermediate key/value and reduce which merges all intermediate values related with the same intermediate key [2]. MapReduce framework allows programmers focusing on key components, while infrastructure management logic, such as fault tolerance, scheduling, replication, tracking jobs are done by the underlying framework. We used Hadoop implementation of MapReduce in our system to parallelize parsing and indexing processes. Hadoop Distributed File System (HDFS) [5] is used as storage for collection of PDF documents which is used as input to MapReduce jobs. Information retrieval of scientific papers. There have been several works related with the information extraction from scientific papers. S. Lawrence et al. [6] proposed an Autonomous Citation Indexing (ACI) system named CiteSeer (and updated system - CiteSeer X [7]). Their main goal is to organize scientific literature openly available in the Web by automating creation of citation indices. Their system crawls scientific articles from the 108

Web, extracts citations, and indexes full-text articles as well. And users can query these articles where the resulted documents can be sorted by number of citations to that document. Verstak and Acharya released Google Scholar, free accessible search engine for scientific papers in 2004. Along with its vast amount of indexed articles and its unique ranking algorithm, Google Scholar provides many convenient features like group of and cited by. Developed by Microsoft Research Asia, Microsoft Academic Search is also one of most popular free search engine for scientific papers which is focused on computer science, electrical engineering and physics [8]. Unlike Google Scholar which discloses the list of coverage, it lists publishers online. CiteSeer X, Google Scholar, MS Academic Search have sophisticated ranking methods, citation indexing features, crawling systems (CiteSeer X is doing great in building citation indexes). But our focus is indexing performance (indexing the content of the scientific articles) with distributed system and that is how it differs from those systems (those systems do not explain the indexing part in detail). Also, we work only with pre-defined repository, while those systems try to cover whole Web. 3. Design and Implementation of SciPDFindexer System The overall architecture of our system is depicted on Figure 1. SciPDFindexer accomplishes two tasks: indexing documents and querying on resulted indices. As we can see from this figure, those two tasks correspond to the two major components: Indexer and QueryParser. Indexer takes a collection of PDFs as inputs and parses them into an appropriate textual representation using the PDFparser subcomponent. Then the textual data is analyzed by the TextAnalyzer. It extracts basic morphological forms of the words, removing frequently occurred words such as article and prepositions which carry little information, and counting word occurrences in the document. The resulted data structure is flashed into the Index Database where we query keywords through the Search UI. The Ranking subcomponent shows the most relevant documents at the top of a search list. Figure 1. Overall Architecture of SciPDFindexer So far a simplified architecture overview of our proposed system is describes from the perspective of components interactions. Now we discuss on the indexing more. Our implemented Indexer component consists of complex workflow of distributed jobs and hence will be described in detail in the following paragraphs. We decided to split the indexing process into two parts: preprocessing and text-indexing because of two reasons. First, in MapReduce programming model map tasks are independent from each other and they do not share any information at runtime. But indexing requires a global Document to map DocumentIDs for all documents, so that mapper can produce <term, 109

docid> pairs from the documents it processed. This requires those mappings should be known in advance. Second, we want to logically separate two different operations: PDF parsing from indexing. By doing so, we convert PDF into the plain text and bind those text with document IDs, and then use these text files with only necessary information for indexing to analyze text, chop into tokens, normalize them into terms, create <term, posting-list> mappings. In this way we make our architecture more modular. PDF files Preprocessing Parsing PDF format, extracting document fields Text files (document text with fields) Text-Indexing Analyzing text chunks: tokenizing, eliminating stopwords, lemmatizing Text files (tokens with posting-lists) Saving to DB Save indices from text files to DB DB (indices with posting lists) Figure 2. Indexing Process Workflow Figure 2 gives detailed picture of how the indexing job is done in terms of job workflow in distributed system. We designed a preprocessing and a text-indexing as two consequent MapReduce operations, the output of first operation is used as an input to the second operation. The preprocessing step parses PDF documents to the text representation, textindexing analyzes text chunks and create index structures. Finally index is saved to a database. This additional step is very much necessary to avoid concurrency issues of databases. For example, when several reducers attempt to insert data into database simultaneously, we have a concurrency problem to address. Additionally, we use a distributed database system in order to query indices in parallel. The preprocessing step is responsible for parsing scientific articles in PDF into text files and creating document structure of given PDF files which usually do not contain any hints to recreate structure of scientific paper. We designed a special parser algorithm, by assuming that scientific articles have some common layouts which will help us to rebuild that structure. That algorithm divides the document content into three zones: title, abstract and body. This is done for the purpose of displaying search result compactly and for differentiated ranking for each zone. We used the PDFTextStream library [9] to extract text from PDFs. After parsing scientific article PDF files into the proper text structure, the actual indexing begins. One of the most common index implementations used in search engines is inverted index and MapReduce framework is especially suitable for inverted index construction job. Where terms from each document are usually used as key, and the keys are sorted and grouped by the framework itself and finally posting-lists for each term are collected in reduce phase. There exist several distributed indexing schemes using MapReduce framework. Conceptually all of them perform the same thing; that is, finally outputting posting-lists from collection of documents. However, they are different from the way postings are created, by the structure of posting-lists which need to be used later in querying and by performance as well. In our scheme each call to map function analyses document, calculate term frequencies and aggregate local results by emitting <term, local_postings> key-value pairs. local_postings contains an array of posting objects belonging to single term, and posting is in (docid, title_freq, abstract_freq, body_freq) tuple form. Reduce function only aggregates local_postings to posting-list, thereby emits final <term, posting_list> pairs. We are able to optimize indexing performance by moving some of computations of reduce side into map side, which resulted in less amount of data copied to reduce side. That allowed us to improve indexing throughput compared to base scheme described in original MapReduce paper [2]. 110

4. Conclusion and Future Works In this work we addressed the information retrieval problem of scientific articles and provided our solution for that problem. Our focus in this work is, first, IR system design of scientific articles. The second is improving indexing performance in a distributed set of machines that can efficiently index a large corpus of scientific articles in parallel with MapReduce framework. We designed and implemented a full IR system for scientific articles in PDF - SciPDFindexer which uses the distributing indexing scheme and optimal parameters as mentioned above. Our system is designed to perform indexing job and run query both in parallel using distributed set of nodes to deal with large scale. As a future work, we intend to extend our system to support dynamic and incremental indexing, such as when new documents are regularly added to the collection and indices need to be up-to-date. This poses some new challenges: how to manage new indices and how to merge them with old ones. Acknowledgements This work was jointly supported by the MKE, Korea under the ITRC support program supervised by NIPA (NIPA-2012-(C1090-1221-0011)) and Basic Science Research Program through the NRF of Korea (No. 2011-0015089) References [1] WorldWideWebSize.com. http://www.worldwidewebsize.com/ [cited in (2011)] [2] J. Dean and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, (2004) December. [3] E. Lai, Google claims MapReduce sets data-sorting record, topping Yahoo, conventional databases, ComputerWorld, (2008) November 28. Available at: http://www.computerworld.com/s/article/9121278/google_claims_mapreduce_sets_data_sorting_record_top ping_yahoo_conventional_databases [cited in 2011]. [4] C. Lee Giles, K. D. Bollacker and S. Lawrence, CiteSeer: An Automatic Citation Indexing System, Proceedings of the 3rd ACM Conference on Digital Libraries, New York, pp. 89-98, (1998). [5] Hadoop Distributed File System. http://hadoop.apache.org/hdfs/ [cited in (2011)]. [6] S. Lawrence, C. Lee Giles and K. Bollacker, Digital Libraries and Autonomous Citation Indexing, IEEE Computer, vol. 32, no. 6, (1999). [7] H. Li, I. Councill, W. Lee and C. Lee Giles, CiteSeerx: an architecture and web service design for an academic document search engine, Proceedings of the 15th International Conference on World Wide Web 2006 (WWW 06), (2006). [8] M. Thelwall, Extracting accurate and complete results from search engines Case Study Windows Live, Journal of the American Society for Information Science and Technology, 59(1), pp. 38-50, (2008). [9] PDFTextStream. PDF Text Extraction library for Java,.NET, Python. http://snowtide.com/pdftextstream [cited in (2011)]. Authors Aziz Murtazaev received B.A. in Economics from the National University of Uzbekistan in 2007, and M.S. in Computer Engineering in Ajou University, South Korea in 2011. Currently he is working at Samsung Electronics as Software engineer. His research interests include distributed systems, cloud computing, information retrieval and largescale software system. 111

Sanggil Kang received the M.S. and Ph.D. degrees in Electrical Engineering from Columbia University and Syracuse University, USA in 1995 and 2002, respectively. He is currently an associate Professor in the Department of Computer Science and Information Engineering at INHA University, Korea. His research interests include Semantic Web, Artificial Intelligence, Multimedia Systems, Inference Systems, etc. Sangyoon Oh received Ph.D. in Computer Science Department from Indiana University at Bloomington, U.S.A. He is an assistant professor of School of Information and Computer Engineering at Ajou University, South Korea. Before joining Ajou University, he worked for SK Telecom, South Korea. His main research interest is in the design and development of web based large scale software systems and he has published papers in the area of mobile software system, collaboration system, Web Service technology, Grid systems, and Service Oriented Architecture (SOA). 112