Efficient Processing of XML Documents in Hadoop Map Reduce



Similar documents
An Empirical Study on XML Schema Idiosyncrasies in Big Data Processing

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Oracle Big Data SQL Technical Update

Complete Java Classes Hadoop Syllabus Contact No:

Cloudera Certified Developer for Apache Hadoop

Integration of Apache Hive and HBase

Integrate Master Data with Big Data using Oracle Table Access for Hadoop

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Internals of Hadoop Application Framework and Distributed File System

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Data processing goes big

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

The Hadoop Eco System Shanghai Data Science Meetup

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Best Practices for Hadoop Data Analysis with Tableau

Manifest for Big Data Pig, Hive & Jaql

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Generic Log Analyzer Using Hadoop Mapreduce Framework

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Hadoop Big Data for Processing Data and Performing Workload

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Hive Interview Questions

Open source Google-style large scale data analysis with Hadoop

Open source large scale distributed data management with Google s MapReduce and Bigtable

IBM SPSS Analytic Server Version 2. User's Guide

BIG DATA SOLUTION DATA SHEET

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Testing Big data is one of the biggest

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Peers Techno log ies Pv t. L td. HADOOP

On a Hadoop-based Analytics Service System

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

ANALYTICS CENTER LEARNING PROGRAM

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Enhancing Massive Data Analytics with the Hadoop Ecosystem

HDFS. Hadoop Distributed File System

Big Data and Market Surveillance. April 28, 2014

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Click Stream Data Analysis Using Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Luncheon Webinar Series May 13, 2013

Integrating Apache Spark with an Enterprise Data Warehouse

Lightweight Stack for Big Data Analytics. Department of Computer Science University of St Andrews

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Introduction To Hive

Chase Wu New Jersey Ins0tute of Technology

Architectures for Big Data Analytics A database perspective

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Using distributed technologies to analyze Big Data

Approaches for parallel data loading and data querying

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

CS54100: Database Systems

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

How To Scale Out Of A Nosql Database

NoSQL and Hadoop Technologies On Oracle Cloud

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Big Data and Analytics by Seema Acharya and Subhashini Chellappan Copyright 2015, WILEY INDIA PVT. LTD. Introduction to Pig

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Constructing a Data Lake: Hadoop and Oracle Database United!

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Customized Report- Big Data

Big Data Introduction

BIG DATA HADOOP TRAINING

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Native Connectivity to Big Data Sources in MSTR 10

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

Mobility Information Series

Data storing and data access

Big Data and Scripting map/reduce in Hadoop

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Outline. What is Big data and where they come from? How we deal with Big data?

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Arti Tyagi Sunita Choudhary

International Journal of Innovative Research in Computer and Communication Engineering

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Big Data Analytics: Where is it Going and How Can it Be Taught at the Undergraduate Level?

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Real-Time Data Analytics and Visualization

Transcription:

Efficient Processing of Documents in Hadoop Map Reduce Dmitry Vasilenko, Mahesh Kurapati Business Analytics IBM Chicago, USA dvasilen@us.ibm.com, mkurapati@us.ibm.com Abstract has dominated the enterprise landscape for fifteen years and still remains the most commonly used data format. Despite its popularity the usage of for "Big Data" is challenging due to its semi-structured nature as well as rather demanding memory requirements and lack of support for some complex data structures such as maps. While a number of tools and technologies for processing are readily available the common approach for map-reduce environments is to create a "custom solution" that is based, for example, on Apache Hive User Defined Functions (UDF). As processing is the common use case, this paper describes a generic approach to handling based on Apache Hive architecture. The described functionality complements the existing family of Hive serializers/deserializers for other popular data formats, such as JSON, and makes it much easier for users to deal with the large amount of data in format. Keywords-, Apache Hadoop, Apache Hive, Map-Reduce, VTD-, XPath. I. INTRODUCTION While the enthusiasm around as the universal data format readable by humans and computers appears to have subsided [1], still remains the widely accepted data format of choice. As a result, there is a growing demand for an efficient processing of large amount of data stored in using Apache Hadoop map reduce functionality. The most common approach to process data is to introduce a "custom" solution based on the user defined functions or scripts. The typical choices vary from introducing an ETL process for extracting the data of interest to transformations of into other formats that are natively supported by Hive. Another possibility is to utilize Apache Hive XPath UDFs but these functions can only be used in Hive views and SELECT statements but not in the table CREATE DDL. The comprehensive survey of other available technologies for processing can be found in [2]. The approach proposed in this document enables the user to define extraction rules directly in the Hive table CREATE statement by introducing a new serializer/deserializer (SerDe). Additionally, a specialization of the Apache Hadoop TextInputFormat is introduced to enable parallel processing of plain or compressed files. II. ARCHITECTURAL OVERVIEW To support efficient processing of the large amount of data the proposed implementation relies on the Apache Hadoop Map Reduce stack. The Apache Hive data warehouse allows the user to create a custom SerDe [3] to handle a particular type of data. Additionally, the user can implement Hadoop InputFormat to create appropriate input splits and associated record readers. A. Collaboration Diagram The collaboration diagram for the proposed implementation is shown below. ISSN : 0975-3397 Vol. 6 No.09 Sep 2014 329

1. Input Format 2. Record Reader 4. XPath Processor 3. SerDe 5. Spec 6. Hive Record Figure 1. SerDe collaboration diagram. 1) Input Format. The input format implementation splits up the input files to the fragments defined by the start and stop byte sequence for the given tag. The implementation evolved from the open source XmlInputFormat from Apache Mahout project [4] to the collection of input format classes to deal with compressed and uncompressed data. Splittable compressed formats such as BZ2 and CMX can be efficiently processed by a number of Hadoop mappers in parallel. The input splits generated by the InputFormat are used by the reader to create text records for further processing. 2) Record Reader. The record reader iterates over the text records based on the start and stop sequence and is used by the SerDe to transform the records to Hive rows. 3) SerDe. The deserializer interacts with the XPath Processor to query and generate values for the Hive rows. The transformation specification for the processor is stored in Hive SERDEPROPERTIES as described below. 4) XPath Processor. The processor is a pluggable component implementing XmlProcessor interface [5]. The implementation transforms the text to the data types supported by the Apache Hive. At the time of this writing there are two publicly available implementations of this interface: one is based on the JDK XPath [5] and another is based on the "Virtual Token Descriptor" (VTD-) [6]. 5) Spec. The specification defines information necessary for mapping of XPath query results to Hive column values as well as a set of conversion rules from elements to the Hive map data types. 6) Hive Record. The Hive record for the given input. B. Hive Table Specification The Hive CREATE TABLE statement for the proposed SerDe is defined as follows. CREATE [EXTERNAL] TABLE <table_name> (<column_specifications>) ROW FORMAT SERDE "com.ibm.spss.hive.serde2.xml.xmlserde" WITH SERDEPROPERTIES ( ["xml.processor.class"="<xml_processor_class_name>",] "column.xpath.<column_name>"="<xpath_query>",... ["xml.map.specification.<element_name>"="<map_specification>"... ] ) STORED AS INPUTFORMAT "<hadoop_input_format>" OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.ignorekeytextoutputformat" [LOCATION "<data_location>"] TBLPROPERTIES ( "xmlinput.start"="<start_tag>", "xmlinput.end"="<end_tag>" ); ISSN : 0975-3397 Vol. 6 No.09 Sep 2014 330

The table parameters described below: 1) xml.processor.class defines the java class name of the XPath processor. Defaults to the JDK XPath implementation. There is also publicly available implementation based on VTD-. 2) column.xpath.<column_name> defines the XPath query to retrieve the value for the column. 3) xml.map.specification.<element_name> defines the mapping specification for the element. The details are discussed in the to Hive section of the document. 4) <hadoop_input_format> defines the implementation class to split-up the input files into logical InputSplits, each of which is then assigned to an individual mapper. 5) xmlinput.start and xmlinput.end define the byte sequence for the start and end of the fragment to be processed. III. TO HIVE RECORD MAPPING As was noted earlier, there are a number of tools and techniques to map the markup to the object instances. Support for complex data types in Apache Hive, however, allows the implementor to do such mapping directly, bypassing the intermediate data types generated by the transformation tools. The following sections shows how the proposed implementation deduces the complex types from the results of the XPath queries. A. Primitive Types The attribute values or text content of the elements can be directly mapped to the Hive primitive types as shown below. TABLE I. PRIMITIVE TYPE MAPPING <result>03.06.2009</result> /result/text() string result 03.06.2009 B. Structures The element can be directly mapped to the Hive structure type so that all the attributes become the data members. The content of the element becomes an additional member of primitive or complex type. TABLE II. STRUCTURE MAPPING <result name = /result struct<name:string,result:string> name: DATE "DATE">03.06.2009</result> result: 03.06.2009 C. Arrays The sequences of elements can be represented as Hive arrays of primitive or complex type. TABLE III. ARRAY MAPPING <result>03.06.2009</result> <result>03.06.2010</result> <result>03.06.2011</result> /result/text() array<string > 03.06.2009, 03.06.2010, 03.06.2011 D. Modeling Maps in The schema does not provide native support for maps. This issue is well understood and there is a proposal to add map type in the W3C XSLT 3.0 Working Draft [7]. There are three common approaches to modeling maps in described in the following sections. 1) Element Name to Element Content The name of the element is used as a key and its content as a value. This is one of the common techniques and is used by default when mapping to Hive map types. The obvious limitations with this approach is that the map key can be only of type string and the key names are fixed. TABLE IV. ELEMENT NAME TO ELEMENT CONTENT MAPPING <entry1>value1</entry1> <entry2>value2</entry2> <entry3>value3</entry3> entry1->value1 entry2->value2 entry3->value3 ISSN : 0975-3397 Vol. 6 No.09 Sep 2014 331

2) Attribute Name to Element Content Another popular approach is to use an attribute value as a key and the element content as a value. TABLE V. 3) Attribute Name to Attribute Value ATTRIBUTE NAME TO ELEMENT CONTENT MAPPING <entry name= key1 >value1</entry> <entry name= key2 >value2</entry> <entry name= key3 >value3</entry> key1->value1 key2->value2 key3->value3 Yet another approach is to use two attributes to model map entries as shown in the following table. TABLE VI. ATTRIBUTE NAME TO ATTRIBUTE VALUE MAPPING <entry name= key1 value= value1 /> <entry name= key2 value= value2 /> <entry name= key3 value= value3 /> key1->value1 key2->value2 key3->value3 E. Handling Maps To accommodate the different ways of modeling maps in the proposed implementation introduces the following syntax: "xml.map.specification.<element_name>"="<key>-><value>" where: element_name - the name of the element to be considered as a map entry key the node map entry key value the node map entry value The map specification for the given element should be defined in the SERDEPROPERTIES section in the Hive table creation DDL. The keys and values can be specified using the syntax shown in the following table. TABLE VII. MAPPING SYNTAX Syntax Example Description @attribute @name The @attribute specification allows the user to use the value of the attribute as a key or value of the map. Element entry The element name can be used as a key or value. #content #content The content of the element can be used as a key or value. As the map keys can only be of primitive type the encountered complex content will be converted to string. IV. COMPRESSION Using compression for the data can significantly reduce storage requirements and still allow for optimal performance during processing. To handle the compressed and raw data the framework provides three different implementations for the Hadoop InputFormats documented below. TABLE VIII. Input Format Class com.ibm.spss.hive.serde2.xml.xmlinputformat com.ibm.spss.hive.serde2.xml.splittablexmlinputformat com.ibm.spss.hive.serde2.xml.cmxxmlinputformat COMPRESSION FORMATS Compression Uncompressed or non-splittable compression formats such as gzip. Splittable bzip2 compression format. Splittable IBM InfoSphere BigInsights CMX compression format. The com.ibm.spss.hive.serde2.xml.xmlinputformat is publicly available [5]. The splittable input formats for BZ2 and IBM CMX compression are distributed with the IBM SPSS Analytic Server [8]. V. CONCLUSION This paper describes the proposed and implemented approach to handling large amount of data in format using map-reduce functionality. The implementation creates logical splits for the input files each of which is assigned to an individual mapper. The mapper relies on the implemented Apache Hive SerDe to break the split into fragments using specified start/end byte sequences. Each fragment corresponds to a single Hive record. The fragments are then handled by the processor to extract values for the record columns utilizing specified XPath queries. The reduce phase is not required. ISSN : 0975-3397 Vol. 6 No.09 Sep 2014 332

In the course of this work two variants of the processors have been implemented and made publicly available. The JDK XPath based implementation can be used for relatively small documents. Large documents can be processed with VTD- based XPath processor [6]. Additionally, implemented Hadoop InputFormats allow for processing of plain and compressed files. ACKNOWLEDGMENTS Authors thank Prof. Arti Arya of PES University, Bangalore, India, for invaluable comments during preparation of this paper. REFERENCES [1] Jim Fuller, Big Data and Modern, Keynote, Amsterdam, 2013. [2] C. Subhashini and Arti Arya, "A Framework For Extracting Information From Web Using VTD- s XPath", International Journal on Computer Science and Engineering (IJCSE), Vol. 4, No. 03, 2012, p. 463-468. [3] Apache Hive SerDe, https://cwiki.apache.org/confluence/display/hive/serde [4] Apache Mahout Project, https://mahout.apache.org/ [5] Apache Hive SerDe, https://github.com/dvasilen/hive--serde [6] Apache Hive VTD SerDe, https://github.com/dvasilen/hive--serde-vtd [7] XSL Transformations (XSLT) Version 3.0, W3C Working Draft 12, December 2013. [8] IBM SPSS Analytic Server, http://www-03.ibm.com/software/products/en/spss-analytic-server AUTHORS PROFILE Dmitry Vasilenko, IBM, 200 West Madison Street, Chicago, IL 60606 (dvasilen@us.ibm.com). Mr. Vasilenko is a Senior Software Engineer in the Business Analytics Department of the IBM Software Group. He received a M.S. degree in Electrical Engineering from Novosibirsk State Technical University, Russian Federation, in 1986. Before joining IBM SPSS in 1997 Mr. Vasilenko led Computer Aided Design projects in the area of Electrical Engineering at the Institute of Electric Power System and Electric Transmission Networks. During his tenure with IBM Mr. Vasilenko received three technical excellence awards for his work in Business Analytics. He is an author or coauthor of 11 technical papers and a pending US patent. Mahesh Kurapati, IBM, 200 West Madison Street, Chicago, IL 60606 (mkurapati@us.ibm.com). Mr. Kurapati is an Advisory Software Engineer in the Business Analytics Department of the IBM Software Group. He received a B.E. degree in Electronics Engineering from Bangalore University, India, in 1993 and Specialization on P.C. Based Instrumentation from Indian Institute of Sciences, Bangalore, India. Before joining IBM in 2006 Mr. Kurapati was involved in various telecommunications and data mining projects. At IBM, Mr. Kurapati's primary focus is on the development of SPSS Statistics and SPSS Analytic Server products. ISSN : 0975-3397 Vol. 6 No.09 Sep 2014 333