Enhancing HiveQL Engine Using Map-Join- Reduce

Transcription

1 Enhancing HiveQL Engine Using Map-Join- Reduce Amruta Kulkarni Prof. Shweta Dharmadhikari Pune Institute Of Computer Technology Pune,India Pune Institute Of Computer Technology Pune,India Abstract Today we are facing information explosion. It brings us the challenge of huge data handling system. Hive is a data warehouse infrastructure based on Hadoop platform. It provides mechanism of huge data organization, extraction methods of data using MapReduce and analysis of large data sets stored in HDFS system. HiveQL is a query language for Hive for data extraction. It also allows to plugin custom MapReduce function in addition with traditional MapReduce functionality. This HiveQL MapReduce is under consideration for MapJoinReduce enhancement. This will lead us for detailed study of performance improvement. MapReduce processing strategy frequently checkpoints and shuffles intermediate results data. MapReduce can be made more scalable and efficient by improving the intermediate data handling strategy. Proposed solution is Map-Join-Reduce. Map-Join- Reduce simplifies the data handling mechanism by removing burden of presenting complex join algorithm. We first present the enhanced Map-Join-Reduce architecture for HiveQL Engine. This architecture design will en-light the Hadoop and Hive system for query processing. Then we will present the existing system performance measures taken to set the benchmark for developing system. This will lead us to enhanced query processing architecture and benchmarking the system performance for next level development. Keywords- Hadoop, Hive, HiveQL I. INTRODUCTION In Hive, MapReduce is responsible for filtration of data and aggregation based on the extraction requirements. So the Map functionality first of all filters the required data and that is given to Reducer functionality for aggregation and computing the result. In HiveQL, Join functionality take place at Map side. As the data grows, check-pointing and shuffling increases. The objective of this research paper is to develop a solution for MapReduce based query engine such as Hive, Pig. By adding a new building block for generating query plan. This research paper elaborates how to enhance HiveQL architecture for MapJoinReduce along with the performance measurement and benchmarking. We will start with literature survey that includes study of Hive architecture. Then we will see the research objectives. And finally we will look at the existing HiveQL system performance measures taken for benchmarking. II. LITERATURE SURVEY A. Hive Architecture Hive architecture contains 3 main components Serializers/Deserializers (trunk/serde) MetaStore (trunk/metastore) Query Processor (trunk/ql) [7] 1. Serializer/Deserializer This component can be found in trunk/serde. This component has inbuilt libraries for serializing and deserialization. It also allows developer to create their own Serializer and Deserializers for this own data formats. [7] 2. MetaStore This component can be found at trunk/metastore. This component is responsible for maintaining metadata of warehouse. [7] 3. Query Processor This component can be found at trunk/ql. This component is responsible for converting SQL to the graph of MapReduce jobs. So MapJoinReduce enhancement will get perform here. [7] B. Join Strategy in Hive Type of join selected for MapReduce in Hive is based on the data configuration and size of data. Data that we have loaded for performance testing has star schema configuration. It has one heavily loaded table which has connectivity with many other small tables. So in this case MapJoinOperator.java class will work for MapReduce operation. [13]

2 III. PROPOSED WORK A. Hadoop-Hive Interaction for MapJoinReduce SerDe MetaStore Query Processor Map-Join-Reduce Execution Engine(ql/exec) Hadoop Record Readers, Input and Output Formatters For Hive (ql/io) Hive Component Map-Join- Reduce Job Configurati on submitted to Hadoop for execution NameNode DataNode JobTracker Task1 Task2 Task3 Map Map Map Join Join Join Reduce Reduce Reduce Hadoop Components Figure 1. Hadoop-Hive Interaction For Map-Join-Reduce As stated earlier, Hadoop is base for Hive query execution plan. Query submitted for execution is given to Hive, which converts that query in Map Reduce tasks that will filter data, aggregate data and compute the result. But this needs Hadoop support for execution of query by dividing it in small jobs called tasks. Execution of their small tasks is handled by Hadoop task tracker. Data that is getting manipulated is supervised and tracked by Hadoop NameNode. Hadoop NameNode is master of HDFS which directs DataNode for local data tasks. JobTracker manages tasks, processes, node assignment and jobs to track and execute all the tasks over the distributed system with no fail. Proposed Hadoop-Hive interaction is represented in Figure.1.It shows Hive component and Hadoop components. Hive component has SerDe, MetaStore, and QueryProcesor. Query Processor has MapJoinReduce execution engine (ql/exec) and Hadoop record reader, input/output formatter for Hive (ql/io). While query execution of Hive, intermediate results gets generated, so a temporary cache is maintained and used for keeping this intermediate results and computing the results. This is achieved in Hive with the help of SerDe system which serializes and deserializes intermediate data. Query processor of Hive component shows MapJoinReduce functionality which is proposed for better efficiency. With the help of Hadoop record reader and Hive input/output formatter, MapJoinReduce configuration is given to Hadoop for execution. Hadoop JobNode gets the MapJoinReduce jobs and allocate to different tasks for execution. B. Detail Level Design Now we will elaborate the design for HiveQL MapJoinReduce. Hive provides mechanism for extracting data from huge data set using HiveQL. HiveQL allows traditional MapReduce along with the custom MapReduce as per the requirement. Hive query for execution Hive_CLI Syntactic Analysis Semantic Analysis Compilation MapJoinReduce Job Configuration (Generate MapJoinReduce Graph) Execute Map Task Generate intermediate results (filtered data) Execute Join Generate intermediate result Execute Reduce Task Generate final result (aggregated data) Figure 2. Detail Level Design For Hive Query Execution

3 User can submit data extraction query from Hive_CLI (Hive command line interface). Hive system than does syntactic analysis to find out syntactic errors of submitted query. This will again lead to semantic analysis. Syntactic and semantic analysis both are performed at client side. Once query is compiled, Hive generates Map-Reduce configuration. Here we are enhancing it to generate MapJoinReduce configuration. This job is given to Hadoop which will provide platform for job execution. This will be again Map-Join- Reduce tasks for Hadoop. IV. HANDS ON EXISTING SYSTEM FOR BEANCHMARKING Before we start hands on existing HiveQL engine, we need to select environmental setups for Hive to make us easy for further development. A. Operating System My operating system selection is based upon the development friendly environment. So I have setup my system on Ubuntu LTS which is 32-bit type. B. IDE This project needs Java platform so Eclipse Kepler IDE is set on my system. C. Hadoop The project is Hive based so we would need Hadoop platform. Single node cluster setup is consummated so that the Map-Reduce operations can be performed on this system. D. Hive On top of Hadoop system, we will have Hive system to run our queries. Apache-hive stable distribution is installed for this system. E. Git And Hadoop Git repository is linked to the system for project management purpose. We would need a copy of git on system. So clone a local git repository from Apache repository F. Data Setup And Generation Unlike other database systems, Hive stores data in flat files. So while creating data tables we have to specify the delimiters for columns and rows. A database is created for the system which is a university database. 13 tables are created. Those are: address, class, country, course, payment, person, remark, room, staff, state, student, studentclass, term. G. Data Loading Database is loaded with data from a data generator Test data generated for system is loaded in local HDFS system by using LOAD command. This command facilitate us to load data from given path to specified table for database. All 13 tables are loaded. H. Queries For Data Extraction To Perform Black box testing for existing system, join queries are written and executed against this system with the loaded data. 1) Query1: How many girls gets scholarship: select count(*) from Person JOIN student ON(person.PersonID=student.PersonID) JOIN StudentClass ON(student.studentID=studentClass.studentID) JOIN remark ON(studentclass.remarkID=remark.remarkID) JOIN payment ON(payment.paymentid=student.paymentID) where person.gender='f' AND remark.remark='good' AND student.status='regular' AND payment.amount Person table:15000 rows Student table:12000 rows studentclass table:12000 rows remark table:12000rows payment table:12000 rows total:63000 rows time taken: seconds number of joins:4 2) Query 2: Arrange park visit by male professors for third term How many staff person gender M having designation as XXX assigned to Room location YYY near park for TTT term(1,2,3,4) Person:15000 staff:2000 class:2000 room:70 term:28 total:19098 time taken: seconds number of joins:4 select count(*) from Person join staff ON(Person.PersonID=Staff.PersonID) JOIN Class ON(Class.ClassID=Staff.StaffID) JOIN Room ON(Class.RoomID=Room.RoomID) JOIN Term ON(Class.TermID=Term.TermID)where Person.gender='M' AND staff.designation='lecturer' AND Term.TermID='3' AND Room.Location='*.Park'; 3) Query3: How many students has changed course in 2 semester in a class

4 select count(*) from student join studentclass ON(student.studentID=studentclass.studentID) join class ON(studentclass.classID=class.ClassID) JOIN Cource ON(class.CourceID=Cource.CourceID) where class.termid='2' student:12000 studentclass:12000 class:12000 course:2000 total:38000 time taken: seconds number of joins:3 a) First Data Load Results Query execution result is tabulated to analyze the relation between data size, number of joins and resultant time taken for execution of each query. Query TABLE I. Time of Execution(in sec) FIRST DATA LOAD RESULT Number of joins Query Query Query Number of Rows From this table, We can understand that the number of joins directly affects Time of Execution. Query1 and Query2 has 4 joins for execution. So the time of execution is high. Query3 has 3 joins to be executed. Time taken by Query3 for execution is less as compared to Query1 and Query2. b) First Data Load Result Chart For 3 Queries Data load results are plotted in chart form. This will help us to analyze effect of data size and number of joins on execution time of query processing Figure 3. Data Load Result Chart X-axis: This axis represents number of rows Time(in sec) No. of Joins Y-axis: This axis represents execution time in seconds taken for result calculation and number of joins for query. CONCLUSION Solution of problem is proposed with an idea of enhancing design architecture of HiveQL engine for MapJoinReduce. It also presents Hadoop-Hive interaction design with Map-Join-Reduce tasks to be executed by Hadoop. As Hadoop is proving platform for Hive query execution. Implementation of this system brings us to the existing system performance measure benchmarking. This will definitely help us to measure process improvement for enhanced HiveQL system. REFERENCES [1] For Hadoop setup [2] For Hive installation guidelines [3] Hive stable version [4] For Git Repository [5] For generating ssh keys git hub [6] Data generator [7] Language manual for Hive Manual+Types And Manual+DML [8] MAP-JOIN-REDUCE: Toward Scalable and Efficient Data Analysis on Large Clusters Dawei Jiang, Anthony K. H. Tung, and Gang Chen. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 9, SEPTEMBER 2011 [9] A Comparison of Join Algorithms for Log Processing in MapReduce, Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao

5 [10] Eugene J. Shekita, Yuanyuan Tian, SIGMOD 10, June 6 11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM /10/06. [11] ] Optimizing Joins in a Map-Reduce Environment, Foto N. Afrati, Jeffrey D. Ullman, ACM. EDBT 2010, March 22-26, [12] Hadoop in Action, Chuck Lam, Volume 1 [13] DevelopersGuide-Apache Hive-Apache Software Foundation