An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi



Similar documents
UPS battery remote monitoring system in cloud computing

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Testing Big data is one of the biggest

Internals of Hadoop Application Framework and Distributed File System

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data processing goes big

Oracle Big Data SQL Technical Update

Applied research on data mining platform for weather forecast based on cloud storage

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

How To Handle Big Data With A Data Scientist

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Integrating Big Data into the Computing Curricula

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Workshop on Hadoop with Big Data

A Brief Outline on Bigdata Hadoop

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

CitusDB Architecture for Real-Time Big Data

Data Refinery with Big Data Aspects

Research on IT Architecture of Heterogeneous Big Data

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Big Data Course Highlights

Big Data and Market Surveillance. April 28, 2014

Optimization of Distributed Crawler under Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

On a Hadoop-based Analytics Service System

Architectures for Big Data Analytics A database perspective

TRAINING PROGRAM ON BIGDATA/HADOOP

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Massive Cloud Auditing using Data Mining on Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Design of Electric Energy Acquisition System on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop

BIG DATA CHALLENGES AND PERSPECTIVES

Hadoop & Spark Using Amazon EMR

BIG DATA HADOOP TRAINING

The Sierra Clustered Database Engine, the technology at the heart of

How To Create A Large Data Storage System

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84


A very short Intro to Hadoop

Log Mining Based on Hadoop s Map and Reduce Technique

Augmented Search for Web Applications. New frontier in big log data analysis and application intelligence

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

Native Connectivity to Big Data Sources in MSTR 10

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Complete Java Classes Hadoop Syllabus Contact No:

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Query and Analysis of Data on Electric Consumption Based on Hadoop

Big Systems, Big Data

BIG DATA What it is and how to use?

Constructing a Data Lake: Hadoop and Oracle Database United!

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

NoSQL and Hadoop Technologies On Oracle Cloud

Data Modeling for Big Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Final Project Proposal. CSCI.6500 Distributed Computing over the Internet

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Hadoop IST 734 SS CHUNG

ITG Software Engineering

Map Reduce & Hadoop Recommended Text:

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

SCHEDULING IN CLOUD COMPUTING

Fast Data in the Era of Big Data: Twitter s Real-

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

In-Memory Analytics for Big Data

Big Data and Scripting map/reduce in Hadoop

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Manifest for Big Data Pig, Hive & Jaql

Application and practice of parallel cloud computing in ISP. Guangzhou Institute of China Telecom Zhilan Huang

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

An Approach to Implement Map Reduce with NoSQL Databases

Massive Data Query Optimization on Large Clusters

Hadoop Ecosystem B Y R A H I M A.

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Foundations of Business Intelligence: Databases and Information Management

Luncheon Webinar Series May 13, 2013

Data Mining in the Swamp

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Transcription:

International Conference on Applied Science and Engineering Innovation (ASEI 2015) An efficient Join-Engine to the SQL query based on Hive with Hbase Zhao zhi-cheng & Jiang Yi Institute of Computer Forensics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China KEYWORD: Hbase; Hive; multi-join; Join Engine ABSTRACT. Hbase combine Hive application platform can analysis of huge amounts of data easily. And querying on multi-join is the bottleneck of restricting the performance. To solve this problem designing a Join Engine between Hbase and Hive. It can read Hbase data and complete multiple table joins and optimization in advance to queries. Using the Join Engine can reduce the time of mapreduce process to data sort and shuffle, effectively improve the efficiency of Hbase combine Hive queries. The experimental results show that the solution with Join Engine can obviously shorten the query time in multi-table join, support the Hbase quasi real-time application. 1 INSTRUCTION With the development of Internet, the growth of the data presented crazy. Huge unstructured data has already far more than the traditional structured data. The traditional databases have been unable to meet the requirements. How to read and analysis the large amounts of unstructured data efficiently become the issues of common concern. Distributed column type database Hadoop Database (Hbase) [1] not only has the advantage of high expansibility and large capacity but also has hadoop platform support. So, Hbase become very popular. But it does not support the SQL query [2] [3]. Restrict the development of Hbase. At present, Hbase use Hive query mechanism to solve the problem. 1.1 Relative work HBase community puts forward two kinds of composite architectural solutions to solve the problem of Hbase do not support SQL queries. One of them is the integration of the MySQL framework [4]. The second is integration framework with Hive. The first solutions add a MySQL layer which has an indirect links with HBase. When the data in HBase has updated, the scheme must restore the HBase data to MySQL database for SQL queries. The second scheme use HBaseIntegration to establish a directly connected between Hbase and Hive. Hive support real-time query under Hbase database updates. Hive is an open source data warehouse project on Hapdoop platform [5]. It can parse SQL sentence into graphs task execution and execute it with MapReduce task job. Hive is widely use in massive log analysis. The Hbase integrated with Hive scheme can make full use of graphs of parallel capability to provide users with easy operation of SQL query and analysis of data [6].So the scheme has been widely used in the era of big data web [7]. But the plan cannot avoid problem of multi-table join queries [8]. Many application choices change the sequence of the data before join. Although the performance has a little improved, but the scheme must know the SQL sentence before optimizing the order of join data [10]. In conclusion, Hbase integrated with Hive scheme is optimal solution to SQL query on Hbase. But the MapReduce job cost so much time that the scheme cannot use in real time or quasi real time application. 2 THE HBASE AND HIVE INTEGRATION FRAMEWORK Hive establishes one-to-one relationship with the Hbase original table. Hive parse SQL queries and run the MapReduce task. 2015. The authors - Published by Atlantis Press 991

Figure 1 Hive and HBase integration As shown in figure 1, The Hbase application programming interface HbaseStorageHandler support the one-to-one relationship with Hive. Hive has the responsibility to parse SQL sentence and execute the query task. HBase provide the storage capacity. The basic process Hbase integration Hive SQL query is first of all establish an association between Hbase and Hive, then parsing and decomposed SQL statement to execution plan by Hive, at last excute Mapreduce task, get the result return to Hbase and clients. Detail of the process is as follows: 1) Hive and Hbase corresponding mapping relationship must be established before the query. Hive's usually in contact with the client side to establish the appearance of establishing to Hbase physical storage. After mapping, Hive can read and write data which store in Hbase. 2) Hive parse SQL statements to execution plan: When Hive client receiving the user's SQL statement Hive parser convert the statement into ASTNode types of syntax tree, and then generate operator tree / DAG, make some optimization operation, then take the tree / DAG generate several MapReduce tasks. 3) Compile the task information generated serialization stage to plan.xml, then start map-reduce, read from Hbase corresponding data input to map function. When configure deserialization plan.xml. 4) After the map function and reduce function returns a result set, save the results output as Collector class, set the file type to save the results by Record Writer. HiveBaseResultSet class will then update the results to HBase data and display to the client. 3 HBASE HIVE QUERY OPTIMIZATION Hbase Hive query mechanism is mainly composed of Hbase and Hive connection Sql parsing and Mapreduce jobs three modules. Mapreduce job is the most time-consuming process. Because the table join operation will completely processes by mapjoin function in MapReduce tasks, it takes a lot of time. 3.1 Multi-table query problem analysis A multi-table join query is very effective data analysis methods, but Hbase does not support multiple table queries. Although with Hbase Hive fusion can solve this problem, but the multi-table query efficiency has become a new challenge. At present most of the commercial application and research only adjust the join order to improve the query efficiency. Some low efficiency SQL statements even cause the process collapse. The problem is now the multi-table connection in MapReduce tasks needs a lot of process time and more task jobs. For instance three tables join in a query will generate at least two MR jobs. SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key2) Three tables connection with two keys. How doses MapReduce task connect the three tables? In first job table A and table B based on first key connect. Results of last job and table c based on the second key connect in second map join task. The SQL query needs to wait for two MR jobs for tables connection. This became the most time-consuming process in Hbase Hive integration query. 992

3.2 Hbase and Hive integration optimization Due to Hbase column structure, each Hbase table column of the cluster is given priority to one or two. If the column clusters too much, Hbase will be affected read and write speed. In column type database the read and write unit is one column. So it is not necessary to store too much columns in a single table. Hbase compared with the traditional database can store more redundant data. Duplication of the column data in a multi-table join queries occur frequently equijoins. Utilizing the principle of pre-processing and cache this article designs a Join-Engine which pre-establish a connection on multi-table and sort the data after optimization. When a client queries arrive, you can quickly get the results directly in the Join-Engine, thus greatly improving the efficiency of client queries. In practical application is unable to determine whether the tables should read into the Join-Engine. If Join-Engine read some tables based some rules, but the user never requested, then the Join-Engine is not worth. Because queries have memory characteristic that query often repeat on the same system. It is best to read the related tables which are appear in the query history. 3.3 Join-Engine Because of the separation of multi-table joining with query operations and utilizing the principle of locality and log analysis, the article change the traditional Hbase and Hive integration framework add the Join-Engine as shown in Figure 2. Figure 2 Join-Engine Join-Engine consists of two modules, Pre-reading mechanisms: Read the relevant data from Hbase based on the query history in Hive; Optimized connection: Complete multi-table join operations, and sort optimize data before Mapreduce upgrade query efficiency. Pre-reading module select the table in Hbase by query history record. In practical systems can also manually configure achieve more accurate results. Optimized connection module sort and merge tables in units of data. After adding Join-Engine, (1) Mapjoin operation which is most timeconsuming is complete before the SQL queries come. Conversely without Join-Engine, First you need to generate a small table Hash Table file and distributed to Cache. Then MapJoinOperator load small table (Hash Table) from the Distributed Cache execute map Join operations. So the connection operation for Mapreduce process is very complex and time-consuming work. (2) Join-Engine carry on sort optimization process when connect operation is in progress. Then the map input split process can be achieved optimal effect. Similar data are assigned to the same slice task statistics, which can speed up the cleaning work after the map statistics. Conversely without Join-Engine, Then fragments will be random load into the map function. Prior to Reduce function get the data, the data need to go through the process of data cleansing shuffle. Shuffle process is constantly read the data into the RAM carry out sorting and merging operations. So join the pre-optimized processing can effectively reduce the cleaning process of sorting and merging. Query time can be shortened. 993

3.4 Optimization Hbase and Hive SQL query The SQL query process of Hbase and Hive integration added Join-Engine as shown in Figure 3, Figure 3 SQL query process There are two ways to process the SQL query. The one is MapReduce job directly read data from the Join-Engine. Another is read data from Hbase original tables. 4 THE DESIGN OF EXPERIMENT 4.1 The data source of experiment Figure 4 SQL query time The data source of experiment comes from two custom data sets A and B. They have the same structure: table a<column family: name, column family id>.table b <column family id, column family motive>.table c <column family name, column family tag>. At sets A there are a lot of the same data between two tables in the column of id or name. Sets B have scarcely duplicate data. 4.2 The results and analysis of experiment Experimental Methods: Contrast SQL query time between added Join-Engine and without Join- Engine. Take 10 groups SQL query comparative experiments, each set of experiments carried out ten times the mean value. The 1-3 group are three multi-table queries experiments, use sets A; 4-6 group experiment is corresponds with 1-3 group use the same SQL queries, but using the dataset B, 7-10 group SQL experiments using data sets A, non-multi-table join queries. As shown in Figure 4 As shown, the 1-3 group experiments have very large difference between after added Join- Engine and old solution. But while using the same SQL query statements the 4-6 groups have little 994

difference on SQL query time. The only difference is the data sets. The data set A has a lot of duplicate data so that it will produce more results at mapjoin process. The data set B will generate much less data at mapjoin because it has less correlation data. Because of each 100 000 associated data of table input will return billions of data need so much memory and CPU compute time. The value associated with the data set B does not exist, in the multi-table joins returns the result is zero, consuming very little. So the efficiency of Join-Engine is closely related to the data set. When the data set has more associated column values, The Join-Engine has more efficiency on multi-table SQL query. The 7-10 groups experiments shown the Join-Engine has less influence on single table SQL query. Experiment 2:,Sorted data has impact on Mapreduce query efficiency. Use two contains the same 100 000 data tables, one is sorted, and another is an unordered table. Contrast their sql query time. The results shown in Figure 5: Figure5 sort and unsort data query time comparison As shown in Figure 2-4 experiments displays the sorted data source has better performance in the implementation of Mapreduce. Because the 2-4 SQL queries has some conditions the result of SQL queries are sorted. SO the reduce function has to sort the data. So the sorted source data can effectively reduce Mapreduce work of sorting data. The first group of SQL test is statement as select *, then reduce function does not need to sort the results, so Mapreduce processing performance has no connection with source data. The result of experiment shows that,add Join-Engine can effectively save the mapjoin time from Mapreduce operation. In the case of highly relevant data sets Join-Engine can effectively reduce the time multi-table join queries Because the higher correlation datasets the more Join-Engine works preprocess. That means add Join-Engine will be able to better support the quasi-realtime SQL query applications, the more obvious the effect of Preprocessing mechanism; Pre-sorting process can be reduced when the conditions are in the process of inquiry Mapreduce sort operation, effectively reducing Mapreduce time. Therefore, Join-Engine can improve query efficiency from two aspects, reducing Mapreduce time of mapjoin operations and sorting operations. 5 CONCLUSION AND FUTURE WORK In the development of the integration technology of Hbase and Hive,Multi-table query performance problems become the key factor to restrict the development of this technology. This article provides a Hbase and Hive integration optimization which added a Join-Engine pre-process Multitable joining and sorting. Experimental results show that, Added Join-Engine hive higher efficient on SQL query especially on multi-table join query. The next step will continue to optimize the design of the Join-Engine, allowing to the original data analysis capabilities can have an adaptive work on any Hbase database. 995

6 REFERENCES [1] HBase[EB/OL]. hbase.apache.org/ [2] J.Kennedy,IndexedTransactionalHBasemaintained[EB/OL],https://github.com/hbase-trx/, 2011. [3] L. George, HBase-The Defensive Guide, Sebastopol[R]:O'Reilly Media, 2011. [4] HBaseIntegrtion[EB/OL].https://cwiki.apache.org/confluence/display/Hive/HbaseIntegration/ [5] Hive[EB/OL].http://hive.apache.org//lib/view/open4.html [6] ZHAO long, JIANG Rong-an. Research of massive searching logs analysis system based on Hive [J]. Application Research of Computers, 2013, 30(11): 3343-3345. [7] Doshi K A, Zhong T, Lu Z, et al. Blending SQL and NewSQL Approaches: Reference Architectures for Enterprise Big Data Challenges[C]//Cyber-Enabled Distributed Computing and Knowledge Discovery (Cyber), 2013 International Conference on. IEEE, 2013: 163-170. [8] Barbierato E, Gribaudo M, Iacono M. Performance evaluation of NoSQL big-data applications using multi-formalism models [J]. Future Generation Computer Systems, 2014, 37: 345-353. [9] WANG Mei, XING Lulu, SUN Li. MapReduce Based Heuristic Multi-Join Optimization under Hybrid Storage [J]. Journal of Frontiers of Computer Science and Technology, 2014, 8(11): 1334-1344. [10] WANG Jing, WANG Teng-jiao, Yang Dong-qing, Li Hongyan. A Filter-Based Multi-Join Algorithm in Cloud Computing Environment [J]. Journal of Computer Research and Development, 2011 (S3): 245-253. 996