Massive Data Query Optimization on Large Clusters

Journal of Computational Information Systems 8: 8 (2012) 3191 3198 Available at http://www.jofcis.com Massive Data Query Optimization on Large Clusters Guigang ZHANG, Chao LI, Yong ZHANG, Chunxiao XING Research Institute of Information Technology, Tsinghua National Laboratory for Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China Abstract The growing demand for massive data processing and analysis applications has led both academia and industry to design lots of new types of highly scalable massive data-intensive computing platforms base on the large clusters in the cloud environment. How to get an fast query response time, especially to those ad hoc queries, is becoming very important in the large clusters environment. In this paper, we designed a series of algorithms for the query optimization. We designed an efficient massive data query and optimization mechanism SemanQuery. SemanQuery have two characters: First, it has better semantics, that is to say it has some intelligent when processing massive data queries through a semantic matching algorithm. Second, In order to reduce the query cost, we constructed a very large query network in SemanQuery and optimize it. Simulation experiment and result showed that SemanQuery will improve the query efficiency better on large clusters. Keywords: Query Processing; Query Optimization; Query Network; Semanquery; Optimization Algorithm 1 Introduction With the development of cloud computing technologies, internet, mobile internet and the internet of things, all kinds of terminals and information collectors increase rapidly. Every things and peoples will produce massive data. IDC predicts that the data will arrive at 8ZB in 2015 year. All these big data will enter into the information system and they need to been stored, analyzed and used. How to process these big data is facing a very big difficult in the cloud environment. Massive data, it can also be known as the big data. It has becoming the hot research trend in the academicals and industry area. How to query the data that satisfied the millions of users requirements from the massive data of files, relational databases and cloud databases is facing a very big challenge. The challenge can be summarized into How to find these data that satisfied Project supported by National Basic Research Program of China (973 Program) No. 2011CB302302, the National Natural Science Foundation of China under Grant No. 61170061, the Research Foundation of the Ministry of Railways and Tsinghua University under Grant No. J2010Z057, J2010Z059. Corresponding author. Email address: guigang@mail.tsinghua.edu.cn (Guigang ZHANG). 1553 9105 / Copyright 2012 Binary Information Press April 2012

3192 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 the millions of users requirements quickly? In order to resolve this challenge, in this paper we proposed some techniques that include semantic technologies and query optimization technologies. We proposed a very big query network at first. The big query network includes all SQL query plans which satisfied millions users requirements. We assume all the files query plans in HDFS, GFS file systems and data query plans in databases can been converted into SQL query plans. We proposed a semantic matching algorithm between the users query requirements and big query network. The semantic matching algorithm mainly helps us to find a good query path in the big query network. 2 Related Work Data query and optimization is very important in database and data processing area. Lots of researchers make many researches in it. The data query and optimization can been divided into the RDMS query and optimization, OODMS query and optimization, traditional database optimization, distributed database query and optimization. Paper [1] make a analysis for an Object-Oriented implementation for extensible database query optimization. In the earlier stage, most of research on query and optimization is aimed at the traditional database such MySQL, Oracle, Sybase and DB2 etc. Later, more and more data should be stored into the distributed environment. And so, many research theories and methods focus on it. For example, the paper [2] proposed an efficient parallel skyline processing using hyper plane projections. With the development of cloud computing and cloud storage technologies, the big data s [3] query and optimization are becoming more and more popular in the recent research. The big data s processing include structured data processing, semi-structured data processing and no-structured data processing, especially the uncertain [4] data query and optimization processing. The massive no-structured data processing mainly use MapReduce computing model [5]. Based on the MapReduce, in order to improve the query and processing efficiency to the big data, lots of others internet massive data computing framework such as Twister, Haloop, Hadoop++, Spark, CrowdDB [6] and Yale university s HadoopDB are proposed in these recently years. All these new computing framework s objectives is to improve the query efficiency to big data. Query and optimization methods will improve the efficiency a lot to the massive data processing. Lots of researchers designed lots of query and optimization methods such as Top-K [7], processing on joins [8] and two-way selection [9] to data processing and query workload balance[10]. 3 SemanQuery Architecture SemanQuery architecture can be shown as Fig. 1. It can be described as the following: (1) All files will be stored in the local file system. The file system maybe is Windows file system, Linux file system or the other file systems. (2) RDMS (Relational Database Management System) may been run on the top of local file system or the DFS (Distributed File System) such as Google file system, Hadoop distributed file system and so on. Cloud databases mainly store the massive files index information and its metadata information and manage all these files. All this files (Txt files, Video files and picture files so on) and cloud databases final files will run on top of the distributed file system.

G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 3193 (3) All RDMS databases and cloud databases are made up of many tables. All our queries will aim at these tables. Big query network will gain the data from these tables and bee processed by query plans. (4) Users can input query requirements and get query results from the user web interface. (5) When the SemanQuery get the users query requirements, SemanQuery will make a semantic matching with the big query network. If the big query network has these query plans, SemanQuery will get the query plans query paths in the big query network and executive these query plans according to the query paths. If SemanQuery cannot find the users query plans in the big query network, it will add the query plans to big query network and create a new big query network. When the new big query network was created renew, submit the users query requirements and executive the query plans. (6) After executive the query plans, the results will be expressed in the user web interface. Fig. 1: SemanQuery architecture 4 SemanQuery Implementation Method 4.1 Big query network In the cloud environment, millions of query plans been submitted by users every day. All these query plans can be constructed into a very big query network. First, we see an example 1. [Example 1] There are four selection plans as the following: S1 : Select T 1.A from T 1 where T 1.A > x

3194 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 S2 : Select T 2.B, T 2.C from T 2 where T 2.D > y S3 : Select T 2.B from T 2 where T 2.D > z S4 : select T 1.A from T 1, T 2 where T 1.A = T 2.A and T 1.A > x Fig. 2: Graphic representation of S1, S2, S3 and S4 query plans The Fig. 2 shows a graphic representation of S1, S2, S3 and S4 query plans. The big query network generation algorithm is as follows: [Algorithm 1] Big Query Network Generation Algorithm Input: Tables, Query Conditions Output: Big Query Network 1. For (i=0; i<queryplan.numbers; i++){ 2. ParSer (QueryPlan[i]); 3. Get the Tables from QueryPlan[i]; 4. Construct a for every table; 5. Get the Conditions from QueryPlan[i]; 6. Construct a for every table; 7. Get the Query Out from QueryPlan[i]; 8. Construct a for every table; 9. Generating a QueryPlan[i] Tree. 10. } 11. Find the same Tables from all QueryPlan[i] Trees; 12. Combine all the same Tables; 13. End

G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 3195 Fig. 3: Part of a big query network after optimization Fig. 4: A semantic matching result at the time [1] 4.2 Semantic matching We assume the T is the table node, C is the computing node (such as Selection, Union etc), R is the result node and the U is the user node. The fig. 3 is part of a big query network after optimization. The Fig. 4 shows a semantic matching result at the time [1] and the Fig. 5 shows a semantic matching result at the time [2]. From the Fig. 4 and Fig. 5, we can see that the query plans requirements at the time [1] and time [2] are all the sub set of the big query network. The query plans requirements at time [1] is the yellow part of the big query network showed in Fig. 4. The query plans requirements at time [2] is the green part of the big query network showed in Fig. 5. Fig. 5: A semantic matching result at the time [2] Unlike the Fig. 4 and Fig. 5, The Fig. 6 is not the same as the Fig. 4 and Fig. 5. Assume at

3196 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 time [3], the all users whole query plan cannot be found completely in the big query network, that is to say, the query sub set are not the sub set of big query network. As shows in the Fig. 6, in the right part, only the light red part is the sub set of big query network and the red part (C14, C9 and U9) is not the sub set of big query network. And so, we will add the red part into the big query network as showed in the left part of Fig. 6. Fig. 6: A semantic matching result at the time [3] In Section 4.2, we describe the optimization methods of big query network, then we take Fig. 4, Fig. 5 and Fig. 6 describe the query network s station at time [1], time [2] and time [3]. The following algorithm 2 is the semantic matching algorithm. It realize all query network station at any time include time [1], time [2] and time [3]. [Algorithm 2] Semantic Matching Algorithm Input: A Serial of Query Plans at time [i], Big Query Network Output: A Query Sub Network, A New Big Query Network and A Query Sub Network. 1. Start 2. For (time[i=1]; i < ; time[i++]){ 3. For (QueryPlan[j=1]; i <= QueryPlan[TotalNumbers]; QueryPlan[j++]){ 4. Get the Tables from QueryPlan[j]; 5. If (QueryPlan[j].Tables BigQueryNetwork.Tables 6. Flag all QueryPlan[j].Tables in BigQueryNetwork.Tables 7. Else 8. Add all those tables, {tables QueryPlan[j].Tables, tables NOT BigQueryNetwork.Tables. 9. If (QueryPlan[j].Conditions BigQueryNetwork.Conditions 10. Flag all QueryPlan[j].Conditions in BigQueryNetwork.Conditions 11. Else

G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 3197 12. Add all those Conditions, { Conditions QueryPlan[j]. Conditions, Conditions NOT BigQueryNetwork. Conditions. 13. Add all the Users OutPut Nodes 14. Connect all these Flag Nodes using 15. Get the Query Sub Network in the old Big Query Network OR in the New Big Query Network. 16. Optimization the Query Sub Network Using Equivalent Substitution Methods. 17. }} 18. END. 5 Simulation Experiment In this simulation experiment, we should compute and compare the cost in the Fig. 7. The part (a) of Fig. 7 is not optimized and the part (b) of Fig. 7 has been optimized. We compute the part (a) and part (b) s cost. Fig. 7: Simulation experiment According to the former big query network, we can get the cost of part (a) is equal to formula (1) and the cost of part (b) is equal to formula (2): Cost(N ooptimization) = n n Cost(C[i])+ Cost(U[i].T 1( )) = n (Cost(C1)+Cost(U 1.T 1( ))) (1) i=1 Cost(Optimization) = Cost(C1) + i=1 n Cost(U[i].T 1( )) = Cost(C1) + n Cost(U1.T 1( )) (2) i=1 Assume: the Table T1 has K records. C1=C2=Cn (Their conditions are the same: T1.A>x), and so, f1=f2=f3=k. The C1, C2 and Cn s filtering rate is θ, so the f4 = f5 = f6 = f1 θ; Table T1 s record unit transmission time is π; C1, C2 and Cn s unit traverse time is and C1, C2 and Cn s unit computing time is l. Assume π = t, = 0.25t, l = 0.25t, θ = 0.5. We can get the simulation result as the figure 8.

3198 G. Zhang et al. /Journal of Computational Information Systems 8: 8 (2012) 3191 3198 Fig. 8: Simulation experiment result 6 Conclusions and Future Work In the future, we will develop a better query and optimization method on their metadata index and original massive data s management. References [1] Navin Kabra, David J. DeWitt: OPT++: An Object-Oriented Implementation for Extensible Database Query Optimization [J]. VLDB J. 8 (1): 55 78 (1999). [2] Henning Köhler, Jing Yang, Xiaofang Zhou. Efficient parallel skyline processing using hyperplane projections. Proceedings of the SIGMOD 2011. pages: 85 96. [3] Yuan LIN, Hongfei LIN, Li HE, A Cluster-based Resource Correlative Query Expansion in Distributed Information Retrieval, Journal of Computational Information Systems, 1 (2012), 31 38. [4] Li YE, Zhiguang QIN, Uncertain Range Queries for Revised Bead Model, Journal of Computational Information Systems, 1 (2012), 81 89. [5] Eaman Jahani, Michael J. Cafarella, Christopher Ré. Automatic Optimization for MapReduce Programs. Proceedings of the VLDB2011. pages: 385 396. [6] Michael J. Franklin, Donald Kossmann, Tim Kraska, Sukriti Ramesh, Reynold Xin. CrowdDB: answering queries with crowdsourcing. Proceedings of the SIGMOD 2011. pages: 61 72. [7] Minji Wu, Laure Berti-Equille, Amélie Marian, Cecilia M. Procopiuc, Divesh Srivastava. Processing Top-k Join Queries. Proceedings of the VLDB 2010. pages: 860 870. [8] Akrivi Vlachou, Christos Doulkeridis, Neoklis Polyzotis. Skyline query processing over joins. Proceedings of the SIGMOD 2011. pages: 73 84. [9] Xavier Martinez-Palau, David Dominguez-Sal, Josep-Lluis Larriba-Pey. Two-way Replacement Selection. (VLDB 2010). pages: 871 881. [10] Eric Lo, Nick Cheng, Wing-Kai Hon. Generating Databases for Query Workloads. Proceedings of the VLDB 2010. pages: 848 859.