Parallel Processing of JOIN Queries in OGSA-DAI

Transcription

1 Parallel Processing of JOIN Queries in OGSA-DAI Fan Zhu Aug 21, 2009 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2009

2 Abstract JOIN Query is the most important and often most expensive of all relational operations, especially when its input is obtained from considerable size of tables on distributed heterogeneous database. As parallel join processing is a well understood technique to get results as quickly as possible, one way to speed up query execution is to exploit parallelism. Since most real queries involve joins of several tables, efficient join execution becomes very important. This thesis focuses on query processing in a distributed heterogeneous database not in a DBMS. The aims of the project are: a) to investigate methods for parallel execution of join query, which are usually used to optimize a single join operation. b) To analyze the difference in performance caused by different query plans, which is used to speed up complex queries that contains multiple join operations. The main steps and achievements of this project are the following: a) The first step of the project is to study and extend my knowledge on relational algebra and OGSA-DAI (Open Grid Service Architecture - Data Access and Integration) software. As OGSA-DAI middleware allows process query and transform data from distributed resources, the mechanism and the interface defined in OGSA and the primary components of OGSA-DAI middleware are used into our experiments. b) The second step is to design efficient parallel approaches to optimize join execution strategies currently used by OGSA-DAI. It is the most important work to analyze and investigate the parallel mechanism when executing complex join query operations on large tables, including Independent Parallelism, Pipelined Parallelism, Partitioned Parallelism and Mixed Parallelism. Based on our parallelism analysis, two parallel join algorithms - Hash Split Join algorithm and Sorted Merge Join algorithm are adopted in the project. All function modules are divided into OGSA-DAI activities and the function of implementation activities are described in detail. c) The third step is to implement the parallel algorithms and to evaluate the performance of parallel query. The thesis discusses and analyzes the performance of every functionality activity such as SQL Query Activity, Tuple Sort Activity and Sorted Merge Activity. It analyzes respectively the performance of queries based on two-table-join, multi-table-join and join on distributed heterogeneous database. Based on our experiments, it point out the affect made by different query plans. Keywords: SQL Query, Join Query, OGSA-DAI, Parallelization.

3 Contents Chapter 1 Introduction Project Aims Research Methods Thesis Structure... 2 Chapter 2 Background Knowledge SQL and Relational Theory Query Graphs and Query Plans OGSA-DAI... 7 Chapter 3 Analysis and Design of Parallel Algorithms Requirements Capture Mechanisms of Parallel Query Execution Partitioning Algorithms Parallel Join Implementations User Side Workflow Chapter 4 Performance Analysis Experimental Setup Performance Analysis for Single Activity Single Join Multiple Join i

4 4.5 Join on Distributed Heterogeneous Database Chapter 5 Conclusions Appendix A Source Code Appendix B Submission Script References ii

5 List of Tables Table 1 Bandwidth of SQL Query Activity Table 2 Bandwidth of Hash Split Activity Table 2 Sorted Merge Activity and Union All Activity Table 3 Overall Activity Performance Table 4 Query Plan 1 vs. Query Plan Table 5 Performance on Different Database Table 6 Performance of Heterogeneous Database iii

6 List of Figures Figure 1 Logical Query Plan... 5 Figure 2 Inner Join... 6 Figure 3 Query Graph Example... 7 Figure 4 Query Plan Example... 7 Figure 5 OGSA Services Framework Figure 6 The Architecture of OGSA-DAI Figure 7 OGSA-DAI Runtime Overview Figure 8 Independent Parallelism Figure 9 Pipelined Parallelism Figure 10 Partitioned Parallelism Figure 11 Independent and Pipelined Mixed Parallelism Figure 12 Serial Join Workflow Figure 13 Hash Split Join Figure 14 Sorted Merge Join Figure 15 Query Graph Figure 16 Query Tree Figure 18 Running Time of Reproduced Test Figure 19 Workflow without Swallow Activity Figure 20 Workflow with Swallow Activity iv

7 Figure 21 Performance of Hash Split Activity Figure 22 Array List vs. Linked List Figure 23 Performance of Tuple Sort Activity Figure 24 Query Plan Figure 25 Query Plan Figure 26 Re-use in Hash Split Join Figure 27 DB2 vs. MySQL v

8 Acknowledgements First of all, I would like to show my deepest thanks to my supervisor, Mr. Bartosz Dobrzelecki, who has provided me with valuable suggestion and guides among this dissertation. I also want to extend my thanks to all my friends for their encouragement and support. vi

9 Chapter 1 Introduction 1.1 Project Aims With the wide application of digital technology, the amount of data to be processed increases at a higher rate than the speed of processing units. This leads to a problem that the traditional database query algorithm may NOT be best suited for massive distributed data sets anymore on the internet. If a query operation takes long time to get its final result, the information it generates may already be obsolete in many application domains. Could we reduce the running time of a query by some techniques? On the other hand, given a query on multiple tables in the database application system, there are many schemes that a database management system can follow to process it and produce its results. Although all schemes will produce equivalent result in terms of their final output, their running cost varies. For example, the amount of time that two schemes need to run is different. Sometimes the time cost difference between two schemes may be enormous. What is the scheme that needs the least amount of time? Here I will also identify the research meaning by describing them from the perspectives of problem statement: The join query is the most expensive operation executed by databases and is proved that it can be optimized by parallelization. Thus it is important work for this project to parallelize a join query. Besides, different query plans may affect performance a lot. Matching a query to the most suitable plan will be also very helpful. The fundamental goal of this project is to investigate and solve the join processing problem by parallelization technique. This project will develop OGSA-DAI implementations of several parallel join algorithms. We also want to gather some experimental data that would help us understand which approaches to parallel join execution are most beneficial. 1.2 Research Methods The research methods of this project are: - To research and analyze on efficient parallelization approaches to optimize existing implementation of join operators which take significant time to execute. 1

10 - To design and implement parallelization algorithms useful for querying distributed data based on OGSA-DAI. 1.3 Thesis Structure The thesis is organised as five chapters. This chapter describes the project s purposes, roadmap and methods adopted in the project research. The rest of the thesis is structured as follow: Chapter 2 introduces the basic of SQL (Structured Query Language) and OGSA-DAI (Open Grid Service Architecture - Data Access and Integration) so that it is easier to understand our work and techniques in the project. In section 2.1 the four subsets of the declarative database language SQL are described and the SELECT query on multiple tables is introduced. Then, a basic set of relational operators is described. The Query Graph, which is used as a graph tool in analysis for query operation, is introduced in section 2.2. The description of the mechanisms and interfaces defined in OGSA and the primary components of architecture for OGSA-DAI are shown in section 2.3. We consider the query requirements based on database integration by OGSA applications. Some aspects in which consumers make requests to an OGSA-DAI product are described in detail. Chapter 3 discusses design and implementation algorithms. As this project is implemented on OGSA-DAI framework, the functional modules are divided into OGSA-DAI activities. It is the most important work to analyze and investigate the parallel mechanism when executing complex join query operations on large tables, including Independent Parallelism, Pipelined Parallelism, Partitioned Parallelism and Mixed Parallelism. Based on our parallelism analysis for join queries, two parallel join algorithms - Hash Split Join algorithm and Sorted Merge Join algorithm are used in the project. Section 3.4 shows the functionality of implemented activities in detail. Finally we discuss how the activities are assembled into OGSA-DAI workflows. In section 3.5, we give the implementation detail of Hash Split Join workflow and Sorted Merge Join. Chapter 4 contains the performance analysis for our parallel. First of all, it shows the test environment of software and hardware, the test data set and the test join query on the TPC Benchmark H (TPC-H) [4]. Then it discusses and analyzes the performance of every functionality activity such as SQL Query Activity, Tuple Sort Activity and Sorted Merge Activity. Sections 4.3 to 4.5 analyze the performance of queries based on two-table-join, multi-table-join and join on distributed heterogeneous database. It explores the reasons for different performance by analysing their implementation. It also illustrates the overall workflow performance and how it works on different databases. Based on our experiments, it points out the effect made by different query plans. It provides some conclusion about how to match a query to a plan. In Chapter 5, the final part of the thesis, the conclusions are presented based on our analysis and experiments. Our discussion in this thesis focuses on join query 2

11 optimization for sequential processing by parallelization method and query plan choosing for complex request. It touches upon issues and techniques related to optimizing join queries in distributed heterogeneous database environments. 3

12 Chapter 2 Background Knowledge 2.1 SQL and Relational Theory SQL (Structured Query Language) is a declarative database language which designed for management and retrieval of data in RDBMS (Relational Database Management System). There are four important parts of the SQL language: Data Manipulation Language (DML), Data Definition Language (DDL), Data Control Language (DCL) and Transactional Control Language (TCL). This project cares about DML part of SQL which is used to retrieve, store, modify, delete, update and manage data in database. For example, DML allows users to describe the desired properties of the result without specifying how to obtain it. This is also why SQL is a declarative language. The most common operation in SQL is result retrieval, which is performed with key word SELECT. A SELECT query can retrieve data from one or more tables. Join operations are needed in order to combine multiple tables. This project focuses on SELECT queries joining multiple tables Relational Algebra In order to define the database structure and constraints, a data model must include a set of operations to manipulate the data. A basic set of relational model operations constitute the relational algebra. Relational algebra is used to represent declarative SQL queies in a procedural form which can be executed. A sequence of relational algebra operations forms a relational algebra expression. [5] A SQL query is a relational algebra expression and can be performed with relational algebra operations such as SELECT, PROJECT, JOIN, UNION, INTERSECTION and CARTESIAN PRODUCT. Select and Join will be used in this project and will be explained in the next sections. 4

13 2.1.2 Select Statement and Logical Query Plan Select statement, which retrieves data from specified table(s), is the most commonly used of SQL expressions. For example, here is a simple Select From Where query: SELECT id, name, job FROM employee WHERE salary > 100 To be able to execute this declarative query, a logical query plan needs to be complied. SELECT query is translated to relational expression using Projection, Selection and Table Scan. The above simple query will be translated to a logical query plan (Figure 1): Figure 1 Logical Query Plan On execution, the system will fetch all records stored in the employee table (TABLE SCAN), then it will filter records and discard all those for which salary is <= 100 (SELECT). Finally the PROJECT operation will select only three attributes from each record. However, if you want to join more tables, the number of possible combinations rapidly explodes. All these plans will generate identical result but will have different cost. Due to combinatorial explosion it is not possible to perform exhaustive search for the best query plan. In this project, we try to devise some heuristic rules to help us choose the most promising plan in a limited time Join Query JOIN Query is the most important among all relational operations. A Join query clause combines tuples from two source tables. The SQL language supports fours types of joins: INNER, OUTER, LEFT, and RIGHT JOIN. This project focuses on INNER JOIN which is the most commonly used in applications and also the default join-type. 5

14 An INNER JOIN essentially combines the records from two tables (A and B) based on a given join-predicate. The result of join can be defined as the outcome of first taking the Cartesian product (or cross-join) of all records in the tables (combining every record in table A with every record in table B) - then return all records which satisfy the join predicate [1]. People Name ID Betty 100 Jones 101 Jack 102 Nationality ID Country 100 United Kingdom 101 Australia SELECT * FROM People INNER JOIN Nationality Where People.ID = Nationality.ID; People.Name People.ID Nationality.Country Nationality.ID Betty 100 United Kingdom 100 Jones 101 Australia 101 Figure 2 Inner Join An EQUIJOIN is a specific type of comparator-based join, which uses only equality comparisons (= only) in the join-predicate. Figure 2 is an example of EQUIJOIN. Tuples with ID equal to 100 or 101 are accepted because these values appear in both tables. Tuples with People.ID = 102 will be discarded as there is no related value in table Nationality. SQL queries often include multiple joins. The SQL language allows to define joins explicitly using the JOIN keyword (see example above). However, user queries usually contain implicit joins with join predicates defined after the WHERE clause. 2.2 Query Graphs and Query Plans Query Graph is a single graph corresponding to each query. It does not specify any order on which operation to perform first. For example, the join query in previous section can be translated into Figure 3. 6

15 Figure 3 Query Graph Example Query Plan (Figure 4) presents a specific order of operations for executing a query. It is a set of steps used to help accessing and modifying a SQL RDMS. Since SQL is declarative, there are typically a large number of alternative ways to execute a given query, with widely varying performance. When a query is submitted to the database, the query optimizer evaluates some of the different, correct possible plans for executing the query and returns what it considers the best alternative [2]. Figure 4 Query Plan Example In this project, SQL query will be analysed first and parsed into a query graph. After observe this query graph, a query plan will be chosen based on our heuristic rules. There will be more details in Section and Section OGSA-DAI OGSA-DAI stands for Open Grid Services Architecture Data Access and Integration. The aim of OGSA-DAI is to develop a standard interface for distributed data resources on the Grid. Nowadays, there are a lot of data out there but these data are not in the same database or even not linked together. Islands of data have this problem. We need a way to integrate isolated and distributed data sources. An OGSA-DAI web service allows data to be queried, updated, transformed and delivered. OGSA-DAI web services can be used to provide web services that offer data integration functionality to clients. OGSA-DAI web services can be deployed within a Grid environment. OGSA-DAI thereby provides a means for users to Grid-enable their data resources [3]. 7

16 2.3.1 OGSA Grid Environment The Grid is defined as an infrastructure consisting of multiple computers connected via network technologies providing the impression of one computer system. In 2001, researchers led by Globus and IBM began developing new Grid standards and technology. The aim was to merge the understanding developed through the design of early Grid applications with the Web Services middleware. Their goal was to allow Grid developers to exploit the huge commercial investment in Web Services infrastructure. The result was the Open Grid Services Architecture (OGSA) -- a high-level framework designed to support dynamic virtual organizations that share independently administered data and resources seamlessly across a network of heterogeneous computers. The OGSA is used to identify the components needed in a grid system. OGSA defines a service-based structure for creating a grid computing environment. Still under development, this architecture defines the major functional components required to meet those requirements. Prof. Ian Foster gave a description of the mechanisms and interfaces defined in OGSA [10][11]. The OGSA services framework is shown in Figure 5. The services are built on Web service standards, with semantics, additions, extensions and modifications that are relevant to Grids [11]. Figure 5 OGSA Services Framework. Cylinders represent individual services The important points are the followings: An important motivation for OGSA is the composition paradigm or building block approach, where a set of functions is built or adapted as required. This provides the adaptability, flexibility and robustness to change that is required in the architecture. 8

17 The entire set of OGSA capabilities does not have to be present in a system. A system may choose to utilize or provide only a subset of services from any capability. OGSA represents the services, their interfaces, and the semantics/behavior and interaction of these services. The architecture is not layered, where the implementation of one service is built upon OGSA-DAI Software With the increase of data produced in research and business environments, data management is increasingly challenging. Since 2002, the Open Grid Service Architecture - Data Access and Integration (OGSA-DAI) project funded by the UK e-science Programme has been working to develop an effective solution to the data management challenge and in particular to data access and integration problems. OGSA-DAI facilitates Data Access and Integration of data resources such as relational databases within a Grid. The reference paper [12] presents a status report on OGSA-DAI activities and announces future directions. The paper [13] describes a new architecture for future OGSA-DAI releases and its rationale. The OGSA-DAI 3.0 is a complete top-to-bottom redesign and implementation of the OGSA-DAI product. The paper [14] describes the motivation behind this redesign and provides an overview of OGSA-DAI 3.0, comparing and contrasting with last OGSA-DAI releases OGSA-DAI Framework The OGSA-DAI is a framework that enables existing data resources to be integrated into a grid environment. OGSA-DAI is a middleware to interface with databases, which allows data resources, such as file systems, relational or XML databases, to be accessed, federated and integrated across the network [15]. As well as accessing and updating data in a database, OGSA-DAI offers an extensibility mechanism, making it possible to add further user defined activities to OGSA-DAI that can be executed in addition to activities already offered by OGSA-DAI, such as SQL query and update. The primary components of new architecture for OGSA-DAI are shown 2 in Figure 6 [13]. The architecture looks forward to multiple data services administered through a consistent regime. There are three data services: one serves OGSA-DAI, one serves the WS-DAI standard perhaps as a configuration of OGSA-DAI and one serves Mobius. 1 Paper [12] and [13] talks about the old OGSA-DAI product, while paper [14] is related to the current one. 2 This figure is for OGSA-DAI 2.x. the architecture used by OGSA-DAI 3.x has slightly different. 9

18 Figure 6 The Architecture of OGSA-DAI OGSA-DAI Activity Activity is a workflow unit implementing a certain function linked with a specific name. Arbitrary data related function can be encapsulated as an activity. These activities can be used to provide complex functionality. OGSA-DAI come with a default set of activities like: SQL query activity, data format transfer activity, data set union activity. These activities can split into several categories like delivery activities and relational activities. As it is showed in Figure 7, every activity has a client side code and server side code; they are matched by their unique ID. There are actually three parts in an activity workflow: 1. User code: Client toolkit API allows user to assemble workflows by connecting activities. It also provides methods for submitting workflows to OGSA-DAI services. It calls the client side code to fill required inputs and declare output. Note that one user code can call more than one activities and every activity can have multiple instances. Executed on user side. 10

19 2. Client Side: It manages the inputs and outputs of an activity. Inputs will be sent to server side code and outputs will be forward to user. Executed on client side, too. 3. Server Side: It is the one who actually do the functionality task. It may connect to database (SQL related activities). Executed on server side. Figure 7 OGSA-DAI Runtime Overview 11

20 Chapter 3 Analysis and Design of Parallel Algorithms This chapter contains requirements capture and system design. Some of implementation details are also introduced in order to specify a low level overview of main functions and solutions to common problems. 3.1 Requirements Capture The main aim of this project is to use parallelization to optimize SQL join query processing which have huge input tables on distributed database system. The other aim of this project is to analyse how different query plans affect the performance. OGSA-DAI is designed to enable remote access to data. It is a well designed framework and takes advantage in management of distributed database system. OGSA-DAI is a framework that simplifies building distributed data processing systems. By using OGSA-DAI in our project we can focus on parallel algorithms and not worry about the details of distributed processing. The following sections present the basic and additional functional goals in this project Basic Functional Goals This project mainly contains four goals: Implementation of Hash Join algorithm and Sorted Merge algorithm to implement query execution on OGSA-DAI. Parallelisation of the above algorithms. Performance analysis of these two joins algorithms. Performance analysis of different query plans. 12

21 3.1.2 Performance Goals Because this project is about optimization, it focuses on performance. As a client/server framework, OGSA-DAI may introduce some overhead during execution. In this project we will investigate if this overhead is damaging, how bad is it and try to understand where exactly time is spent. We also try to find the bottlenecks in this project and whether parallelization is going to reduce execution times. 3.2 Mechanisms of Parallel Query Execution When executing query operations on large tables, poor performance may occurs, especially on complex join operations. There are two limiting factors: the amount of available main memory and computational complexity. Consequently, we try to use parallel mechanism, which handles both the limitations well, to improve runtime efficiency. In the distributed context, when queries may be executed by middleware sitting on top of RDBMS we cannot use foreign key based indexes that are available to the local RDBMS. Besides, we also have limited plans to choose because tables have different locations. In this project, we will try to investigate three basic mechanisms to bring parallelism into Join execution. Taking Query - R 1 JOIN R2 JOIN R 3 JOIN R 4 (R i stands for input table) as example, the three mechanisms are: Independent Parallelism The above query could be executed in the following three steps: Step 1: R 1 JOIN R 2 => R 12 Step 2: R 3 JOIN R 4 => R 34 Step 3: R 12 JOIN R 34 => R 1234 Independent parallelism is illustrated in Figure 8. In this algorithm, independent steps (Step 1 & Step 2 in this case) can be fully parallelised, which leads to a great possibly of huge speedup. However, scalability is limited. To execute 4 joins independently you will need to have at least 8 relations where every pair is joined independently - this is not a frequent scenario. We get some parallelism - but it will rarely allow us to use say 8 processors. 13

22 3.2.2 Pipelined Parallelism Figure 8 Independent Parallelism Another approach would be building a data processing pipeline as in these steps: Step 1: R 1 JOIN R 2 => R 12 Step 2: R 12 JOIN R 3 => R 123 Step 3: R 123 JOIN R 4 => R 1234 Pipelined parallelism is illustrated in Figure 9. In this algorithm, there are data dependencies in different steps, which mean all these steps have to execute one by one. If two operations are related in such a way, the output of first operation is used as input in the second operation. On the other hand, if the first operation can be carried out so that partial results can be produced and immediately channelled to the second operation, then it becomes possible for the first operation to produce the next partial result while the second operation processes earlier partial results. Figure 9 Pipelined Parallelism Partitioned Parallelism Partitioned Parallelism is used for single join operation (R i JOIN R j => R ij ). There are three steps to join two tables together by using partitioned parallelism: Step 1: Split the input data into small sets. Step 2: Join related sets together. Step 3: Union previous results together. 14

23 Figure 10 shows how Partitioned Parallelism works. Figure 10 Partitioned Parallelism Mixed Parallelism Different parallelization mechanisms can be applied to different parts of a query plans. A query plan may be divided into parts which belong to different algorithms. For example, Figure 11 presents one of possible query plans for the following query: R 1 JOIN R 2 JOIN R 3 JOIN R 4 JOIN R 5 JOIN R 6 JOIN R 7 The following equivalence holds for EQUIJOIN. Therefore we can rewrite our query as: (R 1 JOIN R 2 ) JOIN R 3 R 1 JOIN (R 2 JOIN R 3 ) ((R 1 JOIN R 2 ) JOIN (R 3 JOIN R 4 )) JOIN ((R 5 JOIN R 6 ) JOIN R 7 ) The underlined one is applied independent parallelism and italic one is pipelined parallelism. Figure 11 Independent and Pipelined Mixed Parallelism 15

24 3.3 Partitioning Algorithms Data partitioning is used in the partitioned parallelism approach to distribute data over a number of processing elements. Each processing element is then executed simultaneously with other processing elements, thereby creating parallelism. It is the basic step of parallel query processing. When partitioning the workload, four partition algorithms are taken into consideration [3] : 1. Round-robin data partitioning In round-robin algorithm, data is partitioned by its record number. To illustrate, if data is partitioned into n parts, the (xn+i) th data will be put in i th block. The biggest advantage of this algorithm is its perfect load balance (every part has the same amount of data (±1)). 2. Hash data partitioning Data will be partitioned by applying a hash function so every new work set has its specific set of attribute values. However, load balance will be poor if distribution of values is skewed. For example, if we try to partition a work set into five parts. The work set is {1, 2, 3, 4, 6, 7, 8, 11, 12, 16} and hash function is (x mod 5). The result of partitioning will be: Work set 1: {1, 6, 11, 16} Work set 2: {2, 7, 12} Work set 3: {3, 8} Work set 4: {4} Work set 5: {} This shows the potential bad load balance of hash data partitioning. Furthermore, hash data partitioning is the best way to handle EQUIJOIN operation while range data partitioning is used to solve JOIN with greater than / less than operations. If we use the same hash function to partition both join inputs then related data tuples will be end up in the same bucket. So when processing an EQUIJOIN operation, we can easily match related hash split work sets together. 3. Range data partitioning A simple example makes this algorithm easy to understand: Partition a set of discrete number into three subsets. In this case, all the numbers less than 100 can be grouped 16

25 into set 1; numbers ranged in [101, 1000] will be set 2; numbers which are larger than 1000 will be split into last set. Same as hash data partitioning, range data partitioning has similar pros and cons. 4. Random-unequal data partitioning The partitioning function of this algorithm maybe hash or range partitioning function, or just unknown function. Data will be grouped randomly. All these partitioning algorithms have their advantages and disadvantages. So, partitioning algorithm should be chosen based on the type of JOIN algorithm. The project will use hash data partitioning and round-robin partitioning as partitioning algorithms. The former one is used for Hash Split Join Algorithm because it splits data based on the values of input tuples; the latter one is used for Sorted Merge Join Algorithm because when splitting input data for this JOIN algorithm, we do not care about values of tuples but split data randomly in order to ensure a good load balance. Further information is available in chapter Parallel Join Implementations There are many parallel join algorithms, but in this project, we just focus two of them: Hash Split Join algorithm and Sorted Merge Join algorithm. This section focuses on the structure of OGSADAI server side and client side code Hash Split Join In this algorithm, both input tuple sets will be split by their sorting key using default hash function first. Given a value K, the hash value produced by the default hash function is: hash ( K ) = K.hashCode () mod NUM NUM stands for the number of output subsets. hashcode() is the JAVA library function belongs to Object class which generate a integer number as result. After that, related sets can be joined in parallel. Every set contains one part of final result. The last step is to union all join result into the final result. The following activities were implemented to support hash split join algorithm in OGSA-DAI. For each activity input, output and behaviour are described. Hash Split Activity 17

26 Split input data by the giving name of column into a given number of sets by hash functions. Activity inputs: Data. Type: OGSA-DAI list of Tuples. A stream of tuples to be split. Name. Type: String. The name of column to split on. Number. Type: Integer. The number of output sets. Activity outputs: Result. Type: Array of OGSA-DAI list of Tuples. Hash Join Activity Join two sets together on the term of inner join operation. There is one more thing should be noticed that this is a generic activity. It can be also used join un-split input sets. Activity inputs: Data1. Type: OGSA-DAI list of Tuples. The first dataset to be joined. Data2. Type: OGSA-DAI list of Tuples. The second dataset to be joined. Name1. Type: String. The name of column to use for the join from the first dataset. Name2. Type: String. The name of column to use for the join from the second dataset. Activity outputs: Result. Type: OGSA-DAI list of Tuples. Union All Activity Union the given array of list of tuples into one. This activity is used to generate the final result. Activity inputs: Data. Type: Array of OGSA-DAI list of Tuples. The datasets to be union together. Number. Type: Integer. The number of datasets to be union together. Activity outputs: Result. Type: OGSA-DAI list of Tuples. 18

27 Hash Split Join User Side Code This is user side function. This function manages all hash join related activities. It connects activities output to the certain input. It is the one who build the entire workflow from single activities. Activity inputs: Query. Type: SQL Query. The request we try to executed. Number. Type: Integer. The number of processors. Activity outputs: Result. Type: OGSA-DAI list of Tuples Sorted Merge Join Sorted merge join algorithm is different from the hash split join algorithm. This algorithm needs four steps: split, sort split sets, merge split sets, sort merge join. This algorithm uses parallelization to sort the original input set and performs the low complexity sorted join as the last step. First of all, input tuples will be split into subsets. After that, these balanced sets can be sorted in parallel and merged together. The last step is to join two ordered sets into the final result. The following activities were used to form sorted merge join algorithm in OGSA-DAI. Random Split Activity Split input data into subsets with equivalent size. Activity inputs: Data. Type: OGSA-DAI list of Tuples. A stream of tuples to be split. Number. Type: Integer. The number of output sets. Activity outputs: Result. Type: Array of OGSA-DAI list of Tuples. 19

28 Tuple Sort Activity Sort input data by the giving column. Activity inputs: Data. Type: OGSA-DAI list of Tuples. A stream of tuples to be sorted. Name. Type: String. The name of column to sort. Activity outputs: Result. Type: OGSA-DAI list of Tuples. Sorted Merge Activity Merge sorted sets into one. This function only needs to scan every input set once, which leads to a good performance. Activity inputs: Data. Type: Array of OGSA-DAI list of Tuples. The ordered datasets to be merged together. Number. Type: Integer. The number of sets. Name. Type: String. The name of column used for merge. Activity outputs: Result. Type: OGSA-DAI list of Tuples. Sorted Join Activity Join two ordered sets together. This function also only need to scan every input sets once. Activity inputs: Data1. Type: OGSA-DAI list of Tuples. The first dataset to be joined. Data2. Type: OGSA-DAI list of Tuples. The second dataset to be joined. Name1. Type: String. The name of column to use for the join from the first dataset. Name2. Type: String. The name of column to use for the join from the second dataset. Activity outputs: Result. Type: OGSA-DAI list of Tuples. 20

29 Sorted Merge Join User Side Code This is user side code. Similar with Hash Join User Side Code, this function manages all sorted merge join related activities. It is the one who build the entire workflow from single activities. Activity inputs: Query. Type: SQL Query. The request we try to executed. Number. Type: Integer. The number of processors. Activity outputs: Result. Type: OGSA-DAI list of Tuples. 3.5 User Side Workflow OGSA-DAI workflow OGSA-DAI class PipelineWorkflow is used to assemble activities into workflow. By using this class, activities can be organized by their logical order. Independent activities can run in parallel. In detail, this is how OGSA-DAI workflows are assembled and executed programmatically: Step 1: Initialize all the activities in the workflow. Step 2: Connect active ties inputs and outputs. Step 3: Add activities to pipeline. Step 4: Get a handle of OGSA-DAI DataRequestExecutionResource object, and then execute the entire pipeline on this object. Note that, this class is called pipeline only because it organize related activities as pipeline. It allows parallelization. For example, several sort activities can run in parallel Serial Join Serial Join read both tables and joins them together without any parallel optimization. It is used to generate comparable result for testing and a baseline execution time for performance comparison. The main steps in the workflow are following: 21

30 Step 1: Read tables from database using SQLQuery activity. Step 2: Sort both left side and right side table using TupleSort activity. Step 3: Join tables by using OGSA-DAI default TupleMergeJoin activity. Figure 12 is the workflow of serial join. It uses default join activity provided by OGSA-DAI. Because of default join activity requires ordered inputs. The input tuples need to be sorted first. Figure 12 Serial Join Workflow Hash Split Join As it is motioned in chapter 3.4.1, Figure 13 illustrates Hash Split Join user side workflow: Step 1: Read tables from database using SQLQuery activity. Step 2: Split both left side and right side table into hash sets by HashSplit activity. Step 3: Use sort-merge join to apply smallest join unit. Step 3.1: Sort every hash set using TupleSort activity. Step 3.2: Join ordered sets using SortedMergeJoin activity. Step 4: Union all the result using UnionAll activity. 22

31 Figure 13 Hash Split Join Sorted Merge Join As it is motioned in chapter 3.4.2, Figure 14 illustrates Sorted Merge Join user side workflow which is quite similar with Hash Split Join workflow: Step 1: Read tables from database using SQLQuery activity. Step 2: Split both left side and right side table into sets randomly using RandomSplit activity. Step 3: Sort every set using TupleSort activity. Step 4: Combine sorted sets in Step 3 into big sorted set using SortedMerge activity. Step 5: Join two sorted sets using SortedJoin activity. 23

32 Figure 14 Sorted Merge Join 24

33 Chapter 4 Performance Analysis 4.1 Experimental Setup Test Environment Here is a list of key software used in our tests: Linux: el5 x86_64 GNU/Linux Tomcat: OGSA-DAI-3.1-axis The test machine is Ness, which is a parallel machine based on AMD Opteron processors running Linux. It has shared memory architecture. The system consists of the two back end X4600 SMP nodes; both nodes contain 16 2GB memory processor cores. This project only uses one of its back end node with a maximal 16 cores. Furthermore, all the queries request data from IBM DB2 server if not specifically mentioned otherwise. More details about databases and machines they run on are included in Chapter Test Data Set In order to evaluate correctness and performance, TPC-H Benchmark is introduced. TPC is Transaction Processing Performance Council. TPC benchmarks are widely used nowadays to evaluate performance and verify correctness of a database system. The TPC Benchmark H (TPC-H) is a decision support benchmark. It consists of a suite of business oriented ad-hoc queries and concurrent data modifications. The queries and the data populating the database have been chosen to have broad industry-wide relevance. This benchmark illustrates decision support systems that examine large volumes of data, execute queries with a high degree of complexity, and give answers to critical business questions [4]. The default TPC-H has uniform distribution of values. In order to analyse the performance under various distributions, TPC-H skew is used to test the performance on unbalanced input data. It is a modified version of this benchmark provided by Surajit Chaudhuri and Vivek Narasayya from Microsoft. 25

34 The TPC-H generator allows choosing the size of dataset, this project uses the default setting - 100MB as the size of database size Test Query This project chooses a query that contains largest number of tables from TPC-H queries as the test query. This query is: SELECT * FROM customer, orders, lineitem, supplier, nation, region Where c_custkey = o_custkey and l_orderkey = o_orderkey and l_suppkey = s_suppkey and c_nationkey = s_nationkey and s_nationkey = n_nationkey and n_regionkey = r_regionkey This query contains two largest tables in database: orders (600,000 tuples) and lineitem (150,000 tuples). This query is complex enough for our needs as it joins six tables together. The bad news is that it is hard to reuse previously sorted tuple streams in consecutive joins. When this query is analysed, it is clear that all the tables are joined on different keys, which means we need to split tuples by different key, sort tuples by different order. Under this situation, there is no way to reuse previous results. However, after analysing other queries of the TPC-H, it is clear that it is the case for most of the queries Query Graph The large query in the previous section can be represented graphically as a query graph (Figure 15). Nodes (like P, PS, and L) represent source tables. JOIN operations are represented by the graph Edges. Query graph is a compact and convenient representation of join queries and it is used in join ordering algorithms. 26

35 Figure 15 Query Graph Query Plan The query graph presented in Figure 15 can be translated into a query plan shown in Figure 16. This query graph can be mapped to a logical query plan. Note that there may be several mappings that result in semantically equivalent query plans. Figure 16 Query Tree The last step is to translate logical query plan into the executable code. Alternatively, previous query graph can be also transferred into other query plan. Other examples of equivalent query plans are presented in Section Measurement of Time OGSA-DAI workflow is only being defined on client side and OGSA-DAI activities are only initialized in user side code. However, the workflow is executing on the server side 27

36 after its initialization. As a result, only overall running time is available in user side test code; we can not specify the cost of each activity in the test code. An alternative solution is adding timer before and after OGSA-DAI execute unit as debug information, thus can help us to obtain approximate running time of each single activity. As initialization cost of activities is excluded from time measurement, time measured this way must be smaller than the real time because it does not account for the pre-processing and post-processing time in. However, OGSA-DAI mechanism is not that suitable for testing individual running time of every activity. When connecting output stream of one activity to another s input stream, sender will produce its data in small chunks and insert data into the pipe connecting activities block by block. Once the first data chunk is sent, receiver activity is started and blocked by waiting for input stream. There may has overlap between the time line of these two activities. That is also why the sum of individual activities running times may be larger than the overall running time measured in user side Script for Submission As this project tries to measure performance on 16 processors, OGSA-DAI must be run on the back end of Ness. Cause OGSA-DAI runs on the tomcat server, the submit script will be organized as five steps: 1) Set environment parameters. 2) Start up tomcat server. 3) Wait for a while until its service boot successfully by sleep. 4) Run our test case. 5) Shut down tomcat server. The script is available in Appendix B Reproducibility of Measurements There is one thing that should be noted: Java needs a pre-run to warm up. A warm up can help OGSA-DAI initialize its context on the first request and perform just in time complication. Without the warm up phase, the performance of initial test runs may differ significantly from subsequent runs. In our case the initial run is about four times slower than the second run. In order to solve this problem, every test contains a inner loop that executes the same query ten times. Running time of each query is measured by the average time of the last nine loop operations (result of first one will be discarded). 28

37 Besides, JIT, which stand for just-in-time, technique may take some advantages in this test. This technique is used for improving the runtime performance of a program. It is an automatic optimisation based on runtime analysis and dynamic translation. It gains improvement over interpreters to speed up the hot spot of the code. It also can re-compile the code if this is found to be advantageous. Figure 17 illustrate that there is a more than four times speedup when executing query more than once. It can be contributed to that OGSA-DAI only initializes its context in the first request, repeatedly query request can save this time in the rest of executions. In the real world, the OGSA-DAI service should be always initialized. As a result, the first query running time, which is also the slowest one, will be discarded. Furthermore, it is hard to find some benefits bought by the JIT technique. If it works, the running time of second round should be a little bit slower than later ones. That is because it should spend extra time to re-compile the code whose execution time should be counted in round two and gain some speedup in the rest rounds due to the optimization made by re-compile. It may also be the case that all JIT optimisations are applied during the initial run. Figure 17 Running Time of Reproduced Test. The result is based on large parallel test (150,000 tuples). 4.2 Performance Analysis for Single Activity Performance and analysis of every activity is presented in this section. This information is used to spot the bottlenecks in the join workflow. Bandwidth is used to measure the performance of these activities, which is calculated by dividing the number of joined tuples by the processing time. 29

38 Every activity has pre-process, process and post-process steps. When evaluating an activity, timer starts at its pre-process step and ends at post-process step Swallow Activity Swallow activity is used to empty a given tuple list. It goes though its input (OGSA-DAI requirement) and returns an empty list or a count as output. This activity is very fast due to its empty body it basically swallows input tuples. It has two purposes: a) It removes noise in time measurement. As OGSA-DAI requires connecting all the activities input and output in the workflow, we need to add some activities which are not essential for JOIN operation. For example, TupleToWebRowSetCharArrays activity is used to transfer tuples list format to web readable format and is the last part in the workflow. When its input set is large, this activity is really slow and contributes more than 80% of overall running time. However, this activity is used to transfer data and is not part of the actual join processing. The alternative solution is adding a swallow activity before TupleToWebRowSetCharArrays activity. The TupleToWebRowSetCharArrays activity will only take a negligible time to execute for its empty input set and will damage the performance anymore. b) Performance of individual activity may confuse us due to mechanism of OGSA-DAI workflow. As it is showed in Figure 18, Activity 3 needs the output from both Activity 1 and Activity 2. However, Activity 1 is slower than Activity 2, so Activity 2 will block and wait for Activity 1 to finish. In this case, the individual time of Activity 2 will be larger than it is. Figure 18 Workflow without Swallow Activity A swallow activity can handle this problem easily. As it is showed in Figure 19, when testing the running time for individual activity, a swallow activity will be added behind the target activity. In above case, Activity 2 will end without waiting. The waiting time will be transferred to swallow activity which will not confuse us anymore. 30

39 Figure 19 Workflow with Swallow Activity To conclude, swallow activity is inserted after every activity in order to measure execution time of a part of the workflow and is very helpful activity to simplify code Performance of SQL Query Activities Table 1 shows the performance of SQL Query Activity. This is a serial activity. Workflow for this test is: SQL Query -> Swallow -> TupleToWebRowSetCharArrays Running time in this section is measured by the overall running time of this workflow. As it contains two additional and inexpensive activities, it should be a little bit bigger than it is. According to this table, this activity has a 45,000 to 95,000 tuples per second bandwidth which depending on the data size. It can also find that the bandwidth increased when increasing the number of tuples. As SQL query activity contains steps with steady and nonignorable cost like setup and steps whose cost closely related with data size, this activity has a better performance when handling large data sets. Number of Tuples Time (s) Bandwidth 150,000 Tuples ,000 Tuples ,000 Tuples ,000 Tuples ,000 Tuples ,000 Tuples Table 1 Bandwidth of SQL Query Activity 31

40 4.2.3 Performance of Split Activities Both Hash Split Activity and Random Split Activity is an O (N) task. It can be seen that with the number of tuples decreased, the running time of both the two activities will be decreased in the same pattern. Actually, as Hash Split Activity has an extra hash function, it is a little bit slower than Random Split Activity. However, as this extra hash function only contributes a little of the overall activity running time. The performances of both the two hash activities are almost the same. In this section, Hash Split Activity will be used to illustrate the performance. Execution time in this section is measures only the split activity. This is a serial activity. Here is the workflow to evaluate performance: SQL Query -> Split -> Swallow -> TupleToWebRowSetCharArrays As the reason pointed out in Section 4.1.6, SQL query activity will introduce some noise. Table 2 shows the bandwidth of this activity. Number of Tuples Time (s) Bandwidth (Tuples / second) 150, , , , , , Table 2 Bandwidth of Hash Split Activity 32