Testing Database Performance with HelperCore on Multi-Core Processors



Similar documents
Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Architecture Sensitive Database Design: Examples from the Columbia Group

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Multi-Threading Performance on Commodity Multi-Core Processors

CHAPTER 1 INTRODUCTION

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

An Implementation Of Multiprocessor Linux

Parallel Firewalls on General-Purpose Graphics Processing Units

Next Generation GPU Architecture Code-named Fermi

Course Development of Programming for General-Purpose Multicore Processors

FPGA-based Multithreading for In-Memory Hash Joins

Optimizing Shared Resource Contention in HPC Clusters

COS 318: Operating Systems

Shared Address Space Computing: Programming

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

How To Build A Cloud Computer

Introduction to Cloud Computing

A Comparative Study on Vega-HTTP & Popular Open-source Web-servers

Java Virtual Machine: the key for accurated memory prefetching

A Deduplication File System & Course Review

CSC 2405: Computer Systems II

High-Density Network Flow Monitoring

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Multi-core and Linux* Kernel

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Control 2004, University of Bath, UK, September 2004

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

SPARC64 VIIIfx: CPU for the K computer

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Parallel Programming Survey

SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Computer Architecture

Binary search tree with SIMD bandwidth optimization using SSE

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Benchmarking Hadoop & HBase on Violin

Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization

RevoScaleR Speed and Scalability

An Introduction to Parallel Computing/ Programming

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Garbage Collection in the Java HotSpot Virtual Machine

Chapter 1 Computer System Overview

Architecture of Hitachi SR-8000

Interprocess Communication in Java

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Performance Report Modular RAID for PRIMERGY

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

Multi-core Programming System Overview

Intel DPDK Boosts Server Appliance Performance White Paper

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

D1.2 Network Load Balancing

Stream Processing on GPUs Using Distributed Multimedia Middleware

Parallel Algorithm Engineering

Operating System Impact on SMT Architecture

1 Storage Devices Summary

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

MySQL performance in a cloud. Mark Callaghan

Delivering Quality in Software Performance and Scalability Testing

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Operating Systems. Virtual Memory

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Hardware performance monitoring. Zoltán Majó

AMD Opteron Quad-Core

Dependency Free Distributed Database Caching for Web Applications and Web Services

FlexSC: Flexible System Call Scheduling with Exception-Less System Calls

Overlapping Data Transfer With Application Execution on Clusters

OpenMosix Presented by Dr. Moshe Bar and MAASK [01]

Virtuoso and Database Scalability

Scalability Factors of JMeter In Performance Testing Projects

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

DBOS: Revisiting Operating System Support for Data Management Systems

Multilevel Load Balancing in NUMA Computers

The Classical Architecture. Storage 1 / 36

Chapter 6, The Operating System Machine Level

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Capacity Planning Process Estimating the load Initial configuration

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Database Hardware Selection Guidelines

A Survey of Parallel Processing in Linux

Application Performance Analysis of the Cortex-A9 MPCore

IPC. Semaphores were chosen for synchronisation (out of several options).

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

SYSTEM ecos Embedded Configurable Operating System

Scalability evaluation of barrier algorithms for OpenMP

Transcription:

Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob

Problem Definition Study the performance of database queries on multi-core processor by measuring last level cache misses and then analyze the effect of using HelperCore[1] approach on the performance. 1. Introduction The ability to pack billions of transistors on-chip has opened the doors to an important trend in building high-performance computer systems. Until recently, the trend for the development of microprocessors was by increasing their frequency and capability to extract more instruction level parallelism. This has led to severe limitations in terms of power consumption and design complexity. Therefore, this trend has changed to keeping the design simple while packing more processors onto the same die in order to achieve better performance. These chips are called Chip Multiprocessors (CMP) or Multi-cores. CMPs offer a higher granularity (thread/process level) at which parallelism in programs can be exploited by compiler/runtime support, rather than leaving it to the hardware to extract the parallelism at the instruction level on a single (larger) multipleissue core. It is now possible to split a single application into multiple tasks and execute them in parallel. Nevertheless, an increasing number of tasks results in an increase in simultaneous memory requests and, therefore, the access to memory becomes a bottleneck. In addition, an increasing number of tasks may also result in larger synchronization penalty. The use of multi-cores is limited to exploiting them for higher throughput. Current database management systems like PostgreSQL don't have any sophisticated techniques to exploit the benefits offered by the multi-core technology. What is currently possible is issuing queries in parallel i.e. on different processors. But due to the synchronization required by the different query executions in order to maintain database consistent, the parallelization will not be beneficial when the number of execution cores increases beyond a certain point. In addition, the increasing number of memory accesses to shared resources results in a bottleneck. Taking this into consideration, [1] have proposed a new approach to exploit the cores of a multi-core architecture in an alternative way. Instead of executing application code on all cores, some cores execute code that indirectly results in a performance benefit for the application. We call these the Helper Cores. An example of such a use of the multi-core chip is depicted in Figure 1, which presents a high-level diagram of the execution of an application on a 16-core multi-core chip where 12 cores are used for the execution of the application and 4 cores as Helper Cores.

It is observed that, all the cores are not exploited by the applications in a typical database system. The performance of database applications can be improved by using the unused cores of processors by using them as helper cores. We plan to use this approach on PostgreSQL database system and study the performance by measuring the last level cache (L2) misses. The report is organized as follows. Section 2 summarizes the related work in the area. In section 3, we explain the idea of HelperCore approach in detail. The implementation details are given in section 4. Section 5 gives details of experiments carried out and results. The conclusion is given in section 6. 2. Related Work Data pre-fetching is a technique that reduces memory latency by fetching data into the cache before it is requested by the CPU. Pre-fetching can be broadly classified into two categories: hardware-based pre-fetching and software-based pre-fetching. In the first category, the pre-fetch requests are generated automatically by the hardware. The hardware monitors the misses in the caches, predicts the next data to be requested and issues appropriate pre-fetch requests. The second pre-fetch category requires the programmer or compiler to explicitly insert pre-fetch instructions in the code given that the microprocessor supports such instructions. An alternative to traditional pre-fetching is known as preloading which uses an extra thread created by the programmer or the system. Zhou et al. [2] proposed this technique. A thread is used as helper thread which executes a subset of the code ahead of the execution processor. This execution results in data being pre-fetched to the shared level of memory hierarchy so that the other working threads will almost always find the required data and hence executes faster. The query processing in database systems can be made faster by exploiting intra-operator as well as inter-operator parallelism for typical database operators like join. [8] proposed the idea of encapsulating relational operators for efficiently utilizing the resources. But the problem of identifying and encapsulating the possible operator parallelism is critical here. 3. HelperCore Idea The issues with HelperCoreDB design are: (1) It must fetch the data from memory before the database application uses it but just early enough as to avoid being evicted before its use. (2) Synchronization between the HelperCoreDB and database threads should result in a minimal overhead. (3) The implementation of HelperCoreDB should be simple. To fetch the data before the application needs it, we need to consider what code to execute on the HelperCore. Memory access latency is one of the most serious bottleneck in memory resident database systems. We use a form of data pre-fetching to tackle it. The pre-fetching is performed on the HelperCore, while this pre-fetched data is used by worker processes running on other cores. So the HelperCore and the Worker cores must share some level of memory hierarchy. We use L2 as the level of hierarchy shared by them. Database systems organize data in blocks. A block is a construct of a specified fixed size that generally holds tuples of data and control information. Blocks reside on permanent storage as pages and in main memory as buffers. In the database applications, for certain algorithms, such as

the Sequential Scan, whenever a buffer is accessed most of the data on that buffer is likely to be used at that time. Taking advantage of this fact, HelperCoreDB is designed to pre-fetch at the buffer granularity. Consequently, the worker process (the process executing the database application) needs to communicate to the helper process (the process pre-fetching the data) only the next block to be processed by the query plan. This leads to both simpler implementation and less synchronization overhead, thus satisfying criteria (2) and (3). After being triggered to fetch a block, the helper process chooses to fetch d consecutive blocks with the assumption that those blocks will be used in the near future. The number d of consecutive blocks that are requested is called pre-fetch distance. In order to satisfy criteria (1), d needs to be carefully determined as a small value of d may not bring enough data to overcome the overheads and a large value of d may bring more data than necessary at that point, thus potentially evicting other useful data. Different queries should have different values of d. The HelperCore may be implemented either by creating a separate helper thread and then using a busy-wait loop or a blocking synchronization primitive, such as the POSIX thread pthread_cond_wait() or the Linux specific mutex() system call. Blocking synchronization adds unacceptable overhead to the worker thread, so this technique is not suitable. Busy-wait synchronization is possible, but we need a helper thread for each worker process which is the number of clients of database application. Another way of sharing memory is by running a background helper process. Other client processes share the block number to be read and the relation(table) number from which to read it with this helper process by using shm API. The synchronization between this data is achieved by using semaphores ( sem API). We implemented this technique in PostgreSQL database, the details of which are given in next section. 4. Implementation This section first describes the shm API which is used for sharing the data between helper and worker processes in subsection 4.1. In subsection 4.2, we give an overview of the performance monitoring tool perfmon2 which is used to count L2 cache references and misses for analysis of our project. Then in subsection 4.3, the modifications done in PostgreSQL database system are listed. 4.1 Shared Memory Shared memory is a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at the same time. One process will create an area in RAM which the other processes can access. In our project we need to share data between a worker process and the helper process. We used the API for shared memory provided by UNIX system 5. The corresponding function prototypes are defined in header file <sys/shm.h> A process can create a shared memory segment using shmget(). Once created, a shared segment can be attached to a process address space using shmat(). It can be detached using shmdt(). The attaching process must have the appropriate permissions for shmat(). Once attached, the process can read or write to the segment, as allowed by the permission requested in the attach operation.

4.2 Perfmon tool All modern processors have a hardware performance monitoring unit (PMU) which exports a set of counters, called hardware performance counters to collect micro-architectural events such as the number of elapsed cycles or the number of L1(or L2) cache misses etc.. Exploiting these counters to analyze the performance of key applications and operating systems is becoming common practice. perfmon2[4] is a hardware based performance monitoring interface for Linux. Perfmon needs the kernel support. The interface is available as a standalone kernel patch. This patch is needed to be applied to the kernel in order to use the perfmon2 interface. We applied perfmon patch for linux kernel version 2.6.27. Perfmon provides the options of profiling at various levels of granularity. One can either profile the entire system or a single process or all processes on a single CPU or even various functions within a single process. It counts the various micro-architectural events like CPU cycles, cache misses, cache references, TLB accesses etc. 4.3 Modifications to PostgreSQL We used PostgreSQL [5,6] version 8.3.4 for our work. In PostgreSQL postmaster process is the main process. Postmaster spawns a new process whenever new client request arrives. Let us call it a WorkerProcess. Postgres runs some background processes along with that e,g. auto-vacuum garbage collector, write-ahead logger, statistics collector. We added one more background process called the HelperProcess which is also a child of postmaster. We modified the postmastermain() function of postmaster so that it creates the shared memory segment (using shmget()). The shared memory segment contains variable workers(number of workers) and two arrays rel_no[max_workers] and block_no[maxworkers]. MAX_WORKERS is a macro indicating maximum number of Worker Processes. Both the arrays have an entry each for each WorkerProcess. Array rel_no is used to store the requested relation number while the array block_no for requested block number. Whenever new client is created, postmaster calls a function BackendStartup() which does some initialization work and sets up the client. We attach the shared memory for the WorkerProcess in this function using function shmat(). Whenever the WorkerProcess needs a block of data, it asks the buffer manager for the same. The file bufmgr.c provides these functionalities. There is a function ReadBuffer(), which is the single point of access to the data. It takes the relation number and block number which is to be fetched and finds the corresponding memory buffer that holds that block. If the buffer does not exist in main memory, it reads it from permanent storage. We modified the function ReadBuffer, so that in addition to fetching the required block it writes the relation number and block number to shared memory locations rel_no[i] and block_no[i]. Here i refers to the index of the WorkerProcess. Each relation is identified by a structure RelFileNode and the relation number is given by its field relnode. The variable blocknum contains the block number to be fetched. Each block is of size 8kB which is the default block size used by PostgreSQL. The HelperProcess maintains two private arrays old_rel_no[] and old_block_no[] and a macro D which is the pre-fetch distance. The entries old_rel_no[i] and old_block_no[i] refer to

the relation number and block number last requested by WorkerProcess i. The HelperProcess is a continuous loop. In each iteration of the loop, it successively checks the locations in shared arrays rel_no and block_no. i.e. if any of the worker processes has requested a new block that was not pre-fetched from its last request. If a new request is found then the HelperProcess starts prefetching data blocks. Its objective is to read D consecutive data blocks starting from the current block requested by the worker. This allows the HelperProcess to effectively pre-fetch the data blocks ahead of their use. We have synchronized helper and worker processes by the use of semaphore. The sys/sem.h library provides the functions semget(), semctl() and semop() which are used to manage the semaphores. The code fragment for ReadBuffer function and the loop of HelperProcess is given below. /*Shared memory variables*/ //number of workers currently in execution int *workers; //relation numbers of the workers int *rel_no[max_workers]; //block number of the current block for each worker int *block_no[max_workers]; /*ReadBuffer function*/ //i is the worker process identifier //reln is the relation structure pointer //SemID is semaphore variable //Wait(Empty) semop(semid, &WaitEmpty, 1); //write into shared memory the relation number and block number *rel_no[i] = reln->rd_node; *block_no[i] = blocknum; //Signal(Full) semop(semid, &SignalFull, 1); //Fetch the current block ReadBuffer_common(reln, blocknum, zeropage, strategy); /*Helper process*/ for(i=0 ; i<*workers ; i++) { //Wait(Full) semop(semid, &WaitFull, 1); //read from shared memory the relation number and block number tmp_rel_no = *rel_no[i]; tmp_block_no = *block_no[i]; //fetch next block if following condition holds if((tmp_rel_no.relnode!= old_rel_no[i])

{ } ((tmp_block_no < old_block_no[i]) (tmp_block_no > old_block_no[i] + D ))) for(k=1 ; k<=d ; k++) { //Read the next block if(my_readbuffer(tmp_rel_no, tmp_block_no+k, NULL) == -1) { break; } } //update old block number and relation number old_block_no[i] = tmp_block_no; old_rel_no[i] = tmp_rel_no.relnode; } //Signal(Empty) semop(semid, &SignalEmpty, 1); 5. Experimental Setup 5.1 System characteristics All our experiments were executed natively on an Intel Core 2 Duo processor. The system s characteristics are listed in Table 1. L1 is private to each core while L2 is shared. Processor Intel Core 2 Duo Number of Cores 2 Frequency 2 GHz L1 Data Cache private to each core L1 Data Cache Size 16KB L1 Data Cache Line Size 64 Bytes L2 Cache shared by both cores L2 Cache Size 2048KB L2 Cache Line Size 128 Bytes Main Memory (RAM) 2GB DDR2 Table 1. System characteristics The experiments were carried on linux kernel version 2.6.27 with performance monitoring support, for which we applied the perfmon patch to the kernel. 5.2 Workload We used the database benchmark workload TPC-H for our experiments. As sequential scan is an operation for which pre-fetching would give the most benefit, we tested queries involving sequential scan on various relations of the TPC-H database. For each query we measured three parameters, viz. 1. Number of L2 references

2. Number of L2 misses 3. Execution time The above three parameters were measured both with and without using HelperCore. We also varied the value of pre-fetch distance D. When D equals zero, it is as good as normal execution without using HelperCore support. We counted these parameters by profiling the worker process executing the database queries. We are presenting the results for three queries below. 5.3 Results We first present results of two queries which are given below. Query 1: Select * from customer; Query 2: Select * from part; Figure 2 shows the effect of varying pre-fetch distance D on L2 cache references. It can be observed that number of L2 references doesn t change much with change in D. The slight reduction is due to the fact that; as we increase D, we pre-fetch more and more number of blocks and worker process will find the blocks in L1 cache. Figure 3 shows the effect of D on L2 cache misses. As we pre-fetch more and more data, the L2 misses should reduce. But for higher values of D, the newly pre-fetched blocks replace the earlier ones resulting in an increase in L2 misses. We have found from our experiments that prefetch distance of 3 is optimal. It approximately reduces number of misses to 2/3 rd of the misses without pre-fetching (D=0). It is also observed that the values of D greater than 5 increase the cache misses. 7000000 6000000 5000000 4000000 3000000 Query 1 Query 2 2000000 1000000 0 0 1 2 3 4 5 6 7 8 9 10 Figure 2: L2 cache references plotted against pre-fetch distance

20000 18000 16000 14000 12000 10000 8000 6000 4000 Query 1 Query 2 2000 0 0 1 2 3 4 5 6 7 8 9 10 Figure 3: L2 cache misses plotted against pre-fetch distance 1400 1200 1000 800 600 Query 1 Query 2 400 200 0 0 1 2 3 4 5 6 7 8 9 10 Figure 4: Execution time(ms) plotted against pre-fetch distance Figure 4 shows the change in execution time as we vary D. There is no improvement observed. This is due to the fact that, helper process and worker process are synchronized resulting in waiting time on semaphore. Although, the time to fetch the data is reduced due to reduction in cache misses, the waiting time amounts to increase in total execution time. The execution without helper process support i.e. D=0 is found to take the least amount of time. We also tested the TPC-H benchmark query template 6 which is a sequential scan on relation lineitem. The results are shown in Table1 below. We found same patterns in these results also.

QT6 : select sum(l_extendedprice * l_discount) as revenue from lineitem where l_extendedprice > 0 D L2 references L2 misses Time(seconds) 0 45106742 180475 12.24 1 44789333 143256 14.82 2 44120988 124532 13.05 3 43050196 113243 12.58 4 42242583 129876 13.67 5 42229009 168730 13.15 6 41100230 178593 13.46 7 39880133 219224 12.68 8 38896054 232451 13.25 9 37531957 255832 13.92 10 36791351 261783 12.84 Table 2: QT6 results 6. Conclusion We implemented HelperCoreDB approach in PostgreSQL database system and analyzed the effects on performance. We found that L2 cache misses reduce approximately by 33% for optimal pre-fetch distance which we found to be 3 for our system configuration. The original paper found a reduction of about 22% in execution time. But our results show that there is no improvement in execution time due to synchronization overheads. The original paper used a separate helper thread for a client which enables sharing of global data. But this suggests creating separate helper thread for each client which is adding too much overhead on system. We chose to use a background process which will work as helper process for all the clients. Data sharing is achieved by shm API. As the results show, this idea does reduce the L2 cache misses but performance doesn t improve because of synchronization and shared memory overheads. References [1] Kostas Papadopoulos et.al., HelperCoreDB: Exploiting Multicore Technology to Improve Database Performance, IPDPS 2008. [2] J. Zhou, J. Cieslewicz, K. A. Ross, and M. Shah. Improving database performance on simultaneous multithreading processors. In Proceedings of the 31st International Conference on Very Large Data bases (VLDB 05), pages 49 60.,VLDB Endowment, 2005. [3] Transaction Processing Performance Council,http://www.tpc.org, 2008. [4] Perfmon profiling interface, http://perfmon2.sourceforge.net/ [5] The PostgreSQL database system website, http://www.postgresql.org [6] PostgeSQL source code documentation, http://doxygen.postgresql.org [7] Shared memory semantics, http://www.cs.cf.ac.uk/dave/c/node27.html [8] Ralph, Acker et.al., Parallel Query Processing in Databases on Multicore Architecture, ICA3PP 2008, LNCS 5022, pp. 2-12, 2008. [9] Ailamaki et. al., DBMSs on Modern Processors: Where does time go?, Technical Report 1394, University of Wisconsin-Madison, 1999.