Testing Database Performance with HelperCore on Multi-Core Processors
|
|
|
- Oscar Sanders
- 9 years ago
- Views:
Transcription
1 Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob
2 Problem Definition Study the performance of database queries on multi-core processor by measuring last level cache misses and then analyze the effect of using HelperCore[1] approach on the performance. 1. Introduction The ability to pack billions of transistors on-chip has opened the doors to an important trend in building high-performance computer systems. Until recently, the trend for the development of microprocessors was by increasing their frequency and capability to extract more instruction level parallelism. This has led to severe limitations in terms of power consumption and design complexity. Therefore, this trend has changed to keeping the design simple while packing more processors onto the same die in order to achieve better performance. These chips are called Chip Multiprocessors (CMP) or Multi-cores. CMPs offer a higher granularity (thread/process level) at which parallelism in programs can be exploited by compiler/runtime support, rather than leaving it to the hardware to extract the parallelism at the instruction level on a single (larger) multipleissue core. It is now possible to split a single application into multiple tasks and execute them in parallel. Nevertheless, an increasing number of tasks results in an increase in simultaneous memory requests and, therefore, the access to memory becomes a bottleneck. In addition, an increasing number of tasks may also result in larger synchronization penalty. The use of multi-cores is limited to exploiting them for higher throughput. Current database management systems like PostgreSQL don't have any sophisticated techniques to exploit the benefits offered by the multi-core technology. What is currently possible is issuing queries in parallel i.e. on different processors. But due to the synchronization required by the different query executions in order to maintain database consistent, the parallelization will not be beneficial when the number of execution cores increases beyond a certain point. In addition, the increasing number of memory accesses to shared resources results in a bottleneck. Taking this into consideration, [1] have proposed a new approach to exploit the cores of a multi-core architecture in an alternative way. Instead of executing application code on all cores, some cores execute code that indirectly results in a performance benefit for the application. We call these the Helper Cores. An example of such a use of the multi-core chip is depicted in Figure 1, which presents a high-level diagram of the execution of an application on a 16-core multi-core chip where 12 cores are used for the execution of the application and 4 cores as Helper Cores.
3 It is observed that, all the cores are not exploited by the applications in a typical database system. The performance of database applications can be improved by using the unused cores of processors by using them as helper cores. We plan to use this approach on PostgreSQL database system and study the performance by measuring the last level cache (L2) misses. The report is organized as follows. Section 2 summarizes the related work in the area. In section 3, we explain the idea of HelperCore approach in detail. The implementation details are given in section 4. Section 5 gives details of experiments carried out and results. The conclusion is given in section Related Work Data pre-fetching is a technique that reduces memory latency by fetching data into the cache before it is requested by the CPU. Pre-fetching can be broadly classified into two categories: hardware-based pre-fetching and software-based pre-fetching. In the first category, the pre-fetch requests are generated automatically by the hardware. The hardware monitors the misses in the caches, predicts the next data to be requested and issues appropriate pre-fetch requests. The second pre-fetch category requires the programmer or compiler to explicitly insert pre-fetch instructions in the code given that the microprocessor supports such instructions. An alternative to traditional pre-fetching is known as preloading which uses an extra thread created by the programmer or the system. Zhou et al. [2] proposed this technique. A thread is used as helper thread which executes a subset of the code ahead of the execution processor. This execution results in data being pre-fetched to the shared level of memory hierarchy so that the other working threads will almost always find the required data and hence executes faster. The query processing in database systems can be made faster by exploiting intra-operator as well as inter-operator parallelism for typical database operators like join. [8] proposed the idea of encapsulating relational operators for efficiently utilizing the resources. But the problem of identifying and encapsulating the possible operator parallelism is critical here. 3. HelperCore Idea The issues with HelperCoreDB design are: (1) It must fetch the data from memory before the database application uses it but just early enough as to avoid being evicted before its use. (2) Synchronization between the HelperCoreDB and database threads should result in a minimal overhead. (3) The implementation of HelperCoreDB should be simple. To fetch the data before the application needs it, we need to consider what code to execute on the HelperCore. Memory access latency is one of the most serious bottleneck in memory resident database systems. We use a form of data pre-fetching to tackle it. The pre-fetching is performed on the HelperCore, while this pre-fetched data is used by worker processes running on other cores. So the HelperCore and the Worker cores must share some level of memory hierarchy. We use L2 as the level of hierarchy shared by them. Database systems organize data in blocks. A block is a construct of a specified fixed size that generally holds tuples of data and control information. Blocks reside on permanent storage as pages and in main memory as buffers. In the database applications, for certain algorithms, such as
4 the Sequential Scan, whenever a buffer is accessed most of the data on that buffer is likely to be used at that time. Taking advantage of this fact, HelperCoreDB is designed to pre-fetch at the buffer granularity. Consequently, the worker process (the process executing the database application) needs to communicate to the helper process (the process pre-fetching the data) only the next block to be processed by the query plan. This leads to both simpler implementation and less synchronization overhead, thus satisfying criteria (2) and (3). After being triggered to fetch a block, the helper process chooses to fetch d consecutive blocks with the assumption that those blocks will be used in the near future. The number d of consecutive blocks that are requested is called pre-fetch distance. In order to satisfy criteria (1), d needs to be carefully determined as a small value of d may not bring enough data to overcome the overheads and a large value of d may bring more data than necessary at that point, thus potentially evicting other useful data. Different queries should have different values of d. The HelperCore may be implemented either by creating a separate helper thread and then using a busy-wait loop or a blocking synchronization primitive, such as the POSIX thread pthread_cond_wait() or the Linux specific mutex() system call. Blocking synchronization adds unacceptable overhead to the worker thread, so this technique is not suitable. Busy-wait synchronization is possible, but we need a helper thread for each worker process which is the number of clients of database application. Another way of sharing memory is by running a background helper process. Other client processes share the block number to be read and the relation(table) number from which to read it with this helper process by using shm API. The synchronization between this data is achieved by using semaphores ( sem API). We implemented this technique in PostgreSQL database, the details of which are given in next section. 4. Implementation This section first describes the shm API which is used for sharing the data between helper and worker processes in subsection 4.1. In subsection 4.2, we give an overview of the performance monitoring tool perfmon2 which is used to count L2 cache references and misses for analysis of our project. Then in subsection 4.3, the modifications done in PostgreSQL database system are listed. 4.1 Shared Memory Shared memory is a method of inter-process communication (IPC), i.e. a way of exchanging data between programs running at the same time. One process will create an area in RAM which the other processes can access. In our project we need to share data between a worker process and the helper process. We used the API for shared memory provided by UNIX system 5. The corresponding function prototypes are defined in header file <sys/shm.h> A process can create a shared memory segment using shmget(). Once created, a shared segment can be attached to a process address space using shmat(). It can be detached using shmdt(). The attaching process must have the appropriate permissions for shmat(). Once attached, the process can read or write to the segment, as allowed by the permission requested in the attach operation.
5 4.2 Perfmon tool All modern processors have a hardware performance monitoring unit (PMU) which exports a set of counters, called hardware performance counters to collect micro-architectural events such as the number of elapsed cycles or the number of L1(or L2) cache misses etc.. Exploiting these counters to analyze the performance of key applications and operating systems is becoming common practice. perfmon2[4] is a hardware based performance monitoring interface for Linux. Perfmon needs the kernel support. The interface is available as a standalone kernel patch. This patch is needed to be applied to the kernel in order to use the perfmon2 interface. We applied perfmon patch for linux kernel version Perfmon provides the options of profiling at various levels of granularity. One can either profile the entire system or a single process or all processes on a single CPU or even various functions within a single process. It counts the various micro-architectural events like CPU cycles, cache misses, cache references, TLB accesses etc. 4.3 Modifications to PostgreSQL We used PostgreSQL [5,6] version for our work. In PostgreSQL postmaster process is the main process. Postmaster spawns a new process whenever new client request arrives. Let us call it a WorkerProcess. Postgres runs some background processes along with that e,g. auto-vacuum garbage collector, write-ahead logger, statistics collector. We added one more background process called the HelperProcess which is also a child of postmaster. We modified the postmastermain() function of postmaster so that it creates the shared memory segment (using shmget()). The shared memory segment contains variable workers(number of workers) and two arrays rel_no[max_workers] and block_no[maxworkers]. MAX_WORKERS is a macro indicating maximum number of Worker Processes. Both the arrays have an entry each for each WorkerProcess. Array rel_no is used to store the requested relation number while the array block_no for requested block number. Whenever new client is created, postmaster calls a function BackendStartup() which does some initialization work and sets up the client. We attach the shared memory for the WorkerProcess in this function using function shmat(). Whenever the WorkerProcess needs a block of data, it asks the buffer manager for the same. The file bufmgr.c provides these functionalities. There is a function ReadBuffer(), which is the single point of access to the data. It takes the relation number and block number which is to be fetched and finds the corresponding memory buffer that holds that block. If the buffer does not exist in main memory, it reads it from permanent storage. We modified the function ReadBuffer, so that in addition to fetching the required block it writes the relation number and block number to shared memory locations rel_no[i] and block_no[i]. Here i refers to the index of the WorkerProcess. Each relation is identified by a structure RelFileNode and the relation number is given by its field relnode. The variable blocknum contains the block number to be fetched. Each block is of size 8kB which is the default block size used by PostgreSQL. The HelperProcess maintains two private arrays old_rel_no[] and old_block_no[] and a macro D which is the pre-fetch distance. The entries old_rel_no[i] and old_block_no[i] refer to
6 the relation number and block number last requested by WorkerProcess i. The HelperProcess is a continuous loop. In each iteration of the loop, it successively checks the locations in shared arrays rel_no and block_no. i.e. if any of the worker processes has requested a new block that was not pre-fetched from its last request. If a new request is found then the HelperProcess starts prefetching data blocks. Its objective is to read D consecutive data blocks starting from the current block requested by the worker. This allows the HelperProcess to effectively pre-fetch the data blocks ahead of their use. We have synchronized helper and worker processes by the use of semaphore. The sys/sem.h library provides the functions semget(), semctl() and semop() which are used to manage the semaphores. The code fragment for ReadBuffer function and the loop of HelperProcess is given below. /*Shared memory variables*/ //number of workers currently in execution int *workers; //relation numbers of the workers int *rel_no[max_workers]; //block number of the current block for each worker int *block_no[max_workers]; /*ReadBuffer function*/ //i is the worker process identifier //reln is the relation structure pointer //SemID is semaphore variable //Wait(Empty) semop(semid, &WaitEmpty, 1); //write into shared memory the relation number and block number *rel_no[i] = reln->rd_node; *block_no[i] = blocknum; //Signal(Full) semop(semid, &SignalFull, 1); //Fetch the current block ReadBuffer_common(reln, blocknum, zeropage, strategy); /*Helper process*/ for(i=0 ; i<*workers ; i++) { //Wait(Full) semop(semid, &WaitFull, 1); //read from shared memory the relation number and block number tmp_rel_no = *rel_no[i]; tmp_block_no = *block_no[i]; //fetch next block if following condition holds if((tmp_rel_no.relnode!= old_rel_no[i])
7 { } ((tmp_block_no < old_block_no[i]) (tmp_block_no > old_block_no[i] + D ))) for(k=1 ; k<=d ; k++) { //Read the next block if(my_readbuffer(tmp_rel_no, tmp_block_no+k, NULL) == -1) { break; } } //update old block number and relation number old_block_no[i] = tmp_block_no; old_rel_no[i] = tmp_rel_no.relnode; } //Signal(Empty) semop(semid, &SignalEmpty, 1); 5. Experimental Setup 5.1 System characteristics All our experiments were executed natively on an Intel Core 2 Duo processor. The system s characteristics are listed in Table 1. L1 is private to each core while L2 is shared. Processor Intel Core 2 Duo Number of Cores 2 Frequency 2 GHz L1 Data Cache private to each core L1 Data Cache Size 16KB L1 Data Cache Line Size 64 Bytes L2 Cache shared by both cores L2 Cache Size 2048KB L2 Cache Line Size 128 Bytes Main Memory (RAM) 2GB DDR2 Table 1. System characteristics The experiments were carried on linux kernel version with performance monitoring support, for which we applied the perfmon patch to the kernel. 5.2 Workload We used the database benchmark workload TPC-H for our experiments. As sequential scan is an operation for which pre-fetching would give the most benefit, we tested queries involving sequential scan on various relations of the TPC-H database. For each query we measured three parameters, viz. 1. Number of L2 references
8 2. Number of L2 misses 3. Execution time The above three parameters were measured both with and without using HelperCore. We also varied the value of pre-fetch distance D. When D equals zero, it is as good as normal execution without using HelperCore support. We counted these parameters by profiling the worker process executing the database queries. We are presenting the results for three queries below. 5.3 Results We first present results of two queries which are given below. Query 1: Select * from customer; Query 2: Select * from part; Figure 2 shows the effect of varying pre-fetch distance D on L2 cache references. It can be observed that number of L2 references doesn t change much with change in D. The slight reduction is due to the fact that; as we increase D, we pre-fetch more and more number of blocks and worker process will find the blocks in L1 cache. Figure 3 shows the effect of D on L2 cache misses. As we pre-fetch more and more data, the L2 misses should reduce. But for higher values of D, the newly pre-fetched blocks replace the earlier ones resulting in an increase in L2 misses. We have found from our experiments that prefetch distance of 3 is optimal. It approximately reduces number of misses to 2/3 rd of the misses without pre-fetching (D=0). It is also observed that the values of D greater than 5 increase the cache misses Query 1 Query Figure 2: L2 cache references plotted against pre-fetch distance
9 Query 1 Query Figure 3: L2 cache misses plotted against pre-fetch distance Query 1 Query Figure 4: Execution time(ms) plotted against pre-fetch distance Figure 4 shows the change in execution time as we vary D. There is no improvement observed. This is due to the fact that, helper process and worker process are synchronized resulting in waiting time on semaphore. Although, the time to fetch the data is reduced due to reduction in cache misses, the waiting time amounts to increase in total execution time. The execution without helper process support i.e. D=0 is found to take the least amount of time. We also tested the TPC-H benchmark query template 6 which is a sequential scan on relation lineitem. The results are shown in Table1 below. We found same patterns in these results also.
10 QT6 : select sum(l_extendedprice * l_discount) as revenue from lineitem where l_extendedprice > 0 D L2 references L2 misses Time(seconds) Table 2: QT6 results 6. Conclusion We implemented HelperCoreDB approach in PostgreSQL database system and analyzed the effects on performance. We found that L2 cache misses reduce approximately by 33% for optimal pre-fetch distance which we found to be 3 for our system configuration. The original paper found a reduction of about 22% in execution time. But our results show that there is no improvement in execution time due to synchronization overheads. The original paper used a separate helper thread for a client which enables sharing of global data. But this suggests creating separate helper thread for each client which is adding too much overhead on system. We chose to use a background process which will work as helper process for all the clients. Data sharing is achieved by shm API. As the results show, this idea does reduce the L2 cache misses but performance doesn t improve because of synchronization and shared memory overheads. References [1] Kostas Papadopoulos et.al., HelperCoreDB: Exploiting Multicore Technology to Improve Database Performance, IPDPS [2] J. Zhou, J. Cieslewicz, K. A. Ross, and M. Shah. Improving database performance on simultaneous multithreading processors. In Proceedings of the 31st International Conference on Very Large Data bases (VLDB 05), pages ,VLDB Endowment, [3] Transaction Processing Performance Council, [4] Perfmon profiling interface, [5] The PostgreSQL database system website, [6] PostgeSQL source code documentation, [7] Shared memory semantics, [8] Ralph, Acker et.al., Parallel Query Processing in Databases on Multicore Architecture, ICA3PP 2008, LNCS 5022, pp. 2-12, [9] Ailamaki et. al., DBMSs on Modern Processors: Where does time go?, Technical Report 1394, University of Wisconsin-Madison, 1999.
Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Architecture Sensitive Database Design: Examples from the Columbia Group
Architecture Sensitive Database Design: Examples from the Columbia Group Kenneth A. Ross Columbia University John Cieslewicz Columbia University Jun Rao IBM Research Jingren Zhou Microsoft Research In
Parallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5
Performance Study VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5 VMware VirtualCenter uses a database to store metadata on the state of a VMware Infrastructure environment.
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
An Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
COS 318: Operating Systems
COS 318: Operating Systems File Performance and Reliability Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall10/cos318/ Topics File buffer cache
Shared Address Space Computing: Programming
Shared Address Space Computing: Programming Alistair Rendell See Chapter 6 or Lin and Synder, Chapter 7 of Grama, Gupta, Karypis and Kumar, and Chapter 8 of Wilkinson and Allen Fork/Join Programming Model
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
How To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers
A Comparative Study on Vega-HTTP & Popular Open-source Web-servers Happiest People. Happiest Customers Contents Abstract... 3 Introduction... 3 Performance Comparison... 4 Architecture... 5 Diagram...
Java Virtual Machine: the key for accurated memory prefetching
Java Virtual Machine: the key for accurated memory prefetching Yolanda Becerra Jordi Garcia Toni Cortes Nacho Navarro Computer Architecture Department Universitat Politècnica de Catalunya Barcelona, Spain
A Deduplication File System & Course Review
A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror
CSC 2405: Computer Systems II
CSC 2405: Computer Systems II Spring 2013 (TR 8:30-9:45 in G86) Mirela Damian http://www.csc.villanova.edu/~mdamian/csc2405/ Introductions Mirela Damian Room 167A in the Mendel Science Building [email protected]
High-Density Network Flow Monitoring
Petr Velan [email protected] High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Multi-core and Linux* Kernel
Multi-core and Linux* Kernel Suresh Siddha Intel Open Source Technology Center Abstract Semiconductor technological advances in the recent years have led to the inclusion of multiple CPU execution cores
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1
Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System
SPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture
Last Class: OS and Computer Architecture System bus Network card CPU, memory, I/O devices, network card, system bus Lecture 3, page 1 Last Class: OS and Computer Architecture OS Service Protection Interrupts
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs
SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND -Based SSDs Sangwook Shane Hahn, Sungjin Lee, and Jihong Kim Department of Computer Science and Engineering, Seoul National University,
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada [email protected] Micaela Serra
Computer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization
Dell Microsoft SQL Server 2008 Fast Track Data Warehouse Performance Characterization A Dell Technical White Paper Database Solutions Engineering Dell Product Group Anthony Fernandez Jisha J Executive
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
Garbage Collection in the Java HotSpot Virtual Machine
http://www.devx.com Printed from http://www.devx.com/java/article/21977/1954 Garbage Collection in the Java HotSpot Virtual Machine Gain a better understanding of how garbage collection in the Java HotSpot
Chapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Interprocess Communication in Java
Interprocess Communication in Java George C. Wells Department of Computer Science, Rhodes University Grahamstown, 6140, South Africa [email protected] Abstract This paper describes a library of classes
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
Performance Report Modular RAID for PRIMERGY
Performance Report Modular RAID for PRIMERGY Version 1.1 March 2008 Pages 15 Abstract This technical documentation is designed for persons, who deal with the selection of RAID technologies and RAID controllers
Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:
Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls Livio Soares University of Toronto Michael Stumm University of Toronto Abstract For the past 3+ years, system calls have been the
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Intel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking
Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand
D1.2 Network Load Balancing
D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June [email protected],[email protected],
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
Operating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
1 Storage Devices Summary
Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious
Processes and Non-Preemptive Scheduling. Otto J. Anshus
Processes and Non-Preemptive Scheduling Otto J. Anshus 1 Concurrency and Process Challenge: Physical reality is Concurrent Smart to do concurrent software instead of sequential? At least we want to have
Leveraging NIC Technology to Improve Network Performance in VMware vsphere
Leveraging NIC Technology to Improve Network Performance in VMware vsphere Performance Study TECHNICAL WHITE PAPER Table of Contents Introduction... 3 Hardware Description... 3 List of Features... 4 NetQueue...
MySQL performance in a cloud. Mark Callaghan
MySQL performance in a cloud Mark Callaghan Special thanks Eric Hammond (http://www.anvilon.com) provided documentation that made all of my work much easier. What is this thing called a cloud? Deployment
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
Using Synology SSD Technology to Enhance System Performance Synology Inc.
Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...
Operating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
Hardware performance monitoring. Zoltán Majó
Hardware performance monitoring Zoltán Majó 1 Question Did you take any of these lectures: Computer Architecture and System Programming How to Write Fast Numerical Code Design of Parallel and High Performance
AMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
Dependency Free Distributed Database Caching for Web Applications and Web Services
Dependency Free Distributed Database Caching for Web Applications and Web Services Hemant Kumar Mehta School of Computer Science and IT, Devi Ahilya University Indore, India Priyesh Kanungo Patel College
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls
FlexSC: Flexible System Call Scheduling with Exception-Less System Calls Livio Soares University of Toronto Michael Stumm University of Toronto Abstract For the past 3+ years, system calls have been the
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
OpenMosix Presented by Dr. Moshe Bar and MAASK [01]
OpenMosix Presented by Dr. Moshe Bar and MAASK [01] openmosix is a kernel extension for single-system image clustering. openmosix [24] is a tool for a Unix-like kernel, such as Linux, consisting of adaptive
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Scalability Factors of JMeter In Performance Testing Projects
Scalability Factors of JMeter In Performance Testing Projects Title Scalability Factors for JMeter In Performance Testing Projects Conference STEP-IN Conference Performance Testing 2008, PUNE Author(s)
Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation
Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder Computer Science and
DBOS: Revisiting Operating System Support for Data Management Systems
DBOS: Revisiting Operating System Support for Data Management Systems Yinan Li [email protected] Wenbin Fang [email protected] Abstract Three decades ago, Michael Stonebraker made an observation that
Multilevel Load Balancing in NUMA Computers
FACULDADE DE INFORMÁTICA PUCRS - Brazil http://www.pucrs.br/inf/pos/ Multilevel Load Balancing in NUMA Computers M. Corrêa, R. Chanin, A. Sales, R. Scheer, A. Zorzo Technical Report Series Number 049 July,
The Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
Chapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Capacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING
OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted
Database Hardware Selection Guidelines
Database Hardware Selection Guidelines BRUCE MOMJIAN Database servers have hardware requirements different from other infrastructure software, specifically unique demands on I/O and memory. This presentation
A Survey of Parallel Processing in Linux
A Survey of Parallel Processing in Linux Kojiro Akasaka Computer Science Department San Jose State University San Jose, CA 95192 408 924 1000 [email protected] ABSTRACT Any kernel with parallel processing
Application Performance Analysis of the Cortex-A9 MPCore
This project in ARM is in part funded by ICT-eMuCo, a European project supported under the Seventh Framework Programme (7FP) for research and technological development Application Performance Analysis
IPC. Semaphores were chosen for synchronisation (out of several options).
IPC Two processes will use shared memory to communicate and some mechanism for synchronise their actions. This is necessary because shared memory does not come with any synchronisation tools: if you can
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards
Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate Nytro Flash Accelerator Cards Technology Paper Authored by: Mark Pokorny, Database Engineer, Seagate Overview SQL Server 2014 provides
SYSTEM ecos Embedded Configurable Operating System
BELONGS TO THE CYGNUS SOLUTIONS founded about 1989 initiative connected with an idea of free software ( commercial support for the free software ). Recently merged with RedHat. CYGNUS was also the original
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
