Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Size: px
Start display at page:

Download "Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures"

Transcription

1 Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2010 By Andrew Attwood School of Computer Science

2 Contents List of Tables... 5 List of Figures... 6 List of Equations... 8 Abstract... 9 Declaration Copyright Acknowledgements List of Abbreviations Chapter 1 Introduction Motivation and Context of the Study Project Aims and Methodology Project Phases Structure for the Dissertation Chapter 2 Heterogeneous Architectures Chapter Overview Cell Broadband Engine Development of the CBE Nvidia GPU with Intel Nehalem Nvidia Architecture Nvidia Development Intel Nehalem Architecture Intel Nehalem Development Chapter Summary Chapter 3 Scheduling, Load Balancing and Profiling Chapter Overview Computational Scheduling Assessment of the N-Body Problem Block and Cyclic Scheduling Dynamic Scheduling OS Thread Scheduling Profiling

3 3.3 Chapter Summary Chapter 4 Heterogeneous Development Methodologies and Patterns Chapter Overview Development Methodologies Software Development Phases Algorithm Design Considerations Patterns Implementation Testing Chapter Summary Chapter 5 Algorithm Analysis and Design Chapter Overview Analysis Amdahl s analysis of Nmod Nmod Application Structure Analysis SMP Design Cell Design CUDA Design Load Balancing and Profiling Algorithm Design Chapter Summary Chapter 6 Implementation Chapter Overview Platform Configuration CELL Development Environment GPU Development Environment SMP Implementation CELL Implementation GPU Implementation Load-balancing and Profiling Integration Chapter Summary Chapter 7 Testing and Evaluation Chapter Overview Performance Testing Cell Processor Testing

4 7.4 Nehalem GPU Testing Chapter Summary Chapter 8 Conclusions and Future Work Future Work Stored Knowledge Web Services Hint Operating System Enhancements Generic Framework Continuous Monitoring Predictive Initial Conditions Conclusion Bibliography Appendix A: CELL SDK Installation Word Count 23,380 4

5 List of Tables Table 2.1: LIBSPE2 functions 27 Table 2.2: SPE DMA functions 27 Table 2.3: Thread and fork creation overhead 28 Table 3.1: Offset pattern for 100 elements 42 Table 4.1: Parallel languages 53 Table 5.1: Gprof output non-threaded 10 particles 10 steps 61 Table 5.2: Gprof output non-threaded 60 particles 10 steps 62 Table 5.3: Nmod speed-up values 63 Table 7.1: Single and dual thread performance 101 Table 7.2: SPE timing data 102 Table 7.3: CELL Automated profiling algorithm timing 103 Table 7.4: GPU/Nehalem timing data 105 5

6 List of Figures Figure 2.1: CBE block diagram 23 Figure 2.2: Division of computation CBE 24 Figure 2.3: Stream Processing CBE 25 Figure 2.4: CBE compile commands 26 Figure 2.5: Bus capacity 29 Figure 2.6: GPU CPU footprint comparison 31 Figure 2.7: Nvidia processor Architecture 32 Figure 2.8: Two dual threaded SMP processors 34 Figure 2.9: Two SMP single thread processors 34 Figure 3.1: Computation level given index 39 Figure 3.2: 5-Body example 39 Figure 3.3: Body index calculation ratio 40 Figure 3.4: Block cyclic schedule 41 Figure 3.5: Pseudo code for in process offset calculation 41 Figure 3.6: Cyclic Schedule 43 Figure 3.7: Equal area schedule 43 Figure 3.8: DAG approach 45 Figure 4.1: Waterfall model 50 Figure 4.2: Message passing pattern 54 Figure 4.3: The divide and concur pattern 56 Figure 4.4: CELL development process 57 Figure 4.5: Structure of array and arrays of structures 58 Figure 4.6: Implementation plan 59 Figure 5.1: N-body test pattern loaded into matlab 64 Figure 5.2: Application code to generate random body distribution and mass 65 Figure 5.3: JSP Diagram of the NMOD Application 66 Figure 5.4: Original Function declaration for findgravitationalinteractions 67 Figure 5.5: Thread communication structure 68 Figure 5.6: Pseudo code for thread creation and balancing on SMP 69 Figure 5.7: Nmod SMP cache coherent program design 69 Figure 5.8: CELL SPE program design 70 Figure 5.9: Posix memalign function prototype 71 Figure 5.10: Nvidia GPU design 72 Figure 5.11: Load-balancing and profiling diagram 75 Figure 6.1: Cell Compilation commands 78 Figure 6.2: Compile command for threaded SMP 78 Figure 6.3: SMP multi-threaded implementation with load balance 81 Figure 6.4: Calculate gravitation interactions function SMP 84 Figure 6.5: Force accumulator for each SPE aligned to 16 byte borders 86 Figure 6.6: Trans data structure initialisation 86 Figure 6.7: Loading SPE image file 87 Figure 6.8: Passing arg structure to SPE 88 Figure 6.9: SPE context creation 88 Figure 6.10: SPE DMA transfer operation 90 Figure 6.11: CUDA call find gravitational interactions 92 Figure 6.12: CUDA find gravitational interactions 93 Figure 6.13: Load balance and profile GPU/NVIDIA 95 6

7 Figure 6.14: Load-balancing implementation 97 Figure 6.15: Finalising test 98 Figure 6.16: Assembly commands for hardware timers 99 Figure 7.1: CBE 2 PPE thread Speed-up 101 Figure 7.2: 6 SPE Speed-up 102 Figure 7.3: Speed-up on Nehalem and GPU 106 7

8 List of Equations Equation 5.1: Amdahl s Law 62 Equation 5.2: NMOD Speed-up value 63 Equation 5.3: Runge-Kutta 4th order integrator 66 8

9 Abstract Increasingly, devices are moving away from a single architecture single core design. From the fastest supercomputer to the smallest of mobile phones, devices are now being constructed with heterogeneous architectures. The reason for this heterogeneity is owing, in part, to the slowing of speed increases on single chip single core devices, but equally, the realisation that coupling specific devices to cope with specific problems provides increased performance and power efficiency. Application development processes concerned with dealing with multicore heterogeneous technologies are still in their infancy. Developing high-performance applications for the scientific community gives the responsibility to the developer to maximise the use of the underlying architecture. Traditional approaches to the development process are inadequate to deal with complexity in instantiation of computation to heterogeneous architectures. Current load-balancing algorithms fail to provide the required dynamism to best fit computation to the available resource. This project seeks to design an algorithm to optimise the application of a computational problem to the available heterogeneous resource through runtime profiling and load-balancing. CELL and Nehalem/GPU heterogeneous architectures are targeted with an N- Body simulator combined with the implementation of a profiling and load-balancing algorithm. This implementation is subsequently tested under different computational load conditions. 9

10 Declaration No portion of the work referred to in this report has been submitted in support of an application for another degree or qualification of this or any other university or other institute of leaning. 10

11 Copyright i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns any copyright in it (the Copyright ) and s/he has given The University of Manchester the right to use such copyright for any administrative, promotional, educational and/or teaching purposes. ii. iii. iv. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made. The ownership of any patents, designs, trademarks and any and all other intellectual property rights except for the Copyright (the Intellectual Property Rights ) and any reproductions of copyright works, for example graphs and tables ( Reproductions ), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or reproductions described in it may take place is available from the Head of Department of Computer Science. 11

12 Acknowledgements I would like to take this opportunity to thank my supervisor, Dr John Brooke, for his support and guidance during the past 16 months. My thanks also go to Dr Carey Pridgeon for developing the nmod application, which provides the computational base for my algorithm, and his kind response to my questions. I would also like to thank Dr Jonathan Follows for proving the CELL and Nvidia CUDA training at STFC Daresbury. Finally, I would also like to thank my family for their help and support over the past two years. 12

13 List of Abbreviations CBE SPE PPU SIMD SPU MFC LS DMA EIB SPUFS MFC SIMT ILP SMT JSP HPC GCC AGP GPU PCIe CPU RAM Cell broadband engine Synergistic processing element Power processing unit Single instruction multiple data Synergistic processing unit Memory flow controller Local store Direct memory access controller Element interconnect bus Synergistic processing unit file system Memory flow controller Single instruction multiple thread Instruction level parallelism Symmetric multi-threading Jackson structured programming High performance computing GNU c compiler Advanced graphics port Graphics processing unit Peripheral component interconnect express Central processing unit Random access memory 13

14 Chapter 1 Introduction Heterogeneous computing is the coupling of dissimilar processor architectures into a single process or separate processes interconnected by an in-system bus. Heterogeneous computing is viewed as the next evolutionary step in the development of the CPU. Over the past 5 years, we have witnessed great advances in terms of the availability of processing cycles on devices other than the core central processing unit. However, our reliance on a single fast processor core has been challenged, as we have seen the slowing of the speed increases of these devices [1]. In order to combat this issue, most manufacturers have developed multi-core variants of their standard processor. Over the past ten years, graphics card manufacturers have been increasing the capability of their cards; this has mainly been to keep pace with the requirement of users, who now demand ever realistic game graphics and physics. Recently, we have seen libraries released which enable users to make use of the GPU as a massively powerful compute resource. Notably, at the time of writing, it is common to see quad core chips in standard desktop machines, combined with powerful compute capable GPU. We also see heterogeneous chips, such as the Cell Broadband Engine, used in the world s fastest computer [2]. This chip, in contrast to the CPU GPU relationship, combines heterogeneous components into a single process, as opposed to discrete elements connected by an external system bus. 14

15 Developing applications for single threaded execution can be difficult enough, but when also facing the challenge of multiple cores and heterogeneous compute elements, the challenge increases. Multi-threading libraries support the instantiation of virtual threads, and the underlying operating system will then accordingly schedule these virtual threads onto the available homogeneous hardware. Operating systems are unable to schedule threads across heterogeneous compute components, which consequently leaves the control of thread execution on these elements to the application which wants to make use of them. Application developers need to make decisions concerning how best to apply the available computation to the hardware elements in the target system. Load-balancing concerns the application of computational tasks to the available elements in the machine. Failing to properly and fairly distribute load over the available elements will result in less than optimal runtime performance [3]. In a high-performance compute environment, it is essential that idle time is reduced as much as possible so as to maximise the return of investment made in the data centre infrastructure. During the lifetime of a computational task, the quantity of work assigned to an individual subsystem can change; this can lead to an imbalance which can emerge at any point during the lifetime of the computation. Design time profiling is a common step in the development process. As an application is being developed, profiling is conducted in order to determine the sections of code that would benefit from some additional fine-tuning. These sections usually consume a disproportionate amount of the whole application runtime. Usually, it is these sections that we strive to split between the available resources. The complexity arises when these tasks are to be split amongst the available resources. It is only at runtime that we can determine the existence and capability of heterogeneous components and the cost of computation instantiation. This leads to the requirement to profile the application at runtime. 15

16 Planning for the development or redeployment of application code to a heterogeneous highperformance environment is a challenging activity. Conventional multi-threaded development is supported by a number of methodologies and design techniques, concerned with aiding in the development/transformation process. The ability to structure the development activity and to accordingly support the development through a design procedure is an important aspect of this project. 16

17 1.1 Motivation and Context of the Study This project is concerned with the development of a runtime-profiling and load-balancing algorithm to support the computational deployment to the available computing resource. With the ever-increasing complexity and heterogeneous mix of components in both desktops and servers, the ability to correctly assign tasks to resources will be critical when striving to realise the full potential of these machines. However, it should be noted that this project is concerned with floating point applications as opposed to integer applications. High-performance applications typically refer to those that spend a disproportionate amount of time typically using few instructions but operating on a large data set. This data set especially in physical simulations is iterated over many types as the simulation evolves. Integer applications typically use a large number of instructions and smaller data sets. Office applications typically fall under this application type. Successful algorithm development will require different heterogeneous components to target. As outlined in the previous section, GPU components are becoming increasingly common. We believe that targeting a heterogeneous system which comprises of GPU and host CPU is an essential architecture mix to include in this project. These devices are coupled using a PCI express system bus. The second architecture type should not be constrained by the connectivity of a system bus; architectures of this type are referred to as a single-chip heterogeneous platform. Therefore, in order to target a single-chip heterogeneous platform, there are few affordable options available. At the time of writing, the only realistic option is to make use of the Cell Broadband Engine as found in the games console, the PlayStation 3. It may seem unusual to use a game station in the development of a complex, high-performance application; nevertheless, as shown by Buttari et al., the PlayStation 3 is more that capable for scientific research [4]. 17

18 1.2 Project Aims and Methodology The aim of this project is to design an algorithm capable of determining the best pattern of instantiation on the available computational resource for a given computational problem. Subsequently, the algorithm will be implemented using different heterogeneous platforms to validate the effectiveness of the approach; this will be achieved through the transformation of an existing HPC application using a methodology for targeting heterogeneous architectures. The objectives of this project are summarised as follows: To understand the development process and profiling requirements of high-performance applications on heterogeneous architectures; and To obtain speed-up on each target architecture for the N-Body simulation. It is important that we have a deep understanding of the development process of heterogeneous applications, and that we can profile applications in real time in order to better match computation to the available resource. To prove that the suggested approach for runtime profiling is effective, we need to show speed-up over the single threaded versions of the application. Accordingly, in order to achieve the project aims, the following key points will need to be achieved: Understanding of the target heterogeneous architectures; Understanding of the development process for heterogeneous systems; Identification of the methods which enable runtime-profiling and load-balancing; and Researching of the available tools and technologies to enable the implementation. 18

19 To realise the development of a heterogeneous application, a thorough literature review of heterogeneous development will be undertaken. Two target architectures will be used to validate the profiling algorithm: the first is the Cell Broadband Engine, developed by Sony Toshiba and IBM; and the second architecture is the Nvidia 9800 GTX graphics card with an Intel Nehalem host processor. These devices have a total of four discrete computational units, each with their own toolset and architectural nuances which will need to be explored before the implementation phase. In order to successfully implement a profiling algorithm and to accordingly profile the target application, the author will review the current literature and example case studies of multithreaded and heterogeneous development; this will continue with a further general review of application development methodologies for both multi-threaded and heterogeneous development. In the same section, we will suggest an approach for development that will be used in the design and implementation. Following on from the literature review, we will detail the algorithm design followed by details of the implementation. To test the load-balancing and profiling algorithm, a target application will be required. Notably, however, this is outside the scope of this project, and indeed its time constraints to develop such an application. Nevertheless, a number of applications were considered after referring to wellknown lists of computational dwarfs detailed in [5]. As a result, we decided to implement an N- Body simulator, as it has sufficient complexity and can be easily adjusted to vary computational problem size. A suitable open source project was found titled N-Mod. Hosted by Google code and developed By Dr Carey Pridgeon, the application is provided in C source code. For the rest of the project, N-Mod will be referred to as the target application. Once the target application has been modified to run on the target architecture, the results of the speed-up will be documented without the use of the load-balancing algorithm. The loadbalancing algorithm will then be applied, and a comparison as to its effectiveness will be conducted. 19

20 1.3 Project Phases Below is a summary of the main project phases: 1. Literature and background research. 2. Algorithm design 3. Prototype algorithm implementation: a. Target platform installation and configuration b. Development of N-Body algorithm. c. Implement flexible load-balancing system d. Implement load-balancing algorithm. 4. Test/Timing 5. Dissertation writing. 20

21 1.4 Structure for the Dissertation This dissertation contains eight chapters. The first chapter provides the introduction, and details the project s background and the significance of the problem area, as well as the motivation for the research. Chapter One also details the aims, and a methodology to realise the aforementioned aims, and further details the main phases that will be conducted. Chapter Two investigates the architectural and development environments for the target heterogeneous architecture, as well as projects which have been successfully developed on these systems. Chapter Three will research the area of computational load-balancing a component which is fundamental to the development of an algorithm that scales independent of architectural constraints. In Chapter Four, current profiling techniques will be reviewed, including both static and runtime techniques. Chapter Five will review development methodologies and design formalisation techniques for both single and multi-threaded development. Additionally, Chapter Five will bring the research together in such a way so as to propose a design for heterogeneous profiling and a loadbalancing algorithm. In Chapter Six, the implementation of the algorithm using the target application will be detailed, highlighting how various challenges were overcome. Chapter Seven will show detailed testing of the converted algorithm for the target architectures, and will also include a review of the speed-up obtained using various core and load combinations. The loadbalancing algorithm will also be tested so as to demonstrate that it can determine the best architecture for a given computational load. Chapter Eight provides a conclusion of the project, and recommends future work concerning the implementation and design of the load-balancing algorithm, as well as the methodology followed in its development. 21

22 Chapter 2 Heterogeneous Architectures 2.1 Chapter Overview Computer systems have traditionally been deployed using single core homogeneous architectures. In order to meet growing user requirements, we have witnessed the development of systems which combine differing architecture types. However, the integration of different architecture to meet the demands of the user brings with it its own problems. For example, each architecture can differ drastically from the standard x86 design, and requires specific compiler support [6]. One main difference is the way in which these architectures interact with the main memory. This chapter will survey the literature relating to the target architectures that will be used in the application of the Profiling algorithm. 2.2 Cell Broadband Engine The cell broadband engine is a single chip heterogeneous system developed by a partnership of Sony, Toshiba and IBM [7]. It has a revolutionary design consisting of a single PPU (modified Power PC 4 core) which controls 8 SPE (6 available in the Sony PlayStation). The PPU is a 64- bit dual simultaneous multi-threading processor which is similar to the PowerPC 970, and which has 32 KB of level one cache, split between separate instructions and data caches, and 512 KB of level two cache. The PPU has an SIMD engine based on the Altivec instruction set [8]. Although 22

23 powerful, the PPU is included as the controller of the SPE and is not seen as being a target for computational load. Usually, the PPE would run the operating system and control operations of the HPC application. Notably, the PPE is more capable of task-switching than the individual SPE, which is less capable of task-switching owing to the lack of branch prediction logic [9]. Each SPE consists of a SPU, 256 KB LS and MFC. The SPU is a 128-bit vector engine which uses a different instruction set to that of the PPE. Moreover, the SPU has a 128 by 128 bit register bank, which is a large number of registers in comparison to a standard x86 processor; however, there is no on chip cache. Furthermore, each SPU is not considered coherent with the main system memory, and can only access its own LS and local MFC. The MFC is a messaging and DMA controller. The MFC controller communicates with other MFC, PPE and main memory controllers using the EIB. The EIB consists of four unidirectional buses, which interconnect PPE, SPE, IO and memory controller. In order to provide the required communication requirements, the EIB operates at 204.8GB. With this in mind, Figure 2.1 shows the interconnections of all the major components of the CBE. Figure 2.1: CBE Block Diagram 23

24 The PPE is capable of running an application code which is compatible with the PowerPC architecture. Application code is executed from the main memory transparently being transferred to L1 and L2 cache using a cache coherency protocol. SPU application code is located in the main memory, and a pointer to the application code is provided to the SPU by the PPU. The SPU can then direct the MFC to transfer the application code into the LS, and the PPU can subsequently execute the code directly from the LS. Once the PPU has finished with any variables in its own LS, it can then direct the MFC to transfer data back into the main memory. Messages can then be sent between the SPU and PPU so as to indicate the completion of processing [10]. Figure 2.2 details the division of two computational tasks between PPE and SPE. The PPU would perform various initialisations before executing the task on the individual SPE; the SPE would subsequently compute their assigned sub-section of the task before informing the PPE; the PPE would accordingly collect and combine the results before preparing and starting the next stage of the model evolution; and the PPE would then continue evolving the model, perform further compute stages, directing the SPE or terminate [9]. Figure 2.2: Division of computation CBE 24

25 SPE and PPE can communicate using EIB by sending messages to each other s mailboxes. This messaging capability provides alternative kernel instantiation patterns. One common method is to have each SPE conduct a unique stage of processing, as shown in Figure 2.3. This essentially creates a computational chain commonly found in video processing. Figure 2.3: Stream Processing CBE There are additional benefits to using a chain of SPEs when the application code is too big to fit into a single SPE; we can reduce the swapping of application space to and from the LS in the SPE by dividing the workload between the individual SPE Development of the CBE Only the Linux kernel can communicate with the individual SPE, and so drivers have been developed in order to provide easy access to the developer. 25

26 The following functions are provided by the underlying driver [11]: Loading a program binary into an SPU; Transferring memory between an SPU program and a Linux user space application; and Synchronising execution. The application running on the PPE makes use of the LIBSPEV2 (SPE Runtime Management Library V2) to control execution on the SPE. Moreover, the LIBSPEV2 utilises the SPUFS which is provided by the driver in kernel space. The PPE application uses SPE contexts provided by the LIBSPE2 to load the SPE application image, deploy the image to the SPE, trigger execution, and to finally destroy the SPE context [12]. The SPE contexts are considered key to controlling the execution of the binary image on the CBE. Each context needs to be run in an individual PPE thread which is spawned from the main PPE application. Due to the heterogeneous nature of the devices, the application machine code for both the PPE and SPE need to be separately compiled. Importantly, Sony provides a GCC compiler which will create a binary executable for both architectures. $ gcc -lspe2 helloworld_ppe.c -o helloworld_ppe.elf $ spu-gcc helloworld_spe.c -o helloworld_spe.elf Figure 2.4: CBE compile commands In Figure 2.4 you can see that GCC compiles the PPE elf executable, whereas the SPE elf file is compiled using the spu-gcc command. In order to execute the application, the PPE elf file is launched, which in turn loads the SPE elf into memory and accordingly directs the individual SPE to execute through that SPE s context. Each SPE requires its own context structure defined as spe_context_ptr_t. This structure is used for the purpose of storing information required to communicate with the SPE. The functions 26

27 listed in Table 2.1 are used to load the SPE elf image and, using the context structure load the image to the SPE, to initiate the image as well as to destroy and clear the SPE image from memory. Name Functionality spe_image_open() Load the image from the disk into main memory return pointer to the image. spe_context_create() Create the context populating the supplied spe_context_ptr_t structure spe_program_load() Load the supplied image pointer to the supplied SPE defined by its context. spe_context_run() Run the current image on the context provided spe_context_destroy() Destroy the SPE context spe_image_close() Remove the image from main memory Table 2.1: LIBSPE2 Functions The application image, which is loaded onto the SPE, will have to be supplied with the data required for processing. This data will be located in the main memory or on a disk drive. It is the responsibility of each individual SPE to DMA its section of the data that it wishes to process from the main memory to its LS. DMA operations are provided to the developer using intrinsic functions defined in the spu_intrinsics.h and spu_mfcio.h header files. Intrinsic functions wrap a number of machine language commands that are not available to the GCC compiler. Name Functionality spu_mfcdma64() This function instructs the DMA controller in the SPE to transfer the specified number of bytes to or from Main Memory/LS. A tag is specified so a number of DMA transfers can be conducted at once. spu_writech() This macro is used to communicate between SPU and MFC spu_mfcstat() We can call this function to stall the application until the transfer is complete. We can use a technique referred to as double buffering so that the application can process the information it has instead of waiting for all data to be transferred. Table 2.2: SPE DMA functions Using the functions in Table 2.2 and a pointer to a data structure containing the addresses that are required, the SPE can then instruct the DMA controller to access the main memory and to subsequently make available to the SPU the data required for processing [12]. Each SPE has its own LS and communicates with the system memory using DMA transfers. The DMA controller 27

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures

SPEEDUP - optimization and porting of path integral MC Code to new computing architectures SPEEDUP - optimization and porting of path integral MC Code to new computing architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić, A. Bogojević Scientific Computing Laboratory, Institute of Physics

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Programming the Cell Multiprocessor: A Brief Introduction

Programming the Cell Multiprocessor: A Brief Introduction Programming the Cell Multiprocessor: A Brief Introduction David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Overview Programming for the Cell is non-trivial many issues to be

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

evm Virtualization Platform for Windows

evm Virtualization Platform for Windows B A C K G R O U N D E R evm Virtualization Platform for Windows Host your Embedded OS and Windows on a Single Hardware Platform using Intel Virtualization Technology April, 2008 TenAsys Corporation 1400

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture? This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

Optimizing Code for Accelerators: The Long Road to High Performance

Optimizing Code for Accelerators: The Long Road to High Performance Optimizing Code for Accelerators: The Long Road to High Performance Hans Vandierendonck Mons GPU Day November 9 th, 2010 The Age of Accelerators 2 Accelerators in Real Life 3 Latency (ps/inst) Why Accelerators?

More information

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 19 Outline Motivation and previous work

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Cluster Computing at HRI

Cluster Computing at HRI Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: jasjeet@mri.ernet.in 1 Introduction and some local history High performance computing

More information

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1 Performance Study Performance Characteristics of and RDM VMware ESX Server 3.0.1 VMware ESX Server offers three choices for managing disk access in a virtual machine VMware Virtual Machine File System

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Real-time processing the basis for PC Control

Real-time processing the basis for PC Control Beckhoff real-time kernels for DOS, Windows, Embedded OS and multi-core CPUs Real-time processing the basis for PC Control Beckhoff employs Microsoft operating systems for its PCbased control technology.

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Using Synology SSD Technology to Enhance System Performance Synology Inc. Using Synology SSD Technology to Enhance System Performance Synology Inc. Synology_SSD_Cache_WP_ 20140512 Table of Contents Chapter 1: Enterprise Challenges and SSD Cache as Solution Enterprise Challenges...

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World Chapter 4 System Unit Components Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook

More information

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine

Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Cell-SWat: Modeling and Scheduling Wavefront Computations on the Cell Broadband Engine Ashwin Aji, Wu Feng, Filip Blagojevic and Dimitris Nikolopoulos Forecast Efficient mapping of wavefront algorithms

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN 1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction

More information

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are

More information

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals

High Performance Computing. Course Notes 2007-2008. HPC Fundamentals High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs

More information

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed and Cloud Computing Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading

More information

Performance Monitoring of the Software Frameworks for LHC Experiments

Performance Monitoring of the Software Frameworks for LHC Experiments Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero

More information

Symmetric Multiprocessing

Symmetric Multiprocessing Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

The Bus (PCI and PCI-Express)

The Bus (PCI and PCI-Express) 4 Jan, 2008 The Bus (PCI and PCI-Express) The CPU, memory, disks, and all the other devices in a computer have to be able to communicate and exchange data. The technology that connects them is called the

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

Understanding the Performance of an X550 11-User Environment

Understanding the Performance of an X550 11-User Environment Understanding the Performance of an X550 11-User Environment Overview NComputing's desktop virtualization technology enables significantly lower computing costs by letting multiple users share a single

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008 Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer

More information

Data Centric Systems (DCS)

Data Centric Systems (DCS) Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems

More information

Chapter 11 I/O Management and Disk Scheduling

Chapter 11 I/O Management and Disk Scheduling Operating Systems: Internals and Design Principles, 6/E William Stallings Chapter 11 I/O Management and Disk Scheduling Dave Bremer Otago Polytechnic, NZ 2008, Prentice Hall I/O Devices Roadmap Organization

More information

A Review of Customized Dynamic Load Balancing for a Network of Workstations

A Review of Customized Dynamic Load Balancing for a Network of Workstations A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester

More information

Discovering Computers 2011. Living in a Digital World

Discovering Computers 2011. Living in a Digital World Discovering Computers 2011 Living in a Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips,

More information

HP Z Turbo Drive PCIe SSD

HP Z Turbo Drive PCIe SSD Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e

More information

The Truth Behind IBM AIX LPAR Performance

The Truth Behind IBM AIX LPAR Performance The Truth Behind IBM AIX LPAR Performance Yann Guernion, VP Technology EMEA HEADQUARTERS AMERICAS HEADQUARTERS Tour Franklin 92042 Paris La Défense Cedex France +33 [0] 1 47 73 12 12 info@orsyp.com www.orsyp.com

More information

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

September 25, 2007. Maya Gokhale Georgia Institute of Technology

September 25, 2007. Maya Gokhale Georgia Institute of Technology NAND Flash Storage for High Performance Computing Craig Ulmer cdulmer@sandia.gov September 25, 2007 Craig Ulmer Maya Gokhale Greg Diamos Michael Rewak SNL/CA, LLNL Georgia Institute of Technology University

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

How System Settings Impact PCIe SSD Performance

How System Settings Impact PCIe SSD Performance How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage

More information

A Powerful solution for next generation Pcs

A Powerful solution for next generation Pcs Product Brief 6th Generation Intel Core Desktop Processors i7-6700k and i5-6600k 6th Generation Intel Core Desktop Processors i7-6700k and i5-6600k A Powerful solution for next generation Pcs Looking for

More information

Benchmarking Large Scale Cloud Computing in Asia Pacific

Benchmarking Large Scale Cloud Computing in Asia Pacific 2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung

More information

QCD as a Video Game?

QCD as a Video Game? QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture

More information

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of

More information

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit. Objectives The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Enabling Technologies for Distributed Computing

Enabling Technologies for Distributed Computing Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available: Tools Page 1 of 13 ON PROGRAM TRANSLATION A priori, we have two translation mechanisms available: Interpretation Compilation On interpretation: Statements are translated one at a time and executed immediately.

More information

System Requirements Table of contents

System Requirements Table of contents Table of contents 1 Introduction... 2 2 Knoa Agent... 2 2.1 System Requirements...2 2.2 Environment Requirements...4 3 Knoa Server Architecture...4 3.1 Knoa Server Components... 4 3.2 Server Hardware Setup...5

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study CS 377: Operating Systems Lecture 25 - Linux Case Study Guest Lecturer: Tim Wood Outline Linux History Design Principles System Overview Process Scheduling Memory Management File Systems A review of what

More information

Course Development of Programming for General-Purpose Multicore Processors

Course Development of Programming for General-Purpose Multicore Processors Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 wzhang4@vcu.edu

More information