Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Size: px
Start display at page:

Download "Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University"

Transcription

1 Automated Software Testing of Memory Performance in Embedded GPUs Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University 1

2 State-of-the-art in Detecting Performance Loss Input Program profiling Profiler Program Hotspots 2

3 State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Hotspots 3

4 Overall Context Programming abstractions (CUDA, OpenMPI) High-performance Embedded Platforms (GPGPUs, Multi-cores) 4

5 Overall Context Write efficient software Programming abstractions (CUDA, OpenMPI) Tools and techniques High-performance Embedded Platforms (GPGPUs, Multi-cores) 5

6 Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms 6

7 Overview Tools and techniques for Efficient Software Performance Testing Performance Debugging Refactoring High-performance embedded platforms Embedded GPUs 7

8 SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM 8

9 So, what is the problem? Automatically generate test scenarios Expose performance bottlenecks What is a performance bottleneck? What is a test scenario? Generation of test scenarios 9

10 Performance Bottleneck Longer delay does not necessarily mean a bottleneck Heavy but unavoidable computation 10

11 SIMD cores Embedded GPUs Streaming multiprocessor Streaming multiprocessor. Cache Cache Cache Interconnect DRAM Interferences in cache and memory 11

12 Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Memory Computation Memory Computation 12

13 Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 Computation Wait Memory Computation Bottleneck due to DRAM bank conflict 13

14 Performance bottleneck Embedded GPUs Memory subsystem is several magnitudes slower than on-chip registers/memories Thread group 1 Computation Memory Computation Memory Thread group 2 cache conflict Memory Computation Memory Computation Bottleneck due to cache conflict 14

15 State-of-the-art in Detecting Performance Loss Input Program profiling! Program inputs that expose performance loss! Detecting performance loss!! Profiler Program Execution state 15

16 Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Input selection Random testing: DRAM contention probability = 1/2 64 Symbolic (path-based): DRAM contention probability = 1/4 16

17 Test Scenarios y1!=1 y2!=1 y3!=1 y n y n y n DRAM DRAM Thread 1 Thread 2 Thread 3 Schedule selection Random selection: DRAM contention probability = 1/2 n n = schedule points 17

18 Test Scenarios Input selection Thread schedule selection Potentially unbounded combinations 18

19 Test Generation Approach Two-step approach Static analysis to summarize the memory footprint of individual threads Directed test generation using the summary 19

20 Test Generation Approach Thread 1 Thread 2 Thread n Static Analyzer Summary 1 Summary 2 Summary n - Cache hits or cache miss (Ferdinand et al., 2000) - The reuse of cache content - Memory accesses for an uninterrupted execution 20

21 Test Generation Approach Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 21

22 Test Generation Approach Diversion of path More DRAM conflicts (from cache miss information) More cache conflicts (from memory access information) Purely static Concrete execution + symbolic state of all threads (captures the set of paths taken by each thread) 22 Divert to a different set of paths

23 Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph (reaching path to potential DRAM access)!

24 Test Generation Approach x < 2 y < 2 z > 3 DRAM Control Dependency Graph Generate inputs satisfying (x >= 2 /\ z <= 3)

25 Test Generation Approach On-the-fly Thread selection point Memory access information DRAM bank conflicts DRAM state Static Dynamic 25

26 Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 26

27 Test Generation Approach Thread selection point DRAM bank queue m3 m2 m1 Bank 0 Schedule choice Bank 1 ma mb mx my mz Bank mapping Bank mapping Dynamic DRAM state DRAM accesses/cache misses (from summary) 27

28 Test Generation Approach On-the-fly Thread selection point Cache reuse information Cache conflicts Cache state Static Dynamic 28

29 Test Generation Approach Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 29

30 Test Generation Approach Schedule choice Thread selection point Set 0 (From summary) unused Set 1 reused ma mb mx mz mc md Cache mapping Cache mapping Dynamic Cache State Memory accesses (from summary) 30

31 Test Generation Approach Summary 1 GPGPU program Summary 2 Input Selection (symbolic testing) Schedule Selection Execution state (cache, DRAM) Summary n Input Execute Static information Full picture Dynamic information 31

32 Implementation GPGPU-Sim A cycle accurate GPU simulator LLVM Compiler infrastructure GKLEE For generating symbolic constraints along program path STP Theorem prover to solve symbolic constraints 32

33 Number of performance bottlenecks w.r.t. time 33

34 Evaluation with CUDA kernels 34

35 Summary Software testing to discover performance bottlenecks in Embedded GPUs Systematic exploration of test inputs and thread schedule No false positives, however, might miss bottlenecks Usage in optimization such as cache locking, memory layout modification (see the paper) Future perspective Diagnosis of root cause Automatic fixing of performance bottlenecks 35

36 Thank you 36

Automated Software Testing of Memory Performance in Embedded GPUs

Automated Software Testing of Memory Performance in Embedded GPUs Automated Software Testing of Performance in Embedded GPUs Sudipta Chattopadhyay Petru Eles Zebo Peng Linköping University {sudipta.chattopadhyay,petru.eles,zebo.peng}@liu.se Abstract Embedded and real-time

More information

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes

TRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes TRACE PERFORMANCE TESTING APPROACH Overview Approach Flow Attributes INTRODUCTION Software Testing Testing is not just finding out the defects. Testing is not just seeing the requirements are satisfied.

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

Guided Performance Analysis with the NVIDIA Visual Profiler

Guided Performance Analysis with the NVIDIA Visual Profiler Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Motivation: Smartphone Market

Motivation: Smartphone Market Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics

More information

15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report. Trading Platform Server 15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support

More information

Program Optimization Study on a 128-Core GPU

Program Optimization Study on a 128-Core GPU Program Optimization Study on a 128-Core GPU Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain-Zee Ueng, and Wen-mei W. Hwu Yu, Xuan Dept of Computer & Information Sciences University

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Understand Performance Monitoring

Understand Performance Monitoring Understand Performance Monitoring Lesson Overview In this lesson, you will learn: Performance monitoring methods Monitor specific system activities Create a Data Collector Set View diagnosis reports Task

More information

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

Parallel Processing and Software Performance. Lukáš Marek

Parallel Processing and Software Performance. Lukáš Marek Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE DSS Data & Diskpool and cloud storage benchmarks used in IT-DSS CERN IT Department CH-1211 Geneva 23 Switzerland www.cern.ch/it Geoffray ADDE DSS Outline I- A rational approach to storage systems evaluation

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

Windows Kernel Internals User-mode Heap Manager. David B. Probert, Ph.D. Windows Kernel Development Microsoft Corporation

Windows Kernel Internals User-mode Heap Manager. David B. Probert, Ph.D. Windows Kernel Development Microsoft Corporation Windows Kernel Internals User-mode Heap Manager David B. Probert, Ph.D. Windows Kernel Development Microsoft Corporation Topics Common problems with the NT heap LFH design Benchmarks data Heap analysis

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory

More information

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure

A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC -> Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Accelerating Server Storage Performance on Lenovo ThinkServer

Accelerating Server Storage Performance on Lenovo ThinkServer Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER

More information

QoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs

QoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs QoS-Aware Dynamic Resource Allocation for Spatial-Multitasking GPUs Paula Aguilera Katherine Morrow Nam Sung Kim University of Wisconsin-Madison Outline QoS Applications on Multitasking GPUs Methodology,

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Parallels VDI Solution

Parallels VDI Solution Parallels VDI Solution White Paper VDI SIZING A Competitive Comparison of VDI Solution Sizing between Parallels VDI versus VMware VDI www.parallels.com Parallels VDI Sizing. 29 Table of Contents Overview...

More information

26 April (Next Friday)

26 April (Next Friday) MAXIMUM ADDITIONAL SCORE: 2 points Description: 1. Selection of a research paper of interest from a given list 2. Study of the selected paper and the referenced material 3. Presentation of the paper in

More information

Static Analysis Driven Cache Performance Testing

Static Analysis Driven Cache Performance Testing Static Analysis Driven Cache Performance Testing Abhijeet Banerjee Sudipta Chattopadhyay Abhik Roychoudhury National University of Singapore {abhijeet,sudiptac,abhik@comp.nus.edu.sg Abstract Real-time,

More information

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive

More information

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2

GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2 GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE 1 W W W. F U S I ON I O.COM Table of Contents Table of Contents... 2 Executive Summary... 3 Introduction: In-Memory Meets iomemory... 4 What

More information

Web Application s Performance Testing

Web Application s Performance Testing Web Application s Performance Testing B. Election Reddy (07305054) Guided by N. L. Sarda April 13, 2008 1 Contents 1 Introduction 4 2 Objectives 4 3 Performance Indicators 5 4 Types of Performance Testing

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

serious tools for serious apps

serious tools for serious apps 524028-2 Label.indd 1 serious tools for serious apps Real-Time Debugging Real-Time Linux Debugging and Analysis Tools Deterministic multi-core debugging, monitoring, tracing and scheduling Ideal for time-critical

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Long Term Research Agenda

Long Term Research Agenda Long Term Research Agenda Abstract This report is part of the CIRENE project that aims the definition future work including FP7 project proposal, research topics for MSc and PhD students, and the possibilities

More information

Chapter 2 Heterogeneous Multicore Architecture

Chapter 2 Heterogeneous Multicore Architecture Chapter 2 Heterogeneous Multicore Architecture 2.1 Architecture Model In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater fl exibility, it is

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

GPU Memory. Memory access:100 times more time to access local/global memory. Maximize shared/ register memory.

GPU Memory. Memory access:100 times more time to access local/global memory. Maximize shared/ register memory. GPU Memory Local registers per thread. A parallel data cache or shared memory that is shared by all the threads. A read-only constant cache that is shared by all the threads. A read-only texture cache

More information

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operatin g Systems: Internals and Design Principle s Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings Operating Systems: Internals and Design Principles Bear in mind,

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Response Time Analysis

Response Time Analysis Response Time Analysis A Pragmatic Approach for Tuning and Optimizing SQL Server Performance By Dean Richards Confio Software 4772 Walnut Street, Suite 100 Boulder, CO 80301 866.CONFIO.1 www.confio.com

More information

Validating Java for Safety-Critical Applications

Validating Java for Safety-Critical Applications Validating Java for Safety-Critical Applications Jean-Marie Dautelle * Raytheon Company, Marlborough, MA, 01752 With the real-time extensions, Java can now be used for safety critical systems. It is therefore

More information

Best Practices for Web Application Load Testing

Best Practices for Web Application Load Testing Best Practices for Web Application Load Testing This paper presents load testing best practices based on 20 years of work with customers and partners. They will help you make a quick start on the road

More information

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau

Retour d expérience : portage d une application haute-performance vers un langage de haut niveau Retour d expérience : portage d une application haute-performance vers un langage de haut niveau ComPAS/RenPar 2013 Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte 16 Janvier 2013 Our Goals Globally

More information

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Data Direct I/O Technology (Intel DDIO): A Primer > Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,

More information

GPU Performance Analysis and Optimisation

GPU Performance Analysis and Optimisation GPU Performance Analysis and Optimisation Thomas Bradley, NVIDIA Corporation Outline What limits performance? Analysing performance: GPU profiling Exposing sufficient parallelism Optimising for Kepler

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

Real-Time Operating Systems for MPSoCs

Real-Time Operating Systems for MPSoCs Real-Time Operating Systems for MPSoCs Hiroyuki Tomiyama Graduate School of Information Science Nagoya University http://member.acm.org/~hiroyuki MPSoC 2009 1 Contributors Hiroaki Takada Director and Professor

More information

MULTI-PROCESSOR EMBEDDED SYSTEMS. Ann Melnichuk Long Talk

MULTI-PROCESSOR EMBEDDED SYSTEMS. Ann Melnichuk Long Talk MULTI-POCESSO EMBEDDED SYSTEMS Ann Melnichuk Long Talk EFEENCE Multi-Core Embedded Systems Edited by Georgios Kornaros CC Press 2010Pages 1 29 Print ISBN: 978-1-4398-1161-0 ebook ISBN: 978-1-4398-1162-7

More information

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah (DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de jens_onno.krah@fh-koeln.de NIOS II 1 1 What is Nios II? Altera s Second Generation

More information

Static Program Transformations for Efficient Software Model Checking

Static Program Transformations for Efficient Software Model Checking Static Program Transformations for Efficient Software Model Checking Shobha Vasudevan Jacob Abraham The University of Texas at Austin Dependable Systems Large and complex systems Software faults are major

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Resource Utilization of Middleware Components in Embedded Systems

Resource Utilization of Middleware Components in Embedded Systems Resource Utilization of Middleware Components in Embedded Systems 3 Introduction System memory, CPU, and network resources are critical to the operation and performance of any software system. These system

More information

Designing Predictable Multicore Architectures for Avionics and Automotive Systems extended abstract

Designing Predictable Multicore Architectures for Avionics and Automotive Systems extended abstract Designing Predictable Multicore Architectures for Avionics and Automotive Systems extended abstract Reinhard Wilhelm, Christian Ferdinand, Christoph Cullmann, Daniel Grund, Jan Reineke, Benôit Triquet

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se

More information

Cache Configuration Reference

Cache Configuration Reference Sitecore CMS 6.2 Cache Configuration Reference Rev: 2009-11-20 Sitecore CMS 6.2 Cache Configuration Reference Tips and Techniques for Administrators and Developers Table of Contents Chapter 1 Introduction...

More information

Locating Cache Performance Bottlenecks Using Data Profiling

Locating Cache Performance Bottlenecks Using Data Profiling Locating Cache Performance Bottlenecks Using Data Profiling Aleksey Pesterev Nickolai Zeldovich Robert T. Morris Massachusetts Institute of Technology Computer Science and Artificial Intelligence Lab {alekseyp,

More information

21. Software Development Team

21. Software Development Team 21. Software Development Team 21.1. Team members Kazuo MINAMI (Team Head) Masaaki TERAI (Research & Development Scientist) Atsuya UNO (Research & Development Scientist) Akiyoshi KURODA (Research & Development

More information

Optimizing Configuration and Application Mapping for MPSoC Architectures

Optimizing Configuration and Application Mapping for MPSoC Architectures Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : Sebastien.Le-Beux@polymtl.ca 1 Multi-Processor Systems on Chip (MPSoC) Design Trends

More information

Software Driven Embedded Systems Design. A Use Case Analysis: Avoiding a hardware dependent software disaster using Virtual System Prototyping

Software Driven Embedded Systems Design. A Use Case Analysis: Avoiding a hardware dependent software disaster using Virtual System Prototyping Software Driven Embedded Systems Design A Use Case Analysis: Avoiding a hardware dependent software disaster using Virtual System Prototyping Overview Traditional System Development: A use case Traditional

More information

Case Study: Load Testing and Tuning to Improve SharePoint Website Performance

Case Study: Load Testing and Tuning to Improve SharePoint Website Performance Case Study: Load Testing and Tuning to Improve SharePoint Website Performance Abstract: Initial load tests revealed that the capacity of a customized Microsoft Office SharePoint Server (MOSS) website cluster

More information

Integrated Communication Systems

Integrated Communication Systems Integrated Communication Systems Courses, Research, and Thesis Topics Prof. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de

More information

Performance Tuning and Optimizing SQL Databases 2016

Performance Tuning and Optimizing SQL Databases 2016 Performance Tuning and Optimizing SQL Databases 2016 http://www.homnick.com marketing@homnick.com +1.561.988.0567 Boca Raton, Fl USA About this course This four-day instructor-led course provides students

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Software Engineering Best Practices. Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer

Software Engineering Best Practices. Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer Software Engineering Best Practices Christian Hartshorne Field Engineer Daniel Thomas Internal Sales Engineer 2 3 4 Examples of Software Engineering Debt (just some of the most common LabVIEW development

More information

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors

An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors An Evaluation of OpenMP on Current and Emerging Multithreaded/Multicore Processors Matthew Curtis-Maury, Xiaoning Ding, Christos D. Antonopoulos, and Dimitrios S. Nikolopoulos The College of William &

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

GPU Programming Strategies and Trends in GPU Computing

GPU Programming Strategies and Trends in GPU Computing GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO-0314 Oslo, Norway 2 Center of Mathematics

More information

A Software Approach to Unifying Multicore Caches

A Software Approach to Unifying Multicore Caches A Software Approach to Unifying Multicore Caches Silas Boyd-Wickizer, M. Frans Kaashoek, Robert Morris, and Nickolai Zeldovich Abstract Multicore chips will have large amounts of fast on-chip cache memory,

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment Technical Paper Moving SAS Applications from a Physical to a Virtual VMware Environment Release Information Content Version: April 2015. Trademarks and Patents SAS Institute Inc., SAS Campus Drive, Cary,

More information

GPU Architecture. Michael Doggett ATI

GPU Architecture. Michael Doggett ATI GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super

More information

OpenSPARC T1 Processor

OpenSPARC T1 Processor OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware

More information

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances: Scheduling Scheduling Scheduling levels Long-term scheduling. Selects which jobs shall be allowed to enter the system. Only used in batch systems. Medium-term scheduling. Performs swapin-swapout operations

More information

Analyzing IBM i Performance Metrics

Analyzing IBM i Performance Metrics WHITE PAPER Analyzing IBM i Performance Metrics The IBM i operating system is very good at supplying system administrators with built-in tools for security, database management, auditing, and journaling.

More information

Fine-Grained Multiprocessor Real-Time Locking with Improved Blocking

Fine-Grained Multiprocessor Real-Time Locking with Improved Blocking Fine-Grained Multiprocessor Real-Time Locking with Improved Blocking Bryan C. Ward James H. Anderson Dept. of Computer Science UNC-Chapel Hill Motivation Locks can be used to control access to: Shared

More information

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems A. Carbon, Y. Lhuillier, H.-P. Charles CEA LIST DACLE division Embedded Computing Embedded Software Laboratories France

More information

Performance analysis of a Linux based FTP server

Performance analysis of a Linux based FTP server Performance analysis of a Linux based FTP server A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Technology by Anand Srivastava to the Department of Computer Science

More information

Chapter 18: Database System Architectures. Centralized Systems

Chapter 18: Database System Architectures. Centralized Systems Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and

More information

Stochastic Analysis of a Queue Length Model Using a Graphics Processing Unit

Stochastic Analysis of a Queue Length Model Using a Graphics Processing Unit Stochastic Analysis of a ueue Length Model Using a Graphics Processing Unit J. Přiryl* Faculty of Transportation Sciences, Czech University of Technology, Prague, Czech Republic Institute of Information

More information