VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Size: px
Start display at page:

Download "VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS"

Transcription

1 VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, USA ASPLOS 2013 Houston, TX 16 th March, GPGPU 6 March 2013

2 WHAT IS THIS TALK ABOUT? A benchmark suite for heterogeneous computing written in OpenCL that allows us to study the interaction between compute devices in heterogeneous application environments 2 GPGPU 6 March 2013

3 TOPICS Goals of an alternative benchmark suite for heterogeneous computing Classifying heterogeneous applications based on their behavior and their mapping to compute devices Brief overview of Valar s Benchmarks Evaluation methodology Example exploration studies Conclusions and Future work 3 GPGPU 6 March 2013

4 MOTIVATION Benchmarks for evaluating workload partitioning on CPU-GPU systems Most open source benchmark suites for heterogeneous systems do not utilize both the CPU and GPU device(s) for compute in OpenCL Allow a wide range of behavior(s) within the same application to evaluate data movement optimizations A Benchmark suite with different behavior scenarios of heterogeneous applications To evaluate runtimes and schedulers targeting heterogeneous systems Fit somewhere between microbenchmarks and complete applications 4 GPGPU 6 March 2013

5 APPLICATION CLASSIFICATION IMPLEMENTATION Implementation classification covers mapping of computation onto compute devices present Mapping could be static or dynamically decided Determined by algorithm s development and mapping to the compute device Compute Pipeline: Large stream of kernels and minimum IO Multidevice Execution: Computation partitioned over multiple devices with or without frequent communication 5 GPGPU 6 March 2013

6 APPLICATION CLASSIFICATION - BEHAVIORAL Behavioral classification covers the algorithm s usage scenario Separate discussion of implementation of application and its behavior Quality of Service Behavior: Application depends on error or data characteristics Multiple independent Behavior: Small independent tasks continuously offloaded High B/W input Behavior: Large data streams, high bandwidth GPU workloads 6 GPGPU 6 March 2013

7 VALAR S APPLICATIONS PHYSICS SIMULATION Collision pipeline: A physics application where large and small particle combination define workload behavior GPU performs the small small collisions CPU performs the large small and large large collisions. Behavioral space explored using No of particles Ratio of large and small particles GPU (Posn, Vel, Force) CPU Build Grid Synchronization SS Collide LS Collide ForceLS LL Collide Synchronization S Integrate LL Integrate 7 GPGPU 6 March 2013

8 VALAR S APPLICATIONS FINITE IMPULSE FILTER (FIR) Adaptive FIR: A streaming DSP application used in audio filtering, speech recognition, and pulse detection OP signal generated by multiplying output with a set of taps Adaptive FIR changes weight of filter taps on a separate command queue based on signal characteristics Behavioral space explored using Filter block size and number of taps Compute Intensity Dispatch Frequency IO frequency and size 8 GPGPU 6 March 2013

9 VALAR S APPLICATIONS SEARCH Search Application: Simple application searching for a range of values in data GPU OpenCL kernel searches for a set of target data values in blocks of data Application hands off the resultant data to the CPU for a final reduction Behavioral Space Explored Using Interval: Communication frequency of results from GPU to CPU Data pool size: Size of GPU kernel CPU GPU Initialize Data Range Synchronization Search Kernel Initial Reduction Synchronization Final Reduction & Init new data range 9 GPGPU 6 March 2013

10 VALAR S APPLICATIONS SPEEDED UP ROBUST FEATURES (SURF) SURF: Feature detection application that summarizes an image into a number of interest points. Applications in object recognition, tracking, image stitching Behavioral Space Explored Using Image size Host Device I/O size and compute intensity Image color patterns Compute intensity 10 GPGPU 6 March 2013

11 VALAR S APPLICATIONS TRAFFIC Traffic Application: Cellular automaton model (NS model) for road traffic flow to reproduce traffic jams Models traffic jams as an emergent phenomenon due to interaction between cars on road Behavioral Space Explored Using No of cars and their distribution: Compute intensity of kernels Maximum Velocity: affects number of kernel calls per timestep Simple OpenCL kernel called over multiple strides 11 GPGPU 6 March 2013

12 PERFORMANCE ANALYSIS IN A HETEROGENEOUS HIERARCHY Categorization goal: Reflect algorithm, data mapping and kernel optimization in benchmark selection Layers to study heterogeneous application performance AL0 Application input AL1 OpenCL level behavior Host device behavior induced by input arguments AL2 Compute device specific Hardware counter statistics Abstraction Layer AL0 Benchmark Options AL1 Host Device interaction AL2 Device H/W Perf. Counters Southern Island GPUs Performance and Behavior Metrics Input arguments and data to benchmarks Kernel execn. freq vs IO. Kernel calls on CPU vs GPU Memory Transaction Freq Memory Transaction Size Vector ALU Busy % Scalar ALU Busy % Mem-Unit Busy % Registers Used Local Memory Used Throughput & time 12 GPGPU 6 March 2013

13 PERFORMANCE ANALYSIS IN A HETEROGENEOUS HIERARCHY Categorization goal: Reflect algorithm, data mapping and kernel optimization in benchmark selection Layers to study heterogeneous application performance AL0 Application input AL1 OpenCL level behavior Host device behavior induced by input arguments AL2 Compute device specific Hardware counter statistics Argument tracking OpenCL event based profiler AMD APP Profiler 13 GPGPU 6 March 2013

14 EXPERIMENTAL EVALUATION Kernel optimization studies are possible with Valar OpenCL kernels optimized while maintaining correctness on all OpenCL compliant platforms Experiments based on the host-device interaction can be used for the following architectural research Effects of data dependent kernels Benefits of host-device IO optimizations like write combining Kernel call and communication cost Different OpenCL buffer management strategies 14 GPGPU 6 March 2013

15 OPENCL KERNELS DATA DEPENDENT KERNELS IN VALAR Vector ALU utilization and memory unit utilization on AMD Southern Island GPUs Performance variation seen over the runtime of application for representative input cases 15 GPGPU 6 March 2013

16 INTERACTION RESULTS FIR The effect of write combining on application throughput fused and discrete devices Dispatch denotes the number of blocks combined in one kernel invocation Requires an application with enough flexibility in host-device IO and kernel Limited performance benefit seen for fused platforms and higher dispatch sizes 16 GPGPU 6 March 2013

17 INTERACTION RESULTS SEARCH Search: less coupled application - CPU-GPU communication is less frequent Effect of communication on application throughput in heterogeneous systems Comparing a midrange discrete GPU with an APU device APU system throughput comparable for small communication interval 17 GPGPU 6 March 2013

18 INTERACTION RESULTS SEARCH CPU performance: discrete vs APU At high communication: CPU kernel performance on APU reduces CPU kernel does gain from Quad core HT vs Quad core GPU performance: discrete vs APU Improvement for less frequent communication, more work on GPU High BW of SI GPUs vs APU decisive to throughput as communication reduces 18 GPGPU 6 March 2013

19 INTERACTION RESULTS PHYSICS Effect of CPU compute capacity on application throughput for a coupled application Application throughput for different particle distributions. Throughput for APU and discrete in similar range Time / step is affected by large particle counts 19 GPGPU 6 March 2013

20 INTERACTION RESULTS PHYSICS Effect of CPU compute capacity on application throughput for a coupled application Throughput for different large particle counts More large particles increase amount of work on CPU Substantial reduction in throughput Time / step is affected by large particle counts 20 GPGPU 6 March 2013

21 CONCLUSIONS AND FUTURE WORK Conclusions: Valar attempts to provide benchmarks that can generate a range of heterogeneous behavior for architectural research and application comparison Future Work Architectural Research Compare against discrete implementations and other programming models Evaluating power swishing on APUs and evaluate mobile low power SOCs Future Work Applications Predator algorithm (TLD) - coupled machine learning and feature detection More applications required, especially concurrent command queue usage Physics needs CPU OpenCL command queue instead of thread-pool Traffic needs a better algorithm and lane change model needs to be improved 21 GPGPU 6 March 2013

22 THANK YOU! QUESTIONS? COMMENTS? Perhaad Mistry 22 GPGPU 6 March 2013

23 INTERACTION RESULTS SURF IMAGE COMPARE Preprocessing added on CPU device at beginning of the pipeline Comparison kernel calculates difference between two gray-scale images Preprocessing result decides the decision to launch pipeline Heavier threshold values improve performance due to more frames skipped 23 GPGPU 6 March 2013

24 VALAR S APPLICATIONS SPEEDED UP ROBUST FEATURES (SURF) SURF: Feature detection application that summarizes an image into a number of interest points. Applications in object recognition, tracking, image stitching Behavioral Space Explored Using Image size Host Device I/O size and compute intensity Image color patterns Compute intensity 24 GPGPU 6 March 2013

25 EXTRA STUFF 25 GPGPU 6 March 2013

26 PERFORMANCE RESULTS SURF ORIENTATION COMPARE Orientation comparison useful if no camera rotation Test case for overhead since orientation step is < 10% of SURF computation Execution of compute pipeline interrupted to compare orientation vs. previous frame Frequency of orientation comparison increased, native denotes no HAPTIC More degradation in average performance seen for small videos 26 GPGPU 6 March 2013

27 VALAR S APPLICATIONS - PHYSICS SIMULATION Collision Detection Pipeline Large and small particles combination decides workload behavior GPU performs the small small collisions CPU performs the large small and large large collisions. Behavioral space explored using No of particles Ratio of large and small particles 27 GPGPU 6 March 2013

A Framework for Profiling and Performance Monitoring of Heterogeneous Applications

A Framework for Profiling and Performance Monitoring of Heterogeneous Applications A Framework for Profiling and Performance Monitoring of Heterogeneous Applications Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,

More information

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Cristina Silvano cristina.silvano@polimi.it Politecnico di Milano HiPEAC CSW Athens 2014 Motivations System

More information

GPU Profiling with AMD CodeXL

GPU Profiling with AMD CodeXL GPU Profiling with AMD CodeXL Software Profiling Course Hannes Würfel OUTLINE 1. Motivation 2. GPU Recap 3. OpenCL 4. CodeXL Overview 5. CodeXL Internals 6. CodeXL Profiling 7. CodeXL Debugging 8. Sources

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management

An Oracle Technical White Paper November 2011. Oracle Solaris 11 Network Virtualization and Network Resource Management An Oracle Technical White Paper November 2011 Oracle Solaris 11 Network Virtualization and Network Resource Management Executive Overview... 2 Introduction... 2 Network Virtualization... 2 Network Resource

More information

Getting Started with CodeXL

Getting Started with CodeXL AMD Developer Tools Team Advanced Micro Devices, Inc. Table of Contents Introduction... 2 Install CodeXL... 2 Validate CodeXL installation... 3 CodeXL help... 5 Run the Teapot Sample project... 5 Basic

More information

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of

More information

Towards Elastic Application Model for Augmenting Computing Capabilities of Mobile Platforms. Mobilware 2010

Towards Elastic Application Model for Augmenting Computing Capabilities of Mobile Platforms. Mobilware 2010 Towards lication Model for Augmenting Computing Capabilities of Mobile Platforms Mobilware 2010 Xinwen Zhang, Simon Gibbs, Anugeetha Kunjithapatham, and Sangoh Jeong Computer Science Lab. Samsung Information

More information

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services

LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services LBPerf: An Open Toolkit to Empirically Evaluate the Quality of Service of Middleware Load Balancing Services Ossama Othman Jaiganesh Balasubramanian Dr. Douglas C. Schmidt {jai, ossama, schmidt}@dre.vanderbilt.edu

More information

Cisco Integrated Services Routers Performance Overview

Cisco Integrated Services Routers Performance Overview Integrated Services Routers Performance Overview What You Will Learn The Integrated Services Routers Generation 2 (ISR G2) provide a robust platform for delivering WAN services, unified communications,

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Data Center and Cloud Computing Market Landscape and Challenges

Data Center and Cloud Computing Market Landscape and Challenges Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution

More information

An Approach to Load Balancing In Cloud Computing

An Approach to Load Balancing In Cloud Computing An Approach to Load Balancing In Cloud Computing Radha Ramani Malladi Visiting Faculty, Martins Academy, Bangalore, India ABSTRACT: Cloud computing is a structured model that defines computing services,

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

Performance Management for Cloudbased STC 2012

Performance Management for Cloudbased STC 2012 Performance Management for Cloudbased Applications STC 2012 1 Agenda Context Problem Statement Cloud Architecture Need for Performance in Cloud Performance Challenges in Cloud Generic IaaS / PaaS / SaaS

More information

Chapter 1 Computer System Overview

Chapter 1 Computer System Overview Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

Optimizing Application Performance with CUDA Profiling Tools

Optimizing Application Performance with CUDA Profiling Tools Optimizing Application Performance with CUDA Profiling Tools Why Profile? Application Code GPU Compute-Intensive Functions Rest of Sequential CPU Code CPU 100 s of cores 10,000 s of threads Great memory

More information

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications

Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications Comparison of Request Admission Based Performance Isolation Approaches in Multi-tenant SaaS Applications Rouven Kreb 1 and Manuel Loesch 2 1 SAP AG, Walldorf, Germany 2 FZI Research Center for Information

More information

MAQAO Performance Analysis and Optimization Tool

MAQAO Performance Analysis and Optimization Tool MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22

More information

Full and Para Virtualization

Full and Para Virtualization Full and Para Virtualization Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF x86 Hardware Virtualization The x86 architecture offers four levels

More information

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin QoS-Aware Storage Virtualization for Cloud File Systems Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt Zuse Institute Berlin 1 Outline Introduction Performance Models Reservation Scheduling

More information

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015

Capstone Overview Architecture for Big Data & Machine Learning. Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015 Capstone Overview Architecture for Big Data & Machine Learning Debbie Marr ICRI-CI 2015 Retreat, May 5, 2015 Accelerators Memory Traffic Reduction Memory Intensive Arch. Context-based Prefetching Deep

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing Efficient Parallel Processing on Public Cloud Servers Using Load Balancing Valluripalli Srinath 1, Sudheer Shetty 2 1 M.Tech IV Sem CSE, Sahyadri College of Engineering & Management, Mangalore. 2 Asso.

More information

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors

Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors Container-based operating system virtualization: a scalable, high-performance alternative to hypervisors Soltesz, et al (Princeton/Linux-VServer), Eurosys07 Context: Operating System Structure/Organization

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

Multi-GPU Load Balancing for Simulation and Rendering

Multi-GPU Load Balancing for Simulation and Rendering Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks

More information

A Computer Vision System on a Chip: a case study from the automotive domain

A Computer Vision System on a Chip: a case study from the automotive domain A Computer Vision System on a Chip: a case study from the automotive domain Gideon P. Stein Elchanan Rushinek Gaby Hayun Amnon Shashua Mobileye Vision Technologies Ltd. Hebrew University Jerusalem, Israel

More information

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices Brian Jeff November, 2013 Abstract ARM big.little processing

More information

The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Computing (For Multi2Sim v. 4.2)

The Multi2Sim Simulation Framework. A CPU-GPU Model for Heterogeneous Computing (For Multi2Sim v. 4.2) The Multi2Sim Simulation Framework A CPU-GPU Model for Heterogeneous Computing (For Multi2Sim v. 4.2) List of authors contributing to the development of the simulation framework and/or writing of this

More information

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2015) The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang Nanjing Communications

More information

The International Journal Of Science & Technoledge (ISSN 2321 919X) www.theijst.com

The International Journal Of Science & Technoledge (ISSN 2321 919X) www.theijst.com THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE Efficient Parallel Processing on Public Cloud Servers using Load Balancing Manjunath K. C. M.Tech IV Sem, Department of CSE, SEA College of Engineering

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

Real-time Visual Tracker by Stream Processing

Real-time Visual Tracker by Stream Processing Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol

More information

Pros and Cons of HPC Cloud Computing

Pros and Cons of HPC Cloud Computing CloudStat 211 Pros and Cons of HPC Cloud Computing Nils gentschen Felde Motivation - Idea HPC Cluster HPC Cloud Cluster Management benefits of virtual HPC Dynamical sizing / partitioning Loadbalancing

More information

Hardware Based Virtualization Technologies. Elsie Wahlig elsie.wahlig@amd.com Platform Software Architect

Hardware Based Virtualization Technologies. Elsie Wahlig elsie.wahlig@amd.com Platform Software Architect Hardware Based Virtualization Technologies Elsie Wahlig elsie.wahlig@amd.com Platform Software Architect Outline What is Virtualization? Evolution of Virtualization AMD Virtualization AMD s IO Virtualization

More information

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture

Review from last time. CS 537 Lecture 3 OS Structure. OS structure. What you should learn from this lecture Review from last time CS 537 Lecture 3 OS Structure What HW structures are used by the OS? What is a system call? Michael Swift Remzi Arpaci-Dussea, Michael Swift 1 Remzi Arpaci-Dussea, Michael Swift 2

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

Impact of Control Theory on QoS Adaptation in Distributed Middleware Systems

Impact of Control Theory on QoS Adaptation in Distributed Middleware Systems Impact of Control Theory on QoS Adaptation in Distributed Middleware Systems Baochun Li Electrical and Computer Engineering University of Toronto bli@eecg.toronto.edu Klara Nahrstedt Department of Computer

More information

Performance Testing at Scale

Performance Testing at Scale Performance Testing at Scale An overview of performance testing at NetApp. Shaun Dunning shaun.dunning@netapp.com 1 Outline Performance Engineering responsibilities How we protect performance Overview

More information

Keynote Mobile Device Perspective

Keynote Mobile Device Perspective PRODUCT BROCHURE Keynote Mobile Device Perspective Keynote Mobile Device Perspective is a single platform for monitoring and troubleshooting mobile apps on real smartphones connected to live networks in

More information

The Microsoft Windows Hypervisor High Level Architecture

The Microsoft Windows Hypervisor High Level Architecture The Microsoft Windows Hypervisor High Level Architecture September 21, 2007 Abstract The Microsoft Windows hypervisor brings new virtualization capabilities to the Windows Server operating system. Its

More information

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan

MulticoreWare. Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan 1 MulticoreWare Global Company, 250+ employees HQ = Sunnyvale, CA Other locations: US, China, India, Taiwan Focused on Heterogeneous Computing Multiple verticals spawned from core competency Machine Learning

More information

ICRI-CI Retreat Architecture track

ICRI-CI Retreat Architecture track ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning

More information

Characterizing Task Usage Shapes in Google s Compute Clusters

Characterizing Task Usage Shapes in Google s Compute Clusters Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

Efficient Load Balancing using VM Migration by QEMU-KVM

Efficient Load Balancing using VM Migration by QEMU-KVM International Journal of Computer Science and Telecommunications [Volume 5, Issue 8, August 2014] 49 ISSN 2047-3338 Efficient Load Balancing using VM Migration by QEMU-KVM Sharang Telkikar 1, Shreyas Talele

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

CS231M Project Report - Automated Real-Time Face Tracking and Blending

CS231M Project Report - Automated Real-Time Face Tracking and Blending CS231M Project Report - Automated Real-Time Face Tracking and Blending Steven Lee, slee2010@stanford.edu June 6, 2015 1 Introduction Summary statement: The goal of this project is to create an Android

More information

How To Understand And Understand An Operating System In C Programming

How To Understand And Understand An Operating System In C Programming ELEC 377 Operating Systems Thomas R. Dean Instructor Tom Dean Office:! WLH 421 Email:! tom.dean@queensu.ca Hours:! Wed 14:30 16:00 (Tentative)! and by appointment! 6 years industrial experience ECE Rep

More information

Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang ardagna@elet.polimi.it

Black-box Performance Models for Virtualized Web. Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang ardagna@elet.polimi.it Black-box Performance Models for Virtualized Web Service Applications Danilo Ardagna, Mara Tanelli, Marco Lovera, Li Zhang ardagna@elet.polimi.it Reference scenario 2 Virtualization, proposed in early

More information

Writing Applications for the GPU Using the RapidMind Development Platform

Writing Applications for the GPU Using the RapidMind Development Platform Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...

More information

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads A Scalable VISC Processor Platform for Modern Client and Cloud Workloads Mohammad Abdallah Founder, President and CTO Soft Machines Linley Processor Conference October 7, 2015 Agenda Soft Machines Background

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

Improving the performance of data servers on multicore architectures. Fabien Gaud

Improving the performance of data servers on multicore architectures. Fabien Gaud Improving the performance of data servers on multicore architectures Fabien Gaud Grenoble University Advisors: Jean-Bernard Stefani, Renaud Lachaize and Vivien Quéma Sardes (INRIA/LIG) December 2, 2010

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Applied Technology Abstract By migrating VMware virtual machines from one physical environment to another, VMware VMotion can

More information

Evaluation Methodology of Converged Cloud Environments

Evaluation Methodology of Converged Cloud Environments Krzysztof Zieliński Marcin Jarząb Sławomir Zieliński Karol Grzegorczyk Maciej Malawski Mariusz Zyśk Evaluation Methodology of Converged Cloud Environments Cloud Computing Cloud Computing enables convenient,

More information

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications Open System Laboratory of University of Illinois at Urbana Champaign presents: Outline: IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications A Fine-Grained Adaptive

More information

Real-Time Operating Systems for MPSoCs

Real-Time Operating Systems for MPSoCs Real-Time Operating Systems for MPSoCs Hiroyuki Tomiyama Graduate School of Information Science Nagoya University http://member.acm.org/~hiroyuki MPSoC 2009 1 Contributors Hiroaki Takada Director and Professor

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31 Disclaimer: This document was part of the First European DSP Education and Research Conference. It may have been written by someone whose native language is not English. TI assumes no liability for the

More information

PERFORMANCE TUNING ORACLE RAC ON LINUX

PERFORMANCE TUNING ORACLE RAC ON LINUX PERFORMANCE TUNING ORACLE RAC ON LINUX By: Edward Whalen Performance Tuning Corporation INTRODUCTION Performance tuning is an integral part of the maintenance and administration of the Oracle database

More information

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1 Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems

More information

Operating Systems 4 th Class

Operating Systems 4 th Class Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science

More information

Ensuring Collective Availability in Volatile Resource Pools via Forecasting

Ensuring Collective Availability in Volatile Resource Pools via Forecasting Ensuring Collective Availability in Volatile Resource Pools via Forecasting Artur Andrzejak andrzejak[at]zib.de Derrick Kondo David P. Anderson Zuse Institute Berlin (ZIB) INRIA UC Berkeley Motivation

More information

Software and the Concurrency Revolution

Software and the Concurrency Revolution Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )

More information

Secure Containers. Jan 2015 www.imgtec.com. Imagination Technologies HGI Dec, 2014 p1

Secure Containers. Jan 2015 www.imgtec.com. Imagination Technologies HGI Dec, 2014 p1 Secure Containers Jan 2015 www.imgtec.com Imagination Technologies HGI Dec, 2014 p1 What are we protecting? Sensitive assets belonging to the user and the service provider Network Monitor unauthorized

More information

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Nuno A. Carvalho, João Bordalo, Filipe Campos and José Pereira HASLab / INESC TEC Universidade do Minho MW4SOC 11 December

More information

Step by Step Guide To vstorage Backup Server (Proxy) Sizing

Step by Step Guide To vstorage Backup Server (Proxy) Sizing Tivoli Storage Manager for Virtual Environments V6.3 Step by Step Guide To vstorage Backup Server (Proxy) Sizing 12 September 2012 1.1 Author: Dan Wolfe, Tivoli Software Advanced Technology Page 1 of 18

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

DELL s Oracle Database Advisor

DELL s Oracle Database Advisor DELL s Oracle Database Advisor Underlying Methodology A Dell Technical White Paper Database Solutions Engineering By Roger Lopez Phani MV Dell Product Group January 2010 THIS WHITE PAPER IS FOR INFORMATIONAL

More information

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015 1. Goals and Overview 1. In this MP you will design a Dynamic Load Balancer architecture for a Distributed System 2. You will

More information

@IJMTER-2015, All rights Reserved 355

@IJMTER-2015, All rights Reserved 355 e-issn: 2349-9745 p-issn: 2393-8161 Scientific Journal Impact Factor (SJIF): 1.711 International Journal of Modern Trends in Engineering and Research www.ijmter.com A Model for load balancing for the Public

More information

IoT: Smart Vision Leads The Way

IoT: Smart Vision Leads The Way IoT: Smart Vision Leads The Way Peter McGuinness Multimedia Technology Marketing www.imgtec.com IoT is changing from amorphous to concrete: Imagination Technologies US Summit May 2015 2 IoT is changing

More information

theguard! ApplicationManager System Windows Data Collector

theguard! ApplicationManager System Windows Data Collector theguard! ApplicationManager System Windows Data Collector Status: 10/9/2008 Introduction... 3 The Performance Features of the ApplicationManager Data Collector for Microsoft Windows Server... 3 Overview

More information

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING

Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING Go Faster - Preprocessing Using FPGA, CPU, GPU Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING WHO ARE STEMMER IMAGING? STEMMER IMAGING is: Europe's leading independent provider

More information

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale

A bachelor of science degree in electrical engineering with a cumulative undergraduate GPA of at least 3.0 on a 4.0 scale What is the University of Florida EDGE Program? EDGE enables engineering professional, military members, and students worldwide to participate in courses, certificates, and degree programs from the UF

More information

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family White Paper June, 2008 Legal INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

Recent Advances in Periscope for Performance Analysis and Tuning

Recent Advances in Periscope for Performance Analysis and Tuning Recent Advances in Periscope for Performance Analysis and Tuning Isaias Compres, Michael Firbach, Michael Gerndt Robert Mijakovic, Yury Oleynik, Ventsislav Petkov Technische Universität München Yury Oleynik,

More information

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server)

How To Test For Performance And Scalability On A Server With A Multi-Core Computer (For A Large Server) Scalability Results Select the right hardware configuration for your organization to optimize performance Table of Contents Introduction... 1 Scalability... 2 Definition... 2 CPU and Memory Usage... 2

More information

Radeon HD 2900 and Geometry Generation. Michael Doggett

Radeon HD 2900 and Geometry Generation. Michael Doggett Radeon HD 2900 and Geometry Generation Michael Doggett September 11, 2007 Overview Introduction to 3D Graphics Radeon 2900 Starting Point Requirements Top level Pipeline Blocks from top to bottom Command

More information

Visualization à la Unix TM

Visualization à la Unix TM Visualization à la Unix TM Hans-Peter Bischof (hpb [at] cs.rit.edu) Department of Computer Science Golisano College of Computing and Information Sciences Rochester Institute of Technology One Lomb Memorial

More information