HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS
|
|
|
- Ralf Cunningham
- 10 years ago
- Views:
Transcription
1 HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of Wisconsin-Madison Advanced Micro Devices, Inc.
2 Powerpoint version available on: 2 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
3 ABSTRACT Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 4 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
4 PHYSICAL INTEGRATION 5 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
5 PHYSICAL INTEGRATION 6 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
6 PHYSICAL INTEGRATION 7 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
7 PHYSICAL INTEGRATION Stacked High-bandwidth DRAM GPU CPU Credit: IBM Cores 8 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
8 LOGICAL INTEGRATION General-purpose GPU computing OpenCL CUDA Heterogeneous Uniform Memory Access (huma) Shared virtual address space Cache coherence Allows new heterogeneous apps 9 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
9 OUTLINE Motivation Background System overview Cache architecture reminder Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Conclusions 10 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
10 SYSTEM OVERVIEW SYSTEM LEVEL Highbandwidth interconnect Accelerated Processing Unit (APU) DRAM Channels 11 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
11 SYSTEM OVERVIEW APU GPU compute accesses must stay coherent Direct-access bus (used for graphics) GPU Cluster APU CPU Cluster Directory Arrow thickness bandwidth Invalidation traffic To DRAM 12 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
12 SYSTEM OVERVIEW GPU CU L1 CU L1 CU L1 I-Fetch / Decode GPU Cluster Very high bandwidth: CU CU CU CU CU CU CU CU CU CU CU CU L2 has high Local miss Scratchpad rate L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 Register File CU Memory CU L1 Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex Ex GPU L2 Cache L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 L1 CU CU CU CU CU CU CU CU CU CU CU CU CU CU CU CU Ex Ex Ex Ex Coalescer To L1 13 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
13 SYSTEM OVERVIEW CPU Cluster CPU Core Low bandwidth: Low L2 miss rate L1 L2 CPU Core L1 To Dir L1 L1 CPU Core CPU Core 14 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
14 CACHE ARCHITECTURE REMINDER CPU/GPU L2 CACHE Demand Requests Core Data Responses Demand requests Cache Tag Arrays Searches cache tags from L1 Allocates cache an for MSHR a tag match Tag hit on probe: send MSHRs entry On a directory data to other core On a miss, send probe, check On request a hit, return to directory data to the L1 MSHR Entries Miss Requests Data MSHRs Hit and tags Responses Probe Requests Coherent Network Interface 15 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
15 DIRECTORY ARCHITECTURE REMINDER DIRECTORY Coherent Block Requests Demand Block requests Directory Searches Tag Array cache Block tags Probe Requests/ from L2 Allocates cache an for MSHR a tag match Responses On a miss, the entry data comes Allocate and send MSHRs from DRAM Probe Request RAM probes to L2 caches Hit MSHR Entries Miss PR Entries To DRAM 16 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
16 BACKGROUND SUMMARY System under investigation Heterogeneous CPU-GPU on chip High-bandwidth DRAM Directory pipeline complex MSHR array is associative Difficult to pipeline with more than 1 request per cycle Important resources: MSHR entries 17 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
17 OUTLINE Motivation Background Heterogeneous System Bottlenecks Simulation overview Directory bandwidth MSHRs Performance is significantly affected Heterogeneous System Coherence Details Results Conclusions 18 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
18 SIMULATION DETAILS gem5 simulator Simple CPU GPU simulator based on AMD GCN All memory requests through gem5 CPU Clock 2 GHz CPU Cores 2 CPU Shared L2 2 MB (16-way banked) GPU Clock 1 GHz Compute Units 32 GPU Shared L2 4 MB (64-way banked) L3 (Memory-side) 16 MB (16-way banked) DRAM DDR3, 16 channels Peak Bandwidth 700 GB/s Baseline Directory 256k entries (8-way banked) Workloads Modified to use huma Rodinia & AMD APP SDK 19 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
19 GPGPU BENCHMARKS Rodinia benchmarks bp trains the connection weights on a neural network bfs breadth-first search hs performs a transient 2D thermal simulation (5-point stencil) lud matrix decomposition nw performs a global optimization for DNA sequence alignment km does k-means clustering sd speckle-reducing anisotropic diffusion AMD SDK bn bitonic sort dct discrete cosine transform hg histogram mm matrix multiplication 20 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
20 SYSTEM BOTTLENECKS Difficult to scale directory bandwidth Difficult to multi-port GPU Complicated pipeline Cluster High resource usage APU CPU Cluster Must allocate MSHR for entire duration of request MSHR array difficult to scale Directory High bandwidth Designed to support CPU bandwidth To DRAM 21 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
21 DIRECTORY TRAFFIC 4.5 Directory accesses per GPU cycle Difficult to support >1 request per cycle bp bfs hs lud nw km sd bn dct hg mm 22 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
22 RESOURCE USAGE Maximum MSHRs Very difficult to scale MSHR array Steady state at 700 GB/s Causes significant back-pressure on L2s bp bfs hs lud nw km sd bn dct hg mm 23 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
23 PERFORMANCE OF BASELINE COMPARED TO UNCONSTRAINED RESOURCES Slow down Back-pressure from limited MSHRs and bandwidth bp bfs hs lud nw km sd bn dct hg mm 24 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
24 BOTTLENECKS SUMMARY Directory bandwidth Must support up to 4 requests per cycle Difficult to construct pipeline Resource usage MSHRs are a constraining resource Need more than 10,000 Without resource constraints, up to 4x better performance 25 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
25 OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Overall system design Region buffer design Region directory design Example Hardware complexity Results Conclusions 26 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
26 BASELINE DIRECTORY COHERENCE GPU Cluster APU CPU Cluster Initialization Kernel Launch Directory Read result To DRAM 27 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
27 HETEROGENEOUS SYSTEM COHERENCE (HSC) GPU Cluster APU CPU Cluster Initialization Kernel Launch Directory To DRAM 28 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
28 HETEROGENEOUS SYSTEM COHERENCE (HSC) Direct-access bus GPU Cluster Region Buffer APU CPU Region Cluster Buffer Region buffers coordinate with region directory Region Directory Directory To DRAM 29 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
29 HSC: EXAMPLE MEMORY REQUEST GPU L2 Cache APU GPU Region Buffer GPU Region Cluster Buffer CPU Region Cluster Buffer Region Directory Region Directory To DRAM 32 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
30 HSC: L2 CACHE & REGION BUFFER Demand Requests MSHRs Core Data Responses Core Data Responses Region tags and Cache Tag Arrays Cache Tag Arrays permissions Only region-level MSHRs permission traffic Interface for Miss direct-access bus Hit MSHR Entries MSHR Entries Hit Miss Region Buffer Hit Probe Requests Direct Access Bus Interface Miss Requests Miss Hit Probe Data Requests Responses Coherent Coherent Network Network Interface 33 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
31 HSC: REGION DIRECTORY Region Permission Requests MSHRs Coherent Block Requests MSHRs MSHR Entries MSHR Entries Block Directory Tag Array Region Directory Tag Array Region tags, sharers, and permissions Miss Miss Hit Hit Block Probe Requests/ Responses Block Probe Requests/Responses Probe Request RAM Probe Request RAM PR Entries PR Entries To DRAM 34 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
32 HSC: HARDWARE COMPLEXITY Region protocols reduce directory size Region directory: 8x fewer entries Region buffers At each L2 cache 1-KB region (16 64-B blocks) 16-K region entries Overprovisioned for low-locality workloads (a) Region Directory Entry Region Tag 18 bits (b) Region Buffer Entry State CPU GPU 2 bits 1 valid bit per cluster Region Tag State B 0 B 1 B 2... B bits 2 bits 1 valid bit per block in the region 35 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
33 HSC SUMMARY Key insight GPU-CPU applications exhibit high spatial locality Use direct-access bus present in systems Offload bandwidth onto direct-access bus Use coherence network only for permission Add region buffer to track region information At each L2 cache Bypass coherence network and directory Replace directory with region directory Significantly reduces total size needed 36 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
34 OUTLINE Motivation Background Heterogeneous System Bottlenecks Heterogeneous System Coherence Details Results Speed-up Latency of loads Bandwidth MSHR usage Conclusions 37 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
35 THREE CACHE-COHERENCE PROTOCOLS Broadcast: Null-directory that broadcasts on all requests Baseline: Block-based, mostly inclusive, directory HSC: Region-based directory with 1-KB region size 38 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
36 HSC PERFORMANCE Normalized speed-up Largest slow-downs slowdowns from constrained resources Broadcast Baseline HSC bp bfs hs lud nw km sd bn dct hg mm 39 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
37 DIRECTORY TRAFFIC REDUCTION Normalized directory bandwidth broadcast baseline HSC Average bandwidth significantly reduced Theoretical reduction from 16 block regions bp bfs hs lud nw km sd bn dct hg mm 40 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
38 HSC RESOURCE USAGE Normalized directory MSHRs required Maximum MSHRs significantly reduced bp bfs hs lud nw km sd bn dct hg mm 41 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
39 RESULTS SUMMARY Used a detailed timing simulator for CPU and GPU HSC significantly improves performance Reduces the average load latency Decreases bandwidth requirement of directory HSC reduces the required MSHRs at the directory 42 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
40 RELATED WORK Coarse-grained coherence Region coherence Applied to snooping systems [Cantin, ISCA 2005] [Moshovos, ISCA 2005] [Zebchuk, MICRO 2007] Extended to directories [Fang, PACT 2013] [Zebchuk, MICRO 2013] Spatiotemporal coherence [Alisafaee, MICRO 2012] Dual-grain directory coherence [Basu, UW-TR 2013] Primarily focused on directory size GPU coherence [Singh et al. HPCA 2013] Intra-GPU coherence 43 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
41 CONCLUSIONS Hardware coherence can increase the utility of heterogeneous systems Major bottlenecks in current coherence implementations High bandwidth difficult to support at directory Extreme resource requirements We propose Heterogeneous System Coherence Leverages spatial locality and region coherence Reduces bandwidth by 94% Reduces resource requirements by 95% 44 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
42 Questions? Contact: 45 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
43 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners. 46 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
44 Backup Slides
45 LOAD LATENCY Normalized load latency Average load time significantly reduced broadcast baseline HSC bp bfs hs lud nw km sd bn dct hg mm 48 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
46 EXECUTION TIME BREAKDOWN GPU CPU Execution time (%) bp bfs hs lud nw km sd bn dct hg mm 49 HETEROGENEOUS SYSTEM COHERENCE DECEMBER 11, 2013 MICRO-46
A Vision for Tomorrow s Hosting Data Center
A Vision for Tomorrow s Hosting Data Center JOHN WILLIAMS CORPORATE VICE PRESIDENT, SERVER MARKETING MARCH 2013 THE EVOLVING HOSTING MARKET NEW OPPORTUNITIES: HOSTING IN THE CLOUD Hosted Server shipments
ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group
ATI Radeon 4800 series Graphics Michael Doggett Graphics Architecture Group Graphics Product Group Graphics Processing Units ATI Radeon HD 4870 AMD Stream Computing Next Generation GPUs 2 Radeon 4800 series
GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO
GPU ACCELERATED DATABASES Database Driven OpenCL Programming Tim Child 3DMashUp CEO SPEAKERS BIO Tim Child 35 years experience of software development Formerly VP Engineering, Oracle Corporation VP Engineering,
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015
FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic
"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012
"JAGUAR AMD s Next Generation Low Power x86 Core Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012 TWO X86 CORES TUNED FOR TARGET MARKETS Mainstream Client and Server Markets Bulldozer
Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008
Radeon GPU Architecture and the series Michael Doggett Graphics Architecture Group June 27, 2008 Graphics Processing Units Introduction GPU research 2 GPU Evolution GPU started as a triangle rasterizer
AMD Product and Technology Roadmaps
AMD Product and Technology Roadmaps AMD 2013 DESKTOP OEM GRAPHICS ROADMAP 2012: AMD RADEON 7000/6000 SERIES 2013: AMD RADEON HD 8000 SERIES Enthusiast AMD Radeon 7900 Series GPU AMD Radeon 8990 Series
Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator
White Paper Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator Delivering Accelerated Application Performance, Microsoft AlwaysOn High Availability and Fast Data Replication
PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER
PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER Preface Today s world is ripe with computing technology. Computing technology is all around us and it s often difficult
MS Exchange Server Acceleration
White Paper MS Exchange Server Acceleration Using virtualization to dramatically maximize user experience for Microsoft Exchange Server Allon Cohen, PhD Scott Harlin OCZ Storage Solutions, Inc. A Toshiba
Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center
White Paper Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center Provide accelerated data access and an immediate performance boost of businesscritical applications with caching
Delivering Accelerated SQL Server Performance with OCZ s ZD-XL SQL Accelerator
enterprise White Paper Delivering Accelerated SQL Server Performance with OCZ s ZD-XL SQL Accelerator Performance Test Results for Analytical (OLAP) and Transactional (OLTP) SQL Server 212 Loads Allon
Accelerating Database Applications on Linux Servers
White Paper Accelerating Database Applications on Linux Servers Introducing OCZ s LXL Software - Delivering a Data-Path Optimized Solution for Flash Acceleration Allon Cohen, PhD Yaron Klein Eli Ben Namer
ECLIPSE Performance Benchmarks and Profiling. January 2009
ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster
Accelerating Server Storage Performance on Lenovo ThinkServer
Accelerating Server Storage Performance on Lenovo ThinkServer Lenovo Enterprise Product Group April 214 Copyright Lenovo 214 LENOVO PROVIDES THIS PUBLICATION AS IS WITHOUT WARRANTY OF ANY KIND, EITHER
White Paper AMD PROJECT FREESYNC
White Paper AMD PROJECT FREESYNC TABLE OF CONTENTS INTRODUCTION 3 PROJECT FREESYNC USE CASES 4 Gaming 4 Video Playback 5 System Power Savings 5 PROJECT FREESYNC IMPLEMENTATION 6 Implementation Overview
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
Supreme Court of Italy Improves Oracle Database Performance and I/O Access to Court Proceedings with OCZ s PCIe-based Virtualized Solution
enterprise Case Study Supreme Court of Italy Improves Oracle Database Performance and I/O Access to Court Proceedings with OCZ s PCIe-based Virtualized Solution Combination of Z-Drive R4 PCIe SSDs and
Redefining Flash Storage Solution
Redefining Flash Storage Solution Through Capacity + Efficiency + Performance + Form PRODUCT GUIDE Holistic Approach to Redefine Flash Storage Novachips is a leading provider of a broad range of Flash
New!! - Higher performance for Windows and UNIX environments
New!! - Higher performance for Windows and UNIX environments The IBM TotalStorage Network Attached Storage Gateway 300 (NAS Gateway 300) is designed to act as a gateway between a storage area network (SAN)
LS DYNA Performance Benchmarks and Profiling. January 2009
LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The
Accelerating MS SQL Server 2012
White Paper Accelerating MS SQL Server 2012 Unleashing the Full Power of SQL Server 2012 in Virtualized Data Centers Allon Cohen, PhD Scott Harlin OCZ Storage Solutions, Inc. A Toshiba Group Company 1
Family 12h AMD Athlon II Processor Product Data Sheet
Family 12h AMD Athlon II Processor Publication # 50322 Revision: 3.00 Issue Date: December 2011 Advanced Micro Devices 2011 Advanced Micro Devices, Inc. All rights reserved. The contents of this document
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
AMD Processor Performance. AMD Phenom II Processors Discrete Platform Benchmarks December 2008
AMD Processor Performance AMD Phenom II Processors Discrete Platform Benchmarks December 2008 AMD Phenom II Performance Overall Performance of Office Productivity + Digital Media + Games AMD Phenom II
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Accelerating Microsoft Exchange Servers with I/O Caching
Accelerating Microsoft Exchange Servers with I/O Caching QLogic FabricCache Caching Technology Designed for High-Performance Microsoft Exchange Servers Key Findings The QLogic FabricCache 10000 Series
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
HOW MANY USERS CAN I GET ON A SERVER? This is a typical conversation we have with customers considering NVIDIA GRID vgpu:
THE QUESTION HOW MANY USERS CAN I GET ON A SERVER? This is a typical conversation we have with customers considering NVIDIA GRID vgpu: How many users can I get on a server? NVIDIA: What is their primary
QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION
QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION QlikView Scalability Center Technical Brief Series September 2012 qlikview.com Introduction This technical brief provides a discussion at a fundamental
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
EUCIP IT Administrator - Module 1 PC Hardware Syllabus Version 3.0
EUCIP IT Administrator - Module 1 PC Hardware Syllabus Version 3.0 Copyright 2011 ECDL Foundation All rights reserved. No part of this publication may be reproduced in any form except as permitted by ECDL
Motivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting
Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting Introduction Big Data Analytics needs: Low latency data access Fast computing Power efficiency Latest
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
By Citrix Consulting Services. Citrix Systems, Inc.
By Citrix Consulting Services Citrix Systems, Inc. Disclaimer The objective of this white paper is to provide recommendations for ICA Client settings based on network environment configuration. The testing
Configuring Memory on the HP Business Desktop dx5150
Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...
DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering
DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL
Virtualizing a Virtual Machine
Virtualizing a Virtual Machine Azeem Jiva Shrinivas Joshi AMD Java Labs TS-5227 Learn best practices for deploying Java EE applications in virtualized environment 2008 JavaOne SM Conference java.com.sun/javaone
Summary. Key results at a glance:
An evaluation of blade server power efficiency for the, Dell PowerEdge M600, and IBM BladeCenter HS21 using the SPECjbb2005 Benchmark The HP Difference The ProLiant BL260c G5 is a new class of server blade
solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?
solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms? CA Capacity Management and Reporting Suite for Vblock Platforms
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Leveraging Aparapi to Help Improve Financial Java Application Performance
Leveraging Aparapi to Help Improve Financial Java Application Performance Shrinivas Joshi, Software Performance Engineer Abstract Graphics Processing Unit (GPU) and Accelerated Processing Unit (APU) offload
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
NVIDIA GRID DASSAULT CATIA V5/V6 SCALABILITY GUIDE. NVIDIA Performance Engineering Labs PerfEngDoc-SG-DSC01v1 March 2016
NVIDIA GRID DASSAULT V5/V6 SCALABILITY GUIDE NVIDIA Performance Engineering Labs PerfEngDoc-SG-DSC01v1 March 2016 HOW MANY USERS CAN I GET ON A SERVER? The purpose of this guide is to give a detailed analysis
can you effectively plan for the migration and management of systems and applications on Vblock Platforms?
SOLUTION BRIEF CA Capacity Management and Reporting Suite for Vblock Platforms can you effectively plan for the migration and management of systems and applications on Vblock Platforms? agility made possible
Intel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
Hardware Level IO Benchmarking of PCI Express*
White Paper James Coleman Performance Engineer Perry Taylor Performance Engineer Intel Corporation Hardware Level IO Benchmarking of PCI Express* December, 2008 321071 Executive Summary Understanding the
In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
THE AMD MISSION 2 AN INTRODUCTION TO AMD NOVEMBER 2014
THE AMD MISSION To be the leading designer and integrator of innovative, tailored technology solutions that empower people to push the boundaries of what is possible 2 AN INTRODUCTION TO AMD NOVEMBER 2014
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Rambus Smart Data Acceleration
Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then
Virtualization Performance Analysis November 2010 Effect of SR-IOV Support in Red Hat KVM on Network Performance in Virtualized Environments
Virtualization Performance Analysis November 2010 Effect of SR-IOV Support in Red Hat KVM on Network Performance in Steve Worley System x Performance Analysis and Benchmarking IBM Systems and Technology
Memory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server
Leading Virtualization Performance and Energy Efficiency in a Multi-processor Server Product Brief Intel Xeon processor 7400 series Fewer servers. More performance. With the architecture that s specifically
Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database
WHITE PAPER Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database 951 SanDisk Drive, Milpitas, CA 95035 www.sandisk.com Table of Contents Executive
Accelerating High-Speed Networking with Intel I/O Acceleration Technology
White Paper Intel I/O Acceleration Technology Accelerating High-Speed Networking with Intel I/O Acceleration Technology The emergence of multi-gigabit Ethernet allows data centers to adapt to the increasing
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Infor Web UI Sizing and Deployment for a Thin Client Solution
Infor Web UI Sizing and Deployment for a Thin Client Solution Copyright 2012 Infor Important Notices The material contained in this publication (including any supplementary information) constitutes and
APU/GPGPU-BASED SECURITY SOLUTIONS. Vikenty Frantsev ALTELL CEO
APU/GPGPU-BASED SECURITY SOLUTIONS Vikenty Frantsev ALTELL CEO ALTELL: KEY FACTS Core business: IT security, software development, network appliances design & manufacturing Founded: Year 2006 Vertical
HP Z Turbo Drive PCIe SSD
Performance Evaluation of HP Z Turbo Drive PCIe SSD Powered by Samsung XP941 technology Evaluation Conducted Independently by: Hamid Taghavi Senior Technical Consultant June 2014 Sponsored by: P a g e
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Performance Analysis and Software Optimization on Systems Using the LAN91C111
AN 10.12 Performance Analysis and Software Optimization on Systems Using the LAN91C111 1 Introduction This application note describes one approach to analyzing the performance of a LAN91C111 implementation
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25
FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Cloud Data Center Acceleration 2015
Cloud Data Center Acceleration 2015 Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?
Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim? Successful FPGA datacenter usage at scale will require differentiated capability, programming ease, and scalable implementation models Executive
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices
big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices Brian Jeff November, 2013 Abstract ARM big.little processing
Best Practices for Installing and Configuring the Hyper-V Role on the LSI CTS2600 Storage System for Windows 2008
Best Practices Best Practices for Installing and Configuring the Hyper-V Role on the LSI CTS2600 Storage System for Windows 2008 Installation and Configuration Guide 2010 LSI Corporation August 13, 2010
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (
TEST REPORT MARCH 2009 Server management solution comparison on Dell PowerEdge R710 and HP Executive summary Dell Inc. (Dell) commissioned Principled Technologies (PT) to compare server management solutions
Compatibility Matrix BES12. September 16, 2015
Compatibility Matrix BES12 September 16, 2015 Published: 2015-09-16 SWD-20150916153710116 Contents Introduction... 4 Legend...5 BES12 server... 6 Operating system...6 Database server...6 Browser... 8 Mobile
Tips and Best Practices for Managing a Private Cloud
Deploying and Managing Private Clouds The Essentials Series Tips and Best Practices for Managing a Private Cloud sponsored by Tip s and Best Practices for Managing a Private Cloud... 1 Es tablishing Policies
Linux VM Infrastructure for memory power management
Linux VM Infrastructure for memory power management Ankita Garg Vaidyanathan Srinivasan IBM Linux Technology Center Agenda - Saving Power Motivation - Why Save Power Benefits How can it be achieved Role
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
XID ERRORS. vr352 May 2015. XID Errors
ID ERRORS vr352 May 2015 ID Errors Introduction... 1 1.1. What Is an id Message... 1 1.2. How to Use id Messages... 1 Working with id Errors... 2 2.1. Viewing id Error Messages... 2 2.2. Tools That Provide
The Transition to PCI Express* for Client SSDs
The Transition to PCI Express* for Client SSDs Amber Huffman Senior Principal Engineer Intel Santa Clara, CA 1 *Other names and brands may be claimed as the property of others. Legal Notices and Disclaimers
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
LOAD BALANCING IN THE MODERN DATA CENTER WITH BARRACUDA LOAD BALANCER FDC T740
LOAD BALANCING IN THE MODERN DATA CENTER WITH BARRACUDA LOAD BALANCER FDC T740 Balancing Web traffic across your front-end servers is key to a successful enterprise Web presence. Web traffic is much like
A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011
A Close Look at PCI Express SSDs Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011 Macro Datacenter Trends Key driver: Information Processing Data Footprint (PB) CAGR: 100%
Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION A DIABLO WHITE PAPER AUGUST 2014 Ricky Trigalo Director of Business Development Virtualization, Diablo Technologies
Compatibility Matrix BES10. April 27, 2016. Version 10.2 and later
Compatibility Matrix BES10 April 27, 2016 Version 10.2 and later Published: 2016-04-28 SWD-20160428152359812 Contents Enterprise Service 10 Compatibility Matrix... 4 Introduction...4 Legend... 4 Operating
Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck
Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering
