FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015



Similar documents
ATI Radeon 4800 series Graphics. Michael Doggett Graphics Architecture Group Graphics Product Group

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

A Vision for Tomorrow s Hosting Data Center

AMD Product and Technology Roadmaps

"JAGUAR AMD s Next Generation Low Power x86 Core. Jeff Rupley, AMD Fellow Chief Architect / Jaguar Core August 28, 2012

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

GPU Architecture. An OpenCL Programmer s Introduction. Lee Howes November 3, 2010

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

White Paper AMD PROJECT FREESYNC

Next Generation GPU Architecture Code-named Fermi

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Family 12h AMD Athlon II Processor Product Data Sheet

GPU ACCELERATED DATABASES Database Driven OpenCL Programming. Tim Child 3DMashUp CEO

NVIDIA GeForce GTX 580 GPU Datasheet

AMD Radeon HD 8000M Series GPU Specifications AMD Radeon HD 8870M Series GPU Feature Summary

Introduction to GPU Architecture

AMD Processor Performance. AMD Phenom II Processors Discrete Platform Benchmarks December 2008

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Intel 64 and IA-32 Architectures Software Developer s Manual

AMD EMBEDDED PCIe ADD-IN BOARD Comparison

SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST

GPU Architecture. Michael Doggett ATI

Radeon HD 2900 and Geometry Generation. Michael Doggett

GPU architecture II: Scheduling the graphics pipeline

Family 10h AMD Phenom II Processor Product Data Sheet

SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST

THE AMD MISSION 2 AN INTRODUCTION TO AMD NOVEMBER 2014

PHYSICAL CORES V. ENHANCED THREADING SOFTWARE: PERFORMANCE EVALUATION WHITEPAPER

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

White Paper AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With

IA-32 Intel Architecture Software Developer s Manual

ARM Processors for Computer-On-Modules. Christian Eder Marketing Manager congatec AG

Cortex -A7 Floating-Point Unit

Optimizing SQL Server AlwaysOn Implementations with OCZ s ZD-XL SQL Accelerator

Maximize Application Performance On the Go and In the Cloud with OpenCL* on Intel Architecture

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

HP Workstations graphics card options

İSTANBUL AYDIN UNIVERSITY

IA-32 Intel Architecture Software Developer s Manual

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Answering the Requirements of Flash-Based SSDs in the Virtualized Data Center

Introduction to GPGPU. Tiziano Diamanti

Generations of the computer. processors.

SPARC64 VIIIfx: CPU for the K computer

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

CHAPTER 7: The CPU and Memory

MS Exchange Server Acceleration

Configuring Memory on the HP Business Desktop dx5150

Computer Graphics Hardware An Overview

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Intel Pentium 4 Processor on 90nm Technology

Intel Architecture Software Developer s Manual

Awards News. GDDR5 memory provides twice the bandwidth per pin of GDDR3 memory, delivering more speed and higher bandwidth.

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

Haswell Cryptographic Performance

Delivering Accelerated SQL Server Performance with OCZ s ZD-XL SQL Accelerator

Central Processing Unit (CPU)

Chapter 01: Introduction. Lesson 02 Evolution of Computers Part 2 First generation Computers

HP Workstations graphics card options


AMD APP SDK v2.8 FAQ. 1 General Questions

21152 PCI-to-PCI Bridge

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Numerical Matrix Analysis

Ensure that the AMD APP SDK Samples package has been installed before proceeding.

Redefining Flash Storage Solution

Intel 810 and 815 Chipset Family Dynamic Video Memory Technology

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Supreme Court of Italy Improves Oracle Database Performance and I/O Access to Court Proceedings with OCZ s PCIe-based Virtualized Solution

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

SAPPHIRE HD GB GDDR5 PCIE.

Data Storage 3.1. Foundations of Computer Science Cengage Learning

¹ Autodesk Showcase 2016 and Autodesk ReCap 2016 are not supported in 32-Bit.

LSN 2 Computer Processors

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Dual-Core Processors on Dell-Supported Operating Systems

Binary search tree with SIMD bandwidth optimization using SSE

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

GRID SDK FAQ. PG January Frequently Asked Questions

Using the Performance Monitor Unit on the e200z760n3 Power Architecture Core

Figure 1A: Dell server and accessories Figure 1B: HP server and accessories Figure 1C: IBM server and accessories

OC By Arsene Fansi T. POLIMI

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Introduction to PCI Express Positioning Information

Standardization with ARM on COM Qseven. Zeljko Loncaric, Marketing engineer congatec

System requirements for Autodesk Building Design Suite 2017

Transcription:

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic The Steamroller x86 CPU and its Floating-Point Arithmetic Summary and Future Directions 2 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

huma Shared Memory Controller THE KAVERI ACCELERATED PROCESSING UNIT (APU) Kaveri APU EXCELLENT COMPUTE PERFORMANCE 4 Steamroller CPU Cores 8 GCN GPU Compute Units Multimedia AMD TrueAudio UVD VCE Connectivity PCIE Gen 3 PCIE Gen 2 Display DisplayPort 1.2 4 "Steamroller" CPU cores 8 Graphics Core Next (GCN) GPU compute units Heterogeneous System Architecture (HSA) enabled ENHANCED MULTIMEDIA EXPERIENCES Video acceleration AMD TrueAudio 4 display heads HDMI 1.4a DVI VGA HIGH PERFORMANCE CONNECTIVITY 128 bits DDR3 up to 2133 MHz PCI-Express Gen3 x16 for discrete graphics upgrade PCI-Express for direct attach NVM SSD 3 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

THE KAVERI APU AND HSA EQUAL ACCESS TO ENTIRE MEMORY APU GFLOPS ALL-PROCESSORS-EQUAL huma 737.3 GPU GFLOPS hq CPU GPU 118.4 CPU GFLOPS CPU GPU Kaveri GFLOPS GPU and CPU have uniform visibility into entire memory system Programs have access to all of Kaveri APU compute power GPU and CPU have equal flexibility to be used to create and dispatch work items 4 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

GRAPHICS CORE NEXT ARCHITECTURE 5 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

"KAVERI" GPU GRAPHICS CORE NEXT (GCN) ARCHITECTURE Shader Engine RB RB ACE ACE ACE ACE Graphics Command Processor Global Data Share Geometry Processor Rasterizer CU CU CU CU CU CU CU CU ACE ACE ACE ACE 47% of Kaveri is dedicated for GPU Graphics Command Processor and 8 Asynchronous Compute Engines (ACE) for Multitasking Compute Shader Engine for Graphics and Compute Branch & Message Unit 8 Compute Units (CUs) = 32 16-lane SIMD Units for Arithmetic Processing Scheduler Vector Units (4x SIMD-16) Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) L2 Cache Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (4KB) L1 Cache (16KB) 6 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

Branch & Message Unit GCN COMPUTE UNIT Scheduler Vector Units (4x SIMD-16) SIMD SPECIFICS Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB) Each Compute Unit (CU) contains 4 SIMD Vector Units. Each SIMD Vector Unit has: A 16-lane integer and floating-point vector ALU 64KB Vector General Purpose Register (VGPR) file A 48-bit Program Counter Instruction buffer for 10 wavefronts A wavefront is a group of 64 threads: the size of one logical VGPR A 64-thread wavefront issues to a 16-lane SIMD Unit over four cycles 7 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

Branch & Message Unit GCN COMPUTE UNIT Scheduler Vector Units (4x SIMD-16) SIMD SPECIFICS Scalar Unit Texture Filter Units (4) Texture Fetch Load / Store Units (16) Vector Registers (4x 64KB) Local Data Share (64KB) Scalar Registers (8KB) L1 Cache (16KB) Each CU can have 40 wavefronts in-flight Up to 2560 parallel threads Each 16-lane SIMD Vector Unit Supports half, single, and double precision floating-point arithmetic Issues one half or single precision instruction per-lane per-clock (including FMA) Issues double-precision operations at a reduced rate 8 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

VECTOR INSTRUCTIONS 2 Operand 1 Operand Compare Ops Vector Interpolation 3 Operand 1-3 source operands from Vector General Purpose Register (VGPR) file 1 source operand can be from one of several other sources (e.g., 32-bit constant, scalar unit, local memory) 32-bit instruction encoding for most common vector ALU ops 64-bit instruction encoding enables 3 source operands and input/output modifiers Can apply absolute value or negate any input operand Can multiply result by 0.5, 1, 2, or 4 Can clamp result to the range [-1.0, +1.0] 9 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

VECTOR UNIT FLOATING-POINT ARITHMETIC DATATYPES AND MODES Floating-point Arithmetic Datatypes 16-bit, 32-bit, and 64-bit floating-point numbers as defined by IEEE 754-2008 32-bit constant floating-point values can be expanded to 64 bits with zero padding on LSBs IEEE Rounding Modes Round to nearest even, Round toward +Infinity, Round toward Infinity, Round toward zero Provided under software control anywhere in the program Single and double precision rounding modes are controlled separately Sub-normal Handling Modes Separate control for input sub-normal flush to zero and underflow flush to zero Provided under software control anywhere in the program Separate control for single and double precision numbers Exceptions Support Inexact, underflow, overflow, division by zero, sub-normal, invalid operation, and integer divide by zero operation Provided in hardware with mechanisms for software recording and reporting 10 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

VECTOR UNIT FLOATING-POINT ARITHMETIC OPERATIONS FMA (Fused Multiply Add) Single cycle issue instruction IEEE 754-2008 precise with all round modes and proper handling of Nan/Inf/Zero Full sub-normal support in hardware MULADD (Multiply Add) Single cycle issue instruction IEEE MUL followed by IEEE ADD with round and normalization after both multiplication and subsequent addition VCMP (Vector Compare) A full set of compare operations designed to fully implement all the IEEE 754-2008 comparison predicates Return 1 bit per work-item to a named scalar register pair called the Vector Condition Code (VCC) and optionally to an Execute Mask register 3 Operand Selection Operations V_MIN3, V_MAX3, V_MED3 11 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

VECTOR UNIT FLOATING-POINT ARITHMETIC OPERATIONS FP Conversion Ops Between 16-bit, 32-bit, and 64-bit floating-point values with full IEEE 754-2008 precision and rounding 64-bit Transcendental Approximations Hardware based double precision approximations for reciprocal, reciprocal square root and square root 16-bit and 32-bit Transcendental Approximations Hardware based approximations for reciprocal, reciprocal square root, square root, exponent, logarithm, sine, cosine Several other FP Arithmetic Operations Basic arithmetic operations, min and max, interpolation, rounding and truncation, extract exponent or mantissa, etc. 24-bit Integer MUL/MULADD/LOGICAL/SPECIAL at full single-precision rates Heavily utilized for integer thread group address calculation References: http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/07/amd_gcn3_instruction_set_architecture.pdf 12 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

STEAMROLLER X86 CPU 13 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

128-bit FMAC 128-bit FMAC MMX Unit STEAMROLLER CORE MICROARCHITECTURE SHARED FPU Two tightly linked cores share resources to increase efficiency Floating-point co-processor organization Steamroller Fetch Dual 128-bit FMAC pipes Decode Decode MMX Unit for packed integer arithmetic Unified scheduler for both threads Integer Scheduler FP Scheduler Integer Scheduler L1 DCache L1 DCache Shared L2 Cache 14 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

128-bit FMAC 128-bit FMAC MMX Unit STEAMROLLER FPU INSTRUCTION SET ARCHITECTURE Supports all of the x87 FP operations 80-bit floating-point numbers Full support for IEEE 754-2008 Steamroller Fetch Decode Decode Several ISA extensions including MMX, SSE to SSE 4.2, AES, CLM, AVX, and AVX2 Vector operations for multimedia, security, and scientific computing Integer Scheduler FP Scheduler Integer Scheduler Several integer and floating-point instructions with up to 3 source operands L1 DCache L1 DCache Shared L2 Cache 15 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

SUMMARY AND FUTURE DIRECTIONS 16 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

SUMMARY AND FUTURE DIRECTIONS Vector integer and floating-point arithmetic is very important in CPUs and GPUs GPUs are optimized for high-throughput and power efficiency CPUs are optimized for low-latency and power efficiency Integrating CPUs and GPUs provides efficient processing of parallel and serial code using high-level languages Support for IEEE 754-2008 floating-point arithmetic is essential Several additional operations provided for graphics, multimedia, and scientific computing Future Directions Power-efficient floating-point arithmetic Efficient support for multiple precisions Efficient vector floating-point reduction and fused operations New operations for emerging graphics, multimedia, and scientific computing applications 17 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015

DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2015 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Opteron, Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Adobe is a registered trademark of Adobe Systems Inc. ARM is a registered trademark of ARM Limited. Linux is a registered trademark of Linus Torvalds. OpenCL is a trademark of Apple Inc. used by permission of Khronos. Windows and Microsoft registered trademarks of Microsoft Corporation. Other names are for informational purposes only and may be trademarks of their respective owners. 18 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS JUNE 2015