Vector Architectures



Similar documents
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

SPARC64 VIIIfx: CPU for the K computer

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

LSN 2 Computer Processors

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

CSE 6040 Computing for Data Analytics: Methods and Tools

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

İSTANBUL AYDIN UNIVERSITY

GPUs for Scientific Computing

From Concept to Production in Secure Voice Communications

MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute,

Lecture 23: Multiprocessors

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

FLIX: Fast Relief for Performance-Hungry Embedded Applications

CMSC 611: Advanced Computer Architecture

CISC, RISC, and DSP Microprocessors

Processor Architectures

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Game Playing in the Real World. Next time: Knowledge Representation Reading: Chapter

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer Architecture TDTS10

COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton

High Performance Computing


RUNAHEAD EXECUTION: AN EFFECTIVE ALTERNATIVE TO LARGE INSTRUCTION WINDOWS

Radeon HD 2900 and Geometry Generation. Michael Doggett

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Architecture of Hitachi SR-8000

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Instruction Set Architecture (ISA)

A Computer Vision System on a Chip: a case study from the automotive domain

1. Memory technology & Hierarchy

OC By Arsene Fansi T. POLIMI

Introduction to GPU Programming Languages

Architectures and Platforms

Infrastructure Matters: POWER8 vs. Xeon x86

CUDA programming on NVIDIA GPUs

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Property of ISA vs. Uarch?

Software implementation of Post-Quantum Cryptography

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to GPU hardware and to CUDA

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Thread Level Parallelism (TLP)

Introduction to Cloud Computing

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

MICROPROCESSOR AND MICROCOMPUTER BASICS

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Spacecraft Computer Systems. Colonel John E. Keesee

Introduction to Cluster Computing

Data Parallel Computing on Graphics Hardware. Ian Buck Stanford University

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Next Generation GPU Architecture Code-named Fermi

Instruction Set Design

Scalability and Classifications

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

361 Computer Architecture Lecture 14: Cache Memory

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

A Very Brief History of High-Performance Computing

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Chapter 1 Computer System Overview

Load Balancing in Distributed Data Base and Distributed Computing System

A New Chapter for System Designs Using NAND Flash Memory

Network Traffic Monitoring an architecture using associative processing.

A Lab Course on Computer Architecture

Chapter 2 Logic Gates and Introduction to Computer Architecture

Cloud Computing at Google. Architecture

High Performance Computing. Course Notes HPC Fundamentals

How To Improve Performance On A Single Chip Computer

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

POWER8 Performance Analysis

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Imagine business analytics at the speed of thought

Combining Local and Global History for High Performance Data Prefetching

CSEE W4824 Computer Architecture Fall 2012

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Categories and Subject Descriptors C.1.1 [Processor Architecture]: Single Data Stream Architectures. General Terms Performance, Design.

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

ALP: Efficient Support for All Levels of Parallelism for Complex Media Applications

Hardware/Software Co-Design of a Java Virtual Machine

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Price/performance Modern Memory Hierarchy

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

On-Chip Interconnection Networks Low-Power Interconnect

GPU Hardware Performance. Fall 2015

Levels of Programming Languages. Gerald Penn CSC 324

Transcription:

EE482C: Advanced Computer Organization Lecture #11 Stream Processor Architecture Stanford University Thursday, 9 May 2002 Lecture #11: Thursday, 9 May 2002 Lecturer: Prof. Bill Dally Scribe: James Bonanno Reviewer: Mattan Erez Vector Architectures Logistics : - Project Update: presentations next Tuesday (May 14) - Read all project proposals - People not in project groups should form them soon The topic of this lecture was vector architectures. After discussing a definition of vector architectures, similarities and difference between the stream architecture (as idealized in Imagine) were discussed, as were the corresponding advantages and disadvantages. The discussion concluded by focusing on reasons for the apparent fall in popularity of vector machines compared to the Killer Micros, and how this relates to stream processor architecture. The two papers read in preparation for this discussion were: Roger Espasa, Mateo Valero, and James E. Smith, Vector Architectures: Past, Present, and Future, ICS 98, Proceedings of the 1998 International Conference on Supercomputing, July 13-17, 1998, Melbourne, Australia. ACM, 1998. John Wawrzynek, Krste Asanovic, and Brian Kinksbury, Spert-II: A vector Microprocessor System, IEEE Computer, March 1996. 1 What is a Vector Processor? It was suggested that a key aspect of vector architecture is the single-instruction-multipledata (SIMD) execution model. SIMD support results from the type of data supported by the instruction set, and how instructions operate on that data. In a traditional scalar processor, the basic data type is an n-bit word. The architecture often exposes a register file of words, and the instruction set is composed of instructions that operate on individual words. In a vector architecture, there is support of a vector datatype, where a vector is a collection of VL n-bit words (VL is the vector length). There may also be a vector register file, which was a key innovation of the Cray architecture. Previously, vector machines operated on vectors stored in main memory. Figures 1 and 2 illustrate the difference between vector and scalar data types, and the operations that can be performed on them.

2 EE482C: Lecture #11 63 0 0 1 2 3 4 5 6 7 63 0 (A) (B) Figure 1: (A): a 64-bit word, and (B): a vector of 8 64-bit words 2 Comparing Vector and Stream Architectures The following table lists similarities and differences between vector and stream architectures that were discussed in class. These aspects are described below in more detail. Similarities Differences SIMD cross-pipe communication vector Load/Store record vs. operation order pipes (DLP support) local register files (LRF) and microcode 2.1 Similarities 2.1.1 SIMD Vector architectures exhibit SIMD behaviour by having vector operations that are applied to all elements in a vector. Likewise, in stream architectures, the same sequence computation (microprogram) is performed on every stream element. SIMD has three main advantages: 1. Requires lower instruction bandwidth. 2. Allows for cheap synchronization. 3. Allows easier addressing. 2.1.2 Vector Load/Store Vector load/store instructions provide the ability to do strided and scatter/gather memory accesses, which take data elements distributed throughout memory and pack them

EE482C: Lecture #11 3 ADD R3,R1,R2 3 + 5 8 R1 R2 R3 VADD V3,V1,V2 8 16 24 7 15 22 6 14 20 5 4 + 13 12 18 16 3 11 14 2 10 12 1 9 10 V1 V2 V3 Figure 2: Difference between scalar and vector add instructions into sequential vectors/streams placed in vector/stream registers. This promotes data locality. It results in less data pollution, since only useful data is loaded from the memory system. It provides latency tolerance because there can be many simultaneous outstanding memory accesses. Vector instructions such as VLD and VST provide this capability. As illustrated in figure 3 VLD V0,(V1) loads memory elements into V0 specified by the memory addresses (indices) stored in V1. VST (V1), V0 works in a similar manner. VST (S1), 4, V0, stores the contents of V0 starting at the address contained in S1 with stride of 4. 2.1.3 Pipes Having multiple pipes allows easy exploitation of DLP. A vector architecture specifies that the same operation is performed on every element in a vector. It does not specify how this is implemented in the microarchitecture. For example, the T0 processor has 8 pipes, thereby allowing a vector operation to be performed in parallel on 8 elements of the vector. Figure 4 shows how the T0 processor structures its vectors. In that architecture, the vector length is 32. Having 8 pipes therefore results in an arithmetic operation latency of 4 cycles.

4 EE482C: Lecture #11 VLD V0, (V1) V1: 1 8 63... Memory 5 3 9 V0: 5 3 9... Figure 3: Indexed Vector Load Instruction 16 vector registers V1 V0 4 cycle latency to do op on all vector elements 8 pipes Figure 4: T0 vector organization 2.2 Differences 2.2.1 Cross-Pipe Communication A stream architecture may allow cross-pipe communication (In Imagine, this is intercluster communication), while such communication in vector processors is only possible by rearranging the data ordering with load/store instructions to/from the memory system.

EE482C: Lecture #11 5 2.2.2 Record vs. operation order, LRF and microcode Vector processors perform single operations on entire vectors, while stream processors like Imagine execute entire micro-programs on each data element of a stream. The effect of this difference is that in vector architectures, the intermediate vectors produced by each instruction are stored in the vector register file, while in a stream processor, intermediate values are consumed locally. They are stored in local register files, and are accessed at a higher bandwidth. A possible disadvantage is the apparent increase in code complexity in such a stream processor. 3 Memory Bandwidth It was suggested that the true reason why vector supercomputers are fast is because of the memory bandwidth that they provide. They do so by supporting banked memory and allowing several simultaneous outstanding memory requests. Such architectures do not rely on cache techniques to achieve reasonable performance. They therefore often outperform cache-reliant architectures in applications that make random memory accesses. 4 Killer Micros Key reasons why today s microprocessors have displaced widespread use of vector architectures include: 1. CMOS became a universal technology, while many vector supercomputers continued to use Bipolar where the yield was poor on dense chips. 2. A processor could be implemented on a single chip in CMOS, while older vector machines involved multi-chip implementations. 3. Several applications have very good performance on non-vector machines. 4. Caches exploited locality to improve memory system performance without performing the expensive optimizations often found in vector machines. Figure 5 shows the relative performance growth curves of CMOS and Bipolar technology, and also memory performance. 4.1 Relevance to Stream Processors It seems prudent to use CMOS technology for the foreseeable future, to optimize for high bandwidth, and to provide latency tolerance and SIMD execution. Finally the question was posed as to why there have been so few architectural innovations to vector processors?

6 EE482C: Lecture #11 performance CMOS performance CMOS Bipolar Memory locality 1990 time tim (A) (B) Figure 5: Performance Growth of (A) CMOS vs Bipolar, (B) Memory Performance Perhaps this is due to the fact that key innovations are often made during the earlier generations of an architecture.