1.6 Serial, Vector and Multiprocessor Computers

Similar documents
Systolic Computing. Fundamentals

Let s put together a Manual Processor

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Computer Architecture TDTS10

Binary search tree with SIMD bandwidth optimization using SSE

Scalability and Classifications

A Lab Course on Computer Architecture

Chapter 2 Parallel Architecture, Software And Performance

LSN 2 Computer Processors

Choosing a Computer for Running SLX, P3D, and P5

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Lecture 2 Parallel Programming Platforms

An Introduction to Parallel Computing/ Programming

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPU Programming Languages

INTRODUCTION TO DIGITAL SYSTEMS. IMPLEMENTATION: MODULES (ICs) AND NETWORKS IMPLEMENTATION OF ALGORITHMS IN HARDWARE

ASSEMBLY PROGRAMMING ON A VIRTUAL COMPUTER

Annotation to the assignments and the solution sheet. Note the following points

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Chapter 2 Logic Gates and Introduction to Computer Architecture

THE NAS KERNEL BENCHMARK PROGRAM

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Interconnection Networks

MICROPROCESSOR AND MICROCOMPUTER BASICS

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Introduction to Computer Architecture Concepts

Operation Count; Numerical Linear Algebra

Digital Hardware Design Decisions and Trade-offs for Software Radio Systems

GPUs for Scientific Computing

A Very Brief History of High-Performance Computing

Central Processing Unit

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

CHAPTER 7: The CPU and Memory

Low Cost Parallel Processing System for Image Processing Applications

Chapter 1 Computer System Overview

Parallel Computing. Benson Muite. benson.

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Chapter 7 Memory Management

PowerPC Microprocessor Clock Modes

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Introduction to Cloud Computing

on an system with an infinite number of processors. Calculate the speedup of

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

İSTANBUL AYDIN UNIVERSITY

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Architecture of Hitachi SR-8000

OpenCL Programming for the CUDA Architecture. Version 2.3

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

Recommended hardware system configurations for ANSYS users

Performance metrics for parallel systems

An Implementation Of Multiprocessor Linux

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

Rethinking SIMD Vectorization for In-Memory Databases

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Module 5. Broadcast Communication Networks. Version 2 CSE IIT, Kharagpur

MULTIPLE CHOICE FREE RESPONSE QUESTIONS

Parallel Programming

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

CHAPTER 2: HARDWARE BASICS: INSIDE THE BOX

EE361: Digital Computer Organization Course Syllabus

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

CMSC 611: Advanced Computer Architecture

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

PARALLEL PROGRAMMING

Architecture. Evaluation and Classification of Computer Architectures. Bewertung und Klassifikation von Rechnerarchitekturen.

Design and Verification of Nine port Network Router

ultra fast SOM using CUDA

The Central Processing Unit:

Pipelining Review and Its Limitations

Processor Architectures

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING Question Bank Subject Name: EC Microprocessor & Microcontroller Year/Sem : II/IV

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

Binary Adders: Half Adders and Full Adders

Memory Basics. SRAM/DRAM Basics

Chapter 5 Instructor's Manual

Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK

CSCI 4717 Computer Architecture. Function. Data Storage. Data Processing. Data movement to a peripheral. Data Movement

Parallel Scalable Algorithms- Performance Parameters

Computer Basics: Chapters 1 & 2

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Solid State Drive Architecture

Lecture 23: Multiprocessors

Communicating with devices

Transcription:

1.6 Serial, Vector and Multiprocessor Computers In this section we consider the components of a computer, and the various ways they are connected. In particular, vector pipelines and multiprocessors will be introduced, and a model for speedup is presented. The sizes of matrix models substantially increase when the heat and mass transfer in two or three directions is modeled. This is one reason for considering vector and multiprocessing computers. The von Neumann definition of a computer contains three parts: main memory, input-output device and central processing unit (CPU). The CPU has three components: the arithmetic logic unit, the control unit and the local memory. The arithmetic logic unit does the floating point calculations while the control unit governs the instructions and data. The local memory is small compared to the main memory, but moving data within the CPU is usually very fast. Hence, it is important to move data from the main memory to the local memory and do as much computation with this data as possible before moving it back to the main memory. Algorithms that have been optimized for a particular computer will take these facts into careful consideration. Another way of describing a computer is the hierarchical classification of its components. There are three levels: the processor level with wide band communication paths, the register level with several bytes (8 bits per byte) pathways and the gate or logic level with several bits in its pathways. The figure below illustrates two processor level descriptions of computers. The top is a von Neumann computer with the three basic components. The lower part depicts a multiprocessing computer with four CPUs. The CPUs communicate with each other

2 via the shared memory. The switch controls access to the shared memory, and here there is a potential for a bottleneck. The purpose of multiprocessors is to do more computation in less time. This is critical in many applications such as weather prediction. Memory CPU I-O SERIAL COMPUTER I-O Shared Memory Switch CPU 1 CPU 2 CPU 3 CPU 4 SHARED MEMORY MULTIPROCESSOR Processor Level Computers Within the CPU is the arithmetic logic unit, and here there are many floating point adders. These can be described as a register level device. A floating point add can be

3 described as four distinct steps each requiring a distinct hardware segment. For example, use four digits to do a floating point add of 100.1 + (-3.6). CE: compare expressions.1001 10 3 and -.36 10 1 AE: mantissa alignment.1001 10 3 and -.0036 10 3 AD: mantissa add 1001-0036 = 0965 NR: normalization.9650 10 2. This can be depicted by the following figure where the lines indicate communication pathways with several bytes of data. The data moves from left to right in time intervals called the clock cycle time of the particular computer. If each step takes one clock cycle and the clock cycle time was 6 nanoseconds, then a floating point add would take 24 nanoseconds (10-9 sec.). CE AE AD NR Register Level Floating Point Add Vector pipelines will be introduced so as to make greater use of the register level hardware. We will focus on the operation of floating point addition which requires four distinct steps for each addition. The segments of the device that execute these steps are only busy for one fourth of the time to perform a floating point add. The objective is to

4 design a computer so that all of the segments will be busy most of the time. In the case of the four segment floating point adder this could give a speedup possibly equal to four. A vector pipeline is a register level device, which is usually in either the control unit or the arithmetic logic unit. It has a collection of distinct hardware modules or segments that (i) execute distinct steps of an operation and (ii) each segment is busy once the pipeline is full. Segments CE D1 D2 D3 D4 AE D1 D2 D3 D4 AD D1 D2 D3 D4 NR D1 D2 D3 D4 startup fillup full Clock Cycles Vector Pipeline for Floating Point Additions The first pair of floating point numbers is denoted by D1, and this pair enters the pipeline in the upper left of the above figure. Segment CE on D1 is done during the first clock cycle. During the second clock cycle D1 moves to segment AE, and the second pair of floating point numbers D2 enters segment CE. After three clock cycles the pipeline is full, and there after a floating point add is produced every clock cycle. So, for large number of floating point adds with four segments the ideal speedup is four.

5 A multiprocessing computer is a computer with more than one "tightly" coupled CPU. Here "tightly" means that there is relatively fast communication among the CPUs; this is in contrast with a "network" of computers. There are several classification schemes that are commonly used to describe various multiprocessors: memory, data streams, and interconnection. Two examples of the memory classification are shared and distributed. The shared memory multiprocessors communicate via the global shared memory. The distributed memory multiprocessors communicate by explicit message passing which must be part of the computer code. Shared memory multiprocessors often have in code directives that indicate which parts are to be executed concurrently. The Cray Y-MP is an example of a shared memory computer, and an Intel hypercube is an example of a distributed computer as given in the following figure. Shared Memory Data Switch CPU1 CPU2 SHARED MEMORY DISTRIBUTED MEMORY Two Common Memory Types

6 Classification by data streams has two main categories: SIMD and MIMD. The first represents single instruction and multiple data, and an example is a vector pipeline. The second is multiple instruction and multiple data. The Cray Y-MP is a example of an MIMD. One can send different data and different code to the various processors. However, MIMD computers are often programmed like SIMD computers, that is, the same code is executed, but different data is input to the various CPUs. Interconnection schemes are important because of certain types of applications. For example in a closed loop system a ring interconnection might be the best. Or, if a problem requires a great deal of communication between processors, then the complete interconnection scheme might be appropriate. The ring interconnection has only two paths per processor regardless of the number of processors. The complete interconnection will have p-1 paths per processor where p is the number of processors (see the figure below). An interconnection scheme that is a compromise between these two extremes is the hypercube distributed memory which will have p = 2 d processors with d paths per processor (see the above figure where d = 3). RING COMPLETE Ring and Complete Interconnection

7 Multiprocessing computers have been introduced to obtain more rapid computations. Basically, there are two ways to do this: either use faster computers or use faster algorithms. There are natural limits on the speed of computers. Signals cannot travel any faster than the speed of light where it takes about one nanosecond to travel one foot. In order to reduce communication times, the devices must be moved closer. Eventually, the devices will be so small that either uncertainty principles will become dominant or the fabrication of chips will become too expensive. An alternative is to use more than one processor on those problems that have a number of independent calculations. One class of problems that have many matrix products, which are independent calculations, is to the area of visualization, and here the use multiprocessors is very common. But, not all computations have a large number of independent calculations. It is important to understand the relationship between the number of processors and the number of independent parts in a calculation. In order to be able to effectively use p processors, one must have p independent tasks to be performed. Vary rarely is this exactly the case; parts of the code may have no independent parts, two independent parts and so forth. In order to model the effectiveness of a multiprocessor with p processors, Amdahl's timing model has been widely used. It makes the assumption that α is the fraction with p independent parts and the rest (1 - α) has one independent part.

8 Amdahl's Timing Model. Let p = the number of processors, α = the fraction with p independent parts, 1 - α = the fraction with one independent part, T 1 = serial execution time, (1 - α)t 1 = execution time for the 1 independent part and αt 1 /p = execution time for the p independent parts. Speedup = S p (α) = T1 α (1 α)t1 + T1 p 1 = α (1 α) + p. Example. Consider a dotproduct of two vectors of dimension 100. There are 100 scalar products and 99 additions, and we may measure execution time in terms of operations so that T 1 = 199. If p = 4 and the dotproducts are broken into four smaller dotproducts of dimension 25, then the parallel part will have 4(49) operations and the serial part will require 3 operations to add the smaller dotpoducts. Thus, α =196/199 and S 4 = 199/52. If α = 1, then the speedup is p, the ideal case. If α = 0, then the speedup is 1! Another parameter is the efficiency which is defined to be the speedup divided by the number of processors. Thus, for a fixed code the α will be fixed, and the efficiency will decrease as the number of processors increases. Another way to view this is in the following table where α =.9 and p varies from 2 to 16.

9 Table: Speedup and Efficiency for α =.9 Processors Speedup Efficiency 2 1.8.90 4 3.1.78 8 4.7.59 16 6.4.40