1.6 Serial, Vector and Multiprocessor Computers

1.6 Serial, Vector and Multiprocessor Computers In this section we consider the components of a computer, and the various ways they are connected. In particular, vector pipelines and multiprocessors will be introduced, and a model for speedup is presented. The sizes of matrix models substantially increase when the heat and mass transfer in two or three directions is modeled. This is one reason for considering vector and multiprocessing computers. The von Neumann definition of a computer contains three parts: main memory, input-output device and central processing unit (CPU). The CPU has three components: the arithmetic logic unit, the control unit and the local memory. The arithmetic logic unit does the floating point calculations while the control unit governs the instructions and data. The local memory is small compared to the main memory, but moving data within the CPU is usually very fast. Hence, it is important to move data from the main memory to the local memory and do as much computation with this data as possible before moving it back to the main memory. Algorithms that have been optimized for a particular computer will take these facts into careful consideration. Another way of describing a computer is the hierarchical classification of its components. There are three levels: the processor level with wide band communication paths, the register level with several bytes (8 bits per byte) pathways and the gate or logic level with several bits in its pathways. The figure below illustrates two processor level descriptions of computers. The top is a von Neumann computer with the three basic components. The lower part depicts a multiprocessing computer with four CPUs. The CPUs communicate with each other

2 via the shared memory. The switch controls access to the shared memory, and here there is a potential for a bottleneck. The purpose of multiprocessors is to do more computation in less time. This is critical in many applications such as weather prediction. Memory CPU I-O SERIAL COMPUTER I-O Shared Memory Switch CPU 1 CPU 2 CPU 3 CPU 4 SHARED MEMORY MULTIPROCESSOR Processor Level Computers Within the CPU is the arithmetic logic unit, and here there are many floating point adders. These can be described as a register level device. A floating point add can be

3 described as four distinct steps each requiring a distinct hardware segment. For example, use four digits to do a floating point add of 100.1 + (-3.6). CE: compare expressions.1001 10 3 and -.36 10 1 AE: mantissa alignment.1001 10 3 and -.0036 10 3 AD: mantissa add 1001-0036 = 0965 NR: normalization.9650 10 2. This can be depicted by the following figure where the lines indicate communication pathways with several bytes of data. The data moves from left to right in time intervals called the clock cycle time of the particular computer. If each step takes one clock cycle and the clock cycle time was 6 nanoseconds, then a floating point add would take 24 nanoseconds (10-9 sec.). CE AE AD NR Register Level Floating Point Add Vector pipelines will be introduced so as to make greater use of the register level hardware. We will focus on the operation of floating point addition which requires four distinct steps for each addition. The segments of the device that execute these steps are only busy for one fourth of the time to perform a floating point add. The objective is to

4 design a computer so that all of the segments will be busy most of the time. In the case of the four segment floating point adder this could give a speedup possibly equal to four. A vector pipeline is a register level device, which is usually in either the control unit or the arithmetic logic unit. It has a collection of distinct hardware modules or segments that (i) execute distinct steps of an operation and (ii) each segment is busy once the pipeline is full. Segments CE D1 D2 D3 D4 AE D1 D2 D3 D4 AD D1 D2 D3 D4 NR D1 D2 D3 D4 startup fillup full Clock Cycles Vector Pipeline for Floating Point Additions The first pair of floating point numbers is denoted by D1, and this pair enters the pipeline in the upper left of the above figure. Segment CE on D1 is done during the first clock cycle. During the second clock cycle D1 moves to segment AE, and the second pair of floating point numbers D2 enters segment CE. After three clock cycles the pipeline is full, and there after a floating point add is produced every clock cycle. So, for large number of floating point adds with four segments the ideal speedup is four.

5 A multiprocessing computer is a computer with more than one "tightly" coupled CPU. Here "tightly" means that there is relatively fast communication among the CPUs; this is in contrast with a "network" of computers. There are several classification schemes that are commonly used to describe various multiprocessors: memory, data streams, and interconnection. Two examples of the memory classification are shared and distributed. The shared memory multiprocessors communicate via the global shared memory. The distributed memory multiprocessors communicate by explicit message passing which must be part of the computer code. Shared memory multiprocessors often have in code directives that indicate which parts are to be executed concurrently. The Cray Y-MP is an example of a shared memory computer, and an Intel hypercube is an example of a distributed computer as given in the following figure. Shared Memory Data Switch CPU1 CPU2 SHARED MEMORY DISTRIBUTED MEMORY Two Common Memory Types

6 Classification by data streams has two main categories: SIMD and MIMD. The first represents single instruction and multiple data, and an example is a vector pipeline. The second is multiple instruction and multiple data. The Cray Y-MP is a example of an MIMD. One can send different data and different code to the various processors. However, MIMD computers are often programmed like SIMD computers, that is, the same code is executed, but different data is input to the various CPUs. Interconnection schemes are important because of certain types of applications. For example in a closed loop system a ring interconnection might be the best. Or, if a problem requires a great deal of communication between processors, then the complete interconnection scheme might be appropriate. The ring interconnection has only two paths per processor regardless of the number of processors. The complete interconnection will have p-1 paths per processor where p is the number of processors (see the figure below). An interconnection scheme that is a compromise between these two extremes is the hypercube distributed memory which will have p = 2 d processors with d paths per processor (see the above figure where d = 3). RING COMPLETE Ring and Complete Interconnection

7 Multiprocessing computers have been introduced to obtain more rapid computations. Basically, there are two ways to do this: either use faster computers or use faster algorithms. There are natural limits on the speed of computers. Signals cannot travel any faster than the speed of light where it takes about one nanosecond to travel one foot. In order to reduce communication times, the devices must be moved closer. Eventually, the devices will be so small that either uncertainty principles will become dominant or the fabrication of chips will become too expensive. An alternative is to use more than one processor on those problems that have a number of independent calculations. One class of problems that have many matrix products, which are independent calculations, is to the area of visualization, and here the use multiprocessors is very common. But, not all computations have a large number of independent calculations. It is important to understand the relationship between the number of processors and the number of independent parts in a calculation. In order to be able to effectively use p processors, one must have p independent tasks to be performed. Vary rarely is this exactly the case; parts of the code may have no independent parts, two independent parts and so forth. In order to model the effectiveness of a multiprocessor with p processors, Amdahl's timing model has been widely used. It makes the assumption that α is the fraction with p independent parts and the rest (1 - α) has one independent part.

8 Amdahl's Timing Model. Let p = the number of processors, α = the fraction with p independent parts, 1 - α = the fraction with one independent part, T 1 = serial execution time, (1 - α)t 1 = execution time for the 1 independent part and αt 1 /p = execution time for the p independent parts. Speedup = S p (α) = T1 α (1 α)t1 + T1 p 1 = α (1 α) + p. Example. Consider a dotproduct of two vectors of dimension 100. There are 100 scalar products and 99 additions, and we may measure execution time in terms of operations so that T 1 = 199. If p = 4 and the dotproducts are broken into four smaller dotproducts of dimension 25, then the parallel part will have 4(49) operations and the serial part will require 3 operations to add the smaller dotpoducts. Thus, α =196/199 and S 4 = 199/52. If α = 1, then the speedup is p, the ideal case. If α = 0, then the speedup is 1! Another parameter is the efficiency which is defined to be the speedup divided by the number of processors. Thus, for a fixed code the α will be fixed, and the efficiency will decrease as the number of processors increases. Another way to view this is in the following table where α =.9 and p varies from 2 to 16.

9 Table: Speedup and Efficiency for α =.9 Processors Speedup Efficiency 2 1.8.90 4 3.1.78 8 4.7.59 16 6.4.40