Advanced Computer Architecture

Advanced Computer Architecture Institute for Multimedia and Software Engineering Conduction of Exercises: Institute for Multimedia eda and Software Engineering g BB 315c, Tel: 379-1174 E-mail: marius.rosu@uni-due.de Execution time, Throughput, Speedup What is better? Question not precise! Aeroplane NY to Paris Speed Passengers Throughput (Persons/h) Boeing 747 6.5 h 610 mph 470 72.3 Concorde 3 h 1350 mph 132 44.0 Execution time T (response time, latency) [sec], [h],... Throughput X (bandwidth) [1/sec], [1/h],...

Definition of Speedup Speedup S (Acceleration): A is S times faster than B T(B) S = = 6.5h / 3h = 2.167 T(A) Speedup is a measure for the judgement of the processing of a single task (passenger). Throughput is a measure for the judgement of the processing of the whole work load (with what aeroplane type can an airline transport more passengers?). Amdahl s Law In 1967, Gene Amdahl (developer of the IBM 360/xx computer) defined the performance increase of a program with fixed problem size for parallel processing as: T s Sequential execution time Speedup S(p) = = f * T s + (1 f) * T s /p Exe. time (seq.+ parallel) with T s :Execution time for sequential processing of the whole task f : Fraction of the execution time for program segments which cannot run in parallel (f = 0..1) p : Number of parallel l processing elements (processors) for p : S(p) = 1 / f, or for f 0: S(p) = p

Acceleration of Programs For efficient parallel processing it is necessary to achieve speedups that t are close to the number of processors used. S(p) Speedup S(p) = p (ideal) S(p) < p (real) p Number of processors Definition of Efficiency Efficiency defines the ratio of speedup and number of processors used. Efficiency indicates which share of the processor performance can be utilized. E(p) = S(p) / p T(sequential) = p * T(parallel) with 0 < E(p) 1 iency Effici 6 Number of processors p

Example: Dot Product x, y are vectors with n elements. Distribute x, y to p processors! The partial dot products are locally computed by each processor. The global dot product is computed with a reduction algorithm by summation of the partial results.... Example: Dot Product Reduction algorithm for p = 2³= 8 k i The reduction requires d = log 2 p steps. In step i (i=0,..d-1), processor k+2 i sends to processor k its local dot product β i k+2 i, and processor k computes then β (i+1) k = β i k + β i k+2 i with k = 0,... (step 2 i+1, k < p).

... Example: Dot Product For p = 2 d t word T(seq) For n >> p 9 Algorithm Problem size... Example: Dot Product Efficiency E(p) = S(p) / p The efficiency increases with growing problem size n (for fixed p) and decreases with growing number of processors p (for fixed n). How must the problem size n grow for increasing p if the efficiency shall be constant? The efficiency remains constant if n grows with p * log 2 p.

Scalability A computer architecture or a program is scalable if the efficiency of program processing remains constant for increasing processor number. In general, this is only possible for a simultaneous increase of the problem size. A program (an algorithm) is perfectly scalable if a linear increase of n is enough in case of a linear increase of p to achieve constant efficiency. Speedup is usually reduced by additional parallel overhead: V(p) = p*t(p) T(seq)... Scalability Definition of mean parallel overhead: V(p) = V(p) / p Causes: Startup costs of an event (process or communication start) Costs for the distribution/administration of shared data Costs for synchronization What is better? Less communication by bigger work packages for fewer processors (from fine to coarse granularity) Smaller work packages distributed to more processors