Relating Empirical Performance Data to Achievable Parallel Application Performance

Published in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Vol. III, Las Vegas, Nev., USA, June 28-July 1, 1999, pp. 1627-1633. Relating Empirical Performance Data to Achievable Parallel Application Performance Roy W. Melton, Cecil O. Alford, Philip R. Bingham, and Tsai Chi Huang Computer Engineering Research Laboratory, School of Electrical and Computer Engineering, Georgia Institute of Technology 777 Atlantic Dr. Ste. 496, Atlanta, GA 30332-0250 Abstract. Parallel computing offers the best execution performance for many large, computeintensive applications. Whereas the overall computing requirements of such applications lead to a clear choice of a parallel computing paradigm, the specific suitability of one particular parallel computer versus another for a given application is often less clear. Research is underway to approach this issue of computational suitability by seeking to predict the parallel performance of an application based on the underlying computational metrics of that application and of candidate architectures. This paper presents two case studies where empirical performance data have been used to predict parallel application performance. Keywords: MIMD; Parallel Applications; Performance Analysis, Performance Prediction 1 Parallel Applications and Performance Parallel computing has emerged as a means toward achieving the fastest execution of many computationally challenging applications. Reports of various applications ported to various parallel architectures account for much of scientific computing literature and communications published over the last decade. Such metrics as parallel efficiency and speedup are the ubiquitous hallmarks of this body of work summarizing specific case studies and implementations. While such analysis confirms the rewards of particular parallelization efforts, it often does not explicitly provide insight on the performance of significantly differing parallelization efforts. This absence of performance insight serves to handicap the development and deployment of applications to their full parallel potential. For the most computationally demanding applications, obtaining the fastest execution time is the ultimate goal. Its realization for a particular application implies a change of focus from specific implementations to generalized parallel performance characteristics for that application. Such performance research will then provide a path to optimal performance of a given application on its intended parallel computing platform. The foundations for this parallel performance research lie in work addressing the parallel scalability of various algorithms and applications. Typically, such work is based on an algorithmic analysis in terms of the order of operations required. This analysis is then applied to either existing or theoretical future parallel computing architectures to project upper and lower bounds on parallel efficiency and/or speedup. Current published parallel performance research reflects this body of work. Moving beyond this status quo of generalized asymptotic estimates of performance scalability toward more realistic estimates of realizable performance requires a change in the underlying analysis. Current algorithmic analysis expressed in order of operations allows asymptotic parallel performance estimates for general classes of parallel computers. More

Relating Empirical Performance Data to Achievable Parallel Application Performance 2 accurate computational metrics of algorithms and their aggregate applications will permit better estimates of parallel performance on actual hardware. Two case studies follow that investigate the use of measured operational performance on actual hardware to predict parallel application performance. 2 Image Processing Application A parallel performance estimation technique has been applied to an image processing application in order to establish computational requirements necessary to support the application. The Computer Engineering Research Laboratory at Georgia Tech designed and implemented a set of parallel processors for a real-time object detection application as part of a research contract. During the term of the contract, it was known that no commercially available processors could implement the application; however in anticipation of projected increases in the computing power of future commercially available processors, there was interest in quantifying the computational requirements of the application. Consisting of six custom processors, the Georgia Tech VLSI signal processor chip set (GT-VSP) [1] performs parallel operations to identify desired objects within a real-time image stream. Each of the six types of GT-VSP elements implements a different image processing task: non-uniformity compensation (NUC); temporal filtering (TF); spatial filtering (SF); thresholding (THR); clustering (CLS); and centroiding (CTR). The heterogeneous processors operate in a pipeline fashion (as shown in Figure 1 to process image frames of 128x128 pixels in parallel at up to 100 frames per second (fps) [2]. Although the custom GT-VSP intrinsically supports the application, its processing capabilities in terms of standard computational metrics (e.g., millions of instructions per second [MIPS], millions of floating point operations per second [MFLOPS], and millions of operations per second [MOPS]) suitable for comparison to commercially available processors are not readily apparent. Begin_Frame, End_Frame Begin_Row, End_Row NUC Memory 5 x 256 x 256 x 16 TF Memory 4 x 256 x 256 x 16 Centroiding GT-VCTR (X) Non-Uniformity Compensation GT-VNUC Pixel_Int[15:0] Therefore to derive GT-VSP computational requirements, the GT-VSP algorithms were implemented in software [3] and run on a wellbenchmarked machine. Scaling the empirical performance data by this machine s published performance figures yielded standardized performance estimates for the GT-VSP system implementation. 2.1 Empirical Performance Data A Sun SPARCstation 2 (SS2) in single user mode executed the software versions of GT- VSP repeatedly to generate corresponding execution timings. Each GT-VSP algorithm was benchmarked both in direct form from [3] as C code and in one or more computationally optimized C versions. Figure 2 shows the benchmarks for THR in simple mode. The algorithmic version with the best timing was then used for the standardized performance estimation. Two main issues drove the performance measurement methodology. First for the methodology to be successful, the software implementation needed to reflect the actual hardware operation as fully as is possible. Individually, hardware and software 16 Temporal Filtering GT-VTF Spatial Filtering GT-VTF Thresholding GT-VTF Clustering GT-VCLS Centroiding GT-VCTR (Y) Figure 1. GT-VSP Processing Pipeline

Relating Empirical Performance Data to Achievable Parallel Application Performance 3 void SimpleThreshold (pixel_type Lower, pixel_type Uppper, long Count, int Row, Column; pixel_type Pixel; pixel_type In [rows][columns], pixel_type Out [rows][columns] ) { void SimpleThreshold (pixel_type Lower, pixel_type Uppper, long Count, int I, J; pixel_type InPtr, OutPtr, Pixel; pixel_type In [rows][columns], pixel_type Out [rows][columns] ) { Count = 0; for (Row = 0; Row < rows; Row++) { for (Column = 0; Column < columns; Column++) { Pixel = In [Row][Column] if ((Pixel >= Lower) && (Pixel <= Upper) ) { (*Count)++; *Out [Row][Column] = Pixel; else { *Out [Row][Column] = (pixel_type) 0; Figure 2. THR Simple Algorithm: Direct (left) and Optimized (right) C Versions typically impart differing optimizations to a given algorithmic implementation. Therefore, implementation evaluation between these paradigms requires compensation for their intrinsic optimization techniques. An optimized software implementation can avoid some operations inherent in a hardware solution. Software operating on general-purpose processors affords some performance optimizations that do not reflect the custom GT- VSP s operation. Whereas compilers and/or processor hardware facilitate the minimizing of operations based on test conditions and image data (e.g., avoiding multiplication by zero or one), GT-VSP must execute for any pixel the worst case computation as though it occurred for every pixel, because GT-VSP ensures 100 fps processing. The functions within the algorithms and the implementation constraints impact the hardware in ways that are not reflected in software. Thus, GT-VSP performance is best characterized by benchmarking the software with the worst case image data (i.e., the highest number of images to detect, non-zero/non-unity filter coefficients, etc.); this scenario corresponds to a real processing situation, and any replacement computing hardware must be able to solve it in the required time constraint. For most of the GT-VSP algorithms, worst case image data corresponds to a single image to be processed; however, benchmarking CLS required evaluating an ensemble of images. Unlike the other algorithms, CLS is affected not only by how many object (i.e., non-zero) pixels Count = 0; for (I = 0, InPtr = In [0], OutPtr = Out [0]; I < rows; I++) { for (J = 0; J < columns J++, InPtr++, OutPtr++) { Pixel = InPtr; if ((Pixel >= Lower) && (Pixel <= Upper) ) { (*Count)++; *OutPtr = Pixel; else { *OutPtr = (pixel_type) 0; are in a frame but also by how they are arranged in the frame, since it identifies all contiguous non-zero pixels as belonging to a unique object. As pixels are evaluated in raster-scan order, sometimes what appears to be a unique object turns out to be part of a larger object, and thus the object data must be merged. Thus, characterizing CLS performance required various images that could quantify its object detection and object merging capabilities. On the other hand, an optimized hardware implementation can avoid some data access overhead inherent in a software abstraction. In this manner, GT-VSP does not have to maintain the pixel array indices within an image as the software abstractions do, so in the software implementations, array indices were replaced with direct pointers to pixels everywhere practical for performance optimization. Secondly, software benchmarks of custom hardware needed to handle varying data representations. GT-VSP operates in fixed point arithmetic using precision of 16 to 35 bits as needed to preserve accuracy. Therefore, both 32- bit integer and 32-bit floating point software versions were evaluated. 2.2 Parallel Performance Estimation The Sun SPARCstation 2 (SS2) is a machine with many existing benchmarks. Published performance figures for this computer are 28.5 MIPS and 4.2 MFLOPS. The time required to process a single image frame with each GT-VSP algorithm on this computer was

Relating Empirical Performance Data to Achievable Parallel Application Performance 4 Table 1. GT-VSP Performance Estimate in MIPS from SS2 Benchmark Floating Point Operations GT-VSP Algorithm MFR (fps) 100/MFR (1/[100 fps]) MIPS for 100 fps NUC 10.73 9.324 266 SF 16.78 5.959 170 TF 8.47 11.804 336 THR 3.63 27.568 786 CLS/CTR 6.41 15.608 445 Total 70.264 2003 Integer Operations GT-VSP Algorithm MFR (fps) 100/MFR (1/[100 fps]) MIPS for 100 fps NUC 8.14 12.290 350 SF 9.70 10.310 294 TF 3.04 32.855 936 THR 7.95 12.577 358 CLS/CTR 5.69 17.564 501 Total 85.596 2439 measured. Then, the SS2 s maximum frame rate (MFR) for each algorithm was calculated by taking the reciprocal of the observed frame time. Table 1 shows the MFRs along with the estimated GT-VSP MIPS requirement. The MIPS requirement was determined by scaling the SS2 MFR by a factor of 100/MFR (included in Table 1) to obtain the 100 fps processing performance of GT-VSP. Table 2 shows the operation count for each GT-VSP algorithm in MOPS along with the computed SS2 MFLOPS requirement for 100 fps and the ratio of MFLOPS to MOPS. The SS2 MFLOPS requirement was determined by scaling the MIPS requirement from Table 1 by a factor of 4.2/28.5, the ratio of SS2 MFLOPS to MOPS. The fact that the MOPS and MIPS numbers differ illustrates that the GT-VSP and software implementations handle calculations differently. The higher ratios for the essentially algebraic NUC, SF, TF, and THR algorithms indicate additional overhead in the software version, whereas the GT-VSP version of the CLS/CTR function exhibits overhead not present in the software version. 3 Global Climate Modeling Application In addition to image processing, parallel performance estimation based on empirical computational measurements has been applied to facilitate the efficient parallelization of a global climate change model. Execution profiles of the original serial version of the model were used to determine that one algorithmic component was responsible for most of the model s serial execution time. Subsequently, a parallel version of this component was evaluated. The measured performance results were then used to predict the performance of the parallel model. After the model was parallelized, its performance was measured and compared to the predictions. Table 2. GT-VSP Operation Count and Estimate in MFLOPS from SS2 Benchmark GT-VSP Algorithm MOPS at 100 fps MFLOPS for 100 fps Ratio MFLOPS/MOPS NUC 13 39.2 3.01 SF 20 25.0 1.25 TF 24 49.6 2.07 THR 53 115.8 2.18 CLS/CTR 543 65.6 0.121 Total 653 295.1 0.452

Relating Empirical Performance Data to Achievable Parallel Application Performance 5 The Enhanced Dynamical/Chemical Model of Atmospheric Ozone [4], a global climate change model used and refined over more than two decades [5], solves a set of timedependent partial differential equations in three spatial dimensions (illustrated in Figure 3). It uses a finite difference method in time and in the vertical dimension whose coordinate is pressure discretized into 32 atmospheric levels representing from the earth s surface upward 85 km. The horizontal dimensions are solved using Transputer Array a spectral method truncated triangularly at wave number 19 (T19) which corresponds to a 64x28 longitude-latitude grid. During each time step, data are transformed between physical grid space where model physics are computed and spectral space where model dynamics are computed. The first parallel version of the model distributed data and computations across 32 Inmos T800 Transputers by atmospheric level (one processor per level) [6]. This vertical parallelization executed more efficiently than the Latitude Pair 1 2 3 4 5 Level 1 2 3 4 5 6 32 14 EARTH 32 Atmospheric Levels 64 Longitudes 28 Latitudes Figure 3. Global Climate Modeling Application

Relating Empirical Performance Data to Achievable Parallel Application Performance 6 serial version. Further model parallelization required splitting the compute-intensive spectral transform, so preliminary computational analysis was conducted in hopes of achieving a more efficient result. 3.1 Empirical Performance Data Execution profiles of the original serial code revealed that over 70% of execution time is spent in the spectral transform code. The transform consists of two phases: a Guassian quadrature approximation to a Legendre transform for each model grid longitude (spectral wave number); and a Fourier transform phase for each model grid latitude (spectral wave index). The performance of the transform parallelized across latitudes was evaluated since this distribution keeps the known computationally efficient FFTs local to a single processor. Fully parallelized across 14 latitude processors, (28 model latitudes in north/south pairs), the spectral transform exhibited a speedup of 11.4. This measured transform performance is in line with results published for other spectral models [7, 8]. Following this preliminary performance analysis of the transform, the model was parallelized across the latitude dimension to produce a scalable processor mapping shown in Figure 3. The correctness of the parallel model was verified by comparing its output to that of the original serial model. Then, the performance of several processor configurations was measured in terms of the average time to complete a model time step. For these measurements, no model data was output to disk; the unoptimized parallel I/O would have introduced significant serial overhead not reflected by the spectral transform benchmark. 3.2 Parallel Performance Estimation Applying Amdahl s law expressed as 1 t N = f S + f P ts, N where N is the number of processors, t N is the execution time on N processors, f S is the serial fraction, f P is the parallel fraction, and t S is the serial exectuion time, the serial fraction of the spectral transform is 0.077. Most overhead incurred in the latitude parallelization impacts only the spectral transform. Therefore based on the execution profile of the original serial code (see section 3.1), the serial fraction of the overall model is roughly 0.053, 70% of the spectral transform s 0.077. From the measured model execution time and the derived serial fraction, parallel performance estimates were computed for various degrees of latitude parallelization as show in Table 3. Table 3 also gives the actual performance measurements, the percentage error of the estimates, and the distribution profile of latitudes to processors. The parallel performance estimates are within 5% of the actual realized performance except for the last case. Whereas the latitude distributions from 2 up to 5 processors resulted in a strictly decreasing maximum number of latitudes per processor, the last distribution to 6 processors did not decrease the maximum number of latitudes per processor. To evaluate the full scalable potential of the model, additional Transputers would need to be configured. Only two meaningful distributions (i.e., those which reduce the maximum number of latitudes per processor with a minimum number of processors) remain to be measured: 7 latitude processors (224 total processors); and, 14 latitude processors (448 total Table 3. Parallel GCCM Performance by Latitutde Total Processors Latitude Processors, N Latitude Distribution Actual Time, T N (s) Estimated Time (s) Error (%) 32 1 14 134.3 N/A N/A 64 2 7, 7 67.5 70.6 4.59 96 3 5, 4, 5 49.5 49.5 0.00 128 4 3, 4, 3, 4 41.3 38.9 5.81 160 5 3, 3, 3, 2, 3 32.6 32.6 0.00 192 6 2, 2, 2, 2, 3, 3 33.1 28.3 14.5

Relating Empirical Performance Data to Achievable Parallel Application Performance 7 processors). Currently, only 192 processors are configured for the GCCM. 4 Conclusions The use of measured operational performance on actual hardware to predict parallel performance has been presented for two applications: image processing and global climate modeling. In the first case, performance data were used to characterize computational requirements to match custom parallel hardware. In the second case, performance data on a major algorithmic component were used to predict the performance resulting from a given parallel performance effort. While these two examples illustrate that parallel performance can be predicted from empirical results, their results suggest that care must be taken when comparing serial and parallel implementations (e.g., that the measured serial program reflects the computational methodology of the parallel application). The objective of using empirical computational metrics rather than traditional algorithmic order of operations analysis is to produce more accurate parallel application performance predictions rather than loose performance bounds. For many applications, achieving the fastest solution in today s implementation is more important than knowing the theoretical ideal implementation. Real performance data from target processors can elucidate how to obtain the fastest available execution of an application, given that necessary parameters are measured and accounted. Further research is necessary to determine the necessary set of parameters and what equations are necessary to translate them into performance estimates. 5 References [1] W. S. Tan et al., A High-Performance Modular Signal Processor for Object Detection, Proceedings of the 1990 Government Microcircuit Applications Conference (GOMAC), Las Vegas, Nev., USA, November 4-8, 1990, pp. 489-492. [2] R. W. Melton et al., A VLSI System Implementation for Real-Time Object Detection, 1996 IEEE International Symposium on Circuits and Systems (ISCAS'96), Vol. 4, Atlanta, Ga., USA, May 12-15, 1996, pp. 320-323. [3] A. M. Henshaw et al., Signal Processing Algorithms-Georgia Tech Benchmark, Special Technical Report, Report No. STR- 0142-90-008, Computer Engineering Research Laboratory, School of Electrical Engineering, Georgia Institute of Technology, February 27, 1990. [4] F. N. Alyea et al., An enhanced Dynamical/Chemical Model of Atmospheric Ozone, School of Geophysical Sciences, Georgia Institute of Technology, Atlanta, Ga., USA, July, 1985. [5] D. Cunnold et al., A Three-Dimensional Dynamical-Chemical Model of Atmospheric Ozone, Journal of the Atmospheric Sciences, Vol. 32, Jan., 1975, pp. 170.194. [6] R. W. Melton et al., A Transputer-Based Scalable, Parallel Global Climate Change Model, Transputer Research and Applications Conference 7 (NATUG 7), Athens, Ga., USA, October 23-25, 1994, pp. 60-67. [7] G. Carver, A Spectral Meteorological Model on the ICL DAP, Parallel Computing, Vol. 8, No. 1-3, Oct., 1988, pp. 121-126. [8] D. F. Snelling, A High Resolution Parallel Legendre Transform Algorithm, Supercomputing (Proceedings of the First International Conference), 1988, pp. 854-862.