Relating Empirical Performance Data to Achievable Parallel Application Performance
|
|
- Debra Cannon
- 8 years ago
- Views:
Transcription
1 Published in Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA'99), Vol. III, Las Vegas, Nev., USA, June 28-July 1, 1999, pp Relating Empirical Performance Data to Achievable Parallel Application Performance Roy W. Melton, Cecil O. Alford, Philip R. Bingham, and Tsai Chi Huang Computer Engineering Research Laboratory, School of Electrical and Computer Engineering, Georgia Institute of Technology 777 Atlantic Dr. Ste. 496, Atlanta, GA Abstract. Parallel computing offers the best execution performance for many large, computeintensive applications. Whereas the overall computing requirements of such applications lead to a clear choice of a parallel computing paradigm, the specific suitability of one particular parallel computer versus another for a given application is often less clear. Research is underway to approach this issue of computational suitability by seeking to predict the parallel performance of an application based on the underlying computational metrics of that application and of candidate architectures. This paper presents two case studies where empirical performance data have been used to predict parallel application performance. Keywords: MIMD; Parallel Applications; Performance Analysis, Performance Prediction 1 Parallel Applications and Performance Parallel computing has emerged as a means toward achieving the fastest execution of many computationally challenging applications. Reports of various applications ported to various parallel architectures account for much of scientific computing literature and communications published over the last decade. Such metrics as parallel efficiency and speedup are the ubiquitous hallmarks of this body of work summarizing specific case studies and implementations. While such analysis confirms the rewards of particular parallelization efforts, it often does not explicitly provide insight on the performance of significantly differing parallelization efforts. This absence of performance insight serves to handicap the development and deployment of applications to their full parallel potential. For the most computationally demanding applications, obtaining the fastest execution time is the ultimate goal. Its realization for a particular application implies a change of focus from specific implementations to generalized parallel performance characteristics for that application. Such performance research will then provide a path to optimal performance of a given application on its intended parallel computing platform. The foundations for this parallel performance research lie in work addressing the parallel scalability of various algorithms and applications. Typically, such work is based on an algorithmic analysis in terms of the order of operations required. This analysis is then applied to either existing or theoretical future parallel computing architectures to project upper and lower bounds on parallel efficiency and/or speedup. Current published parallel performance research reflects this body of work. Moving beyond this status quo of generalized asymptotic estimates of performance scalability toward more realistic estimates of realizable performance requires a change in the underlying analysis. Current algorithmic analysis expressed in order of operations allows asymptotic parallel performance estimates for general classes of parallel computers. More
2 Relating Empirical Performance Data to Achievable Parallel Application Performance 2 accurate computational metrics of algorithms and their aggregate applications will permit better estimates of parallel performance on actual hardware. Two case studies follow that investigate the use of measured operational performance on actual hardware to predict parallel application performance. 2 Image Processing Application A parallel performance estimation technique has been applied to an image processing application in order to establish computational requirements necessary to support the application. The Computer Engineering Research Laboratory at Georgia Tech designed and implemented a set of parallel processors for a real-time object detection application as part of a research contract. During the term of the contract, it was known that no commercially available processors could implement the application; however in anticipation of projected increases in the computing power of future commercially available processors, there was interest in quantifying the computational requirements of the application. Consisting of six custom processors, the Georgia Tech VLSI signal processor chip set (GT-VSP) [1] performs parallel operations to identify desired objects within a real-time image stream. Each of the six types of GT-VSP elements implements a different image processing task: non-uniformity compensation (NUC); temporal filtering (TF); spatial filtering (SF); thresholding (THR); clustering (CLS); and centroiding (CTR). The heterogeneous processors operate in a pipeline fashion (as shown in Figure 1 to process image frames of 128x128 pixels in parallel at up to 100 frames per second (fps) [2]. Although the custom GT-VSP intrinsically supports the application, its processing capabilities in terms of standard computational metrics (e.g., millions of instructions per second [MIPS], millions of floating point operations per second [MFLOPS], and millions of operations per second [MOPS]) suitable for comparison to commercially available processors are not readily apparent. Begin_Frame, End_Frame Begin_Row, End_Row NUC Memory 5 x 256 x 256 x 16 TF Memory 4 x 256 x 256 x 16 Centroiding GT-VCTR (X) Non-Uniformity Compensation GT-VNUC Pixel_Int[15:0] Therefore to derive GT-VSP computational requirements, the GT-VSP algorithms were implemented in software [3] and run on a wellbenchmarked machine. Scaling the empirical performance data by this machine s published performance figures yielded standardized performance estimates for the GT-VSP system implementation. 2.1 Empirical Performance Data A Sun SPARCstation 2 (SS2) in single user mode executed the software versions of GT- VSP repeatedly to generate corresponding execution timings. Each GT-VSP algorithm was benchmarked both in direct form from [3] as C code and in one or more computationally optimized C versions. Figure 2 shows the benchmarks for THR in simple mode. The algorithmic version with the best timing was then used for the standardized performance estimation. Two main issues drove the performance measurement methodology. First for the methodology to be successful, the software implementation needed to reflect the actual hardware operation as fully as is possible. Individually, hardware and software 16 Temporal Filtering GT-VTF Spatial Filtering GT-VTF Thresholding GT-VTF Clustering GT-VCLS Centroiding GT-VCTR (Y) Figure 1. GT-VSP Processing Pipeline
3 Relating Empirical Performance Data to Achievable Parallel Application Performance 3 void SimpleThreshold (pixel_type Lower, pixel_type Uppper, long Count, int Row, Column; pixel_type Pixel; pixel_type In [rows][columns], pixel_type Out [rows][columns] ) { void SimpleThreshold (pixel_type Lower, pixel_type Uppper, long Count, int I, J; pixel_type InPtr, OutPtr, Pixel; pixel_type In [rows][columns], pixel_type Out [rows][columns] ) { Count = 0; for (Row = 0; Row < rows; Row++) { for (Column = 0; Column < columns; Column++) { Pixel = In [Row][Column] if ((Pixel >= Lower) && (Pixel <= Upper) ) { (*Count)++; *Out [Row][Column] = Pixel; else { *Out [Row][Column] = (pixel_type) 0; Figure 2. THR Simple Algorithm: Direct (left) and Optimized (right) C Versions typically impart differing optimizations to a given algorithmic implementation. Therefore, implementation evaluation between these paradigms requires compensation for their intrinsic optimization techniques. An optimized software implementation can avoid some operations inherent in a hardware solution. Software operating on general-purpose processors affords some performance optimizations that do not reflect the custom GT- VSP s operation. Whereas compilers and/or processor hardware facilitate the minimizing of operations based on test conditions and image data (e.g., avoiding multiplication by zero or one), GT-VSP must execute for any pixel the worst case computation as though it occurred for every pixel, because GT-VSP ensures 100 fps processing. The functions within the algorithms and the implementation constraints impact the hardware in ways that are not reflected in software. Thus, GT-VSP performance is best characterized by benchmarking the software with the worst case image data (i.e., the highest number of images to detect, non-zero/non-unity filter coefficients, etc.); this scenario corresponds to a real processing situation, and any replacement computing hardware must be able to solve it in the required time constraint. For most of the GT-VSP algorithms, worst case image data corresponds to a single image to be processed; however, benchmarking CLS required evaluating an ensemble of images. Unlike the other algorithms, CLS is affected not only by how many object (i.e., non-zero) pixels Count = 0; for (I = 0, InPtr = In [0], OutPtr = Out [0]; I < rows; I++) { for (J = 0; J < columns J++, InPtr++, OutPtr++) { Pixel = InPtr; if ((Pixel >= Lower) && (Pixel <= Upper) ) { (*Count)++; *OutPtr = Pixel; else { *OutPtr = (pixel_type) 0; are in a frame but also by how they are arranged in the frame, since it identifies all contiguous non-zero pixels as belonging to a unique object. As pixels are evaluated in raster-scan order, sometimes what appears to be a unique object turns out to be part of a larger object, and thus the object data must be merged. Thus, characterizing CLS performance required various images that could quantify its object detection and object merging capabilities. On the other hand, an optimized hardware implementation can avoid some data access overhead inherent in a software abstraction. In this manner, GT-VSP does not have to maintain the pixel array indices within an image as the software abstractions do, so in the software implementations, array indices were replaced with direct pointers to pixels everywhere practical for performance optimization. Secondly, software benchmarks of custom hardware needed to handle varying data representations. GT-VSP operates in fixed point arithmetic using precision of 16 to 35 bits as needed to preserve accuracy. Therefore, both 32- bit integer and 32-bit floating point software versions were evaluated. 2.2 Parallel Performance Estimation The Sun SPARCstation 2 (SS2) is a machine with many existing benchmarks. Published performance figures for this computer are 28.5 MIPS and 4.2 MFLOPS. The time required to process a single image frame with each GT-VSP algorithm on this computer was
4 Relating Empirical Performance Data to Achievable Parallel Application Performance 4 Table 1. GT-VSP Performance Estimate in MIPS from SS2 Benchmark Floating Point Operations GT-VSP Algorithm MFR (fps) 100/MFR (1/[100 fps]) MIPS for 100 fps NUC SF TF THR CLS/CTR Total Integer Operations GT-VSP Algorithm MFR (fps) 100/MFR (1/[100 fps]) MIPS for 100 fps NUC SF TF THR CLS/CTR Total measured. Then, the SS2 s maximum frame rate (MFR) for each algorithm was calculated by taking the reciprocal of the observed frame time. Table 1 shows the MFRs along with the estimated GT-VSP MIPS requirement. The MIPS requirement was determined by scaling the SS2 MFR by a factor of 100/MFR (included in Table 1) to obtain the 100 fps processing performance of GT-VSP. Table 2 shows the operation count for each GT-VSP algorithm in MOPS along with the computed SS2 MFLOPS requirement for 100 fps and the ratio of MFLOPS to MOPS. The SS2 MFLOPS requirement was determined by scaling the MIPS requirement from Table 1 by a factor of 4.2/28.5, the ratio of SS2 MFLOPS to MOPS. The fact that the MOPS and MIPS numbers differ illustrates that the GT-VSP and software implementations handle calculations differently. The higher ratios for the essentially algebraic NUC, SF, TF, and THR algorithms indicate additional overhead in the software version, whereas the GT-VSP version of the CLS/CTR function exhibits overhead not present in the software version. 3 Global Climate Modeling Application In addition to image processing, parallel performance estimation based on empirical computational measurements has been applied to facilitate the efficient parallelization of a global climate change model. Execution profiles of the original serial version of the model were used to determine that one algorithmic component was responsible for most of the model s serial execution time. Subsequently, a parallel version of this component was evaluated. The measured performance results were then used to predict the performance of the parallel model. After the model was parallelized, its performance was measured and compared to the predictions. Table 2. GT-VSP Operation Count and Estimate in MFLOPS from SS2 Benchmark GT-VSP Algorithm MOPS at 100 fps MFLOPS for 100 fps Ratio MFLOPS/MOPS NUC SF TF THR CLS/CTR Total
5 Relating Empirical Performance Data to Achievable Parallel Application Performance 5 The Enhanced Dynamical/Chemical Model of Atmospheric Ozone [4], a global climate change model used and refined over more than two decades [5], solves a set of timedependent partial differential equations in three spatial dimensions (illustrated in Figure 3). It uses a finite difference method in time and in the vertical dimension whose coordinate is pressure discretized into 32 atmospheric levels representing from the earth s surface upward 85 km. The horizontal dimensions are solved using Transputer Array a spectral method truncated triangularly at wave number 19 (T19) which corresponds to a 64x28 longitude-latitude grid. During each time step, data are transformed between physical grid space where model physics are computed and spectral space where model dynamics are computed. The first parallel version of the model distributed data and computations across 32 Inmos T800 Transputers by atmospheric level (one processor per level) [6]. This vertical parallelization executed more efficiently than the Latitude Pair Level EARTH 32 Atmospheric Levels 64 Longitudes 28 Latitudes Figure 3. Global Climate Modeling Application
6 Relating Empirical Performance Data to Achievable Parallel Application Performance 6 serial version. Further model parallelization required splitting the compute-intensive spectral transform, so preliminary computational analysis was conducted in hopes of achieving a more efficient result. 3.1 Empirical Performance Data Execution profiles of the original serial code revealed that over 70% of execution time is spent in the spectral transform code. The transform consists of two phases: a Guassian quadrature approximation to a Legendre transform for each model grid longitude (spectral wave number); and a Fourier transform phase for each model grid latitude (spectral wave index). The performance of the transform parallelized across latitudes was evaluated since this distribution keeps the known computationally efficient FFTs local to a single processor. Fully parallelized across 14 latitude processors, (28 model latitudes in north/south pairs), the spectral transform exhibited a speedup of This measured transform performance is in line with results published for other spectral models [7, 8]. Following this preliminary performance analysis of the transform, the model was parallelized across the latitude dimension to produce a scalable processor mapping shown in Figure 3. The correctness of the parallel model was verified by comparing its output to that of the original serial model. Then, the performance of several processor configurations was measured in terms of the average time to complete a model time step. For these measurements, no model data was output to disk; the unoptimized parallel I/O would have introduced significant serial overhead not reflected by the spectral transform benchmark. 3.2 Parallel Performance Estimation Applying Amdahl s law expressed as 1 t N = f S + f P ts, N where N is the number of processors, t N is the execution time on N processors, f S is the serial fraction, f P is the parallel fraction, and t S is the serial exectuion time, the serial fraction of the spectral transform is Most overhead incurred in the latitude parallelization impacts only the spectral transform. Therefore based on the execution profile of the original serial code (see section 3.1), the serial fraction of the overall model is roughly 0.053, 70% of the spectral transform s From the measured model execution time and the derived serial fraction, parallel performance estimates were computed for various degrees of latitude parallelization as show in Table 3. Table 3 also gives the actual performance measurements, the percentage error of the estimates, and the distribution profile of latitudes to processors. The parallel performance estimates are within 5% of the actual realized performance except for the last case. Whereas the latitude distributions from 2 up to 5 processors resulted in a strictly decreasing maximum number of latitudes per processor, the last distribution to 6 processors did not decrease the maximum number of latitudes per processor. To evaluate the full scalable potential of the model, additional Transputers would need to be configured. Only two meaningful distributions (i.e., those which reduce the maximum number of latitudes per processor with a minimum number of processors) remain to be measured: 7 latitude processors (224 total processors); and, 14 latitude processors (448 total Table 3. Parallel GCCM Performance by Latitutde Total Processors Latitude Processors, N Latitude Distribution Actual Time, T N (s) Estimated Time (s) Error (%) N/A N/A , , 4, , 4, 3, , 3, 3, 2, , 2, 2, 2, 3,
7 Relating Empirical Performance Data to Achievable Parallel Application Performance 7 processors). Currently, only 192 processors are configured for the GCCM. 4 Conclusions The use of measured operational performance on actual hardware to predict parallel performance has been presented for two applications: image processing and global climate modeling. In the first case, performance data were used to characterize computational requirements to match custom parallel hardware. In the second case, performance data on a major algorithmic component were used to predict the performance resulting from a given parallel performance effort. While these two examples illustrate that parallel performance can be predicted from empirical results, their results suggest that care must be taken when comparing serial and parallel implementations (e.g., that the measured serial program reflects the computational methodology of the parallel application). The objective of using empirical computational metrics rather than traditional algorithmic order of operations analysis is to produce more accurate parallel application performance predictions rather than loose performance bounds. For many applications, achieving the fastest solution in today s implementation is more important than knowing the theoretical ideal implementation. Real performance data from target processors can elucidate how to obtain the fastest available execution of an application, given that necessary parameters are measured and accounted. Further research is necessary to determine the necessary set of parameters and what equations are necessary to translate them into performance estimates. 5 References [1] W. S. Tan et al., A High-Performance Modular Signal Processor for Object Detection, Proceedings of the 1990 Government Microcircuit Applications Conference (GOMAC), Las Vegas, Nev., USA, November 4-8, 1990, pp [2] R. W. Melton et al., A VLSI System Implementation for Real-Time Object Detection, 1996 IEEE International Symposium on Circuits and Systems (ISCAS'96), Vol. 4, Atlanta, Ga., USA, May 12-15, 1996, pp [3] A. M. Henshaw et al., Signal Processing Algorithms-Georgia Tech Benchmark, Special Technical Report, Report No. STR , Computer Engineering Research Laboratory, School of Electrical Engineering, Georgia Institute of Technology, February 27, [4] F. N. Alyea et al., An enhanced Dynamical/Chemical Model of Atmospheric Ozone, School of Geophysical Sciences, Georgia Institute of Technology, Atlanta, Ga., USA, July, [5] D. Cunnold et al., A Three-Dimensional Dynamical-Chemical Model of Atmospheric Ozone, Journal of the Atmospheric Sciences, Vol. 32, Jan., 1975, pp [6] R. W. Melton et al., A Transputer-Based Scalable, Parallel Global Climate Change Model, Transputer Research and Applications Conference 7 (NATUG 7), Athens, Ga., USA, October 23-25, 1994, pp [7] G. Carver, A Spectral Meteorological Model on the ICL DAP, Parallel Computing, Vol. 8, No. 1-3, Oct., 1988, pp [8] D. F. Snelling, A High Resolution Parallel Legendre Transform Algorithm, Supercomputing (Proceedings of the First International Conference), 1988, pp
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationThe Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
More informationTHE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationSystolic Computing. Fundamentals
Systolic Computing Fundamentals Motivations for Systolic Processing PARALLEL ALGORITHMS WHICH MODEL OF COMPUTATION IS THE BETTER TO USE? HOW MUCH TIME WE EXPECT TO SAVE USING A PARALLEL ALGORITHM? HOW
More informationPerformance metrics for parallel systems
Performance metrics for parallel systems S.S. Kadam C-DAC, Pune sskadam@cdac.in C-DAC/SECG/2006 1 Purpose To determine best parallel algorithm Evaluate hardware platforms Examine the benefits from parallelism
More informationDesign of Remote data acquisition system based on Internet of Things
, pp.32-36 http://dx.doi.org/10.14257/astl.214.79.07 Design of Remote data acquisition system based on Internet of Things NIU Ling Zhou Kou Normal University, Zhoukou 466001,China; Niuling@zknu.edu.cn
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationSolution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationScalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationSynchronization of sampling in distributed signal processing systems
Synchronization of sampling in distributed signal processing systems Károly Molnár, László Sujbert, Gábor Péceli Department of Measurement and Information Systems, Budapest University of Technology and
More informationCONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY
White Paper CONFIGURATION GUIDELINES: EMC STORAGE FOR PHYSICAL SECURITY DVTel Latitude NVMS performance using EMC Isilon storage arrays Correct sizing for storage in a DVTel Latitude physical security
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationSystem Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
More informationA Systolic Algorithm to Process Compressed Binary Images
A Systolic Algorithm to Process Compressed Binary Images Fikret Ercal, Mark Allen, and Hao Feng University of Missouri Rolla Department of Computer Science and Intelligent Systems Center Rolla, MO 65401
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationLoad Balancing on a Grid Using Data Characteristics
Load Balancing on a Grid Using Data Characteristics Jonathan White and Dale R. Thompson Computer Science and Computer Engineering Department University of Arkansas Fayetteville, AR 72701, USA {jlw09, drt}@uark.edu
More informationAn Efficient RNS to Binary Converter Using the Moduli Set {2n + 1, 2n, 2n 1}
An Efficient RNS to Binary Converter Using the oduli Set {n + 1, n, n 1} Kazeem Alagbe Gbolagade 1,, ember, IEEE and Sorin Dan Cotofana 1, Senior ember IEEE, 1. Computer Engineering Laboratory, Delft University
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationParallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,
More informationon an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
More informationParallel Analysis and Visualization on Cray Compute Node Linux
Parallel Analysis and Visualization on Cray Compute Node Linux David Pugmire, Oak Ridge National Laboratory and Hank Childs, Lawrence Livermore National Laboratory and Sean Ahern, Oak Ridge National Laboratory
More informationSome Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.
Some Computer Organizations and Their Effectiveness Michael J Flynn IEEE Transactions on Computers. Vol. c-21, No.9, September 1972 Introduction Attempts to codify a computer have been from three points
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationIS-ENES/PrACE Meeting EC-EARTH 3. A High-resolution Configuration
IS-ENES/PrACE Meeting EC-EARTH 3 A High-resolution Configuration Motivation Generate a high-resolution configuration of EC-EARTH to Prepare studies of high-resolution ESM in climate mode Prove and improve
More informationCellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
More informationA Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
More informationEfficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.
Algorithms Efficiency of algorithms Computational resources: time and space Best, worst and average case performance How to compare algorithms: machine-independent measure of efficiency Growth rate Complexity
More informationBUSINESS RULES AND GAP ANALYSIS
Leading the Evolution WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Discovery and management of business rules avoids business disruptions WHITE PAPER BUSINESS RULES AND GAP ANALYSIS Business Situation More
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
More informationInterconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
More informationCHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY. 3.1 Basic Concepts of Digital Imaging
Physics of Medical X-Ray Imaging (1) Chapter 3 CHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY 3.1 Basic Concepts of Digital Imaging Unlike conventional radiography that generates images on film through
More informationProject Management Process
Project Management Process Description... 1 STAGE/STEP/TASK SUMMARY LIST... 2 Project Initiation 2 Project Control 4 Project Closure 5 Project Initiation... 7 Step 01: Project Kick Off 10 Step 02: Project
More informationDetermining Total Cost of Ownership for Data Center and Network Room Infrastructure
Determining Total Cost of Ownership for Data Center and Network Room Infrastructure White Paper #6 Revision 3 Executive Summary An improved method for measuring Total Cost of Ownership of data center and
More informationGraphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
More informationA Prediction-Based Transcoding System for Video Conference in Cloud Computing
A Prediction-Based Transcoding System for Video Conference in Cloud Computing Yongquan Chen 1 Abstract. We design a transcoding system that can provide dynamic transcoding services for various types of
More informationGrid Computing Vs. Cloud Computing
International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 6 (2013), pp. 577-582 International Research Publications House http://www. irphouse.com /ijict.htm Grid
More informationPerformance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
More informationCapacity Planning Process Estimating the load Initial configuration
Capacity Planning Any data warehouse solution will grow over time, sometimes quite dramatically. It is essential that the components of the solution (hardware, software, and database) are capable of supporting
More informationMotivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
More informationHigh-speed image processing algorithms using MMX hardware
High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to
More informationStudying Code Development for High Performance Computing: The HPCS Program
Studying Code Development for High Performance Computing: The HPCS Program Jeff Carver 1, Sima Asgari 1, Victor Basili 1,2, Lorin Hochstein 1, Jeffrey K. Hollingsworth 1, Forrest Shull 2, Marv Zelkowitz
More informationQ. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
More informationRN-Codings: New Insights and Some Applications
RN-Codings: New Insights and Some Applications Abstract During any composite computation there is a constant need for rounding intermediate results before they can participate in further processing. Recently
More informationA General Framework for Tracking Objects in a Multi-Camera Environment
A General Framework for Tracking Objects in a Multi-Camera Environment Karlene Nguyen, Gavin Yeung, Soheil Ghiasi, Majid Sarrafzadeh {karlene, gavin, soheil, majid}@cs.ucla.edu Abstract We present a framework
More information4.3. David E. Rudack*, Meteorological Development Laboratory Office of Science and Technology National Weather Service, NOAA 1.
43 RESULTS OF SENSITIVITY TESTING OF MOS WIND SPEED AND DIRECTION GUIDANCE USING VARIOUS SAMPLE SIZES FROM THE GLOBAL ENSEMBLE FORECAST SYSTEM (GEFS) RE- FORECASTS David E Rudack*, Meteorological Development
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationSelection of Techniques and Metrics
Performance Evaluation: Selection of Techniques and Metrics Hongwei Zhang http://www.cs.wayne.edu/~hzhang Acknowledgement: this lecture is partially based on the slides of Dr. Raj Jain. Outline Selecting
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationProposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm
Proposal and Development of a Reconfigurable Parallel Job Scheduling Algorithm Luís Fabrício Wanderley Góes, Carlos Augusto Paiva da Silva Martins Graduate Program in Electrical Engineering PUC Minas {lfwgoes,capsm}@pucminas.br
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationTechnical White Paper. Symantec Backup Exec 10d System Sizing. Best Practices For Optimizing Performance of the Continuous Protection Server
Symantec Backup Exec 10d System Sizing Best Practices For Optimizing Performance of the Continuous Protection Server Table of Contents Table of Contents...2 Executive Summary...3 System Sizing and Performance
More informationBuilding Scalable Applications Using Microsoft Technologies
Building Scalable Applications Using Microsoft Technologies Padma Krishnan Senior Manager Introduction CIOs lay great emphasis on application scalability and performance and rightly so. As business grows,
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationBuilding an Inexpensive Parallel Computer
Res. Lett. Inf. Math. Sci., (2000) 1, 113-118 Available online at http://www.massey.ac.nz/~wwiims/rlims/ Building an Inexpensive Parallel Computer Lutz Grosz and Andre Barczak I.I.M.S., Massey University
More informationRN-coding of Numbers: New Insights and Some Applications
RN-coding of Numbers: New Insights and Some Applications Peter Kornerup Dept. of Mathematics and Computer Science SDU, Odense, Denmark & Jean-Michel Muller LIP/Arénaire (CRNS-ENS Lyon-INRIA-UCBL) Lyon,
More informationPIXEL-LEVEL IMAGE FUSION USING BROVEY TRANSFORME AND WAVELET TRANSFORM
PIXEL-LEVEL IMAGE FUSION USING BROVEY TRANSFORME AND WAVELET TRANSFORM Rohan Ashok Mandhare 1, Pragati Upadhyay 2,Sudha Gupta 3 ME Student, K.J.SOMIYA College of Engineering, Vidyavihar, Mumbai, Maharashtra,
More informationData Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
More informationWhite Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationReliable Systolic Computing through Redundancy
Reliable Systolic Computing through Redundancy Kunio Okuda 1, Siang Wun Song 1, and Marcos Tatsuo Yamamoto 1 Universidade de São Paulo, Brazil, {kunio,song,mty}@ime.usp.br, http://www.ime.usp.br/ song/
More informationChapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationPerformance metrics for parallelism
Performance metrics for parallelism 8th of November, 2013 Sources Rob H. Bisseling; Parallel Scientific Computing, Oxford Press. Grama, Gupta, Karypis, Kumar; Parallel Computing, Addison Wesley. Definition
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationA Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
More informationFacts about Visualization Pipelines, applicable to VisIt and ParaView
Facts about Visualization Pipelines, applicable to VisIt and ParaView March 2013 Jean M. Favre, CSCS Agenda Visualization pipelines Motivation by examples VTK Data Streaming Visualization Pipelines: Introduction
More informationPerformance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
More informationOn some Potential Research Contributions to the Multi-Core Enterprise
On some Potential Research Contributions to the Multi-Core Enterprise Oded Maler CNRS - VERIMAG Grenoble, France February 2009 Background This presentation is based on observations made in the Athole project
More informationFAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING
FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING Hussain Al-Asaad and Alireza Sarvi Department of Electrical & Computer Engineering University of California Davis, CA, U.S.A.
More informationA STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti 2, Nidhi Rajak 3
A STUDY OF TASK SCHEDULING IN MULTIPROCESSOR ENVIROMENT Ranjit Rajak 1, C.P.Katti, Nidhi Rajak 1 Department of Computer Science & Applications, Dr.H.S.Gour Central University, Sagar, India, ranjit.jnu@gmail.com
More informationMonday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK
Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK It has been my pleasure to give two presentations
More informationBenchmarking Large Scale Cloud Computing in Asia Pacific
2013 19th IEEE International Conference on Parallel and Distributed Systems ing Large Scale Cloud Computing in Asia Pacific Amalina Mohamad Sabri 1, Suresh Reuben Balakrishnan 1, Sun Veer Moolye 1, Chung
More informationIntel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms
Intel Cloud Builders Guide Intel Xeon Processor-based Servers RES Virtual Desktop Extender Intel Cloud Builders Guide to Cloud Design and Deployment on Intel Platforms Client Aware Cloud with RES Virtual
More informationSUPER RESOLUTION FROM MULTIPLE LOW RESOLUTION IMAGES
SUPER RESOLUTION FROM MULTIPLE LOW RESOLUTION IMAGES ABSTRACT Florin Manaila 1 Costin-Anton Boiangiu 2 Ion Bucur 3 Although the technology of optical instruments is constantly advancing, the capture of
More informationFloating Point Fused Add-Subtract and Fused Dot-Product Units
Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,
More informationJunghyun Ahn Changho Sung Tag Gon Kim. Korea Advanced Institute of Science and Technology (KAIST) 373-1 Kuseong-dong, Yuseong-gu Daejoen, Korea
Proceedings of the 211 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. A BINARY PARTITION-BASED MATCHING ALGORITHM FOR DATA DISTRIBUTION MANAGEMENT Junghyun
More informationtelemetry Rene A.J. Chave, David D. Lemon, Jan Buermans ASL Environmental Sciences Inc. Victoria BC Canada rchave@aslenv.com I.
Near real-time transmission of reduced data from a moored multi-frequency sonar by low bandwidth telemetry Rene A.J. Chave, David D. Lemon, Jan Buermans ASL Environmental Sciences Inc. Victoria BC Canada
More informationThis Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers
This Unit: Floating Point Arithmetic CIS 371 Computer Organization and Design Unit 7: Floating Point App App App System software Mem CPU I/O Formats Precision and range IEEE 754 standard Operations Addition
More informationEE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
More informationInterconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality
More informationCurrent Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
More informationThere are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.
ASSURING PERFORMANCE IN E-COMMERCE SYSTEMS Dr. John Murphy Abstract Performance Assurance is a methodology that, when applied during the design and development cycle, will greatly increase the chances
More informationPHOTOGRAMMETRIC TECHNIQUES FOR MEASUREMENTS IN WOODWORKING INDUSTRY
PHOTOGRAMMETRIC TECHNIQUES FOR MEASUREMENTS IN WOODWORKING INDUSTRY V. Knyaz a, *, Yu. Visilter, S. Zheltov a State Research Institute for Aviation System (GosNIIAS), 7, Victorenko str., Moscow, Russia
More informationParallels Virtuozzo Containers vs. VMware Virtual Infrastructure:
Parallels Virtuozzo Containers vs. VMware Virtual Infrastructure: An Independent Architecture Comparison TABLE OF CONTENTS Introduction...3 A Tale of Two Virtualization Solutions...5 Part I: Density...5
More informationEight Ways to Increase GPIB System Performance
Application Note 133 Eight Ways to Increase GPIB System Performance Amar Patel Introduction When building an automated measurement system, you can never have too much performance. Increasing performance
More informationCharacterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies
Proceedings of the IASTED International Conference Parallel and Distributed Computing and Systems November 3-6, 1999 in Cambridge Massachusetts, USA Characterizing the Performance of Dynamic Distribution
More informationThe Piranha computer algebra system. introduction and implementation details
: introduction and implementation details Advanced Concepts Team European Space Agency (ESTEC) Course on Differential Equations and Computer Algebra Estella, Spain October 29-30, 2010 Outline A Brief Overview
More informationSystem Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationThe Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
More informationScaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
More informationMicrocontroller-based experiments for a control systems course in electrical engineering technology
Microcontroller-based experiments for a control systems course in electrical engineering technology Albert Lozano-Nieto Penn State University, Wilkes-Barre Campus, Lehman, PA, USA E-mail: AXL17@psu.edu
More informationKeywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
More informationA Cloud Computing Approach for Big DInSAR Data Processing
A Cloud Computing Approach for Big DInSAR Data Processing through the P-SBAS Algorithm Zinno I. 1, Elefante S. 1, Mossucca L. 2, De Luca C. 1,3, Manunta M. 1, Terzo O. 2, Lanari R. 1, Casu F. 1 (1) IREA
More informationMEng, BSc Applied Computer Science
School of Computing FACULTY OF ENGINEERING MEng, BSc Applied Computer Science Year 1 COMP1212 Computer Processor Effective programming depends on understanding not only how to give a machine instructions
More informationWhite Paper February 2010. IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario
White Paper February 2010 IBM InfoSphere DataStage Performance and Scalability Benchmark Whitepaper Data Warehousing Scenario 2 Contents 5 Overview of InfoSphere DataStage 7 Benchmark Scenario Main Workload
More information