Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms
|
|
|
- Milo Ryan
- 9 years ago
- Views:
Transcription
1 Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Prof. Lennart Johnsson Texas Learning and Computation Center University of Houston Dr. Dragan Mirkovic MD Anderson Cancer Center University of Texas, Houston
2 Outline Motivation Introduction Empirical Optimization Results Conclusion
3 Hardware/Software Gap End of Moore s Law has pushed the trend towards Specialized architectures. Result: Increasingly difficult for compilers to perform optimizations. Compilers have always lagged the Hardware Technology Advancements.
4 Hardware/Software Gap Traditional ways to Achieve Higher Performance Vendor compilers. Source to source transformations, mainly targeting loops. Hand Tuning. Compilers are not good at optimizing even the regular Matrix Multiplication (MMM).
5 Adaptive Auto-tuning Libraries Auto-tuning performed in two stages. Code Generator adapts to the Microprocessor Architecture by generating highly optimized blocks of codes (micro-kernels) at installation time. Run-time adapts to the Memory System by scheduling the execution of those blocks. Examples: ATLAS, FFTW, UHFFT etc.
6 Adaptive FFT Libraries UHFFT and FFTW Two Stage Optimization The Code Generator generates microkernels of optimized straight-line C code blocks (radix/codelets) that compute part of the FFT problem The Run-Time SPIRAL schedules the execution of a FFT problem by composing parameterized codelets generated by code generator at the installation time Single Stage (Code generator system).
7 Tuning FFT Many formulas to represent a single FFT (Different Operation Counts). FFT s Cache oblivious schedule allows register blocking but it can be improved for larger size codelets. Because of the recursive schedule, loop level optimizations are not as effective as in BLAS. Strided (Non-sequential) data access in FFT.
8 FFT Formula Many formulas to represent FFT (e.g. size 12). Lower OpCount does not always lead to higher performance Size 12 FFT
9 Instruction Scheduling Version 1: Version 2: tmp0i = Im(x[0])+Im(x[2*is]); tmp0r = Re(x[0])+Re(x[2*is]); tmp1i = Im(x[0])-Im(x[2*is]); tmp1r = Re(x[0])-Re(x[2*is]); tmp2i = Im(x[is])+Im(x[3*is]); tmp2r = Re(x[is])+Re(x[3*is]); tmp3i = Im(x[is])-Im(x[3*is]); tmp3r = Re(x[is])-Re(x[3*is]); Re(y[0]) = tmp0r+tmp2r; Im(y[0]) = tmp0i+tmp2i; Re(y[os]) = tmp1r+tmp3i; Im(y[os]) = tmp1i-tmp3r; Re(y[2*os]) = tmp0r-tmp2r; Im(y[2*os]) = tmp0i-tmp2i; Re(y[3*os]) = tmp1r-tmp3i; Im(y[3*os]) = tmp1i+tmp3r; Two versions of Codelet size 4. Upto 7 % Improvement. tmp3r = Re(x[is])-Re(x[3*is]); tmp2r = Re(x[is])+Re(x[3*is]); tmp3i = Im(x[is])-Im(x[3*is]); tmp2i = Im(x[is])+Im(x[3*is]); tmp1r = Re(x[0])-Re(x[2*is]); tmp0r = Re(x[0])+Re(x[2*is]); tmp1i = Im(x[0])-Im(x[2*is]); tmp0i = Im(x[0])+Im(x[2*is]); Re(y[2*os]) = tmp0r-tmp2r; Re(y[0]) = tmp0r+tmp2r; Im(y[os]) = tmp1i-tmp3r; Im(y[3*os]) = tmp1i+tmp3r; Re(y[3*os]) = tmp1r-tmp3i; Re(y[os]) = tmp1r+tmp3i; Im(y[2*os]) = tmp0i-tmp2i; Im(y[0]) = tmp0i+tmp2i; Platform Compiler MFLOPS (Ver:1) MFLOPS (Ver:2) Itanium2(1.5GHz) gcc (3-4%) Itanium2 icc (7%) Itanium2 icc (3-4%) Opteron(2.0GHz) Pathscale (5-6%) Opteron gcc (5-6%)
10 Code Generation (UHFFT1.6 vs FFTW3) FFTW UHFFT1.6 (older) FFT Formula Minimization Schedule Cache Oblivious followed by DAG reordering Heuristics based Register Blocking Yes No Cache Oblivious schedule of instructions Architecture (ISA) Specific Codelets Codelet Types Generator Further Optimizations Yes (e.g. fma, sse etc.) Straight-line and single loop (vector-recursion) Written Caml (rule based language) Expert User selects the best after some trials Relies on compiler completely. Straight-line Written in C (allows more flexibility) Expert User selects the best after some trials
11 Empirical Optimization Iterative Empirical Optimization through Compiler Feedback. Auto-tune Non-deterministic elements in performance of codes. Adapt to Architecture as well as compiler. Recently: research focusing Iterative compiler optimization of whole applications. Successfully used in ATLAS for BLAS codes. Costly but manageable (esp. in microkernels).
12 Empirical Optimization Basic Ingredient (Iterative Compilation Loop) Generate Variants (varying parameters) Dynamic Compilation Evaluate performance (runtime) Guide tuning process
13 UHFFT Empirical Auto-tuning Code Generator
14 FFDL Grammar GRAMMAR FFT Formula Description Language (FFDL) to specify the FFT Formula.
15 Codelet Types Codelet Types Compact String representation of Codelet Types
16 Codelet Types Twiddle Codelet Perform FFT and multiply with twiddle factors. Vector Codelets Compute multiple FFTs on 2D blocks by pushing loop inside the codelets. CodeletN( ) { For (i=1 to r) compute fftn Rotated Codelets For PFA algorithm (Less Multiplications; No Twiddle Multiplications) }
17 Level 1 Optimizations Mode CIRC SKEW SKEWP Rader Implementation Circulant Permutation Skew Circulant Skew Circulant with Partial Diagonalization
18 Level 1 Optimizations Real FFT
19 Level 2 Optimizations Re(y[os]) = Re(x[0])-Re(x[is]); Im(y[0]) = Im(x[0])+Im(x[is]); Re(y[0]) = Re(x[0])+Re(x[is]); Im(y[os]) = Im(x[0])-Im(x[is]); { // block 4 int reg0,reg1,reg2,reg3; { // block 2 int reg4,reg5; } Im(y[0]) = Im(x[0])+Im(x[is]); Re(y[0]) = Re(x[0])+Re(x[is]); Im(y[os]) = Im(x[0])-Im(x[is]); Re(y[os]) = Re(x[0])-Re(x[is]); } { // block 2 int reg6,reg7; }
20 Level 2 Optimizations Real FFT
21 Level 3 Optimizations Y[0] = X[0]+X[1]; Y[1] = X[0]-X[1]; reg0 = X[0]; reg1 = X[1]; reg2 = reg0 + reg1; reg3 = reg0 reg1; Y[0] = reg2; Y[1] = reg3;
22 Level 3 Optimizations Real FFT
23 Results Experimental Setup: Two Architectures (Itanium2 and Opteron) Two Compilers on each Architecture: icc/gcc and pathcc/gcc Performance Metric: MFlop/s Vary Codelet strides to mimic actual execution inside the FFT schedule. (Larger strides => Lower Performance).
24 Codelets (Itanium)
25 Codelets (Opteron)
26 Codelet Size vs Performance Itanium has more registers Opteron has larger instruction cache
27 Performance Improvement (UHFFT2) FFT (Powers of 2 Size) on Itanium2 using ICC Major Gain due to Vector Codelets
28 Performance Improvement (UHFFT2) FFT (Non-Powers of 2) on Itanium2 using ICC Major Gain due to Special PFA Codelets
29 Performance Improvement (UHFFT2) FFT (Powers of 2 Size) on Itanium2 using GCC Major Gain due to Scalar Replacement
30 Performance Improvement (UHFFT2) FFT (Non-Powers of 2) on Itanium2 using GCC Major Gain due to Special PFA Codelets
31 Conclusion Mixed Results; this technique generates codelets at least as good as any heuristics. Better the Compiler => lesser the impact. Performance portability across compilers. Major performance improvement for larger size codelets but large codelets hardly used because of lower performance. Simple heuristics could be used in some cases instead of expensive empirical auto-tuning specially for powers of two codelets.
32 Thank You!
33 Baseline Performance UHFFT Performance (Prior to Empirical Tuning)
34 Baseline Performance Special Codelets (e.g. fma etc) in FFTW and MKL UHFFT slightly better at managing memory
35 Code Generation (FFTW3 vs UHFFT2) FFTW UHFFT FFT Formula Minimization Schedule Uses Heurisitics Cache Oblivious followed by DAG reordering Try limited number of (heuristics based) formulas plus any user supplied formulas Cache Oblivious schedule followed by DAG reordering Register Blocking Yes Multiple blockings empirically evaluated Architecture (ISA) Specific Codelets Codelet Types Yes (e.g. fma, sse etc.) Straight-line and looped FFT, Twiddle and In-place (transposed) codelets (No PFA) No (Relies on compiler completely). Straight-line and looped FFT, Twiddle, Inplace and rotated PFA codelets. Generator Written Caml (rule based language) Written in C (allows more flexibility) Adaptation Adapt to Architecture Adapt to both Architecture and Compiler Optimizations Based on Heuristics (Trial and Error Method for more options) Empirically Tuned Automatic Code generator Motivation
36 Our Solution Generate more and specialized microkernels. Consider more FFT Formulas. Use register blocking to avoid register pressure. Use Iterative Empirical Optimization through Compiler Feedback.
DARPA, NSF-NGS/ITR,ACR,CPA,
Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer
Fast Fourier Transform: Theory and Algorithms
Fast Fourier Transform: Theory and Algorithms Lecture Vladimir Stojanović 6.973 Communication System Design Spring 006 Massachusetts Institute of Technology Discrete Fourier Transform A review Definition
DFT Performance Prediction in FFTW
DFT Performance Prediction in FFTW Liang Gu and Xiaoming Li University of Delaware Abstract. Fastest Fourier Transform in the West (FFTW) is an adaptive FFT library that generates highly efficient Discrete
Optimizing matrix multiplication Amitabha Banerjee [email protected]
Optimizing matrix multiplication Amitabha Banerjee [email protected] Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP
IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture
Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture WHITE PAPER: SPRA434 Authors: Marie Silverthorn Leon Adams Richard Scales Digital Signal Processing Solutions April
Quiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
Sources: On the Web: Slides will be available on:
C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,
The Design and Implementation of FFTW3
1 The Design and Implementation of FFTW3 Matteo Frigo and Steven G. Johnson (Invited Paper) Abstract FFTW is an implementation of the discrete Fourier transform (DFT) that adapts to the hardware in order
Code generation under Control
Code generation under Control Rencontres sur la compilation / Saint Hippolyte Henri-Pierre Charles CEA Laboratoire LaSTRE / Grenoble 12 décembre 2011 Introduction Présentation Henri-Pierre Charles, two
Performance Application Programming Interface
/************************************************************************************ ** Notes on Performance Application Programming Interface ** ** Intended audience: Those who would like to learn more
A Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
Computer Organization and Architecture
Computer Organization and Architecture Chapter 11 Instruction Sets: Addressing Modes and Formats Instruction Set Design One goal of instruction set design is to minimize instruction length Another goal
Matrix Multiplication
Matrix Multiplication CPS343 Parallel and High Performance Computing Spring 2016 CPS343 (Parallel and HPC) Matrix Multiplication Spring 2016 1 / 32 Outline 1 Matrix operations Importance Dense and sparse
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa
The Impact of Memory Subsystem Resource Sharing on Datacenter Applications Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa Introduction Problem Recent studies into the effects of memory
How To Write Fast Numerical Code: A Small Introduction
How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {schellap, franzf, pueschel}@ece.cmu.edu
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Performance Monitoring of the Software Frameworks for LHC Experiments
Proceedings of the First EELA-2 Conference R. mayo et al. (Eds.) CIEMAT 2009 2009 The authors. All rights reserved Performance Monitoring of the Software Frameworks for LHC Experiments William A. Romero
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Cristina Silvano [email protected] Politecnico di Milano HiPEAC CSW Athens 2014 Motivations System
7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix
7. LU factorization EE103 (Fall 2011-12) factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization algorithm effect of rounding error sparse
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
Types of Workloads. Raj Jain. Washington University in St. Louis
Types of Workloads Raj Jain Washington University in Saint Louis Saint Louis, MO 63130 [email protected] These slides are available on-line at: http://www.cse.wustl.edu/~jain/cse567-08/ 4-1 Overview!
1.4 Fast Fourier Transform (FFT) Algorithm
74 CHAPTER AALYSIS OF DISCRETE-TIME LIEAR TIME-IVARIAT SYSTEMS 4 Fast Fourier Transform (FFT Algorithm Fast Fourier Transform, or FFT, is any algorithm for computing the -point DFT with a computational
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
An Introduction to Assembly Programming with the ARM 32-bit Processor Family
An Introduction to Assembly Programming with the ARM 32-bit Processor Family G. Agosta Politecnico di Milano December 3, 2011 Contents 1 Introduction 1 1.1 Prerequisites............................. 2
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Software Quality Factors OOA, OOD, and OOP Object-oriented techniques enhance key external and internal software quality factors, e.g., 1. External (v
Object-Oriented Design and Programming Deja Vu? In the past: Structured = Good Overview of Object-Oriented Design Principles and Techniques Today: Object-Oriented = Good e.g., Douglas C. Schmidt www.cs.wustl.edu/schmidt/
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus
Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva [email protected] Introduction Intel
Agile High-Performance Software Development
Agile High-Performance Software Development Chris Mueller and Andrew Lumsdaine Open Systems Lab/Indiana University RIDMS-2 February 10, 2007 Phoenix, AZ Modern Processors Intel Core Duo IBM Cell BE For
SR-IOV: Performance Benefits for Virtualized Interconnects!
SR-IOV: Performance Benefits for Virtualized Interconnects! Glenn K. Lockwood! Mahidhar Tatineni! Rick Wagner!! July 15, XSEDE14, Atlanta! Background! High Performance Computing (HPC) reaching beyond traditional
Computer Architecture Syllabus of Qualifying Examination
Computer Architecture Syllabus of Qualifying Examination PhD in Engineering with a focus in Computer Science Reference course: CS 5200 Computer Architecture, College of EAS, UCCS Created by Prof. Xiaobo
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks
OpenACC Parallelization and Optimization of NAS Parallel Benchmarks Presented by Rengan Xu GTC 2014, S4340 03/26/2014 Rengan Xu, Xiaonan Tian, Sunita Chandrasekaran, Yonghong Yan, Barbara Chapman HPC Tools
Silverlight for Windows Embedded Graphics and Rendering Pipeline 1
Silverlight for Windows Embedded Graphics and Rendering Pipeline 1 Silverlight for Windows Embedded Graphics and Rendering Pipeline Windows Embedded Compact 7 Technical Article Writers: David Franklin,
@Scalding. https://github.com/twitter/scalding. Based on talk by Oscar Boykin / Twitter
@Scalding https://github.com/twitter/scalding Based on talk by Oscar Boykin / Twitter What is Scalding? Why Scala for Map/Reduce? How is it used at Twitter? What s next for Scalding? Yep, we re counting
T-79.186 Reactive Systems: Introduction and Finite State Automata
T-79.186 Reactive Systems: Introduction and Finite State Automata Timo Latvala 14.1.2004 Reactive Systems: Introduction and Finite State Automata 1-1 Reactive Systems Reactive systems are a class of software
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
İSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism
Intel Cilk Plus: A Simple Path to Parallelism Compiler extensions to simplify task and data parallelism Intel Cilk Plus adds simple language extensions to express data and task parallelism to the C and
CS 294-73 (CCN 27156) CS 194-73 (CCN 26880) Software Engineering for Scientific Computing. Lecture 1: Introduction
CS 294-73 (CCN 27156) CS 194-73 (CCN 26880) Software Engineering for Scientific Computing http://www.eecs.berkeley.edu/~colella/cs294fall2015/ [email protected] [email protected] Lecture 1: Introduction
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning
Application-Specific Architecture Navigating Big Data with High-Throughput, Energy-Efficient Data Partitioning Lisa Wu, R.J. Barker, Martha Kim, and Ken Ross Columbia University Xiaowei Wang Rui Chen Outline
How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture
How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture Instructor: Markus Püschel Guest Instructor: Franz Franchetti TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay
Performance Optimization Guide
Performance Optimization Guide Publication Date: July 06, 2016 Copyright Metalogix International GmbH, 2001-2016. All Rights Reserved. This software is protected by copyright law and international treaties.
SOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
Reduced Precision Hardware for Ray Tracing. Sean Keely University of Texas, Austin
Reduced Precision Hardware for Ray Tracing Sean Keely University of Texas, Austin Question Why don t GPU s accelerate ray tracing? Real time ray tracing needs very high ray rate Example Scene: 3 area lights
picojava TM : A Hardware Implementation of the Java Virtual Machine
picojava TM : A Hardware Implementation of the Java Virtual Machine Marc Tremblay and Michael O Connor Sun Microelectronics Slide 1 The Java picojava Synergy Java s origins lie in improving the consumer
1 Introduction. 2 Overview of the Tool. Program Visualization Tool for Educational Code Analysis
Program Visualization Tool for Educational Code Analysis Natalie Beams University of Oklahoma, Norman, OK [email protected] Program Visualization Tool for Educational Code Analysis 1 Introduction
Levels of Programming Languages. Gerald Penn CSC 324
Levels of Programming Languages Gerald Penn CSC 324 Levels of Programming Language Microcode Machine code Assembly Language Low-level Programming Language High-level Programming Language Levels of Programming
x64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
Evading Android Emulator
Evading Android Emulator Thanasis Petsas [email protected] [email protected] - www.syssec-project.eu 1 What is a Virtual Machine? A software based computer that functions like a physical machine A
Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs
Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs Nathan Whitehead Alex Fit-Florea ABSTRACT A number of issues related to floating point accuracy and compliance are a frequent
The programming language C. sws1 1
The programming language C sws1 1 The programming language C invented by Dennis Ritchie in early 1970s who used it to write the first Hello World program C was used to write UNIX Standardised as K&C (Kernighan
A Design of a PID Self-Tuning Controller Using LabVIEW
Journal of Software Engineering and Applications, 2011, 4, 161-171 doi:10.4236/jsea.2011.43018 Published Online March 2011 (http://www.scirp.org/journal/jsea) 161 A Design of a PID Self-Tuning Controller
Recent Advances in Periscope for Performance Analysis and Tuning
Recent Advances in Periscope for Performance Analysis and Tuning Isaias Compres, Michael Firbach, Michael Gerndt Robert Mijakovic, Yury Oleynik, Ventsislav Petkov Technische Universität München Yury Oleynik,
The C Programming Language course syllabus associate level
TECHNOLOGIES The C Programming Language course syllabus associate level Course description The course fully covers the basics of programming in the C programming language and demonstrates fundamental programming
EEM 486: Computer Architecture. Lecture 4. Performance
EEM 486: Computer Architecture Lecture 4 Performance EEM 486 Performance Purchasing perspective Given a collection of machines, which has the» Best performance?» Least cost?» Best performance / cost? Design
Getting the Performance Out Of High Performance Computing. Getting the Performance Out Of High Performance Computing
Getting the Performance Out Of High Performance Computing Jack Dongarra Innovative Computing Lab University of Tennessee and Computer Science and Math Division Oak Ridge National Lab http://www.cs.utk.edu/~dongarra/
COSCO 2015 Heterogeneous Computing Programming
COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
3.1 State Space Models
31 State Space Models In this section we study state space models of continuous-time linear systems The corresponding results for discrete-time systems, obtained via duality with the continuous-time models,
Weighted Total Mark. Weighted Exam Mark
CMP2204 Operating System Technologies Period per Week Contact Hour per Semester Total Mark Exam Mark Continuous Assessment Mark Credit Units LH PH TH CH WTM WEM WCM CU 45 30 00 60 100 40 100 4 Rationale
Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary
Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:
Stack Allocation. Run-Time Data Structures. Static Structures
Run-Time Data Structures Stack Allocation Static Structures For static structures, a fixed address is used throughout execution. This is the oldest and simplest memory organization. In current compilers,
Regression Verification: Status Report
Regression Verification: Status Report Presentation by Dennis Felsing within the Projektgruppe Formale Methoden der Softwareentwicklung 2013-12-11 1/22 Introduction How to prevent regressions in software
A Multi-layered Domain-specific Language for Stencil Computations
A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Workshop ExaStencils 2014,
OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old
OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a
HP Matrix Operating Environment Federated CMS Overview
HP Matrix Operating Environment Federated CMS Overview HP Matrix OE infrastructure orchestration 7.0 Technical white paper Table of contents Introduction... 2 Overview... 2 Installation and configuration...
Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x
Software Pipelining for (i=1, i
COMPUTER SCIENCE (5651) Test at a Glance
COMPUTER SCIENCE (5651) Test at a Glance Test Name Computer Science Test Code 5651 Time Number of Questions Test Delivery 3 hours 100 selected-response questions Computer delivered Content Categories Approximate
Service Partition Specialized Linux nodes. Compute PE Login PE Network PE System PE I/O PE
2 Service Partition Specialized Linux nodes Compute PE Login PE Network PE System PE I/O PE Microkernel on Compute PEs, full featured Linux on Service PEs. Service PEs specialize by function Software Architecture
Agile QA Process. Anand Bagmar [email protected] [email protected] http://www.essenceoftesting.blogspot.com. Version 1.
Agile QA Process Anand Bagmar [email protected] [email protected] http://www.essenceoftesting.blogspot.com Version 1.1 Agile QA Process 1 / 12 1. Objective QA is NOT the gatekeeper of the quality
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
X86-64 Architecture Guide
X86-64 Architecture Guide For the code-generation project, we shall expose you to a simplified version of the x86-64 platform. Example Consider the following Decaf program: class Program { int foo(int
HSL and its out-of-core solver
HSL and its out-of-core solver Jennifer A. Scott [email protected] Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many
OpenCL for programming shared memory multicore CPUs
Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs
JReport Server Deployment Scenarios
JReport Server Deployment Scenarios Contents Introduction... 3 JReport Architecture... 4 JReport Server Integrated with a Web Application... 5 Scenario 1: Single Java EE Server with a Single Instance of
Architecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
Windows XP Professional x64 Edition for HP Workstations - FAQ
1. What is Microsoft Windows XP Professional x64 Edition? Windows XP Professional x64 Edition is the client version of the Microsoft 64-bit operating system that executes on 64-bit extensions systems.
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Distributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
Performance Analysis and Visualization of SystemC Models. Adam Donlin and Thomas Lenart Xilinx Research
Performance Analysis and Visualization of SystemC Models Adam Donlin and Thomas Lenart Xilinx Research Overview Performance Analysis!= Functional Verification Analysis and Visualization Overview Simulation
