DARPA, NSF-NGS/ITR,ACR,CPA,
|
|
|
- Cuthbert Reginald Watts
- 10 years ago
- Views:
Transcription
1 Spiral Automating Library Development Markus Püschel and the Spiral team (only part shown) With: Srinivas Chellappa Frédéric de Mesmay Franz Franchetti Daniel McFarlin Yevgen Voronenko Electrical and Computer Engineering Carnegie Mellon University This work was supported by DARPA, NSF-NGS/ITR,ACR,CPA, Intel, Mercury, National Instruments
2 Positions and Thoughts Autotuning definition Search over space of alternatives and Parameter-based tuning are very important but fails to address some key problems; we need to think about Raising the level of abstraction: Enables Use of domain knowledge Difficult optimizations: parallelization, vectorization, etc. Faster porting to new platforms and platform paradigms Possibly automatic software development We need coarse platform abstractions We need more interdisciplinary collaborations Metrics Time for code development, porting to new platforms Performance
3 DFT Plot: Analysis Multiple threads: 2x Memory hierarchy: 5x Vector instructions: 3x High performance library development has become a nightmare
4 Spiral Research Goal: Teach computers to write fast libraries Complete automation of implementation and optimization Including vectorization, parallelization Functionality: Linear transforms (discrete Fourier transform, filters, wavelets) BLAS SAR imaging En/decoding (Viterbi, Ebcot in JPEG2000) more Platforms: Desktop (vector, SMP), FPGAs, GPUs, distributed, hybrid Collaboration with Intel (Kuck, Tang, Sabanin) Parts of MKL/IPP generated with Spiral IPP 6.0: ippg domain for Spiral generated code
5 human effort automated Carnegie Mellon automated Vision Behind Spiral Current Numerical problem Future Numerical problem C program algorithm selection implementation algorithm selection implementation compilation compilation Computing platform C code a singularity: Compiler has no access to high level information Computing platform Challenge: conquer the high abstraction level for complete automation
6 Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Conclusions and thoughts
7 Linear Transforms Mathematically: Matrix-vector multiplication Output vector Transform = matrix Input vector Example: Discrete Fourier transform (DFT)
8 Transform Algorithms: Example 4-point FFT Cooley/Tukey fast Fourier transform (FFT): j 1 j j 1 j 1 1 j Fourier transform Diagonal matrix (twiddles) Kronecker product Identity Permutation Algorithms are divide-and-conquer: Breakdown rules Mathematical, declarative representation: SPL (signal processing language) SPL describes the structure of the dataflow
9 Breakdown Rules (>200 for >50 Transforms) Combining these rules yields many algorithms for every given transform
10 SPL to Sequential Code Example: tensor product Correct code: easy fast code: very difficult
11 Program Generation in Spiral (Sketched) Transform user specified Fast algorithm in SPL many choices -SPL: [PLDI 05] Optimization at all abstraction levels parallelization vectorization loop optimizations C Code: Iteration of this process constant folding to search for the fastest scheduling But that s not all
12 SPL to Shared Memory Code: Basic Idea [SC 06] Governing construct: tensor product A A A A Processor 0 Processor 1 Processor 2 Processor 3 p-way embarrassingly parallel, load-balanced Problematic construct: permutations produce false sharing x y Task: Rewrite formulas to extract tensor product + keep contiguous blocks x y
13 Parallelization by Rewriting coarse platform model Load-balanced No false sharing
14 Same Approach for Other Parallel Paradigms Message Passing Vectorization MPI Cg/OpenGL for GPUs: Verilog for FPGAs:
15 Example Results CPU+FPGA CPU CPU + GPU CELL GPU
16 Summary: Complete Automation for Transforms Platform: Off-the-shelf desktop Often: generated code faster than competition (if exists) Memory hierarchy optimization Rewriting and search for algorithm selection Rewriting for loop optimizations Vectorization Rewriting Parallelization Rewriting fixed input size code Derivation of library structure Rewriting Other methods general input size library
17 Generated Libraries RDFT DHT DCT2 DCT3 DCT4 DFT 2-way vectorized, 2-threaded Most are faster than hand-written libs Recursion steps: 4 17 Code size: kloc or MB Generation time: 1 3 hours Filter Wavelet
18 Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Operator language BLAS, Viterbi decoding, SAR imaging, Ebcot encoding Conclusions and thoughts
19 Going Beyond Transforms Transform = linear operator with one vector input and one vector output linear Key ideas: Generalize to (possibly nonlinear) operators with several inputs and several outputs Generalize SPL (including tensor product) to OL (operator language) Generalize rewriting systems for parallelizations
20 Operator Language Carnegie Mellon
21 Example: Matrix Multiplication (MMM) Breakdown rules = algorithm knowledge: capture various forms of blocking
22 Parallelization through rewriting Load-balanced No false sharing
23 Speed-up (m x k) times (k x n) example: k MMM Speedup over MKL, n = 32, Core 2 Duo x x 'ratio32' u 1:2: m Comparison to GotoBLAS similar
24 Viterbi Decoding in OL Operator for Viterbi decoder Breakdown rules
25 Results Karn s implementation: hand-written assembly for 4 Viterbi codes Spiral 16-way 8-way 4-way scalar Karn 16-way 8-way 4-way scalar
26 EBCOT Coding in OL Carnegie Mellon
27 Organization Spiral s framework: Example transforms Complete automation achieved Beyond transforms Conclusions and thoughts
28 Raising the Abstraction Level Formally describe and structure algorithms/applications eternally valid In Spiral Domain-specific, declarative, mathematical language OL Difficult optimizations/transformations by rewriting What it enables Vectorization, parallelization using domain knowledge Efficient retargeting to new platforms and new platform paradigms Complete automation in some cases Other examples Libraries Identification and definition of BLAS Parameter tuning Indispensable tool but cannot achieve the above
29 Interdisciplinary Research Needed Programming languages Program generation Symbolic Computation Rewriting Software Scientific Computing Automating High-Performance Parallel Library Development Algorithms Mathematics Compilers We Need to Work Together
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
Image Compression through DCT and Huffman Coding Technique
International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347 5161 2015 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Research Article Rahul
How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture
How to Write Fast Code SIMD Vectorization 18-645, spring 2008 13 th and 14 th Lecture Instructor: Markus Püschel Guest Instructor: Franz Franchetti TAs: Srinivas Chellappa (Vas) and Frédéric de Mesmay
A Multi-layered Domain-specific Language for Stencil Computations
A Multi-layered Domain-specific Language for Stencil Computations Christian Schmitt, Frank Hannig, Jürgen Teich Hardware/Software Co-Design, University of Erlangen-Nuremberg Workshop ExaStencils 2014,
PARALLEL PROGRAMMING
PARALLEL PROGRAMMING TECHNIQUES AND APPLICATIONS USING NETWORKED WORKSTATIONS AND PARALLEL COMPUTERS 2nd Edition BARRY WILKINSON University of North Carolina at Charlotte Western Carolina University MICHAEL
Data Analysis with MATLAB. 2013 The MathWorks, Inc. 1
Data Analysis with MATLAB 2013 The MathWorks, Inc. 1 Agenda Introduction Data analysis with MATLAB and Excel Break Developing applications with MATLAB Solving larger problems Summary 2 Modeling the Solar
5MD00. Assignment Introduction. Luc Waeijen 16-12-2014
5MD00 Assignment Introduction Luc Waeijen 16-12-2014 Contents EEG application Background on EEG Early Seizure Detection Algorithm Implementation Details Super Scalar Assignment Description Tooling (simple
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms
Empirical Auto-tuning Code Generator for FFT and Trigonometric Transforms Ayaz Ali and Prof. Lennart Johnsson Texas Learning and Computation Center University of Houston Dr. Dragan Mirkovic MD Anderson
Mathematical Libraries on JUQUEEN. JSC Training Course
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries on JUQUEEN JSC Training Course May 10, 2012 Outline General Informations Sequential Libraries, planned Parallel Libraries and Application Systems:
How To Write Fast Numerical Code: A Small Introduction
How To Write Fast Numerical Code: A Small Introduction Srinivas Chellappa, Franz Franchetti, and Markus Püschel Electrical and Computer Engineering Carnegie Mellon University {schellap, franzf, pueschel}@ece.cmu.edu
Polytechnic University of Puerto Rico Department of Electrical Engineering Master s Degree in Electrical Engineering.
Polytechnic University of Puerto Rico Department of Electrical Engineering Master s Degree in Electrical Engineering Course Syllabus Course Title : Algorithms for Digital Signal Processing Course Code
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Trends in High-Performance Computing for Power Grid Applications
Trends in High-Performance Computing for Power Grid Applications Franz Franchetti ECE, Carnegie Mellon University www.spiral.net Co-Founder, SpiralGen www.spiralgen.com This talk presents my personal views
Performance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
OpenFOAM: Computational Fluid Dynamics. Gauss Siedel iteration : (L + D) * x new = b - U * x old
OpenFOAM: Computational Fluid Dynamics Gauss Siedel iteration : (L + D) * x new = b - U * x old What s unique about my tuning work The OpenFOAM (Open Field Operation and Manipulation) CFD Toolbox is a
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
2IP WP8 Materiel Science Activity report March 6, 2013
2IP WP8 Materiel Science Activity report March 6, 2013 Codes involved in this task ABINIT (M.Torrent) Quantum ESPRESSO (F. Affinito) YAMBO + Octopus (F. Nogueira) SIESTA (G. Huhs) EXCITING/ELK (A. Kozhevnikov)
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Introduction to MATLAB Gergely Somlay Application Engineer [email protected]
Introduction to MATLAB Gergely Somlay Application Engineer [email protected] 2012 The MathWorks, Inc. 1 What is MATLAB? High-level language Interactive development environment Used for: Numerical
A Review of Customized Dynamic Load Balancing for a Network of Workstations
A Review of Customized Dynamic Load Balancing for a Network of Workstations Taken from work done by: Mohammed Javeed Zaki, Wei Li, Srinivasan Parthasarathy Computer Science Department, University of Rochester
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
Retargeting PLAPACK to Clusters with Hardware Accelerators
Retargeting PLAPACK to Clusters with Hardware Accelerators Manuel Fogué 1 Francisco Igual 1 Enrique S. Quintana-Ortí 1 Robert van de Geijn 2 1 Departamento de Ingeniería y Ciencia de los Computadores.
Mathematical Libraries and Application Software on JUROPA and JUQUEEN
Mitglied der Helmholtz-Gemeinschaft Mathematical Libraries and Application Software on JUROPA and JUQUEEN JSC Training Course May 2014 I.Gutheil Outline General Informations Sequential Libraries Parallel
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms
Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms Amani AlOnazi, David E. Keyes, Alexey Lastovetsky, Vladimir Rychkov Extreme Computing Research Center,
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
Petascale Software Challenges. William Gropp www.cs.illinois.edu/~wgropp
Petascale Software Challenges William Gropp www.cs.illinois.edu/~wgropp Petascale Software Challenges Why should you care? What are they? Which are different from non-petascale? What has changed since
Bringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks [email protected] 2015 The MathWorks, Inc. 1 Data is the sword of the
WESTMORELAND COUNTY PUBLIC SCHOOLS 2011 2012 Integrated Instructional Pacing Guide and Checklist Computer Math
Textbook Correlation WESTMORELAND COUNTY PUBLIC SCHOOLS 2011 2012 Integrated Instructional Pacing Guide and Checklist Computer Math Following Directions Unit FIRST QUARTER AND SECOND QUARTER Logic Unit
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Exploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom [email protected] Christophe Dubach School of Informatics
Four Keys to Successful Multicore Optimization for Machine Vision. White Paper
Four Keys to Successful Multicore Optimization for Machine Vision White Paper Optimizing a machine vision application for multicore PCs can be a complex process with unpredictable results. Developers need
1.4 Fast Fourier Transform (FFT) Algorithm
74 CHAPTER AALYSIS OF DISCRETE-TIME LIEAR TIME-IVARIAT SYSTEMS 4 Fast Fourier Transform (FFT Algorithm Fast Fourier Transform, or FFT, is any algorithm for computing the -point DFT with a computational
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
CS 575 Parallel Processing
CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
Data-Flow Awareness in Parallel Data Processing
Data-Flow Awareness in Parallel Data Processing D. Bednárek, J. Dokulil *, J. Yaghob, F. Zavoral Charles University Prague, Czech Republic * University of Vienna, Austria 6 th International Symposium on
Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education
Lesson 7: SYSTEM-ON ON-CHIP (SoC( SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY 1 VLSI chip Integration of high-level components Possess gate-level sophistication in circuits above that of the counter,
What s New in MATLAB and Simulink
What s New in MATLAB and Simulink Kevin Cohan Product Marketing, MATLAB Michael Carone Product Marketing, Simulink 2015 The MathWorks, Inc. 1 What was new for Simulink in R2012b? 2 What Was New for MATLAB
HPC enabling of OpenFOAM R for CFD applications
HPC enabling of OpenFOAM R for CFD applications Towards the exascale: OpenFOAM perspective Ivan Spisso 25-27 March 2015, Casalecchio di Reno, BOLOGNA. SuperComputing Applications and Innovation Department,
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Recent Advances in Periscope for Performance Analysis and Tuning
Recent Advances in Periscope for Performance Analysis and Tuning Isaias Compres, Michael Firbach, Michael Gerndt Robert Mijakovic, Yury Oleynik, Ventsislav Petkov Technische Universität München Yury Oleynik,
Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
A Pattern-Based Approach to. Automated Application Performance Analysis
A Pattern-Based Approach to Automated Application Performance Analysis Nikhil Bhatia, Shirley Moore, Felix Wolf, and Jack Dongarra Innovative Computing Laboratory University of Tennessee (bhatia, shirley,
Accelerating Fast Fourier Transforms Using Hadoop and CUDA
Accelerating Fast Fourier Transforms Using Hadoop and CUDA Rostislav Tsiomenko SAIC Columbia, Md., USA [email protected] 1 Abstract There has been considerable research into improving Fast
An Introduction to Parallel Computing/ Programming
An Introduction to Parallel Computing/ Programming Vicky Papadopoulou Lesta Astrophysics and High Performance Computing Research Group (http://ahpc.euc.ac.cy) Dep. of Computer Science and Engineering European
GPU-based Decompression for Medical Imaging Applications
GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 [email protected] (888) LESS-BITS +1 (408) 249-1500 1 Outline
and RISC Optimization Techniques for the Hitachi SR8000 Architecture
1 KONWIHR Project: Centre of Excellence for High Performance Computing Pseudo-Vectorization and RISC Optimization Techniques for the Hitachi SR8000 Architecture F. Deserno, G. Hager, F. Brechtefeld, G.
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
Multicore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
Functional Data Analysis of MALDI TOF Protein Spectra
Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer [email protected]. Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF
Introduction to MATLAB for Data Analysis and Visualization
Introduction to MATLAB for Data Analysis and Visualization Sean de Wolski Application Engineer 2014 The MathWorks, Inc. 1 Data Analysis Tasks Files Data Analysis & Modeling Reporting and Documentation
GPGPU accelerated Computational Fluid Dynamics
t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises
Parallel Programming for Multi-Core, Distributed Systems, and GPUs Exercises Pierre-Yves Taunay Research Computing and Cyberinfrastructure 224A Computer Building The Pennsylvania State University University
Speeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs
Optimizing Symmetric Dense Matrix-Vector Multiplication on GPUs Rajib Nath Computer Science and Engineering University of California, San Diego [email protected] Stanimire Tomov Electrical Engineering and
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
Best practices for efficient HPC performance with large models
Best practices for efficient HPC performance with large models Dr. Hößl Bernhard, CADFEM (Austria) GmbH PRACE Autumn School 2013 - Industry Oriented HPC Simulations, September 21-27, University of Ljubljana,
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
(!' ) "' # "*# "!(!' +,
MATLAB is a numeric computation software for engineering and scientific calculations. The name MATLAB stands for MATRIX LABORATORY. MATLAB is primarily a tool for matrix computations. It was developed
Cluster performance, how to get the most out of Abel. Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013
Cluster performance, how to get the most out of Abel Ole W. Saastad, Dr.Scient USIT / UAV / FI April 18 th 2013 Introduction Architecture x86-64 and NVIDIA Compilers MPI Interconnect Storage Batch queue
Speed up numerical analysis with MATLAB
2011 Technology Trend Seminar Speed up numerical analysis with MATLAB MathWorks: Giorgia Zucchelli Marieke van Geffen Rachid Adarghal TU Delft: Prof.dr.ir. Kees Vuik Thales Nederland: Dènis Riedijk 2011
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Programming Languages & Tools
4 Programming Languages & Tools Almost any programming language one is familiar with can be used for computational work (despite the fact that some people believe strongly that their own favorite programming
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Is a Data Scientist the New Quant? Stuart Kozola MathWorks
Is a Data Scientist the New Quant? Stuart Kozola MathWorks 2015 The MathWorks, Inc. 1 Facts or information used usually to calculate, analyze, or plan something Information that is produced or stored by
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
THE NAS KERNEL BENCHMARK PROGRAM
THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems
David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems About me David Rioja Redondo Telecommunication Engineer - Universidad de Alcalá >2 years building and managing clusters UPM
A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER. Figure 1. Basic structure of an encoder.
A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER Manoj Kumar 1 Mohammad Zubair 1 1 IBM T.J. Watson Research Center, Yorktown Hgts, NY, USA ABSTRACT The MPEG/Audio is a standard for both
Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis
Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
