Fast Exponential Computation on SIMD Architectures



Similar documents
Numerical Matrix Analysis

Big Data Visualization on the MIC

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

A numerically adaptive implementation of the simplex method

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Binary search tree with SIMD bandwidth optimization using SSE

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

A new binary floating-point division algorithm and its software implementation on the ST231 processor

FMA Implementations of the Compensated Horner Scheme

FAST INVERSE SQUARE ROOT

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Speeding Up RSA Encryption Using GPU Parallelization

Next Generation GPU Architecture Code-named Fermi

Numbering Systems. InThisAppendix...

SmartArrays and Java Frequently Asked Questions

Numerical Analysis. Professor Donna Calhoun. Fall 2013 Math 465/565. Office : MG241A Office Hours : Wednesday 10:00-12:00 and 1:00-3:00

Introduction to GPU Programming Languages

THE NAS KERNEL BENCHMARK PROGRAM

Operation Count; Numerical Linear Algebra

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

CSCI567 Machine Learning (Fall 2014)

Computer Architecture TDTS10

High-Performance Modular Multiplication on the Cell Processor

ELECTENG702 Advanced Embedded Systems. Improving AES128 software for Altera Nios II processor using custom instructions

Computation of 2700 billion decimal digits of Pi using a Desktop Computer

The mathematics of RAID-6

Fast Logarithms on a Floating-Point Device

CSE 6040 Computing for Data Analytics: Methods and Tools

Evaluation of CUDA Fortran for the CFD code Strukti

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Copyright Soleran, Inc. esalestrack On-Demand CRM. Trademarks and all rights reserved. esalestrack is a Soleran product Privacy Statement

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Performance Monitoring of the Software Frameworks for LHC Experiments

Ashraf Abusharekh Kris Gaj Department of Electrical & Computer Engineering George Mason University

Mathematical Libraries on JUQUEEN. JSC Training Course

The string of digits in the binary number system represents the quantity

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

5MD00. Assignment Introduction. Luc Waeijen

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

3. Interpolation. Closing the Gaps of Discretization... Beyond Polynomials

Temperature Control Loop Analyzer (TeCLA) Software

Numerical Analysis An Introduction

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming

CSC384 Intro to Artificial Intelligence

Problems and Measures Regarding Waste 1 Management and 3R Era of public health improvement Situation subsequent to the Meiji Restoration

Lattice QCD Performance. on Multi core Linux Servers

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Chapter 2 Logic Gates and Introduction to Computer Architecture

A simple and fast algorithm for computing exponentials of power series

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Monday January 19th 2015 Title: "Transmathematics - a survey of recent results on division by zero" Facilitator: TheNumberNullity / James Anderson, UK

Practical Calculation of Expected and Unexpected Losses in Operational Risk by Simulation Methods

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

1. Memory technology & Hierarchy

İSTANBUL AYDIN UNIVERSITY

On the effect of forwarding table size on SDN network utilization

Inline Deduplication

Habanero Extreme Scale Software Research Project

Machine Learning over Big Data

Attention: This material is copyright Chris Hecker. All rights reserved.

Computer Architecture

Exploiting SIMD Instructions

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Numerical Analysis I

Accurate and robust image superresolution by neural processing of local image representations

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Interactive Level-Set Deformation On the GPU

Improved Single and Multiple Approximate String Matching

How To Write A Hexadecimal Program

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Speed up numerical analysis with MATLAB

DNA Data and Program Representation. Alexandre David

A linear algebraic method for pricing temporary life annuities

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

AN INTRODUCTION TO NUMERICAL METHODS AND ANALYSIS

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Efficient Iceberg Query Evaluation for Structured Data using Bitmap Indices

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Toolkit for Bar Code Recognition and Resolving on Camera Phones - Jump-Starting the Internet of Things

AN IMPROVED DESIGN OF REVERSIBLE BINARY TO BINARY CODED DECIMAL CONVERTER FOR BINARY CODED DECIMAL MULTIPLICATION

The BBP Algorithm for Pi

Application Notes "EPCF 1%' 1SJOU &OHJOF "11&

A HIGH PERFORMANCE SOFTWARE IMPLEMENTATION OF MPEG AUDIO ENCODER. Figure 1. Basic structure of an encoder.

A Novel Fourier Transform B-spline Method for Option Pricing*

64-Bit versus 32-Bit CPUs in Scientific Computing

Index Terms Cloud Storage Services, data integrity, dependable distributed storage, data dynamics, Cloud Computing.

Motorola 8- and 16-bit Embedded Application Binary Interface (M8/16EABI)

Neon: A Domain-Specific Programming Language for Image Processing

Transcription:

, HiPEAC15 - WAPCO, Amsterdam (NL) Fast Exponential Computation on SIMD Architectures A. Cristiano I. MALOSSI, Yves INEICHEN, Costas BEKAS and Alessandro CURIONI @ IBM Research Zurich, Switzerland

Motivations Thousands of applications: Fourier transform Neural networks Lumped models Radioactive decay Population grown Existing techniques: Power series Look up tables IEEE-745 manipulation 2

Drawbacks of state of the art practices (1) Power series/taylor expansions: PROS: make use of arithmetics (can use SIMD unit) PROS: flexible accuracy CONS: convergence is very slow (unusable for high accuracy) CONS: even using Horner's rule it requires too many floating-point multiply-add Look-up tables: PROS: faster than Power series/taylor expansions CONS: do not make use of arithmetics (only partial SIMD use) 3

Drawbacks of state of the art practices (2) IEEE-745 manipulations, by N. N. Schraudolph in 1998: The idea is to profit from the power of 2 implied in the floating point representation PROS: extremely fast CONS: very inaccurate (max 1 or 2 digits accurate) 4

Derivation of methodology (1) IDEA: combine IEEE-745 manipulations with polynomial interpolation Starting point: 1) Use a single 64 bit long int to profit from 52 digits of mantissa (i.e., S=252); 2) Set C=0 (useless in our method) 3) Write the equality 4) Determine the analytic correction factor 5) 5

Derivation of methodology (2) IDEA: combine IEEE-745 manipulations with polynomial interpolation 6

Algorithm work-flow Main features All the computational instructions are on the SIMD unit Architecture independent implementation are possible for the scalar version (no SIMD) Additional notes The constants 252, log2(e), B as well as {a, b, c, } are precomputed (no additional cost) Steps 3, 4, and 5 can be joined in a single code line (improve compiler optimization) In C++ step 5 can be performed as a reinterpret_cast<double &> 7

Flexible accuracy Different polynomials can be employed in the same scheme (same cost) Accuracy can be tuned changing the degree of the polynomial Each degree implies an additional FMADD 8

Performance on L1-cache: time percent reduction IBM BG/Q IBM Power 755 9

Performance on DDR RAM on Power 755 (similar results on BG/Q) Time per exponential (percent reduction) Net energy per exponential (percent reduction) 10

Summary and conclusions Main ingredients IEEE-745 manipulation is very fast Polynomial interpolation is suitable for 100% SIMD unit use (multiply-add instructions) Main resulting advantages Huge reduction in the time-to-solution (up to 96 % on IBM BG/Q and Power7) Huge reduction in the energy-to-solution (up to 93 % on IBM BG/Q and Power7) Low-to-High accuracy flexibility (the user can control the degree of the polynomial) Architecture flexibility (specific SIMD implementation possible on any modern architecture) Further advantages Scalar versions (no SIMD) are still much faster and energy efficient than classical strategies OpenMP (multithread) implementations possible and efficient for big vectors 11