A new binary floating-point division algorithm and its software implementation on the ST231 processor

Similar documents

A new binary floating-point division algorithm and its software implementation on the ST231 processor

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Numerical Matrix Analysis

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

Some Functions Computable with a Fused-mac

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Influences in low-degree polynomials

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

arxiv: v1 [math.pr] 5 Dec 2011

RN-Codings: New Insights and Some Applications

CHAPTER 5 Round-off errors

RN-coding of Numbers: New Insights and Some Applications

Chapter 4 -- Decimals

CORRELATED TO THE SOUTH CAROLINA COLLEGE AND CAREER-READY FOUNDATIONS IN ALGEBRA

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

CS321. Introduction to Numerical Methods

Contrôle dynamique de méthodes d approximation

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

What are the place values to the left of the decimal point and their associated powers of ten?

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

Math 120 Final Exam Practice Problems, Form: A

Arithmetic Coding: Introduction

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

Notes from Week 1: Algorithms for sequential prediction

1.3 Algebraic Expressions

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Chapter 13: Binary and Mixed-Integer Programming

Detection of changes in variance using binary segmentation and optimal partitioning

2010/9/19. Binary number system. Binary numbers. Outline. Binary to decimal

Binary search tree with SIMD bandwidth optimization using SSE

The Goldberg Rao Algorithm for the Maximum Flow Problem

Chapter Load Balancing. Approximation Algorithms. Load Balancing. Load Balancing on 2 Machines. Load Balancing: Greedy Scheduling

What Every Computer Scientist Should Know About Floating-Point Arithmetic

A Note on Maximum Independent Sets in Rectangle Intersection Graphs

MA107 Precalculus Algebra Exam 2 Review Solutions

1. Give the 16 bit signed (twos complement) representation of the following decimal numbers, and convert to hexadecimal:

Analysis of Algorithms I: Optimal Binary Search Trees

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Simplification of Radical Expressions

The program also provides supplemental modules on topics in geometry and probability and statistics.

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

9231 FURTHER MATHEMATICS

Number Systems and Radix Conversion

Adaptive Stable Additive Methods for Linear Algebraic Calculations

0.8 Rational Expressions and Equations

Algebra Practice Problems for Precalculus and Calculus

Chapter 2: Linear Equations and Inequalities Lecture notes Math 1010

Simplify the rational expression. Find all numbers that must be excluded from the domain of the simplified rational expression.

Correctly Rounded Floating-point Binary-to-Decimal and Decimal-to-Binary Conversion Routines in Standard ML. By Prashanth Tilleti

Week 1: Introduction to Online Learning

Notes on Factoring. MA 206 Kurt Bryan

FMA Implementations of the Compensated Horner Scheme

Chapter 7 - Roots, Radicals, and Complex Numbers

Discrete Optimization

Welcome to Math Video Lessons. Stanley Ocken. Department of Mathematics The City College of New York Fall 2013

Operation Count; Numerical Linear Algebra

5.1 Radical Notation and Rational Exponents

PURSUITS IN MATHEMATICS often produce elementary functions as solutions that need to be

Algebra 1 Course Information

Assessment Schedule 2013

Fast Arithmetic Coding (FastAC) Implementations

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

This 3-digit ASCII string could also be calculated as n = (Data[2]-0x30) +10*((Data[1]-0x30)+10*(Data[0]-0x30));

Piecewise Cubic Splines

AN617. Fixed Point Routines FIXED POINT ARITHMETIC INTRODUCTION. Thi d t t d ith F M k Design Consultant

International Journal of Information Technology, Modeling and Computing (IJITMC) Vol.1, No.3,August 2013

EQUATIONS and INEQUALITIES

MPI Hands-On List of the exercises

Message-passing sequential detection of multiple change points in networks

5 Numerical Differentiation

A science-gateway workload archive application to the self-healing of workflow incidents

Approximation Algorithms

Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the school year.

Zuse's Z3 Square Root Algorithm Talk given at Fall meeting of the Ohio Section of the MAA October College of Wooster

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

1 Formulating The Low Degree Testing Problem

1 Review of Least Squares Solutions to Overdetermined Systems

Decentralized Utility-based Sensor Network Design

Florida Math for College Readiness

Copy in your notebook: Add an example of each term with the symbols used in algebra 2 if there are any.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Chapter 5. Linear Inequalities and Linear Programming. Linear Programming in Two Dimensions: A Geometric Approach

Polynomials. Dr. philippe B. laval Kennesaw State University. April 3, 2005

Applied Algorithm Design Lecture 5

Performance metrics for parallelism

4. Continuous Random Variables, the Pareto and Normal Distributions

on an system with an infinite number of processors. Calculate the speedup of

Transcription:

19th IEEE Symposium on Computer Arithmetic (ARITH 19) Portland, Oregon, USA, June 8-10, 2009 A new binary floating-point division algorithm and its software implementation on the ST231 processor Claude-Pierre Jeannerod 1,2 Hervé Knochel 4 Christophe Monat 4 Guillaume Revy 2,1 Gilles Villard 3,2,1 Arénaire Inria project-team (LIP, ENS Lyon) 1 Université de Lyon 2 CNRS 3 Compilation Expertise Centre (STMicroelectronics Grenoble) 4 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 1/25

Context and objectives Context FLIP software library http://flip.gforge.inria.fr/ support for floating-point arithmetic on integer processors low latency implementation of binary floating-point division targets a VLIW integer processor of the ST200 family no support of subnormal numbers input/output: ±0, ±, NaN or normal number Objectives faster software implementation (compared to FLIP 0.3) expose instruction-level parallelism via bivariate polynomial evaluation correctly rounded rounding-to-nearest even Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 2/25

Notation and assumptions Input (x, y): two positive normal numbers precision p, extremal exponents (e min, e max) 8 >< s x {0, 1} x = ( 1) sx m x 2 ex with m x = 1.m x,1... m x,p 1 [1, 2) >: e x {e min,..., e max} Computation: k-bit unsigned integers register size k Example for binary32 format: (k, p, e max) = (32, 24, 127) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 3/25

Outline of the talk Division via polynomial evaluation Generation of an efficient evaluation program Validation of the generated evaluation program Experimental results Concluding remarks Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 4/25

Division via polynomial evaluation Outline of the talk Division via polynomial evaluation Generation of an efficient evaluation program Validation of the generated evaluation program Experimental results Concluding remarks Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 5/25

Division via polynomial evaluation Division algorithm flowchart Definition c = ( 1 if m x m y, 0 if m x < m y. Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 6/25

Division via polynomial evaluation Division algorithm flowchart Definition c = ( 1 if m x m y, 0 if m x < m y. Range reduction x/y = `2m x/m y 2 c 2 ex ey 1+c l = 2m x/m y 2 c RN p(l) l [1, 2) d = e x e y 1 + c RN p(l) [1, 2) RN p(x/y) = RN p(l) 2 d Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 6/25

Division via polynomial evaluation Division algorithm flowchart Definition c = ( 1 if m x m y, 0 if m x < m y. Range reduction x/y = `2m x/m y 2 c 2 ex ey 1+c l = 2m x/m y 2 c RN p(l) l [1, 2) d = e x e y 1 + c RN p(l) [1, 2) RN p(x/y) = RN p(l) 2 d How to compute the correctly rounded significand RN p(l)? Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 6/25

Division via polynomial evaluation How to compute a correctly rounded significand? Iterative methods (restoring, non-restoring,...) Oberman and Flynn (1997) minimal instruction-level parallelism exposure, sequential algorithm Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 7/25

Division via polynomial evaluation How to compute a correctly rounded significand? Iterative methods (restoring, non-restoring,...) Oberman and Flynn (1997) minimal instruction-level parallelism exposure, sequential algorithm Multiplicative methods (Newton-Raphson, Goldschmidt) Piñeiro and Bruguera (2002) Raina s Ph.D/FLIP (2006) more instruction-level parallelism exposure previous implementation of division (FLIP 0.3) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 7/25

Division via polynomial evaluation How to compute a correctly rounded significand? Iterative methods (restoring, non-restoring,...) Oberman and Flynn (1997) minimal instruction-level parallelism exposure, sequential algorithm Multiplicative methods (Newton-Raphson, Goldschmidt) Piñeiro and Bruguera (2002) Raina s Ph.D/FLIP (2006) more instruction-level parallelism exposure previous implementation of division (FLIP 0.3) Polynomial-based methods Agarwal, Gustavson and Schmookler (1999) univariate polynomial evaluation Our approach single bivariate polynomial evaluation Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 7/25

Division via polynomial evaluation Truncated one-sided approximation See for example, Ercegovac and Lang (2004) 3 steps 1. compute v = (01.v 1... v k 2 ) such that 2 p l v < 0 that is implied by (l + 2 p 1 ) v < 2 p 1 2. truncate v after p fraction bits 3. obtain RN p(l) after possibly adding 2 p Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 8/25

Division via polynomial evaluation Truncated one-sided approximation See for example, Ercegovac and Lang (2004) 3 steps 1. compute v = (01.v 1... v k 2 ) such that 2 p l v < 0 that is implied by (l + 2 p 1 ) v < 2 p 1 2. truncate v after p fraction bits 3. obtain RN p(l) after possibly adding 2 p How to compute the one-sided approximation v? Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 8/25

Division via polynomial evaluation Computation of the one-sided approximation 1. Consider l + 2 p 1 as the exact result of the function F(s,t) = s/(1 + t) + 2 p 1, at the points s = 2 1 c m x and t = m y 1: l + 2 p 1 = F(s, t ). Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 9/25

Division via polynomial evaluation Computation of the one-sided approximation 1. Consider l + 2 p 1 as the exact result of the function F(s,t) = s/(1 + t) + 2 p 1, at the points s = 2 1 c m x and t = m y 1: l + 2 p 1 = F(s, t ). 2. Approximate F(s, t) by a bivariate polynomial P(s, t) P(s, t) = s a(t) + 2 p 1. a(t): univariate polynomial approximant of 1/(1 + t) approximation entails an error ǫ approx Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 9/25

Division via polynomial evaluation Computation of the one-sided approximation 1. Consider l + 2 p 1 as the exact result of the function F(s,t) = s/(1 + t) + 2 p 1, at the points s = 2 1 c m x and t = m y 1: l + 2 p 1 = F(s, t ). 2. Approximate F(s, t) by a bivariate polynomial P(s, t) P(s, t) = s a(t) + 2 p 1. a(t): univariate polynomial approximant of 1/(1 + t) approximation entails an error ǫ approx 3. Evaluate P(s, t) by a well-chosen efficient evaluation program P v = P(s, t ). evaluation entails an error ǫ eval Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 9/25

Division via polynomial evaluation Computation of the one-sided approximation 1. Consider l + 2 p 1 as the exact result of the function F(s,t) = s/(1 + t) + 2 p 1, at the points s = 2 1 c m x and t = m y 1: l + 2 p 1 = F(s, t ). 2. Approximate F(s, t) by a bivariate polynomial P(s, t) P(s, t) = s a(t) + 2 p 1. a(t): univariate polynomial approximant of 1/(1 + t) approximation entails an error ǫ approx 3. Evaluate P(s, t) by a well-chosen efficient evaluation program P v = P(s, t ). evaluation entails an error ǫ eval How to ensure that (l + 2 p 1 ) v < 2 p 1? Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 9/25

Division via polynomial evaluation Sufficient error bounds Since by triangular inequality (l + 2 p 1 ) v µ ǫ approx + ǫ eval with µ = max{s } = max{2 1 c m x} = (4 2 3 p ) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 10/25

Division via polynomial evaluation Sufficient error bounds Since by triangular inequality (l + 2 p 1 ) v µ ǫ approx + ǫ eval with µ = max{s } = max{2 1 c m x} = (4 2 3 p ) One has to ensure µ ǫ approx + ǫ eval < 2 p 1 Sufficient conditions can be obtained ǫ approx < 2 p 1 /µ and ǫ eval < 2 p 1 µ ǫ approx Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 10/25

Generation of an efficient evaluation program Outline of the talk Division via polynomial evaluation Generation of an efficient evaluation program Validation of the generated evaluation program Experimental results Concluding remarks Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 11/25

Generation of an efficient evaluation program Automatic generation of an efficient evaluation program Evaluation program P = main part of the full software implementation dominates the cost By efficient, one means an evaluation program that reduces the evaluation latency reduces the number of multiplications is accurate enough Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 12/25

Generation of an efficient evaluation program Automatic generation of an efficient evaluation program Evaluation program P = main part of the full software implementation dominates the cost By efficient, one means an evaluation program that reduces the evaluation latency reduces the number of multiplications is accurate enough Target architecture : ST231 4-issue VLIW integer processor with at most 2 mul. per cycle latencies: addition = 1 cycle, multiplication = 3 cycles Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 12/25

Generation of an efficient evaluation program Automatic generation of an efficient evaluation program Evaluation program P = main part of the full software implementation dominates the cost By efficient, one means an evaluation program that reduces the evaluation latency reduces the number of multiplications is accurate enough Target architecture : ST231 4-issue VLIW integer processor with at most 2 mul. per cycle latencies: addition = 1 cycle, multiplication = 3 cycles Which evaluation program to evaluate the polynomial P(s,t)? Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 12/25

Generation of an efficient evaluation program Example for the binary32 implementation: (k, p) = (32, 24) X10 P(s, t) = 2 p 1 + s a it i Horner s scheme: (3 + 1) 11 = 44 cycles sequential scheme, no instruction-level parallelism exposure i=0 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 13/25

Generation of an efficient evaluation program Example for the binary32 implementation: (k, p) = (32, 24) X10 P(s, t) = 2 p 1 + s a it i Horner s scheme: (3 + 1) 11 = 44 cycles i=0 sequential scheme, no instruction-level parallelism exposure Estrin s scheme: 20 cycles more instruction-level parallelism a last multiplication by s 2 cycles save by distributing the multiplication by s in the evaluation of the univariate polynomial a(t) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 13/25

Generation of an efficient evaluation program Example for the binary32 implementation: (k, p) = (32, 24) X10 P(s, t) = 2 p 1 + s a it i Horner s scheme: (3 + 1) 11 = 44 cycles i=0 sequential scheme, no instruction-level parallelism exposure Estrin s scheme: 20 cycles... more instruction-level parallelism a last multiplication by s 2 cycles save by distributing the multiplication by s in the evaluation of the univariate polynomial a(t) We can do much better. But how to explore the solution space and choose an efficient evaluation program? interest of automatic generation Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 13/25

Generation of an efficient evaluation program Efficient evaluation tree generation Similar to Harrison, Kubaska, Story and Tang (1999) Assumption unbounded parallelism latencies of arithmetic operators: + and Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 14/25

Generation of an efficient evaluation program Efficient evaluation tree generation Similar to Harrison, Kubaska, Story and Tang (1999) Assumption unbounded parallelism latencies of arithmetic operators: + and Two sub-steps 1. determine a target latency τ ie. τ = 3 log 2 (deg(p)) + 1 2. generate automatically a set of evaluation trees, with height τ Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 14/25

Generation of an efficient evaluation program Efficient evaluation tree generation Similar to Harrison, Kubaska, Story and Tang (1999) Assumption unbounded parallelism latencies of arithmetic operators: + and Two sub-steps 1. determine a target latency τ ie. τ = 3 log 2 (deg(p)) + 1 2. generate automatically a set of evaluation trees, with height τ if no tree satisfies τ then increase τ and restart Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 14/25

Generation of an efficient evaluation program Efficient evaluation tree generation Similar to Harrison, Kubaska, Story and Tang (1999) Assumption unbounded parallelism latencies of arithmetic operators: + and Two sub-steps 1. determine a target latency τ ie. τ = 3 log 2 (deg(p)) + 1 2. generate automatically a set of evaluation trees, with height τ if no tree satisfies τ then increase τ and restart Number of evaluation trees = extremely large several filters Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 14/25

Generation of an efficient evaluation program Efficient evaluation tree generation X10 P(s, t) = 2 p 1 + s a it i i=0 first target latency τ = 13 no tree found Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 15/25

Generation of an efficient evaluation program Efficient evaluation tree generation X10 P(s, t) = 2 p 1 + s a it i i=0 first target latency τ = 13 no tree found second target latency τ = 14 obtained in about 10 sec. 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 t a1 a0 s a2 t a3 2 p 1 a4 t a5 s t v a6 a7 t t a10 t a9 a8 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 15/25

Generation of an efficient evaluation program Efficient evaluation tree generation X10 P(s, t) = 2 p 1 + s a it i i=0 first target latency τ = 13 no tree found second target latency τ = 14 obtained in about 10 sec. distribute the multiplication by s otherwise: 18 cycles too difficult to find such tree by hand 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 t a1 a0 s a2 t a3 2 p 1 a4 t a5 s t v a6 a7 t t a10 t a9 a8 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 15/25

Generation of an efficient evaluation program Arithmetic operator choice Polynomial coefficients implemented in absolute value All intermediate values have constant sign not store the sign: more accuracy Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 16/25

Generation of an efficient evaluation program Arithmetic operator choice Polynomial coefficients implemented in absolute value All intermediate values have constant sign not store the sign: more accuracy Label evaluation trees by appropriate arithmetic operator: + or Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 16/25

Generation of an efficient evaluation program Arithmetic operator choice Polynomial coefficients implemented in absolute value All intermediate values have constant sign not store the sign: more accuracy Label evaluation trees by appropriate arithmetic operator: + or If the sign of an intermediate value changes when the input varies then the evaluation tree is rejected implementation with certified interval arithmetic (MPFI) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 16/25

Generation of an efficient evaluation program Practical scheduling checking Schedule the evaluation trees on a simplified model of a real target architecture operator costs, nb. issues, constraints on operators no syllables constraint Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 17/25

Generation of an efficient evaluation program Practical scheduling checking Schedule the evaluation trees on a simplified model of a real target architecture operator costs, nb. issues, constraints on operators no syllables constraint Check if no increase of latency in practice compared to the latency on unbounded parallelism if practical latency > theoretical latency then the evaluation tree is rejected implementation using naive list scheduling algorithm is enough Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 17/25

Validation of the generated evaluation program Outline of the talk Division via polynomial evaluation Generation of an efficient evaluation program Validation of the generated evaluation program Experimental results Concluding remarks Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 18/25

Validation of the generated evaluation program Example for the binary32 implementation: (k, p) = (32, 24) Approximation of 1/(1 + t) by truncated Remez polynomial of degree 10 6e 09 Approximation error 4e 09 Approximation error 2e 09 0 2e 09 4e 09 6e 09 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t = my 1 ǫ approx 2 27.41... 6.0 e-9 < 2 25 /(4 2 21 ) 7.4 e-9 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 19/25

Validation of the generated evaluation program Example for the binary32 implementation: (k, p) = (32, 24) Approximation of 1/(1 + t) by truncated Remez polynomial of degree 10 6e 09 Approximation error 4e 09 Approximation error 2e 09 0 2e 09 4e 09 6e 09 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 t = my 1 ǫ approx 2 27.41... 6.0 e-9 < 2 25 /(4 2 21 ) 7.4 e-9 Deduction of the evaluation error bound from ǫ approx ǫ eval < 2 25 (4 2 21 ) 2 27.41... 2 26.9999... 7.4 e-9. Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 19/25

Validation of the generated evaluation program Example for the binary32 implementation: (k, p) = (32, 24) Case 1: m x m y condition satisfied Case 2: m x < m y condition not satisfied ie. s = 3.935581684112548828125 and t = 0.97490441799163818359375 8e 09 7e 09 6e 09 2 26.9988 Absolute evaluation error Evaluation error bound Absolute evaluation error 5e 09 4e 09 3e 09 2e 09 1e 09 0 0.97 0.975 0.98 0.985 0.99 0.995 t = my 1 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 20/25

Validation of the generated evaluation program Example for the binary32 implementation: (k, p) = (32, 24) Case 1: m x m y condition satisfied Case 2: m x < m y condition not satisfied ie. s = 3.935581684112548828125 and t = 0.97490441799163818359375 8e 09 7e 09 6e 09 Absolute evaluation error Evaluation error bound 1. determine an interval I around this point Absolute evaluation error 5e 09 4e 09 3e 09 2e 09 1e 09 0 0.97 0.975 0.98 0.985 0.99 0.995 t = my 1 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 20/25

Validation of the generated evaluation program Example for the binary32 implementation: (k, p) = (32, 24) Case 1: m x m y condition satisfied Case 2: m x < m y condition not satisfied ie. s = 3.935581684112548828125 and t = 0.97490441799163818359375 Approximation error 6e 09 4e 09 2e 09 0 2e 09 4e 09 6e 09 Approximation error 1. determine an interval I around this point 2. compute ǫ approx overi 3. determine an evaluation error bound η 4. check if ǫ eval < η? 0.97 0.975 0.98 0.985 0.99 0.995 t = my 1 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 20/25

Validation of the generated evaluation program Evaluation program validation strategy Find a splitting of the input interval into n subinterval(s) T (i), and check that µ ǫ (i) approx + ǫ (i) eval < 2 p 1 on each subinterval. Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 21/25

Validation of the generated evaluation program Evaluation program validation strategy Find a splitting of the input interval into n subinterval(s) T (i), and check that µ ǫ (i) approx + ǫ (i) eval < 2 p 1 on each subinterval. Implementation of the splitting by dichotomy for each T (i) 1. compute a certified approximation error bound ǫ (i) approx 2. determine an evaluation error bound ǫ (i) eval 3. check this bound if this bound is not satisfied, T (i) is split up into 2 subintervals implemented using Sollya (steps 1 and 2) and Gappa (step 3) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 21/25

Validation of the generated evaluation program Evaluation program validation strategy Find a splitting of the input interval into n subinterval(s) T (i), and check that µ ǫ (i) approx + ǫ (i) eval < 2 p 1 on each subinterval. Implementation of the splitting by dichotomy for each T (i) 1. compute a certified approximation error bound ǫ (i) approx 2. determine an evaluation error bound ǫ (i) eval 3. check this bound if this bound is not satisfied, T (i) is split up into 2 subintervals implemented using Sollya (steps 1 and 2) and Gappa (step 3) Example of binary32 implementation launched on a 64 processor grid 36127 subintervals found in several hours ( 5h.) Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 21/25

Experimental results Outline of the talk Division via polynomial evaluation Generation of an efficient evaluation program Validation of the generated evaluation program Experimental results Concluding remarks Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 22/25

Experimental results Experimental results Performances on ST231 Nb. of instructions Latency (# cycles) IPC Code size (bytes) rounding to nearest 86 27 3.18 416 speed-up by a factor of about 1.78 in rounding to nearest compared to the previous implementation (48 cycles) optimized implementation efficient ST200 compiler (st200cc) high IPC value: confirms the parallel nature of our approach Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 23/25

Concluding remarks Outline of the talk Division via polynomial evaluation Generation of an efficient evaluation program Validation of the generated evaluation program Experimental results Concluding remarks Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 24/25

Concluding remarks Concluding remarks Contributions New approach for the implementation of binary floating-point division bivariate polynomial-based algorithm automatic generation and validation of efficient evaluation program implementation targeted ST231 VLIW integer processor Speed-up by a factor of about 1.78 in rounding to nearest compared to the previous implementation Since then Extension to subnormal numbers support implementation in 31 cycles: 4 extra cycles Implementation of other functions Latency (# cycles) IPC Code size (bytes) Speed-up square root 21 2.47 276 2.38 reciprocal 22 2.59 336 1.75 reciprocal square root 29 2.24 368 2.27 Guillaume Revy A new binary floating-point division algorithm and its software implementation on the ST231 processor 25/25