FMA Implementations of the Compensated Horner Scheme

Similar documents

published in SIAM Journal on Scientific Computing (SISC), 26(6): , 2005.

Numerical Matrix Analysis

Some Functions Computable with a Fused-mac

RN-coding of Numbers: New Insights and Some Applications

CS321. Introduction to Numerical Methods

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

CHAPTER 5 Round-off errors

CR-LIBM A library of correctly rounded elementary functions in double-precision

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

6.2 Permutations continued

1 Formulating The Low Degree Testing Problem

3.2 The Factor Theorem and The Remainder Theorem

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

RN-Codings: New Insights and Some Applications

Adaptive Stable Additive Methods for Linear Algebraic Calculations

64-Bit versus 32-Bit CPUs in Scientific Computing

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

What Every Computer Scientist Should Know About Floating-Point Arithmetic

A new binary floating-point division algorithm and its software implementation on the ST231 processor

7. Some irreducible polynomials

The sum of digits of polynomial values in arithmetic progressions

Contrôle dynamique de méthodes d approximation

This document has been written by Michaël Baudin from the Scilab Consortium. December 2010 The Scilab Consortium Digiteo. All rights reserved.

Algebraic and Transcendental Numbers

Solution of Linear Systems

Vector and Matrix Norms

6 EXTENDING ALGEBRA. 6.0 Introduction. 6.1 The cubic equation. Objectives

Fast Exponential Computation on SIMD Architectures

Polynomial Degree and Lower Bounds in Quantum Complexity: Collision and Element Distinctness with Small Range

Inner Product Spaces

n k=1 k=0 1/k! = e. Example 6.4. The series 1/k 2 converges in R. Indeed, if s n = n then k=1 1/k, then s 2n s n = 1 n

Differentiation and Integration

Factoring Cubic Polynomials

2.3. Finding polynomial functions. An Introduction:

The Goldberg Rao Algorithm for the Maximum Flow Problem

arxiv: v1 [math.pr] 5 Dec 2011

Correctly Rounded Floating-point Binary-to-Decimal and Decimal-to-Binary Conversion Routines in Standard ML. By Prashanth Tilleti

A numerically adaptive implementation of the simplex method

Numerical Analysis. Gordon K. Smyth in. Encyclopedia of Biostatistics (ISBN ) Edited by. Peter Armitage and Theodore Colton

Algorithms are the threads that tie together most of the subfields of computer science.

a 1 x + a 0 =0. (3) ax 2 + bx + c =0. (4)

1 Cubic Hermite Spline Interpolation

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Influences in low-degree polynomials

Nonlinear Algebraic Equations. Lectures INF2320 p. 1/88

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

Basics of Polynomial Theory

The `Russian Roulette' problem: a probabilistic variant of the Josephus Problem

Application. Outline. 3-1 Polynomial Functions 3-2 Finding Rational Zeros of. Polynomial. 3-3 Approximating Real Zeros of.

JUST THE MATHS UNIT NUMBER 1.8. ALGEBRA 8 (Polynomials) A.J.Hobson

1 Prior Probability and Posterior Probability

7 Gaussian Elimination and LU Factorization

Week 1: Introduction to Online Learning

(Refer Slide Time: 01:11-01:27)

Fairness in Routing and Load Balancing

Stirling s formula, n-spheres and the Gamma Function

Roots of Polynomials

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department

Software package for Spherical Near-Field Far-Field Transformations with Full Probe Correction

Check Digit Schemes and Error Detecting Codes. By Breanne Oldham Union University

arxiv:math/ v1 [math.co] 21 Feb 2002

Numerical Analysis I

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Introduction to Support Vector Machines. Colin Campbell, Bristol University

A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS

THE NUMBER OF REPRESENTATIONS OF n OF THE FORM n = x 2 2 y, x > 0, y 0

Theorem (informal statement): There are no extendible methods in David Chalmers s sense unless P = NP.

The Epsilon-Delta Limit Definition:

Moreover, under the risk neutral measure, it must be the case that (5) r t = µ t.

Concrete Security of the Blum-Blum-Shub Pseudorandom Generator

Optimizing matrix multiplication Amitabha Banerjee

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. #-approximation algorithm.

FAST INVERSE SQUARE ROOT

Integrating Benders decomposition within Constraint Programming

SOLVING POLYNOMIAL EQUATIONS

The BBP Algorithm for Pi

Some Polynomial Theorems. John Kennedy Mathematics Department Santa Monica College 1900 Pico Blvd. Santa Monica, CA

University of Maryland Fraternity & Sorority Life Spring 2015 Academic Report

MOP 2007 Black Group Integer Polynomials Yufei Zhao. Integer Polynomials. June 29, 2007 Yufei Zhao

1 Review of Least Squares Solutions to Overdetermined Systems

MATH 4552 Cubic equations and Cardano s formulae

Prime Numbers and Irreducible Polynomials

Supplement to Call Centers with Delay Information: Models and Insights

New Method for Optimum Design of Pyramidal Horn Antennas

U.C. Berkeley CS276: Cryptography Handout 0.1 Luca Trevisan January, Notes on Algebra

8 Primes and Modular Arithmetic

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Principle of Data Reduction

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

IMPLEMENTATION AND PERFORMANCE ANALYSIS OF ELLIPTIC CURVE DIGITAL SIGNATURE ALGORITHM

Transcription:

Reliable Implementation of Real Number Algorithms: Theory and Practice Dagstuhl Seminar 06021 January 08-13, 2006 FMA Implementations of the Compensated Horner Scheme Stef GRAILLAT, Philippe LANGLOIS and Nicolas LOUVET University of Perpignan, France. http://webdali.univ-perp.fr/

Outline 1. Motivations and previous results: the Compensated Horner Scheme (CHS) 2. How to benefit from the Fused Multiply and Add operator (FMA)? How to derive Error Free Transformations for polynomial evaluations and FMA? (a) From HornerFMA to CompensatedHorner3FMA (b) From CompensatedHorner to CompensatedHornerFMA 3. Theoretical and numerical results (a) The accuracy verifies a twice the working precision behavior (b) The running-time is twice faster than double-double Horner scheme. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 1

Provide numerical algorithms and software being General motivation a few times more accurate than the result from IEEE-754 working precision: the actual accuracy is proven to satisfy improved versions of the classic rule of thumb ; efficient in term of running-time without too much portability sacrifice: only working with IEEE-754 precisions: single, double; together with a residual error bound to control the accuracy of the computed result: dynamic and validated error bound computable in IEEE-754 arithmetic. Example for polynomial evalution with Horner scheme: the Compensated Horner Scheme a a S. Graillat, Ph.L., N. Louvet. Compensated Horner Scheme. Research Report, DALI-LP2A, University of Perpignan, aug. 05 (submitted). Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 2

The general rule of thumb for accuracy of a backward stable algorithm IEEE-754 double precision, with rounding to the nearest: u = 2 53 1.1 10 16 16 digits. The general rule of thumb (RoT) for backward stable algorithms: where u is the computing precision solution accuracy condition number of the problem u Condition number for the evaluation of p(x) = n i=0 a ix i : n i=0 cond(p, x) = a i x i, always 1. p(x) If cond(p, x) 10 10, with u 1.1 10 16, solution accuracy 10 6 only 6 digits! Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 3

The compensated rule of thumb Compensated RoT for twice the current working precision behavior: accuracy of the computed solution precision + condition number precision 2. Theorem 1 Given p a polynomial of degree n with floating point coefficients, and x a floating point value. If no underflow occurs, CompensatedHorner (p, x) p(x) p(x) u + γ2n 2 cond(p, x), with γ 2n n u. Proofs similar to Ogita-Rump-Oichi 2005 SISC paper. K-fold compensated rule of thumb Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 4

CHS: accuracy experiments exhibit a twice the working precision behavior 1 Condition number and relative forward error 0.01 1e-04 γ 2n cond u + γ 2n 2 cond relative forward error 1e-06 1e-08 1e-10 1e-12 1e-14 Horner Compensated Horner DDHorner 1e-16 u 1e-18 100000 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35 condition number CompensatedHorner and DDHorner (double-double) provide the same accuracy. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 5

Compensated Horner Scheme: measured and theoretical ratios of the running-time Pentium 4: 3.0GHz, 1024kB L2 cache - GCC 3.4.1 ratio minimum mean maximum theoretical CompensatedHorner/Horner 1.5 2.9 3.2 13 DDHorner/Horner 2.3 8.4 9.4 17 Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 6

Compensated Horner Scheme: a dynamic and validated error bound Theorem 2 Given a polynomial p with floating point coefficients, and a floating point value x, we consider res = CompensatedHorner (p, x). The absolute forward error affecting the evaluation is bounded according to CompensatedHorner (p, x) p(x) fl((u res + (γ 4n+2 HornerSum ( p π, p σ, x ) + 2u 2 res ))). The dynamic bound is computed in entire FPA with round to the nearest mode only The dynamic bound is less pessimistic than the a priori one. Proof uses EFT and the standard model of FPA (as in Ogita-Rump-Oishi paper) Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 7

Today 1. Motivations and previous results: the Compensated Horner Scheme (CHS) 2. How to benefit from the Fused Multiply and Add operator (FMA)? How to derive Error Free Transformations for polynomial evaluations and FMA? (a) From Horner to HornerFMA (b) From HornerFMA to CompensatedHorner3FMA (c) From CompensatedHorner to CompensatedHornerFMA 3. Theoretical and numerical results (a) The accuracy verifies a twice the working precision behavior (b) The running-time is twice faster than double-double Horner Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 8

Let us consider the availability of a FMA operator What is a Fused Multiply and Add FMA in floating point arithmetic? Given a, b and c three floating point values, FMA (a, b, c) computes a b + c rounded according to the current rounding mode. only one rounding error for two arithmetic operations! The (FMA) is available on Intel Itanium, IBM Power PC... So, when we evaluate polynomials with the Horner scheme: number of floating point operations (flops) divided by two, number of rounding errors also divided by two! For general polynomials and arguments, do we obtain a more accurate evaluation? Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 9

Relative error bounds for the Horner scheme, with or without FMA... Algorithm 1 Classic Horner scheme function [s 0 ] = Horner(p, x) s n = a n for i = n 1 : 1 : 0 p i = s i+1 x s i = p i a i end Horner(p,x) p(x) p(x) where γ 2n = 2nu 1 2nu 2nu. γ }{{} 2n cond(p, x) 2nu Algorithm 2 Horner scheme with a FMA function [s 0 ] = Horner(p, x) s n = a n for i = n 1 : 1 : 0 s i = FMA (s i+1, x, a i ) end HornerFMA(p,x) p(x) p(x) where γ n = nu 1 nu nu. γ }{{} n cond(p, x) nu Almost the same accuracy! Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 10

solution accuracy condition number of the problem u p k (x) = (1 x) k, in expanded form, at x = fl (1.333) k increases cond(p, x) increases! 1 1e-2 1e-4 1e-06 γ n cond relative forward error 1e-08 1e-10 1e-12 Horner HornerFMA 1e-14 1e-16 u 1e-18 100 10000 1e+06 1e+08 1e+10 1e+12 1e+14 1e+16 1e+18 condition number More accuracy for ill-conditionned problems? Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 11

More accuracy for the Horner Scheme Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 12

More accuracy thanks to... 1. More bits Many solutions: 80 bits register of x86, quadruple precision u 2, MP, MPFR, Arprec/MPFUN. Reference in our context is the fixed length floating point expansion, such as double-double (u 2 ) and quad-double (u 4 ). 2. Compensated algorithms General idea: correcting generated rounding errors Many examples: Kahan s compensated summation (65),..., Priest s doubly compensated summation (92),..., Ogita-Rump-Oishi in (SISC 05). Sometimes, more bits (together with some extra-work) yield the definitive solution: accuracy u The more general case: at a reasonable cost we can get a twice the current working precision behavior... Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 13

EFT for the summation Error Free Transformations are properties and algorithms to compute the generated rounding errors at the current working pre- e i F EF T for F ε i F cision. F (e i ) = F (e i ) + G(ε i ) with cond (G, ε i ) < cond (F, e i ) F (e i ) Summation: Knuth (74) (x, y) = 2Sum (a, b) is such that a + b = x + y and x = a b. Only in round to the nearest rounding mode: Bohlender et al. (91). 2Sum requires 6 flops. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 14

EFT for the product without and with the FMA Product: well known algorithm from Veltkamp and Dekker (71) (x, y) = 2Prod (a, b) is such that a b = x + y and x = a b. In all the rounding modes: Bohlender et al. (91), Boldo and Daumas (03). Not valid in case of underflow. 2Prod requires 17 flops...... but only two flops with a FMA! Indeed, y = a b x = FMA (a, b, x). Algorithm 3 EFT for the product of two floating point numbers function [x, y] = 2ProdFMA (a, b) x = a b y = FMA (a, b, x) Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 15

A recent result: EFT for the FMA FMA: Boldo and Muller, 2005 (x, y, z) = 3FMA (a, b, c) is such that x = FMA (a, b, c) and a b + c = x + y + z. Three floating point values needed to represent the exact result. Only available in round to the nearest. Algorithm 4 EFT for the FMA operation function [x, y, z] = 3FMA (a, b, c) x = FMA (a, b, c) (u 1, u 2 ) = 2ProdFMA (a, b) (α 1, z) = 2Sum (b, u 2 ) (β 1, β 2 ) = 2Sum (u 1, α 1 ) y = (β 1 x) β 2 3FMA requires 17 flops. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 16

Two Compensated Horner Schema with FMA 1. Improve the EFT for multiplication in Horner with 2ProdFMA EFTHornerFMA and CompHornerFMA, 2. Apply the EFT for FMA in HornerFMA with 3FMA EFTHorner3FMA and CompHorner3FMA. We prove that the two corresponding algorithms verify the twice the current working precision behavior. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 17

If no underflow occurs, An EFT for the Horner scheme with 2ProdFMA for Horner p(x) = Horner (p, x) + (p π + p σ )(x), with p π (X) = n 1 i=0 π ix i, p σ (X) = n 1 i=0 σ ix i, and π i and σ i in F. Algorithm 5 Classic Horner scheme function [h] = Horner(p, x) s n = a n for i = n 1 : 1 : 0 p i = s i+1 x % rounding error π i s i = p i a i % rounding error σ i end h = s 0 Algorithm 6 EFT for the Horner scheme function [h, p π, p σ ] = EFTHornerFMA(p, x) s n = a n for i = n 1 : 1 : 0 [p i, π i ] = 2ProdFMA(s i+1, x) [s i, σ i ] = 2Sum(p i, a i ) end Let π i be the coefficient of degree i in p π Let σ i be the coefficient of degree i in p σ h = s 0 Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 18

A compensated Horner scheme with 2ProdFMA for Horner Since p(x) = Horner (p, x) + (p π + p σ )(x), we correct Horner (p, x) by computing an approximate of (p π + p σ )(x). Algorithm 7 Compensated Horner scheme function [res] = CompHornerFMA (p, x) [h, p π, p σ ] = EFTHornerFMA (p, x) c = HornerFMA (p π p σ, x) res = h c Theorem 3 Given p a polynomial with floating point coefficients, and x a floating point value, CompHornerFMA (p, x) p(x) p(x) u + (1 + u)γ n γ 2n cond(p, x) u + 2n 2 u 2 cond(p, x) + O(u 3 ). with γ k = ku 1 ku (1 + u)γ n γ 2n 2n 2 u 2. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 19

Another EFT with 3FMA for HornerFMA If no underflow occurs, p(x) = HornerFMA (p, x) + (p ε + p ϕ )(x), with p ε (X) = n 1 i=0 ε ix i, p ϕ (X) = n 1 i=0 ϕ ix i and ε i and ϕ i in F. Algorithm 8 Horner scheme with a FMA function [h] = HornerFMA (p, x) u n = a n for i = n 1 : 1 : 0 u i = FMA(u i+1, x, a i ) end h = u 0 % rounding error ε i + ϕ i Algorithm 9 EFT for the Horner scheme function [h, p ε, p ϕ ] = EFTHornerFMA(p, x) s n = a n for i = n 1 : 1 : 0 [u i, ε i, ϕ i ] = 3FMA(u i+1, x, a i ) end Let ε i be the coefficient of degree i in p ε Let ϕ i be the coefficient of degree i in p ϕ h = u 0 Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 20

Compensating Horner scheme with 3FMA Since p(x) = Horner (p, x) + (p ε + p ϕ )(x), we correct Horner (p, x) by computing an approximate of (p ε + p ϕ )(x). Algorithm 10 Compensated Horner scheme with 3FMA function [res] = CompHorner3FMA (p, x) [h, p ε, p ϕ ] = EFTHorner3FMA (p, x) c = HornerFMA (p ε p ϕ, x) res = h c Theorem 4 Given p a polynomial with floating point coefficients, and x a floating point value, CompHorner3FMA (p, x) p(x) p(x) u + (1 + u)γn 2 cond(p, x) u + n 2 u 2 cond(p, x) + O(u 3 ). with γ n = nu 1 nu (1 + u)γ 2 n n 2 u 2. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 21

Summary of the results Algorithm EFT Relative error bound flops CompHornerFMA CompHorner3FMA EFTHornerFMA (2ProdFMA) EFTHorner3FMA (3FMA) u + 2n 2 u 2 cond(p, x) + O(u 3 ) 10n 1 u + n 2 u 2 cond(p, x) + O(u 3 ) 19n Almost the same accuracy! But EFTHornerFMA requires about 2 times less flops than EFTHorner3FMA. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 22

Numerical experiments Experimental results exhibit the actual accuracy: twice the current working precision behavior, actual speed: about twice faster than the corresponding double-double subroutine. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 23

Experimented algorithms We compare HornerFMA: IEEE-754 double precision Horner scheme + FMA. CompHornerFMA: compensated Horner scheme with 2ProdFMA. CompHorner3FMA: compensated HornerFMA scheme with 3FMA. DDHorner: classical Horner scheme + internal double-double computation: based on double-double Bailey library, also benefits from the FMA. All computations are performed in C language and IEEE-754 double precision. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 24

Accuracy experiments exhibit a twice the working precision behavior 1 1e-2 1e-4 1e-06 γ n cond u + (1+u)γ n 2 cond relative forward error 1e-08 1e-10 1e-12 Horner HornerFMA CompHornerFMA CompHorner3FMA DDHorner 1e-14 1e-16 u 1e-18 100000 1e+10 1e+15 1e+20 1e+25 1e+30 1e+35 1e+40 condition number CompHornerFMA, CompHorner3FMA and DDHorner provide the same accuracy. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 25

Numerical experiments: testing the speed efficiency What is measured? The overhead to double the accuracy with the time ratios: CompHornerFMA/HornerFMA, CompHorner3FMA/HornerFMA, DDHorner/HornerFMA Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 26

Speed efficiency: measured and theoretical ratios environment Itanium 1, 733MHz, GCC 2.96 Itanium 2, 900MHz, GCC 3.3.5 Itanium 2, 1.6GHz, GCC 3.4.4 CompHornerFMA HornerFMA CompHorner3FMA HornerFMA DDHorner HornerFMA mean theo. mean theo. mean theo. 2.6 10 4.8 19 6.5 20 2.5 10 4.4 19 7.9 20 3.8 10 6.0 19 7.3 20 The corrected algorithm runs about twice faster than corresponding double-double implementation. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 27

Conclusion For general polynomials, the FMA mainly improves the efficiency, but not the accuracy. To improves the accuracy, we have presented two compensated Horner schemes: CompHorner3FMA, CompHornerFMA. Both exhibit the twice the current working precision behavior. CompHornerFMA is much faster than CompHorner3FMA more interesting to correct (with 2ProdFMA) than the FMA (with 3FMA). CompHornerFMA is about twice faster than DDHorner based on double-double. K-fold versions from EFT for polynomial evaluation Research Reports at http://webdali.univ-perp.fr Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 28

. Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 29

Speed efficiency: measured and theoretical ratios ratio minimum mean maximum theoretical Itanium 1, 733 MHz, GCC 2.9.6 CompHornerFMA/HornerFMA 1.7 2.6 2.7 10 CompHorner3FMA/HornerFMA 2.3 4.8 5.1 19 DDHorner/HornerFMA 2.9 6.5 7.0 20 Itanium 2, 1.9GHz, GCC 3.3.3 CompHornerFMA/HornerFMA 1.7 2.5 2.6 10 CompHorner3FMA/HornerFMA 2.4 4.4 4.6 19 DDHorner/HornerFMA 3.0 7.9 8.5 20 Itanium 2, 1.6GHz, GCC 3.4.4 CompHornerFMA/HornerFMA 2.4 3.8 4.1 10 CompHorner3FMA/HornerFMA 3.4 6.0 6.3 19 DDHorner/HornerFMA 3.8 7.3 7.6 20 Dagstuhl Seminar 06021 08-13 jan. 2006 S. Graillat, Ph. Langlois and N. Louvet. 30