Adaptive Stable Additive Methods for Linear Algebraic Calculations



Similar documents
A numerically adaptive implementation of the simplex method

DNA Data and Program Representation. Alexandre David

Solution of Linear Systems

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

The mathematics of RAID-6

Operation Count; Numerical Linear Algebra

Numerical Matrix Analysis

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

How To Write A Hexadecimal Program

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

SOLVING LINEAR SYSTEMS

Sources: On the Web: Slides will be available on:

THE NAS KERNEL BENCHMARK PROGRAM

Next Generation GPU Architecture Code-named Fermi

Numerical Methods I Eigenvalue Problems

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

Ridgeway Kite Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

HSL and its out-of-core solver

Software implementation of Post-Quantum Cryptography

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

DEFERRED IMAGE PROCESSING IN INTEL IPP LIBRARY

Lecture 3: Finding integer solutions to systems of linear equations

Binary search tree with SIMD bandwidth optimization using SSE

Quantum Computing and Grover s Algorithm

Faculty of Engineering Student Number:

64-Bit versus 32-Bit CPUs in Scientific Computing

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Integer Computation of Image Orthorectification for High Speed Throughput

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Direct Methods for Solving Linear Systems. Matrix Factorization

Lattice QCD Performance. on Multi core Linux Servers

Factoring Quadratic Expressions

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Intel 64 and IA-32 Architectures Software Developer s Manual

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

Chapter One Introduction to Programming

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

High-Performance Modular Multiplication on the Cell Processor

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Contrôle dynamique de méthodes d approximation

FAST INVERSE SQUARE ROOT

Fast Exponential Computation on SIMD Architectures

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

DATA ANALYSIS II. Matrix Algorithms

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions

Lecture 3. Optimising OpenCL performance

Evaluation of CUDA Fortran for the CFD code Strukti

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Notes on Factoring. MA 206 Kurt Bryan

7 Gaussian Elimination and LU Factorization

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

FMA Implementations of the Compensated Horner Scheme

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

A new binary floating-point division algorithm and its software implementation on the ST231 processor

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Study of a neural network-based system for stability augmentation of an airplane

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method

Vector and Matrix Norms

GPU Hardware Performance. Fall 2015

Floating Point Fused Add-Subtract and Fused Dot-Product Units

High-speed image processing algorithms using MMX hardware

OpenMP and Performance

1 The Brownian bridge construction

Chapter 7D The Java Virtual Machine

3. INNER PRODUCT SPACES

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

SmartArrays and Java Frequently Asked Questions

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Integrating Benders decomposition within Constraint Programming

Learn CUDA in an Afternoon: Hands-on Practical Exercises

CSE 6040 Computing for Data Analytics: Methods and Tools

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

GPGPU accelerated Computational Fluid Dynamics

Haswell Cryptographic Performance

Using EXCEL Solver October, 2000

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Linear Programming. March 14, 2014

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department

Virtual Landmarks for the Internet

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Transcription:

Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Outline Linear algebraic kernel Dot product 2 Hilbert matrix Condition number Large condition number aware logic 3 Stable dot product Primary large condition number detector 2 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Linear algebraic kernel Dot product Pannon Optimizer: linear programming solver Linear programming problem min c T x Ax = b x j 0, j =..n Linear algebraic kernel Provides linear algebraic algorithms and data structures: vector operations (e.g. dot product) FTRAN: α = B a BTRAN: π T = h T B where B is the actual basis 3 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Dot product Floating point numbers: ( ) s.m m 2...m n 2 e Errors s {0,}: sign m i : i th bit of the mantissa e: exponent Rounding error: A = A + B, where A» B, and B 0 Cancellation: Given A and B 0, A = -B C = A + B Expectation: C = 0 Error: C = ±ε These errors can create a lot of fake nonzeros, lead to wrong results and slow down the algorithms. 4 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Intel s SIMD architecture Dot product Paralell operations on multiple data SSE2: 28 bit wide XMM registers One register: 4 single precision floating point numbers, or 2 doubles, or 4 32 bit integers, or 2 64 bit integers Single precision and double precision operations (add, multiply, etc...) Bitwise operations Integer operations Logical operations Moving operations AVX: 256 bit wide YMM registers 4 double precision floating point numbers per register 5 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Naive add implementation Dot product Given A and B vectors C := A+B, where c i := a i + b i Requirement: Avoid cancellation errors Minimize the overhead Naive implementation Input: A, B Output: C For each element of A and B: c i := a i + b i 6 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Linear algebraic kernel Dot product Naive implementation: Does not avoid cancellation errors Stabilize the result using relateive tolerance ǫ r : c i := a i + b i if ( a i + b i )ǫ r c i then c i := 0 Operations: 2 additions, multiplication, 2 assignments 3 absolute values jump comparison The result is stable, but the algorithm contains conditional jumps slows down 7 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Dot product Our accelerated stable add method Use Intel s AVX instruction set of the parallel comparisons are in a YMM register These results can be used for bit masking: mask := 000...000 2 if ( a i + b i )ǫ r < a i + b i then mask :=... 3 c i := ((a i + b i )) bitwise and with mask The comparison in step 2 is an AVX instruction There is no jumping in the implementation! Absolute value: bit masking (bitwise and) ( a + b ) ε i a i + b i compare a i + bi result i r e-5 4.5e-4 2.e-6 4e-2 3e-2 45.56 7.4e-0 4e-0 000...0... 000...0... 3e-2 45.56 7.4e-0-4e-0 0 45.56 0-4e-0 YMM0 YMM YMM2 YMM3 YMM4 8 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Naive dot product implementation Dot product We have two n dimensional vectors: a and b n a T b = a i b i i= Problem: We have to use floating point arithmetic Rounding and cancellation errors 9 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Stable dot product implementation Dot product Separate the negative and positive products Two variables: N: sum of negative products P: sum of positive products Algorithm: Read the a i and b i 2 p := a i b i 3 if p < 0 then 4 N := N + p 5 else 6 P := P + p Final result := N + P 0 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Dot product Our accelerated stable dot product implementation Conditional jumping can be avoided using pointer arithmetic: union Number { double num; unsigned long long int bits; } number; double negpos[2] = {0.0, 0.0}; [...] const double prod = a * b; number.num = prod; *(negpos + (number.bits >> 63)) += prod; The AVX can give more enhancement for the stable dot product / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Hilbert matrix Linear algebraic kernel Hilbert matrix Condition number Large condition number aware logic Hilbert matrix: H n,n, where h i,j = Example: H 4,4 = 2 2 3 3 4 4 5 i+j 3 4 5 6, i,j =,...,n 4 5 6 7 We can construct the following LP problem: min 0 H n,n x = b x j 0,j =..n, and b j = n i= i + j 2 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Solvers and the Hilbert matrix Hilbert matrix Condition number Large condition number aware logic It is clear that if and only if x j =,j =..n, the solution is optimal We have tested the CLP and the GLPK Size GLPK Exact GLPK CLP 3 3 x j = ±3.997 0 5 x j = x j = 4 4 x j = ±8.27 0 3 x j = x j = 5 5 x j = ±.75 0 x j = x j = 6 6 x j = ±2.66 0 0 x j = INFEASIBLE 7 7 x j = ±.57 x j = INFEASIBLE 8 8 x j = ±.600 x j = ±0.20 INFEASIBLE 20 20 x j = ±6.298 x j = ±4.24 INFEASIBLE 00 00 0 x j 24.009 x j = ±2.682 INFEASIBLE We have used Clp ang GLPK as libraries, the models were generated and solved by C++ programs. 3 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Condition number Hilbert matrix Condition number Large condition number aware logic Measures, how much the output changes if the input changes κ(b) = B B Problems with computing κ(b): The matrix changes in every iterations If κ(b) is large, computing B is difficult The condition number of the n*n Hilbert matrix is very large, it grows as ( (+ ) 2) 4n O n κ(h 6,6 ) 2.907 0 7 κ(h 0,0 ) 3.536 0 3 κ(h 00,00 ) 2.42 0 48 4 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Hilbert matrix Condition number Large condition number aware logic Primary large condition number detector We propose: We can not compute the condition number directly However, we can detect the effect of the large condition number! The input of the classic FTRAN is vector a: B a = α Create the perturbed ā copy of a Use a modified FTRAN, which computes B a = α and B ā = ᾱ The modified FTRAN perturbs every sum during computing ᾱ If r = max{ α, ᾱ } min{ α, ᾱ } is greater than a threshold, it means that the condition number is too large primary alarm 5 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Large condition number aware logic Hilbert matrix Condition number Large condition number aware logic The primary detector is executed An error occurs, for example: fallback to phase- If a primary alarm occurs, the algorithm performs primary detector in the following iterations If primary alarms occur in every next iteration and r does not decrease secondary alarm The algorithm ends If a primary alarm occurs, the algorithm performs a sensitivity analysis If the sensitivity analysis finds that the result is extremely instable secondary alarm If a secondary alarm occurs, the software restarts from the last basis with modified parameters (enabled scaling, switching to LU decomposition, etc.) In the last resort: The software restarts from the last basis with enhanced precision arithmetic 6 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Next steps Linear algebraic kernel Hilbert matrix Condition number Large condition number aware logic We have to integrate the enhanced precision arithmetic to the Pannon Optimizer We have to integrate the large condition number recognizer algorithm The large condition number recognizer can be accelerated with low-level optimization (SIMD architecture) Our goal: Implement a solver which runs fast on the stable problems, but recognizes the excessively numerical instable problems Switches to more precise arithmetic, and solves this problems too 7 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Linear algebraic kernel Stable dot product Primary large condition number detector CPU: Intel Core i5-320m, 2.50 GHz Vector lengths: 0 5 Dot product operations repeated 0 5 times 35,00 30,00 28,83 25,00 time [sec] 20,00 5,00 0,00 0,94 8,82 0,55 5,00 0,00 naive conditional jump SSE2 AVX 8 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Stable dot product Stable dot product Primary large condition number detector CPU: Intel Core i5-320m, 2.50 GHz Vector lengths: 0 6 Dot product operations repeated 0 4 times 70,00 63,35 60,00 50,00 time [sec] 40,00 30,00 20,00 0,00 2,3 2,35 6,76 0,0 0,00 naive conditional jump pointer arithmetic SSE2 AVX 9 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Primary large condition number detector Stable dot product Primary large condition number detector Output of the detector: r = max{ α, ᾱ } min{ α, ᾱ } δ = r Problem 25FV47.MPS STOCFOR3.MPS PILOT.MPS MAROS-R7.MPS Value of δ after the last iteration 3.66059e-08 3.07735e-08 9.22276e-06.39086e-0 Hilbert 7*7 0.06745 Hilbert 8*8 0.524724 Hilbert 20*20 2.05845 Hilbert 26*26 5.4588 Hilbert 00*00 0.32362 20 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

Stable dot product Primary large condition number detector Thank you for your attention! This publication/research has been supported by the European Union and Hungary and co-financed by the European Social Fund through the project TÁMOP-4.2.2.C-//KONV-202-0004 - National Research Center for Development and Market Introduction of Advanced Information and Communication Technologies. 2 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations