Adaptive Stable Additive Methods for Linear Algebraic Calculations

Size: px
Start display at page:

Download "Adaptive Stable Additive Methods for Linear Algebraic Calculations"

Transcription

1 Adaptive Stable Additive Methods for Linear Algebraic Calculations József Smidla, Péter Tar, István Maros University of Pannonia Veszprém, Hungary 4 th of July 204. / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

2 Outline Linear algebraic kernel Dot product 2 Hilbert matrix Condition number Large condition number aware logic 3 Stable dot product Primary large condition number detector 2 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

3 Linear algebraic kernel Dot product Pannon Optimizer: linear programming solver Linear programming problem min c T x Ax = b x j 0, j =..n Linear algebraic kernel Provides linear algebraic algorithms and data structures: vector operations (e.g. dot product) FTRAN: α = B a BTRAN: π T = h T B where B is the actual basis 3 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

4 Dot product Floating point numbers: ( ) s.m m 2...m n 2 e Errors s {0,}: sign m i : i th bit of the mantissa e: exponent Rounding error: A = A + B, where A» B, and B 0 Cancellation: Given A and B 0, A = -B C = A + B Expectation: C = 0 Error: C = ±ε These errors can create a lot of fake nonzeros, lead to wrong results and slow down the algorithms. 4 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

5 Intel s SIMD architecture Dot product Paralell operations on multiple data SSE2: 28 bit wide XMM registers One register: 4 single precision floating point numbers, or 2 doubles, or 4 32 bit integers, or 2 64 bit integers Single precision and double precision operations (add, multiply, etc...) Bitwise operations Integer operations Logical operations Moving operations AVX: 256 bit wide YMM registers 4 double precision floating point numbers per register 5 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

6 Naive add implementation Dot product Given A and B vectors C := A+B, where c i := a i + b i Requirement: Avoid cancellation errors Minimize the overhead Naive implementation Input: A, B Output: C For each element of A and B: c i := a i + b i 6 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

7 Linear algebraic kernel Dot product Naive implementation: Does not avoid cancellation errors Stabilize the result using relateive tolerance ǫ r : c i := a i + b i if ( a i + b i )ǫ r c i then c i := 0 Operations: 2 additions, multiplication, 2 assignments 3 absolute values jump comparison The result is stable, but the algorithm contains conditional jumps slows down 7 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

8 Dot product Our accelerated stable add method Use Intel s AVX instruction set of the parallel comparisons are in a YMM register These results can be used for bit masking: mask := if ( a i + b i )ǫ r < a i + b i then mask :=... 3 c i := ((a i + b i )) bitwise and with mask The comparison in step 2 is an AVX instruction There is no jumping in the implementation! Absolute value: bit masking (bitwise and) ( a + b ) ε i a i + b i compare a i + bi result i r e-5 4.5e-4 2.e-6 4e-2 3e e-0 4e e e-0-4e e-0 YMM0 YMM YMM2 YMM3 YMM4 8 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

9 Naive dot product implementation Dot product We have two n dimensional vectors: a and b n a T b = a i b i i= Problem: We have to use floating point arithmetic Rounding and cancellation errors 9 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

10 Stable dot product implementation Dot product Separate the negative and positive products Two variables: N: sum of negative products P: sum of positive products Algorithm: Read the a i and b i 2 p := a i b i 3 if p < 0 then 4 N := N + p 5 else 6 P := P + p Final result := N + P 0 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

11 Dot product Our accelerated stable dot product implementation Conditional jumping can be avoided using pointer arithmetic: union Number { double num; unsigned long long int bits; } number; double negpos[2] = {0.0, 0.0}; [...] const double prod = a * b; number.num = prod; *(negpos + (number.bits >> 63)) += prod; The AVX can give more enhancement for the stable dot product / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

12 Hilbert matrix Linear algebraic kernel Hilbert matrix Condition number Large condition number aware logic Hilbert matrix: H n,n, where h i,j = Example: H 4,4 = i+j , i,j =,...,n We can construct the following LP problem: min 0 H n,n x = b x j 0,j =..n, and b j = n i= i + j 2 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

13 Solvers and the Hilbert matrix Hilbert matrix Condition number Large condition number aware logic It is clear that if and only if x j =,j =..n, the solution is optimal We have tested the CLP and the GLPK Size GLPK Exact GLPK CLP 3 3 x j = ± x j = x j = 4 4 x j = ± x j = x j = 5 5 x j = ±.75 0 x j = x j = 6 6 x j = ± x j = INFEASIBLE 7 7 x j = ±.57 x j = INFEASIBLE 8 8 x j = ±.600 x j = ±0.20 INFEASIBLE x j = ±6.298 x j = ±4.24 INFEASIBLE x j x j = ±2.682 INFEASIBLE We have used Clp ang GLPK as libraries, the models were generated and solved by C++ programs. 3 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

14 Condition number Hilbert matrix Condition number Large condition number aware logic Measures, how much the output changes if the input changes κ(b) = B B Problems with computing κ(b): The matrix changes in every iterations If κ(b) is large, computing B is difficult The condition number of the n*n Hilbert matrix is very large, it grows as ( (+ ) 2) 4n O n κ(h 6,6 ) κ(h 0,0 ) κ(h 00,00 ) / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

15 Hilbert matrix Condition number Large condition number aware logic Primary large condition number detector We propose: We can not compute the condition number directly However, we can detect the effect of the large condition number! The input of the classic FTRAN is vector a: B a = α Create the perturbed ā copy of a Use a modified FTRAN, which computes B a = α and B ā = ᾱ The modified FTRAN perturbs every sum during computing ᾱ If r = max{ α, ᾱ } min{ α, ᾱ } is greater than a threshold, it means that the condition number is too large primary alarm 5 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

16 Large condition number aware logic Hilbert matrix Condition number Large condition number aware logic The primary detector is executed An error occurs, for example: fallback to phase- If a primary alarm occurs, the algorithm performs primary detector in the following iterations If primary alarms occur in every next iteration and r does not decrease secondary alarm The algorithm ends If a primary alarm occurs, the algorithm performs a sensitivity analysis If the sensitivity analysis finds that the result is extremely instable secondary alarm If a secondary alarm occurs, the software restarts from the last basis with modified parameters (enabled scaling, switching to LU decomposition, etc.) In the last resort: The software restarts from the last basis with enhanced precision arithmetic 6 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

17 Next steps Linear algebraic kernel Hilbert matrix Condition number Large condition number aware logic We have to integrate the enhanced precision arithmetic to the Pannon Optimizer We have to integrate the large condition number recognizer algorithm The large condition number recognizer can be accelerated with low-level optimization (SIMD architecture) Our goal: Implement a solver which runs fast on the stable problems, but recognizes the excessively numerical instable problems Switches to more precise arithmetic, and solves this problems too 7 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

18 Linear algebraic kernel Stable dot product Primary large condition number detector CPU: Intel Core i5-320m, 2.50 GHz Vector lengths: 0 5 Dot product operations repeated 0 5 times 35,00 30,00 28,83 25,00 time [sec] 20,00 5,00 0,00 0,94 8,82 0,55 5,00 0,00 naive conditional jump SSE2 AVX 8 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

19 Stable dot product Stable dot product Primary large condition number detector CPU: Intel Core i5-320m, 2.50 GHz Vector lengths: 0 6 Dot product operations repeated 0 4 times 70,00 63,35 60,00 50,00 time [sec] 40,00 30,00 20,00 0,00 2,3 2,35 6,76 0,0 0,00 naive conditional jump pointer arithmetic SSE2 AVX 9 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

20 Primary large condition number detector Stable dot product Primary large condition number detector Output of the detector: r = max{ α, ᾱ } min{ α, ᾱ } δ = r Problem 25FV47.MPS STOCFOR3.MPS PILOT.MPS MAROS-R7.MPS Value of δ after the last iteration e e e e-0 Hilbert 7* Hilbert 8* Hilbert 20* Hilbert 26* Hilbert 00* / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

21 Stable dot product Primary large condition number detector Thank you for your attention! This publication/research has been supported by the European Union and Hungary and co-financed by the European Social Fund through the project TÁMOP C-//KONV National Research Center for Development and Market Introduction of Advanced Information and Communication Technologies. 2 / 2 József Smidla, Péter Tar, István Maros Adaptive Stable Additive Methods for Linear Algebraic Calculations

A numerically adaptive implementation of the simplex method

A numerically adaptive implementation of the simplex method A numerically adaptive implementation of the simplex method József Smidla, Péter Tar, István Maros Department of Computer Science and Systems Technology University of Pannonia 17th of December 2014. 1

More information

DNA Data and Program Representation. Alexandre David 1.2.05 adavid@cs.aau.dk

DNA Data and Program Representation. Alexandre David 1.2.05 adavid@cs.aau.dk DNA Data and Program Representation Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Very important to understand how data is represented. operations limits precision Digital logic built on 2-valued

More information

Solution of Linear Systems

Solution of Linear Systems Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start

More information

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix 7. LU factorization EE103 (Fall 2011-12) factor-solve method LU factorization solving Ax = b with A nonsingular the inverse of a nonsingular matrix LU factorization algorithm effect of rounding error sparse

More information

The mathematics of RAID-6

The mathematics of RAID-6 The mathematics of RAID-6 H. Peter Anvin 1 December 2004 RAID-6 supports losing any two drives. The way this is done is by computing two syndromes, generally referred P and Q. 1 A quick

More information

Operation Count; Numerical Linear Algebra

Operation Count; Numerical Linear Algebra 10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

More information

Numerical Matrix Analysis

Numerical Matrix Analysis Numerical Matrix Analysis Lecture Notes #10 Conditioning and / Peter Blomgren, blomgren.peter@gmail.com Department of Mathematics and Statistics Dynamical Systems Group Computational Sciences Research

More information

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer?

Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Floating-point control in the Intel compiler and libraries or Why doesn t my application always give the expected answer? Software Solutions Group Intel Corporation 2012 *Other brands and names are the

More information

How To Write A Hexadecimal Program

How To Write A Hexadecimal Program The mathematics of RAID-6 H. Peter Anvin First version 20 January 2004 Last updated 20 December 2011 RAID-6 supports losing any two drives. syndromes, generally referred P and Q. The way

More information

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster Acta Technica Jaurinensis Vol. 3. No. 1. 010 A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster G. Molnárka, N. Varjasi Széchenyi István University Győr, Hungary, H-906

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

Sources: On the Web: Slides will be available on:

Sources: On the Web: Slides will be available on: C programming Introduction The basics of algorithms Structure of a C code, compilation step Constant, variable type, variable scope Expression and operators: assignment, arithmetic operators, comparison,

More information

THE NAS KERNEL BENCHMARK PROGRAM

THE NAS KERNEL BENCHMARK PROGRAM THE NAS KERNEL BENCHMARK PROGRAM David H. Bailey and John T. Barton Numerical Aerodynamic Simulations Systems Division NASA Ames Research Center June 13, 1986 SUMMARY A benchmark test program that measures

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Numerical Methods I Eigenvalue Problems

Numerical Methods I Eigenvalue Problems Numerical Methods I Eigenvalue Problems Aleksandar Donev Courant Institute, NYU 1 donev@courant.nyu.edu 1 Course G63.2010.001 / G22.2420-001, Fall 2010 September 30th, 2010 A. Donev (Courant Institute)

More information

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations

ECE 0142 Computer Organization. Lecture 3 Floating Point Representations ECE 0142 Computer Organization Lecture 3 Floating Point Representations 1 Floating-point arithmetic We often incur floating-point programming. Floating point greatly simplifies working with large (e.g.,

More information

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1

Divide: Paper & Pencil. Computer Architecture ALU Design : Division and Floating Point. Divide algorithm. DIVIDE HARDWARE Version 1 Divide: Paper & Pencil Computer Architecture ALU Design : Division and Floating Point 1001 Quotient Divisor 1000 1001010 Dividend 1000 10 101 1010 1000 10 (or Modulo result) See how big a number can be

More information

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015

FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 FLOATING-POINT ARITHMETIC IN AMD PROCESSORS MICHAEL SCHULTE AMD RESEARCH JUNE 2015 AGENDA The Kaveri Accelerated Processing Unit (APU) The Graphics Core Next Architecture and its Floating-Point Arithmetic

More information

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x.

Measures of Error: for exact x and approximation x Absolute error e = x x. Relative error r = (x x )/x. ERRORS and COMPUTER ARITHMETIC Types of Error in Numerical Calculations Initial Data Errors: from experiment, modeling, computer representation; problem dependent but need to know at beginning of calculation.

More information

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation

A Static Analyzer for Large Safety-Critical Software. Considered Programs and Semantics. Automatic Program Verification by Abstract Interpretation PLDI 03 A Static Analyzer for Large Safety-Critical Software B. Blanchet, P. Cousot, R. Cousot, J. Feret L. Mauborgne, A. Miné, D. Monniaux,. Rival CNRS École normale supérieure École polytechnique Paris

More information

Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1

Binary Number System. 16. Binary Numbers. Base 10 digits: 0 1 2 3 4 5 6 7 8 9. Base 2 digits: 0 1 Binary Number System 1 Base 10 digits: 0 1 2 3 4 5 6 7 8 9 Base 2 digits: 0 1 Recall that in base 10, the digits of a number are just coefficients of powers of the base (10): 417 = 4 * 10 2 + 1 * 10 1

More information

Ridgeway Kite Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on

Ridgeway Kite Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on Garf Bowen 16 th Dec 2013 Summary Introduce RKS Reservoir Simula@on HPC goals Implementa@on Simple

More information

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation

More information

HSL and its out-of-core solver

HSL and its out-of-core solver HSL and its out-of-core solver Jennifer A. Scott j.a.scott@rl.ac.uk Prague November 2006 p. 1/37 Sparse systems Problem: we wish to solve where A is Ax = b LARGE Informal definition: A is sparse if many

More information

Software implementation of Post-Quantum Cryptography

Software implementation of Post-Quantum Cryptography Software implementation of Post-Quantum Cryptography Peter Schwabe Radboud University Nijmegen, The Netherlands October 20, 2013 ASCrypto 2013, Florianópolis, Brazil Part I Optimizing cryptographic software

More information

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers

This Unit: Floating Point Arithmetic. CIS 371 Computer Organization and Design. Readings. Floating Point (FP) Numbers This Unit: Floating Point Arithmetic CIS 371 Computer Organization and Design Unit 7: Floating Point App App App System software Mem CPU I/O Formats Precision and range IEEE 754 standard Operations Addition

More information

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor

More information

DEFERRED IMAGE PROCESSING IN INTEL IPP LIBRARY

DEFERRED IMAGE PROCESSING IN INTEL IPP LIBRARY DEFERRED IMAGE PROCESSING IN INTEL IPP LIBRARY Alexander Kibkalo (alexander.kibkalo@intel.com), Michael Lotkov (michael.lotkov@intel.com), Ignat Rogozhkin (ignat.rogozhkin@intel.com), Alexander Turovets

More information

Lecture 3: Finding integer solutions to systems of linear equations

Lecture 3: Finding integer solutions to systems of linear equations Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Quantum Computing and Grover s Algorithm

Quantum Computing and Grover s Algorithm Quantum Computing and Grover s Algorithm Matthew Hayward January 14, 2015 1 Contents 1 Motivation for Study of Quantum Computing 3 1.1 A Killer App for Quantum Computing.............. 3 2 The Quantum Computer

More information

Faculty of Engineering Student Number:

Faculty of Engineering Student Number: Philadelphia University Student Name: Faculty of Engineering Student Number: Dept. of Computer Engineering Final Exam, First Semester: 2012/2013 Course Title: Microprocessors Date: 17/01//2013 Course No:

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C Embedded Systems A Review of ANSI C and Considerations for Embedded C Programming Dr. Jeff Jackson Lecture 2-1 Review of ANSI C Topics Basic features of C C fundamentals Basic data types Expressions Selection

More information

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE

More information

Integer Computation of Image Orthorectification for High Speed Throughput

Integer Computation of Image Orthorectification for High Speed Throughput Integer Computation of Image Orthorectification for High Speed Throughput Paul Sundlie Joseph French Eric Balster Abstract This paper presents an integer-based approach to the orthorectification of aerial

More information

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs

Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs Precision & Performance: Floating Point and IEEE 754 Compliance for NVIDIA GPUs Nathan Whitehead Alex Fit-Florea ABSTRACT A number of issues related to floating point accuracy and compliance are a frequent

More information

Direct Methods for Solving Linear Systems. Matrix Factorization

Direct Methods for Solving Linear Systems. Matrix Factorization Direct Methods for Solving Linear Systems Matrix Factorization Numerical Analysis (9th Edition) R L Burden & J D Faires Beamer Presentation Slides prepared by John Carroll Dublin City University c 2011

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

Factoring Quadratic Expressions

Factoring Quadratic Expressions Factoring the trinomial ax 2 + bx + c when a = 1 A trinomial in the form x 2 + bx + c can be factored to equal (x + m)(x + n) when the product of m x n equals c and the sum of m + n equals b. (Note: the

More information

CS3220 Lecture Notes: QR factorization and orthogonal transformations

CS3220 Lecture Notes: QR factorization and orthogonal transformations CS3220 Lecture Notes: QR factorization and orthogonal transformations Steve Marschner Cornell University 11 March 2009 In this lecture I ll talk about orthogonal matrices and their properties, discuss

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Intel 64 and IA-32 Architectures Software Developer s Manual

Intel 64 and IA-32 Architectures Software Developer s Manual Intel 64 and IA-32 Architectures Software Developer s Manual Volume 1: Basic Architecture NOTE: The Intel 64 and IA-32 Architectures Software Developer's Manual consists of seven volumes: Basic Architecture,

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions Hiroshi Inoue, Moriyoshi Ohara, and Kenjiro Taura IBM Research Tokyo, University of Tokyo {inouehrs, ohara}@jp.ibm.com,

More information

Chapter One Introduction to Programming

Chapter One Introduction to Programming Chapter One Introduction to Programming 1-1 Algorithm and Flowchart Algorithm is a step-by-step procedure for calculation. More precisely, algorithm is an effective method expressed as a finite list of

More information

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus

Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus Elemental functions: Writing data-parallel code in C/C++ using Intel Cilk Plus A simple C/C++ language extension construct for data parallel operations Robert Geva robert.geva@intel.com Introduction Intel

More information

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line. Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line Panruo Wu, Chong Ding, Longxiang Chen, Teresa Davies, Christer Karlsson, and Zizhong Chen Colorado School of Mines November 13,

More information

High-Performance Modular Multiplication on the Cell Processor

High-Performance Modular Multiplication on the Cell Processor High-Performance Modular Multiplication on the Cell Processor Joppe W. Bos Laboratory for Cryptologic Algorithms EPFL, Lausanne, Switzerland joppe.bos@epfl.ch 1 / 19 Outline Motivation and previous work

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Contrôle dynamique de méthodes d approximation

Contrôle dynamique de méthodes d approximation Contrôle dynamique de méthodes d approximation Fabienne Jézéquel Laboratoire d Informatique de Paris 6 ARINEWS, ENS Lyon, 7-8 mars 2005 F. Jézéquel Dynamical control of approximation methods 7-8 Mar. 2005

More information

FAST INVERSE SQUARE ROOT

FAST INVERSE SQUARE ROOT FAST INVERSE SQUARE ROOT CHRIS LOMONT Abstract. Computing reciprocal square roots is necessary in many applications, such as vector normalization in video games. Often, some loss of precision is acceptable

More information

Fast Exponential Computation on SIMD Architectures

Fast Exponential Computation on SIMD Architectures , HiPEAC15 - WAPCO, Amsterdam (NL) Fast Exponential Computation on SIMD Architectures A. Cristiano I. MALOSSI, Yves INEICHEN, Costas BEKAS and Alessandro CURIONI @ IBM Research Zurich, Switzerland Motivations

More information

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions

Phys4051: C Lecture 2 & 3. Comment Statements. C Data Types. Functions (Review) Comment Statements Variables & Operators Branching Instructions Phys4051: C Lecture 2 & 3 Functions (Review) Comment Statements Variables & Operators Branching Instructions Comment Statements! Method 1: /* */! Method 2: // /* Single Line */ //Single Line /* This comment

More information

Lecture 3. Optimising OpenCL performance

Lecture 3. Optimising OpenCL performance Lecture 3 Optimising OpenCL performance Based on material by Benedict Gaster and Lee Howes (AMD), Tim Mattson (Intel) and several others. - Page 1 Agenda Heterogeneous computing and the origins of OpenCL

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1

Intro to GPU computing. Spring 2015 Mark Silberstein, 048661, Technion 1 Intro to GPU computing Spring 2015 Mark Silberstein, 048661, Technion 1 Serial vs. parallel program One instruction at a time Multiple instructions in parallel Spring 2015 Mark Silberstein, 048661, Technion

More information

Notes on Factoring. MA 206 Kurt Bryan

Notes on Factoring. MA 206 Kurt Bryan The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor

More information

7 Gaussian Elimination and LU Factorization

7 Gaussian Elimination and LU Factorization 7 Gaussian Elimination and LU Factorization In this final section on matrix factorization methods for solving Ax = b we want to take a closer look at Gaussian elimination (probably the best known method

More information

Pexip Speeds Videoconferencing with Intel Parallel Studio XE

Pexip Speeds Videoconferencing with Intel Parallel Studio XE 1 Pexip Speeds Videoconferencing with Intel Parallel Studio XE by Stephen Blair-Chappell, Technical Consulting Engineer, Intel Over the last 18 months, Pexip s software engineers have been optimizing Pexip

More information

FMA Implementations of the Compensated Horner Scheme

FMA Implementations of the Compensated Horner Scheme Reliable Implementation of Real Number Algorithms: Theory and Practice Dagstuhl Seminar 06021 January 08-13, 2006 FMA Implementations of the Compensated Horner Scheme Stef GRAILLAT, Philippe LANGLOIS and

More information

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP Q3 2011 325877-001 1 Legal Notices and Disclaimers INFORMATION

More information

A new binary floating-point division algorithm and its software implementation on the ST231 processor

A new binary floating-point division algorithm and its software implementation on the ST231 processor 19th IEEE Symposium on Computer Arithmetic (ARITH 19) Portland, Oregon, USA, June 8-10, 2009 A new binary floating-point division algorithm and its software implementation on the ST231 processor Claude-Pierre

More information

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit

Binary Division. Decimal Division. Hardware for Binary Division. Simple 16-bit Divider Circuit Decimal Division Remember 4th grade long division? 43 // quotient 12 521 // divisor dividend -480 41-36 5 // remainder Shift divisor left (multiply by 10) until MSB lines up with dividend s Repeat until

More information

General Framework for an Iterative Solution of Ax b. Jacobi s Method

General Framework for an Iterative Solution of Ax b. Jacobi s Method 2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

More information

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture. Chirag Gupta,Sumod Mohan K cgupta@clemson.edu, sumodm@clemson.edu Abstract In this project we propose a method to improve

More information

Study of a neural network-based system for stability augmentation of an airplane

Study of a neural network-based system for stability augmentation of an airplane Study of a neural network-based system for stability augmentation of an airplane Author: Roger Isanta Navarro Annex 3 ANFIS Network Development Supervisors: Oriol Lizandra Dalmases Fatiha Nejjari Akhi-Elarab

More information

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1

5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 5 INTEGER LINEAR PROGRAMMING (ILP) E. Amaldi Fondamenti di R.O. Politecnico di Milano 1 General Integer Linear Program: (ILP) min c T x Ax b x 0 integer Assumption: A, b integer The integrality condition

More information

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method

Determining the Optimal Combination of Trial Division and Fermat s Factorization Method Determining the Optimal Combination of Trial Division and Fermat s Factorization Method Joseph C. Woodson Home School P. O. Box 55005 Tulsa, OK 74155 Abstract The process of finding the prime factorization

More information

Vector and Matrix Norms

Vector and Matrix Norms Chapter 1 Vector and Matrix Norms 11 Vector Spaces Let F be a field (such as the real numbers, R, or complex numbers, C) with elements called scalars A Vector Space, V, over the field F is a non-empty

More information

GPU Hardware Performance. Fall 2015

GPU Hardware Performance. Fall 2015 Fall 2015 Atomic operations performs read-modify-write operations on shared or global memory no interference with other threads for 32-bit and 64-bit integers (c. c. 1.2), float addition (c. c. 2.0) using

More information

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Floating Point Fused Add-Subtract and Fused Dot-Product Units Floating Point Fused Add-Subtract and Fused Dot-Product Units S. Kishor [1], S. P. Prakash [2] PG Scholar (VLSI DESIGN), Department of ECE Bannari Amman Institute of Technology, Sathyamangalam, Tamil Nadu,

More information

High-speed image processing algorithms using MMX hardware

High-speed image processing algorithms using MMX hardware High-speed image processing algorithms using MMX hardware J. W. V. Miller and J. Wood The University of Michigan-Dearborn ABSTRACT Low-cost PC-based machine vision systems have become more common due to

More information

OpenMP and Performance

OpenMP and Performance Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

More information

1 The Brownian bridge construction

1 The Brownian bridge construction The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof

More information

Chapter 7D The Java Virtual Machine

Chapter 7D The Java Virtual Machine This sub chapter discusses another architecture, that of the JVM (Java Virtual Machine). In general, a VM (Virtual Machine) is a hypothetical machine (implemented in either hardware or software) that directly

More information

3. INNER PRODUCT SPACES

3. INNER PRODUCT SPACES . INNER PRODUCT SPACES.. Definition So far we have studied abstract vector spaces. These are a generalisation of the geometric spaces R and R. But these have more structure than just that of a vector space.

More information

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large

More information

SmartArrays and Java Frequently Asked Questions

SmartArrays and Java Frequently Asked Questions SmartArrays and Java Frequently Asked Questions What are SmartArrays? A SmartArray is an intelligent multidimensional array of data. Intelligent means that it has built-in knowledge of how to perform operations

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Integrating Benders decomposition within Constraint Programming

Integrating Benders decomposition within Constraint Programming Integrating Benders decomposition within Constraint Programming Hadrien Cambazard, Narendra Jussien email: {hcambaza,jussien}@emn.fr École des Mines de Nantes, LINA CNRS FRE 2729 4 rue Alfred Kastler BP

More information

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Learn CUDA in an Afternoon: Hands-on Practical Exercises Learn CUDA in an Afternoon: Hands-on Practical Exercises Alan Gray and James Perry, EPCC, The University of Edinburgh Introduction This document forms the hands-on practical component of the Learn CUDA

More information

CSE 6040 Computing for Data Analytics: Methods and Tools

CSE 6040 Computing for Data Analytics: Methods and Tools CSE 6040 Computing for Data Analytics: Methods and Tools Lecture 12 Computer Architecture Overview and Why it Matters DA KUANG, POLO CHAU GEORGIA TECH FALL 2014 Fall 2014 CSE 6040 COMPUTING FOR DATA ANALYSIS

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University

Intro to scientific programming (with Python) Pietro Berkes, Brandeis University Intro to scientific programming (with Python) Pietro Berkes, Brandeis University Next 4 lessons: Outline Scientific programming: best practices Classical learning (Hoepfield network) Probabilistic learning

More information

A 0.9 0.9. Figure A: Maximum circle of compatibility for position A, related to B and C

A 0.9 0.9. Figure A: Maximum circle of compatibility for position A, related to B and C MEASURING IN WEIGHTED ENVIRONMENTS (Moving from Metric to Order Topology) Claudio Garuti Fulcrum Ingenieria Ltda. claudiogaruti@fulcrum.cl Abstract: This article addresses the problem of measuring closeness

More information

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems Abstract A Constraint Programming based Column Generation Approach to Nurse Rostering Problems Fang He and Rong Qu The Automated Scheduling, Optimisation and Planning (ASAP) Group School of Computer Science,

More information

GPGPU accelerated Computational Fluid Dynamics

GPGPU accelerated Computational Fluid Dynamics t e c h n i s c h e u n i v e r s i t ä t b r a u n s c h w e i g Carl-Friedrich Gauß Faculty GPGPU accelerated Computational Fluid Dynamics 5th GACM Colloquium on Computational Mechanics Hamburg Institute

More information

Haswell Cryptographic Performance

Haswell Cryptographic Performance White Paper Sean Gulley Vinodh Gopal IA Architects Intel Corporation Haswell Cryptographic Performance July 2013 329282-001 Executive Summary The new Haswell microarchitecture featured in the 4 th generation

More information

Using EXCEL Solver October, 2000

Using EXCEL Solver October, 2000 Using EXCEL Solver October, 2000 2 The Solver option in EXCEL may be used to solve linear and nonlinear optimization problems. Integer restrictions may be placed on the decision variables. Solver may be

More information

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

GPU Accelerated Monte Carlo Simulations and Time Series Analysis GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis

More information

Linear Programming. March 14, 2014

Linear Programming. March 14, 2014 Linear Programming March 1, 01 Parts of this introduction to linear programming were adapted from Chapter 9 of Introduction to Algorithms, Second Edition, by Cormen, Leiserson, Rivest and Stein [1]. 1

More information

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12

1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12 C5 Solutions 1. Convert the following base 10 numbers into 8-bit 2 s complement notation 0, -1, -12 To Compute 0 0 = 00000000 To Compute 1 Step 1. Convert 1 to binary 00000001 Step 2. Flip the bits 11111110

More information

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department gerry@me.pdx.

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department gerry@me.pdx. Solving Linear Systems of Equations Gerald Recktenwald Portland State University Mechanical Engineering Department gerry@me.pdx.edu These slides are a supplement to the book Numerical Methods with Matlab:

More information

Virtual Landmarks for the Internet

Virtual Landmarks for the Internet Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information