Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

Similar documents
SOLVING LINEAR SYSTEMS

Notes on Cholesky Factorization

Math 115A HW4 Solutions University of California, Los Angeles. 5 2i 6 + 4i. (5 2i)7i (6 + 4i)( 3 + i) = 35i + 14 ( 22 6i) = i.

Matrix Multiplication

6. Cholesky factorization

Data Corruption In Storage Stack - Review

Direct Methods for Solving Linear Systems. Matrix Factorization

A numerically adaptive implementation of the simplex method

The Characteristic Polynomial

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS

Solution of Linear Systems

Chapter 7. Matrices. Definition. An m n matrix is an array of numbers set out in m rows and n columns. Examples. (

We shall turn our attention to solving linear systems of equations. Ax = b

POISSON AND LAPLACE EQUATIONS. Charles R. O Neill. School of Mechanical and Aerospace Engineering. Oklahoma State University. Stillwater, OK 74078

NOTES ON LINEAR TRANSFORMATIONS

Operation Count; Numerical Linear Algebra

Parallel Algorithm for Dense Matrix Multiplication

= = 3 4, Now assume that P (k) is true for some fixed k 2. This means that

Algorithmic-Based Fault Tolerance for Matrix Multiplication on Amazon EC2

Math 312 Homework 1 Solutions

Adaptive Stable Additive Methods for Linear Algebraic Calculations

Load Distribution on a Linux Cluster using Load Balancing

a 11 x 1 + a 12 x a 1n x n = b 1 a 21 x 1 + a 22 x a 2n x n = b 2.

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

High Performance Matrix Inversion with Several GPUs

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Unit 18 Determinants

Systems of Linear Equations

Functional-Repair-by-Transfer Regenerating Codes

7 Gaussian Elimination and LU Factorization

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Lecture 3: Finding integer solutions to systems of linear equations

1 Determinants and the Solvability of Linear Systems

Lectures notes on orthogonal matrices (with exercises) Linear Algebra II - Spring 2004 by D. Klain

Methods for Finding Bases

Solution to Homework 2

The Singular Value Decomposition in Symmetric (Löwdin) Orthogonalization and Data Compression

Practical Guide to the Simplex Method of Linear Programming

Chapter 19. General Matrices. An n m matrix is an array. a 11 a 12 a 1m a 21 a 22 a 2m A = a n1 a n2 a nm. The matrix A has n row vectors

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 2. x n. a 11 a 12 a 1n b 1 a 21 a 22 a 2n b 2 a 31 a 32 a 3n b 3. a m1 a m2 a mn b m

7. LU factorization. factor-solve method. LU factorization. solving Ax = b with A nonsingular. the inverse of a nonsingular matrix

1.2 Solving a System of Linear Equations

Fault Analysis in Software with the Data Interaction of Classes

Solving Linear Systems of Equations. Gerald Recktenwald Portland State University Mechanical Engineering Department

Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh

CS3220 Lecture Notes: QR factorization and orthogonal transformations

A Constraint Programming based Column Generation Approach to Nurse Rostering Problems

Bounded Cost Algorithms for Multivalued Consensus Using Binary Consensus Instances

Aperiodic Task Scheduling

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Categorical Data Visualization and Clustering Using Subjective Factors

A Piggybacking Design Framework for Read-and Download-efficient Distributed Storage Codes

Inner Product Spaces and Orthogonality

Typical Linear Equation Set and Corresponding Matrices

Approximation Algorithms

FMA Implementations of the Compensated Horner Scheme

MAT188H1S Lec0101 Burbulla

Parallel and Distributed Computing Programming Assignment 1

Lecture 2 Matrix Operations

An On-Line Algorithm for Checkpoint Placement

DATA ANALYSIS II. Matrix Algorithms

Introduction to Online Learning Theory

Linear Algebra Notes for Marsden and Tromba Vector Calculus

Efficient and Robust Allocation Algorithms in Clouds under Memory Constraints

1 VECTOR SPACES AND SUBSPACES

1 Introduction to Matrices

Matrix Algebra. Some Basic Matrix Laws. Before reading the text or the following notes glance at the following list of basic matrix algebra laws.

Linear Maps. Isaiah Lankham, Bruno Nachtergaele, Anne Schilling (February 5, 2007)

LINEAR ALGEBRA W W L CHEN

Mathematics Course 111: Algebra I Part IV: Vector Spaces

F Matrix Calculus F 1

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Matrix Differentiation

Elementary Matrices and The LU Factorization

4.6 Linear Programming duality

8 Square matrices continued: Determinants

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS

13 MATH FACTS a = The elements of a vector have a graphical interpretation, which is particularly easy to see in two or three dimensions.

MapReduce and the New Software Stack

Row Echelon Form and Reduced Row Echelon Form

K80TTQ1EP-??,VO.L,XU0H5BY,_71ZVPKOE678_X,N2Y-8HI4VS,,6Z28DDW5N7ADY013

Solving Systems of Linear Equations

Least Squares Estimation

MATH 304 Linear Algebra Lecture 18: Rank and nullity of a matrix.

Unit 10: Software Quality

Warshall s Algorithm: Transitive Closure

Analecta Vol. 8, No. 2 ISSN

Lecture 9 - Message Authentication Codes

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

Cryptography and Network Security Department of Computer Science and Engineering Indian Institute of Technology Kharagpur

26. Determinants I. 1. Prehistory

GENERATING SETS KEITH CONRAD

General Framework for an Iterative Solution of Ax b. Jacobi s Method

Transcription:

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line Panruo Wu, Chong Ding, Longxiang Chen, Teresa Davies, Christer Karlsson, and Zizhong Chen Colorado School of Mines November 13, 2011 of 13, Mines) 2011 1 / 1

Errors Soft errors We focus on fail-continue soft errors Soft errors have already been reported in tera-scale systems How to tolerate errors Triple Modular Redundancy(TMR) Algorithm Based Fault Tolerance(ABFT) We designed a novel ABFT alogrithm for DGEMM on-line: meaning that our algorithm checks the validity of the partial results during computation low overhead: compared with high performance non-fault-tolerant DGEMM of 13, Mines) 2011 2 / 1

Related Work Triple Modular Redundancy(TMR) General technique; Requires a factor of 2 or 3 additional hardware; big overhead traditional ABFT: Specific algorithm for every operation Very little overhead; Offline, checks only at the end of computation thus cannot prevent errors from propagating online ABFT Extention to traditional ABFT; very little overhead; online, checks validity of computation in the middle of the computation; more flexible and reliable of 13, Mines) 2011 3 / 1

Traditional ABFT mm(matrix multiplication) In 1984 Huang & Abraham proposed the ABFT mm algorithm The idea is: To compute C = AB, encode A, B into checksum matrices A c, B r do mm on encoded matrices C f = A c B r verify the full checksum relationship of C f a 11 a 1n A c = n a n1 a nn i=1 a i1 n i=1 a in b 11 b n 1n j=1 B r b 1j = b n1 b n nn j=1 b nj of 13, Mines) 2011 4 / 1

It turns out that the result of C f = A c B r (whatever mm algorithm is used) is a full checksum matrix: c 11 c n 1n j=1 c 1j C f = c m1 c n mn j=1 c nj m i=1 c i1 m i=1 c n in j=1 c ij m i=1 If the full checksum relationship does not maintain then the mm procedure is faulty If only one entry c ij is corrupted, then it can be recovered: c ij = n n c ij (is C f i,n+1 ) j=1 k=1,k j c f ik of 13, Mines) 2011 5 / 1

Online ABFT MM: Motivation The traditional ABFT only checks the result at the end of the whole computation What if we can check some partial results in the middle? Handle faults in time More reliable and flexible the key problem: does the checksum relationship maintain in the middle of the mm computation? for most MM algorithms, NO for Outer Product MM, YES of 13, Mines) 2011 6 / 1

Outer product mm algorithm(rank-1 update each iteration) for s = 1 to k do C = C + A(:, s) B(s, :) end for AB = [ ] A 1 A 2 A n B 1 B 2 B n = A 1B 1 + + A n B n of 13, Mines) 2011 7 / 1

Theorem If outer product version of mm is used to compute A c B f, and we denote the partial result C at the end of iteration s by Cs f, then Cs f (s = 1,, k) preserve the full checksum relationship Proof: Let A s := A(1 : n, 1 : s) and B s := B(1 : s, 1 : n) then C s = A c s Bs r [ ] As = e T [ B A s B s e ] s = (A s B s ) f We now have at most n oppurtunities to tolerate faults during one mm execution In practice, we can do the FT routine every several iterations to achieve high performance of 13, Mines) 2011 8 / 1

Differentiate soft errors from roundoff errors Computers do floating point calculation in finite precision Therefore the checksum relationship cannot hold exactly due to roundoff errors We need a threshold to differentiate soft errors from roundoff errors How to choose the threshold? Too large a threshold may hide errors Too small one may interrupt correct computation of 13, Mines) 2011 9 / 1

A classic result from matrix analysis(from Matrix Computation) Theorem For outer product matrix multiplication C = AB, the floating point result f l(ab) satisfies: fl(ab) AB γ k A B =: λ where γ n = nu, and u is the unit roundoff error of the machine 1 nu We begin by assuming the computations are correct, ie C f = A c B r As a convention the floating point version of a variable has a hat over the corresponding name Then we have n ĉ ij ĉ i,n+1 fl(c f ) C f γ n A c B r =: λ j=1 13, of Mines) 2011 10 / 1

Performance Analysis The time complexity of generating row or column checksum for input matrices can be expressed as follow T encode = 4N 2 γ (1) Because of the matrix size increases the overhead of computation is T comp = 2(N + 1) N (N + 1) 2N 3 = (4N 2 + 2N)γ 4N 2 γ (2) If the program is to tolerate m errors, the overhead to detect a matrix is: T detect = 2mN 2 γ (3) 13, of Mines) 2011 11 / 1

Experimental Results Performance and overhead of on-line FTGEMM (Fault Tolerant General Matrix Multiply) Performance comparison between on-line FTGEMM, ABFT and TMR Performance comparison between on-line FTGEMM and ATLAS We show tests done on Alamode machines provided by CCIT in Colorado School of Mines The CPU one Alamode is Intel(R) Core(TM) i7-2600 CPU @ 340GHz and theoretical peak FLOPS is 1360 GFLOPS For simplicity we test on square matrix of size 10,000 and 15,000 13, of Mines) 2011 12 / 1

Special note The failure rate, or Number of failures/execution in the x-axis of all figures and texts in this section is a design parameter which refers to the maximum number of errors our implementation is able to tolerate during one matrix multiplication Be aware that this number does not indicate our implementation is guaranteed to tolerate that many errors; this parameter should be set higher than expected actual failure rate of applications to ensure reliability 13, of Mines) 2011 13 / 1

Approx DGEMM(by ATLAS) runtime: 147s Overhead of on line FTGEMM (matrix size: 10000) on ALAMODE 4 Overhead of Computation Error Reovery Error Detection 3 Build Checksum Matrix Runtime (Second) 2 1 0 2 4 6 8 10 12 14 16 18 20 Number of failures/executions 13, of Mines) 2011 14 / 1

Approx DGEMM(by ATLAS) runtime: 147s Performance of different strategies (matrix size: 10000) on ALAMODE Runtime (Second) 500 450 400 350 300 250 200 150 100 50 on line FTGEMM(D) ABFT TMR 0 0 1 2 Number of failures/executions 13, of Mines) 2011 15 / 1

Approx DGEMM(by ATLAS) runtime: 147s Performance of different failure rate (matrix size: 10000) on ALAMODE 14 Performance (GFLOPS) 12 10 8 6 4 2 ATLAS DGEMM on line FTGEMM(double precision) 0 2 4 6 8 10 12 14 16 18 20 Number of failures/executions 13, of Mines) 2011 16 / 1

Conclusion In this paper we extended the traditional ABFT matrix multiplication technique from off-line to on-line In our approach, soft errors are detected, located, and corrected in the middle of the computation in a timely manner, before they propagate Experimental results demostrate that the proposed technique can correct one error per ten seconds with negligible performance penalty over ATLAS DGEMM() 13, of Mines) 2011 17 / 1

Thank you That s all, Thank you! 13, of Mines) 2011 18 / 1