Tolerating SEU Faults in the Raw Architecture



Similar documents
Power Reduction Techniques in the SoC Clock Network. Clock Power

Switched Interconnect for System-on-a-Chip Designs

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Principles and characteristics of distributed systems and environments

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Introduction to Cloud Computing

OC By Arsene Fansi T. POLIMI

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

路 論 Chapter 15 System-Level Physical Design

Control 2004, University of Bath, UK, September 2004

Advances in Smart Systems Research : ISSN : Vol. 3. No. 3 : pp.

Weighted Total Mark. Weighted Exam Mark

CS250 VLSI Systems Design Lecture 8: Memory

HyperThreading Support in VMware ESX Server 2.1

Scalability and Classifications

How To Understand The Concept Of A Distributed System

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

The Service Availability Forum Specification for High Availability Middleware

CHAPTER 2 MODELLING FOR DISTRIBUTED NETWORK SYSTEMS: THE CLIENT- SERVER MODEL

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

FAULT TOLERANCE FOR MULTIPROCESSOR SYSTEMS VIA TIME REDUNDANT TASK SCHEDULING

Verification of Triple Modular Redundancy (TMR) Insertion for Reliable and Trusted Systems

Java Virtual Machine: the key for accurated memory prefetching

Parallel Computing. Benson Muite. benson.

Chapter 2 Logic Gates and Introduction to Computer Architecture

Network Attached Storage. Jinfeng Yang Oct/19/2015

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

Embedded Systems Lecture 9: Reliability & Fault Tolerance. Björn Franke University of Edinburgh

An Introduction to the ARM 7 Architecture

An On-Line Algorithm for Checkpoint Placement

A Lab Course on Computer Architecture

Virtual machine interface. Operating system. Physical machine interface

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

MULTI-SPERT PHILIPP F ARBER AND KRSTE ASANOVI C. International Computer Science Institute,

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

SOC architecture and design

Attaining EDF Task Scheduling with O(1) Time Complexity

Big data management with IBM General Parallel File System

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Stream Processing on GPUs Using Distributed Multimedia Middleware

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Design and Implementation of the Heterogeneous Multikernel Operating System

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Load Balancing in Distributed Data Base and Distributed Computing System

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Controlling a Dot Matrix LED Display with a Microcontroller

Low-Overhead Hard Real-time Aware Interconnect Network Router

Six Strategies for Building High Performance SOA Applications

Hadoop Architecture. Part 1

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Snapshots in Hadoop Distributed File System

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Spacecraft Computer Systems. Colonel John E. Keesee

USING REPLICATED DATA TO REDUCE BACKUP COST IN DISTRIBUTED DATABASES

Chapter 6, The Operating System Machine Level

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

Communication Networks. MAP-TELE 2011/12 José Ruela

Computer Architecture

CHAPTER 11: Flip Flops

An Operating System for Multicore and Clouds

ISSCC 2003 / SESSION 4 / CLOCK RECOVERY AND BACKPLANE TRANSCEIVERS / PAPER 4.7

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Availability Digest. MySQL Clusters Go Active/Active. December 2006

Real-Time (Paradigms) (51)

SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Chapter 2 Heterogeneous Multicore Architecture

Digital Signal Controller Based Automatic Transfer Switch

Architectures and Platforms

Fault Tolerant Matrix-Matrix Multiplication: Correcting Soft Errors On-Line.

A Deduplication File System & Course Review

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Switch Fabric Implementation Using Shared Memory

On Demand Loading of Code in MMUless Embedded System

TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

A Framework for Highly Available Services Based on Group Communication

Multithreading Lin Gao cs9244 report, 2006

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

1. Memory technology & Hierarchy

Dynamic resource management for energy saving in the cloud computing environment

HRG Assessment: Stratus everrun Enterprise

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

High Performance Computer Architecture

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Transcription:

Tolerating SEU Faults in the Raw Architecture Karandeep Singh #, Adnan Agbaria +*, Dong-In Kang #, and Matthew French # # USC Information Sciences Institute, Arlington VA, USA Email: {karan, dkang, mfrench} @isi.edu + IBM Haifa Research Lab, Mount Carmel, Haifa 31905, Israel Email: adnan@il.ibm.com Abstract This paper describes software fault tolerance techniques to mitigate SEU faults in the Raw architecture, which is a single-chip parallel tiled computing architecture. The fault tolerance techniques we use are efficient Checkpointing and Rollback of processor state, Break-pointing, Selective Replication of code and Selective Duplication of tiles. Our fault tolerance techniques can be fully implemented in the software, without any changes to the architecture, transparent to the user, and designed to fulfill run-time performance and throughput requirements of the system. We illustrate these techniques by mitigating matrix multiply kernel mapped on Raw. The proposed techniques are also applicable to other tiled architectures (and also parallel systems in general). 1. Introduction Multi-core tiled computing architectures are becoming increasing popular because of their good performance, throughput, and power efficiency. Multiple cores also provide an inherent redundancy that enables better fault tolerance and recovery. Raw (developed at Massachusetts Institute of Technology) [Taylor02] is a general purpose tiled parallel processing architecture with a compiler programmable interconnection network. The current Raw chip has 16 processor tiles, which can scale to a larger number of tiles on future chips. The interconnection network extends off-chip, allowing large fabrics of Raw chips to be built. To mitigate faults on Raw, we periodically checkpoint the processor state, which is stored in off-chip stable storage. The state to be checkpointed includes data caches, data in memory, registers, and states of on-chip networks. Two-level Asynchronous and Synchronous checkpoints are taken to improve performance. Selective Replication of code and Selective Duplication of tiles are used for fault/failure detection. Breakpoints are inserted in the code after every replication/ duplication for comparing corresponding states and detecting faults. Our analysis shows that our techniques mitigate a high percentage of SEU (single event upset) faults in Raw, with no VLSI or architecture modifications. 2. Raw Architecture The Raw architecture consists of 16 processing tiles connected in a mesh using two types of high performance pipelined networks: static and dynamic. The Raw chip is implemented in IBM s 180 nm 1.8V 6-layer CMOS 7SF SA-27E copper process. It has a peak throughput of 6.8 GFLOPS at 425 MHz. Each tile has two processors; one MIPS type compute processor with 32KB each of data and instruction cache, and one switch processor with 64 KB of instruction cache. The compute processor has an 8-stage in-order single-issue MIPS-style processing pipeline and a 4-stage single-precision pipelined FPU. Two of the interconnecting networks are static 2-D point-to-point mesh networks, which are optimized to route single word quantities of data (without any headers) and these routes are determined at compile time. There are two dynamically routed networks in Raw. The general dynamic network is used for data transfer among tiles, while the memory dynamic network is used to access off-chip memory [Taylor02]. Block diagram for the Raw architecture is shown in Figure 1. Raw s exposed ISA allows parallel applications to exploit all of the chip resources, including gates, wires and pins and it performs very well across a large class of stream and embedded computing applications [Taylor04]. Compared to a 180 nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2x for sequential applications with a very low degree of ILP, about 2x to 9x better for higher levels of ILP, and 10x-100x better when highly parallel applications are hand coded and optimized [Taylor04]. * Work performed at USC/ISI

Figure 1: Raw architecture (source [Taylor04]) 3. Fault Tolerance Techniques In this work we assume that a SEU fault [NVSR02] can happen at any time. We would like to mitigate SEUs with minimum modification in the Raw architecture, and preferably, without any hardware modification. So, we consider and adopt software fault tolerance techniques that are (1) transparent to the user and (2) efficient to meet performance requirements. Mainly, we use a combination of the following techniques for fault detection and tolerance: selective replication, selective duplication, checkpoint/restart, breakpoints, and TMR. A breakpoint (BP) is used in Raw for error detection. The compiler (or user) inserts BPs in the code of the program to be able to detect errors. The location of the BPs in the program depends on the applied fault detection techniques. For example, as shown later, we insert BPs after code duplication and any selective replication. Please notice here that BP is used to help us in detecting faults. In selective replication (SR) [GS96], the compiler and/or user selects some code to be replicated. As a result, during execution, the selected code runs simultaneously in two different tiles in Raw. Due to replication some synchronization may be required to ensure total ordering in messages delivery [Birman97]. Although, replication is a successful technique for providing fault tolerance, we use SR for failure detection in Raw. Specifically, we trace the states of the two replicas. BPs are inserted after SR to detect any SEU that may happen in one of the replicas. In selective duplication (SD), the compiler and/or user selects some instructions to be duplicated in the code. Then after any code duplication, a BP is inserted to detect any SEU in the instruction. As we see here, we use BPs to check any possible SEU after code duplication and replication. SR is more expensive than SD in terms of the resources and the price to compare the states of the replicas during a BP. Therefore, we try to use SD where we are able to detect SEUs, without the need of synchronization with other tiles. Although SD cannot detect a SEU in the networks, it may detect it in a register. We use a heuristic function that helps us in determining the code that needs to be either replicated or duplicated. Such function is based on cost/effectiveness tradeoff to choose either replication or duplication. We use checkpoint/restart for providing fault tolerance in Raw. Checkpoint/Restart (C/R) is a way to provide persistence and fault tolerance in both uniprocessor and distributed systems [EAWJ02]. Checkpointing is the act of saving an application's state to stable storage during its execution, while restart is the act of restarting the application from a checkpointed state. In the Raw architecture, in order to recover a tile from a failure, we need to checkpoint the state of the tile, which is defined as follows: data caches, data in memory, registers, and the state on the networks. So, during checkpoint, we need to save all these states for each tile. Figure 2 presents the information on what need to be checkpointed.

The network buffers The registers The log file Logs are saved during the execution Log file Figure 2: Checkpointing technique in Raw In our approach, to achieve an efficient C/R mechanism, we applied a new application-based incremental checkpointing. With this approach, during checkpoint, each tile saves only the following: all the registers, the state of the networks, and a log file. This log file is created during the execution and is flushed at every checkpoint. The log file implements our incremental checkpointing technique. Instead of using page faults or copy-on-write as presented in [Plank97], we use a new technique that involves the compiler to identify all the modified data structures between two consecutive checkpoints. The compiler inserts the log calls. By calling a special Log call, the modified data structures are logged in the log file. Specifically, for every data structure a, we log a after its last modify and before the next checkpoint. Using compiler-based analysis, we can identify all the data structures that need to be logged between every two consecutive checkpoint calls [Muchnick97]. Figure 3 presents an example of code in which the compiler inserts Log calls for logging the modified data structures. Chkpt() // Checkpoint # i a = f(..). b = g(..) a = // New modification for a a = // Last modification for a Log (a) Chkpt() // Checkpoint # i+1 Figure 3: An example of using Log and Chkpt calls in the code Usually, the Raw architecture runs multi-task applications in which every tile runs a task. The tasks communicate with each other using on-chip networks. The problem of checkpoint and restart is more complicated in distributed settings (such as Raw), where an application is composed of several processes, each possibly running on a different computer. In order to restart such an application, one has to choose a collection of checkpoints, one from each process that corresponds to a consistent distributed application's state. A distributed application's state is not consistent if it represents a situation in which some message m is received by some process, but the event of sending m is not in the checkpoint collection. A collection of checkpoints that corresponds to a consistent distributed state forms a recovery line. As presented in Figure 4, if a failure occurs when no collection of checkpoints taken during the execution forms a recovery line, then the application will have to be restarted from the beginning, regardless of the number of checkpoints already taken. The domino effect was identified in [Randell75] as the source of not being able to find a recovery line in an execution with checkpoints, indicating that a recovery line is not guaranteed unless special care is taken. To ensure recovery lines, in this work, we adopt the d-bc (d-bounded Cycle) distributed checkpointing protocol presented in [AAFV04]. This is a two-level checkpointing protocol. In Raw, we allow every tile to take its checkpoint independently. We denote this local checkpoint by CL1. However, to avoid the domino effect, we force all the tiles to coordinate their checkpoints to create a consistent global checkpoint. We denote this global checkpoint by CL2. Since CL1 is local and does not require synchronization, CL1 is cheaper than CL2 in terms of size and overhead and occurs more frequently than CL2

T 1 T 2 message checkpoint Inconsistent checkpoints Figure 4: Inconsistent checkpoint due to message exchanges X failure Figure 5 presents a running example of our fault tolerance techniques in Raw. This includes SR, SD, BPs, and C/R. In this example, tile T 7 implements SR for T 1. Similarly, T 5 implements SR for T 2. Notice here that after every SR we have BPs to check for any possible errors. Again, notice here that the BPs and checkpoints are inserted in the code during the compilation of the application. So, each tile is aware of the techniques that it applied during the execution. Global and consistent checkpoints T 1 SD join BP T 7 - replica T 2 T 5 - replica - Breakpoint - CL1 Checkpoint - CL2 Checkpoint Figure 5: An example that shows our fault tolerance techniques in Raw 4. Analysis and a Sample Application We define Reliability as the percentage of time that an application can run without resetting the system on a SEU. We derive the reliability analytically using area information of the processor and the information of possible effects when an SEU fault happens on the area. We use analytical methods to derive reliability numbers. Here, we present an implementation of fault-tolerant matrix multiplication on the Raw processor, and derive reliability of the implementation. The estimated amount of areas of functional components of a tile in Raw processor is shown in Figure 6. A tile consists of a tile processor and a switch processor. A tile processor is estimated to take 60.1% of a tile s area. And a switch processor is estimated to take 39.9% of a tile s area. Compo -nents Area Error Result Dcache Dcache FPU ALU Fetch Unit GPR SPR Event Counters 14.75 0.05 14.75 0.05 15 15 0.05 0.20 0.20 0.05 60.1 A * 100 0 50 0 0 00 0 50 0 Total 3.042 0.05 0 0.025 0 0 0.05 0 0.1 0 3.267 (a) Tile processor

Compo -nents Area Error Result Control Switch Processor SN Data SN DN Data DN MN Data MN 26 0.15 12 0.20 0.05 0.50 0.25 0.50 0.25 39.9 B ** 100 0 100 100 10 25 0 0 Total 6.5 0.15 0 0.2 0.05 0.05 0.063 0 0 7.013 (b) Switch processor Figure 6: Estimated area information of a Raw tile and its fault susceptibility when fault tolerant Matrix Multiplication runs on it ( * : 50% of cache is prone to SEUs at any given time. Out of that 33% area is assumed to be occupied by instructions and 66% by operands. 5% of instruction and 25% of operand area is assumed to be critical) ** : About 50% cache is assumed to be filled and 50% of them are assumed to be critical.) An implementation of FT Matrix Multiplication using our fault tolerance technique is shown in Figure 7. Base algorithm multiplies input matrices A and B, and produces result matrix C in a streaming fashion. Columns of matrix B are read by the tiles in the upper row (tiles 1 and 2 in Figure 7). Rows of matrix A are read by the tiles at the leftmost column (tiles 4, 8, 12 in Figure 7). Result matrix C is collected by the tiles in the rightmost column except the tile in the upper row (tiles 7, 11, 15 in Figure 7). Computations are done by the remaining tiles (tiles 5, 6, 9, 10, 13, 14 in Figure 7). For each part of the algorithm and mapping, different FT techniques are applied. Different FT techniques are applied to the algorithm. Temporal Triple Modular Redundancy (TMR) is applied to input and output parts, which makes single SEU correction possible for inputs. The overhead of temporal TMR at the input nodes is justified by the longer computation time of the computation nodes. Code duplication and local checkpoint and rollback techniques are used for the computation nodes. Since inputs are protected by software TMR, an SEU on a computation node can be well confined within the nodes and local checkpoint and rollback technique can tolerate SEU on the computation part. System monitoring processes are replicated in tile 0 and tile 3 for better protection which are not used by the base algorithm. 0 1 2 3 4 5 6 7 8 9 10 11 Function Node FT Technique Column Input 1, 2 Temporal TMR Row Input 4, 8, 12 Temporal TMR Computation 5, 6, 9, 10, Code Duplication 13,14 Result Output 7, 11, 15 Temporal TMR System Monitoring 0, 3 Code Replication 12 13 14 15 (a) Mapping (b) Used techniques Figure 7: Mapping of FT Matrix Multiplication on Raw Processor and techniques used Reliability of the FT Matrix Multiplication on Raw is estimated using the hardware area information which is shown in Figure 6. Effect of an SEU on each functional component is estimated conservatively. For example, an SEU on instruction cache control logic is always (100%) assumed to lead to system reset. The percentage of system reset when a SEU occurs on a functional component is presented in the rows titled as Error in Figure 6 (a) and (b). The overall percentage of system reset due to an SEU on a functional component

is presented in the rows titles as Result in Figure 6 (a) and (b). Based on the assumptions and estimation of the hardware, the reliability of FT Matrix Multiplication is 89.72%. The performance of the FT Matrix Multiplication is determined by the performance of the computation nodes. Since Code Duplication techniques are used for the computation algorithm, we expect the performance in terms of FLOPS to be slightly less than half of the base algorithm. 5. Conclusion We present software fault tolerance techniques to mitigate SEU faults in the Raw architecture. In this work, we describe these fault detection and mitigation techniques, which can be implemented in the software and don t require any hardware changes in the architecture. We demonstrate these techniques with the help of a sample application (matrix multiplication) and also present analytical evaluations of the performance and reliability numbers for our techniques. 6. Acknowledgements Effort sponsored through the Department of the Interior National Business Center under grant number NBCH1050022. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government. Method for Safety-Critical Systems: Effectiveness and Drawbacks. In Proceedings of the 15th IEEE Symposium on Integrated Circuits and Systems Design. pp. 101 106, September, 2002, Porto Alegre, Brazil. [Plank97] J. S. Plank. An Overview of Checkpointing in Uniprocessor and Distributed Systems, Focusing on Implementation and Performance. Technical Report UT-CS-97-372. Department of Computer Science, University of Tennessee. July, 1997. [Randell75] B. Randell. System Structure for Software Fault Tolerance. IEEE Transactions on Software Engineering. SE- 1:220 232, June, 1975. [Taylor02] Michael B. Taylor,, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jae-Wook Lee, Walter Lee, Albert Ma, Arvind Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe and Anant Agarwal. The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs. IEEE Micro, Mar/Apr 2002. [Taylor04] Michael Bedford Taylor, Walter Lee, Jason Miller, David Wentzlaff, Ian Bratt, Ben Greenwald, Henry Hoffmann, Paul Johnson, Jason Kim, James Psota, Arvind Saraf, Nathan Shnidman, Volker Strumpen, Matt Frank, Saman Amarasinghe, and Anant Agarwal. Evaluation of the Raw Microprocessor: An Exposed-Wire-Delay Architecture for ILP and Streams. Proceedings of International Symposium on Computer Architecture, June 2004. References [AAFV04] A. Agbaria, H. Attiya, R. Friedman, and R. Vitenberg. Quantifying Rollback Propagation in Distributed Checkpointing. Journal of Parallel and Distributed Computing. 64(3):370 384, March, 2004. [Birman97] K. P. Birman. Building Secure and Reliable Network Applications. Manning Publishing Company and Prentice Hall. December, 1996. [EAWJ02] E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message- Passing Systems. ACM Computing Surveys. 34(3):375 408, September, 2002. [GS96] R. Guerraoui and A. Schiper. Fault-Tolerance by Replication in Distributed Systems. Reliable Software Technologies - Ada-Europe'96. pp. 38 57, 1996. [Muchnick97] S. S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers. 1997. [NVSR02] B. Nicolescu, R. Velazco, M. Sonza Reorda, M. Rebaudengo, and M. Violante. A Software Fault Tolerance