Erasure Coding for Cloud Communication and Storage



Similar documents
Digital Audio and Video Data

Airlift: Video Conferencing as a Cloud Service using Inter- Datacenter Networks

Video Streaming with Network Coding

Practical Data Integrity Protection in Network-Coded Cloud Storage

A Digital Fountain Approach to Reliable Distribution of Bulk Data

Classes of multimedia Applications

HOW PUBLIC INTERNET IS FINALLY READY FOR HD VIDEO BACKHAUL

Key Components of WAN Optimization Controller Functionality

Sources: Chapter 6 from. Computer Networking: A Top-Down Approach Featuring the Internet, by Kurose and Ross

Giving life to today s media distribution services

Requirements of Voice in an IP Internetwork

Final for ECE374 05/06/13 Solution!!

Question: 3 When using Application Intelligence, Server Time may be defined as.

QoS issues in Voice over IP

Network Coding for Distributed Storage

Internet Video Streaming and Cloud-based Multimedia Applications. Outline

Computer Network. Interconnected collection of autonomous computers that are able to exchange information

PEER TO PEER FILE SHARING USING NETWORK CODING

IMPROVING QUALITY OF VIDEOS IN VIDEO STREAMING USING FRAMEWORK IN THE CLOUD

Methods for Mitigating IP Network Packet Loss in Real Time Audio Streaming Applications

ADVANTAGES OF AV OVER IP. EMCORE Corporation

Introduction, Rate and Latency

Transport Layer Protocols

Region 10 Videoconference Network (R10VN)

The Data Replication Bottleneck: Overcoming Out of Order and Lost Packets across the WAN

Experiences with Interactive Video Using TFRC

Is Your Network Ready for VoIP? > White Paper

Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.

CSE-E5430 Scalable Cloud Computing P Lecture 5

Lecture 15: Congestion Control. CSE 123: Computer Networks Stefan Savage

Advanced satellite infrastructures in future global Grid computing: network solutions to compensate delivery delay ISTI CNR

Coding Techniques for Efficient, Reliable Networked Distributed Storage in Data Centers

An Analysis of Error Handling Techniques in Voice over IP

PARALLELS CLOUD STORAGE

File sharing using IP-Multicast

Assessment of Traffic Prioritization in Switched Local Area Networks Carrying Multimedia Traffic

Quantum StorNext. Product Brief: Distributed LAN Client

Content Distribution over IP: Developments and Challenges

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Lecture 33. Streaming Media. Streaming Media. Real-Time. Streaming Stored Multimedia. Streaming Stored Multimedia

Three Key Design Considerations of IP Video Surveillance Systems

Final Exam. Route Computation: One reason why link state routing is preferable to distance vector style routing.

Frequently Asked Questions

CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS

AN OVERVIEW OF QUALITY OF SERVICE COMPUTER NETWORK

Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc

Highly Available Mobile Services Infrastructure Using Oracle Berkeley DB

Challenges of Sending Large Files Over Public Internet

CISCO WIDE AREA APPLICATION SERVICES (WAAS) OPTIMIZATIONS FOR EMC AVAMAR

How To Encrypt Data With A Power Of N On A K Disk

The Pros and Cons of Erasure Coding & Replication vs. RAID in Next-Gen Storage Platforms. Abhijith Shenoy Engineer, Hedvig Inc.

White Paper Creating a Video Matrix over IP

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) PERCEIVING AND RECOVERING DEGRADED DATA ON SECURE CLOUD

Performance Analysis of AQM Schemes in Wired and Wireless Networks based on TCP flow

Optimizing Converged Cisco Networks (ONT)

QoS Parameters. Quality of Service in the Internet. Traffic Shaping: Congestion Control. Keeping the QoS

1Multimedia Networking and Communication: Principles and Challenges

Erasure Coding for Cloud Storage Systems: A Survey

Oct 2008 Annual price 256kbps Business Broadband connection Figure 2 Annual Price 2Mbps, 2km leased line connection Figure 1 4,500

Comparative Analysis of Congestion Control Algorithms Using ns-2

Dynamic Load Balancing and Node Migration in a Continuous Media Network

Airlift: Video Conferencing as a Cloud Service using Inter-Datacenter Networks

Datagram-based network layer: forwarding; routing. Additional function of VCbased network layer: call setup.

Ad hoc and Sensor Networks Chapter 13: Transport Layer and Quality of Service

QOS Requirements and Service Level Agreements. LECTURE 4 Lecturer: Associate Professor A.S. Eremenko

Designing a Cloud Storage System

Glossary of Terms and Acronyms for Videoconferencing

Configuring ThinkServer RAID 500 and RAID 700 Adapters. Lenovo ThinkServer

Realtime Multi-party Video Conferencing Service over Information Centric Networks

VoIP 101. E911-Enhanced 911- Used for providing emergency service on cellular and internet voice calls.

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

Multicast vs. P2P for content distribution

Testing & Assuring Mobile End User Experience Before Production. Neotys

Windows 8 SMB 2.2 File Sharing Performance

PORTrockIT. Spectrum Protect : faster WAN replication and backups with PORTrockIT

Improving the Performance of TCP Using Window Adjustment Procedure and Bandwidth Estimation

A Hitchhiker s Guide to Fast and Efficient Data Reconstruction in Erasure-coded Data Centers

Chapter 3 ATM and Multimedia Traffic

Computer Networks Homework 1

SYSTEMATIC NETWORK CODING FOR LOSSY LINE NETWORKS. (Paresh Saxena) Supervisor: Dr. M. A. Vázquez-Castro

Multimedia Data Transmission over Wired/Wireless Networks

Horizon: Balancing TCP over multiple paths in wireless mesh networks

VoIP QoS. Version 1.0. September 4, AdvancedVoIP.com. Phone:

Data Storage - II: Efficient Usage & Errors

Technical Specifications for KD5HIO Software

Mul$media Networking. #3 Mul$media Networking Semester Ganjil PTIIK Universitas Brawijaya. #3 Requirements of Mul$media Networking

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

VOICE OVER IP AND NETWORK CONVERGENCE

Seamless Congestion Control over Wired and Wireless IEEE Networks

VoIP over P2P networks

Performance Evaluation of VoIP Services using Different CODECs over a UMTS Network

Broadband Quality of Service Experience (QoSE)

Video Transmission over Wireless LAN. Hang Liu

VoIP network planning guide

Reliability and Fault Tolerance in Storage

Trevi: Watering down storage hotspots with cool fountain codes. Toby Moncaster University of Cambridge

Per-Flow Queuing Allot's Approach to Bandwidth Management

Visualizations and Correlations in Troubleshooting

Load Balancing in Fault Tolerant Video Server

Distributed Systems (5DV147) What is Replication? Replication. Replication requirements. Problems that you may find. Replication.

Transcription:

Erasure Coding for Cloud Communication and Storage Cheng Huang, Jin Li Microsoft Research Tutorial at IEEE ICC (June, 2014) 1

Tutorial at IEEE ICC (June, 2014) 2

Tutorial at IEEE ICC (June, 2014) 3

Erasure Coding has become a key technology piece in realizing the vision focusing on improving network performance and reducing cloud storage cost Tutorial at IEEE ICC (June, 2014) 4

Tutorial at IEEE ICC (June, 2014) 5

Tutorial at IEEE ICC (June, 2014) 6

Erasure Coding 101 Tutorial at IEEE ICC (June, 2014) 7

Tutorial at IEEE ICC (June, 2014) 8

encoding decoding Tutorial at IEEE ICC (June, 2014) 9

Tutorial at IEEE ICC (June, 2014) 10

replication a=2 a=2 b=3 b=3 a=2 b=3 Tutorial at IEEE ICC (June, 2014) 11

simple parity a=2 a=2 b=3 b=3 a+b=5 Tutorial at IEEE ICC (June, 2014) 12

Tutorial at IEEE ICC (June, 2014) 13

n = 4 k = 2 a=2 a=2 b=3 b=3 a b a 2b Tutorial at IEEE ICC (June, 2014) 14

0 1 2 3 0 0 1 2 3 1 1 0 3 2 2 2 3 0 1 3 3 2 1 0 a=2 X 0 1 2 3 0 0 0 0 0 1 0 1 2 3 2 0 2 3 1 3 0 3 1 2 a=? a=2 b=3 b=? a=2 b=3 a b=1 a b=1 b=3 a 2xb=3 a 2xb=3 Tutorial at IEEE ICC (June, 2014) 15

a a b a b b a b a 2xb is Tutorial at IEEE ICC (June, 2014) 16

X 0 1 1 1 X 1 1 1 0 Tutorial at IEEE ICC (June, 2014) 17

Erasure Coding has been around for decades, the purpose of this tutorial is to cover innovative designs and applications of erasure coding in recent years to address needs from new application scenarios Tutorial at IEEE ICC (June, 2014) 18

Erasure Coding in Cloud-Based Social Gaming Tutorial at IEEE ICC (June, 2014) 19

How to ensure universally smooth gaming experience? Improving the tail! Tutorial at IEEE ICC (June, 2014) 20

Tutorial at IEEE ICC (June, 2014) 21

Tutorial at IEEE ICC (June, 2014) 22

interaction gap Tutorial at IEEE ICC (June, 2014) 23

Many messages arriving late Tutorial at IEEE ICC (June, 2014) 24

Latency (ms) 2500 2000 1500 1000 500 US/CAN & Europe only Imagine what s next open to all markets launch on mobile 0 Tutorial at IEEE ICC (June, 2014) 95% 99% 99.9% 25

Tutorial at IEEE ICC (June, 2014) 26

Tutorial at IEEE ICC (June, 2014) 27

Tutorial at IEEE ICC (June, 2014) 28

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 29

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 30

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 31

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 32

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 33

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 34

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 35

4 S:6 S:6 R:0 R 4 R:2 3 R:0 2 2 S:4 R:1 1 R 2 1 S:3 0 0 S:2 S:2 R:0 R 1 1 0 4 S:5 S:4 4 R:1 3 2 1 S:3 3 2 1 S:3 R:0 R:1 R 2 2 1 0 S:1 0 0 Tutorial at IEEE ICC (June, 2014) 0 RTT 2RTT 3RTT 36

Tutorial at IEEE ICC (June, 2014) 37

Tutorial at IEEE ICC (June, 2014) 38

Tutorial at IEEE ICC (June, 2014) 39

Tutorial at IEEE ICC (June, 2014) 40

Latency (ms) 2500 2000 1500 1000 500 0 TCP Pangolin 95% 99% 99.9% 60% Pangolin overhead only 6.1%! Tutorial at IEEE ICC (June, 2014) 41

Latency Packet Loss Limited bandwidth End to end delay/ping (e.g. 100ms) Burst or Random E.g. <2 Mbps vs 100Mbps for LAN

Effect of Packet Loss on Real-Time Multimedia Communication Transmission error Time Reconstructed video frame subjected to packet loss

RemoteFX for WAN Application Rate Control Transport Protocol Congestion Control Network Feedback Application Original Packets Coded Packets Transmission Strategy Coded Packets Key components Estimate channel condition (packet loss prob), and use a cost function, determine whether to send: i) Original, ii) FEC or iii) Resent packet Use random linear code (network coding) to mix packets

Design Goal 66 Minimize sequential decodability delay Time when packet is sent to the time that packet and all previous packets available Good indicator for user perceived performance A B C x Delay for B is much larger because of retransmission Resend B Delay for C is larger than simply propagation delay because of waiting for B

Delay 67 Sequential decodability delay function of Probability of sequential decodability which is function of Channel loss characteristics increases with network congestion Transmission strategy - which packets have been sent original/fec coding structure of FEC packets delay caused by coding structure as well as retransmissions Propagation delay Network queuing delay increases with network congestion

How to Minimize Delay Loss Prevention and Loss Mitigation 1. Don t cause self-induced congestion minimizes packet loss and minimizes network queuing delay 2. Use flow control strategy that does not induce loss e.g. TCP enters congestion avoidance phase only when it encounters loss once loss has already happened, you suffer 3. Maximize transmission rate aggressive ramp up and state remembering after burst 68 4. Use FEC to proactively correct any remaining losses

Components of Sequential Decodability Delay 69 Expected value of delay for packet l Propagation + Network queuing delay Time it takes packet to reach from sender to receiver Probability of sequential decodability of packet l based upon transmitted packets up to k Time between transmission opportunities 1/transmission rate

Minimizing Components of Sequential Decodability Delay 70 Preventing self induced congestion minimizes this term minimizes network queuing delay Minimizing loss minimizes this term done by Using wise coding strategy Backing off on rate increase prior to loss Minimize this term by maximizing transmission rate Quick ramp up State remembering for initial point

Packet Encoding Strategy 71 Assume these terms constant flow/congestion control s job to minimize also assume congestion based loss at minimum Term to minimize over all packets being considered for practical reasons, only consider those currently in sender s queue By considering all packets currently in queue, balance Gain of packets already sent by sending FEC with Delaying original packets waiting in queue Have fast algorithm to determine probability given particular coding structure and probability of loss

Erasure Coding in P2P Multiparty Conferencing

Multi-party Conferencing Scenario Every user wants to view audio/video from all other users and is a source of its own audio/video stream Maximize Quality-of-Experience (QoE) Challenges Network bandwidth limited Require low end-to-end delay Network conditions time-varying Distributed solution not requiring global network knowledge Existing audio/video conferencing products B A C B A D C Apple ichat AV,,, Halo, TelePresence, Windows Live Messenger

Comparison of Distribution Approaches MCU-assisted multicast Simulcast Peer-assisted multicast A A A MCU B C B C B C High load on MCU, expensive, not scalable with increasing number of peers or groups Halo As group size and heterogeneity increases, video quality deteriorates due to peer uplink bandwidth constraint Apple ichat AV Optimal utilization of each peer s uplink bandwidth, no MCU required but can assist as helper

Celebrity (A P2P Multiparty Conferencing with Network Coding) Objectives Stringent end-to-end delay requirement <200ms Unknown network topology Limited and unknown network bandwidth Time-varying network conditions

Celebrity : Overview Data multicast via a hybrid tree and mixing solution Source sends one copy of content via low-delay spanning trees Can explore all depth-1 and depth-2 trees Each node outputs a mixture/network coded packets on each link at certain rate Redundancy provided by network coding Distributed link rate adaptation that collectively maximized delay-limited capacities of the sessions Critical session and links get more resource Approach 1: driven by session and link innovation measurement Approach 2: driven by link states and critical cut computation Respond to link congestion signals (loss and delay) Similar to TCP, TFRC, DCCP

Hybrid Tree + Coding Approach Distribution trees +: Propagation delay known from the tree structures -: Update of the trees need to be done centrally How to deal with packet loss? How to find a good set of trees? Network coding +: information will find their ways to the sinks built-in resilience to packet losses +: Update of the link rates can potentially be done distributedly How to reduce the decoding delay? How to provide delay guarantee? Hybrid tree + coding approach Best of both worlds. (Tree packets are always sent immediately)

Network Coding

Experimental Results

Erasure Coding in Video Streaming Tutorial at IEEE ICC (June, 2014) 80

Tutorial at IEEE ICC (June, 2014) 81

? Tutorial at IEEE ICC (June, 2014) 82

coding in network Tutorial at IEEE ICC (June, 2014) 83

Peer Peer Peer Peer Peer Peer Peer Peer Peer Peer Peer Peer Peer Peer Tutorial at IEEE ICC (June, 2014) 84

Advanced Applications Video Streaming Streaming VOIP Skype Basic Applications Downloading File Sharing Napster BitTorrent Tutorial at IEEE ICC (June, 2014) time 85

Tutorial at IEEE ICC (June, 2014) 86

a b a b b c c c a Tutorial at IEEE ICC (June, 2014) 87

chunks chunks Tutorial at IEEE ICC (June, 2014) 88

Segment: 4 seconds, 180KB Block: 1KB, 180 blocks/segment 350 kbps video delivered segments priority region outside priority Tutorial at IEEE ICC (June, 2014) 89

Tutorial at IEEE ICC (June, 2014) 90

Tutorial at IEEE ICC (June, 2014) 91

Tutorial at IEEE ICC (June, 2014) 92

Tutorial at IEEE ICC (June, 2014) 93

Erasure Coding in Cloud Storage Tutorial at IEEE ICC (June, 2014) 94

Performance good perf, minimize cost Storage Cost Reliability Tutorial at IEEE ICC (June, 2014) 95

replication a=2 Reed-Solomon coding a=2 a=2 b=3 b=3 b=3 a=2 a=2 reconstruction a b b=3 storage 2x 1.5x reconstruction 1 2 Tutorial at IEEE ICC (June, 2014) 96

permanent failure temporary unavailability (90+%) hot storage nodes rolling update a=2 reconstruction Reed-Solomon coding a=2 b=3 a+b reconstruction on critical path and frequent enough storage 2x 1.5x reconstruction 1 2 Tutorial at IEEE ICC (June, 2014) 97

high reconstruction cost inevitable price for erasure coding Tutorial at IEEE ICC (June, 2014) 98

reconstruction cost Reed-Solomon codes replication storage overhead Tutorial at IEEE ICC (June, 2014) 99

Pyramid Codes Tutorial at IEEE ICC (June, 2014) 100

reconstruction cost: 12 data nodes d1 d2... d6 d7... d11 d12 12 parity nodes C1 C2 C3 3 Reed-Solomon 12 + 3 Tutorial at IEEE ICC (June, 2014) 101

data nodes d1 d2... d6 d7... d11 d12 12 parity nodes C1 C2 C3 3 Pyramid Codes Construction: take an arbitrary Reed- Solomon (RS) code C 1,1 C 1,2 split one RS parity into multiple local parities 12 + 3 RS 12 + 4 Pyramid Tutorial at IEEE ICC (June, 2014) 102

reconstruction cost: 6 d1 d2 d3 d4 d5 d6 C 1,1 d7 d8 d9 d10 d11 d12 C 1,2 C2 C3 Tutorial at IEEE ICC (June, 2014) 103

d1 d2 d3 d4 d5 d6 C 1,1 d7 d8 d9 d10 d11 d12 C 1,2 C2 C3 CASE I: recover d 5 from c 1,1 recover d 8 and d 12 from c 2 and c 3 Tutorial at IEEE ICC (June, 2014) 104

d1 d2 d3 d4 d5 d6 C 1,1 d7 d8 d9 d10 d11 d12 C1,2 C1 C2 C3 CASE II: combine C1,1 and C1,2 C 1 convert 12 + 4 Pyramid code back to 12 + 3 RS code recover the 3 failures (d 8, d 11 and d 12 ) in the RS code Tutorial at IEEE ICC (June, 2014) 105

C1 C2 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 C3 reconstruction cost of d 1 3 Tutorial at IEEE ICC (June, 2014) 106

C1 C2 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 C3 reconstruction cost of d 1 and d 2 6 Tutorial at IEEE ICC (June, 2014) 107

C1 C2 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 C3 decoding analogous to climbing up Pyramid Tutorial at IEEE ICC (June, 2014) 108

reconstruction cost Reed-Solomon codes Pyramid Codes replication storage overhead Tutorial at IEEE ICC (June, 2014) 109

Maximal Recoverability Tutorial at IEEE ICC (June, 2014) 110

d1 d2 d3 d4 d5 d6 C 1,1 d7 d8 d9 d10 d11 d12 C 1,2 C2 C3 Tutorial at IEEE ICC (June, 2014) 111

Decoding Tanner graph Left: failed data nodes Right: survival parity nodes d1 d2 d3 d4 d5 d6 C1,1 d5 C 1,1 d7 d8 d9 d10 d11 d12 C1,2 d6 C 1,2 C2 C3 d8 C2 Recoverability Theorem: recoverable full matching d12 C3 decoding Tanner graph contains full matching Tutorial at IEEE ICC (June, 2014) 112

d1 C1,1 d2 d1 d2 d3 d4 d5 d6 C1,1 C1,2 d7 d8 d9 d10 d11 d12 C1,2 C2 C3 d5 C3 d6 C4 decoding Tanner graph contains no full matching Tutorial at IEEE ICC (June, 2014) 113

First class of MR codes MR codes in cloud deployment (Windows Azure Storage) Tutorial at IEEE ICC (June, 2014) 114

LRC in Windows Azure Storage Tutorial at IEEE ICC (June, 2014) 115

sealed extent ( 3 GB ) sealed extent ( 3 GB ) sealed extent ( 3 GB ) p 1 d 0 d 1 d 2 d 3 d 4 d 5 p 2 Reed-Solomon 6 + 3 storage overhead 3x 1.5x reconstruction cost 6 used in Google GFS II (as of 2012) Tutorial at IEEE ICC (June, 2014) 116 p 3

sealed extent ( 3 GB ) overhead (6+3)/6 = 1.5x d 0 d 1 d 2 d 3 d 4 d 5 p 0 p 1 p 2 Tutorial at IEEE ICC (June, 2014) 117

sealed extent ( 3 GB ) overhead (6+3)/6 = 1.5x d 0 d 1 d 2 d 3 d 4 d 5 p 0 p 1 p 2 (12+4)/12 = 1.33x d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 10 d 11 p 0 p 1 p 2 p 3 Tutorial at IEEE ICC (June, 2014) 118

p 0 d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 10 d 11 p 1 reconstruction twice more expensive requiring 12 fragments (12 disk I/Os, 12 net transfers) p 2 p 3 Tutorial at IEEE ICC (June, 2014) 119

Conventional Reed-Solomon Coding Storage Overhead Reconstruction Cost sealed extent ( 3 GB ) d 0 d 1 d 2 d 3 d 4 d 5 p 1 p 2 p 3 1.5x 6 reads LRC sealed extent ( 3 GB ) p 1 d 0 d 1 d 2 d 3 d 4 d 5 d 6 d 7 d 8 d 9 d 10 d 11 p 2 p 3 1.33x 12 reads p 4 Tutorial at IEEE ICC (June, 2014) 120

sealed extent ( 3 GB ) x 0 x 1 x 2 x 3 x 4 x 5 y 0 y 1 y 2 y 3 y 4 y 5 LRC 12+2+2 : 12 data fragments, 2 local parities and 2 global parities storage overhead: (12 + 2 + 2) / 12 = 1.33x Local parity: reconstruction requires only 6 fragments Tutorial at IEEE ICC (June, 2014) 121

LRC 12+2+2 : reliability: RS 12+4 > LRC 12+2+2 > RS 6+3 Tutorial at IEEE ICC (June, 2014) 122

Tutorial at IEEE ICC (June, 2014) 123

Reconstruction Read Cost RS 12+4 12 10 8 6 RS 10+4 same cost 1.5x 1.33x RS 6+3 Reed-Solomon LRC same overhead half cost (6 3) LRC (12+2+2) 4 LRC (12+4+2) Tutorial at IEEE ICC (June, 2014) 2 0 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2 Storage Overhead RS 10+4 : HDFS-RAID at Facebook RS 6+3 : GFS II (Colossus) at Google 124

RS (6 + 3) reconstruction cost = 6 RS (14 + 4) reconstruction cost = 14 LRC (14 + 2 + 2) reconstruction cost = 7 14% savings Tutorial at IEEE ICC (June, 2014) millions of $ savings! 125

LRC in Hadoop Tutorial at IEEE ICC (June, 2014) 126

8% cold x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 p 0 p 1 p 2 p 3 10 + 4 Reed-Solomon single failure reconstruction requires 10 fragments (10 disk I/Os, 10 net transfers) Tutorial at IEEE ICC (June, 2014) 127

x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 p 0 p 1 p 2 p 3 c 0 c 1 c 2 c 3 c 5 c 5 c 6 c 7 c 8 c9 c 0 c 1 c 2 c 3 s 0 local parity s 1 local parity s 2 implied parity Tutorial at IEEE ICC (June, 2014) add 2 local parities to existing 10 + 4 RS code choose c i carefully so that implied parity can be derived 4 global parities and 2 local parities sum to zero single failure of any chunk can be reconstructed by 5 fragments 14% extra storage for 50% savings in reconstruction 128

LRC in Hierarchical Storage Tutorial at IEEE ICC (June, 2014) 129

Note: no need to tolerate 2 JBOD failures Tutorial at IEEE ICC (June, 2014) 130

New erasure codes designed targeting multi-level durability requirements can reduce storage space Tutorial at IEEE ICC (June, 2014) 131

JBOD enclosure local parity global parity x 1 x 2 x 3 x 4 p x y 1 y 2 y 3 y 4 p y q z 1 z 2 z 3 z 4 p z storage overhead: 1.33x (LRC 12+3+1 ) < 1.5x (RAID6 4+2 ) But, does LRC indeed tolerate failures of 1 JBOD + 1 HDD? Tutorial at IEEE ICC (June, 2014) 132

JBOD enclosure local parity global parity x 1 x 2 x 3 x 4 p x y 1 y 2 y 3 y 4 p y q z 1 z 2 z 3 z 4 p z y 3 and z 3 are reconstructed using local parity p y and p z x 3 and x 4 are then reconstructed using p x and global parity q Shipped in Windows Server 2012 R2 and Windows 8.1 Tutorial at IEEE ICC (June, 2014) 133

PMDS and SD Codes Tutorial at IEEE ICC (June, 2014) 134

m = 4 n = 7 d 0 d 1 d 2 d 3 d 4 d 5 p 0 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d 21 qy 40 qy 51 p 1 p 2 p 3 s = 2 r = 1 PMDS Codes m rows, n columns n drives, m x n sectors r row parities in each row s global parities tolerate r failures per row and s additional failures anywhere Tutorial at IEEE ICC (June, 2014) 135

n = 7 m = 4 d 0 d 1 d 2 d 3 d 4 d 5 p 0 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 p 1 p 2 recoverable case I r = 1 drive (column) failure s = 2 additional sector failures anywhere d 18 d 19 d 20 d 21 qy 40 qy 51 p 3 s = 2 r = 1 Tutorial at IEEE ICC (June, 2014) 136

n = 7 m = 4 d 0 d 1 d 2 d 3 d 4 d 5 p 0 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 p 1 p 2 recoverable case II r = 1 failures per row s = 2 additional failures anywhere d 18 d 19 d 20 d 21 qy 40 qy 51 p 3 s = 2 r = 1 Tutorial at IEEE ICC (June, 2014) 137

n = 7 d 0 d 1 d 2 d 3 d 4 d 5 p 0 recoverable case II m = 4 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d 21 qy 40 qy 51 p 1 p 2 p 3 d 11 and d 19 recoverable from their row parities 4 parities for the remaining 4 failures similar to LRC s = 2 r = 1 Tutorial at IEEE ICC (June, 2014) PMDS codes are Maximally Recoverable (MR) codes 138

case I case II What if restricting to only case I? r Tutorial at IEEE ICC (June, 2014) s 139

m = 4 n = 7 d 0 d 1 d 2 d 3 d 4 d 5 p 0 d 6 d 7 d 8 d 9 d 10 d 11 d 12 d 13 d 14 d 15 d 16 d 17 d 18 d 19 d 20 d 21 qy 40 qy 51 p 1 p 2 p 3 s = 2 r = 1 SD Codes m rows, n columns n drives, m x n sectors r row parities in each row s global parities tolerate r column failures and s additional failures anywhere Tutorial at IEEE ICC (June, 2014) 140

case I case II SD codes handle case I, but not case II There are many constructions which are valid as SD codes, but not PMDS codes. Tutorial at IEEE ICC (June, 2014) 141

Efficient Repair of MDS Codes Tutorial at IEEE ICC (June, 2014) 142

a 1 b 1 a 1 b 1 a 1 b 2 a 2 b 2 a 2 b 2 a 2 b 1 b 2 Tutorial at IEEE ICC (June, 2014) 143

a 1 b 1 a 1 b 1 a 1 b 2 a 2 b 2 a 2 b 2 a 2 b 1 b 2 Tutorial at IEEE ICC (June, 2014) 144

a 1 b 1 a 1 b 1 a 1 b 2 a 2 b 2 a 2 b 2 a 2 b 1 b 2 Tutorial at IEEE ICC (June, 2014) 145

Tutorial at IEEE ICC (June, 2014) 146

Efficient Repair of Existing Codes Tutorial at IEEE ICC (June, 2014) 147

Tutorial at IEEE ICC (June, 2014) 148

Tutorial at IEEE ICC (June, 2014) 149

Tutorial at IEEE ICC (June, 2014) 150

Tutorial at IEEE ICC (June, 2014) 151

~20+% savings in general Tutorial at IEEE ICC (June, 2014) 152

Theoretical Bound on Efficient Repair Tutorial at IEEE ICC (June, 2014) 153

Efficient repair: 1.83x 69% savings! Tutorial at IEEE ICC (June, 2014) 154

Single Failure Repair of 6 + 6 MDS Code Reed-Solomon Coding Regenerating Coding # of nodes participating in repair 6 11 # of network transfers 6x 1.83x # of disk I/Os 6x up to 11x Tutorial at IEEE ICC (June, 2014) 155

network transfer: 3 (optimal), disk I/O: 4 (no saving) a 1 b 1 a 1 b 1 a 1 b 2 a 2 b 2 a 2 b 2 a 2 b 1 b 2 XOR before transmitting b 2 a 1 b 1 a 2 b 2 a 1 a 1 b 2 a 2 b 1 b 2 Regenerating Codes may require more disk I/Os than network transfers. Unfortunately, most RC papers do not discuss the difference! Tutorial at IEEE ICC (June, 2014) 156

Simple Regenerating Codes Tutorial at IEEE ICC (June, 2014) 157

not Tutorial at IEEE ICC (June, 2014) 158

(n=6, k=4, f=2)-src MDS precode placement node1 node2 node3 node4 node5 node6 (6,4)-RS (6,4)-RS tolerating arbitrary two failures any chunk recoverable with 2 I/Os overhead: 3/2 * 6/4 = 2.25x Tutorial at IEEE ICC (June, 2014) 159

single failure recovered efficiently 2 I/Os for each chunk 6 I/Os in total for all three chunks disk I/O = network I/O in repair Tutorial at IEEE ICC (June, 2014) 160

Tutorial at IEEE ICC (June, 2014) 161