Coding Techniques for Efficient, Reliable Networked Distributed Storage in Data Centers Anwitaman Datta Joint work with Frédérique Oggier (SPMS) School of Computer Engineering Nanyang Technological University Infocomm Professional Development Forum 7 th July 2011, Singapore
Self-* Aspects of Networked Distributed Systems Me, myself & SANDS http://sands.sce.ntu.edu.sg/ http://www.ntu.edu.sg/home/anwitaman/
Outline o Data Centers - the heart of the Cloud o Replication, RAID & Erasure Codes o Erasure tailor-made for distributed networked storage Self-repairing o Wrap-up
What is the Cloud? - Cloud - Storage systems o At least, we can all agree Cloud is something big and happening! It s all of these and some more! Old wine Data center SaaS IaaS PaaS %*$aas
NIST Definition for Cloud Computing - Cloud - Storage systems o Cloud computing is a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction.
Two Sides of the Cloud Coin - Cloud - Storage systems o Outside view A single/exclusive entity Access through a demilitarized zone API based Agnostic to multi-tenancy Infinite/elastic resources Pay per use, on-demand, Browser based access (often) Anytime, anywhere, any device
Two Sides of the Cloud Coin - Cloud - Storage systems o Inside view Pool of resources In flux: New compute units joining, old ones retiring Self-*: Load-balancing, fault-tolerance, autoconfiguration, Multi-tenancy Virtualization, transparent migration, Distributed file system and data-management Google s GFS, Amazon s Dynamo, Yahoo! s Pnuts Map Reduce/Hadoop, Pig, Chubby,
The New Stack - Cloud - Storage systems SQL Implementations e.g., PIG (relational algebra), HIVE, Applications NoSQL e.g., Map Reduce, Hadoop, BigTable, Hbase, Cassandra Distributed File System (e.g., Key Value Store) Reliable storage service Distributed Physical Infrastructure: Storage/Compute Nodes Disclaimer: This is a personal view, and may not be standard/universal
Data Center - Cloud - Storage systems o Essentially a networked distributed storage system Source of topology: http://www.cisco.com/en/us/docs/solutions/enterprise/data_center/vmware/vmware.html
Data Center Design Evolution Gen 1 DC Collocation Gen 2 Gen 3 Gen 4 (future) Modular Data Center Deployment Scale Unit Server Rack Containers Pre Assembled Components Capacity Density and Sustainability Scalability Thousands of Servers Slide courtesy Roger Barga (Microsoft) from his P2P 2009 Keynote talk Right Time to Market, Lower TCO (PUE) Scalable Data Centers
Failure (of individual nodes) is inevitable - Cloud - Storage systems o But, failure of the system is not an option! o Solution: Redundancy
Is the Danger Real? Yes - Cloud - Storage systems Cloud is NSFW
Is the Danger Real? Yes - Cloud - Storage systems o There is also the danger of data falling in wrong hands, e.g. due to security breach o Security/privacy issues are out of the scope of this talk @ SANDS we work on those issues also A*Star TSRP project pcloud http://sands.sce.ntu.edu.sg/pcloud/
Data Center Fault-Tolerance - Existing approaches - Has EC a role? o Faults are omnipresent Hardware, network, software, human, misconfiguration, o Cascade of failures in interdependent networks Power failure => Network switches stop working Network failure => Control system for power system ineffective
Redundancy Based Fault Tolerance - Existing approaches - Has EC a role? o Replicate data e.g., 3 or more copies In nodes on different racks Can deal with switch failures o Power back-up using battery between racks (Google)
Redundancy Based Fault Tolerance - Existing approaches - Has EC a role? o Using independent physical infrastructure Over different availability zones (Amazon AZ) How independent are components in a complex network? Over multiple geographical regions
Amazon s AWS: Availability Zones - Existing approaches - Has EC a role? Note: The recent (April 2011) AWS outage was the first region wide failure
Five Levels of Redundancy - Existing approaches - Has EC a role? o Physical o Virtual resource o Availability zone o Region o Cloud From: http://broadcast.oreilly.com/2011/04/the aws outage the clouds shining moment.html
At What Cost? - Existing approaches - Has EC a role? o Failure is not an option, but are the overheads acceptable?
Reducing the Overheads of Redundancy - Existing approaches - Has EC a role? o Erasure Much lower storage overhead High level of fault-tolerance
Erasure Codes for Networked Storage Data = Object O 1 O 2 Encoding B 1 B 2 B l Retrieve any k ( k) blocks Decoding O 1 O 2 Reconstruct Data O k k blocks Lost blocks B n n encoded blocks (stored in storage devices in a network) O k Original k blocks
Replenishing Lost Redundancy for ECs B 1 B 2 B n Retrieve any k ( k) blocks Lost blocks n encoded blocks o Repairs needed for long-term resilience Decoding O 1 O 2 O k Original k blocks Encoding o Repairs are expensive! Recreate lost blocks B l Re insert Reinsert in (new) storage devices, so that there is (again) n encoded blocks
Can We Do Better? - Existing approaches - Has EC a role? o What is the best one can do (w.r.to repairs)? Minimize bandwidth usage per repair Minimize number of live nodes used per repair o Erasure have some other drawbacks Coding/Decoding is Expensive In contrast to replication or RAID/XOR based systems Systematic can help (with decoding/access)! Not adequate when load-balancing is also an issue!! More complex system design We do not attempt to address these explicitly But, some solution we will arrive at will be amenable!
Can We Do Better? - Pyramid - Regenerating Codes - Self-repairing o Self-repairing Codes: Erasure tailor-made for distributed networked storage
Self-repairing Codes: Blackbox View - Pyramid - Regenerating Codes - Self-repairing B 1 B 2 Retrieve some k (< k) blocks (e.g. k =2) to recreate a lost block B l Re insert B n Lost blocks n encoded blocks (stored in storage devices in a network) Reinsert in (new) storage devices, so that there is (again) n encoded blocks
Self-repairing Codes - Pyramid - Regenerating Codes - Self-repairing o There is at least one pair to repair a node, for up to (n -1)/2 simultaneous failures Parallel & fast repair of multiple fairs o Example Data object split in four parts: PSRC(n=5, k=3)
Toy Example: PSRC(5,3) repair - Pyramid - Regenerating Codes - Self-repairing (o 1 +o 2 +o 4 ) + (o 1 ) => o 2 +o 4 Repair using two nodes (o 3 ) + (o 2 +o 3 ) => o 2 Say N (o 1 ) + (o 2 ) => o 1 + o 1 and N 3 2 Four pieces needed to regenerate two pieces (o 2 ) + (o 4 ) => o 2 + o 4 Repair using three nodes (o 1 +o 2 +o 4 ) + (o 4 ) => o 1 +o 2 Say N 2, N 3 and N 4 Three pieces needed to regenerate two pieces
Toy Example: PSRC(5,3) reconstruction - Pyramid - Regenerating Codes - Self-repairing o 3 o 4 (o 3 ) + (o 1 +o 3 ) => o 1 (o 1 ) +(o 4 )+(o 1 +o 2 +o 4 ) => o 2 Reconstruction, say using N 3, N 4 and N 5
Symmetry in SRCs - Pyramid - Regenerating Codes - Self-repairing o All encoded blocks have symmetric role Equivalent importance of all blocks for both data reconstruction & repair o Symmetry is good Easy to analyze, understand and implement Simpler algorithm and system design
Maximum Distance Separable (MDS)? - Pyramid - Regenerating Codes - Self-repairing o SRC is not MDS (and can not be!) Does it matter? Not much In practice, access will be planned PSRC needs less bandwidth than `optimal RGC! This is with random access PSRC(21,3)
Practical properties - Pyramid - Regenerating Codes - Self-repairing o (Current) SRCs are not systematic PSRC is like systematic Need to contact more nodes (than k) To obtain systematic `pieces Same total bandwidth usage Parallel download for access can even be an `advantage `mixed strategies for access, i.e. get some systematic pieces, and some others Power saving (by switching off nodes) strategies possible
Practical properties - Pyramid - Regenerating Codes - Self-repairing o Self-repair implies somewhat locally decodable If access to only part of the whole object is desired o Coding/decoding in PSRC are both using XOR operations only
Outlook o 2020: Self-repairing in a data-center near you? o Ongoing: Concepts/Implementation Prototype miniature data-center Template for preassembled component of a modular 4G+ data center o Interested to Follow: http://sands.sce.ntu.edu.sg/codingfornetworkedstorage/ Get involved: {anwitaman,frederique}@ntu.edu.sg