Distributed systems Lecture 6: Elec3ons, consensus, and distributed transac3ons. Dr Robert N. M. Watson

Similar documents
File System Reliability (part 2)

CS 91: Cloud Systems & Datacenter Networks Failures & Replica=on

Dr Markus Hagenbuchner CSCI319. Distributed Systems

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

PIONEER RESEARCH & DEVELOPMENT GROUP

Data Management in the Cloud: Limitations and Opportunities. Annies Ductan

Lecture 18: Reliable Storage

Scalability. We can measure growth in almost any terms. But there are three particularly interesting things to look at:

CS 153 Design of Operating Systems Spring 2015

Review: The ACID properties

Hard Disk Drives and RAID

Distributed Data Management

CS161: Operating Systems

Transactions and ACID in MongoDB

Trevi: Watering down storage hotspots with cool fountain codes. Toby Moncaster University of Cambridge

REST (Representa.onal State Transfer) Ingegneria del So-ware e Lab. Università di Modena e Reggio Emilia Do<. Marzio Franzini

(Pessimistic) Timestamp Ordering. Rules for read and write Operations. Pessimistic Timestamp Ordering. Write Operations and Timestamps

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

CS 4604: Introduc0on to Database Management Systems

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Data Management in the Cloud

Transactions and Recovery. Database Systems Lecture 15 Natasha Alechina

Distribution transparency. Degree of transparency. Openness of distributed systems

Lecture 36: Chapter 6

Distributed Architectures. Distributed Databases. Distributed Databases. Distributed Databases

Outline. Failure Types

Chapter 10: Distributed DBMS Reliability

Distributed Systems (5DV147) What is Replication? Replication. Replication requirements. Problems that you may find. Replication.

High Performance Computing. Course Notes High Performance Storage

Distributed Data Stores

Recovery and the ACID properties CMPUT 391: Implementing Durability Recovery Manager Atomicity Durability

RAID. Tiffany Yu-Han Chen. # The performance of different RAID levels # read/write/reliability (fault-tolerant)/overhead

Project Overview. Collabora'on Mee'ng with Op'mis, Sept. 2011, Rome

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

How To Understand The Concept Of A Distributed System

DDC Sequencing and Redundancy

Map- reduce, Hadoop and The communica3on bo5leneck. Yoav Freund UCSD / Computer Science and Engineering

Brewer s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

High Availability Design Patterns

CS 6290 I/O and Storage. Milos Prvulovic

Kaseya Fundamentals Workshop DAY THREE. Developed by Kaseya University. Powered by IT Scholars

CS5412: TWO AND THREE PHASE COMMIT

Database Replication Techniques: a Three Parameter Classification

SAN Conceptual and Design Basics

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Highly Available Hadoop Name Node Architecture-Using Replicas of Name Node with Time Synchronization among Replicas

Distributed Databases

Designing a Cloud Storage System

Achieving High Availability

Distributed Systems. Tutorial 12 Cassandra

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

Nutanix Tech Note. Failure Analysis All Rights Reserved, Nutanix Corporation

Cassandra A Decentralized, Structured Storage System

Architec;ng Splunk for High Availability and Disaster Recovery

Using RDBMS, NoSQL or Hadoop?

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

BookKeeper overview. Table of contents

Outline. Database Management and Tuning. Overview. Hardware Tuning. Johann Gamper. Unit 12

How to choose the right RAID for your Dedicated Server

Outline. Clouds of Clouds lessons learned from n years of research Miguel Correia

Data Storage - II: Efficient Usage & Errors

Cluster Computing. ! Fault tolerance. ! Stateless. ! Throughput. ! Stateful. ! Response time. Architectures. Stateless vs. Stateful.

E) Modeling Insights: Patterns and Anti-patterns

Wireless Networks: Network Protocols/Mobile IP

Microkernels & Database OSs. Recovery Management in QuickSilver. DB folks: Stonebraker81. Very different philosophies

Topics. Distributed Databases. Desirable Properties. Introduction. Distributed DBMS Architectures. Types of Distributed Databases

Distributed Software Systems

Disk Array Data Organizations and RAID

The ConTract Model. Helmut Wächter, Andreas Reuter. November 9, 1999

CSE/ISE 311: Systems Administra5on Logging

Distributed Systems Interconnec=ng Them Fundamentals of Distributed Systems Alvaro A A Fernandes School of Computer Science University of Manchester

SiS 180 S-ATA User s Manual. Quick User s Guide. Version 0.1

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Introduction. What is RAID? The Array and RAID Controller Concept. Click here to print this article. Re-Printed From SLCentral

BrightStor ARCserve Backup for Windows

CSE 120 Principles of Operating Systems

Content Distribu-on Networks (CDNs)

Distributed Systems [Fall 2012]

RAID Storage, Network File Systems, and DropBox

MinCopysets: Derandomizing Replication In Cloud Storage

Cheap Paxos. Leslie Lamport and Mike Massa. Appeared in The International Conference on Dependable Systems and Networks (DSN 2004 )

Cloud Computing at Google. Architecture

DPMS2 McAfee Endpoint Encryption New Installation

Retaining globally distributed high availability Art van Scheppingen Head of Database Engineering

Avoid a single point of failure by replicating the server Increase scalability by sharing the load among replicas

! Volatile storage: ! Nonvolatile storage:

Geo-Replication in Large-Scale Cloud Computing Applications

Replication architecture

Fault Tolerance & Reliability CDA Chapter 3 RAID & Sample Commercial FT Systems

This paper defines as "Classical"

Sistemas Operativos: Input/Output Disks

Database Replication with Oracle 11g and MS SQL Server 2008

Crashes and Recovery. Write-ahead logging

istorage Server: High-Availability iscsi SAN for Windows Server 2008 & Hyper-V Clustering

How to Build a Highly Available System Using Consensus

Transactional Support for SDN Control Planes "

Review. Lecture 21: Reliable, High Performance Storage. Overview. Basic Disk & File System properties CSC 468 / CSC /23/2006

Distributed File Systems

Transcription:

Distributed systems Lecture 6: Elec3ons, consensus, and distributed transac3ons Dr Robert N. M. Watson 1

Last 3me Saw how we can build ordered mul3cast Messages between processes in a group Need to dis3nguish receipt and delivery Several ordering op3ons: FIFO, causal or total Considered distributed mutual exclusion: Want to limit one process to a CS at a 3me Central server OK; but bohleneck & SPoF Token passing OK: but traffic, repair, token loss Totally- Ordered Mul3cast: OK, but high number of messages and problems with failures 2

Slide from last lecture clarifica3on: the Lamport variant below is an example of using mul3cast for total ordering, rather than totally ordered mul3cast Addi3onal Details Completely unstructured decentralized solu3on... but: Lots of messages (1 mul3cast + N- 1 unicast) Ok for most recent holder to re- enter CS without any messages Variant scheme (due to Lamport): To enter, process P i mul3casts request(p i, T i ) [same as before] On receipt of a message, P j replies with an ack(p j,t j ) Processes keep all requests and acks in ordered queue If process P i sees his request is earliest, can enter CS and when done, mul3casts a release(p i, T i ) message When P j receives release, removes P i s request from queue If P j s request is now earliest in queue, can enter CS Note that both Ricart & Agrawala and Lamport s scheme, have N points of failure: doomed if any process dies :- ( 3

Leader Elec3on Many schemes are built on the no3on of having a well- defined leader (master, coordinator) examples seen so far include the Berkeley 3me synchroniza3on protocol, and the central lock server An elec?on algorithm is a dynamic scheme to choose a unique process to play a certain role assume P i contains state variable elected i when a process first joins the group, elected i = UNDEFINED By the end of the elec3on, for every P i, elected i = P x, where P x is the winner of the elec3on, or elected i = UNDEFINED, or P i has crashed or otherwise leg the system 4

Ring- Based Elec3on P4 has also crashed, so P3 routes around P 5 P 4 C (2,3) (2,3,5) (2,3) P 3 P 1 System has coordinator who crashes Some process no3ces, and starts an elec3on Find node with highest ID who will be new leader Puts its id into a message, and sends to its successor On receipt, a process acks to sender (not shown), and then appends its id and forwards the elec3on message Finished when a process receives message containing its id P 2 (2) (1,2,3,5) P 2 no3ces C has crashed and starts elec3on P 2 can mul3cast winner (or can con3nue to P 5 ) 5

The Bully Algorithm Algorithm proceeds by ahemp3ng to elect the process s?ll alive with the highest ID Assume that we know the IDs of all processes Assumes we can reliably detect failures by 3meouts If process P i sees current leader has crashed, sends elec?on message to all processes with higher IDs, and starts a 3mer Concurrent elec3on ini3a3on by mul3ple processes is fine Processes receiving an elec3on message reply OK to sender, and start an elec3on of their own (if not already in progress) If a process hears nothing back before 3meout, it declares itself the winner, and mul3casts result A dead process that recovers (or new process that joins) also starts an elec3on: can ensure highest ID always elected 6

Problems with Elec3ons P 0 P P 4 8 P 6 P 2 P 3 P 1 P 5 P 7 Algorithms rely on 3meouts to reliably detect failure However it is possible for networks to fail: a network par??on Some processes can speak to others, but not all Can lead to split- brain syndrome: Every par33on independently elects a leader too many bosses! To fix, need some secondary (& ter3ary?) communica3on scheme e.g. secondary network, shared disk, serial cables, 7

Aside: Consensus Elec3ons are a specific example of a more general problem: consensus Given a set of n processes in a distributed system, how can we get them all to agree on something? Classical treatment has every process P i propose something (a value V i ) Want to arrive at some determinis3c func3on of V i s (e.g. majority or maximum will work for elec3on) A correct solu3on to consensus must sa3sfy: Agreement: all nodes arrive at the same answer Validity: answer is one that was proposed by someone Termina?on: all nodes eventually decide 8

Consensus is impossible Famous result due to Fischer, Lynch & PaHerson (1985) Focuses on an asynchronous network (unbounded delays) with at least one process failure Shows that it is possible to get an infinite sequence of states, and hence never terminate Given the Internet is an asynchronous network, then this seems to have major consequences!! Not really: Result actually says we can t always guarantee consensus, not that we can never achieve consensus And in prac3ce, we can use tricks to mask failures (such as reboot, or replica3on), and to ignore asynchrony Have seen solu3ons already, and will see more later 9

Transac3on Processing Systems Last term looked at transac?ons: ACID proper3es Support for composite opera3ons (i.e. a collec3on of reads and updates to a set of objects) A transac3on is atomic ( all- or- nothing ) If it commits, all opera3ons are applied If it aborts, it s as if nothing ever happened A commihed transac3on moves system from one consistent state to another Transac3on processing systems also provide: isola?on (between concurrent transac3ons) durability (commihed transac3ons survive a crash) 10

Distributed Transac3ons Scheme described last term was client/server E.g., a program (client) accessing a database (server) However distributed transac-ons are those which span mul'ple transac3on processing servers E.g. booking a complex trip from London to Vail, CO Could fly LHR - > LAX - > EGE + hire a car or fly LHR - > ORD - > DEN + take a public bus Want a complete trip (i.e. atomicity) Not get stuck in an airport with no onward transport! Must coordinate ac3ons across mul3ple par3es 11

A Model of Distributed Transac3ons T1 transaction { if(x<2) abort; a:= z; j:= j + 1; } S 1 S 2 x=5 y=0 z=3 a=7 b=8 c=1 T2 transaction { z:= (i+j); } C 1 S 3 i=2 j=4 C 2 Mul3ple servers (S 1, S 2, S 3, ), each holding some objects which can be read and wriren within client transac3ons Mul3ple concurrent clients (C 1, C 2, ) who perform transac3ons that interact with one or more servers e.g. T1 reads x, z from S 1, writes a on S 2, and reads & writes j on S 3 e.g. T2 reads i, j from S 3, then writes z on S 1 A successful commit implies agreement at all servers 12

Implemen3ng distributed transac3ons Can build on top of solu3on for single server: e.g. use locking or shadowing to provide isola?on e.g. use write- ahead log for durability Need to coordinate to either commit or abort Assume clients create unique transac3on ID: TXID Uses TXID in every read or write request to a server S i First 3me S i sees a given TXID, it starts a tenta3ve transac3on associated with that transac3on id When client wants to commit, must perform atomic commit of all tenta3ve transac3ons across all servers 13

Atomic Commit Protocols A naïve solu3on would have client simply invoke commit(txid) on each server in turn Will work only if no concurrent conflic3ng clients, every server commits (or aborts), and no server crashes To handle concurrent clients, introduce a coordinator: A designated machine (can be one of the servers) Clients ask coordinator to commit on their behalf and hence coordinator can serialize concurrent commits To handle inconsistency/crashes, the coordinator: Asks all involved servers if they could commit TXID Servers S i reply with a vote V i = { COMMIT, ABORT } If all V i = COMMIT, coordinator mul3casts docommit(txid) Otherwise, coordinator mul3casts doabort(txid) 14

Two- Phase Commit (2PC) C cancommit(txid)? docommit(txid) S1 S2 S3 ACK physical -me This scheme is called two- phase commit (2PC): First phase is vo?ng: collect votes from all par3es Second phase is comple?on: either abort or commit Doesn t require ordered mul3cast, but needs reliability If server fails to respond by 3meout, treat as a vote to abort Once all ACKs received, inform client of successful commit 15

2PC: Addi3onal Details Client (or any server) can abort during execu3on: simply mul3casts doabort(txid) to all servers E.g., if client transac3on throws excep3on or server fails If a server votes NO, can immediately abort locally If a server votes YES, it must be able to commit if subsequently asked by coordinator: Before vo3ng to commit, server will prepare by wri3ng entries into log and flushing to disk Also records all requests from & responses to coordinator Hence even if crashes aver vo3ng to commit, will be able to recover on reboot 16

2PC: Coordinator Crashes Coordinator must also persistently log events: Including ini3al message from client, reques3ng votes, receiving replies, and final decision made Lets it reply if (restarted) client or server asks for outcome Also lets coordinator recover from reboot, e.g. re- send any vote requests without responses, or reply to client One addi3onal problem occurs if coordinator crashes ager phase 1, but before ini3a3ng phase 2: Servers will be uncertain of outcome If voted to commit, will have to con3nue to hold locks, etc Other schemes (3PC, Paxos, ) can deal with this 17

Replica3on Many distributed systems involve replica?on Mul3ple copies of some object stored at different servers Mul3ple servers capable of providing some opera3on(s) Three key advantages: Load- Balancing: if have many replicas, then can spread out work from clients between them Lower Latency: if replicate an object/server close to a client, will get beher performance Fault- Tolerance: can tolerate the failure of some replicas and s3ll provide service Examples include DNS, web & file caching (& content- distribu3on networks), replicated databases, 18

Replica3on in a Single System A good single- system example is RAID: RAID = Redundant Array of Inexpensive Disks Disks are cheap, so use several instead of just one If replicate data across disks, can tolerate disk crash If don t replicate data, appearance of a single larger disk A variety of different configura3ons (levels) RAID 0: stripe data across disks, i.e. block 0 to disk 0, block 1 to disk 1, block 2 to disk 0, and so on RAID 1: mirror (replicate) data across disks, i.e. block 0 wrihen on disk 0 and disk 1 RAID 5: parity write block 0 to disk 0, block 1 to disk 1, and (block 0 XOR block 1) to disk 2 Get improved performance since can access disks in parallel With RAID 1, 5 also get fault- tolerance

Distributed Data Replica3on Have some number of servers (S 1, S 2, S 3, ) Each holds a copy of all objects Each client C i can access any replica (any S i ) e.g. clients can choose closest, or least loaded If objects are read- only, then trivial: Start with one primary server P having all data If client asks S i for an object, S i returns a copy (S i fetches a copy from P if it doesn t already have a fresh one) Can easily extend to allow updates by P When upda3ng object O, send invalidate(o) to all S i In essence, this is how web caching / CDNs work today But what if clients can perform updates? 20

Replica3on and Consistency Gets more challenging if clients can perform updates For example, imagine x has value 3 (in all replicas) C1 requests write(x, 5) from S4 C2 requests read(x) from S3 What should occur? With strong consistency, the distributed system behaves as if there is no replica3on present: i.e. in above, C2 should get the value 5 requires coordina3on between all servers With weak consistency, C2 may get 3 or 5 (or?) Less sa3sfactory, but much easier to implement 21

Next 3me Replica3on and consistency Strong consistency Quorum- based systems Weaker consistency Consistency, availability and par33ons Further replica3on models Google case studies 22