How To Write A Multi-Core, Non-Relational, Non Relational, Non Synchronous, Non Unreliable, Non Reliable, Non Transactional, Nonrelational Database On A Microsoft Server



Similar documents
CS 457 Lecture 19 Global Internet - BGP. Fall 2011

Based on Computer Networking, 4 th Edition by Kurose and Ross

ECSE-6600: Internet Protocols Exam 2

Border Gateway Protocol (BGP)

Route Discovery Protocols

Bell Aliant. Business Internet Border Gateway Protocol Policy and Features Guidelines

Exterior Gateway Protocols (BGP)

A Transport Protocol for Multimedia Wireless Sensor Networks

Single Pass Load Balancing with Session Persistence in IPv6 Network. C. J. (Charlie) Liu Network Operations Charter Communications

Network Measurement. Why Measure the Network? Types of Measurement. Traffic Measurement. Packet Monitoring. Monitoring a LAN Link. ScienLfic discovery

pmacct: introducing BGP natively into a NetFlow/sFlow collector

Introduction to TCP/IP

Module 7. Routing and Congestion Control. Version 2 CSE IIT, Kharagpur

ZooKeeper. Table of contents

In Memory Accelerator for MongoDB

Network layer: Overview. Network layer functions IP Routing and forwarding

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

BGP overview BGP operations BGP messages BGP decision algorithm BGP states

APNIC elearning: BGP Attributes

Adapting Distributed Hash Tables for Mobile Ad Hoc Networks

Advanced BGP Policy. Advanced Topics

Load balancing and traffic control in BGP

Internet Protocol: IP packet headers. vendredi 18 octobre 13

Inter-domain Routing Basics. Border Gateway Protocol. Inter-domain Routing Basics. Inter-domain Routing Basics. Exterior routing protocols created to:

Border Gateway Protocol BGP4 (2)

Internet Packets. Forwarding Datagrams

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Lecture 18: Border Gateway Protocol"

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Internet inter-as routing: BGP

Distributed File Systems

Computer Networks - CS132/EECS148 - Spring

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Can Forwarding Loops Appear when Activating ibgp Multipath Load Sharing?

IP Multicasting. Applications with multiple receivers

APNIC elearning: BGP Basics. Contact: erou03_v1.0

IP addressing. Interface: Connection between host, router and physical link. IP address: 32-bit identifier for host, router interface

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Routing Protocols. Interconnected ASes. Hierarchical Routing. Hierarchical Routing

HDFS Users Guide. Table of contents

Datagram-based network layer: forwarding; routing. Additional function of VCbased network layer: call setup.

Load Balancing and Sessions. C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002.

Routing Protocols (RIP, OSPF, BGP)

Multihoming and Multi-path Routing. CS 7260 Nick Feamster January

Cassandra A Decentralized, Structured Storage System

Distributed File Systems

Anycast Rou,ng: Local Delivery. Tom Daly, CTO h<p://dyn.com Up,me is the Bo<om Line

Exercises TCP/IP Networking. Solution. With Solutions

CSE 473 Introduction to Computer Networks. Exam 2 Solutions. Your name: 10/31/2013

Peer-to-Peer and Grid Computing. Chapter 4: Peer-to-Peer Storage

BGP Attributes and Path Selection

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Bigdata High Availability (HA) Architecture

CS514: Intermediate Course in Computer Systems

Advanced Computer Networks. Layer-7-Switching and Loadbalancing

Lecture 14: Data transfer in multihop wireless networks. Mythili Vutukuru CS 653 Spring 2014 March 6, Thursday

BGP Terminology, Concepts, and Operation. Chapter , Cisco Systems, Inc. All rights reserved. Cisco Public

Advanced IPSec with GET VPN. Nadhem J. AlFardan Consulting System Engineer Cisco Systems

CS268 Exam Solutions. 1) End-to-End (20 pts)

Distributed Systems. 23. Content Delivery Networks (CDN) Paul Krzyzanowski. Rutgers University. Fall 2015

Outline. Failure Types

COMP 361 Computer Communications Networks. Fall Semester Midterm Examination

Peer-to-Peer Networks. Chapter 6: P2P Content Distribution

In-Memory Databases MemSQL

6.263 Data Communication Networks

Question 1. [7 points] Consider the following scenario and assume host H s routing table is the one given below:

CSC458 Lecture 6. Homework #1 Grades. Inter-domain Routing IP Addressing. Administrivia. Midterm will Cover Following Topics

Load Balancing. Final Network Exam LSNAT. Sommaire. How works a "traditional" NAT? Un article de Le wiki des TPs RSM.

E : Internet Routing

Doing Don ts: Modifying BGP Attributes within an Autonomous System

Cloud Computing at Google. Architecture

How To Understand Bg

IPV6 流 量 分 析 探 讨 北 京 大 学 计 算 中 心 周 昌 令

Distributed Data Management

IP Address Classes (Some are Obsolete) Computer Networking. Important Concepts. Subnetting Lecture 8 IP Addressing & Packets

Interdomain Routing. Outline

basic BGP in Huawei CLI

Application Layer. CMPT Application Layer 1. Required Reading: Chapter 2 of the text book. Outline of Chapter 2

Routing Protocol - BGP

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Faculty of Engineering Computer Engineering Department Islamic University of Gaza Network Chapter# 19 INTERNETWORK OPERATION

IP Network Layer. Datagram ID FLAG Fragment Offset. IP Datagrams. IP Addresses. IP Addresses. CSCE 515: Computer Network Programming TCP/IP

Lecture 8: Routing I Distance-vector Algorithms. CSE 123: Computer Networks Stefan Savage

The Sierra Clustered Database Engine, the technology at the heart of

The Case for Source Address Routing in Multihoming Sites

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Load balancing and traffic control in BGP

EINDHOVEN UNIVERSITY OF TECHNOLOGY Department of Mathematics and Computer Science

Understanding Virtual Router and Virtual Systems

First Midterm for ECE374 02/25/15 Solution!!

Hybrid Overlay Multicast Framework draft-irtf-sam-hybrid-overlay-framework-01.txt. John Buford, Avaya Labs Research

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Transcription:

Online Reliable State Persistence Andrei Agapi Rick Payne Robert Broberg Thilo Kielmann FRIAR 2011

Outline FTSS. The KVL data model Communication (& other optimizations) BGP. BGP/KVL Distributed inodes Snaphots & complex queries Evaluation

FTSS overview Fast, malleable, non-relational DB, can be used as FT shared storage substrate for dynamic app state In-memory, 1-hop, perf-optimized DHT Independently failing components participate, all state replicated on several such & accessible from all Components self-monitor, membership maintained dynamically as they fail/(re)join malleable shmem Support for arbitrary structured data ~ navigational DB

FTSS overview [2] Perf. optimized, e.g. data bulking, optimized link/unlink ops, write-intensive API: Overwrite Append Delete Fetch Link Unlink Read links > Complex queries Each with bulk versions

Consistent Hashing-Based As SEs join/fail, tuples are remapped: A <- 6,4 C <- 3 B <- 2 D <- 5,1

Consistent Hashing-Based

Types of nodes Storage server Client (e.g. app persisting its state) Monitor app (global statistics, snapshoting etc)

The KVL model K Key, unique, used for distribution/replication over servers; V Value associated with key Tuples randomly spread based on hashed keys Can be updated: Overwritable, appendable All are symetrically replicated Linked list at server L Links / unique set Building block for handling structured data, arbitrary pointers, recovery indices, sets etc b-tree at server, serialized at read/replication time

Relevance for Next-Gen Routing: FT BGP. And impl. status Used actively in FT BGP: BGP daemon persists essential state as structured FTSS data: raw updates before ACK, processed state etc FTSS -> fail-stop failures, Byzantine can be orthogonally handled on top via other techniques (e.g. equivalent state machines) 1 BGP update=several FTSS writes, links, possibly deletes

Performance optimizations Copy reductions, malloc reductions along stack, in-memory locking etc Microbenchmarking + rewriting for performance Delaying block serialization for read / re-replication time as opposed to write time Tuning of chunk sizes, cache sizes, direct IO, etc Bulk ops & chunking

Communication Unicast (single key) updates Multicast updates Multi-key (bulk) updates

Communication: Unicast (single key) updates App (BGP) update A K ACK to App replication_ factor == 3 => B,C,D KVL KVL B KVL Unicast Lookup Wait forto each responsible replication affected server list _factor server to according replies hash of K C E D KVL All Allof ofb,c,d: B,C,D: Lookup K,ACKs Merge Send V (e.g. append ), Merge L (e.g. delete links if exist ) ; Update local btrees+lists

Communication: Unicast (single key) updates App (BGP) update A ACK to App Replication_factor unicast packets (+ACKs) per each key update (or > if KVL chunked) K KVL B UDP + chunking + per-chunk ACK-ing for reliability Per-chunk retransmits at timeouts KVL C E D KVL

Communication: Multicast updates App (BGP) update A K ACK to App All 1ofmcast A,E: packet Just to contact N Ignore mcast replication_ affected servers packet based on factor == 3 (still ACKs) B,C,DNheader => B,C,D Mcast + UDP + KV L Mcast addr KVL B,C,D B KVL chunking + perchunkaddresses ACK-ing of for Unicast reliability affected nodes,(acks for quick use unicast) dest filtering KVL Wait for Lookup Single mcast replication responsible packet, _factor server list includes in replies all according to header hash of K affected servers C E D KVL Allof ofb,c,d: B,C,D: All Per-chunk Payload Send retransmits at Lookup K,ACKs Merge timeouts + V (e.g. append ), duplicate Merge LACKs (e.g. ignored delete links if exist ); Update local btrees+lists

Multi-key bulk updates (unicast) App (BGP) update A K1 ACK to App K2: E,A,B K1 V L K2 V L K3 V L B K1,K2,K3 (Substantially) C-K1,K3 less K3: A,B,C UDP packets and D-K1 ACKs; improved B throughput E-K2 Lookup responsible Wait for Unicast server list K3 Repack updates replies forrepacked each K into scheduled fromupdates allto each E updates bulk affected affected server per server servers Bulked version: Chunking occurs A getsofk2,k3 independently # K1: B,C,D updated keys C K2 D Retransmits, timeouts etc function All of A,B,C,D,E: equivalently Send ACKsall (for Lookup full bulk update) respective Ks, Mcast+header-based Mergecould Vs, Merge filtering be Ls; Update similarly used local for bulk ops as per single ops btrees+lists

Communication: Multi-key (bulk) updates Most times applications have >1 data item available to persist at one time in parallel E.g. a BGP packet updates attributes for several prefixes, withdraws > 1 prefixes, independent packets persisted in parallel etc We optimize such cases via bulk operations: Allow for > data+acks to be in flight in the DHT Less (de-)marshaling, more efficient chunking (independent of # of keys), shorter code paths FTSS has an optimization layer for bulk ops We allow any # of ops for different K s (Value overwrites, Link set links/unlinks etc) to be given in parallel + we group KVL tuples based on dest server + only then subsequent chunking

Using KVL Primitive allowing flexibility in optimizing priority ops/building scalable storage Can be used as building block for: Navigational / object DB Centralized or distributed indices over data Write- vs. read-(recovery) optimized Collocating related data together etc Support for complex queries via indexing & distributed key searching

Use case: BGP Main inter-domain routing protocol of the Internet ISPs announce to neighbors and further propagate which prefixes they offer routes for (+ attributes) Based on BGP information it collects, an ISP ultimately decides how to route its traffic (=BGP Best Path Selection algorithm)

Use case: BGP [2] Basically, changes in Internet routing state are disseminated via BGP peering betw. ISPs (ebgp) and within (ibgp) Info about a prefix: AS_PATH, ORIGIN, MED, NEXT_HOP, COMMUNITY, LOCAL_PREF, WEIGHT etc

Use case: BGP [3] Conceptually 2 types of updates (can be bulked together): Prefix withdrawals (== peer doesn t route that prefix at all anymore) Prefix announcements (== only way to update attributes for a given path)

Use case: BGP [4] Typically BGP speakers on routers installed on RP node and resets handled via e.g. BGP Graceful Restart (basically full table resynchronized w/ neighbors) ~10 mins At peek (e.g. at some peer s reset), one can see very high update rates (full table ~300K prefixes)

Use case: BGP [5] External protocol speakers (e.g. BGP peers) S1 S2 Sn Protocol updates FT frontend (e.g. TCP) layer Protocol updates SHIM App State Checkpoint/Restore Update flow + protocol state summarization FTSS Fast, fault tolerant, malleable online database for generic application state Fail-stop S I B L I N G 1 S I B L I N G 2 S I B L I N G m Paravirtualized, codebaseindependent protocol implementations (~ protocol device drivers ) Byzantine

BGP to KVL a simple implementation 1 KVL tuple per each peer PeerEntry 1 per each unique set of attributes AttrEntry Prefixes held as entries in L section of all corresponding AttrEntries 1 recovery index - RootInode

BGP to KVL a simple implementation [2] Recovery Index Peer 1 AttrEntry 1 Prefix 1 Peer 2 Peer n AttrEntry 2 Prefix 2 AttrEntry m Prefix p

BGP to KVL a simple implementation [3] Recovery Index Peer 1 AttrEntry 1 Pfx Tree Peer 2 Peer n AttrEntry 2 Pfx Tree AttrEntry m Pfx Tree

A BGP KVL illustration KVL tuples BGP update ATTRIBUTES Announced Prefx 1 Announced Prefx 2 Announced Prefx n Withdrawn Prefx 1 K1 Meta Keys data K2 ATTR Keys data or Pfxs [ more ATTR KVLs] Withdrawn Prefx 2 Withdrawn Prefx m K Pfx Empty 100 data [ more PFX KVLs] Top inode KVL: holding links to distinct attribute KVLs Attribute KVL: holds in L unique links (or immediate data) to all KVLs of announced paths which have those attributes; attribute data itself is in V Prefix KVL: prefix data + empty links. Alternatively, can be missing altogether: data held as links in respective ATTR KVL.

A BGP KVL illustration [2] Given structure, we can translate a BGP update as a number of separate (K,V,L) updates (links, unlinks, data overwrites etc) The bulk update optimizer then repacks updates accordingly per destination As we typically have > key updates than servers, this matters, e.g: 100 updates,5 servers,2 replicas => ~40 updates to be bulked per server

Load balancing tuples & links

Distributed indexes Implemented as distributed B+trees (B-link trees) We leverage DHT topology maintenance for server pool Attempt to minimize locking/synchronization while: remaining malleable remaining consistent handling DHT routing inconsistencies

B-tree

Bigger B-tree

B-tree Insertion / node split Initial +28 +70 No node split Node split

B-tree - Rebalancing Initial No node split Rebalancing

B-tree Root split +70 +95 Initial Root split

B-tree Delete + (cascade) node merge Initial -60 Node merge

B-tree properties Ops are logarithmic (of large base, e.g. ~100 or >) in # keys stored Preserve order Load is balanced among all nodes There are concurrent versions, versions that minimize locking, distributed versions, top-down/single pass splitting, bulk updating etc Structure of choice for DB indexes etc

B-tree - DHT specifics / issues B-tree nodes are regular KVL tuples maintained in the DHT => B-tree nodes are replicated Pointers are DHT keys rather than explicit servers; rely on DHT table lookups to resolve Catch: Node views on membership may be inconsistent Our B-tree supports reconfiguration while remaining consistent, with minimum locking; via BNode versioning Our main use case: bulk updates from a single writer => not so much concurrency, but BTree replication+malleability issues

Routing inconsistencies B A B A Between steps 3 and 4: RPC K K C D E C F D 1 E A s view of K: C, E, F F B s view of K: 2 D, E, F D is late to answer, A deems it dead B A K C D E 3 F B A A lets everyone know K D died C D E 4 F

KVL B-tree node splitting/naming +5 K Vl L v 12 34 K Vl K:v+1 1 2 3 K:v+2 3 4 5

KVL B-tree node splitting/naming K Vl K:v+1 1 3 2 K Vl +0 +6 +7 K:v+1 0 1 2 V+3 K:v+2 3 5 4 3 5 V+5 K:v+2 K:v+5 4 3 5 V+5 6 7 V+5

Node fields Node Version Sibl Key Left Node Name Last write which passed through node (modifying it or not) Chld Chld Vsn Vsn 1 Sep 2 Sep Val Val Chld Chld 1 2 Key Key 1 2 Parent Key Sibl Key Right At which B-Tree version this node was created Last successful write of child subtree

Regular writes & reads V+N K:N2 V+1 V+N K:N1 V+N+1 K V+N Client K:N3 V+1 K:N4 V+6 K:N5

Regular writes & reads V+N K:N2 V+1 V+N K:N1 V+N+1 K V+N Client K:N3 V+1 K:N4 V+6 K:N5

Regular writes & reads V+N K:N2 V+1 V+N K:N1 V+N+1 K V+N Client V+6 K:N3 V+1 V+1 K:N4 K:N5

Regular writes & reads K:N2 V+1 V+N V+N K:N1 V+N+1 K V+N Client V+6 K:N3 V+1 V+1 K:N4 K:N5

Internal node split Node needs to split V+N K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 K:N5

Internal node split Node needs to split V+N K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 K:N5 Agree upon some Authoritative Topology Leader election based on it

Internal node split AT Node needs to split K:N2 V+N K:N1 V+1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 K:N5

Internal node split Node needs to split V+N K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 V+N K:V+N K:N5

Internal node split Node needs to split V+N AT, split val, [recurse] K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 V+N K:V+N K:N5

Snapshots Some workloads are not optimized for the particular inode indexing structure E.g. BGP DB indexed on AttrEntry, Prefix but RIS bview traces organized on Prefix first (then Peer) Client library includes a snapshoting iface, whereby a snapshot can be built at client, then bulk written

Complex queries recover_all ( root_set, inode_range_query, where_query ) root_set = set of inodes to start from inode_range_query = range query on inode, e.g. keys in [K1, K2) where_query = condition on retrieved nodes, e.g. contains_some(key_set) E.g.: recover_all(pfx_inode, [PFX_left, PFX_right], contains_some(as507, AS100)) recover_all(pfx_inode, *, contains_some(appid1, AppID2))

Evaluation: FTSS

Writes cost per key

Writes throughput

Writes cost per key

Writes scaling in replication factor

Writes scaling in replication factor (tpt)

Link ops cost per key

Link ops throughput

Link ops cost per key

Link ops scaling in rep. factor

Writes effect of bulking (& copy reduction)

Evaluation: BGP/FTSS

Prefixes per peer distribution

Prefixes per peer distribution [2]

Internet table growth

Processing speed In current implementation above ~ 431.00 BGP updates /second 1 BGP update = several link/unlink ops, possibly writes/deletes There is further space for improvement via further bulking a/o better load balancing (e.g. node allocation on ring, decentralized recovery index)

Load balancing Overall Bytecount

Load balancing tuples & links

CDF of individual tree load

Conclusions FTSS is a fast & malleable persistence layer, suitable for app checkpointing Non-relational, extensible, distributed DB, write-intensive oriented, targeted at persisting real-time processes e.g. BGP

Future work Evaluating various complex queries, aggregation, (partial) recovery, BTree indexes Further explore indexing wrt write/read tradeoff, various update patterns etc Models on top of KVL, e.g. object serialization, pointer (un-)swizzling, BigTable-like etc