Online Reliable State Persistence Andrei Agapi Rick Payne Robert Broberg Thilo Kielmann FRIAR 2011
Outline FTSS. The KVL data model Communication (& other optimizations) BGP. BGP/KVL Distributed inodes Snaphots & complex queries Evaluation
FTSS overview Fast, malleable, non-relational DB, can be used as FT shared storage substrate for dynamic app state In-memory, 1-hop, perf-optimized DHT Independently failing components participate, all state replicated on several such & accessible from all Components self-monitor, membership maintained dynamically as they fail/(re)join malleable shmem Support for arbitrary structured data ~ navigational DB
FTSS overview [2] Perf. optimized, e.g. data bulking, optimized link/unlink ops, write-intensive API: Overwrite Append Delete Fetch Link Unlink Read links > Complex queries Each with bulk versions
Consistent Hashing-Based As SEs join/fail, tuples are remapped: A <- 6,4 C <- 3 B <- 2 D <- 5,1
Consistent Hashing-Based
Types of nodes Storage server Client (e.g. app persisting its state) Monitor app (global statistics, snapshoting etc)
The KVL model K Key, unique, used for distribution/replication over servers; V Value associated with key Tuples randomly spread based on hashed keys Can be updated: Overwritable, appendable All are symetrically replicated Linked list at server L Links / unique set Building block for handling structured data, arbitrary pointers, recovery indices, sets etc b-tree at server, serialized at read/replication time
Relevance for Next-Gen Routing: FT BGP. And impl. status Used actively in FT BGP: BGP daemon persists essential state as structured FTSS data: raw updates before ACK, processed state etc FTSS -> fail-stop failures, Byzantine can be orthogonally handled on top via other techniques (e.g. equivalent state machines) 1 BGP update=several FTSS writes, links, possibly deletes
Performance optimizations Copy reductions, malloc reductions along stack, in-memory locking etc Microbenchmarking + rewriting for performance Delaying block serialization for read / re-replication time as opposed to write time Tuning of chunk sizes, cache sizes, direct IO, etc Bulk ops & chunking
Communication Unicast (single key) updates Multicast updates Multi-key (bulk) updates
Communication: Unicast (single key) updates App (BGP) update A K ACK to App replication_ factor == 3 => B,C,D KVL KVL B KVL Unicast Lookup Wait forto each responsible replication affected server list _factor server to according replies hash of K C E D KVL All Allof ofb,c,d: B,C,D: Lookup K,ACKs Merge Send V (e.g. append ), Merge L (e.g. delete links if exist ) ; Update local btrees+lists
Communication: Unicast (single key) updates App (BGP) update A ACK to App Replication_factor unicast packets (+ACKs) per each key update (or > if KVL chunked) K KVL B UDP + chunking + per-chunk ACK-ing for reliability Per-chunk retransmits at timeouts KVL C E D KVL
Communication: Multicast updates App (BGP) update A K ACK to App All 1ofmcast A,E: packet Just to contact N Ignore mcast replication_ affected servers packet based on factor == 3 (still ACKs) B,C,DNheader => B,C,D Mcast + UDP + KV L Mcast addr KVL B,C,D B KVL chunking + perchunkaddresses ACK-ing of for Unicast reliability affected nodes,(acks for quick use unicast) dest filtering KVL Wait for Lookup Single mcast replication responsible packet, _factor server list includes in replies all according to header hash of K affected servers C E D KVL Allof ofb,c,d: B,C,D: All Per-chunk Payload Send retransmits at Lookup K,ACKs Merge timeouts + V (e.g. append ), duplicate Merge LACKs (e.g. ignored delete links if exist ); Update local btrees+lists
Multi-key bulk updates (unicast) App (BGP) update A K1 ACK to App K2: E,A,B K1 V L K2 V L K3 V L B K1,K2,K3 (Substantially) C-K1,K3 less K3: A,B,C UDP packets and D-K1 ACKs; improved B throughput E-K2 Lookup responsible Wait for Unicast server list K3 Repack updates replies forrepacked each K into scheduled fromupdates allto each E updates bulk affected affected server per server servers Bulked version: Chunking occurs A getsofk2,k3 independently # K1: B,C,D updated keys C K2 D Retransmits, timeouts etc function All of A,B,C,D,E: equivalently Send ACKsall (for Lookup full bulk update) respective Ks, Mcast+header-based Mergecould Vs, Merge filtering be Ls; Update similarly used local for bulk ops as per single ops btrees+lists
Communication: Multi-key (bulk) updates Most times applications have >1 data item available to persist at one time in parallel E.g. a BGP packet updates attributes for several prefixes, withdraws > 1 prefixes, independent packets persisted in parallel etc We optimize such cases via bulk operations: Allow for > data+acks to be in flight in the DHT Less (de-)marshaling, more efficient chunking (independent of # of keys), shorter code paths FTSS has an optimization layer for bulk ops We allow any # of ops for different K s (Value overwrites, Link set links/unlinks etc) to be given in parallel + we group KVL tuples based on dest server + only then subsequent chunking
Using KVL Primitive allowing flexibility in optimizing priority ops/building scalable storage Can be used as building block for: Navigational / object DB Centralized or distributed indices over data Write- vs. read-(recovery) optimized Collocating related data together etc Support for complex queries via indexing & distributed key searching
Use case: BGP Main inter-domain routing protocol of the Internet ISPs announce to neighbors and further propagate which prefixes they offer routes for (+ attributes) Based on BGP information it collects, an ISP ultimately decides how to route its traffic (=BGP Best Path Selection algorithm)
Use case: BGP [2] Basically, changes in Internet routing state are disseminated via BGP peering betw. ISPs (ebgp) and within (ibgp) Info about a prefix: AS_PATH, ORIGIN, MED, NEXT_HOP, COMMUNITY, LOCAL_PREF, WEIGHT etc
Use case: BGP [3] Conceptually 2 types of updates (can be bulked together): Prefix withdrawals (== peer doesn t route that prefix at all anymore) Prefix announcements (== only way to update attributes for a given path)
Use case: BGP [4] Typically BGP speakers on routers installed on RP node and resets handled via e.g. BGP Graceful Restart (basically full table resynchronized w/ neighbors) ~10 mins At peek (e.g. at some peer s reset), one can see very high update rates (full table ~300K prefixes)
Use case: BGP [5] External protocol speakers (e.g. BGP peers) S1 S2 Sn Protocol updates FT frontend (e.g. TCP) layer Protocol updates SHIM App State Checkpoint/Restore Update flow + protocol state summarization FTSS Fast, fault tolerant, malleable online database for generic application state Fail-stop S I B L I N G 1 S I B L I N G 2 S I B L I N G m Paravirtualized, codebaseindependent protocol implementations (~ protocol device drivers ) Byzantine
BGP to KVL a simple implementation 1 KVL tuple per each peer PeerEntry 1 per each unique set of attributes AttrEntry Prefixes held as entries in L section of all corresponding AttrEntries 1 recovery index - RootInode
BGP to KVL a simple implementation [2] Recovery Index Peer 1 AttrEntry 1 Prefix 1 Peer 2 Peer n AttrEntry 2 Prefix 2 AttrEntry m Prefix p
BGP to KVL a simple implementation [3] Recovery Index Peer 1 AttrEntry 1 Pfx Tree Peer 2 Peer n AttrEntry 2 Pfx Tree AttrEntry m Pfx Tree
A BGP KVL illustration KVL tuples BGP update ATTRIBUTES Announced Prefx 1 Announced Prefx 2 Announced Prefx n Withdrawn Prefx 1 K1 Meta Keys data K2 ATTR Keys data or Pfxs [ more ATTR KVLs] Withdrawn Prefx 2 Withdrawn Prefx m K Pfx Empty 100 data [ more PFX KVLs] Top inode KVL: holding links to distinct attribute KVLs Attribute KVL: holds in L unique links (or immediate data) to all KVLs of announced paths which have those attributes; attribute data itself is in V Prefix KVL: prefix data + empty links. Alternatively, can be missing altogether: data held as links in respective ATTR KVL.
A BGP KVL illustration [2] Given structure, we can translate a BGP update as a number of separate (K,V,L) updates (links, unlinks, data overwrites etc) The bulk update optimizer then repacks updates accordingly per destination As we typically have > key updates than servers, this matters, e.g: 100 updates,5 servers,2 replicas => ~40 updates to be bulked per server
Load balancing tuples & links
Distributed indexes Implemented as distributed B+trees (B-link trees) We leverage DHT topology maintenance for server pool Attempt to minimize locking/synchronization while: remaining malleable remaining consistent handling DHT routing inconsistencies
B-tree
Bigger B-tree
B-tree Insertion / node split Initial +28 +70 No node split Node split
B-tree - Rebalancing Initial No node split Rebalancing
B-tree Root split +70 +95 Initial Root split
B-tree Delete + (cascade) node merge Initial -60 Node merge
B-tree properties Ops are logarithmic (of large base, e.g. ~100 or >) in # keys stored Preserve order Load is balanced among all nodes There are concurrent versions, versions that minimize locking, distributed versions, top-down/single pass splitting, bulk updating etc Structure of choice for DB indexes etc
B-tree - DHT specifics / issues B-tree nodes are regular KVL tuples maintained in the DHT => B-tree nodes are replicated Pointers are DHT keys rather than explicit servers; rely on DHT table lookups to resolve Catch: Node views on membership may be inconsistent Our B-tree supports reconfiguration while remaining consistent, with minimum locking; via BNode versioning Our main use case: bulk updates from a single writer => not so much concurrency, but BTree replication+malleability issues
Routing inconsistencies B A B A Between steps 3 and 4: RPC K K C D E C F D 1 E A s view of K: C, E, F F B s view of K: 2 D, E, F D is late to answer, A deems it dead B A K C D E 3 F B A A lets everyone know K D died C D E 4 F
KVL B-tree node splitting/naming +5 K Vl L v 12 34 K Vl K:v+1 1 2 3 K:v+2 3 4 5
KVL B-tree node splitting/naming K Vl K:v+1 1 3 2 K Vl +0 +6 +7 K:v+1 0 1 2 V+3 K:v+2 3 5 4 3 5 V+5 K:v+2 K:v+5 4 3 5 V+5 6 7 V+5
Node fields Node Version Sibl Key Left Node Name Last write which passed through node (modifying it or not) Chld Chld Vsn Vsn 1 Sep 2 Sep Val Val Chld Chld 1 2 Key Key 1 2 Parent Key Sibl Key Right At which B-Tree version this node was created Last successful write of child subtree
Regular writes & reads V+N K:N2 V+1 V+N K:N1 V+N+1 K V+N Client K:N3 V+1 K:N4 V+6 K:N5
Regular writes & reads V+N K:N2 V+1 V+N K:N1 V+N+1 K V+N Client K:N3 V+1 K:N4 V+6 K:N5
Regular writes & reads V+N K:N2 V+1 V+N K:N1 V+N+1 K V+N Client V+6 K:N3 V+1 V+1 K:N4 K:N5
Regular writes & reads K:N2 V+1 V+N V+N K:N1 V+N+1 K V+N Client V+6 K:N3 V+1 V+1 K:N4 K:N5
Internal node split Node needs to split V+N K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 K:N5
Internal node split Node needs to split V+N K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 K:N5 Agree upon some Authoritative Topology Leader election based on it
Internal node split AT Node needs to split K:N2 V+N K:N1 V+1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 K:N5
Internal node split Node needs to split V+N K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 V+N K:V+N K:N5
Internal node split Node needs to split V+N AT, split val, [recurse] K:N2 V+1 K:N1 V+N V+N+1 K V+N Client V+6 K:N3 V+1 K:N4 V+N K:V+N K:N5
Snapshots Some workloads are not optimized for the particular inode indexing structure E.g. BGP DB indexed on AttrEntry, Prefix but RIS bview traces organized on Prefix first (then Peer) Client library includes a snapshoting iface, whereby a snapshot can be built at client, then bulk written
Complex queries recover_all ( root_set, inode_range_query, where_query ) root_set = set of inodes to start from inode_range_query = range query on inode, e.g. keys in [K1, K2) where_query = condition on retrieved nodes, e.g. contains_some(key_set) E.g.: recover_all(pfx_inode, [PFX_left, PFX_right], contains_some(as507, AS100)) recover_all(pfx_inode, *, contains_some(appid1, AppID2))
Evaluation: FTSS
Writes cost per key
Writes throughput
Writes cost per key
Writes scaling in replication factor
Writes scaling in replication factor (tpt)
Link ops cost per key
Link ops throughput
Link ops cost per key
Link ops scaling in rep. factor
Writes effect of bulking (& copy reduction)
Evaluation: BGP/FTSS
Prefixes per peer distribution
Prefixes per peer distribution [2]
Internet table growth
Processing speed In current implementation above ~ 431.00 BGP updates /second 1 BGP update = several link/unlink ops, possibly writes/deletes There is further space for improvement via further bulking a/o better load balancing (e.g. node allocation on ring, decentralized recovery index)
Load balancing Overall Bytecount
Load balancing tuples & links
CDF of individual tree load
Conclusions FTSS is a fast & malleable persistence layer, suitable for app checkpointing Non-relational, extensible, distributed DB, write-intensive oriented, targeted at persisting real-time processes e.g. BGP
Future work Evaluating various complex queries, aggregation, (partial) recovery, BTree indexes Further explore indexing wrt write/read tradeoff, various update patterns etc Models on top of KVL, e.g. object serialization, pointer (un-)swizzling, BigTable-like etc