Lessons from an Internet-Scale Notification System. Atul Adya

Lessons from an Internet-Scale Notification System Atul Adya

History End-client notification system Thialfi Presented at SOSP 2011 Since then: Scaled by several orders of magnitude Used by many more products and in different ways Several unexpected lessons

Case for Notifications Ensuring cached data is fresh across users and devices "Colin is online" Bob's browser Alice s Notebook Phil's phones

Common Pattern #1: Polling Did it change yet? Did it change yet? Did it change yet? Did it change yet? Did it change yet? Did it change yet? Did it change yet?... No! No! No! No! No! Yes! No! Cost and speed issues at scale: 100M clients polling at 10 minute intervals => 166K QPS

Common Pattern #2: App pushes updates over point-to-point channels Complicated for every app to build "Colin is online" Plumbing Fan out to endpoints Manage channels Ensure reliable delivery Bookkeeping object ids endpoints registrations cursors ACLs Pending HTTP GCM XMPP

Our Solution: Thialfi Scalable: handles hundreds of millions of clients and objects Fast: notifies clients in less than a second Reliable: even when entire data centers fail Easy to use and deploy: Chrome Sync (Desktop/Android), Google Plus, Contacts, Music, GDrive

Thialfi Programming Overview Register X Notify X Client C1 Client C2 Thialfi client library Client library Client Data center Notify X Register X Register X Update X Thialfi Service X: C1, C2 Update X Application backend

Thialfi Architecture Client Bigtable Registrations, notifications, acknowledgments Registrar HTTP/XMPP/GCM Client library Client Data center Notifications Object Bigtable Matcher Translation Bridge Application Backend Matcher: Object registered clients, version Registrar: Client ID registered object, unacked messages

Thialfi Abstraction Objects have unique IDs and version numbers, monotonically increasing on every update Delivery guarantee Registered clients learn latest version number Reliable signal only: cached object ID X at version Y (Think Cache Invalidation )

Thialfi Characteristics Built around soft-state Recover registration state from clients Lost notification signal: InvalidateUnknownVersion Registration-Sync: Exchange hash of registrations between client & server Helps in edge cases, async storage, cluster switch Multi-Platform: Libraries: C++, Java, JavaScript, Objective-C OS: Windows/Mac/Linux, browsers, Android, ios Channels: HTTP, XMPP, GCM, Internal-RPC

Some Lesions Ouch! I mean, Lessons

Lesson 1: Is this thing on? Launch your system and no one is using it How do I know it is working? People start using it Is it working now? Magically know works for 99.999% of the time Which 99.999%? How to distinguish among ephemeral, disconnected, and buggy clients You can never know

Lesson 1: Is this thing on? What s the best you can do? Continuous testing in production But may not be able to get client monitoring Look at server graphs End-to-end, e.g., latency More detailed, e.g., reg-sync per client type

Lesson 1: Is this thing on? But graphs are not sufficient Even when it looks right, averages can be deceptive How know if missing some traffic Have other ways of getting more reports: customer monitoring, real customers, Twitter,...

Lesson 2: And you thought you could debug? Monitoring indicates that there is a problem Server text logs: but hard to correlate Structured logging: may have to log selectively E.g., cannot log incoming stream multiple times Client logs: typically not available Monitoring graphs: but can be too many signals Specific user has problem (needle-in-a-haystack) Structured logging - if available Custom production code!

Customer unable to receive notifications Whole team spent hours looking Early on - debugging support was poor Text logs - had rolled over Structured logs - not there yet Persistent state - had no history Eventually got lucky War Story: VIP Customer Version numbers were timestamps Saw last notification version was very old Deflected the bug

Opportunity: Monitoring & Debugging Tools Automated tools to detect anomalies Machine-learning based? Tools for root-cause analysis Which signals to examine when problem occurs Finding needles in a haystack Dynamically switch on debugging for a needle E.g., trace a client s registration and notifications

Lesson 3: Clients considered harmful Started out: Offloading work to clients is good But, client code is painful: Maintenance burden of multiple platforms Upgrades: days, weeks, months, years never Hurts evolution and agility

War Story: Worldwide crash of Chrome on Android (alpha) Switched a flag to change message delivery via a different client code path Tested this path extensively with tests Unfortunately, our Android code did network access from the main thread on this path Newer versions of the OS than in our tests crashed the application when this happened

War Story: Strange Reg-Sync Loops Discovered unnecessary registrations for a (small) customer Some JavaScript clients in Reg-Sync loop Theories: Races, Bug - app, library, Closure,... Theory: HTTP clients switching too much Nope!

War Story: Buggy Platform Logged platform of every Reg-sync looping client Found 6.0 and that meant Safari Wrote test but failed to find bug Engineer searched for safari javascript runtime bug" Ran test in a loop SHA-1 hash not the same in all runs of loop! Safari JavaScript mis-jit i++ to ++i sometimes

Future direction: Thin client Move complexity to where it can be maintained Removing most code from client Trying to make library be a thin wrapper around API Planning to use Spanner (synchronous store) But still keeping soft-state aspects of Thialfi

Lesson 4: Getting your foot (code) in the door Developers will use a system iff it obviously makes things better than doing it on their own Clean semantics and reliability not the selling point you think they are Clients care about features not properties

Lesson 4: Getting your foot (code) in the door May need unclean features to get customers Best-effort data along with versions Support special object ids for users Added new server (Bridge) for translating messages Customers may not be able to meet your strong requirements Version numbers not feasible for many systems Allow time instead of version numbers

Lesson 4: Getting your foot (code) in the door Understand their architecture and review their code for integrating with your system Error path broken: invalidateunknownversion Naming matters: Changing to mustresync Know where your customer s code is - so that you can migrate them to newer infrastructure Debugging tools also needed for bug deflection

War Story: Thialfi is unreliable A team used Thialfi for reliable backup path to augment their unreliable fast path Experienced an outage when their fast path became really unreliable Informed us Thialfi was dropping notifications! Investigation revealed Under stress, backend dropped messages on their path and gave up publishing into Thialfi after few retries

Lesson 5: You are building your castle on sand You will do a reasonable job thinking through your own design, protocols, failures, etc Your outage is likely to come from a violation of one of your assumptions or another system several levels of dependencies away

War Story: Delayed replication in Chrome Sync Chrome backend dependency stopped sending notifications to Thialfi When it unwedged, traffic went up by more than 3X. We only had capacity for 2X Incoming feed QPS

War Story: Delayed replication in Chrome Sync Good news: Internal latency remained low and system did not fall over Bad news: End-to-end latency spiked to minutes for all customers Isolation not strong enough - not only Chrome Sync but all customers saw elevated latency

Opportunity: Resource Isolation Need the ability to isolate various customers from each other General problem for shared infrastructure services

War Story: Load balancer config change Thialfi needs clients to be stable w.r.t clusters Not globally reshuffle during a single-cluster outage Change to inter-cluster load balancer config to remove ad hoc cluster stickiness Previously discussed with owning team Config change caused large-scale loss of cluster stickiness for clients

War Story: Load balancer config change No. of active clients Client flapping between clusters caused an explosion in the number of active clients Same client was using resources many times over

Fix: Consistent hash routing Reverted load balancer config change Use consistent hashing for cluster selection Routed client based on client id Not geographically optimal

Opportunity: Geo-aware stable routing Stable : Client goes to same cluster for long periods of time Geographically-aware How to ensure clients are somewhat uniformlydistributed? How to add new clusters or shut down clusters (e.g., for maintenance)

Lesson 6: The customer is always right Customers will ask for anything and everything Tension between keeping system pure/ wellstructured and responding to customers needs C.f. Getting your foot in the door

Initial model: Lack of payload support (Model we had in SOSP 2011) Developers want reliable, in-order data delivery But, adds complexity to Thialfi and application Hard state, arbitrary buffering Offline applications flooded with data on wakeup For most applications, reliable signal is enough Invoke polling path on signal: simplifies integration

War Story: No payloads hurts Chrome Sync Logistics: Requires a cache to handle backend fetches Backend writers wanted one team to build a cache Technical: Lost updates with multi-master async stores No monotonically-increasing version Modify object in cluster A and B Need to get both updates to do conflict resolution But only get last update from one of them

Fix: Add payload support Expose a Pubsub-like API All updates sent to client No version numbers What about data problems mentioned earlier? System can throw away when too much data and send MustResync signal Clients required to fetch only with MustResync Still believe that reliable signal is the most important aspect of a notification system Data is just the icing on the cake

Lesson 6: Except when they are not Latency and SLAs If you ask, customers will tell you they need <100ms, 99.999% availability 5 minute response times when paged Lesson: Don t ask your customers Thialfi averages 0.5-1 sec: seems to be fine

War Story: Unused big feature Important customer wanted large number of objects per client We wanted to scale in various dimensions Optimized architecture to never read all registrations together, never keep them in memory, etc. For Reg-Sync, added Merkle tree support But never shipped it... Most apps use few (one!) objects per client Why? Migrated from polling! Same customer ended up with few objects per client!

Lesson 7: You cannot anticipate the hard parts The initial Thialfi design spent enormous energy on making the notification path efficient Once we got into production, we added 100s of ms of batching for efficiency No one cared...

Lesson 7: You cannot anticipate the hard parts Hard parts of Thialfi actually are: Registrations: Getting client and data center to agree on registration state with asynchronous storage is tough Reg-Sync solved a number of edge cases Wide-area routing: Earliest Thialfi design ignored this issue completely Had to hack it in on the fly Took significant engineer effort to redo it properly

Lesson 7: You cannot anticipate the hard parts Client library and its protocol Did not pay attention initially: grew organically Had to redesign and rebuild this part completely Handling overload Admission control to protect a server Push back to previous server in pipeline Sometimes better to drop data and issue MustResync

Lesson summary 1. Is this thing is on? 2. And you thought you could debug 3. Clients considered harmful 4. Getting your foot (code) in the door 5. You are building your castle on sand 6. The customer is sometimes right 7. You cannot anticipate the hard parts

More Information Thialfi: A Client Notification Service for Internet-Scale Applications Atul Adya, Gregory Cooper, Daniel Myers, Michael Piatek SOSP 2011

Acknowledgements Engineers Interns Alumni Phil Bogle James Chacon Greg Cooper Matthew Harris Vishesh Khemani Nick Kline Colin Meek Daniel Myers Connor Brem Xi Ge Larry Kai Michael Piatek Naveen Sharma Shao Liu Kyle Marvin Joy Zhang