SEDA: An Architecture for Robust, Well-Conditioned Internet Services Matt Welsh UC Berkeley Computer Science Division mdw@cs.berkeley.edu http://www.cs.berkeley.edu/~mdw/proj/seda 1
Motivation - Internet Services Today Massive concurrency demands Yahoo: 1.2 billion+ pageviews/day AOL web caches: 10 billion hits/day Load spikes are inevitable - the Slashdot Effect Traffic on September 11, 2001 overloaded many news sites Denial of service attacks increasingly common Peak load is orders of magnitude greater than average In this regime, overprovisioning is infeasible Load spikes occur exactly when the service is most valuable! Increasingly dynamic and complex Not just static web pages anymore: e-commerce, stock trading, driving directions, etc. New application domains, e.g. Web services Service logic evolves rapidly Matt Welsh, UC Berkeley 2
Massive Overload is Sudden and Unpredictable September 11 - unprecedented web traffic CNN: 20x over expected peak - 30,000+ hits/sec Grew server farm by 5x by scrounging machines, but still no service for 2.5 hours USGS site load after M7.1 earthquake 3 orders of magnitude increase in 10 minutes, disk log filled up 80 USGS Web server load 70 60 Hits per second 50 40 30 20 10 0 00:00 03:00 06:00 09:00 12:00 15:00 18:00 21:00 00:00 Time Matt Welsh, UC Berkeley 3
Service building is a black art Supporting massive concurrency is hard Threads/processes designed for timesharing and don t scale High overheads and memory footprint Existing systems do not provide graceful management of load Standard OSs strive for maximum resource transparency Static resource containment is inflexible How to set apriori resource limits for widely fluctuating loads? Load management demands a feedback loop Dynamics of services exaggerate these problems Much work on performance/robustness for specific services e.g., Fast, event-driven Web servers As services become more dynamic, this engineering burden is excessive Services increasingly hosted on general-purpose facilities Matt Welsh, UC Berkeley 4
Proposal: The Staged Event-Driven Architecture SEDA: A new architecture for Internet services A general-purpose framework for high concurrency and load conditioning Decomposes applications into stages separated by queues Adopt a structured approach to event-driven concurrency Overload management is explicit in the programming model Event queues expose request streams to the application Perform load conditioning by filtering or prioritizing queues Dynamic control for self-tuning resource management System observes application performance and tunes runtime parameters Apply control for graceful degradation Perform load shedding or degrade service under overload Simplify task of building highly-concurrent services Decouple load management from service complexity Use of stages supports modularity, code reuse, debugging Dynamic control shields apps from complexity of resource management Matt Welsh, UC Berkeley 5
Outline Problems with Threads and Event-Driven Concurrency The Staged Event-Driven Architecture SEDA Implementation Application Study: High-Performance HTTP Server Using Control for Overload Prevention Future Work and Conclusions Matt Welsh, UC Berkeley 6
Classic model: One thread per task Threads are designed for timesharing High resource usage, context switch overhead, contended locks Too many threads throughput meltdown, response time explosion How to determine the right number of threads? Regardless of performance, threads are fundamentally the wrong interface Throughput, tasks/sec 30000 25000 20000 15000 10000 5000 Throughput Latency Linear (ideal) latency 400 350 300 250 200 150 100 50 Latency, msec 0 0 1 4 16 64 256 1024 Number of threads Matt Welsh, UC Berkeley 7
Event-driven Concurrency Multiplex many requests (FSMs) over small number of threads Ideal performance: Flat throughput, linear response time penalty Many examples: Click router, Flash web server, TP Monitors, etc. Difficult to engineer, modularize, and tune Typically very brittle: application-specific event scheduling FSM code can never block (but page faults, garbage collection force a block) 35000 Throughput 30000 Throughput, tasks/sec 25000 20000 15000 10000 5000 0 1 32 1024 32768 1048576 Number of tasks in pipeline Matt Welsh, UC Berkeley 8
Staged Event-Driven Architecture (SEDA) file data accept connection read packet connection packet parse packet HTTP request check cache cache hit cache miss handle miss send response I/O request packet file I/O write packet Decompose service into stages separated by queues Each stage performs a subset of request processing Stages use light-weight event-driven concurrency internally Nonblocking I/O interfaces are essential Queues make load management explicit Stages contain a thread pool to drive execution Small number of threads per stage Dynamic control grows/shrinks thread pools with demand Event Handler Thread Pool Dynamic Resource Control Applications implement simple event handler interface Apps don t allocate, schedule, or manage threads Matt Welsh, UC Berkeley 9
Queues for Control and Composition Queues subject to admission control policy Manage load by controlling queueing at each stage e.g., Thresholding, rate control, credit-based flow control Enqueue failure is an overload signal? Reject Accept Block on full queue backpressure Drop rejected events load shedding Can also degrade service, redirect request, etc. Queues introduce explicit execution boundary Threads may only execute within a single stage Performance isolation, modularity, independent load management Explicit event delivery supports inspection Trace flow of events through application Monitor queue lengths to detect bottleneck Profile Matt Welsh, UC Berkeley 10
SEDA Thread Pool Controller 200 Event Handler 160 120 80 Input queue length Observe Length Thread Pool Adjust Size > Observe Throughput Threshold 40 0 Thread pool size 200 400 600 800 1000 1200 1400 1600 Time (1 sec intervals) 20 15 10 5 0 Goal: Determine ideal degree of concurrency for a stage Dynamically adjust number of threads allocated to each stage Avoid wasting threads when unneeded Controller operation Observes input queue length, adds threads if over threshold Idle threads removed from pool Drop in stage throughput indicates max thread pool size Matt Welsh, UC Berkeley 11
SEDA Batching Controller 1200 1000 Event Handler Other Stages 800 600 400 200 Stage output rate 0 Thread Pool Adjust Batching > Factor Running Avg Observe Rate 0 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Time (100 ms intervals) Batching factor 100 80 60 40 20 Goal: Schedule for low response time and high throughput Batching factor: number of events consumed by each thread Large batching factor more locality, higher throughput Small batching factor more parallelism, lower response time Attempt to find smallest batching factor with stable throughput Reduces batching factor when throughput high, increases when low Matt Welsh, UC Berkeley 12
SEDA Prototype: Sandstorm application stage application stage application stage application stage asynchronous sockets HTTP protocol Gnutella protocol timer SSL/TLS protocol asynchronous file I/O Profiler System Manager NBIO Java Virtual Machine JAIO Operating System Implemented in Java with nonblocking I/O interfaces Scalable network performance up to 10,000 clients per node Influenced design of JDK 1.4 java.nio APIs Java viable as service construction language Built-in threading, automatic memory management, cross-platform Java-based SEDA Web server outperforms Apache and Flash Matt Welsh, UC Berkeley 13
Outline Problems with Threads and Event-Driven Concurrency The Staged Event-Driven Architecture SEDA Implementation Application Study: High-Performance HTTP Server Using Control for Overload Prevention Future Work and Conclusions Matt Welsh, UC Berkeley 14
Haboob: A SEDA-Based Web Server file data accept connection read packet connection packet parse packet HTTP request check cache cache hit cache miss handle miss send response I/O request packet file I/O write packet SEDA-based HTTP server for static and dynamic pages Memory page cache (sized at 200 MB to force disk I/O) Dynamic page generation with new Python-based scripting language Database access, asynchronous SSL/TLS library Measured static file load from SpecWEB99 benchmark 1 to 1024 clients making repeated requests, think time 20ms Total fileset size is 3.31 GB ; page sizes from 102 Bytes to 940 KB Comparison with Apache and Flash Apache: Process-based concurrency, 150 processes Flash (Vivek Pai, Princeton): Event-driven w/ 4 processes Apache accepts only 150 connections, Flash only 506 Matt Welsh, UC Berkeley 15
SEDA Scales Well with Increasing Load Throughput, MBit/sec 240 220 200 180 160 140 120 100 80 60 40 20 0 Apache Flash SEDA Throughput 1 2 4 8 16 32 64 128 256 512 1024 Number of clients 4-way Pentium III 500 MHz, Gigabit Ethernet, 2 GB RAM, Linux 2.2.14, IBM JDK 1.3 SEDA throughput 10% higher than Apache and Flash (which are in C!) But higher efficiency was not really the goal! Apache accepts only 150 clients at once - no overload despite thread model But as we will see, this penalizes many clients Matt Welsh, UC Berkeley 16
Max Response Time vs. Apache and Flash Max response time, seconds 100 90 80 70 60 50 40 30 20 10 0 Apache Flash SEDA 1 4 16 64 256 1024 Number of clients SEDA yields stable, predictable performance Apache and Flash are very unfair when overloaded 1.7 minutes 37 sec 3.8 sec Long response times due to exponential backoff in TCP SYN retransmit Not accepting connections is the wrong approach to overload management Matt Welsh, UC Berkeley 17
Can t we just fix Apache? Real problem with Apache: Uses TCP protocol as a request queue Unfortunately, incurs a terrible response time penalty How to fix Apache? Add a couple of threads to rapidly accept and read requests Internally queue requests to avoid TCP backoff But that s not all! Replace static process limit with dynamic control Isolate potential bottlenecks to avoid overload You get the idea... accept connection read packet Apache Thread Pool Tune Process Limit CGI script SSL processing long disk I/O Matt Welsh, UC Berkeley 18
Outline Problems with Threads and Event-Driven Concurrency The Staged Event-Driven Architecture SEDA Implementation Application Study: High-Performance HTTP Server Using Control for Overload Prevention Future Work and Conclusions Matt Welsh, UC Berkeley 19
Alternatives for Overload Control Fundamentally: Apply admission control to each stage Expensive stages throttled more aggressively Degrade service (e.g., reduce image quality or service complexity) Redirect request to another server (e.g., HTTP redirect) Can couple with front-end load balancing across server farm Reject request (e.g., Error message or Please wait... ) Social engineering possible: fake or confusing error message Deliver differentiated service Give some users better service; don t reject users with a full shopping cart! Matt Welsh, UC Berkeley 20
Dynamic Control for Overload Prevention Token Bucket λ Stage Response Times Controller RT Target Adaptive admission control at each stage Target metric: Bound 90th percentile response time Measure stage latency, throttle incoming event rate Distribution Additive-increase/Multiplicative-decrease controller design Slight overshoot in input rate can lead to large response time spikes! Clamp down quickly on input rate when over target Increase incoming rate slowly when below target Matt Welsh, UC Berkeley 21
Arashi: A Web-based e-mail service Yahoo Mail clone - real world service Dynamic page generation, SSL New Python-based Web scripting language Mail stored in back-end MySQL database Realistic client load generator Traces taken from departmental IMAP server Markov chain model of user behavior Overload control applied to each request type separately: admission control list folders accept connection read packet parse packet degrade service, send error page, etc. show message... delete/refile message database access send response (some stages not shown) Matt Welsh, UC Berkeley 22
Overload prevention during massive load spike 120 90th percentile response time (sec) 100 80 60 40 20 0 Load spike (1000 users) 0 0 20 40 60 80 100 Time (5 sec intervals) No overload control With overload control Reject rate Load spike ends Sudden load spike of 1000 users hitting Arashi service 7 request types, handled by separate stages with overload controller 90th percentile response time target: 1 second Rejected requests cause clients to pause for 5 sec Overload controller has no knowledge of the service! 1 0.5 Reject rate Matt Welsh, UC Berkeley 23
Structured Event Queues Related Work Many examples: Click modular router, Scout OS, Work Crews, TSS/360, etc. Often concerned with optimizing fast path, less general processing within stages StagedServer [Microsoft]: Explored use of batching for cache locality only We focus on network-of-queues for robustness to huge variations in load Overload prevention techniques in Internet services Dynamic listen queue thresholding [Voigt, Cherkasova, Kant] Specialized scheduling techniques [Crovella, Harchol-Balter] Class-based service differentiation [Bhoj, Voigt, Reumann] SEDA generalizes on these techniques by introducing them to the service programming model, not as generic OS functions Control-theoretic approaches to resource management PID control of Internet service parameters [Diao, Hellerstein] Feedback-driven scheduling [Stankovic, Abdelzaher, Steere] This work complements our approach to system design Often limited by linear models - fail to capture full complexity of real system Matt Welsh, UC Berkeley 24
SEDA Impact and Retrospectives Adoption by other systems at Berkeley and elsewhere Ninja vspace: Cluster-based SEDA for scalable servers Replicate stages on a cluster for scalablilty and fault tolerance OceanStore (global, secure, archival filesystem) based on SEDA TerraLycos chat server, LimeWire Gnutella servers, SwiftMQ system, etc... SEDA requires a fundamental shift in thinking Programming model is no longer single request in isolation Rather, processing many requests at each stage Overload exposed to applications through queue rejections Open questions and future work Explore impact of SEDA on OS design Rejecting requests which have accumulated state? Need rollback mechanism backwards along event path Applying control theory to SEDA resource management Internet services are highly nonlinear, difficult to apply classic control theory Matt Welsh, UC Berkeley 25
Summary Support for massive load requires new design techniques SEDA introduces service design as a network of stages Architecture for services that are robust to huge variations in load Combines efficiency of event-driven concurrency with programmability of threads Observation and control as key to service design Expose request streams and overload to applications Dynamic control to keep stages within operating regime Decouple load management from service complexity Bring body of work on control theory to bear on Internet services Opens up interesting new directions for Internet service design What would a native SEDA operating system look like? Language and tool support for event-driven computing For more information, software, and papers: http://www.cs.berkeley.edu/~mdw/ Matt Welsh, UC Berkeley 26
Backup Slides Follow Matt Welsh, UC Berkeley 27
But isn t it difficult to program in this way? Some learning curve for SEDA, but mitigated by design: Easier to manage event-driven concurrency within each stage Queues help with debugging, modularity, and performance isolation Also, stages can block (for difficult code or lazy programmers) Much better than monolithic, event-driven spaghetti-code designs Decomposition into stages helps greatly! Each stage is a self-contained, well-conditioned module Stage communication via message protocol, not API Some drawbacks: Shared state across stages still requires synchronization Funny concurrency issues, e.g., inter-event dependencies and serialization Matt Welsh, UC Berkeley 28
Future Directions Explore impact of SEDA on OS design Rethinking the role of resource virtualization Strive for robustness/stability rather than mere speed Scheduling, I/O, VM, programming model all affected Adaptivity and control as the basis for robust, large-scale systems Feedback as first-order systems construction primitive Interesting ties with control theory, performance modeling, machine learning Further development of formal techniques needed to apply to real systems Longer term: Resilient, massively scalable computing platforms Macrocomputing - computing across many unreliable devices e.g., Wireless sensor networks with 10000 s of nodes Language-based approach to robust computing under uncertainty Matt Welsh, UC Berkeley 29
High-performance Web servers Related Work Many systems realizing the benefit of event-driven design [Flash, Harvest, Squid, JAWS,...] Specific applications - no general-purpose framework Little work on load conditioning, event scheduling StagedServer (Microsoft Research) Core design similar to SEDA Primarily concerned with cache locality Wavefront thread scheduler: last in, first out Click Modular Router, Scout OS, Utah Janos Various systems making use of structured event queues Packet processing decomposed as stages Threads call through multiple stages Major goal is latency reduction Matt Welsh, UC Berkeley 30
Resource Containers [Banga] Related Work 2 Similar to Scout path and Janos flow Vertical resource management for data flows SEDA applies resource management at per-stage level Scalable I/O and Event Delivery [ASHs, IO-Lite, fbufs, /dev/poll, FreeBSD kqueue, NT completion ports] Structure I/O system to scale with number of clients We build on this work Large body of work on scheduling Interesting thread/event/task scheduling results e.g., Use of SRPT and SCF scheduling in Web servers [Crovella, Harchol-Balter] Alternate performance metrics [Bender] We plan to investigate their use within SEDA Matt Welsh, UC Berkeley 31
Apache and Flash Architectures Apache Classic thread/process per task model Maintains fixed-size pool of 150 processes Only accepts 150 processes at a time; others queue in network Leads to extreme unfairness due to exponential TCP backoff Lucky clients serviced quickly Unlucky clients must go out to lunch Flash [V. Pai et al., USENIX 1999] Canonical example of event-driven server design Single thread processes all concurrent requests Lots of optimizations Sub-processes used to perform disk I/O, DNS lookup Limited to 506 simultaneous connections due to fd limits Original USENIX paper only measured up to this point Matt Welsh, UC Berkeley 32
Response Time Distribution - 64 Clients 1 0.9 SEDA Flash Apache 0.8 Prob [response time <= x] 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1 10 100 1000 10000 100000 Response time, msec Note log scale SEDA Flash Apache Mean RT 15 ms 18 ms 16 ms Max RT 1.8 sec 12 sec 2.2 sec All servers about the same, Apache a little faster Matt Welsh, UC Berkeley 33
Response Time Distribution - 1024 Clients 1 0.9 Prob [response time <= x] 0.8 0.7 0.6 0.5 0.4 0.3 Apache (150 conns) Flash (506 conns) SEDA (1024 conns) Long tail in Apache & Flash due to TCP SYN backoff 0.2 0.1 0 10 100 1000 10000 100000 Response time, msec Note log scale SEDA Flash Apache Mean RT 547 ms 665 ms 475 ms Max RT 3.8 sec 37 sec 1.7 minutes Apache accepts only 150 connections: less resource contention But very unfair treatment of clients Matt Welsh, UC Berkeley 34
Response time CDF by request type Prob [response time <= x] 100 90 80 70 60 50 40 30 20 10 0 showmsg login refile deletemsg list folderlist search 1 4 16 64 256 1024 4096 16384 65536 Response time, msec CDF of response time, 128 clients Most request types meet performance target Search is particularly bad - but not very common Aggressiveness of controller proportional to request frequency Matt Welsh, UC Berkeley 35
Overload prevention under increasing load 12 1 11 0.9 90th percentile RT, seconds 10 9 8 7 6 5 4 3 2 Response time target No admission control With admission control 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Fraction of accepted requests 1 0.1 0 0 4 8 16 32 64 128 256 512 1024 Number of clients 90th %-tile response time target set to 1 second Controller keeps response time below target with increased load Greater proportion of expensive requests rejected System admin can tune response time vs. rejected load tradeoff Matt Welsh, UC Berkeley 36
Controller-based Overload Prevention 1 0.9 0.8 Performance target 90th percentile RT = 1 sec Prob [response time <= x] 0.7 0.6 0.5 0.4 0.3 With overload control No overload control 0.2 0.1 0 1 10 100 1000 10000 100000 Response time, msec 1024 clients under overload Adaptive load-shedding controller attempts to meet performance target Dynamically adjust queue thresholds to maintain low response time Goal: 90th percentile response time of 1 sec In this run, 49% of requests rejected from server Matt Welsh, UC Berkeley 37