3/2/2011 SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK Systems Group Dept. of Computer Science ETH Zürich, Switzerland SwissBox Humboldt University Dec. 2010 Systems Group = www.systems.ethz.ch Enterprise Computing Center = www.ecc.ethz.ch 1
3/2/2011 APPLIANCES: The world is changing ORACLE EXADATA Intelligent storage manager Massive caching RAC based architecture Fast network interconnect 2
3/2/2011 ORACLE EXADATA Pushing SQL operators to the storage manager NETEZZA (IBM) TWINFIN No storage manager Distributed disks (per node) FPGA processing No indexing 3
3/2/2011 NETEZZA (IBM) TWINFIN SAP ACCELERATOR Main memory database Column store No indexing (automatic) 4
SWISSBOX Gustavo Alonso, Donald Kossmann, Timothy Roscoe: SwissBox: A Database Appliance CIDR 2011 ETH SWISSBOX 5
SwissBox main components Barrelfish: research operating system for multicore machines. Designed to let the application control key system aspects Crescando: main memory storage manager E cast: distributed protocol for routing updates and reads to (large) pools of replicated nodes running Crescando FPGA layer: Hardware accelerators for network traffic optimization and operator off loading from CPUs SharedDB: data flow architecture for shared operator processing CRESCANDO: the storage manager of SwissBox Philipp Unterbrunner, Georgios Giannikis, Gustavo Alonso, Dietmar Fauser, Donald Kossmann: Predictable Performance for Unpredictable Workloads. PVLDB 2(1): 706 717 (2009) 6
Amadeus Workload Passenger Booking Database ~ 600 GB of raw data (two years of bookings) single table, denormalized ~ 50 attributes: flight no, name, date,..., many flags Query Workload up to 4000 queries / second latency guarantees: 2 seconds today: only pre canned queries allowed Update Workload avg. 600 updates per second (1 update per GB per sec) peak of 12000 updates per second data freshness guarantee: 2 seconds Amadeus Query Examples Simple Queries Print passenger list of Flight LH 4711 Give me LH hon circle from Frankfurt to Delhi Complex Queries Give me all Heathrow passengers that need special assistance (e.g., afterterrorwarning) Problems with State of the Art Simple queries work only because of mat. views multi month project to implement new query / process Complex queries do not work at all 7
Why trad. DBMS are a pain? 20'000 MySQL Query 50th MySQL Query 90th MySQL Query 99th 9'000 8'000 Query Latency in msec 15'000 10'000 5'000 7'000 6'000 5'000 4'000 3'000 2'000 1'000 Query Latency in msec 0 0 20 40 60 80 100 Update Load in Updates/sec Performance depends on workload parameters 1.75 1.5 Synthetic Workload Parameter s changes in load (updates, columns accessed) > huge variance Unpredictable performance, impossible to tune correctly 2 1.25 0 System requirements Predictable (= constant) Performance independent of updates, query types,... Meet SLAs latency, data freshness Affordable Cost ~ 1000 COTS machines are okay (compare to mainframe) Meet Consistency Requirements monotonic reads (ACID not needed) Respect Hardware Trends main memory, NUMA, large data centers 8
Selected RelatedWork L. Qiao et. al. Main memory scan sharing for multi core CPUs. VLDB '08 Cooperative main memory scans for ad hoc OLAP queries (read only) P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper pipelining query execution. CIDR 05 Cooperative scans over vertical partitions on disk K. A. Ross. Selection conditions in main memory. In ACM TODS, 29(1), 2004. S. Chandrasekaran and M. J. Franklin. Streaming queries over streaming data VLDB '02 Query data join G. Candea, N. Polyzotis, R. Vingralek. A Scalable, Predictable Join Operator for Highly Concurrent Data Warehouses. VLDB 09 An always on join operator based on similar requirements and design principles What is Crescando? A distributed (relational) table: MM on NUMA horizontally partitioned distributed within and across machines Query / update interface SELECT * FROM table WHERE <any predicate> UPDATE table SET <anything> WHERE <any predicate> monotonic reads / writes (SI within a single partition) Some nice properties constant / predictable latency & data freshness solves the Amadeus use case 9
Design Operate MM like disk in shared nothing architecture Core ~ Spindle (many cores per machine & data center) all data kept in main memory (log to disk for recovery) each core scans one partition of data all the time Batch queries and updates: shared scans do trivial MQO (at scan level on system with single table) control read/update pattern > no data contention Index queries / not data just as in the stream processing world predictable+optimizable: rebuild indexes every second Updates are processed before reads Clock Scan QUERIES UPDATES BUILD QUERY INDEX FOR NEXT SCAN READ CURSOR WRITE CURSOR DATA IN CIRCULAR BUFFER (WIDE TABLE) 10
{record, {query ids} } results is Predicate Indexes Queries + Upd. qs Unindexed Queries Active Queries records Crescando on 1 Core data partition Crescando on 1 Machine (N Cores) Scan Thread Scan Thread Input Queue (Operations) Split Scan Thread Scan Thread Merge Output Queue (Result Tuples)... Input Queue (Operations) Scan Thread Output Queue (Result Tuples) 11
Crescando in a Data Center (N Machines) Implementation Details Optimization decide for batch of queries which indexes to build runs once every second (must be fast) Query + update indexes different indexes for different kinds of predicates e.g., hash tables, R trees, tries,... must fit in L2 cache (better L1 cache) Probe indexes Updates in right order, queries in any order Persistence & Recovery Log updates / inserts to disk (not a bottleneck) 12
Benchmark Environment Crescando Implementation Shared library for POSIX systems Heavily optimized C++ with some inline assembly Benchmark Machines 16 core Opteron machine with 32 GB DDR2 RAM 64 bit Linux SMP kernel, ver. 2.6.27, NUMA enabled Benchmark Database The Amadeus Ticket view (one record per passenger per flight) ~350byte per record; 47 attributes, many of them flags Benchmarks use 15 GB of net data Query + Update Workload Current: Amadeus Workload (from Amadeus traces) Predicted: Synthetic workload with varying predicate selectivity Multi core Scale up 558.5 Q/s 10.5 Q/s 1.9 Q/s Round robin partitioning, read only Amadeus workload, vary number of threads 13
Latency vs. Query Volume thrashing, queue overflows L1 cache base latency of scan L2 cache Hash partitioning, read only Amadeus workload, vary queries/sec Latency vs. Concurrent Writes Hash partitioning, Amadeus workload, 2000 queries/sec, vary updates 14
Crescando vs. MySQL Latency updates + big queries cause massive queuing s= 1.4: 1 / 3,000 queries do not hit an index s= 1.5: 1 / 10,000 queries do not hit an index 16s = time for full table scan in MySQL Amadeus workload, 100 q/sec, vary updates Synthetic read only workload, vary skew Crescando vs. MySQL Throughput read only workload! Amadeus workload, vary updates Synthetic read only workload, vary skew 15
An interesting storage layer Interface is SQL (not pages or blocks) high concurrent query + update throughput Amadeus: ~4000 queries/sec + ~1000 updates/sec updates do not impact latency of queries predictable and guaranteed latency depends on size of partition: not optimal, good enough cost and energy effeciency depends on workload: great for hot data, heavy WL consistency: write monotonicity, can build SI on top works great on NUMA! controls read+write pattern linear scale upwith numberofcores Status & Outlook Status Fully operational system Extensive experiments at Amadeus Production: Summer 2011 (planned) Outlook Column store variant of Crescando Compression E cast: flexible partitioning & replication Additional operators (group by) 16
SWISSBOX: Additional components ETH SWISSBOX 17
Shared DB = processing layer If we can share the scans (Crescando) then maybe we can share other operators (join, short) SharedDB is built on top of Crescando and implements shared operators capable of providing scalable, predictable performance for high volumes of concurrent queries. Shared join Crescando runs selection and projections in one set of cores SharedDB runs joins on the streams from Crescando, thousands of queries at a time 18
Predictability at scale SharedDB can run complex joins (and shorts) in predictable time with large update loads Linear scalability with number of processing units (cores) SWISSBOX: A research platform 19
Key ideas around SwissBox A new way to process queries Massively parallel, simple, predictable Not always optimal, but always good enough Ideal for operational BI High query throughput Concurrent updates with freshness guarantees Great opportunity for research Rethink the database and storage system architecture Explore new posibilities 20