Driving MySQL to Big Data Scale Thomas Hazel Founder, Chief Scien@st thomas@deepis.com
Millions to Billions to Trillions
Agenda Driving MySQL to Big Data Scale Market Trends Hardware Trends Current Computer Science Limita@ons with current Science Rethinking the Science of Databases Introducing CASSI for MySQL Scaling Benchmarking million, billion, trillion 3
Market Trends Where are things heading? Constant data acquisi,on o Streaming data feeds like IoT More data being collected o Larger table sizes required Desire for in- place analy,cs o More indexing to support complex queries 4
Hardware Trends Resource/Capability Storage Type and Size/Performance o HDD RPM (5.9K, 7.2K, 15K, etc.) o SSD Enterprise/Client grade Memory Type and Size/Performance o DRAM (FPM, EDO, etc.) o SRAM (DDR2, DDR3, etc.) Processor Type and Count/Performance o Intel (i3, i5, i7, etc.) o AMD (X2, X3, FX, etc.) Speed / Cost Yearly Trend Size / Count 5
Current Computer Science Structures/Algorithms Phase 1 - Log File (WAL) o Error Recovery o Merge/Op,miza,on Phase 2 - B- Tree/B+Tree o In- memory and on- disk via MMAP o Rows and Index/Key based representa,on Phase 2 - Log Structured Merge (LSM) Tree o In- memory Write- Back Cache of Rows o On- disk Immutable Maps of Sorted Keys and Values Log File B-Tree LSM-Tree 6
Limita@ons with current Science Fix Structures/Algorithms B- Tree (Read Op,mized) o Read before Write o Fix block orienta,on o Inline/sawtooth rebalancing o Scan vs. Write/Point- Read performance vs. Size LSM- Tree (Write Op,mized) o Write w/o Read, Slower Read o Fix block append orienta,on o Background/deferred rebalance/merge o Write Performance vs. Point- Read vs. Scan vs. Size Read Optimized Write Optimized 7
Rethinking the Science of Databases Maintenance free Performance with Scale How to maximize Writes without sacrificing Reads? How to dynamically resize/redefine structures at run-,me? How to remove mathema,cal limits of memory and storage? How to replace offline with online reconfigura,on/op,miza,on? How to support all the classic/powerful database features at scale? 8
CASSI: Adap@ve Structure/Algorithm Con@nuous Adap@ve Sequen@al Summariza@on of Informa@on Separate algorithm behavior from data structure Split memory and storage into independent structures Introduce kernel scheduling techniques to u,lize hardware Introduce layer to observe and adapt to workloads/resources Machine learning to define structure and schedule resources Dynamic and con,nuous online calibra,on (reorder, compress) Metadata embedded in data (cardinality, counts, cost, etc.) 9
CASSI: Adap@ve Structure/Algorithm Fundamentals Constructs Infinite File Logging o Storing both rows and indexes (e.g. rowdata.vrt, indexdata.irt) o Con,nuous merge/op,miza,on (inline memory, background storage) Variable size Segments o Define/Size ranges of blocks based on data values, workload, resources o Allow Segments to be represented as all or part of the actual dataset Memory/Storage Structure (Segments, Segments of Segments) o Memory: tree- oriented summa,on with physical/logical constructs o Storage: append- only, protocol based with physical/logical constructs 10
CASSI: Adap@ve Structure/Algorithm Fundamentals Behavior Scheduling of Work o Task base indexing, defragment, compression, memory/disk access o Orchestrate tasks based on hardware, workload, informa,on modeling Dynamic Structure/Algorithm o Model based splieng, merging, purging, summa,on, etc. of segments o Range space independence, one segment does not affect another Orchestrate/Op,mize the three tenants of CASSI o Always append data to file (i.e. don t seek, use current posi,on, support upsert) o Read data sequen,ally (i.e. don t seek, use current posi,on) o Con,nually re- write and reorder such that previous two principles are met 11
CASSI: Adap@ve Structure/Algorithm Diagram Write Flow Cache Workload Value Log File CASSI Kernel Order Reorder Index Log File Value Log File Compress Index Log File Value Log File Index Log File 12
CASSI: Adap@ve Structure/Algorithm Diagram Read Flow Cache (P1) Cache (S2) Cache (S3) Key-only Scan t0 I U I I U I U Value Log File tn Indexes (1, 2, 3) Log 13
CASSI: Adap@ve Structure/Algorithm Diagram Summariza@on Finalized Segments Summarized Segment Finalized Segments Summarized Segment Summation Range Summation Range 14
CASSI: Adap@ve Structure/Algorithm Diagram Concurrency Reader 1 i Reader 1 j View 0 Reader 1 k View 1 View 2 Writer 1 Active Lockless Access to Segments Temp. User Space Lock to Storyline Lockless Isolated Access to Views 15
Benchmarking Configura@on CASSI vs. B- Tree Random Keys (Small, Medium, Large, Extreme) Schema/Specifica,ons o 1 Primary Index o 3 Indexes containing 4 Columns o 4 Hosts at increasing Scale/Capability Performance/Scale Tes,ng o Small 2 runs, 50 million rows o Medium 2 runs, 100 million rows o Large 2 runs, 1 Billion rows o Extreme 2 runs, 1 Trillion rows (CASSI only, simple schema) Id Cust. Prod. Price Time Data 001 0001 0001 1.00 10/10 aaaa [ AUTOINC PRIMARY KEY (`id`), KEY `index1` (`cust`,`prod`,`price`,`time`), KEY `index2` (`prod`,`price`,`time`,`cust`), KEY `index3` (`price`,`time`,`cust`,`prod`) ] 16
50 million with complex indexing 8 CPU, 2G Cache on 7200 RPM HDD, 2x1G Log (Innodb) Inges,on Time 10 clients o CASSI 520 seconds, 8.5G Size o B- Tree 9,285 seconds, 12G Size o Difference Insert Speed ~18x, Size Efficiency 1.4x Cold start Index only Query o CASSI 4,965 rows in set (0.42 sec) o B- Tree 4,953 rows in set (0.83 sec) o Difference ~2x improvement Cold start Index + Point Query o CASSI 4,965 rows in set (33 sec), 0.02G Cache o B- Tree 4,953 rows in set (151 sec), 1G Cache o Difference Query Speed ~4.5x, Cache Efficiency 50x Insert Time Disk Size Index Query Point Query Cache Eff. B-Tree CASSI 17
100 million with complex indexing 8 CPU, 10G Cache on Client Grade SSD, 2x1G Log (Innodb) Inges,on Time 15 clients o CASSI 901 seconds, 17G Size o B- Tree 5,895 seconds, 27G Size o Difference Insert Speed ~6.5x, Size Efficiency 1.5x Cold start Index only Query o CASSI 10,414 rows in set (0.08 sec) o B- Tree 10,435 rows in set (0.17 sec) o Difference ~2x improvement Cold start Index + Point Query o CASSI 10414 rows in set (61 sec) 0.05G Cache o B- Tree 10414 rows in set (147 sec ) 1.2G Cache o Difference Query Speed ~2.4x, Cache Efficiency ~25x Insert Time Disk Size Index Query Point Query Cache Eff. B-Tree CASSI 18
1 billion with complex indexing 16 CPU, 64G Cache on General Purpose SSD (3000 IOPS), 2x1G Log (Innodb) Inges,on Time 15 clients o CASSI 10800 seconds (3 hours), 185G Size o B- Tree 57600 seconds (16 hours), 234G Size o Difference Insert Speed ~5.3x, Size Efficiency 1.2x Cold start Index only Query o CASSI 1,049,702 rows in set (151 sec) o B- Tree 1,052,021 rows in set (159 sec) o Difference ~1.05x improvement Cold start Index + Point Query o CASSI 1,049,702 rows in set (318 sec) 0.06G Cache o B- Tree 1,052,021 rows in set (787 sec) 20G Cache o Difference Query Speed ~2.1x, Cache Efficiency ~333x Insert Time Disk Size Index Query Point Query Cache Eff. B-Tree CASSI 19
1 trillion CASSI Theory in Prac@ce Two machines tested Virtual/Physical (primary with non- indexed column) o Amazon 16 CPU General Purpose SSD with 64G cache o Bare metal 48 CPU HDD 5900 RPM with 128G cache Time to complete test 15 local clients o Amazon 2 weeks, 50 million inserts per minute, 24TB o Bare metal 2.5 weeks, 37 million inserts per minute, 24TB Cold start Schema/Query Performance o Select count(*) from table 0 seconds (0 seeks) o Any point query across 1 trillion rows ~1 second (3 seeks) What we Learned from tes,ng 1 trillion rows and 24TB o Disk Errors and Power failure (incremental backup and verify) o Verify algorithms and protocols at extreme scale (at testable,me) 20
1 trillion 21
What s Next for Deep Engine Founda@onal Technology Structured - MySQL/Percona/MariaDB Semi- Structured - MongoDB Unstructured - Hadoop/HDFS Deep Na,ve? 22
Thank You!
Thomas Hazel Founder, Chief Scien,st thomas@deepis.com www.deepis.com Follow us: @DeepInfoSci