How, What, and Where of Data Warehouses for MySQL Robert Hodges CEO, Continuent.
Introducing Continuent The leading provider of clustering and replication for open source DBMS Our Product: Continuent Tungsten Clustering - Commercial-grade HA, performance scaling and data management for MySQL Replication - Flexible, high-performance data movement 2
Why Do MySQL Applications Need a Data Warehouse? 3
De!ning the Problem In Retail War, Prices on Web Change Hourly (New York Times, Dec 1st, 2012) 4
Typical Schema for Sales Analytics Product * sku * product_type... Period * hour * day_of_week * day_of_month * week * month... Sales * customer * product * quantity * sale type * location * discount * sale_amount * sale_time * period * payment_type * campaign... Customer * first_name * last_name * loyalty_rank * street... Location * city * county * state * country... 5
InnoDB = Row Store Clustered by primary key Indexes slow writes Sales Table id cust_id prod_id... 1 335301 532... 2 2378 6235... 3......... Cust_ID Index Prod_ID Index Row data stored together Indexes use primary key cust_id id 2378 2 335301 1...... prod_id id 532 1 6235 2...... 6
Row Store + MySQL Server = OLTP Fast update of small number of rows Limited indexing (few, B-Tree only) Minimal compression Nested loop joins Single-threaded query Sharded data sets 7
OLTP!= Analytics Parallel execution Time series Spatial query Recursive query E"cient search on any column Star schema organization Data cubes/pivot tables (OLAP) Business Intelligence (BI) tool integration 8
Solution: MySQL + Data Warehouse Sharded MySQL for high transaction throughput Near-realtime loading Data warehouse for fast analytics 9
Data Warehouse Options 10
Commercial DBMS -- Oracle Parallel query (automatic in 11G) Hash, bitmap indexes Stable and well-known BI tools Wide variety of compression options Amazingly advanced query optimizer Star schemas with dimensions & hierarchies Excellent vertical performance scaling 11
Column Store Architecture Every column is an index Good compression Sales Table cust_id prod_id quantity... 335301 532 1... 2378 6235 3............... Column data stored together Updates to entire row are hideously slow 12
Column Stores -- Vertica PostgreSQL syntax (but little/no code) Parallel query Built-in star schema support Time series support Multiple compression methods Built-in HA model Widely used, excellent scaling 13
Column Store--Calpont In!niDB Looks like MySQL to apps (with minor di#erences) Distributed architecture with parallel query Columns compressed and fully indexed Automatic partitioning of data Built-in HA using distributed data copies 14
NoSQL/Hadoop Minimal SQL dialect (subset of SQL-92) Data access is non-transparent Hadoop is batch-oriented Excellent horizontal scaling in cloud Parallel query using map/reduce HiveQL is getting better fast Handles failures by automatic job resubmit 15
Real-Time Data Loading 16
Options for Loading Data Warehouse 1. Extract/Transfer/Load (ETL) software Stable & good GUI tools but slow, resource intensive, has app a#ects 2. Do-it-yourself reads from the binlog Unstable and hard to maintain (ask me how I know) 3. Real-time replication with Tungsten Replicator Fast with minimal application load or disruption 17
DEMO MySQL sysbench sysbench sysbench db01 db02 db03 X db01 renamed02 MySQL to Vertica replication with some bells and a whistle 18
Understanding Tungsten Replicator Master Download transactions via network Replicator THL (Transactions + Metadata) DBMS Logs Slave Replicator THL Apply using JDBC (Transactions + Metadata) 19
Pipelines with Parallel Apply Stage Extract Filter Apply Pipeline Stage Extract Filter Apply Stage Extract Filter Apply Extract Filter Apply Extract Filter Apply Master DBMS Transaction History Log In-Memory Queue Slave DBMS 20
Real-Time Heterogeneous Transfer MySQL Tungsten Master Replicator Tungsten Slave Replicator Oracle Service oracle Service oracle MySQL Binlog MySQLExtractor Special Filters * Transform enum to string Special Filters * Ignore extra tables * Map names to upper case * Optimize updates to remove unchanged columns binlog_format=row 21
Column Store--Real-Time Batches MySQL Tungsten Master Replicator Tungsten Slave Replicator Service my2vr Service my2vr MySQL Binlog binlog_format=row MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables CSV CSV CSV Files Files CSV Files CSV Files Files Large transaction batches to leverage load parallelization 22
Batch Loading--The Gory Details Replicator Transactions from master Service my2vr COPY to stage tables Staging Staging Tables Staging Tables Tables SELECT to base tables Base Base Tables Base Tables Tables Merge Script CSV CSV CSV Files Files Files (or) COPY directly to base tables 23
Vertica Implementation Steps 24
0. Get Software and Documentation Get the software: http://code.google.com/p/tungsten-replicator Get the documentation: https://docs.continuent.com/wiki/display/tedoc 25
1. Best Practices for MySQL Single column keys UTF-8 data GMT timezone (Currently required by Tungsten) Row replication enabled 26
2. Handle Availability What happens if MySQL fails? What happens if a replicator fails? What happens if Vertica fails? 27
3. Create Base Tables /* MySQL table definition */ CREATE TABLE `sbtest` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k` (`k`)); /* Vertica table definition */ create table db01.sbtest( id int, k int, c char(120), pad char(60) ); 28
4. Provision Initial Data Option 1 (Large data sets): CSV Loading mysql> SELECT * from foo INTO OUTFILE foo.csv ;... (Fix up data if necessary)... vsql> COPY foo FROM 'foo.csv' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'; Option 2 (Small data sets): Run transactions through replicator itself Dump then restore 29
5. Select Tungsten Filter Options Tables to ignore/include? Custom!lters? Schema/table/column renaming? Map names to upper/lower case? tungsten-installer --master-slave -a \ --service-name=mysql2vertica \... --svc-extractor-filters=replicate \ --svc-applier-filters=dbtransform \ --property=replicator.filter.replicate.do=db01.*,db02.* \ --property=replicator.filter.dbtransform.from_regex1=db02 \ --property=replicator.filter.dbtransform.to_regex1=renamed02 \... 30
5. Customize Merge Script # Hacked load script for Vertica--deletes always precede inserts, so # inserts can load directly. # Extract deleted data keys and put in temp CSV file for deletes.!egrep '^"D",' %%CSV_FILE%% cut -d, -f4 > %%CSV_FILE%%.delete COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' # Delete rows using an IN clause. You could also set a column value to # mark deleted rows. DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN (SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%) # Load inserts directly into base table from a separate CSV file.!egrep '^"I",' %%CSV_FILE%% cut -d, -f4- > %%CSV_FILE%%.insert COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' 31
6. Create Staging Tables /* Full staging table */ create table db01.stage_xxx_sbtest( tungsten_opcode char(1), tungsten_seqno int, tungsten_row_id int, id int, k int, c char(120), pad char(60)); (OR) /* Staging table with delete keys only. */ create table db01.stage_xxx_sbtest(id int); 32
7. Install Replicators Master/slave vs. direct replication Directory to hold CSV!les How long to preserve logs Memory size (Java heap) Filter settings (and where to run them) Run replicator locally or on separate host(s) 33
8. Test and Deploy! Typical test cycles for DW loading run to months Not weeks or days Use production data Monitoring/alerting 34
Advanced Replication Features 35
More Possibilities for Analytics... MySQL Master Complex, near real-time reporting OLTP Data Light-weight, real-time operational status Web-facing minidata marts for SaaS users 36
Adding Clustering to MySQL Replicator nyc (master) New York Replicator fra (master) Frankfurt Replicator nyc (slave) fra (slave) sfo (slave) Replicator sfo (master) Data Warehouse San Francisco 37
Conclusion Data warehouses enable fast analytics on MySQL transactions Multiple data warehouse technologies Heterogenous data replication solves the problem of real-time loading 38
One more thing: WE RE HIRING!!! 39
560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs Continuent Web Page: http://www.continuent.com Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator.