How, What, and Where of Data Warehouses for MySQL



Similar documents
Solving Large-Scale Database Administration with Tungsten

Replicating to everything

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Parallel Replication for MySQL in 5 Minutes or Less

Linas Virbalas Continuent, Inc.

Preventing con!icts in Multi-master replication with Tungsten

Future-Proofing MySQL for the Worldwide Data Revolution

Preparing for the Big Oops! Disaster Recovery Sites for MySQL. Robert Hodges, CEO, Continuent MySQL Conference 2011

Real-time reporting at 10,000 inserts per second. Wesley Biggs CTO 25 October 2011 Percona Live

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

SQL Server 2012 and MySQL 5

Data warehousing with PostgreSQL

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Performance and Scalability Overview

SQL Server 2012 and PostgreSQL 9

Splice Machine: SQL-on-Hadoop Evaluation Guide

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Innovative technology for big data analytics

DBMS / Business Intelligence, SQL Server

Integrating VoltDB with Hadoop

Using distributed technologies to analyze Big Data

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

OBIEE 11g Data Modeling Best Practices

Course Outline: Course: Implementing a Data Warehouse with Microsoft SQL Server 2012 Learning Method: Instructor-led Classroom Learning

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Database Design Patterns. Winter Lecture 24

Implementing a Data Warehouse with Microsoft SQL Server 2012 MOC 10777

Lofan Abrams Data Services for Big Data Session # 2987

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Exploring the Synergistic Relationships Between BPC, BW and HANA

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

SQL Databases Course. by Applied Technology Research Center. This course provides training for MySQL, Oracle, SQL Server and PostgreSQL databases.

Tungsten Replicator, more open than ever!

Apache Kylin Introduction Dec 8,

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Moving Large Data at a Blinding Speed for Critical Business Intelligence. A competitive advantage

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

CitusDB Architecture for Real-Time Big Data

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Oracle Architecture, Concepts & Facilities

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

In-memory databases and innovations in Business Intelligence

SQL Server Administrator Introduction - 3 Days Objectives

Implementing a Data Warehouse with Microsoft SQL Server 2012 (70-463)

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Implementing the Future of PostgreSQL Clustering with Tungsten

A Scalable Data Transformation Framework using the Hadoop Ecosystem

D61830GC30. MySQL for Developers. Summary. Introduction. Prerequisites. At Course completion After completing this course, students will be able to:

A Migration Methodology of Transferring Database Structures and Data

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Apache Sqoop. A Data Transfer Tool for Hadoop

GeoKettle: A powerful open source spatial ETL tool

Cassandra vs MySQL. SQL vs NoSQL database comparison

Big Data Analytics - Accelerated. stream-horizon.com

Database Performance with In-Memory Solutions

Comparing SQL and NOSQL databases

Performance and Scalability Overview

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

SQL SERVER BUSINESS INTELLIGENCE (BI) - INTRODUCTION

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

Data Virtualization for Agile Business Intelligence Systems and Virtual MDM. To View This Presentation as a Video Click Here

Data storing and data access

Service Oriented Data Management

Welcome to Virtual Developer Day MySQL!

In-Memory Data Management for Enterprise Applications

Database Administration with MySQL

Data Warehousing and Data Mining

Real-time Data Replication

Safe Harbor Statement

Course Outline. Module 1: Introduction to Data Warehousing

SQL Server 2012 Business Intelligence Boot Camp

COURSE 20463C: IMPLEMENTING A DATA WAREHOUSE WITH MICROSOFT SQL SERVER

Beta: Implementing a Data Warehouse with Microsoft SQL Server 2012

Implementing a Data Warehouse with Microsoft SQL Server

Oracle Database 11g Comparison Chart

SAP Data Services 4.X. An Enterprise Information management Solution

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

IBM WebSphere DataStage Online training from Yes-M Systems

Course 10777A: Implementing a Data Warehouse with Microsoft SQL Server 2012

From Spark to Ignition:

IST722 Data Warehousing

W I S E. SQL Server 2012 Database Engine Technical Update WISE LTD.

American International Journal of Research in Science, Technology, Engineering & Mathematics

Portable Scale-Out Benchmarks for MySQL. MySQL User Conference 2008 Robert Hodges CTO Continuent, Inc.

Breadboard BI. Unlocking ERP Data Using Open Source Tools By Christopher Lavigne

Implementing a Data Warehouse with Microsoft SQL Server 2012

High-Volume Data Warehousing in Centerprise. Product Datasheet

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

Implementing a Data Warehouse with Microsoft SQL Server 2012

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

An Overview of SAP BW Powered by HANA. Al Weedman

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Transcription:

How, What, and Where of Data Warehouses for MySQL Robert Hodges CEO, Continuent.

Introducing Continuent The leading provider of clustering and replication for open source DBMS Our Product: Continuent Tungsten Clustering - Commercial-grade HA, performance scaling and data management for MySQL Replication - Flexible, high-performance data movement 2

Why Do MySQL Applications Need a Data Warehouse? 3

De!ning the Problem In Retail War, Prices on Web Change Hourly (New York Times, Dec 1st, 2012) 4

Typical Schema for Sales Analytics Product * sku * product_type... Period * hour * day_of_week * day_of_month * week * month... Sales * customer * product * quantity * sale type * location * discount * sale_amount * sale_time * period * payment_type * campaign... Customer * first_name * last_name * loyalty_rank * street... Location * city * county * state * country... 5

InnoDB = Row Store Clustered by primary key Indexes slow writes Sales Table id cust_id prod_id... 1 335301 532... 2 2378 6235... 3......... Cust_ID Index Prod_ID Index Row data stored together Indexes use primary key cust_id id 2378 2 335301 1...... prod_id id 532 1 6235 2...... 6

Row Store + MySQL Server = OLTP Fast update of small number of rows Limited indexing (few, B-Tree only) Minimal compression Nested loop joins Single-threaded query Sharded data sets 7

OLTP!= Analytics Parallel execution Time series Spatial query Recursive query E"cient search on any column Star schema organization Data cubes/pivot tables (OLAP) Business Intelligence (BI) tool integration 8

Solution: MySQL + Data Warehouse Sharded MySQL for high transaction throughput Near-realtime loading Data warehouse for fast analytics 9

Data Warehouse Options 10

Commercial DBMS -- Oracle Parallel query (automatic in 11G) Hash, bitmap indexes Stable and well-known BI tools Wide variety of compression options Amazingly advanced query optimizer Star schemas with dimensions & hierarchies Excellent vertical performance scaling 11

Column Store Architecture Every column is an index Good compression Sales Table cust_id prod_id quantity... 335301 532 1... 2378 6235 3............... Column data stored together Updates to entire row are hideously slow 12

Column Stores -- Vertica PostgreSQL syntax (but little/no code) Parallel query Built-in star schema support Time series support Multiple compression methods Built-in HA model Widely used, excellent scaling 13

Column Store--Calpont In!niDB Looks like MySQL to apps (with minor di#erences) Distributed architecture with parallel query Columns compressed and fully indexed Automatic partitioning of data Built-in HA using distributed data copies 14

NoSQL/Hadoop Minimal SQL dialect (subset of SQL-92) Data access is non-transparent Hadoop is batch-oriented Excellent horizontal scaling in cloud Parallel query using map/reduce HiveQL is getting better fast Handles failures by automatic job resubmit 15

Real-Time Data Loading 16

Options for Loading Data Warehouse 1. Extract/Transfer/Load (ETL) software Stable & good GUI tools but slow, resource intensive, has app a#ects 2. Do-it-yourself reads from the binlog Unstable and hard to maintain (ask me how I know) 3. Real-time replication with Tungsten Replicator Fast with minimal application load or disruption 17

DEMO MySQL sysbench sysbench sysbench db01 db02 db03 X db01 renamed02 MySQL to Vertica replication with some bells and a whistle 18

Understanding Tungsten Replicator Master Download transactions via network Replicator THL (Transactions + Metadata) DBMS Logs Slave Replicator THL Apply using JDBC (Transactions + Metadata) 19

Pipelines with Parallel Apply Stage Extract Filter Apply Pipeline Stage Extract Filter Apply Stage Extract Filter Apply Extract Filter Apply Extract Filter Apply Master DBMS Transaction History Log In-Memory Queue Slave DBMS 20

Real-Time Heterogeneous Transfer MySQL Tungsten Master Replicator Tungsten Slave Replicator Oracle Service oracle Service oracle MySQL Binlog MySQLExtractor Special Filters * Transform enum to string Special Filters * Ignore extra tables * Map names to upper case * Optimize updates to remove unchanged columns binlog_format=row 21

Column Store--Real-Time Batches MySQL Tungsten Master Replicator Tungsten Slave Replicator Service my2vr Service my2vr MySQL Binlog binlog_format=row MySQLExtractor Special Filters * pkey - Fill in pkey info * colnames - Fill in names * replicate - Ignore tables CSV CSV CSV Files Files CSV Files CSV Files Files Large transaction batches to leverage load parallelization 22

Batch Loading--The Gory Details Replicator Transactions from master Service my2vr COPY to stage tables Staging Staging Tables Staging Tables Tables SELECT to base tables Base Base Tables Base Tables Tables Merge Script CSV CSV CSV Files Files Files (or) COPY directly to base tables 23

Vertica Implementation Steps 24

0. Get Software and Documentation Get the software: http://code.google.com/p/tungsten-replicator Get the documentation: https://docs.continuent.com/wiki/display/tedoc 25

1. Best Practices for MySQL Single column keys UTF-8 data GMT timezone (Currently required by Tungsten) Row replication enabled 26

2. Handle Availability What happens if MySQL fails? What happens if a replicator fails? What happens if Vertica fails? 27

3. Create Base Tables /* MySQL table definition */ CREATE TABLE `sbtest` ( `id` int(10) unsigned NOT NULL AUTO_INCREMENT, `k` int(10) unsigned NOT NULL DEFAULT '0', `c` char(120) NOT NULL DEFAULT '', `pad` char(60) NOT NULL DEFAULT '', PRIMARY KEY (`id`), KEY `k` (`k`)); /* Vertica table definition */ create table db01.sbtest( id int, k int, c char(120), pad char(60) ); 28

4. Provision Initial Data Option 1 (Large data sets): CSV Loading mysql> SELECT * from foo INTO OUTFILE foo.csv ;... (Fix up data if necessary)... vsql> COPY foo FROM 'foo.csv' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"'; Option 2 (Small data sets): Run transactions through replicator itself Dump then restore 29

5. Select Tungsten Filter Options Tables to ignore/include? Custom!lters? Schema/table/column renaming? Map names to upper/lower case? tungsten-installer --master-slave -a \ --service-name=mysql2vertica \... --svc-extractor-filters=replicate \ --svc-applier-filters=dbtransform \ --property=replicator.filter.replicate.do=db01.*,db02.* \ --property=replicator.filter.dbtransform.from_regex1=db02 \ --property=replicator.filter.dbtransform.to_regex1=renamed02 \... 30

5. Customize Merge Script # Hacked load script for Vertica--deletes always precede inserts, so # inserts can load directly. # Extract deleted data keys and put in temp CSV file for deletes.!egrep '^"D",' %%CSV_FILE%% cut -d, -f4 > %%CSV_FILE%%.delete COPY %%STAGE_TABLE_FQN%% FROM '%%CSV_FILE%%.delete' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' # Delete rows using an IN clause. You could also set a column value to # mark deleted rows. DELETE FROM %%BASE_TABLE%% WHERE %%BASE_PKEY%% IN (SELECT %%STAGE_PKEY%% FROM %%STAGE_TABLE_FQN%%) # Load inserts directly into base table from a separate CSV file.!egrep '^"I",' %%CSV_FILE%% cut -d, -f4- > %%CSV_FILE%%.insert COPY %%BASE_TABLE%% FROM '%%CSV_FILE%%.insert' DIRECT NULL 'null' DELIMITER ',' ENCLOSED BY '"' 31

6. Create Staging Tables /* Full staging table */ create table db01.stage_xxx_sbtest( tungsten_opcode char(1), tungsten_seqno int, tungsten_row_id int, id int, k int, c char(120), pad char(60)); (OR) /* Staging table with delete keys only. */ create table db01.stage_xxx_sbtest(id int); 32

7. Install Replicators Master/slave vs. direct replication Directory to hold CSV!les How long to preserve logs Memory size (Java heap) Filter settings (and where to run them) Run replicator locally or on separate host(s) 33

8. Test and Deploy! Typical test cycles for DW loading run to months Not weeks or days Use production data Monitoring/alerting 34

Advanced Replication Features 35

More Possibilities for Analytics... MySQL Master Complex, near real-time reporting OLTP Data Light-weight, real-time operational status Web-facing minidata marts for SaaS users 36

Adding Clustering to MySQL Replicator nyc (master) New York Replicator fra (master) Frankfurt Replicator nyc (slave) fra (slave) sfo (slave) Replicator sfo (master) Data Warehouse San Francisco 37

Conclusion Data warehouses enable fast analytics on MySQL transactions Multiple data warehouse technologies Heterogenous data replication solves the problem of real-time loading 38

One more thing: WE RE HIRING!!! 39

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com Our Blogs: http://scale-out-blog.blogspot.com http://datacharmer.org/blog http://www.continuent.com/news/blogs Continuent Web Page: http://www.continuent.com Tungsten Replicator 2.0: http://code.google.com/p/tungsten-replicator.