From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten MC Brown, Director of Documentation Linas Virbalas, Senior Software Engineer.

About Tungsten Replicator Open source drop-in replacement for MySQL replication, providing: Global transaction ID Multiple masters Multiple sources Flexible topologies Parallel replication Heterogeneous replication 2

Tungsten Replicator Master Download transactions via network Replicator THL (Transactions + Metadata) DBMS Logs Slave Replicator THL Apply using JDBC (Transactions + Metadata) 3

How Tungsten Replicator Works Pipeline Stage Extract Filter Apply Stage Extract Filter Apply Stage Extract Filter Apply Master DBMS Transaction History Log In-Memory Queue Slave DBMS 4

Where we replicate master-slave fan-in slave Heterogene MySQL Oracle Oracle MySQL all-masters 5 Direct slave Regular MySQL star-schema

Why Hadoop Customer driven Change in the air Environments moving to heterogenous NoSQL was the first We already support MongoDB Hadoop used for big analytics More frequently a live resource Big datasets require Map/Reduce 6

Tungsten Replicator and Hadoop Extract from MySQL or Oracle Base Hadoop and Commercial distributions; Cloudera, HortonWorks, Amazon Elastic MapReduce and IBM InfoSphere BigInsights compatible Automatic replication of incremental changes Customizable formatting Hive Schema generation Materialized views in Hive for carbon-copy tables Sqoop and parallel extractor compatibility for provisioning 7

Applying Data into Hadoop Replicator Replicator Extract transactions from log THL CSV DBMS Logs Hadoop 8

CSV (Staging) Materialized Views Hadoop ID Message Hive Table 11

CSV (Staging) Materialised Views Hadoop ID Message Hive Table 12

CSV (Staging) Materialised Views Hadoop ID Message Hive Table 13

CSV (Staging) Materialized Views Hadoop ID Message Hive Table 14

MySQL Configuration Use Row-based replication Extracts to standard THL Every table must have primary keys Replicator configured with: Filters for metadata and primary key optimisation 15

Configure Hadoop Data is stored in CSV format on HDFS Cloudera, HortonWorks, Amazon Elastic Map Reduce (EMR) and IBM Infosphere BigInsights compatible Compatible with Hive, HBase, and others Live Table DDL can be automatically Staging DDL can be automatically generated generated 16

DDL Generation Built-in Tool, part of Tungsten Replicator Handles staging and live table DDL generation Default mode is for default migrations to Hive types Customizable for your needs BigInts as Strings Data transformations possible through filters 17

Replicator Hadoop Configuration Batch Commit interval By rows count By time interval CSV Format Predefined formats Customizable by field and row characters Parallelization Supported 18

Materialized Views Merges Data from Staging CSV into Hive Tables Processing separate from Replicator Allows individual table views to be generated independently Allows for custom materialization intervals Views based on 'live' data, or by point-in-time from CSV staging 19

Demo 20

Provisioning Data Sqoop Start the replicator Sqoop the data Parallel Extractor Materialized views are idempotent DDL generation is Hive compatible Currently Oracle only Will extract data in parallel and insert into THL 21

Replication Management Replication can be stopped, started, restarted at any time Enables MySQL or Hadoop maintenance windows DDL customizable Views regenerated at any time Schema changes can be handled by re- Sqooping and dematerialising views 22

560 S. Winchester Blvd., Suite 500 San Jose, CA 95128 Tel +1 (866) 998-3642 Fax +1 (408) 668-1009 e-mail: sales@continuent.com Our Blogs: http://scale-out-blog.blogspot.com http://mcslp.wordpress.com http://flyingclusters.blogspot.com http://www.continuent.com/news/blogs Continuent Web Page: http://www.continuent.com Master Slave Hot Standby Failed! Tungsten Replicator 2.2 and 3.0 Preview: http://code.google.com/p/tungsten-replicator