MySQL and Hadoop Percona Live 2014 Chris Schneider
About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for the past ~3 years chschneider@groupon.com
What we ll cover Apache Hadoop CDH Cloudera Distribution for Apache Hadoop Use cases for Hadoop Simple Map Reduce Overview Scoop Hive Impala Tungsten Replicator MySQL -> HDFS
What is Hadoop? An open-source framework for storing and processing data on a cluster of servers Based on Google s whitepapers of the Google File System (GFS) and MapReduce Scales linearly Designed for batch processing Optimized for streaming reads
Where do I start? Image by: Aryan Nava - Hadoop Ecosystem
Hadoop Distribution Cloudera Provides a distribution for Apache Hadoop Along with many other components in the Hadoop ecosystem What is a distribution Repositories Documentation Bug fixes Tested Cloudera Manager There are other companies who provide distributions
Why Hadoop Volume Use Hadoop when you cannot or should not use traditional RDBMS Image Source: www.grc.nasa.gov
Why Hadoop Velocity Can ingest terabytes of data per day Image Source: www.physics4kids.com
Why Hadoop Variety You can have structured or unstructured data Image Source: www.dataenthusiast.com
Use cases for Hadoop Recommendation engine Netflix recommends movies Ad targeting, log processing, search optimization Glam Media, ebay, Orbitz Machine learning and classification Yahoo Mail s spam detection Financial Institutions Identity theft and credit risk Social Graph Facebook, Linkedin and eharmony connections
Some Details about Hadoop Two Main Pieces of Hadoop Hadoop Distributed File System (HDFS) Distributed and redundant data storage using many nodes Hardware will inevitably fail Read and process data with MapReduce Processing is sent to the data Many map tasks each work on a slice of the data Failed tasks are automatically restarted on another node/replica
Hadoop Ecosystem Management - Oozie (Workflow) - Chukwa (Monitoring) - Flume (Data Ingest) - ZooKeeper (Management) - Cloudera Manager (Management) Data Access - Hive (SQL) - Pig (Data Flow) - Mahout (Machine Learning) - Avro (RPC, Serialization) - Sqoop (RDBMS Connector) Data Processing - MapReduce - MRv2 or YARN Data Storage - HDFS (Distributed File System) - Hbase (Column Data Store)
Block Storage Replication Master Nodes Namenode(s) Jobtracker(s) Slave Nodes Datanode Tasktracker Datanode Tasktracker Datanode Tasktracker Blocks 1 3 5 5 3 1 3 5 1
Map is used for Searching 64, big data is totally cool and big Foreach word MAP Intermediate Output (on local disk): big, 1 data, 1 is, 1 totally, 1 cool, 1 and, 1 big, 1
Reduce is used to aggregate Hadoop aggregates the keys and calls a reduce for each unique key e.g. GROUP BY, ORDER BY reduce (key, list) sum the list output (key, sum) big, (1,1) data, (1) is, (1) totally, (1) cool, (1) and, (1) Reduce big, 2 data, 1 is, 1 totally, 1 cool, 1 and, 1
Map/Reduce CompA,2013,35.25 CompB,2013,5.25 CompA,2013,15.00 CompC,2013,25.00 MAP CompA,2013,50.25 CompB,2013,5.25 CompC,2013,25.00 Reduce CompA,2013,60.25 CompB,2013,41.00 CompC,2013,35.25 CompB,2014,20.75 CompC,2014,10.25 CompA,2014,10.00 CompB,2014,15.00 MAP CompA,2013,10.00 CompB,2013,35.75 CompC,2013,10.25
Why do MR jobs take so long? Source: Hadoop the Definitive Guide, Figure 6-4 Shuffle and Sort
Where does Hadoop fit in? Think of Hadoop as an augmentation of your traditional RDBMS system You want to store years of data or you just have a lot of data You need to aggregate all of the data over many years time You want ALL your data stored and accessible not forgotten or deleted You need this to be free software running on commodity hardware
Where does Hadoop fit in? http http http Tableau: Business Analytics Hive Impala Pig MySQL MySQL MySQL MySQL Oozie/ Sqoop/ETL Sqoop MySQL MySQL NameNode NameNode2 DataNode DataNode Hadoop (CDH4) DataNode DataNode JobTracker JobTracker DataNode DataNode DataNode DataNode
Where does Tungsten fit in? http http http Tableau: Business Analytics Hive Impala Pig MySQL MySQL MySQL MySQL MySQL MySQL NameNode NameNode2 Hadoop (CDH4) JobTracker JobTracker Tungsten 3.0 DataNode DataNode DataNode DataNode Single Path Loading DataNode DataNode DataNode DataNode
Simple Data Flow MySQL is used for OLTP data processing ETL process moves data from MySQL to Hadoop Oozie Sqoop Oozie Custom ETL Tungsten Replicator 3.0+ Use MapReduce to transform data, run batch analysis, join data, etc Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report You can also read this data out of Impala directly to save time in export out of HDFS
MySQL Data Capacity Depends, (TB)+ PB+ Data per query/ MR Read/Write Depends, (MB -> GB) Random read/ write Hadoop PB+ Sequential scans, Append-only Query Language SQL MapReduce, Scripted Streaming, HiveQL, Pig Latin Transactions Yes No Indexes Yes No Latency Sub-second Minutes to hours Data structure Relational Both structured and un-structured Enterprise and Community Support Yes Yes
About Sqoop Open Source and stands for SQL-to-Hadoop Parallel import and export between Hadoop and various RDBMS Default implementation is JDBC Optimized for MySQL but not for performance Integrated with connectors for Oracle, Netezza, Teradata, MicroStrategy, Tableau
Sqoop Features You can choose specific tables or columns to import with the --where flag Controlled parallelism Parallel mappers/connections (--num-mappers) Specify the column to split on (--split-by) Incremental loads with TIMESTAMP and AUTO INCREMENT Integration with Hive (HDFS) and Hbase (Column Store)
How Sqoop Import Works 1. Client calls a sqoop import 2. Sqoop submits a map only job to the Jobtracker 3. The Map tasks connect to the database server via JDBC and execute a select 1. Happens on the Tasktrackers 4. Mysqldump on the Datanodes will connect to MySQL (in this case) or other RDBMS systems 1. Select WHERE PK >=0 and PK < = N 2. Select WHERE PK >=N and PK < = N 3. Select WHERE PK >=N and PK < = N 4. Select WHERE PK >=N and PK < = N
Sqoop Data Into Hadoop $ sqoop import --connect jdbc:mysql://example.com/world \ --username <database_username> \ --tables City \ --fields-terminated-by \t \ --lines-terminated-by \n This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.city The resulting TSV file(s) will be stored in HDFS
Sqoop Export $ sqoop export --connect jdbc:mysql://example.com/world \ --username <database_username> \ --tables City \ --export-dir /path/in/hdfs The City table needs to exist within MySQL Default CSV formatted Can use staging table (--staging-table)
Sqoop Gotcha 1 You NEED to make sure to install Connector/J Latest version: http://dev.mysql.com/downloads/connector/j/ Then you must place the.jar file the Sqoop lib directory: Default: /usr/lib/sqoop/lib
Sqoop Gotcha 2 attempt_201302192229_0002_m_000000_0: log4j:warn Please initialize the log4j system properly.java.io.ioexception: Cannot run program "mysqldump": java.io.ioexception: error=2, No such file or directory at java.lang.processbuilder.start(processbuilder.java: 460) at java.lang.runtime.exec(runtime.java:593) at java.lang.runtime.exec(runtime.java:466)at com.cloudera.sqoop.mapreduce.mysqldumpma pper.map(mysqldumpmapper.java:396) at com.cloudera.sqoop.mapreduce.mysqldumpma pper.map(mysqldumpmapper.java:49)
Gotcha 3 Full table scan Try not to sqoop from a master MySQL instance Try not to sqoop from a slave actively taking reads May degrade performance as sqoop selected data page into memory while pushing out expected data
About Hive Offers a way around the complexities of MapReduce/JAVA Hive is an open-source project managed by the Apache Software Foundation Non-JAVA to be able to access data Language based on SQL (ANSI-SQL 92) Easy to lean and use Data is available to many more people
More About Hive Hive is NOT a replacement for RDBMS Not all SQL works Hive is only an interpreter that converts HiveQL to MapReduce more specifically procedural java code HiveQL queries can take many seconds or minutes to produce a result set Hive has a metastore within MySQL that contains details about table representation on HDFS
RDBMS vs Hive RDBMS Hive Language SQL Subset of SQL along with Hive extensions Transactions Yes No ACID Yes No Latency Updates? Sub-second (Indexed Data) Yes, INSERT [IGNORE], UPDATE, DELETE, REPLACE Many seconds to minutes (Non Index Data) INSERT OVERWRITE
Sqoop and Hive $ sqoop import --connect jdbc:mysql://example.com/world \ --tables City \ --hive-import Alternatively, you can create table(s) within the Hive CLI and run an hadoop fs -put with an exported CSV file on the local file system
Impala Brings processing to the data nodes and avoids network bottlenecks Uses a shared metastore Hue integration in CDH 4.2+ Based off of Google s Dremel http://research.google.com/pubs/pub36632.html
Impala High Level Image Source: http://blog.cloudera.com/blog/2012/10/ cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Advantages of Impala 100% Open Source. Local processing on data nodes help avoid network bottlenecks Saves time with minimal data movement A single, open, and unified metadata store can be utilized. Costly data format conversion is unnecessary and thus no overhead is incurred. All data is immediately query-able, with no delays for ETL. All hardware is utilized for Impala queries as well as for MapReduce. Only a single machine pool is needed to scale. ANSI-92 SQL supported with UDFs Supports common Hadoop file formats: text, SequenceFiles, Avro, RPCFile, LZO and Parquet
Impala Info Factors that help make Impala faster Hardware Configuration Complexity of the Query Availability of main memory Does it replace MapReduce or Hive? Limitations? All joins are done in a memory space no larger than the smallest node
Tungsten Replicator 3.0 MySQL to HDFS Why this is needed Sqoop issues Time, Stale Data, Performance issues Real Time data load Idempotent failover If you have multiple slaves and log_slave_updates active All row changes are stored
Current Scope 3.0 Release Extract from MySQL via binlog (RBR) Move data into current popular distributions, Cloudera, Hortonworks Initial provion with Sqoop or parallel extractor (currently Oracle only but MySQL coming soon Asynchronous replicated changes Data transformation into preferred HDFS formats Schema generation of HIVE tables Tools for generating of materialized views
High Level Overview MySQL RDBMS Master Replicator THL GTID + Metadata Binlogs Slave Replicator THL GTID + Metadata HDFS
Specific Overview MySQL Hadoop.js CSV Tungsten Master Replicator Tungsten Slave Replicator HIVE/HDFS Master Filtering: - Drop/Modify Columns (PII) - Fill in primary key - Fill in column names - Add DBMS source - Select subset of tables to replicate Slave Loader: - CSV files are created (1 per table) - 100K transactions or ~1 min data collected - Parallel push into HDFS
Process into HIVE Hadoop.js Load CSV into hadoop Staging table creation Contains all changed data along with Tungsten Specific row information Create Base Table Equivalent table to MySQL in HIVE Create Materialized View Check for table match between MySQL and HIVE
Automate the process $ git clone \ https://github.com/continuent/continuent-tools-hadoop.git $ cd continuent-tools.hadoop $ bin/load-reduce-check \ -U jdbc:mysql:thin://sourcemysqlserver:3306/database \ -s database --verbose
DEMO hive -e drop table city Sqoop Import City table sqoop import --connect jdbc:mysql:// localhost.localdomain/world --username root -- create-hive-table --hive-table City --hive-import -- direct --table City --warehouse-dir /user/hive/ warehouse HiveQL Saved in Hue Impala Saved in Hue
References Hive Impala http://infolab.stanford.edu/~ragho/hive-icde2010.pdf https://ccp.cloudera.com/display/impala10betadoc/cloudera+impala+1.0+beta +Documentation Cloudera https://www.cloudera.com/content/support/en/documentation.html https://ccp.cloudera.com/display/support/downloads Tungsten Replicator 3.0 Download - http://s3.amazonaws.com/files.continuent.com/builds/nightly/replicator-3.0.0-staging/ index.html Documentation MySQL > HDFS - https://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html Tungsten Hadoop Tools https://github.com/continuent/continuent-tools-hadoop