MySQL and Hadoop. Percona Live 2014 Chris Schneider



Similar documents
Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Qsoft Inc

Hadoop Ecosystem B Y R A H I M A.

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Hadoop Job Oriented Training Agenda

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Constructing a Data Lake: Hadoop and Oracle Database United!

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

<Insert Picture Here> Big Data

Internals of Hadoop Application Framework and Distributed File System

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop implementation of MapReduce computational model. Ján Vaňo

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Big Data Course Highlights

Workshop on Hadoop with Big Data

Hadoop and Map-Reduce. Swati Gore

Complete Java Classes Hadoop Syllabus Contact No:

Apache Hadoop: Past, Present, and Future

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Big Data Too Big To Ignore

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Big Data With Hadoop

Deploying Hadoop with Manager

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

Implement Hadoop jobs to extract business value from large and varied data sets

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Certified Big Data and Apache Hadoop Developer VS-1221

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Open source Google-style large scale data analysis with Hadoop

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Large scale processing using Hadoop. Ján Vaňo

Cloudera Certified Developer for Apache Hadoop

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

ITG Software Engineering

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Hadoop IST 734 SS CHUNG

Peers Techno log ies Pv t. L td. HADOOP

SQL on NoSQL (and all of the data) With Apache Drill

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MapReduce with Apache Hadoop Analysing Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

From Relational to Hadoop Part 1: Introduction to Hadoop. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Data processing goes big

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Application Development. A Paradigm Shift

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop and MySQL for Big Data

Chase Wu New Jersey Ins0tute of Technology

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

BIG DATA What it is and how to use?

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Hadoop & Spark Using Amazon EMR

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Introduction to Big Data Training

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Using distributed technologies to analyze Big Data

A very short Intro to Hadoop

Integrating VoltDB with Hadoop

Apache HBase. Crazy dances on the elephant back

Open source large scale distributed data management with Google s MapReduce and Bigtable

Hadoop Architecture. Part 1

Important Notice. (c) Cloudera, Inc. All rights reserved.

A Brief Outline on Bigdata Hadoop

Practical Hadoop by Example

NoSQL and Hadoop Technologies On Oracle Cloud

Self-service BI for big data applications using Apache Drill

COURSE CONTENT Big Data and Hadoop Training

CURSO: ADMINISTRADOR PARA APACHE HADOOP

Replicating to everything

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

How To Scale Out Of A Nosql Database

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Dominik Wagenknecht Accenture

The Hadoop Eco System Shanghai Data Science Meetup

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Chapter 7. Using Hadoop Cluster and MapReduce

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Dell In-Memory Appliance for Cloudera Enterprise

Apache Sentry. Prasad Mujumdar

Data-Intensive Computing with Map-Reduce and Hadoop

Transcription:

MySQL and Hadoop Percona Live 2014 Chris Schneider

About Me Chris Schneider, Database Architect @ Groupon Spent the last 10 years building MySQL architecture for multiple companies Worked with Hadoop for the past ~3 years chschneider@groupon.com

What we ll cover Apache Hadoop CDH Cloudera Distribution for Apache Hadoop Use cases for Hadoop Simple Map Reduce Overview Scoop Hive Impala Tungsten Replicator MySQL -> HDFS

What is Hadoop? An open-source framework for storing and processing data on a cluster of servers Based on Google s whitepapers of the Google File System (GFS) and MapReduce Scales linearly Designed for batch processing Optimized for streaming reads

Where do I start? Image by: Aryan Nava - Hadoop Ecosystem

Hadoop Distribution Cloudera Provides a distribution for Apache Hadoop Along with many other components in the Hadoop ecosystem What is a distribution Repositories Documentation Bug fixes Tested Cloudera Manager There are other companies who provide distributions

Why Hadoop Volume Use Hadoop when you cannot or should not use traditional RDBMS Image Source: www.grc.nasa.gov

Why Hadoop Velocity Can ingest terabytes of data per day Image Source: www.physics4kids.com

Why Hadoop Variety You can have structured or unstructured data Image Source: www.dataenthusiast.com

Use cases for Hadoop Recommendation engine Netflix recommends movies Ad targeting, log processing, search optimization Glam Media, ebay, Orbitz Machine learning and classification Yahoo Mail s spam detection Financial Institutions Identity theft and credit risk Social Graph Facebook, Linkedin and eharmony connections

Some Details about Hadoop Two Main Pieces of Hadoop Hadoop Distributed File System (HDFS) Distributed and redundant data storage using many nodes Hardware will inevitably fail Read and process data with MapReduce Processing is sent to the data Many map tasks each work on a slice of the data Failed tasks are automatically restarted on another node/replica

Hadoop Ecosystem Management - Oozie (Workflow) - Chukwa (Monitoring) - Flume (Data Ingest) - ZooKeeper (Management) - Cloudera Manager (Management) Data Access - Hive (SQL) - Pig (Data Flow) - Mahout (Machine Learning) - Avro (RPC, Serialization) - Sqoop (RDBMS Connector) Data Processing - MapReduce - MRv2 or YARN Data Storage - HDFS (Distributed File System) - Hbase (Column Data Store)

Block Storage Replication Master Nodes Namenode(s) Jobtracker(s) Slave Nodes Datanode Tasktracker Datanode Tasktracker Datanode Tasktracker Blocks 1 3 5 5 3 1 3 5 1

Map is used for Searching 64, big data is totally cool and big Foreach word MAP Intermediate Output (on local disk): big, 1 data, 1 is, 1 totally, 1 cool, 1 and, 1 big, 1

Reduce is used to aggregate Hadoop aggregates the keys and calls a reduce for each unique key e.g. GROUP BY, ORDER BY reduce (key, list) sum the list output (key, sum) big, (1,1) data, (1) is, (1) totally, (1) cool, (1) and, (1) Reduce big, 2 data, 1 is, 1 totally, 1 cool, 1 and, 1

Map/Reduce CompA,2013,35.25 CompB,2013,5.25 CompA,2013,15.00 CompC,2013,25.00 MAP CompA,2013,50.25 CompB,2013,5.25 CompC,2013,25.00 Reduce CompA,2013,60.25 CompB,2013,41.00 CompC,2013,35.25 CompB,2014,20.75 CompC,2014,10.25 CompA,2014,10.00 CompB,2014,15.00 MAP CompA,2013,10.00 CompB,2013,35.75 CompC,2013,10.25

Why do MR jobs take so long? Source: Hadoop the Definitive Guide, Figure 6-4 Shuffle and Sort

Where does Hadoop fit in? Think of Hadoop as an augmentation of your traditional RDBMS system You want to store years of data or you just have a lot of data You need to aggregate all of the data over many years time You want ALL your data stored and accessible not forgotten or deleted You need this to be free software running on commodity hardware

Where does Hadoop fit in? http http http Tableau: Business Analytics Hive Impala Pig MySQL MySQL MySQL MySQL Oozie/ Sqoop/ETL Sqoop MySQL MySQL NameNode NameNode2 DataNode DataNode Hadoop (CDH4) DataNode DataNode JobTracker JobTracker DataNode DataNode DataNode DataNode

Where does Tungsten fit in? http http http Tableau: Business Analytics Hive Impala Pig MySQL MySQL MySQL MySQL MySQL MySQL NameNode NameNode2 Hadoop (CDH4) JobTracker JobTracker Tungsten 3.0 DataNode DataNode DataNode DataNode Single Path Loading DataNode DataNode DataNode DataNode

Simple Data Flow MySQL is used for OLTP data processing ETL process moves data from MySQL to Hadoop Oozie Sqoop Oozie Custom ETL Tungsten Replicator 3.0+ Use MapReduce to transform data, run batch analysis, join data, etc Export transformed results to OLAP or back to OLTP, for example, a dashboard of aggregated data or report You can also read this data out of Impala directly to save time in export out of HDFS

MySQL Data Capacity Depends, (TB)+ PB+ Data per query/ MR Read/Write Depends, (MB -> GB) Random read/ write Hadoop PB+ Sequential scans, Append-only Query Language SQL MapReduce, Scripted Streaming, HiveQL, Pig Latin Transactions Yes No Indexes Yes No Latency Sub-second Minutes to hours Data structure Relational Both structured and un-structured Enterprise and Community Support Yes Yes

About Sqoop Open Source and stands for SQL-to-Hadoop Parallel import and export between Hadoop and various RDBMS Default implementation is JDBC Optimized for MySQL but not for performance Integrated with connectors for Oracle, Netezza, Teradata, MicroStrategy, Tableau

Sqoop Features You can choose specific tables or columns to import with the --where flag Controlled parallelism Parallel mappers/connections (--num-mappers) Specify the column to split on (--split-by) Incremental loads with TIMESTAMP and AUTO INCREMENT Integration with Hive (HDFS) and Hbase (Column Store)

How Sqoop Import Works 1. Client calls a sqoop import 2. Sqoop submits a map only job to the Jobtracker 3. The Map tasks connect to the database server via JDBC and execute a select 1. Happens on the Tasktrackers 4. Mysqldump on the Datanodes will connect to MySQL (in this case) or other RDBMS systems 1. Select WHERE PK >=0 and PK < = N 2. Select WHERE PK >=N and PK < = N 3. Select WHERE PK >=N and PK < = N 4. Select WHERE PK >=N and PK < = N

Sqoop Data Into Hadoop $ sqoop import --connect jdbc:mysql://example.com/world \ --username <database_username> \ --tables City \ --fields-terminated-by \t \ --lines-terminated-by \n This command will submit a Hadoop job that queries your MySQL server and reads all the rows from world.city The resulting TSV file(s) will be stored in HDFS

Sqoop Export $ sqoop export --connect jdbc:mysql://example.com/world \ --username <database_username> \ --tables City \ --export-dir /path/in/hdfs The City table needs to exist within MySQL Default CSV formatted Can use staging table (--staging-table)

Sqoop Gotcha 1 You NEED to make sure to install Connector/J Latest version: http://dev.mysql.com/downloads/connector/j/ Then you must place the.jar file the Sqoop lib directory: Default: /usr/lib/sqoop/lib

Sqoop Gotcha 2 attempt_201302192229_0002_m_000000_0: log4j:warn Please initialize the log4j system properly.java.io.ioexception: Cannot run program "mysqldump": java.io.ioexception: error=2, No such file or directory at java.lang.processbuilder.start(processbuilder.java: 460) at java.lang.runtime.exec(runtime.java:593) at java.lang.runtime.exec(runtime.java:466)at com.cloudera.sqoop.mapreduce.mysqldumpma pper.map(mysqldumpmapper.java:396) at com.cloudera.sqoop.mapreduce.mysqldumpma pper.map(mysqldumpmapper.java:49)

Gotcha 3 Full table scan Try not to sqoop from a master MySQL instance Try not to sqoop from a slave actively taking reads May degrade performance as sqoop selected data page into memory while pushing out expected data

About Hive Offers a way around the complexities of MapReduce/JAVA Hive is an open-source project managed by the Apache Software Foundation Non-JAVA to be able to access data Language based on SQL (ANSI-SQL 92) Easy to lean and use Data is available to many more people

More About Hive Hive is NOT a replacement for RDBMS Not all SQL works Hive is only an interpreter that converts HiveQL to MapReduce more specifically procedural java code HiveQL queries can take many seconds or minutes to produce a result set Hive has a metastore within MySQL that contains details about table representation on HDFS

RDBMS vs Hive RDBMS Hive Language SQL Subset of SQL along with Hive extensions Transactions Yes No ACID Yes No Latency Updates? Sub-second (Indexed Data) Yes, INSERT [IGNORE], UPDATE, DELETE, REPLACE Many seconds to minutes (Non Index Data) INSERT OVERWRITE

Sqoop and Hive $ sqoop import --connect jdbc:mysql://example.com/world \ --tables City \ --hive-import Alternatively, you can create table(s) within the Hive CLI and run an hadoop fs -put with an exported CSV file on the local file system

Impala Brings processing to the data nodes and avoids network bottlenecks Uses a shared metastore Hue integration in CDH 4.2+ Based off of Google s Dremel http://research.google.com/pubs/pub36632.html

Impala High Level Image Source: http://blog.cloudera.com/blog/2012/10/ cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

Advantages of Impala 100% Open Source. Local processing on data nodes help avoid network bottlenecks Saves time with minimal data movement A single, open, and unified metadata store can be utilized. Costly data format conversion is unnecessary and thus no overhead is incurred. All data is immediately query-able, with no delays for ETL. All hardware is utilized for Impala queries as well as for MapReduce. Only a single machine pool is needed to scale. ANSI-92 SQL supported with UDFs Supports common Hadoop file formats: text, SequenceFiles, Avro, RPCFile, LZO and Parquet

Impala Info Factors that help make Impala faster Hardware Configuration Complexity of the Query Availability of main memory Does it replace MapReduce or Hive? Limitations? All joins are done in a memory space no larger than the smallest node

Tungsten Replicator 3.0 MySQL to HDFS Why this is needed Sqoop issues Time, Stale Data, Performance issues Real Time data load Idempotent failover If you have multiple slaves and log_slave_updates active All row changes are stored

Current Scope 3.0 Release Extract from MySQL via binlog (RBR) Move data into current popular distributions, Cloudera, Hortonworks Initial provion with Sqoop or parallel extractor (currently Oracle only but MySQL coming soon Asynchronous replicated changes Data transformation into preferred HDFS formats Schema generation of HIVE tables Tools for generating of materialized views

High Level Overview MySQL RDBMS Master Replicator THL GTID + Metadata Binlogs Slave Replicator THL GTID + Metadata HDFS

Specific Overview MySQL Hadoop.js CSV Tungsten Master Replicator Tungsten Slave Replicator HIVE/HDFS Master Filtering: - Drop/Modify Columns (PII) - Fill in primary key - Fill in column names - Add DBMS source - Select subset of tables to replicate Slave Loader: - CSV files are created (1 per table) - 100K transactions or ~1 min data collected - Parallel push into HDFS

Process into HIVE Hadoop.js Load CSV into hadoop Staging table creation Contains all changed data along with Tungsten Specific row information Create Base Table Equivalent table to MySQL in HIVE Create Materialized View Check for table match between MySQL and HIVE

Automate the process $ git clone \ https://github.com/continuent/continuent-tools-hadoop.git $ cd continuent-tools.hadoop $ bin/load-reduce-check \ -U jdbc:mysql:thin://sourcemysqlserver:3306/database \ -s database --verbose

DEMO hive -e drop table city Sqoop Import City table sqoop import --connect jdbc:mysql:// localhost.localdomain/world --username root -- create-hive-table --hive-table City --hive-import -- direct --table City --warehouse-dir /user/hive/ warehouse HiveQL Saved in Hue Impala Saved in Hue

References Hive Impala http://infolab.stanford.edu/~ragho/hive-icde2010.pdf https://ccp.cloudera.com/display/impala10betadoc/cloudera+impala+1.0+beta +Documentation Cloudera https://www.cloudera.com/content/support/en/documentation.html https://ccp.cloudera.com/display/support/downloads Tungsten Replicator 3.0 Download - http://s3.amazonaws.com/files.continuent.com/builds/nightly/replicator-3.0.0-staging/ index.html Documentation MySQL > HDFS - https://docs.continuent.com/tungsten-replicator-3.0/deployment-hadoop.html Tungsten Hadoop Tools https://github.com/continuent/continuent-tools-hadoop