Alexander Rubin Principle Architect, Percona April 18, 2015. Using Hadoop Together with MySQL for Data Analysis

Similar documents
Hadoop and MySQL for Big Data

SQL Server Parallel Data Warehouse: Architecture Overview. José Blakeley Database Systems Group, Microsoft Corporation

Real-Time Data Analytics and Visualization

Replicating to everything

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Data Warehouse 2.0 How Hive & the Emerging Interactive Query Engines Change the Game Forever. David P. Mariani AtScale, Inc. September 16, 2013

Moving From Hadoop to Spark

Constructing a Data Lake: Hadoop and Oracle Database United!

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

MySQL and Hadoop. Percona Live 2014 Chris Schneider

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

PENTAHO DATA INTEGRATION WITH GREENPLUM LOADER

Parquet. Columnar storage for the people

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Unified Big Data Processing with Apache Spark. Matei

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Native Connectivity to Big Data Sources in MSTR 10

Open source Google-style large scale data analysis with Hadoop

Using RDBMS, NoSQL or Hadoop?

Using distributed technologies to analyze Big Data

Querying data warehouses efficiently using the Bitmap Join Index OLAP Tool

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Reference Architecture, Requirements, Gaps, Roles

Big data blue print for cloud architecture

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

The Internet of Things and Big Data: Intro

Microsoft SQL Server Connector for Apache Hadoop Version 1.0. User Guide

How To Scale Out Of A Nosql Database

Tap into Hadoop and Other No SQL Sources

How, What, and Where of Data Warehouses for MySQL

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, UC Berkeley, Nov 2012

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

The Inside Scoop on Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Application Development. A Paradigm Shift

How Companies are! Using Spark

Apache Hadoop: Past, Present, and Future

Leveraging SAP HANA & Hortonworks Data Platform to analyze Wikipedia Page Hit Data

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect


UNISYS. SQL Server Day 2009 Partners

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Introduction to Apache Cassandra

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

Query Acceleration of Oracle Database 12c In-Memory using Software on Chip Technology with Fujitsu M10 SPARC Servers

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Real Time Big Data Processing

Hadoop Job Oriented Training Agenda

Big Data Big Data/Data Analytics & Software Development

So What s the Big Deal?

BIG DATA TRENDS AND TECHNOLOGIES

Data Analyst Program- 0 to 100

Hadoop. for Oracle database professionals. Alex Gorbachev Calgary, AB September 2013

ITG Software Engineering

Big Data on Microsoft Platform

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Can the Elephants Handle the NoSQL Onslaught?

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Big Data Analytics Nokia

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, XLDB Conference at Stanford University, Sept 2012

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HDFS. Hadoop Distributed File System

WHITE PAPER USING CLOUDERA TO IMPROVE DATA PROCESSING

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data Buzzwords From A to Z. By Rick Whiting, CRN 4:00 PM ET Wed. Nov. 28, 2012

Cost-Effective Business Intelligence with Red Hat and Open Source

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Big Data Technologies Compared June 2014

Hadoop Ecosystem B Y R A H I M A.

Introduction To Hive

Hadoop and Map-Reduce. Swati Gore

Parallel Replication for MySQL in 5 Minutes or Less

Business Intelligence Extensions for SPARQL

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Hadoop & Spark Using Amazon EMR

Map Reduce & Hadoop Recommended Text:

Handling Big Dimensions in Distributed Data Warehouses using the DWS Technique

Hadoop implementation of MapReduce computational model. Ján Vaňo

Maximizing Hadoop Performance with Hardware Compression

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

MySQL and Hadoop Big Data Integration

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Peers Techno log ies Pv t. L td. HADOOP

Big Data Open Source Stack vs. Traditional Stack for BI and Analytics

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Data Course Highlights

Transcription:

Alexander Rubin Principle Architect, Percona April 18, 2015 Using Hadoop Together with MySQL for Data Analysis

About Me Alexander Rubin, Principal Consultant, Percona Working with MySQL for over 10 years Started at MySQL AB, Sun Microsystems, Oracle (MySQL Consulting) Worked at Hortonworks (Hadoop company) Joined Percona in 2013

Agenda Hadoop Use Cases Big Data Analytics with Hadoop Star Schema benchmark MySQL and Hadoop Integration

IG DATA Hadoop: when it makes sense

Big Data 3Vs of big data Volume Petabytes Variety Any type of data usually unstructured/raw data No normalization Velocity Data is collected at a high rate http://en.wikipedia.org/wiki/big_data#definition

Inside Hadoop

Where is my data? Data Nodes HDFS (Distributed File System) HDFS = Data is spread between MANY machines Also Amazon S3 is supported

Inside Hadoop Other componets (HBASE, etc) Impala (Faster SQL) Other components (PIG, etc) Hive (SQL) MapReduce (Distributed Programming Framework) HDFS (Distributed File System) HDFS = Data is spread between MANY machines Write files, append-only mode

Hive SQL level for Hadoop Translates SQL to Map-Reduce jobs Schema on Read does not check the data on load

Hive Example hive> create table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double,... l_shipmode string, l_comment string) row format delimited fields terminated by ' ';

Hive Example hive> create external table lineitem (...) row format delimited fields terminated by ' location /ssb/lineorder/ ;

Impala: Faster SQL Not based on Map-Reduce, directly get data from HDFS http://www.cloudera.com/content/cloudera-content/cloudera-docs/impala/latest/ Installing-and-Using-Impala/Installing-and-Using-Impala.html

Hadoop vs MySQL

Hadoop vs. MySQL for BigData Indexes Partitioning Sharding Full table scan Partitioning Map/Reduce

Hadoop (vs. MySQL) No indexes All processing is full scan BUT: distributed and parallel No transactions High latency (usually)

Indexes (BTree) for Big Data challenge Creating an index for Petabytes of data? Updating an index for Petabytes of data? Reading a terabyte index? Random read of Petabyte? Full scan in parallel is better for big data

ETL vs ELT 1. Extract data from external source 2. Transform before loading 3. Load data into MySQL 1. Extract data from external source 2. Load data into Hadoop 3. Transform data/ Analyze data/ Visualize data; Parallelism

Example: Loading wikistat into MySQL http://dumps.wikimedia.org/other/pagecounts-raw/ 1. Extract data from external source 2. Load data into MySQL and Transform Wikipedia page counts download, >10TB load data local infile '$file' into table wikistats.wikistats_full CHARACTER SET latin1 FIELDS TERMINATED BY ' ' (project_name, title, num_requests, content_size) set request_date = STR_TO_DATE('$datestr', '%Y%m%d %H%i%S'), title_md5=unhex(md5(title));

Load timing per hour of wikistat InnoDB: 52.34 sec MyISAM: 11.08 sec (+ indexes) 1 hour of wikistats =1 minute 1 year will load in 6 days (8765.81 hours in 1 year) 6 year = > 1 month to load

Loading wikistat into Hadoop Just copy files to HDFS and create hive structure How fast to search? Depends upon the number of nodes 1000 nodes spark (hadoop fork) cluster 4.5 TB, 104 Billion records Exec time: 45 sec Scanning 4.5TB of data http://spark-summit.org/wp-content/uploads/2014/07/ Building-1000-node-Spark-Cluster-on-EMR.pdf

Archiving to Hadoop OLTP / Web site Goal: keep 100G 1TB MySQL ELT BI / Data Analysis Can store Petabytes for archiving Hadoop Cluster

Star Schema benchmark Variation of TPC-H with star-schema database design http:///docs/wiki/benchmark:ssb:start

Cloudera Impala

Impala and Columnar Storage Parquet: Columnar Storage for Hadoop Column-oriented storage Supports compression Star Schema Benchmark example RAW data: 990GB Parquet data: 240GB http://parquet.io/ http://www.cloudera.com/content/cloudera-content/cloudera-docs/ Impala/latest/Installing-and-Using-Impala/ciiu_parquet.html

Impala Benchmark: Table 1. Load data into HDFS $ hdfs dfs put /data/ssb/lineorder* /ssb/lineorder/ http:///docs/wiki/benchmark:ssb:start

Impala Benchmark: Table 2. Create external table CREATE EXTERNAL TABLE lineorder_text ( LO_OrderKey bigint, LO_LineNumber tinyint, LO_CustKey int, LO_PartKey int,... ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/ssb/lineorder'; http:///docs/wiki/benchmark:ssb:start

Impala Benchmark: Table 3. Convert to Parquet Impala-shell> create table lineorder like lineorder_text STORED AS PARQUETFILE; Impala-shell> insert into lineorder select * from lineorder_text; http:///docs/wiki/benchmark:ssb:start

Impala Benchmark: SQLs Q1.1 http:///docs/wiki/benchmark:ssb:start select sum(lo_extendedprice*lo_discount) as revenue from lineorder, dates where lo_orderdate = d_datekey and d_year = 1993 and lo_discount between 1 and 3 and lo_quantity < 25;

Impala Benchmark: SQLs Q3.1 select c_nation, s_nation, d_year, sum(lo_revenue) as revenue from customer, lineorder, supplier, dates where lo_custkey = c_custkey and lo_suppkey = s_suppkey and lo_orderdate = d_datekey and c_region = 'ASIA and s_region = 'ASIA' and d_year >= 1992 and d_year <= 1997 group by c_nation, s_nation, d_year order by d_year asc, revenue desc;

Partitioning for Impala / Hive Partitioning by Year CREATE TABLE lineorder ( LO_OrderKey bigint, LO_LineNumber tinyint, LO_CustKey int, Will create a virtual field on year LO_PartKey int,... ) partitioned by (year int); select * from lineorder where year = '2013 will only read the 2013 partition (much less disk IOs!)

Impala Benchmark CDP Hadoop with Impala 10 m1.xlarge Amazon EC2 RAW data: 990GB Parquet data: 240GB

Impala Benchmark Scanning 1Tb of row data 50-200 seconds 10 xlarge nodes on Amazon EC2

Impala vs Hive Benchmark 10 xlarge nodes on Amazon EC2 Impala and Hive

Amazon Elastic Map Reduce

Hive on Amazon S3 hive> create external table lineitem ( l_orderkey int, l_partkey int, l_suppkey int, l_linenumber int, l_quantity double, l_extendedprice double, l_discount double, l_tax double, l_returnflag string, l_linestatus string, l_shipdate string, l_commitdate string, l_receiptdate string, l_shipinstruct string, l_shipmode string, l_comment string) row format delimited fields terminated by ' location 's3n://data.s3ndemo.hive/tpch/lineitem';

Amazon Elastic Map Reduce Store data on S3 Prepare SQL file (create table, select, etc) Run Elastic Map Reduce Will start N boxes then stop them Results loaded to S3

Hadoop and MySQL Together Integrating MySQL and Hadoop

Integration: Hadoop -> MySQL Hadoop Cluster MySQL Server

MySQL -> Hadoop: Sqoop $ sqoop import --connect jdbc:mysql://mysql_host/db_name --table ORDERS --hive-import MySQL Server Hadoop Cluster

MySQL to Hadoop: Hadoop Applier Only inserts are supported Replication Master Hadoop Cluster MySQL Applier (reads binlogs from Master) Regular MySQL Slaves

MySQL to Hadoop: Hadoop Applier Download from: http://labs.mysql.com/ Alpha version right now We need to write code how to process data

Questions? Thank you! Blog: http://www.arubin.org