MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering



Similar documents
Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data: Are You Ready? Kevin Lancaster

Oracle Big Data Essentials

MySQL and Hadoop Big Data Integration

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

The Future of Data Management

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

<Insert Picture Here> Big Data

Integrating VoltDB with Hadoop

BIG DATA TRENDS AND TECHNOLOGIES

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Hadoop and MySQL for Big Data

Connecting Hadoop with Oracle Database

Big Data Are You Ready? Jorge Plascencia Solution Architect Manager

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Data processing goes big

THE HADOOP DISTRIBUTED FILE SYSTEM

Oracle Big Data Building A Big Data Management System

Modernizing Your Data Warehouse for Hadoop

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

From Dolphins to Elephants: Real-Time MySQL to Hadoop Replication with Tungsten

Data Domain Profiling and Data Masking for Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Apache Sentry. Prasad Mujumdar

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Hadoop Job Oriented Training Agenda

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Are You Ready for Big Data?

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Internals of Hadoop Application Framework and Distributed File System

Big Data Big Data/Data Analytics & Software Development

Chase Wu New Jersey Ins0tute of Technology

Are You Ready for Big Data?

Information Architecture

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Oracle Big Data SQL Technical Update

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop: The Big Data Refinery

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Future of Data Management with Hadoop and the Enterprise Data Hub

Big Data With Hadoop

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

Welcome to Virtual Developer Day MySQL!

Safe Harbor Statement

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

Hadoop Ecosystem B Y R A H I M A.

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Actian SQL in Hadoop Buyer s Guide

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Native Connectivity to Big Data Sources in MSTR 10

An Oracle White Paper June Oracle: Big Data for the Enterprise

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Replicating to everything

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Big Data: Tools and Technologies in Big Data

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Big Data Too Big To Ignore

HOW TO LIVE WITH THE ELEPHANT IN THE SERVER ROOM APACHE HADOOP WORKSHOP

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Hadoop Big Data for Processing Data and Performing Workload

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Cloudera Manager Training: Hands-On Exercises

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Architecting for the Internet of Things & Big Data

An Oracle White Paper October Oracle: Big Data for the Enterprise

Big Data Are You Ready? Thomas Kyte

Hadoop & its Usage at Facebook

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

#TalendSandbox for Big Data

Tapping Into Hadoop and NoSQL Data Sources with MicroStrategy. Presented by: Jeffrey Zhang and Trishla Maru

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Information Builders Mission & Value Proposition

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HDP Enabling the Modern Data Architecture

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Using distributed technologies to analyze Big Data

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Customized Report- Big Data

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Move Data from Oracle to Hadoop and Gain New Business Insights

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

Transcription:

MySQL and Hadoop: Big Data Integration Shubhangi Garg & Neha Kumari MySQL Engineering 1Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Agenda Design rationale Implementation Installation Schema to Directory Mappings Integration with Hive Roadmap Q&A 2Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Safe Harbor Statement The following is intended to outline our general product direction. It is intended for information purposes only, and may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decision. The development, release, and timing of any features or functionality described for Oracle s products remains at the sole discretion of Oracle. 3Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Big Data: Strategic Transformation From 4Copyright 2013, Oracle and/or its affiliates. All rights reserved.

CIO & Business Priority Web recommendations 90% with Pilot Projects at end of 2012 Sentiment Analysis Marketing Campaign Analysis Customer Churn Modeling Fraud Detection Poor Data Costs 35% in Annual Revenues Research and Development 10% Improvement in Data Usability Drives $2bn in Revenue Risk Modeling Machine Learning 5Copyright 2013, Oracle and/or its affiliates. All rights reserved.

CIO & Business Priority US HEALTH CARE Increase industry value per year by $300 B MANUFACTURING GLOBAL PERSONAL LOCATION DATA EUROPE PUBLIC SECTOR ADMIN US RETAIL Increase industry value per year by Increase net margin by $100 B 250 B 60+% Decrease dev., Increase service assembly costs by provider revenue by 50% In a big data world, a competitor that fails to sufficiently develop its capabilities will be left behind. McKinsey Global Institute 6Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Analysts on Big Data The area of greatest interest to my clients is Big Data and its role in helping businesses understand customers better. Michael Maoz, Gartner Tony Baer, Ovum Big Data Will Help Shape Your Market s Next Big Winners. Brian Hopkins, Forrester Almost half of IT departments in enterprises in North America, Europe and Asia-Pacific plan to invest in Big Data analytics in the near future. CIOs will need to be realistic about their approach to 'Big Data' analytics and focus on specific use cases where it will have the biggest business impact. Philip Carter, IDC 7Copyright 2013, Oracle and/or its affiliates. All rights reserved.

What Makes it Big Data? PROBLEM: Exceeds limits of conventional systems SOCIAL BLOG SMART METER VOLUME VELOCITY 8Copyright 2013, Oracle and/or its affiliates. All rights reserved. VARIETY 101100101001 001001101010 101011100101 010100100101 VARIABILITY

What s Changed? Enablers Digitization *nearly* everything has a digital heartbeat Now practical to store much larger data volumes (distributed file systems) Now practical to process much larger data volumes (parallel processing) Why is this different from BI/DW? Business formulated questions to ask upfront Drove what was data collected, data model, query design Big Data Complements Traditional Methods: Enables what-if analysis, real-time discovery 9Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Unlocking Value of ALL Data Video and Images Big Data: Decisions based on all your data Documents Social Data Traditional Architecture: Decisions based on database data 10Copyright 2013, Oracle and/or its affiliates. All rights reserved. Machine-Generated Data Transactions

Why Hadoop? Scales to thousands of nodes, TB of structured and unstructured data Combines data from multiple sources, schemaless Run queries against all of the data Runs on commodity servers, handle storage and processing Data replicated, self-healing Initially just batch (Map/Reduce) processing Extending with interactive querying, via Apache Drill, Cloudera Impala, Stinger etc. 11Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL + Hadoop: Unlocking the Power of Big Data 50% of our users integrate with MySQL* Download the MySQL Guide to Big Data: http://www.mysql.com/why-mysql/white-papers/mysql-and-hadoop-guide-to-big-data-integration/ *Leading Hadoop Vendor 12Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Leading Use-Case, On-Line Retail Users Browsing Profile, Purchase History Recommendations Recommendations Social media updates Preferences Brands Liked Web Logs: Pages Viewed Comments Posted Telephony Stream 13Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL Applier for Hadoop: Big Data Lifecycle DECIDE ANALYZE 14Copyright 2013, Oracle and/or its affiliates. All rights reserved. ACQUIRE ORGANIZE

MySQL in the Big Data Lifecycle DECIDE ACQUIRE BI Solutions Applier ANALYZE 15Copyright 2013, Oracle and/or its affiliates. All rights reserved. ORGANIZE

Apache Sqoop Apache TLP, part of Hadoop project Developed by Cloudera Bulk data import and export Between Hadoop (HDFS) and external data stores JDBC Connector architecture Supports plug-ins for specific functionality Fast Path Connector developed for MySQL 16Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Importing Data Submit Map Only Job Sqoop Import Sqoop Job HDFS Storage Map Gather Metadata Map Transactional Data Map Map Hadoop Cluster 17Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Ensure Proper Design Performance impact: bulk transfers to and from operational systems Complexity: configuration, usage, error reporting 18Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL Applier for Hadoop Real-Time Event Streaming MySQL to HDFS 19Copyright 2013, Oracle and/or its affiliates. All rights reserved.

New: MySQL Applier for Hadoop Real-time streaming of events from MySQL to Hadoop Supports move towards Speed of Thought analytics Connects to the binary log, writes events to HDFS via libhdfs library Each database table mapped to a Hive data warehouse directory Enables eco-system of Hadoop tools to integrate with MySQL data See dev.mysql.com for articles Available for download now labs.mysql.com 20Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL Applier for Hadoop: Basics Replication Architecture 21Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL Applier for Hadoop: Basics What is MySQL Applier for Hadoop? An utility which will allow you to transfer data from MySQL to HDFS. Reads binary log from server on a real time basis Uri for connecting to HDFS: const *uri= hdfs://user@localhost:9000 ; Network Transport: const *uri= mysql://root@127.0.0.1:3306 ; Decode binary log events Contain code to decode the events Uses the user defined content handler, and if nothing is specified then the default one in order to process events Cannot handle all events Event Driven API 22Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL Applier for Hadoop: Basics Event driven API: Content Handlers 23Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL Applier for Hadoop: Implementation Replicates rows inserted into a table in MySQL to Hadoop Distributed File System Uses an API provided by libhdfs, a C library to manipulate files in HDFS The library comes pre-compiled with Hadoop Distributions Connects to the MySQL master (or reads the binary log generated by MySQL) to: Fetch the row insert events occurring on the master Decode these events, extracting data inserted into each field of the row Separate the data by the desired field delimiters and row delimiters Use content handlers to get it in the format required Append it to a text file in HDFS 24Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Installation: Pre-requisites Hadoop Applier package from http://labs.mysql.com Hadoop 1.0.4 or later Java version 6 or later (since Hadoop is written in Java) libhdfs (it comes pre compiled with Hadoop distros) Cmake 2.6 or greater Libmysqlclient 5.6 Openssl Gcc 4.6.3 MySQL Server 5.6 FindHDFS.cmake (cmake file to find libhdfs library while compiling) FindJNI.cmake (optional, check if you already have one): $locate FindJNI.cmake Hive (optional) 25Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Installation: Steps Download a copy of Hadoop Applier from http://labs.mysql.com Download a Hadoop release, configure dfs to set the append property true (flag dfs.support.append) Set the environment variable $HADOOP_HOME Ensure that 'FindHDFS.cmake' is in the CMAKE_MODULE_PATH libhdfs being JNI based, JNI header files and libraries are also required Build and install the library 'libreplication', using cmake Set the CLASSPATH $export PATH= $HADOOP_HOME/bin:$PATH $export CLASSPATH= $(hadoop classpath) Compile by the command make happlier on the terminal. This will create an executable file in the mysql2hdfs directory in the repository 26Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Integration with HIVE Hive runs on top of Hadoop. Install HIVE on the hadoop master node Set the default datawarehouse directory same as the base directory into which SQL Query Hadoop Applier writes Create similar schema's on both MySQL & CREATE TABLE t (i INT); Hive Timestamps are inserted as first field in HDFS files Data is stored in 'datafile1.txt' by default The working directory is base_dir/db_name.db/tb_name 27Copyright 2013, Oracle and/or its affiliates. All rights reserved. Hive QL CREATE TABLE t ( time_stamp INT, i INT) [ROW FORMAT DELIMITED] STORED AS TEXTFILE;

Mapping Between MySQL and HDFS Schema 28Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Hadoop Applier in Action Run hadoop dfs (namenode and datanode) Start a MySQL server as master with row based replication For ex: using mtr: $MySQL-5.6/mysql-test$./mtr --start suite=rpl --mysqld=-binlog_format='row' mysqld=-binlog_checksum=none Start hive (optional) Run the executable./happlier $./happlier [mysql-uri] [hdfs-uri] Data being replicated can be controlled by command line options. $./happlier help Find the demo here: http://www.youtube.com/watch? feature=player_detailpage&v=mzratcu3m1g 29Copyright 2013, Oracle and/or its affiliates. All rights reserved. Options Use -r, --field-delimiter=delim String separating the fields in a row -w, --row-delimiter=delim String separating the rows of a table -d, --databases=db_list Ex: d=db_name1-table1table2,dbname2table1,dbname3 Import entries for some databases, optionally include only specified tables. -f, --fields=list Import entries for some fields only in a table -h, --help Display help

Future Road Map Under Consideration Support all kinds of events occurring on the MySQL master Support binlog checksums and GTID's, the new features in MySQL 5.6 GA Considering support for DDL's Considering support for updates and deletes Leave comments on the blog! http://innovating-technology.blogspot.co.uk/2013/04/mysql-hadoop-applier-part-2.html 30Copyright 2013, Oracle and/or its affiliates. All rights reserved.

MySQL in the Big Data Lifecycle DECIDE Analyze Export Data Decide ANALYZE 31Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Analyze Big Data 32Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Summary MySQL + Hadoop: proven big data pipeline Real-Time integration: MySQL Applier for Hadoop Perfect complement for Hadoop interactive query tools Integrate for Insight 33Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Next Steps 34Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Thank you! 35Copyright 2013, Oracle and/or its affiliates. All rights reserved.