Yu Xu Pekka Kostamaa Like Gao. Presented By: Sushma Ajjampur Jagadeesh



Similar documents
Integrating Hadoop and Parallel DBMS

Native Connectivity to Big Data Sources in MSTR 10

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop Ecosystem B Y R A H I M A.

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

ITG Software Engineering

Architectures for Big Data Analytics A database perspective

Cloudera Certified Developer for Apache Hadoop

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

CERULIUM TERADATA COURSE CATALOG

Map Reduce & Hadoop Recommended Text:

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

Qsoft Inc

Tap into Hadoop and Other No SQL Sources

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Chapter 7. Using Hadoop Cluster and MapReduce

Big Telco, Bigger DW Demands: Moving Towards SQL-on-Hadoop

Presented By Smit Ambalia

Hadoop Job Oriented Training Agenda

Discovering Business Insights in Big Data Using SQL-MapReduce

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

HadoopRDF : A Scalable RDF Data Analysis System

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data on Microsoft Platform

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Delivering Intelligence to Publishers Through Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Hadoop. Sunday, November 25, 12

DATA MINING OVER LARGE DATASETS USING HADOOP IN CLOUD ENVIRONMENT

iservdb The database closest to you IDEAS Institute

Integrated Big Data: Hadoop + DBMS + Discovery for SAS High Performance Analytics

MySQL and Hadoop. Percona Live 2014 Chris Schneider

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Workshop on Hadoop with Big Data

CitusDB Architecture for Real-Time Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Certified Big Data and Apache Hadoop Developer VS-1221

A Performance Analysis of Distributed Indexing using Terrier

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

BIG DATA TRENDS AND TECHNOLOGIES

Apache Kylin Introduction Dec 8,

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

SAP Business Objects Business Intelligence platform Document Version: 4.1 Support Package Data Federation Administration Tool Guide

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database


Data processing goes big

Using distributed technologies to analyze Big Data

Hadoop for MySQL DBAs. Copyright 2011 Cloudera. All rights reserved. Not to be reproduced without prior written consent.

Using SAS as a Relational Database

Apache Hadoop. Alexandru Costan

Move Data from Oracle to Hadoop and Gain New Business Insights

Splice Machine: SQL-on-Hadoop Evaluation Guide

A Next-Generation Analytics Ecosystem for Big Data. Colin White, BI Research September 2012 Sponsored by ParAccel

Creating a universe on Hive with Hortonworks HDP 2.0

Manifest for Big Data Pig, Hive & Jaql

Information Architecture

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Complete Java Classes Hadoop Syllabus Contact No:

NoSQL and Hadoop Technologies On Oracle Cloud

Using RDBMS, NoSQL or Hadoop?

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

Xiaoming Gao Hui Li Thilina Gunarathne

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

SQL Server 2016 New Features!

Alternatives to HIVE SQL in Hadoop File Structure

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

COURSE CONTENT Big Data and Hadoop Training

Internals of Hadoop Application Framework and Distributed File System

Cloud Computing at Google. Architecture

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Connecting Hadoop with Oracle Database

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Trafodion Operational SQL-on-Hadoop

Large scale processing using Hadoop. Ján Vaňo

Chase Wu New Jersey Ins0tute of Technology

Hadoop & its Usage at Facebook

Data Warehouse as a Service. Lot 2 - Platform as a Service. Version: 1.1, Issue Date: 05/02/2014. Classification: Open

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Applied research on data mining platform for weather forecast based on cloud storage

Actian Vector in Hadoop

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

E6893 Big Data Analytics Lecture 2: Big Data Analytics Platforms

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Teradata Connector for Hadoop Tutorial

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

NoSQL for SQL Professionals William McKnight

Next Gen Hadoop Gather around the campfire and I will tell you a good YARN

Transcription:

Yu Xu Pekka Kostamaa Like Gao Presented By: Sushma Ajjampur Jagadeesh

Introduction Teradata s parallel DBMS can hold data sets ranging from few terabytes to multiple petabytes. Due to explosive data volume increase in recent years at some customer sites some data such as web logs and sensor data are not managed by Teradata EDW (Enterprise Data Warehouse). Expensive to load large volume of data such as web logs and sensor data onto Teradata EDW. Google s MapReduce and open source implementation of Hadoop is gaining momentum to perform large scale data analysis. Teradata customers have seen increasing needs to perform BI (Business Intelligence) over both data stored in Hadoop and data in Teradata EDW.

Parallel DBMS v/s HDFS Slow to load very high volume data into an RDBMS Fast execution of queries HDFS has reliability and quick load time 2-3 times slower in execution of queries Easy to write SQL for complex BI analysis Difficult to write MapReduce programs Expensive Low Cost

Solution Efficiently transferring data between Hadoop and Teradata EDW is the important first step for integrated BI over Hadoop and Teradata EDW. A straightforward approach is to use Hadoop and Teradata s current load and export utilities. One common thing between Hadoop and Teradata EDW is that data in both systems are partitioned across multiple nodes for parallel computing, which creates integration optimization opportunities not possible for DBMSs running on a single node. Three efforts towards tight and efficient integration of Hadoop and Teradata EDW.

Methods of Integration Direct Load - Load Hadoop data into EDW TeradataInputFormat - Retrieve EDW data from MapReduce programs Table UDF - Access Hadoop data as a table

Parallel Loading of Hadoop Data to Teradata EDW FastLoad Approach FastLoad utility/protocol is widely in production use for loading data to a Teradata EDW table. A FastLoad client connects to a Gateway process residing at one node in the Teradata EDW system and establishes many sessions. Each node in a Teradata EDW system is configured to run multiple virtual parallel units called AMPs (Access Module Processors). AMP is responsible for doing scans, joins and other data management tasks on the data it manages. FastLoad client sends a batch of rows in a round-robin fashion over one session at a time to the connected Gateway process. The Gateway forwards the rows to a receiving AMP. The receiving AMP computes the row-hash value of each row. The value determines which AMP should manage the row. The receiving AMP sends the rows it receives to the right final AMPs which will store the rows in Teradata EDW.

DirectLoad Approach Remove the two hops in the current FastLoad approach. Hadoop file is divided into many portions. Decide which portion of a Hadoop file each AMP should receive. Start as many DirectLoad jobs as the number of AMPs in Teradata EDW. Each DirectLoad job connects to a Teradata Gateway process and reads the designated portion of a Hadoop file using Hadoop s API. Forwards the data to its connected Gateway which sends Hadoop data only to a unique local AMP on the same Teradata node. Each receiving AMP acts as the final AMP managing the rows the AMP has received. No row-hash computation is needed and the second hop in the FastLoad approach is removed.

Retrieving EDW Data from MapReduce Programs Straightforward approach for a MapReduce program to access relational data: Export the results of SQL queries to a local file Load the local file to Hadoop More convenient and productive to directly access relational data from their MapReduce programs without the external steps of exporting data from a DBMS Based on the DBInputFormat new approach called TeradataInputFormat is developed. This enables MapReduce programs to directly read Teradata EDW data via JDBC drivers without any external steps.

DBInputFormat MapReduce programmer provides a SQL query via the DBInputFormat class. The DBInputFormat implementation first generates a query SELECT count(*) from T where C and sends to the DBMS to get the number of rows (R) in the table T. At runtime, the DBInputFormat implementation knows the number of Mappers (M) started by Hadoop. Each Mapper sends a query through a standard JDBC driver to DBMS. Select P From T Where C Order By O Limit L Offset X (Q)

DBInputFormat (cont d) Drawbacks: Each Mapper sends the same SQL query to the DBMS but with different LIMIT and OFFSET. Performance issues are serious for a parallel DBMS which have higher number of concurrent queries and larger datasets.

TeradataInputFormat Teradata connector for Hadoop named TeradataInputFormat sends the SQL query only once to Teradata EDW. TeradataInputFormat class sends the following query P to Teradata EDW based on the query Q provided by the MapReduce program. CREATE TABLE T AS (Q) WITH DATA PRIMARY INDEX ( c1 ) PARTITION BY (c2 MOD M) + 1 (P) Q is executed only once and the results are stored in a PPI (Partitioned Primary Index) table T. After the query Q is evaluated and the table T is created, each AMP has M partitions numbered from 1 to M. Each Mapper from Hadoop sends a new query Q i which just asks for all rows in the i-th partition on every AMP. SELECT * FROM T WHERE PARTITION = i (Q i ) After all Mappers retrieve their data, the table T is deleted.

TeradataInputFormat (cont d) Drawbacks: Currently a PPI table in Teradata EDW must have a primary index column. The data retrieved by a MapReduce program are not stored in Hadoop.

Accessing Hadoop Data from SQL via Table UDF A table UDF (User Defined Function) named HDFSUDF pulls data from Hadoop to Teradata EDW using SQL queries. INSERT INTO Tab1 SELECT * FROM TABLE ( HDFSUDF ( mydfsfile.txt ) ) AS T1; Typically an instance of HDFSUDF is run on every AMP in a Teradata system to retrieve a portion of Hadoop file. When a UDF instance is invoked on an AMP, the table UDF instance communicates with the NameNode in Hadoop which manages the metadata about mydfsfile.txt. Each UDF instance talks to the NameNode and finds the total size S of mydfsfile.txt. The table UDF then inquires Teradata EDW to discover its own numeric AMP identity and the number of AMPs. Each UDF instance identifies the offset into mydfsfile.txt and starts reading data from Hadoop.

Continued

Conclusion Teradata Customers are increasingly seeing the need to perform integrated BI over both data stored in Hadoop and Teradata EDW. DirectLoad approach: Fast parallel loading of Hadoop data to Teradata EDW. TeradataInputFormat: Allows MapReduce programs efficient and direct parallel access to Teradata EDW data without external steps. Table UDF SQL: Directly access and join Hadoop data with Teradata EDW data from SQL queries via user defined table functions. Future work: Push more computation from Hadoop to Teradata EDW or from Teradata EDW to Hadoop.

Thank You