07/11/2014 Julien! Poorna! Andreas

Similar documents
ITG Software Engineering

Complete Java Classes Hadoop Syllabus Contact No:

Workshop on Hadoop with Big Data

Apache Sentry. Prasad Mujumdar

Integration of Apache Hive and HBase

Qsoft Inc

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Hive Interview Questions

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Hadoop Job Oriented Training Agenda

BIG DATA HADOOP TRAINING

Big Data Course Highlights

How to Install and Configure EBF15328 for MapR or with MapReduce v1

The Hadoop Eco System Shanghai Data Science Meetup

Large Scale Text Analysis Using the Map/Reduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

COURSE CONTENT Big Data and Hadoop Training

Big Data and Hadoop. Module 1: Introduction to Big Data and Hadoop. Module 2: Hadoop Distributed File System. Module 3: MapReduce

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

How Companies are! Using Spark

Cloudera Certified Developer for Apache Hadoop

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Peers Techno log ies Pv t. L td. HADOOP

Hadoop Ecosystem B Y R A H I M A.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

HADOOP. Revised 10/19/2015

Introduction to Big Data Training

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

HIVE. Data Warehousing & Analytics on Hadoop. Joydeep Sen Sarma, Ashish Thusoo Facebook Data Team

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

American International Journal of Research in Science, Technology, Engineering & Mathematics

Hadoop: The Definitive Guide

Native Connectivity to Big Data Sources in MSTR 10

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

CERTIFIED MULESOFT DEVELOPER EXAM. Preparation Guide

Best Practices for Hadoop Data Analysis with Tableau

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop & Spark Using Amazon EMR

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Certified Big Data and Apache Hadoop Developer VS-1221

High-Speed In-Memory Analytics over Hadoop and Hive Data

Hadoop and Big Data Research

Apache Sqoop. A Data Transfer Tool for Hadoop

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Oracle Big Data SQL Technical Update

Open Source Technologies on Microsoft Azure

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

BIG DATA - HADOOP PROFESSIONAL amron

Data processing goes big

Reusable Data Access Patterns

Chase Wu New Jersey Ins0tute of Technology

Architecting the Future of Big Data

Apache Flink Next-gen data analysis. Kostas

A Brief Introduction to Apache Tez

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Moving From Hadoop to Spark

The Greenplum Analytics Workbench

[Type text] Week. National summer training program on. Big Data & Hadoop. Why big data & Hadoop is important?

BigData in Real-time. Impala Introduction. TCloud Computing 天 云 趋 势 孙 振 南 2012/12/13 Beijing Apache Asia Road Show

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

How Intel IT Successfully Migrated to Cloudera Apache Hadoop*

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Dominik Wagenknecht Accenture

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Introduction and Overview for Oracle 11G 4 days Weekends

Integrating VoltDB with Hadoop

How To Create A Data Visualization With Apache Spark And Zeppelin

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Integrating Apache Spark with an Enterprise Data Warehouse

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Big Data Research in the AMPLab: BDAS and Beyond

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Unlocking Hadoop for Your Rela4onal DB. Kathleen Technical Account Manager, Cloudera Sqoop PMC Member BigData.

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Map Reduce & Hadoop Recommended Text:

Connecting Hadoop with Oracle Database

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Transcription:

Ad-hoc Query Brown Bag Session 07/11/2014 Julien Poorna Andreas

User Story Procedures are only developer friendly and not ad-hoc Open datasets to broader audience of non developers Introduce schema to datasets Use SQL as query paradigm Open to BI tools and analysts

Hive - Introduction Manage and query structured data SQL-like language: Hive QL Built on top of Hadoop Create Hive tables based on files from: HDFS, HBase, any custom file format; Graph of Map-Reduce jobs for execution

Hive - How it works Hive Client Thrift Hive Server 2 Thrift Hive Metastore DB SELECT a,b FROM mytable; Launch mytable schema MR MR MR YARN Cluster

Hive Integration with External Systems External tables VS managed tables table definition managed externally data storage not handled by Hive Non-native tables VS native tables need an external component - Storage Handler - to access data Set hive.aux.jars configuration with component s jar

Hive Storage Handlers Access data stored and managed by other systems. Need to implement: Input format - get data splits for MR jobs and records from splits Output format - write records to external storage system SerDe - Serialize / Deserialize records Object Inspector, analyze internal structure of record For Datasets: DatasetInputFormat, DatasetSerDe, reuse Hive ObjectInspector

From Datasets to Hive Hive Server MR Job Dataset data Dataset Split Dataset Record Record Fields Hive Interfaces DatasetInput Format getsplits() DatasetInput Format getrecordre ader() DatasetSerD e deserialize() DatasetSerD e getobjectins pector() Record Scannable RecordScan nable getsplits() RecordScan nable createsplitre cordscanner() RecordScan nable getrecordtyp e()

Create a RecordScannable Most Datasets already implement BatchReadable Shared by BatchReadable and RecordScannable Utility methods to use BatchReadable methods to implement RecordScannable ones

Explore Module Explore Executor Async HTTP Service Explore Service Hive CLI Service Public endpoints to sendquery, getstatus, getschema, getresults, cancel, close Hive CLI Service - not HiveServer2: HiveServer2 only fully supports thrift Wrap execution of a query in a long running tx Run in a Twill Container User code executed by Storage Handler Different implementations of Explore Service for different versions of Hive

Starting Explore In ReactorMaster startup script, create Explore class path In ReactorMaster: Check reactor.explore.enabled setting Use separate class loader to check that Hive jars are present with a supported version: Hive12 / Hive13 / Hive distributed with CDH 4.3-4.7, 5.0 Create Explore Twill container and ship Hive classes and related Reactor classes: HBase classes, DatasetFramework.class, DatasetStorageHandler.class Ship Hive conf files as resources to the container In Container: Set hive.aux.jars configuration with related Reactor classes CLI Service connects to existing Metastore Register HTTP service with Zookeeper

Deploying a Dataset REST Client DS.jar R o u t e r DS Manager enable Explore Executor Hive CLI Create table Hive Meta Service add copy Add schema MDS DS.jar HDFS mysql

Executing a Query Start tx REST Client Poll to get results SQL R o u t e r SQL DS Manager query start MR Storage Handler Explore Executor Hive CLI getsplits Retrieve metadata Hive Meta Service read load read MDS DS.jar HDFS mysql

JDBC Connector Eventually open Reactor Datasets to external systems Right now limited version to use ad-hoc queries in unit tests Class.forName("com.continuuity.explore.jdbc.ExploreDriver"); Connection con = DriverManager.getConnection("jdbc:reactor://localhost:10000"); ResultSet res = stmt.preparestatement( select * from continuuity_user_mytable ).executequery(); res.next(); String firstcolumn = res.getstring(1);

Troubles Along the Way Class loading issues Hive packages libraries with different versions than Reactor Dataset class loading issues DatasetFramework does not cache class loaders used to instantiate datasets Instantiating a system dataset that requires user types is not supported Different Hive components read Hive conf from different places Hive in-memory map reduce logic is broken: patch for Hive-exec.jar Schema with recursive data types make object inspector go into infinite loop CLIService implementation slightly different for different versions of Hive Hive CLI Service does not have a startandwait method

Future Work JDBC connector with more functionalities, to be used by BI tools Support secure Hive Support for UDFs Pick up record scannable datasets when explore turns to enabled UI Hive on Tez Hive on Spark?

Thank you