SQL on NoSQL (and all of the data) With Apache Drill

Similar documents

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill

Hadoop Ecosystem B Y R A H I M A.

Why Spark on Hadoop Matters

Rethinking SQL for Big Data with Apache Drill

Information Builders Mission & Value Proposition

MapR: Best Solution for Customer Success

The Internet of Things and Big Data: Intro

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Dominik Wagenknecht Accenture

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Workshop on Hadoop with Big Data

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Native Connectivity to Big Data Sources in MSTR 10

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Moving From Hadoop to Spark

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

#TalendSandbox for Big Data

HADOOP VENDOR DISTRIBUTIONS THE WHY, THE WHO AND THE HOW? Guruprasad K.N. Enterprise Architect Wipro BOTWORKS

Big Data Course Highlights

The Future of Data Management with Hadoop and the Enterprise Data Hub

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Time-Series Databases and Machine Learning

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop Job Oriented Training Agenda

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Apache Hadoop: The Pla/orm for Big Data. Amr Awadallah CTO, Founder, Cloudera, Inc.

Bringing Big Data to People

How to Hadoop Without the Worry: Protecting Big Data at Scale

ITG Software Engineering

Upcoming Announcements

Complete Java Classes Hadoop Syllabus Contact No:

HDP Hadoop From concept to deployment.

The Future of Data Management

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

HDFS. Hadoop Distributed File System

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Data Security in Hadoop

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Introduction to Big Data Training

Implement Hadoop jobs to extract business value from large and varied data sets

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Apache Sentry. Prasad Mujumdar

Hadoop 101. Lars George. NoSQL- Ma4ers, Cologne April 26, 2013

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Hadoop & Spark Using Amazon EMR

How To Scale Out Of A Nosql Database

Parquet. Columnar storage for the people

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

HADOOP. Revised 10/19/2015

COURSE CONTENT Big Data and Hadoop Training

Chase Wu New Jersey Ins0tute of Technology

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Practical Hadoop by Example

Hadoop implementation of MapReduce computational model. Ján Vaňo

Secure Your Hadoop Cluster With Apache Sentry (Incubating) Xuefu Zhang Software Engineer, Cloudera April 07, 2014

Data processing goes big

Unified Big Data Processing with Apache Spark. Matei

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Finding the Needle in a Big Data Haystack. Wolfgang Hoschek (@whoschek) JAX 2014

Hadoop Trends and Practical Use Cases. April 2014

A Scalable Data Transformation Framework using the Hadoop Ecosystem

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

BIG DATA & HADOOP DEVELOPER TRAINING & CERTIFICATION

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Using RDBMS, NoSQL or Hadoop?

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

HDP Enabling the Modern Data Architecture

The Hadoop Eco System Shanghai Data Science Meetup

Batter Up! Advanced Sports Analytics with R and Storm

Oracle Big Data Fundamentals Ed 1 NEW

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

So What s the Big Deal?

Real-Time Data Analytics and Visualization

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Processing Big Data With SQL on Hadoop. Jens Albrecht

BITKOM& NIK - Big Data Wo liegen die Chancen für den Mittelstand?

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Apache Hadoop Ecosystem

Hadoop: The Definitive Guide

Peers Techno log ies Pv t. L td. HADOOP

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Transcription:

SQL on NoSQL (and all of the data) With Apache Drill Richard Shaw Solutions Architect @aggress

Who What Where NoSQL DB Very Nice People Open Source Distributed Storage & Compute Platform (up to 1000s of nodes) MapR Hadoop MapR-DB

BIG DATA MapR-DB

Hadoop World HiveQL to Map Reduce The image cannot be displayed. Your computer may not have enough memory to open the image, or the image may have been corrupted. Restart your computer, and then open the file again. If the red x still appears, you may have to delete the image and then insert it again. APACHE HADOOP ECOSYSTEM Management Oozie Mahout Sqoop Hive/Stinger/ Tez Flume Cascading Hue Solr Impala HBase Storm Drill Accumulo Sentry Zookeeper Whirr Pig YARN MapReduce Shark Spark MapR Data Platform T Enterprise-grade e Inter-operability Multi-tenancy Security Operational h

Low Latency SQL on NoSQL and Other Stuff

Real-World Data Modeling and Transformations

What s Drill? Apache open source project Scale-out execution engine for low-latency SQL queries Unified SQL-based API for zero day analytics & operational applications Flexible data sources Data agility for NoSQL, HBase, Hadoop Mentored by -MapR -Lucidworks -Elasticsearch -Academics Power to Users The Sexy Bit The Useful Bit

Drill and Google Dremel Google Tech SQL querying of Google data over GFS & BigTable In use production use since 2006-8 YEARS! Tens of thousand of concurrent users over PB of data Dremel paper released 2010

Dremel Architecture Designed for columnar format data (nested data) Analysis in-situ -> Ad hoc version of Map Reduce Multi-level execution trees Load balancing across Tablets Handles contended or failed queries - redirects

Drill Architecture

Drill Architecture Highlights Flexible - Dynamic Schema Discovery & Data Model (nested etc). Many data formats Extensible - Java API. Work with RDBMS and NoSQL DBs Performance - Distributed engine, columnar optimised

Self-Describing Data is Ubiquitous Flat files in a distributed file system Complex data (Thrift, Avro, protobuf) Columnar data (Parquet, ORC) Loosely defined (JSON) Traditional files (CSV, TSV) Data stored in NoSQL stores Relational-like (rows, columns) Sparse data (NoSQL maps) Embedded blobs (JSON) Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }

Drill s Data Model is Flexible Fixed schema Schema-less RDBMS/SQL-on-Hadoop table Name Gender Age Michael M 6 Jennifer F 3 Flat Complex CSV TSV Parquet Avro Flexibility HBase JSON BSON This Was The Sexy Bit Flexibility Drill Sees in JSON Apache Drill table { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }

Quick Tour Self-Service Data Exploration with Apache Drill 2014 2014 MapR MapR Technologies Technologies

Zero to Results in 2 Minutes (3 Commands) $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT count(*) AS incidents, columns[1] AS category FROM dfs.`/tmp/sfpd_incidents_-_previous_three_months.csv` GROUP BY columns[1] ORDER BY incidents DESC; +------------+------------+ incidents category +------------+------------+ 8372 LARCENY/THEFT 4247 OTHER OFFENSES 3765 NON-CRIMINAL 2502 ASSAULT... 35 rows selected (0.847 seconds) Install Launch shell (embedded mode) Query Results

Data Source is in the Query SELECT timestamp, message FROM dfs1.logs.`appserverlogs/2014/jan/p001.parquet` WHERE errorlevel > 2 A storage engine instance - DFS - HBase - Hive Metastore/HCatalog A workspace - Sub-directory - Hive database - HBase namespace A table - pathnames - HBase table - Hive table

Data Sources JSON CSV ORC (ie, all Hive types) Parquet HBase tables can combine them Select USERS.name, PROF.emails.work from dfs.logs.`/data/logs` LOGS, dfs.users.`/profiles.json` USERS, where LOGS.uid = USERS.uid and errorlevel > 5 order by count(*);

Query Directory Trees # Query file: How many errors per level in Jan 2014? SELECT errorlevel, count(*) FROM dfs.logs.`/appserverlogs/2014/jan/part0001.parquet` GROUP BY errorlevel; # Query directory sub-tree: How many errors per level? SELECT errorlevel, count(*) FROM dfs.logs.`/appserverlogs` GROUP BY errorlevel; # Query some partitions: How many errors per level by month from 2012? SELECT errorlevel, count(*) FROM dfs.logs.`/appserverlogs` WHERE dirs[1] >= 2012 GROUP BY errorlevel, dirs[2];

Works with HBase and Embedded Blobs # Query an HBase table directly (no schemas) SELECT cf1.month, cf1.year FROM hbase.table1; # Embedded JSON value inside column profileblob inside column family cf1 of the HBase table users SELECT profile.name, count(profile.children) FROM ( SELECT CONVERT_FROM(cf1.profileBlob, 'json') AS profile FROM hbase.users )

Combine Data Sources on the Fly # Join log directory with JSON file (user profiles) to identify the name and email address for anyone associated with an error message. SELECT DISTINCT users.name, users.emails.work FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorlevel > 5; # Join a Hive table and an HBase table (without Hive metadata) to determine the number of tweets per user SELECT users.name, count(*) as tweetcount FROM hive.social.tweets tweets, hbase.users users WHERE tweets.userid = convert_from(users.rowkey, 'UTF-8') GROUP BY tweets.userid;

Use ANSI SQL with no modifications # TPC-H standard query 4 SELECT o.o_orderpriority, count(*) AS order_count FROM orders o WHERE o.o_orderdate >= date '1996-10-01' AND o.o_orderdate < date '1996-10-01' + interval '3' month AND EXISTS( SELECT * FROM lineitem l WHERE l.l_orderkey = o.o_orderkey AND l.l_commitdate < l.l_receiptdate ) GROUP BY o.o_orderpriority ORDER BY o.o_orderpriority;

Seamless integration with Apache Hive Low latency queries on Hive tables Support for 100s of Hive file formats Ability to reuse Hive UDFs Support for multiple Hive Metastores in a single query

What s the Worst That Can Happen

Think of the use cases Querying against many different data sets JDBC & ODBC connection to BI & Analytical tools Power to the users - Think Google Dremel Supporting different formats and databases

The Present Drill 0.5.0 is out Go play with it https://incubator.apache.org/drill/ Very active development 10 minutes to get running Actively Supported by the community

The Future SQL for $NoSQL $RDBMS Full Beta this month Production grade 1.0 before end of 2014 Monthly releases Support for other NoSQL databases & data sources One SQL to rule them all

The Paper

The Summary AGILITY INSTANT INSIGHTS TO BIG DATA Direct queries on self describing data No schemas or ETL required FLEXIBILITY ONE INTERFACE FOR HADOOP & NOSQL Query HBase and other NoSQL stores Use SQL to natively operate on complex data types (such as JSON) FAMILIARITY EXISTING SKILLS & TECHNOLOGIES Leverage ANSI SQL skills and BI tools Plug-n-play with Hive schema, file formats, UDFs

The End, Thank You @mapr maprtech mapr-technologies MapRTechnologies rshaw@mapr.com maprtech