Rethinking SQL for Big Data with Apache Drill

Similar documents

SQL on NoSQL (and all of the data) With Apache Drill

MapR: Best Solution for Customer Success

Self-service BI for big data applications using Apache Drill

Self-service BI for big data applications using Apache Drill

Information Builders Mission & Value Proposition

Native Connectivity to Big Data Sources in MSTR 10

Trafodion Operational SQL-on-Hadoop

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Apache Kylin Introduction Dec 8,

HDP Hadoop From concept to deployment.

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

The Internet of Things and Big Data: Intro

How Companies are! Using Spark

Moving From Hadoop to Spark

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Decoding the Big Data Deluge a Virtual Approach. Dan Luongo, Global Lead, Field Solution Engineering Data Virtualization Business Unit, Cisco

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Big Data Architecture & Analytics A comprehensive approach to harness big data architecture and analytics for growth

Hadoop Ecosystem B Y R A H I M A.

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

AtScale Intelligence Platform

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

Big Data Analytics Nokia

Enterprise Operational SQL on Hadoop Trafodion Overview

Luncheon Webinar Series May 13, 2013

Big Data Can Drive the Business and IT to Evolve and Adapt

Upcoming Announcements

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Datenverwaltung im Wandel - Building an Enterprise Data Hub with

Safe Harbor Statement

Real-Time Data Analytics and Visualization

How To Handle Big Data With A Data Scientist

Practical Hadoop by Example

Unified Big Data Processing with Apache Spark. Matei

Oracle Big Data Discovery Unlock Potential in Big Data Reservoir

PLATFORA SOLUTION ARCHITECTURE

Sisense. Product Highlights.

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Hadoop Job Oriented Training Agenda

Oracle Database 12c Plug In. Switch On. Get SMART.

Best Practices for Hadoop Data Analysis with Tableau

Using distributed technologies to analyze Big Data

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Databricks. A Primer

Ganzheitliches Datenmanagement

A Brief Introduction to Apache Tez

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Using Tableau Software with Hortonworks Data Platform

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

HareDB HBase Client Web Version USER MANUAL HAREDB TEAM

Addressing Risk Data Aggregation and Risk Reporting Ben Sharma, CEO. Big Data Everywhere Conference, NYC November 2015

Lecture 21: NoSQL III. Monday, April 20, 2015

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

HDP Enabling the Modern Data Architecture

HAWQ Architecture. Alexey Grishchenko

11/18/15 CS q Hadoop was not designed to migrate data from traditional relational databases to its HDFS. q This is where Hive comes in.

The Future of Data Management

BIG DATA AND THE ENTERPRISE DATA WAREHOUSE WORKSHOP

Hadoop implementation of MapReduce computational model. Ján Vaňo

Time-Series Databases and Machine Learning

Real-time Data Analytics mit Elasticsearch. Bernhard Pflugfelder inovex GmbH

Dominik Wagenknecht Accenture

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Implement Hadoop jobs to extract business value from large and varied data sets

Cleveland State University

Databricks. A Primer

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Ali Ghodsi Head of PM and Engineering Databricks

Constructing a Data Lake: Hadoop and Oracle Database United!

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Actian Vortex Express 3.0

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

How to Install and Configure EBF15328 for MapR or with MapReduce v1

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Creating a universe on Hive with Hortonworks HDP 2.0

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Big Data Course Highlights

How To Scale Out Of A Nosql Database

APACHE DRILL: Interactive Ad-Hoc Analysis at Scale

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

White Paper: What You Need To Know About Hadoop

From Relational to Hadoop Part 2: Sqoop, Hive and Oozie. Gwen Shapira, Cloudera and Danil Zburivsky, Pythian

Hadoop & Spark Using Amazon EMR

Transcription:

Rethinking SQL for Big Data with Apache Drill Neeraja Rentachintala, Director of Product Management, MapR technologies 5/21/2015 1

Topics Motivation Apache Drill overview Product walkthrough Resources 2

Motivation 2015 2015 MapR MapR Technologies Technologies 3

Data Is Doubling Every Two Years Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA SEMI-STRUCTURED DATA Total Data Stored 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data 4

Data Increasingly Stored in Non-Relational Datastores Volume GBs-TBs TBs-PBs Structure Development Structured Planned (release cycle = months-years) Structured, semi-structured and unstructured Iterative (release cycle = days-weeks) Database RELATIONAL DATABASES Fixed schema DBA controls structure NON-RELATIONAL DATASTORES Dynamic / Flexible schema Application controls structure 1980 1990 2000 2010 2020 5

How To Bring SQL to Non-Relational data stores? Familiarity of SQL Agility of NoSQL ANSI SQL semantics BI (Tableau, MicroStrategy, etc.) Low latency No schema management HDFS (Parquet, JSON, etc.) HBase No transform or silos of data Ease of use 6

Industry's First Schema-free SQL engine for Big Data 7

Combining Agility with Performance Point-and-query vs. schema-first Access to any data source & type Industry standard APIs Performance at Scale Extreme Ease of Use 8

Enabling As-It-Happens Business with Instant Analytics Total time to insight: weeks to months Traditional approach Hadoop data Data modeling Transformation Data movement (optional) Users Source data evolution New Business questions Total time to insight: minutes Exploratory approach Hadoop data Users 9

Evolution Towards Self-Service Data Exploration Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Data Modeling and Transformation IT-driven IT-driven IT-driven Optional Data Visualization IT-driven Self-service Self-service Self-service Zero-day analytics 10

Common Use Cases Raw Data Exploration JSON Analytics DWH offload {JSON}, Parquet Text Files Files Directories Hive HBase 11

How Drill achieves Agility & Performance 2015 2015 MapR MapR Technologies Technologies 12

Drill Supports Schema Discovery On-The-Fly Schema Declared In Advance Schema 2 Discovered On-The-Fly Fixed schema Leverage schema in centralized repository (Hive Metastore) Fixed schema, evolving schema or schema-less Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY 13

Drill enables SQL on Everything (Omni-SQL) Workspace - Sub-directory - HBase namespace - Hive database Table - Pathnames - Hive table - HBase table SELECT * FROM dfs.yelp.`business.json`! Storage plugin instance - DFS (Text, Parquet, JSON) - HBase/MapR-DB - Hive Metastore/HCatalog - Easy API to go beyond Hadoop 14

Drill s Data Model is Flexible Complex Fixed schema Parquet Avro Dynamic schema JSON BSON Flexibility {! }! {! }! Apache Drill table name: {! first: Michael,! last: Smith! },! hobbies: [ski, soccer],! district: Los Altos! name: {! first: Jennifer,! last: Gates! },! hobbies: [sing],! preschool: CCLC! Flat CSV TSV HBase RDBMS/SQL-on-Hadoop table Name! Gender! Age! Michael! M! 6! Jennifer! F! 3! Flexibility 15

Drill is a Distributed SQL query engine drillbit drillbit drillbit DataNode/ RegionServer DataNode/ RegionServer DataNode/ RegionServer ZooKeeper ZooKeeper ZooKeeper Ø Scale-out (single node to 1000 s of nodes) Ø Columnar and Vectorized execution Ø Optimistic execution (no MR, Spark, Tez) Ø Extensible 16

Drill allows reuse of existing SQL Tools and Skills Leverage SQL-compatible tools (BI, query builders, etc.) via Drill s standard ODBC, JDBC and ANSI SQL support Enable business analysts, technical analysts and data scientists to explore and analyze large volumes of real-time data 17

Product Walkthrough 2015 2015 MapR MapR Technologies Technologies 18

Business dataset { } "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd S\nThe Strip\nLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": - 115.172588519464, "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } 19

Reviews dataset { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05- 17", "text": "dr. goldberg offers everything...", "type": "review", "business_id": "vcnawilm4dr7d2nwwj7nca" } 20

Zero to Results in 2 minutes $ tar - xvzf apache- drill- 1.0.0.tar.gz $ bin/sqlline - u jdbc:drill:zk=local > SELECT state, city, count(*) AS businesses FROM dfs.yelp.`business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; Install Launch shell (embedded mode) Query files and directories +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - - + state city businesses +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - - + NV Las Vegas 12021 AZ Phoenix 7499 AZ Scottsdale 3605 EDH Edinburgh 2804 AZ Mesa 2041 AZ Tempe 2025 NV Henderson 1914 AZ Chandler 1637 WI Madison 1630 AZ Glendale 1196 +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - - + Results 21

Directories are implicit partitions sales 2014 q1 q2 q3 q4 2015 q1 SELECT dir0, SUM(amount) FROM sales GROUP BY dir1 IN (q1, q2) 22

Intuitive SQL access to complex data // It s Friday 10pm in Vegas and looking for Hummus > SELECT name, stars, b.hours.friday friday, categories FROM dfs.yelp.`business.json` b WHERE b.hours.friday.`open` < '22:00' AND b.hours.friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; Query data with any levels of nesting +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - + name stars friday categories +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - + Olives 4.0 {"close":"22:30","open":"11:00"} ["Mediterranean","Restaurants"] Marrakech Moroccan Restaurant 4.0 {"close":"23:00","open":"17:30"} ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - +- - - - - - - - - - - - + 23

ANSI SQL compatibility //Get top cool rated businesses Ø SELECT b.name from dfs.yelp.`business.json` b WHERE b.business_id IN (SELECT r.business_id FROM dfs.yelp.`review.json` r GROUP BY r.business_id HAVING SUM(r.votes.cool) > 2000 ORDER BY SUM(r.votes.cool) DESC); +- - - - - - - - - - - - + name +- - - - - - - - - - - - + Earl of Sandwich XS Nightclub The Cosmopolitan of Las Vegas Wicked Spoon +- - - - - - - - - - - - + Use familiar SQL functionality (Joins, Aggregations, Sorting, Sub- queries, SQL data types) 24

Logical views //Create a view combining business and reviews datasets > CREATE OR REPLACE VIEW dfs.tmp.businessreviews AS SELECT b.name, b.stars, r.votes.funny, r.votes.useful, r.votes.cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id; Lightweight file system based views for granular and de- centralized data management +- - - - - - - - - - - - +- - - - - - - - - - - - + ok summary +- - - - - - - - - - - - +- - - - - - - - - - - - + true View 'BusinessReviews' created successfully in 'dfs.tmp' schema +- - - - - - - - - - - - +- - - - - - - - - - - - + > SELECT COUNT(*) AS Total FROM dfs.tmp.businessreviews; +------------+ Total +------------+ 1125458 +------------+ 25

Materialized Views AKA Tables > ALTER SESSION SET `store.format` = 'parquet'; > CREATE TABLE dfs.yelp.businessreviewstbl AS SELECT b.name, b.stars, r.votes.funny funny, r.votes.useful useful, r.votes.cool cool, r.`date` FROM dfs.yelp.`business.json` b, dfs.yelp.`review.json` r WHERE r.business_id = b.business_id; Save analysis results as tables using familiar CTAS syntax +- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - + Fragment Number of records written +- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - + 1_0 176448 1_1 192439 1_2 198625 1_3 200863 1_4 181420 1_5 175663 +- - - - - - - - - - - - +- - - - - - - - - - - - - - - - - - - - - - - - - - - + 26

Extensions to ANSI SQL to work with repeated values // Flatten repeated categories > SELECT name, categories FROM dfs.yelp.`business.json` LIMIT 3; +- - - - - - - - - - - - +- - - - - - - - - - - - + name categories +- - - - - - - - - - - - +- - - - - - - - - - - - + Eric Goldberg, MD ["Doctors","Health & Medical"] Pine Cone Restaurant ["Restaurants"] Deforest Family Restaurant ["American (Traditional)","Restaurants"] +- - - - - - - - - - - - +- - - - - - - - - - - - + > SELECT name, FLATTEN(categories) AS categories FROM dfs.yelp.`business.json` LIMIT 5; +- - - - - - - - - - - - +- - - - - - - - - - - - + name categories +- - - - - - - - - - - - +- - - - - - - - - - - - + Eric Goldberg, MD Doctors Eric Goldberg, MD Health & Medical Pine Cone Restaurant Restaurants Deforest Family Restaurant American (Traditional) Deforest Family Restaurant Restaurants +- - - - - - - - - - - - +- - - - - - - - - - - - + Dynamically flatten repeated and nested data elements as part of SQL queries. No ETL necessary 27

Extensions to ANSI SQL to work with repeated values // Get most common business categories >SELECT category, count(*) AS categorycount FROM (SELECT name, FLATTEN(categories) AS category FROM dfs.yelp.`business.json`) c GROUP BY category ORDER BY categorycount DESC; +- - - - - - - - - - - - +- - - - - - - - - - - - + category categorycount +- - - - - - - - - - - - +- - - - - - - - - - - - + Restaurants 14303 Australian 1 Boat Dealers 1 Firewood 1 +- - - - - - - - - - - - +- - - - - - - - - - - - + 28

Extensions to ANSI SQL to work with embedded JSON - - embedded JSON value inside column donutjson inside column- family cf1 of an hbase table donuts SELECT d.name, COUNT(d.fillings) FROM (! SELECT convert_from(cf1.donutjson, JSON) as d FROM hbase.donuts); 29

Drill provides access control that scales User PAM Authentication + User Impersonation User Drill View 1 Drill View 2 U Files HBase Hive U U Fine-grained row and column level access control with Drill Views no centralized security repository required 30

Drill is Top-Ranked SQL-on-Hadoop Drill isn t just about SQL-on-Hadoop. It s about SQL-onpretty-muchanything, immediately, and without formality. Key: Number indicates companies relative strength across all vectors Size of ball indicates company s relative strength along individual vector Source: Gigaom Research, 2015 31

Drill project status Just released Jun 13 First release Drill 0.1 Sep 14 Beta Drill 0.5 Dec 14 + Apache Top Level Project Mar 15 Drill 0.7 Drill 0.8 May 15 Drill 1.0 Project incubation Sep 12 Dev Preview Drill 0.4 Aug 14 Drill 0.6 Nov 14 GigaOm Top ranked SQL On Hadoop Jan 15 Drill 0.9 Apr 15 Large community, growing rapidly Growing user adoption Highlights Apache Top Level Project Iterative Project cycles 50 contributors 1000 s downloads 7 releases < 9 months 32

Recommendations On Trying and Using Drill New to Drill? Get started with Free MapR On Demand training Test Drive Drill on cloud with AWS Learn how to use Drill with Hadoop using MapR sandbox Ready to play with your data? Try out Apache Drill in 10 mins guide on your desktop Download Drill for your cluster and start exploration Comprehensive tutorials and documentation available Ask questions user@drill.apache.org 33