HAWQ Architecture. Alexey Grishchenko



Similar documents
Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Accelerating GeoSpatial Data Analytics With Pivotal Greenplum Database

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Fundamentals Curriculum HAWQ

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Apache Sentry. Prasad Mujumdar

Pivotal HAWQ Release Notes

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

How To Create A Data Visualization With Apache Spark And Zeppelin

EMC SOLUTION FOR AGILE AND ROBUST ANALYTICS ON HADOOP DATA LAKE WITH PIVOTAL HDB

Hadoop & Spark Using Amazon EMR

IBM BigInsights Has Potential If It Lives Up To Its Promise. InfoSphere BigInsights A Closer Look

Hadoop Ecosystem B Y R A H I M A.

Hadoop: The Definitive Guide

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Apache HBase. Crazy dances on the elephant back

Hadoop Job Oriented Training Agenda

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

The Hadoop Eco System Shanghai Data Science Meetup

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

H2O on Hadoop. September 30,

Native Connectivity to Big Data Sources in MSTR 10

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

COURSE CONTENT Big Data and Hadoop Training

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Enabling High performance Big Data platform with RDMA

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Big Data Technology Core Hadoop: HDFS-YARN Internals

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

The Internet of Things and Big Data: Intro

Performance and Scalability Overview

SQL on NoSQL (and all of the data) With Apache Drill

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Integrating Apache Spark with an Enterprise Data Warehouse

Parquet. Columnar storage for the people

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

docs.hortonworks.com

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

HDFS. Hadoop Distributed File System

Moving From Hadoop to Spark

Actian SQL in Hadoop Buyer s Guide

AtScale Intelligence Platform

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

How To Scale Out Of A Nosql Database

Actian Vector in Hadoop

Unified Big Data Processing with Apache Spark. Matei

Integrating with Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Case Study : 3 different hadoop cluster deployments

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Oracle Big Data SQL. Architectural Deep Dive. Dan McClary, Ph.D. Big Data Product Management Oracle

PivotalR: A Package for Machine Learning on Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Hadoop: Embracing future hardware

Big Data SQL and Query Franchising

Oracle Database 11g: SQL Tuning Workshop

Using RDBMS, NoSQL or Hadoop?

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Complete Java Classes Hadoop Syllabus Contact No:

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Data processing goes big

Open Source Technologies on Microsoft Azure

Cloudera Manager Training: Hands-On Exercises

Performance and Scalability Overview

This Preview Edition of Hadoop Application Architectures, Chapters 1 and 2, is a work in progress. The final book is currently scheduled for release

Best Practices for Hadoop Data Analysis with Tableau

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Sisense. Product Highlights.

HiBench Introduction. Carson Wang Software & Services Group

Design and Evolution of the Apache Hadoop File System(HDFS)

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

SQL Server 2012 PDW. Ryan Simpson Technical Solution Professional PDW Microsoft. Microsoft SQL Server 2012 Parallel Data Warehouse

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Wisdom from Crowds of Machines

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop and Hive Development at Facebook. Dhruba Borthakur Zheng Shao {dhruba, Presented at Hadoop World, New York October 2, 2009

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Open source Google-style large scale data analysis with Hadoop

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

Hadoop & its Usage at Facebook

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Big Data Course Highlights

MySQL and Hadoop. Percona Live 2014 Chris Schneider

Constructing a Data Lake: Hadoop and Oracle Database United!

Integrating with Apache Hadoop HPE Vertica Analytic Database. Software Version: 7.2.x

Transcription:

HAWQ Architecture Alexey Grishchenko

Who I am Enterprise Architect @ Pivotal 7 years in data processing 5 years of experience with MPP 4 years with Hadoop Using HAWQ since the first internal Beta Responsible for designing most of the EMEA HAWQ and Greenplum implementations Spark contributor http://0x0fff.com

Agenda What is HAWQ

Agenda What is HAWQ Why you need it

Agenda What is HAWQ Why you need it HAWQ Components

Agenda What is HAWQ Why you need it HAWQ Components HAWQ Design

Agenda What is HAWQ Why you need it HAWQ Components HAWQ Design Query execution example

Agenda What is HAWQ Why you need it HAWQ Components HAWQ Design Query execution example Competitive solutions

What is Analytical SQL-on-Hadoop engine

What is Analytical SQL-on-Hadoop engine HAdoop With Queries

What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ 2005 Fork Postgres 8.0.2

What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ 2005 2009 Fork Postgres 8.0.2 Rebase Postgres 8.2.15

What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ 2005 2009 2011 Fork Postgres 8.0.2 Rebase Postgres 8.2.15 Fork GPDB 4.2.0.0

What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ 2005 2009 2011 2013 Fork Postgres 8.0.2 Rebase Postgres 8.2.15 Fork GPDB 4.2.0.0 HAWQ 1.0.0.0

What is Analytical SQL-on-Hadoop engine HAdoop With Queries Postgres Greenplum HAWQ 2005 2009 2011 2013 Fork Postgres 8.0.2 Rebase Postgres 8.2.15 Fork GPDB 4.2.0.0 HAWQ 1.0.0.0 2015 HAWQ 2.0.0.0 Open Source

HAWQ is 1 500 000 C and C++ lines of code

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only 180 000 Python LOC

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only 180 000 Python LOC 60 000 Java LOC

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only 180 000 Python LOC 60 000 Java LOC 23 000 Makefile LOC

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only 180 000 Python LOC 60 000 Java LOC 23 000 Makefile LOC 7 000 Shell scripts LOC

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only 180 000 Python LOC 60 000 Java LOC 23 000 Makefile LOC 7 000 Shell scripts LOC More than 50 enterprise customers

HAWQ is 1 500 000 C and C++ lines of code 200 000 of them in headers only 180 000 Python LOC 60 000 Java LOC 23 000 Makefile LOC 7 000 Shell scripts LOC More than 50 enterprise customers More than 10 of them in EMEA

Apache HAWQ Apache HAWQ (incubating) from 09 2015 http://hawq.incubator.apache.org https://github.com/apache/incubator-hawq What s in Open Source Sources of HAWQ 2.0 alpha HAWQ 2.0 beta is planned for 2015 Q4 HAWQ 2.0 GA is planned for 2016 Q1 Community is yet young come and join!

Why do we need it?

Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003

Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 Example - 5000-line query with a number of window function generated by Cognos

Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 Example - 5000-line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data

Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 Example - 5000-line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data Example - parse URL to extract protocol, host name, port, GET parameters

Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 Example - 5000-line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data Example - parse URL to extract protocol, host name, port, GET parameters Good performance

Why do we need it? SQL-interface for BI solutions to the Hadoop data complaint with ANSI SQL-92, -99, -2003 Example - 5000-line query with a number of window function generated by Cognos Universal tool for ad hoc analytics on top of Hadoop data Example - parse URL to extract protocol, host name, port, GET parameters Good performance How many times the data would hit the HDD during a single Hive query?

HAWQ Cluster Server 1 Server 2 Server 3 Server 4 ZK JM ZK JM ZK JM NameNode SNameNode interconnect Server 5 Server 6 Server N Datanode Datanode Datanode

HAWQ Cluster Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

HAWQ Cluster Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM HAWQ Master HAWQ Standby NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

Master Servers Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM HAWQ Master HAWQ Standby NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

Master Servers HAWQ Master HAWQ Standby Master Query Parser Distributed TransacJons Manager Query Parser Distributed Transac,ons Manager Query OpJmizer Metadata Catalog WAL repl. Query Op,mizer Metadata Catalog Global Resource Manager Query Dispatch Global Resource Manager Query Dispatch

Segments Server 1 Server 2 Server 3 Server 4 YARN RM YARN App Timeline ZK JM ZK JM ZK JM HAWQ Master HAWQ Standby NameNode SNameNode interconnect Server 5 Server 6 Server N YARN NM YARN NM YARN NM Datanode Datanode Datanode

Segments Query Executor libhdfs3 PXF YARN Node Manager Local Filesystem Temporary Data Directory Logs

Metadata HAWQ metadata structure is similar to Postgres catalog structure

Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table

Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field

Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field

Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field Number of unique values in the field

Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field Number of unique values in the field Number of null values in the field

Metadata HAWQ metadata structure is similar to Postgres catalog structure Statistics Number of rows and pages in the table Most common values for each field Histogram of values distribution for each field Number of unique values in the field Number of null values in the field Average width of the field in bytes

Statistics No Statistics How many rows would produce the join of two tables?

Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity

Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two 1000- row tables?

Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two 1000- row tables? ü From 0 to 1 000 000

Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two 1000- row tables? ü From 0 to 1 000 000 Histograms and MCV How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values?

Statistics No Statistics How many rows would produce the join of two tables? ü From 0 to infinity Row Count How many rows would produce the join of two 1000- row tables? ü From 0 to 1 000 000 Histograms and MCV How many rows would produce the join of two 1000- row tables, with known field cardinality, values distribution diagram, number of nulls, most common values? ü ~ From 500 to 1 500

Metadata Table structure information ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90

Metadata Table structure information Distribution fields ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(id)

Metadata Table structure information Distribution fields Number of hash buckets ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(id) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90

Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90 hash(id) ID Name Num Price 1 Яблоко 10 50 2 Груша 20 80 3 Банан 40 40 4 Апельсин 25 50 5 Киви 5 120 6 Арбуз 20 30 7 Дыня 40 100 8 Ананас 35 90

Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) General metadata Users and groups

Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) General metadata Users and groups Access privileges

Metadata Table structure information Distribution fields Number of hash buckets Partitioning (hash, list, range) General metadata Users and groups Access privileges Stored procedures PL/pgSQL, PL/Java, PL/Python, PL/Perl, PL/R

Query Optimizer HAWQ uses cost-based query optimizers

Query Optimizer HAWQ uses cost-based query optimizers You have two options Planner evolved from the Postgres query optimizer ORCA (Pivotal Query Optimizer) developed specifically for HAWQ

Query Optimizer HAWQ uses cost-based query optimizers You have two options Planner evolved from the Postgres query optimizer ORCA (Pivotal Query Optimizer) developed specifically for HAWQ Optimizer hints work just like in Postgres Enable/disable specific operation Change the cost estimations for basic actions

Storage Formats Which storage format is the most optimal?

Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal

Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data

Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data Minimal disk space usage

Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data Minimal disk space usage Minimal time to retrieve record by key

Storage Formats Which storage format is the most optimal? ü It depends on what you mean by optimal Minimal CPU usage for reading and writing the data Minimal disk space usage Minimal time to retrieve record by key Minimal time to retrieve subset of columns etc.

Storage Formats Row-based storage format Similar to Postgres heap storage No toast No ctid, xmin, xmax, cmin, cmax

Storage Formats Row-based storage format Similar to Postgres heap storage No toast No ctid, xmin, xmax, cmin, cmax Compression No compression Quicklz Zlib levels 1-9

Storage Formats Apache Parquet Mixed row-columnar table store, the data is split into row groups stored in columnar format

Storage Formats Apache Parquet Mixed row-columnar table store, the data is split into row groups stored in columnar format Compression No compression Snappy Gzip levels 1 9

Storage Formats Apache Parquet Mixed row-columnar table store, the data is split into row groups stored in columnar format Compression No compression Snappy Gzip levels 1 9 The size of row group and page size can be set for each table separately

Resource Management Two main options Static resource split HAWQ and YARN does not know about each other

Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources

Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources Flexible cluster utilization Query might run on a subset of nodes if it is small

Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources Flexible cluster utilization Query might run on a subset of nodes if it is small Query might have many executors on each cluster node to make it run faster

Resource Management Two main options Static resource split HAWQ and YARN does not know about each other YARN HAWQ asks YARN Resource Manager for query execution resources Flexible cluster utilization Query might run on a subset of nodes if it is small Query might have many executors on each cluster node to make it run faster You can control the parallelism of each query

Resource Management Resource Queue can be set with Maximum number of parallel queries

Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority

Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits

Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit

Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit MIN/MAX number of executors across the system

Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit MIN/MAX number of executors across the system MIN/MAX number of executors on each node

Resource Management Resource Queue can be set with Maximum number of parallel queries CPU usage priority Memory usage limits CPU cores usage limit MIN/MAX number of executors across the system MIN/MAX number of executors on each node Can be set up for user or group

External Data PXF Framework for external data access Easy to extend, many public plugins available Official plugins: CSV, SequenceFile, Avro, Hive, HBase Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe

External Data PXF Framework for external data access Easy to extend, many public plugins available Official plugins: CSV, SequenceFile, Avro, Hive, HBase Open Source plugins: JSON, Accumulo, Cassandra, JDBC, Redis, Pipe HCatalog HAWQ can query tables from HCatalog the same way as HAWQ native tables

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Server 1: 2 containers Server 2: 1 container Server N: 2 containers YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch I need 5 containers Each with 1 CPU core and 256 MB RAM NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer Motion Gather Project s.beer, s.price YARN RM Metadata Resource Mgr. HashJoin b.name = s.bar TransacJon Mgr. QE Query Dispatch s Scan Sells Motion Redist(b.name) Filter b.city = 'San Francisco' b Scan Bars NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata Resource Mgr. TransacJon Mgr. QE Query Dispatch NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer OK YARN RM Metadata TransacJon Mgr. QE Resource Mgr. Query Dispatch Free query resources Server 1: 2 containers Server 2: 1 container Server N: 2 containers NameNode Server 1 Server 2 Server N QE QE QE QE QE Plan Resource Prepare Execute Result Cleanup

Query Example HAWQ Master Query Parser Query OpJmizer YARN RM Metadata TransacJon Mgr. Resource Mgr. Query Dispatch NameNode Server 1 Server 2 Server N Plan Resource Prepare Execute Result Cleanup

Query Performance Data does not hit the disk unless this cannot be avoided

Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided

Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP

Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP HAWQ has a good cost-based query optimizer

Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP HAWQ has a good cost-based query optimizer C/C++ implementation is more efficient than Java implementation of competitive solutions

Query Performance Data does not hit the disk unless this cannot be avoided Data is not buffered on the segments unless this cannot be avoided Data is transferred between the nodes by UDP HAWQ has a good cost-based query optimizer C/C++ implementation is more efficient than Java implementation of competitive solutions Query parallelism can be easily tuned

Competitive Solutions Query OpJmizer Hive SparkSQL Impala HAWQ

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism DistribuJons

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism DistribuJons Stability

Competitive Solutions Hive SparkSQL Impala HAWQ Query OpJmizer ANSI SQL Built-in Languages Disk IO Parallelism DistribuJons Stability Community

Roadmap AWS and S3 integration

Roadmap AWS and S3 integration Mesos integration

Roadmap AWS and S3 integration Mesos integration Better Ambari integration

Roadmap AWS and S3 integration Mesos integration Better Ambari integration Cloudera, MapR and IBM Hadoop distributions native support

Roadmap AWS and S3 integration Mesos integration Better Ambari integration Cloudera, MapR and IBM Hadoop distributions native support Make the SQL-on-Hadoop engine ever!

Summary Modern SQL-on-Hadoop engine For structured data processing and analysis Combines the best techniques of competitive solutions Just released to the open source Community is very young Join our community and contribute!

Questions Apache HAWQ http://hawq.incubator.apache.org dev@hawq.incubator.apache.org user@hawq.incubator.apache.org Reach me on http://0x0fff.com