INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD ARCHITECT @ METAMARKETS



Similar documents
Apache Kylin Introduction Dec 8,

Unified Big Data Processing with Apache Spark. Matei

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Using distributed technologies to analyze Big Data

Moving From Hadoop to Spark

Practical Cassandra. Vitalii

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

How to Choose Between Hadoop, NoSQL and RDBMS

Can the Elephants Handle the NoSQL Onslaught?

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Parallel Data Warehouse

How To Scale Out Of A Nosql Database

TUT NoSQL Seminar (Oracle) Big Data

Data-cubing made-simple! with Spark, Algebird and HBase goo.gl/dbgr0h

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Columnstore in SQL Server 2016

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

CitusDB Architecture for Real-Time Big Data

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Using RDBMS, NoSQL or Hadoop?

How To Handle Big Data With A Data Scientist

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Oracle Big Data SQL Technical Update

Architectures for Big Data Analytics A database perspective

SEIZE THE DATA SEIZE THE DATA. 2015

In-Memory Analytics: A comparison between Oracle TimesTen and Oracle Essbase

API Analytics with Redis and Google Bigquery. javier

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Case Study: Real-time Analytics With Druid. Salil Kalia, Tech Lead, TO THE NEW Digital

Big Data Pipeline and Analytics Platform

In-memory computing with SAP HANA

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

the missing log collector Treasure Data, Inc. Muga Nishizawa

NoSQL for SQL Professionals William McKnight

HP Vertica and MicroStrategy 10: a functional overview including recommendations for performance optimization. Presented by: Ritika Rahate

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

PostgreSQL Business Intelligence & Performance Simon Riggs CTO, 2ndQuadrant PostgreSQL Major Contributor

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

BIG DATA CAN DRIVE THE BUSINESS AND IT TO EVOLVE AND ADAPT RALPH KIMBALL BUSSUM 2014

Parquet. Columnar storage for the people

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data Analytics - Accelerated. stream-horizon.com

Domain driven design, NoSQL and multi-model databases

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch September 2,

Vertica Live Aggregate Projections

Rakam: Distributed Analytics API

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Integrating Apache Spark with an Enterprise Data Warehouse

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Optimizing Your Data Warehouse Design for Superior Performance

Power BI Performance. Tips and Techniques

Big Data Course Highlights

Big Data Technologies Compared June 2014

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

The Kentik Data Engine

Fact Sheet In-Memory Analysis

Ali Ghodsi Head of PM and Engineering Databricks

Big Data Analytics Nokia

Unified Batch & Stream Processing Platform

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Real-Time Data Analytics and Visualization

Oracle BI Suite Enterprise Edition

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Oracle Database In-Memory A Practical Solution

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

SQL Server 2012 Performance White Paper

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

An Approach to Implement Map Reduce with NoSQL Databases

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Understanding the Value of In-Memory in the IT Landscape

Evaluation of NoSQL databases for large-scale decentralized microblogging

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

2009 Oracle Corporation 1

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Performance Verbesserung von SAP BW mit SQL Server Columnstore

Putting Apache Kafka to Use!

The Arts & Science of Tuning HANA models for Performance. Abani Pattanayak, SAP HANA CoE Nov 12, 2015

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Oracle Database In-Memory The Next Big Thing

Transcription:

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD ARCHITECT @ METAMARKETS

MICHAEL E. DRISCOLL CEO @ METAMARKETS - @MEDRISCOLL

Metamarkets is the bridge from BI to Data Science, a new kind of real-time, analytics tool for the Big Data generation.

OUR SOLUTION... Visualization Analytics Data Store ETL

OUR SOLUTION... LEVERAGES OPEN SOURCE Visualization (node.js, D3) Analytics (R, Scala) Data Store (Druid*) ETL (Hadoop, Kafka)

1,000,000,000+ 100,000+ 500 ms events daily queries daily avg response Data Store (Druid*)

Twitter 8 TB per day Global Businesses 1.8 ZB in 2011 Wal-Mart 2.5 PB stored Facebook 100 TB per day US Library of Congress 235 TB Boeing 640 TB per flight A TIDAL WAVE OF DATA

JUST IN CASE THE WIFI FAILS US...

ERIC TSCHETTER LEAD ARCHITECT @ METAMARKETS - @ZEDRUID

REAL-TIME ANALYTICS MEANS Fast aggregations over O(billions), high-dimensional events Flexible drill-downs with arbitrary depths Sub-second queries that reflect changes immediately

WHAT WE TRIED

WHAT WE TRIED I. RDMBS - Relational Database

I. RDBMS - THE SETUP Star Schema Aggregate Tables Query Caching

I. RDBMS - THE RESULTS Queries that were cached fast Queries against aggregate tables fast to acceptable Queries against base fact table generally unacceptable

I. RDBMS - PERFORMANCE Naive benchmark scan rate ~5.5M rows / second / core 1 day of summarized aggregates 60M+ rows 1 query over 1 week, 16 cores ~5 seconds Page load with 20 queries over a week of data long time

WHAT WE TRIED I. RDMBS - Relational Database

WHAT WE TRIED I. RDMBS - Relational Database

WHAT WE TRIED I. RDMBS - Relational Database II. NoSQL - Key/Value Store

II. NOSQL - THE SETUP Pre-aggregate all dimensional combinations Store results in a NoSQL store

II. NOSQL - THE RESULTS Queries were fast range scan on primary key Inflexible not aggregated, not available Not continuously updated aggregate first, then display Processing scales exponentially

II. NOSQL - PERFORMANCE Dimensional combinations => exponential increase Tried limiting dimensional depth still expands exponentially Example: ~500k records 11 dimensions, 5-depth 4.5 hours on a 15-node Hadoop cluster 14 dimensions, 5-depth 9 hours on a 25-node Hadoop cluster

WHAT WE TRIED I. RDMBS - Relational Database II. NoSQL - Key/Value Store

WHAT WE TRIED I. RDMBS - Relational Database II. NoSQL - Key/Value Store

WHAT WE TRIED I. RDMBS - Relational Database II. NoSQL - Key/Value Store III.???

WHAT WE LEARNED Problem with RDBMS: scans are slow Problem with NoSQL: computationally intractable

WHAT WE LEARNED Problem with RDBMS: scans are slow Problem with NoSQL: computationally intractable Fixing RDBMS issue seems easier

INTRODUCING DRUID

DRUID FIVE KEY FEATURES 1.Distributed Column-Store 2.Real-time Ingestion 3.Fast Data Scans 4.Fast Filtering 5.Operationally Simple

FEATURE 1 DISTRIBUTED COLUMN-STORE Data chunked into segments MVCC swapping protocol Converted to columnar format Coordinator oversees cluster Data replication and balance Per-segment replication Increase replication on hot spots

FEATURE 2 REAL-TIME INGESTION Historical Nodes Query API

FEATURE 2 REAL-TIME INGESTION Historical Nodes Query Rewrite Scatter/Gather Query API Broker Nodes Query API

FEATURE 2 REAL-TIME INGESTION Real-time Nodes Historical Nodes Query API Query Rewrite Scatter/Gather Query API Broker Nodes Query API

FEATURE 2 REAL-TIME INGESTION a Hand Off Data Historical Nodes Real-time Nodes Query API Query Rewrite Scatter/Gather Query API Broker Nodes Query API

FEATURE 3 FAST DATA SCANS Column orientation Only load/scan what you need Compression Decrease size of data in storage (RAM/disk)

FEATURE 4 FAST FILTERING Indexed Bitmap indices Compressed Bitmaps CONCISE compression Operate on compressed form Resolve filters without looking at data

FEATURE 5 OPERATIONALLY SIMPLE Fault-tolerant Replication Rolling deployments/restarts All node types have failover/replication Grow == start processes Shrink == kill processes

DRUID IN ACTION

DRUID S QUERY INTERFACE Resembles SQL... SELECT hour(timestamp), lang, hashtag, sum(tweets) AS tweets FROM twitter-spritzer GROUP BY hour(timestamp), lang, hashtag WHERE country = US AND timestamp BETWEEN 2012-10-01 AND 2012-10-02 ;

DRUID S QUERY INTERFACE...but in a JSON format SELECT hour(timestamp), lang, hashtag, sum(tweets) AS tweets FROM twitter-spritzer GROUP BY hour(timestamp), lang, hashtag WHERE country = US AND timestamp BETWEEN 2012-10-01 AND 2012-10-02 ; { "querytype": "groupby", "datasource": twitter-spritzer", "intervals":["2012-10-01/2012-10-02"], "granularity": hour", "dimensions": ["lang", "hashtag"], "aggregations":[ { "type": longsum", "fieldname": "tweets", "name": "tweets"} ], "filter": {"type": "selector", "dimension": "country", "value":"us }

QUERY FEATURES Group By Time-series roll-ups Arbitrary boolean filters Aggregation functions Sum, Min, Max, Avg, etc. Dimensional Search Explore values in your dimensions

DRUID VERSUS OTHERS vs Google Dremel No indexing structure vs Google PowerDrill Close analog, all in-memory vs Hadoop+Avro+Hive(+Yarn) Closer to Tenzig Back-office use case

DRUID BENCHMARKS Scan speed ~33M rows / second / core Realtime ingestion rate ~10k records / second / node Cluster Examples 100 nodes, 6 TB Memory, 400 TB disk 1.4 seconds 6TB in-memory query

DRUID IS NOW OPEN SOURCE

DRUID IS NOW OPEN SOURCE Learn more: METAMARKETS.COM/DRUID GITHUB.COM/METAMX/DRUID

DRUID IS NOW OPEN SOURCE Learn more: METAMARKETS.COM/DRUID GITHUB.COM/METAMX/DRUID eric tschetter - cheddar@metamarkets.com @ zedruid