Apache Kylin Introduction Dec 8, 2014 @ApacheKylin



Similar documents
A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Using distributed technologies to analyze Big Data

Apache Kylin. Open Source Journey

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Splice Machine: SQL-on-Hadoop Evaluation Guide

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Trafodion Operational SQL-on-Hadoop

Integrating Apache Spark with an Enterprise Data Warehouse

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

TE's Analytics on Hadoop and SAP HANA Using SAP Vora

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Integrating Hadoop. Into Business Intelligence & Data Warehousing. Philip Russom TDWI Research Director for Data Management, April

Data-cubing made-simple! with Spark, Algebird and HBase goo.gl/dbgr0h

Analytics on Spark &

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

How to Enhance Traditional BI Architecture to Leverage Big Data

Oracle BI Suite Enterprise Edition

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Native Connectivity to Big Data Sources in MSTR 10

How To Scale Out Of A Nosql Database

1. OLAP is an acronym for a. Online Analytical Processing b. Online Analysis Process c. Online Arithmetic Processing d. Object Linking and Processing

Big Data Analytics Nokia

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

In-Memory Data Management for Enterprise Applications

Inge Os Sales Consulting Manager Oracle Norway

When to consider OLAP?

Apache Hadoop: Past, Present, and Future

Apache Flink Next-gen data analysis. Kostas

the missing log collector Treasure Data, Inc. Muga Nishizawa

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Moving From Hadoop to Spark

Crack Open Your Operational Database. Jamie Martin September 24th, 2013

Oracle Big Data, In-memory, and Exadata - One Database Engine to Rule Them All Dr.-Ing. Holger Friedrich

Business Intelligence in Microservice Architecture. Debarshi bol.com

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

<Insert Picture Here> Enhancing the Performance and Analytic Content of the Data Warehouse Using Oracle OLAP Option

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Schema Design Patterns for a Peta-Scale World. Aaron Kimball Chief Architect, WibiData

Data Warehouse: Introduction

Introducing Oracle Exalytics In-Memory Machine

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

How Companies are! Using Spark

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

OBIEE 11g Data Modeling Best Practices

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Chapter 7. Using Hadoop Cluster and MapReduce

<Insert Picture Here> Extending Hyperion BI with the Oracle BI Server

Exploring the Synergistic Relationships Between BPC, BW and HANA

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

ROLAP with Column Store Index Deep Dive. Alexei Khalyako SQL CAT Program Manager

Cloudera Impala: A Modern SQL Engine for Hadoop Headline Goes Here

Big Data Analytics - Accelerated. stream-horizon.com

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

MapR: Best Solution for Customer Success

Upcoming Announcements

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

How, What, and Where of Data Warehouses for MySQL

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Monitoring Genebanks using Datamarts based in an Open Source Tool

LEARNING SOLUTIONS website milner.com/learning phone

Big Data Management and Security

Hadoop and Map-Reduce. Swati Gore

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

CS2032 Data warehousing and Data Mining Unit II Page 1

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

SAS BI Course Content; Introduction to DWH / BI Concepts

Distributed Computing and Big Data: Hadoop and MapReduce

Workshop on Hadoop with Big Data

Oracle Big Data SQL Technical Update

Dashboard Engine for Hadoop

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Oracle Database 12c for Data Warehousing and Big Data ORACLE WHITE PAPER SEPTEMBER 2014

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Sentimental Analysis using Hadoop Phase 2: Week 2

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Impala: A Modern, Open-Source SQL Engine for Hadoop. Marcel Kornacker Cloudera, Inc.

Actian SQL in Hadoop Buyer s Guide

How To Handle Big Data With A Data Scientist

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

In-Memory Analytics: A comparison between Oracle TimesTen and Oracle Essbase

Optimize Oracle Business Intelligence Analytics with Oracle 12c In-Memory Database Option

Advanced Big Data Analytics with R and Hadoop

Presented by: Jose Chinchilla, MCITP

Hadoop Ecosystem B Y R A H I M A.

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

Transcription:

Apache Kylin Introduction Dec 8, 2014 @ApacheKylin Luke Han Sr. Product Manager lukhan@ebay.com @lukehq Yang Li Architect & Tech Leader yangli9@ebay.com

Agenda What s Apache Kylin? Tech Highlights Performance Open Source Q & A

What s Kylin kylin / ˈkiːˈlɪn / 麒 麟 - - n. (in Chinese art) a mythical animal of composite form Extreme OLAP Engine for Big Data Kylin is an open source Distributed Analytics Engine from ebay that provides SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets Open Sourced on Oct 1st, 2014 Be accepted as Apache Incubator Project on Nov 25th, 2014

Big Data Era More and more data becoming available on Hadoop Limitations in existing Business Intelligence (BI) Tools Limited support for Hadoop Data size growing exponentially High latency of interactive queries Scale-Up architecture Challenges to adopt Hadoop as interactive analysis system Majority of analyst groups are SQL savvy No mature SQL interface on Hadoop OLAP capability on Hadoop ecosystem not ready yet

Business Needs for Big Data Analysis Sub-second query latency on billions of rows ANSI SQL for both analysts and engineers Full OLAP capability to offer advanced functionality Seamless Integration with BI Tools Support of high cardinality and high dimensions High concurrency thousands of end users Distributed and scale out architecture for large data volume Open source solution

Why not Build an engine from scratch? 6

Analytics Query Taxonomy Kylin is designed to accelerate 80+% analyncs queries performance on Hadoop High Level AggregaNon Very High Level, e.g GMV by site by verncal by weeks Strategy Analysis Query Middle level, e.g GMV by site by verncal, by category (level x) past 12 weeks OLAP Drill Down to Detail Detail Level (Summary Table) OperaNon Low Level AggregaNon First Level AggragaNon TransacNon TransacNon Level TransacNon Data OLTP

Technical Challenges Huge volume data Table scan Big table joins Data shuffling Analysis on different granularity Runtime aggregation expensive Map Reduce job Batch processing

OLAP Cube Balance between Space and Time Cuboid = one combination of dimensions Cube = all combination of dimensions (all cuboids) 0-D(apex) cuboid time, item time item location supplier time, location item, location Time, supplier item, supplier location, supplier 1-D cuboids 2-D cuboids time, item, location time, item, supplier time, location, supplier item, location, supplier time, item, location, supplier 3-D cuboids 4-D(base) cuboid Base vs. aggregate cells; ancestor vs. descendant cells; parent vs. child cells 1. (9/15, milk, Urbana, Dairy_land) - <time, item, location, supplier> 2. (9/15, milk, Urbana, *) - <time, item, location> 3. (*, milk, Urbana, *) - <item, location> 4. (*, milk, Chicago, *) - <item, location> 5. (*, milk, *, *) - <item> 9

From Relational to Key-Value

Kylin Architecture Overview 3rd Party App (Web App, Mobile ) REST API SQL SQL- Based Tool (BI Tools: Tableau ) JDBC/ODBC SQL Ø Online Analysis Data Flow Ø Offline Data Flow Ø Clients/Users interacnve with Kylin via SQL Ø OLAP Cube is transparent to users Mid Latency - Minutes REST Server Query Engine RouNng Low Latency - Seconds Hadoop Hive Star Schema Data Metadata Cube Build Engine (MapReduce ) Data OLAP Cube Cube (HBase) Key Value Data 11

Features Highlights Extremely Fast OLAP Engine at Scale Kylin is designed to reduce query latency on Hadoop for 10+ billions of rows of data ANSI SQL Interface on Hadoop Kylin offers ANSI SQL on Hadoop and supports most ANSI SQL query functions Seamless Integration with BI Tools Kylin currently offers integration capability with BI Tools like Tableau. Interactive Query Capability Users can interact with Hadoop data via Kylin at sub-second latency, better than Hive queries for the same dataset MOLAP Cube User can define a data model and pre-build in Kylin with more than 10+ billions of raw data records

Features Highlights Cons Compression and Encoding Support Incremental Refresh of Cubes Approximate Query Capability for distinct Count (HyperLogLog) Leverage HBase Coprocessor for query latency Job Management and Monitoring Easy Web interface to manage, build, monitor and query cubes Security capability to set ACL at Cube/Project Level Support LDAP Integration

How Does Kylin Utilize Hadoop Components? Hive Input source Pre-join star schema during cube building MapReduce Pre-aggregation metrics during cube building HDFS Store intermediated files during cube building. HBase Store data cube. Serve query on data cube. Coprocessor is used for query processing.

Why Kylin is Fast? Pre-built cube query result already be calculated Leveraging distributed computing infrastructure No runtime Hive table scan and MapReduce job Compression and encoding Put Computing to Data Cached

Agenda What s Kylin Tech Highlights Performance Open Source Q & A

Data Modeling How to Define Cube? End User Cube Modeler Admin Dim Fact Cube: Fact Table: Dimensions: Measures: Storage(HBase): Row Key row A row B row C Column Val 1 Val 2 Val 3 Dim Dim Column Family Source Star Schema Mapping Cube Metadata Target HBase Storage

Cube Metadata How to Define Cube? Dimension Normal Mandatory Hierarchy Derived Measure Sum Count Max Min Average Distinct Count (based on HyperLogLog)

Mandatory Dimension How to Define Cube? Dimension that must present on cuboid E.g. Date Normal A B C A B - - B C A - C A - - A is mandatory A B C A B - A - C A - - - B - - - C - - -

Hierarchy Dimension How to Define Cube? Dimensions that form a contains relationship where parent level is required for child level to make sense. E.g. Year -> Month -> Day; Country -> City Normal A B C A B - - B C A - C A - - A - > B - > C is hierarchy A B C A B - A - - - - - - B - - - C - - -

Derived Dimension How to Define Cube? Dimensions on lookup table that can be derived by PK E.g. User ID -> [Name, Age, Gender] Normal A B C A B - A, B, C is derived by ID ID - - B C A - C A - - - B - - - C - - -

Cube Build Job Flow How to Build Cube?

How to Build Cube? Cube Build Result

Query Engine Calcite How to Query Cube? Dynamic data management framework. Formerly known as OpNq, Calcite is an Apache incubator project, used by Apache Drill and Apache Hive, among others. hjp://opnq.incubator.apache.org

Calcite Plugins How to Query Cube? Metadata SPI Provide table schema from kylin metadata Optimize Rule Translate the logic operator into kylin operator Relational Operator Find right cube Translate SQL into storage engine api call Generate physical execute plan by linq4j java implementation Result Enumerator Translate storage engine result into java implementation result. SQL Function Add HyperLogLog for distinct count Implement date time related functions (i.e. Quarter)

Kylin Explain Plan How to Query Cube? SELECT test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name, test_sites.site_name, SUM(test_kylin_fact.price) AS GMV, COUNT(*) AS TRANS_CNT FROM test_kylin_fact LEFT JOIN test_cal_dt ON test_kylin_fact.cal_dt = test_cal_dt.cal_dt LEFT JOIN test_category ON test_kylin_fact.leaf_categ_id = test_category.leaf_categ_id AND test_kylin_fact.lstg_site_id = test_category.site_id LEFT JOIN test_sites ON test_kylin_fact.lstg_site_id = test_sites.site_id WHERE test_kylin_fact.seller_id = 123456OR test_kylin_fact.lstg_format_name = New' GROUP BY test_cal_dt.week_beg_dt, test_category.category_name, test_category.lvl2_name, test_category.lvl3_name, test_kylin_fact.lstg_format_name,test_sites.site_name OLAPToEnumerableConverter OLAPProjectRel(WEEK_BEG_DT=[$0], category_name=[$1], CATEG_LVL2_NAME=[$2], CATEG_LVL3_NAME=[$3], LSTG_FORMAT_NAME=[$4], SITE_NAME=[$5], GMV=[CASE(=($7, 0), null, $6)], TRANS_CNT=[$8]) OLAPAggregateRel(group=[{0, 1, 2, 3, 4, 5}], agg#0=[$sum0($6)], agg#1=[count($6)], TRANS_CNT=[COUNT()]) OLAPProjectRel(WEEK_BEG_DT=[$13], category_name=[$21], CATEG_LVL2_NAME=[$15], CATEG_LVL3_NAME=[$14], LSTG_FORMAT_NAME=[$5], SITE_NAME=[$23], PRICE=[$0]) OLAPFilterRel(condiNon=[OR(=($3, 123456), =($5, New'))]) OLAPJoinRel(condiNon=[=($2, $25)], jointype=[lex]) OLAPJoinRel(condiNon=[AND(=($6, $22), =($2, $17))], jointype=[lex]) OLAPJoinRel(condiNon=[=($4, $12)], jointype=[lex]) OLAPTableScan(table=[[DEFAULT, TEST_KYLIN_FACT]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]]) OLAPTableScan(table=[[DEFAULT, TEST_CAL_DT]], fields=[[0, 1]]) OLAPTableScan(table=[[DEFAULT, test_category]], fields=[[0, 1, 2, 3, 4, 5, 6, 7, 8]]) OLAPTableScan(table=[[DEFAULT, TEST_SITES]], fields=[[0, 1, 2]])

Storage Engine How to Query Cube? Plugin-able query engine Common iterator interface for storage engine Isolate query engine from underline storage Translate cube query into HBase table scan Columns, Groups à Cuboid ID Filters -> Scan Range (Row Key) Aggregations -> Measure Columns (Row Values) Scan HBase table and translate HBase result into cube result HBase Result (key + value) -> Cube Result (dimensions + measures)

Cube Optimization How to Optimize Cube? Curse of dimensionality : N dimension cube has 2 N cuboid Full Cube vs. Partial Cube Hugh data volume Dictionary Encoding Incremental Building

Full Cube vs. Partial Cube How to Optimize Cube? Full Cube Pre-aggregate all dimension combinations Curse of dimensionality : N dimension cube has 2 N cuboid. Partial Cube To avoid dimension explosion, we divide the dimensions into different aggregation groups 2 N+M+L à 2 N + 2 M + 2 L For cube with 30 dimensions, if we divide these dimensions into 3 group, the cuboid number will reduce from 1 Billion to 3 Thousands 2 30 à 2 10 + 2 10 + 2 10 Tradeoff between online aggregation and offline pre-aggregation

Partial Cube How to Optimize Cube?

Dictionary Encoding How to Optimize Cube? Data cube has lost of duplicated dimension values Dictionary maps dimension values into IDs that will reduce the memory and storage footprint. Dictionary is based on Trie

Incremental Build How to Optimize Cube?

Inverted Index Challenge Has no raw data records Slow table scan on high cardinality dimensions Inverted Index Storage (an ongoing effort) Persist the raw table Bitmap inverted index Time range partition In-memory (block cache) Parallel scan (endpoint coprocessor)

Agenda What s Apache Kylin? Tech Highlights Performance Open Source Q & A

High Level AggregaNo n Analysis Query Kylin vs. Hive Drill Down to Detail Low Level AggregaNo n 200 150 100 50 0 Based on 12+B records case SQL #1 SQL #2 SQL #3 Hive Kylin TransacNon Level # Query Type Return Dataset Query On Kylin (s) Query On Hive (s) Comments 1 High Level AggregaNon 4 0.129 157.437 1,217 Nmes 2 Analysis Query 22,669 1.615 109.206 68 Nmes 3 Drill Down to Detail 4 Drill Down to Detail 325,029 12.058 113.123 9 Nmes 524,780 22.42 6383.21 278 Nmes 5 Data Dump 972,002 49.054 N/A

Performance -- Concurrency Linear scale out with more nodes

Performance - Query Latency 90%Nle queries <5s Green Line: 90%Nle queries Gray Line: 95%Nle queries

Agenda What s Apache Kylin? Tech Highlights Performance Open Source Q & A

Kylin Ecosystem Kylin Core Fundamental framework of Kylin OLAP Engine Extension Plugins to support for additional functions and features Integration Lifecycle Management Support to integrate with other applications Interface Driver Allows for third party users to build more features via userinterface atop Kylin core ODBC and JDBC Drivers Integration à ODBC Driver à ETL à Scheduling Kylin OLAP Core Interface à Web Console à Customized BI à Ambari/Hue Plugin Extension à Security à Redis Storage à Spark Engine

Kylin Evolution Roadmap 2013 2014 2015 Future Q1, 2015 TBD Sep, 2014 MOLAP Incremental Refresh HOLAP InvertedIndex JDBC Driver AutomaQon Next Gen In- Memory Analysis Real Time & Streaming (TBD) more Jan, 2014 Prototype for MOLAP ANSI SQL ODBC Driver Web GUI ACL Open Source Capacity Managment Support Spark more Sep, 2013 IniQal Basic end to end POC

Open Source Kylin Site: Twitter: @ApacheKylin Source Code Repo: https://github.com/kylinolap Google Group: Kylin OLAP

Thanks hjp://kylin.io