Data-cubing made-simple! with Spark, Algebird and HBase goo.gl/dbgr0h

Similar documents

Apache Kylin Introduction Dec 8,

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Best Practices for Hadoop Data Analysis with Tableau

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

BIG DATA ANALYTICS REFERENCE ARCHITECTURES AND CASE STUDIES

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Using distributed technologies to analyze Big Data

Big Data Analytics in LinkedIn. Danielle Aring & William Merritt

Moving From Hadoop to Spark

The Cubetree Storage Organization

NOT IN KANSAS ANY MORE

How To Use Big Data For Telco (For A Telco)

Tuning Tableau and Your Database for Great Performance PRESENT ED BY

Big Data for the JVM developer. Costin Leau,

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

Architectures for Big Data Analytics A database perspective

Big Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Using Kafka to Optimize Data Movement and System Integration. Alex

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Analytics on Spark &

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

In Memory Accelerator for MongoDB

Productionizing a 24/7 Spark Streaming Service on YARN

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Introduction to Hadoop

Click Stream Data Analysis Using Hadoop

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Unified Big Data Processing with Apache Spark. Matei

Big Data Analytics - Accelerated. stream-horizon.com

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Infrastructures for big data

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Parquet. Columnar storage for the people

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Apache Flink Next-gen data analysis. Kostas

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

NoSQL for SQL Professionals William McKnight

How Companies are! Using Spark

The evolution of database technology (II) Huibert Aalbers Senior Certified Executive IT Architect

The Internet of Things and Big Data: Intro

Multi-dimensional index structures Part I: motivation

Overview of Databases On MacOS. Karl Kuehn Automation Engineer RethinkDB

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

How To Handle Big Data With A Data Scientist

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Customer Case Study. Sharethrough

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Time-Series Databases and Machine Learning

Real Time Data Processing using Spark Streaming

Business Intelligence in SharePoint 2013

Luncheon Webinar Series May 13, 2013

COMP9321 Web Application Engineering

Big Data Patterns. Ron Bodkin Founder and President, Think Big

Rainbird: Real-time

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Week 3 lecture slides

Building a real-time, self-service data analytics ecosystem Greg Arnold, Sr. Director Engineering

Predictive Analytics with Storm, Hadoop, R on AWS

Can the Elephants Handle the NoSQL Onslaught?

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

Spark Application Carousel. Spark Summit East 2015

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Introducing Oracle Exalytics In-Memory Machine

Big Data Solutions. Portal Development with MongoDB and Liferay. Solutions

A Scalable Data Transformation Framework using the Hadoop Ecosystem

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Oracle Big Data SQL Technical Update

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop in the Enterprise

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Transcription:

Data-cubing made-simple! with Spark, Algebird and HBase goo.gl/dbgr0h Vidmantas Zemleris

Agenda Intro Analytics at Vinted What is Data-cubing? Why did we build it? Architecture Preliminaries Metric computation Storage & Serving metrics Optimisations Conclusions

Analytics at Vinted P2P marketplace for lifestyle & clothing 10M members 2M active monthly 8 countries ~10TB of data Data-driven company SQL solutions too slow or inflexible for analytics Ended up using Spark for ETL Developed little own OLAP engine: better understand user needs made complex reporting possible automatic insight discovery, and much more?

OLAP Reporting explore metrics by all combinations of dimensions similar to pivot-tables Excel P.S. numbers are randomised

What is Data-cubing? (pre)compute metrics by all* dimension combinations: portal platform product_features guid LT iphone ["A", "B", "C"] 1 DE android ["A"] 1001 Dimensions Metrics portal product_feature unique_visitors any any 2 any A 2 LT B 1 DE A 1 Equivalent in SQL:! SELECT COUNT(DISTINCT guid) AS unique_visitors, portal FROM sessions GROUP BY portal! UNION ALL! SELECT COUNT(DISTINCT guid) AS unique_visitors, EXPLODE(product_features) AS product_feature, portal, product_feature FROM sessions GROUP BY portal, product_feature! UNION ALL!...

Problem definition Given clean fact tables with: 10-20+ Dimensions (things to filter on): country=lt age= 18..25 devices_used=array( iphone, android ) Measures (things to aggregate over) price=10.23 user_id=123 uuid=abc1def Metrics definitions unique seller count unique members who made a transaction Requirements! query metrics at any* viewpoint within seconds be simple & integrate well with Hadoop & Spark support multivalued dimensions

Existing solutions Apache Kylin (by Ebay) pros: commodity infrastructure (Hadoop map-reduce) cons: complex, spark not ready yet Druid (used by Ebay, as much as Kylin) pros: scalable for high dimension count batch+stream (lambda architecture) spacial indexing cons: complex custom cluster services, reads JSON only* missing exact count-distinct also Linked-in s Cubert, Mondrian ROLAP, etc

Why a custom cubing solution? Less complexity Easier integration with existing ETL Use shared/regular Hadoop Infrastructure Spark is faster than map-reduce Missing features: multivalued dimensions exact count-distinct

Architecture overview

Cube definitions class SessionsCube(hiveCtx: HiveContext) extends Cube { val maxgroupingsize: Int = 3 val dimensionnames = Set("portal", "gender", "platform") override val multivalueddimensionnames = Set("product_features") val metrics = MapAggregator( metric(name = Session length sum", aggregator = sumaggregator(_.getas[int]( session_length"))), metric(id = Unique visitors", aggregator = countdistinctaggregator(_.getas[int]( guid"))) ) def facts = SessionEnrichedFact.read(hiveCtx) } portal gender platform product_features guid session_length LT M iphone ["A", "B", "C"] 1 100 DE F android ["A"] 100 500

Adding ALL the things Monoid - adder, we use it for aggregations ordering does not matter - commutativity (1 + 2) + 3 = 6 1+ (2 + 3) = 6 aggregations optimised by adding partial sums 1 + 2 = 3 3 + 4 = 7 send over network 3 + 7 = 10 class MinMonoid { def zero = Int.MaxValue def plus(l: Int, r: Int) = Math.min(l, r) } List(3, 4, 2).reduce(MinMonoid.plus) // 2

Adding ALL the things with Monoid aggregators can ADD complex things top-k values std-dev exact count-distinct approximate count-distinct, e.g. 5KB for 0.5% error class TopKMonoid(k: Int) { def zero = List.empty def plus(l: List, r: List) = (l + r).sorted.take(k) }

Adders in real world: exact count distinct Problem! accurate counts are often important naive grouping & counting is inefficient Solution! use a Monoid with compact bit-sets (RoaringBitMap) set a bit on for present values small memory footprint - compress zeros Adding bit sets: BitsetMonoid.plus(BitSet(1, 5), BitSet(3, 5)) === BitSet(1, 3, 5) key: 1 2 3 4 5

Twitter s Algebird A rich library of monoid-based aggregators min, sum, std-dev, quantiles, approx. count-distinct (HLL) Abstraction layer which hides complexity Composable - multiple aggregations in one pass

Cube definitions (again) class SessionsCube(hiveCtx: HiveContext) extends Cube { val maxgroupingsize: Int = 3 val dimensionnames = Set("portal", "gender", "platform") override val multivalueddimensionnames = Set("product_features") val metrics = MapAggregator( metric(name = Session length sum", aggregator = sumaggregator(_.getas[int]( session_length"))), metric(id = Unique visitors", aggregator = countdistinctaggregator(_.getas[int]( guid"))) ) def facts = SessionEnrichedFact.read(hiveCtx) } portal gender platform product_features guid session_length LT M iphone ["A", "B", "C"] 1 100 DE F android ["A"] 100 500

Naive Cubing algorithm (Step 1/3) // #1. pre-aggregate the input: dimensions -> additivemetrics input.reducebykey(monoid.plus)

Naive Cubing algorithm (Step 2/3) // #1. pre-aggregate the input: dimensions -> additivemetrics input.reducebykey(monoid.plus) // #2. explode rows per each combination of dimensions.flatmapkeys(cubify).reducebykey(monoid.plus) P.S.reduceByKey does local-aggregation first def cubify(dimensions) = { for { groupingset <- groupingsets // if groupingset contains multivalued dimensions, explode their values explodedmultivalued <- explodemultivalued(dimensions, groupingset) dimensionvalues = dimensions.filterkeys(groupingset) } yield dimensionvalues ++ explodedmultivalued }

Naive Cubing algorithm (Step 3/3)! // #1. pre-aggregate the input (dimensions, additivemetrics) input.reducebykey(monoid.plus) // #2. explode rows per each combination of dimensions.flatmapkeys(cubify).reducebykey(monoid.plus) // #3. transform the metric values for end-use.mapvalues(aggr.present)

Writing metrics to HBase distributed key-value store fast scanning by sequential key HBase key: - dimensionshash:version:metricname:period:dimensionvalues e.g. da7c31ac:v1:gmv:y2013:lt:android store metrics as string values: 12.0Eur

Serving the metrics to end-user Analytics UI queries REST service: REST service scans HBase start_key= da7c31ac:v1:gmv:y2013! end_key= da7c31ac:v1:gmv:y2016 decodes HBase records applies simple transformations returns JSON Show pivot table and pretty Graphs

Obtaining reasonable performance pre-aggregate input - trivial with monoids limit the grouping-sets 2 n combinations - look at 3-4 out of 17 dimensions n!/(n-k)! - split dimensions in groups 2 10 + 2 10 + 2 10 < 2 30 do increments by time - old metrics are quite immutable - derived metrics (e.g. accumulation) in serving layer

Conclusions A Naïve cubing algorithm is fine for querying moderate # of dimensions in seconds scales with input size (dimension count limited so far) pre-aggregation and limiting groupings is key to performance If not enough, hybrid approach needed Monoids - efficient abstraction for complex aggregations makes pre-aggregation easy offers even better scalability via hybrid offline/online aggregation (store serialised monadic value)

Extra: more on related tools Kylin saves Monoids into HBase pre-compute predefined grouping-sets other views on-the-fly from existing partial aggregates Druid pre-aggregates by time creates inverted-index and computes on-the-fly allows filtering by arbitrary dimensions tolerates large # of dimensions

Resources http://druid.io/ http://kylin.incubator.apache.org/ https://github.com/linkedin/cubert https://github.com/twitter/algebird http://www.infoq.com/presentations/abstract-algebra-analytics

About Me Data-warehouse developer at Vinted - Clojure, Scala, Spark, Hadoop, vidma

Advertisement: Help disabled speak You like open-source? You code Android/Java? JOIN-IN and help the disabled to speak :) an app for children with speech disabilities that forms sentences from a list of pictograms clicked natural-language-generation & TTS inside started as semester project @ EPFL https://github.com/vidma/aac-speech-android much beta, but people already like it:

Thank you! https://goo.gl/dbgr0h