Case Study: Real-time Analytics With Druid. Salil Kalia, Tech Lead, TO THE NEW Digital



Similar documents
Time series IoT data ingestion into Cassandra using Kaa

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Choosing The Right Big Data Tools For The Job A Polyglot Approach

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

How To Use Big Data For Telco (For A Telco)

Big Data Analytics - Accelerated. stream-horizon.com

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Integrating VoltDB with Hadoop

From Spark to Ignition:

Using distributed technologies to analyze Big Data

Dashboard Engine for Hadoop

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Information Retrieval Elasticsearch

So What s the Big Deal?

INTRODUCING DRUID: FAST AD-HOC QUERIES ON BIG DATA MICHAEL DRISCOLL - CEO ERIC TSCHETTER - LEAD METAMARKETS

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Towards Smart and Intelligent SDN Controller

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Customized Report- Big Data

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Fast Innovation requires Fast IT

Big Data Pipeline and Analytics Platform

BIG DATA TECHNOLOGY. Hadoop Ecosystem

MakeMyTrip CUSTOMER SUCCESS STORY

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Dominik Wagenknecht Accenture

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

SEIZE THE DATA SEIZE THE DATA. 2015

Search and Real-Time Analytics on Big Data

WSO2 Message Broker. Scalable persistent Messaging System


Fast Data in the Era of Big Data: Tiwtter s Real-Time Related Query Suggestion Architecture

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Architecting Distributed Databases for Failure A Case Study with Druid

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

Real-time Big Data Analytics with Storm

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Architecting Open source solutions on Azure. Nicholas Dritsas Senior Director, Microsoft Singapore

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Apache Cassandra for Big Data Applications

Zynga Analytics Leveraging Big Data to Make Games More Fun and Social

Real-time Streaming Analysis for Hadoop and Flume. Aaron Kimball odiago, inc. OSCON Data 2011

Abstraction of a failure free Software Defined Network (SDN Application)

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

INTRODUCTION TO CASSANDRA

Sentimental Analysis using Hadoop Phase 2: Week 2

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

ANALYTICS BUILT FOR INTERNET OF THINGS

Time-Series Databases and Machine Learning

Data Pipeline with Kafka

Big Data Infrastructure at Spotify

Creating Big Data Applications with Spring XD

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

OPEN MODERN DATA ARCHITECTURE FOR FINANCIAL SERVICES RISK MANAGEMENT

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

How To Scale Out Of A Nosql Database

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Real-time Ad-hoc Analytics on S3 with MemSQL

Ganzheitliches Datenmanagement

International Journal of Advancements in Research & Technology, Volume 3, Issue 5, May ISSN BIG DATA: A New Technology

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Industry 4.0 and Big Data

The Rise of Industrial Big Data. Brian Courtney General Manager Industrial Data Intelligence

Big data platform for IoT Cloud Analytics. Chen Admati, Advanced Analytics, Intel

MySQL and Hadoop: Big Data Integration. Shubhangi Garg & Neha Kumari MySQL Engineering

Wisdom from Crowds of Machines

Cloud Big Data Architectures

Oracle Big Data SQL Technical Update

CitusDB Architecture for Real-Time Big Data

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

In-memory data pipeline and warehouse at scale using Spark, Spark SQL, Tachyon and Parquet

Adobe Anywhere for video Collaborate without boundaries

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Enterprise Operational SQL on Hadoop Trafodion Overview

NoSQL Databases. Polyglot Persistence

Introduction to Apache Kafka And Real-Time ETL. for Oracle DBAs and Data Analysts

Big Data With Hadoop

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

Innovative, High-Density, Massively Scalable Packet Capture and Cyber Analytics Cluster for Enterprise Customers

How To Create An Integrated Visualization For A Network Security System (For A Free Download)

SQL VS. NO-SQL. Adapted Slides from Dr. Jennifer Widom from Stanford

Building a logging pipeline with Open Source tools. Iñigo Ortiz de Urbina Cazenave

Big Data Analytics Nokia

Microsoft Services Exceed your business with Microsoft SharePoint Server 2010

NoSQL and Hadoop Technologies On Oracle Cloud

Transcription:

Case Study: Real-time Analytics With Druid Salil Kalia, Tech Lead, TO THE NEW Digital

Agenda Understanding the use-case Ad workflow Our use case Experiments with technologies Redis Cassandra Introduction to Druid Architecture Druid in production Demo

Understanding the use-case

What Is Analytics? Processing the HISTORICAL data to: Understand potential trends Analyze the effects of certain decisions or events Evaluate the performance of a system Make better business decisions

What Is Real-time Analytics?

Understanding The Ad Workflow Web Page Request PUBLISHER SERVER USER Ad Request Ad-Content AD EXCHANGE AD AGENCY-2 AD AGENCY-1 AD AGENCY-3

Examples From Our Use Case How many times a video has been viewed in a particular time-span? in a particular time-span at a particular site? in a particular time-span at a particular site in a particular country? in a particular time-span at a particular site in a particular country on a particular device?

Let s play a video ad

Video Events For The Analysis LOAD START PLAYING VIEW STOP / PAUSE FINISH

Event Data (Sample) TIMESTAMP Ad Site Advertiser Event Action 2011-01-01T01:01:27 Z 2011-01-01T01:01:33 Z 2011-01-01T01:01:40 Z 2011-01-01T01:01:45 Z 2011-01-01T01:01:50 Z 2011-01-01T01:01:51 Z 123 abc.com Brand X Player Load 234 abcd.com Brand Y Player Load 123 abc.com Brand X Player Start 123 abc.com Brand X Player Playing 123 abc.com Brand Y Player Playing 123 abc.com Brand X Player Stop

Why Real-time Analytics? Understand the real-time performance Control the velocity Control the targeting Avoid over serving Avoid under serving

Recap Things We Understood Our use-case How the ad-tech works (in general) Different video player events We are expecting a huge amount of data coming at a very high velocity.

Experiments with technologies

Experience From Redis There was a huge variety of keys all over the place Not a good fit to deal with time-series (big) data Persistence is another issue we can t afford loosing data. Not a right match for our use-case

Conclusion From Redis Never blame Redis It was too early decision Our misunderstanding with the real use-case Thanks to Redis to help us understanding our requirements, very soon.

Working With Cassandra Very good support for the time-series data Extremely good for writing the data at a very high speed Very easy to scale horizontally Supports aggregations through Counters

Writing into Cassandra AD PLAYER ANALYTICS SERVER CASSANDRA

Reading from Cassandra ANALYTICS SERVER CAMPAIGN MANAGER CASSANDRA

What didn t work with Cassandra Inconsistent results Unreliable counters No ad-hoc queries support Nodes were crashing out very frequently

Crossroads What next? Third party tools on the top of Cassandra for better consistency DataStax Enterprise edition Taking a deeper dive into Cassandra to reconfigure the whole architecture and setup Switching to different technology

Understanding druid

About Druid An open-source analytics data store Supports streaming - data ingestion Flexible filters for ad-hoc queries Fast aggregations sub second queries Distributed, shared-nothing architecture Highly scalable

Setting Up Druid In Production KAFKA (CLUSTER) AD PLAYER ANALYTICS SERVER DRUID CLUSTER CASSANDRA

Druid s Reliability Check KAFKA (CLUSTER) DRUID CLUSTER AD PLAYER ANALYTICS SERVER RAW FILES Job To Test Druid s Integrity RAW FILE CONSUME R RAW FILES RAW FILES

A Quick Demo

Druid Architecture Druid Nodes External Dependencies Steaming Data REAL TIME NODES MY SQL COORDINATO R NODES ZOOKEEPE R BROKER NODES Client Queries DEEP STORAGE HISTORICA L NODES Queries Data/Segments MetaData

Druid Data Ingestion Druid Nodes External Dependencies Steaming Data REAL TIME NODES MY SQL COORDINATO R NODES ZOOKEEPE R BROKER NODES Client Queries DEEP STORAGE HISTORICA L NODES Queries Data/Segments MetaData

Druid Data Ingestion KAFKA (CLUSTER) AD PLAYER ANALYTICS SERVER DRUID Realtime Node

Druid Data Retrieval Druid Nodes External Dependencies Steaming Data REAL TIME NODES MY SQL COORDINATO R NODES ZOOKEEPE R BROKER NODES Client Queries DEEP STORAGE HISTORICA L NODES Queries Data/Segments MetaData

Druid Data Coordination Druid Nodes External Dependencies Steaming Data REAL TIME NODES MY SQL COORDINATO R NODES ZOOKEEPE R DEEP STORAGE HISTORICA L NODES Queries Data/Segments MetaData

COMPANIES USING DRUID

Questions?