Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Similar documents
Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

From Spark to Ignition:

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Real-time Big Data Analytics with Storm

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

Putting Apache Kafka to Use!

BIG DATA ANALYTICS For REAL TIME SYSTEM

Unified Batch & Stream Processing Platform

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Hadoop & Spark Using Amazon EMR

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

GigaSpaces Real-Time Analytics for Big Data

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

SAP and Hortonworks Reference Architecture

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Real Time Data Processing using Spark Streaming

Introducing Storm 1 Core Storm concepts Topology design

Apache Kylin Introduction Dec 8,

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

JVM Performance Study Comparing Oracle HotSpot and Azul Zing Using Apache Cassandra

70-467: Designing Business Intelligence Solutions with Microsoft SQL Server

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Kafka & Redis for Big Data Solutions

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hybrid Solutions Combining In-Memory & SSD

Creating Big Data Applications with Spring XD

BIG DATA TRENDS AND TECHNOLOGIES

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Predictive Analytics with Storm, Hadoop, R on AWS

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Productionizing a 24/7 Spark Streaming Service on YARN

The Big Data Ecosystem at LinkedIn Roshan Sumbaly, Jay Kreps, and Sam Shah LinkedIn

Big Data Analytics - Accelerated. stream-horizon.com

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Big Data Analytics Nokia

Comprehensive Analytics on the Hortonworks Data Platform

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Azure Data Lake Analytics

HDP Hadoop From concept to deployment.

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

CitusDB Architecture for Real-Time Big Data

Big Data Pipeline and Analytics Platform

IBM Boston Technical Exploration Center 404 Wyman Street, Boston MA IBM Corporation

Tracking a Soccer Game with Big Data

YARN Apache Hadoop Next Generation Compute Platform

Big Data Analytics - Accelerated. stream-horizon.com

Analytics on Spark &

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Embedded Analytics & Big Data Visualization in Any App

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

SAP HANA SAP s In-Memory Database. Dr. Martin Kittel, SAP HANA Development January 16, 2013

[Hadoop, Storm and Couchbase: Faster Big Data]

Stinger Initiative: Introduction

How To Handle Big Data With A Data Scientist

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Customer Case Study. Sharethrough

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Big Data and Your Data Warehouse Philip Russom

Apache Kafka Your Event Stream Processing Solution

Rakam: Distributed Analytics API

Harnessing the Power of the Microsoft Cloud for Deep Data Analytics

Integrating a Big Data Platform into Government:

How To Make Sense Of Data With Altilia

The IBM Cognos Platform for Enterprise Business Intelligence

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Intelligent Business Operations and Big Data Software AG. All rights reserved.

Big Data Visualization and Dashboards

Using Kafka to Optimize Data Movement and System Integration. Alex

The Cloud to the rescue!

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Big Data Analytics! Architectures, Algorithms and Applications! Part #3: Analytics Platform

Big Data and Industrial Internet

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Solving big data problems in real-time with CEP and Dashboards - patterns and tips

Workshop on Hadoop with Big Data

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Ad Hoc Analysis of Big Data Visualization

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Getting Real Real Time Data Integration Patterns and Architectures

Business Intelligence in Microservice Architecture. Debarshi bol.com

Building a Scalable News Feed Web Service in Clojure

Module 14: Scalability and High Availability

Big Data Research in the AMPLab: BDAS and Beyond

Reference Architecture

xpaaerns on Spark, Shark, Tachyon and Mesos

How To Use Big Data For Telco (For A Telco)

Exploring the Synergistic Relationships Between BPC, BW and HANA

G-Cloud Big Data Suite Powered by Pivotal. December G-Cloud. service definitions

Ikasan ESB Reference Architecture Review

Dashboard Engine for Hadoop

Transcription:

Pulsar Realtime Analytics At Scale Tony Ng April 14, 2015

Big Data Trends Bigger data volumes More data sources DBs, logs, behavioral & business event streams, sensors Faster analysis Next day to hours to minutes to seconds Newer processing models MR, in-memory, stream processing, Lambda 2

What is Pulsar Open-source real-time analytics platform and stream processing framework 3

Business Needs for Real-time Analytics Near real-time insights React to user activities or events within seconds Examples: Real-time reporting and dashboards Business activity monitoring Personalization Marketing and advertising Fraud and bot detection Analyze & Generate Insights Optimize App Experience Users Interact with Apps Collect Events 4

Systemic Quality Requirements Scalability Scale to millions of events / sec Latency <1 sec delivery of events Availability No downtime during upgrades Disaster recovery support across data centers Flexibility User driven complex processing rules Declarative definition of pipeline topology and event routing Data Accuracy Should deal with missing data 99.9% delivery guarantee 5

Pulsar Real-time Analytics In-memory compute cloud Marketing Personalization Behavioral Events Business Events Enrich Aggregate Filter Mutate Dashboards Machine Learning Security Queries Risk Complex Event Processing (CEP): SQL on stream data Custom sub-stream creation: Filtering and Mutation In Memory Aggregation: Multi Dimensional counting 6

Pulsar Framework Building Block (CEP Cell) Inbound Channel-1 Inbound Channel-2 Processor-1 Processor-2 Outbound Channel Spring Container JVM Event = Tuples (K,V) Mutable Channels: Message, File, REST, Kafka, Custom Event Processor: Esper, RateLimiter, RoundRobinLB, PartitionedLB, Custom 7

Pulsar Framework Flexibility Stream Processing Pipeline Consist of loosely coupled stages (cluster of CEP cells) CEP cells (channels and processors) configured as Spring beans Declarative wiring of CEP cells to define pipeline Each stage can adopt its own release and deployment cycles Support topology changes without pipeline restart Stream Processing Logic Two approaches: Java or SQL-like syntax through Esper integration SQL statements can be hot deployed without restarting applications 8

Pulsar Real-time Analytics Pipeline 9

Complex Event Processing in Real-time Analytics Pipeline Enrichment Filtering and mutation Analysis over windows of time (rolling vs. tumbling) Aggregation Grouping and ordering Stateful processing Integration with other systems 10

Event Filtering and Routing Example 11

Aggregate Computation Example 12

TopN Computation Example TopN computation can be expensive with high cardinality dimensions Consider approximate algorithms Implemented as aggregate functions e.g. select ApproxTopN(10, D1, D2, D3) 13

Pulsar Deployment Architecture 14

Availability And Scalability Self Healing Datacenter failovers State management Shutdown Orchestration Dynamic Partitioning Elastic Clusters Dynamic Flow Routing Dynamic Topology Changes 15

Pulsar Integration with Kafka Kafka Persistent messaging queue High availability, scalability and throughput Pulsar leveraging Kafka Supports pull and hybrid messaging model Loading of data from real-time pipeline into Hadoop and other metric stores 16

Messaging Models Producer Netty Push Model Consumer Producer Consumer (At most once delivery semantics) Kafka Queue Pull Model Producer Pause/Resume Consumer (At least once delivery semantics) Kafka Replayer Queue Hybrid Model

Pulsar Integration with Kylin Apache Kylin Distributed analytics engine Provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop Support extremely large datasets Pulsar leveraging Kylin Build multi-dimensional OLAP cube over long time period Aggregate/drill-down on dimensions such as browser, OS, device, geo location Capture metrics such as session length, page views, event counts 18

Pulsar Integration with Druid Druid Real-time ROLAP engine for aggregation, drill-down and slice-n-dice Pulsar leveraging Druid Real-time analytics dashboard Near real-time metrics like number of visitors in the last 5 minutes, refreshing every 10 seconds Aggregate/drill-down on dimensions such as browser, OS, device, geo location 19

Key Takeaways Creating pipelines declaratively SQL driven processing logic with hot deployment of SQL Framework for custom SQL extensions Dynamic partitioning and flow control < 100 millisecond pipeline latency 99.99% Availability < 0.01% steady state data loss Cloud deployable 20

Future Development and Open Source Real-time reporting API and dashboard Integration with Druid and other metrics stores Session store scaling to 1 million insert/update per sec Rolling window aggregation over long time windows (hours or days) Dynamic Joins with graphs and RDBMS tables Hot deployment of Java source code 21

More Information GitHub: http://github.com/pulsario repos: pipeline, framework, docker files Website: http://gopulsar.io Technical whitepaper Getting started Documentation Google group: http://groups.google.com/d/forum/pulsar 22

Appendix

Twitter Storm/Spark Streaming vs Pulsar Key Differences Requirement Pulsar Storm/Trident Spark Streaming Declarative Pipeline Wiring Yes No No Pipeline stitching Run time Build time Build time Topology change requires reboot No Yes Yes SQL support Yes No Yes* Hot deployment of processing rules Yes No No Guaranteed Message Processing Yes (batching) Yes Yes Pipeline Flow Control Yes?? Stateful Processing Yes Yes Yes 24