Unified Batch & Stream Processing Platform

Similar documents
Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Ganzheitliches Datenmanagement

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

#TalendSandbox for Big Data

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

HDP Hadoop From concept to deployment.

Moving From Hadoop to Spark

Upcoming Announcements

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Client Overview. Engagement Situation. Key Requirements

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

A Brief Introduction to Apache Tez

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

WHITE PAPER DataTorrent RTS: Real-Time Streaming Analytics for Big Data

XpoLog Competitive Comparison Sheet

Luncheon Webinar Series May 13, 2013

How To Use Big Data For Telco (For A Telco)

Introduction to Big Data Training

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Data Integration Checklist

GROW WITH BIG DATA Third Eye Consulting Services & Solutions LLC.

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Comprehensive Analytics on the Hortonworks Data Platform

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

From Spark to Ignition:

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Cisco Data Preparation

Solving big data problems in real-time with CEP and Dashboards - patterns and tips

Big Data Analytics Nokia

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

HDP Enabling the Modern Data Architecture

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

XpoLog Center Suite Log Management & Analysis platform

Workshop on Hadoop with Big Data

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Big data platform for IoT Cloud Analytics. Chen Admati, Advanced Analytics, Intel

Real-time Big Data Analytics with Storm

Towards Smart and Intelligent SDN Controller

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Real Time Data Processing using Spark Streaming

<Insert Picture Here> Oracle SQL Developer 3.0: Overview and New Features

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Jitterbit Technical Overview : Microsoft Dynamics CRM

Descriptive to Predictive to Prescriptive Analytics: Move Up the Value Chain. Suren Nathan CTO

The Analytics Value Chain Key to Delivering Value in IoT

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

SAP and Hortonworks Reference Architecture

Databricks. A Primer

IBM Big Data Platform

Virtualizing Apache Hadoop. June, 2012

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Approaches. Making Sense of Big Data. Ian Crosland. Jan 2016

Fast Innovation requires Fast IT

Databricks. A Primer

YARN Apache Hadoop Next Generation Compute Platform

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Processing and Analyzing Streams. CDRs in Real Time

Integrating a Big Data Platform into Government:

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Getting Real Real Time Data Integration Patterns and Architectures

Making big data simple with Databricks

SQLSaturday #399 Sacramento 25 July, Big Data Analytics with Excel

Big Data Use Case: Business Analytics

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Big Data on Microsoft Platform

How To Create A Data Visualization With Apache Spark And Zeppelin

PRODUCTIVITY IN FOCUS PERFORMANCE MANAGEMENT SOFTWARE FOR MAILROOM AND SCANNING OPERATIONS

Reference Architecture, Requirements, Gaps, Roles

How telcos can benefit from streaming big data analytics

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

A Modern Data Architecture with Apache Hadoop

How To Write A Trusted Analytics Platform (Tap)

Spark use case at Telefonica CBS

Cloudera Enterprise Data Hub in Telecom:

Hadoop Ecosystem B Y R A H I M A.

The 4 Pillars of Technosoft s Big Data Practice

<Insert Picture Here> Big Data

Copyright 2013 Splunk Inc. Introducing Splunk 6

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Expanding Uniformance. Driving Digital Intelligence through Unified Data, Analytics, and Visualization

Business Application Services Testing

Apache Flink Next-gen data analysis. Kostas

Implement Hadoop jobs to extract business value from large and varied data sets

Using Data Mining and Machine Learning in Retail

Oracle Big Data Fundamentals Ed 1 NEW

The Future of Data Management

VIEWPOINT. High Performance Analytics. Industry Context and Trends

GoodData. Platform Overview

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

BIG DATA ANALYTICS For REAL TIME SYSTEM

Transcription:

Unified Batch & Stream Processing Platform Himanshu Bari Director Product Management

Most Big Data Use Cases Are About Improving/Re-write EXISTING solutions To KNOWN problems

Current Solutions Were Built On A. Imperfect information B. Expensive s/w & h/w infrastructure C. Relational data stores

Inevitable Course of the Re-write Specialized solutions Near perfect information Next gen data management platform In Memory Processing NoSQL Graph Hadoop Search EDW + RDBMS Commodity hardware Open source software

Data Processing Categories in Big Data Use Cases Data Velocity Streaming Stream Processing N/A Static Known Batch Processing Ad-hoc Query Unknown Questions known?

Every Batch Process *Could* Have Been A Stream Process Every Static data point was Streaming at some point We choose to wait and collect a bunch of data points and then process them at once in Batch Mode Move processing time closer to the data generation or Event occurrence time Reduces time to insight and allows you to be proactive with timely actions

But We Still Need Batch Historical data analysis What-if analysis Experimentation Data re-statement Transaction processing and re-conciliation Audit Machine learning model training..

Need A Unified Platform Stale Data Change Data Fresh Streaming Data Unified Batch & Stream Processing Engine to execute data pipelines Streaming only pipelines OR Batch processing only pipelines OR Dual Batch & Stream processing pipelines YARN HDFS Enterprise Repositories RDBMS Other EDW Search NoSQL Graph Any Notification Engine Any BPM Engine

Management & Monitoring DataTorrent RTS Provides A Unified Batch & Streaming Platform Transform Normalize Descriptive & Predictive Analytics Alert Action Unified Batch & Streaming Data Ingestion & Distribution Pipeline for Hadoop Graphical Application Assembly Real-Time Data Visualization Re-Usable Java Operator Library Apache 2.0 licensed Project Malhar Scalable, High Performance, Fault Tolerant In-Memory Data Processing Platform Apache 2.0 licensed - Project Apex Hadoop 2.0 YARN + HDFS Physical Virtual Cloud

DataTorrent Project Apex- Unified Batch & Stream Processing Engine 1 In-memory compute engine Stateful fault tolerance with NO data loss 6 2 Hadoop native (YARN + HDFS) architecture Automatic & incremental recovery 7 3 Sub-second event processing Rolling & tumbling windows 8 4 Event order guarantees Processing guarantees 9 5 Flexible stream partitioning mechanism Ability to update application without downtime 10

Unified Data Ingestion & Distribution Pipeline Input & Output Variety FTP, S3 etc. Kafka & JMS Change data capture Easily customizable Easily extend and insert operations for data preparation Tackle data size & speed fluctuations Overcome HDFS small file problem Skew management Run-time updates Parameters like filtering criteria, bandwidth utilization & polling interval should be updateable at runtime Hadoop Native Runs within the Hadoop cluster over YARN & HDFS Simple to build & Operate Graphical UI & API End to end metrics

Hadoop Data Ingestion & Distribution Application

Data Prep & Analytics Layer Requirements Truly scale horizontally across the Hadoop cluster Pre-built operators ᵒ Re-ordering ᵒ Normalization ᵒ Transpose ᵒ De-duplication ᵒ Tagging ᵒ Filtering ᵒ Enrichment Operators work seamlessly in both streaming & batch mode ᵒ Data local HDFS read & process ᵒ Ability to do computations on time window as well as file boundaries Ability to re-use existing business logic Simple workflow & scheduling capabilities ᵒ Built-in or integrations with Oozie or other schedulers

Development API Requirements Consistent between Streaming & Batch pipelines No mapreduce No exposure of low level processing engine concepts Easily extendible

Malhar Operator Library Overview Malhar Operator Library Input / Output Connectors File Systems RDBMS NoSQL Messaging Notification In Memory Databases Social Media Protocol read/write Pattern Matching Compute Operators Stats & Math Machine Learning & Algorithms Parsers Stream Manipulators Query & Scripting

Visual Application Assembly Easy to Use ᵒ Web based drag-n-drop development environment ᵒ Automatic port compatibility validation ᵒ Simple schema management ᵒ Generic property configurator Easy to Operate ᵒ No external component dependency - Runs natively in Hadoop ᵒ Integrated with DataTorrent management platform Simple to extend ᵒ Simple API to enable any existing DataTorrent operator ᵒ Ability to plug any business logic using a custom operator

Streaming or Batch Data Processing Visualization Intuitive user interface ᵒ Auto-generate or custom create ᵒ One dashboard for multiple apps ᵒ Supports bar, line, pie, area charts & data tables Easy to Operate ᵒ No external component dependency - Runs natively in Hadoop ᵒ Integrated with DataTorrent management platform Simple to extend ᵒ Any DAG operator can be made a real-time datasource

Management & Monitoring Summary Transform Normalize Graphical Application Assembly Descriptive & Predictive Analytics Real-Time Data Visualization Malhar Apache 2.0 license Re-Usable Java Operator Library Scalable, High Performance, Fault Tolerant In-Memory Data Processing Platform Apache 2.0 licensed - Project Apex Hadoop 2.0 YARN + HDFS Physical Virtual Cloud Alert Action Unified Batch & Streaming Data Ingestion & Distribution Pipeline for Hadoop Project Apex https://www.datatorrent.com/product/project -apex/ DataTorrent RTS Sandbox https://www.datatorrent.com/download/ World is moving from Batch to Streaming BUT both are required Need a Hadoop native in memory compute engine that is scalable & fault tolerant in BOTH batch & streaming modes With out -of-the box data prep & analytics operators Using a consistent & functional development API Operationalized through a common management & monitoring layer

Big Data. Big Actions. Now.

Some Verticals & Use Cases Ad-Tech Real-time customer facing dashboards on key performance indicators Click fraud detection Billing optimization Telco & Cable Call detail record (CDR) & extended data record (XDR) analysis for Service quality improvement Capacity planning/optimization Understanding customer behavior AND context Packaging and selling anonymous customer data Financial Services Fraud & risk monitoring Sentiment based trading strategies Usage based insurance Improved credit risk assessment Improving turn around time of trade settlement processes IoT Process optimization Proactive maintenance prediction Remote monitoring & diagnostics