3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS



Similar documents
Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Unified Batch & Stream Processing Platform

Databricks. A Primer

Databricks. A Primer

WHITE PAPER DataTorrent RTS: Real-Time Streaming Analytics for Big Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Real Time Data Processing using Spark Streaming

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Virtualizing Apache Hadoop. June, 2012

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

YARN Apache Hadoop Next Generation Compute Platform

From Spark to Ignition:

BIG DATA ANALYTICS For REAL TIME SYSTEM

Elastic Application Platform for Market Data Real-Time Analytics. for E-Commerce

Apache Flink Next-gen data analysis. Kostas

CDH AND BUSINESS CONTINUITY:

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

HDP Hadoop From concept to deployment.

Oracle Database 12c Plug In. Switch On. Get SMART.

How To Use Hp Vertica Ondemand

Cisco Data Preparation

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Big Data Analysis: Apache Storm Perspective

Data movement for globally deployed Big Data Hadoop architectures

XpoLog Competitive Comparison Sheet

The Future of Data Management

Introducing Storm 1 Core Storm concepts Topology design

Hadoop Ecosystem B Y R A H I M A.

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

More Data in Less Time

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Real-time Big Data Analytics with Storm

Making big data simple with Databricks

Upcoming Announcements

Big Data Integration: A Buyer's Guide

STREAM ANALYTIX. Industry s only Multi-Engine Streaming Analytics Platform

Deploying an Operational Data Store Designed for Big Data

WHITE PAPER: Egenera Cloud Suite

CSE-E5430 Scalable Cloud Computing Lecture 11

Technical Overview Simple, Scalable, Object Storage Software

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

How To Use Big Data For Telco (For A Telco)

ORACLE COHERENCE 12CR2

Jitterbit Technical Overview : Microsoft Dynamics CRM

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

How To Choose A Data Flow Pipeline From A Data Processing Platform

Luncheon Webinar Series May 13, 2013

Testing Big data is one of the biggest

Predictive Analytics with Storm, Hadoop, R on AWS

SEIZE THE DATA SEIZE THE DATA. 2015

DRIVING INNOVATION THROUGH DATA ACCELERATING BIG DATA APPLICATION DEVELOPMENT WITH CASCADING

To run large data set applications in the cloud, and run them well,

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

How To Write A Trusted Analytics Platform (Tap)

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Scale Cloud Across the Enterprise

Dell In-Memory Appliance for Cloudera Enterprise

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

WHITE PAPER. Five Steps to Better Application Monitoring and Troubleshooting

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Learn How to Leverage System z in Your Cloud

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Roadmap Talend : découvrez les futures fonctionnalités de Talend

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Next-Generation Cloud Analytics with Amazon Redshift

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Data Challenges in Telecommunications Networks and a Big Data Solution

Cloud Computing: Making the right choices

Dominik Wagenknecht Accenture

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Hadoop & Spark Using Amazon EMR

How To Improve Your Communication With An Informatica Ultra Messaging Streaming Edition

HP Virtualization Performance Viewer

Migration Scenario: Migrating Batch Processes to the AWS Cloud

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Apache Storm vs. Spark Streaming Two Stream Processing Platforms compared

Solution White Paper Connect Hadoop to the Enterprise

The Virtualization Practice

WHITE PAPER: Egenera Cloud Suite

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

IBM Analytics The fluid data layer: The future of data management

CloudCenter Full Lifecycle Management. An application-defined approach to deploying and managing applications in any datacenter or cloud environment

Analytics on Spark &

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Introduction to Spark

Redefining Oracle Database Management

APP DEVELOPMENT ON THE CLOUD MADE EASY WITH PAAS

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Transcription:

. 3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS Deliver fast actionable business insights for data scientists, rapid application creation for developers and enterprise-grade operational excellence for IT

1 Getting to fast actionable insights means empowering analysts and data scientists to easily work with data from many data sources (both in motion and at rest), gain insights in seconds, visualize the data insights and take action automatically, without the need to involve the entire IT department. At the same time, data center operations teams need to ensure that the solution is operational and meets business SLAs. Given the buzz around Spark Streaming & Storm, they can seem like obvious choices for supporting streaming analytics. However, most of our customers have struggled to take both Spark Streaming & Storm beyond the proof-of-concept stage as they address the enterprise objectives too narrowly to offer a complete solution. Enterprises require an easy to use, visual tools-based approach that works out of the box. The platform needs to meet the needs of data scientists, developers and the data center operations teams without needing extensive & expensive patchwork of custom code & third party software that often fails DataTorrent RTS is the industry s first fully Hadoop native streaming analytics solution. DataTorrent RTS provides an enterprise grade streaming analytics platform, delivers tools and pre-built analytics modules and lights out data center operational capabilities. This paper explores the top 3 reasons enterprises pass on Spark Streaming & Storm and deploy DataTorrent RTS.

2 1. Enterprise-grade streaming analytics platform Your streaming analytics platform needs to meet the needs of your business. It s not sufficient to take open source code that might work for some large web-scale organizations with scores of platform level developers and try to deploy in an enterprise data center. Most enterprises don t have or want developers that are coding at the platform level. Imagine having your developers struggle with tuple level acking, configuration & distributed state management! Enterprises have strict business requirements for SLAs (no data loss, performance/latency and availability) and they want their developers to focus on solving their core business problem With this goal in mind, DataTorrent RTS was built from day one as a Hadoop 2.x native application. DataTorrent RTS natively supports Hadoop YARN and HDFS on every commercial Hadoop platform. It also runs seamlessly in public or private cloud environments. IT organizations get the benefits of high performance, in-order processing, auto-scaling, dynamic updates, automatic fault tolerance of application state, engine state as well as raw data & distributed in-memory analytics without having to hand code any of these capabilities An enormous amount of data is being generated each day, of different variety, at different sizes and at different rates. This fast big data is critical to an organization s ability to gain competitive advantage and acheive operational efficiencies. It s important that your streaming analytics solution not only handles the different data types but also provides appropriate processing guarantees. DataTorrent RTS is the only streaming analytics solution that can provide exactly-once, at-most-once and atleast-once event processing guarantees while still achieving the low latency of per tuple processing and not resorting to micro-batching The decisions that are being made based on insights gained from fast big data are typically in an operational data path. Enterprise grade fault tolerance is required for fast big data insights to be operational. DataTorrent RTS provides fault tolerance for raw input data (even when the input source is not stateful), engine state as well as processed data (application state) all without human intervention in the event of an outage. Also, only DataTorrent RTS supports incremental recovery which allows a failed node to recover its state and raw data stream from the previous node rather than requiring replay from the first step. This significantly reduces recovery time and ensures latency SLAs are maintained Where Storm & Spark Streaming fall short Apache Storm & Spark streaming s applicability is limited by their core architecture. With Spark Streaming, the inherent RDD based processing paradigm introduces overhead and latency to stream processing performance. The per-tuple acking in Storm is notoriously problematic in production environments and creates severe operational headaches when scaling a topology or troubleshooting bottlenecks & failures. Both Storm & Spark streaming force users to micro-bath input to provide exactly-once processing guarantees. This introduces significant latency in processing. Also, ability to maintain event order or provide application state level fault tolerance are not part of the core platform for both Spark Streaming & Storm. These are critical components of a stream processing platform and a must have for most of the use cases (eg. Imagine trying to do event sequence based pattern detection). Implementing these require non-trivial programming with intricate understanding of the underlying streaming platform & concepts and require constant maintenance and update with each release of the platform. Finally, all the workarounds you have to build into your business logic create significant lock-in for your application

3 What to ask To ensure an enterprise grade solution that meets your organization s SLA requirements, ask the following questions of your proposed solution: If Hadoop is your core big data platform, does your streaming platform seamlessly use HDFS for raw data & application state checkpoints & engine state management to reduce dependence on external datastores like relational databases that do not scale? Also, does your streaming platform run natively on YARN for scheduling without having to deal with making the underlying streaming platform scheduler work well with YARN as that can cause significant multi-tenancy & operational issues? Can the streaming analytics solution auto-scale and process increased data loads without manual programming and re-deployment? Does the streaming analytics platform guarantee the processing order of your events across all processing guarantees at-most once, at-least once & exactly once without having to micro-batch the input data? Is the streaming analytic solution s fault tolerance complete (raw events, app state & engine state), abstracted from the developer and done natively in Hadoop using HDFS? Streaming analytics applications need to be able to handle events non-stop. Does your streaming analytics solution support dynamic updates to application properties and business logic with no application downtime? 2.Data scientist and application developer friendly The path to a production ready streaming analytics solution entails a lot of experimentation upfront. Data scientists and developers should be able to use intuitive visual tools to quickly create streaming applications and iterate over their hypothesis. These iterations should not always involve cumbersome coding by developers. Developers should be able to simply create organization specific business logic (e.g. custom parsers) from any data source and make it available for data scientists to visually assemble the streaming application. The DataTorrent RTS streaming analytics solution enables rapid time to market/time to value via pre-built modular analytics capabilities that are easily combined using a visual interface. Development is simple with a single-threaded Java based development model that allows for arbitrary business logic (often re-using existing code!). In order to get your developers productive in no time, DataTorrent RTS provides over 450 pre-built Java operators that provide a raft of analytical capabilities. 75+ input and output operators allow for data ingestion and distribution from sources such as Kafka, Flume, message busses (JMS, MQ, etc,) databases (SQL, NoSQL), web sockets and more. All the platform processing guarantees, idempotency & state management are automatically extended to the input & output connectors & all other operators so no additional platform level development work from the application developer is needed The Java operator-programming model is simple, yet powerful as DataTorrent RTS provides key capabilities that are left up to the developer in open source streaming analytics platform. Developers do not have to worry about multi-threading the code, the application is automatically partitioned and distributed across the Hadoop cluster for scalability. Another key capability is native application support for application timeseries windows that are both aggregate (per minute, per hour) and rolling (last 5 minutes, last 3 hours). As mentioned earlier, fault tolerance is a platform capability and abstracted from the developer.

4 Where Storm & Spark Streaming fall short The Java API in Spark Streaming & Storm requires a lot of hand coding as there is no library of pre-built code. Data input & output connectors are few. The Java interface in Spark Streaming is notoriously hard to use as there is a significant bias towards Scala. With Storm, even though Java is supported, developers have to hassle with doing tuple level acking in their application code. Besides the lack of a starting point, for both Spark Streaming & Storm, programming is tedious as the developer must manually account for scalability, handle input data skews, hand-code fault tolerance for the application data and attempt to force event ordering/re-ordering. Spark streaming & Storm do not have any visual development tools so coding must be done by a developer and does not allow for a data scientist that is not familiar with Streaming to create simple applications to quickly iterate over their analysis. What to ask To ensure that data scientists and developers can rapidly assemble applications, ask the following questions of your proposed solution: Does the streaming analytics solution have connectors to support faulttolerant & auto-scaling data ingestion & distribution for all of your data sources & analytics destinations out of the box? Are common data analytics capabilities such as joins, aggregations, and statistical analysis available out-of-the-box? How about complex capabilities such as dimensional cube creations and integration with machine learning tools? Does the solution aggregate data over varying windows, both static and rolling, automatically, or does the developer have to manually implement? Is the solution data scientist and business analyst friendly with a visual application creation and data visualization tools? 3. Robust management and operational deployment Fast big data doesn t stop and neither can the insight and actions that your business takes. As a result, streaming analytics applications are designed to run 24x7 with no downtime. Data center operations teams need to ensure that the full lifecycle of application deployment, monitoring, updating, and problem resolution meets the organization s business commitments. Management requirements extend not only to on-premise deployments, but also cloud and hybrid cloud/data center deployments. Designed from day one with enterprise datacenter operations as a requirement, DataTorrent RTS fully embraces the application lifecycle. The DataTorrent solution is fully multi-tenant, allowing multiple applications to run on the same Hadoop cluster optimizing operations and maximizing data center resources. DataTorrent RTS provides a simple to implement and use application-packaging technology to streamline the handoff from dev to ops. Designed for zero downtime, data center ops teams have the ability to change business logic, modify application window sizes (example 1 hour to 30 minutes) and performance tune a running application without stopping the data processing. The DataTorrent RTS UI console provides full visibility into the application at a Hadoop container-level, including resource usage and performance/latency statistics in addition to built-in monitoring alerts. Application issue resolution is simplified with application counters, console event alerts and cluster-wide log collection and consolidation.

5 Where Storm & Spark Streaming fall short Spark Streaming & Storm provide rudimentary capabilities across the application lifecycle. The management & monitoring platform does not provide full visibility into all metrics of the streaming application and the infrastructure. There are no considerations in Spark Streaming & Storm architecture for dynamic application updates. What to ask Does your organization require easy to use tools for the full application deployment & management operations cycle? Are visual, automated alerting and command line tools required for your data center operations team? Does the streaming analytic solution have built in capabilities to make application modifications dynamically?

6 Conclusion Enterprises are seeing greater opportunity to better serve their customers, drive greater revenues and reduce costs through operational efficiencies. In order to capitalize on the opportunity, organizations are looking for solutions that enable rapid insights and action to be taken on fast big data. An enterprise-grade solution is required that meets the needs of data scientists, developers and data center operations. The top 3 reasons that enterprises are deploying DataTorrent RTS over Spark Streaming are summarized below. Enterprise-grade streaming analytics platform Industry s first Hadoop-native, fully multi-tenant YARN and HDFS based architecture No data loss with automatic fault tolerance for raw event data, application state & engine state High-throughput, in-memory & low-latency event processing with no need to micro-batch At-most-once, at-least-once and exactly-once processing guarantees while guaranteeing event order! Auto-scaling & auto-partitioning of event streams for skew management Data scientist & application developer friendly Visual application creation tool that utilizes the 450+ open source Java operators Ability to ingest data from and distribute to any source with more than 75 pre-built adaptors Open source library of 450+ operators for a wide variety of real-time analytics & transformations Robust operations & management platform Simple application packaging and deployment Intuitive UI for end to end management, monitoring, reporting & troubleshooting Dynamic application updates with no application downtime Light footprint (no need to deploy on every Hadoop node) for simple installation & upgrade REST API for easy integration with enterprise tools Additional Resources DataTorrent RTS: Data sheet DataTorrent RTS Whitepaper DataTorrent download DataTorrent Inc., 3200 Patrick Henry Drive 2 nd Floor Santa Clara CA 95054 +(1) 408-331-5034, ext #101