THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Similar documents
The Internet of Things and Big Data: Intro

Big Data Technology ดร.ช ชาต หฤไชยะศ กด. Choochart Haruechaiyasak, Ph.D.

ANALYTICS BUILT FOR INTERNET OF THINGS

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Real-Time Big Data Analytics SAP HANA with the Intel Distribution for Apache Hadoop software

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Understanding the Value of In-Memory in the IT Landscape

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Reference Architecture, Requirements, Gaps, Roles

Tiber Solutions. Understanding the Current & Future Landscape of BI and Data Storage. Jim Hadley

How To Handle Big Data With A Data Scientist

Whitepaper. Innovations in Business Intelligence Database Technology.

Preview of Oracle Database 12c In-Memory Option. Copyright 2013, Oracle and/or its affiliates. All rights reserved.

REAL-TIME BIG DATA ANALYTICS

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

NoSQL for SQL Professionals William McKnight

Real Time Big Data Processing

Using In-Memory Data Fabric Architecture from SAP to Create Your Data Advantage

Apache Kylin Introduction Dec 8,

Splice Machine: SQL-on-Hadoop Evaluation Guide

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Why DBMSs Matter More than Ever in the Big Data Era

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

The 3 questions to ask yourself about BIG DATA

Oracle Big Data SQL Technical Update

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Next-Generation Cloud Analytics with Amazon Redshift

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Hadoop & Spark Using Amazon EMR

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Microsoft Analytics Platform System. Solution Brief

Big Data and Analytics 21 A Technical Perspective Abhishek Bhattacharya, Aditya Gandhi and Pankaj Jain November 2012

How to Choose Between Hadoop, NoSQL and RDBMS

The Sierra Clustered Database Engine, the technology at the heart of

SAP HANA PLATFORM Top Ten Questions for Choosing In-Memory Databases. Start Here

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Oracle Architecture, Concepts & Facilities

Getting Real Real Time Data Integration Patterns and Architectures

Real-time Big Data Analytics with Storm

Evaluating NoSQL for Enterprise Applications. Dirk Bartels VP Strategy & Marketing

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Big Data Analytics - Accelerated. stream-horizon.com

Practical Considerations for Real-Time Business Intelligence. Donovan Schneider Yahoo! September 11, 2006

Trafodion Operational SQL-on-Hadoop

Executive Summary... 2 Introduction Defining Big Data The Importance of Big Data... 4 Building a Big Data Platform...

CIO Guide How to Use Hadoop with Your SAP Software Landscape

Timo Elliott VP, Global Innovation Evangelist SAP SE or an SAP affiliate company. All rights reserved. 1

From Spark to Ignition:

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

How to Enhance Traditional BI Architecture to Leverage Big Data

A Scalable Data Transformation Framework using the Hadoop Ecosystem

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Big Data Analytics Nokia

Traditional BI vs. Business Data Lake A comparison

CitusDB Architecture for Real-Time Big Data

Emerging Technologies Shaping the Future of Data Warehouses & Business Intelligence

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Parallel Data Warehouse

SAP and Hortonworks Reference Architecture

Oracle Database In-Memory The Next Big Thing

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Why Big Data in the Cloud?

Big Data Are You Ready? Thomas Kyte

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

How To Use Hp Vertica Ondemand

Actian SQL in Hadoop Buyer s Guide

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

MOC 20467B: Designing Business Intelligence Solutions with Microsoft SQL Server 2012

The Lab and The Factory

Performance Management in Big Data Applica6ons. Michael Kopp, Technology

2009 Oracle Corporation 1

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

Unlock your data for fast insights: dimensionless modeling with in-memory column store. By Vadim Orlov

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

The big data revolution

Driving Peak Performance IBM Corporation

Information Architecture

[Analysts: Dr. Carsten Bange, Larissa Seidler, September 2013]

PLATFORA INTERACTIVE, IN-MEMORY BUSINESS INTELLIGENCE FOR HADOOP

ORACLE BUSINESS INTELLIGENCE, ORACLE DATABASE, AND EXADATA INTEGRATION

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Challenges for Data Driven Systems

Transcription:

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS WHITE PAPER Successfully writing Fast Data applications to manage data generated from mobile, smart devices and social interactions, and the Internet is development s next big challenge. The availability and abundance of fast, streaming data presents an enormous opportunity to write smart applications that extract intelligence, provide insight, and make everyday products, services, places, farms, cities, grids, buildings, and homes a source of intelligence. Modern applications need to manage and drive value from fast-moving streams of data. But traditional tools such as conventional database systems are simply too slow to ingest data, analyze it in real-time, and make decisions. They can t meet Fast Data s demands. Successfully interacting with Fast Data requires a new approach to handling these new data streams. As Fast Data emerges as a required component of a complete data-at-scale stack, several technologies are being proposed as possible solution components. These fall into three categories: fast OLAP systems (the province of Business Intelligence applications), stream-processing systems, and OLTP (database) systems. Each of these solutions can be highly capable but some are better suited to Fast Data than others. Organizing the Fast Data contenders by their core architecture types provides a way to evaluate their core strengths and weaknesses for the requirements of Fast Data applications.

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS page 2 FAST OLAP OLAP solutions enable fast queries against raw (structured) data. Eliminating the need to organize/accumulate (by time window, session, or volume) is a primary part of their value proposition. They organize data at query time, not at ingest; mature OLAP systems are very good at doing so. OLAP vendors believe the most valuable part of Fast Data is reducing time-to-reporting and enabling real-time BI. Vendors talk about the benefits of using columnar compression to store large amounts of data (years of history vs. hours) in memory; they also emphasize query speed. Fast OLAP systems organize data to enable efficient queries across multiple dimensions of terabytes to petabytes of stored data. Typically these systems organize data in a fashion that allows cheap, fast, effective compression (e.g., run-length-encoding). Compression is usually critical: it allows the OLAP system to store more data in the same storage footprint, and it lets the OLAP system scan many entries while minimizing expensive disk accesses. Where OLAP solutions fall short in Fast Data use cases is in transactional (multi-factor) decisions. OLAP offerings are analytics engines, not transaction processing engines. As noted, vendors highlight compression the ability to store large amounts of data. They de-emphasize integration with competing OLAP systems (export/archive). STREAM PROCESSING Streaming systems main purpose is to capture data. Unlike OLAP and OLTP systems, streaming systems are not optimized to store data, nor do they optimize for fast record lookup. And since they aren t storing data, they are not optimized for scans across different dimensions of the data set. Instead, streaming systems are optimized for running computations across a stream of arriving events. Almost all offerings in this category organize a set of functions and run those functions against moving windows. Functions can be arranged in parallel (run input x against f(x) and g(x)), or in serial (run input x against f(x) and then run g() against the output: g(f(x))). These systems are very good at scaling pre-defined real-

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS page 3 time analytics against Fast Data sources. The ability to compose these computations also enables different real-time ETL applications. Some streaming proponents believe the most valuable part of Fast Data is scalable message processing and coordination between systems. Vendors and popular open source stream processing projects promote strengths in data integration and message pipelines. Complex Event Processing (CEP) and streaming solutions are less-thanideal choices for operations that require state. They are primarily onramps to OLAP. While streaming/cep solutions can accomplish streaming analytics, they are not sufficient, without a supporting back-end database, for applications that rely on decisions or ETL. As a result, CEP/streaming offerings are often configured with a back-end database to address the fast decisions use case. However, bolting on a back-end database will result in lower performance and higher latency than a fast OLTP solution. FAST OLTP Fast OLTP vendors believe that per-event decision-making (requiring ACID semantics and database transactions), real-time data enrichment, and streaming analytics are critical when building smart, fast applications. OLTP systems organize data as a series of records, where records may be rows or documents. They are durable systems and can persist user data across failures. OLTP systems are designed for fast record lookups using indexes, and provide a query/transaction model that allows applications to query and read/write record data coherently and consistently. OLTP systems are typically designed to make writes to a specific record (or field) efficient and safe, but are not designed to scale multi-dimensional reads of large amounts of records like an OLAP offering. Within OLTP systems, there are two types of architectures: traditional SQL systems and the New/NoSQL systems.

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS page 4 Traditional SQL systems are disk-based systems that can be challenging to scale at the throughput required by today s Fast Data requirements. These systems are more general-purpose systems. NewSQL and NoSQL solutions can provide the speed and availability required by Fast Data applications. Each comes with its own specialty. NoSQL systems trade off query expressiveness (SQL) and schema for a flexible data model, low-latency lookup, and high availability. NewSQL solutions provide similar scalability but specialize this architecture, providing the expressiveness of SQL queries, strong consistency, and high availability, while providing a strong schema contract. USE CASE VS. CONTENDER: MAPPING THE LANDSCAPE FAST OLAP CEP/STREAMING FAST OLTP Cannot generate realtime responses and decisions. An evolution of OLAP capability to in-memory; not a sustainable strategic value-add vs. columnar incumbents (Vertica, Redshift, etc.) May have poor SQL support not a substitute to incumbent column stores (MemSQL) Good at data capture. Good at ingest to OLAP and pre-defined readonly analytics. Other operations introduce complexity and lack of reliability. May require development work; needs a back-end database. Fast Data is message/ event oriented and databases are query/ transaction oriented. Put messages first and support with DB where necessary. Suitable for analytics with pre-defined queries. No compression: too expensive to store large datasets. May require complex OLAP integrations. THREE DRAWBACKS OF STREAMING SOLUTIONS STREAMING SOLUTIONS LACK CONTEXT Streaming solutions can ingest fast-moving data feeds but they lack context and state, both necessary to support decision-making. For example, filter, aggregate, and join operations (aka enrich) require state. Streaming systems thus must be complemented by back-end databases; as standalone offerings their value is primarily focused on fast ingestion.

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS page 5 Their poor query support inhibits interactivity and introduces network I/O round trips to the back end, where they still need fast back-end DBs. Steaming solutions thus are a compromised solution for Fast Data applications that rely on context. STREAMING SOLUTIONS ARE NOT ARCHITECTED FOR REAL-TIME DECISIONS Decisions are fundamental ACID workloads. Except for the scale (required performance) load created by M2M/IoT/Web applications, these are OLTP applications. They require all of the requirements of an ACID system: atomicity of side effects, consistency in evaluation and constraint checks, isolation from other concurrent transactions, and durability of results. Streaming systems are not designed to offer fast per-event responses to applications. The strength of streaming systems is algorithmic processing of windowed data. Streaming systems do not implement ACID transactions and are not designed to be durable records. Lacking standard application interfaces (ODBC/JDBC) or broad ad-hoc query capability, streaming solutions are less than ideal for real-time decision use cases. Apache Spark, an extension of the OLAP Hadoop warehouse, is an analytics tool that reduces time to analytics (vs. batch processing/periodic reporting). However, Spark does not support decisions. STREAMING SOLUTIONS LACK OPERATIONAL TRANSPARENCY Streaming solutions can only be queried for their statically-configured topological results; streaming solutions that maintain fixed aggregations need to store those aggregations in fault tolerant (durable) back-ends, introducing the complexity of another storage system. FAST DATA ADOPTION IN AGILE DEVELOPMENT: INFLECTION POINTS As more developers are tasked with building applications to handle fast streams of data, providers of streaming solutions are reaching an inflection point. Unable to meet the full-stack demands of application developers, they are turning to composed solutions, e.g., Kafka+Storm+Spark+NoSQL database; the Lambda architecture, which splits the functions of batch, speed and serving; or highly customized, purpose-built solutions, cobbled together from open source projects and custom code. These approaches rely on the cool technology du jour,

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS page 6 sacrificing enterprise scale, proven, high-quality code, and repeatability. In the rush to develop applications for Fast Data, developers focus on the stream of data, not on the desired business result. SOLVING THE SHORTCOMINGS OF STREAMING SOLUTIONS Understanding the promise and value of Fast Data requires an understanding of the nature, utility, and shortcomings of streaming solutions. While much can be accomplished by ingesting fast-moving streams of data, two of the three contenders for streaming solutions lack key functionality necessary to build applications that enable businesses to act in real time as data flows into the organization from millions of endpoints: sensors, mobile devices, connected systems and the Internet of Things. Yes, fast and big data have different requirements, and it s necessary to have a component on the front end of the data pipeline to manage streams. Yet how much more effective would it be to have the ability to handle streams of data by ingesting and interacting on the data stream; performing real-time analytics on the data in motion; and making datadriven decisions on each event in the stream? In this model, applications can take action on streams of data, and processed data can be exported at high speed to the data warehouse for historical analytics, reporting, analysis, and more. Streaming solutions, while appealing for fast ingest, do not provide the missing link between fast streams of data and Fast Data applications. Application developers must be free to write code that adds value to the organization, rather than being burdened by writing code to manage streams of data as it flows through the pipeline. An in-memory operational system that can decide, analyze and serve results at Fast Data s speed is the answer to handling fast streams of data at enterprise scale. Fast Data applications give organizations the tools to process high volume streams of data while enabling millions of complex decisions in real-time. With Fast Data, things that were not possible before become achievable: instant decisions can be made on real-time data to drive sales, connect with customers, inform business processes, and create value.