STREAMING DATA AND THE FAST DATA STACK

From this document you will learn the answers to the following questions:

What do developers do when they need to do with streaming data?

What does the Fast Data Stack use to make decisions?

What is an Export function needed to do to distribute data?

Similar documents
FAST DATA APPLICATION REQUIRMENTS FOR CTOS AND ARCHITECTS

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

CitusDB Architecture for Real-Time Big Data

GigaSpaces Real-Time Analytics for Big Data

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Intelligent Business Operations and Big Data Software AG. All rights reserved.

Well packaged sets of preinstalled, integrated, and optimized software on select hardware in the form of engineered systems and appliances

Innovative technology for big data analytics

BIG DATA-AS-A-SERVICE

End to End Solution to Accelerate Data Warehouse Optimization. Franco Flore Alliance Sales Director - APJ

How To Make Data Streaming A Real Time Intelligence

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

How To Handle Big Data With A Data Scientist

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Stay Tuned for Today s Session! NAVIGATING THE DATABASE UNIVERSE"

Unified Batch & Stream Processing Platform

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

SQL Server 2012 Performance White Paper

Getting Real Real Time Data Integration Patterns and Architectures

SQL Server and MicroStrategy: Functional Overview Including Recommendations for Performance Optimization. MicroStrategy World 2016

Analytic Applications With PHP and a Columnar Database

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

The 4 Pillars of Technosoft s Big Data Practice

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Enabling Real-Time Sharing and Synchronization over the WAN

Using Big Data for Smarter Decision Making. Colin White, BI Research July 2011 Sponsored by IBM

Big Data Analytics Nokia

Building the Internet of Things Jim Green - CTO, Data & Analytics Business Group, Cisco Systems

Big Data for the Rest of Us Technical White Paper

Scala Storage Scale-Out Clustered Storage White Paper

Interactive data analytics drive insights

Next-Generation Cloud Analytics with Amazon Redshift

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Data Refinery with Big Data Aspects

Complex Event Processing (CEP) Why and How. Richard Hallgren BUGS

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

Inge Os Sales Consulting Manager Oracle Norway

Understanding the Value of In-Memory in the IT Landscape

Why Big Data in the Cloud?

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Automated Data Ingestion. Bernhard Disselhoff Enterprise Sales Engineer

Intelligent Business Operations

Are You Ready for Big Data?

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Page 2 of 5. Big Data = Data Literacy: HP Vertica and IIS

Azure Data Lake Analytics

Dynamic M2M Event Processing Complex Event Processing and OSGi on Java Embedded

The Purview Solution Integration With Splunk

SELLING PROJECTS ON THE MICROSOFT BUSINESS ANALYTICS PLATFORM

WHITE PAPER SPLUNK SOFTWARE AS A SIEM

Safe Harbor Statement

Powerful Management of Financial Big Data

2015 Analyst and Advisor Summit. Advanced Data Analytics Dr. Rod Fontecilla Vice President, Application Services, Chief Data Scientist

An Accenture Point of View. Oracle Exalytics brings speed and unparalleled flexibility to business analytics

Are You Ready for Big Data?

Find the Information That Matters. Visualize Your Data, Your Way. Scalable, Flexible, Global Enterprise Ready

LEARNING SOLUTIONS website milner.com/learning phone

Information Architecture

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Big Data at Cloud Scale

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Introducing Oracle Exalytics In-Memory Machine

Fog Computing and the Internet of Things: Extend the Cloud to Where the Things Are

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

Reference Architecture, Requirements, Gaps, Roles

EMC SOLUTION FOR SPLUNK

Apache Hadoop: The Big Data Refinery

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Oracle Database 12c Plug In. Switch On. Get SMART.

The Internet of Things and Big Data: Intro

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Building Data-Driven Internet of Things (IoT) Applications

FINANCIAL SERVICES: FRAUD MANAGEMENT A solution showcase

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Boarding to Big data

ORACLE BUSINESS INTELLIGENCE SUITE ENTERPRISE EDITION PLUS

Big Data Performance Growth on the Rise

Luncheon Webinar Series May 13, 2013

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

High Performance Data Management Use of Standards in Commercial Product Development

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

An Integrated Analytics & Big Data Infrastructure September 21, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle Enterprise

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

The Principles of the Business Data Lake

Processing and Analyzing Streams. CDRs in Real Time

An Overview of SAP BW Powered by HANA. Al Weedman

Solving Your Big Data Problems with Fast Data (Better Decisions and Instant Action)

Transcription:

STREAMING DATA AND THE FAST DATA STACK WHITE PAPER Software designers and architects build software stack reference architectures to solve common, repeating problems. A great example is the LAMP stack, which provides a framework for building web applications. Each tier of the LAMP stack has viable substitutes that might be chosen simply by preference or perhaps for technical reasons. Additional layers or complementing tools can be mixed in to address specific needs. Much information can be conveyed succinctly to others familiar with the pattern by explaining the resulting architecture in the context of the generalized LAMP pattern. Big Data is data at rest; Fast Data is streaming data, data in motion. A stack is emerging across both verticals and industries alike for building applications that process these high velocity streams of data that quickly accumulate into the Big Data lake. This new stack, the Fast Data Stack, has a unique purpose: to grab real-time data and output recommendations, decisions and analyses in milliseconds. Over the next several years this emerging Fast Data Stack will gain prominence and serve as a starting point for developers writing applications for streaming data. As with any emerging reference software stack, it s important to step up a layer and consider the workflow that motivates the Fast Data Stack s requirements. The requirements for a desktop application stack (a graphical UI, likely an MVC-style layering, a configuration database, undo capability ) are well understood, as are the generally accepted user activities that guide the design of a web application. In fact these are so well understood that developers have internalized the knowledge and forgotten that it was once new!

STREAMING DATA AND THE FAST DATA STACK page 2 When considering a Fast Data stack, we have to understand that few developers have created multiple large-scale data applications. Certainly very few have developed more than a handful. The developer community has not yet internalized the usage patterns and general requirements for applications that process high speed inputs from sensors, M2M devices, or IoT platforms. With massive streams of data coming into the enterprise and into applications from hundreds to millions of sources, in different formats (e.g., structured and unstructured), agreement on definition of the Fast Data Stack will benefit all. This paper will look at the emerging Fast Data Stack through the lens of streaming data to provide architects, CTOs and developers with fundamental architectural elements of the new Fast Data Stack: a LAMP stack for streaming data applications. A CLOSE LOOK AT THE FAST DATA STACK A Fast Data stack should Ingest, Analyze, Decide, and Store. This emerging stack, shown below, addresses the core data (message) requirements of streaming apps: ingestion, analysis, decisions, and storage. Variations on the basic stack can be used for different use cases. INGESTION Ingestion is the first stage in the Fast Data Stack (N.B.: some would argue the ingestion stage is similar to the Lambda Architecture). The job of ingestion is to interface to the streaming data sources, and to accept and

STREAMING DATA AND THE FAST DATA STACK page 3 transform or normalize incoming data. Ingestion marks the first point at which data can be transacted against, applying key functions and processes to produce value from the data value that includes insight, intelligence, and action. Developers have two choices for ingestion. The first is to use direct ingestion, where a straight-through code module can hook directly into the data-generating API, capturing the entire stream at the speed that the API and the network will run, e.g., at wire speed. In this case the analytic/ decision engines will have a direct ingestion adapter. With some amount of coding the analytic/decision engines can handle streams of data from an API pipeline without the need to stage or cache any of the data on disk. If access to the data generating API is not available, usually the alternative is that the data is being served up in a Message Queue. In this case, an ingestion system is needed to process incoming data off the queue. Modern queuing systems handle partitioning, replication, and ordering of data, and can manage backpressure from slower downstream components. STREAMING ANALYTICS AND REAL-TIME DECISIONS As data is ingested, it is used by one or more analytic and decision engines to accomplish specific tasks on streaming data. At this point the data is Fast Data, and the challenge for the decision engines is to keep pace with the velocity of the data stream. Real-time decisions are used to influence the next step of processing. Real-time decision engines are doing a lot of work, in that they consume the velocity of the data stream and at the same time are processing complex logic, all in time to complete the real-time decision feedback loop. Real-time decisions are made possible by a Fast Data Stack that features: Continuous query mode OLTP anayltics mode Programmatic Request-Response Stored procedures Export to longer-term analytics/storage Streaming analytics need to consume high-velocity data while maintaining real-time analytics, in the form of counters, aggregations and leaderboards. Streaming analytics require a Fast Data Stack with the following attributes:

STREAMING DATA AND THE FAST DATA STACK page 4 Time-window processing Counting, Statistics Rules Triggers DATA EXPORT Once Fast Data analytics are completed, the data moves through the pipeline for later processing; data ingestion and export flow at the same rate. Usually streaming analytics rely on an ingestion queue. Similar queues are used for the export stage. For real-time decisions, which process Fast Data in a continuous query mode, an Export function is needed to transform and distribute data to the Big Data warehouse/storage (OLAP) engine(s) of choice. DATA STORE AND HISTORICAL ANALYTICS OLAP engines enable fast queries against raw (structured) data. They append streaming data to historical data and store it for later, further Big Data analysis on the entire data set, reducing time-to-reporting and enabling historical BI. Many OLAP systems use columnar compression to store large amounts of data in memory and to increase query speed. OLAP data stores and historical analytics have the following capabilities: Periodic query OLAP/data warehouse model Dashboards User interaction BI, Reports STREAMING DATA AND ITS DEMANDS Let s return to a discussion of streaming data, and review its demands on application developers. Data arriving continuously, and in massive quantity, is called a stream. Data in a stream may be in many data types and formats. Most often the data provides information about the process that generated it; this information may be called messages or events. This includes data from new sources, such as sensor data, as well as clickstreams from Web servers, machine data, and data from devices, events, transactions, and customer interactions. Enterprises adopting Big Data strategies quickly realize that Big Data is created by Fast Data, e.g. high-speed streams of incoming data. High

STREAMING DATA AND THE FAST DATA STACK page 5 counts of small messages or events add up to massive amounts of streaming data. And the types of data streaming into the enterprise are increasing, seemingly daily. It has become a significant challenge for enterprises not only to ingest these relentless feeds of data, but also to capture value from them. STREAM PROCESSING IS NOT NEW Complex Event Processing (CEP) was an early form of stream processing. Time windows, state changes (events), and simple analysis techniques of correlation or classification were used to synthesize useful information in near real-time from the stream. By limiting the type of analysis that could be performed on a stream in real-time, the system could extract certain patterns or events. These techniques proved to be very useful (e.g., for programmatic stock trading), in financial fraud detection (money laundering or credit card fraud), manufacturing control, and military Communication, Command, Control, and Information (C3I) systems. Domain-specific forms of stream processing also were introduced. In the network security field, stream processing of network traffic proved to be an effective way to implement Intrusion Detection Systems/Intrusion Prevention Systems (IDS/IPS). Byte-patterns of known malware could be detected in-transit. To run fast enough, these systems were implemented in specialized hardware. Both of these examples were ultimately limited by processing speed. STREAM PROCESSING FINDS NEW LIFE IN THE CLOUD Cloud-based applications began generating demand for stream processing due to scale. Large web sites serving millions of subscribers, and Cloud services for stocks, travel, marketplaces, and financial services, generated massive streams of data about user behavior, product sales, search terms, click streams, and much more. This formed the demand side for stream processing. The same architecture that enabled applications to create massive streams of data also allowed designers to create high-capacity stream processing systems. The underlying architecture of computing clusters enabling parallel execution, and large pools of compute and memory, were among the capabilities needed to build these. New stream processing solutions were implemented, taking advantage of the underlying commodity

STREAMING DATA AND THE FAST DATA STACK page 6 scale-out architecture of the cloud. The clustered, scale-out architecture opened the door for all-software, high-performance implementation of stream processing. This formed the supply side for stream processing. This newfound ability to process streaming data with commodity resources (and inexpensively) has enabled new types of applications: applications that can process both Fast and Big Data. With the arrival of this new class of applications comes the definition of a Fast Data Stack. FAST DATA (STREAMING) USE CASES REAL-TIME TRADING DECISIONS In this example use case, quote streams are copied to the Fast Data Stack input queue. A real-time decision engine combines client, regulatory, pricing, fees, if and where to trade, and other data sources which the business requires; calculates a decision; and reports this decision up to the execution systems. In this use case, there is no historical data processing; all effort is aimed at Fast Data. AD SERVING: REAL-TIME DECISIONS AND HISTORICAL ANALYSIS In this example use case, navigation and user information is sent to the Fast Data Stack via a direct API. The real-time decision engine combines data about recent ads served and combines this with other data. It calculates a decision on which ad to serve and reports this decision up to the ad placement systems. Additionally, the ad service result is exported to a data warehouse, where it is merged with longer-term historical ad service data and made ready for batch historical analysis. This use case includes both Fast Data processing and historical Big Data processing. SMART METER SENSING, PREDICTING, AND REPORTING In this use case, meter streams are copied to the Fast Data Stack input queue and accessed by the real-time decision engine. The real time decision engine immediately applies a predictive model and alerts the supply grid of immediate demand forecasting. The streaming analytics engine also accesses the data and calculates and records values every 15 minutes from the Smart Meter, alerting on any exceptions (e.g., power failures). All data is saved for billing and historical analysis. This use case features Fast Data processing and historical Big Data processing.

STREAMING DATA AND THE FAST DATA STACK page 7 CONCLUSION The growth of streaming data, as well as the capability of new distributed systems architectures to process that data in real time, has spurred the development of new applications that can benefit from a new technology stack, the Fast Data Stack. These new applications process real-time data and output recommendations, decisions and analyses in milliseconds. When we look at the requirements of these applications, we find in most cases we are looking at an OLTP problem. OLTP problems are often solved with databases, but streaming problems may seem ill-suited to such an approach because of the limitations of traditional database technologies. Because traditional databases cannot handle streaming volume, developers have resorted to coupling streaming offerings along with data stores - often non-acid, non-transactional data stores - to try to capture value from Fast Data. This approach has many limitations: it requires the use of many components, as in the Lambda Architecture, introducing complexity and risk. Unlike traditional databases, VoltDB, an in-memory scale-out relational database, can process streams of data and produce analyses and decisions in milliseconds. As a single integrated platform, VoltDB reduces the complexity of building Fast Data applications by eliminating the need to connect streaming systems and non-relational data stores, and also provides a familiar, proven interaction model (SQL), simplifying application development and capturing real-time analytics using industry standard SQL-based tools. Using VoltDB puts decision-making as close to the head of the data pipeline as possible. Because VoltDB can handle all traffic from ingestion, there is no need for the overhead of a separate ingest module or stream analytics module. These characteristics make VoltDB an ideal match to the demands of handling stream processing, real-time analysis and decision-making, using a relatively simple to implement the Fast Data Stack at scale.