SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON



Similar documents
Streaming Big Data Performance Benchmark. for

Streaming Big Data Performance Benchmark for Real-time Log Analytics in an Industry Environment

SQLstream 4 Product Brief. CHANGING THE ECONOMICS OF BIG DATA SQLstream 4.0 product brief

How To Make Data Streaming A Real Time Intelligence

Processing and Analyzing Streams. CDRs in Real Time

From Spark to Ignition:

Innovation Session BIG DATA. HP EMEA Software Performance Tour 2014

Welcome. Host: Eric Kavanagh. The Briefing Room. Twitter Tag: #briefr

BIG DATA ANALYTICS For REAL TIME SYSTEM

Embedded inside the database. No need for Hadoop or customcode. True real-time analytics done per transaction and in aggregate. On-the-fly linking IP

IBM System x reference architecture solutions for big data

How To Use Hp Vertica Ondemand

The 4 Pillars of Technosoft s Big Data Practice

Dell* In-Memory Appliance for Cloudera* Enterprise

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

DATAMEER WHITE PAPER. Beyond BI. Big Data Analytic Use Cases

Enabling Real-Time Sharing and Synchronization over the WAN

IT Platforms for Utilization of Big Data

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Modern IT Operations Management. Why a New Approach is Required, and How Boundary Delivers

Understanding traffic flow

Big data platform for IoT Cloud Analytics. Chen Admati, Advanced Analytics, Intel

High Performance Data Management Use of Standards in Commercial Product Development

Oracle Big Data SQL Technical Update

Where is... How do I get to...

Cisco Data Preparation

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Real-time Big Data Analytics with Storm

Real Time Big Data Processing

CitusDB Architecture for Real-Time Big Data

Interactive data analytics drive insights

Big Data Analytics - Accelerated. stream-horizon.com

How To Handle Big Data With A Data Scientist

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

locuz.com Big Data Services

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Dynamic M2M Event Processing Complex Event Processing and OSGi on Java Embedded

Maximum performance, minimal risk for data warehousing

BIG DATA FOR MEDIA SIGMA DATA SCIENCE GROUP MARCH 2ND, OSLO

Innovation: Add Predictability to an Unpredictable World

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

IBM Software Hadoop in the cloud

Create and Drive Big Data Success Don t Get Left Behind

Your Path to. Big Data A Visual Guide

Big Data Analytics - Accelerated. stream-horizon.com

Get More Scalability and Flexibility for Big Data

Klarna Tech Talk: Mind the Data! Jeff Pollock InfoSphere Information Integration & Governance

Splunk Company Overview

Time-Series Databases and Machine Learning

Big Data Analytics: Today's Gold Rush November 20, 2013

Apache Hadoop in the Enterprise. Dr. Amr Awadallah,

Enabling Cloud Architecture for Globally Distributed Applications

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

SQL Server 2012 Parallel Data Warehouse. Solution Brief

IBM WebSphere Premises Server

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Minimize cost and risk for data warehousing

Predictive Analytics with Storm, Hadoop, R on AWS

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Business opportunities from IOT and Big Data. Joachim Aertebjerg Director Enterprise Solution Sales Intel EMEA

Fast, Low-Overhead Encryption for Apache Hadoop*

IBM BigInsights for Apache Hadoop

How Transactional Analytics is Changing the Future of Business A look at the options, use cases, and anti-patterns

Aligning Your Strategic Initiatives with a Realistic Big Data Analytics Roadmap

DRIVING THE CHANGE ENABLING TECHNOLOGY FOR FINANCE 15 TH FINANCE TECH FORUM SOFIA, BULGARIA APRIL

Big Data Are You Ready? Thomas Kyte

Oracle Big Data Building A Big Data Management System

Machine Data Analytics with Sumo Logic

5 Keys to Unlocking the Big Data Analytics Puzzle. Anurag Tandon Director, Product Marketing March 26, 2014

Complex, true real-time analytics on massive, changing datasets.

IBM Netezza High Capacity Appliance

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Transforming the Telecoms Business using Big Data and Analytics

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

Connected Product Maturity Model

Cloudera Enterprise Data Hub in Telecom:

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Building the Internet of Things Jim Green - CTO, Data & Analytics Business Group, Cisco Systems

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

IBM AND NEXT GENERATION ARCHITECTURE FOR BIG DATA & ANALYTICS!

Deploying Big Data to the Cloud: Roadmap for Success

Radware ADC-VX Solution. The Agility of Virtual; The Predictability of Physical

HP Vertica at MIT Sloan Sports Analytics Conference March 1, 2013 Will Cairns, Senior Data Scientist, HP Vertica

TIBCO Live Datamart: Push-Based Real-Time Analytics

BEYOND BI: Big Data Analytic Use Cases

How to Leverage Big Data in the Cloud to Gain Competitive Advantage

BIG DATA TRENDS AND TECHNOLOGIES

Dell Reference Configuration for DataStax Enterprise powered by Apache Cassandra

Transcription:

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence of Big Data has defined one of the most exciting eras in the evolution of IT. Apache Hadoop has been at the forefront, driving costeffective analysis of unstructured and semi-structured data. However, Hadoop was designed for processing stored data in batches. The continuous analysis of Big Data in real-time requires a different approach. Processing data as streams enables the real-time value of Big Data to be extracted before the data are stored. Automated actions and visualizations can be driven by real-time analytics from the data in motion, before streaming the data through to Hadoop and other data storage platforms. Applications for stream processing have emerged across many industries, including telecommunications, transportation, cybersecurity, oil and gas, financial services and the Internet of Things. The increase in machine data data generated from sensors, devices, networks and applications is driving the need for real-time action and analysis from data arriving at rates exceeding many millions of records per second. This paper documents a performance benchmark and ROI analysis carried out by a customer in the telecoms sector. The requirement was to detect time-based patterns from 4G/LTE network performance data that were predictors of potential QoS failures. The throughput performance requirement was to process 10 million records per second, generating results with latency in the low milliseconds.

3 The customer was concerned with applying increasingly complex business logic across multiple sources of network element and radio tower data streams, with a goal of aggregating and analyzing the data payloads with near-zero latency. SQLstream Blaze or Apache Storm The customer identified SQLstream Blaze and the open source Apache Storm framework as candidates for an in-depth evaluation. Although both can be considered stream processing platforms, there are also significant differences. As the benchmark comparison highlighted, although Apache Storm can be downloaded at no cost, factors such as hardware performance and requirements, development effort and expert support were critical considerations in determining total system cost and overall value. Goal: to increase the quality of service in a dynamic and immediate manner, ensuring robust cellular service and eliminating serviceimpacting events. The results demonstrated that the SQLstream Blaze real-time data hub with its core stream processing platform, s-server, performed 15x faster with significantly lower Total Cost of Ownership (TCO) as projected over the lifetime of the system. TCO savings were a result of significant less hardware for the same performance and faster time-to-value with significantly less development effort.

4 Real-time Performance Monitoring with Streaming Analytics The business objective was to increase the quality of service (QoS) in a dynamic and immediate manner, ensuring cellular transmissions could be made more robust and eliminate as far as possible negative events such as dropped calls and low-quality voice paths. The customer was concerned with the increasingly complex business logic coupled with the need for low-latency aggregation and analysis across multiple sources of network element and radio tower data streams. The traditional data management approach of storing the data before processing could not deliver the lowlatency analytics required from the high-volume, high-velocity data streams. A call or data transmission would have failed by the time the event was identified. In addition, management network architectures for modern 4G/LTE radio towers require multiple regional data centers. The massive data volumes for this particular use case would have required systems to be implemented at each tower site, and then to aggregate each tower s pre-processed data in the relevant regional data centers. Deploying potentially numerous systems in diverse and often remote geographic locations was cost prohibitive. Any solution must be able to handle the constant high volume data payload traffic, and scale out during periods of large spikes in traffic volumes. The solution must also be able to handle different data structures and formats, as well as operational differences such as legacy equipment and differences in device firmware or software versions. Addressing the large number of system platform permutations and delivering a normalized flow of data at high volumes with low latency was also a prime consideration. Flexibility in the field would be paramount. The customer decided the most appropriate data management architecture was to build and deploy a stream processing application. The high-level architecture would require remote data collection agents to capture and stream performance data to a single central platform. The central platform must be able to scale dynamically up to the peak forecast load of 10 million records per second. Data must also be filtered, parsed and enhanced dynamically as part of the real-time pipeline flow. Aggregated and streaming intelligence feeds must be delivered continuously to existing non-real-time data warehouses and operational applications.

5 Overview of the Apache Storm Implementation Apache Storm is a distributed data stream processing framework available under the Apache Open Source license. The Storm data processing architecture is similar in concept to an Hadoop data storage cluster, replacing Hadoop s Map Reduce infrastructure for processing static data with Storm topologies for processing data streams. A Storm topology consists of Spouts (data sources) and Bolts (nodes for insertion of stream processing logic). Bolts are commonly written in Java, but in principle could be defined in any coding language. Apache Storm Solution Development The Storm framework requires significant development effort in order to deliver a complete, operational application. In particular, the Java-based stream processing libraries to address the analytics and data aggregation, and the integration adapters with external systems and data feeds. The resulting solution required a number of additional coding steps in order to produce an operationally viable solution based on the latest release software of the project software, including: Integration of the Storm messaging middleware technology with the Java-based stream-processing library. Development of the data aggregation and analytics as Java extensions to the core project framework. Development of data integration adapters. Apache Storm Performance and Total Cost of Ownership The resulting development effort required a considerable bespoke coding to deliver an operational solution. However, three further considerations also contributed to the higher overall TCO costs: Lower performance per server required a significantly higher number of servers in order to realize the target throughput, driving higher costs for hardware, power, cooling and solution support. New or updated analytics required the core engine to be stopped and restarted, impacting operational service level agreements and driving higher maintenance costs. Higher ongoing support and maintenance costs for custom code over the lifetime of the project.

6 SQLstream Blaze and Apache Storm Compared The code to handle all streaming pipelines consisted of only 350 lines of commented SQL code, driving the lowest TCO to further address the ongoing as-deployed maintenance and support of complex applications in the field. SQLstream provided the customer with the SQLstream Blaze platform plus the supporting developer and user documentation. The customer team was able to develop prototypes quickly for several different business use cases. SQLstream s technical support team provided support when requested and suggestions for solution optimization, in particular, providing guidance on the architectural differences between implementation of a stream processing solution over a traditional store-then-process approach. SQLstream s real-time machine data collection agents enabled the use of lightweight Java agents to reside outside the central server. The data collection agents performed data filtering and optimized the transport of data streams using SQLstream s Streaming Data Protocol (SDP). SDP utilizes efficient data compression to optimize for transport of high velocity, high volume machine data streams. SQLstream Blaze Best Throughput Performance SQLstream Blaze performed at 1,350,000 records per second per 4-core Intel Xeon server platform, based on a record payload size of 1 Kbyte. This performance throughput per server was 15x faster than the equivalent Storm implementation. The customer s target of 10 million records per second required only 10 servers with the SQLstream solution (*see note 1). The equivalent Storm-based solution would require more than 110 servers. SQLstream Blaze Lowest Total Cost of Ownership SQLstream Blaze was able to demonstrate significant cost savings with dramatically lower projected TCO - one third that of the alternative solution. The TCO savings came from a combination of reduced hardware and power consumption, but was also down to the power and simplicity of SQL over low-level Java development. The code to support the required use cases consisted of only 350 lines of commented SQL code, in contrast to the significant volume of java code development required to deliver a viable operational solution on the Storm framework. *Note 1: The performance benchmark was carried out on the SQLstream Blaze 3 platform. Additional performance enhancements in the current release, SQLstream Blaze 4, will further reduce the overall server hardware requirement. Current measurements for SQLstream Blaze 4 indicate a throughput performance in excess of 1 million records per second per CPU core with overall latency of under 10 milliseconds. SQLstream Blaze 4 has also been tested for data ingest rates into Hadoop in excess of 440 MB/second.

7 Conclusions SQLstream Blaze is a real-time data hub for streaming analytics, real-time visualization and continuous integration of machine data. The SQLstream Blaze stream processing engine, s-server, is 100% SQL-compliant and can handle up to a million records per second per CPU core. Stream processing unlocks the value of high-velocity unstructured log file, sensor and other machine data, giving new levels of visibility and insight, and driving both manual and automated actions in real-time. Businesses are moving on from simple monitoring and search-based tools, and trying to understand the meaning and causes of business and system problems. This requires the ability to process high-velocity data on a massive scale. The results of this benchmark demonstrate that SQLstream Blaze scales for the most extreme high-velocity Big Data use cases while being the lowest TCO option, even when compared with open source or freeware projects. The advantages of SQLstream Blaze as demonstrated in the performance benchmark project include: Scaling to a throughput of 1.35 million 1Kbyte records per second per four-core server each fed by twenty remote streaming agents. Expressiveness of the standards-based streaming SQL language with support for enhanced streaming User Defined Operations (UDXes) in Java. Deploying new analytics on the fly without having to stop and recompile or rebuild applications. Advanced pipeline operations including data enrichment, sliding time windows, external data storage platform read and write, and other advanced time-series analytics. Advanced memory management, with query optimization and execution environments to utilize and recover memory efficiently. Higher throughput and performance per server for lower hardware requirements, lower costs and simple to maintain installations. Proven, mature enterprise-grade product with a validated roadmap and controlled release schedule. In summary, SQLstream Blaze exceled through a mature, industry-strength platform, support for standard SQL (SQL:2011) for streaming data analysis, plus a flexible adapter and agent architecture. The result was class-leading performance with impressively low total cost of ownership. Using 20 remote agents pointed at each single s-server instance running on a 4-core Intel Xeon server platform, SQLstream was able to perform at a truly massive level of throughput: 1,350,000 records per second per 4-core server, each event having an initial payload of 1 KByte.

SQLstream, Inc. 1540 Market Street San Francisco, CA, 94102 www.sqlstream.com SQLstream powers real-time smart services for the Internet of Things. SQLstream s stream processing suite, Blaze, collects, analyzes and integrates sensor and other machine generated data in real-time, providing the real-time insight required to drive automated actions. SQLstream Blaze includes s-server, the world s fastest stream processor and the only stream processing platform built on standards-compliant streaming SQL. Blaze includes real-time visualization, industry-specific application libraries, and a full range of data collection agents and adapters for Hadoop and other enterprise data management platforms. SQLstream is the recipient of leading industry awards, including the Ventana Research Technology Innovation Award for IT Analytics and Performance. SQLstream is based in San Francisco, CA.