SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

2 The V of Big Data Velocity means both how fast data is being produced and how fast the data must be processed to meet demand. Gartner The emergence of Big Data has defined one of the most exciting eras in the evolution of IT. Apache Hadoop has been at the forefront, driving costeffective analysis of unstructured and semi-structured data. However, Hadoop was designed for processing stored data in batches. The continuous analysis of Big Data in real-time requires a different approach. Processing data as streams enables the real-time value of Big Data to be extracted before the data are stored. Automated actions and visualizations can be driven by real-time analytics from the data in motion, before streaming the data through to Hadoop and other data storage platforms. Applications for stream processing have emerged across many industries, including telecommunications, transportation, cybersecurity, oil and gas, financial services and the Internet of Things. The increase in machine data data generated from sensors, devices, networks and applications is driving the need for real-time action and analysis from data arriving at rates exceeding many millions of records per second. This paper documents a performance benchmark and ROI analysis carried out by a customer in the telecoms sector. The requirement was to detect time-based patterns from 4G/LTE network performance data that were predictors of potential QoS failures. The throughput performance requirement was to process 10 million records per second, generating results with latency in the low milliseconds.

3 The customer was concerned with applying increasingly complex business logic across multiple sources of network element and radio tower data streams, with a goal of aggregating and analyzing the data payloads with near-zero latency. SQLstream Blaze or Apache Storm The customer identified SQLstream Blaze and the open source Apache Storm framework as candidates for an in-depth evaluation. Although both can be considered stream processing platforms, there are also significant differences. As the benchmark comparison highlighted, although Apache Storm can be downloaded at no cost, factors such as hardware performance and requirements, development effort and expert support were critical considerations in determining total system cost and overall value. Goal: to increase the quality of service in a dynamic and immediate manner, ensuring robust cellular service and eliminating serviceimpacting events. The results demonstrated that the SQLstream Blaze real-time data hub with its core stream processing platform, s-server, performed 15x faster with significantly lower Total Cost of Ownership (TCO) as projected over the lifetime of the system. TCO savings were a result of significant less hardware for the same performance and faster time-to-value with significantly less development effort.

4 Real-time Performance Monitoring with Streaming Analytics The business objective was to increase the quality of service (QoS) in a dynamic and immediate manner, ensuring cellular transmissions could be made more robust and eliminate as far as possible negative events such as dropped calls and low-quality voice paths. The customer was concerned with the increasingly complex business logic coupled with the need for low-latency aggregation and analysis across multiple sources of network element and radio tower data streams. The traditional data management approach of storing the data before processing could not deliver the lowlatency analytics required from the high-volume, high-velocity data streams. A call or data transmission would have failed by the time the event was identified. In addition, management network architectures for modern 4G/LTE radio towers require multiple regional data centers. The massive data volumes for this particular use case would have required systems to be implemented at each tower site, and then to aggregate each tower s pre-processed data in the relevant regional data centers. Deploying potentially numerous systems in diverse and often remote geographic locations was cost prohibitive. Any solution must be able to handle the constant high volume data payload traffic, and scale out during periods of large spikes in traffic volumes. The solution must also be able to handle different data structures and formats, as well as operational differences such as legacy equipment and differences in device firmware or software versions. Addressing the large number of system platform permutations and delivering a normalized flow of data at high volumes with low latency was also a prime consideration. Flexibility in the field would be paramount. The customer decided the most appropriate data management architecture was to build and deploy a stream processing application. The high-level architecture would require remote data collection agents to capture and stream performance data to a single central platform. The central platform must be able to scale dynamically up to the peak forecast load of 10 million records per second. Data must also be filtered, parsed and enhanced dynamically as part of the real-time pipeline flow. Aggregated and streaming intelligence feeds must be delivered continuously to existing non-real-time data warehouses and operational applications.

5 Overview of the Apache Storm Implementation Apache Storm is a distributed data stream processing framework available under the Apache Open Source license. The Storm data processing architecture is similar in concept to an Hadoop data storage cluster, replacing Hadoop s Map Reduce infrastructure for processing static data with Storm topologies for processing data streams. A Storm topology consists of Spouts (data sources) and Bolts (nodes for insertion of stream processing logic). Bolts are commonly written in Java, but in principle could be defined in any coding language. Apache Storm Solution Development The Storm framework requires significant development effort in order to deliver a complete, operational application. In particular, the Java-based stream processing libraries to address the analytics and data aggregation, and the integration adapters with external systems and data feeds. The resulting solution required a number of additional coding steps in order to produce an operationally viable solution based on the latest release software of the project software, including: Integration of the Storm messaging middleware technology with the Java-based stream-processing library. Development of the data aggregation and analytics as Java extensions to the core project framework. Development of data integration adapters. Apache Storm Performance and Total Cost of Ownership The resulting development effort required a considerable bespoke coding to deliver an operational solution. However, three further considerations also contributed to the higher overall TCO costs: Lower performance per server required a significantly higher number of servers in order to realize the target throughput, driving higher costs for hardware, power, cooling and solution support. New or updated analytics required the core engine to be stopped and restarted, impacting operational service level agreements and driving higher maintenance costs. Higher ongoing support and maintenance costs for custom code over the lifetime of the project.

6 SQLstream Blaze and Apache Storm Compared The code to handle all streaming pipelines consisted of only 350 lines of commented SQL code, driving the lowest TCO to further address the ongoing as-deployed maintenance and support of complex applications in the field. SQLstream provided the customer with the SQLstream Blaze platform plus the supporting developer and user documentation. The customer team was able to develop prototypes quickly for several different business use cases. SQLstream s technical support team provided support when requested and suggestions for solution optimization, in particular, providing guidance on the architectural differences between implementation of a stream processing solution over a traditional store-then-process approach. SQLstream s real-time machine data collection agents enabled the use of lightweight Java agents to reside outside the central server. The data collection agents performed data filtering and optimized the transport of data streams using SQLstream s Streaming Data Protocol (SDP). SDP utilizes efficient data compression to optimize for transport of high velocity, high volume machine data streams. SQLstream Blaze Best Throughput Performance SQLstream Blaze performed at 1,350,000 records per second per 4-core Intel Xeon server platform, based on a record payload size of 1 Kbyte. This performance throughput per server was 15x faster than the equivalent Storm implementation. The customer s target of 10 million records per second required only 10 servers with the SQLstream solution (*see note 1). The equivalent Storm-based solution would require more than 110 servers. SQLstream Blaze Lowest Total Cost of Ownership SQLstream Blaze was able to demonstrate significant cost savings with dramatically lower projected TCO - one third that of the alternative solution. The TCO savings came from a combination of reduced hardware and power consumption, but was also down to the power and simplicity of SQL over low-level Java development. The code to support the required use cases consisted of only 350 lines of commented SQL code, in contrast to the significant volume of java code development required to deliver a viable operational solution on the Storm framework. *Note 1: The performance benchmark was carried out on the SQLstream Blaze 3 platform. Additional performance enhancements in the current release, SQLstream Blaze 4, will further reduce the overall server hardware requirement. Current measurements for SQLstream Blaze 4 indicate a throughput performance in excess of 1 million records per second per CPU core with overall latency of under 10 milliseconds. SQLstream Blaze 4 has also been tested for data ingest rates into Hadoop in excess of 440 MB/second.

7 Conclusions SQLstream Blaze is a real-time data hub for streaming analytics, real-time visualization and continuous integration of machine data. The SQLstream Blaze stream processing engine, s-server, is 100% SQL-compliant and can handle up to a million records per second per CPU core. Stream processing unlocks the value of high-velocity unstructured log file, sensor and other machine data, giving new levels of visibility and insight, and driving both manual and automated actions in real-time. Businesses are moving on from simple monitoring and search-based tools, and trying to understand the meaning and causes of business and system problems. This requires the ability to process high-velocity data on a massive scale. The results of this benchmark demonstrate that SQLstream Blaze scales for the most extreme high-velocity Big Data use cases while being the lowest TCO option, even when compared with open source or freeware projects. The advantages of SQLstream Blaze as demonstrated in the performance benchmark project include: Scaling to a throughput of 1.35 million 1Kbyte records per second per four-core server each fed by twenty remote streaming agents. Expressiveness of the standards-based streaming SQL language with support for enhanced streaming User Defined Operations (UDXes) in Java. Deploying new analytics on the fly without having to stop and recompile or rebuild applications. Advanced pipeline operations including data enrichment, sliding time windows, external data storage platform read and write, and other advanced time-series analytics. Advanced memory management, with query optimization and execution environments to utilize and recover memory efficiently. Higher throughput and performance per server for lower hardware requirements, lower costs and simple to maintain installations. Proven, mature enterprise-grade product with a validated roadmap and controlled release schedule. In summary, SQLstream Blaze exceled through a mature, industry-strength platform, support for standard SQL (SQL:2011) for streaming data analysis, plus a flexible adapter and agent architecture. The result was class-leading performance with impressively low total cost of ownership. Using 20 remote agents pointed at each single s-server instance running on a 4-core Intel Xeon server platform, SQLstream was able to perform at a truly massive level of throughput: 1,350,000 records per second per 4-core server, each event having an initial payload of 1 KByte.

SQLstream, Inc. 1540 Market Street San Francisco, CA, 94102 www.sqlstream.com SQLstream powers real-time smart services for the Internet of Things. SQLstream s stream processing suite, Blaze, collects, analyzes and integrates sensor and other machine generated data in real-time, providing the real-time insight required to drive automated actions. SQLstream Blaze includes s-server, the world s fastest stream processor and the only stream processing platform built on standards-compliant streaming SQL. Blaze includes real-time visualization, industry-specific application libraries, and a full range of data collection agents and adapters for Hadoop and other enterprise data management platforms. SQLstream is the recipient of leading industry awards, including the Ventana Research Technology Innovation Award for IT Analytics and Performance. SQLstream is based in San Francisco, CA.