Stream Processing on Demand for Lambda Architectures



Similar documents
Modeling Big Data Systems by Extending the Palladio Component Model

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data and Industrial Internet

Big Data and Data Science: Behind the Buzz Words

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Kafka & Redis for Big Data Solutions

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Forecast of Big Data Trends. Assoc. Prof. Dr. Thanachart Numnonda Executive Director IMC Institute 3 September 2014

Towards a Performance Model Management Repository for Component-based Enterprise Applications

Using Dynatrace Monitoring Data for Generating Performance Models of Java EE Applications

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Big Data Technologies Compared June 2014

FAQs. This material is built based on. Lambda Architecture. Scaling with a queue. 8/27/2015 Sangmi Pallickara

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Building Scalable Big Data Pipelines

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

A framework for easy development of Big Data applications

Big Data Market Size and Vendor Revenues

Big Data Analytics - Accelerated. stream-horizon.com

Big Data - Business, Math, Technology Best combination for big data 商 业 理 解, 数 据 科 学, 技 术 实 践 之 完 美 结 合

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data Use Case: Business Analytics

Performance and Scalability Overview

Hadoop & Spark Using Amazon EMR

Dominik Wagenknecht Accenture

Big Data JAMES WARREN. Principles and best practices of NATHAN MARZ MANNING. scalable real-time data systems. Shelter Island

Openbus Documentation

NoSQL Data Base Basics

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

SPECjEnterprise2010 & Java Enterprise Edition (EE) PCM Model Generation DevOps Performance WG Meeting

CPS 516: Data-intensive Computing Systems. Instructor: Shivnath Babu TA: Zilong (Eric) Tan

How To Scale Out Of A Nosql Database

Performance and Scalability Overview

HDP Enabling the Modern Data Architecture

Streaming items through a cluster with Spark Streaming

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Enterprise Operational SQL on Hadoop Trafodion Overview

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

Upcoming Announcements

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Talend Open Studio for Big Data. Release Notes 5.2.1

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Big Data and Apache Hadoop s MapReduce

Comprehensive Analytics on the Hortonworks Data Platform

The Internet of Things and Big Data: Intro

#mstrworld. Tapping into Hadoop and NoSQL Data Sources in MicroStrategy. Presented by: Trishla Maru. #mstrworld

Chukwa, Hadoop subproject, 37, 131 Cloud enabled big data, 4 Codd s 12 rules, 1 Column-oriented databases, 18, 52 Compression pattern, 83 84

Introduction to Hadoop

Real Time Big Data Processing

Big Data & Security. Aljosa Pasic 12/02/2015

GAIN BETTER INSIGHT FROM BIG DATA USING JBOSS DATA VIRTUALIZATION

Hurtownie Danych i Business Intelligence: Big Data

SQL + NOSQL + NEWSQL + REALTIME FOR INVESTMENT BANKS

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Federated SQL on Hadoop and Beyond: Leveraging Apache Geode to Build a Poor Man's SAP HANA. by Christian

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

How To Understand The Business Case For Big Data

Big Data Success Step 1: Get the Technology Right

IBM InfoSphere Guardium Data Activity Monitor for Hadoop-based systems

Constructing a Data Lake: Hadoop and Oracle Database United!

Real World Big Data Architecture - Splunk, Hadoop, RDBMS

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

Data Challenges in Telecommunications Networks and a Big Data Solution

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

A Brief Introduction to Apache Tez

Big Data Too Big To Ignore

HYPER-CONVERGED INFRASTRUCTURE STRATEGIES

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Reference Architecture, Requirements, Gaps, Roles

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Market Overview: Big Data Integration

Big Data Analytics for Cyber

How Companies are! Using Spark

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

The Future of Data Management

MapReduce with Apache Hadoop Analysing Big Data

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Chase Wu New Jersey Ins0tute of Technology

Big Data Course Highlights

Ali Ghodsi Head of PM and Engineering Databricks

#TalendSandbox for Big Data

Transcription:

Madrid, 2015-09-01 Stream Processing on Demand for Lambda Architectures European Workshop on Performance Engineering (EPEW) 2015 Johannes Kroß 1, Andreas Brunnert 1, Christian Prehofer 1, Thomas A. Runkler 2, Helmut Krcmar 3 1 fortiss GmbH, 2 Siemens AG, 3 Technische Universität München fortiss GmbH An-Institut Technische Universität München

Agenda Motivation Stream Processing On Demand Experimental Validation Related Work Conclusion and Future Work 2

Agenda Motivation Stream Processing On Demand Experimental Validation Related Work Conclusion and Future Work 3

MapR MongoDB SAP VoltDB HP Vertica Apache Spark IBM Netezza splunk Voldemort Apache Flume S4 Cloudera Motivation Autonomy tableau Cassandra Hortonworks ElephantDB Apache HBase TIBCO Apache Storm Amazon Kinesis Apache Kafka Apache Hadoop EMC Greenplum Teradata Aster Pentaho Apache Samza Hana Various complementary big data technologies with different characteristics Development of complex system of systems Performance issues and high resource requirements (Brunnert et al. 2014) 4

Motivation Data Processing in the Lambda Architecture Batch layer Incoming data Apache Kafka Apache Flume Scribe *MQ Camus Double processing Kafka- Spout Data set HDFS Speed layer Apache Spark Hadoop MapReduce Apache Storm Apache Samza Apache Spark Streaming Amazon Kinesis View ElephantDB Voldemort Impala Apache HBase Cassandra Redis View Merge Query Query Query Apache HBase adapted from Marz and Warren (2015) Enable real-time queries on big data Design principles: + Data immutability - Resource requirements + Recomputation + Fault-tolerance 5

Agenda Motivation Stream Processing On Demand Experimental Validation Related Work Conclusion and Future Work 6

Stream Processing On Demand A Novel Approach 1 Lambda architecture Batch process 2 Decision-making model Stream process on demand Iterative Procedure: 1) Regular batch iteration (in parallel with stream process) 2) Decision-making model Decide if stream processing is additionally required in the next batch iteration 7

Stream Processing On Demand Chronological Sequence of Batch and Stream Processes Decision point whether batch process k will exceed time-constraint and stream processes j and k are demanded Batch process i time < x Batch process j time < y Batch process k time < z Stream process j time y Stream process k time z x y z time 8

Stream Processing On Demand Decision-making Model Start Batch process (and stream process) are completed Load intensity Time constraint Predict response time of second next batch process Will the batch process exceed time constraints? Yes or No Start new batch process Yes Start new stream process End 9

Agenda Motivation Stream Processing On Demand Experimental Validation Related Work Conclusion and Future Work 10

Experimental Validation Smart Energy Use Case Background: Batch processing... Sensor data Stream processing Q u e r i e s Wind turbines (WT) measure several thousand parameters Wind turbine availabilities (WTA) lie between 67.4% and 99% (Faulstich et al. 2011) Assumption: Dependent on the WTA, WT either produce a constant set of data or none variable amount of monitoring data Example use case: Data are analyzed for bargaining power at the European power exchange time-constraint of 15-minutes for the continuous intraday spot market Development of a data generator to produce random WT monitoring data: id, timestamp, power, param1,... paramn 12, 2015-04-01 08:23:04.125, 12.67, value1,... value1 15, 2015-04-01 08:23:03.973, 13.49, value2,... value2 13, 2015-04-01 08:23:04.096, 12.59, value3,... value3... 11

Experimental Validation Implementation of the Batch Layer Apache Hadoop Hadoop Distributed File System (HDFS) to store data sets Hadoop MapReduce for batch processing (single node cluster in pseudo-distributed mode) Sample analytic batch process: Simple moving average algorithm for a MapReduce job: Map function pseudo code: map ( Object key1, String value1 ): // key1 : file name // value1 : measurements of wind turbines of one farm for each line l in value : kv = parse (l) emit ({ kv.id, kv. timestamp }, {kv. timestamp, kv. power }) Reduce function pseudo code: reduce ( Object key, Iterator < object > values ): // key: an object containing id and timestamp // values : power values ordered by timestamp result = simplemovingaverage ( values ) emit (id, result ) 12

Experimental Validation Decision-Making Model & Performance Model Prototype Palladio Component Model (PCM) to predict the response time of batch processes Integrated measurements for CPU resource demands Repository model Service effect specification (SEFF) <processjob> 13

Experimental Validation Controlled Experiment Generation of monitoring data for 10 wind farms with 100 WT each Configuration of a sliding window of 24 hours for the MapReduce job Naïve forecast to forecast the workload of the next batch iteration Assumption of different WTA to simulate variable load intensities Experimental Results Scenario WTA Fluctuation PRT MRT RE 85 % ± 0 % 12.78 minutes 12.17 minutes 5.01 % 1 2 3 90 % ± 0 % 13.53 minutes 13.60 minutes 0.51 % 95 % ± 0 % 14.28 minutes 15.47 minutes 7.69 % 85 % + 5 % 12.78 minutes 13.82 minutes 7.53 % 90 % + 5 % 13.53 minutes 15.03 minutes 9.98 % 90 % - 5 % 13.53 minutes 12.58 minutes 7.55 % 95 % - 5 % 14.28 minutes 13.17 minutes 8.43 % 14

Agenda Motivation Stream Processing On Demand Experimental Validation Related Work Conclusion and Future Work 15

Related Work Lambda architecture Martinez-Prieto et al. (2015) adapt the lambda architecture for semantic data Casado and Younas (2015) give an extensive review about related technologies Aniello et al. (2013) & Rychlý et al. (2015) focus on scheduling stream processes Alrokayan et al. (2015) concentrate on scheduling batch processes Several approaches exist to simplify implementing the lambda architecture storm-yarn 1 and Nabi et al. (2014) integrate different stream processing technologies in the Apache Hadoop environment Summingbird 2 is an open source library to write algorithms that can be used for batch as well as stream processing Prediction of batch processes Barbierato et al. (2014), Verma et al. (2011), Vianna et al. (2013) provide modeling approaches to predict response times of single MapReduce jobs Castiglione et al. (2014) model the performance for big data applications in cloud infrastructures 1 https://github.com/yahoo/storm-yarn 2 https://github.com/twitter/summingbird 16

Agenda Motivation Stream Processing On Demand Experimental Validation Related Work Conclusion and Future Work 17

Conclusion and Future Work We introduced a novel approach to... use resources more efficiently and reduce hardware costs when implementing the lambda architecture We plan to... integrate stream processing fully automate our approach apply different workload forecasting techniques conduct a more comprehensive evaluation 18

References Alrokayan, M., Vahid Dastjerdi, A., Buyya, R.: Sla-aware provisioning and scheduling of cloud resources for big data analytics. In: Proceedings of the 2014 IEEE International Conference on Cloud Computing in Emerging Markets. pp. 1-8. IEEE (2014) Aniello, L., Baldoni, R., Querzoni, L.: Adaptive online scheduling in storm. In: Proceedings of the 7th ACM International Conference on Distributed Event-based Systems. pp. 207-218. ACM, New York, NY, USA (2013) Barbierato, E., Gribaudo, M., Iacono, M.: Performance evaluation of nosql big-data applications using multi-formalism models. Future Generation Computer Systems 37(0), 345-353 (2014) Brunnert, A., Vögele, C., Danciu, A., Pfaff, M., Mayer, M., Krcmar, H.: Performance management work. Business & Information Systems Engineering 6(3), 177-179 (2014) Casado, R., Younas, M.: Emerging trends and technologies in big data processing. Concurrency and Computation: Practice and Experience 27(8), 2078-2091 (2015) Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Modeling performances of concurrent big data applications. Software: Practice and Experience (2014) Faulstich, S., Hahn, B., Tavner, P.J.: Wind turbine downtime and its importance for offshore deployment. Wind Energy 14(3), 327-337 (2011) Martnez-Prieto, M.A., Cuesta, C.E., Arias, M., Fernnde, J.D.: The solid architecture for real-time management of big semantic data. Future Generation Computer Systems 47, 62-79 (2015), special Section: Advanced Architectures for the Future Generation of Software-Intensive Systems Marz, N., Warren, J.: Big data: principles and best practices of scalable real-time data systems. Manning Publications Co. (2015) Nabi, Z., Wagle, R., Bouillet, E.: The best of two worlds: integrating ibm infosphere streams with apache yarn. In: Proceedings of the 2014 IEEE International Conference on Big Data. pp. 47-51. IEEE (2014) Rychly, M., Skoda, P., Smrz, P.: Heterogeneity-aware scheduler for stream processing frameworks. International Journal of Big Data Intelligence 2(2), 70-80 (2015) Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. pp. 235-244. ACM, New York, NY, USA (2011) Vianna, E., Comarela, G., Pontes, T., Almeida, J., Almeida, V., Wilkinson, K., Kuno, H., Dayal, U.: Analytical performance models for mapreduce workloads. International Journal of Parallel Programming 41(4), 495-525 (2013) 19

Q&A Johannes Kroß kross@fortiss.org performancegroup@fortiss.org pmw.fortiss.org 20