Modeling Big Data Systems by Extending the Palladio Component Model

Similar documents

Stream Processing on Demand for Lambda Architectures

Using Dynatrace Monitoring Data for Generating Performance Models of Java EE Applications

Big Data for the JVM developer. Costin Leau,

Hadoop and Data Warehouse Friends, Enemies or Profiteers? What about Real Time?

BIG DATA APPLICATIONS

Hadoop and ecosystem * 本文中的言论仅代表作者个人观点 * 本文中的一些图例来自于互联网. Information Management. Information Management IBM CDL Lab

Towards a Performance Model Management Repository for Component-based Enterprise Applications

Mrs: MapReduce for Scientific Computing in Python

Model-Based Performance Evaluations in Continuous Delivery Pipelines

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

Big Data Market Size and Vendor Revenues

Contents. Pentaho Corporation. Version 5.1. Copyright Page. New Features in Pentaho Data Integration 5.1. PDI Version 5.1 Minor Functionality Changes

Big Data Technologies Compared June 2014

TABLE OF CONTENTS 1 Chapter 1: Introduction 2 Chapter 2: Big Data Technology & Business Case 3 Chapter 3: Key Investment Sectors for Big Data

Introduc8on to Apache Spark

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

Performance and Scalability Overview

IBM Big Data Platform

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Big Data and Industrial Internet

Performance and Scalability Overview

Hadoop Lab Notes. Nicola Tonellotto November 15, 2010

CPS 516: Data-intensive Computing Systems. Instructor: Shivnath Babu TA: Zilong (Eric) Tan

Lambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Tutorial- Counting Words in File(s) using MapReduce

HDP Enabling the Modern Data Architecture

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Dominik Wagenknecht Accenture

A framework for easy development of Big Data applications

Mind Commerce. Commerce Publishing v3122/ Publisher Sample

Enterprise Data Storage and Analysis on Tim Barr

Talend Open Studio for Big Data. Release Notes 5.2.1

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Comprehensive Analytics on the Hortonworks Data Platform

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data Success Step 1: Get the Technology Right

CA Technologies Big Data Infrastructure Management Unified Management and Visibility of Big Data

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Hadoop and Eclipse. Eclipse Hawaii User s Group May 26th, Seth Ladd

YARN, the Apache Hadoop Platform for Streaming, Realtime and Batch Processing

A Hadoop MapReduce Performance Prediction Method

[Hadoop, Storm and Couchbase: Faster Big Data]

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

HDInsight Essentials. Rajesh Nadipalli. Chapter No. 1 "Hadoop and HDInsight in a Heartbeat"

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

SPECjEnterprise2010 & Java Enterprise Edition (EE) PCM Model Generation DevOps Performance WG Meeting

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

How To Create A Data Visualization With Apache Spark And Zeppelin

Big Data and Data Science: Behind the Buzz Words

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

WHITE PAPER. Four Key Pillars To A Big Data Management Solution

#TalendSandbox for Big Data

VIEWPOINT. High Performance Analytics. Industry Context and Trends

The Hadoop Eco System Shanghai Data Science Meetup

How To Understand The Business Case For Big Data

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Session 1: IT Infrastructure Security Vertica / Hadoop Integration and Analytic Capabilities for Federal Big Data Challenges

Internals of Hadoop Application Framework and Distributed File System

Hurtownie Danych i Business Intelligence: Big Data

Running Hadoop on Windows CCNP Server

Big Data Advanced Analytics for Game Monetization. Kimberly Chulis

Tips and Techniques on how to better Monitor, Manage and Optimize your MicroStrategy System High ROI DW and BI Solutions

Hadoop2, Spark Big Data, real time, machine learning & use cases. Cédric Carbone Twitter

Architectures for massive data management

I/O Considerations in Big Data Analytics

Big Data 2012 Hadoop Tutorial

Hadoop in the Enterprise

Market for Telecom Structured Data, Big Data, and Analytics: Business Case, Analysis and Forecasts

Big Data Analytics for Cyber

Hadoop & Spark Using Amazon EMR

Connecting Hadoop with Oracle Database

Big Data Analytics - Accelerated. stream-horizon.com

The Big Data Market: Business Case, Market Analysis & Forecasts

Big Data & Security. Aljosa Pasic 12/02/2015

How To Scale Out Of A Nosql Database

Talend Big Data. Delivering instant value from all your data. Talend

Data Analytics Infrastructure

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

HDP Hadoop From concept to deployment.

Cisco IT Hadoop Journey

Transcription:

München, 2015-11-06 Modeling Big Data Systems by Extending the Palladio Component Model 6 th Symposium on Software Performance (SSP) 2015 Johannes Kroß 1, Andreas Brunnert 1, Helmut Krcmar 2 1 fortiss GmbH, 2 Technische Universität München fortiss GmbH An-Institut Technische Universität München

Agenda Motivation Development Process and Characteristics of Big Data Systems Palladio Component Model (PCM) Meta-model Extension Related Work Conclusion and Future Work 2

Agenda Motivation Development Process and Characteristics of Big Data Systems Palladio Component Model (PCM) Meta-model Extension Related Work Conclusion and Future Work 3

MapR MongoDB SAP VoltDB HP Vertica Apache Spark IBM Netezza splunk Cloudera Voldemort Apache Flume S4 Motivation Autonomy tableau Cassandra Hortonworks ElephantDB Apache HBase TIBCO Apache Storm Amazon Kinesis Apache Kafka Apache Hadoop EMC Greenplum Teradata Aster Pentaho Apache Samza Hana Various big data technologies with different characteristics Casado and Younas (2015) list two main techniques that are common for big data systems, namely, batch and stream processing 4

Motivation The added value of big data systems for organizations depends on the performance of such systems (Barbierato et al. 2014) Performance models allow for proactive evaluations of these systems Existing performance meta-models for big data systems, however, focus on either... one processing paradigm such as stream processing e.g., Ginis and Strom (2013) or one technology such as Apache Hadoop MapReduce e.g., Ge et al. (2013) We propose a general performance meta-model to specify shared characteristics of big data systems 5

Agenda Motivation Development Process and Characteristics of Big Data Systems Palladio Component Model (PCM) Meta-model Extension Related Work Conclusion and Future Work 6

Development Process of Big Data Systems Component developers Batch processing (e.g., using Apache MapReduce) public void map(object key, Text value,..)..{ StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasmoretokens()) { word.set(itr.nexttoken()); context.write(word, one); } } public void reduce(text key, Iterable<IntWritable> values,..)..{ int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } Stream processing (e.g., using Apache Storm) public void execute(tuple tuple, BasicOutputCollector collector) { String word = tuple.getstring(0); Integer count = counts.get(word); if (count == null) count = 0; count++; counts.put(word, count); collector.emit(new Values(word, count)); } 7

Development Process of Big Data Systems System deployers Resource environment (e.g., Apache YARN) Client Node Manager Container Application Master Container Map Task Node Resource Manager Node Node Manager Container Map Task Container Reduce Task Node 8

Characteristics of Big Data Systems We derive the following requirements of big data systems that we propose to implement based on the finding of previous work (Kroß et al. 2015) 1. Distribution and parallelization of operations Component developers specify reusable software components consisting of operations using software frameworks like Apache Spark. In doing so, they may specify, but also may not know the definite number of simultaneous and/or total executions of an operation. 2. Clustering of resource containers System deployers specify resource containers with resource roles (e.g., master or worker nodes), link them to a mutual network and logically group them to a computer cluster. 9

Agenda Motivation Development Process and Characteristics of Big Data Systems Palladio Component Model (PCM) Meta-model Extension Related Work Conclusion and Future Work 10

PCM Meta-model Extension Service effect specification (SEFF) actions CallReturnAction CallAction 0..1 0..1 * * VariableUsage OperationRequired Role OperationSignature 1 1 ExternalCallAction - retrycount : Integer SetVariableAction * 0..1 InterCallAction DistributedCallAction - totalforkcount : Integer - simultaneousforkcount: Integer AbstractAction AbstractInternal ControlFlowAction PCM Version 3.4.1 11

PCM Meta-model Extension Resource environment <<Enumeration>> SchedulingPolicy - DELAY - PROCESSOR_SHARING - FCFS - ROUND_ROBIN Resource Environment 1 1 <<Enumeration>> ResourceRole - CLUSTER - MASTER - WORKER * * * ResourceContainer * LinkingResource 1 1 1 0..1 ClusterResourceSpecification - resourcerole : ResourceRole - actionschedulingpolicy : SchedulingPolicy ProcessingResource Specification * 1 CommunicationLink ResourceSpecification PCM Version 3.4.1 12

PCM Meta-model Extension Service effect specification (SEFF) diagram 13

PCM Meta-model Extension Resource environment diagram 14

Agenda Motivation Development Process and Characteristics of Big Data Systems Palladio Component Model (PCM) Meta-model Extension Related Work Conclusion and Future Work 15

Related Work Ginis and Strom (2013) present a method for predicting the response time of stream processes in distributed systems Verma et al. (2011) introduce the ARIA framework which specifies on strategy scheduling of single Apache MapReduce jobs Vianna et al. (2013) propose an analytical performance model which focuses on the pipeline between map and reduce jobs Barbierato et al. (2013) and Ge et al. (2013) present modeling techniques for Apache MapReduce which allow to estimate response times only Castiglione et al. (2014) use Markovian agents and mean field analysis to model big data batch applications and to provide information about performance of cloud-based data processing architectures 16

Agenda Motivation Development Process and Characteristics of Big Data Systems Palladio Component Model (PCM) Meta-model Extension Related Work Conclusion and Future Work 17

Conclusion and Future Work We introduced a modeling approach that allows to model essential characteristics of data processing as found in big data systems We presented to meta-model extensions for PCM.. to model a computer cluster and to apply distributed and parallel operations on this cluster We plan to... complete extending the simulation framework SimuCom fully evaluate our extensions for up- and downscaling scenarios automatically derive performance models based on measurement data 18

References Barbierato, E., Gribaudo, M., Iacono, M.: Performance evaluation of nosql big-data applications using multi-formalism models. Future Generation Computer Systems 37(0), 345-353 (2014) Casado, R., Younas, M.: Emerging trends and technologies in big data processing. Concurrency and Computation: Practice and Experience 27(8), 2078-2091 (2015) Castiglione, A., Gribaudo, M., Iacono, M., Palmieri, F.: Modeling performances of concurrent big data applications. Software: Practice and Experience (2014) Ge, S., Zide, M., Huet, F., Magoules, F., Lei, Y., Xuelian, L.: A Hadoop MapReduce performance prediction method. In: Proceedings of the IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 820-825 (2013) Ginis, R., Strom, R.E.: Method for predicting performance of distributed stream processing systems. US Patent 8,499,069, url: https://www.google.com/patents/us8499069 (2013) Kroß, J., Brunnert, A., Prehofer C., Runkler, T., Krcmar, H.: Stream processing on demand for lambda architectures. Computer Performance Engineering (Vol. 9272) Eds.: M. Beltrán, W. Knottenbelt, and J. Bradley, pp. 243-257. Springer International Publishing (2015) Verma, A., Cherkasova, L., Campbell, R.H.: Aria: automatic resource inference and allocation for mapreduce environments. In: Proceedings of the 8th ACM International Conference on Autonomic Computing. pp. 235-244. ACM, New York, NY, USA (2011) Vianna, E., Comarela, G., Pontes, T., Almeida, J., Almeida, V., Wilkinson, K., Kuno, H., Dayal, U.: Analytical performance models for mapreduce workloads. International Journal of Parallel Programming 41(4), 495-525 (2013) 19

Q&A Johannes Kroß kross@fortiss.org performancegroup@fortiss.org pmw.fortiss.org 20