Performance analysis model for big data applications in cloud computing



Similar documents
Enterprise Risk Management Software Buyer s Guide

An Undergraduate Curriculum Evaluation with the Analytic Hierarchy Process

ClearPeaks Customer Care Guide. Business as Usual (BaU) Services Peace of mind for your BI Investment

JaERM Software-as-a-Solution Package

Treatment Spring Late Summer Fall Mean = 1.33 Mean = 4.88 Mean = 3.

Reasoning to Solve Equations and Inequalities

How To Network A Smll Business

Economics Letters 65 (1999) macroeconomists. a b, Ruth A. Judson, Ann L. Owen. Received 11 December 1998; accepted 12 May 1999

How To Set Up A Network For Your Business

Small Business Networking

Small Business Networking

Small Business Networking

Experiment 6: Friction

Small Business Networking

Decision Rule Extraction from Trained Neural Networks Using Rough Sets


Learner-oriented distance education supporting service system model and applied research

Small Business Cloud Services

Introducing Kashef for Application Monitoring

Vendor Rating for Service Desk Selection

Software Cost Estimation Model Based on Integration of Multi-agent and Case-Based Reasoning

AN ANALYTICAL HIERARCHY PROCESS METHODOLOGY TO EVALUATE IT SOLUTIONS FOR ORGANIZATIONS

Hillsborough Township Public Schools Mathematics Department Computer Programming 1

The LENA TM Language Environment Analysis System:

Unleashing the Power of Cloud

All pay auctions with certain and uncertain prizes a comment

WEB DELAY ANALYSIS AND REDUCTION BY USING LOAD BALANCING OF A DNS-BASED WEB SERVER CLUSTER

Graphs on Logarithmic and Semilogarithmic Paper

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

Factoring Polynomials

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )


Helicopter Theme and Variations

Space Vector Pulse Width Modulation Based Induction Motor with V/F Control

Source Code verification Using Logiscope and CodeReducer. Christophe Peron Principal Consultant Kalimetrix

Data replication in mobile computing

Health insurance exchanges What to expect in 2014

Quality Evaluation of Entrepreneur Education on Graduate Students Based on AHP-fuzzy Comprehensive Evaluation Approach ZhongXiaojun 1, WangYunfeng 2

Techniques for Requirements Gathering and Definition. Kristian Persson Principal Product Specialist

Implementation Evaluation Modeling of Selecting ERP Software Based on Fuzzy Theory

2. Transaction Cost Economics

9 CONTINUOUS DISTRIBUTIONS

Health Information Systems: evaluation and performance of a Help Desk

Econ 4721 Money and Banking Problem Set 2 Answer Key

SyGEMe: Integrated Municipal Facilities Management of Water Ressources Swiss Geoscience Meeting, Neuchâtel, 21 novembre 2009 k

Operations with Polynomials

APPLICATION OF TAGUCHI EXPERIMENTAL DESIGN FOR PROCESS OPTIMIZATION OF TABLET COMPRESSION MACHINES AT HLL LIFECARE LIMITED, INDIA

Basic Analysis of Autarky and Free Trade Models

Small Businesses Decisions to Offer Health Insurance to Employees

Value Function Approximation using Multiple Aggregation for Multiattribute Resource Management

A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook Warehouse Cluster

Network Configuration Independence Mechanism

Health insurance marketplace What to expect in 2014

QoS Mechanisms C HAPTER Introduction. 3.2 Classification

AntiSpyware Enterprise Module 8.5

EE247 Lecture 4. For simplicity, will start with all pole ladder type filters. Convert to integrator based form- example shown

COMPARISON OF SOME METHODS TO FIT A MULTIPLICATIVE TARIFF STRUCTURE TO OBSERVED RISK DATA BY B. AJNE. Skandza, Stockholm ABSTRACT

Application Bundles & Data Plans

Application-Level Traffic Monitoring and an Analysis on IP Networks

EQUATIONS OF LINES AND PLANES

Traffic Rank Based QoS Routing in Wireless Mesh Network

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

Facilitating Rapid Analysis and Decision Making in the Analytical Lab.

Modeling POMDPs for Generating and Simulating Stock Investment Policies

TITLE THE PRINCIPLES OF COIN-TAP METHOD OF NON-DESTRUCTIVE TESTING

Corporate Compliance vs. Enterprise-Wide Risk Management

VoIP for the Small Business

How To Reduce Telecommunictions Costs

Contextualizing NSSE Effect Sizes: Empirical Analysis and Interpretation of Benchmark Comparisons

DEVELOPMENT. Introduction to Virtualization E-book. anow is the time to realize all of the benefits of virtualizing your test and development lab.

Portfolio approach to information technology security resource allocation decisions

Virtual Machine. Part II: Program Control. Building a Modern Computer From First Principles.

VoIP for the Small Business

VoIP for the Small Business

Uplift Capacity of K-Series Open Web Steel Joist Seats. Florida, Gainesville, FL 32611;

VoIP for the Small Business

Assessing authentically in the Graduate Diploma of Education

Blackbaud The Raiser s Edge

Unit 29: Inference for Two-Way Tables

Project 6 Aircraft static stability and control

How To Get A Free Phone Line From A Cell Phone To A Landline For A Business

Recognition Scheme Forensic Science Content Within Educational Programmes

ProfileMe: Hardware Support for Instruction-Level Profiling on Out-of-Order Processors

ENHANCING CUSTOMER EXPERIENCE THROUGH BUSINESS PROCESS IMPROVEMENT: AN APPLICATION OF THE ENHANCED CUSTOMER EXPERIENCE FRAMEWORK (ECEF)

Test Management using Telelogic DOORS. Francisco López Telelogic DOORS Specialist

VoIP for the Small Business

Binary Representation of Numbers Autar Kaw

Influence of Playing Experience and Coaching Education on Coaching Efficacy among Malaysian Youth Coaches

Engineer-to-Engineer Note

According to Webster s, the

Transcription:

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 RESEARCH Performnce nlysis model for big dt pplictions in cloud computing Luis Edurdo Butist Villlpndo 1,2, Alin April 2 nd Alin Abrn 2 Open Access Abstrct The foundtion of Cloud Computing is shring computing resources dynmiclly llocted nd relesed per demnd with miniml mngement effort. Most of the time, computing resources such s processors, memory nd storge re llocted through commodity hrdwre virtuliztion, which distinguish cloud computing from others technologies. One of the objectives of this technology is processing nd storing very lrge mounts of dt, which re lso referred to s Big Dt. Sometimes, nomlies nd defects found in the Cloud pltforms ffect the performnce of Big Dt Applictions resulting in degrdtion of the Cloud performnce. One of the chllenges in Big Dt is how to nlyze the performnce of Big Dt Applictions in order to determine the min fctors tht ffect the qulity of them. The performnce nlysis results re very importnt becuse they help to detect the source of the degrdtion of the pplictions s well s Cloud. Furthermore, such results cn be used in future resource plnning stges, t the time of design of Service Level Agreements or simply to improve the pplictions. This pper proposes performnce nlysis model for Big Dt Applictions, which integrtes softwre qulity concepts from ISO 25010. The min gol of this work is to fill the gp tht exists between quntittive (numericl) representtion of qulity concepts of softwre engineering nd the mesurement of performnce of Big Dt Applictions. For this, it is proposed the use of sttisticl methods to estblish reltionships between extrcted performnce mesures from Big Dt Applictions, Cloud Computing pltforms nd the softwre engineering qulity concepts. Keywords: Cloud computing; Big dt; Anlysis; Performnce; Relief lgorithm; Tguchi method; ISO 25010; Mintennce; Hdoop MpReduce Introduction According to ISO subcommittee 38, the CC study group, Cloud Computing (CC) is prdigm for enbling ubiquitous, convenient, on-demnd network ccess to shred pool of configurble cloud resources ccessed through services which cn be rpidly provisioned nd relesed with miniml mngement effort or service provider interction [1]. One of the chllenges in CC is how to process nd store lrge mounts of dt (lso known s Big Dt BD) in n efficient nd relible wy. ISO subcommittee 32, Next Genertion Anlytics nd Big Dt study group, refers Big Dt s the trnsition from structured dt Correspondence: ebutistv@yhoo.com 1 Deprtment of Electronic Systems, Autonomous University of Agusclientes, Av. Universidd 940, Ciudd Universitri, Agusclientes, Mexico 2 Deprtment of Softwre Engineering nd Informtion Technology ETS, University of Quebec, 1100 Notre-Dme St., Montrel, Cnd nd trditionl nlytics to nlysis of complex informtion of mny types. Moreover, the group mentions tht Big Dt exploits cloud resources to mnge lrge dt volume extrcted from multiple sources [2]. In December 2012, the Interntionl Dt Corportion (IDC) stted tht, by the end of 2012, the totl dt generted ws 2.8 Zettbytes (ZB) (2.8 trillion Gigbytes). Furthermore, the IDC predicts tht the totl dt generted by 2020 will be 40 ZB. This is roughly equivlent to 5.2 terbytes (TB) of dt generted by every humn being live in tht yer [3]. Big Dt Applictions (BDA) re wy to process prt of such lrge mounts of dt by mens of pltforms, tools nd mechnisms for prllel nd distributed processing. ISO subcommittee 32 mentions tht BD Anlytics hs become mjor driving ppliction for dt wrehousing, with the use of MpReduce outside nd inside of dtbse mngement systems, nd the use of self-service dt mrts [2]. MpReduce is one of the 2014 Butist Villlpndo et l.; licensee Springer. This is n Open Access rticle distributed under the terms of the Cretive Commons Attribution License (http://cretivecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, nd reproduction in ny medium, provided the originl work is properly credited.

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 2 of 20 progrmming models used to develop BDA, which ws developed by Google for processing nd generting lrge dtsets. Sometimes, nomlies nd defects found in pltforms of Cloud Computing Systems (CCS) ffect the performnce of BDA resulting in degrdtion of the whole system. Performnce nlysis models (PAM) for BDA in CC, should propose mens to identify nd quntify norml ppliction behviour, which cn serve s bseline for detecting nd predicting possible nomlies in the softwre (i.e. pplictions in Big Dt pltforms) tht my impct BDA itself. To be ble to design such PAM for BDA, methods re needed to collect the necessry bse mesures specific to performnce, nd performnce frmework must be used to determine the reltionships tht exist mong these mesures. One of the chllenges in designing PAM for BDA is how to determine wht type of reltionship exists between the vrious bse mesures nd the performnce qulity concepts defined in interntionl stndrds such s ISO 25010 [4]. For exmple, wht is the extent of the reltionship between the mounts of physicl memory used by BDA nd the performnce qulity concepts of softwre engineering such s resource utiliztion or cpcity Thus, this work proposes the use of sttisticl methods to determine how closely performnce prmeters (bse mesures) re relted with performnce concepts of softwre engineering. This pper is structured s follows. Relted work nd bckground sections present the concepts relted to the performnce mesurement of BDA nd introduces the MpReduce progrmming model. In ddition, bckground section presents the Performnce Mesurement Frmework for Cloud Computing (PMFCC), which describes the key performnce concepts nd sub concepts tht the best represent the performnce of CCS. Anlysis model section, presents the method for exmining the reltionships mong the performnce concepts identified in the PMFCC. An experimentl methodology bsed on the Tguchi method of experimentl design, is used nd offers mens for improving the qulity of product performnce. Experiment section presents the results of n experiment, which nlyzes the reltionship between the performnce fctors of BDA, Cloud Computing Pltforms (CCP) nd the performnce concepts identified in the PMFCC. Finlly, conclusion section presents synthesis of the results of this reserch nd suggests future work. Relted work Reserchers hve nlyzed the performnce of BDA from vrious viewpoints. For exmple, Alexndru [5] nlyzes the performnce of Cloud Computing Services for Mny-Tsk Computing (MTC) system. According to Alexndru, scientific worklods often require High- Performnce Computing cpbilities, in which scientific computing community hs strted to focus on MTC, this mens high performnce execution of loosely coupled pplictions comprising mny tsks. By mens of this pproch it is possible to demnd systems to operte t high utiliztions, like to current production grids. Alexndru nlyzes the performnce bsed on the premise if current clouds cn execute MTC-bsed scientific worklod with similr performnce nd t lower cost tht the current scientific processing systems. For this, the uthor focuses on Infrstructures s Service (IS), this mens providers on public clouds tht re not restricted within n enterprise. In this reserch, Alexndru selected four public clouds providers; Amzon EC2, GoGrid, ElsticHosts nd Mosso in which it is performed trditionl system benchmrking in order to provide first order estimte of the system performnce. Alexndru minly uses metrics relted to disk, memory, network nd cpu to determine the performnce through the nlysis of MTC worklods which comprise tens of thousnds to hundreds of thousnds of tsks. The min finding in this reserch is tht the compute performnce of the tested clouds is low compred to trditionl systems of high performnce computing. In ddition, Alexndru found tht while current cloud computing services re insufficient for scientific computing t lrge, they re good solution for scientists who need resources instntly nd temporrily. Other similr reserch is performed by Jckson [6] who nlyzes high performnce computing pplictions on the Amzon Web Services cloud. The purpose of this work is to exmine the performnce of existing CC infrstructures nd crete mechnism to quntittively evlute them. The work is focused on the performnce of Amzon EC2, s representtive of the current minstrem of commercil CC services, nd its pplicbility to Cloud-bsed environments for scientific computing. To do so, Jckson quntittively exmines the performnce of set of benchmrks designed to represent typicl High Performnce Computing (HPC) worklod running on the Amzon EC2 pltform. Timing results from different ppliction benchmrks re used to compute the Sustined System Performnce (SSP) metric to mesure the performnce delivered by the worklod of computing system. According to the Ntionl Energy Reserch Scientific Computing Center (NERSC) [7], SSP provides process for evluting system performnce cross ny time frme, nd cn be pplied to ny set of systems, ny worklod, nd/or benchmrk suite, nd for ny time period. The SSP mesures time to solution cross different ppliction res nd cn be used to evlute bsolute performnce nd performnce reltive

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 3 of 20 to cost (in dollrs, energy or other vlue propositions). The results show strong correltion between the percentge of time n ppliction spends communicting, nd its overll performnce on EC2. The more communiction there is, the worse the performnce becomes. Jckson lso concludes tht the communiction pttern of n ppliction cn hve significnt impct on performnce. Other reserchers focus their work on the performnce nlysis of MpReduce pplictions. For exmple, Jin [8] proposes stochstic model to predict the performnce of MpReduce pplictions under filures. His work is used to quntify the robustness of MpReduce pplictions under different system prmeters, such s the number of processes, the men time between filures (MTBF) of ech process, filure recovery cost, etc. Authors like Jing [9], performs depth study of fctors tht ffect the performnce of MpReduce pplictions. In prticulr, he identifies five fctors tht ffect the performnce of MpReduce pplictions: I/O mode, indexing, dt prsing, grouping schemes nd block level scheduling. Moreover, Jing concludes tht crefully tuning ech fctor, it is possible to eliminte the negtive impct of these fctors nd improve the performnce of MpReduce pplictions. Other uthors like Guo [10] nd Cheng [11] focus their works on improving the performnce of MpReduce pplictions. Gou explodes the freedom to control concurrency in MpReduce in order to improve resource utiliztion. For this, he proposes resource steling which dynmiclly expnds nd shrinks the resource usge of running tsks by mens of the benefit wre specultive execution (BASE). BASE improves the mechnisms of fulttolernce mnged by specultively lunching duplicte tsks for tsks deemed to be strgglers. Furthermore, Cheng [11] focuses his work on improving the performnce of MpReduce pplictions through strtegy clled mximum cost performnce (MCP). MCP improves the effectiveness of specultive execution by mens of ccurtely nd promptly identifying strgglers. For this he provides the following methods: 1) Use both the progress rte nd the process bndwidth within phse to select slow tsks, 2) Use exponentilly weighted moving verge (EWMA) to predict process speed nd clculte tsks remining time nd 3) Determine which tsk to bckup bsed on the lod of cluster using cost-benefit model. Although these works present interesting methods for the performnce nlysis of CCS nd improving of BD pplictions (MpReduce), their pproch is from n infrstructure stndpoint nd does not consider the performnce from softwre engineering perspective. This work focuses on the performnce nlysis of BDA developed by mens of the Hdoop MpReduce model, integrting softwre qulity concepts from ISO 25010. Bckground Hdoop MpReduce Hdoop is the Apche Softwre Foundtions top level project, nd encompsses the vrious Hdoop sub projects. The Hdoop project provides nd supports the development of open source softwre tht supplies frmework for the development of highly sclble distributed computing pplictions designed to hndle processing detils, leving developers free to focus on ppliction logic [12]. Hdoop is divided into severl sub projects tht fll under the umbrell of infrstructures for distributed computing. One of these sub projects is MpReduce, which is progrmming model with n ssocited implementtion, both developed by Google for processing nd generting lrge dtsets. According to Den [13], progrms written in this functionl style re utomticlly prllelized nd executed on lrge cluster of commodity mchines. Authors like Lin [14] point out tht tody, the issue of tckling lrge mounts of dt is ddressed by divide-nd-conquer pproch, the bsic ide being to prtition lrge problem into smller sub problems. Those sub problems cn be hndled in prllel by different workers; for exmple, threds in processor core, cores in multi-core processor, multiple processors in mchine, or mny mchines in cluster. In this wy, the intermedite results of ech individul worker re then combined to yield the finl output. The Hdoop MpReduce model results re obtined in two min stges: 1) the Mp stge, nd 2) the Reduce stge. In the Mp stge, lso clled the mpping phse, dt elements from list of such elements re inputted, one t time, to function clled Mpper, which trnsforms ech element individully into n output dt element. Figure 1 presents the components of the Mp stge process. The Reduce stge (lso clled the reducing phse) ggregtes vlues. In this stge, reducer function receives input vlues itertively from n input list. This function combines these vlues, returning single output vlue. The Reduce stge is often used to produce summry dt, turning lrge volume of dt into smller summry of itself. Figure 2 presents the components of the Reduce stge. Figure 1 The mpping phse, in which n output list is creted.

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 4 of 20 Figure 2 The components of the reducing phse. According to Yhoo! [15], when mpping phse begins, ny mpper (node) cn process ny input file or prt of n input file. In this wy, ech mpper lods set of locl files to be ble to process them. When mpping phse hs been completed, n intermedite pir of vlues (consisting of key nd vlue) must be exchnged between mchines, so tht ll vlues with the sme key re sent to single reducer. Like Mp tsks, Reduce tsks re spred cross the sme nodes in the cluster nd do not exchnge informtion with one nother, nor re they wre of one nothers existence. Thus, ll dt trnsfer is hndled by the Hdoop MpReduce pltform itself, guided implicitly by the vrious keys ssocited with the vlues. Performnce mesurement frmework for cloud computing The Performnce Mesurement Frmework for Cloud Computing (PMFCC) [16] is bsed on the scheme for performnce nlysis shown in Figure 3. This scheme estblishes set of performnce criteri (or chrcteristics) to help to crry out the process of nlysis of system performnce. In this scheme, the system performnce is typiclly nlyzed using three sub concepts, if it is performing service correctly: 1) responsiveness, 2) productivity, nd 3) utiliztion, nd proposes mesurement process for ech. There re severl possible outcomes Figure 3 Scheme of performnce nlysis of service request to system. for ech service request mde to system, which cn be clssified into three ctegories. The system my: 1) perform the service correctly, 2) perform the service incorrectly, or 3) refuse to perform the service ltogether. Moreover, the scheme defines three sub concepts ssocited with ech of these possible outcomes, which ffect system performnce: 1) speed, 2) relibility, nd 3) vilbility. Figure 3 presents this scheme, which shows the possible outcomes of service request to system nd the sub concepts ssocited with them. Bsed on the bove scheme, the PMFCC [16] mps the possible outcomes of service request onto qulity concepts extrcted from the ISO 25010 stndrd. The ISO 25010 [4] stndrd defines softwre product nd computer system qulity from two distinct perspectives: 1) qulity in use model, nd 2) product qulity model. The product qulity model is pplicble to both systems nd softwre. According to ISO 25010, the properties of both determine the qulity of the product in prticulr context, bsed on user requirements. For exmple, performnce efficiency nd relibility cn be specific concerns of users who specilize in res of content delivery, mngement, or mintennce. The performnce efficiency concept proposed in ISO 25010 hs three sub concepts: 1) time behvior, 2) resource utiliztion, nd 3) cpcity, while the relibility concept hs four sub concepts: 1) mturity, 2) vilbility, 3) fult tolernce, nd 4) recoverbility. The PMFCC selects performnce efficiency nd relibility s concepts for determining the performnce of CCS. In ddition, the PMFCC proposes the following definition of CCS performnce nlysis: The performnce of Cloud Computing system is determined by nlysis of the chrcteristics involved in performing n efficient nd relible service tht meets requirements under stted conditions nd within the mximum limits of the system prmeters. Once tht the performnce nlysis concepts nd sub concepts re mpped onto the ISO 25010 qulity concepts, the frmework presents model of reltionship (Figure 4) tht presents logicl sequence in which the concepts nd sub concepts pper when performnce issue rises in CCS. In Figure 4, system performnce is determined by two min sub concepts: 1) performnce efficiency, nd 2) relibility. We hve seen tht when CCS receives service request, there re three possible outcomes (the service is performed correctly, the service is performed incorrectly, or the service cnnot be performed). The outcome will determine the sub concepts tht will be pplied for performnce nlysis. For exmple, suppose tht the CCS performs service correctly, but, during its execution, the service filed nd ws lter reinstted. Although the service ws

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 5 of 20 Figure 4 Model of the reltionships between performnce concepts nd sub concepts. ultimtely performed successfully, it is cler tht the system vilbility (prt of the relibility sub concept) ws compromised, nd this ffected CCS performnce. Performnce nlysis model for big dt pplictions Reltionship between performnce mesures of BDA, CCP nd softwre engineering qulity concepts In order to determine the degree of reltionship between performnce mesures of BDA, nd performnce concepts nd sub concepts defined in the PMFCC (Figure 4), first it is necessry to mp performnce mesures from the BDA nd CCP onto the performnce qulity concepts previously defined. For this, mesures need to be collected by mens of extrcted dt from MpReduce log files nd system monitoring tools (see Tble 1). This dt is obtined from Hdoop cluster, which is the cloud pltform in which the CCS is running. Once the performnce mesures re collected, they re mpped onto the performnce concepts defined in the PMFCC by mens of the formule defined in the ISO 25023. ISO 25023 - Mesurement of system nd softwre product qulity, provides set of qulity mesures for the chrcteristics of system/softwre products tht cn be used for specifying requirements, mesuring nd evluting the system/softwre product qulity [17]. It is importnt to mention tht such formule were dpted ccording to the different performnce mesures collected from the BDA nd CCP in order to represent the different concepts in coherent form. Tble 2 presents the different BDA nd CCP performnce mesures fter being mpped onto the PMFCC concepts nd sub concepts. Selection of key PMFCC concepts to represent the performnce of BDA Once the performnce mesures extrcted from the BDA nd CCP re mpped onto the performnce qulity concepts (see Tble 2), the next step is to select set of key sub concepts of PMFCC tht best represent the performnce of BDA. For this, two techniques for feture selection re used in order to determine the most relevnt fetures (PMFCC sub concepts) from dt set. According to Kntrdzic [18], feture selection is set of techniques tht select relevnt fetures (PMFCC sub

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 6 of 20 Tble 1 Extrct of collected performnce mesures from the BDA nd CCP Mesure Source Description jobs:clustermpcpcity Jobs of MpReduce Mximum number of vilble mps to be creted by job jobs:clusterreducecpcity Jobs of MpReduce Mximum number of vilble reduces to be creted by job jobs:finishtime Jobs of MpReduce Time t which job ws completed jobs:jobsetuptsklunchtime Jobs of MpReduce Time t which job is setup in the cluster for processing jobs:jobid Jobs of MpReduce Job ID jobs:lunchtime Jobs of MpReduce Time t which job is lunched for processing jobs:sttus Jobs of MpReduce Job sttus fter processing (Successful or Filed) jobs:submittime Jobs of MpReduce Time t which job ws submitted for processing disk:redbytes Virtul Mchine System Amount of HD bytes red by job disk:writebytes Virtul Mchine System Amount of HD bytes written by job memory:free Virtul Mchine System Amount of verge free memory on specific time memory:used Virtul Mchine System Amount of verge memory used on specific time network:rxbytes Virtul Mchine System Amount of network bytes received on specific time network:rxerrors Virtul Mchine System Amount of network errors during received trnsmission on specific time network:txbytes Virtul Mchine System A mount of network bytes trnsmitted on specific time network:txerrors Virtul Mchine System Amount of network errors during trnsmission on specific time concepts) for building robust lerning models by removing most irrelevnt nd redundnt fetures from the dt. Kntrdzic estblishes tht feture selection lgorithms typiclly fll into two ctegories: feture rnking nd subset selection. Feture rnking rnks ll fetures by specific bse mesure nd elimintes ll fetures tht do not chieve n dequte score while subset selection, serches the set of ll fetures for the optiml subset in which selected fetures re not rnked. The next subsections present two techniques of feture rnking which re used in the PAM for BDA in order to determine the most relevnt performnce sub concepts (fetures) tht best represent the performnce of BDA. Feture selection bsed on comprison of mens nd vrinces The feture selection bsed on comprison of mens nd vrinces is bsed on the distribution of vlues for given feture, in which it is necessry to compute the men vlue nd the corresponding vrince. In generl, if one feture describes different clsses of entities, smples of two different clsses cn be exmined. The mens of feture vlues re normlized by their vrinces nd then compred. If the mens re fr prt, interest in feture increses: it hs potentil, in terms of its use in distinguishing between two clsses. If the mens re indistinguishble, interest wnes in tht feture. The men of feture is compred in both cses without tking into considertion reltionship to other fetures. The next equtions formlize the test, where A nd B re sets of feture vlues mesured for two different clsses, nd n 1 nd n 2 re the corresponding number of smples: SE A B Test : sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi vr A vr B n 1 n 2 jmen A men B j > threshold vlue 2 SE A B In this pproch to feture selection, it is ssumed tht given feture is independent of others. A comprison of mens is typiclly nturl fit to clssifiction problems. For k clsses, k pir wise comprisons cn be mde, compring ech clss with its complement. A feture is retined if it is significnt for ny of the pir wise comprisons s shown in formul 2. Relief lgorithm Another importnt technique for feture selection is the Relief lgorithm. The Relief lgorithm is feture weight-bsed lgorithm, which relies on relevnce evlution of ech feture given in trining dt set in which smples re lbeled (clssifiction problems). The min concept of this lgorithm is to compute rnking score for every feture indicting how well this feture seprtes neighboring smples. The uthors of the Relief lgorithm, Kir nd Rendell [19], proved tht rnking 1

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 7 of 20 score becomes lrge for relevnt fetures nd smll for irrelevnt ones. The objective of the relief lgorithm is to estimte the qulity of fetures ccording to how well their vlues distinguish between smples close to ech other. Given trining dt S, the lgorithm rndomly selects subset of smples size m, where m is user defined prmeter. The lgorithm nlyses ech feture bsed on selected subset of smples. For ech rndomly selected smple X from trining dt set, it serches for its two nerest neighbors: one from the sme clss, clled nerest hit H, nd the other one from different clss, clled nerest miss M. The Relief lgorithm updtes the qulity score W(Ai) for ll feture Ai depending on the differences on their vlues for smples X, M, ndh s shown in formul 3. W new A i W old A i diff X AŠ i ; HAŠ i m 2 diff X A i Š ; MA i Š The process is repeted m times for rndomly selected smples from the trining dt set nd the scores W(Ai) re ccumulted for ech smple. Finlly, using threshold of relevncy τ, the lgorithm detects those fetures tht re sttisticlly relevnt to the trget clssifiction, nd 2 3 Tble 2 BDA nd CCP performnce mesures mpped onto PMFCC concepts nd sub concepts PMFCC concept PMFCC sub concepts Description Adpted formul Performnce efficiency Time Response time Durtion from submitted BDA Job to strt processing till it is submittime - lunchtime behvior lunched Time Turnround time Durtion from submitted BDA Job to strt processing till finishtime submittime behvior completion of the Job Time Processing time Durtion from lunched BDA Job to strt processing till finishtime-lunchtime behvior completion of the Job Resource CPU utiliztion How much CPU time is used per minute to process BDA Job 100 cpuidlepercent utiliztion (percent) Resource Memory utiliztion How much memory is used to process BDA Job per minute 100 memoryfreepercent utiliztion (percent) Resource utiliztion Hrd disk bytes red How much bytes re red to process BDA Job per minute Totl of bytes red per minute Resource utiliztion Cpcity Cpcity Hrd disk bytes written Lod mp tsks cpcity Lod reduce tsks cpcity How much bytes re written to process BDA Job per minute How mny mp tsks re processed in prllel for specific BDA Job How mny reduce tsks re processed in prllel for specific BDA Job Cpcity Network Tx bytes How mny bytes re trnsferred while specific BDA Job is processed Cpcity Network Rx bytes How mny bytes re received while specific BDA Job is processed Relibility Mturity Tsk men time between filure How frequently does tsk of specific BDA Job fil in opertion Mturity Tx network errors How mny trnsfer errors in the network re detected while processing specific BDA Job Mturity Rx network errors How mny reception errors in the network re detected while processing specific BDA Job Avilbility Time of CC System Totl time tht the system hs been in opertion Up Fult Network Tx collisions tolernce Fult Network Rx dropped tolernce Recoverbility Men recovery time How mny trnsfer collision in the network occurs while processing specific BDA Job How mny reception bytes in the network re dropped while processing specific BDA Job Wht is the verge time the CC system tke to complete recovery from filure Totl of bytes written per minute Totl of mp tsks processed in prllel for specific BDA Job Totl of reduce tsks processed in prllel for specific BDA Job Totl of trnsferred bytes per minute Totl of received bytes per minute Number of tsks filed per minute Number of Tx network errors detected per minute Number of Rx network errors detected per minute Totl minutes of the CC system opertion Totl of Tx network collisions per minute Totl of Rx network bytes re dropped per minute Averge recovery time of CC system

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 8 of 20 these re the fetures with W(Ai) τ. The min steps of the Relief lgorithm re formlized in Algorithm 1. Choosing methodology to nlyze reltionships between performnce concepts Once tht subset of the most importnt fetures (key performnce sub concepts) hs been selected, the next step is to determine the degree of reltionship tht exist between such subset of fetures nd the rest of performnce sub concepts defined by mens of PMFCC. For this, the use of Tguchis experimentl design method is proposed: it investigtes how different fetures (performnce mesures) re relted, nd to wht degree. Understnding these reltionships will enble us to determine the influence ech of them hs in the resulting performnce concepts. The PMFCC shows mny of the reltionships tht exist between the bse mesures, which hve mjor influence on the collection functions. However, in BDA nd more specificlly in the Hdoop MpReduce ppliction experiment, there re over hundred possible performnce mesures (including system mesures) tht could contribute to the nlysis of BDA performnce. A selection of these performnce mesures hs to be included in the collection functions so tht the respective performnce concepts cn be obtined nd, from there, n indiction of the performnce of the pplictions. One key design problem is to estblish which performnce mesures re interrelted nd how much they contribute to ech of the collection functions. In trditionl sttisticl methods, thirty or more observtions (or dt points) re typiclly needed for ech vrible, in order to gin meningful insights nd nlyze the results. In ddition, only few independent vribles re necessry to crry out experiments to uncover potentil reltionships, nd this must be performed under certin predetermined nd controlled test conditions. However, this pproch is not pproprite here, owing to the lrge number of vribles involved nd the considerble time nd effort required. Consequently, n nlysis method tht is suited to our specific problem nd in our study re is needed. A possible cndidte method to ddress this problem is Tguchis experimentl design method, which investigtes how different vribles ffect the men nd vrince of process performnce chrcteristics, nd helps in determining how well the process is functioning. This Tguchi method proposes limited number of experiments, but is more efficient thn fctoril design in its bility to identify reltionships nd dependencies. The next section presents the method to find out the reltionships. Tguchi method of experimentl design Tguchis Qulity Engineering Hndbook [20] describes the Tguchi method of experimentl design which ws developed by Dr. Genichi Tguchi, resercher t the Electronic Control Lbortory in Jpn. This method combines industril nd sttisticl experience, nd offers mens for improving the qulity of mnufctured products. It is bsed on robust design concept, ccording to which well designed product should cuse no problem when used under specified conditions. According to Cheikhi [21], Tguchis two phse qulity strtegy is the following: Phse 1: The online phse, which focuses on the techniques nd methods used to control qulity during the production of the product. Phse 2: The offline phse, which focuses on tking those techniques nd methods into ccount before mnufcturing the product, tht is, during the design phse, the development phse, etc. One of the most importnt ctivities in the offline phse of the strtegy is prmeter design. This is where the prmeters re determined tht mkes it possible to stisfy the set qulity objectives (often clled the objective function) through the use of experimentl designs under set conditions. If the product does not work properly (does not fulfill the objective function), then the design constnts (lso clled prmeters) need to be djusted so tht it will perform better. Cheikhi [21] explins tht this ctivity includes five (5) steps, which re required to determine the prmeters tht stisfy the qulity objectives: 1. Definition of the objective of the study, tht is, identifiction of the qulity chrcteristics to be observed in the output (results expected). 2. Identifiction of the study fctors nd their interctions, s well s the levels t which they will be set. There re two different types of fctors: 1) control fctors: fctors tht cn be esily mnged or djusted; nd 2) noise fctors: fctors tht re difficult to control or mnge. 3. Selection of the pproprite orthogonl rrys (OA) for the study, bsed on the number of fctors, nd their levels nd interctions. The OA show the

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 9 of 20 vrious experiments tht will need to be conducted in order to verify the effect of the fctors studied on the qulity chrcteristic to be observed in the output. 4. Preprtion nd performnce of the resulting OA experiments, including preprtion of the dt sheets for ech OA experiment ccording to the combintion of the levels nd fctors for the experiment. For ech experiment, number of trils re conducted nd the qulity chrcteristics of the output re observed. 5. Anlysis nd interprettion of the experimentl results to determine the optimum settings for the control fctors, nd the influence of those fctors on the qulity chrcteristics observed in the output. According to Tguchis Qulity Engineering Hndbook [20] the OA orgnizes the prmeters ffecting the process nd the levels t which they should vry. Tguchis method tests pirs of combintions, insted of hving to test ll possible combintions (s in fctoril experimentl design). This pproch cn determine which fctors ffect product qulity the most in minimum number of experiments. Tguchis OA cn be creted mnully or they cn be derived from deterministic lgorithms. They re selected by the number of prmeters (vribles) nd the number of levels (sttes). An OA rry is represented by Ln nd Pn, where Ln corresponds to the number of experiments to be conducted, nd Pn corresponds to the number of prmeters to be nlyzed. Tble 3 presents n exmple of Tguchi OA L12, mening tht 12 experiments re conducted to nlyze 11 prmeters. An OA cell contins the fctor levels (1 nd 2), which determine the type of prmeter vlues for ech experiment. Once the experimentl design hs Tble 3 Tguchis Orthogonl Arry L12 No. of P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 Experiments (L) 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 2 2 2 2 2 2 3 1 1 2 2 2 1 1 1 2 2 2 4 1 2 1 2 2 1 2 2 1 1 2 5 1 2 2 1 2 2 1 2 1 2 1 6 1 2 2 1 2 2 1 2 1 2 1 7 1 2 2 2 1 2 2 1 2 1 1 8 2 1 2 1 2 2 2 1 1 1 2 9 2 1 1 2 2 2 1 2 2 1 1 10 2 2 2 1 1 1 1 2 2 1 2 11 2 2 1 2 1 2 1 1 1 2 2 12 2 2 1 1 2 1 2 1 2 2 1 been determined nd the trils hve been crried out, the performnce chrcteristic mesurements from ech tril cn be used to nlyze the reltive effect of the vrious prmeters. Tguchis method is bsed on the use of the signl-tonoise rtio (SNR). The SNR is mesurement scle tht hs been used in the communictions industry for nerly century for determining the extent of the reltionship between qulity fctors in mesurement model [20]. The SNR pproch involves the nlysis of dt for vribility in which n input-to-output reltionship is studied in the mesurement system. Thus, to determine the effect ech prmeter hs on the output, the SNR is clculted by the follow formul: SN i Where 10 log y2 i S 2 i y i 1 N i X N i S 2 i 1 N i 1 y i;u u 1 X N i u 1 y i;u y i i=experiment number u=tril number N i =Number of trils for experiment i To minimize the performnce chrcteristic (objective function), the following definition of the SNR should be clculted: SN i 10 log XN i y 2 u N u 1 i! To mximize the performnce chrcteristic (objective function), the following definition of the SNR should be clculted: " # SN i 10 log 1 X N i 1 N i y u 1 2 u Once the SNR vlues hve been clculted for ech fctor nd level, they re tbulted s shown in Tble 4, nd then the rnge R (R = high SN - low SN) of the SNR for ech prmeter is clculted nd entered on Tble 4. According to Tguchis method, the lrger the R vlue for prmeter, the greter its effect on the process. 4 5 6

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 10 of 20 Tble 4 Rnk for SNR vlues Level P1 P2 P3 P4 P5 P6 P7 P11 1 SN 1,1 SN 2,1 SN 3,1 SN 4,1 SN 5,1 SN 6,1 SN 7,1 SN 11,1 2 SN 1,2 SN 2,2 SN 3,2 SN 4,2 SN 5,2 SN 6,2 SN 7,2 SN 11,2 3 SN 1,3 SN 2,3 SN 3,3 SN 4,3 SN 5,3 SN 6,3 SN 7,3 SN 11,3 4 SN 1,4 SN 2,4 SN 3,4 SN 4,4 SN 5,4 SN 6,4 SN 7,4 SN 11,4 Rnge R P1 R P2 R P3 R P4 R P5 R P6 R P7 R P11 Rnk Rnk P1 Rnk P2 Rnk P3 Rnk P4 Rnk P5 Rnk P6 Rnk P7 Rnk P11 Corresponding vlues for prmeters P8, P9 nd P10. Experiment Experiment setup The experiment ws conducted on DELL Studio Worksttion XPS 9100 with Intel Core i7 12-core X980 processor t 3.3 GHz, 24 GB DDR3 RAM, Segte 1.5 TB 7200 RPM SATA 3Gb/s disk, nd 1 Gbps network connection. We used Linux CentOS 6.4 64-bit distribution nd Xen 4.2 s the hypervisor. This physicl mchine hosts five virtul mchines (VM), ech with dul-core Intel i7 configurtion, 4 GB RAM, 20 GB virtul storge, nd virtul network interfce type. In ddition, ech VM executes the Apche Hdoop distribution version 1.0.4, which includes the Hdoop Distributed File System (HDFS) nd MpReduce frmework librries, Apche Chukw 0.5.0 s performnce mesures collector nd Apche HBse 0.94.1 s performnce mesures repository. One of these VM is the mster node, which executes NmeNode (HDFS) nd JobTrcker (MpReduce), nd the rest of the VM re slve nodes running DtNodes (HDFS) nd JobTrckers (MpReduce). Figure 5 presents the cluster configurtion for the set of experiments. Mpping of performnce mesures onto PMFCC concepts A totl of 103 MpReduce Jobs (BDA) were executed in the virtul Hdoop cluster nd set of performnce mesures were obtined from MpReduce Jobs logs nd monitoring tools. One of the min problems tht rose fter the performnce mesures repository ingestion process ws the clenliness of dt. Clenliness clls for the qulity of the dt to be verified prior to performing dt nlysis. Among the most importnt dt qulity issues to consider during dt clening in the model were corrupted records, inccurte content, missing vlues, nd Figure 5 Cluster configurtion for the experiment.

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 11 of 20 formtting inconsistencies, to nme few. Consequently, one of the min chllenges t the preprocessing stge ws how to structure dt in stndrd formts so tht they cn be nlyzed more efficiently. For this, dt normliztion process ws crried out over the dt set by mens of the stndrd score technique (see formul 7). Xnorm i where X i μ i S i 7 X i =Feture i μ i =Averge vlue of Xi in dt set S i =Rnge of feture i (MxX i -MinX i ) The normliztion process scled the vlues between the rnge of [-1, 1] ccording to the different collected performnce mesures which re expressed in different units nd dimensions. For exmple the mesure processing time is expressed in minutes while the mesure memory utiliztion is expressed in Mbytes. Tble 5 presents n extrct from the different collected performnce mesures fter the process of normliztion. Note: Tble 5 shows tht vlues relted to network mesures re equl to zero becuse the experiment is performed in Hdoop virtul cluster. This mens tht rel trnsmission over physicl network does not exist leving out the possibility of errors. In ddition, other mesures such s men time between filure nd men recovery time re lso equl to zero becuse during the experiment durtion Hdoop virtul cluster never filed. Selection of key mesures to represent the performnce of BDA One of the chllenges in the design of the PAM for BDA is how to determine set of key sub concepts which hve more relevnce in the performnce compred to others. For this, the ppliction of feture selection is used during the process for knowledge discovery. As previously mentioned, two techniques used for feture selection re: mens nd vrinces, nd the Relief lgorithm. The mens nd vrinces pproch ssumes tht the given fetures re independent of others. In the experiment totl of 103 Hdoop MpReduce Jobs were executed storing their performnce mesures. A MpReduce Job my belong to one of two clsses ccording to its sttus; filed or successful (0 or 1) (see Tble 5). Thus, pplying mens nd vrinces technique to the dt set, the feture Job Sttus clssifies ech Job records into two clsses 0 nd 1. First, it is necessry to compute men vlue nd vrince for both clsses nd for ech feture (PMFCC sub concept mesure). It is importnt to note tht test vlues will be compred with the highest set Tble 5 Extrct of collected performnce mesures fter process of normliztion Performnce mesure 138367812000- job_201311051347_0021 1384366260- job_201311131253_0019 1384801260- job_201311181318_0419 Time of CC System Up 0.4534012681 0.4158208360 0.1921547093 Lod mp tsks cpcity 0.0860196415 0.0770106325 0.0860196415 Lod reduce tsks 0.0334295334 0.0334295334 0.0334295334 cpcity Network Rx bytes 0.0647059274 0.4808087278 0.0055927073 Network Tx bytes 0.0779191010 0.3139488890 0.0613171507 Network Rx dropped 0.0 0.0 0.0 Network Tx collisions 0.0 0.0 0.0 Rx network errors 0.0 0.0 0.0 Tx network errors 0.0 0.0 0.0 CPU utiliztion 0.0950811052 0.5669416548 0.0869983066 Hrd disk bytes red 0.0055644728 0.0196859057 0.0076297598 Hrd disk bytes written 0.0386960610 0.2328110281 0.0253053155 Memory utiliztion 0.1956635952 0.4244033618 0.0341498692 Processing time 0.1838906682 0.8143236713 0.0156797304 Response time 0.0791592524 0.1221040377 0.1846444285 Turnround time 0.1838786629 0.8143213555 0.0156595689 Tsk MTBF 0.0 0.0 0.0 Men recovery time 0.0 0.0 0.0 Job Sttus 1.0 0.0 1.0

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 12 of 20 of vlues obtined fter the rnking process (0.9) becuse this distinguished them from the rest of results. Results re shown in Tble 6. The nlysis shows tht mesures job processing time nd job turnround hve the potentil to be distinguishing fetures between the two clsses becuse their mens re fr prt nd interest in such mesures increses, this mens their test vlues re greter thn 0.9. In ddition, it is importnt to mention tht lthough between the second nd third result (hrd disk bytes written) there is considerble difference; the ltter is lso selected in order to nlyze its reltionship with the rest of mesures becuse it lso hs the potentil, in terms of their use, to stnd out from the rest of the mesures nd give more certinty to the nlysis of reltionships. Thus, the mesures job processing time, job turnround ndhrddiskbyteswrittenre selected s cndidtes to represent the performnce of the BDA in the Hdoop system. In order to give more certinty to the bove results, the Relief lgorithm technique ws pplied to the sme dt set. As previously mentioned, the core of Relief lgorithm estimtes the qulity of fetures ccording to how well their vlues distinguish between smples (performnce mesures of MpReduce Job records) close to ech other. Thus, fter pplying the Relief lgorithm to the dt set, results re presented in Tble 7 where the lgorithm detects those fetures tht re sttisticlly relevnt to the trget clssifiction which re mesures with highest qulity score. Tble 6 Results of mens nd vrinces Performnce mesures Test vlues MpReduceJob_ProcessingTime 9.214837 MpReduceJob_TurnAround 9.214828 SystemHDWriteBytes_Utiliztion 8.176328 SystemUpTime 7.923577 SystemLodMpCpcity 6.613519 SystemNetworkTxBytes 6.165150 SystemNetworkRxBytes 5.930647 SystemCPU_Utiliztion 5.200704 SystemLodReduceCpcity 5.163010 MpReduceJob_ResponseTime 5.129339 SystemMemory_Utiliztion 3.965617 SystemHDRedBytes_Utiliztion 0.075003 NetworkRxDropped 0.00 NetworkTxCollisions 0.00 NetworkRxErrors 0.00 NetworkTxErrors 0.00 Distinguishing fetures between the two clsses with the highest set of vlues obtined fter the rnking process. Tble 7 Relief lgorithm results Performnce mesure Qulity score (W) MpReduceJob_ProcessingTime 0.74903 MpReduceJob_TurnAround 0.74802 SystemHDWriteBytes_Utiliztion 0.26229 SystemUpTime 0.25861 SystemCPU_Utiliztion 0.08189 SystemLodMpCpcity 0.07878 SystemMemory_Utiliztion 0.06528 SystemNetworkTxBytes 0.05916 MpReduceJob_ResponseTime 0.03573 SystemLodReduceCpcity 0.03051 SystemNetworkRxBytes 0.02674 SystemHDRedBytes_Utiliztion 0.00187 NetworkRxDropped 0.00 NetworkTxCollisions 0.00 NetworkRxErrors 0.00 NetworkTxErrors 0.00 Distinguishing fetures between the two clsses with the highest qulity scores obtined fter pplying the Relief lgorithm. The Relief results show tht the performnce mesures job processing time nd job turnround, hve the highest qulity scores (W) nd lso hve the potentil to be distinguishing fetures between the two clsses. In this cse the performnce mesure hrd disk bytes written is lso selected by mens of the sme pproch s in the mens nd vrince nlysis: in other words, this hs in terms of their use to stnd out from the rest of the mesures nd give more certinty to the nlysis of reltionships. Thus, the mesures job processing time, job turnround nd hrd disk bytes written re lso selected s cndidtes to represent the performnce of BDA in the Hdoop system. The results show tht Time behvior nd Resource utiliztion (see Tble 2) re the PMFCC concepts tht best represent the performnce of the BDA. The next step is to determine how the rest of performnce mesures re relted nd to wht degree. Studying these reltionships enbles to ssess the influence ech of them hs on the concepts tht best represent the BDA performnce in the experiment. For this, Tguchis experimentl design method is pplied in order to determine how different performnce mesures re relted. Anlysis of reltionship between selected performnce mesures Once tht set of performnce mesures re selected to represent the performnce of BDA, it is necessry to determine the reltionships tht exist between them nd the rest of the performnce mesures. These key mesures re defined s qulity objectives (objective functions) ccording to

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 13 of 20 Tble 8 Experiment fctors nd levels Fctor number Fctor nme Level 1 Level 2 1 Time of CC system up < 0.0 0.0 2 Lod mp tsks cpcity < 0.0 0.0 3 Lod reduce tsks cpcity < 0.0 0.0 4 Network Rx bytes < 0.0 0.0 5 Network Tx bytes < 0.0 0.0 6 CPU utiliztion < 0.0 0.0 7 Hrd disk bytes red < 0.0 0.0 8 Memory utiliztion < 0.0 0.0 9 Response time < 0.0 0.0 Tguchis terminology. According to Tguchi [20], qulity is often referred to s conformnce to the operting specifictions of system. To him, the qulity objective (or dependent vrible) determines the idel function of the output tht the system should show. In our experiment, the observed dependent vribles re the following: Job processing time, Job turnround nd Hrd disk bytes written Ech MpReduce Job record (Tble 5) is selected s n experiment in which different vlues for ech performnce mesure is recorded. In ddition, different levels of ech fctor (see Tble 3) re estblished s: Vlues less thn zero, level 1. Vlues greter or equl to zero, level 2. Tble 8 presents summry of the fctors, levels, nd vlues for this experiment. Note. The fctor set consisting of the rest of performnce mesures fter the key selection process. In ddition, it is importnt to mention tht it is fesible to hve vlues less thn 0.0; this mens negtive vlues becuse the experiment is performed fter the normliztion process. Using Tguchis experimentl design method, selection of the pproprite OA is determined by the number of fctors nd levels to be exmined. The resulting OA rry for this cse study is L12 (presented in Tble 3). The ssignment of the vrious fctors nd vlues of this OA rry is shown in Tble 9. Tble 9 shows the set of experiments to be crried out with different vlues for ech prmeter selected. For exmple, experiment 3 involves vlues of time of system up fewer thn 0, mp tsk cpcity fewer thn 0, reduce tsk cpcity greter thn or equl to 0, network rx bytes greter thn or equl to 0, nd so on. A totl of pproximtely 1000 performnce mesures were extrcted by selecting those tht met the different combintion of prmeter vlues fter the normliztion process for ech experiment. Only set of 40 mesures met the experiment requirements presented in Tble 9. This set of 12 experiments ws divided into three groups of twelve experiments ech (clled trils). An extrct of the vlues nd results of ech experiment for the processing time output objective is presented in Tble 10 (the sme procedure is performed to developed the experiments of job turnround nd hrd disk bytes written output objectives). Tguchis method defined the SNR used to mesure robustness, which is the trnsformed form of the performnce qulity chrcteristic (output vlue) used to nlyze the results. Since the objective of this experiment is to minimize the qulity chrcteristic of the Tble 9 Mtrix of experiments Experiment Time of system up Mp tsks cpcity Reduce tsks cpcity Network Rx bytes Network Tx bytes CPU utiliz-tion HD bytes red Memory utiliztion 1 < 0 < 0 < 0 < 0 < 0 < 0 < 0 < 0 < 0 2 < 0 < 0 < 0 < 0 < 0 0 0 0 0 3 < 0 < 0 0 0 0 < 0 < 0 < 0 0 4 < 0 0 < 0 0 0 < 0 0 0 < 0 5 < 0 0 0 < 0 0 0 < 0 0 < 0 6 < 0 0 0 < 0 0 0 < 0 0 < 0 7 < 0 0 0 0 < 0 0 0 < 0 0 8 0 < 0 0 < 0 0 0 0 < 0 < 0 9 0 < 0 < 0 0 0 0 < 0 0 0 10 0 0 0 < 0 < 0 < 0 < 0 0 0 11 0 0 < 0 0 < 0 0 < 0 < 0 < 0 12 0 0 < 0 < 0 0 < 0 0 < 0 0 Response time

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 14 of 20 Tble 10 Trils, experiments, nd resulting vlues for job processing time output objective Tril Experiment Time of system up Mp tsks cpcity Reduce tsks cpcity Network Rx bytes Network Tx bytes CPU utiliztion 1 1 0.44091 0.08601 0.03342 0.04170 0.08030 0.00762 1 2 0.34488 0.07100 0.03342 0.02022 0.18002 0.16864 1 3 0.49721 0.08601 0.79990 0.01329 0.02184 0.03221 1 4 0.39277 0.01307 0.03342 0.02418 0.08115 0.02227 b b b b b b b b 2 1 0.03195 0.08601 0.03342 0.06311 0.09345 0.17198 2 2 0.01590 0.19624 0.03342 0.06880 0.01529 0.06993 2 3 0.11551 0.07701 0.79990 0.05635 0.09014 0.02999 2 4 0.04868 0.80375 0.20009 0.00585 0.01980 0.07713 c c c c c c c c 3 1 0.06458 0.08601 0.03342 0.06053 0.08483 0.14726 3 2 0.04868 0.19624 0.03342 0.07017 0.01789 0.07074 3 3 0.29027 0.07100 0.79990 0.049182 0.06387 0.07363 3 4 0.06473 0.91398 0.03342 0.00892 0.02461 0.05465 d d d d d d d d Corresponding vlues for HD bytes red nd Memory utiliztion. b Corresponding vlues for the set of experiments 5 to 12 of tril 1. c Corresponding vlues for the set of experiments 5 to 12 of tril 2. d Corresponding vlues for the set of experiments 5 to 12 of tril 3. b c d Job processing time 0.183902878 0.170883497 0.171468597 0.13252447 b 0.015597229 0.730455521 0.269538778 0.13252447 c 0.015597229 0.730455521 0.264375632 0.13252447 d output (mount of processing time used per mp reduce Job), the SNR for the qulity chrcteristic the smller the better is given by formul 8, tht is: SN i XN i y 2 u N u 1 i! The SNR result for ech experiment is shown in Tble 11. Complete SNR tbles for the job turnround nd hrd 8 disk bytes written experiments were developed in order to obtin their results. According to Tguchis method, the fctor effect is equl to the difference between the highest verge SNR nd the lowest verge SNR for ech fctor (see Tble 4). This mens tht the lrger the fctor effect for prmeter, the lrger the effect the vrible hs on the process, or, in other words, the more significnt the effect of the fctor. Tble 12 shows the fctor effect for ech vrible studied in the experiment. Similr Tble 11 Processing time SNR results Experiment Time of system up Mp tsks cpcity Reduce tsks cpcity Network Rx bytes 1 < 0 < 0 < 0 < 0 2 < 0 < 0 < 0 < 0 3 < 0 < 0 0 0 4 < 0 0 < 0 0 5 < 0 0 0 < 0 6 < 0 0 0 < 0 7 < 0 0 0 0 8 0 < 0 0 < 0 9 0 < 0 < 0 0 10 0 0 0 < 0 11 0 0 < 0 0 12 0 0 < 0 < 0 Processing time tril 1 Processing time tril 2 Processing Time tril 3 SNR 0.1839028 0.5155972 0.4155972 0.999026 0.1708835 0.7304555 0.7304555 0.45658085 0.1714686 0.269538 0.2643756 1.25082414 0.1325244 0.132524 0.132524 15.7043319 0.1856763 0.267772 0.269537 1.39727504 0.2677778 0.269537 0.185676 1.39727504 0.1714686 0.174542 0.174542 3.98029432 0.2688839 0.267712 0.268355 5.32068168 0.81432367 0.8143236 0.8143236 15.7761839 0.1325244 0.132524 0.132524 15.7043319 0.1837929 0.182090 0.269544 1.24567693 0.1714686 0.269538 0.269538 1.23463636 Corresponding prmeter configurtion for Network Tx bytes, CPU utiliztion, HD bytes red, Memory utiliztion nd Response time.

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 15 of 20 Tble 12 Fctor effect rnk on the job processing time output objective Time of system Up Mp tsks cpcity Reduce tsks cpcity Net. Rx bytes Net. Tx bytes CPU utiliztion HD bytes red Memory utiliztion Response time Averge SNR t 3.18205 4.1784165 5.4175370 3.3712 3.8949 6.57901 5.11036 2.005514 4.011035 Level 1 Averge SNR t 7.85630 5.8091173 4.8417803 7.5914 6.0116 3.58260 5.15667 8.253802 6.248281 Level 2 Fctor effect 4.67424 1.6307007 0.5757566 4.2202 2.1166 2.99641 0.04630 6.248288 2.237245 (difference) Rnk 2 7 8 3 6 4 9 1 5 fctor effect tbles for job turnround time nd hrd disk bytes written output vlues were lso developed to obtin their results. Results Anlysis nd interprettion of results Bsed on the results presented in Tble 12, it cn be observed tht: Memory utiliztion is the fctor tht hs the most influence on the qulity objective (processing time used per MpReduce Job) of the output observed, t 6.248288, nd Hrd disk bytes red is the lest influentil fctor in this experiment, t 0.046390. Figure 6 presents grphicl representtion of the fctor results nd their levels for processing time output objective. To represent the optiml condition of the levels, lso clled the optiml solution of the levels, nnlysisof SNR vlues is necessry in this experiment. Whether the im is to minimize or mximize the qulity chrcteristic (job processing time used per MpReduceJob),itislwysnecessrytomximizetheSNR prmeter vlues. Consequently, the optimum level of specific fctor will be the highest vlue of its SNR. It cn be seen tht the optimum level for ech fctor is represented by the highest point in the grph (s presented in Figure 6); tht is, L2 for time of system up, L2 for mp tsk cpcity, L1 for reduce tsk cpcity, etc. Using the findings presented in Tbles 11 nd 12 nd in Figure 6, it cn be concluded tht the optimum levels for the nine (9) fctors for processing time output objective in this experiment bsed on our experimentl configurtion cluster re presented in Tble 13. Sttisticl dt nlysis of job processing time The nlysis of vrince (ANOVA) is sttisticl technique typiclly used in the design nd nlysis of experiments. According to Trivedi [22], the purpose of pplying the ANOVA technique to n experimentl sitution is to compre the effect of severl fctors pplied simultneously to the response vrible (qulity chrcteristic). It llows the effects of the controllble fctors to be seprted from those of uncontrolled vritions. Tble 14 presents the results of this ANOVA nlysis of the experimentl fctors. Figure 6 Grphicl representtions of fctors nd their SNR levels.

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 16 of 20 Tble 13 Optimum levels for fctors of the processing time output Fctor number Performnce mesure Optimum level 1 Time of CC System Up 0 (L2) 2 Lod mp tsks cpcity 0 (L2) 3 Lod reduce tsks cpcity < 0 (L1) 4 Network Rx bytes 0 (L2) 5 Network Tx bytes 0 (L2) 6 CPU utiliztion < 0 (L1) 7 Hrd disk bytes red 0 (L2) 8 Memory utiliztion 0 (L2) 9 Response time 0 (L2) As cn be seen in the contribution column of Tble 14, these results cn be interpreted s follows (represented grphiclly in Figure 7): According to Tguchis method, the fctor with the smllest contribution is tken s the error estimte. So, the fctor Hrd disk bytes red is tken s the error estimte, since it corresponds to the smllest sum of squres. The results of this cse study show, bsed on both the grphicl nd sttisticl dt nlyses of the SNR, tht the Memory utiliztion required to process MpReduce ppliction in our cluster hs the most influence, followed by the Time of CC system up nd, finlly, Network Rx bytes. Sttisticl dt nlysis of job turnround The sttisticl dt nlysis of job turnround output objective is presented in Tble 15. As cn be seen in the contribution column of Tble 15, these results cn be interpreted s follows (represented grphiclly in Figure 8): Memory utiliztion is the fctor tht hs the most influence (lmost 39% of the contribution) on the processing time in this experiment. Time of CC system up is the fctor tht hs the second gretest influence (21.814% of the contribution) on the processing time. Network Rx bytes is the fctor tht hs the third gretest influence (17.782% of the contribution) on the processing time. Hrd disk bytes red is the fctor with the lest influence (0.002% of the contribution) on the processing time in the cluster. Lod reduce tsk cpcity is the fctor tht hs the most influence (lmost 50% of the contribution) on the job turnround in this experiment. Lod mp tsk cpcity is the fctor tht hs the second gretest influence (lmost 21% of the contribution) on the job turnround. Hrd disk bytes red is the fctor tht hs the third gretest influence (16.431% of the contribution) on the job turnround. CPU utiliztion is the fctor with the lest influence (0.006% of the contribution) on the job turnround in the cluster system. In ddition, bsed on the column relted to the vrince rtio F shown in Tble 14, it cn be concluded tht: In ddition, bsed on the column relted to the vrince rtio F shown in Tble 15, it cn be concluded tht: The fctor Memory utiliztion hs the most dominnt effect on the output vrible. The fctor Time of CC system up hs the most dominnt effect on the output vrible. Tble 14 Anlysis of vrince of job processing time output objective (ANOVA) Fctors Degrees of freedom Sum of squres (SS) Vrince (MS) Contribution (%) Vrince rtion (F) Time of CC system up 1 21.84857 21.84857 21.814 101.87 Lod mp tsks cpcity 1 2.659185 2.659185 2.655 12.39 Lod reduce tsks cpcity 1 0.331495 0.331495 0.330 1.54 Network Rx bytes 1 17.81038 17.81038 17.782 83.04 Network Tx bytes 1 4.480257 4.480257 4.473 20.89 CPU utiliztion 1 8.978526 8.978526 8.964 41.86 Hrd disk bytes red 1 0.002144 0.002144 0.002 0.001 Memory utiliztion 1 39.04110 39.04110 38.979 182.04 Response time 1 5.005269 5.005269 4.997 23.33 Error 0 0.0000 0.0000 Totl 9 100.15 100 Error estimte 1 0.0021445

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 17 of 20 Hrd disk bytes red is the fctor tht hs the second gretest influence (32.332% of the contribution) on the hrd disk bytes written. CPU utiliztion is the fctor tht hs the third gretest influence (18.711% of the contribution) on the hrd disk bytes written. Memory utiliztion is the fctor with the lest influence (0.544% of the contribution) on the hrd disk bytes written in the cluster system. Figure 7 Percentge contribution of fctors for processing time output objective. According to Tguchis method, the fctor with the smllest contribution is tken s the error estimte. So, the fctor CPU utiliztion istkenstheerrorestimte, since it corresponds to the smllest sum of squres. The results of this cse study show, bsed on both the grphicl nd sttisticl dt nlysis of the SNR, tht the Lod reduce tsk cpcity into which is used by the Job in MpReduce ppliction in our cluster hs the most influence in its job turnround mesure. Sttisticl dt nlysis of hrd disk bytes written ptients The sttisticl dt nlysis of hrd disk bytes written output objective is presented in Tble 16. As cn be seen in the contribution column of Tble 16, these results cn be interpreted s follows (represented grphiclly in Figure 9): Time of CC system up is the fctor tht hs the most influence (37.650% of the contribution) on the hrd disk bytes written output objective in this experiment. In ddition, bsed on the column relted to the vrince rtio F shown in Tble 16, it cn be concluded tht the following: The fctor Time of CC system up hs the most dominnt effect on the output vrible. According to Tguchis method, the fctor with the smllest contribution is tken s the error estimte. So, the fctor Memory utiliztion is tken s the error estimte, since it corresponds to the smllest sum of squres. The results of this experiment show, bsed on both the grphicl nd sttisticl dt nlysis of the SNR, tht the Time of CC system up while Job MpReduce ppliction is executed in our cluster hs the most influence in the hrd disk written. Summry of performnce nlysis model To summrize, when n ppliction is developed by mens of MpReduce frmework nd is executed in the experimentl cluster, the fctors job processing time, job turn round, nd hrd disk bytes written, must be tken into ccount in order to improve the performnce of the BDA. Moreover, the summry of performnce concepts nd mesures which re ffected by the contribution performnce mesures is shown in Figure 10. Tble 15 Anlysis of vrince of job turnround output objective (ANOVA) Fctors Degrees of freedom Sum of squres (SS) Vrince (MS) Contribution (%) Vrince rtion (F) Time of CC system up 1 1.6065797 1.6065797 11.002 174.7780 Lod mp tsks cpcity 1 3.0528346 3.0528346 20.906 0.020906 Lod reduce tsks cpcity 1 7.2990585 7.2990585 49.984 0.049984 Network Rx bytes 1 0.0176696 0.0176697 0.121 0.000121 Network Tx bytes 1 0.1677504 0.1677504 1.148 0.001148 CPU utiliztion 1 0.0009192 0.0009192 0.006 0.62E-05 Hrd disk bytes red 1 2.3993583 2.3993583 16.431 0.064308 Memory utiliztion 1 0.0521259 0.0521259 0.357 0.000356 Response time 1 0.0064437 0.0064437 0.044 0.000044 Error 0 0.0000 0.0000 Totl 9 14.602740 100 Error estimte 1 0.0009192

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 18 of 20 Figure 8 Percentge contribution of fctors for job turnround output objective. Figure 10 shows tht the performnce on this experiment is determined by two sub concepts; Time behvior nd Resource utiliztion. The results of the performnce nlysis show tht the min performnce mesures involved in these sub concepts re: Processing time, Job turnround nd Hrd disk bytes written. In ddition, there re two sub concepts which hve greter influence in the performnce sub concepts; Cpcity nd Avilbility. These concepts contribute with the performnce by mens of their specific performnce mesures which hve contribution in the behvior of the performnce mesures, they re respectively: Memory utiliztion, Lod reduce tsk, nd Time system up. Conclusion This pper presents the conclusions of our reserch, which proposes performnce nlysis model for big pplictions PAM for BDA. This performnce nlysis model is bsed on mesurement frmework for CC, which hs been vlidted by reserchers nd prctitioners. Such frmework defines the elements necessry to mesure the performnce of CCS using softwre qulity concepts. The design of the frmework is bsed on the concepts of metrology, long with spects of softwre qulity directly relted to the performnce concept, which re ddressed in the ISO 25010 interntionl stndrd. It ws found through the literture review tht the performnce efficiency nd relibility concepts re closely ssocited with the performnce mesurement. As result, the performnce nlysis model for BDA which is proposed in this work, integrtes ISO 25010 concepts into perspective of mesurement for BDA in which terminology nd vocbulry ssocited re ligned with the ISO 25010 interntionl stndrd. In ddition, this reserch proposes methodology s prt of the performnce nlysis model for determining the reltionships between the CCP nd BDA performnce mesures. One of the chllenges tht ddresses this methodology is how to determine the extent to which the performnce mesures re relted, nd to their influence in the nlysis of BDA performnce. This mens, the key design problem is to estblish which performnce mesures re interrelted nd how much they contribute to ech of performnce concepts defined in the PMFCC. To ddress this chllenge, we proposed the use of methodology bsed on Tguchis method of experimentl design combined with trditionl sttisticl methods. Experiments were crried out to nlyze the reltionships between the performnce mesures of severl MpReduce pplictions nd performnce concepts tht best represent the performnce of CCP nd BDA, s for exmple CPU processing time nd time behvior. We found tht Tble 16 Anlysis of vrince of hrd disk bytes written output objective (ANOVA) Fctors Degrees of freedom Sum of squres (SS) Vrince (MS) Contribution (%) Vrince rtion (F) Time of CC system up 1 2.6796517 2.6796517 37.650 69.14399 Lod mp tsks cpcity 1 0.0661859 0.0661859 0.923 0.009299 Lod reduce tsks cpcity 1 0.0512883 0.0512883 0.720 0.007206 Network Rx bytes 1 0.1847394 0.1847394 2.595 0.025956 Network Tx bytes 1 0.4032297 0.4032297 5.665 0.056655 CPU utiliztion 1 1.3316970 1.3316970 18.711 0.187108 Hrd disk bytes red 1 2.3011542 2.3011542 32.332 0.323321 Memory utiliztion 1 0.0387546 0.0387546 0.544 0.005445 Response time 1 0.0605369 0.0605369 0.850 0.008505 Error 0 0.0000 0.0000 Totl 9 7.1172380 100 Error estimte 1 0.0387546

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 19 of 20 Figure 9 Percentge contribution of fctors for hrd disk bytes written output objective. when n ppliction is developed in the MpReduce progrmming model to be executed in the experimentl CCP, the performnce on the experiment is determined by two min performnce concepts; Time behvior nd Resource utiliztion. The results of performnce nlysis show tht the min performnce mesures involved in these concepts re: Processing time, Job turnround nd Hrd disk bytes written. Thus, these mesures must be tken into ccount in order to improve the performnce of the ppliction. Finlly, it is expected tht it will be possible, bsed on this work, to propose robust model in future reserch tht will be ble to nlyze Hdoop cluster behvior in production CC environment by mens of the proposed nlysis model. This would llow rel time detection of nomlies tht ffect CCP nd BDA performnce. Figure 10 Summry of performnce mesurement nlysis.

Butist Villlpndo et l. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014, 3:19 Pge 20 of 20 Competing interests The uthors declre tht they hve no competing interests. Authors contributions All the listed uthors mde substntive intellectul contributions to the reserch nd mnuscript. Specific detils re s follows: LEBV: Responsible for the overll technicl pproch nd model design, editing nd preprtion of the pper. AA: Contributed to requirements gthering nd evlution for designing the performnce mesurement frmework for CC. Led the work on requirements gthering. AA: Contributed to requirements gthering nd evlution. Contributed to the design of methodology for nlysis of reltionship between performnce mesures. Contributed to the nlysis nd interprettion of the experiment results. All uthors red nd pproved the finl mnuscript. Received: 12 August 2014 Accepted: 31 October 2014 References 1. ISO/IEC (2012) ISO/IEC JTC 1 SC38: Cloud Computing Overview nd Vocbulry. Interntionl Orgniztion for Stndrdiztion, Genev, Switzerlnd 2. ISO/IEC (2013) ISO/IEC JTC 1 Interntionl Orgniztion for Stndrdiztion. ISO/IEC JTC 1 SC32: Next Genertion Anlytics nd Big Dt study group, Genev, Switzerlnd 3. Gntz J, Reinsel D (2012) THE DIGITAL UNIVERSE IN 2020: Big Dt, Bigger Digitl Shdows, nd Biggest Growth in the Fr Est. IDC, Frminghm, MA, USA 4. ISO/IEC (2011) ISO/IEC 25010: Systems nd Softwre Engineering-Systems nd Softwre Product Qulity Requirements nd Evlution (SQuRE)-System nd Softwre Qulity Models. Interntionl Orgniztion for Stndrdiztion, Genev, Switzerlnd 5. Alexndru I (2011) Performnce nlysis of cloud computing services for mny-tsks scientific computing. IEEE Trnsctions on Prllel nd Distributed Systems 22(6):931 945 6. Jckson KR, Rmkrishnn L, Muriki K, Cnon S, Choli S, Shlf J, Wssermn HJ, Wright NJ (2010) Performnce Anlysis of High Performnce Computing Applictions on the Amzon Web Services Cloud. In: IEEE Second Interntionl Conference on Cloud Computing Technology nd Science (CloudCom). IEEE Computer Society, Wshington, DC, USA, pp 159 168, doi:10.1109/cloudcom.2010.69 7. Krmer W, Shlf J, Strohmier E (2005) The NERSC Sustined System Performnce (SSP) Metric. Lwrence Berkeley Ntionl Lbortory, Cliforni, USA 8. Jin H, Qio K, Sun X-H, Li Y (2011) Performnce under Filures of MpReduce Applictions. Pper presented t the Proceedings of the 11th IEEE/ACM Interntionl Symposium on Cluster Computing, Cloud nd Grid. IEEE Computer Society, Wshington, DC, USA 9. Jing D, Ooi BC, Shi L, Wu S (2010) The performnce of MpReduce: n in-depth study. Proc VLDB Endow 3(1-2):472 483, doi:10.14778/ 1920841.1920903 10. Guo Z, Fox G (2012) Improving MpReduce Performnce in Heterogeneous Network Environments nd Resource Utiliztion. Pper presented t the Proceedings of the 2012 12th IEEE/ACM Interntionl Symposium on Cluster, Cloud nd Grid Computing (ccgrid 2012). IEEE Computer Society, Wshington, DC, USA 11. Cheng L (2014) Improving MpReduce performnce using smrt specultive execution strtegy. IEEE Trns Comput 63(4):954 967 12. Hdoop AF (2014) Wht Is Apche Hdoop Hdoop Apche. http://hdoop. pche.org/ 13. Den J, Ghemwt S (2008) MpReduce: simplified dt processing on lrge clusters. Commun ACM 51(1):107 113, doi:10.1145/1327452.1327492 14. Lin J, Dyer C (2010) Dt-Intensive Text Processing with MpReduce. Mnuscript of book in the Morgn & Clypool Synthesis Lectures on Humn Lnguge Technologies. University of Mrylnd, College Prk, Mrylnd 15. Yhoo! I (2012) Yhoo! Hdoop Tutoril. http://developer.yhoo.com/ hdoop/tutoril/module7.html - configs. Accessed Jnury 2012 16. Butist L, Abrn A, April A (2012) Design of performnce mesurement frmework for cloud computing. J Softw Eng Appl 5(2):69 75, doi:10.4236/ jse.2012.52011 17. ISO/IEC (2013) ISO/IEC 25023: Systems nd softwre engineering Systems nd softwre Qulity Requirements nd Evlution (SQuRE) Mesurement of system nd softwre product qulity. Interntionl Orgniztion for Stndrdiztion, Genev, Switzerlnd 18. Kntrdzic M (2011) DATA MINING: Concepts, Models, Methods, nd Algorithms, 2nd edn. IEEE Press & John Wiley, Inc., Hoboken, New Jersey 19. Kir K, Rendell LA (1992) The Feture Selection Problem: Trditionl Methods nd New Algorithm. In: The Tenth Ntionl Conference on Artificil Intelligence (AAAI). AAAI Press, Sn Jose, Cliforni, pp 129 134 20. Tguchi G, Chowdhury S, Wu Y (2005) Tguchis Qulity Engineering Hndbook. John Wiley & Sons, New Jersey 21. Cheikhi L, Abrn A (2012) Investigtion of the Reltionships between the Softwre Qulity Models of ISO 9126 Stndrd: An Empiricl Study using the Tguchi Method. Softwre Qulity Professionl Mgzine, Milwukee, Wisconsin, Vol. 14 Issue 2, p22 22. Trivedi KS (2002) Probbility nd Sttistics with Relibility, Queuing nd Computer Science Applictions, 2nd edn. Wiley, New York, U.S.A. doi:10.1186/s13677-014-0019-z Cite this rticle s: Butist Villlpndo et l.: Performnce nlysis model for big dt pplictions in cloud computing. Journl of Cloud Computing: Advnces, Systems nd Applictions 2014 3:19. Submit your mnuscript to journl nd bene t from: 7 Convenient online submission 7 Rigorous peer review 7 Immedite publiction on cceptnce 7 Open ccess: rticles freely vilble online 7 High visibility within the eld 7 Retining the copyright to your rticle Submit your next mnuscript t 7 springeropen.com