Large-Scale Network Traffic Monitoring with DBStream, a System for Rolling Big Data Analysis

Similar documents
Multiprocessor Systems-on-Chips

Making a Faster Cryptanalytic Time-Memory Trade-Off

Automatic measurement and detection of GSM interferences

Morningstar Investor Return

Performance Center Overview. Performance Center Overview 1

Model-Based Monitoring in Large-Scale Distributed Systems

PROFIT TEST MODELLING IN LIFE ASSURANCE USING SPREADSHEETS PART ONE

Real-time Particle Filters

USE OF EDUCATION TECHNOLOGY IN ENGLISH CLASSES

Chapter 1.6 Financial Management

Task is a schedulable entity, i.e., a thread

Trends in TCP/IP Retransmissions and Resets

The Application of Multi Shifts and Break Windows in Employees Scheduling

Impact of scripless trading on business practices of Sub-brokers.

Chapter 8: Regression with Lagged Explanatory Variables

TEMPORAL PATTERN IDENTIFICATION OF TIME SERIES DATA USING PATTERN WAVELETS AND GENETIC ALGORITHMS

How To Optimize Time For A Service In 4G Nework

Market Liquidity and the Impacts of the Computerized Trading System: Evidence from the Stock Exchange of Thailand

OPERATION MANUAL. Indoor unit for air to water heat pump system and options EKHBRD011ABV1 EKHBRD014ABV1 EKHBRD016ABV1

The Grantor Retained Annuity Trust (GRAT)

Single-machine Scheduling with Periodic Maintenance and both Preemptive and. Non-preemptive jobs in Remanufacturing System 1

Secure Election Infrastructures Based on IPv6 Clouds

Inductance and Transient Circuits

Automated Allocation of ESA Ground Station Network Services

How To Calculate Price Elasiciy Per Capia Per Capi

MODEL AND ALGORITHMS FOR THE REAL TIME MANAGEMENT OF RESIDENTIAL ELECTRICITY DEMAND. A. Barbato, G. Carpentieri

Individual Health Insurance April 30, 2008 Pages

TSG-RAN Working Group 1 (Radio Layer 1) meeting #3 Nynashamn, Sweden 22 nd 26 th March 1999

Task-Execution Scheduling Schemes for Network Measurement and Monitoring

Capacity Planning and Performance Benchmark Reference Guide v. 1.8

Distributing Human Resources among Software Development Projects 1

INTEREST RATE FUTURES AND THEIR OPTIONS: SOME PRICING APPROACHES

Improvement of a TCP Incast Avoidance Method for Data Center Networks

t Thick,intelligent,or thin access points? t WLAN switch or no WLAN switch? t WLAN appliance with 3rd party APs?

Why Did the Demand for Cash Decrease Recently in Korea?

Can Individual Investors Use Technical Trading Rules to Beat the Asian Markets?

Ecotopia: An Ecological Framework for Change Management in Distributed Systems

Chapter 4: Exponential and Logarithmic Functions

CAREER MAP HOME HEALTH AIDE

BALANCE OF PAYMENTS. First quarter Balance of payments

Measuring macroeconomic volatility Applications to export revenue data,

C Fast-Dealing Property Trading Game C

MACROECONOMIC FORECASTS AT THE MOF A LOOK INTO THE REAR VIEW MIRROR

Diane K. Michelson, SAS Institute Inc, Cary, NC Annie Dudley Zangi, SAS Institute Inc, Cary, NC

Sampling Time-Based Sliding Windows in Bounded Space

17 Laplace transform. Solving linear ODE with piecewise continuous right hand sides

Appendix D Flexibility Factor/Margin of Choice Desktop Research

Risk Modelling of Collateralised Lending

µ r of the ferrite amounts to It should be noted that the magnetic length of the + δ

Predicting Stock Market Index Trading Signals Using Neural Networks

Option Put-Call Parity Relations When the Underlying Security Pays Dividends

DDoS Attacks Detection Model and its Application

The naive method discussed in Lecture 1 uses the most recent observations to forecast future values. That is, Y ˆ t + 1

Analysis of Pricing and Efficiency Control Strategy between Internet Retailer and Conventional Retailer

Double Entry System of Accounting

LEVENTE SZÁSZ An MRP-based integer programming model for capacity planning...3

Mathematics in Pharmacokinetics What and Why (A second attempt to make it clearer)

Chapter 2 Problems. 3600s = 25m / s d = s t = 25m / s 0.5s = 12.5m. Δx = x(4) x(0) =12m 0m =12m

The Journey. Roadmaps. 2 Architecture. 3 Innovation. Smart City

Efficient One-time Signature Schemes for Stream Authentication *

Stock Trading with Recurrent Reinforcement Learning (RRL) CS229 Application Project Gabriel Molina, SUID

GoRA. For more information on genetics and on Rheumatoid Arthritis: Genetics of Rheumatoid Arthritis. Published work referred to in the results:

How To Predict A Person'S Behavior


Towards Intrusion Detection in Wireless Sensor Networks

Strategic Optimization of a Transportation Distribution Network

Return Calculation of U.S. Treasury Constant Maturity Indices

Full-wave rectification, bulk capacitor calculations Chris Basso January 2009

CHARGE AND DISCHARGE OF A CAPACITOR

Caring for trees and your service

DOES TRADING VOLUME INFLUENCE GARCH EFFECTS? SOME EVIDENCE FROM THE GREEK MARKET WITH SPECIAL REFERENCE TO BANKING SECTOR

Duration and Convexity ( ) 20 = Bond B has a maturity of 5 years and also has a required rate of return of 10%. Its price is $613.

1 HALF-LIFE EQUATIONS

ANALYSIS AND COMPARISONS OF SOME SOLUTION CONCEPTS FOR STOCHASTIC PROGRAMMING PROBLEMS

Molding. Injection. Design. GE Plastics. GE Engineering Thermoplastics DESIGN GUIDE

Time Series Analysis Using SAS R Part I The Augmented Dickey-Fuller (ADF) Test

II.1. Debt reduction and fiscal multipliers. dbt da dpbal da dg. bal

Software Project Management tools: A Comparative Analysis

The Transport Equation

Direc Manipulaion Inerface and EGN algorithms

cooking trajectory boiling water B (t) microwave time t (mins)

Making Use of Gate Charge Information in MOSFET and IGBT Data Sheets

Transcription:

Large-Scale Nework Traffic Monioring wih DBSream, a Sysem for Rolling Big Daa Analysis Arian Bär, Alessandro Finamore, Pedro Casas, Lukasz Golab, Marco Mellia FTW Vienna, Ausria - email: {baer, casas}@fw.a Poliecnico di Torino, Ialy - email: {finamore, mellia}@lc.polio.i Universiy of Waerloo, Canada - email: lgolab@uwaerloo.ca Absrac The complexiy of he Inerne has rapidly increased, making i more imporan and challenging o design scalable nework monioring ools. Nework monioring ypically requires rolling daa analysis, i.e., coninuously and incremenally updaing (rolling-over) various repors and saisics over highvolume daa sreams. In his paper, we describe DBSream, which is an SQL-based sysem ha explicily suppors incremenal queries for rolling daa analysis. We also presen a performance comparison of DBSream wih a parallel daa processing engine (Spark), showing ha, in some scenarios, a single DBSream node can ouperform a cluser of en Spark nodes on rolling nework monioring workloads. Alhough our performance evaluaion is based on nework monioring daa, our resuls can be generalized o oher big daa problems wih high volume and velociy. Keywords-Big Daa Analysis; Daa Sream Processing; Nework Daa Analysis; Sysem Performance. I. INTRODUCTION The complexiy of large-scale, Inerne-like neworks is consanly increasing. Wih more services being offered, he massive adopion of Conen Delivery Neworks (CDNs) and Cloud services for raffic hosing and delivery, and he coninuous growh of bandwidh-hungry video-sreaming services, nework and server infrasrucures are becoming exremely difficul o monior. In paricular, he challenge faced by Nework Traffic Monioring and Analysis (NTMA) is o process big, heerogeneous and high-speed daa. Nework monioring daa are heerogeneous by naure, conaining muliple ypes of measuremens coming from differen kinds of logging sysems. In addiion, nework monioring daa come in he form of high-speed sreams, which need o be coninuously analyzed. The noion of a daa sream used in his paper is ha of a coninuous flow of measuremens coming in he form of shor ime slices or baches, e.g., all he TCP flows capured in a backbone link in he las minue. These baches can conain a very large number of samples, given he high capaciy of nework links and he dynamics of Inerne raffic. NTMA and oher monioring applicaions ypically perform wha we refer o as rolling daa analysis: resuls are periodically and incremenally updaed (rolled-over) as new daa arrive. In his paper, we describe DBSream, which is a sysem The research leading o hese resuls has received funding from he European Union under he FP7 Gran Agreemen n. 38627 (Inegraed Projec mplane.) buil upon he PosgreSQL daabase ha explicily suppors incremenal queries for rolling daa analysis. DBSream, recenly inroduced in [2], ingess daa sreams coming in he form of shor ime-scale aggregaed baches (i.e., minue) from a wide variey of sources (e.g., passive nework raffic daa, acive measuremens, rouer logs and alers, ec.) and performs complex coninuous analysis, aggregaion and filering jobs. DBSream can sore ens of erabyes of heerogeneous daa, and allows boh real-ime queries on recen daa as well as deep analysis of hisorical daa. The echnical conribuions of his paper are as follows. Firs, we presen he Coninuous Execuion Language (CEL), which is a declaraive SQL-based inerface for specifying rolling daa analysis in DBSream. CEL allows DBSream users o rapidly implemen advanced daa analyics which run in parallel and coninuously over ime using jus a few lines of code, acceleraing he developmen of new applicaions. Second, we compare he performance of DBSream wih he popular Spark parallel processing engine using real nework raffic daa from an operaional nework. We show ha rolling queries can be easily implemened in CEL, and a single DBSream node can, in some scenarios, execue hem faser han a cluser of en Spark nodes. The remainder of he paper is organized as follows. Sec. II discusses he relaed work; Sec. III presens he rolling daa analysis capabiliies of DBSream; Sec. IV compares DB- Sream wih Spark; and Sec. V concludes he paper. II. RELATED WORK There has been a grea deal of effor o improve he performance and scalabiliy of radiional daabase managemen sysems by re-implemening he daa processing engine, relaxing daa consisency consrains and/or applying novel daa processing paradigms. Sill, a major limiaion is he inabiliy o cope wih coninuous/rolling analyics. Some relaional daabase sysems suppor maerialized views, bu incremenal view mainenance over ime is resriced o simple ypes of queries such as filers and joins, which is no sufficien for monioring applicaions. Furhermore, NoSQL sysems such as Hadoop [6] have been considered in he conex of nework monioring [], bu hey are suiable for off-line raher han rolling analyics. However, here has been some recen work

on enabling real-ime and/or incremenal analyics in NoSQL sysems, such as Incoop [4], Muppe [0], SCALLA [2] and Spark Sreaming [8]. Addiionally, Daa Sream Managemen Sysems (DSMSs) such as Borealis [], Gigascope [6] and Sreambase [5] suppor coninuous processing, bu hey usually canno suppor analyics over hisorical daa. Recenly, Daa Sream Warehouses (DSWs) have been inroduced, which exend radiional daabase sysems wih (nearly) coninuous daa inges and processing. DaaCell [3] and DaaDepo [8] are wo examples, as well as he DBSream sysem presened in his paper. The novely of DBSream is ha i enables users and applicaions o declaraively specify, using arbirary SQL, exacly how o updae a view when a new bach of daa arrives a he sysem. These specificaions may even refer o previously generaed resuls ha are sored in he same view, which, o he bes of our kledge, is no declaraively suppored by any oher sysem. Finally we noe ha here has been recen work in he neworking communiy on exending SQL wih addiional funcionaliies required for nework monioring; examples include complex window expressions [5] and sequenial paerns [9]. However, none of hese proposals include he declaraive rolling analyics ha DBSream suppors. III. ROLLING ANALYTICS IN DBSTREAM DBSream is a rolling analysis sysem implemened as a Daa Sream Warehouse (DSW). Is main purpose is o process and combine daa from muliple sources as hey are produced, creae aggregaions, and sore query resuls for furher processing by exernal analysis modules or visualizaion. The sysem arges, bu is no limied o, coninuous nework monioring. For insance, smar grid, inelligen ransporaion sysems, or any oher use case ha requires coninuous processing of large amouns of daa over ime can ake advanage of DBSream. In his paper, we focus on he following wo imporan feaures of DBSream: I suppors incremenal queries defined using a declaraive inerface based on SQL. Incremenal queries are hose which updae heir resuls by combining newly arrived daa wih previously generaed resuls raher han re-compuing hem from scrach (see Sec. III-A for more deails). This enables efficien processing of wo ineresing groups of queries. Firs, aggregaed variables can be kep for he elemens of he moniored se, e.g., he number of byes uploaded and downloaded by each clien over a sliding window of ime. Second, a se of iems can be moniored over ime by looking a he las sae plus he new daa, e.g., monioring he se of all server IP addresses ha are accessed wihin a sliding window of ime such as in he las wo weeks. In conras o many oher sysems, DBSream does no change he query processing engine. Insead, queries over daa sreams are evaluaed as repeaed invocaions of a process ha consumes a bach of newly arrived daa and combines hem wih he previous resul o come up wih he new resul. Therefore, DBSream is able o reuse he full funcionaliy of he underlying DBMS, including is query processing engine and query opimizer. A. Coninuous Execuion Language (CEL) In his secion we describe he user and applicaion inerface o DBSream, based on SQL, o define rolling analyics. We give a high-level overview of CEL using examples from he neworking domain. Le us assume we have a sream of daa coming from a rouer. I sends one row per minue and per TCP flow wih informaion abou he uploaded and downloaded byes ypical for NeFlow [4] complian rouers. The schema of inpu daa is hus kn. We are ineresed in how many byes are uploaded and downloaded per hour on ha link. In CEL, his can be expressed as he following job: <job inpus="a (window 60min)" oupu="b" schema="serial_ime in4, oal_download in8, oal_upload in8"> <query> selec _STARTTS, sum(download), sum(upload) from A group by _STARTTS </query></job> The inpus aribue defines he inpu window and he oupu aribue defines he desinaion for he resul. Here, a 60-minue window over A is specified, meaning ha for each new hour of daa in A, he query specified in he query elemen will run and is resuls will be appended o able B. DBSream suppors all SQL queries ha are suppored by he underlying DBMS, which is PosgreSQL a he momen. In his example, he query sums up he uploaded and downloaded byes for each hour. The query includes a from A saemen, which does no acually read all of A, only he window of A ha was specified in he inpus saemen (i.e., he mos recen 60 minues). The schema of he oupu sream B is defined using he schema saemen, for which he firs field mus be a imesamp called serial ime. In he above example, he imesamp field is he sar ime of a window, denoed by _STARTTS. Fig. illusraes he suppored window definiions. For each job, one window is defined o be he window and is marked wih he keyword. Afer a job insance is done, he sae of he job is shifed by he size of he window. As soon as here is a full new inpu window, he nex insance of he job is execued. The oher imporan keyword is delay, which shifs a window ino he pas by a given amoun of ime. Par A) of Fig. shows he simples window definiion, similar o he previous example. Only a single window exiss, which is also he window. Therefore, he defined query is execued for every minue of he inpu sream. In Par B), we have wo windows. Every hree minues (he lengh Alhough more complex definiions can be used, here a flow can be idenified by he 5-uple: source IP, desinaion IP, source por desinaion por and IP proocol.

window A) Single window Query window 3min window 3min B) Two window query window 3min C) Sliding window query window delay D) Incremenal query window window Fig.. Muliple inpu window definiions possible in DBSream s Coninuous Execuion Language (CEL). of he window), he query for his job reads daa from each of he wo inpu windows. Par C) shows how o independenly define he window lengh and he frequency of query execuion. The window is one minue long, meaning ha he query is execued every minue. However, he query can access he las hree minues of he same sream A hrough he oher window, enabling many ineresing kinds of queries, such as a rolling average, sum or any oher aggregaion. Par D) explains he delay keyword. Here, he same inpu sream is referenced wice, bu for he second window, a delay of one minue is specified. As a resul, he query can read daa from boh he curren minue (window ) and he previous minue (window delay ) of sream A. This makes complex incremenal queries possible, such as a rolling/moving se or median, by being able o reference he previous sae of he daa and compare i wih he curren sae. The main difference beween DBSream s CEL and sream processing languages is he handling and definiion of windows and sliding windows in paricular. For example, in SreamBase [5], windows are specified as [SIZE x ADVANCE y TIME], where x defines he lengh of he window and y he query execuion frequency. In CEL, he keyword corresponds o ADVANCE, bu is specified only once regardless of he number of inpus o make i clear how ofen o re-compue he query. Alhough Fig. shows several possible window ypes, i sill covers only a small fracion of possible window definiions. Since daa in DBSream are always sored on non-volaile sorage, windows can reference pas hisory. I is possible o reference daa from one week or even one monh ago, e.g., o compare he curren sae of he nework wih he pas. B. Examples of Rolling Analyics We give wo more complex incremenal job examples, deailing how rolling analyics can be implemened in CEL. We sar wih a rolling window average shown below, in which every minue, we calculae he average uploaded and downloaded byes over he las hree minues. <job inpus="a (window ) as A, A (window 3min) as A2" oupu="b" schema="serial_ime in4, avg_download floa8, avg_upload floa8"> <query> selec _STARTTS, avg(download), avg(upload) from A2 </query></job> The firs window A is he window ha denoes he query execuion frequency. The second window, A2, is used o run he acual average calculaion. Fig. 2(a) illusraes how he windows over sream A correspond o resuls appended o B; he oupu of he above job is a sequence of new resuls generaed every minue, all of which are sored in B and idenified by heir _STARTTS (window sar ime) imesamps. There is a simple performance opimizaion ha can easily be expressed in CEL: we can pre-aggregae each minue of he daa in A using one query, and hen wrie a second query o add up he hree mos recenly pre-aggregaed windows and compue he hree-minue aggregaes. In he nex example, we compue he disinc se of IP addresses acive in he las hour, updaed every minue. A naive approach is o always scan he las hour of daa from scrach whenever he resul is o be updaed. A more efficien approach is o keep an inermediae sae of disinc IP addresses of he las hour in memory. Then, we can compue he disinc se of IP addresses for he curren minue as he union of he se of IP addresses from he curren minue and hose from he las 59 minues. However, since sae is kep in memory, i mus be re-buil in case of a sysem crash. In CEL, we can implemen he laer via a job ha uses is own pas oupu as inpu. This approach is no only more efficien, bu also, as we show in Sec. IV-C, i is more faul-oleran since he sae of he compuaion is acually sored in he oupu able. The corresponding CEL job definiion is shown below. The inpu is a sream C, which conains, among oher hings, he IP addresses of acive erminals. We wan o ransform sream C ino a new sream D conaining, for each minue, he disinc se of acive IP addresses in he las hour. To achieve his, we firs add a new imesamp las o D recording he ime of he las aciviy of a IP address. Now, from he curren minue of C, we produce a new uple for each disinc IP address and we se he las aciviy o he sar of he curren window using he _STARTTS keyword. From he previous minue of D we selec hose IP addresses which where acive less han one hour ago. We hen combine hose wo resuls using he SQL UNION ALL operaor and selec for each

delay delay delay 3 3 3 3 Sream D Sream D Sream D Sream D Sream C Sream C Sream C Sream C -4-3 -2 - (a) Rolling average over he las 3 minues, updaed every minue. -4-3 -2 - (b) Complex daa processing flow for an incremenal query. Fig. 2. Daa flow of wo example incremenal jobs; he windows of he curren ask are marked in black. disinc IP address, he curren ime, he maximum value of he las aciviy imesamp, and he IP address iself. By using his feedback loop, we can efficienly compue he se of IP addresses acive in he las hour per minue, wihou keeping any addiional sae informaion. The windows used in his compuaion are visualized in Fig. 2(b). <job inpus="c (window ), D (window delay )" oupu="d" schema="serial_ime in4, las in4, ip ine"> <query> selec _STARTTS, max(las), ip from ( selec _STARTTS as las, ip from C group by,2 union all selec las, ip from D where las <= _STARTTS-60 group by,2) group by,3 </query></job> IV. PERFORMANCE ANALYSIS We compare DBSream wih respec o he sae-ofhe-ar Big Daa framework Spark. Spark is an open-source MapReduce soluion proposed by he UC Berkley Amplab. I explois Resilien Disribued Daases (RDDs), i.e., a disribued memory daa absracion which allows in-memory operaions on large clusers in a faul-oleran manner [7]. This approach has been demonsraed o be paricurlarly efficien [3] enabling boh ieraive and ineracive applicaions in Scala, Java or Pyhon. Spark does no sricly require he presence of Hadoop cluser o run. In fac, despie he sysem is commonly used in combinaion wih Hadoop and HDFS, i also offers a simple, sandalone resource manager o coordinae he aciviies of differen hoss and suppors direc access o he Linux file sysem. A recen evoluion of Spark is Spark Sreaming [8]. Differenly from Spark, which is a pure bach processing soluion, Spark Sreaming enables real ime analysis hrough processing of shor baches. Of paricular ineres are he sysem primiives for defining sliding windows and developing incremenal queries similarly o wha was discussed in Sec. III-A. However, Spark Sreaming arges mainly real ime analysis scenarios and offers limied suppor for processing hisorical daa, which is also required by NTMA. Recen discussions on he Spark Sreaming mailing lis sugges ha some workarounds may be possible 2. However, we were unable o implemen hese and herefore we leave he evaluaion of Spark Sreaming for rolling analyics as fuure work. A. Sysem Seup and Daases We insalled DBSream and Spark on a se of machines having he same hardware (6 core XEON E5 2640, 32 GB of RAM and a 5 HD of 3TB each). One machine has been dedicaed o DBSream, recombining 4 of he available HDs in a RAID0 and insalling PosgreSQL v9.2.4 as a underlying Daabase Managemen Sysem (DBMS). The remaining 0 machines compose a producion Hadoop ha runs CDH 4.6 wih Map Reduce v Job Tracker enabled. On he cluser we also insalled Spark v..0 where we could only enable he sandalone resource manager 3. All machines are locaed wihin he same rack conneced hrough a Gb/s swich. The rack also conains a 40TB NAS used o collec hisorical daa. In paricular, we use four 5 daylong daases, each colleced a a differen nework Vanage Poin (VP) in a real ISP nework beween February 3 and February 7, 204. Each VP is insrumened wih Tsa [7] o produce per-flow ex log files from monioring he raffic of more han 20,000 households. For he purpose of his work we focus only on TCP raffic for which Tsa repors more han 00 nework indexes and generaes a new log file each hour. Overall, each VP generaed a daase of abou 60 GB of raw 2 hp://apache-spark-user-lis.00560.n3.nabble.com/ window-analysis-wih-spark-and-spark-sreaming-d8806.hml#a985 3 Apparenly, he implemenaion of Yarn provided in CDH 4.6 has some incompaibiliies wih Spark. These seem be solved in CDH 5 providing Yarn by defaul and a parcel for Spark v..0. Unforunaely, esing such a configuraion requires an upgrade of he node operaing sysems, which was no possible o do in our producion environmen.

daa (i.e., abou 5 imes he memory available on each node) for a oal of abou 640 GB (i.e., wice he memory available on he whole cluser). B. Benchmark Definiion We use a se of 7 jobs, represening daily operaions performed on a producion Hadoop cluser we are considering. J: for every 0 minues, i) map each desinaion IP address o is organizaion name (orgname for shor) hrough he Maxmind Orgname daabase (www.maxmind.com/en/geoip2-isp), and ii) for each Orgname found, compue aggregaed raffic saisics (min/max/average Round-Trip Time (RTT), number of disinc server IP addresses, oal number of uploaded/downloaded byes). J2: for every hour, i) compue he orgname-ip mapping as in J, ii) filer all orgname s relaed o he Akamai CDN, and iii) compue some aggregaed saisics (min/max/average RTT). J3: for every hour, i) compue he orgname-ip mapping as in J, and ii) selec he op 0 orgname having he highes number of disinc IP addresses. J4: for every hour, i) ransform he desinaion IP address ino a /24 subne, and ii) selec he op 0 /24 subnes having he highes number of flows. J5: for every minue, for each source IP address, compue he oal number of uploaded/downloaded byes and flows. J6: for every minue, i) find he se of disinc desinaion IP addresses, and ii) use i o updae he se of IP addresses ha were acive over he pas 60 minues. J7: for every minue, i) compue he oal uploaded/downloaded byes for each source IP address, and ii) compue he average over he pas 60 minues. Overall, hese jobs define performance indexes relaed o CDN (J o J4), saisics relaed o he moniored households (J5), and wo incremenal queries (J6 and J7). C. Benchmark implemenaion Each analysis engine has differen peculiariies, properies and uning opions. Differen implemenaions are herefore possible for he defined benchmark. We define a possible implemenaion ha we consider reasonable, discussing possible modificaions ha can affec performance. DBSream benchmark: All queries are expressed in he Coninuous Execuion Language (CEL). The fac ha he oupu of a job is sored on disk and can be used as inpu o anoher job is exploied o achieve beer performance. Fig. 3 shows he resuling job dependencies, where he nodes represen he jobs and an arrow from e.g. job J o J2 means ha he oupu of J is used as inpu o J2. The number nex o an arrow indicaes he size of he inpu window in minues. For insance, J4 and J5 are implemened in a single sep using a inpu window of 60 minues of impored daa. Conversely, J6 is implemened using an inermediae sep J6 prepare which pre-aggregaes he se of acive IP addresses per minue in windows of 0 minues of impored daa. Now, J6 can uilize he oupu of J6 prepare and combine i wih is own pas as oupu, as indicaed by he reflexive arrow saring Impor 60 60 0 J4 J5 J6 prepare 60 J2 0 60 J J7 60 J3 0 0 J prepare Fig. 3. Job iner-dependencies for he DBSream implemenaion. Nodes represen jobs and arrows precedence consrains. from and going back ino J6, o compue he final resul. Please noe ha each minue of J6 conains he acive IP addresses of he las 60 minues along wih a imesamp indicaing when hose IPs was las acive. In each one minue sep of J6 his imesamp is checked and IPs which were las acive longer han 60 minues ago are removed. Spark benchmark: Each job is implemened as a separae Spark applicaion using Scala. Each applicaion receives a lis of files locaed on HDFS as inpu and processes hem sequenially. The firs 5 jobs have a sraighforward implemenaion, since he do no presen srong daa dependencies and daa are already spli per hour. The wo incremenal queries, J6 and J7, insead are more complex o implemen. In fac, we need o implemen he logic o sore and updae he daa in windows. We consider a simple approach, creaing an RDD collecing per-minue daa bins on which we hen loop o compose 60 minue windows. Our implemenaion processes daa in a sream of hourly baches, where he resuls are available afer each he processing for each bach has finished. D. Resuls Fig. 4 shows he resuls of running Spark on our cluser of 0 machines. The labels VP and 4VP correspond o he number of vanage poins collecing daa, i.e., 4VP corresponds o four imes as much daa as VP. For he jobs J o J5, Spark offers excellen performance and he whole cluser is perfecly able o parallelize processing, leading o very good resuls. However, jobs J6 and J7 do no scale well. J6 in paricular canno be parallelized very well, since daa have o be synchronized and merged in one single locaion afer each minue. We also ried differen implemenaions of J6 using more complex sraegies and higher number of map/reduce asks aiming o uilize furher cluser resources, which urned ou o be even less performing. Also for J7, he compuaion has o be synchronized for every minue, bu here he amoun of daa is smaller since he oupu for every minue is only a single number. This migh be he reason why J7 J6

Execuion Time [minues] 400 350 300 250 200 Spark, 0 node, VP Spark, 0 node, 2 VP Spark, 0 node, 4 VP Spark, node, VP 50 00 50 0 Impor J J2 J3 J4 J5 J6 J7 Execuion Time [minues] Fig. 4. 600 500 400 300 200 00 0 Fig. 5. Performance numbers for differen seups using Spark. Spark, 0 nodes, J J7 Spark, 0 nodes, Impor + J J7 DBSream, node, Impor + J J7 VP 2 VPs 4 VPs Scalabiliy comparison of DBSream and Spark. does show a beer performance han J6. Whereas we can no exclude he possibiliy of more performance implemenaion in Spark for J6 and J7, hese resuls show ha obaining good performance wih Spark in such scenarios is no a all sraighforward. Typical opimizaion used for such a problem such as skip liss or complex ree srucures are hard o parallelize and would no be a fair comparison o a declaraive language like CEL. In Fig. 5, we compare he performance of Spark and DBSream. In DBSream, he oal execuion ime is measured from he sar of he impor of he firs hour of daa unil all jobs finished processing he las hour of daa. For Spark, all jobs were sared a he same ime in parallel. We repor he oal execuion ime of he job finishing las, which was J6 in his experimen. Since for Spark, daa impor and daa processing is separaed, we also repor he solve job processing ime wihou daa impor. For DBSream, he execuion ime increases nearly linearly wih he number of VPs and indicaing a linear scalabiliy, a leas up o he used amoun of VPs.In conras, for Spark he main boleneck is he execuion ime of J6. The oal execuion ime does no increase much wih more VPs, since muliple insances of J6 run in parallel. Therefore, Spark is able o uilize is parallel naure beer, he more jobs are running, whereas DBSream shows beer performance for incremenal jobs. Noably, for he VP case, Spark akes 2.6 imes longer o finish imporing and processing he daa. V. CONCLUSION In his paper, we presened he DBSream sysem for rolling big daa analysis. We focused on he way in which DBSream allows a declaraive specificaion of incremenal queries, including hose which access heir previous resuls in order o compue new resuls. When esed wih real nework monioring daases and workloads, a single DBSream node performed as well as a cluser of en Spark nodes due o he performance advanages of incremenal processing. There are several ineresing direcions for fuure work. One is o develop DBSream on op of a parallel daabase engine such as Greenplum so ha i can scale-ou as well as or beer han Spark on cluser implemenaions. Anoher opion is o use Spark (in paricular, is laes version ha can direcly execue SQL queries) as DBSream s processing engine, and compare he wo archiecures. Finally, since nework monioring (and oher monioring applicaions) ofen involves complex machine learning ha canno be easily expressed in SQL, we will invesigae how o implemen rolling machine learning operaors in DBSream. REFERENCES [] D. Abadi, D. Carney, U. Ceinemel, M. Cherniack, C. Convey, S. Lee, M. Sonebraker, M. Tabul, S. Zdonik, Aurora: a new model and archiecure for daa sream managemen, THe VLDB Journal 2(2):20-39 (2003). [2] A. Bär, P. Casas, L. Golab, A. Finamore, DBSream: an Online Aggregaion, Filering and Processing Sysem for Nework Traffic Monioring, in IWCMC 204-5h TRAC Workshop, 204. [3] Berkeley AMPLab, Big Daa Benchmark, hps://amplab.cs.berkeley. edu/benchmark/, 204. [4] P. Bhaoia, A. Wieder, R. Rodrigues, U. Acar, R. Pasquin, Incoop: MapReduce for Incremenal Compuaions, in SOCC 20, -4. [5] K. Borders, J. Springer, M. Burnside, Chimera: A Declaraive Language for Sreaming Nework Traffic Analysis, in USENIX Securiy Symp., 202. [6] C. Cranor, T. Johnson, O. Spascheck, V. Shkapenyuk, Gigascope: a sream daabase for nework applicaions, in SIGMOD 2003, 647-65. [7] A. Finamore, M. Mellia, M. Meo, M. Munafo, P. D. Torino, D. Rossi, Experiences of inerne raffic monioring wih sa. IEEE Nework 25(3): 8-4 (20) [8] L. Golab, T. Johnson, J. S. Seidel, V. Shkapenyuk, Sream Warehousing wih DaaDepo, in SIGMOD 2009, 847-854. [9] L. Golab, T. Johnson, S. Sen, J. Yaes, A sequence-oriened sream warehouse paradigm for nework monioring applicaions, in PAM 202, 53-63. [0] W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, A. Doan, Muppe: MapReduce-syle processing of fas daa, PVLDB 5(2):84-825, 202. [] Y. Lee, Y. Lee, Toward Scalable Inerne Traffic Measuremen and Analysis wih Hadoop, in SIGCOMM Compu. Commun. Rev. (CCR) 43():5-3, 202. [2] B. Li, E. Mazur, Y. Diao, A. McGregor, P. Shenoy, SCALLA: A plaform for scalable one-pass analyics using MapReduce, ACM Transacions on Daabase Sysems 37(4):-43, 202. [3] E. Liarou, S. Idreos, S. Manegold, M. Kersen, MoneDB/DaaCell: online analyics in a sreaming column-sore, PVLDB 5(2):90-93, 202. [4] RFC 3954 - Cisco Sysems NeFlow Services Expor Version 9, 2004. [5] SreamBase. Sreambase: Real-ime, low laency daa processing wih a sream processing engine. hp://www.sreambase.com, 204. [6] T. Whie, Hadoop: he definiive guide, O Reilly, 202. [7] M. Zaharia, M. Chowdhury, M. Franklin, S. Shenker, I. Soica, Spark: Cluser Compuing wih Working Ses, in HoCloud workshop, 200. [8] M. Zaharia, T. Das, H. Li, S. Shenker, I. Soica, Discreized Sreams: An Efficien and Faul-Toleran Model for Sream Processing on Large Clusers, in HoCloud workshop, 202.