Mastering the Velocity Dimension of Big Data Emanuele Della Valle DEIB - Politecnico di Milano emanuele.dellavalle@polimi.it
It's a streaming world Agenda Mastering the velocity dimension with informaeon flow processing Hands- on on esper and the Event Processing Language Stream Reasoning: adding the variety dimension What about Veracity? 2
It's a streaming World! 1/2 Off- shore oil operaeons Smart CiEes Global Contact Center Social networks Generate data streams! 3
It's a streaming World! 2/2 What is the expected Eme to failure when that turbine's barring starts to vibrate as detected in the last 10 minutes? Is public transportaeon where the people are? Who are the best available agents to route all these unexpected contacts about the tariff plan launched yesterday? Who is driving the discussion about the top 10 emerging topics? E. Della Valle, S. Ceri, F. van Harmelen, D. Fensel It's a Streaming World! Reasoning upon Rapidly Changing InformaEon. IEEE Intelligent Systems 24(6): 83-89 (2009) 4
Requirements 1/8 A system able to answer those queries must be able to handle massive datasets A typical oil produceon plaborm is equipped with about 400.000 sensors Telecom data is the most pervasive data source in urban are, in Milano there are 1.8 million mobile users A global contact centre of a Telecom operator counts 500 millions of clients Facebook alone has 1.1 billion of aceve users 5
Requirements 2/8 A system able to answer those queries must be able to process data streams on the fly The sensors on typical oil produceon plaborm generates 10,000 observaeons per minute with peaks of 100,000 o/m The mobile users in Milano generates 20,000 call/sms/data conneceons per minute with peaks of 80,000 c/m A global contact centre receives 10,000 contacts per minute with peaks of 30,000 c/m Facebook, as of May 2013, observes 3 millions "I like" per minute 6
Requirements 3/8 A system able to answer those queries must be able to cope with heterogeneous dataset The sensors on typical oil produceon have been deployed over 10 years by 10s of different producers Tens of data sources are normally needed to make sense of an urban phenomena A global contact centre consists in 100s of offices owned by different subsidiary companies engaged yearly Each social network has its own data model, APIs, 7
Requirements 4/8 A system able to answer those queries must be able to cope with incomplete data 10s of sensors and networking links broke down daily Coverage is incomplete Only standard cases are covered by fully machine processable data records 100s of contacts per minute are manage ad- hoc ConversaKons happen outside the social networks, too! 8
Requirements 5/8 A system able to answer those queries must be able to cope with uncertain data Sensor out- of- operakng range Faulty sensors Agents misunderstand, get Kred, Irony, sarcasm, 9
Requirements 6/8 A system able to answer those queries must be able to provide reackve answers deteceon of dangerous situaeons must occur within minutes recommendaeons to ciezens must be performed in few seconds roueng a contact through each step of the decision tree must take less than a second Search autocompleeng may need to be updated every few minutes 10
Requirements 7/8 A system able to answer those queries must be able to support fine- grained informakon access IdenEfy a turbine among thousands Locate a bus among thousands Contact an agent among thousands IdenEfy an opinion maker among thousands of influencers for a topic 11
Requirements 8/8 A system able to answer those queries must be able to integrate complex domain models of operakonal and control process various city aspects contact management, contract types, agent skills, contactor profiles, topics, user profiles, 12
Requirements (wrap up) A system able to answer those queries must be able to handle massive datasets x process data streams on the fly cope with heterogeneous dataset cope with incomplete data cope with uncertain data provide reackve answers support fine- grained informakon access integrate complex domain models 13 x Volume' x x x Velocity' x x x x Variety' x Veracity'
Other domains Intrusion deteceon Fraud DetecEon Emergency Response Services TransportaEon and LogisEcs Supply Chain OpEmizaEon System monitoring Click inspeceon... 14
Mastering the Velocity dimension with InformaEon Flow Processing (IFP) solueons 15
Velocity: Say goodbye to DBMS The DBMS way query The IFP way data data reply 16 query reply
IFP - Gartner The Gartner hype cycle 17
IFP - Forrester Forrester s top 15 emerging tech to watch: Now to 2018 18
Is there a market beyond hype? Complex Event Processing Market Worth $3,322M by 2018 (2014 Report by MarketsandMarkets) Major players include: Microsom IBM Oracle SAP Tibco... 19
InformaEon Flow Processing The IFP engine processes incoming flows of informa1on according to a set of processing rules Processing is on line Sources produce the incoming informaeon flows, sinks consume the results of processing, rule managers add or remove rules InformaEon flows are composed of informa1on items Items part of the same flow are neither necessarily ordered nor of the same kind Processing involve filtering, combining, and aggregaeng flows, item by item as they enter the engine Sources Informa1on Flows 20 IFP Engine Informa1on Flows - - - - - - - - - - - - - - - Rules - - - - - - - - - - - - - - - - - - - - - - - - - Rule managers Sinks
IFP: A bit of history of two approaches Traditional DBMS Eventbased Systems Active DBMS DSMS 21 CEP
From Passive to AcEve DBMSs Standard DBMSs Purely passive: Human- ac1ve database- passive (HADP) ExecuEon happens only when asked by clients (through queries) AcEve DBMSs The reaceve behavior moves (in part) from the applicaeon to the DB layer which executes Event CondiEon AcEon (ECA) rules 22
As a DBMS extension AcEve DBMSs Rules may only refer to the internal state of the DB Closed DB applicaeons Rules may support the semanecs of the applicaeon, but external sources of events are not allowed But events may come from external sources 23
Data Stream Management Systems (DSMS) Data streams are (unbounded) sequences of Eme- varying data elements Represent: an (almost) conenuous flow of informaeon Eme with the recent informaeon being more relevant as it describes the current state of a dynamic system 24
Data Stream Management Systems (DSMS) The nature of streams requires a paradigmaec change* from persistent data one Eme semanecs to transient data conenuous * This paradigmaec change first arose in DB community in the late '90s 25
ConEnuous SemanEcs ConEnuous queries registered over streams that are observed trough windows window Dynamic System 26 input streams Registered ConEnuous Query streams of answer
Event- based systems Components collaborate by exchanging informaeon about occurrent events. In parecular: Components publish noeficaeons about the events they observe, or they subscribe to the events they are interested to be noefied about CommunicaEon is: Purely message based Asynchronous MulEcast Implicit Anonymous topic=fire* & place=* topic=* & place=1 st floor topic=fire alarm & place=* 27 fire training at 1 st floor fire alarm at 1 st fire floor training at 1 st floor fire alarm at 1 st fire floor training at 1 st floor fire alarm at 1 st floor fire alarm at 1 st floor
Complex Event Processing (CEP) CEP systems adds the ability to deploy rules that describe how composite events can be generated from primieve (or composite) ones Typical CEP rules search for sequences of events Raise C if A B Time is a key aspect in CEP 28 - - - - - - - - - - - - - - - - - - - - Rules
Transforming vs. DetecEng Systems Transforming rules Define an execueon plan combining primieve operators Each operator transforms one or more input flows into one or more output flows The execueon plan can be defined: explicitly (e.g., through graphical notaeon) Usually, deteceon is based on a implicitly (using a high level language) Omen used with homogeneous informaeon flows To take advantage of the predefined structure of input and output 29 DetecKng rules Present an explicit disenceon between a detec1on and produc1on phase Detect a situaeon of interest Produce a complex event to noefy that the situaeon has happened logical predicate that captures paferns of interest in the history of received items
DeclaraKve Transforming languages Specify the expected result rather than the desired execueon flow Usually derive from relaeonal languages ImperaKve Specify the desired execueon flow StarEng from primieve operators Can be user- defined RelaEonal algebra / SQL CQL/Stream: Select IStream(*) From F1[Rows 5], F2[Rows 10] Where F1.A = F2.A 30 Usually adopt a graphical notaeon Arvind Arasu, Shivnath Babu, Jennifer Widom: The CQL conenuous query language: semanec foundaeons and query execueon. VLDB J. 15(2): 121-142 (2006)
ImperaEve Languages Aurora (Boxes & Arrows Model) 31 Daniel J. Abadi, Donald Carney, Ugur ÇeEntemel, Mitch Cherniack, ChrisEan Convey, Sangdon Lee, Michael Stonebraker, Nesime Tatbul, Stanley B. Zdonik: Aurora: a new model and architecture for data stream management. VLDB J. 12(2): 120-139 (2003)
DeclaraEve languages Specify a firing condi1on as a pa?ern Select a poreon of incoming flows through Logic operators Content / Eming constraints The ac1on part uses the selected items to produce new knowledge TESLA / T-Rex Define Fire(area: string, measuredtemp: double) From Smoke(area=$a) and last Temp(area=$a and value>45) within 5 min. from Smoke Where area=smoke.area and measuredtemp=temp.value Gianpaolo Cugola, Alessandro Margara: TESLA: a formally defined event specificaeon language. DEBS 2010: 50-61 Gianpaolo Cugola, Alessandro Margara: Complex event processing with T- REX. Journal of Systems and Somware 85(8): 1709-1728 (2012) 32
Hands- on Esper and EPL See dedicated slides 33
Adding Variety to Velocity 34
Variety: Syntax changes, semanecs not Data comes from muleple sources in muleple formats The SemanEc Web is teaching us how to represent informaeon (knowledge) in a standard way RDF, OWL, Ontologies, SPARQL,... But the world in not staec... 35
DSMS/CEP + SemanEc Web = Stream Reasoning Requirement massive datasets data streams heterogeneous dataset DSMS CEP SemanKc Web incomplete data noisy data reaceve answers fine- grained informaeon access complex domain models The Stream Reasoning Era is coming 36
IFP: A bit of history (ConEnued) Traditional DBMS Eventbased Systems Active DBMS DSMS Semantic Web Stream Reasoning 37 CEP
Research QuesEon is it possible to make sense in real Kme of mulkple, heterogeneous, gigankc and inevitably noisy and incomplete data streams in order to support the decision process of extremely large numbers of concurrent user? E. Della Valle, S. Ceri, F. van Harmelen, D. Fensel It's a Streaming World! Reasoning upon Rapidly Changing InformaEon. IEEE Intelligent Systems 24(6): 83-89 (2009) 38
Stream Reasoning feasibility (intuieon) Many relevant reasoning methods are not able to deal with high frequency data streams However, trade- off exists between the complexity of the reasoning method and the frequency of the data stream NEXPTIME AbstracEon DL Reasoning 1 Hz Complexity PTIME AC 0 SelecEon InterpretaEon DL- Lite SemanEc Streams Raw Stream Processing Complexity vs. Dynamics Querying Re- wrieng 10 4 Hz Change Frequency Heiner Stuckenschmidt, Stefano Ceri, Emanuele Della Valle, Frank van Harmelen: Towards Expressive Stream Reasoning. Proceedings of the Dagstuhl Seminar on SemanEc Aspects of Sensor Networks, 2010. 39
PracEcal cases 10+ deployments in Sensor Networks, and Social media analyecs BOTTARI City Data Fusion Winner of SemanKc Web Challenge 2011 Winner of IBM faculty award 2013 M. Balduini, I. Celino, D. Dell Aglio, E. Della Valle, Y. Huang, T. Lee, S.- H. Kim, V. Tresp: BOTTARI: An augmented reality mobile applicaeon to deliver personalized and locaeon- based recommendaeons by conenuous analysis of social media streams. J. Web Sem. 16: 33-41 (2012) 40
Where to learn more about stream General Info reasoning h?p://streamreasoning.org W3C Community h?p://www.w3.org/community/rsp/ Papers h?p://scholar.google.com/scholar?hl=en&q=stream +reasoning 41
What about Veracity? 42
Support for Uncertainty Many applicakons deal with imprecise (or even incorrect) data from sources E.g., sensors, RFID, The ability to associate a degree of uncertainty to informaeon items increases expressiveness To the content of items Imprecise temperature reading To the presence of an item (occurrence of an event) Spurious RFID reading 43 Two orthogonal aspects Support for uncertain input Allows rules to deal with/ reason about uncertain input data Support for uncertain output Allows rules to associate a degree of uncertainty to the output produced Uncertain rules
Where to learn more on InformaEon Flow Processing PhD course on Stream And Complex Event Processing Lecturers Emanuele Della Valle Gianpaolo Cugola Alessandro Margara More info h?p://bit.ly/phd- course- IFP 44
Thank you! Any QuesEon? Emanuele Della Valle DEIB - Politecnico di Milano emanuele.dellavalle@polimi.it