Efficient Processing for Big Data Streams and their Context in Distributed Cyber Physical Systems Department of Computer Science and Engineering Chalmers University of Technology & Gothenburg University Gothenburg Sweden 1
Prelude Assoc prof., Chalmers Un. of Technology & Gothenburg University, Sweden Center for Mathematics & Computer Science, Netherlands Max Planck Institute for Computer Science, Germany Chalmers: forskarassistent PhD (1996) University of Patras, Greece Computer Science and Engineering Distributed Computing 2
Roadmap Cyberphysical systems, big data, streams and distributed systems: how they belong together At our research team Concluding discussion 3
Examples Cyber Physical System (CPS) Adaptive Electricity Grids www.energy daily.com/images/ http://www.kapsch.net/se/
Cyberphysical systems as layered systems communication link Sensing+computing+ communicating device aka Internet of Things (IoT) Cyber system Physical system
CPS/IoT => big numbers of devices and/or big data rates => big volumes of events/data! Why this complexity? (smart) adaptive use of resources. possibilities of improvements: e.g. energy consumption, traffic bandwidth, early warnings, improving systems quality [the 4 th industrial (r)evolution, presentation S. Jeschke, 2013] 6
Info needed in near real time Is store&process (DB) a feasible option? high rate sensors, high speed networks, soc. media, financial records: up to Mmsg/sec; decisions must be taken really fast e.g., fractions of msec, even μsecs. as of today, of the available data from sensors only 0.1% is analyzed, mainly offline (i.e., afterwards, not in or close to real time) [Jonathan Ballon, Chief Strategy Officer, General Electric] Data Streaming: In memory, in network, distributed Locality, use of available resources Efficient one pass analysis & filter fig: V. Gulisano 7
Data streaming components [State of the art literature] parallelization in operators implementations: but single point bottlenecks can still persist Challenges: Throughput, Latency, Determinism, Load balancing, Fault Tolerance Distributed input sources generating streams of data (unbounded sequences of tuples, time series) fig: V. Gulisano Continuous Query ( ies) (graph of data streaming operators/tasks). Can be used to: filter / modify tuples aggregate tuples, join streams Input/output & processing can involve multiple parallel threads stateful operations computed over windows 8
Roadmap Cyberphysical systems, big data, streams and distributed systems: how they belong together At our research team Concluding discussion 9
Fine grain parallelism Parallel Data Streaming At CTH: enhanced parallelism by means of dedicated / semanticaware concurrent data objects and their efficient algorithmic finegrain synchronization implementations fig: V. Gulisano, R. Rodriguez
Examples of results with ScaleGate Latency, throughput scaling (while keeping fault tolerant and deterministic processing; aggregation, join operations) Baseline (Borealis,Streamcloud) FIFO queue Baseline Lock free FIFO ScaleGate based shifting the saturation point of the pipeline possible to process heavier streams with same computing capacity, many times faster, Mtuples/sec [CGNPT ACM SPAA2014, GNPT IEEE BigData2015] 11
Examples of use cases: Geospatial monitoring DETERMINISTIC REAL TIME ANALYTICS OF GEOSPATIAL DATA STREAMS THROUGH SCALEGATE OBJECTS http://www.chalmers.se/en/departments/cse/news/pages/debs2015.aspx BEST SOLUTION GRAND CHALLENGE AWARD: 9th ACM SIGMOD SIGSOFT International Conference on Distributed Event Based Systems 2015 Top k frequent routes, profitable cells (near real time window based streaming) > 110,000 tuples/sec throughput, < 46 msec latency [GNWPT ACM DEBS 2015] 12
Examples of use cases: Advanced Metering Infrastructure Efficient temporal spacial clustering for on line identification of critical events (even when the communication is unreliable) Sliding window time Grid based Single Linkage Clustering (G SLC) [FALP IEEE BigData2014] 13
Examples of use cases: Advanced Metering Infrastructure Efficient Data Validation on the fly: Noisy and lossy data: bad calibrated / faulty devices, lossy communication, Eg scaling to 25 Million meters/hourly readings on mainstream 6 core platform [GAP IEEE ISGT 2014] + differentially private aggregation [ongoing work] 14
Roadmap Cyberphysical systems, big data, streams and distributed systems: how they belong together At our research team Concluding discussion 15
Summarizing & Concluding DS^2: DataStreaming*DataStructures ie efficient multicore stream processing Efficient algorithmic (in memory) stream analysis Advancing SoA BigDataStreamAnalysis (context IoT/CPS; relate with Cloud/ Fog computing) important to design algorithms that communicate as little as possible efficient processing and data analysis need to be unified [J. Dongarra, D. Reed, CACM 2015] In our ongoing/near future research: Elastic parallel&distributed, in network streaming (allowing eg. embedded devices) More concurrent data structures & multicorealgos for efficient in memory stream processing Processing high rate sensory data (eg LIDAR) & other use cases in CPS&IoT 16
Thank you Contact; ptrianta@chalmers.se Co authors in work mentioned here (from left to right): M. Almgren, D. Cederman, Z. Fu, V. Gulisano, O. Landsiedel, Y. Nikolakopoulos, M.P., P. Tsigas EXCESS 17
At our research team (approx 30 pers): Cyberphysical systems research Systems Security Distribut ed systems, IoT Parallel &stream computing Demand response in energy Data Internet of Things Energy/efficient computation Cooperative vehicular systems Resource management, load shaping Microgrids demo/ testbeds Data processing: validation, monitoring, prediction Security, privacy streaming, parallel, multicore energy efficiency : estimated savings 30 70% Communication &coordination, data driven situationawareness (new postdoc SAFER) Virtual trafficlights/safer crossings Gulliver demo/testbed