DSMS Part 1 & 2 Data Stream Management Systems - Applications, Concepts, and Systems Vera Goebel Department of Informatics University of Oslo 1 Introduction: What are DSMS? (terms) Why do we need DSMS? (applications) Concepts and Issues: Architectures Data Modeling & Querying Application Examples: Network Monitoring with Esper (TelegraphCQ, STREAM, and Borealis) Medical Data Analysis with Esper Complex Event Processing (Home Care Scenario) 2 Handle Data Streams in DBS? Traditional DBS SQL Query Query Processing Main Memory Disk Result DSMS Register CQs Data Stream(s) Query Processing Main Memory Scratch store (main memory or disk) Result (stored) Data Stream(s) Archive Stored relations 3 Data Management: Comparison - DBS versus DSMS Database Systems (DBS) Persistent relations (relatively static, stored) One-time queries Random access Unbounded disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by query processor, physical DB design DSMS Transient streams (on-line analysis) Continuous queries (CQs) Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-gb arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics 4 Adapted from [Motawani: PODS tutorial]
DSMS Applications Sensor Networks: Monitoring of sensor data from many sources, complex filtering, activation of alarms, aggregation and joins over single or multiple streams Network Traffic Analysis: Analyzing Internet traffic in near real-time to compute traffic statistics and detect critical conditions Financial Tickers: On-line analysis of stock prices, discover correlations, identify trends On-line auctions Transaction Log Analysis, e.g., Web, telephone calls, Motivation for DSMS Large amounts of interesting data: deploy transactional data observation points, e.g., AT&T long-distance: ~300M call tuples/day AT&T IP backbone: ~10B IP flows/day generate automated, highly detailed measurements NOAA: satellite-based measurement of earth geodetics Sensor networks: huge number of measurement points Near real-time queries/analyses ISPs: controlling the service level NOAA: tornado detection using weather radar data 5 6 VLDB 2003 Tutorial [Koudas & Srivastava 2003] Motivation for DSMS (cont.) Performance of disks: 1987 2004 Increase CPU Performance 1 MIPS 2,000,000 MIPS 2,000,000 x Memory Size 16 Kbytes 32 Gbytes 2,000,000 x Motivation for DSMS (cont.) Take-away points: Large amounts of raw data Analysis needed as fast as possible Data feed problem Memory Performance 100 usec 2 nsec 50,000 x Disc Drive Capacity 20 Mbytes 300 Gbytes 15,000 x Disc Drive Performance 60 msec 5.3 msec 11 x Source: Seagate Technology Paper: Economies of Capacity and Speed: Choosing the most cost-effective disc drive size and RPM to meet IT requirements 7 8
Application Requirements Generic DSMS Architecture Data model and query semantics: order- and time-based operations Selection Nested aggregation Multiplexing and demultiplexing Frequent item queries Joins Windowed queries Query processing: Streaming query plans must use non-blocking operators Only single-pass algorithms over data streams Data reduction: approximate summary structures Synopses, digests => no exact answers Real-time reactions for monitoring applications => active mechanisms Long-running queries: variable system conditions Scalability: shared execution of many continuous queries, monitoring multiple streams Stream Mining Streaming Inputs Input Monitor Working Storage Summary Storage Static Storage Updates to Static Data Query Repository User Queries Query Processor Output Buffer Streaming Outputs 9 10 [Golab & Özsu 2003] DSMS: 3-Level Architecture DBS Data feeds to database can also be treated as data streams Resource (memory, disk, per-tuple computation) rich Useful to audit query results of DSMS Supports sophisticated query processing, analyses DSMS DSMS at multiple observation points, (voluminous) streams-in, (data reduced) streams-out Resource (memory, per tuple computation) limited, esp. at low-level Reasonably complex, near real-time, query processing Identify what data to populate in DB Data Models Real-time data stream: sequence of data items that arrive in some order and may be seen only once. Stream items: like relational tuples - relation-based models, e.g., STREAM, TelegraphCQ; or instanciations of objects - object-based models, e.g., COUGAR, Tribeca Window models: Direction of movement of the endpoints: fixed window, sliding window, landmark window Physical / time-based windows versus logical / count-based windows Update interval: eager (update for each new arriving tuple) versus lazy (batch processing -> jumping window), nonoverlapping tumbling windows 11 12 VLDB 2003 Tutorial [Koudas & Srivastava 2003]
Timestamps Explicit Injected by data source Models real-world event represented by tuple Tuples may be out-of-order, but if near-ordered can reorder with small buffers Implicit Introduced as special field by DSMS Arrival time in system Enables order-based querying and sliding windows Issues Distributed streams? Composite tuples created by DSMS? Queries - I DBS: one-time (transient) queries DSMS: continuous (persistent) queries Support persistent and transient queries Predefined and ad hoc queries (CQs) Examples (persistent CQs): Tapestry: content-based email, news filtering OpenCQ, NiagaraCQ: monitor web sites Chronicle: incremental view maintenance Unbounded memory requirements Blocking operators: window techniques Queries referencing past data 13 14 Queries - II DBS: (mostly) exact query answer DSMS: (mostly) approximate query answer Approximate query answers have been studied: Synopsis construction: histograms, sampling, sketches Approximating query answers: using synopsis structures Approximate joins: using windows to limit scope Approximate aggregates: using synopsis structures Batch processing Data reduction: sampling, synopses, sketches, wavelets, histograms, One-pass Query Evaluation DBS: Arbitrary data access One/few pass algorithms have been studied: Limited memory selection/sorting: n-pass quantiles Tertiary memory databases: reordering execution Complex aggregates: bounding number of passes DSMS: Per-element processing: single pass to reduce drops Block processing: multiple passes to optimize I/O cost 15 16 VLDB 2003 Tutorial [Koudas & Srivastava 2003]
Query Languages 3 querying paradigms for streaming data: 1. Relation-based: SQL-like syntax and enhanced support for windows and ordering, e.g., Esper, CQL (STREAM), StreaQuel (TelegraphCQ), AQuery, GigaScope 2. Object-based: object-oriented stream modeling, classify stream elements according to type hierarchy, e.g., Tribeca, or model the sources as ADTs, e.g., COUGAR 3. Procedural: users specify the data flow, e.g., Aurora, users construct query plans via a graphical interface (1) and (2) are declarative query languages, currently, the relation-based paradigm is mostly used. 17 Approximate Query Answering Methods Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications Batched processing, sampling and synopses Batched if update is fast but computing is slow Compute periodically, not very timely Samplingif update is slow but computing is fast Compute using sample data, but not good for joins, etc. Synopsis data structures Maintain a small synopsis or sketch of data Good for querying historical data Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the entire input 18 [Han 2004] Outline DSMS Monitoring Morten Lindeberg Network Monitoring Introduction Network Monitoring with Esper Example Queries Results Medical Monitoring Sensors Variable sized windows Results 19 20
Network Monitoring: Motivation Network Monitoring - Traffic analysis: Which applications generate most traffic on the network (e.g., look at TCP port numbers), and what is the utilized bandwidth (traffic volumes during a day) - Provisioning of network resources (e.g., identify congested routers or links) - Study protocol impact (e.g., measure the overhead in terms of control messages of, e.g., OLSR) - Intrusion detection (e.g., SYN-flooding attacks) Basic principles: Passive vs. Active Online vs. Offline Our focus on passive online analysis Observe traffic flowing through a point in the network (e.g., a router or a computer) 21 22 Network Monitoring Tools Common tools: tcpdump (command line interface, packet sniffing) Wireshark (graphical interface, packet sniffing) DSMS related: CoMo (network monitoring with focus on load shedding => drop packets if system does not manage to handle incoming traffic) http://como.sourceforge.net Gigascope (AT&T Labs research project, DSMS specialized for monitoring) (project page does no longer exist) Endace (hardware network cards, hard to program) http://www.endace.com Why Use A DSMS? For Passive Online Analysis Huge streams, in-memory processing needed, disks are to slow React fast to changes in network traffic SQL interface allow dynamically changing queries to respond to network changes or characteristics Although: processing packets fast enough to handle all traffic on a link is still an issue 23 24
Esper Network Monitoring With Esper Esper is a component for CEP and ESP applications, available for Java as Esper, and for.net as Nesper CEP (Complex Event Processing) ESP (Event Stream Processing) Available at: http://esper.codehaus.org/ Experiment setup Application flow 25 26 Network Monitoring with Esper Network Monitoring Queries 1. Capture packets from a network interface card using Unix pcap library (e.g., TCP and IP headers) 2. Create instance of Esper Component for Query Engine 3. Register query 4. Register listener for query results 5. Capture and convert packets into tuples => POJO (Plain Old Java Objects) 6. Inject POJO s into the engine 7. Output query results at listener 27 Query 1 - Projection of the TCP/IP header fields (like tcpdump) select * from PacketEvent Query 2 - Packet count and network load per second (bandwidth consumption) select sum(totallength), count(*) from PacketEvent.win:time_batch(1 sec) 28
Network Monitoring Queries Network Monitoring Queries Query 3 - Packet count for specific ports during five minutes (account for packets destined for certain applications) insert into PortStream select destport, count(*) as portcount from PacketEvent.win:time_batch(1 sec) group by destport select Table.destport as port, sum(port.portcount) from PortStream.win:time(300 sec) as Port, sql:portdatabase [ select destport from Ports ] as Table where Port.destport = Table.destport group by Query 4 - SYN Flood attack identifier (denial of service (DoS) attack, a server is flooded with packets with the SYN bit set) insert into NormalView select count(*) as tuplecount, destip, destport from PacketEvent.win:time_batch(1 sec) where syn = 0 group by destip, destport insert into SynView select count(*) as tuplecount, destip, destport from PacketEvent.win:time_batch(1 sec) where syn = 1 group by destip, destport select (Normal.tupleCount / Syn.tupleCount) as Ratio, Normal.tupleCount as NormalTupleCount, Normal.destip as NormalDestip, Normal.destport as NormalDestport, Syn.tupleCount as SynTupleCount from NormalView.win:time(5 sec) as Normal, SynView.win:time(5 sec) as Syn Table.destport 29 where Normal.destip = Syn.destip 30 Network Monitoring Queries Query 5 - Filter out land attack (spoofed packets causing infinite loop. Destination IP address and the source IP address are identical ) Network Monitoring Queries Query 6 - Identify attempts of port scanning (port scanning is commonly the first step in an attack; a port scan can be identified if a series of packets are sent from one source IP address to a destination IP address with many different TCP destination ports) select * from Packets where sourceip = destip and sourceport = destport and syn = 1 31 SELECT COUNT(DISTINCT destport) as distinctcount, sourceip FROM Packets WINDOW SIZE 10 seconds GROUP BY sourceip, sourceport 32
Performance StreamBench Determine the amount of traffic that a DSMS can handle for specific network monitoring tasks (queries) Initial results form earlier work Master s theses and INF5090 assignments Benchmarked DSMSs: Esper, Borealis, TelegraphCQ, STREAM Performance Results Processing rate: Investigates the bounds of network load that the DSMS is able to process without loosing to many packets Results: The different systems have quite differently network load bounds Some systems perform significantly better, if allowing some packet loss The processing complexity of the different queries has a significant impact on the load bounds 33 34 Performance Results Performance Results CPU Consumption Query 5 Memory Consumption Query 6 35 36
Network Monitoring Results Outline Query 6 (port scan) results Network Monitoring Introduction Network Monitoring with Esper Example Queries Results Medical Monitoring Sensors Variable sized windows Results 37 38 Experiment Goal #1 Online Analysis of Myocardial Ischemia From Medical Sensor Data Streams with Esper Stig Støa 1, Morten Lindeberg 2 and Vera Goebel 2 1 The Interventional Centre (IVS), Rikshospitalet University Hospital, Oslo, Norway 2 Distributed Multimedia Systems (DMMS), Department of Informatics, University of Oslo, Norway Recreate an off-line technique of Elle et al. [1] conducted in MATLAB Early recognition of regional cardiac ischemia 3-way accelerometer placed on the left ventricle of the heart Single metric: Fast Fourier Transformation (FFT) is used to examine the accelerometer signal in the frequency-domain Euclidian distance vector (EDV(i)) between reference vector RV(0) and current vector CV(j), where j is the latest sample number CV(j) : FFT over sliding window (size 512 over y-axis) RV(0) : FFT over baseline window (first 512 samples) Data set from surgery performed on pigs at the Interventional Centre We can conduct experiments with the same data set (referred to as data set 1) 39 40
Experiment Goal #2 Improve results by adding beat to beat detection using a QRS detection algorithm on ECG signals Each ECG trace of a normal heartbeat typically contains a QRS event A good reference for separating heartbeats We need a to perform FFT over a sliding window of variable size! Cannot use the same data, use new data set that include ECG (data set 2) [ image from http://www.wikipedia.org ] 41 Challenges 1. Incorporate signal processing operations Problem: Not supported in the query language Fast Fourier Transformation of the accelerometer signals Euclidian distance vector from baseline window QRS detection for detecting the heartbeats from the ECG signals Solution: Custom aggregate functions 2. Static sized windows are not feasible for beat-to-beat detection Problem: Heartbeat duration is not a static pre-known size. DSMS window techniques only describe static time-based or tuple-based windows. Solution: Introduce variable length triggered tumbling windows 3. Synchronize the two streams Problem: QRS detection introduces variable delay (approx. 91 samples) Solution: Introduce variable buffer, that slows the accelerometer stream down 42 Signal processing operations Implement as custom aggregate functions Use defined Java interface and simply add to query engine Implemented methods: QRSD(v): QRS detection based upon algorithm from Hamilton et al. [2], source code is also public available edv(v): Euclidian distance from baseline Variable length triggered tumbling windows The ECG stream is aggregated into a stream consisting of QRS events S b This stream (S b ) triggers the flushing of the sliding window w(t) where the custom aggregation over the stream S a is performed The window is not supported by Esper => We implemented a workaround exploiting functionality of externally timed windows 43 44
Stream Synchronization The QRS detection algorithm over the ECG stream introduces a variable delay t Introduce the same delay to the accelerometer stream Accelerometer stream is sent through a FIFO queue, with dynamic size QRS detection function sets the dynamic size of the FIFO queue (also triggers the flushing of the aggregate window, in order to obtain dynamic windows, as already seen) Results #1 (data set 1) SELECT edv(y) Easier than MATLAB FROM Accelerometer WINDOW LENGTH(512) Occlusion occurs after 80 seconds Figure shows a perfect overlap, thus the technique by Elle et al. [1] can be recreated online using Esper Perfusion after 170 seconds 45 46 Results #2 (data set 2) Plot shows fixed sliding window (512 samples) SELECT edv(y) FROM Accelerometer TRIGGER and dynamic triggered window (based on QRS WINDOW BY QRSD(ECG.value) detection) => less variance! Results #3 (data set 2) SELECT edv(y), min(y) FROM Query with added local minimum Accelerometer TRIGGER WINDOW BY value => easy to change! QRSD(ECG.value) Occlusion Perfusion The bottom plot represents local minimum value for the accelerometer stream Sudden drop caused by ultra sound probe 47 48
Summary DSMSs can be used for: Network monitoring: Performance is an issue, systems behave/handles overload differently Medical monitoring of patients: Custom operators are helpful, dynamic window sizes to adapt to heart rate References [1] Elle, O. J., Halvorsen, S., Gulbrandsen, M. G., Aurdal, L., Bakken, A., Samset, E., Dugstad H. and Fosse, E. Early recognition of regional cardiac ischemia using a 3-axis accelerometer sensor. Institute of Physics Publishing. Physiological Measurement, Volume 26, Number 4, August 2005, pp. 429-440 [2] Hamilton, P. S. and Tompkins, W. J. Quantitive Investigation of QRS Detection Rules Using the MIT/BIH Arrhythmia Database. IEEE Transactions on Biomedical Engineering, Volume BME-33, Number 12, December 1986, pp. 1157-1165 Questions? Thank you for attending lecture! 49 50