DSMS Part 1 & 2. Data Management: Comparison - DBS versus DSMS. Handle Data Streams in DBS?



Similar documents
Design, Implementation, and Evaluation of Network Monitoring Tasks with the TelegraphCQ Data Stream Management System

Data Mining for Knowledge Management. Mining Data Streams

4 Internet QoS Management

Data Stream Processing on Enterprise Integration Platforms

Models and Issues in Data Stream Systems

Data Stream Management and Complex Event Processing in Esper. INF5100, Autumn 2010 Jarle Søberg

CS 356 Lecture 16 Denial of Service. Spring 2013

Data Stream Management

Implementing Large-Scale Autonomic Server Monitoring Using Process Query Systems. Christopher Roblee Vincent Berk George Cybenko

Load Distribution in Large Scale Network Monitoring Infrastructures

Dual Mechanism to Detect DDOS Attack Priyanka Dembla, Chander Diwaker 2 1 Research Scholar, 2 Assistant Professor

Network Intrusion Detection Systems. Beyond packet filtering

CS 640 Introduction to Computer Networks. Network security (continued) Key Distribution a first step. Lecture24

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

Internet Firewall CSIS Packet Filtering. Internet Firewall. Examples. Spring 2011 CSIS net15 1. Routers can implement packet filtering

Fuzzy Network Profiling for Intrusion Detection

Large-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop

Wharf T&T Limited DDoS Mitigation Service Customer Portal User Guide

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Chapter 28 Denial of Service (DoS) Attack Prevention

Lecture 8 Performance Measurements and Metrics. Performance Metrics. Outline. Performance Metrics. Performance Metrics Performance Measurements

MONITORING OF TRAFFIC OVER THE VICTIM UNDER TCP SYN FLOOD IN A LAN

STANDPOINT FOR QUALITY-OF-SERVICE MEASUREMENT

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Network Monitoring On Large Networks. Yao Chuan Han (TWCERT/CC)

Application Latency Monitoring using nprobe

Chapter 3. Internet Applications and Network Programming

NetStore: An Efficient Storage Infrastructure for Network Forensics and Monitoring

How To Set Up An Ip Firewall On Linux With Iptables (For Ubuntu) And Iptable (For Windows)

HP Intelligent Management Center v7.1 Network Traffic Analyzer Administrator Guide

Online and Scalable Data Validation in Advanced Metering Infrastructures

Improving the Database Logging Performance of the Snort Network Intrusion Detection Sensor

AlienVault Unified Security Management (USM) 4.x-5.x. Deployment Planning Guide

Security Toolsets for ISP Defense

Middleware support for the Internet of Things

CSCE 465 Computer & Network Security

Network congestion control using NetFlow

Case Study: Instrumenting a Network for NetFlow Security Visualization Tools

Resource Utilization of Middleware Components in Embedded Systems

Application of Netflow logs in Analysis and Detection of DDoS Attacks

Monitoring Traffic manager

Congestion Control Review Computer Networking. Resource Management Approaches. Traffic and Resource Management. What is congestion control?

An apparatus for P2P classification in Netflow traces

Monitoring PostgreSQL database with Verax NMS

Firewalls Overview and Best Practices. White Paper

Overview of Network Security The need for network security Desirable security properties Common vulnerabilities Security policy designs

Hillstone T-Series Intelligent Next-Generation Firewall Whitepaper: Abnormal Behavior Analysis

Course Title: Penetration Testing: Security Analysis

INCREASE NETWORK VISIBILITY AND REDUCE SECURITY THREATS WITH IMC FLOW ANALYSIS TOOLS

Nine Use Cases for Endace Systems in a Modern Trading Environment

ECE 578 Term Paper Network Security through IP packet Filtering

co Characterizing and Tracing Packet Floods Using Cisco R

An Active Packet can be classified as

Denial of Service attacks: analysis and countermeasures. Marek Ostaszewski

Network Security Demonstration - Snort based IDS Integration -

Using Fuzzy Logic Control to Provide Intelligent Traffic Management Service for High-Speed Networks ABSTRACT:

Network traffic: Scaling

Challenges of Sending Large Files Over Public Internet

Netflow Overview. PacNOG 6 Nadi, Fiji

modeling Network Traffic

Chapter 7 outline. 7.5 providing multiple classes of service 7.6 providing QoS guarantees RTP, RTCP, SIP. 7: Multimedia Networking 7-71

How To Store Data On An Ocora Nosql Database On A Flash Memory Device On A Microsoft Flash Memory 2 (Iomemory)

Observer Analysis Advantages

Intrusion Detection Systems

CS5008: Internet Computing

Track 2 Workshop PacNOG 7 American Samoa. Firewalling and NAT

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Application Visibility and Monitoring >

SAN Conceptual and Design Basics

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Infrastructure for active and passive measurements at 10Gbps and beyond

Vulnerability Analysis of Hash Tables to Sophisticated DDoS Attacks

Distributed Denial of Service (DDoS)

Lab VI Capturing and monitoring the network traffic

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

Network Management & Monitoring

Security vulnerabilities in the Internet and possible solutions

Security Event Management. February 7, 2007 (Revision 5)

Testing L7 Traffic Shaping Policies with IxChariot IxChariot

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

Transcription:

DSMS Part 1 & 2 Data Stream Management Systems - Applications, Concepts, and Systems Vera Goebel Department of Informatics University of Oslo 1 Introduction: What are DSMS? (terms) Why do we need DSMS? (applications) Concepts and Issues: Architectures Data Modeling & Querying Application Examples: Network Monitoring with Esper (TelegraphCQ, STREAM, and Borealis) Medical Data Analysis with Esper Complex Event Processing (Home Care Scenario) 2 Handle Data Streams in DBS? Traditional DBS SQL Query Query Processing Main Memory Disk Result DSMS Register CQs Data Stream(s) Query Processing Main Memory Scratch store (main memory or disk) Result (stored) Data Stream(s) Archive Stored relations 3 Data Management: Comparison - DBS versus DSMS Database Systems (DBS) Persistent relations (relatively static, stored) One-time queries Random access Unbounded disk store Only current state matters No real-time services Relatively low update rate Data at any granularity Assume precise data Access plan determined by query processor, physical DB design DSMS Transient streams (on-line analysis) Continuous queries (CQs) Sequential access Bounded main memory Historical data is important Real-time requirements Possibly multi-gb arrival rate Data at fine granularity Data stale/imprecise Unpredictable/variable data arrival and characteristics 4 Adapted from [Motawani: PODS tutorial]

DSMS Applications Sensor Networks: Monitoring of sensor data from many sources, complex filtering, activation of alarms, aggregation and joins over single or multiple streams Network Traffic Analysis: Analyzing Internet traffic in near real-time to compute traffic statistics and detect critical conditions Financial Tickers: On-line analysis of stock prices, discover correlations, identify trends On-line auctions Transaction Log Analysis, e.g., Web, telephone calls, Motivation for DSMS Large amounts of interesting data: deploy transactional data observation points, e.g., AT&T long-distance: ~300M call tuples/day AT&T IP backbone: ~10B IP flows/day generate automated, highly detailed measurements NOAA: satellite-based measurement of earth geodetics Sensor networks: huge number of measurement points Near real-time queries/analyses ISPs: controlling the service level NOAA: tornado detection using weather radar data 5 6 VLDB 2003 Tutorial [Koudas & Srivastava 2003] Motivation for DSMS (cont.) Performance of disks: 1987 2004 Increase CPU Performance 1 MIPS 2,000,000 MIPS 2,000,000 x Memory Size 16 Kbytes 32 Gbytes 2,000,000 x Motivation for DSMS (cont.) Take-away points: Large amounts of raw data Analysis needed as fast as possible Data feed problem Memory Performance 100 usec 2 nsec 50,000 x Disc Drive Capacity 20 Mbytes 300 Gbytes 15,000 x Disc Drive Performance 60 msec 5.3 msec 11 x Source: Seagate Technology Paper: Economies of Capacity and Speed: Choosing the most cost-effective disc drive size and RPM to meet IT requirements 7 8

Application Requirements Generic DSMS Architecture Data model and query semantics: order- and time-based operations Selection Nested aggregation Multiplexing and demultiplexing Frequent item queries Joins Windowed queries Query processing: Streaming query plans must use non-blocking operators Only single-pass algorithms over data streams Data reduction: approximate summary structures Synopses, digests => no exact answers Real-time reactions for monitoring applications => active mechanisms Long-running queries: variable system conditions Scalability: shared execution of many continuous queries, monitoring multiple streams Stream Mining Streaming Inputs Input Monitor Working Storage Summary Storage Static Storage Updates to Static Data Query Repository User Queries Query Processor Output Buffer Streaming Outputs 9 10 [Golab & Özsu 2003] DSMS: 3-Level Architecture DBS Data feeds to database can also be treated as data streams Resource (memory, disk, per-tuple computation) rich Useful to audit query results of DSMS Supports sophisticated query processing, analyses DSMS DSMS at multiple observation points, (voluminous) streams-in, (data reduced) streams-out Resource (memory, per tuple computation) limited, esp. at low-level Reasonably complex, near real-time, query processing Identify what data to populate in DB Data Models Real-time data stream: sequence of data items that arrive in some order and may be seen only once. Stream items: like relational tuples - relation-based models, e.g., STREAM, TelegraphCQ; or instanciations of objects - object-based models, e.g., COUGAR, Tribeca Window models: Direction of movement of the endpoints: fixed window, sliding window, landmark window Physical / time-based windows versus logical / count-based windows Update interval: eager (update for each new arriving tuple) versus lazy (batch processing -> jumping window), nonoverlapping tumbling windows 11 12 VLDB 2003 Tutorial [Koudas & Srivastava 2003]

Timestamps Explicit Injected by data source Models real-world event represented by tuple Tuples may be out-of-order, but if near-ordered can reorder with small buffers Implicit Introduced as special field by DSMS Arrival time in system Enables order-based querying and sliding windows Issues Distributed streams? Composite tuples created by DSMS? Queries - I DBS: one-time (transient) queries DSMS: continuous (persistent) queries Support persistent and transient queries Predefined and ad hoc queries (CQs) Examples (persistent CQs): Tapestry: content-based email, news filtering OpenCQ, NiagaraCQ: monitor web sites Chronicle: incremental view maintenance Unbounded memory requirements Blocking operators: window techniques Queries referencing past data 13 14 Queries - II DBS: (mostly) exact query answer DSMS: (mostly) approximate query answer Approximate query answers have been studied: Synopsis construction: histograms, sampling, sketches Approximating query answers: using synopsis structures Approximate joins: using windows to limit scope Approximate aggregates: using synopsis structures Batch processing Data reduction: sampling, synopses, sketches, wavelets, histograms, One-pass Query Evaluation DBS: Arbitrary data access One/few pass algorithms have been studied: Limited memory selection/sorting: n-pass quantiles Tertiary memory databases: reordering execution Complex aggregates: bounding number of passes DSMS: Per-element processing: single pass to reduce drops Block processing: multiple passes to optimize I/O cost 15 16 VLDB 2003 Tutorial [Koudas & Srivastava 2003]

Query Languages 3 querying paradigms for streaming data: 1. Relation-based: SQL-like syntax and enhanced support for windows and ordering, e.g., Esper, CQL (STREAM), StreaQuel (TelegraphCQ), AQuery, GigaScope 2. Object-based: object-oriented stream modeling, classify stream elements according to type hierarchy, e.g., Tribeca, or model the sources as ADTs, e.g., COUGAR 3. Procedural: users specify the data flow, e.g., Aurora, users construct query plans via a graphical interface (1) and (2) are declarative query languages, currently, the relation-based paradigm is mostly used. 17 Approximate Query Answering Methods Sliding windows Only over sliding windows of recent stream data Approximation but often more desirable in applications Batched processing, sampling and synopses Batched if update is fast but computing is slow Compute periodically, not very timely Samplingif update is slow but computing is fast Compute using sample data, but not good for joins, etc. Synopsis data structures Maintain a small synopsis or sketch of data Good for querying historical data Blocking operators, e.g., sorting, avg, min, etc. Blocking if unable to produce the first output until seeing the entire input 18 [Han 2004] Outline DSMS Monitoring Morten Lindeberg Network Monitoring Introduction Network Monitoring with Esper Example Queries Results Medical Monitoring Sensors Variable sized windows Results 19 20

Network Monitoring: Motivation Network Monitoring - Traffic analysis: Which applications generate most traffic on the network (e.g., look at TCP port numbers), and what is the utilized bandwidth (traffic volumes during a day) - Provisioning of network resources (e.g., identify congested routers or links) - Study protocol impact (e.g., measure the overhead in terms of control messages of, e.g., OLSR) - Intrusion detection (e.g., SYN-flooding attacks) Basic principles: Passive vs. Active Online vs. Offline Our focus on passive online analysis Observe traffic flowing through a point in the network (e.g., a router or a computer) 21 22 Network Monitoring Tools Common tools: tcpdump (command line interface, packet sniffing) Wireshark (graphical interface, packet sniffing) DSMS related: CoMo (network monitoring with focus on load shedding => drop packets if system does not manage to handle incoming traffic) http://como.sourceforge.net Gigascope (AT&T Labs research project, DSMS specialized for monitoring) (project page does no longer exist) Endace (hardware network cards, hard to program) http://www.endace.com Why Use A DSMS? For Passive Online Analysis Huge streams, in-memory processing needed, disks are to slow React fast to changes in network traffic SQL interface allow dynamically changing queries to respond to network changes or characteristics Although: processing packets fast enough to handle all traffic on a link is still an issue 23 24

Esper Network Monitoring With Esper Esper is a component for CEP and ESP applications, available for Java as Esper, and for.net as Nesper CEP (Complex Event Processing) ESP (Event Stream Processing) Available at: http://esper.codehaus.org/ Experiment setup Application flow 25 26 Network Monitoring with Esper Network Monitoring Queries 1. Capture packets from a network interface card using Unix pcap library (e.g., TCP and IP headers) 2. Create instance of Esper Component for Query Engine 3. Register query 4. Register listener for query results 5. Capture and convert packets into tuples => POJO (Plain Old Java Objects) 6. Inject POJO s into the engine 7. Output query results at listener 27 Query 1 - Projection of the TCP/IP header fields (like tcpdump) select * from PacketEvent Query 2 - Packet count and network load per second (bandwidth consumption) select sum(totallength), count(*) from PacketEvent.win:time_batch(1 sec) 28

Network Monitoring Queries Network Monitoring Queries Query 3 - Packet count for specific ports during five minutes (account for packets destined for certain applications) insert into PortStream select destport, count(*) as portcount from PacketEvent.win:time_batch(1 sec) group by destport select Table.destport as port, sum(port.portcount) from PortStream.win:time(300 sec) as Port, sql:portdatabase [ select destport from Ports ] as Table where Port.destport = Table.destport group by Query 4 - SYN Flood attack identifier (denial of service (DoS) attack, a server is flooded with packets with the SYN bit set) insert into NormalView select count(*) as tuplecount, destip, destport from PacketEvent.win:time_batch(1 sec) where syn = 0 group by destip, destport insert into SynView select count(*) as tuplecount, destip, destport from PacketEvent.win:time_batch(1 sec) where syn = 1 group by destip, destport select (Normal.tupleCount / Syn.tupleCount) as Ratio, Normal.tupleCount as NormalTupleCount, Normal.destip as NormalDestip, Normal.destport as NormalDestport, Syn.tupleCount as SynTupleCount from NormalView.win:time(5 sec) as Normal, SynView.win:time(5 sec) as Syn Table.destport 29 where Normal.destip = Syn.destip 30 Network Monitoring Queries Query 5 - Filter out land attack (spoofed packets causing infinite loop. Destination IP address and the source IP address are identical ) Network Monitoring Queries Query 6 - Identify attempts of port scanning (port scanning is commonly the first step in an attack; a port scan can be identified if a series of packets are sent from one source IP address to a destination IP address with many different TCP destination ports) select * from Packets where sourceip = destip and sourceport = destport and syn = 1 31 SELECT COUNT(DISTINCT destport) as distinctcount, sourceip FROM Packets WINDOW SIZE 10 seconds GROUP BY sourceip, sourceport 32

Performance StreamBench Determine the amount of traffic that a DSMS can handle for specific network monitoring tasks (queries) Initial results form earlier work Master s theses and INF5090 assignments Benchmarked DSMSs: Esper, Borealis, TelegraphCQ, STREAM Performance Results Processing rate: Investigates the bounds of network load that the DSMS is able to process without loosing to many packets Results: The different systems have quite differently network load bounds Some systems perform significantly better, if allowing some packet loss The processing complexity of the different queries has a significant impact on the load bounds 33 34 Performance Results Performance Results CPU Consumption Query 5 Memory Consumption Query 6 35 36

Network Monitoring Results Outline Query 6 (port scan) results Network Monitoring Introduction Network Monitoring with Esper Example Queries Results Medical Monitoring Sensors Variable sized windows Results 37 38 Experiment Goal #1 Online Analysis of Myocardial Ischemia From Medical Sensor Data Streams with Esper Stig Støa 1, Morten Lindeberg 2 and Vera Goebel 2 1 The Interventional Centre (IVS), Rikshospitalet University Hospital, Oslo, Norway 2 Distributed Multimedia Systems (DMMS), Department of Informatics, University of Oslo, Norway Recreate an off-line technique of Elle et al. [1] conducted in MATLAB Early recognition of regional cardiac ischemia 3-way accelerometer placed on the left ventricle of the heart Single metric: Fast Fourier Transformation (FFT) is used to examine the accelerometer signal in the frequency-domain Euclidian distance vector (EDV(i)) between reference vector RV(0) and current vector CV(j), where j is the latest sample number CV(j) : FFT over sliding window (size 512 over y-axis) RV(0) : FFT over baseline window (first 512 samples) Data set from surgery performed on pigs at the Interventional Centre We can conduct experiments with the same data set (referred to as data set 1) 39 40

Experiment Goal #2 Improve results by adding beat to beat detection using a QRS detection algorithm on ECG signals Each ECG trace of a normal heartbeat typically contains a QRS event A good reference for separating heartbeats We need a to perform FFT over a sliding window of variable size! Cannot use the same data, use new data set that include ECG (data set 2) [ image from http://www.wikipedia.org ] 41 Challenges 1. Incorporate signal processing operations Problem: Not supported in the query language Fast Fourier Transformation of the accelerometer signals Euclidian distance vector from baseline window QRS detection for detecting the heartbeats from the ECG signals Solution: Custom aggregate functions 2. Static sized windows are not feasible for beat-to-beat detection Problem: Heartbeat duration is not a static pre-known size. DSMS window techniques only describe static time-based or tuple-based windows. Solution: Introduce variable length triggered tumbling windows 3. Synchronize the two streams Problem: QRS detection introduces variable delay (approx. 91 samples) Solution: Introduce variable buffer, that slows the accelerometer stream down 42 Signal processing operations Implement as custom aggregate functions Use defined Java interface and simply add to query engine Implemented methods: QRSD(v): QRS detection based upon algorithm from Hamilton et al. [2], source code is also public available edv(v): Euclidian distance from baseline Variable length triggered tumbling windows The ECG stream is aggregated into a stream consisting of QRS events S b This stream (S b ) triggers the flushing of the sliding window w(t) where the custom aggregation over the stream S a is performed The window is not supported by Esper => We implemented a workaround exploiting functionality of externally timed windows 43 44

Stream Synchronization The QRS detection algorithm over the ECG stream introduces a variable delay t Introduce the same delay to the accelerometer stream Accelerometer stream is sent through a FIFO queue, with dynamic size QRS detection function sets the dynamic size of the FIFO queue (also triggers the flushing of the aggregate window, in order to obtain dynamic windows, as already seen) Results #1 (data set 1) SELECT edv(y) Easier than MATLAB FROM Accelerometer WINDOW LENGTH(512) Occlusion occurs after 80 seconds Figure shows a perfect overlap, thus the technique by Elle et al. [1] can be recreated online using Esper Perfusion after 170 seconds 45 46 Results #2 (data set 2) Plot shows fixed sliding window (512 samples) SELECT edv(y) FROM Accelerometer TRIGGER and dynamic triggered window (based on QRS WINDOW BY QRSD(ECG.value) detection) => less variance! Results #3 (data set 2) SELECT edv(y), min(y) FROM Query with added local minimum Accelerometer TRIGGER WINDOW BY value => easy to change! QRSD(ECG.value) Occlusion Perfusion The bottom plot represents local minimum value for the accelerometer stream Sudden drop caused by ultra sound probe 47 48

Summary DSMSs can be used for: Network monitoring: Performance is an issue, systems behave/handles overload differently Medical monitoring of patients: Custom operators are helpful, dynamic window sizes to adapt to heart rate References [1] Elle, O. J., Halvorsen, S., Gulbrandsen, M. G., Aurdal, L., Bakken, A., Samset, E., Dugstad H. and Fosse, E. Early recognition of regional cardiac ischemia using a 3-axis accelerometer sensor. Institute of Physics Publishing. Physiological Measurement, Volume 26, Number 4, August 2005, pp. 429-440 [2] Hamilton, P. S. and Tompkins, W. J. Quantitive Investigation of QRS Detection Rules Using the MIT/BIH Arrhythmia Database. IEEE Transactions on Biomedical Engineering, Volume BME-33, Number 12, December 1986, pp. 1157-1165 Questions? Thank you for attending lecture! 49 50