Data Mining for Knowledge Management. Mining Data Streams



Similar documents
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

Models and Issues in Data Stream Systems

NetFlow Tracker Overview. Mike McGrath x ccie CTO mike@crannog-software.com

Topics in basic DBMS course

Data Stream Management

Middleware support for the Internet of Things

Configuring Security for FTP Traffic

Understanding traffic flow

THE DEVELOPER GUIDE TO BUILDING STREAMING DATA APPLICATIONS

NCAS National Caller ID Authentication System

Web analytics: Data Collected via the Internet

Data Integrator Performance Optimization Guide

Case Study: Instrumenting a Network for NetFlow Security Visualization Tools

Processing and Analyzing Streams. CDRs in Real Time

Limitations of Packet Measurement


Big Data With Hadoop

Hierarchical Bloom Filters: Accelerating Flow Queries and Analysis

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

Lecture II : Communication Security Services

High-Volume Data Warehousing in Centerprise. Product Datasheet

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Transforming the Telecoms Business using Big Data and Analytics

Analysis of Network Beaconing Activity for Incident Response

1 Log visualization at CNES (Part II)

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

ANALYTICS IN BIG DATA ERA

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Presented by: Aaron Bossert, Cray Inc. Network Security Analytics, HPC Platforms, Hadoop, and Graphs Oh, My

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

Data Stream Management Systems

CISCO INFORMATION TECHNOLOGY AT WORK CASE STUDY: CISCO IOS NETFLOW TECHNOLOGY

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Network Tomography and Internet Traffic Matrices

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Bit Chat: A Peer-to-Peer Instant Messenger

Using Data Mining and Machine Learning in Retail

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

Methods for Lawful Interception in IP Telephony Networks Based on H.323

Event-based middleware services

B490 Mining the Big Data. 0 Introduction

CommuniGate Pro Real-Time Features. CommuniGate Pro Internet Communications VoIP, , Collaboration, IM

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Solving big data problems in real-time with CEP and Dashboards - patterns and tips

Product Guide. Sawmill Analytics, Swindon SN4 9LZ UK tel:

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

An apparatus for P2P classification in Netflow traces

CHAPTER 3 PROBLEM STATEMENT AND RESEARCH METHODOLOGY

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Oracle SOA Suite: The Evaluation from 10g to 11g

Configuring Backup Settings. Copyright 2009, Oracle. All rights reserved.

Network Measurement. Why Measure the Network? Types of Measurement. Traffic Measurement. Packet Monitoring. Monitoring a LAN Link. ScienLfic discovery

Solution: B. Telecom

Log Audit Ensuring Behavior Compliance Secoway elog System

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Understanding Slow Start

Datagram-based network layer: forwarding; routing. Additional function of VCbased network layer: call setup.

Protocols. Packets. What's in an IP packet

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Online and Scalable Data Validation in Advanced Metering Infrastructures

Network-Wide Capacity Planning with Route Analytics

McAfee Web Gateway 7.4.1

Ranch Networks for Hosted Data Centers

BIG DATA Alignment of Supply & Demand Nuria de Lama Representative of Atos Research &

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Cisco NetFlow Reporting Instruction Manual Version 1.0

DSMS Part 1 & 2. Data Management: Comparison - DBS versus DSMS. Handle Data Streams in DBS?

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Firewalls & Intrusion Detection

FIREWALL AND NAT Lecture 7a

Traffic Engineering & Network Planning Tool for MPLS Networks

Lawful Interception in P2Pbased

Per-Flow Queuing Allot's Approach to Bandwidth Management

The basic data mining algorithms introduced may be enhanced in a number of ways.

Secospace elog. Secospace elog

Monitor and Manage Your MicroStrategy BI Environment Using Enterprise Manager and Health Center

Hadoop Ecosystem B Y R A H I M A.

Availability Digest. Raima s High-Availability Embedded Database December 2011

Hadoop. Sunday, November 25, 12

Unity Error Message: Your voic box is almost full

Network Monitoring using MMT:

Transcription:

Data Mining for Knowledge Management Mining Data Streams Themis Palpanas University of Trento http://dit.unitn.it/~themis Spring 2007 Data Mining for Knowledge Management 1 Motivating Examples: Production Control System Spring 2007 Data Mining for Knowledge Management 6 1

Motivating Examples: Monitoring Vehicle Operation Spring 2007 Data Mining for Knowledge Management 8 Motivating Examples: Financial Applications Spring 2007 Data Mining for Knowledge Management 9 2

Motivating Examples: Web Data Streams Mining query streams. Google wants to know what queries are more frequent today than yesterday. Mining click streams. Yahoo wants to know which of its pages are getting an unusual number of hits in the past hour. Spring 2007 Data Mining for Knowledge Management 10 Motivating Examples: Network Monitoring SNMP/RMON, NetFlow records Peer Enterprise Networks FR, ATM, IP VPN Network Operations Center (NOC) Converged IP/MPLS Core DSL/Cable Networks Broadband Internet Access Source Destination Duration Bytes Protocol 10.1.0.2 16.2.3.7 12 20K http 18.6.7.1 12.4.0.3 16 24K http 13.9.4.3 11.6.8.2 15 20K http 15.2.2.9 17.1.2.1 19 40K http 12.4.3.8 14.8.7.4 26 58K http 10.5.1.3 13.0.0.1 27 100K ftp 11.1.0.6 10.3.4.5 32 300K ftp 19.7.1.2 16.5.5.8 18 80K ftp PSTN Voice over IP Example NetFlow IP Session Data 24x7 IP packet/flow data-streams at network elements Truly massive streams arriving at rapid rates AT&T collects 600-800 Gigabytes of NetFlow data each day. Often shipped off-site to data warehouse for off-line analysis Spring 2007 Data Mining for Knowledge Management 11 3

Back-end Data Warehouse Motivating Examples: Network Monitoring DBMS (Oracle, DB2) Off-line analysis slow, expensive What are the top (most frequent) 1000 (source, dest) pairs seen over the last month? Network Operations Center (NOC) How many distinct (source, dest) pairs have been seen by both R1 and R2 but not R3? Peer Set-Expression Query Enterprise Networks DSL/Cable Networks PSTN SELECT COUNT (R1.source, R2.dest) FROM R1, R2 WHERE R1.dest = R2.source SQL Join Query Spring 2007 Data Mining for Knowledge Management 12 Motivating Examples: Network Monitoring DSL/Cable Networks BGP Network Operations Center (NOC) IP Network PSTN Must process network streams in real-time and one pass Critical NM tasks: fraud, DoS attacks, SLA violations Real-time traffic engineering to improve utilization Tradeoff communication and computation to reduce load Make responses fast, minimize use of network resources Secondarily, minimize space and processing cost at nodes Spring 2007 Data Mining for Knowledge Management 13 4

Motivating Examples: Sensor Networks the sensors era ubiquitous, small, inexpensive sensors applications that bridge physical world to information technology Spring 2007 Data Mining for Knowledge Management 14 Motivating Examples: Sensor Networks the sensors era ubiquitous, small, inexpensive sensors applications that bridge physical world to information technology sensors unveil previously unobservable phenomena Spring 2007 Data Mining for Knowledge Management 20 5

Requirements develop efficient streaming algorithms need to process this data online allow approximate answers operate in a distributed fashion (network as distributed database) can also be used as one-pass algorithms for massive datasets Spring 2007 Data Mining for Knowledge Management 21 Requirements develop efficient streaming algorithms need to process this data online allow approximate answers operate in a distributed fashion (network as distributed database) can also be used as one-pass algorithms for massive datasets propose new data mining algorithms help in data analysis in the above setting Spring 2007 Data Mining for Knowledge Management 22 6

Data Stream Management System? Traditional DBMS data stored in finite, persistent data sets New Applications data input as continuous, ordered data streams Network monitoring and traffic engineering Telecom call records Network security Financial applications Sensor networks Manufacturing processes Web logs and clickstreams Massive data sets Spring 2007 Data Mining for Knowledge Management 24 Data Stream Management System! User/Application Register Query Results Scratch Space (Memory and/or Disk) Stream Query Processor Data Stream Management System (DSMS) Spring 2007 Data Mining for Knowledge Management 25 7

Killer-apps Meta-Questions Application stream rates exceed DBMS capacity? Can DSMS handle high rates anyway? Motivation Need for general-purpose DSMS? Not ad-hoc, application-specific systems? Non-Trivial DSMS = merely DBMS with enhanced support for triggers, temporal constructs, data rate mgmt? Spring 2007 Data Mining for Knowledge Management 26 DBMS versus DSMS Persistent relations One-time queries Random access Unbounded disk store Only current state matters Passive repository Relatively low update rate No real-time services Precise answers Access plan determined by query processor, physical DB design Transient streams Continuous queries Sequential access Bounded main memory History/arrival-order is critical Active stores Possibly multi-gb arrival rate Real-time requirements Imprecise/approximate answers Access plan dependent on variable data arrival and data characteristics Spring 2007 Data Mining for Knowledge Management 27 8

Making Things Concrete ALICE BOB Central Office Outgoing (call_id, caller, time, event) Central Office Incoming (call_id, callee, time, event) DSMS event = start or end Spring 2007 Data Mining for Knowledge Management 28 Query 1 (self-join) Find all outgoing calls longer than 2 minutes SELECT O1.call_ID, O1.caller FROM Outgoing O1, Outgoing O2 WHERE (O2.time O1.time > 2 AND O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) Result requires unbounded storage Can provide result as data stream Can output after 2 min, without seeing end Spring 2007 Data Mining for Knowledge Management 29 9

Query 2 (join) Pair up callers and callees SELECT O.caller, I.callee FROM Outgoing O, Incoming I WHERE O.call_ID = I.call_ID Can still provide result as data stream Requires unbounded temporary storage unless streams are near-synchronized Spring 2007 Data Mining for Knowledge Management 30 Query 3 (group-by aggregation) Total connection time for each caller SELECT FROM WHERE GROUP BY O1.caller, sum(o2.time O1.time) Outgoing O1, Outgoing O2 (O1.call_ID = O2.call_ID AND O1.event = start AND O2.event = end) O1.caller Cannot provide result in (append-only) stream Output updates? Provide current value on demand? Memory? Spring 2007 Data Mining for Knowledge Management 31 10

Data Model Append-only Call records Updates Stock tickers Deletes Transactional data Meta-Data Control signals, punctuations System Internals probably need all above Spring 2007 Data Mining for Knowledge Management 32 Query Model Query Registration Predefined Ad-hoc Predefined, inactive until invoked User/Application Query Processor Answer Availability One-time Event/timer based Multiple-time, periodic Continuous (stored or streamed) Stream Access Arbitrary Weighted history Sliding window (special case: size = 1) DSMS Spring 2007 Data Mining for Knowledge Management 33 11

Related Database Technology DSMS must use ideas, but none is substitute Triggers, Materialized Views in Conventional DBMS Main-Memory Databases Distributed Databases Pub/Sub Systems Active Databases Sequence/Temporal/Timeseries Databases Realtime Databases Adaptive, Online, Partial Results Novelty in DSMS Semantics: input ordering, streaming output, State: cannot store unending streams, yet need history Performance: rate, variability, imprecision, Spring 2007 Data Mining for Knowledge Management 34 Stream Projects Amazon/Cougar (Cornell) sensors Borealis (Brown/MIT) sensor monitoring, dataflow Hancock (AT&T) telecom streams Niagara (OGI/Wisconsin) Internet XML databases OpenCQ (Georgia) triggers, incr. view maintenance Stream (Stanford) general-purpose DSMS Tapestry (Xerox) pub/sub content-based filtering Telegraph (Berkeley) adaptive engine for sensors Tribeca (Bellcore) network monitoring Spring 2007 Data Mining for Knowledge Management 35 12