TelegraphCQ: A Data Stream Management System



Similar documents
Design, Implementation, and Evaluation of Network Monitoring Tasks with the TelegraphCQ Data Stream Management System

Data Stream Management System

Inside the PostgreSQL Query Optimizer

Monitoring PostgreSQL database with Verax NMS

Data Mining for Knowledge Management. Mining Data Streams

Middleware support for the Internet of Things

Network File System (NFS) Pradipta De

1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

Enhancing SQL Server Performance

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Objectives of Lecture. Network Architecture. Protocols. Contents

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

PostgreSQL Past, Present, and Future EDB All rights reserved.

Chapter 13 File and Database Systems

Chapter 13 File and Database Systems

Performance Counters. Microsoft SQL. Technical Data Sheet. Overview:

Big Data With Hadoop

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Enterprise Applications

Integrating VoltDB with Hadoop

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Performance rule violations usually result in increased CPU or I/O, time to fix the mistake, and ultimately, a cost to the business unit.

How To Handle Big Data With A Data Scientist

Distributed File Systems

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Aurora: a new model and architecture for data stream management

D. SamKnows Methodology 20 Each deployed Whitebox performs the following tests: Primary measure(s)

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

The Sierra Clustered Database Engine, the technology at the heart of

Router Architectures

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

CASE STUDY: Oracle TimesTen In-Memory Database and Shared Disk HA Implementation at Instance level. -ORACLE TIMESTEN 11gR1

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Postgres Plus Advanced Server

Object Oriented Database Management System for Decision Support System.

BIG DATA TECHNOLOGY. Hadoop Ecosystem

2 Associating Facts with Time

Optimizing Performance. Training Division New Delhi

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Availability Digest. Raima s High-Availability Embedded Database December 2011

ICOM 6005 Database Management Systems Design. Dr. Manuel Rodríguez Martínez Electrical and Computer Engineering Department Lecture 2 August 23, 2001

Enabling Database-as-a-Service (DBaaS) within Enterprises or Cloud Offerings

Tushar Joshi Turtle Networks Ltd

Cloud Computing at Google. Architecture

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

ELIXIR LOAD BALANCER 2

CitusDB Architecture for Real-Time Big Data

Web Server Software Architectures

Oracle EXAM - 1Z Oracle Database 11g Release 2: SQL Tuning. Buy Full Product.

Data Stream Management and Complex Event Processing in Esper. INF5100, Autumn 2010 Jarle Søberg

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Cloud Computing for Control Systems CERN Openlab Summer Student Program 9/9/2011 ARSALAAN AHMED SHAIKH

VirtualCenter Database Performance for Microsoft SQL Server 2005 VirtualCenter 2.5

Top 10 reasons your ecommerce site will fail during peak periods

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

FileNet System Manager Dashboard Help

PaaS - Platform as a Service Google App Engine

Data Management for Portable Media Players

Configuration Management of Massively Scalable Systems

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

Implementing and Maintaining Microsoft SQL Server 2005 Reporting Services COURSE OVERVIEW AUDIENCE OUTLINE OBJECTIVES PREREQUISITES

Spark. Fast, Interactive, Language- Integrated Cluster Computing

Business Application Services Testing

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Configuring Apache Derby for Performance and Durability Olav Sandstå

SQL Server Query Tuning

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

High-Speed Network Traffic Monitoring Using ntopng. Luca

Data Integrator Performance Optimization Guide

STREAM: The Stanford Data Stream Management System

The Hadoop Distributed File System

F1: A Distributed SQL Database That Scales. Presentation by: Alex Degtiar (adegtiar@cmu.edu) /21/2013

Scaling Web Applications in a Cloud Environment using Resin 4.0

Integrity Checking and Monitoring of Files on the CASTOR Disk Servers

Improved metrics collection and correlation for the CERN cloud storage test framework

In-Memory Databases MemSQL

Promise of Low-Latency Stable Storage for Enterprise Solutions

Business Value Reporting and Analytics

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

Transcription:

TelegraphCQ: A Data Stream Management System Neil Conway (neilc@samurai.com) TelegraphCQ: p. 1

Introduction What is data stream management? A query language for streams TelegraphCQ architecture Query execution TelegraphCQ: p. 2

Data Streams Most traditional database systems are optimized for one-time queries on mostly-static data Long-lived data, short-lived queries This is a poor fit for applications that want to manipulate real-time unbounded streams of data Long-lived queries, short-lived data Examples: real-time analysis of financial data (stock trades, fraud detection), sensor network data, network traffic analysis (clickstreams, intrusion-detection),... TelegraphCQ: p. 3

Data Stream Management Possible solution: insert incoming data into a DBMS, perform analysis, and then periodically trim/archive the data Inefficient: unnecessarily hits disk, and recomputes the analysis from scratch each time Better would be to incrementally update the result set of a continuous query to reflect latest stream tuples Popular alternative: roll your own from scratch Labour intensive Goal: simplify streaming applications by building a generic data stream management system (DSMS) TelegraphCQ: p. 4

TelegraphCQ There has been considerable interest in streams among academia One such project is TelegraphCQ from UC Berkeley Open-source DSMS prototype, based on the PostgreSQL database system Several other academic streaming projects: STREAM (Stanford), Aurora (MIT), Nile (Purdue), Medusa (Brown),... Several TelegraphCQ folks have started Amalgamated Insight Inc. There are several other stream-related startups, notably StreamBase (Stonebraker) TelegraphCQ: p. 5

What is a stream? A stream is an infinite bag of timestamp,tuple pairs Stream tuples can be externally timestamped, or timestamped by the system A DSMS manages both streams and (typically) conventional database objects Push stream: stream content is supplied by client applications that send it directly to the DSMS For example, by connecting via TCP In pull streams, the DSMS fetches content from a remote source and converts it into a stream of tuples A derived stream is a stream defined by a query on other database objects TelegraphCQ: p. 6

Queries on streams A DBMS typically handles ad hoc, one-time queries Client submits query, DBMS evaluates it and returns result set In streaming applications, we are more interested in continuous queries Long-running (months or years not uncommon) Continuous: result set changes over time Typically represents a condition of interest ( do x when condition y or z is satisfied ), or a running statistical summary of a stream ( show me the k most active stock symbols in the last t minutes, computed every n seconds ) TelegraphCQ: p. 7

Query Language A declarative query language is a Good Thing STREAM and TelegraphCQ implement variants of CQL, the Continuous Query Language CQL is a straightforward extension to SQL for manipulating streams Doesn t specify some important technical details Currently under development: standardized streaming query language, StreamSQL TelegraphCQ: p. 8

CQL Basics Pragmatism: we have a perfectly good relational query language, so reuse that Query optimization and execution for traditional relational query languages is also very well understood Two basic kinds of things: streams and relations SQL defines relation relation operators CQL defines operators for stream relation and relation stream TelegraphCQ: p. 9

Stream Relation Streams are unbounded. To deal with a finite portion of the stream, we apply a window clause to it to produce a time-varying relation Time-based window: the tuples that appear in a specified time interval in the stream RANGE 5 minutes SLIDE 30 seconds Tuple-based window: most recent n tuples in the stream ROWS 10 SLIDE 60 seconds Partitioned window: given a set of attributes, groups stream tuples on those attributes, then computes the union of a window clause applied to all the groups PARTITION BY stock.symbol ROWS 5 TelegraphCQ: p. 10

Relation Stream 3 types of R S operators: 1. The IStream of R contains a tuple s at time t when s is in R t R t 1 2. The DStream of R contains a tuple s at time t when s is in R t 1 R t 3. The RStream of R contains a tuple s at time t when s is in R t TelegraphCQ: p. 11

Joins Stream-relation joins are common in practice For each new stream tuple, do a table (index) lookup 1 stream, n relations Divides plan into streaming and non-streaming components Stream-stream joins can be useful Probably requires compatible window clauses TelegraphCQ: p. 12

CQL Example (Source: Stanford CQL Query Repository) Network traffic analysis: Every 5 minutes, sum the number of bytes and number of packets devoted to HTTP traffic for each distinct IP address. SELECT FROM RSTREAM(src_ip, SUM(len), COUNT(*)) packets [RANGE 5 Minutes WHERE dest_port = 80 GROUP BY src_ip SLIDE 5 Minute ] TelegraphCQ: p. 13

CQL Example 2 Online auction fraud detection: Every 90 seconds, compute the highest bid made in the last 10 minutes. SELECT FROM RSTREAM(item_id, bid_price) bid [RANGE 10 Minutes WHERE bid_price = (SELECT FROM SLIDE 90 Seconds ] MAX(bid_price) bid [RANGE 10 Minutes ] SLIDE 90 Seconds ]) TelegraphCQ: p. 14

TelegraphCQ Adaptivity: continuous queries might run for months. Static planning decisions will become invalid Therefore, don t use a static query plan Sharing: many typical systems will execute hundreds of continuous queries at a time. Sharing the work required to evaluate these queries is necessary Happily, long-running continuous queries are easier to share Concurrency is also simplified with streams Need to allow continuous queries to be easily added and removed from a running system Try to avoid hitting disk TelegraphCQ: p. 15

TelegraphCQ Architecture Context: PostgreSQL uses a fairly traditional Unix daemon architecture A single persistent parent process, the postmaster The postmaster forks a new child process called a backend to handle each new client connection A SysV IPC shared memory region is used to communicate between backends It contains various caches, notably the shared buffer pool Some additional server processes: autovacuum daemon, background writer, checkpoint process TelegraphCQ: p. 16

Process Architecture New TelegraphCQ processes: Wrapper Clearing House: manages stream I/O (push/pull) and format conversion TCQ Backend: single process that executes all the streaming queries as part of a single query plan Communication between processes via shared memory queues Client connects to normal Postgres backend; continuous queries are planned by the backend, then sent via shared memory to the TCQ backend Results returned via another shared memory queue TelegraphCQ: p. 17

Global Query Plan TCQ backend is responsible for evaluating all the continuous queries in the system Construct a single query plan containing all the operators in all the queries Continuous queries from a Postgres backend folded into the shared query plan Commonalities between queries can be exploited by using a single shared operator to implement parts of more than one query Determining how to walk the graph of operators for a given stream input tuple is called tuple routing TelegraphCQ: p. 18

Tuple Routing The optimal path might change over time: operator cost, operator selectivity, stream arrival rates,... are all variable Therefore, don t do any static planning: instead, per-tuple adaptive routing A stream tuple includes routing metadata, describing the operators it has visited, the queries it is still visible to, and its signature (underlying base tuples) We don t materialize join tuples, for more routing flexibility Once a tuple fails a predicate for a query, mark it as invisible to that query (but continue routing tuple!) TelegraphCQ: p. 19

Tuple Routing, cont. Split joins into two halves (STeM): build and probe Decide which operator to send a tuple to next based on runtime statistics about operator costs / selectivies, plus the tuple routing metadata Current implementation is not parallel, but the design should parallelize well TelegraphCQ: p. 20

Shared Evaluation This architecture naturally leads to implementing parts of multiple queries with a single operator Sharing predicates is fairly easy for,<,>,, =, Joins can be shared by splitting them into StEMs Aggregates can be shared pretty effectively Even aggregates with different predicates and window clauses can be shared Two-phase aggregation TelegraphCQ: p. 21

Interesting Streaming Problems Graceful degredation under load The rate of arrival of a given stream is often highly variable Sometimes necessary to provision hardware for average load, not peak load How to degrade gracefully? Options: spill excess tuples to disk, summarize excess tuples (e.g. via histograms), or discard them High-availability and clustering TelegraphCQ: p. 22

More Problems Hybrid queries (stream-table joins) How does this change query optimization/execution, especially in the non-streaming portion of the query? How do we avoid the downsides of static planning? Sharing? Streams and transactions When do rows in base tables become visible? Transaction-like semantics for streaming queries? Historical queries, archived streams TelegraphCQ: p. 23