Data Stream Management Systems
|
|
|
- Luke Henry
- 9 years ago
- Views:
Transcription
1 Data Stream Management Systems Principles of Modern Database Systems 2007 Tore Risch Dept. of information technology Uppsala University Sweden
2 Tore Risch Uppsala University, Sweden What is a Data Base Management System? DBMS Users and programmers SQL queries Software to process queries Software to access stored data Meta data Stored Data
3 New applications Data comes as large data streams, e.g. - Satellite data - Scientific instruments - Colliders - Patient monitoring - Stock data - Process industry - Traffic control Would like to query data in streams
4 Tore Risch Uppsala University, Sweden What is a Data Stream Management System? Users and programmers Continuous queries (CQs) DSMS Software to process queries Data streams Software to access streams and data Data streams Meta data Stored Data
5 DSMS Scenario set wd= PCC(2,"RRpart", "fft3","s-merge",0.1); set q= cq(wd,{s1},{s2}); Coordinator compile(q); CQ run(q); Client WN2 WN1 FFT3() RRPart(2,0) WN4 Radio Signal RRPart(2,1) WN3 FFT3() S-Merge(0.1) Visualization application Cluster/Grid Legend: Client request Control flow Data flow
6 Overview paper L. Golab and T. Özsu: Issues in Stream Data Management, SIGMOD Records, 32(2), June 2003, /1.golab-ozsu1.p
7 The LOFAR Instrument antennas -Distributed over 100 stations -Producing ~20Tbps raw data
8 Streams vs tables Streams potentially infinite in size - Regular DBs based on queries to finite tables Streams ordered, i.e. sequence data - Regular DBs are based on sets and bags Stop condition indicates when/if streams end Often very high stream data volume and rate - Regular DBs usually less demanding Real-time delivery, Quality of Service - Regular DBs weak here Active query model, continuous queries - Regular DB queries passive
9 Continuous queries CQs are turned on and run until stop condition true - Regular queries executed until finished by demand CQs return unbounded data (streams) as result - Regular queries bounded by size of tables CQs operators usually montone, i.e. cannot re-read stream - Reqular queries can access same table many times CQs specified over stream windows (i.e. bounded stream segments) - Regular queries specified over entire tables CQs often based on time stamps (logs) of stream elements (temporal) - Regular queries not temporal CQ join operators approximate - Regular join operators usually exactly match data
10 Stream windows Need monotone window operator to chop stream into segments Window size (sz) based on: - Number of elements E.g. last 10 elements -Time E.g. elements last second Landmark window: - Window from start of stream - Continously growing - Not bounded - Materialization Windows also have stride (str) - Rule for how they move forward
11 Window stride How fast the window moves forward Jumping window sz = str => Output data rate o = input data rate i => No overlap between windows => All data processed once => C.f. window rate wr=i/sz Sliding windows str < sz => o > i (o = i*sz/str ) => Overlaps between windows => Data processed more than once Sampling window str >sz => o < i => No overlaps => Some data not processed => a form of schredding
12 Joining streams Streams infinite => Monotone join operators needed => regular join impossible (not monotone) Instead streams are merged: 1. Split stream into segments by window operator 2. Join windows from each stream 3. Merge the result Stream merge is approximate join method - Window size determines quality of result Stream joins need to deal with rate differences, blocking => Time-out when data blocks => Load shredding skips stream elements => Can also do approximations (e.g. aggregation) => Need to deal with nulls (c.f. outer joins)
13 Stream joining methods Special join methods different from table joins Xjoin: T. Urhan and M. Franklin. Dynamic pipeline scheduling for improving interactive performance of online queries.proceedings of the VLDB Conference, Mjoin: S. Viglas, J. Naughton, and J. Burger. Maximizing the output rate of multi-join queries over streaming information sources. In Proc. of the VLDB Conference 2003 Hybride: Babu, Munagala, Widom, Motwani:Adaptive Caching for Continuous Queries, Proc. 21st International Conference on Data Engineering (ICDE 2005)
14 Punctuations Can be seen as corresponding to transactions Condition for a unit of work E.g. deal is done => new data about it ignored Add punctuation token in stream May improve performance Syncronization Punctuated joins: Ding, Mehta, Rundensteiner, Heineman: Joining Punctuated Streams, EDBT 2004
15 DSMS Systems Aurora (Brown,MIT,Brandeis): Carney et al: Monitoring Streams A New Class of Data Management Applications, VLDB 2003 TelegraphCQ (Berkeley): Chandrasekaran et al: TelegraphCQ: Continuous Dataflow Processing for an Uncertain World, CIDR 2003 Gigascope (AT & T): Cranor et al: Gigascope: High Performance Network Monitoring with an SQL Interface, SIGMOD 2002 STREAM (Stanford):StreaMon: Baby & Widom: An Adaptive Engine for Stream Query Processing, SIGMOD 2004 Borealis (Brown & Brandeis): Ahmad et al: StreaMon: An Adaptive Engine for Stream Query Processing, SIGMOD 2005 (distributed streams) Wavescope (MIT): Girod et al: The Case for a Signal-Oriented Data Stream Management System, CIDR 2007
16 Own related efforts SCSQ (Zeitler & Risch): Processing high-volume stream queries on a supercomputer, ICDE Ph.D. Workshop 2006 (distributed, numerical) GSDM (Ivanova & Risch): Customizable Parallel Execution of Scientific Stream Queries, VLDB 2005 (distributed, numerical) L.Lin, T. Risch: Querying Continuous Time Sequences, VLDB 1998 (numerical time series)
17 Aggregation over stream windows E.g. SCSQ: select avg(winagg(s,100,30)) from Stream s where id(source(s))=2; Lots of work on similarity search over time sequences Indexing time series Bulut and Singh: A Unified Framework for Monitoring Data Streams in Real Time, ICDE 2005 Zhu and Shasha: Warping Indexes with Envelope Transforms for Query by Humming, SIGMOD 2003
18 Scientific Databases Optimization of queries with numerical functions Wolniewicz and Graefe: Algebraic Optimization of Computations overscientific Databases, VLDB 1999 Function approximation and caching Panda, Riedewald, Pope, Gehrke, Chew: Indexing for Function Approximation, VLDB 2006 Denny & Franklin: Adaptive Execution of Variable-Accuracy Functions, VLDB 2006
19 Scientific Databases Scientific workflows Berkley et al: Incorporating Semantics in Scientific Workflow Authoring, SSDBM 2005 Tracking changes and sources Buneman et al: Provenance Management in Curated Databases, SIGMOD 2006 Spatial indexing (c.f. multimedia databases) Csabail et al: Spatial Indexing of Large Multidimensional Databases, CIDR 2007
DataCell: Building a Data Stream Engine on top of a Relational Database Kernel
DataCell: Building a Data Stream Engine on top of a Relational Database Kernel Erietta Liarou (supervised by Martin Kersten) CWI Amsterdam, The Netherlands [email protected] ABSTRACT Stream applications gained
Preemptive Rate-based Operator Scheduling in a Data Stream Management System
Preemptive Rate-based Operator Scheduling in a Data Stream Management System Mohamed A. Sharaf, Panos K. Chrysanthis, Alexandros Labrinidis Department of Computer Science University of Pittsburgh Pittsburgh,
Data Mining for Knowledge Management. Mining Data Streams
Data Mining for Knowledge Management Mining Data Streams Themis Palpanas University of Trento http://dit.unitn.it/~themis Spring 2007 Data Mining for Knowledge Management 1 Motivating Examples: Production
Topics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
Data Stream Management
Data Stream Management Synthesis Lectures on Data Management Editor M. Tamer Özsu, University of Waterloo Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. The
Flexible Data Streaming In Stream Cloud
Flexible Data Streaming In Stream Cloud J.Rethna Virgil Jeny 1, Chetan Anil Joshi 2 Associate Professor, Dept. of IT, AVCOE, Sangamner,University of Pune, Maharashtra, India 1 Student of M.E.(IT), AVCOE,
Design, Implementation, and Evaluation of Network Monitoring Tasks with the TelegraphCQ Data Stream Management System
Design, Implementation, and Evaluation of Network Monitoring Tasks with the TelegraphCQ Data Stream Management System INF5100, Autumn 2006 Jarle Content Introduction Streaming Applications Data Stream
SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK
3/2/2011 SWISSBOX REVISITING THE DATA PROCESSING SOFTWARE STACK Systems Group Dept. of Computer Science ETH Zürich, Switzerland SwissBox Humboldt University Dec. 2010 Systems Group = www.systems.ethz.ch
Big Data and Big Analytics
Big Data and Big Analytics Introducing SciDB Open source, massively parallel DBMS and analytic platform Array data model (rather than SQL, Unstructured, XML, or triple-store) Extensible micro-kernel architecture
Aurora: a new model and architecture for data stream management
Aurora: a new model and architecture for data stream management Daniel J. Abadi 1, Don Carney 2, Ugur Cetintemel 2, Mitch Cherniack 1, Christian Convey 2, Sangdon Lee 2, Michael Stonebraker 3, Nesime Tatbul
RTSTREAM: Real-Time Query Processing for Data Streams
RTSTREAM: Real-Time Query Processing for Data Streams Yuan Wei Sang H Son John A Stankovic Department of Computer Science University of Virginia Charlottesville, Virginia, 2294-474 E-mail: {yw3f, son,
Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect
Matteo Migliavacca (mm53@kent) School of Computing Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect Simple past - Traditional
Identifying Frequent Items in Sliding Windows over On-Line Packet Streams
Identifying Frequent Items in Sliding Windows over On-Line Packet Streams (Extended Abstract) Lukasz Golab [email protected] David DeHaan [email protected] Erik D. Demaine Lab. for Comp. Sci. M.I.T.
DISTRIBUTED AND PARALLELL DATABASE
DISTRIBUTED AND PARALLELL DATABASE SYSTEMS Tore Risch Uppsala Database Laboratory Department of Information Technology Uppsala University Sweden http://user.it.uu.se/~torer PAGE 1 What is a Distributed
Load Distribution in Large Scale Network Monitoring Infrastructures
Load Distribution in Large Scale Network Monitoring Infrastructures Josep Sanjuàs-Cuxart, Pere Barlet-Ros, Gianluca Iannaccone, and Josep Solé-Pareta Universitat Politècnica de Catalunya (UPC) {jsanjuas,pbarlet,pareta}@ac.upc.edu
Introduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
STREAM: The Stanford Data Stream Management System
STREAM: The Stanford Data Stream Management System Arvind Arasu, Brian Babcock, Shivnath Babu, John Cieslewicz, Mayur Datar, Keith Ito, Rajeev Motwani, Utkarsh Srivastava, and Jennifer Widom Department
How To Use Data For Abstract Reasoning
Applying High-speed Pattern Recognition to Generate Queryable Semantics from Big Data Chuck Rehberg CTO Trigent Software Chief Scientist at Semantic Insights October 2014 copyright 2014 Semantic Insights
Data Stream Management System for Moving Sensor Object Data
SERBIAN JOURNAL OF ELECTRICAL ENGINEERING Vol. 12, No. 1, February 2015, 117-127 UDC: 004.422.635.5 DOI: 10.2298/SJEE1501117J Data Stream Management System for Moving Sensor Object Data Željko Jovanović
PLACE*: A Distributed Spatio-temporal Data Stream Management System for Moving Objects
PLACE*: A Distributed Spatio-temporal Data Stream Management System for Moving Objects Xiaopeng Xiong Hicham G. Elmongui Xiaoyong Chai Walid G. Aref Department of Computer Sciences, Purdue University,
Optimizing Timestamp Management in Data Stream Management Systems
Optimizing Timestamp Management in Data Stream Management Systems Yijian Bai [email protected] Hetal Thakkar [email protected] Haixun Wang IBM T. J. Watson [email protected] Carlo Zaniolo [email protected]
Scaling a Monitoring Infrastructure for the Akamai Network
Scaling a Monitoring Infrastructure for the Akamai Network Thomas Repantis Akamai Technologies [email protected] Scott Smith Formerly of Akamai Technologies [email protected] ABSTRACT We describe the
adaptive loading adaptive indexing dbtouch 3 Ideas for Big Data Exploration
adaptive loading adaptive indexing dbtouch Ideas for Big Data Exploration Stratos Idreos CWI, INS-, Amsterdam data is everywhere daily data years daily data years Eric Schmidt: Every two days we create
iservdb The database closest to you IDEAS Institute
iservdb The database closest to you IDEAS Institute 1 Overview 2 Long-term Anticipation iservdb is a relational database SQL compliance and a general purpose database Data is reliable and consistency iservdb
Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de
Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft
An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams
An XML Framework for Integrating Continuous Queries, Composite Event Detection, and Database Condition Monitoring for Multiple Data Streams Susan D. Urban 1, Suzanne W. Dietrich 1, 2, and Yi Chen 1 Arizona
Gigascope: A Stream Database for Network Applications
Gigascope: A Stream Database for Network Applications Chuck Cranor, Theodore Johnson, Oliver Spataschek AT&T Labs Research chuck,johnsont,spatsch @research.att.com Vladislav Shkapenyuk Dept. of Computer
A Review of Window Query Processing for Data Streams
Invited Paper Journal of Computing Science and Engineering, Vol. 7, No. 4, December 2013, pp. 220-230 A Review of Window Query Processing for Data Streams Hyeon Gyu Kim* Division of Computer, Sahmyook
IEEE JAVA Project 2012
IEEE JAVA Project 2012 Powered by Cloud Computing Cloud Computing Security from Single to Multi-Clouds. Reliable Re-encryption in Unreliable Clouds. Cloud Data Production for Masses. Costing of Cloud Computing
Challenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams
Processing Sliding Window Multi-Joins in Continuous Queries over Data Streams Lukasz Golab M. Tamer Özsu School of Computer Science University of Waterloo Waterloo, Canada {lgolab, tozsu}@uwaterloo.ca
Case Study: Instrumenting a Network for NetFlow Security Visualization Tools
Case Study: Instrumenting a Network for NetFlow Security Visualization Tools William Yurcik* Yifan Li SIFT Research Group National Center for Supercomputing Applications (NCSA) University of Illinois at
Query Selectivity Estimation for Uncertain Data
Query Selectivity Estimation for Uncertain Data Sarvjeet Singh *, Chris Mayfield *, Rahul Shah #, Sunil Prabhakar * and Susanne Hambrusch * * Department of Computer Science, Purdue University # Department
Data-intensive HPC: opportunities and challenges. Patrick Valduriez
Data-intensive HPC: opportunities and challenges Patrick Valduriez Big Data Landscape Multi-$billion market! Big data = Hadoop = MapReduce? No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard,
12/10/2012. Real-Time Analytics & Attribution. Client Case Study: Staples. Noah Powers Principal Solutions Architect, Customer Intelligence, SAS
Real-Time Analytics & Attribution Noah Powers Principal Solutions Architect, Customer Intelligence, SAS Patty Hager Analytics Manager, Content/Communication/Entertainment, SAS Suneel Grover Solutions Architect,
Data Mining on Streams
Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs
Complex Event Processing in Wireless Sensor Networks
Complex Event Processing in Wireless Sensor Networks Omran Saleh Faculty of Computer Science and Automation Ilmenau University of Technology Ilmenau, Germany [email protected] ABSTRACT Most of
2 Associating Facts with Time
TEMPORAL DATABASES Richard Thomas Snodgrass A temporal database (see Temporal Database) contains time-varying data. Time is an important aspect of all real-world phenomena. Events occur at specific points
LearnFromGuru Polish your knowledge
SQL SERVER 2008 R2 /2012 (TSQL/SSIS/ SSRS/ SSAS BI Developer TRAINING) Module: I T-SQL Programming and Database Design An Overview of SQL Server 2008 R2 / 2012 Available Features and Tools New Capabilities
Complex Event Processing (CEP) Why and How. Richard Hallgren BUGS 2013-05-30
Complex Event Processing (CEP) Why and How Richard Hallgren BUGS 2013-05-30 Objectives Understand why and how CEP is important for modern business processes Concepts within a CEP solution Overview of StreamInsight
ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services
ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process Building dimensions Building fact tables Extract Transformations/cleansing Load MS Integration Services Original
High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University
High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM
SURVEY ON SCIENTIFIC DATA MANAGEMENT USING HADOOP MAPREDUCE IN THE KEPLER SCIENTIFIC WORKFLOW SYSTEM 1 KONG XIANGSHENG 1 Department of Computer & Information, Xinxiang University, Xinxiang, China E-mail:
Efficiently Identifying Inclusion Dependencies in RDBMS
Efficiently Identifying Inclusion Dependencies in RDBMS Jana Bauckmann Department for Computer Science, Humboldt-Universität zu Berlin Rudower Chaussee 25, 12489 Berlin, Germany [email protected]
Abhinandan Das EDUCATION AREAS OF INTEREST REPRESENTATIVE RESEARCH
Abhinandan Das 4154 Upson Hall Department of Computer Science Cornell University Ithaca NY 14853 http://www.cs.cornell.edu/~asdas Office: (607) 255 4574 Home: (607) 227 5957 Cell: (607) 227 5957 Fax: (607)
Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner
24 Horizontal Aggregations In SQL To Generate Data Sets For Data Mining Analysis In An Optimized Manner Rekha S. Nyaykhor M. Tech, Dept. Of CSE, Priyadarshini Bhagwati College of Engineering, Nagpur, India
Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University 2014-05-12
Introduction to NoSQL Databases and MapReduce Tore Risch Information Technology Uppsala University 2014-05-12 What is a NoSQL Database? 1. A key/value store Basic index manager, no complete query language
The STC for Event Analysis: Scalability Issues
The STC for Event Analysis: Scalability Issues Georg Fuchs Gennady Andrienko http://geoanalytics.net Events Something [significant] happened somewhere, sometime Analysis goal and domain dependent, e.g.
Data Stream Management Systems and Query Languages
Data Stream Management Systems and Query Languages Advanced School on Data Exchange, Integration, and Streams (DEIS'10) Dagstuhl Information Systems - Informatik 5 University 09.11.2010 New Applications
ADDING A NEW SITE IN AN EXISTING ORACLE MULTIMASTER REPLICATION WITHOUT QUIESCING THE REPLICATION
ADDING A NEW SITE IN AN EXISTING ORACLE MULTIMASTER REPLICATION WITHOUT QUIESCING THE REPLICATION Hakik Paci 1, Elinda Kajo 2, Igli Tafa 3 and Aleksander Xhuvani 4 1 Department of Computer Engineering,
Data Integrator Performance Optimization Guide
Data Integrator Performance Optimization Guide Data Integrator 11.7.2 for Windows and UNIX Patents Trademarks Copyright Third-party contributors Business Objects owns the following
In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR
1 2 2 3 In this session, we use the table ZZTELE with approx. 115,000 records for the examples. The primary key is defined on the columns NAME,VORNAME,STR The uniqueness of the primary key ensures that
Big Data Research in the AMPLab: BDAS and Beyond
Big Data Research in the AMPLab: BDAS and Beyond Michael Franklin UC Berkeley 1 st Spark Summit December 2, 2013 UC BERKELEY AMPLab: Collaborative Big Data Research Launched: January 2011, 6 year planned
Exploratory Data Analysis for Ecological Modelling and Decision Support
Exploratory Data Analysis for Ecological Modelling and Decision Support Gennady Andrienko & Natalia Andrienko Fraunhofer Institute AIS Sankt Augustin Germany http://www.ais.fraunhofer.de/and 5th ECEM conference,
SSIS Scaling and Performance
SSIS Scaling and Performance Erik Veerman Atlanta MDF member SQL Server MVP, Microsoft MCT Mentor, Solid Quality Learning Agenda Buffers Transformation Types, Execution Trees General Optimization Techniques
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov
An Industrial Perspective on the Hadoop Ecosystem Eldar Khalilov Pavel Valov agenda 03.12.2015 2 agenda Introduction 03.12.2015 2 agenda Introduction Research goals 03.12.2015 2 agenda Introduction Research
Data Stream Management System
Case Study of CSG712 Data Stream Management System Jian Wen Spring 2008 Northeastern University Outline Traditional DBMS v.s. Data Stream Management System First-generation: Aurora Run-time architecture
Big Data Provenance: Challenges and Implications for Benchmarking
Big Data Provenance: Challenges and Implications for Benchmarking Boris Glavic Illinois Institute of Technology 10 W 31st Street, Chicago, IL 60615, USA [email protected] Abstract. Data Provenance is information
