COMP 5311: Topics in DB Research Comp 5311 Database Management Systems. 7. Topics in DB Research



Similar documents
OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Multi-dimensional index structures Part I: motivation

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

Chapter 20: Data Analysis

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

PowerDesigner WarehouseArchitect The Model for Data Warehousing Solutions. A Technical Whitepaper from Sybase, Inc.

The basic data mining algorithms introduced may be enhanced in a number of ways.

Foundations of Business Intelligence: Databases and Information Management

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Database Applications. Advanced Querying. Transaction Processing. Transaction Processing. Data Warehouse. Decision Support. Transaction processing

Week 13: Data Warehousing. Warehousing

Foundations of Business Intelligence: Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

SeaCloudDM: Massive Heterogeneous Sensor Data Management in the Internet of Things

Data Preprocessing. Week 2

CHAPTER-24 Mining Spatial Databases

Database Design Patterns. Winter Lecture 24

Fluency With Information Technology CSE100/IMT100

Foundations of Business Intelligence: Databases and Information Management

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Chapter 3 - Data Replication and Materialized Integration

Week 3 lecture slides

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

KEYWORD SEARCH IN RELATIONAL DATABASES

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

International Journal of Advance Research in Computer Science and Management Studies

2.1 Basics: Indexing. 2.1 Primary Index. 2.1 Secondary Index. 2.1 Secondary Index. 2.1 Indexes. 2.1 Indexes

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

How To Improve Performance In A Database

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Infrastructures for big data

The Scientific Data Mining Process

14. Data Warehousing & Data Mining

Transforming the Telecoms Business using Big Data and Analytics

CHAPTER 5: BUSINESS ANALYTICS

New Approach of Computing Data Cubes in Data Warehousing

Data Warehousing. Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

Building Data Cubes and Mining Them. Jelena Jovanovic

Data Warehouse: Introduction

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Introduction to Data Mining

Course MIS. Foundations of Business Intelligence

DATA WAREHOUSING AND OLAP TECHNOLOGY

How To Handle Big Data With A Data Scientist

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Survey On: Nearest Neighbour Search With Keywords In Spatial Databases

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Data Warehousing & OLAP

Hadoop and Relational Database The Best of Both Worlds for Analytics Greg Battas Hewlett Packard

How to Enhance Traditional BI Architecture to Leverage Big Data

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

Cloud and Big Data Summer School, Stockholm, Aug Jeffrey D. Ullman

Unsupervised Data Mining (Clustering)

How To Write A Diagram

CHAPTER 4: BUSINESS ANALYTICS

Dataset Preparation and Indexing for Data Mining Analysis Using Horizontal Aggregations

Machine Learning using MapReduce

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Warehousing and OLAP

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

Inge Os Sales Consulting Manager Oracle Norway

High-Volume Data Warehousing in Centerprise. Product Datasheet

Cloud Computing at Google. Architecture

SQL Server 2012 Business Intelligence Boot Camp

Framework for Data warehouse architectural components

Outline. What is Big data and where they come from? How we deal with Big data?

TIM 50 - Business Information Systems

Challenges for Data Driven Systems

Graph Mining and Social Network Analysis

How To Use Big Data For Telco (For A Telco)

Information Management course

Big Data and Your Data Warehouse Philip Russom

An Overview of Database management System, Data warehousing and Data Mining

Big Systems, Big Data

Cassandra A Decentralized, Structured Storage System

Data Warehousing and Data Mining in Business Applications

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Similarity Search in a Very Large Scale Using Hadoop and HBase

When to consider OLAP?

University of Gaziantep, Department of Business Administration

Foundations of Business Intelligence: Databases and Information Management

Indexing Techniques for Data Warehouses Queries. Abstract

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

SPATIAL DATA CLASSIFICATION AND DATA MINING

Data Warehouse design

Transcription:

COMP 5311: Topics in DB Research Comp 5311 Database Management Systems 7. Topics in DB Research 1

DB Topics Data Warehouses and OLAP Data Streams Keyword Search in Databases Spatial/Spatio-temporal Databases Time Series Skylines and Top-k Queries Other Topics

DATA WAREHOUSES and OLAP On-Line Transaction Processing (OLTP) Systems manipulate operational data, necessary for day-to-day operations. Most existing database systems are in this category. On-Line Analytical Processing (OLAP) Systems support specific types of queries (based on group-bys and aggregation operators) useful for decision making.

Why OLTP is not sufficient for Decision Making Lets say that Welcome supermarket uses a relational database to keep track of sales in all of stores simultaneously SALES table product id store id 567 17 1 219 219... quantity sold date/time of sale 1997-10-22 09:35:14 16 4 1997-10-22 09:35:14 17 1 1997-10-22 09:35:17

Example (cont.) PRODUCTS table Prod. id product name 567 Colgate Gel Pump 6.4 oz. 219 Diet Coke 12 oz. can... product category toothpaste 68 soda 5 Manufact. id STORES table CITIES table store id city id 16 34 17 58 store location 510 Main Street 13 Maple Avenue phone number 415-555- 1212 914-555- 1212 id name state Popul. 34 San Francisco 58 East Fishkill California 700,000 New York 30,000

Example (cont.) An executive, asks "I noticed that there was a Colgate promotion recently, directed at people who live in towns with population < 100,000. How much Colgate toothpaste did we sell in those towns yesterday? And how much on the same day a month ago?" select sum(sales.quantity_sold) from sales, products, stores, cities where products.manufacturer_id = 68 -- restrict to Colgateand products.product_category = 'toothpaste and cities.population < 100000 and sales.datetime_of_sale::date = 'yesterday'::date and sales.product_id = products.product_id and sales.store_id = stores.store_id and stores.city_id = cities.city_id

Example (cont.) You have to do a 4-way JOIN of some large tables. Moreover, these tables are being updated as the query is executed. Need for a separate DB system (i.e., a Data Warehouse) to support queries like the previous one. The Warehouse can be tailor-made for specific types of queries: if you know that the toothpaste query will occur every day then you can denormalize the data model.

Example (cont.) Suppose Welcome acquires ParknShop which is using a different set of OLTP data models and a different brand of RDBMS to support them. But you want to run the toothpaste queries for both divisions. Solution: Also copy data from the ParknShop Database into the Welcome Data Warehouse (data integration). One of the more important functions of a data warehouse in a company that has disparate computing systems is to provide a view for management as though the company were in fact integrated.

Motivation In most organizations, data about specific parts of business is there -- lots and lots of data, somewhere, in some form. Data is available but not information -- and not the right information at the right time. To bring together information from multiple sources as to provide a consistent database source for decision support queries. To off-load decision support applications from the on-line transaction system.

Multitiered Architecture metadata other sources Operational DBs Monitor & Integrator Extract Transform Load Refresh Data Warehouse OLAP Server Serve Analysis Query Reports Data mining Tools Data Sources Data Marts

Loading Data to the Warehouse The warehouse must clean data, since operational data from multiple sources are often dirty: inconsistent field lengths, missing entries, violation of integrity constraints. Loading the warehouse includes some other processing tasks: checking integrity constraints, sorting, summarizing, build indexes, etc. Refreshing a warehouse means propagating updates on source data to the data stored in the warehouse

Conceptual Modeling - Star Schema Date Date Month Year Store StoreID City State Country Region Measures Fact Table Date Product Store Customer unit_sales dollar_sales Yen_sales Product ProductNo ProdName ProdDesc Category QOH Cust CustId CustName CustCity CustCountry Dimension Tables

Product Multidimensional View of Data Sales volume (measure) as a function of product, time, and geography (dimensions). Dimensions: Product, Region, Time Hierarchical summarization paths Industry Country Category Region Year Quarter Product City Month Week Office Day Time

Data Cube = Fact table + All Group-bys SELECT SUM(Sales) FROM Sales Aggregate Sum RED WHITE BLUE Group By (with total) By Color SELECT Color, SUM(Sales) FROM Sales GROUP BY Color SELECT Color, Model, SUM(Sales) FROM Sales GROUP BY Color, Model Sum RED WHITE BLUE Cross Tab Chevy Ford By Color By Make Sum By Make & Year By Year By Color & Year The Data Cube and The Sub-Space Aggregates FORD CHEVY Sum 1990 1991 1992 By Color 1993 By Make RED WHITE BLUE By Make & Color

Cube Operation SELECT Model, Year, Color, SUM(sales) FROM Sales GROUP BY CUBE(Model, Year, Color); SALES Model Year Color Sales Chevy 1990 red 5 Chevy 1990 white 87 Chevy 1990 blue 62 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 blue 49 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 blue 71 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 blue 63 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 blue 55 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 blue 39 CUBE DATA CUBE Model Year Color Sales Chevy 1990 blue 62 Chevy 1990 red 5 Chevy 1990 white 95 Chevy 1990 ALL 154 Chevy 1991 blue 49 Chevy 1991 red 54 Chevy 1991 white 95 Chevy 1991 ALL 198 Chevy 1992 blue 71 Chevy 1992 red 31 Chevy 1992 white 54 Chevy 1992 ALL 156 Chevy ALL blue 182 Chevy ALL red 90 Chevy ALL white 236 Chevy ALL ALL 508 Ford 1990 blue 63 Ford 1990 red 64 Ford 1990 white 62 Ford 1990 ALL 189 Ford 1991 blue 55 Ford 1991 red 52 Ford 1991 white 9 Ford 1991 ALL 116 Ford 1992 blue 39 Ford 1992 red 27 Ford 1992 white 62 Ford 1992 ALL 128 Ford ALL blue 157 Ford ALL red 143 Ford ALL white 133 Ford ALL ALL 433 ALL 1990 blue 125 ALL 1990 red 69 ALL 1990 white 149 ALL 1990 ALL 343 ALL 1991 blue 106 ALL 1991 red 104 ALL 1991 white 110 ALL 1991 ALL 314 ALL 1992 blue 110 ALL 1992 red 58 ALL 1992 white 116 ALL 1992 ALL 284 ALL ALL blue 339 ALL ALL red 233 ALL ALL white 369 ALL ALL ALL 941

DATA STREAMS Data streams differ from conventional DMBS: Records arrive online System has no control over arrival order Data streams are potentially unbounded in size Once a record from a data stream has been processed, it is discarded or archived. It cannot be retrieved easily because memory is small relative to the size of data streams Continuous queries Snapshot queries in conventional databases Evaluated once over a point-in-time snapshot of data set Continuous queries in data streams Evaluated continuously as data streams continue to arrive May be stored and updated as new data arrives, or may produce data streams themselves

Motivating Examples Financial system receiving stock values. sell the stock when the value drops below $10. Modern security applications. detect potential attacks to the network Clickstream monitoring to enable applications such as personalization, and load-balancing. (e.g., Yahoo) Sensor monitoring identify traffic congestions in road networks using sensors monitoring traffic

Finite Streams Finite Streams are bounded (i.e., at some point all tuples arrive) Unlike conventional databases, processing takes place in main memory, without all the data available in advance Conventional join algorithms require one input (BNL, index nested loop) or both inputs (sort merge and hash join) in advance Adapted versions of the algorithms for streams: must produce the first results immediately after the arrival of the first tuples must keep a constant output rate must utilize the available main memory

Infinite Streams - Sliding Windows Past Data Window Recent Data Future Data Infinite Streams: data are NOT bounded (they arrive for ever). Evaluate query over sliding window of recent data from streams Attractive Properties Well-defined and understood Emphasizes recent data, which in many real-world applications is more important than old data

Sliding Windows - Joins Two tuples can be joined only if they fall in the same sliding window (i.e., there time difference is within the window). General framework for joining streams A and B. Tuples arrive in chronological order. System maintains the list of tuples S A and S B that have arrived and not expired yet. An incoming tuple t from input stream A first purges tuples from S B whose timestamp is earlier than t.ts-w. Then, it probes S B and joins with its tuples. Finally, t is inserted into S A. Once a join result is generated, it must also be assigned a timestamp, since it may constitute an input for a subsequent operator. Output tuples must be generated in the order of their timestamps

Data Streams Other Issues Approximate Queries due to limited amount of memory, it may not be possible to produce exact answers Sketches Random sampling Histograms Wavelets Query optimization How to optimize continuous queries How to migrate plans

RELATIONAL KEYWORD SEARCH KEYWORD SEARCH (KWS) Very Easy No language to learn Ubiquitous Web Search Millions of users Millions of queries Now applied to databases

Example of KWS t 1 did name 100 Quentin Tarantino other attr ibutes Director movie.did= dir ector.did Movie t2 mid title 2 Pulp Fiction did 100 other attr ibutes Schema Plays plays.mid= movie.mid actor.aid= plays.aid Actor t 3 aid name 120 John Travolta t 4 145 Quentin Tarantino other attributes mid aid t5 2 120 t6 5 120 t7 5 145 other attr ibutes What is the query {Tarantino, Travolta} supposed to compute? t1 JOIN t2 JOIN t5 JOIN t3: there is a movie (Pulp Fiction), which was directed by Tarantino and features Travolta t3 JOIN t6 JOIN t7 JOIN t4 : there is movie (mid=5) that includes both Tarantino and Travolta as actors

Equivalent SQL Expressions SELECT * FROM Director D, Movie M, Plays P, Actor A WHERE D.name=Tarantino, A.name=Travolta and D.did=M.did and P.mid=M.mid and A.aid=m.aid SELECT * FROM Actor A1, Actor A2, Plays P1, Plays P2, WHERE A1.name=Tarantino, A2.name=Travolta and A1.aid=P1.aid and A2.aid=P2.aid and P1.mid=P2.mid These are only the statements that actually output results. Many more SQL queries have to be issued, in order to cover every possible interaction, e.g. a movie starring Tarantino that was directed by Travolta. R-KWS allows querying for terms in unknown locations (tables/attributes). A query can be issued without knowledge of tables, their attributes, or join conditions.

Database as a Graph Every Database can be modeled as a graph * : Nodes Represent tuples Edges Connect joining tuples Schema V S T U k 3 s1 k 1, k 2 v u k 3 k 2 1 2 v 2 v 3 v 4 k 1, k 2 k 1 u 1 t 1 s 2 t 2 k 3 u 3 Data Graph

Graph-Based Query Processing Graph based systems such as Banks and DBSurfer maintain the data graph in main memory. Given a query, an inverted index identifies all tuples that contain at least one keyword. Each such tuple initiates a graph traversal. Whenever a node is reached by all keywords, a result is constructed by following the reverse paths to the keyword occurrences. Duplicates are filtered in a second, post-processing step.

Operator-Based Query Processing Systems, such as Discover, DBXplorer and Mragyati, translate an R-KWS query into a series of SQL statements, which are executed directly on secondary storage, using the underlying DBMS. SELECT * FROM Director D, Movie M, Plays P, Actor A WHERE D.name=Tarantino, A.name=Travolta and D.did=M.did and P.mid=M.mid and A.aid=m.aid Director.did= Movie.did name=tarantino Movie.mid=Plays.mid Plays.aid=Actor.aid name=travolta Director Movie Plays Actor SELECT * FROM Actor A1, Actor A2, Plays P1, Plays P2, WHERE A1.name=Tarantino, A2.name=Travolta and A1.aid=P1.aid and A2.aid=P2.aid and P1.mid=P2.mid Plays1.mid=Plays2.mid Actor1.aid=Plays1.aid name=tarantino Actor2.aid=Plays2.aid name=travolta Actor 1 Plays1 Plays 2 Actor 2

Database Keyword Search Other Topics Ranking How to retrieve the top-k most interesting results Keyword search in multiple databases How to select the top-k databases with the most promising results Continuous keyword search in streams Spatial keyword search

SPATIAL AND SPATIOTEMPORAL DATABASES Spatial Database Systems manage large collections of static multidimensional objects with explicit knowledge about their extent and position in space (as opposed to image databases). A spatial object contains (at least) one spatial attribute that describes its geometry and location A spatial relation is an organized collection of spatial objects of the same entity (e.g. rivers, cities, road segments) ID Name Type Polyline 1 Sunset avenue (10023,1094), (9034,1567), (9020,1610) 2 H5 highway (4240,5910), (4129,6012), (3813,6129), (3602,6129) Road segments from an area in CA A spatial relation

Common Spatial Queries Range query (spatial selection, window query, zoom-in) e.g. find all cities that intersect window W Answer set: {c 1, c 2 } c 1 c 2 F c 3 c 4 W Nearest neighbor query e.g. find the city closest to the F-spot Answer: c 2 Spatial join e.g. find all pairs of cities and rivers that intersect Answer set: {(r 1,c 1 ), (r 2,c 2 ), (r 2,c 5 )} c 1 r 1 r 2 c 2 c 3 c 4 c 5

Two-step spatial query processing A spatial object is usually approximated by its minimum bounding rectangle (MBR) The spatial query is then processed in two steps: 1. Filter step: The MBR is tested against the query predicate 2. Refinement step: The exact geometry of objects that pass the filter step is tested for qualification Examples: filtered pair non-qualifying pair that passes the filter step (false hit) qualifying pair

Example R-tree Range (Window) Query Node capacity depends on the page size In practice in the order of 100 10 8 6 4 2 y axis E m 2 E 1 h E 7 E 5 e f g E 6 k E i j 4 d b c E 3 a x axis 0 2 4 6 8 10 Root l window query Minimum Bounding Rectangle (MBR) E 1 E 2 E 1 E 3 E 4 E 5 E 6 E 7 E2 a b c d e f g h E E 3 4 E 5 E 6 i j k E 7 l m

Spatial Joins A spatial join returns intersecting pairs of objects (from two data sets) The RJ join algorithm traverses both R-trees simultaneously, visiting only those branches that can lead to qualifying pairs. R A A 1 A 2 R B B 1 B 2 a 1 a 2 a 3 a 4 a 5 b 1 b 2 b 3 b 4 a 4 a 5 b 3 a 3 a 1 a 2 A 2 b 1 b 2 b 4 A 1 B 1 B 2

Nearest Neighbor (NN) search with R-trees Depth-first traversal y axis 10 E 7 E1 e f E 2 8 E8 d E g 5 6 h i E 6 E 9 query point contents 4 E omitted b 4 a search region 2 c E 3 x axis 0 2 4 6 8 10 note: some nodes have been omitted for simplicity a 5 b 13 c 18 d 13 e 13 Root E 1 1 f 10 E 2 2 E 1 E 2 E 4 5 E 5 5 E 6 9 E 3 8 E 7 13 E 4 E 5 E 8 E 8 2 h 2 E 9 17 g 13 i 10

Reverse NN Queries Monochromatic: given a multi-dimensional dataset P and a point q, find all the points p P that have q as their nearest neighbor Bichromatic: given a set Q of queries and a query point q, find the objects p P that are closer to q than any other point of Q p 1 p 4 q p 3 RNN(q)=p 1, p 2 p 2 NN(q)= p 3

Spatial and Spatiotemporal DB Other Issues Road networks Continuous monitoring of spatial queries Predictive indexing and query processing Indexing historical location data Spatiotemporal aggregation Alternative types of spatial queries Location privacy

TIME SERIES DATABASES A time series or data sequence R consists of a stream of numbers ordered by time: R= R[0], R[1],, where R[0] corresponds to the value at timestamp 0, R[1] to the value at timestamp 1 and so on. Time series are ubiquitous in several applications: stock market, image similarity, sensor networks etc. Queries: Similarity Search (find all stocks whose values in the last year are similar to a given stock).

Similarity Definition 7 6 5 4 3 2 R 1 R 2 1 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Difficult to define depends on the application domain, user. A simple definition is based on Euclidean distances Does not account for translation, rotation etc.

Whole Sequence Matching Given a set of stored time series with the same length d, a query sequence Q with length d and a similarity threshold, a whole matching query returns the series that -match with Q. 3-step processing framework index building: apply dimensionality reduction technique to convert d-dimensional sequences to points into an f-dimensional space. The resulting f-dimensional points are indexed by an R-tree index searching: transform the query sequence Q to an f-dimensional point q. A range query centered at q with radius is performed on the R-tree to retrieve candidates results. post-processing is performed on the candidates to get actual result.

Whole Sequence Matching - Assumptions All data base sequences and query sequence should have the same length The dimensionality reduction technique should be distance-preserving: i.e., the distance in the low dimensional space should be smaller or equal to the distance in high dimensions

Sub-Sequence Matching Given a data sequence R = R[0],, R[m-1], a query sequence Q = Q[0],, Q[d-1] (m d) and a similarity threshold, a sub-sequence matching query retrieves all the subsequences R' = R[i : i+d-1] (0 i m-d), such that dist(q, R').

Index Building for Sub-Sequence Matching 7 6 5 4 3 2 1 0 R R[6:9] R[5:8] R[4:7] R[3:6] R[2:5] R[1:4] R[0:3] R[8:11] R[7:10] R[14:17] R[13:16] R[12:15] R[11:14] R[10:13] R[9:12] R[16:19] R[15:18] 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 x 2 R[16:19] R[15:18] R[0:3] R[4:7] R[5:8] R[14:17] R[13:16] R[12:15] R[1:4] R[3:6] R[6:9] R[11:14] R[2:5] R[7:10] R[10:13] R [8:11] R [ 9:12] x 1

Query processing - Query length w (=4) x 2 Q R[4:7] R[3:6] x 1

Time Series Other Issues Distance definitions Dynamic Time Warping Application-dependent definitions Dimensionality reduction techniques Discrete Fourier Transform Wavelets Linear Segments Alternative problems Outlier detection Streaming time series

SKYLINE AND TOP-K QUERIES Which buildings can we see? Higher or nearer

Distance Skyline Example Find a cheap hotel that is close to beach. B A dominates B. A(dist) B(dist) and A(price) B(price) A Price dominates Skyline is a set of objects not dominated by any other objects.

NN algorithm NN uses the results of nearest neighbor search to partition the data universe recursively. x y b a i k h g d f e c l o 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 m n x y b a i k h g d f e c l o 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 m n

NN algorithm (cont) NN uses the results of nearest neighbor search to partition the data universe recursively. x y b a i k h g d f e c l o 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 m n x y b a i k h g d f e c l o 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 m n note: another query is necessary to confirm this partition empty

Top-k queries Top-k query: Given a scoring function f, report the k tuples in a dataset with the highest scores. fund id 1 2 3 4 5 6 7 8 9 10 11 12 growth 0.2 0.1 0.3 0.2 0.3 0.5 0.4 0.6 0.7 0.6 0.7 0.7 stability 0.2 0.5 0.3 0.9 0.8 0.7 0.3 0.1 0.2 0.5 0.6 0.5 Preference function f(t)=w1 t.growth+w2 t.stability where w1 and w2 are specified by a user to indicate her/his priorities on the two attributes. If w1=0.1, w2=0.9 (stability is favored), the top 3 funds have ids 4, 5, 6 since their scores (0.83, 0.75, 0.68, respectively) are the highest. If w1=0.5, w2=0.5 (both attributes are equally important), the ids of the best 3 funds become 11, 6, 12.

Top-k query Processing Query processing techniques Based on pre-processing (i.e., generation of views in advance) On-line (no preprocessing) 1 A 2 N 5 t 4 t.a 1 + t.a 2 =1.3 identical score curve 0.8 t 5 t 6 search region 0.6 t 2 N 2 N6 N 4 t 11 0.4 0.2 t 1 t 3 N 1 t 7 t 10 N 3 t 12 t 8 t 9 0 0.2 0.4 0.6 0.8 1 A 1

DATABASE OUTSOURCING AND QUERY AUTHENTICATION Data Owner initial data data updates Service Provider query query results Client Advantages The data owner does not need the hardware / software / personnel to run a DBMS The service provider achieves economies of scale The client enjoys better quality of service A main challenge The service provider is not trusted, and may return incorrect query results (e.g., to favor the competition) 51

Framework Data Owner initial data & signatures data updates & signature updates Service Provider query query results & VO Client The owner signs its data with a digital signature scheme Given a query, the service provider attaches a VO (Verification Object) to the results The client verifies query results with the VO and the owner s signature to ensure: soundness completeness 52

PRIVACY Problem: how to publish data (e.g., for statistical purposes) without disclosing the MicroData identity of the records. k-anonimity Differential privacy Other anonymity concepts Anonymity in streams etc ID QI 1 QI 2 A 1 1 B 1 4 C 2 2 D 2 3 E 3 1 F 3 2 G 5 4 SA v 1 v 2 v 3 v 4 v 5 v 6 v 7 Generalized QI 1 QI 2 SA 1-2 1-4 v 1 1-2 1-4 1-2 1-4 v 2 v 3 1-2 1-4 v 4 3-5 1-4 v 5 3-5 1-4 v 6 3-5 1-4 v 7 QI attributes - quasi-identifiers (e.g., age, address) SA-sensitive attribute (e.g., dicease, salary)

DATA MINING AND KNOWLEDGE DISCOVERY Association: discovering market basket rules A significant number of customers who spent over $1000 in sports equipment also spent over $500 in designer clothes 98% of people who purchase diapers also buy beer Classification: assigning instances to pre-defined classes A bank classifies a client as a safe borrower based on his/her characteristics (e.g., college degree, marital status, age etc). Insurance company determines the risk of individuals (and the fee) based on the class of individual. Clustering: segmenting multidimensional data into groups based on similarity Discovering sub-populations from marketing data,e.g. clusters of customers

Other Mining Problems Tasks Sequence analysis: extracting patterns over time High-end real estate sales have tracked technology stocks for the past 5 years Deviations: Detection of outliers or anomalies Strange behavior in credit card use Text/multimedia mining: Extraction of structured information from unstructured or semi-structured data Find FAQs from customer support case document

OTHER TOPICS Peer-to-Peer and Distributed DB The DB is distributed over several servers How to efficiently store and find the data Minimization of the communication overhead Sensor Data Management Related to stream management Minimization of energy consumption In network aggregation, approximation techniques

Information Integration OTHER TOPICS How to combine information from several DB Schema matching, Data Cleaning Graph Databases How to solve graph problems in huge graphs Clique detection, triangle enumeration, reachability queries etc. Social Networks Data managements issues and specialized query processing tasks (influence, clustering based on social connectivity)

NOSQL Systems OTHER TOPICS data maintained in means other than tables (keyvalue stores, column stores, document stores etc.) Cloud Databases Database as a service Map-reduce, Hadoop Crowdsourcing Use numerous people to perform difficult tasks, such as entity resolution, image matching, and clustering

OTHER TOPICS Many more topics - databases touch most aspects of Computer Science Query processing with alternative hardware (e.g., graphic processors). Storage in alternative hardware (e.g., FLASH storage). Algorithms, approximation techniques (e.g., sketches) for large volumes of data. Transaction processing.