Query Selectivity Estimation for Uncertain Data

Similar documents
Chapter 3 RANDOM VARIATE GENERATION

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

3. Data Analysis, Statistics, and Probability

Jitter Measurements in Serial Data Signals

Independence Diagrams: A Technique for Visual Data Mining

Chapter 4 - Lecture 1 Probability Density Functions and Cumul. Distribution Functions

Generating Random Numbers Variance Reduction Quasi-Monte Carlo. Simulation Methods. Leonid Kogan. MIT, Sloan , Fall 2010

Efficient Data Streams Processing in the Real Time Data Warehouse

Data Preprocessing. Week 2

Fuzzy Duplicate Detection on XML Data

Probability and Random Variables. Generation of random variables (r.v.)

CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Probabilistic XML: A Data Model for the Web

Lecture 7: Continuous Random Variables

Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:

UNIT I: RANDOM VARIABLES PART- A -TWO MARKS

Representing Uncertainty by Probability and Possibility What s the Difference?

Lecture 8. Confidence intervals and the central limit theorem

DATABASE DESIGN - 1DL400

Analyses: Statistical Measures

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering & Visualization

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

Solving Quadratic & Higher Degree Inequalities

Creating Probabilistic Databases from Duplicated Data

Data, Measurements, Features

A Software Tool for. Automatically Veried Operations on. Intervals and Probability Distributions. Daniel Berleant and Hang Cheng

1 The Line vs Point Test

Package SHELF. February 5, 2016

An Algorithm to Evaluate Iceberg Queries for Improving The Query Performance

Veracity in Big Data Reliability of Routes

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Least-Squares Intersection of Lines

Supplement to Call Centers with Delay Information: Models and Insights

STT315 Chapter 4 Random Variables & Probability Distributions KM. Chapter 4.5, 6, 8 Probability Distributions for Continuous Random Variables

Testing Random- Number Generators

Normality Testing in Excel

Modelling and Big Data. Leslie Smith ITNPBD4, October Updated 9 October 2015

Fast Contextual Preference Scoring of Database Tuples

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Notes on Continuous Random Variables

Pearson's Correlation Tests

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

HYPOTHESIS TESTING: POWER OF THE TEST

5. Continuous Random Variables

User Guide Thank you for purchasing the DX90

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

TEST 2 STUDY GUIDE. 1. Consider the data shown below.

More details on the inputs, functionality, and output can be found below.

Credibility and Pooling Applications to Group Life and Group Disability Insurance

Addendum A for TM-21-11: Projecting Long Term Lumen Maintenance of LED Packages

Universal hashing. In other words, the probability of a collision for two different keys x and y given a hash function randomly chosen from H is 1/m.

Mining various patterns in sequential data in an SQL-like manner *

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

Shroudbase Technical Overview

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Dongfeng Li. Autumn 2010

Scoring the Data Using Association Rules

Big Data & Scripting Part II Streaming Algorithms

Differential privacy in health care analytics and medical research An interactive tutorial

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Calculator Notes for the TI-Nspire and TI-Nspire CAS

Math 151. Rumbos Spring Solutions to Assignment #22

Quantitative Inventory Uncertainty

Method To Solve Linear, Polynomial, or Absolute Value Inequalities:

AP Physics 1 and 2 Lab Investigations

WEEK #22: PDFs and CDFs, Measures of Center and Spread

A Generic Auto-Provisioning Framework for Cloud Databases

VISUALIZATION OF DENSITY FUNCTIONS WITH GEOGEBRA

Point Biserial Correlation Tests

Big Data Provenance: Challenges and Implications for Benchmarking

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 12: June 22, Abstract. Review session.

Flood Risk Analysis considering 2 types of uncertainty

Introduction to Logistic Regression

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao, David Tse Note 18. A Brief Introduction to Continuous Probability

2 Binomial, Poisson, Normal Distribution

Big Data & Scripting Part II Streaming Algorithms

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Transcription:

Query Selectivity Estimation for Uncertain Data Sarvjeet Singh *, Chris Mayfield *, Rahul Shah #, Sunil Prabhakar * and Susanne Hambrusch * * Department of Computer Science, Purdue University # Department of Computer Science, Louisiana State University Presented by: Reynold Cheng Department of Computer Science, University of Hong Kong ckcheng@cs.hku.hk

Motivation: Data Uncertainty Many applications need to handle uncertain data Example: Sensor networks, Location-based applications, Data integration, Biological databases Existing databases do not provide direct support for uncertain data. Two simple options: manage uncertain data outside DBMS, or remove uncertainty from data However, there is need for managing uncertain data at the database level 2

Motivation (cont.) DBMS have been proposed to handle uncertain data Examples: Orion, Trio, MystiQ, MayBMS Probabilistic queries: Queries over uncertain data return answers with probabilities Results with low probability of occurrence are often not desirable or meaningful Probabilistic Threshold Queries: Return only those answers that exceed a specified threshold 3

Motivation (cont.) Query optimization is important An essential ingredient is the ability to estimate cost of a given query plan New indexes have been proposed for uncertain data Their effective use needs a reasonable estimate of query selectivity Optimizer needs to know when to use the indexes Our Contribution Efficient algorithms for selectivity estimation of probabilistic threshold queries over uncertain data 4

Related Work Selectivity estimation for traditional relational databases is well studied [SIGMOD96] Models for uncertain data Attribute Uncertainty [SIGMOD03, ICDE08] Tuple Uncertainty [VLDB04b, VLDB06] Uncertainty management systems Orion [Orion], Trio [CIDR05], Mystiq [SIGMOD05], MayBMS [ICDE07], [ICDE07b] 5

Related Work Efficient evaluation of probabilistic queries Prob. range queries [VLDB04a,VLDB04b] Prob. threshold indexing [VLDB04a] Prob. NN queries [SIGMOD03, ICDE07c] Selectivity estimation for probabilistic threshold queries has not been addressed before 6

Outline Motivation Uncertainty Model Selectivity estimation using Histogram Unbounded Range Queries General Range Queries Selectivity estimation using Slabs Experiments Conclusion 7

Uncertainty Model f a (x) uncertainty PDF a [l a Uncertainty Interval r a ] 1 2 4 0.5 0.2 0.3 Continuous Discrete Attribute Uncertainty: An uncertain attribute a consists of an Uncertainty Interval [l a, r a ] and a pdf f a (x) (cdf F a (x)) over the interval Our techniques are also applicable to Tuple Uncertainty 8

Outline Motivation Uncertainty Model Selectivity estimation using Histogram Unbounded Range Queries General Range Queries Selectivity estimation using Slabs Experiments Conclusion 9

Selectivity of Unbounded Range Queries a < t x o : Pr(a < x o ) > t f a (x) dx > t F a (x o ) > t Probability p=1 t F a (x) F b (x) F c (x) x o Attribute value 10

General range queries General range query Pr (x 1 < a < x 2 ) > t F a (x 2 ) F a (x 1 ) > t Instead of a 2D cdf curve, we can now plot a 3D curve for each uncertain data item: G a (x 1,x 2 ) = f a (x) dx = F a (x 2 ) F a (x 1 ) The algorithm is similar to the unbounded case Optimizations reducing construction time are possible (see paper) 11

Outline Motivation Uncertainty Model Selectivity estimation using Histogram Unbounded Range Queries General Range Queries Selectivity estimation using Slabs Experiments Conclusion 12

General Range Queries Histogram approach for general range queries Provides very good selectivity estimate Initial construction time is quadratic in terms of range of input data General Range Queries using Slabs Provides a better space-time complexity than histogram technique Has a lower accuracy (in general) 13

General Range queries using Slabs Slabs of size 8δ Hierarchy level Slabs of size 4δ Slabs of size 2δ Slabs of size δ Attribute value A slab S(x 1,x 2,t) stores the selectivity of query Q(x 1,x 2,t) We define a hierarchy of slabs, with the size of slabs increasing exponentially Space and construction time complexity of this approach is linear in terms of range of input data 14

Selectivity estimation using Slabs Hierarchy level Q Attribute value Given a query Q(x 1,x 2,t), we find pairs of slabs that contains (over-estimate) and is contained (under-estimate) by the query We linearly interpolate the two estimates to get the final estimate 15

Outline Motivation Uncertainty Model Selectivity estimation using Histogram Unbounded Range Queries General Range Queries Selectivity estimation using Slabs Experiments Conclusion 16

Experiments We implemented our selectivity estimation techniques in Orion (probabilistic extension of PostgreSQL) Synthetic Datasets: Each dataset of random sensor readings with uniform distribution [CIKM06, VLDB04a] The intervals are distributed uniformly in [0,1000] Interval sizes are distributed normally Database size is 250,000 17

Effect on Query Plan PostgreSQL Query plan Without any selectivity estimate function, PostgreSQL assumes a default (fixed) selectivity. In practice, it favors the use of un-clustered indexes With our algorithms in place, PostgreSQL correctly picks the query plan with lower I/O cost 18

Accuracy at Varying Selectivities Selectivities (2D) Selectivities (3D) 19

Accuracy at Varying Precisions Precision (2D) Precision (3D) 20

Conclusion and Future work Developed efficient algorithms for selectivity estimation of probabilistic threshold queries The algorithms were implemented in a real database system Experiments show that the algorithms are efficient and provide good estimates for query selectivities The algorithms can be further improved by combining them with standard cost estimation techniques such as equi-depth binning and sampling 21

References [SIGMOD96] Poosala et al. Improved histograms for selectivity estimation of range predicates. In SIGMOD 1996. [SIGMOD03] R. Cheng et al. Evaluating probabilistic queries over imprecise data. In SIGMOD 2003. [ICDE08] Singh et al. Database Support for Probabilistic Attributes and Tuples. In ICDE 2008. [Orion] http://orion.cs.purdue.edu, 2006. [CIDR05] J. Widom. Trio: A system for integrated management of data, accuracy and lineage. In CIDR, 2005. [SIGMOD05] J. Boulos et al. MYSTIQ: A system for finding more answers by using probabilities, IN SIGMOD 2005 [ICDE07] Antova et al. 10^10^6 Worlds and beyond: Efficient representation and processing of incomplete information. In ICDE 2007 [ICDE07b] Sen et al. Representing correlated tuples in Probabilistic databases, ICDE 2007 [VLDB04a] R. Cheng et al. Efficient indexing methods for probabilistic threshold queries over uncertain data. In VLDB 2004 [VLDB04b] N. Dalvi et al. Efficient Query Evaluation on Probabilistic Databases. In VLDB 2004. [ICDE07c] Ljosa et al. APLA: Indexing arbitrary probability distributions. In ICDE 2007. [VLDB06] O. Benjelloun et al. ULDBs: Databases with uncertainty and lineage, In VLDB 2006. [CIKM06] Reynold et al. Effiecient Join processing over uncertain data. In CIKM 2006. Sarvjeet Singh Database support for Uncertain Data 22

Thank you Questions? Sarvjeet Singh sarvjeet@purdue.edu 23