Data Quality Aware Query System



Similar documents
Comparing Microsoft SQL Server 2005 Replication and DataXtend Remote Edition for Mobile and Distributed Applications

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Master Data Management

Efficient database auditing

1 File Processing Systems

THE DATA WAREHOUSE ETL TOOLKIT CDT803 Three Days

What's New in SAS Data Management

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

Report on the Dagstuhl Seminar Data Quality on the Web

Database Design Patterns. Winter Lecture 24

SQL Server for developers. murach's TRAINING & REFERENCE. Bryan Syverson. Mike Murach & Associates, Inc. Joel Murach

Information Management & Data Governance

Procedia Computer Science 00 (2012) Trieu Minh Nhut Le, Jinli Cao, and Zhen He. trieule@sgu.edu.vn, j.cao@latrobe.edu.au, z.he@latrobe.edu.

The Data Warehouse ETL Toolkit

Search Result Optimization using Annotators

EVILSEED: A Guided Approach to Finding Malicious Web Pages

CitusDB Architecture for Real-Time Big Data

Delivering Business Intelligence With Microsoft SQL Server 2005 or 2008 HDT922 Five Days

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Business Process Discovery

Effecting Data Quality Improvement through Data Virtualization

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Lightweight Data Integration using the WebComposition Data Grid Service

A Software and Hardware Architecture for a Modular, Portable, Extensible Reliability. Availability and Serviceability System

Contents RELATIONAL DATABASES

Cray: Enabling Real-Time Discovery in Big Data

TRENDS IN THE DEVELOPMENT OF BUSINESS INTELLIGENCE SYSTEMS

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

On Engineering Web-based Enterprise Applications

vii TABLE OF CONTENTS CHAPTER TITLE PAGE DECLARATION DEDICATION ACKNOWLEDGEMENT ABSTRACT ABSTRAK

Data Warehouse and Business Intelligence Testing: Challenges, Best Practices & the Solution

Best Practices for Hadoop Data Analysis with Tableau

Lecture 3: Scaling by Load Balancing 1. Comments on reviews i. 2. Topic 1: Scalability a. QUESTION: What are problems? i. These papers look at

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

Oracle Trading Community Architecture Modeling Customer and Prospect Data TCA Best Practices V.2. An Oracle White Paper January 2004

DATA QUALITY MATURITY

SQL Server Master Data Services A Point of View

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Usage of OPNET IT tool to Simulate and Test the Security of Cloud under varying Firewall conditions

Patterns of Information Management

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Database Replication with MySQL and PostgreSQL

Machine Data Analytics with Sumo Logic

IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance

Data processing goes big

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

SQL Server 2012 Business Intelligence Boot Camp

Search and Information Retrieval

CHAPTER 1 INTRODUCTION

The Scientific Data Mining Process

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

BUSINESS RULES AND GAP ANALYSIS

Proceedings of the International MultiConference of Engineers and Computer Scientists 2013 Vol I, IMECS 2013, March 13-15, 2013, Hong Kong

Introduction. Acknowledgments Support & Feedback Preparing for the Exam. Chapter 1 Plan and deploy a server infrastructure 1

Software Performance and Scalability

Optimizing the Performance of the Oracle BI Applications using Oracle Datawarehousing Features and Oracle DAC

Dynamic Data in terms of Data Mining Streams

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Taking EPM to new levels with Oracle Hyperion Data Relationship Management WHITEPAPER

Enterprise Data Quality

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

DBMS / Business Intelligence, SQL Server

KEYWORD SEARCH IN RELATIONAL DATABASES

Port evolution: a software to find the shady IP profiles in Netflow. Or how to reduce Netflow records efficiently.

DATABASE MANAGEMENT SYSTEM

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

A Business Process Driven Approach for Generating Software Modules

Data Integration for XML based on Semantic Knowledge

Mining the Software Change Repository of a Legacy Telephony System

Reputation Management Algorithms & Testing. Andrew G. West November 3, 2008

Logical Modeling for an Enterprise MDM Initiative

The Sierra Clustered Database Engine, the technology at the heart of

The Data Access Handbook

Introduction. Part I: Finding Bottlenecks when Something s Wrong. Chapter 1: Performance Tuning 3

Data Virtualization and ETL. Denodo Technologies Architecture Brief

Big Data Analytics with IBM Cognos BI Dynamic Query IBM Redbooks Solution Guide

OBIEE 11g Data Modeling Best Practices

Principal MDM Components and Capabilities

Integrity 10. Curriculum Guide

User-Centric Client Management with System Center 2012 Configuration Manager in Microsoft IT

Dimensional Data Modeling for the Data Warehouse

Data Mining and Database Systems: Where is the Intersection?

Load Balancing. Load Balancing 1 / 24

Data Quality Assessment. Approach

Master Data Services Training Guide. Modeling Guidelines. Portions developed by Profisee Group, Inc Microsoft

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Component visualization methods for large legacy software in C/C++

Integrating Pattern Mining in Relational Databases

How to Enhance Traditional BI Architecture to Leverage Big Data

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Modeling Guidelines Manual

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Data Modeling for Big Data

Data Integration and ETL Process

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Big data big talk or big results?

Graphical Web based Tool for Generating Query from Star Schema

SQL Server Integration Services. Design Patterns. Andy Leonard. Matt Masson Tim Mitchell. Jessica M. Moss. Michelle Ufford

CHAPTER 1 INTRODUCTION

Transcription:

Data Quality Aware Query System By Naiem Khodabandehloo Yeganeh in fulfilment of the Degree of Doctorate of Philosophy School of Information Technology and Electrical Engineering April 2012 Examiner s Copy

ii

Acknowledgements I would like to thank these people, but... iii

iv Acknowledgements

List of Publications insert author list insert paper title. (submitted to insert journal name) insert author list insert paper title. insert journal name insert volume number, insert article or page number (insert year) v

vi List of Publications

Abstract The issue of data quality is gaining importance as individuals as well as corporations are increasingly relying on multiple, often external sources of data to make decisions. Traditional query systems do not factor in data quality considerations in their response. Studies into the diverse interpretations of data quality indicate that fitness for use is a fundamental criteria in the evaluation of data quality. In this report we address the issue of quality aware query systems by developing a query answering methodology that considers user data quality preferences over a proposed collaborative systems architecture. In this report we present three major issues relating to quality aware queries, namely modelling of user data quality preferences, measuring source quality and processing the query considering those preferences and measures. We then address each of these issues by introducing Quality Aware SQL, Quality Aware Metadata and Quality Aware Source Ranking methods respectively. Proposed contributions are evaluated using a.net tool called Quality Aware Query Studio which is developed to support our contributions and simulate the proposed multi source vii

viii Abstract architecture. Finally roadmap and timeline of future works is presented based on open challenges and remaining issues for the three mentioned challenges.

Contents Acknowledgements iii List of Publications v Abstract vii List of Figures xiii List of Tables xv 1 Introduction 1 2 Framework for DQ Aware Query Systems 7 3 Related Work 13 3.0.1 Data Quality Framework......................... 15 3.0.2 Profiling Data Quality.......................... 16 ix

x Contents 3.0.3 User Preferences on Data Quality.................... 17 3.0.4 Query Planning and Data Integration.................. 18 4 Conditional Data Quality Profiling 21 4.1 DQ Profile Table Approach........................... 24 4.1.1 Conditional DQ Profile Generation................... 25 4.1.2 Complexity of the Proposed Algorithms................ 31 4.1.3 Querying Conditional DQ Profile to Estimate Query Result..... 32 4.1.4 Evaluation................................. 33 4.2 DQ Sample Trie Approach............................ 38 4.2.1 Sample Trie................................ 41 4.2.2 Sample Trie Algorithm......................... 45 4.2.3 Evaluation and Analysis......................... 56 4.3 Summary..................................... 61 5 Quality Aware SQL 63 5.1 Inconsistent Partial Order Graphs........................ 66 5.2 Inconsistency Detection.............................. 68 5.3 Normalization of DQ Preferences for Query Planners............. 71 5.4 Summary..................................... 73 6 Quality Aware Query Response 75 6.1 Estimation of the DQ of a query plan...................... 79 6.1.1 Statistical Approach........................... 81

Contents xi 6.1.2 Sampling Based Approach........................ 82 6.1.3 Extension of Conditional DQ Profiling................. 83 6.2 Summary..................................... 83 7 Proof of Concept 85 7.1 Implementation of DQAQS............................ 86 7.1.1 Common Protocol Between Components................ 87 7.1.2 Registering Data Sources......................... 90 7.1.3 Data Profiler............................... 90 7.1.4 Query Parser............................... 97 7.1.5 Query Planner.............................. 99 7.1.6 Final Results............................... 100 7.2 Summary..................................... 101 8 Conclusion 103 References 105

xii Contents

List of Figures 2.1 Data Quality Aware Query System Architecture................ 8 2.2 A traditional data quality profile for a sample dataset............. 10 4.1 A sample conditional DQ profile s initial generation (a) the source data set (b) the expansion phase of conditional DQ profile............... 27 4.2 Reduced conditional DQ profile of Figure 4.1 with thresholds τ = 2 and ɛ = 0.2. 30 4.3 Comparison of the average estimation error rate for traditional DQ profile (DQP) and conditional DQ profile (CDQP) for different variations in distribution of dirty data d............................... 35 4.4 (a) Effect of Error Distribution d on Profile Size (b) Effect of Error Distribution on profile generation time.......................... 36 4.5 (a) Effect of certainty threshold ɛ on the size of DQ profile for different minimum set thresholds τ (b) Effect of τ on estimation capability of the DQ profile....................................... 37 xiii

xiv List of Figures 4.6 Scalability graphs (a) generated profile size versus data set size (b) profile generation time vs database size......................... 38 4.7 Sample trie and a query workload........................ 42 4.8 Normalizing popularity and cost for cost model calculation.......... 50 4.9 One dimensional illustration of the coverage of samples with different sampling rate d 0 > d 1 > d 2 > d 3.............................. 54 4.10 Effect of coverage d on the effectiveness of approaches when query popularity is predictable................................... 59 4.11 Effect of coverage d on the effectiveness of approaches when query popularity is constantly changing.............................. 60 4.12 Effect of the uniform sampling rate selected on the effectiveness of approaches 60 5.1 (a)circular inconsistency (b) Path inconsistency (c)consistent graph.... 66 5.2 Graph G with a circular inconsistency and its relevant distance matrix... 69 7.1 Component architecture of DQAQS implementation.............. 87 7.2 Registering a new data source in DQAQS.................... 91 7.3 Generating an attribute based DQ profile in DQAQS............. 92 7.4 Generating a conditional DQ profile in DQAQS................ 94 7.5 Generating a sample trie based on a query workload in DQAQS....... 95 7.6 Capturing user preferences using DQ aware SQL in DQAQS......... 97 7.7 Capturing user preferences using visual tool in DQAQS............ 98 7.8 Optimization of query plans ins DQAQS.................... 100 7.9 Query results using DQAQS........................... 101

xv List of Tables

xvi List of Tables

1 Introduction User satisfaction from a query response is a complex problem encompassing various dimensions including both the efficiency as well as the quality of the response. Quality in turn includes several dimensions such as completeness, currency, accuracy, relevance an many more [33]. In order to present the best possible query response to the user, understanding user preferences on data quality as well as understanding the quality of data sources are 1

2 Introduction critical. Consider for example a virtual store that is integrating a comparative price list for a given product (such as Google products) through a meta search (a search that queries results of other search engines and selects best possible results amongst them). The search engine obviously does not read all the millions of results for a search and does not return millions of records to the user. It normally selects top results from different search engines and compiles the answers from the shortlisted sources. In the above scenario, when a user queries for a product, the virtual store searches through a variety of data sources for that item, ranks them and returns the results. For example the user may query for Canon EOS. In turn the virtual store may query camera vendor sites and return the results. The value that the user associates with the query result is clearly subjective and related to the user s intended requirements which go beyond the entered query term, namely Canon EOS (currently returns 91,000 results from Google products). For example the user may be interested in comparing product prices, or the user may be interested in information on latest models. More precisely, suppose that the various data sources can be accessed through a schema consisting of attributes ( Item Title, Item Description, Numbers Available, Price, Tax, User Comments ). A user searching for Canon EOS may actually be interested in: 1. Browsing products: such a user may not care about the Numbers Available and Tax columns. Price is somewhat important to the user although obsoleteness and inaccuracy in price values can be tolerated. However, consistency of Item Title and

3 completeness within the populations of User Comments in the query results, is of highest importance. 2. Comparing prices: where the user is sure about the item to purchase but is searching for the best price. Obviously Price and Tax fields have the greatest importance in this case. They should be current and accurate. Numbers Available is also important although slight inaccuracies in this column are acceptable as any number more than 1 will be sufficient. The above examples indicate that getting satisfactory query result is subjected to three questions: how good is each data source? what does the term good mean to the user? and how to assimilate data from the sources to best meet user preferences? To answer above questions, we face the following challenges. First challenge is to provide suitable means to measure the quality of data. In order to estimate the quality of data we should collect descriptive and/or statistical information about data. These measures can in turn be used in query planning, and query optimization. Descriptiveness of the collected information contributes to the effectiveness of the system, by making predictions on the quality of the source/result-set closer to reality. However, it comes with a trade-off due to increased storage and computational costs. Although, in today s technology, data storage is rarely a problem, and user satisfaction with the query results can be deemed more important than storage. In many real world applications, measuring DQ is not cheap. For example assurance of correctness of a url, is only possible by calling the url and parsing the returned page. Cost of such DQ measurements for millions of records, can be overwhelming. For such applications, efficient sampling techniques could be used for

4 Introduction profiling data quality. Generation of a minimal sample for estimating DQ profile in presence of environmental constraints is a key challenge. Second challenge relates to the capture of user preferences on DQ. Modelling user preferences is a challenging problem due to its inherent subjective nature [28]. Additionally, DQ preferences have a hierarchical nature, since there can be a list of different DQ requirements for each attribute in the query. Several models have been developed to model user preferences in decision making theory [28] and database research [20]. Models which have been based on partial orders are shown to be effective in many cases [29]. Different extensions to the standard SQL have also been proposed to define a preference language [20]. A good system for capturing user preferences should assist users to present consistent preferences as there is evidence to indicate that specification of user preferences on queries often contains irrational or inconsistent preferences [18]. Third challenge is to develop data quality aware query planning and data integration methods that allow for efficient data source selection and effective data integration in the presence of pre-specified user preferences. In particular, techniques for ranking data sources based on a multi-criteria decision making and assembling the results from the selected data sources into one query result are discussed in this paper. In this paper we present a framework for DQ aware query systems in the presence of multiple overlapping data sources available for answering the same query. We define this framework around the three challenges mentioned earlier in this section which leads us to three key components of the system: 1) Profiling DQ which is the process of extracting measures about the quality of each data source for a given query. 2) Capturing user preferences

5 on DQ and 3) Planning queries and integrating results from multiple sources in consideration of user defined DQ preferences. The rest of this paper is organized as follows: In Section 2 the general framework proposed for this work is presented. Rest of the sections are dedicated to discuss solutions and options for the three challenges noted above. At the end of each section, relevant future works and open questions are discussed where relevant. Various techniques discussed in this paper are evaluated at the end of each section. In Section 3, existing literature related to the three key requirements of the framework is studied. Finally in Section 8 we conclude the paper.

6 Introduction

2 Framework for DQ Aware Query Systems The challenges addressed in this research are positioned within an overall framework, namely Data Quality Aware Query System (DQAQS). Before starting the discussion about the framework for DQ aware query system, we assume the following architecture which is an extension of general data integration architecture with Data quality components. The architecture which consists of: Data sources (S 1, S 2,...), Data Quality Services (DQS), and 7

8 Framework for DQ Aware Query Systems Data Quality Mediator (DQM) S1 S2 DQS S3 Figure 2.1: Data Quality Aware Query System Architecture a Data Quality Aware Mediator (DQM). Data sources should expose their schema to the mediators. In general data integration systems this happens through a wrapper which is a component that wraps the database schema in order to unify the schemas definition for all sources. In this paper we assume the wrapper as part of the interface to the data source. The Data Quality Services are merely containers for DQ metrics and their definitions. They know about the business rules to measure DQ metrics and implement the functionality for it. Notion of the Data Quality Services is widely used in industrial products such as IBM Data Quality Stage, recent MS SQL Server Data Quality Services, etc. DQSs query relevant data sources to generate and maintain DQ profiles. Query mediators are the most complex part of the architecture and orchestrate query planning and data integration for the end user. Measurement of DQ is expensive, therefore, Data Quality services may become the bottleneck for processing data. They are usually able to process limited amount of data at each time and they are not able to process the data with the same speed that query response is generated by DQ Mediator. Therefore, DQ Mediator should use some terms of pre-processing if it wants to be able to provide DQ metrics to the user s query. Data Quality Aware Mediators host three necessary functions: A service for parsing the

9 DQ aware query languages, a service for planning the query and integrating data while considering DQ preferences and requirements, and a local DQ Profile Dictionary which potentially contains DQ profiles for all data sources. This profile dictionary can help the mediator to optimize its query plan and select only those data sources that best serve the user s DQ preferences. The effective and efficient deployment of the DQM as presented in the architecture above is a key objective of this paper. In the following sections we essentially present the techniques and methods necessary to build such a technology. Note that the three parts of the DQM correspond to the three challenges outlined in the introduction section, and represent the core elements of a data quality aware query system. Measurement of the data quality of a data set is called DQ profiling. Overall statistical measurements for a data set - which we call traditional DQ profile - as shown in Figure 2.2 is used in literature and industry [34][24]. Figure 2.2 (b) represents a traditional DQ profile for the given data set of Figure 2.2 (a). DQ profile of Figure 2.2 (b) is the result of measuring completeness of each attribute in the dataset presented in Figure 2.2 (a). In this example completeness is considered as the number of null values over the number of records in the data for each attribute. For example, completeness of the Image attribute of the given data set is 50% since there are 3 null values over 6 records. The three columns in the profile table of Figure 2.2 (b) identify object (an attribute from the relation) against which the metric Completeness is measured as well as the result of this measurement. Traditional DQ profiles which are similar to the one in Figure 2.2 (b) are incapable of returning reliable estimates for the quality of the result set most of the times. For example;

10 Framework for DQ Aware Query Systems Figure 2.2: A traditional data quality profile for a sample dataset even though the completeness of the Image attribute for data source S 1 is 50%, the completeness of the Image attribute for query results (with conditions) from this data source can be anything between 0% and 100%: Completeness of the Image attribute for Cannon cameras is 50%, this value is 100% for Sony cameras, and 0% for Panasonic cameras. In reality, for example a Cannon shop may have records of other brands in their database, but they are particularly careful about their own items, which means the quality of Cannon items in database will significantly differ from the overall quality of the database. In this example, the selection conditions (brand is Cannon or brand is Sony ) are fundamental parts of the query. However, they have not been considered in previous work on data quality profiling. In the next sections we propose the concept of conditional DQ profiling, along with techniques for generation and maintenance of the profile in two different environment. One techniques takes a closed-world assumption and guarantees the accuracy of DQ estimation which is more suitable for enterprise applications. We extend the discussion on creation and maintenance of conditional DQ profiles with proposing sampling techniques to estimate DQ for open world systems (e.g. Web), where environmental is more restrictive. Data Quality Aware Query Planning and Data Fusion??? One major challenge for query

11 planning in data integration systems is selecting data sources. One method to rank data sources considering the quality of data sources and user preferences on DQ can be based on a multi-criteria decision making technique. The general idea of having a hierarchy in decision making and definition of hierarchy as strict partial orders is widely studied. In [28] a decision making approach is proposed which is called Analytical Hierarchy Process (AHP). Processing the problem of source selection for Quality-aware queries can be delineated as a decision making problem in which a source of information should be selected based on a hierarchy of user preferences that defines what a good source is.

12 Framework for DQ Aware Query Systems

3 Related Work Consequences of poor quality of data have been experienced in almost all domains. From the research perspective, data quality has been addressed in different contexts, including statistics, management science, and computer science [30]. To understand the concept, various research works have defined a number of quality dimensions [30] [32]. 13

14 Related Work Data quality dimensions characterize data properties e.g. accuracy, currency, completeness etc. Many dimensions are defined for assessment of quality of data that give us the means to measure the quality of data. Data Quality dimensions can be very subjective (e.g. ease of use, expandability, objectivity, etc.). To address the problems that stem from the various data quality dimensions, the approaches can be broadly classified into investigative, preventative and corrective. Investigative approaches essentially provide the ability to assess the level of data quality and is generally provided through data profiling tools. Many sophisticated commercial profiling tools exist [13]. There are several interpretations of dimensions (which may vary for different use cases), e.g. completeness may represent missing tuples (open world assumption) [3]. Accuracy may represent the distance from truth in the real world. Such interpretations are difficult if not impossible to measure through computational means and hence in the subsequent discussion, the interpretation of these dimensions is assumed as a set of DQ rules. A variety of solutions have also been proposed for preventative and corrective aspects of data quality management. These solutions can be categorized into the following broad groups : Semantic integrity constraints [7]. Record linkage solutions. Record linkage has been addressed through approximate matching [16], de-duplicating [17] and entity resolution techniques [4]. Data lineage or provenance solutions are classified as annotation and non-annotation based approaches where back tracing is suggested to address auditory or reliability problems [31]. Data uncertainty and probabilistic databases are another important consideration in data quality [22]. In [19], the issue of data imputation for incomplete

15 datasets is studied, whereas maximizing data currency has been addressed in [27] Nevertheless, data quality problems can not be completely corrected and in presence of errors and further consideration is required to maximise user satisfaction for the quality of data received. 3.0.1 Data Quality Framework A number of studies have focused on answering queries in information systems based on some aspects of the data quality. Quality-aware querying performed by Data Quality broker is explicitly addressed in some works. In [23] a service based framework for representing and exchanging data quality in cooperative information systems is described. A data quality broker monitors the communication within the system and manages a number of DQ metrics (e.g. trustworthy, currency, etc.) of data sources. Similarly, DaQuinCIS architecture was proposed at [3] to manage DQ in cooperative information systems. Both these architectures define certain data quality metrics and suggest techniques to manage them for different data sources using a distributed service called Data Quality broker. In [24], a model for determining the completeness of a source or combination of sources is suggested. In [24] an algorithm for querying the most complete data is proposed which selects the sources with the highest completeness to answer a query. In addition of the above works, [8] analyzes existing definitions and metrics for data freshness in the context of a Data Integration System. We share with above works, the idea of data quality aware querying; however the main difference is the semantics of our system: Our aim is not only to query quality of data sources or data, but is to improve the user satisfaction from the query results. To such a

16 Related Work scope, profiling data quality to estimate DQ with high accuracy for the final query result, should go to the extent of estimating DQ of any given query with various selection conditions. Additionally understanding user preferences on data quality is not studied in the above works ([23] proposes Xml based method to model DQ metrics assignments, but does not cover the problem of capturing user preferences on data quality). One more difference between the semantics of our system and above works is that we de-coupled our system with the definition of the DQ metrics, and focus on the problem of DQ profiling and query answering in isolation from the DQ definition. 3.0.2 Profiling Data Quality Measurements made on a dataset for DQ dimensions are called DQ metrics and the act of generating DQ metrics for DQ dimensions is called DQ profiling. Profiling generally consists of collecting descriptive statistical information about data. These statistics can in turn be used in query planning, and query optimization. Data quality profile is a form of meta data which can be made available to the query planning engine to predict and optimize the quality of query results. Literature reports on some works on DQ profiling. For example in [24] DQ metrics are assigned to each data source in the form of a vector of DQ metrics and their values (source level). In [32] a vector of DQ metrics and their values is attached to the table s meta data to store additional DQ profiling; e.g. {(Completeness,0.80),(Accuracy,0.75),... } (relation level). In [34] an additional DQ profile table is attached to the relation s meta data to store DQ metric measurements for the relation s attributes (attribute level).

17 We assume that a set of DQ metrics M is standardized between data sources, however data sources may have different approaches (i.e. different rules) to calculate their DQ metrics (e.g. a UK based data source has a different set of rules from an Australian based data source for checking accuracy of address). Above profiling technique provide estimation of the quality of data for a data source, a table or a static set of data. Although these works estimate the quality of data, the difference between our work and above works lies in the fact that we are interested in estimating the data quality of the query results based on the query requested by the user which contains selection conditions. 3.0.3 User Preferences on Data Quality The issue of user preferences in database queries dates back to 1987 [21]. Preference queries in deductive databases are studied in [15]. In [20] and [9] a logical framework for formulating preferences and its embedding into relational query languages are proposed. We leverage the idea presented in these works in modeling user preferences, however none of these works specifically address the issue of modeling DQ preferences. Additionally, several models have been developed to model user preferences by decision making theory and database communities. Models which have been based on partial orders are shown to be effective in many cases [29]. Typically models based on partial order let users define inconsistent preferences. Current studies on user preferences in database systems assume that existence of inconsistency is natural (and hard to avoid) for user preferences and a preference model should be designed to function even when user preferences are inconsistent, hence; they deliberately

18 Related Work opt to ignore it. Nevertheless, all studies do not always agree with this assumption [18]. Human science and decision making studies show that people struggle with an internal consistency check and they will almost always avoid inconsistent preferences if those individuals are given enough information about their state in their decision (e.g. visually). In fact, existence of inconsistency in user preferences dims the information about user preferences captured by the query. All above works address the problem of understanding and modeling user preferences in different domains, from computer science to psychology. In regards to modeling user preferences in DQ, we share similar goals with above works, however we put together ideas from the different fields to present a practical model for capturing reliable user preferences on data quality. 3.0.4 Query Planning and Data Integration From a query planning perspective, a data source is abstracted by the source descriptions. These descriptions specify the properties of the sources that the system needs to know in order to use their data. In addition to source schema, the source descriptions might also include information on the source: response time, coverage, completeness, timeliness, accuracy, and reliability, etc. When a user poses a data integration query, the system formulates that query into subqueries over the schemas of the data sources whose combination will form the answer to the original query. For applications with a large number of sources, typically the number of sub-query plans is very large and plan evaluation is costly, so executing all sub-query plans

19 is expensive and often infeasible. In practice, however, only a subset of those sub-queries is actually selected for execution [11][25][26]. Each executed sub-query is also optimized to produce a physical query execution plan with minimum cost. This plan specifies exactly the execution order of operations and the specific algorithm used for every operation (e.g., algorithms for joins). Existing techniques tend to separate the query selection process into two separate phases using two alternative approaches: 1) select-then-optimize, and 2) optimize-then-select. In the first approach, a subset of sub-queries is selected based on coverage, then each selected sub-query is optimized separately [11][25], whereas in the second approach, each sub-query is optimized first, then a subset of optimized sub-queries is selected to maximize coverage while minimizing cost [26]. However, in both approaches, the selection of query plans is primarily based on data coverage and/or query planning costs without further considerations for DQ. To the contrast, in this paper, we describe our approach for query plan ranking based on DQ that allows for efficient data source selection in the presence of pre-specified user preferences over multi-dimensional DQ measures.

20 Related Work

4 Conditional Data Quality Profiling DQ profiling is the task of measuring DQ metrics for given data. In this paper we assume that services exist which can measure DQ metrics (i.e. DQS as described in Section 2). In particular, we do not need to know how Completeness or Consistency of data is defined, instead we assume a service that calculates DQ metrics for every row of the dataset. Data quality profile that stores attribute level DQ statistics for the whole dataset will 21

22 Conditional Data Quality Profiling require very little amount of storage for each data source, an attribute level DQ profile is shown in Figure 2.2 (b). The DQ profile in Figure 2.2 is generated from the data set of Figure 2.2 (a). It is the result of measuring completeness of each attribute in the dataset presented in Figure 2.2 (a). In this example completeness is considered as number of null values over the number of records in the data for each attribute. For example the completeness of the Image attribute of the given data set is 50% since there are 3 null values over 6 records. The three columns in the profile table of Figure 2.2 (b) identify object (an attribute from the relation) against which the metric Completeness is measured, and the result of this measurement. Attribute level DQ profiling (we also call it traditional DQ profiling) is not adequate for all purposes. Above limitation of traditional DQ profiling motivates us to propose Conditional DQ Profile to estimate DQ profile metrics where distribution of dirty data in dataset is not evenly random and queries are bound with conditions. In this section, we propose techniques and methods to present the conept of conditional DQ profiling for a variety of needs and environments. Measuring DQ metrics is expensive. It often involves querying some external master data or refernce. Data Quality Service is responsible to measure DQ metrics for a given set of data and it is limited in the amount of data it can process in a timely fation. For example a DQS might be able to provide accuracy measures for only a few hundred addresses a second while source data sets have tens to hundreds of million records. Obviously the amount of data and the available preprocessing time wil have a signifact effect on the usefulness of the proposed DQ profiling technique. Ideally, DQ profile should be able to provide an estimate for every query issued by

23 user where the estimate is not far from the actual DQ with any more than a limitted preknown error rate. We call this; guaranteed DQ estimation. Depending to the environmental restrictions and amount of data that should be profiled, guaranteed DQ estimation may or may not be achievable. For example in a corporate environemnt where the number of records in data sets are limitted and DQ profile can be generated overnight, user can have the luxury of guaranteed DQ estimation. In contrast on web or cloud environment where pre-processing all data sets is extremely costly and processing can not happen overnight, an incremental collaborative technique that can create DQ profiles within enforced cost limits is required. In this paper we first propose techniques to generate a DQ profile that can guarantee DQ estimation, and is suitable for the corporate environment. We call this profile, DQ Profile Table, and we present a technique to generate and compress the DQ profile table by preprocessing all data in data sets. This technique can guarantee the accuracy of DQ estimation in trade off with procesing and storage cost of the profile creation and DQ measurement. Afterwards, we propose a technique based on sampling data which works in presence of cost constraints on DQ measurement and does not asume pre-processing of the data-sets. We call this technique sampling trie. In sampling trie technique, we try to optimize the accuracy of DQ estimation within cost limits, however, we can not guarantee DQ estimation. The two proposed techniques for DQ profile generation and maintenance can cover various environments in which DQAQS is implemented and contribute to the generality of the proposed framework.

24 Conditional Data Quality Profiling 4.1 DQ Profile Table Approach Given a data set S, let us assume direct access to all data in S is possible and generation of the DQ profile is an off-line process, i.e. resources and time for preprocessing the whole data set is available. In order to define Conditional DQ Profile, we first need to formally define DQ metric function. Definition 1.1 Let {a 1,..., a m } be all attributes of the relation R, T be a set of tuples representing R, and metric m be a set of rules. We define metric function m ai (t), t T as 1, if the value of attribute a i from tuple t, does not violate any rule in m, and 0 otherwise DQ metric function m ai is a set of rules, where each rule is a boolean function that describes DQ dimension. For example, accuracy metric function for a given postcode can be defined by a set of rules containing a check that compares data against some master data. Assuming conditions in a finite domain consist of single comparisons, combined together with operators, we define Conditional DQ profile as a table that maps conjunctive conditions of queries to DQ metric values 1. For every metric function m ai, we create a Conditional DQ profile table which is as a set of conditional DQ profile records, and defined as below: Definition 1.2 Let m ai be an arbitrary metric function for attribute a i, and T be set of tuples for relation R. We define conditional DQ profile P r as a set of DQ profile records t P r. Definition 1.3 Let P r be a conditional DQ profile table, consisting of profile record t P r, and Condition be the set of conjunctive equality conditions for a given query that results in a subset of dataset ζ T. We define t P r as (Condition) (# ζ, Q ζ ), where Condition 1 Calculation of DQ profile is independent from operator since A B = A + B A B

4.1 DQ Profile Table Approach 25 is a query s conjunctive selection condition and ζ is a subset of T that satisfies Condition. # ζ = ζ and Q ζ = {t ζ m ai (t) = 1} In relational databases, we store the profile table P r for metric m ai (also referred as P r mai ) as a relation (a 1,..., a n, a #, a Q ) consisting of profile row tuples t P r = (t 1P r,..., t np r, # tp r, Q tp r) where t ip r dom(a i ) (don t care value). #t P r is number of appearances of the tuple in T considering which substitutes for anything, and Q tp r is the number of tuples among them that return 1 for m ai. If we are given all possible conjunctive equality conditions that can happen for a given dataset, we can create the conditional DQ profile P r ai for any given metric function. However, the number of possible conjunctive equality conditions over a given dataset can be exhaustive and the Conditional DQ Profile that would be created from all possible queries can become much larger that the underlying dataset. 4.1.1 Conditional DQ Profile Generation As described above, the complete search space for a conditional DQ profile may need to traverse all condition possibilities. Brute-force Approach Conditional DQ profile generation incurs the following challenge: Size of a complete conditional DQ profile which has all data required for estimating the quality of any query result, can be as big as the original dataset or even larger. Thus, storing and querying conditional DQ profile may be too expensive and inappropriate. To illustrate this problem, consider relation Items(Brand, M odel, P rice) in Figure 4.1

26 Conditional Data Quality Profiling (a) (we also refer as B, M, P for simplicity). Values these attributes come from the following domains: dom(b) = {Sony, Cannon} (abbreviated as {S,C}), dom(m)={slr, Norm} ({S,N}), and dom(p )={Low, High}({L,H}). Each domain has 2 members, by addition of don t care value ( ), there can be 3 3 possible queries. {B=C or B=S or B=, M=S or M=N or M=, P=H or P=L or P= }. Search space consists of all possible subsets of equality comparisons joined with operator (e.g. {B=C M=S P=H}). Number of possible comparisons for any attribute a is dom(a) and search space can be as large as dom(a 1 )... dom(a n ) where n is the number of attributes in R. The brute-force solution is to enumerate all possible conjunctive equality selection conditions over dataset R, compute the metric function for each conjunctive condition and store DQ of the results and the conjunctive selection condition in DQ profile P r. P r can then be queried to estimate the quality of the result set for any given query with any selection condition. The brute-force browsing of all possible conjunctive conditions is inefficient for two reasons: First, there are many selection conditions for which no result in the dataset can be found. For example if data set Item does not include tuple {Cannon, SLR, Low} (or {C,S,L}), it is not required to be checked. Second, measurement of the quality of the query result for each combination of conditions requires execution of a query (first, the query result set should be found, and then the data quality of that result can be calculated). Obviously there is a cost associated with each query. Efficient Conditional DQ Profiling In this section we propose a technique for creation of the conditional DQ profile and

4.1 DQ Profile Table Approach 27 Figure 4.1: A sample conditional DQ profile s initial generation (a) the source data set (b) the expansion phase of conditional DQ profile further reduction of its size. We call the first phase of the technique as expansion phase where the initial conditional DQ profile is created. We call the second phase as reduction phase in which we reduce the profile size while maintaining acceptable accuracy of DQ estimation. We further have a revert phase that will be used to overcome any loss on information incurred during the reduction phase. The level of acceptability of accuracy is ensured by letting user define two thresholds which will be discussed further in this chapter. Let a 1... a k be the list of attributes to be used for query conditions over R, and m ai be the metric function m on attribute a i to be profiled. In initial phase, we create conditional DQ profile P r using the following query: select a 1... a k, count(a i ) as #, sum(m ai ) as Q from R group by a 1... a k union select a 1... a k 1, - as a k, count(a i ) as #, sum(m ai ) as Q from R group by a 1... a k 1... select a 1, - as a 2,..., - as a k, count(a i ) as #, sum(m ai ) as Q from R group by a 1 union select - as a 1,..., - as a k, count(a i ) as #, sum(m ai ) as Q from R

28 Conditional Data Quality Profiling order by a 1,..., a k Above query creates the conditional DQ profile similar to Figure 4.1 (b). We now explain the above concept in the context of our running example. In Figure 4.1 we use completeness for metric function defined as one simple rule: if Image is Null then 0 else 1. Note that the same principle can be applied to other DQ metrics that can be defined through a rule or set of rules. Figure 4.1 (a) shows sample data set Items(Brand, Model, Price, Image). All attributes and their values are abbreviated to the first letter in Figure 4.1. Figure 4.1 (b) illustrates the conditional DQ profile before applying the ɛ threshold for estimating the Completeness of attribute Image. Profile created after the expansion phase, is usually large, therefore; we take further steps to reduce the size of generated profile. Approximating DQ to Reduce Size We reduce size of the profile in two ways. First by estimating the value of query s DQ metric instead of providing the exact value and second, by removing all the conditions that return few and statistically insignificant number of records. These reductions are controlled by two thresholds: First, minimum set threshold τ which defines the minimum size of tuple set to be profiled. For example if there is only four Panasonic cameras in a shop, and τ = 10, DQ profile row with condition Brand = P anasonic will not be created. This threshold will reduce noise in DQ profile, e.g. if this threshold is ignored (τ = 1), any single tuple which has the metric function value of exactly 0 or 1 appears in the conditional DQ profile. Second, certainty threshold ɛ is the maximum acceptable uncertainty for the user. For example, if ɛ = 0.1, the DQ profile row with condition Brand = Cannon should return completeness between 40% and 60% for data set in Figure 2.2.

4.1 DQ Profile Table Approach 29 Algorithm 1, receives conditional DQ profile table P r, and accuracy threshold ɛ as input, and returns a reduced conditional DQ profile table. Algorithm 1 removes profile records which their metric function value is within range of ɛ from their parent profile record. For a given profile record P r t1, a parent profile record P r t2 is a record which for every attribute value in P r t1, the same value or exists in P r t2. For example (C, S,, 4, 2) is parent of (C, S, H, 2, 2). Data: Profile Table P r create empty stack S push(s,fetch x from Pr) while eof(p r) do Fetch x from P r if x is parent of top(s) or vice versa then Let s := top(s) if (s.q/s.#) ɛ (x.q/x.#) (s.q/s.#) + ɛ then delete x from P r end end else end while top(s) is not parent of x or vice versa do pop(s) end return P r Algorithm 1: Reducing the size of conditional DQ profile Algorithm 1, removes profile records from dataset where quality of the profile record is in range of ɛ from the quality of the closest parent record that is not deleted. For example as in Figure 4.2, record S,, for which the quality (57%) is in the range of ɛ = 0.2 (or 20%) from its parent,, (63%) is removed from dataset. Hence, the quality of a query with condition Brand = S will be estimated from its closest parent 63% which is not far from the actual quality of the result-set (around 57%). Figure 4.2 (a) depicts the DQ profile from Figure 4.1 (b) considering τ = 2 and Figure 4.2 (b) shows the reduced profile table of the Figure 4.2 (a) considering ɛ = 0.2.

30 Conditional Data Quality Profiling Figure 4.2: Reduced conditional DQ profile of Figure 4.1 with thresholds τ = 2 and ɛ = 0.2. Furthermore, reducing the size of profile by removing profile records that reflect less than τ records is straight forward an it can be done using the command delete from P r where # τ. Although, Algorithm 1 is able to reduce size of the profile, it has the side effect of losing some valuable data. For example if DQ profile record S, S, H is removed, the quality of the query with condition P rice = H can not be estimated. Using Figure 4.2 (a), quality of,, H can be estimated from the quality of records C, S, H, C, N, H, S, S, H, and S, N, H (i.e. (2 + 2 + 0 + 0)/(2 + 3 + 1 + 1) = 57%), but using Figure 4.2 (b), such value can not be estimated correctly. This problem appears because we have reduced a record by comparing it to only one of the possible parents. For example S, S, H is not only a child of S, S, in the search space. It can also be a child of S,, H or, S, H which do not exist in the generated DQ profile. After reducing the DQ profile with the threshold ɛ, we reduce the profile with minimum set threshold τ. To resolve this problem, if we are removing some records (nodes in the search space) from the profile, we should revert all possible parents of the removed nodes if removal of the node effects the correctness for DQ estimation for its parent. For example, if we are removing

4.1 DQ Profile Table Approach 31 S, S, H, and the quality of it s parent node, S, H is different, we should revert, S, H into the conditional DQ profile. We revert any useful deleted data as follows; we refer to this part of the algorithm as Revert phase: Given profile table Pr (a 1,..., a n,#, Q), and all eliminated tuples C=(c 0,..., c n, #, Q), and a prefix list consisting of A = (a i,..., a k ), let A + be the attributes of P r that do not appear in A. First, run expansion and reduction phases for C sorted by order sequence a j where a j start with sequence A and continues in a random order from A +. Revert results to P r with the depth of the recursion attached to it and keep reduced records in C. Then, if C C or A + is null or C is null, for each attribute a A +, recursively run revert phase with parameters A = A a, C and P r. For example deleted records from Figure 4.2 (a) will be sent to revert phase in parallel, sorted as (P, M, B), (M, P, B). First recursion for the attributes sorted as P, M, B, generates HSS, 66%, H, S,, 66%, H,,, 66%, etc. It will be reduced to H,,, 65%, etc. Revert phase recurses for another level (as long as size of the result of the function is reducing). Thus, generation of the conditional DQ profile now happens in three steps: First, run expansion, second; reduction phases of the algorithms over data, and third; run the revert algorithm for each attribute in P r to update P r. 4.1.2 Complexity of the Proposed Algorithms Initial expansion phase of the algorithm, consists of n group by queries, where n is the number of attributes that may appear in conditions. Group by queries run in O( R ) where R is size of the relation R. Push and pops to the stack, in the Algorithm 1, happen exclusively,

32 Conditional Data Quality Profiling thus they do not introduce extra complexity dimension. Complexity of the expansion and reduction phases of the algorithm is O( R.n) in worst case. For the Revert phase of the algorithm we conduct various experiments to compare execution time versus dataset size. In traditional DQ profiling, a full scan of the database is required to figure out the average DQ profile of the whole dataset per metric O( R.n). However, expansion phase of the algorithm usually generates a much larger DQ profile. Reduction and revert phases of the algorithm take more preparation time in favour of a smaller conditional DQ profile. 4.1.3 Querying Conditional DQ Profile to Estimate Query Result Conditional DQ profile can estimate the quality of the result set limited to the two factors τ and ɛ. There might be cases where query conditions do not exist in the profile, this happens for three reasons: 1) If the query returns no data. We assume checking for this situation occurs before querying the profile to avoid unnecessary reference to datasets that are not relevant. 2) The profile record might have been removed due to ɛ threshold. In this case, we return estimation for next available parent of the query in profile. 3) The profile record might have been removed due to threshold τ. In this case we act similar to the previous case, but the result can be inaccurate. In previous section we proposed techniques to create conditional DQ profile for a given data set. Estimating DQ of the query result from a conditional DoQ profile is possible by finding the first row in the conditional DQ profile that contains all conjunctive selection conditions in the query. In worst case, a full scan of DQ profile may be required. Hence speed of query answering from conditional DQ profile is proportional to the size of DQ profile.

4.1 DQ Profile Table Approach 33 Consider relation R from the source S, and conditional DQ profile P r for an arbitrary metric and attribute. P r consists of columns (a 0,...,a k, #, Q). To estimate the quality of the result set for user query consisting of selection criteria Φ a0 =α 0... a j =α j, same query should be run against the DQ profile P r (instead of R), but it should also consider the don t care values. We can translate Φ to Φ (a 1 =α 0 a 1 =,...). If a i is not mentioned in the query conditions, it still appears in the translation as a i =. For example query SELECT * FROM D WHERE Brand= Cannon AND Model= SLR translates to SELECT TOP 1 #, Q FROM P r WHERE (Brand= Cannon OR Brand= ) AND (Model= SLR OR Model = ) AND (Price= ) ORDER BY Brand, Model,Price. In regards to the ORDER BY operation, don t care value should appear after any other value in the domain, hence, general result will be selected only if a less general result does not exist. 4.1.4 Evaluation Here we study effectiveness and scalability of our proposed techniques of this section. Effectiveness of conditional DQ profile is compared against the traditional DQ profile. Traditional DQ profile generates only one value for the whole data set regardless of the underlying data. Therefore, it imposes very low overhead on the system. When distribution of dirty data is totally uniform through the whole dataset, our conditional DQ profile also seems to be very small. However, the benefits of our technique becomes obvious when distribution of dirty data in the data set is not totally uniform. In this section we only compare effectiveness of our approaches with traditional DQ profiling. In regards to scalability of profile creation,

34 Conditional Data Quality Profiling traditional DQ profile has a constant cost, which is not comparable to our techniques. However, we study scalability of our techniques versus various affecting factors. We study our algorithms on the publicly available DBLP data set [10]. In this paper, we assumed that details of the calculation of DQ metric is transparent for our technique. Hence we write a deterministic function that pre-generates a metric result value for every record of the dataset. This function stores the pre-generated results per record in a lookup table and returns the same result for every record of the dataset each time. The important characteristic of the data that affects size of the conditional DQ profile, is distribution of dirty data. On one end, distribution of dirty data can be totally uniform. Then conditional DQ profile will be reduced down to traditional DQ profile and would need only one row to satisfy DQ estimation requirements for all the queries. On the other end, if any given subset of the dataset has a different distribution of dirty data from another conditional DQ profile should have one row for each subset. In order to simulate different distribution of dirty data in data set, we pre-generate metric functions as follows: Let d be the variation in distribution of dirty data. d = 0 is no variation in distribution of dirty data, i.e. uniform distribution of dirty data, and d = 1 is the maximum variation in distribution of dirty data. We approximately simulate d by grouping the dataset by all attributes that can be used in the conditions. Then we generate count(groups)/d, (or 1 if d = 0) random numbers and map these numbers to groups with the same pattern. For example, there might be 100 different groups for {Brand, Model, P rice}. If d = 0.5, two random numbers, e.g. 0.3 and 0.9, will be generated and mapped to the

4.1 DQ Profile Table Approach 35 Figure 4.3: Comparison of the average estimation error rate for traditional DQ profile (DQP) and conditional DQ profile (CDQP) for different variations in distribution of dirty data d groups to introduce dirty data respectively. Therefore; division of dirty data for first 50 groups will be 0.3 and division of dirty data for the other 50 groups will be 0.9. For evaluation of our techniques, we first generate the conditional DQ profile. To ensure we have covered different combinations, we run 10% of all possible queries with conjunctive equality selection conditions against the profile for each experiment. In order to better understand the effect of two different thresholds on DQ estimation quality, when measuring the effects of ɛ, we exclude all queries that return less than τ results. Effectiveness of DQ Estimation Figure 4.3 shows the average error rates of randomly generated queries. Figure 4.3 compares the average estimation error rate for both traditional DQ profiles (DQP) and conditional DQ profiles (CDQP) in case of different variations in distribution of dirty data. Estimation error rate is difference of the actual metric function value of the query result and estimated metric function value of the query result). Conditional DQ profile is generated for different thresholds ɛ = 0.15, ɛ = 0.30, and ɛ = 0.45. It can be observed that, estimation accuracy using conditional DQ profile has always better than ɛ. DQ estimation error rate in conditional DQ profile will not exceed the threshold ɛ. When d = 0, or distribution of dirty data is totally random, traditional DQ profile

36 Conditional Data Quality Profiling d(variationindistributionofdirtydata) d(variationindistributionofdirtydata) Figure 4.4: (a) Effect of Error Distribution d on Profile Size (b) Effect of Error Distribution on profile generation time performs as well as conditional DQ profile, however, when variation in distribution of dirty data increases, the actual metric function value for more queries will be different from the traditional DQ metric estimation which is basically average of DQ metric value for the whole dataset. Therefore, estimation error will increase. Higher ɛ also means that much more profile records are reduced and DQ metric value for more queries are not estimated correctly. Effect of Error Distribution Figure 4.4 (a) and (b) show the effect of variation in distribution of dirty data on database size and the cost of creating conditional DQ profile. We conducted this experiment on a windows 7, 32 bit virtual machine with 4GB of RAM on a Intel Core i3 2GHz machine running OSX as the host operating system. It can be observed that uniform distribution of dirty data results in the minimum profile size. Indeed, if for example from every 10 records in the dataset, 7 are of good quality, any arbitrary subset of the dataset will return 70% for the metric function. Hence, it might be enough to only keep the metric function result for the whole data set. However, when the distribution of dirty data increases, a bigger profile will be required. Based on Figure 4.4 (b) although the

4.1 DQ Profile Table Approach 37 Figure 4.5: (a) Effect of certainty threshold ɛ on the size of DQ profile for different minimum set thresholds τ (b) Effect of τ on estimation capability of the DQ profile profile size increases by more variation in distribution of dirty data, profile generation time decreases with increase in variation of the distribution of dirty data. The reason for this behaviour is that most of the profile generation time is spent in the reduction phase. Effect of ɛ and τ Thresholds Figure 4.5 studies the effect of both minimum set threshold τ and accuracy threshold ɛ on size of the conditional DQ profile. It can be observed that with tolerating a degree of uncertainty, a significant reduction in the DQ profile size can be achieved. Thresholds τ and ɛ improve size reduction if they are used together, e.g. a very low τ and relatively high ɛ has a negative effect on the generated profile size. A higher ɛ means lower accuracy. Figure 4.5 (b) depicts the effect of minimum set threshold τ on the estimation capability of the DQ profile. Since DQ is statistical characteristic of data, τ = 0 means that if a query results in only one or two records, DQ profile row for supporting that query should be kept in the profile table, however, in many applications only a few records, do not convey statistical significance. In contrary, increasing τ to a big number will remove many possibly valuable profile records from the profile table which DQ metric for them can not be estimated, and we may need to fall back to traditional DQ metric estimate for those queries.

38 Conditional Data Quality Profiling Figure 4.6: Scalability graphs (a) generated profile size versus data set size (b) profile generation time vs database size Effect of Input Data Size Figure 4.6 illustrates the scalability of the proposed algorithms. We used a smaller sub set of DBLP dataset for our experiment and increased the size by selecting a larger subset each time until reaching the whole data set and for each experiment we create a new set of queries. Figure 4.6 illustrates the effect of data set size on the size of conditional DQ profile, it also illustrates the effect of data set size on the profile generation time. Experiments are ran with thresholds ɛ = 0.2 and τ = 20. 4.2 DQ Sample Trie Approach Previous approach for generating conditional DQ profiles is best suitable when accuracy of DQ estimation is required for every query and enough resources and time is available to generate DQ measurements for every record of data. In environments where access to data and measurement and generation of DQ profile is costly (e.g. web or cloud applications) and some inaccuracy in estimated DQ can be acceptable, more flexible and more resource conserving techniques are required to generate conditional DQ profile. In web environment, amount of data that is available to the DQ Mediator (See Figure 2.1) is not pre-processable.

4.2 DQ Sample Trie Approach 39 Amount of data that users query from DQ Mediator at every moment can also be overwhelming and varying time to time. However, due to the nature of DQ measurement, the DQS usually has limited power and can not process DQ for all the data that DQ Mediator is handling. For example DQS may be able to process accuracy of a few hundred addresses, or correctness of less than a hundred urls per second, but the amount of data that is processed by DQ Mediator can be easily tens of times more than the DQS limitations. In such case that DQS becomes a bottleneck an incremental cost-effective approach is required for creating DQ profiles. DQ Profile Table estimates DQ of a given query by looking up the pre-processed estimate from the Profile Table. There is a very small overhead for every query, to store the DQ measurement value. However, the DQ measurement value can not be re-used for other queries. Storing sample instead of DQ measurement value can let us re-use the DQ measurements made for previous queries to estimate DQ of the future queries. Re-using samples to answer different queries can drop the cost of DQ measurement (Unfortunately, samples can not guarantee the accuracy of estimation, so if the accuracy of estimation must be guaranteed, the Profile Table approach is more suitable). In DQAQS, users and decision makers issue queries (to DQ Mediator) and if possible they will be immediately informed about the estimation of the quality of data they will receive as well as confidence level of the estimation. Then user can decide to continue with the query or take another action such as issuing another query. Sampling technique should be able to take advantage of fast local memory and storage of the mediator when it is available, but the limited resources should be utilized in a way that maximizes user satisfaction. Different

40 Conditional Data Quality Profiling subsets of database have different popularity for the system s users. For example prestigious database conferences are more queried by users of this particular system, hence there is more value in sampling respective subsets of database. Since external limitation define constrains on the cost, not all necessary samples can be generated. The system should decisively generate only samples that contribute the most to the final user satisfaction (i.e. accuracy of DQ estimation). For example if the only cost is the sample size, and 100 samples with a total size of 1000 rows are required to accurately estimate DQ for all the queries, but the cost restrictions do not allow for more than 100 sample rows, a portion of samples should be created that do not exceed 100 rows, and also the sum of popularity is being maximized. Another aspect of a good sampling technique is provision of an accurate response even for queries that return small number of records while the overall size of samples is minimized. To achieve this aspect, biased sampling techniques [1] are suggested in literature. For DQ profiling, cost of measurement of DQ highly overweights the cost of storage (which is usually the most important cost for biased sampling techniques). Measuring DQ is inherently expensive. For example correctness of an address requires a number of lookups in a data base and assurance of a web address requires calling the web address and parsing the returned page, etc. Therefore, we try to re-use parts of the sample to generate a more accurate DQ estimate with minimum extra cost. For example if a Computer Science publication data set has 1000 records for JIS journal papers, and 10 records for JIS papers of year 2011, a uniform 1% sample of JIS papers will include probably only one or zero sample records from 2011 JIS papers. If users show specific interest for recent JIS papers (i.e. popularity of the query) it makes sense to create

4.2 DQ Sample Trie Approach 41 a small sample for 2011 JIS papers in addition to the 1% sample for all JIS papers. Once a dense sample like 2011 JIS papers is created, we will utilize it to enrich the DQ estimation for every query that may include data from this sample. 4.2.1 Sample Trie We utilize a variation of suffix trie data structure (which we call sample trie) for accessing different samples from different subsets of data set. Figure 4.7 illustrates a sample trie for a given query workload. Sample tree is a forest (set of disjoint trees) where each tree represents all samples that relate to a suffix of conjunctive selection conditions on attributes. For example, in Figure 4.7, attribute are year, venue, and university. Longest conjunctive selection condition on attributes consists of three selection conditions referring to all these three attributes. We assume a constant order in the presentation of attributes in conjunctive conditions (e.g. all selection conditions is assumed in the order of year, venue, university), hence, every single tree in the sample trie forms a suffix of this longest selection condition. The complete sample trie for this example wil consist of one tree of all conjunctive selection conditions made of year, venue, and university. Another tree for all conjunctive selection conditions made of venue, and university. Last tree for the selection conditions made of only university. Each node of the trie that is marked with a number, represents a sample which is generated as a uniform sample from the query response. Density of every sample in the trie may be different from others. For example sample 3, should represent a denser sample than sample 2. Assume sampling rate of both samples 3 and 2 are 0.1, which means for every 10

42 Conditional Data Quality Profiling records in the query result, there is 1 record in the sample. If new query, year=2010 and venue=jis and uni=uq that returns 20 records is issued to the system, both samples 2 and 3 may contain only 2 records that satisfy the query conditions. In sample trie, each sample is denser than it s parents in the trie. E.g. if sample 2 is created with uniform sampling rate of 0.1, sample 3 may be created with uniform sampling rate of 0.4, which means it may contain 8 records for above query. In the rest of this paper, we use the terms node and sample interchangeably for nodes of the trie and the sample they contain. Year=(2008-2012) Year=2010 Venue=JIS Query 2 Year=2011 4 Venue=JIS 1 3 1- Year between 2008 and 2012 & Venue = JIS 2- Year = 2010 3- Year = 2010 & Venue = JIS 4- Year = 2011 5- Venue = JIS 6- Venue = JIS & University = UQ 7- University = UQ Venue=JIS Uni=UQ 5 6 Uni=UQ 7 Figure 4.7: Sample trie and a query workload Nodes in the sample trie have the following characteristics: 1) Each sample S represents a query S Q with selection condition as the conjunction of the node s condition with it s parent s conditions (recursively to the root node) 2) Each node is subset of its parent node. Node S 1 is subset of the node S 2 if the query response for Q S1 is subset of the query response for Q S2. For illustration purpose, the sample trie of the Figure 4.7 represents one node for every query, however, various techniques in the rest of this section are suggested to use a much smaller sample tire (bound to cost limits) to estimate DQ accurately, for as many

4.2 DQ Sample Trie Approach 43 queries as possible. Sample trie, differs from a suffix trie in the terms that not every suffix for one node s query necessarily exists in the trie. Samples are created only when needed (e.g. Figure 4.7, query 7 is a suffix for query 6, but node 7 is only created when required. It is not created with node 6). Sample trie can be used to estimate DQ of a given query. For uniform samples, the denser the sample is, it provides more accurate estimate for the given query. The benefit of a sample trie, is that if the lower density samples are not able to provide an accurate DQ estimation, there might be a denser sample that can help for finding a more accurate estimate. The sample trie can be generated to maximize the possibility of the queries being accurately estimated. To estimate DQ of a query result from a sample, we should first run the query against the sample. A query can be satisfied from a sample if running the query against the sample, returns enough number of records that can estimate DQ of the query results with some confidence. We define confidence of the DQ estimation, as max(1 1/ Q S, γ S ), where Q S is the size of the query result ran against sample S and γ S is the sampling rate. Confidence of DQ estimation is related to the size of results returned from sample, but if the number of result is very low (e.g. 1) while sampling rate is very high (e.g. 1) the confidence reflects the sampling rate. User defines what confidence threshold is acceptable. Assuming the sample trie exists, for a given query Q, there might be three possibilities where we might be able to use the sample trie to estimate DQ of the result of Q: 1) There exists at least one sample S that Q can be totally satisfied from (i.e. Q is subset of S and Q can be satisfied from S). In this case Q S (result of running Q against S) provides an accurate estimation of DQ. We may use other information in the sample to improve the

44 Conditional Data Quality Profiling estimation accuracy even further. 2) There is no sample that satisfies Q, but there are a number of samples that satisfy a part of Q, and there exists a sample S that is a superset of the query (which is not dense enough to answer Q within the confidence threshold, otherwise it would totally satisfy Q). In this case, an estimate of the DQ of the query result cannot be calculated, because sample trie does not have enough data to completely cover Q. For such situations, we provide a lower bound for the DQ estimation. 3) There is no sample that totally satisfies Q nor there is a sample that is superset of Q. In this case, even a lower bound of DQ can not be estimated. Instead, we use all information available to provide a DQ estimate with unknown confidence. Generation and utilization of the sample trie are not two distinct phases. In fact, the purpose of using sampling is to re-use the DQ metrics processed for a part of dataset, for as many queries as possible. Therefore, we integrate the algorithm for creation of the sample trie with its utilization. We employ popularity of the queries to identify their significance to the overall user satisfaction (i.e. DQ estimation accuracy). We consider two common scenarios, first, when the query workload is available in advanced. For example online systems that user pattern is predictable (e.g. users query pattern for one week is a good representation of the query pattern for next week). In this case we can provide a more cost effective sample and we can optimize the total cost more accurately. Second scenario, is when the query log can not be assumed in advanced (i.e. users querying pattern is totally unpredictable and varying). In this case, cost effectiveness can not be optimum, as we cannot predict user behaviour. Hence, we try improve the cost effectiveness based on the moving average of user querying pattern.

4.2 DQ Sample Trie Approach 45 To define a sample tree, let N represent sample S, and N.Q represent the query that generates the sample for node N. For any N as a child of N, N.Q N.Q (We use existing view matching techniques [14] to identify if a query is subset of another query). Let A be set of attributes a i and D ai be size of the value domain of each attribute a i. Imagine the longest conjunctive selection condition that includes all a i in the order of D ai (For example attributes Y ear, Domain, and V enue are in order of Da i ). The suffix trie for the longest conjunctive selection condition illustrates the sample trie that will be resulted if all possible samples are generated. Creating the suffix trie in order of D ai helps to reduce the bushiness of this trie which makes it easier to maintain. Therefore, we organize each conjunctive selection conditions in the order of D ai and we create the suffix trie using this order. Each node of the sample trie contains the sample for the query, a list of child samples, and other meta-data like density of the sample and popularity of the sample s query which are describe later. 4.2.2 Sample Trie Algorithm We assume that a constant or recurring limit for all different costs such as memory, storage, and DQS access is provided. Following steps define the general framework for creating the sample trie from the query workload. Inputs are: query Q with popularity ρ, and DQ sample trie T. We consider popularity as the number of times a query is issued by users. 1. Find the deepest node N in trie T where Q N.Q. 2. If N Satisfies estimation of DQ for Q, then return Estimation of DQ for Q 3. If any of the cost limits exceeds, return without creating any new sample.

46 Conditional Data Quality Profiling 4. If it is Worthy to create a new sample for Q, then create the new sample S, and populate the DQ metrics by running it against DQS. Create a new node N assigned to sample S, query Q and popularity ρ and add it to the trie as child of N. All children of N should then be reorganized to keep the trie valid. (e.g. if any node is subset of N should be moved under N ) The above algorithm provides a general framework for definition of the various approaches in this paper. Every approach we define works in the context of the above algorithm, but each approach provides a different definition for at least one of the functions in the algorithm, presented in italic. There are three key functions in the above framework, namely; Satisfaction, DQ Estimation, and Worthiness. We provide basic and cost-based approaches. Basic and naive approaches provide minimum functionality for all functions of the above algorithm, and cost-based approach will re-define the worthiness and estimation functions to make the above algorithm cost-effective. Eventually we re-define the estimation function from the cost-based approach to maximize the accuracy of DQ estimation based on all data available in the sample trie. Below, we define the discussed approaches in more detail: a-naive Approach Easiest approach to build a sample tire, is to create one sample per distinct query as long as the costs are not overflown. To implement this approach we will define the algorithm functions as follows: Satisfaction, A query is only satisfied with a node if the node represents the same query. DQ Estimation, DQ can be estimated by running the query against the sample. Worthiness, every new query is worthy of creating a new sample (worthiness function returns true regardless of the query).

4.2 DQ Sample Trie Approach 47 b-basic Approach Naive approach provides a base line for comparison. It is hardly useful as it is very inefficient and it does not benefit from the data re-usability that samples provide. By creating a new sample only if there is no single sample that can satisfy the query, we can re-use collected samples. By moving deep in the sample trie until the sample is a subset of the given query Q, select... where year=2011 and venue=jis, the smallest superset of the query can be found. For example, if a sample S already exists for query select... where year=2008, it could be directly identified as the best candidate to estimate DQ for Q. Indeed, Running the query against its smaller superset, returns a result set of size S Q. If S Q > 0, confidence of the estimation can be defined as 1 1/ S Q. If the confidence of estimation is greater than a confidence threshold c defined by user, there would be no need to create a new sample for Q. Hence by setting the Satisfaction function to return true if estimation with appropriate confidence can be made, we won t generate unnecessary samples. We call this basic approach. c-cost Based Approach In a system where cost of DQ measurement is subject to some limitation, an approach that does not differentiate queries, may exhaust the cost limit with rarely used samples. We can create only a limited number of samples. If we create a number of samples that can satisfy only a few queries in future, we have wasted the resources, since the overall accuracy of the system for estimating DQ of the user queries would be low. To improve the overall accuracy of DQ estimation, we should create samples that can satisfy the queries that are highly popular. Let us consider two general types of DQAQS. For the first type, overall user behaviour is predictable. There is no need to predict queries for any individual user or any specific time. Here, predictability means that the popularity of a

48 Conditional Data Quality Profiling query from past can be applied to future. E.g. recent science conferences are more queried, or publications of some authors are queried more frequently than others. For this type of systems, we can utilize an existing query workload to optimize the sample tire generation. We assume there is a constant cost limit that can not be exceeded. Second type, are the systems in which user query is totally unpredictable and changes all the time. For example news popularity change based on the current event and looking to the last week or month query workload is not a good representation of the next week queries. In this type of systems, there is a recurring cost limit which resets by time (e.g. amount of DQ measurement per hour). We consider popularity as number of times a query being issued. More accurately, popularity of a query is consisted of the popularity of the query itself plus all other queries that can be answered from the sample of the query result. To identify if a query can be answered from the sample, let cardinality of the sample s query result be denoted as S Q, sampling rate be γ, and confidence threshold be c. Query Q can be answered from the sample for S, if 2 S Q 1 (1 c). 1 1 c identifies the minimum number of rows that is required as the answer of query to statistically estimate the DQ of the results within the confidence threshold. E.g. if confidence threshold is 0.8, minimum 5 rows should be returned by the query. Cost can be bandwidth, storage, or most importantly the cost of communication with DQS (i.e. DQ measurement). For simplicity, let us assume only the cost of DQS communication which is equivalent to the size of the sample. If C be sum of all costs for queries in L, and C be the upper bound of the costs defined by user, maximum of α = C C of the 2 Since samples are generated uniformly, statistically Q γ records from the result of Q should exists in any sample that is a superset of Q.

4.2 DQ Sample Trie Approach 49 queries can be materialized for a sample. The cost function should selects top α percent of queries with highest popularities and lowest costs. If we replace the W orthiness function in the above framework with the following cost function, dramatic improvement of the system accuracy would be achieved. We call this cost based approach. We define cost function for two type of DQAQS 1) Predictable popularity where the query workload L is known in advanced, and 2) Constantly changing popularity where L is received as a stream of queries (and is not known in advanced). For simplicity we assume only one cost limitation C, i.e. cost of DQ measurement which is the dominant cost. i-predictable Popularity: If L is completely available in advanced, samples can be selected as follows to maximize the popularity and minimize the cost. We first filter out all queries that can be answered from other queries based on cardinality of each query (as described before). Let the result be L. We then shortlist all subsets of L where sum of costs is less than or equal to C and select the subset with maximum popularity. Indeed W orthiness function will only consider queries in this set as W orthy. Popularity of the selected set of queries is calculated as the sum of popularity of queries in the set plus sum of the popularities of queries which are subset of any query in the selected set. Finding all subsets of L where sum of costs is less than or equal to C grows exponential to the size of L. We use the following approximation in order to filter the most cost-effective queries. Let max ρ be maximum popularity of queries in L. We sort queries Q in order of ( c Q min c max c min c ) 2 + ( maxρ ρ Q max ρ min ρ ) 2, c Q is cost of the query Q and ρ Q is popularity of query Q. We then select queries from the sorted list until cost limit C is exceeded. In fact, we normalize the cost and popularity values between 0 and 1, afterwards we draw queries in a two axis

50 Conditional Data Quality Profiling system where center point is the point of maximum popularity and minimum cost. We then sort all the queries in order of closeness to the center point. Figure 4.8: Normalizing popularity and cost for cost model calculation By selecting queries from the sorted order until the cost limit C is exceeded, we can approximate a set of queries that maximize the accuracy of the system for the given cost limit C. W orthiness function can then return W orthy only if the query is in this list. ii-constantly Changing Popularity: In some DQAQ Systems, query popularity is constantly changing. Looking back to the query workload does not provide a good estimate for the future queries, instead, popularity of future queries constantly changes based on current events. News or stock market can be an example of such systems. We consider systems that popularity of queries constantly change based on current events, but they follow a trend (e.g. a new technology product is launched, and suddenly the popularity of queries about the specific technology increases). In such systems, monitoring a moving average of popularity and cost of queries can be used to estimate the query pattern in close future. We propose approaches to implement a cost based W orthiness function for systems with constantly changing popularity. In this approach we assume a dynamic cost limit, that gets updated thought the time. For example, we might be able to request DQ measurement for a limited number of records from DQS per minute. In this case, the quota available at any moment is assumed to be available as rate function

4.2 DQ Sample Trie Approach 51 (e.g. 1000 records per second). Let C be the cost in forms of number of records per time unit (e.g. seconds). And let T be the cost of queries that need a sample in terms of records per time unit. T is calculated from the number of records that are handle by DQ mediator per time unit which are the result of queries that DQ mediator has been unable to provide high confidence DQ estimate for them, times the sampling rate. In short, T identifies the cost of sampling all queries that are handled at DQ mediator and need samples. We want to identify if query Q is W orthy of creating a new sample. We consider benefit that will be achieved from creating a sample for Q in contrast to the cost that will be incurred to do so. Obviously, if T is greater then C, DQ of all records can not be measured. In order to maximize the number of queries that can be estimated from the sample tire, we first figure out the percentage of queries that can be sampled based on the cost; which we call τ = C/T. Then we propose to generate a sample for a query only if it is in the top τ percent of queries in terms of cost and popularity. For example if only 5% of the queries can be sampled, we sample a query only if it is in the top 5% of the queries in terms of high popularity and low cost using the same method defined in Section 4.2.2-i. Lets assume the cost of creating Q is C Q (i. e., number of rows in the sample). We define max C, min C as the maximum, and minimum of a moving window of query costs, and max ρ, and min ρ as the maximum and minimum of the query popularities in the same moving window of query costs. We maintain a moving windows of the received queries which their DQ can not be estimated from the sample trie which we call W.

52 Conditional Data Quality Profiling In order to use the available C for the queries with highest popularity and lowest cost, we need to filter queries which lay within the radius of top τ percent queries (see Figure 4.8). We keep a sorted array from the queries in W where each element contains max c C Q max c min c 2 + ρ Q min ρ max ρ min ρ 2. We call this formula the normalized cost-popularity of Q. We then define the W orthiness function as: query Q is W orthy of creating a new sample, only if it s normalized cost-popularity is in the top τ percent of the queries in W. Obviously, for every query received, T, the fixed size window W, minimum and maximum of ρ and C change, hence the worthiness function dynamically adopts to the trend of queries. Optimized Accuracy DQ Estimation We can improve the estimation function of the above approaches by utilizing all available information within samples. Above approaches estimate DQ for a given query Q by finding one sample that satisfies the query and running the query against that sample. However, the sample trie may contain samples that can estimate DQ of at least one subset of the query result more accurately. For example, if the user queries for JIS papers after 2010, and there exists a sample for all JIS papers and another sample for 2011 JIS papers, the first sample to totally satisfy Q will be the sample for all JIS papers. However, this sample can estimate DQ with low accuracy as it is sampling a large set of data and has a low sampling rate (e.g. 1%). The sample for 2011 JIS papers, can estimate DQ with a better accuracy as it is a denser sample of a small set of data, but it can only answer a part of the query. Fortunately, since we know the sampling rate for each sample, we can estimate the cardinality of intersection of Q and every other sample and use these information to calculate more accurate DQ estimation. Considering query Q, we are be able to estimate DQ of the results of Q if there is at least one sample in the

4.2 DQ Sample Trie Approach 53 trie that has records that satisfy selection condition of Q. We identify three cases based on intersections of the samples and the query, which we discuss further: 1) There is a sample S 0 that totally satisfies Q (we call it direct hit) and there might be other samples S i that satisfy parts of Q with higher accuracy. In this case, we can estimate the DQ of the result of Q using S 0. Then, we further improve the accuracy of estimation using other samples that provide better estimates for subsets of Q. 2) There exists a sample that is superset of Q but does not totally satisfy Q, and there is no sample that totally satisfies Q, but there are samples that satisfy parts of Q (No Hit with DQ Lower Bound). In this case we can only estimate a lower bound for DQ and we can make the lower bound more accurate using samples that satisfy parts of Q 3) There is no sample that is superset of Q, but there are samples that satisfy parts of the Q. In this case we can not provide any confidence for DQ estimation. However using the samples that satisfy part of the Q, estimation can be made more accurate. We propose techniques to estimate DQ for above cased. In the proposed techniques, we assume a list of samples S i that can satisfy part of Q. Before continuing with the details of accuracy improvement, it is worth noting that finding S i is trivial. If we store all sample records in one big table T T where every record has a pointer to its containing sample in the trie, S i can be calculated from running Q against T T and grouping the results with sample Id (result should also be filtered for groups that have more than 1/(1 c) records to ensure the confidence threshold). i-direct Hit:

54 Conditional Data Quality Profiling Figure 4.9: One dimensional illustration of the coverage of samples with different sampling rate d 0 > d 1 > d 2 > d 3 Let S 0 be the node in the suffix tree of samples that represents the deepest superset of Q. As mentioned in Section 4.2.1, each node keeps references to other samples that may satisfy part of Q. It also has a set of immediate children that are subsets of S 0 which may also satisfy parts of the Q. Let S i be all the children and reference samples of S 0 that satisfy Q and have a higher sampling rate than S (any sample with lower sampling rate than S is unable to improve the accuracy of estimation). Let S Qi be the result of running Q against S i and Er Qi be the number of records with bad DQ in S Qi. We consider d i as the reverse of sampling rate of S i. For example if sample S i is generated with the rate 5%, d i would be 100/5 = 20, which means every row in S i represents 20 rows in the actual dataset. To improve the accuracy of the DQ estimation, we should substitute subsets of S 0 with more accurate (or denser) samples S i. Figure 4.9 illustrates this concept for query presented. Each line represents the extents of a query response. The line for S 0Q represents all the records from running Q against S 0. The line for S 1Q represents the result of running Q against S 1. If S 0 totally satisfies Q, we can observe from Figure 4.9 that S 1 to S 3, each one satisfies a different subset of Q.

4.2 DQ Sample Trie Approach 55 Samples S i are ordered by their density. Denser sample are on top. Each line represents the result of running Q against a sample S i. Obviously denser samples provide a higher quality estimation. To maximize the estimation quality, we should use the densest sample where possible. For example, to generate the final answer, we use S 3 to estimate DQ for part of the query result that intersects with S 3. To estimate DQ for the remaining of the query result, similarly, we should use S 2 when possible. S 1 can be used to estimate DQ for parts of the query result that do not intersect with S 3 and S 2. S 0 which has the lowest density, can be used to estimate DQ for the rest of the query results. The number of records in the result of Q with bad DQ, in Figure 4.9 is calculated by adding the number of records with bad DQ; Er Q = Er QS0 i=1,2,3 Q.d S 0 +Er QS1 i ( i=2,3 Q.d S 1 + i Er QS2 Q S1.d 2 + Er QS3.d 3 Algorithm 2 implements the above equation. Starting from the densest sample (e.f. S 3 ), estimating Er QS3 by running the query Q against S 3. In the next step we can add the result of the first step, and estimate number of records with bad DQ by running Q against query of S 2 and not query of S 3. At each step the accumulative query AccQ is rewritten as the previous the query of S i and not the previous AccQ. Data: S i in the descending order of density Let AccQ Q for S in S i do AccEr AccEr + Er AccQSi.d i end AccQ query of S i and not AccQ return P r Algorithm 2: Calculating Er Q ii-no Hit with DQ Lower Bound: Let S i be all the samples that can satisfy a part of Q. If there is a sample S 0 in the sample trie that is superset of Q, but does not satisfy Q

56 Conditional Data Quality Profiling (i.e. cardinality of Q is too small to show enough results in S 0 ), we can estimate DQ using S i, and we can provide a lower bound for the DQ of estimation. Sample S 0 does not satisfy Q if S Q < 1/(1 c) where c is the Confidence threshold, thus, we can estimate that Q < d 0 /(1 c). Lower bound of the number of records with bad DQ in the result of Q can be estimated using Algorithm 2. The reason Er Q calculated from S i estimates a lower bound on the number of records in the result of Q with bad DQ is that there is that there might be parts from Q which we have no information from and may contain any number of errors. But we have an estimate of upper bound of the cardinality of Q. Actual rate of records with bad DQ in the result of Q, would be greater than or equal to Er Q.(1 c)/d 0. iii-no Hit: If there is no sample in the trie that is a super set of Q, but there are samples S i that satisfy a part of Q, we do not know the confidence of the DQ estimation because, there might be parts from the result of Q which we have no information about, and we can not estimate an upper bound on the cardinality of Q. However, estimation of DQ using Algorithm 2 can at least provide some insight to the user about the DQ he should be expecting. We can return the estimation as our best guess with low confidence. 4.2.3 Evaluation and Analysis We conducted evaluation of the proposed algorithms of this section implementing the prototype over Microsoft SQL Server. We used DBLP [10] Computer Science bibliographical data set for our evaluation. In order to simulate realistic observations of various data quality metric results we incorporated a synthetic DQ metric generation technique. Since DQ

4.2 DQ Sample Trie Approach 57 metrics measurements are considered as black box in our techniques provided by Data Quality Service, we have simulated a DQ metric function with a deterministic boolean function which returns true of false and will always return the same result for any given record from dataset. Accuracy of the system is evaluated for both predictable, and constantly changing popularities. For the predictable popularity, we consider a constant cost limit, and study the effectiveness of the proposed techniques for different cost limits. For constantly changing popularity, we consider a constant cost limit per minute C, versus query load per minute T, and we study effectiveness of the proposed approached for different C/T. If C/T is close to zero, the DQ mediator s load is much higher than what DQS can handle, and if C/T is greater than or equal to 1, DQS can handle every query received from DQ mediator. Samples are generated with different base sampling rates, and the effect of base sampling rate on the effectiveness of our approaches is studied. Base sampling rate is the sampling rate that will be used to sample a query that has no sample that is its superset. As mentioned before, sampling rate for every sample is higher than it s parent (otherwise the sample would not be able to estimate any query better than its parent). The sampling rate of the children increases based on the confidence threshold. If a sample should estimate DQ of queries that can not be estimated with it s parent, it sampling rate should be at least sampling rate of the parent times 1/c. For example, consider confidence threshold is c = 0.8 (which means minimum 5 records should be returned from a sample for Q 1 ). If query Q 1 returns 0 records from sample S 1, a new sample S 2 (subset of S 1 ) that has less than 5 times density of S 1, would not return 5 records for Q 1 either. The sampling rate is overwritten when the sample size is less than the minimum sample size to answer the sample s query. In fact if confidence

58 Conditional Data Quality Profiling threshold is 0.8, minimum sample size will be 5 records, regardless of the sampling rate. In our experiments, for each query DQ estimation error result is calculated from the actual DQ of each query versus the DQ estimated from the samples. For example if query Q returns 500 records where 400 of the records return true for the DQ metric, and estimation from the samples returns 5 records where 2 of them are marked true for the DQ metric, estimation error is 0.4 (i.e. 0.8 0.4 %). Samples are generated based on query workload L. In order to be able to evaluate various possible query workloads, we define coverage factor d of the query workload as percentage of queries in the workload that are subsets of other queries. We then analyze different query workloads with different coverage factors. We prepared a large list of possible queries. To form a query workload with a given coverage factor d, we would select a number of queries from this list that are not subset of any other query, then we would add more queries that are supersets of these queries, until coverage factor d is reached. Accuracy of the system is calculated from the average estimation error for all queries in the workload. If the system is not able to estimate any DQ for the query, error rate is considered 1, hence if the system is unable to estimate any query, the accuracy of system will be zero, and if the system estimates every query Effect of Cost Limit. In Figures 4.10 and 4.11 we evaluate all different approaches being undertaken in this paper. Figure 4.10 compares the system s accuracy for approaches when popularity of the queries are predictable. In this case, we initially create the samples from the query workload, and then run the query workload against the created samples. In Figure 4.11 we compare the system s accuracy for approaches when the popularity of the queries

4.2 DQ Sample Trie Approach 59 are now known from advanced. In this systems, we run the query workload and the sample trie evolves as new queries are received. In this case, naive approach is unable to utilize the sample trie, and the system s accuracy estimation is almost zero. Hence we omit the naive approach from the graphs. Naive approach only uses the sample if the exact query is already present in the sample trie and in this case, it s result reflects no information except the number of duplicate queries. Limitation on the sampling cost (which is usually enforced by environment), has direct effect on the accuracy of this method. Basic approach shows improvement in accuracy with lower cost limitations. The cost based approach is capable of accurate estimation even within very low cost limits. The optimized estimation approach improves the estimate from the queries using available knowledge from the samples. It tend to increase the accuracy of query results. Estimation Accuracy 1 0.8 0.6 0.4 0.2 0 d=0.2 d=0.4 d=0.8 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 Estimation Accuracy 1 0.8 0.6 Naïve 0.4 Basic Cost-Based 0.2 Optimized 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 0.6 Naïve 0.4 Basic Cost-Based 0.2 Optimized 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000 110000 Allowed total sample size Allowed total sample size Allowed total sample size No. Reords Allowed to be Sampled No. Reords Allowed to be Sampled No. Reords Allowed to be Sampled Estimation Accuracy 1 0.8 Naïve Basic Cost-Based Optimized Figure 4.10: Effect of coverage d on the effectiveness of approaches when query popularity is predictable Effect of Coverage Factor in Queries. Coverage factor is the most significant factor that affects the effectiveness of our approaches. Coverage factor d is usually defined by the nature of application. It defines the overlap of queries issued by users. In some applications this overlap might be significant. For example search queries usually have a lot of overlap

60 Conditional Data Quality Profiling 1 d=0.2 d=0.4 d=0.8 1 1 Estimation Accuracy 0.8 0.6 0.4 0.2 0 Estimation Accuracy 0.8 0.6 Basic Cost-Based 0.4 Optimized 0.2 0 Estimation Accuracy 0.8 0.6 Basic Cost-Based 0.4 Optimized 0.2 0 Basic Cost-Based Optimized 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 C / T C / T C / T Figure 4.11: Effect of coverage d on the effectiveness of approaches when query popularity is constantly changing Basic Cost Based Optimized Estimation Accuracy 1 0.75 0.5 0.25 0 0.10% 1.00% 40.00% Estimation Accuracy 1 0.75 d=0.2 0.5 d=0.40.25 d=0.8 0 0.10% 1.00% 40.00% Estimation Accuracy 1 0.75 d=0.20.5 d=0.4 0.25 d=0.8 0 0.10% 1.00% 40.00% d=0.2 d=0.4 d=0.8 Base Sampling Rate Base Sampling Rate Base Sampling Rate Figure 4.12: Effect of the uniform sampling rate selected on the effectiveness of approaches as people tend to restrict the search result is various shapes. In some systems there might be limited overlap of queries, for example where people only query for last hour of a stream of information. Figures 4.10 and 4.11 represent behaviour of different approaches when workload with different coverage is utilized. Coverage factor, also infers differences in popularity of queries. Popularity of a query is defined as the number of times a query or any subset of the query is issued, hence, low coverage factors means that average popularity of queries is low, and high coverage factor means that average popularity of queries is high. If coverage factor is small, it means that very few query can be answered based on any existing sample. In such case, (d = 0.2) all approaches behave closely to the naive approach.

4.3 Summary 61 If the query workload is large enough, there will always be queries which are subset of other queries. With an slight increase in coverage factor, (d = 0.4) approaches start to be effective. Effect of Base Sampling Rate. Base sampling rate γ which is the uniform sample rate affects the size of every individual sample. A bigger sample is more expensive to create, but is able to answer smaller sub-queries. A very small sampling rate means that no sub query can can be accurately estimated from its superset sample. It will cause the DQ sample tree to become similar to the naive approach. A very high sample rate will cause the sample to be similar to the whole query result. It means that any sub-query can be answered from its superset sample, but it will make the sample creation too expensive, leading to a small number of samples being created under the sampling cost limit. Figure 4.12 display the effect of different sampling rates for basic, cost-based, and optimized approaches. 4.3 Summary Data quality profile is meta data that can be used to estimate DQ of a query for a given data source. In this section we first studied the necessity of some form of DQ profile that is able to estimate the DQ of queries accurately for any given query. Although DQ profiling has considerable position in literature and industry, works on estimation of the generic DQ metrics for any query are limited. We suggested the concept of Conditional DQ profiling which is a generalization of existing attribute based DQ profiling, and supports estimation of DQ for queries with conjunctive selection conditions. Considering the complexity of the problem, there is no one solution

62 Conditional Data Quality Profiling that fits all situations with different requirements. We divide the environment based on the cost limitation factor and discussed two solutions for these environments. One environment is when the cost of measuring DQ is negligible. Such situation happens if long overnight processing times are available for measuring DQ of not too large datasets. The other environment is when the cost of measuring DQ is significant compared to the resources available. In such environments DQ measurement will become a bottleneck and should be minimized. For both these environments our goal is to maximize the accuracy of DQ estimation for a given query. For the first environment, where cost of measuring DQ is not an issue, we suggest a big pre-processing step in which we measure DQ for every single record from data base and then process the DQ measurements for every possible conjunctive equality selection condition. Such DQ profile can become too large, thus; we suggested algorithms to minimize size of the generated DQ profile based on an accuracy and a minimum set size threshold defined by user. For the second environment where cost of DQ profiling is the bottle neck, we suggest a sampling technique called sample trie that manages samples in a variation of suffix tree of samples, and maximizes the accuracy of DQ estimations, considering cost limitations and popularity of the queries. We then suggested algorithms to information already collected in the sample tire in a way to maximize the accuracy of DQ estimation in different situations, e.g. when a sample can be directly used to estimate DQ or when some samples can be used to only provide a lower bound for estimation of DQ.

5 Quality Aware SQL Modelling user preference on DQ is one of the key challenges in creating a DQ aware query system. Preference modelling is in general a difficult problem due to the inherent subjective nature. The preference model should thus enable natural communication with the end user to ensure an accurate and consistent capture of preferences. It is not reasonable to expect end users to directly weigh every metric of every attribute [29]. Typically preferences are stated 63

64 Quality Aware SQL in relative terms. The notion of preferences using partial orders is widely used in decision making theory. Accordingly, we propose a model based on partial order prioritization to define user preferences on DQ. We expect a set of binary subjective prioritizations of only two items (e.g. it is natural to say I prefer coffee much more over tea). A hierarchy of complex preferences can be then be derived using the binary comparisons. Techniques are available to analyse such hierarchies [???? REF????], as such an analysis may reveal inconsistency in the user preferences. Inconsistency may be tolerated [???ref???] but detection and notification to the user about its presence can be provided regardless. In this section, we will first formally define the preference model based on partial order prioritisation as described above, including details on inconsistency detection. We then briefly propose an extension of SQL syntax to provide a deployment blueprint for incorporating user preferences in query formulations. The notion of Hierarchy in preferences is defined in the literature [9] as prioritized composition of preferences. For example; completeness of user comments may have priority over completeness of prices. We use the term Hierarchy to define prioritised composition of preferences which can form several levels of priority i.e. a hierarchy. The hierarchy over the preference relations is quantifiable such as: a is strongly more important than b, or a is moderately more important than b. [???? maybe this part can be summarised rather than fully presented as a prelude to the next section. We then suggest techniques to translate the prioritization quotes to numeric weights which can be directly used by query planning methods.???] Definition 0.1 Let relation R(A) be a relation of attributes a A. Let M = m 1..m k be the set of k data quality metrics. Let S = S 1...S n be a set of possible sources for relation

65 R(A). A preference formula (pf) C(S 1, S 2 ) is a first order formula defining a preference relation denoted as, namely S 1 S 2 iff C(S 1, S2). Consider the Relation ShopItem(T itle, P rice, U sercomments) from source S denoted as R S (t, p, u), quality metric completeness is denoted as m 1 and completeness of the price is denoted as p.m 1. A preference relation can thus be defined as C: (t, p, u) R s (t, p, u ) R S p.m 1 > p.m 1 Where p.m 1 is from the source S and p.m 1 is from the source S. A hierarchy (prioritized order) of preferences is defined as below:[9]. Definition 0.2 Consider two preference relations 1 and 2 defined over the same schema. The prioritized composition 1,2 := 1 2 of 1 and 2 is defined as: S 1 1,2 S 2 S 1 1 S 2 ( S 2 1 S 1 S 1 2 S 2 ). The formal notions above are translated below into pseudo-sql in order to illustrate the quality aware query. A specialized HIERARCHY clause is defined to identify the hierarchy of preferences. It is assumed that sources with maximial value of a data quality metric are preferred (i.e. source S k where σ m (a) is greater than or equal to σ m (a) of all S i S i : 1..n), thus, there is no need to explicitly define this preference in the query. Only hierarchy (prioritized order) of those preferences are written in the query. The example hierarchy clause HIERARCHY(ShopItem.p) p.currency OVER (p.completeness) is defined as below, let p be the Price attribute, m 1 and m 2 be currency and completeness metrics and p.m 1 and p.m 1 be the currency of the column Price, from quality-aware meta

66 Quality Aware SQL Figure 5.1: (a)circular inconsistency (b) Path inconsistency (c)consistent graph data related to source S and S : (t, p, u) 0 1 (t, p, u ), (t, p, u) S 0 (t, p, u ) S p.m 1 > p.m 1, (t, p, u) S 1 (t, p, u ) S p.m 2 > p.m 2 A Hierarchy clause can be modelled as a directed weighted graph, in which nodes represent items, directed links represent preferences, and preference intensities represent link weights. Preference graph is defined as follows: N := {a i 1 i n} G := {(a, a, d) a, a N a a d {1..9}} Where n is the number of nodes or items in the preference, N is the set of all items, and G is the graph representing preferences. 5.1 Inconsistent Partial Order Graphs As discussed earlier partial orders in DQAQs include a numeric intensity as representatives for quantitative words like very strongly, strongly, slightly, etc. In [29] represented in numbers from 1 to 9.

5.1 Inconsistent Partial Order Graphs 67 Accordingly we define cumulative transitivity over preferences as below: Definition 1.1 If A OVER B n, and B OVER C m holds then A OVER C m+n holds, where a {m, n, m + n} {1..9} a, a > 9 a = 9. Accumulative transitivity can be assumed for most real-world preferences, for example if someone loves coffee, likes coke, and hates tea; he strongly prefers coffee to coke, strongly prefers coke to tea, thus; very strongly prefers coffee to tea. It does not make sense if he specifies that he prefers coffee to tea slightly. In this paper, we assume accumulative transitivity holds in all cases. Two types of inconsistencies can be found in a preference graph: Circular inconsistency: A graph has circular inconsistency when a circular loop exists in the graph. In this situation, there would be at least two nodes n, n that both clauses n is preferred to n and n is preferred to n can be inferred from the graph. Figure 5.1(a) depicts a set of preferences that make an inconsistent state because taking any two nodes e.g. A and B, both following clauses can be inferred from the graph: A is preferred to B and B is preferred to A. Path inconsistency: A graph has path inconsistency when there is a pair of nodes n and n that can be connected with at least two different paths with different accumulative weights. In this situation, it is inferred that n is preferred to n but with different intensities which does not make sense. Graph in Fig. 5.1(b) is in an inconsistent state since considering two nodes A and C, both of the following clauses can be inferred from the graph: A is slightly preferred to C and A is highly preferred to C.

68 Quality Aware SQL Data: G := {(a, a, d) a, a N a a d {0..9}} d(x, y ) Result: 0 x = y n n φ True or False (x, y, n) G (y, x, n) G else Define M[n,n] matrix of (0..9) for k:=1 to n do i:= k-1 while M i,k φ M i,k 0 do i:=i - 1 end for j:=1 to n do if M i,j φ then M i,j := M i,j M i,k end else M k,j := d(a k, a j ) end end for j:=1 to n do end if M k,j d(a k, a j ) then return False end end return True Algorithm 3: Inconsistency detection in quality aware query systems 5.2 Inconsistency Detection Definition 2.1 A directed weighted graph G is inconsistent if either; either a loop exists in the graph or at least a pair of nodes n, n N exist which are connected by at least a pair of paths P k, P k where w k w k assuming P is the set of all paths from P 1, P 2,..., P k ; P k,... with weight w 1, w 2 ;..., w k, w k,... We call the process of checking the consistency of graph with searching for existance of a loop or a pair of inconsistent paths as inconsistency check. Brut-force searching for inconsistent paths in graph is exhaustive. There is n (n 1)/2 combination of node pairs where all possible paths between any pair should be tested.

5.2 Inconsistency Detection 69 Figure 5.2: Graph G with a circular inconsistency and its relevant distance matrix Alternatively consistency check can be described as follows: Given the graph G, check if: 1) No loop exists in G. 2)For any arbitrary pair of nodes n k, n k, shortest connecting path and longest connecting path are of the same weight. Despite the fact that the first part of consistency check is the classic loop detection problem in graph with complexity of O(n), second part of the consistency check is hard. Previous studies prove that the problems of searching for longest paths in a graph is a NP-hard problems [5]. Since weights are only numeric representations of the subjective priority of preferences, accumulated weights have an upper limit and will never pass that limit, thus; our problem will reduce to an approximated version of shortest and longest path (or the accumulative weight in this paper) identification which is bound to a strict upper bound. Additionally, we do not really need to find both shortest and longest paths or exhaustively search for all possible paths between any two nodes. As soon as we get to a point that inconsistency is inevitable we can judge that the graph is inconsistent. Considering above specificities, we propose an algorithm that detects inconsistencies in the preference graph in O(n 2 ) time (This can be seen by looking at the number of nested loops in the algorithm). Algorithm 2 gets the preference graph as input and determines if the graph consists of inconsistent preferences. It checks for both circular and path inconsistencies.

70 Quality Aware SQL We illustrate now the execution of this algorithm through the example graph given in Fig. 5.1(a). Consider N := {a i 1 i n} to be all the nodes of the preference graph G where n is the number of nodes. We define matrix M n n as accumulative distance matrix of graph G where M i,j is the accumulative distance between nodes i and j. Accumulative distance between two nodes in the graph represents accumulated preference weight of one item over another in the actual preference query. Consider graph G of Fig.5.1(a) with four nodes A, B, C, D, Fig. 5.2 illustrates it s accumulative distance matrix. Accumulative distance matrix M will be formed with main diagonal consisting of zero since the distance between each node and itself is zero. Starting from node A, Fig 5.2(a) displays distance between nodes A, B, and D in matrix M at this stage distance from A to C is not known since there is no direct link between A and C. The second row of matrix M in Fig 5.2(b) displays distance between node B and other nodes. It is generated by adding the first row with -3 which is the distance from node B to A. Distance from node B to nodes A and C from the graph should be same as the nodes calculated in matrix M. The third row of matrix M in Fig 5.2(c) displays distance between node C and other nodes. It is calculated by adding second row with -3 which is the distance from node C to B. Third matrix row identifies that distance from C to D should be -9, but the real distance from C to D in graph G is 3. This difference indicates that there is an inconsistency in the graph. [???? include here from previous version/apweb/bis paper: For purpose of illustration of quality aware query formulation, we present below an extension of SQL to include

5.3 Normalization of DQ Preferences for Query Planners 71 user preferences on data quality metrics.???] [???then have a summary paragraph that summarised the section below on normalisation. use an example to explain rather than presenting the full AHP background and algorithm, instead make reference to paper where this was proposed???] 5.3 Normalization of DQ Preferences for Query Planners Although preferences defined as partial orders is suitable for user interaction, it can not be easily utilized by query planners. Normalization of partial orders to numeric weights is very important for usability of the defined preferences by the mediator in Figure 2.1. Fortunately there are widely accepted techniques in the decision making field that we can employ to normalize user preferences [29]. In [28] a decision making approach is proposed which is called Analytical Hierarchy Process (AHP). In AHP, decision hierarchy is structured as a tree on objectives from the top with the goal (query in our case), then the objectives from a higher level perspective (attributes), and lower level objectives (DQ metrics). This is then followed by a process of prioritization. Prioritization involves eliciting judgements in response to question about dominance of one element over another when compared with respect to a property. Prioritizations form a set of pair-wise comparison matrices for each level of objective. Within each objective level (query or attribute) Elements of a pair-wise comparison matrix represent intensity of the priority of each item (e.g. DQ metrics) over another in a

72 Quality Aware SQL scale of 1 to 9. The Hierarchy Clauses (Sec.5) can be directly mapped to pair-wise comparison matrices. Algorithm 4 presents the approach. It i) maps a quality-aware query to an AHP decision tree ii) maps Hierarchy clauses to pair-wise comparison matrices and iii) returns weights assigned to each data quality metric using AHP technique over the generated decision tree and comparison matrices. Data: SELECT {a a A} FROM R HIERARCHY (A) {a OV ER a a, a A a a }, { } Hierarchy(a){a.m OV ER a.m m, m M m m } a A where A attributes(r) and M is a set of DQ metrics. Result: W := {w a.m a A m M} Define Q as AHP Tree; r = Q.RootElement; foreach a A do r.addchild(a); end r.pairwisecomparision(a, a a, a A a a ) =t from Hierarchy a OVER a foreach m M do r.a.addchild(m); end m.pairwisecomparision(m, m m, m M m m ) =t from Hierarchy a.m OVER a.m return AHP(Q)// Which is w a.m for each a and m using AHP technique Algorithm 4: AHP Tree from Quality-aware SQL Assume R is an arbitrary view and M is an arbitrary set of DQ metrics. If a hierarchy a over a is not defined in the user query, it can be estimated as follows: If there is no hierarchy a over a with weight t, use a hierarchy of a over a with weight 1, otherwise use a hierarchy of a over a with weight 1/t. Algorithm 4 initially generates a tree root node Q. It then adds all attributes within the query as children nodes. The pairwise comparison matrix is generated for the root node by substituting each matrix pair by hierarchy clause weights. Respective tree nodes for each DQ metric in the query are added to each attribute node and pairwise comparison matches are generated accordingly. Eventually AHP technique assigns a weight to each DQ metric for each attribute which can be directly used by the query planner for

5.4 Summary 73 measuring the effect of various DQ metrics on the overall ranking of the query plans. 5.4 Summary Goal of DQAQS is essentially to improve user satisfaction from the query result by providing him with higher quality data in the query results. Understanding user preferences on Data Quality is an essential prerequisite for improving DQ of the query results. In this section we studied the problem of understanding user preferences on data quality. Modeling user preferences is inherently a complex problem and is studied in various fields, from human physiology to computer science. We believe that modeling user preferences as set of partial orders is natural to human beings. Hence, in this section we presented a techniques for modelling user preferences on data quality as a set of partial order prioritizations. In this section we formally defined user preferences on DQ. Additionally we suggested an extension to SQL language that can be used to define user preferences on DQ. Although prioritization as partial orders is natural for users, confusion factors in implementation of the user interface can be a source of inconsistency in preferences defined by user. In this section we also proposed techniques to efficiently discover inconsistencies in user preferences. Using these techniques UI implementations can provide online feedback to the user and assist them in providing consistent preferences.

74 Quality Aware SQL

6 Quality Aware Query Response One last component of DQAQS to discuss, is query planning. As mentioned in Section 2, query planning is one of the roles of the DQ Aware Mediator. Mediator has access to a DQ profile (See Section 4) for every single source. It also receives DQ aware queries from users (Queries that include user preferences for DQ, see Section 5). Techniques provided in Section 4 empowers the mediator to estimate DQ of the result of running query against any of the 75

76 Quality Aware Query Response data sources. Techniques provided in Section?? empowers mediator to understand user preferences on DQ. In a generic data integration system, the final query answer consists of integrating results of multiple query plans where each query plan may span the user defined query across multiple data sources. In this section, we first provide a brief introduction on query planning in data integration systems. We discuss different components of a query planning system, and we suggest that definition of a DQ aware utility function is enough to make existing query planning and optimization techniques data quality aware. Note that this section does not provide any new contribution, instead it is focused on leveraging existing literature to DQAQS. The process of data integration in DQAQS is similar to the generic process of data integration. In a data integration system user issues the query. Mediator generates an ordered set of sound query plans on a mediated schema for a number of data sources [6]. Queries are ran against the data sources, the schema of the results is mapped to the mediated schema. Result from various data sources are merged, and duplicate data is removed. In the first step of this process, mediator translates the query into a set of query plans against different data sources. Several algorithms have been proposed to reformulate user queries [12]. Each plan is a query formulation over a number of data sources and specifies a way to access sources and combine data to answer the user query. When the number of sources is large, the mediator should consider many query plans [11]. Examining and executing all the query plans is not possible except for very intuitive problems, hence; the mediator should generate the plans in the ascending order of utility. Optimization is a crucial part of this step [12]. Utility is a number (resulting from the utility function) factor that defines

77 relative goodness of the plan in relation to other plans. No single query plan is guaranteed to produce all the answers, hence the answer to a user query is the union of the output of all query plans. If the number of plans are large, all query plans can not be executed and the mediator should execute the query plans (which are ordered based on their utility) in a way that maximizes the coverage until the time or cost limit is reached. Many optimization approaches are focused on minimizing the cost of obtaining reasonable amount of answers from the sources. Because for applications with a large number of sources. Query plans tend to vary significantly in their utility (e.g., coverage, execution time, cost, etc.)[11]. The most important optimization problem in query reformulation and optimization is to find query plans in decreasing order of their utility. This problem is well studied in the literature. In [11] the problem of plan optimization is studied for generic utility which can be a good basis for our discussion. The query plans gets executed in order and the results are integrated by removing duplicates [6] until time and cost limits are reached. In [6] a survey of techniques to merge query results to present to the user. In our work, this latest phase of data integration is not different from the literature, therefore; to define query quality aware query planning, we narrow down our focus on the query optimization phase. As mentioned before, the major challenge in query planning is to sort the query plans in order of the utility. Works like [11] provide query planning techniques to achieve this goal based on a generic utility function. In the query reformulation and optimization technique is independent from the utility function. In DQAQS, DQ of the query result is one of the factors that defines the goodness of a query plan and it can be one of the factors in the

78 Quality Aware Query Response utility function. Therefore, we narrow down the problem of DQ aware query planning to the problem of definition of a DQ aware utility function. There are several factors in a DQ aware utility function. One major factor in DQAQS is the estimated DQ of the query plan, and other factors are generic utilities like coverage, execution time, server response time, etc. Some utilities have positive effect on the final utility value, and some have a negative effect. DQ metrics have a positive effect, i.e. higher DQ is better, but server response time has a negative effect, i.e. less is better. The problem of query planning with different type of utilities is well explored. For example [24] proposes a plan optimization algorithms based on LAV where utility functions are from positive and negative nature. If we estimate the DQ of a query plan, it can be fed to the solutions such as [24] along with other utilities (e.g.e server response time, etc.) to calculate the final utility. Therefore; the problem of data quality aware query planning boils down further to the problem of estimating DQ of the query plan. In the rest of this section we focus on the problem of estimation of DQ of a query plan in for a quality-aware query. For simplicity, we consider that the utility of a plan is just based on the DQ of the query results. For brevity, in the rest of this section we use the terms utility function and utility of a plan for only the part of utility that reflects DQ. We discuss methods to estimate DQ of a query plan. This number can be fed to the utility function to make generic query planning techniques for data integration environment, DQ aware. Note that this section does not have a new contribution, instead, it leverages existing query planning techniques to suggest data quality aware query planning.

6.1 Estimation of the DQ of a query plan 79 6.1 Estimation of the DQ of a query plan In DQAQS, our goal is to maximize data quality based on user preferences, therefore; utility of a plan should reflect the data quality of the query resulted from the plan. Data quality of the query result is consisted of the measurement of different DQ metrics from the query result weighted by the user preferences on DQ metrics. The utility of a plan is commonly defined as a number that indicates the relative worthiness of the plan. Some work assume that the utility of the plan can be computed in isolation from other plans, while some other adopt a more general notion of the utility which is relative in respect to other plans and the query. For DQAQS, we use the more general notion of utility u. For plan p with respect to a set of plans p 1... p l and query Q, given that plans p 1... p l have been executed, utility u p is denoted as u(p p 1... l, Q). As discussed above, in this section, we consider utility to be the DQ of the query result. Consider query Q, and a list of n data sources S 1... S n. User query Q can be expressed as a conjunctive query QR 1,..., R m where R i is the part of schema that is selected from S i. A conjunctive query plan p (or plan p when there is no ambiguity) has the form pv 1,..., V m where each V i is a source relation corresponding to data source S i. Plan p accesses and combines data from the sources V 1,..., V m to produce query results in response to query Q. We first discuss the methods to define the utility function for DQAQS in the simplified situation where the plan p is against only one data source S 0. That is pv 0. We then generalize the discussion to a plan with multiple data sources. In DQAQS, queries contain user preferences on data quality in addition to the normal relational query information (e.g. projection and selection conditions). Such preferences

80 Quality Aware Query Response are available to the mediator in the form of partial order prioritization for DQ metrics and relational attributes of the query. For example Price.Accuracy is highly prioritized to NoOfStocks.Accuracy. As discussed in section 5 and [34], such partial orders phrases can be translated to weights using AHP technique. Using the technique presented in [34], a weight w µ is calculated for every DQ metric-attribute µ l. A DQ metric-attribute µ l is the short form of metric m l for an attribute a l. A metric-attribute µ l is considered to be part of R i only if µ l is defined as metric m l for attribute a l iff a l R i. We use the notion of metric attribute, because at this stage we are not interested in metric or attribute alone. For query Q, there are k metric-attribute µ 1,..., µ k where i=1,...,k w µi = 1. In [34], simple additive weight technique is used to define the utility function where the plan is consisted of only one data source. If the plan p for query Q is consisted of only one data source S 0, u p = i=1,...,k w µ0i.dq µ0i where Dq µ0i is the estimated data quality for metricattribute µ i of source S 0. In fact, assuming some data quality profile exists for the metricattribute µ i of data source S 0 (See Section 4), the utility function is defined as the user preference weighted average of the estimated data quality for query Q, against data source S 0. If the query plan p consists of more than one source, estimating DQ of the plan result is not trivial. For example query plan p from sources S 1 and S 2, runs V 1 against S 1 and V 2 against S 2. It then joins the results of these queries to from the query answers for Q. DQ of running V 1 against S 1 and V 2 against S 2 can be separately estimated (using techniques from Section 4). Challenge is to estimate the DQ of the result of plan p for metric-attribute µ i,

6.1 Estimation of the DQ of a query plan 81 having DQ of V 1,..., V l for the metric-attribute µ i. Below we suggest three approaches to this challenge. 6.1.1 Statistical Approach Plan p, rewrites the query Q as conjunction of k queries, V 1,..., V k over data sets S 1,... S k. Our goal is to estimate the DQ metrics-attribute µ for the result of query Q, assuming estimation of the DQ for V 1,..., V k is available. In [24] an algorithm for query planning optimization is proposed, that orders the plans based on the completeness of the query results. Completeness is one of the common DQ metrics that can improve user satisfaction of the query results. Works like [24] require the DQ metric to poses some predictable statistical natures. It then defines statistical methods to approximate the propagation of errors through a join. For example, it defines completeness of V 1 joined to V 2 is the probability that either left or right side has non-null value, and calculated as c V1 + c V2 c V1.c V2, where c Vi is the completeness of the query results of V 1 against S 1. The propagation of DQ for join operation is defined for every other metric in use. As another example, accuracy is defined as the probability that both query result do not contain an error and is calculated as the product of the accuracy of V 1 and V 2. If a simple statistical definition, can be defined for DQ metrics, such that a formula can be proposed to estimate the DQ of the result of join operation, it can be directly used for calculation of the utility function.

82 Quality Aware Query Response 6.1.2 Sampling Based Approach For most of the DQ metrics, a simple statistical definition of the join operation can not be found. Some works like [2] provide more generic techniques that can utilize sample of data for estimation of the DQ of the query results. In [2] a sampling techniques is used to estimate the quality of query results. It is defined in the context of a single relational database, but their technique can be applied to multiple sources. It assumes samples are created for every table and tries to estimate DQ for queries consisting the join result of different tables. There is no real requirement that the tables should be from the same data source. In Section 4.2 we proposed sampling techniques to estimate DQ of the query result against a single data source. We utilize the techniques from Section 4.2 to estimate the DQ of the query result for every sub-query V i against single data source S i. We can then feed the query result estimated from samples defined in Section 4.2, to the methods proposed in [2] to estimate the conjunctive DQ of Q. Method presented in [2] assumes distribution of error within data set is even. It assumes probability of records in a single data source are of acceptable quality is constant across the table. Also it assumes, the acceptability of data unit is independent of the acceptability of any other. In some real world situations these assumptions can not be assumed. For example, if an address should have both city and post code fields to be considered accurate, join result of two inaccurate addresses (one without city, and another without post code) can be accurate. Thus; if DQ metric value can be assumed to be independent from the join operation, methods defined in [2] can be used to estimate the DQ of a plan.

6.2 Summary 83 6.1.3 Extension of Conditional DQ Profiling If the conditional data quality profiles, as proposed in Section 4 exists, both Statistical and sampling based techniques can be utilized to define the utility function. Choice of which technique to use depends on the definition of DQ metrics and possibility of propagating DQ estimations through the join operator. In Section 4 we suggested the use of conditional DQ profiles. In Section 4.2 we proposed cost & popularity based, optimization of the conditional DQ profile. Conditional DQ profile is defined over a singe data set or view over one data source, however, the concept can be generalized to multiple data sources. If we assume a global data source where every data set is union of all other mediated data sets, and generalize the definition of queries (that is used to define samples) to query plans, we can create and utilize a global sample trie that each sample is a query plan and popularity of each node is the popularity of a query plan. There are a number of challenges that should be solved. For example some query plans might become deserted and never get picked because its popularity is unknown, while other query plans with higher possibility get always selected because their popularity is known. 6.2 Summary In DQAQS, on query may be answered from multiple data sources. Each data source may contains a part of schema. Typically query will be rewritten over available sources. Each rewritten query is considered a query plan. The problem of query optimization is to order query plans in the decreasing order of utility. Query planning and optimization in Data Integration systems is widely studied in the literature. In this section we did not suggest

84 Quality Aware Query Response any new contribution, instead we focused on discussions about how can existing techniques from literature be used to optimize query plan with consideration of Data Quality. We argued that the problem of DQ aware query planning boils down to the problem of DQ aware utility function. We then suggested how different techniques from the literature and other sections of this paper can be used to define a DQ aware utility function. Three approaches were discussed in this section: Statistical approach, sampling based approach and extension to conditional DQ profile. Statistical approach is suitable for systems where statistical formulas can be defined for DQ metrics over join operator. Sampling based techniques are suitable for systems where DQ metric can be considered independent from the join operation. As the last approach we suggested an extension of the conditional DQ profiling concept provided in Section 4 of this paper that can be used for quality aware utility function.

7 Proof of Concept In previous section we studied the three different essential concepts of a DQAQS namely, profiling the quality of data, capturing user preferences on DQ, and planning queries with consideration of data quality. We suggested techniques to implement each of these concepts in a DQAQS. In addition to these three concept, we also proposed a framework that describes how all these concept can work together. 85

86 Proof of Concept In this section is to wrap up all the previous discussion in a tangible solution that can be directly tested by users. Therefore, we implement a working model of data quality aware query system. DQAQS is based on generic data integration architecture. In a data integration architecture, multiple data sources exist over a mediated global schema. Each data source contains a part of the global schema. Different data sources may comprise overlapping data. 7.1 Implementation of DQAQS Although data integration system is consisted from many individually hosted data sources, it can be safely simulated on a single host. Effect of network on accessing the data sources (e.g. latency and availability) can be statistically simulated and applied to the query answering mechanism against the data source. We have two initial assumption in regards to the data sources and their scheme. We assume that; First a global schema is available that is the result of mediating between all different data sources and second, schema of every single data source is a subset of the global schema. I.e. an invisible wrapper around each data source transforms data to the mediated schema. Main reason for these assumption when implementing DQAQS is that the wrapper logic is not of interest of this paper. In this section we utilize a single database to implement all data sources. Assuming the global schema is consisted of table T 1,..., T n, and each table T i has attributes a i1,..., a il, every table is named as S name T i where S name is the name of the data source. For example if ShoppingItems is one of the tables in the global schema and DataSource1 is a data source

7.1 Implementation of DQAQS 87 Data Schema Data Sources DQAQS Simulation System Global Schema Data Profiler DQ Service Query Planner Query Parser Figure 7.1: Component architecture of DQAQS implementation identifier, table DataSource1 ShoppingItem would represent the mentioned table for the data source DataSource1. In addition to the tables that implement the schema of the data source, one extra table exists for each data source that contains data source related metrics like average latency and availability of the data source. Figure 7.1 illustrates the component-based design for the implementation of DQAQS. System consists of a number of data sources plus the three main components that together implement the three main concepts of DQAQS discussed in this paper. In our simulated environment, each component is a pluggable service that can be swapped to another component on user request to compare the effect of different methods and implementations on that results. Not only the output of each component is used by other components, but also they are returned to the user (if requested) for analysis of the intermediate results. 7.1.1 Common Protocol Between Components In Section 5 we studied the problem of communicating the query requirements with user. Provided methods present a machine readable model of the user preferences on data quality.

88 Proof of Concept Understanding rest of the query is trivial as it is translatable to relational algebra using conventional libraries. A query containing user preferences on DQ is transferred (and possibly re-written before being transferred) to other components of DQAQS. Similarly, data resulting from running queries against data sources or generating DQ profiles, etc is transfered to the user and other components. We have designed components to be abstract and pluggable, hence a protocol is required to facilitate communication between components and user. Below we suggest a protocol to transfer data and queries with consideration of data quality. Let us assume the following example: The global schema consists of the following items: - ShoppingItems (Title, Price, Location, and Tax) - UserRatings (Item, Rating, Photo) Also, assume the following column-metrics are measured and queried in the system. Note that not every attribute supports metric, hence definition of the supported metric-attributes is considered a part of the global schema. Decision of what metric-attribute being supported is an important business decision which will lead to implementation of the metric-attribute for the business need. A single metric, is not necessarily implemented similarly for different attributes. For example measurement of the accuracy of an address does not have much in common with the measurement of the accuracy of price. In this example, following metricattributes are assumed: ShoppingItem.Title.Completeness which defines if a title is null or not ShoppingItem.Price.Completeness which defines if a price is null or not

7.1 Implementation of DQAQS 89 ShoppingItem.Price.Accuracy which defines if a price is accurate or not by testing it against a master data. E.g. if an item prices between $50 and $100, price $500 for the item would be considered inaccurate. UserRatings.Rating.Completeness which defines if the item is rated UserRatings.Photo.Completeness which defines if a photo is null or not Our implementation of DQAQS, only measures the above metric-attributes. We define the communication protocol for communicating queries between components as: Every query will be modeled as an XML document that has three type of nodes: Projection nodes which define the query s projection statement. Selection condition nodes which define the conjunctive selection conditions of the query. And attributes and metric-attribute relative weights. These weights which should sum to 1 represents the normalized weight of every metric-attribute defined in the query to a number. For example a query may project ShoppingItem.T itle and ShoppingItem.P rice (Projection items) for data with $200 < P rice < $400 (Selection Conditions), where the user preferences on DQ weight as ShoppingItem.Price.Accuracy=0.7 and ShoppingItem.Price.Completeness=0.3. In our implementation of DQAQS we support a limited query definition (e.g. no aggregate queries) to keep the focus on the implementation of DQ Aware components. However, complex query parsing and processing systems are available in the market that can be leveraged for commercial implementations of DQAQS. Actual data is transfered as normal tables through different components and the end user. Measurements of the metric-attributes are appended to the table as additional columns. Such

90 Proof of Concept data is frequently passed between the DQ profiler and DQ planner components. 7.1.2 Registering Data Sources Figure 7.2 demonstrates the user interaction in order to register a data source. Each data source supports a part of the mediated global schema. In figure 7.2 users checks the boxes from the schema tree. For example data source DataSource1 contains tables ShoppingItems and UserRatings, from ShoppingItems it supports the fields T itle and P ric and from UserRatings it supports the fields Item and Rating. There is a Browse button in front of every table where user can upload data for that specific table. Data should be prepared in CSV format and must match the selected schema. Once the user defines its data sources, and uploads the data he can then use these data to examine various algorithm and techniques. He can also test the final query results which is compiled with DQAQS as the query processing system. 7.1.3 Data Profiler Data Quality Profiler profile provides the functionality to profile the quality of data for data sources. In addition to the algorithms suggested in Section 4 that are implemented as plug-ins to this component. We implemented three different data profilers. First, we implemented a source-based DQ score profiler which keeps only a vector of scores for different DQ metrics per data source. Second, we implemented an attribute-level DQ profiler which maintains a vector of DQ scores for metric-attributes. These basic implementations helps users to study and compare the effectiveness of the solutions proposed in Section 4. Third,

7.1 Implementation of DQAQS 91 we implemented a conditional DQ profiler and forth, we implemented a sampling based conditional DQ profiler. Figure 7.2: Registering a new data source in DQAQS Various DQ profiling techniques store their data differently. In our implementation their data is stored in separate tables which are distinguished from each other by naming convention. Below we demonstrate definition and utilization of the four implemented DQ profilers. i - Source level DQ profiler. Source level DQ profiler is the most simple implementation of DQ profilers. It stores a vector of metrics for each data source. A source level DQ profile is automatically generated once the user generates a Conditional DQ Profile, hence we do not provide a separate user interface for this profiler. The functionality of source level profiler is similar to the common

92 Proof of Concept techniques of marking every data source with a score (e.g. 1 to 5 start for completeness, accuracy, etc.). A sample source level DQ profile for the ShoppingItem table is: Completenes = 0.63 and Accuracy = 0.46. Figure 7.3: Generating an attribute based DQ profile in DQAQS ii - Attribute level DQ profiler. Attribute level DQ profile is very similar to the source level DQ profilers, but with the difference that it stores a vector of metric-attributes for every table within the data

7.1 Implementation of DQAQS 93 source. It can estimate the quality of the query result based on the projection of the query. Attribute level DQ profile usually results in a better estimation of DQ profile because it has the possibility of utilizing user preferences on DQ. Figure 7.3 shows a sample attribute level DQ profile for DataSource1. It contains a table where each column shows a metric and each row defines a table and attribute. iii - Conditional DQ profiler. Figure 7.4 shows user interaction for generating a conditional DQ profile. A conditional DQ profile generates a big DQ profile where each row contains a condition in addition to the measurements of the DQ metrics. In the user interface presented in Figure?? in each step user creates a conditional DQ profile for one metric attribute. Each metric attribute has a separate table for storing its conditional DQ profile. User selects the data source, table, column, and the DQ metric. He then runs the profile generation. It is also possible to select all tables, all columns and all metrics for DQ profile generation. Figure 7.4 shows a sample of conditional DQ metric profile generated for DataSource1, table ShoppintItems, attribute P rice, and metric Completeness. First columns of each row of the generated DQ profile defines the selection condition. For example the DQ profile shows that a query with selection condition title = SonyCybershotS20 AND price = $360 will return 50 records (which is defined in the column #) which 45 of them have complete P rice. Column # defines the number of records in the returned query result and the column Q defines the number of columns with Complete Prices (as defined by the definition of the P rice.completeness metric-attribute). iv - Sample trie DQ profiler.

94 Proof of Concept Figure 7.4: Generating a conditional DQ profile in DQAQS The implemented sample trie DQ profiler requires a query workload to initialize the sample trie. Maximum DQ Profiling cost and the base sampling rate are two user defined parameters that should be set for the system. The sample trie DQ profile will not exceed the cost. The cost is in the form of the number of records that are profiled. User has to prepare a query workload and upload it to the system in CSV format. The CSV should include the selection condition part of the query. Every column of the CSV defines an attribute and the value of the CSV defines the selection condition. For example

7.1 Implementation of DQAQS 95 Figure 7.5: Generating a sample trie based on a query workload in DQAQS the column P rice may have the value 500 which means that the selection condition consists of the selection condition P rice = 500or it can have the value [300, 700] which defines that the query contains the selection condition P rice 300 and P rice 700. Only one sample trie is generated for every single data source. The sample trie is consisted of a forest of individual trees where each tree is an attribute suffix of the tree above it. Every node of each tree also represents a conjunctive selection condition. Every node contains a sample table that is actually a sample of the data which includes the data plus the DQ metric measurements for every single row of the sample. Every DQ metric is a boolean

96 Proof of Concept which defines if the record meats the metric definition of good quality. For example a record can have a complete title or accurate price. Sample trie is stored in an xml document inside the simulated database. Every sample in the trie has a unique ID. Data for the samples are stored in a big table that covers the schema of the data source and has a reference to the sample ID for each row, so that data of a single sample can be queried from this table. At the same time data can be queried from this table and the samples which these data belong to can be identified. Figure 7.5 shows a sample trie generated for DataSource1. One node of the sample trie is selected. This node represents a query which contains the following conjunctive selection conditions: ShoppinItem.T itle = SonyCybershot AND ShoppingItem.P rice is between 500 and 1000 AND UserRating.Rating = 4. Note that this query is a product of two tables ShoppingItem and U serratings. The table shown in Figure 7.5 shows a sample of the query result for the above query. Note that latest columns of the data defines metric-attributes and their values. It can be seen that the first row shown in Figure 7.5 has an accurate and complete value for the attribute P rice. This sample can be sued to estimate the DQ of other queries using techniques presented in Section 4. For example a query for ShoppingItem = SonyP owershot AND P rice between 600 and 700 AND UserRating = 4 can be answered from the selected sample. Other samples may be able to answer other queries.

7.1 Implementation of DQAQS 97 7.1.4 Query Parser We have already proposed a standard protocol for different components of DQAQS to transfer user queries with inclusion of their preferences on data quality. It is consisted of projections, selection conditions, and weighted metric-attributes. Although this protocol is suitable for implementation in a software system, it is not suitable to communicate with end users. Techniques presented in Section 5 are able to communicate user preferences on DQ with the user in a natural way to humans. However, these sections do not suggest any special UI design. In this section we implement the techniques proposed in Section 5 with two different design. First a textual representation of the suggested hierarchical prioritization, and second a more graphical implementation that replaces the hierarchical prioritization clauses with interactive circles which their sizes translate to the prioritization clauses. Figure 7.6: Capturing user preferences using DQ aware SQL in DQAQS

98 Proof of Concept Figure 7.6 shows an example of a user defining prioritization clauses as text. Each line of the text in the prioritizations part defines how different attributes in the query weight against others and how different metrics for one attributes weight against other metrics. It is not required to provide all possible prioritization clauses. User enters only the prioritization clauses that are importance to him. The system can automatically normalize the weight of other attributes and metrics that are not defined by user. Prioritization clauses defined in Figure 7.6 suggest that the data quality of the attribute P ric is highly more important the data quality of attribute T ax. It also defines that importance of the DQ of attribute P rice is slightly more than the DQ of attribute U serrating. For the attribute P rice, metric Accuracy is slightly more important that the metric Completeness. Figure 7.7: Capturing user preferences using visual tool in DQAQS

7.1 Implementation of DQAQS 99 Figure 7.7 shows the example of UI that allows user to define the prioritization clauses using graphical interaction to allow users enter the prioritization clauses. In Figure 7.7 size of the circles define how important each attribute or metric is in compare to other attributes or metrics. User can press down the mouse button on a circle and drag the mouse to make the circle bigger or smaller. As user changes the size of a circle, other circles resize accordingly to reflect the right pairwise prioritizations. 7.1.5 Query Planner We implement a query planner using the statistical approach discussed in Section 6. The statistical approach of query planning uses statistical formulas which are defined as part of the metric definition to estimate the DQ of a query plan from the DQ of individual. Figure 7.8 shows the query plans ordered by their utility function, which in this case in the estimated DQ of the results of running the query plan. Figure 7.8 shows the user defined query. User is looking for the price of items with T itle Canon Powershot and P rice greater than 400. The user prefers the query plans that maximize the Accuracy of price and he cares little about the quality of T ax. The table in figure 7.8 shows that a query plan that joins the result made from DataSource1 and DataSource2 has the maximum data quality. The actual query plan is a re-write of the query over different data sources. A list of detailed query plans is too large to display, hence; Figure 7.8 show a summary of the query plan. However, the utility of the important factor for the planning algorithm to select a plan.

100 Proof of Concept Figure 7.8: Optimization of query plans ins DQAQS 7.1.6 Final Results Despite the implementation of DQAQS allows the user to monitor different steps of the query answering process as discussed above, actual goal of DQAQS is to improve user satisfaction by providing the highest quality query results based on user preferences. Figure 7.9 shows the actual query result generated from DQAQS. User enters the query with DQ preferences, and receives the query result in order that most satisfies his requirements. The query result in Figure 7.9 contains the attribute metrics as additional columns. In Figure 7.9, DQ metrics measurements are added to the result after

7.2 Summary 101 Figure 7.9: Query results using DQAQS generation of the query results as post processing merely for evaluation purpose. In normal implementations there is no need to return the actual DQ metric measurements to the user. 7.2 Summary In this section we discussed the implementation of a DQAQS. Some technical details of the implementation was discussed. We discussed how to simulate multiple data sources in one host. As in general data integration systems, there is a mediated global schema, and data