Fall 2011. Andrew U. Frank, October 23, 2011



Similar documents
Oscillations of the Sending Window in Compound TCP

The Universe of Discourse Design with Visible Context

Software development process

Bilateral Exposures and Systemic Solvency Risk

Is a Single-Bladed Knife Enough to Dissect Human Cognition? Commentary on Griffiths et al.

Spatial data quality assessment in GIS

Clustering and scheduling maintenance tasks over time

programming languages, programming language standards and compiler validation

DATA QUALITY AND SCALE IN CONTEXT OF EUROPEAN SPATIAL DATA HARMONISATION

1 Example of Time Series Analysis by SSA 1

GEOGRAPHIC INFORMATION SYSTEMS CERTIFICATION

Stock Investing Using HUGIN Software

TOWARDS AN AUTOMATED HEALING OF 3D URBAN MODELS

Exercises Engenharia de Software (cod & 6633 )

Information and Responsiveness in Spare Parts Supply Chains

Introduction to Logistic Regression

Ulrich A. Muller UAM June 28, 1995


INDIVIDUAL COURSE DETAILS

Microeconomics. Lecture Outline. Claudia Vogel. Winter Term 2009/2010. Part III Market Structure and Competitive Strategy

Geo-information in The Hague & National SDI hub PDOK

STRUTS: Statistical Rules of Thumb. Seattle, WA

IMPLEMENTATION OF A MANAGEMENT AND QUALITY CONTROL SYSTEM UNDER ISO STANDARDS 9001:2000, 19113, 19114,19138 AND IN CARTOGRAPHIC PRODUCTION

Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001

Many algorithms, particularly divide and conquer algorithms, have time complexities which are naturally

Statistics for Business Decision Making

The Time Value of Money

Chapter 12 Modal Decomposition of State-Space Models 12.1 Introduction The solutions obtained in previous chapters, whether in time domain or transfor

Geography 4203 / GIS Modeling. Class 12: Spatial Data Quality and Uncertainty

On computer algebra-aided stability analysis of dierence schemes generated by means of Gr obner bases


Open Source Project Categorization Based on Growth Rate Analysis and Portfolio Planning Methods

The CORAS Model-based Method for Security Risk Analysis

Intelligent Agents. Based on An Introduction to MultiAgent Systems and slides by Michael Wooldridge

Prot Maximization and Cost Minimization

Chapter 7. Continuity

A Spatial Data Infrastructure for a Spatially Enabled Government and Society

Intermediate Microeconomics (22014)

Daniel F. DeMenthon and Larry S. Davis. Center for Automation Research. University of Maryland

Michael Cline. University of British Columbia. Vancouver, British Columbia. bimanual user interface.

Managing large sound databases using Mpeg7

How To Understand The Theory Of Economic Theory

Simultaneous or Sequential? Search Strategies in the U.S. Auto. Insurance Industry. Elisabeth Honka 1. Pradeep Chintagunta 2

Possibilistic programming in production planning of assemble-to-order environments

Six Degree of Freedom Control with a Two-Dimensional Input Device: Intuitive Controls and Simple Implementations

Operations management: Special topic: supply chain management

1 Uncertainty and Preferences

DATA QUALITY IN GIS TERMINOLGY GIS11

The program also provides supplemental modules on topics in geometry and probability and statistics.

Finite cloud method: a true meshless technique based on a xed reproducing kernel approximation


APPLICATION OF FREE TACHEOMETRIC STATIONS IN MONITORING OF MONUMENTAL OBJECTS

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

How to Write a Successful PhD Dissertation Proposal

ArcGIS Data Models Practical Templates for Implementing GIS Projects

Two-step competition process leads to quasi power-law income distributions Application to scientic publication and citation distributions


For example, estimate the population of the United States as 3 times 10⁸ and the

Mississippi Private Schools 2015

ADVANCED GEOGRAPHIC INFORMATION SYSTEMS Vol. II - Spatial Data Management: Topic Overview Gary J. Hunter SPATIAL DATA MANAGEMENT: TOPIC OVERVIEW

It all depends on independence

A.II. Kernel Estimation of Densities

ECON 305 Tutorial 7 (Week 9)

BUSINESS RULES AND GAP ANALYSIS

The Data Warehouse Challenge

CFSD 21 ST CENTURY SKILL RUBRIC CRITICAL & CREATIVE THINKING

Abstract. Introduction

University of Chicago

Lars Nielsen. Abstract. In this paper, a general technique for evaluation of measurements by the method of

Title 10 DEPARTMENT OF NATURAL RESOURCES Division 35 Land Survey Chapter 1 Cadastral Mapping Standards

Working Paper. Combining Recession Probability Forecasts from a Dynamic Probit Indicator. Januar Thomas Theobald

South Carolina College- and Career-Ready (SCCCR) Algebra 1

Miscellaneous. Simone Freschi Tommaso Gabriellini Università di Siena

Introduction. Background Knowledge. The Task. The Support the Work Setting Should Provide

PS engine. Execution


Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

Oracle Database 10g: Building GIS Applications Using the Oracle Spatial Network Data Model. An Oracle Technical White Paper May 2005

Virtual Landmarks for the Internet

Transcription:

Fall 2011 Andrew U. Frank, October 23, 2011 TU Wien, Department of Geoinformation and Cartography, Gusshausstrasse 27-29/E127.1 A-1040 Vienna, Austria frank@geoinfo.tuwien.ac.at Part I Introduction 1 The data producer's perspective Current approaches to geo data quality are pushed by the producers of geo data primarily the National Mapping Agencies NMA to communicate the standards they maintain [nsds, morrison] and to coordinate necessary specication with other agencies, which are potentially users of their data [fcdic]. Some NMA make their data available again substantial fees and use data quality arguments to try to convince potential users that the data are of high quality and therefore the high prices are justied [some german/austrian publication?]. Data producers, similar to producers of other goods, claim that their product is of high quality. Unlike material goods, which are produced for a particular use, data can be used for many purposes - this is what GIS, previously called multipurpose cadastre [], is all about [wisconsin paper]. For a physical good, e.g. a pair of scissors for cutting hair, it is relatively clear what requirements a user has and what 'high quality' means; but note that a pair of scissors for cutting paper is not of high quality if one intents to cut hair (or reverse!). I would interpret high quality as fullls the users requirements for the intended purpose. But what does 'high quality' mean for a good with not yet dened and thus not yet known use? High quality for what? - Here starts the problem of geo data quality! The producer knows the production method, the instruments used etc. and understands their eect on the quality of the data collected. A surveyor can describe the statistical deviations from true values for the coordinates of the 1

points determined, assuming a normal distribution and indicating mean and standard deviation of the results. But precision of location is not the only aspect of geo data quality: Geo data quality has multiple aspects. Besides the precision of locations, matters obviously: when the data was collected, which themes are included, and how they are coded etc. In the mid 1980s tentative lists of what data quality aspects need to be included were published [chrisman, frank 1986]. They included, dierentiated between precision and resolution, for the three components of geo data: geometry, thematic data and time [sinton]. As a cope-out aspect, lineage was added, where data producers should describe where the data originated and how it was produced and treated. These views more or less reworked are the state of the art in today's data quality standards [refs]. <<insert a matrix: precision resolution // geometry, time, theme>> Only later was discovered that these aspects were not orthogonal to each other [frank?]. For example, spatial and temporal precision are hard to separate completely - an uncertain location and a sharp time time cannot be dierentiated from a certain location and a uncertain time stamp (g). This is only an application of Sinton's generic description of geographic data with three aspects (location, time, theme), of which one is xed, one varies as the independent variable, and the last is the dependent variable. Practical progress in reporting data quality for data sets is slow, despite the publication of standards. Hunter started a systematic investigations [ref], which revealed not only many missing indications, but often found uninformative values. What should a user do with a description of geometric precision as varying? Not much more is learned from precision between 2m and 10 km. I take this as indicating, that the practitioners among the data producers know that the users hardly ever consult the metadata, and that the data quality values in the metadata hardly ever help the user decide whether to use the dataset or not. This is conrmed by studies of user behavior [ann boin], which reveals other information users use to make the decision whether to use a dataset. The separation of the producers point of view from the perspective of the user introduced by Timpf [] help the research out of the impasse and stagnation. It posed a number of new questions for research: how to describe the user's requirements, how to connect the producers descriptions of data quality with the user requirements; These questions are the major driving force and provide the guideline for the presentation in this course. The practical goal of geo data quality research should be to achieve an operational connection between the data quality description from the producer's perspective which we know how to do and the decision by the user, whether a geo data set is useful for him and he should acquire and use it. 2

2 The users perspective If we consider the users perspective on data quality, we have to ask why a user would acquire a dataset and how data quality will aect this decision. It is obvious, that data which is not useful for the user will not be acquired - but what does not useful in this context? To answer the question, why a potential user would acquire some data, we have to look into the users situation. 2.1 Data serve only in decision situations When does a user need data? The only use of data is to improve decisions this is the only use of data! therefore, a user will consider the acquisition of data only when he needs them to make a decision, i.e. specic situation, not some generic need to know. The modern, highly distributed methods of decision making in corporations and public administration produce many situations, where potential decision-makers ask for data, which is, however, always related to some possible decision situation. The decision not to act is a decision as well; decision-makers typically ask for information to help them rst decide if an action by them is necessary and often no further action is observable - meaning that the decision was not to act. 2.2 Model of decision making A model of a decision is required for a formal analysis: a decision is a choice between dierent alternative actions, represented as a 1, a 2,... a n. A person makes a decision between the alternatives such that the outcome of the action he selects promises to be the best, the most advantageous outcome for him. In his mind, the outcome for each of the actions a i is the transformation of the current state s 0 to a new state s i ; the states s i are evaluated by a valuation function v, which produces for each state s i the corresponding value v i. The action which corresponds to the highest value v i is the most advantageous and is therefore selected. <<gure>> Note, that we do not assume that the user knows exactly what state follows from an action and what his valuation of this state will be, after execution. The concept of bounded rationality introduced by Herb Simon [] posits only that the decision maker has some idea of what the outcome will be and how he imagines the value of this outcome. From experience, we all know that we are sometimes very limited in what we know and select actions because we erroneously imagine an outcome which never realizes and we are disappointed when we realize our error in expecting a specic outcome or the error in valuation of an outcome we have imagined much nicer than what is actually achieved. 3

Figure 1: Model for decision making; without information, the maximum of v i (i = 1, 2,or 3). Information is acquired if v c, which is the expected value achieved with information, is larger than v i. 2.3 Role of information in decision making Assume a decision maker with the alternatives a 1, a 2,... a n as before, but the additional choice to acquire some data d, which contains information of relevance for the decision (Fig. 2.3). When should the data d be acquired? Lets label the alternatives, when executed after acquiring the data d with primes: a 1, a 2,... a n, to which outcomes s i with valuations v ibelong (Fig. 2). Given the additional information the user has, neither the outcomes nor the valuations are necessary the same as the ones he would expect without the acquired information. A rational decision maker will again select the action among a 1, a 2,... a n and a 1, a 2,... a n which give the best value. The apparent value of the information is the contribution to improve the decision, i.e. the dierence in the maximum of the values v i and the maximum of the values v i.the acquisition of the data was worthwhile if the maximum of the values v i, say v m, is larger than the maximum of the values v i, say v m ; a rational user should be ready to pay the dierence between v m and v m. With the assumption of bounded rationality, one must actually include an additional compound decision a c which is the action of acquiring the data and then select the best decision; the initially, before acquiring the data, expected value of this v c enters in the assessment of the willingness to pay for acquiring data as v c v m. The real value of the data is only revealed after the fact, when the actions are carried out and the real outcome of the decisions is revealed. The eect of acquiring data is often (only) a reduction of risk in a decision, which must be counted as a positive contribution. 4

Figure 2: Model for decision making; after acquisition of information, the improvement of the decision through the information can be evaluated. 3 Model of data quality from a user perspective 3.1 When is data correct (from a user perspective) Correctness of data is the pinnacle of data quality. When is data correct? Much has been discussed by data producers and standards state what deviation from values from re-measurements of higher quality is acceptable - often quite arbitrary rules, dictated by practicability, available resources of an agency etc. If we take the perspective of the user, the answer is relatively easy: data is correct if it leads to decisions for actions, which can be carried out and have the expected results. Some simplistic examples: a railway timetable entry is correct, if is leads us to catch the desired train: if we arrive at the station before the indicated time we are able to catch the respective train; navigation instructions are correct, if they can be followed (i.e. not leading to actions prohibited by the driving rules established by law) and lead to the desired goal, i.e. we reach the destination. This denition of correctness of data from a user perspective does not require that the data gives a true description of reality, as is sometimes requested, but only that the eect of deviations from a true description does not inuence the decision substantially - meaning another decision would be better, if the data were better. This leads to an understanding of the value of data and indirectly to the quality of the data always related to a specic decision situation. It hints to a reduced need for quality in the data: lack of correctness in the data is only aecting a decision, if another decision would be better than the one selected based on the erroneous data; given that for a decision we seldom have many options, then only data which is better than helping us to avoid selecting 5

the wrong alternative, is necessary. This means that approximate data and heuristic methods for decision making are sucient to select among the few alternatives one has in reality. It is meaningless to ask for data quality from a user perspective without considering a specic decision situation. 3.2 Quality of a decision Assume a decision situation, where the optimal decision is ã m and the decision with the available information is a m, the value of the information is the improvement of the decision and the degraded available information is thus just ã m a m less valuable than the perfect one. Consider the decision making d as a function d from some input data values d i to a decision (a i, v i ). Using ideas from adjustment computations to this decision function, one posit, that the optimal decision ã m results from correct values d for each input data element. In consequence, the contribution for the deviation of each data element from the correct value can be computed - assuming that the deviations are not large, linearization of the function d is permitted. (a i, v i ) = d(d i ) The data quality of a data element is then derived from the contribution it makes to the correctness of the decision. We can compare the decision with information d i compared to the decision we would make with no particular information d 0 (the absence of additional information is just the a particular case of erroneous information). Comparing the corresponding values indicates what contribution this data makes to the decision and says what a rational decision maker would be willing pay for it. [my paper] 4 Summary Data quality is not unlike the quality of other products: producers claim 'high quality', meaning that the data are produced with high quality inputs and carefully arranged operations under permanent control and nally checked against exacting standards. What sounds very similar to material production is somewhat complicated that the denition of dimensions on which to measure data - quantity as well as quality - is considerably more complex than for material goods. Measuring the quantity of data you receive from a source is far more complicated and no widely accepted consensus on how to do it exists - it is denitely not as easy as weighting a bag of potatoes. Measuring the quality is equally dicult and not comparable to non-trivial, but standardized measure of the starch content of said bag of potatoes (some industries pay potatoes for their starch content, which I consider here a quality attribute of potatoes). We have also seen dierence between material goods and data, e.g. non-rival, 6

multipurpose, experience good; aect how quality for data is somewhat dierent from quality descriptions for material goods. Considering decision making as a function from data to outcomes shows how the eect of data and data quality on a decision can be analyzed; given that deviation from correct values are small, linearization of the function is possible. The quality of the decision can be calculated by applying Gauss' law of error propagation from the quality of the input data. This decision quality deriving formula is in principle the desired method to translate the data quality descriptions of the producer to the data quality of the user. The restriction in principle indicates that the assumption of normal distribution of the deviations, i.e. that the deviations from perfect quality can be described statistically with standard deviations, is not justied for all data quality aspects. The completeness - technically described by omission and commission rates, for example, needs other statistical methods. To gain some insight, we start an ontological approach next. 7