Fall 2011 Andrew U. Frank, October 23, 2011 TU Wien, Department of Geoinformation and Cartography, Gusshausstrasse 27-29/E127.1 A-1040 Vienna, Austria frank@geoinfo.tuwien.ac.at Part I Introduction 1 The data producer's perspective Current approaches to geo data quality are pushed by the producers of geo data primarily the National Mapping Agencies NMA to communicate the standards they maintain [nsds, morrison] and to coordinate necessary specication with other agencies, which are potentially users of their data [fcdic]. Some NMA make their data available again substantial fees and use data quality arguments to try to convince potential users that the data are of high quality and therefore the high prices are justied [some german/austrian publication?]. Data producers, similar to producers of other goods, claim that their product is of high quality. Unlike material goods, which are produced for a particular use, data can be used for many purposes - this is what GIS, previously called multipurpose cadastre [], is all about [wisconsin paper]. For a physical good, e.g. a pair of scissors for cutting hair, it is relatively clear what requirements a user has and what 'high quality' means; but note that a pair of scissors for cutting paper is not of high quality if one intents to cut hair (or reverse!). I would interpret high quality as fullls the users requirements for the intended purpose. But what does 'high quality' mean for a good with not yet dened and thus not yet known use? High quality for what? - Here starts the problem of geo data quality! The producer knows the production method, the instruments used etc. and understands their eect on the quality of the data collected. A surveyor can describe the statistical deviations from true values for the coordinates of the 1
points determined, assuming a normal distribution and indicating mean and standard deviation of the results. But precision of location is not the only aspect of geo data quality: Geo data quality has multiple aspects. Besides the precision of locations, matters obviously: when the data was collected, which themes are included, and how they are coded etc. In the mid 1980s tentative lists of what data quality aspects need to be included were published [chrisman, frank 1986]. They included, dierentiated between precision and resolution, for the three components of geo data: geometry, thematic data and time [sinton]. As a cope-out aspect, lineage was added, where data producers should describe where the data originated and how it was produced and treated. These views more or less reworked are the state of the art in today's data quality standards [refs]. <<insert a matrix: precision resolution // geometry, time, theme>> Only later was discovered that these aspects were not orthogonal to each other [frank?]. For example, spatial and temporal precision are hard to separate completely - an uncertain location and a sharp time time cannot be dierentiated from a certain location and a uncertain time stamp (g). This is only an application of Sinton's generic description of geographic data with three aspects (location, time, theme), of which one is xed, one varies as the independent variable, and the last is the dependent variable. Practical progress in reporting data quality for data sets is slow, despite the publication of standards. Hunter started a systematic investigations [ref], which revealed not only many missing indications, but often found uninformative values. What should a user do with a description of geometric precision as varying? Not much more is learned from precision between 2m and 10 km. I take this as indicating, that the practitioners among the data producers know that the users hardly ever consult the metadata, and that the data quality values in the metadata hardly ever help the user decide whether to use the dataset or not. This is conrmed by studies of user behavior [ann boin], which reveals other information users use to make the decision whether to use a dataset. The separation of the producers point of view from the perspective of the user introduced by Timpf [] help the research out of the impasse and stagnation. It posed a number of new questions for research: how to describe the user's requirements, how to connect the producers descriptions of data quality with the user requirements; These questions are the major driving force and provide the guideline for the presentation in this course. The practical goal of geo data quality research should be to achieve an operational connection between the data quality description from the producer's perspective which we know how to do and the decision by the user, whether a geo data set is useful for him and he should acquire and use it. 2
2 The users perspective If we consider the users perspective on data quality, we have to ask why a user would acquire a dataset and how data quality will aect this decision. It is obvious, that data which is not useful for the user will not be acquired - but what does not useful in this context? To answer the question, why a potential user would acquire some data, we have to look into the users situation. 2.1 Data serve only in decision situations When does a user need data? The only use of data is to improve decisions this is the only use of data! therefore, a user will consider the acquisition of data only when he needs them to make a decision, i.e. specic situation, not some generic need to know. The modern, highly distributed methods of decision making in corporations and public administration produce many situations, where potential decision-makers ask for data, which is, however, always related to some possible decision situation. The decision not to act is a decision as well; decision-makers typically ask for information to help them rst decide if an action by them is necessary and often no further action is observable - meaning that the decision was not to act. 2.2 Model of decision making A model of a decision is required for a formal analysis: a decision is a choice between dierent alternative actions, represented as a 1, a 2,... a n. A person makes a decision between the alternatives such that the outcome of the action he selects promises to be the best, the most advantageous outcome for him. In his mind, the outcome for each of the actions a i is the transformation of the current state s 0 to a new state s i ; the states s i are evaluated by a valuation function v, which produces for each state s i the corresponding value v i. The action which corresponds to the highest value v i is the most advantageous and is therefore selected. <<gure>> Note, that we do not assume that the user knows exactly what state follows from an action and what his valuation of this state will be, after execution. The concept of bounded rationality introduced by Herb Simon [] posits only that the decision maker has some idea of what the outcome will be and how he imagines the value of this outcome. From experience, we all know that we are sometimes very limited in what we know and select actions because we erroneously imagine an outcome which never realizes and we are disappointed when we realize our error in expecting a specic outcome or the error in valuation of an outcome we have imagined much nicer than what is actually achieved. 3
Figure 1: Model for decision making; without information, the maximum of v i (i = 1, 2,or 3). Information is acquired if v c, which is the expected value achieved with information, is larger than v i. 2.3 Role of information in decision making Assume a decision maker with the alternatives a 1, a 2,... a n as before, but the additional choice to acquire some data d, which contains information of relevance for the decision (Fig. 2.3). When should the data d be acquired? Lets label the alternatives, when executed after acquiring the data d with primes: a 1, a 2,... a n, to which outcomes s i with valuations v ibelong (Fig. 2). Given the additional information the user has, neither the outcomes nor the valuations are necessary the same as the ones he would expect without the acquired information. A rational decision maker will again select the action among a 1, a 2,... a n and a 1, a 2,... a n which give the best value. The apparent value of the information is the contribution to improve the decision, i.e. the dierence in the maximum of the values v i and the maximum of the values v i.the acquisition of the data was worthwhile if the maximum of the values v i, say v m, is larger than the maximum of the values v i, say v m ; a rational user should be ready to pay the dierence between v m and v m. With the assumption of bounded rationality, one must actually include an additional compound decision a c which is the action of acquiring the data and then select the best decision; the initially, before acquiring the data, expected value of this v c enters in the assessment of the willingness to pay for acquiring data as v c v m. The real value of the data is only revealed after the fact, when the actions are carried out and the real outcome of the decisions is revealed. The eect of acquiring data is often (only) a reduction of risk in a decision, which must be counted as a positive contribution. 4
Figure 2: Model for decision making; after acquisition of information, the improvement of the decision through the information can be evaluated. 3 Model of data quality from a user perspective 3.1 When is data correct (from a user perspective) Correctness of data is the pinnacle of data quality. When is data correct? Much has been discussed by data producers and standards state what deviation from values from re-measurements of higher quality is acceptable - often quite arbitrary rules, dictated by practicability, available resources of an agency etc. If we take the perspective of the user, the answer is relatively easy: data is correct if it leads to decisions for actions, which can be carried out and have the expected results. Some simplistic examples: a railway timetable entry is correct, if is leads us to catch the desired train: if we arrive at the station before the indicated time we are able to catch the respective train; navigation instructions are correct, if they can be followed (i.e. not leading to actions prohibited by the driving rules established by law) and lead to the desired goal, i.e. we reach the destination. This denition of correctness of data from a user perspective does not require that the data gives a true description of reality, as is sometimes requested, but only that the eect of deviations from a true description does not inuence the decision substantially - meaning another decision would be better, if the data were better. This leads to an understanding of the value of data and indirectly to the quality of the data always related to a specic decision situation. It hints to a reduced need for quality in the data: lack of correctness in the data is only aecting a decision, if another decision would be better than the one selected based on the erroneous data; given that for a decision we seldom have many options, then only data which is better than helping us to avoid selecting 5
the wrong alternative, is necessary. This means that approximate data and heuristic methods for decision making are sucient to select among the few alternatives one has in reality. It is meaningless to ask for data quality from a user perspective without considering a specic decision situation. 3.2 Quality of a decision Assume a decision situation, where the optimal decision is ã m and the decision with the available information is a m, the value of the information is the improvement of the decision and the degraded available information is thus just ã m a m less valuable than the perfect one. Consider the decision making d as a function d from some input data values d i to a decision (a i, v i ). Using ideas from adjustment computations to this decision function, one posit, that the optimal decision ã m results from correct values d for each input data element. In consequence, the contribution for the deviation of each data element from the correct value can be computed - assuming that the deviations are not large, linearization of the function d is permitted. (a i, v i ) = d(d i ) The data quality of a data element is then derived from the contribution it makes to the correctness of the decision. We can compare the decision with information d i compared to the decision we would make with no particular information d 0 (the absence of additional information is just the a particular case of erroneous information). Comparing the corresponding values indicates what contribution this data makes to the decision and says what a rational decision maker would be willing pay for it. [my paper] 4 Summary Data quality is not unlike the quality of other products: producers claim 'high quality', meaning that the data are produced with high quality inputs and carefully arranged operations under permanent control and nally checked against exacting standards. What sounds very similar to material production is somewhat complicated that the denition of dimensions on which to measure data - quantity as well as quality - is considerably more complex than for material goods. Measuring the quantity of data you receive from a source is far more complicated and no widely accepted consensus on how to do it exists - it is denitely not as easy as weighting a bag of potatoes. Measuring the quality is equally dicult and not comparable to non-trivial, but standardized measure of the starch content of said bag of potatoes (some industries pay potatoes for their starch content, which I consider here a quality attribute of potatoes). We have also seen dierence between material goods and data, e.g. non-rival, 6
multipurpose, experience good; aect how quality for data is somewhat dierent from quality descriptions for material goods. Considering decision making as a function from data to outcomes shows how the eect of data and data quality on a decision can be analyzed; given that deviation from correct values are small, linearization of the function is possible. The quality of the decision can be calculated by applying Gauss' law of error propagation from the quality of the input data. This decision quality deriving formula is in principle the desired method to translate the data quality descriptions of the producer to the data quality of the user. The restriction in principle indicates that the assumption of normal distribution of the deviations, i.e. that the deviations from perfect quality can be described statistically with standard deviations, is not justied for all data quality aspects. The completeness - technically described by omission and commission rates, for example, needs other statistical methods. To gain some insight, we start an ontological approach next. 7