DATA QUALITY IN GIS When using a GIS to analyse spatial data, there is sometimes a tendency to assume that all data, both locational and attribute, are completely accurate. This of course is never the case in reality. Whilst some steps can be taken to reduce the impact of certain types of error, they can never be completely eliminated. Generally speaking, the greater the degree of error in the data, the less reliable are the results of analyses based upon that data. This is sometimes referred to as GIGO (Garbage In Garbage Out). There is obviously a need to be aware of the limitations of the data and the implications this may have for subsequent analyses. We will begin by looking at some of the terminology used to describe data quality in GIS. We will then look at some sources of error in GIS data, before looking at how errors can be modelled. Following that we will look at the role of metadata. TERMINOLGY A specialised vocabulary is used to describe data quality in GIS. We will begin with a quick review of some of the more important concepts. The terms data quality and error are used in a fairly loose, but common sense, sort of way. Data quality refers to how good the data are. An error is a departure from the correct data value. Data containing a lot of errors are obviously poor in quality. A distinction is usually made between accuracy and precision. Accuracy is the extent to which a measured data value approaches its true value. No dataset is 100 per cent accurate. Accuracy could be quantified using tolerance bands - i.e. the distance between two points might be given as 173 metres plus or minus 2 metres. These bands are generally expressed in probabilistic terms (i.e. 173 metres plus or minus 2 metres with 95 per cent confidence). Precision refers to the recorded level of detail. A distance recorded as 173.345 metres is more precise than if it is recorded as 173 metres. However, it is quite possible for data to be accurate (within a certain tolerance) without being precise. It is also possible to be precise without being accurate. Indeed, data recorded with a high degree of precision may give a misleading impression of accuracy. Data should not be recorded with a higher degree of precision than their known accuracy. 1 The term bias is used to refer to a consistent error. For example, if a map was accidentally moved during digitising, all points digitised after the move will be displaced relative to those digitised before the move in a systematic manner (i.e. by a fixed amount in a certain direction). As another example, all data values may be truncated by the software, resulting in a lower degree of precision. The above terms apply to both attribute and locational data. The terms resolution and generalisation refer only to the locational data. Resolution refers to the size of the smallest features captured in the data. In raster mode this is a function of the size of the raster cells. For example, if each cell covers an area of 20 metres by 20 metres on the ground, then features smaller than this (e.g. free standing trees) will not be captured. If digitising in vector mode, the resolution will be a function of the scale of the source map. Generalisation refers to the degree of simplification when drawing a map. Maps are models of the real world rather than miniaturisations - i.e. in order to display certain features clearly, cartographers have to eliminate various details which would only tend to clutter the map. For example, lines with many twists and turns may be straightened out; features which would be difficult to see at small scale if represented by polygons are represented as point features; features which might be difficult to see if drawn at their true scale are exaggerated (e.g. the width of roads); etc. 1 Precision is sometimes defined in a different manner to refer to the repeatability of measurements. Burrough and McDonnell suggest accuracy defines the relationship of the measured data value to the true data value and can be expressed statistically using the standard error. Precision defines the spread of values around the mean and can be expressed as a standard deviation. This concept of precision is sometimes referred to as observational variance. - 1 -
Currency introduces a time dimension and refers to the extent to which the data have gone past their 'sell by' date. Administrative boundaries tend to exhibit a high degree of geographical inertia, but they are sometimes revised from time to time. Other features may change location on a more frequent basis: rivers may follow a different channel after flooding; roads may be straightened; the boundaries between different types of vegetation cover may change as a result of deforestation or natural ecological processes; and so forth. The attribute data associated with spatial features will also change with time. The metadata (see below) associated with a particular set of data should specify the date of data capture. Other data quality considerations include completeness, compatibility, consistency and applicability. Completeness refers to the degree to which data are missing - i.e. a complete set of data covers the study area and time period in its entirety. Sample data are by definition incomplete, so the main issue is the extent to which they provide a reliable indication of the complete set of data. The term compatibility indicates that it is reasonable to use two data sets together. Maps digitised from sources at different scales may be incompatible. For example, although GIS provides the technology for overlaying coverages digitised from maps at 1:10,000 and 1:250,000 scales, this would not be a very useful exercise due to differences in accuracy, precision and generalisation. To ensure compatibility, data sets should be developed using the same methods of data capture, storage, manipulation and editing (collectively referred to as consistency). Inconsistencies may occur within a data set if the data were digitised by different people or from different sources (e.g. different map sheets, possibly surveyed at different times). The term applicability refers to the suitability of a particular data set for a particular purpose. For example, attribute data may become outdated and therefore unsuitable for modelling that particular attribute a few years later, especially if the attribute is likely to have changed in the interim. SOURCES OF ERROR Data errors may originate from a large number of different sources. Identifying possible sources of error and taking steps to reduce errors is largely a matter of common sense. The following therefore is only intended to provide an indication of possible error sources rather than a comprehensive list of all possible errors. Inaccuracies may arise with regard to space, time or attribute. Spatial inaccuracies arise if the co-ordinates used to identify the location of an entity (i.e. point, line or polygon) or a data point used to interpolate field data are measured or recorded incorrectly. Attribute errors arise if the attribute data for objects or the data values for sample points used to interpolate a field are measured or recorded incorrectly. As noted above, attribute data values and locational characteristics are likely to change over time, so it is good procedure to record the date and time when the data were collected. Inaccuracies may also arise in the recorded time, although for most types of GIS application (except possibly for systems in which there is rapid change - e.g. weather systems) temporal errors are less critical than the other types of error. They are therefore not considered further here. Inaccuracies may occur at all stages in a GIS analysis. The following identifies some of the sources of error at each stage. Data Input Errors The data for entry into a GIS may contain measurement inaccuracies. These may be primary or secondary. Primary data acquisition errors occur during data capture or measurement. For example, if digitising data from a printed map, the printed map may contain errors which will naturally be retained after conversion to a digital format. Attribute data sources may also contain errors arising from problems with measurement instruments, sample bias, errors in recording, coding errors, etc. Some measurement methods (e.g. surveying) are obviously more likely to be accurate than others (e.g. interpretation of an air photo). Further errors, referred to as secondary data acquisition errors, may be introduced subsequently during the process of entering the data into the GIS - e.g. digitising errors, typing errors, etc. - 2 -
Locational Data The capture of locational data for entities (e.g. by digitising from a paper map) can result in numerous errors. ESRI suggests a useful checklist of objectives when capturing data in vector mode: 1. All entities that should have been entered are present. 2. No extra entities have been digitised. 3. The entities are in the right place and are of the correct shape and size. 4. All entities that are supposed to be connected to each other are. 5. All polygons have only a single label point to identify them. 6. All entities are within the outside boundary identified with registration marks. This provides a good indication of the types of problem that might arise. Entities (i.e. points, lines, polygons) may simply be overlooked when digitising, or may be entered more than once. An arc missing between two nodes may result in two polygons being captured as a single polygon. An arc inadvertently digitised twice may result in a sliver line. Vertices inaccurately digitised may result in lines having the wrong shape or, if the vertex in question is a node, may result in a dangling node. The dangling node may either undershoot its correct location, resulting in a gap, or it may overshoot its intended location, resulting in a cul-de-sac (and an intersection not identified as a node). Vertices digitised in the wrong sequence may result in weird polygons or a polygonal knot. If digitising polygons then it is obviously important to have the correct number of labels points in the correct locations. Too few label points will result in some polygons not having associated attribute data, whilst too many label points may result in a polygon having the wrong attribute data associated with it. Digitising errors do not necessarily indicate a lack of accuracy when digitising points. They may also arise if the snapping tolerance is incorrectly set. For example, dangling nodes frequently arise if the snapping tolerance is set too low. However, if the snapping tolerance is set too high, nodes may be snapped to the wrong points. Apart from causing lines to have the wrong shape, this could result in topological inconsistencies. If the data are topologically encoded, then the digitising software can run a number of checks to identify potential problems. For example, the software can check how many line segments enter each node. If only one line segment enters a node then it can be identified as a dangling node. If two line segments enter a node then it is referred to as a pseudo node. Both situation can be flagged as potential problems. However, dangling nodes may reflect genuine cul-de-sacs in a road system, or the sources of tributaries in a river system; whilst a pseudo node may identify a polygon completely enclosed within another polygon (e.g. a lake) or a change in attribute along a line (e.g. single lane road to dual carriageway). The first situation is sometimes referred to as a spatial pseudo node and the second as an attribute pseudo node. Attribute Data Errors in the attribute data may be caused either by primary or secondary data acquisition errors. Primary data acquisition errors occur during measurement. Most secondary data acquisition errors are simply a result of typing mistakes. For example, numbers may be entered wrongly or names may be spelt wrongly. Spelling mistakes in a field used to join the attribute table to the spatial features may result in those features not being associated with the correct attribute data. Missing attribute data (for whatever reason) will also cause fairly obvious problems. Data Processing Errors Further errors may be introduced during data processing. For example, if converting data from raster to vector mode, vector mode lines which should be straight may take on a stepped appearance. There are various smoothing algorithms which may be used to smooth out angular lines, but there is no way of knowing whether the smoothed lines are actually any more accurate - the net effect of smoothing the lines may be to introduce further errors by making them artificially smooth. Vector to raster conversions may result in topological errors being introduced or even in the creation or loss of small polygons. Raster coverages created from the same vector coverage will tend to vary depending upon relatively arbitrary decisions about cell size, the orientation of the raster and the location of the origin. Interpolation of data values in a continuous field from sample points will result in different values depending upon the choice of method of interpolation and other decisions made with regard to the parameters used. The number of sample points will also have a fairly obvious influence upon the reliability of the resulting estimates. When analysing field data it is therefore necessary to bear in mind that the estimated data values are not necessarily accurate. - 3 -
It is important to realise that computers may introduce errors when processing data due to limitations placed upon the precision of numbers arising from the way in which they are stored in a computer. When working with numbers requiring a large number of significant digits, calculations done by computers may result in a high degree of inaccuracy. This problem is becoming less serious with the availability of 32-bit and 64-bit machines, provided that the software has been programmed to take advantage of the extra precision. If you are working with high precision numbers you should confirm that both the computer and the software can support the degree of precision required because the implications may be more serious than simply rounding numbers to a smaller number of significant digits. Finally, use errors may arise from simply using inappropriate tools for a particular type of analysis. Data Display Errors The display of data may also introduce errors. For example, the display of raster data on a vector mode device (e.g. a plotter) or the display of vector data on a raster device (e.g. a monitor or a printer) will generally introduce other small inaccuracies due to the need to round off during the conversion from one mode to the other. These errors can probably be ignored for practical purposes, but they serve as a reminder that errors can creep in at all stages in a GIS analysis. MODELLING DATA ERRORS Apart from recognising that errors are likely and then taking whatever steps one can to minimise them, what else can be done? The treatment of errors in GIS has received relatively little attention, especially in commercial applications software, but there have been some tentative steps towards using quantitative measures of errors to provide some indication of the reliability of data in a GIS. Attribute Errors Measurement errors in the attribute data can be modelled using conventional statistical techniques. For example, if the measurement errors can be assumed to be normally distributed with a mean of zero, then one can calculate the standard error and use it to place confidence limits on the data values. If the attribute data refer to sample points, then it may be necessary to interpolate the values of intervening points. Kriging provides an estimate of the variance of the interpolated values. If the attribute data are non-numerical categorical data (e.g. landuse types) then it may be possible to calculate a misclassification matrix (also known as a confusion matrix or an error matrix). The rows in this matrix represent the various categories as measured and the columns represent the correct categories. For example, the rows may represent the categories in a landuse classification based on an analysis of satellite images, while the columns may correspond to the categories in a classification based on ground truthing. The data values in the matrix would indicate the number of cells in a raster image falling into each combination of categories. Once the table is constructed, it is a simple matter to calculate what percentage of cells in each landuse category would be correctly classified from the satellite imagery. These calculations may then be applied to other images. Positional Error Models Positional error models represent an attempt to place confidence bands around locational features. It is assumed that if the x co-ordinate of a point (or vertex) was measured repeatedly then the observed x co-ordinates would have a Normal (i.e. Guassian) distribution with an expected value (or mean) corresponding to the true value. 68 per cent of the observed co-ordinates would be within one standard error of mean, and 90 per cent would be within 1.65 standard errors. Having established the standard error for one point by experiment, the expected error associated with other points could be expressed using probabilistic confidence bands. There are a number of assumptions implicit in the choice of a Normal distribution to model measurement errors (e.g. it is assumed that the errors are unbiased; it is assumed that the probabilities associated with errors of differing magnitude form a continuum, etc.). If these assumptions are unrealistic, then other statistical distributions may be preferable. However, the basic approach is much the same. - 4 -
Point Data When recording the spatial location of a point we record two co-ordinates (i.e. x and y). If the errors associated with each are assumed to be normally distributed, then the probability distribution will be a bell-shaped surface, declining at the same rate in all directions. 2 The standard error of this surface is called the circular standard error (CSE). 39.35 per cent of all points can be expected to lie within a circle with a radius of 1 CSE centred on the mean. 90 per cent of the points should be within 2.146 CSE. One way to define the accuracy of a map is to specify a circular map accuracy standard (e.g. 2.146 CSE, meaning that 90 per cent of all the observed data points will be within this distance of their locations). Line Data The true location of each point on a line can be thought of as lying within a band on either side of a digitised line, where the width of the band reflects the standard error. These bands are sometimes referred to as epsilon bands. The original epsilon band was hypothesised as rectangular in cross-section, but a Normal distribution is more frequently assumed. However, given that each point in a line is not independent of the points which precede and follow it, it seems plausible that the digitised points forming a sequence will tend to be displaced in the same direction (i.e. there will be a bias, and the true distribution of the errors will be skewed). Some investigators have suggested the crosssectional probability distribution should be bimodal. Polygon Data: Similar principles apply to polygons. Information on the width of the epsilon bands can be used to place confidence limits on point in polygon tests. As can be seen from the diagram, instead of points being classified as either inside 2 This can be thought of as a bell-shaped curve rotated through 360 degrees around its vertical axis. - 5 -
or outside a polygon, they can now be classed as being definitely in, definitely out, possibly in, possibly out or ambiguous. A Monte Carlo approach can be used to model errors. This basically involves adding a random noise factor (which can be positive or negative) to each co-ordinate for each point before performing whatever GIS operation you need to do. The results are saved, and the whole process is repeated a large number of times (e.g. 100 times). The accumulated results can be used to calculate confidence limits for numerical answers or to draw confidence bands around features on output maps. The main problem with a Monte Carlo approach is that it requires a lot of computer resources. Burrough and McDonnell (Chapter 10) discuss a similar approach for evaluating the effects of measurement errors in numerical models. They also discuss the statistical theory of error propagation. 3 Whilst mathematically more challenging, this provides a computationally more efficient means of achieving the same objectives. The main conclusion from their review of several case studies is that even relatively small measurement errors can have a much greater impact than one might imagine. There is therefore an obvious need to develop better methods for assessing data quality and its implications. METADATA The reliability of a particular set of data is dependent upon the uses to which it is put. Data which are completely inappropriate in one context may be totally adequate in a different context (or vice versa). Data quality is therefore to some extent a relative concept dependent upon the context. The emphasis has therefore tended to switch away from simply trying to make the data as error free as possible to providing potential users with the information which they require to make an informed decision about the adequacy of the data for a particular purpose. This information is referred to as metadata. Data exchange format Data summary Lineage Co-ordinate system Spatial data model Feature coding system Classification completeness Geographical coverage Positional accuracy Attribute accuracy Topological accuracy Graphical representation Data storage format. Data sources, areal coverage, classification used, date collected, scale, etc. Agency of origin, method of data collection, primary survey techniques, digitising method. Dates updated. Processing history: co-ordinate transformations, data model translations, attribute transformations. Type of co-ordinate system. Map projection parameters. Specification of primitive spatial objects. Topological data stored. Definition of feature codes and classification system. Documentation on the extent of usage of classification system. Overall extent. Detailed specification of coverage if not complete. Statistics on co-ordinate errors. Statistics on attribute errors. Methods of topology validation employed. Graphical symbolism for each feature class. Text fonts for annotation. Metadata is data about data. In the GIS context, each set of data should be accompanied by metadata explaining not only what the it contains but how and when it was collected, and details relating to its quality. The table above (from Jones) indicates the type of information the metadata might include. There are now a number of international standards for metadata (e.g. OGC, ISO). 3 The term error propagation refers to the cumulative effect of errors upon the final results of the analysis. - 6 -