SPATIAL DATA MODELS AND SPATIAL DATA

Transcription

1 SPATIAL DATA MODELS AND SPATIAL DATA STRUCTURES TABLE OF CONTENTS 1 Spatial data models: an introduction Geometric entities Problems with the entity definition process Spatial data models and structures The raster approach Raster data structures The simple raster The complex raster Data compaction methods Run length encoding The quadtree The vector approach Data structures without topology The neighbourhood problem The island or hole problem Data structures with topology Point Lines and networks Areas Vector and raster spatial data models: advantages and disadvantages Data volume Topology queries Generality Analytical capabilities Accuracy and precision What have you learned?... 33

2 SPATIAL DATA MODELS AND SPATIAL DATA STRUCTURES This section focuses on the methods available for the actual implementation of geographic models within GIS. It will review the geometric primitives (spatial entities: points, lines, areas, networks and surfaces) and the different approaches (data models and data structures) used to implement these representations of geographic space in a computerised GIS environment. 1 Spatial data models: an introduction Human perception of space is frequently not the most efficient way to structure a computer database and does not account for the physical requirements of storing and repeatedly using digital information. Computers for handling geographical data therefore need to be programmed to represent the phenomenological structures in an appropriate manner (Burrough and McDonnell, 1998 p.36) This pertinent statement provides a timely reminder that computers require unambiguous instructions on how to perform specific tasks. In the same way, a computer needs to be told how to build spatial models for implementation within a GIS. Using Peuquet s (1984) abstraction schema as a framework this section examines in detail the different levels of abstraction (levels 1 3) and the associated processes used in the design and implementation of a GIS data model. The successful design of spatial data models within a GIS must consider the following: 1. Which landscape elements (phenomenological structures or spatial entities) are necessary to appropriately represent the system under investigation (level 1 abstraction)? 2. What approach (spatial data model) should be used to handle and display these spatial entities (level 2 abstraction)? 3. What particular set of instructions and information (data structure) will the computer require to reconstruct the spatial data model in digital form (level 3 abstraction)? 2 Geometric entities In performing the first stage of data abstraction we need to identify the features / objects to be represented within our GIS. The design and construction of our GIS is dependent upon the successful identification of a series of geometric primitives. These are the basic units of our spatial data models and form the building blocks of GIS. All geographical

3 phenomena can, in two dimensions at least, be represented by one of three entity types (Figure 1). In completing the e-tutorial to construct your own map, you produced a series of spatial entities of your own to represent the different landscape elements. These entities (or primitives), were made up of a series of shapes (geometries) including points to represent the location of trees, a series of lines to illustrate the road centre-lines and areas to represent recreational space and residential and/or commercial buildings. Let's quickly remind ourselves of some of the geometric properties of point, line and area entity types. What is a point? A point is a spatial entity that has no length or area. What is a line? A line is a feature that has length but no area. What is an area? An area is a spatial entity that has perimeter and area. Figure 1. Point, line and area spatial entities. The three entity types above, however, are not always the most appropriate representational forms for landscape elements to be included in a GIS. The concepts of area and line can be extended to produce two other spatial entities, namely surfaces and networks (Figure 2).

4 Figure 2. Surface and network entities. Surfaces can be used, for example, to represent population density, elevation values and temperature. As well as its fundamental two-dimensional attributes the surface entity is also capable of three-dimensional representation in GIS and this is revisited later in the section on dimensionality. Examples of networks include traffic (road) and hydrological systems. In many cases, the range of surface and network entity types that you identified will have been limited only by your knowledge of the area. Some suggestions include:

5 Surface Rainfall Elevation Pollution Network Electricity Rivers Sewage pipes Telephone cables Footpaths 2.1 Problems with the entity definition process As one might expect, there are a number of problems associated with simplifying the complexities of the real world into five basic, two dimensional building blocks. These include the dynamic nature of the real world, the scale at which a particular problem needs addressing and the identification of discrete features. Real world dynamics The real world is not static: forests grow, rivers flood, cities expand. This poses two particular problems for the entity definition phase of a GIS project. The first problem is

6 how to select the entity type that provides the most appropriate representation for the feature being modelled. For example, is it best to represent a forest as a collection of points (representing the location of individual trees), or as an area (the boundary of which defines the territory covered by a forest)? The second problem is one of temporal change. For example, a forest that was originally represented as an area may decline to the extent that in reality it is only a dispersed group of trees that are, perhaps, better represented using point features. Scale Scale is also an important concept in the entity definition process. For example, if a GIS database is to be constructed at a scale of 1: 1M (for example, the Digital Chart of the World) it may be appropriate to represent the city of Manchester as a point feature. However, at larger scales you would need to employ different entity types to provide a more appropriate representation of Manchester. For example, at 1:250 K, an area feature would be most suitable; at any larger scales, it is likely that a compound or collection of entities would be more practical. Ideally, a truly scale independent GIS would be most effective, able to operate at any scale, modifying and selecting the appropriate entity representation as the user zooms in and out of the database. Definition The selection of appropriate entity types is further plague by the fact that many real world features simply do not fit comfortably into the character of the models available. Feature boundaries, for instance, are a particular problem in the definition of spatial entity type. In reality, the boundaries of natural phenomenon are rarely discrete, but instead are more readily characterised by a continuum or transition zone. For example, where should you place the boundary of an area feature used to represent a stand of forest? Do forests have edges, or do transitional zones from full to zero forest cover better define their boundary properties? Very often in GIS we make use of paper maps (secondary data sources) for data input and these are readily defined by the clear and distinct marking of any boundaries. While the discretisation of feature boundaries may be very useful and allow us to more easily generate quantifiable measurements, we must recognise the problems associated with the choice of entity type and its boundary characteristics, particularly with regard to natural (and therefore potentially fuzzy) phenomena. Level 1 of the abstraction process as defined by Peuquet (1984) is characterised by the five geometric primitives (or spatial entities) used as the basic building blocks of GIS. Unfortunately, as our discussion has shown, the process of defining what entity type should reflect which real world feature is far from simple. The decision is vitally important

7 for the successful design of a GIS, as it controls GIS functionality and the potential for further spatial operations (an issue addressed later in this unit). The data layer concept So far we have just considered the five primitive spatial entity types. However, the complexity of the real world is such that for most GIS applications it is necessary to construct more complex models of reality, typically consisting of compound features (several entity types). The most common method used in GIS to handle this problem is to adopt a layered approach. Individual data layers are constructed using the various entity types to represent different spatial elements. Each data layer is stored independently in the GIS using either raster or vector approaches as mentioned next. These data layers can then be used either independently (single layer operations) or together (as multiple layers) depending upon the application. The use of multiple layers, as discussed later, can cause additional problems dependent upon the choice of spatial data model. 3 Spatial data models and structures Levels 2 and 3, the next stages in the abstraction process concern the design, representation and implementation of the defined spatial entities in the GIS. If you revisit the quotation from Burrough and McDonnell (1998) you will recognise that as well as defining our entities we must also instruct the computer on how to turn specific entity data into (digital) graphical representations. In GIS we mostly employ (one of) two methods to handle and display our chosen spatial entities. These are commonly referred to as the raster and vector approaches. The history of GIS software is such that some software was originally designed for a raster spatial data model e.g. Idrisi and other software is based on the vector spatial data model e.g. GeoMedia Professional, ArcGIS. Raster data sets are characterised by their grid cell structure, whereas the vector approach comprises co-ordinate geometry in an attempt to represent the features or objects of interest as exactly as possible. As well as employing the raster and vector spatial data models 1 we are required to provide the computer with further information to reconstruct these models in digital format. There are a great number of spatial data structures, specific to either the raster or vector spatial data models, used by commercial GIS. The great diversity of spatial data structures is one of the reasons why exchanging spatial data between GIS is problematic. Different GIS may contain information of value to the other, but will be 1The term data model is often used to describe these two terms. This can become confusing since the term data modelling is used to describe the entire process of representing reality in a computer. For our own purposes we will use the term spatial data model in association with the terms raster and vector, and the terms data model and data modelling to refer to the overall modelling process.

8 unable to share that information if the data structures used to store the information are incompatible. In the case of the UNIGIS supported software, each GIS has its own data format and structure. This means that we cannot simply transfer a raster file straight into a vector system or vice versa. The following sections explore the raster and vector spatial data models and examine, using a range of examples, the diversity of their associated spatial data structures. 4 The raster approach Raster systems are a result of developments in computer graphics technology over the last forty years. They are widely used in computing and digital television graphics and work by repeatedly sweeping an electron beam across the computer or television screen, from side to side and top to bottom. The image attributes (brightness and colour) at each point on the screen are determined by computer-generated data (Coll, 1991). In actual fact, each point on the screen is actually composed of a small cell structure or pixel (picture element). In the raster world, individual (typically square) cells are used to represent the different geometric entities (points, lines, areas, networks and surfaces) used to build the GIS image. In a raster file geographic space is divided into regular sized grid tessellations. Single cells represent point features, whereas lines and areas are identified by groups or clumps of pixels (Figure 3). Figure 3. Raster representations of points, lines and areas. Importantly, raster space can be composed of different tessellation patterns, including the triangle and hexagon (Figure 4). Peuquet (1990) notes that triangular tessellations are useful for terrain representation triangles do not all have the same orientation, making them more effective for picking out bumps and undulations on the land surface.

9 Figure 4. Regular tessellations for a raster field-based representation. Hexagonal tesselations Triangular tessellations Square tesselations As well as selecting an appropriate tesselation you will also have to decide on the resolution of the cells. Too small and the data volumes will be prohibitive, too large and your data will look coarse and lack precision. 4.1 Raster data structures There is a wide range of different data structures to represent an entity in a computer using the raster data model. In the next part of this section we will examine some of these data structures in detail, including a discussion of the data compaction methods used to minimise data storage requirements for large or complex raster data sets The simple raster At the most elementary level there is the basic or simple raster data structure where information is stored for each cell in the image. This information informs the computer of the presence (or absence) of a feature within a given cell. Figure 5 illustrates what a raster representation of a simple map may look like at a range of different cell (pixel) sizes. This usefully illustrates how data quality changes with cell size in a raster image. Figure 5. Raster views of a simple map. a: The simple map

10 b: Fine resolution grid cells c: Intermediate size grid cells d: Coarse resolution grid cells Cell occupancy and mixed pixels As well as the implications for quality of definition that results, the pixel size can also lead to problems when dealing with phenomenon (entities) that only partially occupy a raster grid cell. Typically, this is solved by the application of one of two cell occupancy rules: the present or absent rule, and the 50% rule. The present or absent rule states that even if an entity is only minimally occupying a raster cell then it is considered to be present and the cell will record an entity feature. As it suggests, the 50% rule states that if more than 50% of a pixel is occupied by an entity feature then the entity will be acknowledged and recorded as present. Figure 6 shows how these two different procedures can affect the

11 shape and character of an entity in a raster spatial data model. Figure 6. The present or absent rule and the 50% occupancy rule. A circular phenomenon Raster representation using present or absent rule Raster representation using majority (50%) rule Another problem with the simple raster model is its inability to distinguish between the nature of an entity feature (i.e. point, line or area). This reflects the binary codes used in raster technology to store image information. Using binary coding, entities present are recorded with a value of 1, and unoccupied cells as 0. The computer therefore sees the raster image presented in Figure 5 as a series of 0s and 1s, and not as a housing estate, river or trees since it does not contain any information to distinguish between the three. Using a simple raster approach, the computer requires a separate layer of information for each class. Figure 5 would then require 3 separate raster files; one for the tree map, one for the river map, and one for the housing estate map. Figure 7. House figure for simple raster data structure exercise The complex raster One of the more obvious problems with the simple raster data structure is the volume of

12 information that has to be recorded to represent even the simplest map. Complex raster data structures reduce the volume of information by assigning coded labels to grid cells that not only tell the computer that a feature is present but also identify its character. Using our earlier example from Figure 5, the cells representing trees might be assigned a value of 1. The table below indicates how other entity types could be represented. Note a column indicating colour has been included to illustrate how a complex raster image may appear on screen. Phenomena Entity Type Code Colour Tree Point 1 Green River Line 2 Blue Housing estate Area 3 Red Figure 8. House figure for complex raster data structure exercise 4.2 Data compaction methods File size is one of the major problems with raster data sets space occupancy raster structures require a value to recorded and stored for each grid cell. This means a complex soil map of, say 100 x 100 pixels, that may contain 20 or more distinct soil classes requires the same storage space as a simple road map of the same area despite the fact that much of the raster road map contains many cells recording a value of 0. Raster data storage requirements have received considerable attention, and a range of data compression (compaction) methods have been developed. These compaction methods can reduce the size of a raster data set quite considerably. In the following subsections we will examine some of the more commonly used methods in detail.

13 4.2.1 Run length encoding One of the most common and simplest techniques for reducing the data volume associated with a raster image is a technique known as run length encoding. This technique reduces the information stored for each line in a raster matrix by storing a single value for the consecutive number of cells of a given type, rather than storing a value for each cell. Consider the following simple raster in Figure 9 showing the presence or absence of clay. Figure 9. Simple raster file structure. Row Row Row Row Row A run length encoded version of the file would be represented as follows: Row 1 31, 40, 31, 40 Row 2 51, 30, 31, 30 Row 3 31, 50, 31, 20, 11 Row 4 21, 80, 41 Row 5 21, 80, 41 If we take a closer look at the first row of the run length encoded file we can see how the method works: 31, 40, 31, 40 The first number (3) represents the number of consecutive cells with the same coding. In this case the coding 1 = soil type A (clay soil). The third number (4) indicates the number of unoccupied cells moving from left to right. The fourth number (0) represents the absence of clay soil. The fifth number (3) denotes the next 3 consecutive cells are occupied by clay soil (code = 1 again). Finally, the numbers 4 and 0 indicate the absence of clay soil in the 4 grid cells that complete the row. Note that the commas have been added to make the file easier to read they would be absent in a real run length encoded file e.g If we assume one numeric value uses one byte of storage (1 byte = 8 bits) on the computer then row one of our run length encoded (RLE) file takes up 8 bytes compared with the 14 bytes required to store the same information using the simple raster data structure. The equivalent file sizes and savings are given below:

14 BYTE storage requirements Simple raster Run length raster Row Row Row Row Row Total Saving 38 bytes 46% saving in storage space The figure below shows how the data volume associated with the storage of a complex raster using the RLE method could be reduced in a similar way. Note that the presence or absence values of 0 and 1 have been extended to include the codes used to identify 3 different soil types present in the grid. Figure 10 Complex raster file structure. Row Row Row Row Row An RLE version of the complex raster file would be represented as follows: Row 1 31, 40, 32, 40 Row 2 51, 30, 32, 32 Row 3 31, 50, 32, 20, 12 Row 4 21, 80, 42 Row 5 21, 80, The quadtree As Peuquet (1990) points out, the advantages of the raster spatial data model that employs a square tessellation is that each cell can be subdivided into smaller cells of the same shape and orientation. This unique feature of the grid or raster data model has produced a range of innovative data storage and data reduction methods that are based on a regularly subdividing geographical space. The most widely implemented is the

15 quadtree (Samet 1989) based on the recursive decomposition of a grid. There is a range of quadtree types (see Peuquet (1990) and Samet (1989) for further reading) and the most common approach is the area quadtree. The area or region quadtree The area quadtree works on the principle of recursively subdividing the number of cells by quads (or quarters) until there is a homogenous block and no more subdivision can take place (Bonham-Carter 1994). At the end of the subdivision process each cell in the grid matrix may be classed as having an entity present or absent. The number of subdivisions in this process depends upon the complexity of the map layer and upon what is acceptable as the finest division to represent an object. The smallest quad cell size is determined by pixel resolution. Figure 11 illustrates the quadtree principle. Figure 11. The quadtree process. 1 st subdivision 2 nd subdivision Original feature

16 3 rd subdivision 4 th subdivision Figure 11 shows the process of hierarchical subdivision. The image is first divided into four quadrants of which none can be wholly classified as not containing the entity. Therefore a further stage of subdivision is required in each quadrant, where it is possible to identify ten quadrants that do not contain the entity and six quadrants that do. Of these, one quadrant wholly contains the entity and five quadrants only partially contain the entity (3 rd subdivision). Further subdivision is therefore only necessary in these five partially occupied quadrants. This process of subdivision continues until every cell either contains or does not contain the entity (4 th subdivision). Looking closely at Figure 11 we can identify four hierarchical levels. The hierarchical nature of the quadtree becomes more apparent when we see it represented diagrammatically as a binary image tree also known as a bintree (Figure 12). Figure 12. Binary image tree. In the bintree we can clearly distinguish the four data layers and by binary coding of each root in the tree with a 1 or 0 we can see whether a quadrant contains part of the entity or not. To examine how a quadtree is coded and information retrieved for display by the computer look at Figure 13.

17 Figure 13. Coding a simple quadtree. Raster scan order Another method of referencing a quadtree is to use the Morton Matrix indexing scheme. This method, named after its developer (Morton 1966) is based on the Peano Scan method which generates a track through space that exploits the property that areas close together in the real world will be close together in a sequential digital file. These scan

18 orders are best illustrated with reference to a diagram. Figure 14 shows how a Peano scan would be generated for a raster matrix with 8 columns and 8 rows (dimensions of raster in Figure 13). The coding system must be adapted to follow the Peano curve and each cell has a unique identifier, known as a Morton number (Figure 15). Figure 14. Peano curve. Figure 15. Morton ordering scheme

19 Figure 16. House figure for quadtree data structure exercise. 5 The vector approach The five entity primitives identified at the start of this section can most easily be represented using the vector spatial data model. The vector approach attempts to represent the features of interest as exactly as possible employing Cartesian co-ordinate geometry. Points are represented by coordinate pairs, lines by arcs linking a series or string of points, areas by lines enclosing homogenous areas (a string of coordinate pairs with the same origin and end point), networks by connected lines and surfaces by areas linking points and lines. Those of you who use and work with topographic paper maps on a regular basis will be very familiar with the vector approach. Cartographic maps are composed entirely of a series of representative points, lines, areas and surfaces. The topographic map or vector model provides a scaled model of reality. For example, a river feature will be represented using a line of appropriate thickness rather than a series of contiguous inappropriately shaped cells (as with the raster model). The high quality of geometric representation makes interpretation of objects a relatively easy task (take another look at the map provided in Section 1 to remind yourself).

20 This precision of representation is very useful. There is, however, further important information that is needed to develop a vector data structure to store information about the entities in a vector spatial data model. This information is about the geographical relationships between points and lines that are used to represent an entity. These spatial relationships are expressed as topology. As with the raster spatial data model there are many potential data structures that can be used to represent an entity in the vector world. However, they can be categorised into two groups: 1. Data structures without topology 2. Data structures with topology 5.1 Data structures without topology The simplest form of vector data structure that can be used to reproduce a geographical image in the computer is a set of x and y coordinates. Figure 17 shows how the simple map (introduced earlier in the section on raster data), might be represented in a vector view of the world. Figure 17. A simple vector map. A simple data structure without topology for this model could be constructed as follows: Area (housing estate) H1 (40, 50), H2 (40,50), H3 (50, 45), H4 (50, 35), H5 (70, 35), H6 (70, 30), H7 (90, 30), H8 (90, 50), H1 (40, 50). Note: the first and last co-ordinate pair are the same. This ensures that the area or polygon is closed. Line (river)

21 R1 (0, 25), R2 (10, 23), R3 (20, 20), R4 (40, 20), R5 (50, 20), R6 (70, 20), R7 (80, 20), R8 (90, 15). Points (trees) T1 (10, 10), T2 (20, 5), T3 (25, 15), T4 (35, 10), T5 (55, 10), T6 (60, 15), T7 (65, 5), T8 (75, 10), T9 (75, 5). Where x, y represents the co-ordinates used to identify the location of the points, which must be connected to make the entity. The descriptor in brackets, (tree) etc., is added to the file so that the computer knows what the data represents. A new feature is recorded by a code such as carriage return <cr>. The limitations of simple vector data structures start to emerge when we look at more complex spatial entities. Consider for example the group of areas and lines represented in Figure 18. Figure 18. Figure 18a, at its simplest level could be represented by the following data structure: Area 1 xa,yb xb,yb xc,yc xk,yk xl,yl xm,ym Area 2 xu,yu xv,yv xw,yw xk,yk xl,yl xm,ym If you were to reconstruct the image in Figure 18a from the data structure given above, you would find that the co-ordinates that define the boundary line, which is shared by the two polygons, would be stored twice. While this may not appear too much of a problem for our small example, consider the implications for a map of the soil series of the United Kingdom or the state boundaries in the US. The amount of duplicated data stored would be a large proportion of the total data. In Figure 18b we have a slightly different problem. The network could quite easily be stored using the following simple file structure. Line 1 x1,y1 x2,y2 x3,y3 etc Line 2 x4,y4 x5,y5 x3,y3 etc

22 Line 3 x3,y3 x6,y6 x7,y7 etc Line 4 x8,y8 x9,y9 x7,y7 etc Line 5 x7,y7 x10,y10 x11,y11 etc The computer would be able to reproduce the image but a problem would arise as soon as we tried to use this information to ask questions about the network. This is because the computer has not been provided with any information to tell it that line 1 is connected to line 2 which is connected to line 3 and so on. These spatial linkages are only inferred in the viewer s mind when the lines are displayed on the screen and are not contained explicitly within our data file. This situation has lead to simple vector data structures of this type, without topology, being referred to as 'spaghetti', because what is actually on the screen is merely a jumble of linear features as far as the computer is concerned. This spaghetti approach is used by many design and drawing packages. A true GIS must use a topological data structure to represent spatial entities if it is to be of any practical use other than for displaying features. Note that the absence of topology is one factor which distinguishes a CAD/CAM system from one designed to store, manipulate and analyse spatial data (GIS). There are two specific problems of 'spaghetti' data structures that illustrate why topological information is important. First, spaghetti data contains no neighbourhood information, and second the data structure is unable to cope with what are termed hole or island polygons The neighbourhood problem The neighbourhood problem has already been alluded to when we discussed the problem of storing a simple network as a series of lines (Figure 18b). The problem is that while the lines give the appearance of a network when displayed on the screen the actual file that is used to create the image contains no information about which line is connected to the next. In the same way, a series of polygons created using the simple data structure may appear to be connected, but in fact they are discrete entities which are unaware of the presence of neighbouring polygons. Even giving each polygon a label or unique identifier would not solve the problem. What is required is a set of instructions which informs the computer where one polygon is with respect to its neighbours. Polygon data structures that contain such information would be termed topologically correct. How a data structure can be designed to include full topology will be discussed later.

23 5.1.2 The island or hole problem Figure 19 illustrates the island or hole problem. From the figure you can see that one polygon classified as containing the soil type clay is wholly contained within a polygon classified as soil type loam. The problem is frequently referred to as one of nested or hierarchical polygons. While a simple file structure would be able to recreate the image in Figure 19 by a series of x, y co-ordinates it would not be able to inform the computer that the island polygon was in fact part of the larger clay polygon. Dealing with islands or holes also requires a fully topological data structure. Figure 19. The island or hole problem. 5.2 Data structures with topology Point A point is the simplest spatial entity that can be represented in the vector world with full topology, because all that is required for a point to be topologically correct is a pointer or geographical reference which locates its position with respect to other spatial entities in the real world. This is performed by tagging the point with a geographical reference Lines and networks Simple lines carry no inherent spatial information about their connectivity. Lines only need to have topological information attached to them when they become part of a network,

24 area or surface feature. Topological information is added to line features through the use of 'pointers' which flag where links occur in the data structure. The most frequently used pointer in the vector data model is the node. Figure 20 shows the type of information required to identify connectivity in a line network. The first stage in turning a series of lines into an intelligent network is to identify the start, end and junction points. These pointers or nodes are then used to record information about the connectivity of the network as well as hold information that regulates the nature and direction of the information flow. Figure 20a illustrates six nodes, four of which represent the start and end points of the network (B, D,E, F) and two (A, C) which represent junctions. The second stage is to identify the lines or arcs that connect the nodes. This information is present in Figure 20b. In many cases, direction is also an important network feature and Figure 20c shows how the direction at which an arc joins a node can be recorded. Figure 20. Network connectivity.

25 5.2.3 Areas Topology for a set of area entities is built in a series of stages. These stages have been described by Burrough and McDonnell (1998) and are only summarised here. The order in which the stages are carried out by a vector GIS will be GIS-product specific and may not follow the order described here. However, the principles remain the same, with the process consisting of four stages: Stage 1 Generating a boundary network Stage 2 Linking lines into polygons Stage 3 Checking polygons for closure

26 Stage 4 Providing a unique identifier for each polygon Stage 1: Generating a boundary network. Figure 21 shows a diagram of a set of simple polygons. The first step in generating full topology for the entities in Figure 21 would be to identify those arcs that intersect with one another. Those arcs that cross are automatically intersected and built into two separate arcs and a node added at the junction (Figure 22b). Figure 21. A set of simple polygons. Figure 22. Generating a boundary network.

27 The second step involves sorting arcs according to their x and y location so that arcs topologically close to each other are also in close proximity in the data file. This process helps speed up retrieval times when searching for adjacent chains. Now it is possible to generate an outer envelope or boundary network that contains all other polygons. The outer polygon is only used to build topology for the arc network. The envelope polygon is built by identifying the arcs that make up the outer boundary of the area (Figure 22c). A flag should be set to indicate that each of the arcs that make up the outer envelope has been traversed once. The following information should be stored for the envelope polygon: A unique identifier or polygon ID A code that identifies it as an envelope polygon A direction pointer indicating the order and direction in which arcs should be linked

28 together to form the boundary A list of arcs in the boundary Its x, y extent Stage 2: Linking arcs into polygons. Once the outer envelope has been created, topology can be constructed for each of the other polygons in turn. The same starting point should be used as employed in the construction of the outer envelope and if the outer envelope was constructed in a clockwise direction, then the other polygons should be constructed in a similar fashion. Once a pointer or node is reached the arcs which are to the right should be followed. Arcs should be dropped from the search once they have been traversed twice. This process should be repeated until all polygons have been constructed (Figure 22d). Stage 3: Checking polygons for closure. In building topology for a set of areas it is essential that the areas themselves can be identified as being closed or not. If a polygon is left open it is not topologically correct. By consulting the arc table polygon closure can be checked quite easily because all arcs must be linked to a node that points them to the next arc. If any arcs in the table are found without the proper node pointers then either they will be mistakes or unclosed polygons exist. If arcs of this type are found, then depending upon the nature of the error, they may be flagged for correction or deletion. Stage 4: Providing a unique identifier for each polygon. The final stage in building full topology for a set of polygons is to ensure that a unique label or identifier is attached to each polygon. This is important if nonspatial (attribute data) is to be linked with the polygons that have been created. It is also important for locating (geographically) one polygon in relation to another. Figure 23 is a summary of the information necessary for computer storage in order to reconstruct polygon topology. Figure 23. Vector polygon with topological data structure.

29 Figure 24. Vector representation exercise.

30 6 Vector and raster spatial data models: advantages and disadvantages Maffini (1987) states that the raster-vector approaches are two alternate methods for storing and representing spatial phenomena. As models they have relative strengths and weaknesses for describing conditions in the real world. In this section we will explore some of merits and weaknesses of each model. Throughout the course there will be many instances where you will need to determine whether one data model is more appropriate than another is. IGISE (1991) groups the advantages and disadvantages identified into five generic themes: Data volume Topology queries Generality Analytical capability Accuracy and precision These themes are used here to start the debate about the choice of spatial data model as part of the design process. 6.1 Data volume One of the most frequently discussed areas in the raster-vector debate is data volume, which was at its height during the 1970s and the first half of the 1980s when the technological limitations on computer power and storage were most marked. The problem is that the answer to this question is not simply that raster data sets are larger than vector data sets. It depends upon the character and complexity of the spatial entities you are trying to record. A simple or complex raster for example can require as much data storage space to record a simple spatial entity with few polygon boundaries as it would take to record a complex spatial entity with many polygon boundaries. In the same way an unstructured vector file without topology for a series of 50 complex polygons can be much smaller in size than a fully topological vector data structure for the same area. The more complex a spatial entity becomes the closer the data volume requirements of the different data storage techniques. As a general rule however, raster spatial data models are generally more demanding of data storage than their vector counterparts.

31 6.2 Topology queries As IGISE (1991) stated, an important prerequisite of a GIS project is the ability to ask questions such as: Where is something? What is next to something? What is contained within something? The ability of the different data models to provide answers to these questions is of vital importance to the designer of a GIS project. Both the raster and vector spatial data models have strengths and weaknesses associated with answering different types of spatial questions. These are explored in detail in the following sections of single and multiple layer spatial operations. For the moment it is enough to note that traditionally, vector data models have been considered more appropriate for answering topological questions about containment, adjacency and connectivity. However, with the advent of more intelligent raster data structures such as the quadtree which contain information about the relationship between cells in the image, two particular spatial queries can now be performed efficiently using a raster data model. These are identifying the area in which a point is located. In general, however, where topological queries are likely to constitute the major application of the GIS, a vector data model is required. 6.3 Generality In a GIS project it is frequently necessary to be able to change the scale and thematic resolution of operation. This often makes it essential to be able to generalise the complexity of spatial features. For example, it might be necessary to have the ability to dissolve 500 enumeration districts into 14 wards or 200 detailed soil polygons into 15 general soil units. Depending upon the type of generality the vector and raster data models possess different relative advantages. The vector data model, for example, handles changes in scale much more easily than its raster counterpart with regard to the visual representation of entities. This is because of the precise way in which information is recorded as a set of x, y co-ordinates. Changes of scale pose a problem in the grid world if a resolution is requested below the cell specified at the project outset. Increases in scale in the raster world are typified by the appearance of a blockier image. On the other hand, in generalising the actual form of an area and of surface entities the raster model comes into its own because to aggregate a complex soil class map into a more general one needs only the value of each cell to be reclassified to reconstruct the new image. While this is possible in the vector world it requires complex calculations of

32 the intersection and adjacency of polygons with similar attributes. Therefore, if many calculations of this nature are required a raster data model may be more appropriate. 6.4 Analytical capabilities There is a clear distinction between the analytical capabilities of raster and vector GIS. This is a major component of the sections of this unit dealing with spatial operations. 6.5 Accuracy and precision In days gone by, you may have often heard a vendor of a vector GIS product announcing to his/her client that their product is more accurate at representing spatial features because it the vector spatial model. This statement provides a useful starting point for examining accuracy and precision because of highlights one of the most common mistakes made when comparing raster and vector data; the confusion of the terms accuracy and precision. Before we proceed it is necessary to define what we mean by the terms. Accuracy is the faithfulness with which our spatial entity is represented in our computer view of the real world including its location (positional or spatial accuracy) and character (attribute accuracy). Precision is independent of accuracy and is the degree or exactness used to record the location and character of our spatial entity. For example, a typical vector based GIS allocates 8 decimal digits of precision to each of its coordinates and many allocate 16 (Goodchild and Gopal 1990). The level of this precision is much higher than the accuracy of typical GIS data. Therefore, what the vendor really meant in the above statement is that the vector data model is more precise at reproducing the shapes and lines that we are used to seeing on our traditional paper map model of the world. This is because the paper map uses a vector data model to represent spatial entities. The location of small entities are shown as points, roads and rivers by lines and areas of forests by polygons with a distinct boundary. Naturally, therefore, if we use a vector GIS to capture this data it will be more precisely reproduced than in the raster world where the points would appear as cells, the roads and rivers as jagged and stepped linear features and the areas as blocky irregular entities rather than the smooth boundaries that are formed by the arcs of the polygon. At first appearance it might appear that our vendor is indeed right that the vector GIS is much more accurate than an equivalent grid based system. To understand why this is not the case we need to revisit our definition of accuracy and look closely at the concept of faithfulness of representation. The first important point to recognise is that all spatial data

33 are of limited accuracy. IGISE (1991) and Goodchild and Gopal (1990) provide passages to illustrate the point. Consider for example two air photo-interpreters who are evaluating the boundary for a wooded area. They are likely to produce two different boundaries. Branches of trees, for example, overlap one another. The overhang and overlap can easily be several metres. Depending on the season in which the photograph was taken (i.e. whether or not the trees are in leaf), the boundary line may be drawn differently. IGISE (1991) The area labelled soil type A on a map of soils is not in reality all type A, and its boundaries are not sharp breaks but transition zones. Similarly, the area labelled population density /sq.km does not in fact have between 1000 and 2000 in every square km, or between 10 and 20 in every hectare, since the spatial distribution of population is punctiform and can only be approximated by a smooth surface. Goodchild and Gopal (1990) Now if we return to our vendor, what s/he should have said was that a vector GIS is more precise in representing a spatial entity as it appears on a map. It is not necessarily more accurate than a raster GIS at representing the location and character of the true real world feature. Because an entity takes on a jagged, blocky or stepped appearance it is not correct to assume that the database is inaccurate. In addition, many users of GIS consider the blocky irregular boundary produced between areas when a raster spatial data model is used to be more appropriate for representing the real world features where distinct boundaries between spatial phenomena are not present. 7 What have you learned? This section has introduced the two field-based models for representing geographic space in GIS raster and vector. We have examined a range of storage methods employed by the cell-based raster spatial data model from the simple raster to the area quadtree. In considering the vector spatial data model with its precise geometric representation, we have seen the significance of topology as well as the variety of structures in vector models.