Data storage and data structures this is lecture 4
Main points in today s lecture quantification; digital storage; structuring devices; data structures; and data models.
Quantification Information to data (discretization) Data to data structures Data structures to data models Analysis Start with problem of continuous data and discretization. The process of discretization is a fundamental requirement of using GIS data.
Question What is discretization?
Problems with discretization Difficult/impossible to determine where boundaries should be drawn. Especially with respect to natural phenomena. Human-made entities such as buildings, roads, bridges, dams, and sports fields are easier to define using crisp boundaries.
Selection After discretization, need to select objects that will be included in the database. Selection is a necessary step in acquiring and storing data. Including all features would require an infinitely large database.
Context Context and goals of analysis determine what is included. If you are analyzing the spread of measles among pre-schoolers, then the ratio of conifers to deciduous trees in the area doesn t concern you.
After discretization and selection Georeferencing provides us with a method to encode discrete objects. Georeferencing systems include addresses and postal codes; administrative zones (town of Burnaby); grids and map sheets. Georeferencing involves quantification.
Quantification Once data is selected and referenced, need to quantify it in order to use it in a computer. Quantification uses numerical representation. Sometimes, it is easy to quantify. The width of a highway is a case of simple quantification. Likewise, determining the square mileage of a lake is straightforward.
Points about quantification A computer system stores unique or discrete values. These may or may not faithfully represent the continuum of values that exist in the real world. The nature of the data is important, as different types of mathematical operations can be performed on different data. Numerical values can be defined with respect to nominal, ordinal, interval or ratio scales of measurement.
Simple vs complex quantification Quantification may be simple, or require considerable abstraction. Example 1 (Simple): The maximum height of a mountain can easily be included in a GIS. Example 2 (Complex) There are many options for coding the characteristics of a forest in a GIS, including: -numeric codes for forest categories such as rainforest or woodland; -the canopy closure expressed as a percentage; -or a numeric code for the dominant species of tree in the forest.
Complex quantification Spatial science is riddled with complex instances of quantification. How do you quantify the zoning areas of the city of Vancouver? How do you encode areas with high, low or medium risk of avalanches? QUESTION: Which of these problems first requires discretization of continuous data?
Summary: to convert info to data 1. select data. 2. classify it (make categories) 3. discretize 4. georeference 5. quantify
Structuring digital data Once a numerical designation has been determined, we have to input the data in a way which is acceptable to the computer. Need to review how digital data is stored inside a computer.
Bits and bytes The basic unit of storage is a single character called a bit which is short for Binary digit. A bit can only have two states: on or off. Eight bits make up a byte and groups of bytes make up words.
bytes Bits are rarely seen alone in computers. They are almost always bundled together into 8-bit collections, and these collections are called bytes. The 8-bit byte is something that people settled on through trial and error over the past 50 years. As accidental as 12 eggs in a dozen. With 8 bits in a byte, you can represent 256 values ranging from 0 to 255, as shown here: 0 = 00000000 1 = 00000001 2 = 00000010... 254 = 11111110 255 = 11111111
Words The number of bits that the computer uses as the basic unit to store data is called the word size. For example, the following sizes are commonly used: 16-bit (2-bytes) "personal computers" (previous generation) 32-bit (4-bytes) "personal computers" (current generation) 64-bit (8-bytes) mainframes
Save as.txt Establishment of ASCII as a standard revolutionized data transfer as it allows us to use the same semantic coding between systems. ASCII stands for American Standard Code for Information Interchange. It assigns each letter and symbol on the keyboard a standardized numerical code. Note that when preparing files for exchange, you will be often asked to store them as ASCII. Or.txt format. Same thing.
Data storage Rely on structuring principles which are themselves based on computer architecture. Structuring principles include: arrays, matrices, lists, stacks, queues and deques, records, sets, trees, tables and networks.
Structuring devices Structuring devices are ways of storing information that directly conform to and reflect computer architecture. The lowest order of structuring devices are lists, stacks, arrays, queues and deques. Records, sets, trees tables and networks are higher order structuring devices, and are dependent on lower order devices.
Lists Lists are a slightly lower level of structuring devices but closely related to arrays. A list or linear list is a dynamic data structure (meaning it can shrink or grow depending on how many items it includes). A list is quite literally a list and it usually contains like data such as integers or real numbers or text strings rather than a mix. However, this is not a strict rule. One characteristic of lists is that they are ordered.
Ordering of lists Each element or data item is in a specific order whether alphabetical or numerical or other. Lists can be implemented by using arrays. In such a case, the list is "held" by the array.
Stacks, queues, and dequeues Stacks, queue and deques are all instances of the linear list. They are transitory data structures as they go out of commission as soon as data elements are retrieved. In a stack, all the additions and deletions are made at one end -- the top of the stack. LIFO. In a queue, input is at the top and output is at the bottom of the list. FIFO. The much more flexible deque allows insertions and deletions at either end.
Registers The final step in memory is the registers. These are memory cells built right into the CPU that contain specific data needed by the CPU, particularly the arithmetic and logic unit (ALU). An integral part of the CPU itself, registers are controlled directly by the compiler that sends information for the CPU to process.
Relations of registers to lists etc. Lists and arrays may seem esoteric concepts but they refer directly to the computer architecture. If you think of the registers of a computer, lists and arrays directly address positions in the register. They constitute the base map for how information is stored in the computer. Their terminology must be precise because the computer meaning is computationally precise. Computers store data items in literal addresses. The entire system has a unique architecture.
Arrays An array is a structure which accommodates the inherent row and column nature of much data. It comprises of a block of contiguous memory in the computer in which data elements are stored. It can have one or many dimensions and programming languages will allow the user to DIMension arrays. In BASIC, the syntax for dimensioning an array is: dim array_1(20) which translates to make space for a one dimensional array with 20 elements.
Matrices A matrix is like an array but it is not necessarily computer compatible. A matrix is a good way to imagine an array. Once a matrix is encoded in the computer, it becomes an array. A typical matrix looks like this. 55 65 73 93 34 98 23 87 225 9 12 65 94 356 7 983 * How many dimensions is this matrix?
Difference btwn arrays and matrices A matrix is a higher level data structure (like an array) but one which could be expressed on paper. An array is, by contrast, a computer data structure. Arrays specify how the table information is stored and accessed by the computer while a matrix is a just a table of numbers.
Records (of a database) A record is a common organizing concept for grouping data items together. Records are organized in arrays. If you think of the rows in ArcGIS datatables, each row constitutes one record. In precise computer terminology, a record is a "linear sequence of variable items which have a collective identity" (Bracken and Webster, 1990, 159). In many computing environments, records constitute a built-in data structure.
What is a database? A database is a collection of persistent data which is formally defined and centrally controlled for use in a computer.
Advantages to storing data in a database There are several advantages to using databases to store information: data is easily shared. data in a database is permanent and usually remains in a database for long periods data is easily accessible through search, intersect and overlap functions databases can easily used by the computer.
Data Structures The Flat File data structure is just a simple list. The Index File data structure finds objects based on their attributes. The conventional datastructures in GIS are the relational, network and hierarchical. Relational datastructures are organized by records which resemble tables. Hierarchical data structures are based on the tree structure with parent-child relationships. Network datastructures are classified according to record types with pointers linking associated records. Increasingly the Object-oriented datastructure is emerging as an alternative in GIS.
Simple data structures Since the days of early computers, computer scientists have evolved more sophisticated ways to keep track of the 0s and 1s that represent information in the computer. The simplest way to order information in the computer is to put in files like an rolodex file on a desk.
Flat file data structure (simple lists) Simplest data structure is simple list of all items. Each new item is inserted at the end of the list in no particular order. Easy to add data, but hard to retrieve it. Looking for something in an unstructured list is like looking for a needle in a haystack especially when the list is large.
Indexed data structure Sorting was difficult using the flat-file data structure (e.g. bubble sorts) Indexed d/s allow search for the attributes of an entity rather than the entity itself. For example, we might search for all census districts containing high-income populations Or look for postal codes with people containing the attribute home owner. The attributes act like an index in a book; they point to the real thing.
March of progress Computer Science continued to develop more sophisticated data structures. Today there are three main data structure used in GIS: 1. hierarchical data structures; 2. network systems; and 3. relational database structures.
Hierarchical data structures Hierarchical Data Structures are a familiar concept in that they use a family-tree type structure to organize data. This will also be familiar to those of you who used DOS on PCs before Windows came out. There was a root directory, with sub-directories and then files within those. The hierarchical data structure is basically a tree structure with parent-child relationships. It is also the basis on biological taxonomy with species, genus, phyllum etc.
Trees Trees represent data relationships that are hierarchical. For example, if a database stores data related to a genus, then at the top or root, we might have the genus, followed by species nodes, followed in turn by subspecies.
Trees in the database concept Example of a tree: different levels of government with federal at the root, followed by state, followed in turn by county and municipal governments. The terminal links are called leaves while the connecting links are called nodes. Each of these structures and principles are involved in some extent in the database concept.
Navigating hierarchies The problem with the HDS is that it is cumbersome to navigate up and down it when looking for information.
Problems with hierarchies Its rigid structure makes it less than perfect for GIS. Your text book offers an example of how the HDS s non-flexible format makes it difficult to use at times. The director of the Royal Botanical Gardens in London wanted to query the botanical database before a trip to Mexico to find all plants that were native to the area he was going to visit.
But despite, all inventory having been computerized using a HDS, the geograpical location of the plants had not been included in the database so it was impossible to select those indigenous to Mexico. It is also difficult to add new detail later (i.e. to update).
Many to many relationships HDS uses one to many relationships. In GIS, we often have many to one or many to many relationships. For instance, if we have an urban database, one polygon might have many point locations (intersections, for instance) with several types of convenience stores and gas stations associated with them. But the same brand of convenience stores and gas stations will be located in other polygons.
Network data structures In a network database structure, entities have pointers which point to related entities. So any piece of data can point to any other piece of data in the database. The pointers indicate relationships between data. This is a much less rigid system than HDS. It is used a lot in transportation databases where the really important relationships are between routes and nodes.
Problems with networks A drawback, however, is that the number of pointers (and relationships) can get out of hand and require too much storage space. Each relationship needs to be explicitly defined with the use of pointers. These numerous relationships can become a tangled web and the system lead to incorrect linkages and general confusion. The network database structure is only appropriate for certain types of GIS.
Relational data structures The relationships between data tables are based on primary keys. Most common GIS database.
Benefits of relational systems Relational systems are useful because: (i) they re simple; (ii) most accounting and other non-spatial databases are relational so it makes it easy to transfer such data to GIS; (iii) there is a well-established query systems developed for relational database management systems (RDBMS) called SQL standardized query language.
Next week Data models and the resurrection of the HDS for portraying objects.