Module 3: File and database organization

Transcription

1 Module 3: File and database organization Overview This module introduces the basic concepts of files and databases, their components, and organization. Database characteristics, advantages, and disadvantages will be reviewed, followed by a comparison of hierarchical, network, and relational databases. You will also study database management systems and new developments. Test your knowledge Begin your work on this module with a set of test-your-knowledge questions designed to help you gauge the depth of study required. Topic outline and learning objectives 3.1 Data organization and information Describe how fields, records, files, and databases are organized within a data hierarchy. (Level 1) 3.2 Database organization methods Describe database organization and database components. (Level 1) 3.3 Database management systems Describe a database management system and explain why it is needed. (Level 1) 3.4 Database storage and analysis Describe database storage techniques. (Level 2) 3.5 Database developments Describe database developments, including data warehousing, data marts, and data mining. (Level 2) Module summary Print this module

2 Module 3: Test your knowledge 1. Multiple choice Solution a. Which of the following is the lowest level in the hierarchy of data? 1. Entity 2. Field 3. File 4. Record b. What is a data definition language? 1. A language used by Java to define data on the Web 2. A language used in data communication to route data packets 3. A language used to define and describe data and data relationships in a database 4. A language used in decision support systems to define data c. What is the most important characteristic of a primary key field? 1. It is short. 2. It is a file. 3. It represents an entity. 4. It is unique. d. Which of the following is a problem of traditional file environment? 1. Requires administrator for data maintenance 2. Requires fourth-generation language to program 3. Program data dependence 4. Storage capacity e. Which of the following file or database models has a many-to-many relationship? 1. Indexed 2. Flat file 3. Hierarchical 4. Relational 2. Chapter 5, Review question 2, page 210 Solution 3. Chapter 5, Review question 13, page 210 Solution

3 3.1 Data organization and information Learning objective Describe how fields, records, files, and databases are organized within a data hierarchy. (Level 1) Required reading LEVEL 1 Chapter 5, pages Databases can be used for business intelligence purposes such as obtaining product profitability, customer profiles, and targeting promotions. In the opening case of Chapter 5, the Valero Energy company uses a fully-integrated enterprise business intelligence system, called WebFocus, to make meaningful data available and accessible throughout the organization. Basic terms Data is generally organized in a hierarchy that starts with a character and progresses into a database. For illustrative purposes, let's look at the components of a student database that holds the students' names, courses enrolled, and the students' grades. A character may be alphabetic, numeric, or a symbol, and each character occupies a single position in a field. Each letter in a student's name is a character. A field is a group of related characters and it is the smallest piece of information in a record. For example, in a student file, one field could hold the first name of each student; in an accounts receivable file, one field could hold the invoice number. A field can also hold graphical, video, or sound images. More than one field makes up a record. A record is a collection of related data fields. It holds all the information about an entity in the file. All the records in a file must have the same fields. A file is a collection of related records. Each file has a unique structure. For example, a paper-based file is identified by a folder and all the pages it holds, organized in some fashion, perhaps with a table of contents. An electronic file on a computer is identified by a filename, and holds all the records stored under the filename. An entity is a generalized class of people, places, or things (objects) for which data is collected, stored, and maintained. For example, in a student database, one entity could be a student. In general, each entity has at least one record associated with it. An attribute is a characteristic of an entity. In the above example, the student has a student number, name, date of birth, and so on. Attributes are contained in the fields that are grouped by entities. Not only must each record in a file contain the same fields, each field must hold the same type of information and have the same attributes. An example of an attribute defined for the NAME field of a personnel file could be: Field description: Field type: Field width: Field structure: NAME Character field 30 characters Last name, followed by a blank, then first name, followed by a blank, then initial. The first character in the last and first names must be in upper case, subsequent characters to be in lower case, unless specified otherwise. The initial is always in upper case. If a name contains more than one

4 initial, use the first initial only. If the name is too long to fit in the field, drop the initial, then truncate (shorten) the first name as needed. Each record can be seen as a row in a table and each field can be seen as a column. A database is an organized collection of records in one or multiple tables. All databases require that every record contain at least one key. A key is a field or set of fields that identifies the record. A primary key is a field or set of fields that uniquely identifies each record in the table. In case the primary key is not unique, a secondary key can be used. For example, in a file containing a student directory, the key field could be the name, and the secondary field could be the address, so that in case of identical names, the secondary field can be used for sorting.

5 3.2 Database organization methods Learning objective Describe database organization and database components. (Level 1) Required reading LEVEL 1 Chapter 5, pages Database approach As computer applications became more complex and required the use of several related files, database techniques were developed to meet these needs. The Data Base Task Group of the Conference on Data Systems Languages (CODASYL) published the first formal documentation of the key features of databases in This publication, which has been updated several times, has become the model that many software developers use to develop databases. Unlike the file approach, the database approach allows different applications (for example, accounting, personnel, and payroll) to access the same database. Instead of organizing the data to meet the needs of a particular application (for example, payroll), the database approach requires the organization to analyze its overall information requirements, and then design a common database to meet the needs of multiple applications. Database systems provide a centralized repository of information that is not application-specific. The data in the database is managed centrally regarding the data integrity, primary and secondary key management, and indexing. Various applications access the database to update information. Because the information is no longer organized in application-specific files, it is much easier to update or change software applications as long as the information is used as structured in the database. The database approach requires the use of database management systems (DBMS). Data modelling Logical design describes logical relationships among data and groups them in a logical order, whereas physical design takes the logical design and structures it for efficiency and effectiveness. For example, it might be more effective to create summary totals as data are entered, rather than calculate them each time they are required, or some data attributes could be carried in more than one entity. These are examples of planned data redundancy, with the goal of improving system performance to meet user needs. An important tool for database designers is a data model, which is used to show relationships between entities. If this is done at the highest level for the organization, it is known as enterprise modelling. A commonly used tool for modellers is an entity-relationship(er) diagram. By using these tools, designers can ensure that relationships are logically structured so that when databases and application programs are developed, they will in fact meet the needs of the system's users. Database models The data in a database can be interrelated in many ways. Historically, databases were organized in a hierarchical or network structure. Today, the most popular structure is a relational database. Do not be overly

6 concerned with the mechanics of these structures. Instead, focus on the essential differences between the database types, and the general organization of the data. Hierarchical database A hierarchical database organizes information in a tree-like structure in which data elements are related to each other in a parent (superior) to child (subordinate) relationship. A data element can be a data field, a record, or a database file. The hierarchical database provides a one-to-many relationship, in a top-down manner. To access the employee of any department, you must specify the department because department is the parent of employee. If you have no information on the parent, it is impossible to retrieve the item because you must access the item through its parent. A hierarchical structure is particularly useful for databases containing structured information where access to information is keyed to the structure, that is, the logical access is in the same hierarchy as the physical layout of the database. The rigid structure of a hierarchical database enables it to be updated efficiently. Typically, it is used in applications such as inventory management, where a large number (hundreds of thousands or millions) of records are in the system. Network database A network database is similar to a hierarchical database except that a child in the system can have more than one parent. Thus, because more than one path to a particular data element exists, the database structure is many-to-many. Network databases are particularly efficient for looking up information because they permit access from more than one starting point. Unlike a hierarchical database, the process of querying a network database is less restrictive. A network database is appropriate for situations where queries of the database may not follow a predetermined pattern. An example is a database of students and their course enrolment, where a student can be enrolled in multiple courses. The relationship between students and courses is thus many-tomany, and a hierarchical database is inappropriate. Relational database A relational database uses two-dimensional tables called relations to store data. In the relational model, each row of a table represents an entity, with the columns representing attributes. Each attribute can have only certain predefined values, and these allowable values are called the domain. This provides automatic error-checking features to all applications using the table. The relational database is particularly easy to manage for answering user questions and producing reports. Basic data manipulation includes selecting (eliminates rows), projecting (eliminates columns), and joining and linking (creates a new table). One distinctive feature of a relational database is that you can combine any number of tables as long as there are common fields. You can combine (join) two tables to form a third, provided there is a common column. As long as the tables share at least one common attribute, they can be linked to answer queries or produce reports. What is especially important is that data from multiple tables can be linked to answer queries. Using a relational database, you can answer a complex query with a few simple commands, whereas the traditional file-based approach would require several programs to be written and run against the various files containing the required data, and then creating a new file after several operations. A relational database has properties beyond two-dimensional tables. For example, there is no need for order or sequence in a table, and the relation is a logical structure, thus users need not be concerned with physical storage details. Because of the flexibility provided by relational databases, they are becoming the design of choice for computer professionals. Relational databases reduce data redundancy (facilitated by the database joining capability) and allow data tables to be added with relative ease. With relational databases, it is relatively easy to perform

7 queries on the data without being constrained by the actual structure of the data. Microsoft Access is a relational database program. Example 3.1 Choosing a database model Francine Ong has been assigned to design a database for a new inventory control system. The following is a partial description of the data items and their relationship: Product items are organized by product lines, and each product can only belong to one product line. Each salesperson is assigned one or more product lines. A product line can have more than one salesperson assigned. Each salesperson is assigned a sales territory. For large territories, more than one salesperson can be assigned. Q: Of the three database models (hierarchical, network, and relational), which model is suitable for the information described? Solution Exhibit 3-1 graphically represents the network model for the inventory database, while Exhibit 3-2 depicts the relational model. Exhibit 3-3 is a short-form notation of the relational model. Exhibit 3-1 Network model

8 Exhibit 3-2 Relational model Sales territory (partial table) Territory code Territory name Territory manager G/L profit centre 1001 Northern B.C. J. Chrieten North Vancouver Island B. Beverly South Vancouver Island C. Cleverly Lower Mainland K.C. Leung Salesperson (partial table) Salesperson ID Salesperson name Territory code Quota for the year 810 Kelvin Longile , Rowanda Dhaliwal , James Jones , Mathew Mah , Lucien Chong ,000 Product line (partial list) Product line code Description E01 Electrical parts E02 Plumbing supplies Product assignment (partial list) Product line code Salesperson ID E E E E E Product (partial table) Product code Product name Manufacturer Product line code Unit cost C1023 Centronics plug Acme Manufacturing E C0143 Triplex plug Acme Manufacturing E C1045 Universal plug TDK Manufacturing E P4106 Peerless faucet LEW Piping Supplies E P4107 Kitchen Sink Stainless Kitchen Aid Manufacturing E Instead of drawing the tables as shown in Exhibit 3-2, another common practice is to list the contents of each table in short-form notation, marking the key field with an asterisk, as shown in Exhibit 3-3. Exhibit 3-3

9 Short-form notation Tables: Sales territory: Salesperson: Product line: Product assignment: Product: Sales territory Salesperson Product line Product assignment Product Territory code* Territory name Territory manager G/L profit centre Salesperson ID* Salesperson name Territory code Quota for the year Product line code* Description Product line code* Salesperson ID* Product code* Product name Manufacturer Product line code Unit cost Many computer database programs, such as Access, FileMaker Pro, and Paradox, provide relational capabilities. Oracle and Microsoft SQL are examples of fully relational databases. Any database is only as useful as the data it contains. Data should be accurate, complete, economical, flexible, reliable, relevant, simple, timely, verifiable, accessible, and secure. The purpose of data cleanup is to develop processes to ensure those characteristics. Data cleanup is particularly important when moving from a file-based system to a database or migrating from one database to another one.

10 3.3 Database management systems Learning objective Describe a database management system and explain why it is needed. (Level 1) Required reading LEVEL 1 Chapter 5, pages The goals and activities of a business should be supported by the appropriate database structure. To create, implement, and use a database, a database management system (DBMS) is required. A DBMS is a group of programs used as an interface between the database and either the application programs or a user. Users include end-users, who use the information from the database or enter data into the database; programmers, who develop applications for the database; and database administrators (DBA), who create and manage the database. All DBMSs have certain common functions, but are classified by the type of database they support. Providing a user view The first step in creating a database is to define the business objective or goal of the database in a measurable manner. The next step is providing the DBMS with information about the physical structure and logical relationships among the data to be contained in the database. This description is called a schema or schematic. Subschemas, which defines a set of data that users can view or modify, or do both, are used to give users access to only a portion of the entire database that they need based on business rules and their role in the organization. For example, the subschema for the accounts payable clerks should only allow them to have access to the accounts payable-related information and not payroll information. The use of subschemas is not only efficient but also ensures data security. Creating and modifying the database A data definition language (DDL) is used to define and describe data and data relationships in a database. The schema and subschema are applied using a DDL. When creating or modifying a database, it is also critical to establish a data dictionary that contains a complete description of all data in the database, including nomenclature, attributes, users, and applications. Typical uses of a data dictionary are to provide a standard definition of terms and data elements assist programmers in designing and writing programs simplify database modification A data dictionary helps achieve the advantages of the database approach by reducing data redundancy increasing data reliability speeding up program development facilitating modification of data and information Storing and retrieving data Potential problems arise if more than one user or program attempts to access the same record in the same database at the same time, and so there is a need for concurrency control. Data access control functions

11 within the DBMS ensure that two users cannot modify the same field at the same time. Manipulating data and generating reports When the DBMS is operational, a variety of programming languages can be used by different users to create applications that will access the data from the database. Data manipulation language (DML) are commands that are part of the DBMS package. Structured query language (SQL) is a popular DML tool that can be used across a wide range of hardware platforms. The personal computer environment is significantly different from the corporate mainframe or networked environment. Typical database software for personal computers, such as Microsoft Access, MySQL and FileMakerPro, allows the user to interact directly with the database without needing to know or understand the different components such as the DDL and DML.

12 3.4 Database storage and analysis Learning objective Describe database storage techniques. (Level 2) Required reading LEVEL 2 Chapter 5, pages Database storage techniques For any database, a number of database storage techniques can be used to store and manage it. Most databases are stored in a central location. Mainframe computers and personal computers use a centralized database storage technique. However, distributed database storage is growing in popularity. Distributed databases Distributed databases are technically quite complicated to implement and administer. Distributed database storage involves storing an organization's data in several different servers that are connected via telecommunication equipment. It is sufficient to know that such a technology exists, and that one form of implementation is a replicated database. For the purpose of this course, the description in the text on page 203 is adequate.

13 3.5 Database developments Learning objective Describe database developments, including data warehousing, data marts, and data mining. (Level 2) Required reading LEVEL 2 Chapter 5, pages (up to "Distributed databases"), pages Data warehouses, data marts, and data mining The value of data ultimately lies in the decisions it enables. Companies have started developing data warehouses and data marts to collect business information from the multiple sources within an organization with the objective of making better business decisions. Data mining and online analytical processing (OLAP) are information-analysis tools that help automate the identification of patterns, trends, or relationships in a data warehouse to support decision making. A data warehouse enables an organization to consolidate massive amounts of information extracted from operational and production systems for analysis. Data warehousing techniques are becoming increasingly popular with large organizations that have amassed trillions of bytes of data. Ordinary database analysis techniques do not work well with such massive amounts of data. A well-designed and properly built data warehouse delivers a good return on investment improves the company's competitive advantage by linking both internal and external information stores data extracted from the production databases and conventional files in one place has directories that show users what is in the database and how to access it provides information that meets the organization's need for business intelligence Building a data warehouse is a very time-consuming process because it is difficult to define what data are necessary and what level of consolidation is desired. Many organizations now start with a smaller version of data warehouse called a data mart for departmental use. Data marts are also used by small and medium-sized businesses. Departmental data marts can be used for online analytical processing (OLAP) within departments and form the basis of the data warehouse for the organization. Data mining is an information analysis tool that involves the automated discovery of patterns and relationships in a data warehouse. Business intelligence has stimulated the interest in and the use of data mining because of the enormous amounts of data being collected. Because of the rapid growth and potential for data mining, the traditional DBMS vendors are incorporating data mining tools into their products. While both online analytical processing (OLAP) and data mining support data analysis and decision making, a data-mining tool generally does the work for the user and presents results, while OLAP requires the user to be more knowledgeable about the data and their business context to gain insight from the data. OLAP is now being used to store and deliver vast amounts of data warehouse information efficiently. Business intelligence

14 Business intelligence (BI) is the process of getting enough of the right information in a timely manner and usable form to support the business strategy, tactics, or operations. Competitive intelligence is the continuous legal and ethical collection and analysis of information about competitors for comparison purposes. Counterintelligence is what a firm does to protect its information from the competition. Knowledge management is a collection of techniques that captures and manages structured and unstructured information to improve the ability of the organization to make timely and good business decisions. Open database connectivity (ODBC) is a set of standards that helps database integration and has the ability to share information between databases. Software developed according to these standards can be used with any ODBC-compliant database. This is extremely important to organizations that use a variety of levels of database applications. ODBC is frequently a standard requirement when organizations select software. Object-oriented and object-relational database management systems Instead of storing individual records, an object-oriented database management system (OODBMS) stores objects which, unlike records, may not be uniform in shape and size and may exist in a variety of forms including audio, video, and graphical data. An object-relational database management system (ORDBMS) allows third parties to add new data types and operations to the database. The growth of e-commerce, web-based applications, and web servers has created increasing demands for ORDBMS. Virtual or hypermedia databases contain linked nodes of data. A web page containing hypertext links can be viewed as a form of hypermedia database of information. On a web page, a user does not need to navigate through the information in a sequential manner. Instead, hypertext links can be used to explore other parts of the database. The advantage hypertext has is that, unlike traditional database manipulation languages, users can search for and manipulate alphanumeric data in an unstructured form. Hypermedia databases are an extension of hypertext that store and access graphics, sound, and video, as well as alphanumeric data. One other database system of increasing importance is spatial data technology, also known as geographic information systems (GIS). The global positioning system (GPS) is one of the applications that provide data input to the GIS. The databases store spatial location data. In the case of NASA and Canadian satellites, over a terabyte of data is stored every day. The cumulative data is nearing the petabyte (1,000 terabytes) mark. For such large databases, special tools are being developed to handle the data.

15 Module 3 summary File and database organization This module introduces the basic concepts of files and databases, their components, and organization. Database characteristics, advantages, and disadvantages will be reviewed, followed by a comparison of hierarchical, network, and relational databases. Describe how fields, records, files, and databases are organized within a data hierarchy. Data must be organized and structured so that they can be used effectively. Data hierarchy (from largest to smallest element): 1. Database a group of files holding related information 2. File a collection of related information called records 3. Record a collection of attributes of an entity in a file. For example, in a personnel file, an employee is an entity. Attributes of an employee include employee number, date of birth, and start date. 4. Field: A field is the smallest piece of information in a record, corresponding to one attribute of an entity. A primary key field is a field that uniquely identifies a record in a file for quicker access of data and sorting. A secondary key field is sometimes used for access and sorting but it does not uniquely identify a record. 5. Entity people, places, or objects for which data is collected, stored, and maintained 6. Attribute a characteristic of an entity 7. Character a letter, number, or symbol Describe database organization and database components. A database is a collection of data organized so that they can be accessed and used by many different applications. Data is stored and managed centrally. Logical and physical view of data: logical view presents what end-users see physical view reflects the way data is actually organized and structured on physical storage media Some advantages of using a database approach: data independent of application program reduction of data redundacy and inconsistency elimination of data confusion consolidation of data management ease of information access and use Disadvantages of database approach: Organization is more vulnerable in the event of system failures because data is centralized. Software and hardware requirements are higher. Because data is centralized, errors that do enter the database may have a

16 widespread effect. A database administrator (DBA) is required to manage the DBMS. Three principal database models are: hierarchical model organizes information in a tree-like structure network model the database structure is many-to-many relational model uses two-dimensional tables called relations to store data Which model to use depends on the nature of the data relationships the need for flexibility the volume of requests or changes to the database to be processed the ease of use for end-users Describe a database management system and explain why it is needed. Database management system (DBMS) is the software that serves as an interface between a common database and various application programs. Three components of a DBMS are: data definition language data manipulation language data dictionary A schema describes physical structure and logical relationships of data. A subschema provides a specific user view. A data definition language (DDL) is used to define and describe data and data relationships in a database. A data dictionary contains a complete description of all data in the database. A data dictionary reduces data redundancy, increases data reliability, and facilitates development and modification of the database. Data manipulation language (DML) commands are part of a DBMS package, and are used to manipulate the data and generate reports. Structured query language (SQL) is a tool to be used across a wide range of hardware platforms. Describe database storage techniques and services. Most databases are stored in a central location. Mainframe computers, personal computers, as well as LANs, use a centralized database storage technique. Distributed databases are technically quite complicated to implement and administer. A replicated database holds a duplicate set of frequently-used data at different locations and is one type of distributed database. Describe database developments, including data warehousing, data marts, and data mining. Data warehouse Data warehouse consolidates data from various operational systems and external data. It enables online analytical processing (OLAP) to provide information that meets the organization s information needs.

17 It is difficult and costly to build; however, it provides a good return on investment if properly designed. Data marts Data mining Smaller versions of data warehouse, called data marts, may be built first. These data marts can be used for departmental OLAP and form the basis of data warehouse for the organization. Data mining consolidates data from various operational systems and external data. It enables online analytical processing (OLAP) to provide information that meets the organization s information needs.

18 Solution 1 a. 2) Text, page 177 b. 3) Module Notes, Topic 3.3 c. 4) Text, page 179 d. 3) Text, page 179 e. 4) Text, page 182

19 Solution 2 A database is a collection of integrated and related files. A database management system is the software used to manipulate the database and provide an interface between the database and the user or application programs. A database management system is systems software that helps organize data for effective access and storage by multiple applications. A DBMS provides different users with different views of the data (subschemas), avoids redundancy, encourages program independence, offers flexible access, and provides centralized control.

20 Solution 3 Data mining is the automated discovery of patterns and relationships in data warehouses. OLAP tools can tell users what happened in their business. Data mining searches the data for statistical "whys" by seeking patterns in the data and then developing hypotheses to predict future behaviour. Online analytical processing (OLAP) programs are used to store and deliver data warehouse information. The OLAP allows users to explore corporate data in new and innovative ways using multiple dimensions such as products, salespeople, or time. OLAP programs include spreadsheets, reporting and analysis tools, and custom applications.

21 The hierarchical model is inappropriate in this case because of the many-to-many relationships between salespersons, product lines, sales territories, and inventory items. The network and relational models, however, are both suitable. The preferred model is a relational database due to its flexibility to associate or link different types of data.