? Database Management

Subject: DATABSE MANAGEMENT Credits: 4 SYLLABUS Introduction to data base management system Data versus information, record, file; data dictionary, database administrator, functions and responsibilities; file-oriented system versus database system Database system architecture Introduction, schemas, sub schemas and instances; data base architecture, data independence, mapping, data models, types of database systems Data base security Threats and security issues, firewalls and database recovery; techniques of data base security; distributed data base Data warehousing and data mining Emerging data base technologies, internet, database, digital libraries, multimedia data base, mobile data base, spatial data base Lab: Working over Microsoft Access Suggested Readings: 1. A Silberschatz, H Korth, S Sudarshan, Database System and Concepts ; fifth Edition; McGraw-Hill 2. Rob, Coronel, Database Systems, Seventh Edition, Cengage Learning

DATABASE MANAGEMENT SYSTEM DATABASE MANAGEMENT COURSE OVERVIEW This course provides an immediately useable tools and the techniques in the methods of database management system, requirements analysis, definition, specification and design etc. It provides participants with the details if the tools, techniques and methods to lead or participate in the front-end phases. Database systems are designed to manage large bodies of information.management of data involves both defining structures for storage of information & providing way for manipulation of data. In addition, the database system must ensure safety of data. DBMS is collection of programs that enables you to store, modify, and extract important information from a database. There are many different types of DBMS, ranging from small sys-tems that run on personal computers to huge systems that run on mainframes. The students on completion of the course shall develop the following skills and competencies: Database DBMS Database System Application File System Data Ionsistency Objectives To help you to learn DBMS and design technique: what it is and how one goes about doing it. The primary goal of DBMS is to provide an environment that is both convenient & efficient for people to use in retrieving & storing information.. Database systems are designed to store large bodies of information. By the end of this material, you will be equipped with good knowledge of technical information that will help you develop & understand DBMS ii

DATABASE MANAGEMENT DATABASE Lesson No. Topic Page No. INTRODUCTION TO DBMS CONTENT MANAGEMENT Lesson 1 Introduction to Database I 1 Lesson 2 Introduction to Database II 5 Lesson 3 Tutorial 8 Lesson 4 Database Concepts I 9 Lesson 5 Database Concepts II 13 DATA MODELS IN DATABASES Lesson 6 Data Models 17 RELATIONAL DATA BASE MANAGEMENT SYSTEM Lesson 7 Relational Database Management System I 21 Lesson 8 Relational Database Management System II 27 Lesson 9 E-R Model I 31 Lesson 10 E-R Model II 36 STRUCTURED QUERY LANGUAGES Lesson 11 Structured Query Language(SQL) I 40 Lesson 12 LAB 45 Lesson 13 Lab 46 Lesson 14 SQL II 47 Lesson 15 LAB 55 Lesson 16 LAB 56 Lesson 17 SQL III 57 Lesson 18 Lab 62 Lesson 19 Lab 63 Lesson 20 SQL IV 64 Lesson 21 Lab 67 Lesson 22 Lab 68 Lesson 23 Integrity and security 69 Lesson 24 LAB 75 Lesson 25 LAB 76 Lesson 26 PL/SQL 77 Lesson 27 Lab 82 Lesson 28 Lab 83 Lesson 29 Database Triggers 84 Lesson 30 LAB 89 v

DATABASE MANAGEMENT CONTENT Lesson No. Topic Page No. Lesson 31 LAB 90 Lesson 32 Database Cursors 91 Lessom 33 LAB 100 Lesson 34 LAB 101 NORMALIZATION Lesson 35 Normalisation I 102 Lesson 36 Normalisation I I 107 Lesson 37 Normalisation III 112 FILE ORGANIZATION METHODS Lesson 38 File Organization Method I 118 Lesson 39 File Organization Method II 123 DATABASE MANAGEMENT DATA BASE OPERATIONAL MAINTENANCE Lesson 40 Transactions Management 130 Lesson 41 Concurrency Control I 136 Lesson 42 Concurency Control II 141 Lesson 43 Concurrency Control III 146 Lesson 44 Database Recovery 152 vi

LESSON 1: INTRODUCTION TO DATABASE I Lessons Objective Database Database management system Essentials of data Benefits of DBMS Database system application Purpose of database system 1.1 What is Database Management System(DBMS) A database can be termed as a repository of data. A collection of actual data which constitutes the information regarding an organisation is stored in a database. For ex. There are 1000 students in a college & we have to store their personal details, marks details etc., these details will be recorded in a database. A collection of programs that enables you to store, modify, and extract information from a database is known as DBMS.The primary goal of a DBMS is to provide a way to store & retrieve database information that is both convenient & efficient. Actual data storage Database systems are designed to manage large bodies of information.management of data involves both defining structures for storage of information & providing way for manipulation of data. In addition, the database system must ensure safety of data. DBMS is collection of programs that enables you to store, modify, and extract important information from a database. There are many different types of DBMS, ranging from small sys-tems that run on personal computers to huge systems that run on mainframes. Good data management is an essential prerequisite to corporate success. Data Information Information Knowledge Knowledge Judgment Judgment Decision Decision Success provided that data is: complete accurate timely easily available 1.2 Database System Applications There are many different types of DBMSs, ranging from small systems that run on personal computers to huge systems that run on mainframes. Database are applied in wide no. of applications. Following are some of the examples :- Banking: For customer information, accounts, loans & other banking transactions Airlines: For reservation & schedule information Universities: For student information, course registration,grades etc. Credit card transaction: For purchase of credit cards & generation of monthly statements. Tlecommunication: For keeping records of calls made, generating monthly billetc. Finance: For storing information about holdings, sales & purchase of financial statements Sales: For customer,product & purchase information Manufacturing: For management of supply chain. Human Resource: For recording information about employees,salaries,tax,benefits etc. We can say that when ever we need to have a computerised system, we need a database system 1.3 Purpose of Database system A file system is one in which we keep the information in operating system files. Before the evolution of DBMS, organisations used to store information in file systems. A typical file processing system is supported by a conventional operating system. The system stores permanent records in various files & it need application program to extract records, or to add or delete records. We will compare both systems with the help of an example. There is a saving bank enterprise that keeps information about all customers & saving accounts. Following manipulations has to be done with the system A program to debit or credit an account A program to add a new account. A program to find balance of an account. A program to generate monthly statements. As the need arises new applications can be added at a particular point of time as checking accounts can be added in a saving account. Using file system for storing data has got following disadvantages:- 1. Data Redundancy and Inconsistency Different programmers work on a single project, so various files are created by different programmers at some interval of time. So various files are created in different formats & different programs are written in different programming language. DATABASE MANAGEMENT 1

DATABASE MANAGEMENT Same information is repeated.for ex name & address may appear in saving account file as well as in checking account. This redundancy sesults in higher storage space & access cost.it also leads to data inconsistency which means that if we change some record in one place the change will not be reflected in all the places. For ex. a changed customer address may be reflected in saving record but not any where else. 2. Difficulty in Accesing data Accessing data from a list is also a difficulty in file system.suppose we want to see the records of all customers who has a balance less than $10,000, we can either check the list & find the names manually or write an application program.if we write an application program & at some later time, we need to see the records of customer who have a balance of less than $20,000, then again a new program has to be written. It means that file processing system do not allow data to be accessed in a convenient manner. 3. Data Isolation As the data is stored in various files, & various files may be stored in different format, writing application program to retrieve the data is difficult. 4. Integrity Problems Sometimes, we need that data stored should satisfy certain constraints as in a bank a minimum deposit should be of $100. Developers enforce these constraints by writing appropriate programs but if later on some new constraint has to be added then it is difficult to change the programs to enforce them. 5. Atomicity Problems Any mechanical or electrical devive is subject to failure, and so is the computer system. In this case we have to ensure that data should be restored to a consistent state.for example an amount of $50 has to be transferred from Account A to Account B. Let the amount has been debited from account A but have not been credited to Account B and in the mean time, some failure occurred. So, it will lead to an inconsistent state. So,we have to adopt a mechanism which ensures that either full transaction hould b executed or no transaction should be excuted i.e. the fund transfer should be atomic. 6. Concurrent access Problems Many systems allows multiple users to update the data simultaneously. It can also lead the data in an inconsistent state.suppose a bank account contains a balance of $ 500 & two customers want to withdraw $100 & $50 simultaneously. Both the transaction reads the old balance & withdraw from that old balance which will result in $450 & &400 which is incorrect. 7.Security Problems All the user of database should not be able to access all the data. For example a payroll Personnel needs to access only that part of data which has information about various employees & are not needed to access information about customer accounts. Points to Ponder A DBMS contains collection of inter-related data & collection of programs to access the data. The primary goal of DBMS is to provide an environment that is both convenient & efficient for people to use in retrieving & storing information. DBMS systems are ubiquitous today & most people interact either directly or indirectly with database many times every day. Database systems are designed to store large bodies of information. A major purpose of a DBMS is to provide users with an abstract view of data i.e. the system hides how the data is stored & maintained. Review Terms Database DBMS Database System Application File System Data Ionsistency Consistency constraints Atomicity Redundancy Data isolation Data Security Students Activity 1. What is database?explain with example? 2. What is DBMS?Explain with example? 3. List four significant difference between file system & DBMS? 2

4. What are the advantages of DBMS? 5. Explain various applications of database? 10. Explain consistency constraints in database? DATABASE MANAGEMENT 6. Explain data inconsistency with example? 7. Explain data security? Why it is needed?explain with example? 8. Explain isolation & atomicity property of database? 9. Explain why redundancy should be avoided in database? 3

DATABASE MANAGEMENT Student Notes 4

Lesson Objective Data abstraction View of data Levels of data Physical level Logical level View level Database language DDL,DML View of Data A database contains a no. of files & certain programs to access & modify these files.but the actual data is not shown to the user, the system hides actual details of how data is stored & maintained. Data Abstraction Data abstraction is the process of distilling data down to its essentials. The data when needed should be retrieved efficiently. As all the details are not of use for all the users, so we hide the actual(complex) details from users. Various level of abstraction to data is provided which are listed below:- Physical level It is the lowest level of abstraction & specifies how the data is actually stored. It describes the complex data structure in details. LESSON 2: INTRODUCTION TO DATABASE II View level This level contains the actual data which is shown to the users. This is the highest level of abstraction & the user of this level need not know the actual details(complexity) of data storage. DATABASE MANAGEMENT Logical level It is the next level of abstraction & describes what data are stored in database & what relationship exists between varius data. It is less complex than physical level & specifies simple structures. Though the complexity of physical level is required at logical level, but users of logical level need not know these complexities. Database Language As a language is required to understand any thing, similarly to create or manipulate a database we need to learn a language.database language is divided into mainly 2 parts :- 1. DDL(Data definition language) 2. DML(Data Manipulation language) Data Definition Language (DDL) Used to specify a database scheme as a set of definitions expressed in a DDL 1. DDL statements are compiled, resulting in a set of tables stored in a special file called a data dictionary or data directory. 2. The data directory contains metadata (data about data) 3. The storage structure and access methods used by the database system are specified by a set of definitions in a 5

DATABASE MANAGEMENT special type of DDL called a data storage and definition language 4. basic idea: hide implementation details of the database schemes from the users Data Manipulation Language (DML) 1. Data Manipulation is: retrieval of information from the database insertion of new information into the database deletion of information in the database modification of information in the database 2. A DML is a language which enables users to access and manipulate data. The goal is to provide efficient human interaction with the system. 3. There are two types of DML: procedural: the user specifies what data is needed and how to get it nonprocedural: the user only specifies what data is needed Easier for user May not generate code as efficient as that produced by procedural languages 4. A query language is a portion of a DML involving information retrieval only. The terms DML and query language are often used synonymously. Points to Ponder DBMS systems are ubiquitous today & most people interact either directly or indirectly with database many times every day. Database systems are designed to store large bodies of information. A major purpose of a DBMS is to provide users with an abstract view of data i.e. the system hides how the data is stored & maintained. Structure of a database is defined through DDL. & manipulated through DML. DDL statements are compiled, resulting in a set of tables stored in a special file called a data dictionary or data directory. A query language is a portion of a DML involving information retrieval only. The terms DML and query language are often used synonymously. Review Terms Data Security Data Views Data Abstraction Physical level Logical level View level Database language Ddl Dml Query language Students Activity 1. Define data abstraction? 2. How many views of data abstraction are there? Explain in details? 3. Explain database language? Differentiate between DDL & DML? 6

Student Notes DATABASE MANAGEMENT 7

DATABASE MANAGEMENT LESSON 3: TUTORIAL 8

Lesson objective Data dictionary Meta data Database schema Database Instance Data independence Data Dictionary English language dictionaries define data in terms such as known facts or things used as a basis for inference or reckoning, typically (in modern usage) operated upon or manipulated by computers, or Factual information used as a basis for discussion, reasoning, or calculation Data dictionary may cover the whole organisation, a part of the organisation or a database. In its simplest form, the data dictionary is only a collection of data element definitions. More advanced data dictionary contains database schema with reference keys, still more advanced data dictionary contains entity-relationship model of the data elements or objects. Parts of Data Dictionary 1. Data Element Definitions Data element definitions may be independent of table definitions or a part of each table definition Data element number Data element number is used in the technical documents. Data element name (caption) Commonly agreed, unique data element name from the application domain. This is the real life name of this data element. Short description Description of the element in the application domain. Security classification of the data element Organisation-specific security classification level or possible restrictions on use. This may contain technical links to security systems. Related data elements List of closely related data element names when the relation is important. Field name(s) Field names are the names used for this element in computer programs and database schemas. These are the technical names, often limited by the programming languages and systems. Code format Data type (characters, numeric, etc.), size and, if needed, special representation. Common programming language notation, input masks, etc. can be used. LESSON 4: DATABASE CONCEPTS I Null value allowed Null or non-existing data value may be or may not be allowed for an element. Element with possible null values needs special considerations in reports and may cause problems, if used as a key. Default value Data element may have a default value. Default value may be a variable, like current date and time of the day (DoD). Element coding (allowed values) and intra-element validation details or reference to other documents Explanation of coding (code tables, etc.) and validation rules when validating this element alone in the application domain. Inter-element validation details or reference to other documents Validation rules between this element and other elements in the data dictionary. Database table references Reference to tables the element is used and the role of the element in each table. Special indication when the data element is the key for the table or a part of the key. Definitions and references needed to understand the meaning of the element Short application domain definitions and references to other documents needed to understand the meaning and use of the data element. Source of the data in the element Short description in application domain terms, where the data is coming. Rules used in calculations producing the element values are usually written here. Validity dates for the data element definition Validity dates, start and possible end dates, when the element is or was used. There may be several time periods the element has been used. History references Date when the element was defined in present form, references to superseded elements, etc. External references References to books, other documents, laws, etc. Version of the data element document Version number or other indicator. This may include formal version control or configuration management references, but such references may be hidden, depending on the system used. Date of the data element document Writting date of this version of the data element document. DATABASE MANAGEMENT 9

DATABASE MANAGEMENT Quality control references Organisation-specific quality control endorsements, dates, etc. Data element notes Short notes not included in above parts. Table Definitions Table definition is usually available with SQL command help table Tablename Table name Table owner or database name List of data element (column) names and details Key order for all the elements, which are possible keys Possible information on indexes Possible information on table organisation Technical table organisation, like hash, heap, B+ -tree, AVL - tree, ISAM, etc. may be in the table definition. Duplicate rows allowed or not allowed Possible detailed data element list with complete data element definitions Possible data on the current contents of the table The size of the table and similar site specific information may be kept with the table definition. Security classification of the table Security classification of the table is usually same or higher than its elements. However, there may be views accessing parts of the table with lower security. Database schema It is the overall structure is called a database schema. Database schema is usually graphical presentation of the whole database. Tables are connected with external keys and key colums. When accessing data from several tables, database schema will be needed in order to find joining data elements and in complex cases to find proper intermediate tables. Some database products use the schema to join the tables automatically. Database system has several schemas according to the level of abstraction.the physical schema describes the database design at physical level. The logical schema describes the database design at logical level.a database can also have sub-schemas(view level) that describes different views of database. Database Instance 1. Databases change over time. 2. The information in a database at a particular point in time is called an instance of the database 3. Analogy with programming languages: Data type definition - scheme Value of a variable - instance Meta-Data Meta-data is definitional data that provides information about or documentation of other data managed within an application or environment. For example, meta-data would document data about data elements or attributes, (name, size, data type, etc) and data about records or data structures (length, fields, columns, etc) and data about data (where it is located, how it is associated, ownership, etc.). Meta-data may include descriptive information about the context, quality and condition, or characteristics of the data. Data Independence 1. The ability to modify a scheme definition in one level without affecting a scheme definition in a higher level is called data independence. 2. There are two kinds: Physical data independence The ability to modify the physical scheme without causing application programs to be rewritten Modifications at this level are usually to improve performance Logical data independence The ability to modify the conceptual scheme without causing application programs to be rewritten Usually done when logical structure of database is altered 3. Logical data independence is harder to achieve as the application programs are usually heavily dependent on the logical structure of the data. An analogy is made to abstract data types in programming languages. Points to Ponder Data dictionary is a collection of data elements & its definition. Database Schema is the overall structure of a database. Database instance is the structure of a database at a particular time. Meta-data is the data about data. The ability to modify a scheme definition in one level without affecting a scheme definition in a higher level is called data independence. Review Terms Database Instance Schema Database Schema Physical schema Logical schema Physical data independence Database Language DDL DML Query Language Data dictionary Metadata 10

Student Activity 1. What is difference between database Schema & database instance? DATABASE MANAGEMENT 2. What do you understand by the structure of a database? 3. Define physical schema and logical schema? 4. Define data independence?explain types of data independence? 5. Define data dictionary, meta-data? 6. Define various elements of data dictionary? 11

Lesson Objective Database manager Database user Database administrator Role of Database administrator Role of Database user Database architecture Database Manager The database manager is a program module which provides the interface between the low-level data stored in the database and the application programs and queries submitted to the system. 1. Databases typically require lots of storage space (gigabytes). This must be stored on disks. Data is moved between disk and main memory (MM) as needed. 2. The goal of the database system is to simplify and facilitate access to data. Performance is important. Views provide simplification. 3. So the database manager module is responsible for Interaction with the file manager: Storing raw data on disk using the file system usually provided by a conventional operating system. The database manager must translate DML statements into low-level file system commands (for storing, retrieving and updating data in the database). Integrity enforcement: Checking that updates in the database do not violate consistency constraints (e.g. no bank account balance below $25) Security enforcement: Ensuring that users only have access to information they are permitted to see Backup and recovery: Detecting failures due to power failure, disk crash, software errors, etc., and restoring the database to its state before the failure Concurrency control: Preserving data consistency when there are concurrent users. 4. Some small database systems may miss some of these features, resulting in simpler database managers. (For example, no concurrency is required on a PC running MS- DOS.) These features are necessary on larger systems Database Administrator The database administrator is a person having central control over data and programs accessing that data. Duties of the database administrator include: Scheme definition: the creation of the original database scheme. This involves writing a set of definitions in a DDL (data storage and definition language), compiled by the DDL compiler into a set of tables stored in the data dictionary. LESSON 5: DATABASE CONCEPTS II Storage structure and access method definition: writing a set of definitions translated by the data storage and definition language compiler Scheme and physical organization modification: writing a set of definitions used by the DDL compiler to generate modifications to appropriate internal system tables (e.g. data dictionary). This is done rarely, but sometimes the database scheme or physical organization must be modified. Granting of authorization for data access: granting different types of authorization for data access to various users Integrity constraint specification: generating integrity constraints. These are consulted by the database manager module whenever updates occur. Database Users The database users fall into several categories: Application programmers are computer professionals interacting with the system through DML calls embedded in a program written in a host language (e.g. C, PL/1, Pascal). These programs are called application programs. The DML precompiler converts DML calls (prefaced by a special character like $, #, etc.) to normal procedure calls in a host language. The host language compiler then generates the object code. Some special types of programming languages combine Pascal-like control structures with control structures for the manipulation of a database. These are sometimes called fourth-generation languages. They often include features to help generate forms and display data. Sophisticated users interact with the system without writing programs. They form requests by writing queries in a database query language. These are submitted to a query processor that breaks a DML statement down into instructions for the database manager module. Specialized users are sophisticated users writing special database application programs. These may be CADD systems, knowledge-based and expert systems, complex data systems (audio/video), etc. Naive users are unsophisticated users who interact with the system by using permanent application programs (e.g. automated teller machine). 1. Database systems are partitioned into modules for different functions. Some functions (e.g. file systems) may be provided by the operating system. DATABASE MANAGEMENT 13

DATABASE MANAGEMENT 2. Components include: File manager manages allocation of disk space and data structures used to represent information on disk. Database manager: The interface between low-level data and application programs and queries. Query processor translates statements in a query language into low-level instructions the database manager understands. (May also attempt to find an equivalent but more efficient form.) DML precompiler converts DML statements embedded in an application program to normal procedure calls in a host language. The precompiler interacts with the query processor. DDL compiler converts DDL statements to a set of tables containing metadata stored in a data dictionary. In addition, several data structures are required for physical system implementation: Data files: store the database itself. Data dictionary: stores information about the structure of the database. It is used heavily. Great emphasis should be placed on developing a good design and efficient implementation of the dictionary. Indices: provide fast access to data items holding particular values. Database System Architecture Database systems are partitioned into modules for different functions. Some functions (e.g. file systems) may be provided by the operating system. Components Include File manager manages allocation of disk space and data structures used to represent information on disk. Database manager: The interface between low-level data and application programs and queries. Query processor translates statements in a query language into low-level instructions the database manager understands. (May also attempt to find an equivalent but more efficient form.) DML precompiler converts DML statements embedded in an application program to normal procedure calls in a host language. The precompiler interacts with the query processor. DDL compiler converts DDL statements to a set of tables containing metadata stored in a data dictionary. In addition, several data structures are required for physical system implementation: Data files: store the database itself. Data dictionary: stores information about the structure of the database. It is used heavily. Great emphasis should be placed on developing a good design and efficient implementation of the dictionary. Indices: provide fast access to data items holding particular values. Naïve Users Application Interfaces Application Program object code DML PreCompiler File Manager Data files Application Programs Application programs Data storage Data dictionary Sophisticated Users Query Query processor DDL compiler Database Manager Database Administrator Database Schema Points to Ponder Database manager is a program module which provides the interface between the low-level data stored in the database and the application programs Database administrator is a person having central control over data Database user is a person who access the database at various level. Data files: store the database itself. Data dictionary: stores information about the structure of the database. DML precompiler converts DML statements embedded in an application program to normal procedure calls in a host language. File manager manages allocation of disk space and data structures used to represent information on disk. Review Terms Database Instance Schema Database Schema Physical schema Logical schema Physical data independence 14

Database Language DDL DML Query Language Data dictionary Metadata Database Administrator Database User Student Activity 1. What are the various kinds of database users? DATABASE MANAGEMENT 2. What do you understand by the structure of a database? 3. Define physical schema and logical schema? 4. Define file manager, dml compiler,data files? 15

Lesson objective LESSON 6: DATA MODELS Understsnding data models Different types of data models Hierchachical data model network data model Relational model Data models are a collection of conceptual tools for describing data, data relationships, data semantics and data constraints. A data model is a description of both a container for data and a methodology for storing and retrieving data from that container. Actually, there isn t really a data model thing. Data models are abstractions, oftentimes mathematical algorithms and concepts. You cannot really touch a data model. But nevertheless, they are very useful. The analysis and design of data models has been the cornerstone of the evolution of databases. As models have advanced so has database efficiency. There are various kinds of data models i.e. in a database records can be arranged in various ways.the various ways in which data can be represented are:- 1. Hierarchial data model 2. Network data model 3. Relational Model 4. E-R-Model The Hierarchical Model Organization of the records is as a collection of trees. As its name implies, the Hierarchical Database Model defines hierarchically-arranged data. Perhaps the most intuitive way to visualize this type of relationship is by visualizing an upside down tree of data. In this tree, a single table acts as the root of the database from which other tables branch out. You will be instantly familiar with this relationship because that is how all windows-based directory management systems (like Windows Explorer) work these days. Relationships in such a system are thought of in terms of children and parents such that a child may only have one parent but a parent can have multiple children. Parents and children are tied together by links called pointers (perhaps physical addresses inside the file system). A parent will have a list of pointers to each of their children. If we want to create a structure where in a course various students are there & these students are given certain marks in assignment. However, as you can imagine, the hierarchical database model has some serious problems. For one, you cannot add a record to a child table until it has already been incorporated into the parent table. This might be troublesome if, for example, you wanted to add a student who had not yet signed up for any courses. Worse, yet, the hierarchical database model still creates repetition of data within the database. You might imagine that in the database system shown above, there may be a higher level that includes multiple course. In this case, there could be redundancy because students would be enrolled in several courses and thus each course tree would have redundant student information. Redundancy would occur because hierarchical databases handle one-to-many relationships well but do not handle many-tomany relationships well. This is because a child may only have one parent. However, in many cases you will want to have the child be related to more than one parent. For instance, the relationship between student and class is a many-to-many. Not only can a student take many subjects but a subject may also be taken by many students. How would you model this relationship simply and efficiently using a hierarchical database? The answer is that you wouldn t. Though this problem can be solved with multiple databases creating logical links between children, the fix is very kludgy and awkward. Faced with these serious problems, the computer brains of the world got together and came up with the network model. Network Databases In many ways, the Network Database model was designed to solve some of the more serious problems with the Hierarchical Database Model. Specifically, the Network model solves the problem of data redundancy by representing relationships in terms of sets rather than hierarchy. The model had its origins in DATABASE MANAGEMENT 17

DATABASE MANAGEMENT the Conference on Data Systems Languages (CODASYL) which had created the Data Base Task Group to explore and design a method to replace the hierarchical model. The network model is very similar to the hierarchical model actually. In fact, the hierarchical model is a subset of the network model. However, instead of using a single-parent tree hierarchy, the network model uses set theory to provide a tree-like hierarchy with the exception that child tables were allowed to have more than one parent. This allowed the network model to support many-to-many relationships. Visually, a Network Database looks like a hierarchical Database in that you can see it as a type of tree. However, in the case of a Network Database, the look is more like several trees which share branches. Thus, children can have multiple parents and parents can have multiple children. The Sequence of Columns is Insignificant The Sequence of Rows is Insignificant Each Column Has a Unique Name Certain fields may be designated as keys, which means that searches for specific values of that field will use indexing to speed them up. Where fields in two different tables take values from the same set, a join operation can be performed to select related records in the two tables by matching values in those fields. Often, but not always, the fields will have the same name in both tables. For example, an orders table might contain (customer-id, product-code) pairs and a products table might contain (product-code, price) pairs so to calculate a given customer s bill you would sum the prices of all products ordered by that customer by joining on the product-code fields of the two tables. This can be extended to joining multiple tables on multiple fields. Because these relationships are only specified at retreival time, relational databases are classed as dynamic database management system. The RELATIONAL database model is based on the Relational Algebra. A basic understanding of the relational model is necessary to effectively use relational database software such as Oracle, Microsoft SQL Server, or even personal database systems such as Access or Fox, which are based on the relational model. Nevertheless, though it was a dramatic improvement, the network model was far from perfect. Most profoundly, the model was difficult to implement and maintain. Most implementations of the network model were used by computer programmers rather than real users. What was needed was a simple model which could be used by real end users to solve real problems. Relational Model The relational model was formally introduced by Dr. E. F. Codd in 1970 and has evolved since then, through a series of writings. The model provides a simple, yet rigorously defined, concept of how users perceive data. Network model solves the problem of data redundancy by representing relationships in terms of sets. A relational database is a collection of twodimensional tables. The organization of data into relational tables is known as the logical view of the database. That is, the form in which a relational database presents data to the user and the programmer. The way the database software physically stores the data on a computer disk system is called the internal view. The internal view differs from product to product and does not concern us here. A relational database allows the definition of data structures, storage and retrieval operations and integrity constraints. In such a database the data and relations between them are organised in tables. A table is a collection of records and each record in a table contains the same fields. Properties of Relational Tables Values Are Atomic Each Row is Unique Column Values Are of the Same Kind Points to Ponder Data models are a collection of conceptual tools for describing data, data relationships, data semantics and data constraints. Types of data models are:- 1. Hierarchial data model 2. Network data model 3. Relational Model 4. E-R-Model The Hierarchical Database Model defines hierarchicallyarranged data. Network model solves the problem of data redundancy by representing relationships in terms of sets. Network model solves the problem of data redundancy by representing relationships in terms of sets. Review Terms Data models Hierchical data model Network data model Relational data model 18

Students Activity 1. Define data models? DATABASE MANAGEMENT 2. Define hierarchichal data model? 3. Define relational data model? 4. Define relational data model? 19

LESSON 7: RELATIONAL DATABASE MANAGEMENT SYSTEM I Lesson Objective Understanding RDBMS Understanding data structures Understanding data manipulation Understanding various relational algebra operation Understanding data integrity The relational model was proposed by E. F. Codd in 1970. It deals with database management from an abstract point of view. The model provides specifications of an abstract database management system.to use the database management systems based on the relational model however, users do not need to master the theoretical foundations. Codd defined the model as consisting of the following three components: 1. Data Structure - a collection of data structure types for building the database. 2. Data Manipulation - a collection of operators that may be used to retrieve, derive or modify data stored in the data structures. 3. Data Integrity - a collection of rules that implicitly or explicitly define a consistent database state or changes of states Data Structure Often the information that an organisation wishes to store in a computer and process is complex and unstructured. For example, we may know that a department in a university has 200 students, most are full-time with an average age of 22 years, and most are females. Since natural language is not a good language for machine processing, the information must be structured for efficient processing. In the relational model the information is structures in a very simple way We consider the following database to illustrate the basic concepts of the relational data model. The above database could be mapped into the following relational schema which consists of three relation schemes. Each relation scheme presents the structure of a relation by specifying its name and the names of its attributes enclosed in parenthesis. Often the primary key of a relation is marked by underlining. student(student_id, student_name, address) enrolment(student_id, subject_id) subject(subject_id, subject_name, department) An example of a database based on the above relational model is: student_id student_name address 8656789 8700074 8900020 8801234 8654321 8712374 The relation 8612345 student student_id 8700074 8900020 8900020 8700074 8801234 8801234 Peta Williams John Smith Arun Kumar Peter Chew Reena Rani Kathy Garcia Chris Watanabe subject_id CP302 CP302 CP304 MA111 CP302 CH001 9, Davis Hall 9, Davis Hall 90, Second Hall 88, Long Hall 88, Long Hall 88, Long Hall 11, Main Street The relation enrolment subject_id subject_name department CP302 Database Management Comp. Science CP304 Software Engineering Comp. Science CH001 Introduction to Chemistry Chemistry PH101 Physics Physics MA111 Pure Mathematics Mathematics We list a number of properties of relations: 1. Each relation contains only one record type. 2. Each relation has a fixed numer of columns that are explicitly named. Each attribute name within a relation is unique. 3. No two rows in a relation are the same. 4. Each item or element in the relation is atomic, that is, in each row, every attribute has only one value that cannot be decomposed and therefore no repeating groups are allowed. 5. Rows have no ordering associated with them. 6. Columns have no ordering associated with them (although most commercially available systems do). The above properties are simple and based on practical considerations. The first property ensures that only one type of information is stored in each relation. The second property involves naming each column uniquely. This has several benefits. The names can be chosen to convey what each column is and the names enable one to distinguish between the column and its domain. Furthermore, the names are much easier to remember than the position of the position of each column if the number of columns is large. The third property of not having duplicate rows appears obvious but is not always accepted by all users and designers of DBMS. The property is essential since no sensible context free DATABASE MANAGEMENT 21

DATABASE MANAGEMENT meaning can be assigned to a number of rows that are exactly the same. The next property requires that each element in each relation be atomic that cannot be decomposed into smaller pieces. In the relation model, the only composite or compound type (data that can be decomposed into smaller pieces) is a relation. This simplicity of structure leads to relatively simple query and manipulative languages. The relation is a set of tuples and is closely related to the concept of relation in mathematics. Each row in a relation may be viewed as an assertion. For exaple, the relation student asserts that a student by the name of Reena Rani has student_id 8654321 and lives at 88, Long Hall. Similarly the relation subject asserts that one of the subjects offered by the Department of Computer Science is CP302 Database Management. In the relational model, a relation is the only compound data structure since relation do not allow repeating groups or pointers. We now define the relational terminology: Relation - essentially a table Tuple - a row in the relation Attribute - a column in the relation Degree of a relation - number of attributes in the relation Cardinality of a relation - number of tuples in the relation Domain - a set of values that an attribute is permitted to take. Same domain may be used by a number of different attributes. Primary key - as discussed in the last chapter, each relation must have an attribute (or a set of attributes) that uniquely identifies each tuple. Each such attribute (or a set of attributes) is called a candidate key of the relation if it satisfies the following properties: (a) the attribute or the set of attributes uniquely identifies each tuple in the relation (called uniqueness), and (b) if the key is a set of attributes then no subset of these attributes has property (a) (called minimality). There may be several distinct set of attributes that may serve as candidate keys. One of the candidate keys is arbitrarily chosen as the primary key of the relation. The three relations above student, enrolment and subject have degree 3, 2 and 3 respectively and cardinality 4, 6 and 5 respectively. The primary key of the the relation student is student_id, of relation enrolment is (student_id, subject_id), and finally the primary key of relation subject is subject_id. The relation student probably has another candidate key. If we can assume the names to be unique than the student_name is a candidate key. If the names are not unique but the names and address together are unique, then the two attributes (student_id, address) is a candidate key. Note that both student_id and (student_id, address) cannot be candidate keys, only one can. Similarly, for the relation subject, subject_name would be a candidate key if the subject names are unique. The relational model is the most popular data model for commercial data processing applications.it is very much simple due to which the programmer s work is reduced. Data Manipulation The manipulative part of relational model makes set processing (or relational processing) facilities available to the user. Since relational operators are able to manipulate relations, the user does not need to use loops in the application programs. Avoiding loops can result in significant increase in the productivity of application programmers. The primary purpose of a database in an enterprise is to be able to provide information to the various users in the enterprise. The process of querying a relational database is in essence a way of manipulating the relations that are the database. For example, one may wish to know 1. names of all students enrolled in CP302, or 2. names of all subjects taken by John Smith. The Relational Algebra The relational algebra is a procedural query language. It consists of a set of operations that take one or two relations as input and produce a new relation as their result. The fundamental operations in the relational algebra are select, project, union, set difference, Cartesian product, and rename. In addition to the fundamental operations, there are several other operationsnamely, set intersection, natural join, division, and assignment. We will define these operations in terms of the fundamental operations. Fundamental Operations The select, project, and rename operations are called unary operations, because they operate on one relation. The other three operations operate on pairs of relations and are, therefore, called binary operations. Various operations are shown as follows: Operation Projection Selection Renaming Union Intersection Assignment Symbol The Select Operation Operation Cartesian product Symbol The select operation selects tuples that satisfy a given condition.. The argument relation is in parentheses after the σ. Thus, to select those tuples of the loan relation where the branch is Perryridge, we write σ branch-name = Perryridge (loan) If the loan relation is as shown, then the relation that results from the preceding query will be a different relation. We can find all tuples in which the amount lent is more than $1200 writing σ amount>1200 (loan) Join Left outer join Right outer join Full outer join Semijoin 22

In general, we allow comparisons using =,, <,, >, in the selection predicate. Furthermore, we can combine several predicates into a larger predicate by using the connectives and (Λ), or (V), and not ( ). Thus, to find those tuples pertaining to loans of more than $1200 made by the Perryridge branch, we write σ branch-name = Perryridge L amount>1200 (loan) Example: The table E (for Employee) enr ename dept 1 Bill A 2 Sarah C 3 John A Example: The table D (for Department) dnr dname A Marketing B Sales C Legal DATABASE MANAGEMENT The selection predicate may include comparisons between two attributes. To Illustrate, consider the relation loan-officer that consists of three attributes: customer-name, banker-name, and loan-number, which specifies that a particular banker is the loan officer for a loan that belongs to some customer. To find all customers who have the same name as their loan officer, we can write σ customer-name = banker-name (loan-officer) Projection Projection is the operation of selecting certain attributes from a relation R to form a new relation S. For example, one may only be interested in the list of names from a relation that has a number of other attributes. Projection operator may then be used. Like selection, projection is an unary operator. II loan-number, amount (loan) Composition of Relational Operations The fact that the result of a relational operation is itself a relation is important. Consider the more complicated query Find those Perryridge customers who live 1500 in Harrison. We write: II customer-name (σ customer-city = Harrison (customer)) Notice that, instead of giving the name of a relation as the argument of the projection operation, we give an expression that evaluates to a relation. In general, since the result of a relational-algebra operation is of the same type (relation) as its inputs, relational-algebra operations can be composed together into a loan-number L-11 L-14 L-15 L-16 L-17 L-23 L-93 amount 900 1500 1500 1300 1000 2000 500 Loan number and the amount of the loan. Relational-algebra expression. Composing relational-algebra operations into relational-algebra expressions in just like composing arithmetic operations (such as +, -, *, and %) into arithmetic expressions. Cartesian Product The cartesian product of two tables combines each row in one table with each row in the other table. Result Relational algebra enr ename dept dnr dname E X D 1 Bill A A Marketing 1 Bill A B Sales 1 Bill A C Legal 2 Sarah C A Marketing 2 Sarah C B Sales 2 Sarah C C Legal 3 John A A Marketing 3 John A B Sales 3 John A C Legal Seldom useful in practice. Can give a huge result. The Union Operation Consider a query to find the names of all bank customers who have either an account or a loan or both. Note that the customer relation does not contain the information, since a customer does not need to have either an account or a loan at the bank. To answer this query, we need the information in the depositor relation and in the borrower relation. We know how to find the names of all customers with a loan in the bank: II customer-name (borrower) We also know how to find the names of all customers with an account in the bank: II customer-name (depositor) To answer the query, we need the union of these two sets; that is, we need all customer names that appear in either or both of the two relations. We find these data by the binary operation union, denoted, as in set theory, by U. So the expression needed is II customer-name (borrower) U II customer-name (depositor) There are 10 tuples in the result, even though there are seven distinct borrowers and six depositors. This apparent discrepancy occurs because Smith, Jones, and Hayes are borrowers as well as depositors. Since relations are sets, duplicate values are eliminated. customer-name Adams Curry Hayes Jackson Jones Smith Williams Lindsay Johnson Turner 23

DATABASE MANAGEMENT Names of all customers who have either a loan or an account. For a union operation r U s to be valid, we require that two conditions hold: 1. The relations r and s must have the same number of attributes. 2. The domains of the ith attribute of r and the ith attribute of s must be the same, for all i. Note that r and s can be, in general, temporary relations that are the result of relational-algebra expressions. The Set-Intersection Operation The first additional-relational algebra operation that we shall define is set intersection ( I ). Suppose that we wish to find all customers who have both a loan and an account. Using set intersection, we can write IIcustomer-name (borrower) IIcustomer-name (depositor) Note that we can rewrite any relational algebra expression that uses set intersection by replacing the intersection operation with a pair of set-difference operations as: Thus, set intersection is not a fundamental operation and does not add any power to the relational algebra. It is simply more convenient to write r I s than to write r (r s). The Set Difference Operation The set-difference operation, denoted by -, allows us to find tuples that are in one relation but are not in another. The expression r s produces a relation containing those tuples in r but not in s. We can find all customers of the bank who have an account but not a loan by writing II customer-name (depositor) II customer-name (borrower) As with the union operation, we must ensure that set differences are taken between compatible relations. Therefore, for a set difference operation r s to be valid, we require that the relations r and s be of the same arity, and that the domains of the ith attribute of r and the ith attribute of s be the same. The Assignment Operation It is convenient at times to write a relational-algebra expression by assigning parts of it to temporary relation variables. The assignment operation, denoted by, works like assignment in a programming language. temp1 amount>1200 (loan) temp2 II loan-number, amount (loan) result = temp1 temp2 The evaluation of an assignment does not result in any relation being displayed to the user. Rather, the result of the expression to the right of the is assigned to the relation variable on the left of the. This relation variable may be used in subsequent expressions. With the assignment operation, a query can be written as a sequential program consisting of a series of assignment followed by an expression whose value is displayed as the result of the query. For relational-algebra queries, assignment must always be made to a temporary relation variable.note that the assignment operation does not provide any additional power to the algebra. It is, however, a convenient way to express complex queries. Points to Ponder The relational model was proposed by E. F. Codd in 1970. provides specifications of an abstract database management system It consists of the following three components: 1. Data Structure a collection of data structure types for building the database. 2. Data Manipulation a collection of operators that may be used to retrieve, derive or modify data stored in the data structures. 3. Data Integrity a collection of rules that implicitly or explicitly define a consistent database state or changes of. Relational algebra describes a set of algebraic operations that operates on tables, & output a table as a result. Review Terms Table/Relation Tuple Domain Database schema Database instance Keys Primary key Foreign key Relational algebra Student Activity 1. Why do we use RDBMS? 2. Define relation,tuple,domain,keys? 24

3. What is the difference between Intersection, Union & Cartesian product? DATABASE MANAGEMENT 25

Lesson Objectives LESSON 8: RELATIONAL DATABASE MANAGEMENT SYSTEM II Elaborating various other features of Relational algebra Understanding aggregate function Understanding joins Understanding natural, outer, inner joins? Aggregate Functions Aggregate functions take a collection of values and return a single value as a result. For example, the aggregate function sum takes a collection of values and returns the sum of the values. Thus, the function sum applied on the collection {1,1,3,4,4,11} returns the value 24. The aggregate function avg returns the average of the values. When applied to the preceding collection, it returns the value 4. The aggregate function count returns the number of the elements in the collection, and returns 6 on the preceding collection. Other common aggregate function include min and max, which return the minimum and maximum values in a collection; they return 1 and 11, respectively, on the preceding collection. The collections on which aggregate functions operate can have multiple occurrences of a value; the order in which the values appear is not relevant. Such collections are called multisets. Sets are a special case of multisets where there is only one copy of each element. To illustrate the concept of aggregation, we take the following example Adams Brown Gopal Johnson Loreena Peterson Rao Sato G sum(salary) (pt-works) Perryridge Perryridge Perryridge Downtown Downtown Downtown Austin Austin 1500 1300 5300 1500 1300 2500 1500 1600 The symbol G is the letter G in calligraphic fount; read it as calligraphic G. The relational-algebra operation G signifies that aggregation is to be applied, and its subscript specifies the aggregate operation to be applied. The result of the expression above is a relation with a single attribute, containing a single row with a numerical value corresponding to the sum of all the series of all employees working part-time in the bank. There are cases where we must eliminate multiple occurrences of a value before computing an aggregate function. If we do want to eliminate duplicates, we use the same function names as before, with the addition of the hyphenated string distinct appended to the end of the function name (for example, count- distinct). An example arises in the query Find the number of branches appearing in the pt-works relation. In this case, a branch name count only once, regardless of the number of employees working that branch. We write this query as follows: G count-distinct(branch-name) (pt-works) In the above figure, the result of this query is a single row containing the value 3. Suppose we want to find the total salary sum of all part-time employees at each branch of the bank separately, rather than the sun for the entire bank. To do so, we need to partition the relation pt-works into group based on the branch, and to apply the aggregate function on each group. The following expression using the aggregation operator G achieves the desired result: G (pt-works) branch-name sum(salary) In the expression, the attribute branch-name in the left-hand subscript of G indicates that the input relation pt-works must be divided into groups based on the value of branch-name. Following Figure shows the resulting groups. employee-name branch-name salary Rao Sato Austin Austin 1500 1600 Johnson Loreena Peterson Downtown Downtown Downtown 1500 1300 2500 Adams Brown Gopal The expression sum(salary) in the right-hand subscript of G indicates that for each group of tuples (that is, each branch), the aggregation function sum must be applied on the collection of values of the salary attribute. The output relation consists of tuples with the branch name, and the sum of the salaries for the branch, as shown in Figure branch-name Austin Downtown Perryridge Perryridge Perryridge Perryridge sum of salary 3100 5300 8100 1500 1300 5300 The general from of the aggregation operation G is as follows: G 1, G 2,, G n G F 1 (A 1 ), F 2 (A 2 ),,F m (A m )(E) Where E is any relational-algebra expression; G 1, G 2,, G n constitute a list of attributes on which to group; each F i is an aggregate function; and each A i is an attribute Join The Natural-Join Opeation If is often desirable to simplify certain queries that require a Cartesian product. Usually, a query that involves a Cartesian DATABASE MANAGEMENT 27

DATABASE MANAGEMENT product includes a selection operation on the result of the Cartesian product. Consider the query Find the names of all customers who have a loan at the bank, along with the loan number and the loan amount. We first form the Cartesian product of the borrower and loan relations. Then, we select those tuples that pertain to only the same loan-number, followed by the projection of the resulting customer-name, loan-number, and amount: IIcustomer-name, loan.loan-number, amount (σ borrower.loan-number = loan.loan-number (borrower x loan)) The natural join is a binary operation that allows us to combine certain selections and a Cartesian product into one operation. It is denoted by the join symbol ( ). The natural-join operation forms a Cartesian product of its two arguments, performs a selection forcing equality on those attributes that appear in both relation schemas, and finally removes duplicate attributes. Outer Join The outer-join operation is an extension of the join operation to deal with missing information. Suppose that we have the relations with the following schemas, which contain data on full-time employees: employee (employee-name, street, city) ft-works (employee-name, branch-name, salary) Suppose that we want to generate a single relation with all the information (street, city, branch name, and salary) about fulltime employees. A possible approach would be to use the natural join operation as follows: employee ft-works We can use the outer-join operation to avoid this loss of information. There are actually three forms of the operation: let outer join, denoted ; right outer join, denoted ; and full outer join, denoted. All three forms of outer join compute the join, and add extra tuples to the result of the join. The result of employee ft-works. Result of employee ft-works. The left outer join ( ) takes all tuples in the left relation that did not match with any tuple in the right relation, pads the tuple with null values for all other attributes from the right relation, and adds them to the result of the natural join. The tuple (Smith, Revolver, Death Valley, null, null) is such a tuple. All information from the left relation is present in the result of the left outer join. The right outer join ( ) is symmetric with the left outer join: It pads tuples from the right relation that did not match any from the left relation with nulls and adds them to the result of the natural join. The tuple (Gates, null, null, Redmond, 5300) is such a tuple. Thus, all information from the right relation is present in the result of the right outer join. The full outer join ( ) does both of those operations, padding tuples from the left relation that did not match any from the right relation, as well as tuples from the right relation that did not match any from the left relation, and adding them to the result of the join. Result of employee Result of employee Renaming Tables and Columns Example: The table E (for Employee) Example: The table D (for Department) nr A B C We want to join these tables, but: ft-works. ft-works. name Marketing Sales Legal Several columns in the result will have the same name (nr and name). How do we express the join condition, when there are two columns called nr? Solutions Rename the attributes, using the rename operator. Keep the names, and prefix them with the table name, as is done in SQL. (This is somewhat unorthodox.) Result Relational algebra enr ename dept dnr dname 1 Bill A A Marketing (p (enr, ename, dept)(e))? dept = dnr (p (dnr, 2 Sarah C C Legal dname)(d)) 3 John A A Marketing You can use another variant of the renaming operator to change the name of a table, for example to change the name of E to R. This is necessary when joining a table with itself (see below)..p R (E) A third variant lets you rename both the table and the columns:.p R(enr, ename, dept) (E) 28

Points to Ponder Aggregate functions take a collection of values and return a single value as a result. Usually, a query that involves a Cartesian product includes a selection operation on the result of the Cartesian product. Review Terms Aggregate functions Joins Natural join Outer join Right outer join Left outer join Student Notes 1. Define aggregate functions with example? 5. Define rename operators? DATABASE MANAGEMENT 2. Define joins?what is natural join? 3. Differentiate between inner join & outer join? 4. Differentiate between left outer join & right outer join with the help of example? 29

Lesson Objective Understanding entity Understanding relationship Understanding attribute, domain, entity set Understanding Simple & composit Attributes Understanding Derived Attribute Understanding relationship set Components of E-R-Diagrams Designing E-R-diagrams ER considers the real world to consist of entities and relationships among them. An Entity is a thing which can be distinctly identified, for example a person, a car, a subroutine, a wire, an event. A Relationship is an association among entities, eg person Owns car is an association between a person and a car person EATS dish IN place is an association among a person, a dish and a place. Attribute, Value, Domain, Entity Set The information about one entity is expressed by a set of (attribute,value) pairs, eg a car model could be: Name = R1222 Power = 7.3 Nseats = 5 Values of attributes belong to different value-sets or domains, for example, for a car, Nseats is an integer between 1 and 12 Entities defined by the same set of attributes can be grouped into an Entity Set (abbreviated as ESet) as shown in ------------------------------- ESET : CarModel ------------------------------- Name Power Nseats ---------- -------------------- R1222 7.3 5 HZ893 6.8 5 R1293 5.4 4 ------------------------------- An Entity Set A given set of attributes may be referred to as an entity type. All entities in a given ESet are of the same type, but sometimes there can be more that one set of the same type. The set of all persons who are customers at a given bank can be defined as an entity set customer. The individual entity that constitutes a set are said to be extension to entity set. So all the individual bank customers are the extension of entity set customer. LESSON 9: E-R MODEL - I Each entity has a value for each of its attributes.for each attribute,there is a set of permitted values called domain or value set. Simple and Composit Attributes A simple attribute has got one value for its attribute & a comosit attribute is one which can be divided into sub-parts. For example an attribute name can be divided into first name middle name & last name. Single and Multivalued Attributes An attribute which have got only one value is known as single valued attribute.for ex. the loan_no attribute will have only one loan_no. There may be cases when an attribute has a set of values for a specific entity.for ex. an attribute phone_no. may have a value zero,one or several phone_no. This is known as multivalued attribute. Derived Attribute Its value is derived from value of other related attributes or entities.for ex. an attribute age can be calculated from another attribute date_of_birth. Relationship, Relationship Set A relationship is an association among several entities.for ex. a customer A is associated with loan_no L1. A relationship set is a subset of the cartesian product of entity sets. For example a relationship set (abbreviated as RSet) on the relationship Person HAS_EATEN Dish IN Place could be as shown in --------------------------------------- RSet 'Person HAS_EATEN Dish IN Place' --------------------------------------- Person Dish Place --------- -------- -------------------- Steve Duck Kuala Lumpur Weiren Duck Beijing Paolo Noodles Naples Mike Fondue Geneva Paolo Duck Beijing --------------------------------------- Notice that an RSet is an ESet having ESets as attributes. Components of ER-D The overall logical structure of a database can be expressed graphically by an E-R diagram.the various components of an E-R diagram are as folows:- 1. Rectapgles, which represent entity sets. 2. _ Ellipses, which represent attributes. 3. Diamonds, which represents relationship sets 4. Lines, which link attributes to entity sets and entity sets to relationship sets. 5. Double ellipses, which represent multi-valued attributes. 6. Dashed ellipse which represents derived attributes 7. Double lines,which indicate total participation of an entity in a relationship set. DATABASE MANAGEMENT 31

DATABASE MANAGEMENT ENTITY RELATIONSHIP DIAGRAM NOTATIONS Peter Chen developed ERDs in 1976. Since then Charles Bachman and James Martin have added some sligh refinements to the basic ERD principles. Entity An entity is an object or concept about which you want to store information. Weak Entity A weak entity is dependent on another entity to exist.. Attributes Attributes are the properties or characteristics of an entity. Key attribute A key attribute is the unique, distinguishing characteristic of the entity. For example, an employee's social security number might be the employee's key attribute. Multivalued attribute A multivalued attribute can have more than one value. For example, an employee entity can have multiple skill values. Derived attribute A derived attribute is based on another attribute. For example, an employee's monthly salary is based on the employee's annual salary. Relationships Relationships illustrate how two entities share information in the database structure. 32

Weak relationship To connect a weak entity with others, you should use a weak relationship notation. Cardinality Cardinality specifies how many instances of an entity relate to one instance of another entity. DATABASE MANAGEMENT Ordinality is also closely linked to cardinality. While cardinality specifies the occurences of a relationship, ordinality describes the relationship as either mandatory or optional. In other words, cardinality specifies the maximum number of relationships and ordinality specifies the absolute minimum number of relationships. Recursive relationship In some cases, entities can be self-linked. For example, employees can supervise other employees. Mapping Cardinalities It express the number of entities to which other entity can be associated via a relationship set. It can be of the following types:- 1. One to one: An entity in A is associated with at most one entity in B, & an entity in B is associated with at most one entity in A. 2. One to many: An entity in A is associated with any number(zero or more) of entities in B.An entity in B however can be associated with at most one entity in A. 3. Many to one: An entity in A is associated with at most one entity in B.An entity in B however can be associated with any number(zero or more) of entities in A. 4. Many to many: An entity in A is associated with any number(zero or more) of entities in B & an entity in B is associated with any number(zero or more) of entities in A. 33

DATABASE MANAGEMENT One to many relationship Many to one relationship 2. Define relationship, relationship set? 3. Differentiate between simple & composit attribute? One to one relationship Points to Ponder An Entity is a thing which can be distinctly identified, for example a person, a car, a subroutine, a wire, an event. A Relationship is an association among entities, eg person Owns car A given set of attributes may be referred to as an entity type A simple attribute has got one value for its attribute & a comosit attribute is one which can be divided into subparts. value is derived from value of other related attributes or entities is known as derived attribute. A relationship is an association among several entities A relationship set is a subset of the cartesian product of entity sets Review Terms Entity Entity set Attribute Domain Value Relationship Relationship set Cardinality Association Student Notes 1. Define entity, domain,value? 4. Define derived attribute? 5. Differentiate between single & multi-valued attribute? 6. Define cardinality?explain various kinds of cardinality? 7. Define various components of E-R-Diagram? 34

DATABASE MANAGEMENT Lesson Objective Designing E-R-Diagram Understanding keys Super key,primary key, composit key Entity relationship diagram methodology Developing Entity Relationship Diagrams (ERDs) Why Entity Relationship Diagrams are a major data modelling tool and will help organize the data in your project into entities and define the relationships between the entities. This process has proved to enable the analyst to produce a good database structure so that the data can be stored and retrieved in a most efficient manner. Information Entity A data entity is anything real or abstract about which we want to store data. Entity types fall into five classes: roles, events, locations, tangible things or concepts. E.g. employee, payment, campus, book. Specific examples of an entity are called instances. E.g. the employee John Jones, Mary Smith s payment, etc. Relationship A data relationship is a natural association that exists between one or more entities. E.g. Employees process payments. Cardinality defines the number of occurrences of one entity for a single occurrence of the related entity. E.g. an employee may process many payments but might not process any payments depending on the nature of her job. Attribute A data attribute is a characteristic common to all or most instances of a particular entity. Synonyms include property, data element, field. E.g. Name, address, Employee Number, pay rate are all attributes of the entity employee. An attribute or combination of attributes that uniquely identifies one and only one instance of an entity is called a primary key or identifier. E.g. Employee Number is a primary key for Employee. Keys Differences between entities must be expressed in terms of attributes. A superkey is a set of one or more attributes which, taken collectively, allow us to identify uniquely an entity in the entity set. For example, in the entity set customer, customer-name and S.I.N. is a superkey. Note that customer-name alone is not, as two customers could have the same name. LESSON 10: E-R MODEL - II A superkey may contain extraneous attributes, and we are often interested in the smallest superkey. A superkey for which no subset is a superkey is called a candidate key. In the example above, S.I.N. is a candidate key, as it is minimal, and uniquely identifies a customer entity. A primary key is a candidate key (there may be more than one) chosen by the DB designer to identify entities in an entity set. An entity set that does not possess sufficient attributes to form a primary key is called a weak entity set. One that does have a primary key is called a strong entity set. For example, The entity set transaction has attributes transaction-number, date and amount. Different transactions on different accounts could share the same number. These are not sufficient to form a primary key (uniquely identify a transaction). Thus transaction is a weak entity set. For a weak entity set to be meaningful, it must be part of a oneto-many relationship set. This relationship set should have no descriptive attributes. (Why?) The idea of strong and weak entity sets is related to the existence dependencies seen earlier. Member of a strong entity set is a dominant entity. Member of a weak entity set is a subordinate entity. A weak entity set does not have a primary key, but we need a means of distinguishing among the entities. The discriminator of a weak entity set is a set of attributes that allows this distinction to be made. The primary key of a weak entity set is formed by taking the primary key of the strong entity set on which its existence depends (see Mapping Constraints) plus its discriminator. To illustrate transaction is a weak entity. It is existence-dependent on account. The primary key of account is account-number. transaction-number distinguishes transaction entities within the same account (and is thus the discriminator). So the primary key for transaction would be (accountnumber, transaction-number). Just Remember: The primary key of a weak entity is found by taking the primary key of the strong entity on which it is existence-dependent, plus the discriminator of the weak entity set. 36

An Entity Relationship Diagram Methodology 1. Identify Entities 2. Find Relationships 3. Draw Rough ERD 4. Fill in Cardinality 5. Define Primary Keys 6. Draw Key-Based ERD 7. Identify Attributes 8. Map Attributes 9. Draw fully attributed ERD 10. Check Results Identify the roles, events, locations, tangible things or concepts about which the end-users want to store data. Find the natural associations between pairs of entities using a relationship matrix. Put entities in rectangles and relationships on line segments connecting the entities. Determine the number of occurrences of one entity for a single occurrence of the related entity. Identify the data attribute(s) that uniquely identify one and only one occurrence of each entity. Eliminate Many-to-Many relationships and include primary and foreign keys in each entity. Name the information details (fields) which are essential to the system under development. For each attribute, match it with exactly one entity that it describes. Adjust the ERD from step 6 to account for entities or relationships discovered in step 8. Does the final Entity Relationship Diagram accurately depict the system data? A Simple Example A company has several departments. Each department has a supervisor and at least one employee. Employees must be assigned to at least one, but possibly more departments. At least one employee is assigned to a project, but an employee may be on vacation and not assigned to any projects. The important data fields are the names of the departments, projects, supervisors and employees, as well as the supervisor and employee number and a unique project number. 1. Identify Entities The entities in this system are Department, Employee, Supervisor and Project. One is tempted to make Company an entity, but it is a false entity because it has only one instance in this problem. True entities must have more than one instance. 2. Find Relationships We construct the following Entity Relationship Matrix: 3. Examine relationships between entities closely. Are they necessary? Are there any relationships missing? Eliminate any redundant relationships. Don t connect relationships to each other. 4. Use colors to highlight important portions of your diagram. DATABASE MANAGEMENT Department Employee Supervisor Project Department is assigned run by Employee belongs to works on Supervisor Runs Project uses 3. Draw Rough ERD We connect the entities whenever a relationship is shown in the entity Relationship Matrix. E-R diagram with an attribute attached to a relationship set. Tips for Effective ER Diagrams 1. Make sure that each entity only appears once per diagram. 2. Name every entity, relationship, and attribute on your diagram. If a relationship set has also some attributes associated with it, then we link these ttributes to that relationship set. For example, we have the accessdrate descriptive attribute attached to the relationship set depositor to specify the most ecent date on which a customer accessed that account 37

DATABASE MANAGEMENT E-R diagram with composite, multivalued, and derived attributes Student Activity 1. Explain the difference between primary key, candidate key & super key? It shows how composite attributes can be represented in the E- R notation. Here, a composite attribute name, with component attributes first-name, middle-initial, md last-name replaces the simple attribute customer-name of customer. Also, a composte attribute address, whose component attributes are street, city, state, and zip-code) replaces the attributes customer-street and customer-city of customer. The attribute street is tself a composite attribute whose component attributes are streetnumber, street-name, md apartment number. It also illustrates a multivalued attribute phone-number, depicted by a louble ellipse, and a derived attribute age, depicted by a dashed ellipse. Points to Ponder ER considers the real world to consist of entities and relationships among them. An Entity is a thing which can be distinctly identified A Relationship is an association among entities The information about one entity is expressed by a set of (attribute,value) A given set of attributes may be referred to as an entity type Attributes are of various types:- Single & Multi-valued Simple & Composit Derived E-R-diagrams are expressed through various symbols Cardinality expresses the number of entities to which other entity can be associated via a relationship set Review Terms Entity. Relationship. Attribute. domain E-R-D Symbols. Various kinds of keys. Types of Attributes Types of Cardinality 2. Why is redundancy a bad practice? 3. Construct an E-R-diagram for a car insurance company whose customer owns one or more cars each.each car is associated with it zero to any number of recorded student? 4. Design an E-R diagram for keeping tracks of the exploits of your favourite sports team. You should store the matches played, scores in each match, players in each match & individual player statistics for each match. Summary statistics should be modelled as derived attribute? 38

DATABASE MANAGEMENT LESSON 11: STRUCTURED QUERY LANGUAGE(SQL)I Lesson Objective Understanding SQL Understanding DDL, DML Creating tables Selecting data Constraints Drop Insert Select with conditions SQL (pronounced ess-que-el ) stands for Structured Query Language. SQL is used to communicate with a database. According to ANSI (American National Standards Institute), it is the standard language for relational database management systems. SQL statements are used to perform tasks such as update data on a database, or retrieve data from a database. Some common relational database management systems that use SQL are: Oracle, Sybase, Microsoft SQL Server, Access, Ingres, etc. Although most database systems use SQL, most of them also have their own additional proprietary extensions that are usually only used on their system. However, the standard SQL commands such as Select, Insert, Update, Delete, Create, and Drop can be used to accomplish almost everything that one needs to do with a database. The SQL language has several parts: Data-definition language (DDL). The SQL DDL provides commands for defin-ing relation schemas, deleting relations, and modifying relation schemas. Interactive data-manipulation language (DML). The SQL DML includes a query language based on both the relational algebra and the tuple relational calculus. It includes also commands to insert tuples into, delete tuples from, and modify tuples in the database. View definition. The SQL DOL includes commands for defining views. Transaction control. SQL includes commands for specifying the beginning and ending of transactions. Embedded SQL and dynamic SQL. Embedded and dynamic SQL define how SQL statements can be embedded within general-purpose programming lan-guages, such as C, C++, Java, PUr, Cobol, Pascal, and Fortran. Integrity. The SQL DDL includes commands for specifying integrity constraints that the data stored in the database must satisfy. Updates that violate integrity constraints are disallowed. Authorization. The SQL DDL includes commands for specifying access rights to relations and views. A relational database system contains one or more objects called tables. The data or information for the database are stored in these tables. Tables are uniquely identified by their names and are comprised of columns and rows. Columns contain the column name, data type, and any other attributes for the column. Rows contain the records or data for the columns. Here is a sample table called weather. city, state, high, and low are the columns. The rows contain the data for this table: Weather city state high low Phoenix Arizona 105 90 Tucson Arizona 101 92 Flagstaff Arizona 88 69 San Diego California 77 60 Albuquerque New Mexico 80 72 Data-Definition Language The set of relations in a database must be specified to the system by means of a data definition language (DDL). The SQL DDL allows specification of not only a set of relations, but also information about each relation, including The schema for each relation The domain of values associated with each attribute The integrity constraints The set of indices to be maintained for each relation The security and authorization information for each relation The physical storage structure of each relation on disk Creating Tables The create table statement is used to create a new table. Here is the format of a simple create table statement: create table tablename ( column1 data type, column2 data type, column3 data type ); Format of create table if you were to use optional constraints: create table tablename ( column1 data type [constraint], column2 data type [constraint], column3 data type [constraint]); [ ] = optional Note: You may have as many columns as you d like, and the constraints are optional. 40

Example create table employee (first varchar(15), last varchar(20), age number(3), address varchar(30), city varchar(20), state varchar(20)); To create a new table, enter the keywords create table followed by the table name, followed by an open parenthesis, followed by the first column name, followed by the data type for that column, followed by any optional constraints, and followed by a closing parenthesis. It is important to make sure you use an open parenthesis before the beginning table, and a closing parenthesis after the end of the last column definition. Make sure you seperate each column definition with a comma. All SQL statements should end with a ;. The table and column names must start with a letter and can be followed by letters, numbers, or underscores - not to exceed a total of 30 characters in length. Do not use any SQL reserved keywords as names for tables or column names (such as select, create, insert, etc). Data types specify what the type of data can be for that particular column. If a column called Last_Name, is to be used to hold names, then that particular column should have a varchar (variable-length character) data type. Here are the most common Data types: Fixed-length character string. Size is specified in char(size) parenthesis. Max 255 bytes. Variable-length character string. Max size is specified in Varchar(size) parenthesis. Number value with a max number of column digits number(size) specified in parenthesis. Date Date value Number value with a maximum number of digits of "size" number(size,d) total, with a maximum number of "d" digits to the right of the decimal. Number value with a maximum number of digits of "size" number(size,d) total, with a maximum number of "d" digits to the right of the decimal. Constraints When tables are created, it is common for one or more columns to have constraints associated with them. A constraint is basically a rule associated with a column that the data entered into that column must follow. For example, a unique constraint specifies that no two records can have the same value in a particular column. They must all be unique. The other two most popular constraints are not null which specifies that a column can t be left blank, and primary key. A primary key constraint defines a unique identification of each record (or row) in a table. It s now time for you to design and create your own table. If you decide to change or redesign the table, you can either drop it and recreate it or you can create a completely different one. Students Activity You have just started a new company. It is time to hire some employees. You will need to create a table that will contain the following information about your new employees: firstname, lastname, title, age, and salary. Drop a Table The drop table command is used to delete a table and all rows in the table. To delete an entire table including all of its rows, issue the drop table command followed by the tablename. drop table is different from deleting all of the records in the table. Deleting all of the records in the table leaves the table including column and constraint information. Dropping the table removes the table definition as well as all of its rows. drop table tablename Example drop table myemployees; Inserting into a Table The insert statement is used to insert or add a row of data into the table. To insert records into a table, enter the key words insert into followed by the table name, followed by an open parenthesis, followed by a list of column names separated by commas, followed by a closing parenthesis, followed by the keyword values, followed by the list of values enclosed in parenthesis. The values that you enter will be held in the rows and they will match up with the column names that you specify. Strings should be enclosed in single quotes, and numbers should not. insert into tablename (first_column,...last_column) values (first_value,...last_value); In the example below, the column name first will match up with the value Luke, and the column name state will match up with the value Georgia. Example insert into employee (first, last, age, address, city, state) values ( Luke, Duke, 45, 2130 Boars Nest, Hazard Co, Georgia ); Note: All strings should be enclosed between single quotes: string Students Activity 1. Insert data into your new employee table. DATABASE MANAGEMENT 41

DATABASE MANAGEMENT 2. Your first three employees are the following: Jonie Weber, Secretary, 28, 19500.00 Potsy Weber, Programmer, 32, 45300.00 Dirk Smith, Programmer II, 45, 75020.00 3. Enter these employees into your table first, and then insert at least 5 more of your own list of employees in the table Selecting Data The select statement is used to query the database and retrieve selected data that match the criteria that you specify. Here is the format of a simple select statement: select column1 [, column2",etc] from tablename [where condition ]; [] = optional The column names that follow the select keyword determine which columns will be returned in the results. You can select as many column names that you d like, or you can use a * to select all columns. The table name that follows the keyword from specifies the table that will be queried to retrieve the desired results. The where clause (optional) specifies which data values or rows will be returned or displayed, based on the criteria described after the keyword where. Conditional selections = used Equal in the where clause: > Greater than < Less than >= Greater than or equal <= Less than or equal <> Not equal to LIKE *See note below The Like pattern matching operator can also be used in the conditional selection of the where clause. Like is a very powerful operator that allows you to select only rows that are like what you specify. The percent sign % can be used as a wild card to match any possible character that might appear before or after the characters specified. For example: select first, last, city from empinfo where first LIKE Er% ; This SQL statement will match any first names that start with Er. Strings must be in single quotes. Or you can specify, select first, last from empinfo where last LIKE %s ; This statement will match any last names that end in a s. select * from empinfo where first = Eric ; This will only select rows where the first name equals Eric exactly. Sample Table: empinfo first Last id age city state John Jones 99980 45 Payson Arizona Mary Jones 99982 25 Payson Arizona Eric Edwards 88232 32 San Diego California Mary Ann Edwards 88233 32 Phoenix Arizona Ginger Howell 98002 42 Cottonwood Arizona Sebastian Smith 92001 23 Gila Bend Arizona Gus Gray 22322 35 Bagdad Arizona Mary Ann May 32326 52 Tucson Arizona Erica Williams 32327 60 Show Low Arizona Leroy Brown 32380 22 Pinetop Arizona Elroy Cleaver 32382 22 Globe Arizona Some more examples select first, last, city from empinfo; select last, city, age from empinfo where age > 30; select first, last, city, state from empinfo where first LIKE J% ; select * from empinfo; select first, last, from empinfo where last LIKE %s ; select first, last, age from empinfo where last LIKE %illia% ; select * from empinfo where first = Eric ; 42

Student Activity 1. Display the first name and age for everyone that s in the table. 7. Select all columns for everyone whose last name ends in ith. DATABASE MANAGEMENT 2. Display the first name, last name, and city for everyone that s not from Payson. 3. Display all columns for everyone that is over 40 years old. 4. Display the first and last names for everyone whose last name ends in an ay. 5. Display all columns for everyone whose first name equals Mary. Points to Ponder SQL is used to communicate with a database. The set of relations in a database must be specified to the system by means of a data definition language (DDL). The create table statement is used to create a new table A constraint is basically a rule associated with a column that the data entered into that column must follow The drop table command is used to delete a table and all rows in the table. The select statement is used to query the database and retrieve selected data that match the criteria that you specify. The insert statement is used to insert or add a row of data into the table. The Like pattern matching operator can also be used in the conditional selection of the where clause Review Terms SQL DDL DML Constraints Insert Create Select Drop Constraints Like 6. Display all columns for everyone whose first name contains Mary. 43

LESSON 12: LAB DATABASE MANAGEMENT 45

DATABASE MANAGEMENT LESSON 13: LAB 46

Lesson Objective Elaborating Select statement Rename operator Aggregate function Having Order-by Group-by IN & Between The SELECT statement is the core of SQL, and it is likely that the vast majority of your SQL commands will be SELECT statements.there are enormous amount of options available for the SELECT statement.when constructing SQL Queries (with the SELECT statement), it is very useful to know all of the possible options and the best or more efficient way to do things The Rename Operation SQL provides a mechanism for renaming both relations and attributes. It uses the as clause, taking the form: old-name as new-name Example select first as name, last, city as emp_city from empinfo Select Statement The SELECT statement is used to query the database and retrieve selected data that match the criteria that you specify. The SELECT statement has five main clauses to choose from, although, FROM is the only required clause. Each of the clauses have a vast selection of options, parameters, etc. The clauses will be listed below. Here is the format of the Select statement: Select [All Distinct] column1[,column2] From table1[,table2] [Where conditions ] [Group By column-list ] [Having conditions] [Order By column-list [Asc Desc] ] Select and From Clause Review Select first_column_name, second_column_name From table_name Where first_column_name > 1000; *The column names that follow the SELECT keyword determine which columns will be returned in the results. You can select as many column names that you d like, or you can use a * to select all columns. The order they are specified will be the order that they are returned in your query results. *The table name that follows the keyword FROM specifies the table that will be queried to retrieve the desired results. LESSON 14: SQL-II *The WHERE clause (optional) specifies which data values or rows will be returned or displayed, based on the criteria described after the keyword where. Example SELECT name, age, salary FROM employee WHERE age > 50; The above statement will select all of the values in the name, age, and salary columns from the employee table whose age is greater than 50. Comparison Operators = Equal > Greater than < Less than >= Greater than or equal to <= Less than or equal to <> or!= Not equal to LIKE String comparison test *Note about LIKE Example Select name, title, dept From employee Where title Like Pro% ; The above statement will select all of the rows/values in the name, title, and dept columns from the employee table whose title starts with Pro. This may return job titles including Programmer or Pro-wrestler. All and Distinct are keywords used to select either All (default) or the distinct or unique records in your query results. If you would like to retrieve just the unique records in specified columns, you can use the Distinct keyword. Distinct will discard the duplicate records for the columns you specified after the SELECT statement: For example: Select Distinct age FROM employee_info; This statement will return all of the unique ages in the employee_info table. ALL will display all of the specified columns including all of the duplicates. The ALL keyword is the default if nothing is specified. Note: The following two tables we will be using DATABASE MANAGEMENT 47

DATABASE MANAGEMENT Items_ordered customerid order_date item quantity Price 10330 30-Jun-1999 Pogo stick 1 28.00 10101 30-Jun-1999 Raft 1 58.00 10298 01-Jul-1999 Skateboard 1 33.00 10101 01-Jul-1999 Life Vest 4 125.00 10299 06-Jul-1999 Parachute 1 1250.00 10339 27-Jul-1999 Umbrella 1 4.50 10449 13-Aug-1999 Unicycle 1 180.79 10439 14-Aug-1999 Ski Poles 2 25.50 10101 18-Aug-1999 Rain Coat 1 18.30 10449 01-Sep-1999 Snow Shoes 1 45.00 10439 18-Sep-1999 Tent 1 88.00 10298 19-Sep-1999 Lantern 2 29.00 10410 28-Oct-1999 Sleeping Bag 1 89.22 10438 01-Nov-1999 Umbrella 1 6.75 10438 02-Nov-1999 Pillow 1 8.50 10298 01-Dec-1999 Helmet 1 22.00 10449 15-Dec-1999 Bicycle 1 380.50 10449 22-Dec-1999 Canoe 1 280.00 10101 30-Dec-1999 Hoola Hoop 3 14.75 10330 01-Jan-2000 Flashlight 4 28.00 10101 02-Jan-2000 Lantern 1 16.00 10299 18-Jan-2000 Inflatable Mattress 1 38.00 10438 18-Jan-2000 Tent 1 79.99 10413 19-Jan-2000 Lawnchair 4 32.00 10410 30-Jan-2000 Unicycle 1 192.50 Customers customerid firstname lastname city state 10101 John Gray Lynden Washington 10298 Leroy Brown Pinetop Arizona 10299 Elroy Keller Snoqualmie Washington 10315 Lisa Jones Oshkosh Wisconsin 10325 Ginger Schultz Pocatello Idaho 10329 Kelly Mendoza Kailua Hawaii 10330 Shawn Dalton Cannon Beach Oregon 10338 Michael Howell Tillamook Oregon 10339 Anthony Sanchez Winslow Arizona 10408 Elroy Cleaver Globe Arizona 10410 Mary Ann Howell Charleston South Carolina 10413 Donald Davids Gila Bend Arizona 10419 Linda Sakahara Nogales Arizona 10429 Sarah Graham Greensboro North Carolina 10438 Kevin Smith Durango Colorado 10439 Conrad Giles Telluride Colorado 10449 Isabela Moore Yuma Arizona Students Activity 1. From the items_ordered table, select a list of all items purchased for customerid 10449. Display the customerid, item, and price for this customer. 2. Select all columns from the items_ordered table for whoever purchased a Tent. 3. Select the customerid, order_date, and item values from the items_ordered table for any items in the item column that start with the letter S. 4. Select the distinct items in the items_ordered table. In other words, display a listing of each of the unique items from the items_ordered table. Aggregate Functions MIN MAX SUM AVG COUNT COUNT(*) returns the smallest value in a given column returns the largest value in a given column returns the sum of the numeric values in a given column returns the average value of a given column returns the total number of values in a given column returns the number of rows in a table Aggregate functions are used to compute against a returned column of numeric data from your SELECT statement. They basically summarize the results of a particular column of selected data. We are covering these here since they are required by the next topic, GROUP BY. Although they are required for the GROUP BY clause, these functions can be used without the GROUP BY clause. For example: SELECT AVG(salary)FROM employee; This statement will return a single result which contains the average value of everything returned in the salary column from the employee table. Another Example SELECT AVG(salary)FROM employee WHERE title = Programmer ; This statement will return the average salary for all employees whose title is equal to Programmer 48

Example SELECT Count(*)FROM employees; This particular statement is slightly different from the other aggregate functions since there isn t a column supplied to the count function. This statement will return the number of rows in the employees table. Students Activity 1. Select the maximum price of any item ordered in the items_ordered table. Hint: Select the maximum price only. 2. Select the average price of all of the items ordered that were purchased in the month of Dec. 3. What are the total number of rows in the items_ordered table? 4. For all of the tents that were ordered in the items_ordered table, what is the price of the lowest tent? Hint: Your query should return the price only. SELECT column1, SUM(column2) FROM list-of-tables GROUP BY column-list ; Let s say you would like to retrieve a list of the highest paid salaries in each dept: SELECT max(salary), dept FROM employee GROUP BY dept; This statement will select the maximum salary for the people in each unique department. Basically, the salary for the person who makes the most in each department will be displayed. Their, salary and their department will be returned. For example, take a look at the items_ordered table. Let s say you want to group everything of quantity 1 together, everything of quantity 2 together, everything of quantity 3 together, etc. If you would like to determine what the largest cost item is for each grouped quantity (all quantity 1 s, all quantity 2 s, all quantity 3 s, etc.), you would enter: SELECT quantity, max(price) FROM items_ordered GROUP BY quantity; Enter the statement in above, and take a look at the results to see if it returned what you were expecting. Verify that the maximum price in each Quantity Group is really the maximum price. GROUP BY - Multiple Grouping Columns - What if? What if you ALSO want to display their lastname for the query below: SELECT max(salary), dept FROM employee GROUP BY dept; What you ll need to do is: SELECT lastname, max(salary), dept FROM employee GROUP BY dept, lastname; This is a called multiple grouping columns. Students Activity 1. How many people are in each unique state in the customers table? Select the state and display the number of people in each. Hint: count is used to count rows in a column, sum works on numeric data only. DATABASE MANAGEMENT Group by Clause The Group By clause will gather all of the rows together that contain data in the specified column(s) and will allow aggregate functions to be performed on the one or more columns. This can best be explained by an example: GROUP BY clause syntax: 2. From the items_ordered table, select the item, maximum price, and minimum price for each specific item in the table. Hint: The items will need to be broken up into separate groups. 49

DATABASE MANAGEMENT 3. How many orders did each customer make? Use the items_ordered table. Select the customerid, number of orders they made, and the sum of their orders. Click the Group By answers link below if you have any problems. 2. From the items_ordered table, select the item, maximum price, and minimum price for each specific item in the table. Only display the results if the maximum price for one of the items is greater than 190.00. Having Clause The HAVING clause allows you to specify conditions on the rows for each group - in other words, which rows should be selected will be based on the conditions you specify. The HAVING clause should follow the GROUP BY clause if you are going to use it. HAVING clause syntax: SELECT column1, SUM(column2) FROM list-of-tables GROUP BY column-list HAVING condition ; HAVING can best be described by example. Let s say you have an employee table containing the employee s name, department, salary, and age. If you would like to select the average salary for each employee in each department, you could enter: SELECT dept, avg(salary) FROM employee GROUP BY dept; But, let s say that you want to ONLY calculate & display the average if their salary is over 20000: SELECT dept, avg(salary) FROM employee GROUP BY dept HAVING avg(salary) > 20000; Students Activity 1. How many people are in each unique state in the customers table that have more than one person in the state? Select the state and display the number of how many people are in each if it s greater than 1. 3. How many orders did each customer make? Use the items_ordered table. Select the customerid, number of orders they made, and the sum of their orders if they purchased more than 1 item. Order by Clause ORDER BY is an optional clause which will allow you to display the results of your query in a sorted order (either ascending order or descending order) based on the columns that you specify to order by. ORDER BY clause syntax: SELECT column1, SUM(column2) FROM list-of-tables ORDER BY column-list [ASC DESC]; [ ] = optional This statement will select the employee_id, dept, name, age, and salary from the employee_info table where the dept equals Sales and will list the results in Ascending (default) order based on their Salary. ASC = Ascending Order - default DESC = Descending Order For example SELECT employee_id, dept, name, age, salary FROM employee_info WHERE dept = Sales ORDER BY salary; If you would like to order based on multiple columns, you must seperate the columns with commas. For example: SELECT employee_id, dept, name, age, salary FROM employee_info WHERE dept = Sales 50

ORDER BY salary, age DESC; Students Activity 1. Select the lastname, firstname, and city for all customers in the customers table. Display the results in Ascending Order based on the lastname. 2. Same thing as exercise #1, but display the results in Descending order. 3. Select the item and price for all of the items in the items_ordered table that the price is greater than 10.00. Display the results in Ascending order based on the price. greater than or equal to 50000.00 AND the title is equal to Programmer. Both of these conditions must be true in order for the rows to be returned in the query. If either is false, then it will not be displayed. Although they are not required, you can use paranthesis around your conditional expressions to make it easier to read: SELECT employeeid, firstname, lastname, title, salary FROM employee_info WHERE (salary >= 50000.00) AND (title = Programmer ); Another Example SELECT firstname, lastname, title, salary FROM employee_info WHERE (title = Sales ) OR (title = Programmer ); This statement will select the firstname, lastname, title, and salary from the employee_info table where the title is either equal to Sales OR the title is equal to Programmer. Students Activity 1. Select the customerid, order_date, and item from the items_ordered table for all items unless they are Snow Shoes or if they are Ear Muffs. Display the rows as long as they are not either of these two items. DATABASE MANAGEMENT Combining conditions and Boolean Operators The AND operator can be used to join two or more conditions in the WHERE clause. Both sides of the AND condition must be true in order for the condition to be met and for those rows to be displayed. SELECT column1, SUM(column2) FROM list-of-tables WHERE condition1 AND condition2 ; The OR operator can be used to join two or more conditions in the WHERE clause also. However, either side of the OR operator can be true and the condition will be met - hence, the rows will be displayed. With the OR operator, either side can be true or both sides can be true. For Example SELECT employeeid, firstname, lastname, title, salary FROM employee_info WHERE salary >= 50000.00 AND title = Programmer ; This statement will select the employeeid, firstname, lastname, title, and salary from the employee_info table where the salary is 2. Select the item and price of all items that start with the letters S, P, or F. In and Between Conditional Operators SELECT col1, SUM(col2) FROM list-of-tables WHERE col3 IN (list-of-values); SELECT col1, SUM(col2) FROM list-of-tables WHERE col3 BETWEEN value1 AND value2; The IN conditional operator is really a set membership test operator. That is, it is used to test whether or not a value (stated before the keyword IN) is in the list of values provided after the keyword IN. For Example SELECT employeeid, lastname, salary 51

DATABASE MANAGEMENT FROM employee_info WHERE lastname IN ( Hernandez, Jones, Roberts, Ruiz ); This statement will select the employeeid, lastname, salary from the employee_info table where the lastname is equal to either: Hernandez, Jones, Roberts, or Ruiz. It will return the rows if it is ANY of these values. The IN conditional operator can be rewritten by using compound conditions using the equals operator and combining it with OR - with exact same output results: SELECT employeeid, lastname, salary FROM employee_info WHERE lastname = Hernandez OR lastname = Jones OR lastname = Roberts OR lastname = Ruiz ; As you can see, the IN operator is much shorter and easier to read when you are testing for more than two or three values. You can also use NOT IN to exclude the rows in your list. The BETWEEN conditional operator is used to test to see whether or not a value (stated before the keyword BETWEEN) is between the two values stated after the keyword BE- TWEEN. For example SELECT employeeid, age, lastname, salary FROM employee_info WHERE age BETWEEN 30 AND 40; This statement will select the employeeid, age, lastname, and salary from the employee_info table where the age is between 30 and 40 (including 30 and 40). This statement can also be rewritten without the BETWEEN operator: SELECT employeeid, age, lastname, salary FROM employee_info WHERE age >= 30 AND age <= 40; You can also use NOT BETWEEN to exclude the values between your range. Students Ativity 1. Select the date, item, and price from the items_ordered table for all of the rows that have a price value ranging from 10.00 to 80.00. Mathematical Operators Standard ANSI SQL-92 supports the following first four basic arithmetic operators: + addition - subtraction * multiplication / division % modulo The modulo operator determines the integer remainder of the division. This operator is not ANSI SQL supported, however, most databases support it. The following are some more useful mathematical functions to be aware of since you might need them. These functions are not standard in the ANSI SQL-92 specs, therefore they may or may not be available on the specific RDBMS that you are using. However, they were available on several major database systems that I tested. They WILL work on this tutorial. ABS(x) returns the absolute value of x SIGN(x) returns the sign of input x as -1, 0, or 1 (negative, zero, or positive respectively) MOD(x,y) modulo - returns the integer remainder of x divided by y (same as x%y) FLOOR(x) returns the largest integer value that is less than or equal to x CEILING(x) or returns the smallest integer value that is greater than or CEIL(x) equal to x POWER(x,y) returns the value of x raised to the power of y ROUND(x) returns the value of x rounded to the nearest whole integer ROUND(x,d) returns the value of x rounded to the number of decimal places specified by the value d SQRT(x) returns the square-root value of x For example SELECT round(salary), firstname FROM employee_info This statement will select the salary rounded to the nearest whole value and the firstname from the employee_info table. Students Activity Select the item and per unit price for each item in the items_ordered table. Hint: Divide the price by the quantity. 2. Select the firstname, city, and state from the customers table for all of the rows where the state value is either: Arizona, Washington, Oklahoma, Colorado, or Hawaii. 52

Points to Ponder SQL provides rename operator both relations and attributes Aggregate functions are used to compute against a returned column of numeric data from your SELECT statement. The GROUP BY clause will gather all of the rows together that contain data in the specified column(s) and will allow aggregate functions to be performed on the one or more columns. The HAVING clause allows you to specify conditions on the rows for each group ORDER BY is an optional clause which will allow you to display the results of your query in a sorted order Review Terms Rename operator Aggregate function Having Order-by Group-by DATABASE MANAGEMENT 53

Lesson Objective Table joins Tuple variable String operators Set operators Views Update Delete Table Joins All of the queries up until this point have been useful with the exception of one major limitation - that is, you ve been selecting from only one table at a time with your SELECT statement. It is time to introduce you to one of the most beneficial features of SQL & relational database systems - the Join. To put it simply, the Join makes relational database systems relational. Joins allow you to link data from two or more tables together into a single query result from one single SELECT statement. A Join can be recognized in a SQL SELECT statement if it has more than one table after the FROM keyword. For example SELECT list-of-columns FROM table1,table2 WHERE search-condition(s) Joins can be explained easier by demonstrating what would happen if you worked with one table only, and didn t have the ability to use joins. This single table database is also sometimes referred to as a flat table. Let s say you have a one-table database that is used to keep track of all of your customers and what they purchase from your store: Everytime a new row is inserted into the table, all columns will be be updated, thus resulting in unnecessary redundant data. For example, every time Wolfgang Schultz purchases something, the following rows will be inserted into the table: id first last Address city state zip date item price 10982 Wolfgang Schultz 300 N. 1st Ave Yuma AZ 85002 032299 snowboard 45.00 10982 Wolfgang Schultz 300 N. 1st Ave Yuma AZ 85002 091199 gloves 15.00 10982 Wolfgang Schultz 300 N. 1st Ave Yuma AZ 85002 100999 lantern 35.00 10982 Wolfgang Schultz 300 N. 1st Ave Yuma AZ 85002 022900 tent 85.00 An ideal database would have two tables: 1. One for keeping track of your customers 2. And the other to keep track of what they purchase: Customer_info table: customer_number firstname lastname address city state zip Purchases table: customer_number date item price LESSON 17: SQL-III Now, whenever a purchase is made from a repeating customer, the 2nd table, Purchases only needs to be updated! We ve just eliminated useless redundant data, that is, we ve just normalized this database! Notice how each of the tables have a common cusomer_number column. This column, which contains the unique customer number will be used to JOIN the two tables. Using the two new tables, let s say you would like to select the customer s name, and items they ve purchased. Here is an example of a join statement to accomplish this: SELECT customer_info.firstname, customer_info.lastname, purchases.item FROM customer_info, purchases WHERE customer_info.customer_number = purchases.customer_number; This particular Join is known as an Inner Join or Equijoin. This is the most common type of Join that you will see or use. Notice that each of the columns are always preceeded with the table name and a period. This isn t always required, however, it is good practice so that you wont confuse which colums go with what tables. It is required if the name column names are the same between the two tables. I recommend preceeding all of your columns with the table names when using joins. Note: The syntax described above will work with most Database Systems. However, in the event that this doesn t work with yours, please check your specific database documentation. Although the above will probably work, here is the ANSI SQL- 92 syntax specification for an Inner Join using the preceding statement above that you might want to try: SELECT customer_info.firstname, customer_info.lastname, purchases.item FROM customer_info INNER JOIN purchases ON customer_info.customer_number = purchases.customer_number; Another Example SELECT employee_info.employeeid, employee_info.lastname, employee_sales.comission FROM employee_info, employee_sales WHERE employee_info.employeeid = employee_sales.employeeid; This statement will select the employeeid, lastname (from the employee_info table), and the comission value (from the employee_sales table) for all of the rows where the employeeid in the employee_info table matches the employeeid in the employee_sales table. DATABASE MANAGEMENT 57

DATABASE MANAGEMENT Students Activity 1. Write a query using a join to determine which items were ordered by each of the customers in the customers table. Select the customerid, firstname, lastname, order_date, item, and price for everything each customer purchased in the items_ordered table. 2. Repeat exercise #1, however display the results sorted by state in descending order Tuple Variables The as clause is particularly useful in defining the notion of tuple variables, as is done in the tuple relational calculus. A tuple variable in SQL must be associated with a particular relation. Tuple variables are defined in the from clause by way of the as clause. To illustrate, we rewrite the query For all customers who have a loan from the bank, find their names, loan numbers, and loan amount as select customer-name, T. loan-number, S.amount from borrower as T, loan as S where T. loan-number = S. loan-number Note that we define a tuple variable in the from clause by placing it after the name of the relation with which it is associated, with the keyword as in between (the keyword as is optional). When we write expressions of the form relationname. attribute-name, the relation name is, in effect, an implicitly defined tuple variable. Tuple variables are most useful for comparing two tuples in the same relation. Recall that, in such cases, we could use the rename operation in the relational algebra. Suppose that we want the query Find the names of all branches that have assets greater than at least one branch located in Brooklyn. We can write the SQL expression select distinct T.branch-name from branch as T, branch as S where T.assets > S.assets and S.branch-city = Brooklyn Observe that we could not use the notation branch. asset, since it would not be clear which reference to branch is intended. String Operations SQL specifies strings by enclosing them in single quotes, for example, Perryridge, as we saw earlier. A single quote character that is part of a string can be specified by using two single quote characters; for example the string It s right can be specified by It s right. The most commonly used operation on strings is pattern matching using the op-erator like. We describe patterns by using two special characters: Percent (%): The % character matches any sub string. Underscore (-): The - character matches any character. Patterns are case sensitive; that is, uppercase characters do not match lowercase char-acters, or vice versa. To illustrate pattern matching, we consider the following exam-ples: Perry% matches any string beginning with Perry. %idge% matches any string containing idge as a sub string, for example, Perryridge, Rock Ridge, Mianus Bridge, and Ridgeway. - - - matches any string of exactly three characters. - - % matches any string of at least three characters. SQL expresses patterns by using the like comparison operator. Consider the query Find the names of all customers whose street address includes the substring Main. This query can be written as select customer-name from customer where customer-street like %Main% Joined Relations SQL provides not only the basic Cartesian-product mechanism for joining tuples of relations found in its earlier versions, but, SQL also provides various other mechanisms - Loan number Branch name Amout L-170 Downtown 3000 L-230 Redwood 4000 L-260 Perryridge 1700 Customer name Loan number Jones L-170 Smith L-230 Hayes L-155 The loan and borrower relations. for joining relations, including condition joins and natural joins, as well as var-ious forms of outer joins. These additional operations are typically used as subquery expressions in the from clause. Examples We illustrate the various join operations by using the relations loan and borrower in. We start with a simple example of inner joins. loan inner join borrower on loan. loan-number - borrower. loan-number The expression computes the theta join of the loan and the borrower relations, with the join condition being loan. Loannumber = borrower. loan-number. The attributes of the result 58

consist of the attributes of the Right-hand-side relation followed by the attributes of the right-hand-side relation. Note that the attribute loan-number appears twice in the figurethe first occur-rence is from loan, and the second is from borrower. The SQL standard does not require attribute names in such results to be unique. An as clause should be used to assign unique names to attributes in query and subquery results. We rename the result relation of a join and the attributes of the result relation by using an as clause, as illustrated here: loan inner join borrower on loan. loan-number = borrower.loan-number as Lb(loan-number, branch, amount, cust, cust-loan-num) We rename the second occurrence of loan-number to cust- loannum. The ordering of the attributes in the result of the join is important for the renaming. Next, we consider an example of the left outer join operation: loan left outer join borrower on loan. Loan-number = borrower. loan-number result. Although outer -join expressions are typically used in the from clause, they can be used anywhere that a relation can be used. Each of the variants of the join operations in SQL consists of a join type and a join condition. The join condition defines which tuples in the two relations match and what attributes are present in the result of the join. The join type defines how tuples in each Loan number Branch name Amount Customer name L-170 Downtown 3000 Jones L- 230 Redwood 4000 Smith The result of loan natural inner join borrower. Join types Inner join Left outer join Right outer join Full outer join DATABASE MANAGEMENT Loan number Branch name Amount Customer name Loan number L-170 Downtown 3000 Jones L-170 L-230 Redwood 4000 Smith L-230 The result of loan inner join borrower on loan. loan-number = borrower.loan-number. Loan number Branch name Amount Customer name Loan number L-170 Downtown 3000 Jones L-170 L- 230 Redwood 4000 Smith L-230 L-260 Perryridge 1700 null null The result of loan left outer join borrower on loan. loannumber = borrower.loan-number. We can compute the left outer join operation logically as follows. First, compute the result of the inner join as before. Then, for every tuple t in the left-hand-side relation loan that does not match any tuple in the right-hand-side relation borrower in the inner join, add a tuple r to the result of the join: The attributes of tuple r that are derived from the lefthand-side relation are filled in with the values from tuple t, and the remaining attributes of r are filled with null values. The tuples (L-170, Downtown, 3000) and (L-230, Redwood, 4000) join with tuples from borrower and appear in the result of the inner join, and hence in the result of the left outer join. On the other hand, the tuple (L-260, Perryridge, 1700) did not match any tuple from borrower in the inner join, and hence a tuple (L- 260, Perryridge, 1700, null, null) is present in the result of the left outer join. Finally, we consider an example of the natural join operation: loan natural inner join borrower This expression computes the natural join of the two relations. The only attribute name common to loan and borrower is loannumber. However, the attribute loan-number appears only once in the result of the natural join, whereas it appears twice in the result of the join with the on condition. Join Types and Conditions We saw examples of the join operations permitted in SQL Join op-erations take two relations and return another relation as the Join Types and Join Conditions The first join type is the inner join, and the other three are the outer joins. Of the three join conditions, we have seen the natural join and the on condition before, and we shall discuss the using condition, later in this section. The use of a join condition is mandatory for outer joins, but is optional for inner joins (if it is omitted, a Cartesian product results). Syntactically, the keyword natural appears before the join type, as illustrated earlier, whereas the on and using conditions appear at the end of the join expression. The keywords inner and outer are optional, since the rest of the join type enables us to deduce whether the join is an inner join or an outer join. The meaning of the join condition natural, in terms of which tuples from the two relations match, is straightforward. The ordering of the attributes in the result of a natural join is as follows. The join attributes (that is, the attributes common to both relations) appear first, in the order in which they appear in the left-hand-side relation. Next come all nonjoin attributes of the left-hand-side relation, and finally all nonjoin attributes of the right-hand-side relation. The right outer join is symmetric to the left outer join. Tuples from the right-hand--side relation that do not match any tuple in the left-hand-side relation are padded with nulls and are added to the result of the right outer join. Here is an example of combining the natural join condition with the right outer join type: loan natural right outer join borrower The attributes of the result are defined by the join type, which is a natural join; hence, loan-number appears only once. The first two tuples in the result are from the inner natural join of loan and borrower. The tuple (Hayes, L-155) from the right-handside relation does not match any tuple from the left-hand-side relation loan in the natural inner join. Hence, the tuple (L-155, null, null, Hayes) appears in the join result. 59

DATABASE MANAGEMENT The join condition using (Ai, A2,...,An) is similar to the natural join condition, ex-cept that the join attributes are the attributes A1, A2,..., An, rather than all attributes that are common to both relations. The attributes A1, A2,, An must consist of only attributes that are common to both relations, and they appear only once in the result of the join. The full outer join is a combination of the left and right outerjoin types. After the operation computes the result of the inner join, it extends with nulls tuples from Loan number Branch name Amount Customer name L-170 Downtown 3000 Jones L-230 Redwood 4000 Smith L-155 Null null Hayes The result of loan natural right outer join borrower. the left-hand-side relation that did not match with any from the right-hand-side, and adds them to the result. Similarly, it extends with nulls tuples from the right-hand--side relation that did not match with any tuples from the left-hand-side relation and adds them to the result. loan full outer join borrower using (loan-number) Loan number Branch name Amount Customer name L-170 Downtown 3000 Jones L- 230 Redwood 4000 Smith L-260 Perryridge 1700 null L-155 null Null Hayes The result of loan full outer join borrower using(loannumber). Points to Ponder Joins allow you to link data from two or more tables together into a single query result from one single SELECT statement. Tuple variables are defined in the from clause by way of the as clause We describe patterns by using two special characters: Percent (%): The % character matches any sub string. Underscore (-): The - character matches any character. Various kinds of joins are as follows:- Review Terms Table joins Tuple variable String operators Set operators Views Update Delete Join types Inner join Left outer join Right outer join Full outer join 60

DATABASE MANAGEMENT Lesson Objectives Set operators Union Intersect Except View Update Delete Set Operations The SQL operations union, intersect, and except operate on relations and correspond to the relational-algebra operations U, n, and -. Like union, intersection, and set difference in relational algebra, the relations participating in the operations must be compatible; that is, they must have the same set of attributes. select customer-name from depositor and the set of customers who have a loan at the bank, which can be derived by select customer-name from borrower We shall refer to the relations obtained as the result of the preceding queries as d and b, respectively. The Union Operation To find all customers having a loan, an account, or both at the bank, we write (select customer-name from depositor) union (select customer-name from borrower) The union operation automatically eliminates duplicates, unlike the select clause. Thus, in the preceding query, if a customer-say, Jones-has several accounts or loans (or both) at the bank, then Jones will appear only once in the result. If we want to retain all duplicates, we must write union all in place of union: (select customer-name from depositor) union all (select Customer-name from borrower) The number of duplicate tuples in the result is equal to the total number of duplicates that appear in both d and b. Thus, if Jones has three accounts and two loans at the bank, then there will be five tuples with the name Jones in the result. LESSON 20: SQL-IV The Intersect Operation To find all customers who have both a loan and an account at the bank, we write (select distinct customer-name from depositor) intersect (select distinct customer-name from borrower) The intersect operation automatically eliminates duplicates. Thus, in the preceding query, if a customer-say, Jones-has several accounts and loans at the bank, then Jones will appear only once in the result. If we want to retain all duplicates, we must write intersect all in place of intersect: (select customer-name from depositor) intersect all (select customer-name from borrower) The number of duplicate tuples that appear in the result is equal to the minimum number of duplicates in both d and b. Thus, if Jones has three accounts and two loans at the bank, then there will be two tuples with the name Jones in the result. The Except Operation To find all customers who have an account but no loan at the bank, we write (select distinct customer-name from depositor) except (select customer-name from borrower) The except operation automatically eliminates duplicates. Thus, in the preceding query, a tuple with customer name Jones will appear (exactly once) in the result only if Jones has an account at the bank, but has no loan at the bank. If we want to retain all duplicates, we must write except all in place of except: (select customer-name from depositor) except all (select customer-name from borrower) The number of duplicate copies of a tuple in the result is equal to the number of duplicate copies of the tuple in d minus the number of duplicate copies of the tuple in b, provided that the difference is positive. Thus, if Jones has three accounts and one loan at the bank, then there will be two tuples with the name Jones in the result. If, instead, this customer has two accounts and three loans at the bank, there will be no tuple with the name Jones in the result. 64

Views We define a view in SQL by using the create view command. To define a view, we must give the view a name and must state the query that computes the view. The form of the create view command is create view v as <query expression> where <query expression> is any legal query expression. The view name is repre-sented by v. As an example, consider the view consisting of branch names and the names of customers who have either an account or a loan at that branch. Assume that we want this view to be called all-customer. We define this view as follows: create view all-customer as (select branch-name, customer-name from depositor, account where depositor.account-number = account.accountnumber) union (select branch-name, customer-name from borrower, loan where borrower. loan-number = loan.loan-number) The attribute names of a view can be specified explicitly as follows: create view branch-total-ioan(branch-name, total-loan) as select branch-name, sum(amount) from loan group by branch-name The preceding view gives for each branch the sum of the amounts of all the loans at the branch. Since the expression sum (amount) does not have a name, the attribute name is specified explicitly in the view definition. View names may appear in any place that a relation name may appear. Using the view all-customer, we can find all customers of the Perryridge branch by writing select customer-name from all-customer where branch-name = Perryridge Modification of the Database Update The update statement is used to update or change records that match a specified criteria. This is accomplished by carefully constructing a where clause. update tablename set columnname = newvalue [, nextcolumn = newvalue2...] where columnname OPERATOR value [and or column OPERATOR value ]; [] = optional Examples update phone_book set area_code = 623 where prefix = 979; update phone_book set last_name = Smith, prefix=555, suffix=9292 where last_name = Jones ; update employee set age = age+1 where first_name= Mary and last_name= Williams ; Students Activity 1. Jonie Weber just got married to Bob Williams. She has requested that her last name be updated to Weber-Williams. 2. Dirk Smith s birthday is today, add 1 to his age. 3. All secretaries are now called Administrative Assistant. Update all titles accordingly. 4. Everyone that s making under 30000 are to receive a 3500 a year raise. DATABASE MANAGEMENT 65

DATABASE MANAGEMENT 5. Everyone that s making over 33500 are to receive a 4500 a year raise. 6. All Programmer II titles are now promoted to Programmer III. Students Activity 1. Jonie Weber-Williams just quit, remove her record from the table. 2. It s time for budget cuts. Remove all employees who are making over 70000 dollars. 7. All Programmer titles are now promoted to Programmer II. Deleting Records The delete statement is used to delete records or rows from the table. delete from tablename where columnname OPERATOR value [and or column OPERATOR value ]; [ ] = optional Examples delete from employee; Note: if you leave off the where clause, all records will be deleted! delete from employee where lastname = May ; delete from employee where firstname = Mike or firstname = Eric ; To delete an entire record/row from a table, enter delete from followed by the table name, followed by the where clause which contains the conditions to delete. If you leave off the where clause, all records will be deleted. Points to Ponder The SQL operations union, intersect, and except operate on relations The union operation automatically eliminates duplicates, unlike the select clause The intersect operation automatically eliminates duplicates. To define a view, we must give the view a name and must state the query that computes the view The update statement is used to update or change records that match a specified criteria The delete statement is used to delete records or rows from the table. Review Terms Set operators Union Intersect Except View Update Delete Students Activity 66

Lesson Objective Data constraints Column Level Constraints Level Constraints Table level constraints NULL value Primary Key Unique key Default key Foreign key NOT NULL constraints Check constraints Data Constraints Integrity constraints ensure that changes made to the database by authorized users do not result in a loss of data consistency. Thus, integrity constraints guard against accidental damage to the database. Besides the cell name, cell length and cell data type, there are other parameters i.e. other data constraints that can be passed to the DBA at cell creation time. These data constraints will be connected to a cell by the DBA as flags. Whenever a user attempts to load a cell with data, the DBA will check the data being loaded into the cell against the data constraints defined at the time the cell was created. If the data being loaded fails any of the data constraint checks fired by the DBA, the DBA will not load the data into the cell, reject the entered record, and will flash an error message to the user. These constraints are given a constraint name and the DBA stores the constraints with its name and instructions internally along with the cell itself The constraint can either be placed at the column level or at the table level. Column Level Constraints If the constraints are defined along with the column definition, it is called as a column level constraint. Column level constraint can be applied to anyone column at a time i.e. they are local to a specific column. If the constraint spans across multiple columns, the user will have to use table level constraints. Table Level Constraints If the data constraint attached to a specific cell in a table references the contents of another cell in the table then the user will have to use table level constraints. Table level constraints are stored as a part of the global table definition. Examples of different constraints that can be applied on the table are as follows: LESSON 23: INTEGRITY AND SECURITY Null Value Concepts While creating tables, if a row lacks a data value for a particular column, that value is said to be null. Columns of any data types may contain null values unless the column was defined as not null when the table was created. Principles of Null values Setting a null value is appropriate when the actual value is unknown, or when a value would not be meaningful. A null value is not equivalent to a value of zero. A null value will evaluate to null in any expression. e.g. null multiplied by 10 is null. When a column name is defined as not null, then that column becomes a mandatory column. It implies that the user is forced to enter data into that column. Example: Create table client master with a not null constraint on columns client no, Name, address, address2. NOT NULL as a column constraint: CREATE TABLE client master (client_no varchar2(6) NOT NULL, name varchar2(20) NOT NULL, address 1 varchar2(30) NOT NULL, address2 varchar2(30) NOT NULL, city varchar2(15), state varchar2(15), pin code number( 6), remarks varchar2(60), bal_due number (10,2)); Primary Key Concepts A primary key is one or more columns in a table used to uniquely identify each row. in the table. Primary key values must not be null and must be unique across the column. A multicolumn primary key is called a composite primary key. The only function that a primary key performs is to uniquely identify a row and thus if one column is used it is just as good as if multiple columns are used. Multiple columns i.e. (composite keys) are used only when the system designed requires a primary key that cannot be contained in a single column. Examples Primary Key as a column constraint: Create client_master where client_no is the primary key. CREATE TABLE client master (client_no varchar2(6) PRIMARY KEY, name varchar2(20), add}-essl varchar2(30), address2 varchar2(30), city varcbar2(15), state varchar2(15), pincode number(6), remarks varchar2(60), bal_due number (10,2)); Primary Key as a table constraint: Create a sales order details table where DATABASE MANAGEMENT 69

DATABASE MANAGEMENT Column Name: Data Type Size Attributes S_order_no varchar2 6 Primary Key product_no varchar2 6 Primary Key qty _ordered Number 8 qty - disp Number 8 product_rate Number 8,2 CREATE TABLE sales order details (s_order_no varchar2(6), product_no varchar2(6), qty _ordered number(8), qty - disp number(8), product_rate number(8,2), PRIMARY KEY (s_order_no, product_no)); Unique Key Concepts A unique key is similar to a primary key, except that the purpose of a unique key is to ensure that information in the column for each record is unique, as with telephone or driver s license numbers. A table may have many unique keys. Example: Create Table client_master with unique constraint on column client_no UNIQUE as a column constraint: CREATE TABLE client master (client_no varchar2( 6) CONSTRAINT cnmn - ukey UNIQUE, name varchar2(20), address 1 varchar2(30), address2 varchar2(30), city varchar2(15), state varchar2(l5), pincode number(6), remarks varch_2(60), bal_due number(lo,2), partpay311 char(l)); UNIQUE as a table constraint: CREATE TABLE client master (client_no varchar2(6), name varchar2(20), addressl varchar2(30), address2 varchar2(30), city varchar2(15), state varchar2(15), pincode number(6), remarks varchar2(60), bal_due number(lo,2), CONSTRAINT cnmn_ukey UNIQUE (client_no)); Default Value Concepts At the time of cell creation a default value can be assigned to it. When the user is loading a record with values and leaves this cell empty, the DBA will automatically load this cell with the default value specified- The data type of the default value should match the data type of the column. You can use the default clause to specify any default value you want. Create sales_order table where: Column Name Data Type Size Attributes S_order no varchar2 6 Primary Key S_order date date Client_no varchar2 6 Dely_Addr varchar2 25 Salesman_no varchar2 6 Dely_type char I Delivery: part (P) / full (F) Billed_yn char I Default 'F' Dely_date date Order_status varchar2 10 CREATE TABLE sales_order (s_order_no varchar2(6) PRIMARY KEY, s _order_date date, client_no varchar2(6), dely _Addr varchar2(25), salesman _no varchar2(6), dely_type char(l) DEFAULT F, billed_yn char( l), dely_date date, order_status varchar2(1 0)) - Foreign Key Concepts Foreign keys represent relationships between tables. A foreign key is a Column (or a group of columns) whose values -are derived from the primary key of the same or some other table. The existence of a foreign key implies that the table with the foreign key is related to the - primary key table from which the foreign-key is derived. A foreign key must have a corresponding primary key value in the primary key table to have a meaning. For example, the s_order_no column is the primary key of table sales_order. In table sales_order details, s _order_no is a foreign key that references the s_order_.no values in table sales order. The Foreign Key References Constraint rejects an Insert or Update of a value, if a corresponding value does not currently exist in the primary key table rejects a DEL_TE, if it would invalidate a REFERENCES constrain must reference a PRIMARY KEY or UNIQUE column(s) in primary key table will reference the PRIMARY KEY of the primary key table if no column or group of columns is specified in the constraint must reference a table, not a view or cluster; requires that you own the primary key table, have Reference privilege on it, or have column-level Reference privilege on the referenced colwnns in the primary key table; doesn t restrict how other constraints may reference the same tables; requires that the FOREIGN KEY column(s) and the CONSTRAINT column(s) have matching data types; may reference the same table named in the CREATE TABLE statement; must not reference the same column more than once (in a single constraint). Example: Create table sales_order _details with primary key as s_order_no and product_no and foreign key as s_order_no referencing column s_order_no in the sales order table. Foreign Key as a Column Constraint CREATE TABLE sales order details (s_order_no varchar2(6) REFERENCES sales_order, product_no varchar2(6), qty _ordered number(8), qty - disp number(8), product_rate number(8,2), Primary Key (s_order_no, product_no)); 70

FOREIGN KEY as a Table Constraint Create Table sales order details ( s _order_no varchar2( 6), product_no varchar2(6), qty_ordered number(8), qty_disp number(8), product_rate number(8,2), Primary Key (s_order_no, product_no), Foreign Key (s_order_no) References sales_order); Check Integrity Constraints Use the Check constraint when you need to enforce integrity rules that can be evaluated based on a logical expression. Never use Check constraints if the constraint can be defined using the not null, primary key or foreign key constraint. Following are a few examples of appropriate CHECK constraints: a Check constraint on the client no column of the client master so that no client no value starts with C. a Check constant on name column of the client master so that the name is entered in upper case. a Check constraint on the city column of the client_master so that only the cities Bombay, New Delhl, Mapras and Calcutta are allowed. Create T Able client master (client_no varchar2(6) Constraint ck_clientno Check ( client_no like C% ), name varchar2(20) Constraint ck_cname Check (name = upper(name», address I varchar2(30), address2 varchar2(30), city varchar2( 15) Constraint ck _city Check (city In Cnewdelhi, Bombay, Calcutta, Madras )), state varchar2(l5), pin code number(6), remarks varchar2(60), bal- due number(10,2)); When using CHECK constraints, consider the ANSI I ISO standard which states that a CHECK constraint is violated only if the condition evaluates to False, True and unknown values do not violate a check condition. Therefore, make sure that a CHECK constraint that you define actually enforces the rule you need to enforce. For example, consider the following CHECK constraint for emp table: CHECK ( sal > 0 or comm >= 0 ) At first glance, this rule may be interpreted as do not allow a row in emp table unless the employee s salary is greater than 0 or the employee s commission is greater than or equal to 0. However, note that if a row is inserted with a null salary and a negative commission, the row does not violate the CHECK constraint because the entire check condition is evaluated as unknown. In this particular case, you can account for such violations by placing not null integrity constraint on both the sal and comm columns. Check with not Null Integrity Constraints According to the ANSI I ISO standard, a not null integrity constraint is an example of a CHECK integrity constraint, where the condition is, CHECK (column_name IS NOT NULL). Therefore, the not null integrity constraint for a single column can, in practice be written in two forms, by using the not null constraint or by using a CHECK constrllint. For ease of use, you should always choose to define the not null integrity constraint instead of a CHECK constraint with the is not null condition. Here we shall look at the method via which data constraints can be attached to the cell so that data validation can be done at table level itself using the power of the DBA.. A constraint clause restricts the range of valid values for one column (a column constraint) or for a group of columns (a table constraint). Any INSERT, UPDATE or DELETE statement evaluates a relevant constraint; the constraint must be satisfied for the statement to succeed. Constraints can be connected to a table by CREATE TABLE or ALTER TABLE command. Use ALTER T ABLE to add. or drop the constraint from a table. Constraints are recorded in the data dictionary. If you don t name a constraint, it is assigned the name SYS - C n, where n is an integer that makes the name unique in the database. Restrictions on Check Constraints A Check integrity constraint requires that a condition be true or unknown for every row of the table. If a statement causes the condition to evaluate to false; the statement is rolled back. The condition of a CHECK constraint has the following limitations: The condition must be a Boolean expression that can be evaluated using the values in the row being inserted or updated. The condition cannot contain sub queries or sequences. The condition cannot include the Sysdate, Uid, User or Userenv SQL function Defining Different Constraints on the Table Create a sales_order _details table where Column Name: Data Type Size Attributes S_order_ no varchar2 6 Primary Key, Foreign Key references s order no of sales order table. product_no varchar2 6 Primary Key, Foreign Key references product_no of product_master table. qty_ordered number 8 not null Qty_disp number 8. product_rate number 8,2 not null Create Table sales_order details (s_order_no varchar2(6) Contraint order_fkey References sales_order, product_no varchar2(6) Constraint product_fkey References product_master, qty _ordered number(8) Not Null, DATABASE MANAGEMENT 71

DATABASE MANAGEMENT qty - disp number(8), product_rate number(8,2) Not Null, Primary Key (s_order_no, product_no)); Defining Integrity Constraints in the Alter Table Command You can also define integrity constraints using the constraint clause in the Alter Table command. The following examples show the definitions of several integrity constraints: 1. Add Primary Key constant on column supplier_no in table supplier_master; Alter Table supplier_master Add Primary Key (suppplier_no); 2. Add Foreign Key constraint on column s order no in table sales order details referencing table sales_order, modify column qty _ordered to include Not Null constant; Alter Table sales order details Add Constraint order_tkey Foreign Key (s_order_no) References sales_order Modify (qty_ordered number(8) Not Null); Dropping Integrity Constraints in the Alter Table Cmmand You can drop an integrity constraint if the rule that it enforces is no longer true or if the constraint is no longer needed. Drop the constraint using the Alter Table command with.the DROP clause. The following examples illustrate the dropping of integrity constraints: 1. Drop the Primary Key constraint from supplier_master; Alter Table supplier_master Drop Primary Key; 2. Drop Foreign Key constraint on column product_no in table sales_order_details; Alter Table sales order details Drop Constraint product- fkey; Note : Dropping UniquE and Primary Key constraints drops the associated indexes. Points to Ponder Integrity constraints ensure that changes made to the database by authorized users do not result in a loss of data consistency. If the constraints are defined along with the column definition, it is called as a column level constraint. If the data constraint attached to a specific cell in a table references the contents of another cell in the table then the user will have to use table level constraints While creating tables, if a row lacks a data value for a particular column, that value is said to be null. A primary key is one or more columns in a table used to uniquely identify each row. A unique key is similar to a primary key, except that the purpose of a unique key is to ensure that information in the column for each record is unique. When the user is loading a record with values and leaves this cell empty, the DBA will automatically load this cell with the default value specified. Foreign keys represent relationships between tables. A foreign key is a Column (or a group of columns) whose values -are derived from the primary key of the same or some other table. Use the CHECK constraint when you need to enforce integrity rules that can be evaluated based on a logical expression PL/SQL is Oracle s procedural language extension to SQL. PL/SQL enables you to mix SQL statements with procedural constructs Procedures, functions, and packages are all examples of PL/ SQL program units. Database triggers are procedures that are stored in the database and are implicitly executed (fired) when the contents of a table are changed Review Terms Integrity constraints Table level constraints Column level constraint Check constraints NULL constraints Primary key Foreign key Unique key Students Activity 1. What is constraint?how many kinds of constraints are there? 2. Define primary key,unique key,foreign key? 3. Define check integrity constraints? 72

4. Differentiate between table level & column level constraint? 5. Define Not Null constraint with the help of an example? DATABASE MANAGEMENT 73

Lesson objectives Difference between SQL & PL/SQL Stored procedures Packages Stored procedures Functions PL/SQL is Oracle s procedural language extension to SQL. PL/ SQL enables you to mix SQL statements with procedural constructs. With PL/SQL, you can define and execute PL/SQL program units such as procedures, functions, and packages. PL/ SQL program units generally are categorized as anonymous blocks and stored procedures. An anonymous block is a PL/SQL block that appears within your application and it is not named or stored in the database. In many applications, PL/SQL blocks can appear wherever SQL statements can appear. A stored procedure is a PL/SQL block that Oracle stores in the database and can be called by name from an application. When you create a stored procedure, Oracle parses the procedure and stores its parsed representation in the database. Oracle also allows you to create and store functions (which are similar to procedures) and packages (which are groups of procedures and functions). An Introduction to Stored Procedures and Packages Oracle allows you to access and manipulate database information using procedural schema objects called PL/SQL program units. Procedures, functions, and packages are all examples of PL/SQL program units. PL/SQL is Oracle s procedural language extension to SQL. It extends SQL with flow control and other statements that make it possible to write complex programs in it. The PL/SQL engine is the tool you use to define, compile, and execute PL/SQL program units. This engine is a special component of many Oracle products, including Oracle Server. While many Oracle products have PL/SQL components, this chapter specifically covers the procedures and packages that can be stored in an Oracle database and processed using the Oracle Server PL/SQL engine. The PL/SQL capabilities of each Oracle tool are described in the appropriate tool s documentation. Stored Procedures and Functions Procedures and functions are schema objects that logically group a set of SQL and other PL/SQL programming language statements together to perform a specific task. Procedures and functions are created in a user s schema and stored in a database for continued use. You can execute a procedure or function interactively using an Oracle tool, such as SQL*Plus, or call it explicitly in the code of a database application, such as an Oracle Forms or Precompiler application, or in the code of another procedure or trigger. LESSON 26: PL/SQL Packages A package is a group of related procedures and functions, together with the cursors and variables they use, stored together in the database for continued use as a unit. Similar to standalone procedures and functions, packaged procedures and functions can be called explicitly by applications or users. Procedures and Functions A procedure or function is a schema object that consists of a set of SQL statements and other PL/SQL constructs, grouped together, stored in the database, and executed as a unit to solve a specific problem or perform a set of related tasks. Procedures and functions permit the caller to provide parameters that can be input only, output only, or input and output values. Procedures and functions allow you to combine the ease and flexibility of SQL with the procedural functionality of a structured programming language. For example, the following statement creates the Credit_Account procedure, which credits money to a bank account: Create Procedure credit_account (acct Number, credit Number) AS /* This procedure accepts two arguments: an account number and an amount of money to credit to the specified account. If the specified account does not exist, a new account is created. */ old_balance Number; new_balance Number; Begin Select balance Into old_balance From accounts Where acct_id = acct For Update Of balance; new_balance := old_balance + credit; Update accounts Set balance = new_balance Where acct_id = acct; Commit; Exception When No_Data_Found Then Insert Into accounts (acct_id, balance) Values(acct, credit); When Others Then Rollback; End credit_account; Notice that the Credit_Account procedure includes both SQL and PL/SQL statements. DATABASE MANAGEMENT 77

DATABASE MANAGEMENT Procedure Guidelines Use the following guidelines to design and use all stored procedures: Define procedures to complete a single, focused task. Do not define long procedures with several distinct subtasks, because subtasks common to many procedures might be duplicated unnecessarily in the code of several procedures. Do not define procedures that duplicate the functionality already provided by other features of Oracle. For example, do not define procedures to enforce simple data integrity rules that you could easily enforce using declarative integrity constraints. Benefits of Procedures Procedures provide advantages in the following areas. Security Stored procedures can help enforce data security. You can restrict the database operations that users can perform by allowing them to access data only through procedures and functions. For example, you can grant users access to a procedure that updates a table, but not grant them access to the table itself. When a user invokes the procedure, the procedure executes with the privileges of the procedure s owner. Users who have only the privilege to execute the procedure (but not the privileges to query, update, or delete from the underlying tables) can invoke the procedure, but they cannot manipulate table data in any other way. Performance Stored procedures can improve database performance in several ways: The amount of information that must be sent over a network is small compared with issuing individual SQL statements or sending the text of an entire PL/SQL block to Oracle, because the information is sent only once and thereafter invoked when it is used. A procedure s compiled form is readily available in the database, so no compilation is required at execution time. If the procedure is already present in the shared pool of the SGA, retrieval from disk is not required, and execution can begin immediately. Memory Allocation Because stored procedures take advantage of the shared memory capabilities of Oracle, only a single copy of the procedure needs to be loaded into memory for execution by multiple users. Sharing the same code among many users results in a substantial reduction in Oracle memory requirements for applications. Productivity Stored procedures increase development productivity. By designing applications around a common set of procedures, you can avoid redundant coding and increase your productivity. For example, procedures can be written to insert, update, or delete rows from the EMP table. These procedures can then be called by any application without rewriting the SQL statements necessary to accomplish these tasks. If the methods of data management change, only the procedures need to be modified, not all of the applications that use the procedures. Integrity Stored procedures improve the integrity and consistency of your applications. By developing all of your applications around a common group of procedures, you can reduce the likelihood of committing coding errors. For example, you can test a procedure or function to guarantee that it returns an accurate result and, once it is verified, reuse it in any number of applications without testing it again. If the data structures referenced by the procedure are altered in any way, only the procedure needs to be recompiled; applications that call the procedure do not necessarily require any modifications. Anonymous PL/SQL Blocks vs. Stored Procedures A stored procedure is created and stored in the database as a schema object. Once created and compiled, it is a named object that can be executed without recompiling. Additionally, dependency information is stored in the data dictionary to guarantee the validity of each stored procedure. As an alternative to a stored procedure, you can create an anonymous PL/SQL block by sending an unnamed PL/SQL block to Oracle Server from an Oracle tool or an application. Oracle compiles the PL/SQL block and places the compiled version in the shared pool of the SGA, but does not store the source code or compiled version in the database for reuse beyond the current instance. Shared SQL allows anonymous PL/SQL blocks in the shared pool to be reused and shared until they are flushed out of the shared pool. In either case, moving PL/SQL blocks out of a database application and into database procedures stored either in the database or in memory, you avoid unnecessary procedure recompilations by Oracle at runtime, improving the overall performance of the application and Oracle. External Procedures A PL/SQL procedure executing on an Oracle8 Server can call an external procedure or function that is written in the C programming language and stored in a shared library. The C routine executes in a separate address space from that of the Oracle Server. Packages Packages encapsulate related procedures, functions, and associated cursors and variables together as a unit in the database. You create a package in two parts: the specification and the body. A package s specification declares all public constructs of the package and the body defines all constructs (public and private) of the package. This separation of the two parts provides the following advantages: The developer has more flexibility in the development cycle. You can create specifications and reference public procedures without actually creating the package body. You can alter procedure bodies contained within the package body separately from their publicly declared specifications in the package specification. As long as the procedure specification does not change, objects that reference the 78

altered procedures of the package are never marked invalid; that is, they are never marked as needing recompilation The following example creates the specification and body for a package that contains several procedures and functions that process banking transactions. Create Package bank_transactions (null) AS minimum_balance Constant Number := 100.00; Procedure apply_transactions; Procedure enter_transaction (acct Number, kind Char, amount Number); END bank_transactions; Create Package Body bank_transactions AS /* Package to input bank transactions */ new_status CHAR(20); /* Global variable to record status of transaction being applied. Used for update in Apply_Transactions. */Procedure do_journal_entry (acct Number, kind CHAR) IS /* Records a journal entry for each bank transaction applied by the Apply_Transactions procedure. */ Begin Insert Into journal Values (acct, kind, sysdate); IF kind = D THEN new_status := Debit applied ; ELSIF kind = C THEN new_status := Credit applied ; ELSE new_status := New account ; END IF; END do_journal_entry; Procedure credit_account (acct Number, credit Number) IS /* Credits a bank account the specified amount. If the account does not exist, the procedure creates a new account first. */ old_balance NUMBER; new_balance NUMBER; BEGIN SELECT balance INTO old_balance FROM accounts WHERE acct_id = acct FOR UPDATE OF balance; /* Locks account for credit update */ new_balance := old_balance + credit; UPDATE accounts SET balance = new_balance WHERE acct_id = acct; do_journal_entry(acct, C ); EXCEPTION WHEN NO_DATA_FOUND THEN /* Create new account if not found */ INSERT INTO accounts (acct_id, balance) VALUES(acct, credit); do_journal_entry(acct, N ); When Others Then /* Return other errors to application */ new_status := Error: SQLERRM(SQLCODE); END credit_account; PROCEDURE debit_account (acct NUMBER, debit NUM- BER) IS /* Debits an existing account if result is greater than the allowed minimum balance. */ old_balance NUMBER; new_balance NUMBER; insufficient_funds EXCEPTION; BEGIN SELECT balance INTO old_balance FROM accounts WHERE acct_id = acct FOR UPDATE OF balance; new_balance := old_balance - debit; IF new_balance >= minimum_balance THEN UPDATE accounts SET balance = new_balance WHERE acct_id = acct; do_journal_entry(acct, D ); ELSE RAISE insufficient_funds; END IF; EXCEPTION WHEN NO_DATA_FOUND THEN new_status := Nonexistent account ; WHEN insufficient_funds THEN new_status := Insufficient funds ; WHEN OTHERS THEN /* Returns other errors to application */ new_status := Error: SQLERRM(SQLCODE); END debit_account; PROCEDURE apply_transactions IS /* Applies pending transactions in the table TRANSACTIONS to the ACCOUNTS table. Used at regular intervals to update bank accounts without interfering with input of new transactions. */ /* Cursor fetches and locks all rows from the TRANSAC- TIONS table with a status of Pending. Locks released after all pending transactions have been applied. */ CURSOR trans_cursor IS SELECT acct_id, kind, amount FROM transactions WHERE status = Pending ORDER BY time_tag FOR UPDATE OF status; BEGIN FOR trans IN trans_cursor LOOP /* implicit open and fetch */ IF trans.kind = D THEN debit_account(trans.acct_id, trans.amount); ELSIF trans.kind = C THEN credit_account(trans.acct_id, trans.amount); ELSE new_status := Rejected ; END IF; /* Update TRANSACTIONS table to return result of applying this transaction. */ UPDATE transactions SET status = new_status WHERE CURRENT OF trans_cursor; END LOOP; COMMIT; /* Release row locks in TRANSACTIONS table. */ END apply_transactions; PROCEDURE enter_transaction (acct NUMBER, kind CHAR, amount NUMBER) IS /* Enters a bank transaction into the TRANSACTIONS table. A new transaction is always put into this queue before being DATABASE MANAGEMENT 79

DATABASE MANAGEMENT applied to the specified account by the APPLY_TRANSACTIONS procedure. Therefore, many transactions can be simultaneously input without interference. */ BEGIN INSERT INTO transactions VALUES (acct, kind, amount, Pending, sysdate); COMMIT; END enter_transaction; END bank_transactions; Packages allow the database administrator or application developer to organize similar routines. They also offer increased functionality and database performance. Benefits of Packages Packages are used to define related procedures, variables, and cursors and are often implemented to provide advantages in the following areas: encapsulation of related procedures and variables declaration of public and private procedures, variables, constants, and cursors better performance Encapsulation Stored packages allow you to encapsulate (group) related stored procedures, variables, datatypes, and so forth in a single named, stored unit in the database. This provides for better organization during the development process. Encapsulation of procedural constructs in a package also makes privilege management easier. Granting the privilege to use a package makes all constructs of the package accessible to the grantee. Public and Private Data and Procedures The methods of package definition allow you to specify which variables, cursors, and procedures are How Oracle Stores Procedures and Packages When you create a procedure or package, Oracle compiles the procedure or package stores the compiled code in memory stores the procedure or package in the database Points to Ponder PL/SQL enables you to mix SQL statements with procedural constructs. Procedures and functions are schema objects that logically group a set of SQL and other PL/SQL programming language statements together to perform a specific task. A package is a group of related procedures and functions Review Terms Difference between SQL & PL/SQL Stored procedures Packages Stored procedures Functions Students Activity 1. Differentiate between SQL & PL-SQL? 2. Define stored procedures & functions? public private Directly accessible to the user of a package. Hidden from the user of a package. For example, a package might contain ten procedures. You can define the package so that only three procedures are public and therefore available for execution by a user of the package; the remainder of the procedures are private and can only be accessed by the procedures within the package. Performance Improvement An entire package is loaded into memory when a procedure within the package is called for the first time. This load is completed in one operation, as opposed to the separate loads required for standalone procedures. Therefore, when calls to related packaged procedures occur, no disk I/O is necessary to execute the compiled code already in memory. A package body can be replaced and recompiled without affecting the specification. As a result, objects that reference a package s constructs (always via the specification) need not be recompiled unless the package specification is also replaced. By using packages, unnecessary recompilations can be minimized, resulting in less impact on overall database performance. 3. Define packages? 80

DATABASE MANAGEMENT Lesson Objectives Database triggers Parts of triggers Types of triggers Executing trigger Database triggers are procedures that are stored in the database and are implicitly executed (fired) when the contents of a table are changed. Introduction Oracle allows the user to define procedures that are implicitly executed (i.e. executed by, Oracle itself), when an insert, update or delete is issued against a table from SQL * Plus or through an application. These procedures are called database triggers. The major point that make these triggers stand alone is that they are fired implicitly (i.e. internally) by Oracle itself and not explicitly called by the user, as done in normal procedures. A trigger can include SQL and PL/SQL statements to execute as a unit and can invoke stored procedures. However, procedures and triggers differ in the way that they are invoked. A procedure is explicitly executed by a user, application, or trigger. Triggers (one or more) are implicitly fired (executed) by Oracle when a triggering INSERT, UPDATE, or DELETE statement is issued, no matter which user is connected or which application is being used. Use Of Database Triggers Triggers can supplement the standard capabilities of Oracle to provide a highly customized database management system. For example, a trigger can restrict DML operations against a table to those issued during regular business hours. A trigger could also restrict DML operations to occur only at certain times during weekdays. Other uses for triggers are to automatically generate derived column values prevent invalid transactions enforce complex security authorizations enforce referential integrity across nodes in a distributed database enforce complex business rules provide transparent event logging provide sophisticated auditing maintain synchronous table replicates gather statistics on table access Parts of a Trigger A trigger has three basic parts: a triggering event or statement a trigger restriction a trigger action LESSON 29: DATABASE TRIGGERS Triggering Event or Statement A triggering event or statement is the SQL statement that causes a trigger to be fired. A triggering event can be an Insert, Update, or Delete statement on a table.... Update of parts_on_hand on inventory... which means: when the Parts_On_Hand column of a row in the INventory table is updated, fire the trigger. Note that when the triggering event is an Update statement, you can include a column list to identify which columns must be updated to fire the trigger. You cannot specify a column list for Insert and Delete statements, because they affect entire rows of information. A triggering event can specify multiple DML statements, as in... Insert or Update or Delete of inventory... which means: when an Insert, Update, or Delete statement is issued against the Inventory table, fire the trigger. When multiple types of DML statements can fire a trigger, you can use conditional predicates to detect the type of triggering statement. In this way, you can create a single trigger that executes different code based on the type of statement that fires the trigger. Trigger Restriction A trigger restriction specifies a Boolean (logical) expression that must be True for the trigger to fire. The trigger action is not executed if the trigger restriction evaluates to FALSE or Unknown. In the example, the trigger restriction is new.parts_on_hand < new.reorder_point Trigger Action A trigger action is the procedure (PL/SQL block) that contains the SQL statements and PL/SQL code to be executed when a triggering statement is issued and the trigger restriction evaluates to TRUE. Like stored procedures, a trigger action can contain SQL and PL/SQL statements, define PL/SQL language constructs (variables, constants, cursors, exceptions, and so on), and call stored procedures. Additionally, for row triggers (described in the next section), the statements in a trigger action have access to column values (new and old) of the current row being processed by the trigger. Two correlation names provide access to the old and new values for each column. Types of Triggers This section describes the different types of triggers: row and statement triggers; and Before, After, and Instead-of triggers. Row vs. Statement Triggers When you define a trigger, you can specify the number of times the trigger action is to be executed: once for every row affected by the triggering statement (such as might be fired by an UPDATE statement that updates many rows), or once for the triggering statement, no matter how many rows it affects. 84

Row Triggers A row trigger is fired each time the table is affected by the triggering statement. For example, if an UPDATE statement updates multiple rows of a table, a row trigger is fired once for each row affected by the UPDATE statement. If a triggering statement affects no rows, a row trigger is not executed at all. Row triggers are useful if the code in the trigger action depends on data provided by the triggering statement or rows that are affected. Statement Triggers A statement trigger is fired once on behalf of the triggering statement, regardless of the number of rows in the table that the triggering statement affects (even if no rows are affected). For example, if a DELETE statement deletes several rows from a table, a statement-level DELETE trigger is fired only once, regardless of how many rows are deleted from the table. Statement triggers are useful if the code in the trigger action does not depend on the data provided by the triggering statement or the rows affected. For example, if a trigger makes a complex security check on the current time or user, or if a trigger generates a single audit record based on the type of triggering statement, a statement trigger is used. Before vs. After Triggers When defining a trigger, you can specify the trigger timingwhether the trigger action is to be executed before or after the triggering statement. Before and After apply to both statement and row triggers Before Triggers Before triggers execute the trigger action before the triggering statement is executed. This type of trigger is commonly used in the following situations: When the trigger action should determine whether the triggering statement should be allowed to complete. Using a BEFORE trigger for this purpose, you can eliminate unnecessary processing of the triggering statement and its eventual rollback in cases where an exception is raised in the trigger action. To derive specific column values before completing a triggering INSERT or UPDATE statement. After Triggers AFTER triggers execute the trigger action after the triggering statement is executed. AFTER triggers are used in the following situations: When you want the triggering statement to complete before executing the trigger action. If a BEFORE trigger is already present, an AFTER trigger can perform different actions on the same triggering statement. Combinations Using the options listed above, you can create four types of triggers: BEFORE statement trigger Before executing the triggering statement, the trigger action is executed. Before row trigger Before modifying each row affected by the triggering statement and before checking appropriate integrity constraints, the trigger action is executed provided that the trigger restriction was not violated. After statement trigger After executing the triggering statement and applying any deferred integrity constraints, the trigger action is executed. After row trigger After modifying each row affected by the triggering statement and possibly applying appropriate integrity constraints, the trigger action is executed for the current row provided the trigger restriction was not violated. Unlike Before row triggers, After row triggers lock rows. You can have multiple triggers of the same type for the same statement for any given table. For example you may have two Before statement triggers for Update statements on the EMP table. Multiple triggers of the same type permit modular installation of applications that have triggers on the same tables. Also, Oracle snapshot logs use After row triggers, so you can design your own After row trigger in addition to the Oracledefined After row trigger. You can create as many triggers of the preceding different types as you need for each type of DML statement (Insert, Update, or Delete). For example, suppose you have a table, SAL, and you want to know when the table is being accessed and the types of queries being issued. The example below contains a sample package and trigger that tracks this information by hour and type of action (for example, Update, Delete, or Insert) on table SAL. A global session variable, Stat.Rowcnt, is initialized to zero by a Before statement trigger. Then it is increased each time the row trigger is executed. Finally the statistical information is saved in the table Stat_Tab by the After statement trigger. Sample Package and Trigger for SAL Table Drop Table stat_tab; Create Table stat_tab(utype CHAR(8), rowcnt Integer, uhour Integer); Create or Replace Package stat is rowcnt Integer; END; / Create Trigger bt Before Update or Delete or Insert on sal Begin stat.rowcnt := 0; END; / Create Trigger rt Before Update or Delete Oor Insert on sal for Each Row Begin stat.rowcnt := stat.rowcnt + 1; END; / Create Trigger at After Update or Delete or Insert on sal Declare typ CHAR(8); hour Number; Begin IF updating DATABASE MANAGEMENT 85

DATABASE MANAGEMENT THEN typ := update ; END IF; IF deleting THEN typ := delete ; END IF; IF inserting THEN typ := insert ; END IF; hour := TRUNC((SYSDATE - TRUNC(SYSDATE)) * 24); UPDATE stat_tab SET rowcnt = rowcnt + stat.rowcnt WHERE utype = typ AND uhour = hour; IF SQL%ROWCOUNT = 0 THEN INSERT INTO stat_tab VALUES (typ, stat.rowcnt, hour); END IF; EXCEPTION WHEN dup_val_on_index THEN UPDATE stat_tab SET rowcnt = rowcnt + stat.rowcnt WHERE utype = typ AND uhour = hour; END; / Instead of Triggers Instead of triggers provide a transparent way of modifying views that cannot be modified directly through SQL DML statements (Insert, Update, and Delete). These triggers are called Instead of triggers because, unlike other types of triggers, Oracle fires the trigger instead of executing the triggering statement. The trigger performs update, insert, or delete operations directly on the underlying tables. Users write normal INSERT, DELETE, and UPDATE statements against the view and the INSTEAD OF trigger works invisibly in the background to make the right actions take place. By default, INSTEAD OF triggers are activated for each row. Example of an Instead of Trigger The following example shows an Instead of trigger for inserting rows into the Manager_Info view. Create View manager_info AS SELECT e.name, e.empno, d.dept_type, d.deptno, p.level, p.projno FROM emp e, dept d, project p WHERE e.empno = d.mgr_no AND d.deptno = p.resp_dept; Create Trigger manager_info_insert Instead of Insert on manager_info Eferencing New As n new manager information For Each Row Begin If Not Exists Select * From emp WHERE emp.empno = :n.empno THEN INSERT INTO emp VALUES(:n.empno, :n.name); ELSE UPDATE emp SET emp.name = :n.name WHERE emp.empno = :n.empno; END IF; If Not Exists Select * From dept Where dept.deptno = :n.deptno Then Insert Into dept Values(:n.deptno, :n.dept_type); Else Update dept SET dept.dept_type = :n.dept_type Where dept.deptno = :n.deptno; End If; If Not Exists Select * From project Where project.projno = :n.projno Then Insert Into project Values(:n.projno, :n.project_level); ELSE Update project SET project.level = :n.level Where project.projno = :n.projno; End If; END; The actions shown for rows being inserted into the Manager_Info view first test to see if appropriate rows already exist in the base tables from which Manager_Info is derived. The actions then insert new rows or update existing rows, as appropriate. Similar triggers can specify appropriate actions for UPDATE and Delete. Trigger Execution A trigger can be in either of two distinct modes: enabled disabled An enabled trigger executes its trigger action if a triggering statement is issued and the trigger restriction (if any) evaluates to TRUE. A disabled trigger does not execute its trigger action, even if a triggering statement is issued and the trigger restriction (if any) would evaluate to TRUE. For enabled triggers, Oracle automatically executes triggers of each type in a planned firing sequence when more than one trigger is fired by a single SQL statement performs integrity constraint checking at a set point in time with respect to the different types of triggers and guarantees that triggers cannot compromise integrity constraints provides read-consistent views for queries and constraints manages the dependencies among triggers and objects referenced in the code of the trigger action uses two-phase commit if a trigger updates remote tables in a distributed database fires multiple triggers in an unspecified order, if more than one trigger of the same type exists for a given statement Points to Ponder Allowing the user to define procedures that are implicitly executed is known as triggers Triggers can supplement the standard capabilities of Oracle to provide a highly customized database management system A trigger restriction specifies a Boolean (logical) expression that must be TRUE for the trigger to fire. 86

A trigger has three basic parts: 1. a triggering event or statement 2. a trigger restriction 3. a trigger action Review Terms Database triggers Parts of triggers Types of triggers Executing trigger Students Activity 1. Explain the use of database triggers?explain its need? 6. Define various restriction of trigger? DATABASE MANAGEMENT 2. What are the various parts of database trigger? 3. Differentiate between before & after trigger? 4. Differentiate between row & after trigger? 5. Define INSTEAD OF Trigger with the help of an example? 87

Lesson Objectives Defining cursor Use of cursor Explicit cursor Implicit cursor Parametrized cursor When ever an SQL statement is executed, Oracle DBA performs the following tasks:- Reserves an area in memory called private SQL Area. Populates this area with the appropriate data Processes the data in memory area. Frees the memory area when the execution is complete. What is Cursor? Oracle DBA uses a work area for its internal processing. This work area is private to SQL s operation and is called a cursor. The data that is stored is called Active Data Set. The size of the cursor in memory is the size required to hold the number of rows in the Active Data set. Example When a user first select statement as Select empno,job,salary from Employeewhere dept_no=20 The resultant data set is as follows:- Active Data Set 3456 IVAN MANAGER 10000 3459 PRADEEP - ANALYST 7000 3446 MITA PROGRMR 4000 3463 VIJAY CLERK 2000 3450 ALDRIN ACCTANT 3000 Contents -of a cursor. When a query returns multiple rows, in addition to the data held in, the cursor, Oracle will also open and maintain a row pointer. Depending on user requests to view data the row pointer will be relocated within the cursor s Active Data Set. Additionally Oracle also maintains cursor variables loaded with the value of the total no. of rows fetched from the active data set. Use of Cursors in PL/SQL While SQL is the natural language of the DBA, it does not have any procedural capabilities such as condition checking, looping and branching. For this, Oracle provides the PL/SQL. Programmers can use it to create programs for validation and manipulation of table data. PL/SQIJ adds to the power of SQL and provides the user with all the functionality of a programming environment. A PL/SQL block of code includes the Procedural code for looping and branching along with the SQL statement. If LESSON 32: DATABASE CURSORS records from a record, set created using a select statement are to be evaluated & processed one at a time, then the only method available is by using Explicit Cursors. Also cursors can be used to evaluate the success of updates and deletes and the number of this affected (Implicit ). Explicit Cursor You can explicitly declare a cursor to process the rows individually. A cursor declared by the user is called Explicit Cursor. For queries that return more than one row, you must declare a cursor explicitly. Why Use an Explicit Cursor? Cursors can be used when the users wants to process data one row at a time. Example Update an Acctmast table und set a value in its Balance amount column depending upon whether the account has an amount debited or credited in the Accttran table. The records from the Accttran table will be fetched one at a time and. updated in the Acctmast table depending upon whether the account is debited or credited. PL/SQL raises an error if an embedded select statement retrieves more than one row. Such an error forces an abnormal termination of the PL/SQL block. Such an error can be eliminated by using a cursor. Explicit Cursor Management The steps involved in declaring a cursor and manipulating data in the active set are : Declare a cursor that specifies the SQL select statement that you want to process Open a cursor. Fetch data from the cursor one row at a time. Close the cursor. A cursor is defined in the declarative part of a PL/SQL block by naming it and specifying a query. Then, three commands are used to control the cursor: open, fetch and close. First, initialize the cursor with the open statement, this ; defines a private SQL area executes a query associated with the cursor populates the Active Data Set. Sets the Active Data Set1s row pointer to the first record. The fetch statement retrieves the current row and advances the cursor to the next row. You can execute fetch repeatedly until all rows have been retrieved. When the last row has been processed, close the cursor with the close statement. This will release the memory occupied by the cursor and its Data Set. Focus: The HRD manager has decided to raise the salary for all the employees in department No. 20 by 0.05. Whenever any such raise is given to the. employees, a record for the same is DATABASE MANAGEMENT 91

DATABASE MANAGEMENT maintained in the emp_ raise table. It includes the employee number, the date when the raise was given and the actual raise. Write a PL/SQL block to update the salary of each employee and insert a record in the emp _raise table;- The table definition is as follows: Table name: employee Column name Data Type Size Attributes emp _code varchar 10 Primary Key, via which we shall seek data in the table. Ename varchar 20 The first name of the candidate. Depmo number 5 The department number. Job varchar 20 Employee job details. Sal number 8,2 The current salary of the employee. - emp _code varchar 10 is the part of a composite key via which we shall seek data in the table. raise date date The date on which the raise was given raise amt number 8,2 The raise given to the employee. Emp _code and raise_date together form a composite primary key. Declaring a Cursor To do the above via a PL/SQL block it is necessary to declare a cursor and associate it with a query before referencing it in any statement within the PL/SQL block. This is because forward references to object are not allowed in PL/SQL. Syntax : CURSOR cursorname IS SQL statement; Example: DECLARE /* Declaration of the cursor named c - emp The active data set wi# include the names, department numbers and salaries of all the employees belonging to department 20 */ cursor c - emp is select emp _code, salary from employee where deptno = 20; The cursor name is not a PL/SQL variable;. it is used only to reference the query. It cannot be assigned any values or be used in an expression. Opening a Cursor Opening the cursor executes the query and identifies the active set, that contains all the rows which meet the query search criteria. SYntax : OPEN cursorname ; Example: DECLARE cursor c - emp is select emp _code, salary from employee where deptno = 20; BEGIN /* Opening cursor c - emp */ open c_emp; END; Open statements retrieves the records fro1n the database and places it in the cursor (private SQL area). Fetching a Record from the Cursor The fetch statement retrieves the rows from the active set to the variables one at a time. Each time a fetch is executed, The focus of the DBA cursor advances to the next row in the Active Set. One can make use of any loop structure (Loop-End Loop along with While, For, IF-End If ) to fetch the records from the cursor into variables one row at a time. Syntax : FETCH cursorname INTO variable1, variable2,..., For each column value returned by the query associated cursor, there must be a corresponding variable in the into list. And, their datatypes must match. These variables will be declared in the DECLARE section of the PL/SQL block. Example : DECLARE cursor c - emp is select emp _code, salary from employee where deptno = 20; /* Declaration of memory variable that holds data fetched from the cursor */ str - emp _code employee.emp - code%type; num _salary employee.salary%type; BEGIN open c_emp; /* infinite loop to fetch data from cursor c - emp one row at a time */ loop fetch c - emp into str - emp _code, num _salary; /* Updating the salary in the employee table as current salary + raise */ update employee set salary = num_salary + (num_salary *.05) where emp _code = str - emp _code; /*insert a record in the emp _raise table */ insert into emp _raise values (str_emp_code, sysdate, num_salary * 0.05) end loop; commit; END; Note that the current program will result into an indefinite loop as there is no exit provided from the loop. The Exit from the loop can be provided using cursor variables as explained on page 101 section Explicit Cursor Variables. Also note if you execute a fetch and there are no more rows left in the active data set, the values of the Explicit cursor,variables are indeterminate. Closing a Cursor The close statement disables the cursor and the active set becomes undefined. This will release the memory occupied by the cursor and its Data Set. Once a cursor is closed, the user can reopen the cursor using the open statement. Syntax : CLOSE cursorname ; 92

Example : DECLARE ursor c - emp is select _mp _code, salary from employee where deptno = 20; str- emp _code employee.emp - code%type; num _salary employee.salary%type; BEGIN open c_emp; loop fetch c - emp into str - emp _code, num_salary; update employee set salary = num_salary + (num_salary *.05) where emp _code = str - emp _code; insert into emp _raise values (str - emp code, sysdate, num_salary * 0.05) end loop; commit; /* Close cursor c - emp */ close c_emp; END; Explicit Cursor Attributes Oracle provides certain attributes / cursor variables to control the execution of the cursor. Whenever any cursor (explicit or implicit) is opened and used Oracle creates a set of four system. variables via which Oracle keeps track of the Current status of the cursor. You can access these cursor variables. They are described below. %Notfound : evaluates to t11,le, if the last fetch has failed because no more rows were available; or to false, if the last fetch returned a row. Svntax : cursorname%notfound Example: DECLARE cursor c - emp is select emp _code, salary from employee where deptno = 20; str - emp _code employee.emp -=.code%type; num_salary employee. salary%type; BEGIN open c_emp; loop fetch c_emp into str_emp_code, num_salary; / * If no. of records retrieved is 0 or if all the records are fetched then exit the loop. */ exit when c_emp%notfound ; update employee set salary = num_salary + (num_salary *.05) where -emp _code = str - emp _code; insert into emp _raise values (str_emp_code, sysdate, num_salary * O_05) end loop; commit; close c - emp ; END; %FOUND: is the logical opposite of %notfound. It evaluates to true, if the last fetch succeeded because a row was available; or to false, if the last fetch failed because no more rows were available. Syntax : cursorname%found Example: The PL/SQL block will be as follows: Declare cursor c - emp is select emp _code, salary from employee where deptno = 20; str - emp _code employee.emp - code%type; num_salary employee.salary_type; Begin open c_emp; loop fetch c - emp into str - emp _code, num _salary; * If no. of record retrieved:> 0 then process the data else exit the loop. */ if c_emp%)found then update employee set salary = num_salary + (num_salary *.05) where emp _code =str - emp _code; insel1 into emp _raise values (str_emp_code, sysdate, num_salary * 0.05) else exit; end if; end loop; commit; close c - emp ; END; %ISOPEN : evaluates to true, if an explicit cursor is open; or to false, if it is closed. Syntax : Cursorname%ISOPEN Example : Declare cursor c - emp is select emp _code, salary from employee where deptno = 20; str - emp _code employee.emp - code%type; num _salary employee. salary%type; Begin open c - emp; /* If the cursor is open continue with the data processing else display an appropriate error message */ DATABASE MANAGEMENT 93

DATABASE MANAGEMENT if c_emp%isopen then loop fetch c - emp into str - emp _code, num _salary; exit when c - emp%notfound ; update employee set salary = num_salary + (num_salary *.05} where emp _code = str - emp _code; insert into emp _raise values (str - emp _code, sysdate, num _salary * 0.05) end loop; commit; close c - emp ; else dbms_output. put_line ( Unable to open Cursor ); end if ; END; %Rowcount : returns the number of rows fetched from the active set. It is set to zero when the cursor is opened. Syntax : cursomaine%rowcount Example: Display the names, department name and salary of the first 10 employees getting the highest salary. Declare cursor c - emp is select ename, deptno, salary from employee, deptmaster where deptmaster.deptno = employee.deptno order by salary desc ; str - ename employee.ename%type ; num - deptno employee.deptno%type ; num_salary employee.salary%type; Begin open c_emp; dbms_output.put_line ( Name Department Salary ); dbms - output. Put _line ( - ); -loop fetch c - emp into str - ename, num - deptno, num _salary; dbms -output.put_line (str - ename II, II num - deptno II exit when c_emp%rowcount = 10 ; end loop;. II num_salary); END; Cursor for Loops In most situations that require an explicit cursor, you can simplify coding by using a cursor for loop instead of the open, fetch and close statements. A cursor for loop implicitly declares its loop index as a %rowtype record, opens a cursor, repeatedly fetches rows of values from the active set into items in the record, and closes the cursor when all rows have been processed. Example : The PL/SQL block will be rewritten as follows: Declare cursor c - emp is select emp _code, salary from employee where deptno = 20; Begin for emp_rec in c_emp loop update employee set salary = emp_rec.salary + (emp_rec.salary *.05) where emp _code = emp _rec.emp _code; insert into emp _raise values (emp_rec.emp_code, sysdate, emp_rec.salary *.05) end loop; commit; End; When you use a cursor for loop, the cursor for loop implicitly declares emp _rec as belonging to type c _ emp%rowtype and retrieves the records as declared in the cursor c - emp. The sequence of statements inside the loop is executed once for every row that satisfies the query associated with the cursor. with each iteration, a record will be fetched from c _emp into emp _rec. Dot notation should be used to make a reference to individual items. Example: emp yec.emp _code where emp _rec is a row type variable and emp _code is the name of the field. When you leave the loop, the cursor is closed automatically. This is true even if you use an exit or goto statement to leave the loop prematurely, or if an exception is raised inside the loop. Thus, When you exit the loop it closes the cursor c - emp. Note: The record is defined only inside the loop. You cannot refer to it s item outside the loop. The reference in the following example is illegal: Begin for c1rec in c1 loop end loop; result:= clrec.n2 + 3; /* referencing c 1rec outside for loop is illegal */ End; Focus: A bank has an ACCTMAST table where it holds the current status of a,client s bank account ( i.e. currently what the client has in the savings bank account.) Another table called the ACCTTRAN table holds each transaction as it occurs at the back. i.e. Deposits / Withdrawals of clients. A client can deposit money which must be then ADDED to the amount held against that specific clients name in the ACCTMAST table. This is referred to as a CREDIT type transaction. 94

A client may withdraw money from his account. This must be SUBTRACTED from the amount held against that specific client s name in the ACCTMAST table. This is referred to as a DEBIT type transaction. The ACCTTRAN table must therefore hold a flag that indicates whether the transaction type was CREDIT or DEBIT. Based on this flag define a cursor which will update the ACCTMAST Balance field contents. Write a PL/SQL block that updates the acctmast table and sets the balance depending upon whether the account is debited OT credited. The updation should be done only for those records that are not processed i.e. the processed flag is N in the accttran table. 1. Create the following tables Table name: acctmast Column name Data Type Size Attributes Acctno varchar2 4 'Primary Key Name varchar2 20 Account Name Balance Number 8 The balance in the account. Table name : Accttran Column name Data Type Size Attributes Acctno varchar2 4 is the foreign key field which references table acttmast. TrnDate date The Transaction Date: deb crd char 1 The Dr / Cr Flag. Amount number 7,2 Processed char 1 A flag indicating whether the record is processed or not The following PL/SQL code updates the acctmast table depending upon the daily transactions entered in the accttran table. Declare cursor acc-updt is select acctno, deb - crd, amount from accttran where processed = N ; acctnum char(4); db_cd char( 1); amt number(7,2); Begin open acc - updt ; /* perform the updation for all the records retrieved by the cursor */ loop fetch acc - updt into acctnum, db - cd,amt; /* if the account is debited then update the acctmast table as balance = balance - amt */ if db_cd = d then update acctmast set balance = (balance - amt) where acctno = acctnum; else /* if the account is credited then update the acctmast table as balance = balance + amt */ update acctmast set balance = (balance + amt) where acctno = acctnum ; end if; update accttran set processed = Y where acctno = acctnum; exit when acc - updt%notfound ; end loop; close acc - updt; commit; End; Implicit Cursor Oracle implicitly opens a cursor to process each SQL statement not associated with an explicitly declared cursor. PL/SQL lets you refer to the most recent implicit cursor as the SQL cursor. So, although you cannot use the open, fetch, and close statements to control an implicit cursor, you can still use cursor attributes to access information about 1:\1e most recently executed SQL statement. Implicit Cursor Attributes The SQL cursor has four attributes as described below. When appended to the cursor name (i.e. SQL), these attributes let you access information about the execution of insert, update, delete and single-row select statements. Implicit cursor attributes return the boolean null value, until they are set by a cursor operation. The values of the cursor attributes always refer to the most recently executed SQL statement, wherever the statement appears. It might be in a different scope (in a sub-block). So, if you want to save an attribute value for later use, assign it a boolean variable immediately. %Notfound : evaluates to true, if an insert, update or delete affected no rows, or a single -row select returns no rows. Otherwise, it evaluates to false. Syntax : QL %Notfound Example : The HRD manager has decided to raise the salary of employees by 0.15. Write a PL/SQL block to accept the employee number and update the salary of that employee. Display appropriate message based on the existence of the record in the employee table. Begin update employee set salary = salary * 0.15 where emp _code = &emp _code; if sq1%botfound- then dbms_output.put_line( Employee No. Does not Exist ); else dbms_- output.put_line( Employee Record Modified Successfully ); DATABASE MANAGEMENT 95

DATABASE MANAGEMENT end if; END; %FOUND : is the logical opposite of %notfound. Note, however that both attributes evaluate to null until they are set by an implicit or explicit cursor operation. %found evaluates to true, if an insert, update or delete affected one or more rows, or a single-row select returned one or more tows. Otherwise, it evaluates to false. Syntax: SQL%Notfound Example: The example in sq1%notfound will be written as follows: Begin update employee set salaty = salaty * 0.15 where emp _code = &emp _code; if sq1%found then dbms - output.put _line( Employee Record Modified Successfully ); else dbms_output.put_line( Employee No. Does not Exist ); end if; END; %Rowcount : returns the number of rows affected by an insert, update or delete, or select into statement. Example: The HRD manager has decided to raise the salary of employees working as Programmers by 0.15. Write a PL/SQL block to accept the employee number and update the salary of that employee. Display appropriate message based on the existence of the record in the employee table. Declare rows_affected char( 4) ; Begin update employee set salary = salary * 0.15 where job = Programmers ; rows_affected := to_char(sq1%rowcount) if sq1%rowcount > 0 then dbms - output.put_line(rows _affected II Employee Records Modified Successfully ) ; else dbms_output.put_line( There are no Employees working as Programmers ); end if; END; %ISOPEN : Oracle automatically closes the SQL cursor after executing its associated SQL statement. As a result, sq1%isopen always evaluates to false. Parameterized Cursors So far we have used cursors querying all the records from a table. Sometimes records are brought in memory selectively. While declaring a cursor? the select statement must include a where clause to retrieve data conditionally. We should be able to pass a value to cursor only when it is being opened. For $at the cursor must be declared in such a way that it recognizes that it will receive the requested value( s) at the time of opening the cursor. Such a cursor is known as Parameterized Cursor. Syntax : CURSOR cursor_name (variable_name datatype) is select statement... The scope of cursor parameters are local to that cursor, which metros that they can be referenced only within the query declared in the cursor declaration. The values of cursor parameters are used by the associated query when the cursor is opened. For example. cursor c_emp (num_deptno number) is select job, ename from emp where deptno > num - deptn; The parameters to a cursor can be passed in the open statement. They can either be constant values or the contents of a memory variable. For example OPEN c_emp (30) OPEN c_emp (num_deptno) Note: The memory variable should be declared in the DE- CLARE section and the value should be assigned to that memory variable. Each parameter in the declaration must have a corresponding value in the open statement. Remember that the parameters of a cursor cannot return values. Example : Allow insert, update and delete for the table itemmast on the bases of the table itemtran table.. 1. Create the following tables Table name: itemmast Column name Data Type Size Attributes itemid Number 4 Primary Key description Varchar 20 The item description Bal stock Number 3 The balance stock for an item Table name: ItemTran Column name Data Type Size Attributes Itemid Number 4 Foreign key via which we shall seek data in the table. Description Varchar 30 Item description Operation Char 1 The kind of operation on Item Mast table i.e. Insert Update Delete(I U, D) Qty Number 3 The Qty sold Status Varchar 30 The status of the Operation. Based on the value in the operation column of table itemtran the records for table itemmast is either inserted updated or deleted. On the basis of success/failure of insert, update and delete operation the status column in the table itemtran is updated with appropriate text indicating success or reason for failure. Following are the 3 cases which are to be taken care of. 1. If Operation = I then the itemid against along with description and qty is inserted into the required columns of 96

the table itemmast. If insert is successful then the status field of itemtran table is updated to Successful - else it is updated as Item Already Exist. 2. If Operation = U then the qty against this operation column is added to bal:...:.stock column of the table itemmast where itemid of table itemmast is same as that of itemtran. If update is successful then the status column of itemtran table is updated to Successful else it is updated as Item Does Not Exist. 3. If Operation = D, then a Tow from itemmast is deleted whose itemid is equal to the itemid in the table itemtran with the operation column having the value D. If delete is successful then the status column of itemtran table is updated to Successful else it is updated as Item Does Not Exist. The following PL/SQL code takes care of the above three cases. Declare /* Cursor scantable retrieves all the records of table itemtran */ cursor scantable is select itemchk operation, qty, description from itemtran; /* Cursor Itemchk accepts the value of item id from the current raw of cursor scantable */ cursor itemchk(mastitemid number) is select itemid from item_mast where itemid = mastitemid; /* variable_ that hold data from the cursor scantable */ itemidno number( 4); de scrip varchar2(30); oper char(l); quantity number(3); /* variable that hold data from the cursor itemchk */ dummyitem number( 4); Begin /* open the scantable cursor */ open scantable; loop / * Fetch the records from the scan table cursor */fetch scantable into itemidno, oper, quantity, descrip; /* Open the itemchk cursor Note that the value of variable passed to the itemchk cursor is set to the value of item id in the current row of cursor scantable */ open itemchk(itemidno); fetch itemchk into dummyitem; /* if the record is not found and the operation is insert then insert the new record and set the status to Successful */if itemchk%notfound then if oper = T then insert into item - mast(itemid, bal- stock, description) values(itemidno; quantity, descrip); update itemtran set itemtran.status = SUCCESSFUL where itemid = itemidno; /* if the record is notfound and the operation is update/ delete then set the status to Item Not Present */ elsif oper = U or oper = D then update itemtran set itemtran.status = Item Not Present where itemid = itemidno; end if; else /* if the record is found and the operation is insert then set the status to Item Already exists *,/ if oper = I then update itemtran set itemtran.status = Item Already Exists where itemid = itemidno; /* if the record is found and the operation is update/delete then perform the update or delete operation set the status to Successful */. elsif oper = D then delete from item_mast where item - mast.itemid = itemidno; update itemtran set itemtran.status = Successful where itemid = itemidno; elsif oper = U then update item_mast set item_mast. bal- stock = quantity where itemid = itemidno; update itemtran - set itemtran.status = Successful where itemid = itemidno; end if; close itemchk; exit when scantable%notfound; end loop; close scantable; commit; END; Points to Ponder Oracle DBA uses a work area for its internal processing. You can explicitly declare a cursor to process the rows individually. A cursor declared by the user is called Explicit Cursor Opening the cursor executes the query and identifies the active set, that contains all the rows which meet the query search criteria. DATABASE MANAGEMENT 97

DATABASE MANAGEMENT The fetch statement retrieves the rows from the active set to the variables one at a time The close statement disables the cursor and the active set becomes undefined Oracle implicitly opens a cursor to process each SQL statement not associated with an explicitly declared cursor. Review Terms Defining cursor Use of cursor Explicit cursor Implicit cursor Parametrized cursor Students Activity 1. Define cursors?explain its usage? 2. Differentiate between explicit & implicit cursor with the help of an example? 98

DATABASE MANAGEMENT Lesson Objectives Normalisation First Normal Form LESSON 35: NORMALISATION - I Functional dependencies Closure of a set of functional dependencies When deciding upon the structure of data to be stored in a file(s), or a database, the two main issues to be considered are: 1. removing data duplication from the files/database 2. avoiding data confusion, where different versions of the same information is located in different places in the same file Data Normalisation is the process of determining the correct structure for data in files or databases so that the problems mentioned cannot occur Data is structured by following a series of steps. Each step removes the potential for a particular problem to occur in the data, e.g. duplication, and each step builds upon the previous steps First Normal Form The first of the normal forms that we study, first normal form, imposes a very basic requirement on relations. A domain is atomic if elements of the domain are considered to be indivisible units. We say that a relational schema R is in first normal form (lnf) if the domains of all attributes of R are atomic. A set of names is an example of a non atomic value. For example, if the schema of a relation employee included an attribute children whose domain elements are sets of names, the schema would not be in first normal form. Composite attributes, such as an attribute address with component attributes street and city, also have nonatomic domains. Integers are assumed to be atomic, so the set of integers is an atomic domain; the set of all sets of integers is a nonatomic domain. The distinction is that we do not normally consider integers to have subparts, but we consider-sets of integers to have subparts-namely, the integers making up the set. But the important issue is not what the domain itself is, but rather how we use domain elemeents in our database. The domain of all integers would be nonatomic if we considered each integer to be an ordered list of digits. As a practical illustration of the above point, consider an organization that as-signs employees identification numbers of the following form: The first two letters specify the department and the remaining four digits are a unique number within the department for the employee. Examples of such numbers would be CS0012 and EE1127. Such identification numbers can be divided into smaller units, and are there-fore nonatomic. If a relation schema had an attribute whose domain consists of iden-tification numbers encoded as above, the schema would not be in first normal form. First normal form (1NF) sets the very basic rules for an organized database: Eliminate duplicative columns from the same table. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key). Functional Dependencies Functional dependencies plays key role in designing good database. A functional dependency is a type of constraint that is a generalization of the notion of key. Functional dependencies are constraints. on the set of legal relations. They. allow us to express facts about the enterprise that we are modeling with our database. If we define the notion of a superkey as follows. Let R be a relation schema. A subset K of R is a superkey of R if, in any legal relation r(r), for all pairs tl and t2 of tuples in r such that tl t2, then t1[k] t2[k_]. That is, no two tuples in any legal relation r(r) may have the same value on attribute set K. The notion of functional dependency generalizes the notion of superkey. Consider a relation schema R, and let α R and β R. The functional dependency C α holds on schema R if, in any legal relation r(r), for all pairs of tuples t1 and t2 in r such that t1(α) = t2 (α) it is also the case that t1(β) = t2(β) Using the functional-dependency notation, we say that K is a superkey of R if K R. That is, K is a superkey if, whenever t1[k) = t2[k], it is also the case that t1[r]= t2[r] (that is, t1=t2). Functional dependencies allow us to express constraints that we cannot express with superkeys. Consider the schema Loan-info-schema = (loan-number, branch-name, customername, amount) which is simplification of the Lending-schema that we saw earlier. The set of functional dependencies that we expect to hold on this elation schema is loan-nurnber amount loan-number branch-name We would not, however, expect the functional dependency loan-number customer-name to hold, since, in general, a given loan can be made to more than one customer (for example, to both members of a husbandwife pair). 102

We shall use functional dependencies in two ways: 1. To test relations to see whether they are legal under a given set of functional dependencies. If a relation r is legal under a set F of functional dependencies, we say that r satisfies F. 2. To specify constraints on the set of legal relations. We shall thus concern our-selves with only those relations that satisfy a given set of functional dependen-cies. If we wish to constrain ourselves to relations on schema R that satisfy a set F of functional dependencies, we say that F holds on R. To see which functional dependencies are satisfied. Observe that A C is satisfied. There are two tuples that have an A value of a1. These tuples have the same C value-namely, C1. Similarly, the two tu-ples with an A value of a2 have the same C value, C2. There are no other pairs of distinct tuples that have the same A value. The functional dependency C A is not satisfied, however. To see that it is not, consider the tuples t1 = (a2, b3, C2, d3) and a1 B1 C1 d1 a1 B2 c1 d2 a2 B2 c2 d2 a2 B2 C2 d3 a3 B3 C2 d4 Customer-name Customer-street Customer-city Jones Smith Main North Harrison Rye Hayes Main Harrison Curry North Rye Lindsay Park Pittsfield Tufl1et Putnam Stamford Williams Nassau Princeton Adams Spring Pittsfield Johnson Alma Palo Alto Glenn Sand Hill Woodside Brooks Senator Brooklyn Green Walnut Stamford The customer relation L-17 Downtown 1000 L-23 Redwood 2000 L-15 Perryridge 1500 L-14 Downtown 1500 L-93 Mianus 500 L-11 Round Hill 900 L-29 Pownal 1200 L-16 North Town 1300 L-18 Downtown 2000 L-25 Perryridge 2500 L-10 Brighton 2200 DATABASE MANAGEMENT Sample relation r. t2= (a3, b3, C2, d4). These two tuples have the same C values, C2, but they have dif-ferent A values, a2 and a3, respectively. Thus, we have found a pair of tuples t1 and t2 such that t1(c) = t2 [C], but t1 [A] t2 [A]. Many other functional dependencies are satisfied by r, including, for example, the functional dependency AB D. Note that we use AB as a shorthand for {A,B}, to conform with standard practice. Observe that there is no pair of distinct tuples tl and t2 such that tl [AB] = t2 [AB]. Therefore, if t1 [AB] = t2 [AB], it must be that t1 = t2 and, thus, t1 [D] = t2[d]. So, r satisfies AB D. Some functional dependencies are said to be trivial because they are satisfied by all relations. For example, A A is satisfied by all relations involving attribute A. Reading the definition of functional dependency literally, we see that, for all tuples t1 and t2 such that t1(a] = t2 [A], it is the case that h[a] = t2 [A]. Similarly, AB A is satisfied by all relations involving attribute A. In general, a functional dependency of the form α β is trivial if β α. To distinguish between the concepts of a relation satisfying a dependency and a dependency holding on a schema, we return to the banking example. If we consider the customer relation (on Customer-schema) in Figure 7.3, we see that customerstreet customer-city is satisfied. However, we believe that, in the real world, two cities The loan relation. can have streets with the same name. Thus, it is possible, at some time, to have an instance of the customer relation in which customer-street customer-city is not satis-fied. So, we would not include customer-street customer-city in the set of functional dependencies that hold on Customer-schema. In the loan relation (on Loan-schema), we see that the dependency loan--number amount is satisfied. In contrast to the case of customer-city and customer-street in Customer-schema, we do believe that the real-world enterprise that we are modeling requires each loan to have only one amount. Therefore, we want to require that loan-number amount be satisfied by the loan relation at all times. In other words, we require that the constraint loan-number amount hold on Loan-schema. In the branch relation, we see that branch-name assets is satisfied, as is assets branch-name. We want to require that branch-name assets hold on Branch-schema. However, we do not wish to require that assets branch-name hold, since it is possible to have several branches that have the same asset value. In what follows, we assume that, when we design a relational database, we first list those functional dependencies that must always hold. In the banking example, our list of dependencies includes the following: Downtown Brooklyn 9000000 Redwood Palo Alto 2100000 Perryridge Horseneck 1700000 Mianus Horseneck 400000 Round Hill Horseneck 8000000 Pownal Bennington 300000 North Town Rye 3700000 Brighton Brooklyn 7100000 103

DATABASE MANAGEMENT On Branch-schema: branch-name branch-city branch-name assets On Customer-schema: customer-name customer-city customer-name customer-street On Loan-schema: loan-number amount loan-nll1nber branch-name On Borrower-schema: No functional dependencies On Account-schema: account-number branch-name account-number balance On Depositor-schema: No functional dependencies Closure of a Set of Functional Dependencies It is not sufficient to consider the given set of functional dependencies. Rather, we need to consider all functional dependencies that hold. We shall see that, given a set F of functional dependencies, we can prove that certain other functional dependencies hold. We say that such functional dependencies are logically implied by F. More formally, given a relational schema R, a functional dependency f on R is log-ically implied by a set of functional dependencies F on R if every relation instance r(r) that satisfies F also satisfies f. Suppose we are given a relation schema R = (A, B, C, G, H, I) and the set of functional dependencies A B A C CG H CG I B H The functional dependency A H is logically implied. That is, we can show that, whenever our given set of functional dependencies holds on a relation, A H must also hold on the relation. Suppose that tl and t2 are tuples such that t1(a] = t2[a] Since we are given that A B, it follows from the definition of functional dependency that t1(b] = t2[b] Then, since we are given that B H, it follows from the definition of functional dependency that t1(h] = t2[h] Therefore, we have shown that, whenever tl and t2 are tuples such that tl [A] = t2 [A], it must be that tl [H] = t2[h]. But that is exactly the-definition of A H. Let F be a set of functional dependencies. The closure of F, denoted by F+, is the set of all functional dependencies logically implied by F. Given F, we can compute F+ directly from the formal definition of functional dependency. If F were large, this process would be lengthy and difficult. Such a computation of F+ requires argu-ments of the type just used to show that A H is in the closure of our example set of dependencies. Axioms, or rules of inference, provide a simpler technique for reasoning about functional dependencies. In the rules that follow, we use Greek letters (α, β, γ...) for sets of attributes, and uppercase Roman letters from the beginning of the alphabet for individual attributes. We use αβ to denote αuβ. We can use the following three rules to find logically implied functional dependencies. By applying these rules repeatedly, we can find all of F+ given F. This collection of rules is called Armstrong s axioms in honor of the person who first proposed it. Reflexivity rule. If α is a set of attributes and β α then a β holds. Augmentation rule. If α β holds and γ is a set of attributes, then γα γβ holds. Transitivity rule. If α β holds and β γ holds, then α γ holds. Armstrong s axioms are sound, because they do not generate any incorrect func-tional dependencies. They are complete, because, for a given set F of functional de-pendencies, they allow us to generate all F+. The bibliographical notes provide ref-erences for proofs of soundness and completeness. Although Armstrong s axioms are complete, it is tiresome to use them directly for the computation of F+. To simplify matters further, we list additional rules. It is pos-sible to use Armstrong s axioms to prove that these rules are correct Union rule. If α β holds and α γ holds, then α βγ Decomposition rule. If α βγ holds, then α β holds and α γ holds. Pseudotransitivity rule. If α β holds and γβ δ holds, then αγ δ holds. Let us apply our rules to the example of schema R = (A, B, C, G, H, I) and the set F of functional dependencies {A B, A C, CG H, CG I, B H}. We list several members of F+ here: A H. Since A B and B H hold, we apply the transitivity rule. Observe that it was much easier to use Armstrong s axioms to show that A H holds than it was to argue directly from the definitions, as we did earlier in this section. CG HI. Since CG and CG I, the union rule implies that CG HI. AG I. Since A C and CG I, the pseudotransitivity rule implies that A G I holds. Another way of finding that AG I holds is as follows. We use the aug-mentation rule on A C to infer AG CG. 104

Applying the transitivity rule to this dependency and CG I, we infer AG I. Points To Ponder Normalisation is the process of determining the correct structure for data in files or databases A relational schema R is in first normal form (lnf) if the domains of all attributes of R are atomic. Functional dependencies plays key role in designing good database. A functional dependency is a type of constraint that is a generalization of the notion of key. The bad design of a database suggests that we should decompose a relation schema that has many attributes into several schemas with fewer attributes. Review Terms Normalisation First Normal Form Functional dependencies Closure of a set of functional dependencies Students Activity 1. Define Normalisation? 4. Define functional dependecy? 5. Define Closure of a set of functional dependency? DATABASE MANAGEMENT 2. How normalisation can improve the redundancy of data? 3. Define 1NF? 105

Lesson objectives Decomposition Properties of decomposition Lossless join decomposition 2NF Decomposition The bad design of a database suggests that we should decompose a relation schema that has many attributes into several schemas with fewer attributes. Careless decom-position, however, may lead to another form of bad design. Consider an alternative design in which we decompose Lending-schema into the following two schemas: Branch-customer-schema = (branch-name, branch-city, assets, customer-name) Customer-loan-schema = (customer-name, loan-number, amount) Using the lending relation of Figure 7.1, we construct our new relations branch-customer (Branch-customer) and customerloan (Customer-loan-schema): branch-customer = Π branch-name, branch-city, assets, customer-name (lending) customer-loan = Π customer-name, loan-number, amount (lending) Of course, there are cases in which we need to reconstruct the loan relation. For example, suppose that we wish to find all branches that have loans with amounts less than $1000. No relation in our alternative database contains these data. We need to reconstruct the lending relation. It appears that we can do so by writing branch-customer _ customer-loan Branch name Branch city Assets Customer name Downtown Brooklyn 9000000 Jones Redwood Palo Alto 2100000 Smith Perryridge Horseneck 1700000 Hayes Downtown Brooklyn. 9000000 Jackson Mianus Horseneck 400000 Jones Round Hill Horseneck 8000000 Turner Pownal Bennington 300000 Williams North Town Rye 3700000 Hayes Downtown Brooklyn 9000000 Johnson Perryridge Horseneck 1700000 Glenn Brighton Brooklyn. 7100000 Brooks LESSON 36: NORMALISATION - II the relation branch customer Customer name Loan number Amount Jones L-17 1000 Smith L-23 2000 Hayes L-15 1500 Jackson L-14 1500 Jones L-93 500 Turner L-11 900 Williams L-29 1200 Hayes L-16 1300 Johnson L-18 2000 Glenn L-25 2500 Brooks L-10 2200 The relation customer-loan. The result of computing branch-customer [><I customer-loan. When we compare this relation and the lending relation with which we started, we notice a difference: Although every tuple that appears in the lending relation ap-pears in branch-customer [><I customer-loan has the following additional tuples (Downtown, Brooklyn, 9000000, Jones, L-93, 500) (Perryridge, Horseneck, 1700000, Hayes, L-16, 1300) (Mianus, Horseneck, 400000, Jones, L-17, 1000) (North Town, 3700000, Hayes, L-15, 1500) Consider the query Find all bank branches that have made a loan in an amount less than $1000.. we see that the only branches with loan amounts less than $1000 are Mianus and Round Hill However, when we apply the expression Π branch-name ((J amount < 1000 (branch-customer [><I customer-loan)) we obtain three branch names: Mianus, Round Hill, and Downtown. A closer examination of this example shows why. If a customer happens to have several loans from different branches, we cannot tell which loan belongs to which branch. Thus, when we join branch-customer and customer-loan, we obtain not only the tuples we had originally in lending, but also several additional tuples. Although we have more tuples in branch-customer [><I customer-loan, we actually have less information. We are no longer able, in general, to represent in the database information about which customers are borrowers from which branch. Because of this loss of information, we call the decomposition of Lending-schema into Branch-customerschema and customer-loan-schema a lossy decomposition, or a lossy-join decomposition. Decomposition that is not a lossyjoin decomposition is a lossless-join decomposing. DATABASE MANAGEMENT 107

DATABASE MANAGEMENT Branch name Branch city Assets Customer Laon name number Amount Downtown Brooklyn 9000000 Jones L-17 1000 Downtown Brooklyn 9000000 Jones L-93 500 Redwood Palo Alto 2100000 Smith L-23 2000 Perryridge Horseneck 1700000 Hayes L-15 1500 Perryridge Horseneck 1700000 Hayes L-16 1300 Downtown Brooklyn 9000000 Jackson L-14 1500 Mianus Horseneck 400000 Jones L-17 1000 Mianus Horseneck 400000 Jones L-93 500 Round Hill Horseneck 8000000 Turner L-11 900 Pownal Bennington 300000 Williams L-29 1200 North Town Rye 3700000 Hayes L-15 1500 North Town Rye 3700000 Hayes L-16 1300 Downtown Brooklyn 9000000 Johnson L-18 2000 Perryridge Horseneck 1700000 Glenn L-25 2500 Brighton Brooklyn 7100000 Brooks L-10 2200 The relation branch-customer [><I customer-loan. It should be clear from our example that a lossy-join decomposition is, in gen-eral, a bad database design. Why is the decomposition lossy? There is one attribute in common between Branch-customer-schema and Customer- loan-schema:- Branch-customer-schema n Customer-loan-schema = {customer-name} The only way that we can represent a relationship between, for example, loan-number and branch-name is through customername. This representation is not adequate because a customer may have several loans, yet these loans are not necessarily obtained from the same branch. Let us consider another alternative design, in which we decompose Lending-schema into the following two schemas: Branch-schema = (branch-name, branch-city, assets) Loan-info-schema = (branch-name, customer-name, loannumber, amount) There is one attribute in common between these two schemas: Branch-loan-schema n Customer-loan-schema = {branchname} Thus, the only way that we can represent a relationship between, for example, customer-name and assets is through branch-name. The difference between this exam-ple and the preceding one is that the assets of a branch are the same, regardless of the customer to which we are referring; whereas the lending branch associated with a certain loan amount does depend on the customer to which we are referring. For a given branch-name, there is exactly one assets value and exactly one branch-city; Where as a similar statement cannot be made for customername. That is, the functional dependency branch-name assets branch-city holds, but customer-name does not functionally determine loan-number. The notion of loss less joins is central to much of relationaldatabase design. There-fore, we restate the preceding examples more concisely and more formally. Let. R be a relation schema. A set of relation schemas {R1, R2,..., Rn} is a decomposition of R if R = Rl U R2 U... U Rn That is, {R1, R2,..., Rn} is a decomposition of R if, for i = 1,2,..., n, each R i is a subset of R, and every attribute in R appears in at least one R i. Let r be a relation on schema R, and let Ti = Π Ri (T) for i = 1,2,..n. That is, {r1, r2, r3 rn} is the database that results from decomposing R into {R1, R2".., Rn}. It is always the case that r r1 [><I r n To see that this assertion is true, consider a tuple t in relation r. When we compute the relations r1, r2, r3,... r n, the tuple t gives rise to one tuple ti in each ri, i = 1,2,..., n. These n tuples combine to regenerate t when we compute r1, [><Ir2 [><Ir2 [><I. [><I r n. As an illustration, consider our earlier example, in which n = 2. R. = Lending-schema. R 1 = Branch-customer-schema. R2 = Customer-Loan-schema. r = the relation shown in Figure 7.1. r1 = the relation shown in Figure 7.2 r2 = the relation shown in Figure 7.10. r1 [><I r2 = the relation shown in Figure 7.11. To have a lossless-join decomposition, we need to impose constraints on the set of possible relations. We found that the decomposition of Lending-schema into Branch-schema and Loan-info-schema is lossless because the functional dependency branch-name branch:...city assets hold on branch schema. We shall show how to test whether decomposition is losslessjoin decomposition in the next few sections. A major part of this chapter deals with the questions of how to specify constraints on the database, and how to obtain lossless-join decom-positions that avoid the pitfalls represented by the examples of bad database designs that we have seen in this section. Desirable Properties of Decomposition We can use a given set of functional dependencies in designing a relational database in which most of the undesirable properties do not occur. When we design such systems, it may become necessary to decompose a relation into several smaller relations. In this section, we outline the desirable properties of a decomposition of a rela-tional schema. In later sections, we outline specific ways of decomposing a relational schema to get the properties we desire. We illustrate our concepts with the Lending -schema Lending-schema = (branch-name, branch-city, assets, customername, loan-number, amount) The set F of functional dependencies that we require to hold on Lending-schema are branch-name -+ branch-city assets loan-number -+ amount branch-name 108

Lending-schema is an example of a bad database design. Assume that we decompose it to the following three relations: Branch-schema = (branch-name, branch-city, assets) Loan-schema = (loan-number, branch-name, amount) Borrower-schema = (customer-name, loan-number) We claim that this decomposition has several desirable properties, which we discuss next. Lossless-join Decomposition When we decompose a relation into a number of smaller relations, it is crucial that the decomposition be lossless. We must first present a criterion for determining whether a decomposition is lossy. Let R be a relation schema, and let F be a set of functional dependencies on R. Let R1 and R2 form a decomposition of R. This decomposition is a lossless-join decom-position of R if at least one of the following functional dependencies is in F+: R1 R2 Rl R1 R2 R2 In other words, if R1 R2 forms a superkey of either R1 or R2, the decomposition of R is a lossless-join decomposition. We can use attribute closure to efficiently test for superkeys, as we have seen earlier. We now demonstrate that our decomposition of Lendingschema is a lossless-join decomposition by showing a sequence of steps that generate the decomposition. We begin by decomposing Lending-schema into two schemas: Branch-schema = (branch-name, branch-city, assets) Loan-info-schema = (branch-name, customer-name, loannumber, amount) Since branch-name branch-city assets, the augmentation rule for functional depen-dencies implies that branch-name branch-name branch-city assets Since Branch-schema n Loan-info-schema = {branch-name}, it follows that our initial decomposition is a lossless-join decomposition. Next, we decompose Loan-info-schema into Loan-schema = (loan-number, branch-name, amount) Borrower-schema = (customer-name, loan-number) This step results in a lossless-join decomposition, since loannumber is a common at-tribute and loan-number amount branch-name. For the general case of decomposition of a relation into multiple parts at once, the test for lossless join decomposition is more complicated. See the bibliographical notes for references on the topic. While the test for binary decomposition is clearly a sufficient condition for lossless join, it is a necessary condition only if all constraints are functional dependencies. We shall see other types of constraints later (in particular, a type of constraint called multivalued dependencies), that can ensure that a decomposition is lossless join even if no functional dependencies are present. Second Normal Form The general requirements of 2NF are: Remove subsets of data that apply to multiple rows of a table and place them in separate rows. Create relationships between these new tables and their predecessors through the use of foreign keys. These rules can be summarized in a simple statement: 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables. Let s look at an example. Imagine an online store that maintains customer information in a database. Their Customers table might look something like this: CustNum FirstName LastName Address City State ZIP 1 John Doe 12 Main Street Sea Cliff NY 11579 2 Alan Johnson 82 Evergreen Tr Sea Cliff NY 11579 3 Beth Thompson 1912 NE 1st St Miami FL 33157 4 Jacob Smith 142 Irish Way South Bend IN 46637 5 Sue Ryan 412 NE 1st St Miami FL 33157 A brief look at this table reveals a small amount of redundant data. We re storing the Sea Cliff, NY 11579 and Miami, FL 33157 entries twice each. Now, that might not seem like too much added storage in our simple example, but imagine the wasted space if we had thousands of rows in our table. Additionally, if the ZIP code for Sea Cliff were to change, we d need to make that change in many places throughout the database. In a 2NF-compliant database structure, this redundant information is extracted and stored in a separate table. Our new table (let s call it ZIPs) might look like this: ZIP City State 11579 Sea Cliff NY 33157 Miami FL 46637 South Bend IN If we want to be super-efficient, we can even fill this table in advance the post office provides a directory of all valid ZIP codes and their city/state relationships. Surely, you ve encountered a situation where this type of database was utilized. Someone taking an order might have asked you for your ZIP code first and then knew the city and state you were calling from. This type of arrangement reduces operator error and increases efficiency. Now that we ve removed the duplicative data from the Customers table, we ve satisfied the first rule of second normal form. We still need to use a foreign key to tie the two tables together. We ll use the ZIP code (the primary key from the ZIPs table) to create that relationship. Here s our new Customers table: CustNum FirstName LastName Address ZIP 1 John Doe 12 Main Street 11579 2 Alan Johnson 82 Evergreen Tr 11579 3 Beth Thompson 1912 NE 1st St 33157 4 Jacob Smith 142 Irish Way 46637 5 Sue Ryan 412 NE 1st St 33157 DATABASE MANAGEMENT 109

DATABASE MANAGEMENT We ve now minimized the amount of redundant information stored within the database and our structure is in second normal form. Points to Ponder The bad design of a database suggests that we should decompose a relation schema that has many attributes into several schemas with fewer attributes. when we decompose a relation into a number of smaller relations, it is crucial that the decomposition be lossless 2NF attempts to reduce the amount of redundant data in a table by extracting it, placing it in new table(s) and creating relationships between those tables. We can use a given set of functional dependencies in designing a relational database in which most of the undesirable properties do not occur When we decompose a relation into a number of smaller relations, it is crucial that the decomposition be lossless 2NF remove subsets of data that apply to multiple rows of a table and place them in separate rows. Review Terms Decomposition Properties of decomposition Lossless join decomposition 2NF Students Activity 1. Define decomposition? Why it is required? 4. Define 2NF with the help of an example? 2. Define desirable properties of decomposition? 3. Define Lossless join decomposition? 110

DATABASE MANAGEMENT Lesson Objectives BCNF 3NF Comparison of BCNF & 3NF 4NF Boyce-Codd Normal Form Using functional dependencies, we can define several normal forms that represent good database designs. Definition One of the more desirable normal forms that we can obtain is Boyce-Codd normal form (BCNF). A relation schema R is in BCNF with respect to a set F of functional dependencies if, for all functional dependencies in F+ of the form α β, where α R and β R, at least one of the following holds: α β is a trivial functional dependency (that is, β α). α is a superkey for schema R. A database design is in BCNF if each member of the set of relation schemas that constitutes the design is in BCNF. As an illustration, consider the following relation schemas and their respective functional dependencies: Customer-schema = (customer-name, customer-street, customer-city) customer-name -t customer-street customer-city Branch-schema = (branch_name, assets, branch-city) branch-name -t assets branch-city Loan-info-schema = (branch-name, customer-name, loannumber, amount) loan-number -t amount branch-name We claim that Customer-schema is in BCNF. We note that a candidate key for the, schema is cust9mer-name. The only nontrivial functional dependencies that hold on Customerschema have customer-name on the left side of the arrow. Since customer-name is a candidate key, functional dependencies with customer-name on the left side do not violate the definition of BCNF. Similarly, it can be shown easily that the relation schema Branch-schema is in BCNF. The schema Loan-info-schema, however, is not in BCNF. First, note that loan-number is not a superkey for Loan-info-schema, since we could have a pair of tuples represent-ing a single loan made to two people - for example, (Dowhtown, John Bell, L-44, 1000) (Downtown, Jane Bell, L-44, 1000) Because we did not list functional dependencies that rule out the preceding case, loan--number is not a candidate key. However, the functional dependency loan-nurnber amount is nontrivial Therefore, Loan-info-schema does not satisfy the definition of BCNF. LESSON 37: NORMALISATION - III We claim that Loan-info-schema is not in a desirable form, since it suffers from the problem of repetition of information. We observe that, if there are several customer names associated with a loan, in a relation on Loan-info--schema, then we are forced to repeat the branch name and the amount once for each customer. We can eliminate this redundancy by redesigning our database such that all schemas are in BCNF. One approach to this problem is to take the existing non--bcnf design as a starting point, and to decompose those schemas that are not in BCNF. Consider the decomposition of Loan-info-schema into two schemas: Loan-schema = (loan-number, branch-name, amount) Borrower-schema = (customer-name, loan-number) This decomposition is a lossless-join decomposition. To determine whether these schemas are in BCNF, we need to determine what functional dependencies apply to them. In this example, it is easy to see that loan-number amount branch-name applies to the Loan-schema, and that only trivial functional dependencies apply to Borrower-schema. Although loannumber is not a superkey for Loan-info-schema, it is a candidate key for Loan-schema. Thus, both schemas of our decomposition are in SCNF. It is now possible to avoid redundancy in the case where there are several cus-tomers associated with a loan. There is exactly one tuple for each loan in the relation on Loan-schema, and one tuple for each customer of each loan in the relation on Borrower-schema. Thus, we do not have to repeat the branch name and the amount once for each customer associated with a loan. Often testing of a relation to see if it satisfies BCNF can be simplified: To check if a nontrivial dependency α β causes a violation of SCNF, comput α+ (the attribute closure of α), and verify that it includes all attributes of R; that is, it is a superkey of R. To check if a relation schema R is in SCNF, it suffices to check only the dependencies in the given set F for violation of BCNF, rather than check all depen-dencies in F+. We can show that if none of the dependencies in F causes a violation of SCNF, then none of the dependencies in F+ will cause a violation of SCNF either. Unfortunately, the latter procedure does not work when a relation is decomposed. That is it does not suffice to use F when we test a relation Ri in a decomposition of R, for violation of BCNF. For example consider relation schema R (A, B, C, D, E), with functional dependencies F containing A Band BC D. Suppose this were decomposed into R1(A, B) and R2(A, C, D, E). Now, neither of the dependencies in F contains only attributes from (A, C, D, E) so we might be 112

misled into thinking R2 satisfies BCNF. In fact, there is a dependency AC D in F+ (which can be inferred using the pseudotransitivity rule from the two dependencies in F), which shows that R2 is not in BCNF. Thus, we may need a dependency that is in F+, but is not in F, to show that a decomposed relation is not in BCNF. An alternative BCNF test is sometimes easier than computing every dependency in F+. To check if a relation Ri in a decomposition of R is in BCNF, we apply this test:. For every subset a of attributes in R.i, check that a+ (the attribute closure of a under F) either includes no attribute of Ri - a, or includes all attributes of Ri. If the condition is violated by some set of attributes a in Ri, consider the following functional dependency, which can be shown to be present in F+: α (α+ - α) Ri The above dependency shows that Ri violates BCNF, and is a witness for the viola-tion. The BCNF decomposition algorithm, which we shall see in Section 7.6.2, makes use of the witness. Third Normal Form As we saw earlier, there are relational schemas where a BCNF decomposition cannot be dependency preserving. For such schemas, we have two alternatives if we wish to check if an update violates any functional dependencies: Pay the extra cost of computing joins to test for violations. Use an alternative decomposition, third normal form (3NF), which we present below, which makes testing of updates cheaper. Unlike BCNF, 3NF decompo-sitions may contain some redundancy in the decomposed schema. We shall see that it is always possible to find a lossless-join, dependency-preserving decomposition that is in 3NF. Which of the two alternatives to choose is a design decision to be made by the database designer on the basis of the application requirements. Definition BCNF requires that all nontrivial dependencies be of the form α β, where α is a super key. 3NF relaxes this constraint slightly by allowing nontrivial functional de-pendencies whose left side is not a superkey. A relation schema R is in third normal form (3NF) with respect to a set F of func-tional dependencies if, for all functional dependencies in F+ of the form α β, where α R and β R, at least one of the following holds: α β is a trivial functional dependency. α is a superkey for R. Each attribute A in β - α is contained in a candidate key for R. Note that the third condition above does not say that a single candidate key should contain all the attributes in β - α; each attribute A in β - α may be contained in a different candidate key. The first two alternatives are the same as the two alternatives in the definition of BCNF. The third alternative of the 3NF definition seems rather unintuitive, and it is not obvious why it is useful. It represents, in some sense, a minimal relaxation of the BCNF conditions that helps ensure that every schema has a dependency-preserving decomposition into 3NF. Its purpose will become more clear later, when we study decomposition into 3NF. Observe that any schema that satisfies BCNF also satisfies 3NF, since each of its functional dependencies would satisfy one of the first two alternatives. BCNF is there-fore a more restrictive constraint than is 3NF. The definition of 3NF allows certain functional dependencies that are not allowed in BCNF. A dependency α β that satisfies only the third alternative of the 3NF definition is not allowed in BCNF, but is allowed in 3NF. Let us return to our Banker-schema example (Section 7.6). We have shown that this relation schema does not have a dependency-preserving, lossless-join decomposition into BCNF. This schema, however, turns out to be in 3NF. To see that it is, we note that {customer-name, branch-name} is a candidate key for Banker-schema, so the only attribute not contained in a candidate key for Banker-schema is banker-name. The only nontrivial functional dependencies of the form α banker-name include {customer-name, branch-name} as part of Q. Since {customer-name, branch-name} is a candidate key, these dependencies do not violate the definition of 3NF. As an optimization when testing for 3NF, we can consider only functional depen-dencies in the given set F, rather than in F+. Also, we can decompose the dependen-cies in F so that their. right-hand side consists of only single attributes, and use the resultant set in place of F. Given a dependency α β, we can use the same attributeclosure-based tech-nique that we used for BCNF to check if α is a superkey. If α is not a superkey, we have to verify whether each attribute in β is contained in a candidate key of R; this test is rather more expensive, since it involves finding candidate keys. In fact, test-ing for 3NF has been.shown to be NP-hard; thus, it is very unlikely that there is a polynomial time complexity algorithm for the task. Comparison of BCNF and 3NF Of the two normal forms for relational-database schemas, 3NF and BCNF, there are advantages to 3NF in that we know that it is always possible to obtain a 3NF design without sacrificing a lossless join or dependency preservation. Nevertheless, there are disadvantages to 3NF: If we do not eliminate all transitive relations schema depen-dencies, we may have to use null values to represent some of the possible meaningful relationships among data items, and there is the problem of repetition of information. As an illustration of the null value problem, consider again the Banker-schema and its associated functional dependencies. Since banker-name branch-name, we may want to represent relationships between values for banker-name and values for branch--name in our database. If we are to do so, however, either there must be a correspond-ing value for customer-name, or we must use a null value for the attribute customer--name 1. DATABASE MANAGEMENT 113

DATABASE MANAGEMENT Customer-name bank-name branch-name Jones Johnson Perryridge Smith Johnson Perryridge Hayes Johnson Perryridge Jackson Johnson Perryridge Curry Johnson Perryridge Turner Johnson Perryridge An instance of Banker-schema. As an illustration of the repetition of information problem, consider the instance of Banker-schema. Notice that the information indicating that Johnson is working at the Perryridge branch is repeated. Recall that our goals of database design with functional dependencies are: 1. BCNF 2. Lossless join 3. Dependency preservation Since it is not always possible to satisfy all three, we may be forced to choose between BCNF and dependency preservation with 3NF. It is worth noting that SQL does not provide a way of specifying functional depen-dencies, except for the special case of declaring superkeys by using the primary key or unique constraints. It is possible, although a little complicated, to write assertions that enforce a functional dependency, unfortunately testing the assertions would be very expensive in most database systems. Thus even if we had a dependency-preserving decomposition, if we use standard SQL we would not be able to efficiently test a functional dependency whose left-hand side is not a key. Although testing functional dependencies may involve a join if the decomposition is not dependency preserving, we can reduce the cost by using materialized views, which many database systems support. Given a BCNF decomposition that is not dependency preserving, we consider each dependency in a minimum cover Fc that is not preserved in the decomposition. For each such dependency α β, we define a materialized view that computes a join of all relations in the decomposition, and projects the result on αβ. The functional dependency can be easily tested on the ma-terialized view, by means of a constraint unique (α). On the negative side, there is a space and time overhead due to the materialized view, but on the positive side, the application programmer need not worry about writing code to keep redundant data consistent on updates; it is the job of the database system to maintain the material-ized view, that is, keep up up to date when the database is updated Thus, in case we are not able to get a dependency-preserving BCNF decomposition, it is generally preferable to opt for BCNF, and use techniques such as materialized views to reduce the cost of checking functional dependencies. Fourth Normal Form Some relation schemas, even though they are in BCNF, do not seem to be sufficiently normalized, in the sense that they still suffer from the problem of repetition of infor-mation. Consider again our banking example. Assume that, in an alternative design for the bank database schema, we have the schema BC-schema = (loan-number, customer-name, customer-street, customer-city) The astute reader will recognize this schema as a non-bcnf schema because of the functional dependency customer-name customer-street customer-city that we asserted earlier, and because customer-name is not a key for BC-schema. How-ever, assume that our bank is attracting wealthy customers who have several ad-dresses (say, a winter home and a summer home). Then, we no longer wish to enforce the functional dependency customer-name customer-street customer-city. If we remove this functional dependency, we find BC-schema to be in BCNF with respect to our modified set of functional dependencies. Yet, even though BC-schema is now in BCNF, we still have the problem of repetition of information that we had earlier. To deal with this problem, we must define a new form of constraint, called a mul-tivalued dependency. As we did for functional dependencies, we shall use multivalued dependencies to define a normal form for relation schemas. This normal form, called fourth normal form (4NF), is more restrictive than BCNF. We shall see that every 4NF schema is also in BCNF, but there are BCNF schemas that are not in 4NF. Multivalued Dependencies Functional dependencies rule out certain tuples from being in a relation. If A B, then we cannot have two tuples with the same A value but different B values. Mul-tivalued dependencies, on the other hand, do not rule out the existence of certain tuples. Instead, they require that other tuples of a certain form be present in the rela-tion. For this reason, functional dependencies sometimes are referred to as equality- generating dependencies, and multivalued dependencies are referred to as tuple- generating dependencies. Let R be a relation schema and let α R and β R. The multivalued dependency α β holds on R if, in any legal relation r(r), for all pairs of tuples t1 and t2 in r such that t1[α] = t2[α], there exist tuples t3 and t4 in r such that tr[α] = t2[α] = t3[α] = t4[α] t3 [β] = td{β] t3[r - β] = t2[r - β] t4 [β].= t2 [β] t4[r - β] = t1(r - β] T1 T2 T3 T4 α β R- α - β A1..ai ai +1.aj Aj +1 an A1..ai bi + 1 bj Bj +1.bn A1..ai Ai+1. aj Bj +1.bn A1..ai Bi +1.bj Aj +1 an Tabular representation of α β 114

This definition. is less complicated than it appears to be. Figure 7.16 gives a tabular picture of t1, t2, t3 and t4. Intuitively, the multivalued dependency α β says that the relationship between α and α is independent of the relationship between α and R - β. If the multivalued dependency α β is satisfied by all relations on schema R, then α β is a trivial multivalued dependency on schema R. Thus α β is trivial if β α or β U α = R. To illustrate the difference between functional and multivalued dependencies, we consider the BC-schema again, and the relation bc (BC-schema). We must repeat the loan number once for each address a customer has, and we must repeat the address for each loan a customer has. This repetition is unnecessary, since the relationship between a customer and his address is independent of the relationship between that customer and a loan. If a customer (say, Smith) has a loan (say, loan number L- 23), we want that loan to be associated with all Smith s addresses. Thus, the relation of Figure 7.18 is illegal. To make this relation legal, we need to add the tuples (L-23, Smith, Main, Manchester) and (L-27, Smith, North, Rye) to the bc relation Comparing the preceding example with our definition of multivalued dependency, we see that we want the multivalued dependency customer-name customer-street customer-city to hold. (The multivalued dependency customer-name loan-number will do as well. We shall soon see that they are equivalent.) As with functional dependencies, we shall use multivalued dependencies in two ways: 1. To test relations to determine whether they are legal under a given set of func-tional and multivalued dependencies 2. To specify constraints on the set of legal relations; we shall thus concern our-selves with only those relations that satisfy a given set of functional and mul-tivalued dependencies Loan-number L-23 Customer-name Smith Customer-street Customer-city Rye North L-23 Smith Main Manchester L-93 Curry Lake Horseneck Relation bc: An example of redundancy in a BCNF relation. Loan-number Customer-name Customer-street customercity L-23 Smith North Rye L-27 Smith Main Manchester An illegal bc relation. Note that, if a relation r fails to satisfy a given multivalued dependency, we can con-struct a relation r that does satisfy the multivalued dependency by adding tuples to r. Let D denote a set of functional and multivalued dependencies. The closure D+ of D is the set of all functional and multivalued dependencies logically implied by o. As we did for functional dependencies, we can compute D+ from 0, using the formal definitions of functional dependencies and multivalued dependencies. We can man-age with such reasoning for very simple multivalued dependencies. Luckily, multi-valued dependencies that occur in practice appear to be quite simple. For complex dependencies, it is better to reason about sets of dependencies by using a system of inference rules. From the definition of multivalued dependency, we can derive the following rule: α β, then α β In other words, every functional dependency is also a multivalued dependency. Definition of Fourth Normal Form Consider again our BC-schema example in which the multivalued dependency customer-name customer-street customer-city holds, but no nontrivial functional de-pendencies hold. We saw in the opening paragraphs of Section 7.8 that, although BC-schema is in BCNF, the design is not ideal, since we must repeat a customer s address information lor each loan. We shall see that we can use the given multivalued de-pendency to improve the database design, by decomposing BC-schema into a fourth normal form decomposition. A relation schema R is in fourth normal form (4NF) with respect to a set 0 of functional and multivalued dependencies if, for all multivalued dependencies in D+ of the form α β, where α R and β R, at least one of the following holds α β is a trivial multivalued dependency. α is a superkey for schema R. A database design is in 4NF if each member of the set of relation schemas that consti-tutes the design is in 4NE Note that the definition of 4NF differs from the definition of BCNF in only the use of multivalued dependencies instead of functional dependencies. Every 4NF schema is in BCNF. To see this fact, we note that, if a schema R is not in BCNF, then there is result := {R}; done := false; compute D+; Given schema Ri, let Di denote the restriction of D+ to Ri while (not done) do if (there is a schema Ri in result that is not in 4NF W.r.t.- Di) then begin let α β a nontrivial multivalued dependency that holds on Ri such that α Ri is not in Di, and α β = φ; result := (result - Ri) U (Ri - β U (α, β); end else done := true; 4NF decomposition algorithm. a nontrivial functional dependency α β holding on R, where a is not a superkey. Since α β implies α β, R cannot be in 4NF. Let R be a relation schema, and let R1, R2,..., Rn be a decomposition of R. To check if each relation schema Ri in the decomposition is in 4NF, we need to find what multivalued dependencies hold on each Ri. Recall that, for a set F of functional dependencies, the restriction Fi of F to Ri is all functional dependencies in F+ that include only attributes of Ri. Now consider a set D of both functional and multivalued DATABASE MANAGEMENT 115

DATABASE MANAGEMENT dependencies. The restriction of D to Ri is the set Di consisting of 1. All functional dependencies in D+ that include only attributes of Ri 2. All multivalued dependencies of the form α β Ri where α Ri and α β is in D+. Points to Ponder A relation schema R is in BCNF with respect to a set F of functional dependencies if, for all functional dependencies in F+ of the form α β third normal form (3NF), which we present below, which makes testing of updates cheaper Mul-tivalued dependencies do not rule out the existence of certain tuples Instead, they require that other tuples of a certain form be present in the rela-tion. Every 4NF schema is also in BCNF, but there are BCNF schemas that are not in 4NF. Review Terms BCNF 3NF Comparison of BCNF & 3NF 4NF Students Activity 1. Define BCNF? 4. Define Multivalued dependencies? 5. Define 4NF? 2 Define 3NF? 3. Differentiate between BCNF & 3NF? 116

DATABASE MANAGEMENT LESSON 38: FILE ORGANIZATION METHOD - I Lesson objectives Physical Storage Media Performance Measures of Disks Optimization of Disk-Block Access Fixed-Length Records Variable-Length Records Sequential File Organization Clustering File Organization Data Dictionary Storage Overview of Physical Storage Media 1. Several types of data storage exist in most computer systems. They vary in speed of access, cost per unit of data, and reliability. Cache: most costly and fastest form of storage. Usually very small, and managed by the operating system. Main Memory (MM): the storage area for data available to be operated on. General-purpose machine instructions operate on main memory. Contents of main memory are usually lost in a power failure or crash. Usually too small (even with megabytes) and too expensive to store the entire database. Flash memory: EEPROM (electrically erasable programmable read-only memory). Data in flash memory survive from power failure. Reading data from flash memory takes about 10 nano-secs (roughly as fast as from main memory), and writing data into flash memory is more complicated: write-once takes about 4-10 microsecs. To overwrite what has been written, one has to first erase the entire bank of the memory. It may support only a limited number of erase cycles (10 4 to 10 6 ). It has found its popularity as a replacement for disks for storing small volumes of data (5-10 megabytes). Magnetic-disk storage: primary medium for longterm storage. Typically the entire database is stored on disk. Data must be moved from disk to main memory in order for the data to be operated on. After operations are performed, data must be copied back to disk if any changes were made. Disk storage is called direct access storage as it is possible to read data on the disk in any order (unlike sequential access). Disk storage usually survives power failures and system crashes. Optical storage: CD-ROM (compact-disk read-only memory), WORM (write-once read-many) disk (for archival storage of data), and Juke box (containing a few drives and numerous disks loaded on demand). Tape Storage: used primarily for backup and archival data. Cheaper, but much slower access, since tape must be read sequentially from the beginning. Used as protection from disk failures! 2. The storage device hierarchy is presented, where the higher levels are expensive (cost per bit), fast (access time), but the capacity is smaller. Storage-device hierarchy 3. Another classification: Primary, secondary, and tertiary storage. 1. Primary storage: the fastest storage media, such as cash and main memory. 2. Secondary (or on-line) storage: the next level of the hierarchy, e.g., magnetic disks. 3. Tertiary (or off-line) storage: magnetic tapes and optical disk juke boxes. 4. Volatility of storage. Volatile storage loses its contents when the power is removed. Without power backup, data in the volatile storage (the part of the hierarchy from main memory up) must be written to nonvolatile storage for safekeeping. Performance Measures of Disks The main measures of the qualities of a disk are capacity, access time, data transfer rate, and reliability, 118

1. Access time: the time from when a read or write request is issued to when data transfer begins. To access data on a given sector of a disk, the arm first must move so that it is positioned over the correct track, and then must wait for the sector to appear under it as the disk rotates. The time for repositioning the arm is called seek time, and it increases with the distance the arm must move. Typical seek time range from 2 to 30 milliseconds. Average seek time is the average of the seek time, measured over a sequence of (uniformly distributed) random requests, and it is about one third of the worst-case seek time. Once the seek has occurred, the time spent waiting for the sector to be accesses to appear under the head is called rotational latency time. Average rotational latency time is about half of the time for a full rotation of the disk. (Typical rotational speeds of disks ranges from 60 to 120 rotations per second). The access time is then the sum of the seek time and the latency and ranges from 10 to 40 milli-sec. 2. Data transfer rate, the rate at which data can be retrieved from or stored to the disk. Current disk systems support transfer rate from 1 to 5 megabytes per second. 3. Reliability, measured by the mean time to failure. The typical mean time to failure of disks today ranges from 30,000 to 800,000 hours (about 3.4 to 91 years). Optimization of Disk-Block Access 1. Data is transferred between disk and main memory in units called blocks. 2. A block is a contiguous sequence of bytes from a single track of one platter. 3. Block sizes range from 512 bytes to several thousand. 4. The lower levels of file system manager covert block addresses into the hardware-level cylinder, surface, and sector number. 5. Access to data on disk is several orders of magnitude slower than is access to data in main memory. Optimization techniques besides buffering of blocks in main memory. Scheduling: If several blocks from a cylinder need to be transferred, we may save time by requesting them in the order in which they pass under the heads. A commonly used disk-arm scheduling algorithm is the elevator algorithm. File organization. Organize blocks on disk in a way that corresponds closely to the manner that we expect data to be accessed. For example, store related information on the same track, or physically close tracks, or adjacent cylinders in order to minimize seek time. IBM mainframe OS s provide programmers fine control on placement of files but increase programmer s burden. UNIX or PC OSs hide disk organizations from users. Over time, a sequential file may become fragmented. To reduce fragmentation, the system can make a backup copy of the data on disk and restore the entire disk. The restore operation writes back the blocks of each file continuously (or nearly so). Some systems, such as MS-DOS, have utilities that scan the disk and then move blocks to decrease the fragmentation. Nonvolatile write buffers. Use nonvolatile RAM (such as battery-back-up RAM) to speed up disk writes drastically (first write to nonvolatile RAM buffer and inform OS that writes completed). Log disk. Another approach to reducing write latency is to use a log disk, a disk devoted to writing a sequential log. All access to the log disk is sequential, essentially eliminating seek time, and several consecutive blocks can be written at once, making writes to log disk several times faster than random writes. File Organization 1. A file is organized logically as a sequence of records. 2. Records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating systems, so we assume the existence of an underlying file system. 4. Blocks are of a fixed size determined by the operating system. 5. Record sizes vary. 6. In relational database, tuples of distinct relations may be of different sizes. 7. One approach to mapping database to files is to store records of one length in a given file. 8. An alternative is to structure files to accommodate variablelength records. (Fixed-length is easier to implement.) Fixed-Length Records 1. Consider a file of deposit records of the form: aaaaaaaaaaaa type deposit = record bname : char(22); account# : char(10); balance : real; end If we assume that each character occupies one byte, an integer occupies 4 bytes, and a real 8 bytes, our deposit record is 40 bytes long. The simplest approach is to use the first 40 bytes for the first record, the next 40 bytes for the second, and so on. However, there are two problems with this approach. It is difficult to delete a record from this structure. Space occupied must somehow be deleted, or we need to mark deleted records so that they can be ignored. Unless block size is a multiple of 40, some records will cross block boundaries. It would then require two block accesses to read or write such a record. 2. When a record is deleted, we could move all successive records up one, which may require moving a lot of records. DATABASE MANAGEMENT 119

DATABASE MANAGEMENT We could instead move the last record into the hole created by the deleted record This changes the order the records are in. It turns out to be undesirable to move records to occupy freed space, as moving requires block accesses. Also, insertions tend to be more frequent than deletions. It is acceptable to leave the space open and wait for a subsequent insertion. This leads to a need for additional structure in our file design. 3. So one solution is: At the beginning of a file, allocate some bytes as a file header. This header for now need only be used to store the address of the first record whose contents are deleted. This first record can then store the address of the second available record, To insert a new record, we use the record pointed to by the header, and change the header pointer to the next available record. If no deleted records exist we add our new record to the end of the file. 4. Note: Use of pointers requires careful programming. If a record pointed to is moved or deleted, and that pointer is not corrected, the pointer becomes a dangling pointer. Records pointed to are called pinned. 5. Fixed-length file insertions and deletions are relatively simple because one size fits all. For variable length, this is not the case. Variable-Length Records Variable-length records arise in a database in several ways: Storage of multiple items in a file. Record types allowing variable field size Record types allowing repeating fields Organization of Records in Files There are several ways of organizing records in files. heap file organization. Any record can be placed anywhere in the file where there is space for the record. There is no ordering of records. sequential file organization. Records are stored in sequential order, based on the value of the search key of each record. hashing file organization. A hash function is computed on some attribute of each record. The result of the function specifies in which block of the file the record should be placed to be discussed in chapter 11 since it is closely related to the indexing structure. clustering file organization. Records of several different relations can be stored in the same file. Related records of the different relations are stored on the same block so that one I/O operation fetches related records from all the relations. Sequential File Organization 1. A sequential file is designed for efficient processing of records in sorted order on some search key. Records are chained together by pointers to permit fast retrieval in search key order. Pointer points to next record in order. Records are stored physically in search key order (or as close to this as possible). This minimizes number of block accesses. 2. It is difficult to maintain physical sequential order as records are inserted and deleted. Deletion can be managed with the pointer chains. Insertion poses problems if no space where new record should go. If space, use it, else put new record in an overflow block. Adjust pointers accordingly. Problem: we now have some records out of physical sequential order. If very few records in overflow blocks, this will work well. If order is lost, reorganize the file. Reorganizations are expensive and done when system load is low. 3. If insertions rarely occur, we could keep the file in physically sorted order and reorganize when insertion occurs. In this case, the pointer fields are no longer required. Clustering File Organization 1. One relation per file, with fixed-length record, is good for small databases, which also reduces the code size. 2. Many large-scale DB systems do not rely directly on the underlying operating system for file management. One large OS file is allocated to DB system and all relations are stored in one file. 3. To efficiently execute queries involving, one may store the depositor tuple for each cname near the customer tuple for the corresponding cname 4. This structure mixes together tuples from two relations, but allows for efficient processing of the join. 5. If the customer has many accounts which cannot fit in one block, the remaining records appear on nearby blocks. This file structure, called clustering, allows us to read many of the required records using one block read. 6. Our use of clustering enhances the processing of a particular join but may result in slow processing of other types of queries, such as selection on customer. For example, the query aaaaaaaaaaaa select * from customer now requires more block accesses as our customer relation is now interspersed with the deposit relation. 120

7. Thus it is a trade-off, depending on the types of query that the database designer believes to be most frequent. Careful use of clustering may produce significant performance gain. Data Dictionary Storage 1. The database also needs to store information about the relations, known as the data dictionary. This includes: Names of relations. Names of attributes of relations. Domains and lengths of attributes. Names and definitions of views. Integrity constraints (e.g., key constraints). plus data on the system users: Names of authorized users. Accounting information about users. plus (possibly) statistical and descriptive data: Number of tuples in each relation. Method of storage used for each relation (e.g., clustered or non-clustered). 2. When we look at indices we ll also see a need to store information about each index on each relation: Name of the index. Name of the relation being indexed. Attributes the index is on. Type of index. 3. This information is, in itself, a miniature database. We can use the database to store data about itself, simplifying the overall structure of the system, and allowing the full power of the database to be used to permit fast access to system data. Points to Ponder Several types of data storage exist in most computer systems as main memory,cache, Flash memory, Magneticdisk storage,optical storage, tape storage. The main measures of the qualities of a disk are capacity, access time, data transfer rate, and reliability A sequential file is designed for efficient processing of records in sorted order on some search key Clustering structure mixes together tuples from two relations, but allows for efficient processing of the join. The database also needs to store information about the relations, known as the data dictionary. Review Terms Storage device Sequential file storage Clustering file storage Data dictionary storage Search key Students Activity 1. Define various types of storage media? 2. Define qualities of a disk? 3. Define sequential file storage? 4. Define clustering file storage? 5. Define data dictionary storage? DATABASE MANAGEMENT 121

Lesson objectives Indexing & Hashing Ordered Indices Dense and Sparse Indices Multi-Level Indices Index Update Secondary Indices Static Hashing Hash functions Bucket overflow Dynamic hashing Indexing and Hashing 1. Many queries reference only a small proportion of records in a file. For example, finding all records at Perryridge branch only returns records where bname = Perryridge. 2. We should be able to locate these records directly, rather than having to read every record and check its branch-name. We then need extra file structuring. Basic Concepts 1. An index for a file works like a catalogue in a library. Cards in alphabetic order tell us where to find books by a particular author. 2. In real-world databases, indices like this might be too large to be efficient. We ll look at more sophisticated indexing techniques. 3. There are two kinds of indices. Ordered indices: indices are based on a sorted ordering of the values. Hash indices: indices are based on the values being distributed uniformly across a range of buckets. The buckets to which a value is assigned is determined by a function, called a hash function. 4. We will consider several indexing techniques. No one technique is the best. Each technique is best suited for a particular database application. 5. Methods will be evaluated on: 1. Access Types types of access that are supported efficiently, e.g., value-based search or range search. 2. Access Time time to find a particular data item or set of items. 3. Insertion Time time taken to insert a new data item (includes time to find the right place to insert). 4. Deletion Time time to delete an item (includes time taken to find item, as well as to update the index structure). LESSON 39: FILE ORGANISATION METHOD - II 5. Space Overhead additional space occupied by an index structure. 6. We may have more than one index or hash function for a file. (The library may have card catalogues by author, subject or title.) 7. The attribute or set of attributes used to look up records in a file is called the search key (not to be confused with primary key, etc.). Ordered Indices 1. In order to allow fast random access, an index structure may be used. 2. A file may have several indices on different search keys. 3. If the file containing the records is sequentially ordered, the index whose search key specifies the sequential order of the file is the primary index, or clustering index. Note: The search key of a primary index is usually the primary key, but it is not necessarily so. 4. Indices whose search key specifies an order different from the sequential order of the file are called the secondary indices, or nonclustering indices. Dense and Sparse Indices 1. There are two types of ordered indices: Dense Index: An index record appears for every search key value in file. This record contains search key value and a pointer to the actual record. Sparse Index: Index records are created only for some of the records. To locate a record, we find the index record with the largest search key value less than or equal to the search key value we are looking for. We start at that record pointed to by the index record, and proceed along the pointers in the file (that is, sequentially) until we find the desired record. DATABASE MANAGEMENT 123

DATABASE MANAGEMENT Dense index. 2. Notice how we would find records for Perryridge branch using both methods. (Do it!) Sparse index. 3. Dense indices are faster in general, but sparse indices require less space and impose less maintenance for insertions and deletions. (Why?) 4. A good compromise: to have a sparse index with one entry per block. Why is this good? Biggest cost is in bringing a block into main memory. We are guaranteed to have the correct block with this method, unless record is on an overflow block (actually could be several blocks). Index size still small. Multi-Level Indices 1. Even with a sparse index, index size may still grow too large. For 100,000 records, 10 per block, at one index record per block, that s 10,000 index records! Even if we can fit 100 index records per block, this is 100 blocks. 2. If index is too large to be kept in main memory, a search results in several disk reads. If there are no overflow blocks in the index, we can use binary search. This will read as many as 1+log 2 (b) blocks (as many as 7 for our 100 blocks). If index has overflow blocks, then sequential search typically used, reading all b index blocks. 3. Solution: Construct a sparse index on the index 4. Two-level sparse index. 5. Use binary search on outer index. Scan index block found until correct index record found. Use index record as before - scan block pointed to for desired record. 6. For very large files, additional levels of indexing may be required. 7. Indices must be updated at all levels when insertions or deletions require it. 8. Frequently, each level of index corresponds to a unit of physical storage (e.g. indices at the level of track, cylinder and disk). Index Update Regardless of what form of index is used, every index must be updated whenever a record is either inserted into or deleted from the file. 1. Deletion Find (look up) the record If the last record with a particular search key value, delete that search key value from index. For dense indices, this is like deleting a record in a file. For sparse indices, delete a key value by replacing key value s entry in index by next search key value. If that value already has an index entry, delete the entry. 2. Insertion Find place to insert. Dense index: insert search key value if not present. Sparse index: no change unless new block is created. (In this case, the first search key value appearing in the new block is inserted into the index). Secondary Indices 1. If the search key of a secondary index is not a candidate key, it is not enough to point to just the first record with each search-key value because the remaining records with the same search-key value could be anywhere in the file. Therefore, a secondary index must contain pointers to all the records. 124

h ( Ki ) i ) = h ( K j ) Sparse secondary index on cname. 2. We can use an extra-level of indirection to implement secondary indices on search keys that are not candidate keys. A pointer does not point directly to the file but to a bucket that contains pointers to the file. To perform a lookup on Peterson, we must read all three records pointed to by entries in bucket 2. Only one entry points to a Peterson record, but three records need to be read. As file is not ordered physically by cname, this may take 3 block accesses. 3. Secondary indices must be dense, with an index entry for every search-key value, and a pointer to every record in the file. 4. Secondary indices improve the performance of queries on non-primary keys. 5. They also impose serious overhead on database modification: whenever a file is updated, every index must be updated. Designer must decide whether to use secondary indices or not. Static Hashing Index schemes force us to traverse an index structure. Hashing avoids this. Hash File Organization 1. Hashing involves computing the address of a data item by computing a function on the search key value. 2. A hash function h is a function from the set of all search key values K to the set of all bucket addresses B. We choose a number of buckets to correspond to the number of search key values we will have stored in the database. To perform a lookup on a search key value K i, we compute, and search the bucket with that address. If two search keys i and j map to the same address, because, then the bucket at the address obtained will contain records with both search key values. In this case we will have to check the search key value of every record in the bucket to get the ones we want. Insertion and deletion are simple. Hash Functions 1. A good hash function gives an average-case lookup that is a small constant, independent of the number of search keys. 2. We hope records are distributed uniformly among the buckets. 3. The worst hash function maps all keys to the same bucket. 4. The best hash function maps all keys to distinct addresses. 5. Ideally, distribution of keys to addresses is uniform and random. 6. Suppose we have 26 buckets, and map names beginning with ith letter of the alphabet to the ith bucket. Problem: this does not give uniform distribution. Many more names will be mapped to A than to X. Typical hash functions perform some operation on the internal binary machine representations of characters in a key. For example, compute the sum, modulo # of buckets, of the binary representations of characters of the search key. using this method for 10 buckets (assuming the ith character in the alphabet is represented by integer i). Handling of Bucket Overflows 1. Open hashing occurs where records are stored in different buckets. Compute the hash function and search the corresponding bucket to find a record. 2. Closed hashing occurs where all records are stored in one bucket. Hash function computes addresses within that bucket. (Deletions are difficult.) Not used much in database applications. 3. Drawback to our approach: Hash function must be chosen at implementation time. Number of buckets is fixed, but the database may grow. If number is too large, we waste space. If number is too small, we get too many collisions, resulting in records of many search key values being in the same bucket. Choosing the number to be twice the number of search key values in the file gives a good space/ performance tradeoff. Hash Indices 1. A hash index organizes the search keys with their associated pointers into a hash file structure. 2. We apply a hash function on a search key to identify a bucket, and store the key and its associated pointers in the bucket (or in overflow buckets). 3. Strictly speaking, hash indices are only secondary index structures, since if a file itself is organized using hashing, there is no need for a separate hash index structure on it. DATABASE MANAGEMENT 125

DATABASE MANAGEMENT Dynamic Hashing 1. As the database grows over time, we have three options: Choose hash function based on current file size. Get performance degradation as file grows. Choose hash function based on anticipated file size. Space is wasted initially. Periodically re-organize hash structure as file grows. Requires selecting new hash function, recomputing all addresses and generating new bucket assignments. Costly, and shuts down database. 2. Some hashing techniques allow the hash function to be modified dynamically to accommodate the growth or shrinking of the database. These are called dynamic hash functions. Extendable hashing is one form of dynamic hashing. Extendable hashing splits and coalesces buckets as database size changes. This imposes some performance overhead, but space efficiency is maintained. As reorganization is on one bucket at a time, overhead is acceptably low. 3. How does it work? General extendable hash structure. We choose a hash function that is uniform and random that generates values over a relatively large range. Range is b-bit binary integers (typically b=32). is over 4 billion, so we don t generate that many buckets! Instead we create buckets on demand, and do not use all b bits of the hash initially. At any point we use i bits where. The i bits are used as an offset into a table of bucket addresses. Value of i grows and shrinks with the database. Note that the i appearing over the bucket address table tells how many bits are required to determine the correct bucket. It may be the case that several entries point to the same bucket. All such entries will have a common hash prefix, but the length of this prefix may be less than i. So we give each bucket an integer giving the length of the common hash prefix. Number of bucket entries pointing to bucket j is then 2 (i ij). 4. To find the bucket containing search key value K i : Compute h(k t ). Take the first i high order bits of h(k t ). Look at the corresponding table entry for this i-bit string. Follow the bucket pointer in the table entry. 5. We now look at insertions in an extendable hashing scheme. Follow the same procedure for lookup, ending up in some bucket j. If there is room in the bucket, insert information and insert record in the file. If the bucket is full, we must split the bucket, and redistribute the records. If bucket is split we may need to increase the number of bits we use in the hash. 6. Two cases exist: 1. If i = i j, then only one entry in the bucket address table points to bucket j. Then we need to increase the size of the bucket address table so that we can include pointers to the two buckets that result from splitting bucket j. We increment i by one, thus considering more of the hash, and doubling the size of the bucket address table. Each entry is replaced by two entries, each containing original value. Now two entries in bucket address table point to bucket j. We allocate a new bucket z, and set the second pointer to point to z. Set i j and i z to i. Rehash all records in bucket j which are put in either j or z. Now insert new record. It is remotely possible, but unlikely, that the new hash will still put all of the records in one bucket. If so, split again and increment i again. 2. If i>i j, then more than one entry in the bucket address table points to bucket j. Then we can split bucket j without increasing the size of the bucket address table (why?). Note that all entries that point to bucket j correspond to hash prefixes that have the same value on the leftmost i j bits. 126

We allocate a new bucket z, and set i j and i z to the original i j value plus 1. Now adjust entries in the bucket address table that previously pointed to bucket j. Leave the first half pointing to bucket j, and make the rest point to bucket z. Rehash each record in bucket j as before. Reattempt new insert. 7. Note that in both cases we only need to rehash records in bucket j. 8. Deletion of records is similar. Buckets may have to be coalesced, and bucket address table may have to be halved. 9. Insertion is illustrated for the example deposit file 32-bit hash values on bname are shown An initial empty hash structure is shown We insert records one by one. We (unrealistically) assume that a bucket can only hold 2 records, in order to illustrate both situations described. As we insert the Perryridge and Round Hill records, this first bucket becomes full. When we insert the next record (Downtown), we must split the bucket. Since i = i 0, we need to increase the number of bits we use from the hash. We now use 1 bit, allowing us 2 1 = 2 buckets. This makes us double the size of the bucket address table to two entries. We split the bucket, placing the records whose search key hash begins with 1 in the new bucket, and those with a 0 in the old bucket Next we attempt to insert the Redwood record, and find it hashes to 1. That bucket is full, and i = i 1. So we must split that bucket, increasing the number of bits we must use to 2. This necessitates doubling the bucket address table again to four entries (F We rehash the entries in the old bucket. We continue on for the deposit records, obtaining the extendable hash structure 8. Advantages: Extendable hashing provides performance that does not degrade as the file grows. Minimal space overhead - no buckets need be reserved for future use. Bucket address table only contains one pointer for each hash value of current prefix length. 9. Disadvantages: Extra level of indirection in the bucket address table Added complexity Points to Ponder An index for a file works like a catalogue in a library In order to allow fast random access, an index structure may be used. Indices whose search key specifies an order different from the sequential order of the file are called the secondary indices Dense Index appears for every search key value in file. This record contains search key value and a pointer to the actual record. Sparse Index: are created only for some of the records. Regardless of what form of index is used, every index must be updated whenever a record is either inserted into or deleted from the file. Hashing involves computing the address of a data item by computing a function on the search key value. Review Terms Index Ordered index Dense Index Sparse index Hashing Hash function Static hashing Dynamic hashing Students Activity 1. Define index and hashing? 2. Define ordered indices? 3. Differentiate between sparse index & dense index? DATABASE MANAGEMENT 127

DATABASE MANAGEMENT 4. Define secondary index & multilevel index? 5. Define Hashing?What is hash function? 6. Differentiate between Static hashing & dynamic hashing? 128

DATABASE MANAGEMENT LESSON 40: TRANSACTIONS MANAGEMENT Lesson Objectives Transactions ACID properties Transaction state Implementation of ACID properties Collections of operations that form a single logical unit of work are called transac-tions. A database system must ensure proper execution of transactions despite fail-ures-either the entire transaction executes, or none of it does. For example, a transfer of funds from a checking account to a savings account is a single operation from the customer s standpoint; within the database system, however, it consists of several operations. Clearly, it is essential that all these operations occur, or that, in case of a failure, none occur. Furthermore, it must manage concurrent execution of transactions in a way that avoids the introduction of inconsistency. In our funds-transfer example, a transaction computing the customer s total money might see the checking-account balance before it is debited by the funds - transfer transaction, but see the savings balance after it is credited. As a result, it would obtain an incorrect result. Transaction A transaction is a unit of program execution that accesses and possibly updates var-ious data items. Usually, a transaction is initiated by a user program written in a high-level data-manipulation language or programming language (for example, SQL, COBOL, C, C++, or Java), where it is delimited by statements (or function calls) of the form begin transaction and end transaction. The transaction consists of all opera-tions executed between the begin transaction and end transaction. To ensure integrity of the data, we require that the database system maintain the following properties of the transactions. These properties are often called the ACID properties, the acronym is derived from the first letter of each of the four properties. Atomicity. Either all operations of the transaction are reflected properly in the database, or none are. Consistency. Execution of a transaction in isolation (that is, with no other transaction executing concurrently) preserves the consistency of the database. Isolation. Even though multiple transactions may execute concurrently, the system guarantees that, for every pair of transactions T i and T j, it appears to T i that either T j finished execution before T i started, or T j started execu-tion after T i finished. Thus, each transaction is unaware of other transactions executing concurrently in the system. Durability. After a transaction completes successfully, the changes it has made to the database persist, even if there are system failures. To gain a better understanding of ACID properties and the need for them, con-sider a simplified banking system consisting of several accounts and a set of trans-actions that access and update those accounts. For the time being, we assume that the database permanently resides on disk, but that some portion of it is temporarily residing in main memory. Transactions access data using two operations: read(x), which transfers the data item X from the database to a local buffer belonging to the transaction that executed the read operation. write(x), which transfers the data item X from the the local buffer of the trans action that executed the write back to the database. In a real database system, the write operation does not necessarily result in the imme-diate update of the data on the disk; the write operation may be temporarily stored in memory and executed on the disk later. For now, however, we shall assume that the write operation updates the database immediately. Let T i be a transaction that transfers $50 from account A to account B. This trans-action can be defined as T i : read(a); A := A-50; write(a); read(b); B := B + 50; write(b). Let us now consider each of the ACID requirements. (For ease of presentation, we consider them in an order different from the order A-C-I-D). Consistency: The consistency requirement here is that the sum of A and B be unchanged by the execution of the transaction. Without the consistency requirement, money could be created or destroyed by the transaction It can be verified easily that, if the database is consistent before an execution of the transaction, the database remains consistent after the execution of the transac-tion. Ensuring consistency for an individual transaction is the responsibility of the application programmer who codes the transaction. This task may be facil-itated by automatic testing of integrity constraints Atomicity: Suppose that, just before the execution of transaction T i the values of accounts A and Bare $1000 and $2000, respectively. Now suppose that, dur-ing the execution of transaction T i, a failure occurs that prevents T i from com-pleting its execution successfully. Examples of such failures include power failures, hardware failures, and software errors. Further, suppose that the fail-ure happened after the write (A) operation but before the write(b) 130

operation. In this case, the values of accounts A and B reflected in the database are $950 and $2000. The system destroyed $50 as a result of this failure. In particular, we note that the sum A + B is no longer preserved. Thus, because of the failure, the state of the system no longer reflects a real state of the world that the database is supposed to capture. We term such a state an inconsistent state. We must ensure that such inconsistencies are not visible in a database system. Note, however, that the system must at some point be in an inconsistent state. Even if transaction Ti is executed to comple-tion, there exists a point at which the value of account A is $950 and the value of account B is $2000, which is clearly an inconsistent state. This state, how-ever, is eventually replaced by the consistent state where the value of account A is $950, and the value of account B is $2050. Thus, if the transaction never started or was guaranteed to complete, such an inconsistent state would not be visible except during the execution of the transaction. That is the reason for the atomicity requirement: If the atomicity property is present, all actions of the transaction are reflected in the database, or none are. The basic idea behind ensuring atomicity is this: The database system keeps track (on disk) of the old values of any data on which a transaction performs a write, and, if the transaction does not complete its execution, the database sys-tem restores the old values to make it appear as though the transaction never executed. Durability: Once the execution of the transaction completes successfully, and the user who initiated the transaction has been notified that the transfer of funds has taken place, it must be the case that no system failure will result in a loss of data corresponding to this transfer of funds. The durability property guarantees that, once a transaction completes suc-cessfully, all the updates that it carried out on the database persist, even if there is a system failure after the transaction completes execution. We assume for now that a failure of the computer system may result in loss of data in main memory, but data written to disk are never lost. We can guarantee durability by ensuring that either 1. The updates carried out by the transaction have been written to disk be-fore the transaction completes. 2. Information about the updates carried out by the transaction and written to disk is sufficient to enable the database to reconstruct the updates when the database system is restarted after the failure. Ensuring durability is the responsibility of a component of the database sys-tem called the recovery-management component. The transaction-manage-ment component and the recoverymanagement component are closely re-lated. Isolation: Even if the consistency and atomicity properties are ensured for each transaction, if several transactions are executed concurrently, their oper-ations may interleave in some undesirable way, resulting in an inconsistent state. For example, as we saw earlier, the database is temporarily inconsistent while the transaction to transfer funds from A to B is executing, with the de-ducted total written to A and the increased total yet to be written to B. If a second concurrently running transaction reads A and B at this intermediate point and computes A + B, it will observe an inconsistent value. Furthermore, if this second transaction then performs updates on A and B based on the in-consistent values that it read, the database may be left in an inconsistent state even after both transactions have completed. A way to avoid the problem of concurrently executing transactions is to execute transactions serially-that is, one after the other. However, concur-rent execution of transactions provides significant performance benefits Other solutions have therefore been developed; they allow multiple transactions to execute concurrently. The isolation property of a transaction ensures that the concurrent execution of transactions results in a system state that is equivalent to a state that could have been obtained had these transactions executed one at a time in some order. Ensuring the isolation property is the responsibility of a component of the database system called the concurrency-control component. Transaction State In the absence of failures, all transactions complete successfully. However, as we noted earlier, a transaction may not always complete its execution successfully. Such a transaction is termed aborted. If we are to ensure the atomicity property, an aborted transaction must have no effect on the state of the database. Thus, any changes that the aborted transaction made to the database must be undone. Once the changes caused by an aborted transaction have been undone, we say that the transaction has been rolled back. It is part of the responsibility of the recovery scheme to manage transaction aborts. A transaction that completes its execution successfully is said to be committed. A committed transaction that has performed updates transforms the database into a new consistent state, which must persist even if there is a system failure. Once a transaction has committed, we cannot undo its effects by aborting it. The only way to undo the effects of a committed transaction is to execute a compensating transaction. For instance, if a transaction added $20 to an account, the compensating transaction would subtract $20 from the account. However, it is not always possible to create such a compensating transaction. Therefore, the responsibility of writing and executing a compensating transaction is left to the user, and is not handled by the database system. We need to be more precise about what we mean by successful completion of a trans-action. We therefore establish a simple abstract transaction model. A transaction must be in one of the following states: Active, the initial state; the transaction stays in this state while it is executing. Partially committed, after the final statement has been executed Failed, after the discovery that normal execution can no longer proceed Aborted, after the transaction has been rolled back and the database has been restored to its state prior to the start of the transaction DATABASE MANAGEMENT 131

DATABASE MANAGEMENT Committed, after successful completion The state diagram corresponding to a transaction appears in Figure 23.1. We say that a transaction has committed only if it has entered the committed state. Simi-larly, we say that a transaction has aborted only if it has entered the aborted state. A transaction is said to have terminated if has either committed or aborted. A transaction starts in the active state. When it finishes its final statement, it enters the partially committed state. At this point, the transaction has completed its exe-cution, but it is still possible that it may have to be aborted, since the actual output may still be temporarily residing in main memory, and thus a hardware failure may preclude its successful completion. The database system then writes out enough information to disk that, even in the event of a failure, the updates performed by the transaction can be re-created when the system restarts after the failure. When the last of this information is written out, the transaction enters the committed state. As mentioned earlier, we assume for now that failures do not result in loss of data on disk. A transaction enters the failed state after the system determines that the transac-tion can no longer proceed with its normal execution (for example, because of hard-ware or logical errors). Such a transaction must be rolled back. Then, it enters the aborted state. At this point, the system has two options: Active Partially committed Failed Committed Aborted Figure 23.1 state diagram of a transaction It can restart the transaction, but only if the transaction was aborted as a result of some hardware or software error that was not created through the internal logic of the transaction. A restarted transaction is considered to be a new transaction. It can kill the transaction. It usually does so because of some internal logical error that can be corrected only by rewriting the application program, or be-cause the input was bad, or because the desired data were not found in the database. We must be cautious when dealing with observable external writes, such as writes to a terminal or printer. Once such a write has occurred, it cannot be erased, since it may have been seen external to the database system. Most systems allow such writes to take place only after the transaction has entered the committed state. One way to implement such a scheme is for the database system to store any value associated with such external writes temporarily in nonvolatile storage, and to perform the actual writes only after the transaction enters the committed state. If the system should fail after the transaction has entered the committed state, but before it could complete the external writes, the database system will carry out the external writes (using\~he data in nonvolatile storage) when the system is restarted. Handling external writes can be more complicated in some situations. For example suppose the external action is that of dispensing cash at an automated teller machine, and the system fails just before the cash is actually dispensed (we assume that cash can be dispensed atomically). It makes no sense to dispense cash when the system is restarted, since the user may have left the machine. In such a case a compensat-ing transaction, such as depositing the cash back in the users account, needs to be executed when the system is restarted. For certain applications, it may be desirable to allow active transactions to dis-play data to users, particularly for longduration transactions that run for minutes or hours. Unfortunately, we cannot allow such output of observable data unless we are willing to compromise transaction atomicity. Most current transaction systems ensure atomicity and, therefore, forbid this form of interaction with users. Implementation of Atomicity and Durability The recovery-management component of a database system can support atomicity and durability by a variety of schemes. We first consider a simple, but extremely in-efficient, scheme called the shadow copy scheme. This scheme, which is based on making copies of the database, called shadow copies, assumes that only one transac-tion is active at a time. The scheme also assumes that the database is simply a file on disk. A pointer called db-pointer is maintained on disk; it points to the current copy of the database. In the shadow-copy scheme, a transaction that wants to update the database first creates a complete copy of the database. All updates are done on the new database copy, leaving the original copy, the shadow copy, untouched. If at any point the transaction has to be aborted, the system merely deletes the new copy. The old copy of the database has not been affected. If the transaction completes, it is committed as follows. First, the operating system is asked to make sure that all pages of the new copy of the database have been written out to disk. (Unix systems use the flush command for this purpose.) After the operat-ing system has written all the pages to disk, the database system updates the pointer db-pointer to point to the new copy of the database; the new copy then becomes the current copy of the database. The old copy of the database is then deleted. Figure 23.2 depicts the scheme, showing the database state before and after the update. db-pointer Old copy of database (a) Before update Old copy of database (to be deleted ) db-pointer New copy of database (a) After update 132

Figure 23.2 Shadow-copy technique for atomicity and durability. The transaction is said to have been committed at the point where the updated db-pointer is written to disk. We now consider how the technique handles transaction and system failures. First, consider transaction failure. If the transaction fails at any time before db-pointer is updated, the old contents of the database are not affected. We can abort the trans-action by just deleting the new copy of the database. Once the transaction has been committed, all the updates that it performed are in the database pointed to by db-pointer. Thus, either all updates of the transaction are reflected, or none of the effects are reflected, regardless of transaction failure. Now consider the issue of system failure. Suppose that the system fails at any time before the updated db-pointer is written to disk. Then, when the system restarts, it will read db-pointer arid will thus see the original contents of the database, and none of the effects of the transaction will be visible on the database. Next, suppose that the system fails after db-pointer has been updated on disk. Before the pointer is updated, all updated pages of the new copy of the database were written to disk. Again, we assume that, once a file is written to disk, its contents will not be damaged even if there is a system failure. Therefore, when the system restarts, it will read db-pointer and will thus see the contents of the database after all the updates performed by the transaction. The implementation actually depends on the write to dbpointer being atomic; that is, either all its bytes are written or none of its bytes are written. If some of the bytes of the pointer were updated by the write, but others were not, the pointer is meaningless, and neither old nor new versions of the database may be found when the system restarts. Luckily, disk systems provide atomic updates to entire blocks, or at least to a disk sector. In other words, the disk system guarantees that it will update db-pointer atomically, as long as we make sure that db-pointer lies entirely in a single sector, which we can ensure by storing db-pointer at the beginning of a block. Thus, the atomicity and durability properties of transactions are ensured by the shadow-copy implementation of the recoverymanagement component. As a simple example of a transaction outside the database domain, consider a text editing session. An entire editing session can be modeled as a transaction. The actions executed by the transaction are reading and updating the file. Saving the file at the end of editing corresponds to a commit of the editing transaction; quitting the editing session without saving the file corresponds to an abort of the editing transaction. Many text editors use essentially the implementation just described, to ensure that an editing session is transactional. A new file is used to store the updated file. At the end of the editing session, if the updated file is to be saved, the text editor uses a file rename command to rename the new file to have the actual file name. The rename, assumed to be implemented as an atomic operation by the underlying file system, deletes the old file as well. Unfortunately, this implementation is extremely inefficient in the context of large databases, since executing a single transaction requires copying the entire database. Furthermore, the implementation does not allow transactions to execute concurrently with one another. There are practical ways of implementing atomicity and durability that are much less expensive and more powerful. Points to Ponder Collections of operations that form a single logical unit of work are called transac-tions Usually, a transaction is initiated by a user program written in a high-level data-manipulation language or programming language Transactions access data using two operations: Read(X) Write(X) A transaction may not always complete its execution successfully. Such a transaction is termed aborted. A transaction must be in one of the following states: Active Failed Aborted Commited The recovery-management component of a database system can support atomicity and durability by a variety of schemes Review Terms Transactions ACID properties Transaction state Implementation of ACID properties Students Activity 1. Define transaction with the help of an example? 2. Define ACID properties of a database? 3. Define various states of a transaction? DATABASE MANAGEMENT 133

DATABASE MANAGEMENT 4. Define the implementation of Atomicity property of a database? 5. Define implementation of Durability? 6. Define the importance of consistency in a database transaction? 7. Define Isolation of a database? 134

DATABASE MANAGEMENT Lesson objectives concurrent executions Advantages of concurrency Problems with concurrency Serializibility Concurrent Executions Transaction-processing systems usually allow multiple transactions to run concur-rently. Allowing multiple transactions to update data concurrently causes several complications with consistency of the data, as we saw earlier. Ensuring consistency in spite of concurrent execution of transactions requires extra work; it is far easier to insist that transactions run serially-that is, one at a time, each starting only after the previous one has completed. However, there are two good reasons for allowing concurrency: Improved throughput and resource utilization. A transaction consists of many steps. Some involve I/O activity; others involve CPU activity. The CPU and the disks in a computer system can operate in parallel. Therefore, I/O activity can be done in parallel with processing at the CPU. The parallelism of the CPU and the I/O system can therefore be exploited to run multiple transactions in parallel. While a read or write on behalf of one transaction is in progress on one disk, another transaction can be running in the CPU, while another disk may be executing a read or write on behalf of a third transaction. All of this increases the throughput of the system-that is, the number of transactions executed in a given amount of time. Correspondingly, the processor and disk utilization also increase; in other words, the processor and disk spend less time idle, or not performing any useful work. Reduced waiting time. There may be a mix of transactions running on a sys-tem, some short and some long. If transactions run serially, a short transaction may have to wait for a preceding long transaction to complete, which can lead to unpredictable delays in running a transaction. If the transactions are oper-ating on different parts of the database, it is better tolet them run concurrently, sharing the CPU cycles and disk accesses among them. Concurrent execution reduces the unpredictable delays in running transactions. Moreover, it also reduces the average response time: the average time for a transaction to be completed after it has been submitted. The motivation for using concurrent execution in a database is essentially the same as the motivation for using multiprogramming in an operating system. When several transactions run concurrently, database consistency can be destroyed despite the correctness of each individual transaction. In this section, we present the concept of schedules LESSON 41: CONCURRENCY CONTROL - I to help identify those executions that are guaranteed to ensure consistency. The database system must control the interaction among the concurrent trans-actions to prevent them from destroying the consistency of the database. It does so through a variety of mechanisms called concurrency-control schemes Consider again the simplified banking system of Section 23.1, which has several accounts, and a set of transactions that access and update those accounts. Let T l and T 2 be two transactions that transfer funds from one account to another. Transaction T l transfers $50 from account A to account R It is defined as T 1 : read(a); A := A-50; write(a); read(b); B := B + 50; write(b). Transaction T2 transfers 10 percent of the balance from account A to account B. It is defined as T 2 : read(a); temp:= A * 0.1; A :=A - temp; write(a); read(b); B := B + temp; write(b). Suppose the current values of accounts A and Bare $1000 and $2000, respectively. Suppose also that the two transactions are executed one at a time in the order T l followed by T 2. This execution sequence appears in Figure 23.3. In the figure, the sequence of instruction steps is in chronological order from top to bottom, with in-structions of T l appearing in the left column and instructions of T 2 appearing in the right column. The final values of accounts A and B, after the execution in Figure 23.3 takes place, are $855 and $2145, respectively. Thus, the total amount of money in T 1 T 2 read(a) read(a) A := A-50 temp:= A * 0.1 write (A) A := A ~ temp read(b) write(a) B := B + 50 read(b) write (B) B := B + temp write(b) Figure 41.1 Schedule I-a serial schedule in which T l is followed by T 2. 136

Accounts A and B-that is, the sum A + B - is preserved after the execution of both transactions. Similarly, if the transactions are executed one at a time in the order T 2 followed by T I, then the corresponding execution sequence is that of Figure 41.2. Again, as expected, the sum A + B is preserved, and the final values of accounts A and Bare $850 and $2150, respectively. The execution sequences just described are called schedules. They represent the chronological order in which instructions are executed in the system. Clearly, a sched-ule for a set of transactions must consist of all instructions of those transactions, and must preserve the order in which the instructions appear in each individual transac-tion. For example, in transaction T I, the instruction write(a) must appear before the instruction read(b), in any valid schedule. In the following discussion, we shall refer to the first execution sequence (T I followed by T 2 ) as schedule 1, and to the second execution sequence (T 2 followed by T 1 )as schedule 2. These schedules are serial: Each serial schedule consists of a sequence of instruc-tions from various transactions, where the instructions belonging to one single trans-action appear together in that schedule. Thus, for a set of n transactions, there exist n! different valid serial schedules. When the database system executes several transactions concurrently, the corre-sponding schedule no longer needs to be serial. If two transactions are running con-currently, the operating system may execute one transaction for a little while, then perform a context switch, execute the second transaction for some time, and then switch back to the first transaction for some time, and so on. With multiple transac-tions, the CPU time is shared among all the transactions. Several execution sequences are possible, since the various instructions from both transactions may now be interleaved. In general, it is not possible to predict exactly how many instructions of a transaction will be executed before the CPU switches to T l T 2 read(a) temp:= A * 0.1 A :=A - temp write(a) read(b) B := B + temp write(b) read(a) A := A-50 write(a) read(b) B := B + 50 write(b) Figure 41.2 Schedule 2-a serial schedule in which T 2 is followed by T I. T 1 T 2 read(a) A := A-50 write(a) read(a) temp := A * 0.1 A :=A - temp write(a) read(b) B := B + 50 write(b) read(b) B := B + temp write(b) Figure 41.3 Schedule 3-a concurrent schedule equivalent to schedule 1. another transaction. Thus, the number of possible schedules for a set of n transactions is much larger than n!. Returning to our previous example, suppose that the two transactions are exe-cuted concurrently. One possible schedule appears in Figure 41.3. After this execution takes place, we arrive at the same state as the one in which the transactions are executed serially in the order T l followed by T2. The sum A + B is indeed preserved. Not all concurrent executions result in a correct state. To illustrate, consider the schedule of Figure 23.6. After the execution of this schedule, we arrive at a state where the final values of accounts A and Bare $950 and $2100, respectively. This final state is an inconsistent state, since we have gained $50 in the process of the concur-rent execution. Indeed, the sum A + B is not preserved by the execution of the two transactions. If control of concurrent execution is left entirely to the operating system, many possible schedules, including ones that leave the database in an inconsistent state, such as the one just described, are possible. It is the job of the database system to ensure that any schedule that gets executed will leave the database in a consistent state. The concurrency-control component of the database system carries out this task. We can ensure consistency of the database under concurrent execution by making sure that any schedule that executed has the same effect as a schedule that could have occurred without any concurrent execution. That is, the schedule should, in some sense, be equivalent to a serial schedule. We examine this idea in Section Serializability The database system must control concurrent execution of transactions, to ensure that the database state remains consistent. Before we examine how the database T 1 T 2 read(a) A := A-50 read(a) DATABASE MANAGEMENT 137

DATABASE MANAGEMENT temp := A * 0.1 A :=A temp write (A) read(b) write (A) read(b) B := B + 50 write (B) B := B + temp write (B) Figure 41.4 Schedule 4-a concurrent schedule. system can carry out this task, we must first understand which schedules will en-sure consistency, and which schedules will not. Since transactions are programs, it is computationally difficult to determine ex-actly what operations a transaction performs and how operations of various trans-actions interact. For this reason, we shall not interpret the type of operations that a transaction can perform on a data item. Instead, we consider only two operations: read and write. We thus assume that, between a read (Q) instruction and a write (Q) instruction on a data item Q, a transaction may perform an arbitrary sequence of op-erations on the copy of Q that is residing in the local buffer of the transaction. Thus, the only significant operations of a transaction, from a scheduling point of view, are its read and write instructions. We shall therefore usually show only read and write instructions in schedules, as we do in schedule 3 in Figure 41.5. In this section, we discuss different forms of schedule equivalence; they lead to the notions of conflict serializability and view serializability. Concurrent execution reduces the unpredictable delays in running transactions. Moreover, it also reduces the average response time The database system must control concurrent execution of transactions, to ensure that the database state remains consistent Since transactions are programs, it is computationally difficult to determine ex-actly what operations a transaction performs and how operations of various trans-actions interact Review Terms Concurrent executions Advantages of concurrency Problems with concurrency Serializibility Students Activity 1. Define concurrency? 2. Define the advantages & disadvantages of concurrency control? T 1 T 2 read(a) write(a) read(a) write(a) read(b) write(b) read(b) write(b) Figure 41.5 Schedule 3-showing only the read and write instructions. Points to Ponder Transaction-processing systems usually allow multiple transactions to run concur-rently Allowing multiple transactions to update data concurrently causes several complications with consistency of the data The parallelism of the CPU and the I/O system can therefore be exploited to run multiple transactions in parallel 3. Define serializibility? 4. When a database can enter into an inconsistent state? 138

5. How we can ignore inconsistency problem? DATABASE MANAGEMENT 139

Lesson objectives Types of serializibility Conflict Serializability View Serializability Implementation of isolation Transaction definition in SQL Conflict Serializability Let us consider a schedule 5 in which there are two consecutive instructions I i and I j, of transactions T i and T j, respectively (i j). If I i and I j refer to different data items, then we can swap I i and I j without affecting the results of any instruction in the schedule. However, if I i and I j refer to the same data item Q, then the order of the two steps may matter. Since we are dealing with only read and write instructions, there are four cases that we need to consider: 1. I i = read(q), I j = read(q). The order of I i and I j does not matter, since the same value of Q is read by T i and T j, regardless of the order. 2. I i = read(q), I j = write(q). If I i comes before I j, then T i does not read the value of Q that is written by T j in instruction I j. If I j comes before I i, then T i reads the value of Q that is written by T j. Thus, the order of I i and I j matters. 3. I i = write(q), I j = read(q). The order of I i and I j matters for reasons similar to those of the previous case. 4. I i = write(q), I j = write(q). Since both instructions are write operations, the order of these instructions does not affect either T i or T j. However, the value obtained by the next read(q) instruction of 5 is affected, since the result of only the latter of the two write instructions is preserved in the database. If there is no other write(q) instruction after I i and I j in 5, then the order of I i and I j directly affects the final value of Q in the database state that results from schedule 5. Thus, only in the case where both I i and I j are read instructions does the relative order of their execution not matter. We say that I i and I j conflict if they are operations by different transactions on the same data item, and at least one of these instructions is a write operation. To illustrate the concept of conflicting instructions, we consider schedule 3, in Fig-ure 23.7. The write(a) instruction of T l conflicts with the read(a) instruction of T 2. However, the write(a) instruction of T 2 does not conflict with the read(b) instruction of T I, because the two instructions access different data items. Let I i and I j be consecutive instructions of a schedule 5. If I i and I j are instructions of different transactions and I i and I j do not conflict, then we can swap the order of Ii and Ij to produce a LESSON 42: CONCURENCY CONTROL - II new schedule S!. We expect S to be equivalent to S!, since all instructions appear in the same order in both schedules except for I i and I j, whose order does not matter. Since the write(a) instruction of T 2 in schedule 3 of Figure 23.7 does not conflict with the read(b) instruction of TI, we can swap these instructions to generate an equivalent schedule, schedule 5, in. Figure 23.8. Regardless of the initial system state, schedules 3 and 5 both produce the same final system state. We continue to swap non-conflicting instructions: Swap the read (B) instruction of T I with the read (A) instruction of T 2. Swap the write (B) instruction of T I with the write (A) instruction of T 2. Swap the write(b) instruction of T I with the read(a) instruction of T 2. T 1 T 2 Read (A) Write (A) Read (A) Read (B) Write (A) Write (B) Read (B) Write (B) Figure 42.1 Schedule 5-schedule 3 after swapping of a pair of instructions. The final result of these swaps, schedule 6 of Figure 42.2, is a serial schedule. Thus, we have shown that schedule 3 is equivalent to a serial schedule. This equivalence implies that, regardless of the initial system state, schedule 3 will produce the same final state as will some serial schedule. If a schedule S can be transformed into a schedule S by a series of swaps of non--conflicting instructions, we say that Sand S are conflict equivalent. In our previous examples, schedule 1 is not conflict equivalent to schedule 2. How- ever, schedule 1 is conflict equivalent to schedule 3, Because the read(b) and write(b) instruction of T l can be swapped with the read(a) and write(a) instruction of T 2. The concept of conflict equivalence leads to the concept of conflict serializability. We say that a schedule S is conflict serializable if it is conflict equivalent to a serial schedule. Thus, schedule 3 is conflict serializable, since it is conflict equivalent to the serial schedule 1. Finally, consider schedule 7 of Figure 42.3; it consists of only the significant op-erations (that is, the read and write) of transactions T 3 and T 4. This schedule is not conflict serializable, DATABASE MANAGEMENT 141

DATABASE MANAGEMENT since it is not equivalent to either the serial schedule <T 3,T 4 > or the serial schedule <T 4,T 3 >. It is possible to have two schedules that produce the same outcome, but that are not conflict equivalent. For example, consider transaction T5, which transfers $10 T 1 T 2 read(a) write(a) read(b) write(b) read(a) write(a) read(b) write (B) Figure 42.2 Schedule 6-a serial schedule that is equivalent to schedule 3. T 3 T 4 Read (Q) Write (Q) Write (Q) Figure 42.3 Schedule 7. from account B to account A. Let schedule 8 be as defined in Figure 42.4. We claim that schedule 8 is not conflict equivalent to the serial schedule <T I, T 5 >, since, in schedule 8, the write (B) instruction of Ts conflicts with the read (B) instruction of T I. Thus, we cannot move all the instructions of T I before those of T 5 by swapping con-secutive non-conflicting instructions. However, the final values of accounts A and B after the execution of either schedule 8 or the serial schedule <T I,T 5 > are the same -$960 and $2040, respectively. We can see from this example that there are less stringent definitions of schedule equivalence than conflict equivalence. For the system to determine that schedule 8 produces the same outcome as the serial schedule <T I,T 5 >, it must analyze the com-putation performed by T I and T 5, rather than just the read and write operations. In general, such analysis is hard to implement and is computationally expensive. How-ever, there are other definitions of schedule equivalence based purely on the read and write operations. We will consider one such definition in the next section. View Serializability In this section, we consider a form of equivalence that is less stringent than conflict equivalence, but that, like conflict equivalence, is based on only the read and write operations of transactions. T I T 2 read(a) A := A-50 write (A) read(b) B := B - 10 write(b) read(b) B := B + 50 write(b) read(a) A := A + 10 write (A) figure 42.4 schedule 8. Consider two schedules S and S, where the same set of transactions participates in both schedules. The schedules S and S are said to be view equivalent if three conditions are met: 1. For each data item Q, if transaction T i reads the initial value of Q in schedule S, then transaction T i must, in schedule S, also read the initial value of Q. 2. For each data item Q, if transaction T i executes read(q) in schedule S, and if that value was produced by a write(q) operation executed by transaction T i, then the read(q) operation of transaction T i must, in schedule S, also read the value of Q that was produced by the same write(q) operation of transaction T i. 3. For each data item Q, the transaction (if any) that performs the final write(q) operation in schedule S must perform the final write(q) operation in sched-ule S. Conditions 1 and 2 ensure that each transaction reads the same values in both schedules and, therefore, performs the same computation. Condition 3, coupled with conditions 1 and 2, ensures that both schedules result in the same final system state. In our previous examples, schedule 1 is not view equivalent to schedule 2, since, in schedule 1, the value of account A read by transaction T 2 was produced by T 1 whereas this case does not hold in schedule 2. However, schedule 1 is view equivalent to schedule 3, because the values of account A and B read by transaction T 2 were produced by T 1 in both schedules. The concept of view equivalence leads to the concept of view serializability. We say that a schedule 5 is view serializable if it is view equivalent to a serial schedule. As an illustration, suppose that we augment schedule 7 with transaction T 6, and obtain schedule 9 in Figure 42.5. Schedule 9 is view serializable. Indeed, it is view equivalent to the serial schedule <T3, T4, T6>, since the one read(q) instruction reads the initial value of Q in both schedules, and T6 performs the final write of Q in both schedules. Every conflict-serializable schedule is also view serializable, but there are view-serializable schedules that are not conflict serializable. Indeed, schedule 9 is not con-flict serializable, since every pair of consecutive instructions conflicts, and, thus, no swapping of instructions is possible. Observe that, in schedule 9, transactions T 4 and T 6 perform write(q) operations without having performed a read(q) operation. Writes of this sort are called blind writes. Blind writes appear in any view-serializable schedule that is not conflict seri-alizable. 142

T 3 T 4 T 6 read( Q) write(q) write( Q) write(q) Figure 42.5 Schedule 9-a view-serializable schedule. Implementation of Isolation So far, we have seen what properties a schedule must have if it is to leave the database in a consistent state and allow transaction failures to be handled in a safe manner. Specifically, schedules that are conflict or view serializable and cascadeless satisfy these requirements. There are various concurrency-control schemes that we can use to ensure that, even when multiple transactions are executed concurrently, only acceptable sched-ules are generated, regardless of how the operating-system time-shares resources (such as CPU time) among the transactions. As a trivial example of a concurrency-control scheme, consider this scheme: A transaction acquires a lock on the entire database before it starts and releases the lock after it has committed. While a transaction holds a lock, no other transaction is allowed to acquire the lock and all must therefore wait for the lock to be released. As a result of the locking policy, only one transaction can execute at a time. Therefore, only serial schedules are generated. These are trivially serializable, and it is easy to verify that they are cascadeless as well. A concurrency-control scheme such as this one leads to poor performance, since it forces transactions to wait for preceding transactions to finish before they can start. In other words, it provides a poor degree of concurrency. As explained in Section 23.4, concurrent execution has several performance benefits. The goal of concurrency-control schemes is to provide a high degree of concur-rency, while ensuring that all schedules that can be generated are conflict or view serializable, and are cascadeless. The schemes have different trade-offs in terms of the amount of concurrency they allow and the amount of overhead that they incur. Some of them allow only conflict serializable schedules to be generated; others allow certain view-serializable schedules that are not conflict-serializable to be generated. Transaction Definition in SQL A data-manipulation language must include a construct for specifying the set of ac-tions that constitute a transaction. The SQL standard specifies that a transaction begins implicitly. Transactions are ended by one of these SQL statements: Commit work commits the current transaction and begins a new one. Rollback work causes the current transaction to abort. The keyword work is optional in both the statements. If a program terminates with-out either of these commands, the updates are either committed or rolled back -which of the two happens is not specified by the standard and depends on the im-plementation. The standard also specifies that the system must ensure both serializability and freedom from cascading rollback. The definition of serializability used by the stan-dard is that a schedule must have the same effect as would some serial schedule. Thus, conflict and view serializability are both acceptable. The SQL.92 standard also allows a transaction to specify that it may be executed in a manner that causes it to become nonserializable with respect to other transactions. We study such weaker levels of consistency in Section 16.8. Points to Ponder A transaction is a unit of program execution that accesses and possibly updates var-ious data items. To ensure integrity of the data, we require that the database system maintain the following properties of the transactions. These properties are often called the ACID properties. Allowing multiple transactions to update data concurrently causes several complications with consistency The database system must control concurrent execution of transactions, to ensure that the database state remains consistent Review Terms Inconsistent state Transaction state Active Partially committed Failed Aborted Committed Terminated Transaction Restart Kill Observable external writes Shadow copy scheme Concurrent executions Serial execution Schedules Conflict of operations Conflict equivalence Conflict serializability View equivalence View serializability Students Activity 1. Define Transaction? DATABASE MANAGEMENT 143

DATABASE MANAGEMENT 2. Define Conflict Serializibility? 3. Define View Serializibility? 4. Define Commit,Rollback transaction? 144

DATABASE MANAGEMENT Lesson objectives Locking mechanism Graph based protocol Timestamps Timestamp based protocol Timestamp ordering protocols Validation protocol When the lock manager receives an unlock message from a transaction, it deletes the record for that data item in the linked list corresponding to that transaction. It tests the record that follows, if any to see if that request can now be granted. If it can, the lock man-ager grants that request, and processes the record following it, if any, similarly, and so on. If a transaction aborts, the lock manager deletes any waiting request made by the transaction. Once the database system has taken appropriate actions to undo the transaction it releases all locks held by the aborted transaction. This algorithm guarantees freedom from starvation for lock requests, since a re-quest can never be granted while a request received earlier is waiting to be granted. Graph-based Protocols But, if we wish to develop protocols that are not two phase, we need additional information on how each transaction will access the database. There are various models that can give us the additional information, each differing in the amount of information provided. The simplest model requires that we have prior knowledge about the order in which the database items will be accessed. Given such information, it is possible to construct locking protocols that are not two phase, but that, nevertheless, ensure conflict serializability. To acquire such prior knowledge, we impose a partial ordering on the set D = {d 1, d 2,..., d h } of all data items. If d i d j, then any transaction accessing both d i and d j must access d i before accessing d j. This partial ordering may be the result of either the logical or the physical organization of the data, or it may be imposed solely for the purpose of concurrency control. The partial ordering implies that the set D may now be viewed as a directed acyclic graph, called a database graph. In this section, for the sake of simplicity, we will restrict our attention to only those graphs that are rooted trees. We will present a simple protocol, called the tree protocol, which is restricted to employ only exclusive locks. References to other, more complex, graphbased locking protocols are in the bibliographical notes. In the tree protocol, the only lock instruction allowed is lock-x. Each transaction Ti can lock a data item at most once, and must observe the following rules: 1. The first lock by T i may be on any data item. LESSON 43: CONCURRENCY CONTROL - III 2. Subsequently, a data item Q can be locked by T i only if the parent of Q is currently locked by T i. 3. Data items may be unlocked at any time. 4. A data item that has been locked and unlocked by Ti cannot subsequently be relocked by T 1. All schedules that are legal under the tree protocol are conflict serializable. To illustrate this protocol, consider the database graph of Figure 43.1. The follow-ing four transactions follow the tree protocol on this graph. We show only the lock and unlock instructions: T lo : lock-x(b); lock-x(e); lock-x(d); unlock(b); unlock(e); lock-x(g); unlock(d); unlock(g). T 11 : lock-x(d); lock-x(h); unlock(d); unlock(h). T 12 : lock-x(b); lock-x(e); unlock(e); unlock(b). T 13 : lock-x(d); lock-x(h); unlock(d); unlock(h). One possible schedule in which these four transactions participated appears in Figure 43.2. Note that, during its execution, transaction T 1O holds locks on two dis-joint subtrees. Observe that the schedule of Figure 43.2 is conflict serializable. It can be shown not only that the tree protocol ensures conflict serializability, but also that this proto-col ensures freedom from deadlock. The tree protocol in Figure 43.2 does not ensure recoverability and cascadeless-ness. To ensure recoverability and cascadelessness, the protocol can be modified to not permit release of exclusive locks until the end of the transaction. Holding exclu-sive locks until the end of the transaction reduces concurrency. Here is an alterna-tive that improves concurrency, but ensures only recoverability: For each data item with an uncommitted write we record which transaction performed the last write to the data item. Whenever a transaction T i performs a read of an uncommitted data item, we record a commit dependency of T i on the transaction that performed the last write to the data item. Transaction T i is then not permitted to commit until the commit of all transactions on which it has a commit dependency. If any of these trans-actions aborts, T i must also be aborted. 146

Figure 43.1 Tree-structured database graph T 10 T 11 T 12 T 13 lock-x(b) lock-x (D) lock-x(h) unlock(d) lock-x(e) lock-x(d) unlock(b) unlock(e) lock-x(g) unlock(d) unlock (G) unlock (H) lock-x (B) lock-x (E) unlock(e) unlock(b) lock-x (D) lock-x(h) unlock(d) unlock(h) Figure 43.1 Serializable schedule under the tree protocol. The tree-locking protocol has an advantage over the two-phase locking protocol in that, unlike two-phase locking, it is deadlock-free, so no rollbacks are required. The tree-locking protocol has another advantage over the two-phase locking protocol in that unlocking may occur earlier. Earlier unlocking may lead to shorter waiting times, and to an increase in concurrency. However, the protocol has the disadvantage that, in some cases, a transaction may have to lock data items that it does not access. For example, a transaction that needs to access data items A and J in the database graph of Figure 43.1 must lock not only A and J, but also data items B, D, and H. This additional locking results in increased locking overhead, the possibility of additional waiting time, and a potential decrease in concurrency. Further, without prior knowledge of what data items will need to be locked, transactions will have to lock the root of the tree, and that can reduce concurrency greatly. For a set of transactions, there may be conflict-serializable schedules that cannot be obtained through the tree protocol. Indeed, there are schedules possible under the two-phase locking protocol that are not possible under the tree protocol, and vice versa. Examples of such schedules are explored in the exercises. Timestamp-based Protocols The locking protocols that we have described thus far determine the order between every pair of conflicting transactions at execution time by the first lock that both members of the pair request that involves incompatible modes. Another method for determining the serializability order is to select an ordering among transactions in advance. The most common method for doing so is to use a timestamp-ordering scheme. Timestamps With each transaction T i in the system, we associate a unique fixed timestamp, de-noted by TS(T i ). This timestamp is assigned by the database system before the trans-action T i starts execution. If a transaction T i has been assigned timestamp TS(Ti), and a new transaction T j enters the system, then TS(T i ) < TS(T j ). There are two simple methods for implementing this scheme: 1. Use the value of the system clock as the timestamp; that is, a transaction s time-stamp is equal to the value of the clock when the transaction enters the system. 2. Use a logical counter that is incremented after a new timestamp has been assigned; that is, a transaction s timestamp is equal to the value of the counter when the transaction enters the system. The timestamps of the transactions determine the serializability order. Thus, if TS(T i ) < TS(T j ), then the system must ensure that the produced schedule is equiva-lent to a serial schedule in which transaction T i appears before transaction T j. To implement this scheme, we associate with each data item Q two timestamp values: W-timestamp(Q) denotes the largest timestamp of any transaction that exe-cuted write(q) successfully. R-timestamp(Q) denotes the largest timestamp of any transaction that exe-cuted read(q) successfully. These timestamps are updated whenever a new read(q) or write(q) instruction is executed. The Timestamp-ordering Protocol The timestamp-ordering protocol ensures that any conflicting read and write opera-tions are executed in timestamp order. This protocol operates as follows: 1. Suppose that transaction T i issues read(q). a. If TS(T i ) < W-timestamp(Q), then T i needs to read a value of Q that was already overwritten. Hence, the read operation is rejected, and T i is rolled back. b. If TS(T i ) ~ W-timestamp(Q), then the read operation is executed, and R-timestamp(Q) is set to the maximum of R-timestamp(Q) and TS(T i ). 2. Suppose that transaction T i issues write(q). a. If TS(T i ) < R-timestamp(Q), then the value of Q that T i is producing was needed previously, and the system assumed that that value would never be produced. Hence, the system rejects the write operation and rolls T i back. b. If TS(T i ) < W-timestamp(Q), then T i is attempting to write an obsolete value of Q. Hence, the system rejects this write operation and rolls T i back. c. Otherwise, the system executes the write operation and sets W-timestamp(q) to TS(T i ). TS(T i ) = R/W - TS(Q). If a transaction T i is rolled back by the concurrency-control scheme as result of is-suance of either a read or write operation, the system assigns it a new timestamp and restarts it. To illustrate this protocol, we consider transactions T I4 and T 15. Transaction T I4 displays the contents of accounts A and B: T 14 : read(b); read(a); display(a + B). DATABASE MANAGEMENT 147

DATABASE MANAGEMENT Transaction T I5 transfers $50 from account A to account B, and then displays the contents of both: T 15 : read(b); B := B - 50; write(b); read(a); A := A + 50; write(a); display(a + B). In presenting schedules under the timestamp protocol, we shall assume that a trans-action is assigned a timestamp immediately before its first instruction. Thus, in sched-ule 3 of Figure 43.3, TS(T I4 ) < TS(T I5 ), and the schedule is possible under the timestamp protocol. We note that the preceding execution can also be produced by the two-phase lock-ing protocol. There are, however, schedules that are possible under the two-phase locking protocol, but are not possible under the timestamp protocol, and vice versa The timestamp-ordering protocol ensures conflict serializability. This is because conflicting operations are processed in timestamp order. The protocol ensures freedom from deadlock, since no transaction ever waits. However, there is a possibility of starvation of long transactions if a sequence of conflicting short transactions causes repeated restarting of the long transaction. If T 14 T 15 read (B) read (A) Display(A + B) Figure 43.13 Schedule 3. Read (B) B: = B-50 Write (B) read (A) A : = A+50 Write (A) Display (A+B) a transaction is found to be getting restarted repeatedly, conflicting transactions need to be temporarily blocked to enable the transaction to finish. The protocol can generate schedules that are not recoverable. However, it can be extended to make the schedules recoverable, in one of several ways: Recoverability and cascadelessness can be ensured by performing all writes together at the end of the transaction. The writes must be atomic in the fol-lowing sense: While the writes are in progress, no transaction is permitted to access any of the data items that have been written. Recoverability and cascadelessness can also be guaranteed by using a limited form of locking, whereby reads of uncommitted items are postponed until the transaction that updated the item commits Recoverability alone can be ensured by tracking uncommitted writes, and al-lowing a transaction T i to commit only after the commit of any transaction that wrote a value that T i read. Commit dependencies. Validation-based Protocols In cases where a majority of transactions are read-only transactions, the rate of con-flicts among transactions may be low. Thus, many of these transactions, if executed without the supervision of a concurrency-control scheme, would nevertheless leave the system in a consistent state. A concurrency-control scheme imposes overhead of code execution and possible delay of transactions. It may be better to use an alterna-tive scheme that imposes less overhead. A difficulty in reducing the overhead is that we do not know in advance which transactions will be involved in a conflict. To gain that knowledge, we need a scheme for monitoring the system. We assume that each transaction T i executes in two or three different phases in its lifetime, depending on whether it is a read-only or an update transaction. The phases are, in order, 1. Read phase. During this phase, the system executes transaction T i. It reads the values of the various data items and stores them in variables local to T i. It performs all write operations on temporary local variables, without updates of the actual database. 2. Validation phase. Transaction T i performs a validation test to determine whe-ther it can copy to the database the temporary local variables that hold the results of write operations without causing a violation of serializability. 3. Write phase. If transaction T i succeeds in validation (step 2), then the system applies the actual updates to the database. Otherwise, the system rolls back T i. Each transaction must go through the three phases in the order shown. However, all three phases of concurrently executing transactions can be interleaved. To perform the validation test; we need to know when the various phases of trans-actions T i took place. We shall, therefore, associate three different timestamps with transaction T i : 1. Start(T i ), the time when T i started its execution. 2. Validation(T i ), the time when T i finished its read phase and started its vali-dation phase. 3. Finish(T i ), the time when T i finished its write phase. We determine the serializability order by the timestampordering technique, using the value of the timestamp Validation(T i ). Thus, the value TS(T i ) = Validation(T i ) and, if TS(T j ) < TS(T k ), then any produced schedule must be equivalent to a serial schedule in which transaction T j appears before transaction T k. The reason we have chosen Validation(T i ), rather than Start(T i ), as the timestamp of transaction T i is that we can expect faster response time provided that conflict rates among transactions are indeed low. The validation test for transaction T j requires that, for all transactions T i with TS(T i ) < TS(T j ), one of the following two conditions must hold: 148

1. Finish(T i ) < Start(T j ) Since T i completes its execution before T j started, the serializability order is indeed maintained. 2. The set of data items written by T i does not intersect with the set of data items read by T, and T i completes its write phase before T j starts its validation phase (Start(T j ) < Finish(T i ) < Validation(T j )). This condition ensures that T 14 T 15 read(b) read (A) (validate) display (A + B) read(b) B := B 50 read (A) A := A + 50 (validate) write(b) write (A) With each transaction T i in the system, we associate a unique fixed timestamp, This timestamp is assigned by the database system before the trans-action T i starts execution The timestamp-ordering protocol ensures that any conflicting read and write opera-tions are executed in timestamp order Review Terms Locking mechanism Graph based protocol Timestamps Timestamp based protocol Timestamp ordering protocols Validation protocol Students Activity 1. Define locks in a database system?why it is advisable DATABASE MANAGEMENT Figure 43.5 Schedule 5, a schedule produced by using validation. the writes of T i and T j do not overlap. Since the writes of T i do not affect the read of T j, and since T j cannot affect the read of T i the serializability order is indeed maintained. As an illustration, consider again transactions T 14 and T 15. Suppose that TS(T 14 ) < TS(T 15 ). Then, the validation phase succeeds in the schedule 5 in Figure 43.5. Note that the writes to the actual variables are performed only after the validation phase of T 15. Thus, T 14 reads the old values of B and A, and this schedule is serializable. The validation scheme automatically guards against cascading rollbacks, since the actual writes take place only after the transaction issuing the write has committed. However, there is a possibility of starvation of long transactions, due to a sequence of conflicting short transactions that cause repeated restarts of the long transaction. To avoid starvation, conflicting transactions must be temporarily blocked, to enable the long transaction to finish. This validation scheme is called the optimistic concurrency control scheme since transactions execute optimistically, assuming they will be able to finish execution and validate at the end. In contrast, locking and timestamp ordering are pessimistic in that they force a wait or a rollback whenever a conflict is detected, even though there is a chance that the schedule may be conflict serializable. Points to Ponder When the lock manager receives an unlock message from a transaction, it deletes the record for that data item in the linked list corresponding to that transaction If we wish to develop protocols that are not two phase, we need additional information on how each transaction will access the database One method for determining the serializability order is to select an ordering among transactions in advance 2. Define graph based protocol? 3. Define Timestamps? 4. Define Timestamp based protocol? 149

DATABASE MANAGEMENT 5. Define Timestamp ordering protocol? 6. Define validation based protocol? 150

DATABASE MANAGEMENT Lesson objectives Failure Types of failures Storage structure Storage Type Stable-Storage Implementation Data Access Recovery and Atomicity Log-Based Recovery Checkpoints The database system must take actions in advance to ensure that the atomicity and durability properties of transactions as a computer system, like any other device, is subject to failure from a variety of causes: disk crash, power outage, software error, a fire in the machine room, even sabotage. In any failure, information may be lost. are preserved. An integral part of a database system is a recovery scheme that can restore the database to the consistent state that existed before the failure. The recovery scheme must also provide high availability; that is, it must minimize the time for which the database is not usable after a crash. Failure Classification There are various types of failure that may occur in a system, each of which needs to be dealt with in a different manner. The simplest type of failure is one that does not result in the loss of information in the system. The failures that are more difficult tom deal with are those that result in loss of information. Various types of failure are: Transaction Failure There are two types of errors that may cause a transaction to fail: Logical error. The transaction can no longer continue with its normal ex-ecution because of some internal condition, such as bad input, data not found, overflow, or resource limit exceeded. System error. The system has entered an undesirable state (for example, deadlock), as a result of which a transaction cannot continue with its nor-mal execution. The transaction, however, can be executed at a later time. System crash There is a hardware malfunction, or a bug in the database software or the operating system, that causes the loss of the content of volatile storage, and brings transaction processing to a halt. The content of nonvolatile storage remains intact, and is not corrupted. The assumption that hardware errors and bugs in the software bring the system to a halt, but do not corrupt the nonvolatile storage contents, is known as the fail-stop assumption. Welldesigned systems have numerous internal checks, at the hardware and the software level, that bring the system to a halt LESSON 44: DATABASE RECOVERY when there is an error. Hence, the fail-stop assumption is a reasonable one. Disk failure. A disk block loses its content as a result of either a head crash or failure during a data transfer operation. Copies of the data on other disks, or archival backups on tertiary media, such as tapes, are used to recover from the failure. To determine how the system should recover from failures, we need to identify the failure modes of those devices used for storing data. Next, we must consider how these failure modes affect the contents of the database. We can then propose algorithms to ensure database consistency and transaction atomicity despite failures. These algorithms, known as recovery algorithms, have two parts: 1. Actions taken during normal transaction processing to ensure that enough information exists to allow recovery from failures. 2. Actions taken after a failure to recover the database contents to a state that ensures database consistency, transaction atomicity, and durability. Storage Structure The various data items in the database may be stored and accessed in a number of different storage media. To understand how to ensure the atomicity and durability properties of a transaction, we must gain a better under-standing of these storage media and their access methods. Storage Types Storage media can be distinguished by their relative speed, capacity, and resilience to failure, and classified as volatile storage or nonvolatile stor-age. We review these terms, and introduce another class of storage, called stable stor-age. Volatile storage. Information residing in volatile storage does not usually sur-vive system crashes. Examples of such storage are main memory and cache memory. Access to volatile storage is extremely fast, both because of the speed of the memory access itself, and because it is possible to access any data item in volatile storage directly. Nonvolatile storage. Information residing in nonvolatile storage survives sys-tem crashes. Examples of such storage are disk and magnetic tapes. Disks are used for online storage, whereas tapes are used for archival storage.both however, are subject to failure (for example, head crash), which may result in loss of information. At the current state of technology, nonvolatile st-age is slower than volatile storage by several orders of magnitude. This is because disk and tape devices are electromechanical, rather than based entirely on chips, as is volatile storage. In database systems, disks are used for most nonvolatile storage. Other nonvolatile media are normally used only for backup data. 152

Flash storage, though nonvolatile, has insuffi-cient capacity for most database systems. Stable storage. Information residing in stable storage is never lost (never should be taken with a grain of salt, since theoretically never cannot be guaranteed for example, it is possible, although extremely unlikely, that a black hole may envelop the earth and permanently destroy all data!). Although stable stor-age is theoretically impossible to obtain, it can be closely approximated by techniques that make data loss extremely unlikely. The distinctions among the various storage types are often less clear in practice than in our presentation. Certain systems provide battery backup, so that some main memory can survive system crashes and power failures. Alternative forms of nonvolatile storage, such as optical media, provide an even higher degree of reliability than do disks. Stable-Storage Implementation To implement stable storage, we need to replicate the needed information in sev-eral nonvolatile storage media (usually disk) with independent failure modes, and to update the information in a controlled manner to ensure that failure during data transfer does not damage the needed information. RAID systems guarantee that the failure of a single disk (even during data transfer) will not result in loss of data. The simplest and fastest form of RAID is the mirrored disk, which keeps two copies of each block, on separate disks. Other forms of RAID offer lower costs, but at the expense of lower performance. RAID systems, however, cannot guard against data loss due to disasters such as fires or flooding. Many systems store archival backups of tapes off-site to guard against such disasters. However, since tapes cannot be carried off-site continually, updates since the most recent time that tapes were carried offsite could be lost in such a disaster. More secure systems keep a copy of each block of stable storage at a remote site, writing it out over a computer network, in addition to storing the block on a local disk system. Since the blocks are output to a remote system as and when they are output to local storage, once an output operation is complete, the output is not lost, even in the event of a disaster such as a fire or flood. We study such remote backup systems In the remainder of this section, we discuss how storage media can be protected from failure during data transfer. Block transfer between memory and disk storage can result in Successful completion. The transferred information arrived safely at its des-tination. Partial failure. A failure occurred in the midst of transfer, and the destination block has incorrect information. Total failure. The failure occurred sufficiently early during the transfer that the destination block remains intact. We require that, if a data-transfer failure occurs, the system detects it and invokes a recovery procedure to restore the block to a consistent state. To do so, the system must maintain two physical blocks for each logical database block; in the case of mirrored disks, both blocks are at the same location; in the case of remote backup, one of the blocks is local, whereas the other is at a remote site. An output operation is executed as follows: 1. Write the information onto the first physical block. 2. When the first write completes successfully, write the same information onto the second physical block. 3. The output is completed only after the second write completes successfully. During recovery, the system examines each pair of physical blocks. If both are the same and no detectable error exists, then no further actions are necessary. (Recall that errors in a disk block, such as a partial write to the block, are detected by storing a checksum with each block.) If the system detects an error in one block, then it replaces its content with the content of the other block. If both blocks contain no detectable error, but they differ in content, then the system replaces the content of the first block with the value of the second. This recovery procedure ensures that a write to stable storage either succeeds completely (that is, updates all copies) or results in no change. The requirement of comparing every corresponding pair of blocks during recovery is expensive to meet. We can reduce the cost greatly by keeping track of block writes that are in progress, using a small amount of nonvolatile RAM. On recovery, only blocks for which writes were in progress need to be compared. The protocols for writing out a block to a remote site are similar to the protocols for writing blocks to a mirrored disk system We can extend this procedure easily to allow the use of an arbitrarily large number of copies of each block of stable storage. Although a large number of copies reduces the probability of a failure to even lower than two copies-do, it is usually reasonable to simulate stable storage with only two copies. Data Access The database system resides permanently on nonvolatile storage (usually disks), and is partitioned into fixed-length storage units called blocks. Blocks are the units of data transfer to and from disk, and may contain several data items. We shall assume that no data item spans two or more blocks. This assumption is realistic for most data-processing applications, such as our banking example. Transactions input information from the disk to main memory, and then output the information back onto the disk. The input and output operations are done in block units. The blocks residing on the disk are referred to as physical blocks; the blocks residing temporarily in main memory are referred to as buffer blocks. The area of memory where blocks reside temporarily is called the disk buffer. Block movements between disk and main memory are initiated through the fol-lowing two operations: 1. input(b) transfers the physical block B to main memory. 2. output(b) transfers the buffer block B to the disk, and replaces the appropriate physical block there. Each transaction Ti has a private work area in which copies of all the data items accessed and updated by Ti are kept. The system creates this work area when the transaction is initiated; the system removes it when the transaction either commits or DATABASE MANAGEMENT 153

DATABASE MANAGEMENT aborts. Each data item X kept in the work area of transaction Ti is denoted by Xi. Transaction Ti interacts with the database system by transferring data to and from its work area to the system buffer. We transfer data by these two operations: 1. read(x) assigns the value of data item X to the local variable Xi. It executes this operation as follows: a. If block B x on which X resides is not in main memory, it issues input(b x). b. It assigns to Xi the value of X from the buffer block. 2. write(x) assigns the value of local variable Xi to data item X in the buffer block. It executes this operation as follows: a. If block B x on which X resides is not in main memory, it issues input(bx L b. It assigns the value of Xi to X in buffer B x. Note that both operations may require the transfer of a block from disk to main mem-ory. They do not, however, specifically require the transfer of a block from main mem-ory to disk. A buffer block is eventually written out to the disk either because the buffer man-ager needs the memory space for other purposes or because the database system wishes to reflect the change to B on the disk. We shall say that the database system performs a force-output of buffer B if it issues an output(b). When a transaction needs to access a data item X for the first time, it must execute read (X). The system then performs all updates to X on Xi. After the transaction ac-cesses X for the final time, it must execute write(x) to reflect the change to X in the database itself. The output(b x) operation for the buffer block B x on which X resides,does not need to take effect immediately after write(x) is executed, since the block Bx may contain other data items that are still being accessed. Thus, the actual output may take place later. Notice that, if the system crashes after the write(x) operation was executed but before output(bx) was executed, the new value of X is never written to disk and, thus, is lost. Recovery and Atomicity Consider again our simplified banking system and transaction Ti that transfers $50 from account A to account B, with initial values of A ai)d B being $1000 and $2000, respectively. Suppose that a system crash has occurred during the execution of Ti, after output (B A) has taken place, but before output(b B) was executed, where B A and B B denote the buffer blocks on which A and B reside. Since the memory contents were lost, we do not know the fate of the transaction; thus, we could invoke one of two possible recovery procedures: Execute Ti. This procedure will result in the value of A becoming $900, rather than $950. Thus, the system enters an inconsistent state. Do not execute Ti. The current system state has values of $950 and $2000 for A and B, respectively. Thus, the system enters an inconsistent state. In either case, the database is left in an inconsistent state, and thus this simple re-covery scheme does not work. The reason for this difficulty is that we have modified the database without having assurance that the transaction will indeed commit. Our goal is to perform either all or no database modifications made by Ti. However, if Ti performed multiple database modifications, several output operations may be re-quired, and a failure may occur after some of these modifications have been made, but before all of them are made. To achieve our goal of atomicity, we must first output information describing the modifications to stable storage, without modifying the database itself. As we shall see, this procedure will allow us to output all the modifications made by a committed transaction, despite failures. There are two ways to perform such outputs; we study them in Sections 44.4 and 44.5. In these two sections, we shall assume that transactions are executed serially; in other words, only a single transaction is active at a time. We shall describe how to handle concurrently executing transactions later, in Section 44.6. Log-Based Recovery The most widely used structure for recording database modifications is the log. The log is a sequence of log records, recording all the update activities in the database. There are several types of log records. An update log record describes a single data-base write. It has these fields: Transaction identifier is the unique identifier of the transaction that performed the write operation. Data-item identifier is the unique identifier of the data item written. Typically, it is the location on disk of the data item. Old value is the value of the data item prior to the write. New value is the value that the data item will have after the write. Other special log records exist to record significant events during transaction pro-cessing, such as the start of a transaction and the commit or abort of a transaction. We denote the various types of log records as: <Ti start>. Transaction Ti has started. <Ti, Xj, VI, V2>. Transaction Ti has performed a write on data item Xj. Xj had value VI before the write, and will have value V2 after the write. <Ti commit>. Transaction Ti has committed.. <Ti abort>. Transaction Ti has aborted. Whenever a transaction performs a write, it is essential that the log record for that write be created before the database is modified. Once a log record exists, we can output the modification to the database if that is desirable. Also, we have the ability to undo a modification that has already been output to the database. We undo it by using the old-value field in log records. For log records to be useful for recovery from system and disk failures, the log must reside in stable storage. For now, we assume that every log record is written to the end of the log on stable storage as soon as it is created. In Section 44.7, we shall see when it is safe to relax this requirement so as to reduce the overhead imposed by logging. Two techniques for using the log to ensure transaction atomicity despite failures. Observe that the log contains a complete record of all database activity. As a 154

result, the volume of data stored in the log may become unreasonably large. Deferred Database Modification The deferred-modification technique ensures transaction atomicity by recording all database modifications in the log, but deferring the execution of all write operations of a transaction until the transaction partially commits. Recall that a transaction is said to be partially committed once the final action of the transaction has been ex-ecuted. The version of the deferredmodification technique that we describe in this section assumes that transactions are executed serially. When a transaction partially commits, the information on the log associated with the transaction is used in executing the deferred writes. If the system crashes before the transaction completes its execution, or if the transaction aborts, then the informa-tion on the log is simply ignored. The execution of transaction Ti proceeds as follows. Before Ti starts its execution, a record <Ti start> is written to the log. A write(x) operation by Ti results in the writing of a new record to the log. Finally, when Ti partially commits, a record <Ti commit> is written to the log. When transaction Ti partially commits, the records associated with it in the log are used in executing the deferred writes. Since a failure may occur while this updating is taking place, we must ensure that, before the start of these updates, all the log records are written out to stable storage. Once they have been written, the actual updating takes place, and the transaction enters the committed state. Observe that only the new value of the data item is required by the deferred modification technique. Thus, we can simplify the general update-log record struc-ture that we saw in the previous section, by omitting the old-value field. To illustrate, reconsider our simplified banking system. Let To be a transaction that transfers $50 from account A to account B: To: read (A); A := A-50; write(a); read(b); B := B + 50; write(b). Let Tl be a transaction that withdraws $100 from account C: T1: read(c); C := C - 100; write(c). Suppose that these transactions are executed serially, in the order To followed by Tl and that the values of accounts A, B, and C before the execution took place were $1000, $2000, and $700, respectively. There are various orders in which the actual outputs can take place to both the database system and the log as a result of the execution of To and Tl One such order <To start> <To, A, 950> <To, B, 2050> <To commit> <T 1 start> <T1, C, 600> <Tl commit> Figure 44.2 Portion of the database log corresponding to To and Tl appears in Figure 44.3. Note that the value of A is changed in the database only after the record <To, A, 950> has been placed in the log. Using the log, the system can handle any failure that results in the loss of informa-tion on volatile storage. The recovery scheme uses the following recovery procedure: redo(ti) sets the value of all data items updated by transaction Ti to the new values. The set of data items updated by Ti and their respective new values can be- found in the log. The redo operation must be independent; that is, executing it several times must be equivalent to executing it once. This characteristic is required if we are to guarantee correct behavior even if a failure occurs during the recovery process. After a failure, the recovery subsystem consults the log to determine which trans-actions need to be redone. Transaction Ti needs to be redone if and only if the log contains both the record <Ti start> and the record <Ti commit>. Thus, if the system crashes after the transaction completes its execution, the recovery scheme uses the information in the log to restore the system to a previous consistent state after the transaction had completed. As an illustration, let us return to our banking example with transactions To and Tl executed one after the other in the order To followed by Tl Figure 44.2 shows the log that results from the complete execution of To and Tl Let us suppose that the Log Database <To start> <To, A, 950> <To, B, 2050> <To commit> A = 950 B = 2050 <Tl start> <T1, C, 600> <Tl commit> C = 600 Figure 44.3 State of the log and database corresponding. to To and Tl <T0 start> <T0 start> <T0 start> < T0,A,950> < T0,A,950> < T0,A,950> <T0,B,2050> <T0,B,2050> <T0,B,2050> DATABASE MANAGEMENT 155

DATABASE MANAGEMENT <T0 commit> <T0 commit> <T1 start> <T1 start> <T1,C,600> <T1,C,600> <T1,commit> (a) (b) (c) Figure 44.4 Same log shown at three different times. System crashes before the completion of transaction, so that we can see how the recovery technique restores the database to a consistent state.assume that the crash occurs just after the log record for the step. Write(B) Of transaction t 0 has been written to stable storage.the log at the time of crash appears in figure 44.4(a) When the system comes back up, no redo actions need to be taken, since no commit record appears in the log. The values of accounts A and B remain $1000 and $2000, respectively. The log records of the incomplete transaction To can be deleted from the log. Now, let us assume the crash comes just after the log record for the step write( C) of transaction Tl has been written to stable storage. In this case, the log at the time of the crash is as in Figure 17.4b. When the system comes back up, the operation redo(to) is performed, since the record <To commit> appears in the log on the disk. After this operation is executed, the values of accounts A and Bare $950 and $2050, respectively. The value of account Remains $700. As before, the log records of the incomplete transaction Tl can be deleted from the log. Finally, assume that a crash occurs just after the log record <Tl commit> is written to stable storage. The log at the time of this crash is as in Figure 44.4c. When the system comes back up, two commit records are in the log: one for To and one for Tl Therefore, the system must perform operations redo(to) and redo(t1), in the order in which their commit records appear in the log. After the system executes these operations, the values of accounts A, B, and Care $950, $2050, and $600, respectively. Finally, let us consider a case in which a second system crash occurs during re-covery from the first crash. Some changes may have been made to the database as a result of the redo operations, but all changes may not have been made. When the System comes up after the second crash, recovery proceeds exactly as in the pr_.g examples. For each commit record <Ticommit> Finally, let us consider a case in which a second system crash occurs during re-covery from the first crash. Some changes may have been made to the database as a result of the redo operations, but all changes may not have been made. When the System comes up after the second crash, recovery proceeds exactly as in the pr_.g examples. For each commit record <Ticommit> found in the log, the system performs the operation redo(ti).in other words,it restarts the recovery action from the beginning. Since redo writes values to the database independent of the values currentlyin the database, the result of a successful second attempt at redo is the same as though redo has succeeded the first time. Checkpoints When a system failure occurs, we must consult the log to determine those transac-tions that need to be redone and those that need to be undone. In principle, we need to search the entire log to determine this information. There are two major difficulties with this approach: 1. The search process is time consuming. 2. Most of the transaction that, need to be redone have already written their updates into the database. To reduce these types of overhead, we introduce checkpoints. During execution, the system maintains the log. In addition, the system periodically performs checkpoints, which require the following sequence of actions to take place: 1. Output onto stable storage all log records currently residing in main memory. 2. Output to the disk all modified buffer blocks. 3. Output onto stable storage a log record <checkpoint>. Transactions are not allowed to perform any update action, such as writing to any buffer blockor writing a log record, while a checkpoint is in progress. Points to Ponder There are various types of failure that may occur in a system, each of which needs to be dealt with in a different manner as transaction failure,system crash, disk failures. Storage media can be distinguished by their relative speed, capacity, and resilience to failure, and classified as volatile storage or nonvolatile stor-age RAID systems guarantee that the failure of a single disk (even during data transfer) will not result in loss of data When a system failure occurs, we must consult the log to determine those transac-tions that need to be redone and those that need to be undone. Review Terms Failure Types of failures Storage structure Storage Type Stable-Storage Implementation Data Access Recovery and Atomicity Log-Based Recovery Checkpoints 156

Students Activity 1. Define database recovery? DATABASE MANAGEMENT 2. What are the various kinds of failures? 3. Define log-based recovery? 4. Define checkpoints? 157

? The lesson content has been compiled from various sources in public domain including but not limited to the internet for the convenience of the users. The university has no proprietary right on the same. Rai Technology University Campus Dhodballapur Nelmangala Road, SH -74, Off Highway 207, Dhodballapur Taluk, Bangalore - 561204 E-mail: info@raitechuniversity.in Web: www.raitechuniversity.in