Part 2 - The Database Environment

Rela%onal Database 2 1 Part 2 - The Database Environment The purpose of a RDBMS is to provide users with an abstract view of the data, hiding certain details of how data are stored and manipulated. Therefore, the starting point of the design of data is also abstract and generalized to the needs of the organization. The first thing to do is to model the data for our needs. All RDBMS capture the real world in something called an entity. In the soccer example, the entities include the Team, Field, Players, etc. Each entity has certain properties or qualities (the Player, for example, has a last name, first name, phone, date of birth and a team). These qualities are called attributes Next we want to know the relationship between entities. For instance, the Team plays a Match; the Referee is assigned a Match. The physical data level of the RDBMS is the actual computer hardware and software holding the data. In this image we see that some fields are shown to some users (called the data view); the data are drawn from actually one database table; then we see the structure of that table and finally a suggestion of its physical disks. EXTERNAL LEVEL View (a user s view of the data) View (a user s view of the data) idno lname fname salary staffno deptno lname phone CONCEPTUAL LEVEL lname fname salary staffno deptno phone PHYSICAL LEVEL INTERNAL LEVEL struct STAFF { int staffno; int deptno; char phone [12]; char fname [15]; char lname [25]; struct date DateOfBirth; float salary; struct STAFF *next; }; index staffno, index deptno; /* pointer to next Staff record */ /* define index for staff */ Physical Data Organization filename: levels1.ai

Rela%onal Database 2 2 One of the main purposes of dividing the logical from the physical is to preserve the idea of the independence of the data; independent that is from specific user needs and particular physical implementation issues. For example, the logical data independence means the external views are immune from change in the conceptual level. In other words, if the database administrator updates the structure or the workings of the RDBMS, the end-user is still served the appropriate data. The physical data independence means the reverse - changing the data doesn t affect the physical level. This means if new data are added or new data types created, we don t have to get new physical equipment. Notice that all this is abstract: we ve not mentioned any particular need or any particular equipment. Databases are commanded the same way: the commands are actually representations of the relational algebra in the model for the relational database scheme (as opposed to the network model, hierarchical model, and others). By abstracting the commands, their physical implementation (how the command is processed by the computer) are independent of the command itself, and vice versa. For example, we could issue the SQL statement SELECT * FROM table1 AND table2. On the one hand, this reflects the Cartesian product of table1 x table2. Our results (the view) would show us every combination of fields in the two tables! On the other hand, perhaps someone will create a (humansounding) language where we query the database with great courtesy Oh, Mighty Computer, please show me all the data! and have the command return the same results. Or perhaps would might want to tailor the commands to another human language. Instead of SELECT * FROM mytable we want to allow the user to type in ВЫБЕРИТЕ * ОТ моейtаблицы. This abstract idea of communicating with the database is called a data definition language or DDL. More precisely, a DDL allows the user to describe and name the entities required or the application and the relationships that may exist between the different entities. The physical database management systems (the actual software) maintains for itself a set of tools, called system catalog, that includes the database s metadata, or data about data. We can query the metadata. We might want to ask how many rows (records) there are in the table - the answer is in the metadata. A data creation language (DCL) defines the commands to create databases and tables. A data manipulation language (DML) defines the commands for inserting, editing, deleting, and updating data. In SQL, CREATE DATABASE mydb; is an example of DCL; SELECT * FROM mytable is an example of DML. Part of the RDBMS tools is a directory of who has access to what databases and tables as well as who can issue what DCL and DML commands. These are called rights that are granted by system administrator or database administrator (DBA). When you access a website that in turn queries a database (as when you log in to some systems), you are logging in as an anonymous user. As such you re given rights to SELECT data. Except where input is allowed, the user cannot do much with the database and tables. In an office setting, the circulation staff might see (view) only the patron s name, any fines, phone, and items that are checked out. The circulation manager might be able to see (view) those fields as well as the entire borrowing history, home address, and data about this client. The manager may have INSERT, EDIT, DELETE rights while the other staff may have only SELECT rights. This introduces the

Rela%onal Database 2 3 whole field of security that we ll return to later. The entity relationship model is at times replaced or complemented by the object oriented model, where entities, attributes, and relationships are all integrated into a single class (or encapsulation). The main services of a RDBMS: 1. Data storage, retrieval, and update 2. User-accessible catalog (the descriptions of the data maintained by the RDBMS) 3. Transaction support (either the entire record(s) is(are) updated or none are) 4. Concurrency control (make sure that data that are mutually dependent are updated simultaneously) 5. Recoverable (if something happens, there must be a way to recover the data) 6. Authorization service (only authorized users view and have access to certain data) 7. Support data communication (can be shared with other computers and be independent of the techniques used to send data, e.g., TCP/IP, Z39, CORBA, other) 8. Integrity services (that the data in the db and changes follow certain rules) 9. Support data independence 10. Utility services (tools for exporting, importing data from other formats and systems) Today, RDBMS are increasingly applied to the Internet - hence web-enabled db that serve data to remote computers and to mobile devices. Below is an example of a system architecture. In real practice, there will be firewalls and VPN issues to contend with. We review these when we discuss security issues. We will also discuss delivering data to mobile devices and the use of SQLite in the ipad and iphone.

Rela%onal Database 2 4 CLIENTS local remote computers LAN Internet mobile devices SERVER typicaldbarch.ai Web Server Software (e.g., Apache) servlets other programs jsp, php connection Flat Files (e.g., xslt, html, css, txt, pdf ) submit query over connection return hits in an object ResultSet Database (and tables, indices) MySQL Oracle others The above figure represents a typical architecture for using RDBMS. In this example, there are 3 clients, one local, two devices via the Internet. Naturally, we re not limited to three users - this is just an image. Whether locally connected or view the Internet, the human end-user (or a computer) sends a request for data (called a query). Let s use the example of a web form of someone logging in to the library s website. In that form, there is the the name-value pair of method and post, along with action and the name of the program to be run on the server. The end-user sends his id and password to the web server. The web server software (50% of the time it is the OpenSource project Apache) captures the data from the form and then streams the data to the name of the program in the action element. Typically, the program running on the server will take the data passed from the form and use them to construct an SQL statement - exactly as if a human typed it! The program then created a Connection using software specific to the operating system and DB product. For example, the Driver (an object used to complete the bridge between the program calling the data and the database itself) varies by product and operating system. To connect to MS Access using Windows and Macintosh, check the system preferences for the ODBC icon; click on that icon and you select what file types and sources are to be used by your program, such as making your MS Access database visible via the Internet.

Rela%onal Database 2 5 But most folk write their own programs using Java, PHP, and the like. In these situations, the program must identify which Driver to use and the database, commonly MySQL. For the connection to MySQL running on Linux, use the org.gjt.mm.mysql.driver software; for Mac and Unix, it s com.mysql.jdbc.driver; other versions exist for Windows.] In real life, you might include references to both drivers in the code and then use some environmental variables or data to determine which OS your database is running on. For example, in a computer program we might prepare for both: String driver1 = org.gjt.mm.mysql.driver ; String driver2 = com.mysql.jdbc.driver ; if (os.equals( Linux )) { drivertouse = driver1; } else { drivertouse = driver2; } // LINUX // MAC Once the connection is created, the program creates a special object called a Statement. The Statement object contains the query - it is a kind of wrapper to send the SQL command to the database. The command is then executed by the SQL software. In response to the query, the SQL software sends data back in an object called a ResultSet. The programmer can extract metadata from the ResultSet about how many rows there are in the set, or other data about the data. The programmer can then select only the fields s/he wants to show the user. For example, when the Circulation Manager logs in, the program is written to send him all the fields. When the staff log in, the program sends only the patron s name, phone, and a list of items checked out. Note that the raw data cannot be used. The program must convert the data into something that appropriate for the output device. If the data are being sent back to the user s browser, then the output is often HTML, XML, or PDF. If the data are sent to another computer, or the end-user wants to import the data into a spreadsheet, the data may be exported first as tab-delimited data. If we use the terminal window to connect to SQL, we could issue the command SELECT * FROM staff and might see the following on the terminal: and so on... +-------------------------------------------------+ idno lname fname age +-------------------------------------------------+ 031 Smith Jane 27 393 Gomes John 31 But to display Ms. Smith s record to the browse, we must first make changes to the data while inside the program running on the webserver. In this example, we re using a Java servlet. The command to send the data back is println (for print line ). We d like to have the lname field contents appear in bold in the browser. So we combine HTML and data from the ResultSet, instantiated as rs in this example:

Rela%onal Database 2 6 println( Welcome, <font color= red > + rs.getstring( lname ) + </font> ); Notice how plain text ( Welcome ) is combined in the same String as the html tags. Notice, too, that the ResultSet object rs is being asked to extract only data from the lname field and notice that we are looking for String, or alphanumeric, data type (getstring()). Part of our documentation, the Data Dictionary, clarifies for all users, programmers, and web designers exactly what data we want and where. When we get further along in our demonstration distance education case we ll see how the documentation you create is used to design input screens, reports and output screens, and in creating the databases and tables themselves. Section 2 You can seen, then, that there s a lot of work and thought involved in creating a RDBMS. Most of the effort is in the logical phase of the project. This phase defines the tables, fields, keys (primary and foreign), establishes table relationships and levels of data integrity. The implementation of this logical design includes using a DBMS (such as MySQL) to create in the computer tables and their relationships and using or creating tools to implement levels of data integrity. Many people create these tables and issue commands to extract at will; this is very important for on-the-job data you ll need but since most RDBMS are used to fulfill others data needs, we review Conceptual phase (which details the organization s data and work behaviors), the Logical phase, and then the Physical phase, where we build and test the product. In practice, you need data as part of the running of your institution and as part of the management of those data [data to manage data]. Therefore, databases are operational [the store data that s collected, maintained, and modified (dynamic data), such as sales transactions, circulation services, cataloguing, etc. Other types of databases are analytical. These track historic (static) data, that are often used in transaction analyses and creating other statistical reports. For example, to review circulation trends or web use, researchers and practitioners use access logs from web servers to extract data about queries, what databases and other servers were used, etc. Other data models have difficulties inserting data efficiently (adding a new record) and in handing redundant data (for instance, someone s id appears in many places). The relational part of RDMBS is an application of the mathematics of set theory and first order predicate logic by E. F. Codd in the 1970s. He defined a relational algebra that demonstrated how an individual datum could be identified uniquely; his mathematical model was implemented in software as the relational database model (RDM). In RDM data are stored in a relation or table. [The two terms are used interchangeably so you should know them both.] See Figure 1.

Rela%onal Database 2 7 Figure 1. Each table (relation) has rows or records. In Figure 1, you see a single row [B5, 22 Deer Rd, Beacon Hill, Boston, 02139, 617-555-1212]. This constitutes a single record in the table. The single row is called a tuple [pronounced like two-pull ]. [The real-world things that interest us and that we called entities above have to be converted into something the computer understands and in the terminology of RDBMS the entities are stored in tables. The tuples hold the specific attributes of the entities.] In the tuple we see B5, 22 Deer Rd, etc. These data (e.g., B5 ) are are fields (or columns). The columns represent attributes of row. In a different example, let s say you have a table of student employee data: ID Last name First Name Hourly 53 Smith Betty 10.50 This student employee has attributes of an ID (53), a Last name (Smith), and so on. Each row or record is represented by a unique field, known as the Primary key. It is common when creating a database to include a record number field that is a unique number for the record. Typically we ask the SQL software to determine the number and store it automatically in the field. For example, in the above example, we might actually design the table to include a field called recno (for Record Number) and ask the system to auto increment that field: recno ID Last name First Name Hourly

Rela%onal Database 2 8 983 53 Smith Betty 10.50 RDBMS cluster data that are related by some theme into tables that make sense to the people who use the data. If you created a payroll system, you might not want everyone to know everyone else s salary. So you might collect all the contact data (the staff s name, address, telephone, email, etc.) in one table; you might have another table that organizes staff by departments (circulation staff, technical services staff, etc.). Now we can link a staff member s data (stored on the contact table) to the department in which he or she works (departments table). We have, then, created a relationship. There are three relationships in RDBMS: one-to-one (1:1), one-to-many (1:M), and many-tomany (N:M). In this example, one staff member works for one department (1:1). It could be possible, too, that the same staffer works in two departments (1:M). We try to break down M:N relationships into a 1:1 or 1:M because it becomes very difficult to maintain data integrity if data appear in multiple ways in multiple tables. Breaking down these M:N relationship is at times rather artificial and usually cannot be done successfully without knowing how your organization works. We review this again in detail later.] If a relationship is known between tables, we identify the type of relationship. Notice in Figure 1 Branch No table and Staff table share data (Bno, or branch number ). In our example, the field names are the same but they don t have to be. Branch No and Staff table are linked via the Bno field. This means when we search for a Staff person named Nancy Smithy, we see her ID is S3 and her Branch Number is B3. Following the link we see Bno B3 is located at 4 Smith St. From the point of view of the Branch Number table, the link has two ends: one end is located in the Branch Number table and it is the primary key; its other end is in a different (or foreign) table, the Staff table, so it is called a foreign key. [If we turned things around, from the point of view of the Staff table, Bno is the primary key and points to Branch No table as a foreigner.] Recap of Terminology: Four categories of terms are described in this chapter: value-related, structure-related, relationshiprelated, and integrity-related. Value Related Terms Data are the values that are stored in the database. They are static in the sense that they remain in the same state until they are modified. Information is data that has been processed in a way that makes it meaningful. It can be shown as the result of a query, either displayed on-screen, or printed on a report. Null is a value that is either missing or unknown. A null value represents neither zero nor blank, as they are actual values and can be meaningful in certain circumstances. A drawback to null values is that they cannot be evaluated by mathematical expressions. Structure Related Terms A table is the main structure in a relational database. It is composed of fields and records, the order of which is completely unimportant. It always represents a single, specific subject, which can be

Rela%onal Database 2 9 an object or an event. A field (also known as an attribute) is the smallest structure in a relational database. It represents a characteristic of the subject of the table. A field may be multipart, multi-valued or the result of a calculation (concatenated). A record (also known as a tuple) is a structure within a table that represents a unique instance of the subject of the table. A View is a virtual table that is composed of the fields of one or more tables. It draws its data from the tables on which it is based. They are commonly implemented as saved queries. STUDENT TABLE Student ID 1001 1009 1191 Student First Name Jones Crabtree Cat Student Last Name Jeff Luella Suky Student Phone 5-1234 999-3029 6-5440 INSTRUMENTS TABLE Instrument ID 401 444 423 Student ID 1001 1009 1191 Instrument Type Guitar Marimba Drums Instrument Desc Stratocaster Mbasa Zydek STUDENT INSTRUMENTS (VIEW) Student ID 1001 1009 1191 Student Last Name Jones Crabtree Cat Instrument Desc Stratocaster Mbasa Zydek In this example, the STUDENT INSTRUMENTS View is composed of fields taken from both the STUDENTS table and the INSTRUMENTS table. Data in the View is drawn from both tables simultaneously, based on matching values between the Student ID field in the STUDENTS table and the Student ID field in the INSTRUMENTS table. Keys are special fields that serve a specific purpose within a table. A Primary key is a field that uniquely identifies a record within a table. A Foreign key is the field that is used to establish a relationship between a pair of tables. In the following example, Agent ID is the Primary key of AGENTS because it uniquely identifies each record in that table. Similarly, Client ID is the Primary key of CLIENTS because it also uniquely identifies each of the table s records. Agent ID in the CLIENTS table is a Foreign key because it is used to establish a relationship between the CLIENTS and the AGENTS table.

Rela%onal Database 2 10 AGENTS Agent ID 100 101 102 Agent First Name Chris Chip Bunny Agent Last Name Girard Pinky Rabbit Date of Hire 06/01/06 11/23/06 09/07/08 Agent Phone 555-1212 909-0392 641-1234 CLIENTS Client ID 432 9883 871 Agent ID 100 100 101 Client First Name Bix Felicia Skip Client Last Name Benoit Finklestein Neant Client Phone 867-5309 909-0932 4-222- 1234 Relationship-related Terms Relationships establish a connection between a pair of tables. This relationship exists when a pair of tables is connected by a Primary and Foreign key. Types of Relationships One-to-One: Exists between a pair of tables if a single record in the first table is related to one and only one record in the second table. EMPLOYEES Employee ID Emp First Name Emp Last Name Phone 101 100 Skip Felicia Neant Finkelstein 909-0932 444-1234 COMPENSATION Employee ID 100 101 Hourly Rate 25.00 10.59 Commission Rate 50% 27% One-to-Many: Exists between a pair of tables if a single record in the first table is related to one or more records in the second table, but a single record in the second table can be related to only one record in the first table. This is the most common type of relationship. STUDENTS Student ID Student First Name Student Last Name Phone INSTRUMENTS

Rela%onal Database 2 11 Instrument ID Student ID Instrument Type Instrument Desc Many-to-Many: Exists between a pair of tables if a single record in the first table can be related to one or more records in the second table, and a single record in the second table can be related to one or more records in the first table. Establishing a direct connection between these two tables is difficult because it will produce a large amount of redundant data in one of the tables. STUDENTS Student ID Student First Name Student Last Name Student Phone CLASSES Class ID Class Name Instructor ID Types of Participation There are two types of participation that a table can have within a relationship: mandatory and optional. If records in Table A must exist before any records can be entered into Table B, then Table A s participation within the relationship is mandatory. If not, it is considered optional. Each table in a relationship has a degree of participation, which is the minimum and maximum number of records in one table that can be related to a single record in the other table. Consider Agents and Clients tables. If we say that an agent should have at least one client but no more than eight, then the degree of participation for the Clients table is 1,8. Integrity-related Terms A field specification (also known as a domain) represents all the elements of a field. Each field specification has three types of elements: general, physical and logical. A field s general elements include such items as field name, description and source table. Physical elements include items such as data type, length, and display format. Logical elements describe the values stored in a field, such as required value, range of values and default values. Data integrity refers to the validity, consistency and accuracy of the data in a database. The four types of data integrity are: Table-level integrity ensures that the field that identifies each record within the table is unique and is

Rela%onal Database 2 12 never missing its value. Field-level integrity ensures that the structure of every field is sound, that the values in each field are valid, consistent and accurate. Relationship-level integrity ensures that the relationship between a pair of tables is sound and that there is synchronization between the two tables whenever data is entered, updated or deleted. Business rules impose restrictions or limitations on certain aspects of a database based on the ways an organization perceives and uses its data. Before we get too much further into the complexities of data modeling, let us focus on the conceptual phase to see how we can apply RDBMS in a real-world situation.