Foundations of Information Management - WS 2009/10 Juniorprofessor Alexander Markowetz Bonn Aachen International Center for Information Technology (B-IT)
Alexander Markowetz Born 1976 in Brussels, Belgium Raised in Marburg, Germany Research visits University of California Riverside (2000) Polytechnic Institute of the New York University (2004 & 5) 2004: Diplom Informatik, University of Marburg 2008: PhD Computer Science, The Hongkong University of Science and Technology 2009: Assistant Professor in Bonn 2
My Research Database Systems Information Retrieval (Search Engines) At the moment: Searching in Databases Searching in Code Repositories Architectures for Online Games 3
My Other Life Scuba Diving Yoga Hiking A lot of time in Asia 4
Contacts Room A225 Römerstr. 164 Tel.: 0228 73-7409 alex@iai.uni-bonn.de http://www.iai.uni-bonn.de/~alex I will NOT be available directly after class But usually before Else, send an email 5
Schedule 20. October 27. October 3. November 10. November 17. November 24. November 1. December 8. December 15. December 22. December 29. December 5. January 12. January 19. January 26. January 2. February Only 13 lectures this semester! Little time for a lot of things Consequences: Work efficiently Concentrate on key topics Little time for practical exercise You need to practice independently In every spare minute 6
Attendance You need to attend at least 80% = 10 of the 13 lectures You will catch a cold, at some time during the winter Hence, never skip classes intentionally Save the days for when you are really sick 7
Home Work The class will contain a certain amount of homework, and interactive exercises What and how much, will be determined throughout the semester We need to see what works best Homework and class participation are mandatory to qualify for final exams 8
Exams One Final Exam Early February Exact date to be announced! We aim at the ECTS grading scheme: A - 10% B - 25% C - 30% D - 25% E - 10% 9
Background Dilemma Broad spectrum of student's backgrounds: life sciences to computer science (and beyond)! Some people know (nearly) nothing about information management using computers. Others know something, or even a lot about databases and information systems. Be patient! Assist others! 10
Two very different classes Bio Databases (Prof. Hofman-Apitius) Foundations of Information Management Application oriented Real examples Method oriented Few general techniques General techniques From CS perspective From LSI perspective 11
Information and Resources Quick Books (Schaum's Outlines) R.A. Mata-Toledo, P.K. Cushman Fundamentals of Relational Databases, 2000 R.A. Mata-Toledo, P.K. Cushman Fundamentals of SQL Programming, 2000 Real Database Textbooks A. Silberschatz, H. F. Korth und S. Sudarshan Database System Concepts, 2006. R. Ramakrishnan, J. Gehrke: Database Management Systems, 2003. 12
Conferences and Journals This may all be a bit early for you But, if you do read papers, read good ones: Only top-10 publications You find most of the papers online: DBLP: http://www.informatik.uni-trier.de/~ley/db/ Citeseer: http://citeseer.ist.psu.edu/ The ACM Digital Library http://portal.acm.org/dl.cfm 13
DBLP 14
OK, so let's get started Foundations of Data Management Really: Database Management Systems 15
Data & Databases Data: Simple information Database: Collection of interrelated data Examples Banking: all transactions Airlines: reservations, schedules Universities: registration, grades Sales: customers, products, purchases 16
Database (Management) Systems Software to access data Convenient and efficient to use DBMS Users & application programs DB 17
Amazon: A really big DBMS Purchasing Customers Web Shop DBMS Warehouse & Shipping External Vendors Advertising DB Plus many more (external) connections. 18
Commercial DBMS The Big Three: Open Source: Oracle PostgresSQL IBM DB2 MySQL MS SQL Server Others: Sybase Informix (now IBM) Ingress Office Toys: MS Access 19
Databases in Life Science Most databases in the life sciences do not use a DBMS! Hundreds of databases in biology, chemistry, pharmacy, or medicine are based on dedicated (system-specific) textfile formats which come with very limited software support (if any). This lecture familiarizes you with the ideal of a database + DBMS, in order to be able to properly judge how much DBMS you need. There are cases where using a full DBMS would be overkill sometimes a less powerful system is more appropriate. There is a big turn towards moving LS databases to a stable and powerful general purpose DBMS you ought to know the basic principles of database technology. At the end of the lecture, we will look at alternatives to (real) database systems, though. 20
Before Database Systems Binary Files: 0100 1001 0101 0001 0101 0001 0101 0101 0101 1100 1111 1100 0110 Text Files: 01, Alexander, Markowetz, Professor 02, Bob, Benson, Truck Driver 03, Janice, Watson, Nurse 21
Drawbacks of storing data in files (1) Data redundancy and inconsistency Multiple file formats Duplication of information in different files Difficulty in accessing data Need to write a new program to carry out each new task Integrity problems Integrity constraints (e.g. account balance > 0) become part of program code Hard to add new constraints or change existing ones
Drawbacks of storing data in files (2) Atomicity of updates Failures may leave database in an inconsistent state E.g. transfer of funds between accounts should either complete or not happen at all Concurrent access by multiple users Concurrent accesses needed for performance Uncontrolled concurrent accesses can lead to inconsistencies E.g. two people reading a balance and updating it at the same time Security problems
Example (1) Alex writes a program to manage the addresses of all students at this university He uses a text file to store the addresses: Name, Address, Program of Study He has to write code parsing the text lines He has to write a code to ensure that the name of a student cannot become null When he wants to add another data-field Age He has to change all of the above code Two separate departments need access to this data Each keeps its own copy Over time, the two databases will drift apart, become inconsistent
Example (2) Whenever Alex introduces any change in the data format When he implements another project for the university He has to change all the above code, yet again, at both departments He has to write all the above again Still, his code is full of errors, does not allow two users to access data at the same time, and lacks many other features DBMS solve all the above problems
Data Independence Application program is isolated from the way that data is stored in the DBMS DBMS is isolated from hardware Achieved in a 3-layer architecture application view Logical Independence logical Physical Independence physical
Parts of a Database Database Schema Metadata, data about data Describes the structure of the data What sets (tables) of data there are Which data-fields (attributes) they contain Database Instance The actual data stored in the database At this moment!!!
Database Design (1) 1) Requirements Analysis Analyze real world, user needs & requirements Informal process, client interviews, etc. 2) Conceptual Design High level description of data to be stored Results in an ER-model 3) Logical Design Convert conceptual design into a relational database schema
Database Design (2) 4) Schema Refinement Analyze and refine logical schema Guided by powerful and elegant theory 5) Physical Design Address database performance Create Indexes 6) Application and Security Design
Interacting with a Database Data Definition Language (DDL) Describes the schema Data Manipulation Language Insert, delete and update data objects Retrieve data (query the database) There are graphical tools as well, these too can be categorized into the above categories SQL comprises both, a DDL as well as a DML
Thinking Databases As seen above, there are many benefits to using DBMS However, there is one more: Entity Relationship Diagrams A formal way to design data Relational Algebra A formal way to query data
Basic Concepts of ER A database can be modeled as a collection of entities relationships among entities Entity: an object that exists independently and is distinguishable from other objects. an employee, a company, a car, a student, a class etc. color, age, etc. are not entities
Entity set: entities of the same type E.g., a set of employees, a set of departments also called entity types Entity Type : Entity set: Employee e1 e2 e3 A general specification The actual employees
Attributes Properties of an entity name, address, weight, height are properties of a Person entity Properties of relationships date of marriage is a property of the relationship Marriage
Types of Attributes Simple attribute: contains a single value. EmpNo Employee Name Address
Composite Attributes EmpNo Name Employee Street Address City Country
Multivalued attributes: > 1 values Phone Employee Email
Derived attributes: computed from others Age Employee Date of birth
Key Attributes A set of attributes that can uniquely identify an entity EmpNo ERD Employee Name tabular EmpNo Name... 123456 John Wong... 456789 Mary Cheung... 146777 John Wong...
Key Attributes Composite key: Name or Address alone cannot uniquely identify a student, but together they can! Name Student Address
Key Attributes An entity may have more than one key Candidate key Primary key A minimal set of attributes that uniquely identifies an entity One candidate key is selected to be the primary key Sometimes artificial keys may be created E.g. we can enumerate all employees in a company
Example Entity (Customer)
Relationship A relationship is an association among several entities The degree refers to the number of entity sets that participate in a relationship set. Binary: two entity sets More than two relationships: very rare
Example of (Binary) Relationship Borrower is a relationship between Customers and Loans A customer can associated with one or more loans And vice versa
Relationship Sets with Attributes Depositor is a relationship between Customers and Accounts Access-date is an attribute of Depositor
Cardinality Constraints We express cardinality constraints by drawing either a directed line ( ), signifying one, or an undirected line ( ), signifying many, between the relationship set and the entity set. E.g.: One-to-one relationship: A customer is associated with at most one loan via the relationship borrower A loan is associated with at most one customer via borrower
One-To-Many Relationship In the one-to-many relationship a loan is associated with at most one customer via borrower, a customer is associated with several (including 0) loans via borrower
Many-To-One Relationships In a many-to-one relationship a loan is associated with several (including 0) customers via borrower, a customer is associated with at most one loan via borrower
Many-To-Many Relationship A customer is associated with several (possibly 0) loans via borrower A loan is associated with several (possibly 0) customers via borrower
Participation of an Entity Set in a Relationship Set Total participation (indicated by double line): every entity in the entity set participates in at least one relationship in the relationship set E.g. participation of loan in borrower is total every loan must have a customer associated to it via borrower Partial participation: some entities may not participate in any relationship in the relationship set E.g. participation of customer in borrower is partial
Alternative Cardinality Notation Cardinality limits can also express participation constraints
Roles Entity sets of a relationship need not be distinct The labels manager and worker are called roles; they specify how employee entities interact via the works-for relationship set. Roles are indicated in E-R diagrams by labeling the lines that connect diamonds to rectangles. Role labels are optional, and are used to clarify semantics of the relationship
Keys for Relationship Sets The combination of primary keys of the participating entity sets forms a super key of a relationship set. (customer-id, account-number) is the super key of depositor This means that a pair of entities can have at most one relationship in a particular relationship set. E.g. if we wish to track all access-dates to each account by each customer, we cannot assume a relationship for each access. Solution: use a multivalued attribute for access dates. Must consider the mapping cardinality of the relationship set when deciding the candidate keys
Ternary Relationships Suppose employees of a bank may have jobs (responsibilities) at multiple branches, with different jobs at different branches. Then there is a ternary relationship set between entity sets employee, job and branch
Binary Vs. Non-Binary Relationships Some relationships that appear to be nonbinary may be better represented using binary relationships E.g. A ternary relationship parents, relating a child to his/her father and mother, is best replaced by two binary relationships, father and mother Using two binary relationships allows partial information (e.g. only mother being known) But there are some relationships that are naturally non-binary E.g. works-on
Weak Entity Sets An entity set that does not have a primary key is referred to as a weak entity set. The existence of a weak entity set depends on the existence of a identifying entity set it must relate to the identifying entity set via a total, one-to-many relationship set from the identifying to the weak entity set Identifying relationship depicted using a double diamond The discriminator (or partial key) of a weak entity set is the set of attributes that distinguishes among all the entities of a weak entity set. The primary key of a weak entity set is formed by the primary key of the strong entity set on which the weak entity set is existence dependent, plus the weak entity set s discriminator.
Weak Entity Sets (Cont.) We depict a weak entity set by double rectangles. We underline the discriminator of a weak entity set with a dashed line. payment-number discriminator of the payment entity set Primary key for payment (loan-number, payment-number)
Another example of weak entity type EmpNo Name Age Employee Emp_Dep Dependent A child may not be old enough to have a passport number Even if he/she has a passport number, the company may not be interested in keeping it in the database.
Summary of Symbols (Cont.)
Design Decisions - Attribute vs Entity For each employee we want to store the office number, location of the office (e.g., Building A, floor 6), and telephone. Several employees share the same office Office as attribute Employee_id Name Employee_id Name Office_number Employee Office_location Office_phone Office as entity Employee Office_number Office Office_location Office_phone
ER Design Decisions - Entity vs Relationship Account example Can you see some differences? (e.g., can you have accounts without a customer?) Account as an entity Customer Account Branch Account as relationship Account Customer Branch
ER Design Decisions - Entity vs Relationship You want to record the period that an employ works for some department. from name ssn to lot did name ssn lot Employees from did Works_In3 Duration budget Departments Works_In2 Employees dname dname budget Departments to
ER Design Decisions - Strong vs. Weak Entity Example: What if in the accounts example an account must be associated with exactly one branch two different branches are allowed to have accounts with the same number. Number Account Branch_id Branch