Θεµελίωση Βάσεων εδοµένων Βασίλης Βασσάλος 1 What do we need to produce good software? Good programmers, software engineers, project managers, business experts, etc. (People) Good software development methodologies and practices (Processes) Good programming languages, compilers, etc. (Tools) 2 1
What Is a Database? A very large, integrated collection of data It models a real-world enterprise Entities (e.g., customers, orders) Relationships (e.g., Joe Smith bought a Corvette) A Database Management System (DBMS) is a software package designed to store and manage databases Difficult software package (needs tending - DBA) Expensive software package 3 Databases and Database Management Database A structured collection of data and information about entities (things) of interest Database Management System: A software application with which you can create, store, organize and retrieve data from one or many databases E.g. Oracle, Sybase, Informix, DB2, Access Database Administrator: A person responsible for the development and management of an organization s databases 4 2
Why we need a DBMS A DBMS stores, manages and manipulates effectively and efficiently large amounts of data Most business processes and functions in big organizations generate, depend on, and use large amounts of data 5 A view of an ebookseller s customer-oriented IT infrastructure LAN Web Server Client HTML XML ISP T3 Telco DSL Web Client Personalization Application Server Search application http ISP T3 ISDN Web Client Transaction Monitor Community Server Shopping Bot http CableCo Web Client ODBC, SQL SQL ISP Internet http Wireless Provider Laptop Database Server DBMS Books CORBA XML Zshops Customer & order info Intranet Data Warehouse Check Inventory Levels OLAP Server Intelligent Agent Order Execution and Procurement System (*) Distributor Data mining guru Distributors Automated Warehouse Marketing 3
Two broad business uses for DBMS s Run the operational aspects of the business Order entry, payroll, inventory management, etc Online transaction processing Help with decision-making Measure the effectiveness of marketing campaigns, or find out the most profitable products Online decision support Uses are converging Supply chain execution 7 DBMS provides levels of abstraction Many views, a single conceptual (logical) schema and a physical schema Views describe how users see the data Conceptual schema defines logical structure Physical schema describes the files and indexes used View 1 View 2 View 3 Conceptual Schema Physical Schema Schemata are defined using DDL - Data Definition Language Data are modified/queried using DML - Data Manipulation Language 8 4
Example: Bookstore Database Conceptual schema: Customers(cid: string, name: string, address: string, sex: string, category:integer) Books(isbn: string, title:string, price:float) Sales(cid:string, isbn:string, pdate:date) Physical schema: Relations stored as unordered files Index on first column of Customers External Schema (View): TSalesPerBook(isbn:string,total_sales:float) Goodcustomers(cid:string,name:string) 9 Data Independence Applications insulated from how data are structured and stored Logical data independence: Protection from changes in logical structure of data Physical data independence: Protection from changes in physical structure of data One of the most important benefits of using a DBMS! 10 5
Database Development Process (Abstraction Revisited) Conceptual Data Modeling Diagram, Preliminary Model Technology Independent Example: Entity-Relationship Diagrams Decide what entities should be part of the database and what are the relations between them Logical Database Design Abstract model of database Relational: Tables Physical Database Design How database will be arranged technology dependent DBMS (Database Management System) Database Implementation Database Maintenance 11 Why we need a DBMS Concurrent access Recovery from crashes Focus of class Data integrity and security Preserve constraints on data, avoid corruption, control access Data independence Conceptual modeling should be independent of physical modeling Declarative language Efficient access Focus of class Uniform data administration Manage the individual accounts, the customer info, the business loans Reduced application development time 12 6
Example: Bookstore Database Conceptual schema: Customers(cid: string, name: string, address: string, sex: string, category:integer) Books(isbn: string, title:string, price:float) Sales(cid:string, isbn:string, pdate:date) Physical schema: Relations stored as unordered files Index on first column of Students External Schema (View): TSalesPerBook(isbn:string,total_sales:float) Goodcustomers(cid:string,name:string) 13 Data Models A data model is a vocabulary of primitives for describing data A schema is a description of a particular collection of data, using the a given data model The relational model of data is the most widely used model today Main primitive: relation, basically a table with rows and columns Every relation has a schema, which describes the columns, or fields Proposed by E.F. Codd in 1970 14 7
Querying a Database Find how many customers have bought A random walk down Wall Street on 10/3/2002 S(tructured) Q(uery) L(anguage) select COUNT(P.cid) from Purchase P, Books B where P.isbn=B.isbn and P.date= 10/3/2002 and B.title= A random walk down Wall Street User asks for what they need Query processor figures out how to answer the query efficiently Declarative language 15 E-R Diagram Example: EverFail Car Owned by Customer Cust ID Model Name Make VIN Approves PartID Part Descr Part Price WorkID Parts Includes Work Workdescr LaborCharge 16 8
The Complete Relations for EverFail Car VIN, Model, Make, CustID Customer CustID, Name, Address, Phone Work WorkID, Workdescr, LaborCharge, CustID Parts PartsID, PartDescr, PartPrice PartsUsed WorkID, PartsID, Qty 17 Querying a relational database (S)tructured (Q)uery (L)anguage intergalactic dataspeak Basic SQL query SELECT [DISTINCT] target-list FROM relation-list WHERE qualification relation-list A list of relation names (possibly with a range-variable after each name). target-list A list of attributes of relations in relation-list qualification Comparisons (Attr op const or Attr1 op Attr2, where op is one of ) combined using AND, OR and NOT <, >, =,,, 18 9
What can we do? Retrieve a subset of rows ( selection ) Retrieve a subset of columns ( projection ) Connect relations (a join ) Union, intersect relations 19 Example We will use these instances of the Sailors and Reserves relations in our examples. bid is Boat-id If the key for the Reserves relation contained only the attributes sid and day, how would the semantics differ? S1 S2 R1 sid bid day 22 101 10/10/96 58 103 11/12/96 sid sname rating age 22 dustin 7 45.0 31 lubber 8 55.5 58 rusty 10 35.0 sid sname rating age 28 yuppy 9 35.0 31 lubber 8 55.5 44 guppy 5 35.0 58 rusty 10 35.0 10
Example Query SELECT S.sname FROM Sailors S, Reserves R WHERE S.sid=R.sid AND R.bid=103 (sid) sname rating age (sid) bid day 22 dustin 7 45.0 22 101 10/10/96 22 dustin 7 45.0 58 103 11/12/96 31 lubber 8 55.5 22 101 10/10/96 31 lubber 8 55.5 58 103 11/12/96 58 rusty 10 35.0 22 101 10/10/96 58 rusty 10 35.0 58 103 11/12/96 21 Conceptual Evaluation Strategy Compute the cross-product of relation-list All the ways to combine the tuples in the relations Discard resulting tuples if they fail qualifications Delete attributes that are not in target-list If DISTINCT is specified, eliminate duplicate rows This strategy is probably the least efficient way to compute a query! An optimizer will find more efficient strategies to compute the same answers 22 11
Find sailors who ve reserved at least one boat SELECT S.sid FROM Sailors S, Reserves R WHERE S.sid=R.sid What is the effect of replacing S.sid by S.sname in the SELECT clause? Would adding DISTINCT to this variant of the query make a difference? 23 Find the age of the youngest sailor for each rating with age 18 SELECT S.rating, MIN (S.age) FROM Sailors S WHERE S.age >= 18 GROUP BY S.rating Only S.rating and S.age are mentioned in the SELECT and GROUP BY clauses; other attributes `unnecessary 2nd column of result is unnamed sid sname rating age 22 dustin 7 45.0 31 lubber 8 55.5 71 zorba 10 16.0 64 horatio 7 35.0 29 brutus 1 33.0 58 rusty 10 35.0 rating age 1 33.0 7 45.0 7 35.0 8 55.5 10 35.0 Answer relation rating 1 7 8 10 33.0 35 55.5 35.0 24 12
Conceptual Evaluation The cross-product of relation-list is computed, tuples that fail qualification are discarded, `unnecessary fields are deleted, and the remaining tuples are partitioned into groups by the value of attributes in grouping-list The group-qualification is then applied to eliminate some groups. Expressions in group-qualification must have a single value per group! One answer tuple is generated per qualifying group. 25 Triggers Trigger: procedure that starts automatically if specified changes occur to the DBMS Three parts: Event (activates the trigger) Condition (tests whether the triggers should run) Action (what happens if the trigger runs) Turns the database into an active component E.g., alert (or act!) when inventory is low Help with decision-support 26 13
SQL examples Find sailors who have reserved boat 103 SELECT S.sname FROM Sailors S WHERE EXISTS (SELECT * FROM Reserves R WHERE R.bid=103 AND S.sid=R.sid)7 Find sid s of sailors who ve reserved both a red and a green boat: SELECT S.sid FROM Sailors S, Boats B, Reserves R WHERE S.sid=R.sid AND R.bid=B.bid AND B.color= red AND S.sid IN (SELECT S2.sid FROM Sailors S2, Boats B2, Reserves R2 WHERE S2.sid=R2.sid AND R2.bid=B2.bid AND B2.color= green ) 27 Isn t Implementing a Database System Simple? Relations Statements Results 28 14
Introducing the Database Management System The latest from Megatron Labs Incorporates latest relational technology UNIX compatible 29 Megatron 3000 Implementation Details Relations stored in files (ASCII) e.g., relation R is in /usr/db/r Smith # 123 # CS Jones # 522 # EE. 30 15
Megatron 3000 Implementation Details Directory file (ASCII) in /usr/db/directory R1 # A # INT # B # STR R2 # C # STR # A # INT. 31 Megatron 3000 Sample Sessions % MEGATRON3000 Welcome to MEGATRON 3000! &. & quit % 32 16
Megatron 3000 Sample Sessions & select * from R # & Relation R A B C SMITH 123 CS 33 Megatron 3000 Sample Sessions & select A,B from R,S where R.A = S.A and S.C > 100 # & A B 123 CAR 522 CAT 34 17
Megatron 3000 Sample Sessions & select * from R LPR # & Result sent to LPR (printer). 35 Megatron 3000 Sample Sessions & select * from R where R.A < 100 T # & New relation T created. 36 18
Megatron 3000 To execute select * from R where condition : (1) Read dictionary to get R attributes (2) Read R file, for each line: (a) Check condition (b) If OK, display 37 Megatron 3000 To execute select * from R where condition T : (1) Process select as before (2) Write results to new file T (3) Append new line to dictionary 38 19
Megatron 3000 To execute select A,B from R,S where condition : (1) Read dictionary to get R,S attributes (2) Read R file, for each line: (a) Read S file, for each line: (i) Create join tuple (ii) Check condition (iii) Display if OK 39 What s wrong with the Megatron 3000 DBMS? 40 20
What s wrong with the Megatron 3000 DBMS? Tuple layout on disk e.g., - Change string from Cat to Cats and we have to rewrite file - ASCII storage is expensive - Deletions are expensive 41 What s wrong with the Megatron 3000 DBMS? Search expensive; no indexes e.g., - Cannot find tuple with given key quickly - Always have to read full relation 42 21
What s wrong with the Megatron 3000 DBMS? Brute force query processing e.g., select * from R,S where R.A = S.A and S.B > 1000 - Do select first? - More efficient join? 43 What s wrong with the Megatron 3000 DBMS? No buffer manager e.g., Need caching No concurrency control No reliability e.g., - Can lose data - Can leave operations half done No security e.g., - File system insecure - File system security is coarse No application program interface (API) e.g., How can a payroll program get at the data? No interoperability with other databases Poor dictionary facilities No GUI 44 22
System Structure Strategy Selector Query Parser User User Transaction Transaction Manager Concurrency Control Buffer Manager Recovery Manager Lock Table File Manager M.M. Buffer Log Statistical Data Indexes User Data System Data 45 Some Terms Database system Transaction processing system File access system Information retrieval system 46 23
ιαδικαστικά Μάθηµα: Πέµπτη 12-3, αίθουσα 606 ιάλειµµα 15, ~1:40 «Γραφείο»: Εργ. Τεχνητής Νοηµοσύνης, 4ος όροφος Αντωνιάδου, τηλ 160 Ώρες γραφείου: Τρίτη/Πέµπτη 3:30-4:30, άλλες ώρες µε ραντεβού Βοηθός: Μάγδα Ειρηνάκη «Γραφείο»: 3ος όροφος, Κορδιγκτώνος 12 Ώρες γραφείου: Τετάρτη 1-3 47 οµή µαθήµατος υο εργασίες: 15% του βαθµού Χρήση online συστήµατος για τις εργασίες Τελικό διαγώνισµα: 85% του βαθµού Προαιρετική βιβλιογραφική εργασία: 10% bonus ιαφάνειες του µαθήµατος διαθέσιµες στο http://www.aueb.gr/lessons/grad/dbtheory Κυρίως στα αγγλικά 48 24
Ύλη µαθήµατος Εισαγωγή Ευρετήρια και κατακερµατισµός (Indexes and hashing) Επεξεργασία και βελτιστοποίηση επερωτήσεων (Query processing and optimization) Ανάνηψη (Crash Recovery) Θεωρία Ταυτοχρονισµού (Concurrency Control theory) Επεξεργασία οσοληψιών (Transaction processing) Συµπερασµατικές και Λογικές Βάσεις εδοµένων (Deductive and Logic Databases) Ενεργές Βάσεις εδοµένων (Active Data Bases) Αποθήκες εδοµένων (Data Warehouses) 49 Acknowledgement Slides mostly based on the ones provided by Hector Garcia Molina Thanks! 50 25