Database Theory, Fall Semester MMXV: Week #3 Any Questions on anything from last week? Which relational operators did I call basic? Why did I distinguish between basic and defined operators? If you haven t handed in homework 1 yet, bring it to the front of the room now! More defined relations: Joins. A. Natural Joins (operator written ): 1. Consider again Section: Dept Crs# Sect Smstr Yr Instr Registration: Stdnt# Dept Crs# Sect Smstr Yr Grade 2. Registration Section is a relation with 13 attributes, listing every possible pair of registration data and section data. Estimate #tuples for UC with 10 years data. 3. Often don t want tuples for unmatched courses! The common attributes in the 2 relations are Dept, Crs#, Sect, Smstr, and Yr. The natural join of the 2 relations is like the Cartesian product, but it picks only the tuples where the values of those attributes are the same in both relations, and it stores only one copy of those attributes in the result. 4. So Registration Section has just 8 attributes. Stdnt# Dept Crs# Sect Smstr Yr Grade Instr 1
It lists all courses all students are registered for, their grades, and the instructors of those courses. 5. How do you define R S using only basic operators? 6. Why do we choose to use natural joins in SQL? (a) Easier to program don t need to do all that renaming & selection! (b) Easier for someone else to understand the program! (c) SQL implementations typically have special logic to compute joins without doing the renaming and forming the full Cartesian products avoiding creating some huge intermediate tables. B. Theta Joins: Read in book. We won t discuss them or use them in class. There are other variants of joins too. We won t discuss them here either, but if you do a lot of SQL programming you ll want to explore them. 2
More Examples: Write queries to answer the following questions, using natural joins (and perhaps other operators) for the schema Student(LastName, FirstName, Student#, Class, Major) Course(CrsName, Dept, Crs}, CredHrs) Section(Dept, Crs#, Sect, Smstr, Yr, Instr) Registration (Student#, Dept, Crs#, Sect, Smstr, Yr, Grade) 1. Give names (first and last) of all students who have ever taken a class not taught by Professor Schlipf. 2. Give names (first and last) of all pairs of students who have ever been in a class together. 3. Give names (first and last) of all (lucky?) students who have never taken a class taught by Professor Schlipf. 3
Nastier examples on enlarged Schema: Student: LastName FirstName Student# Class Major Course: CrsName Dept Crs# CredHrs Section: Dept Crs# Sect Smstr Yr Instr Registration: Student# Dept Crs# Sect Smstr Yr Grade Prerequisite: Dept Crs PrereqDept# PrereqCrs 1. List all pairs of students who have taken classes from exactly the same professors. 2. List those students by name. 3. List all tuples (Stdnt#, Dept, Crs# where the student is currently taking the course but has not passed ( ) one of its prerequisites. 4. List those students and courses by name rather than by number. 4
Aside: Monotonicity. A relational database schema D, A set of constraints Q All possible relational databases R for D and Q all possible relations for those relation schemas obeying the constraints. Notation: For Q a query in relational database schema D and R a relational database for D, Q(R) is the answer Q over R i.e., (in relational algebra terms), the value of expression Q evaluated over database R. Definition: A query Q is monotonic if, whenever R 1 R 2, Q(R 1 ) Q(R 2 ), i.e., we cannot delete any tuples from the answer by simply adding tuples to the database. Theorem: Every relational algebra query whose only operators are,, π, σ,, and ρ is monotonic. Practical Consequence: You re going to have to use, probably more often than you d at first expect, when you write queries. This is especially frequent when the English statement involves words like every and not. In particular, putting a NOT into the condition for selection often won t get you what you want. 5
DATALOG translation of all this: 1. I shall do all the syntax in terms of the system clingo. 2. To make the programming easier, we ll concatenate, into 1 big file, original database ( facts, or extensional database, or EDB ), queries ( rules, or intensional database, or IDB ), 3. Running: In UNIX/LINUX/MAC) clingo file.lp. Output is to the screen. Try it as soon as you can. Output SATISFIABLE says queries can all be answered. Later you ll see ways to get UNSATISFIABLE. 4. The Extensional Database EDB : List all tuples in all relations the facts. (abbreviations to fit more on one line in this font) stdnt("palooka", "Joe", "MD1234", 3, english). stdnt("warbucks", "Annie", "MD7654", 1, mngmnt). course("database Theory", cs, 6051, 3). course("history of Comics", engl, 227, 3). sect(cs, 6051, 1, fall, 2015, "Schlipf"). regstr("md1234", cs, 6051, 1, fall, 2015, "N"). Notes: (a) Identify attributes by position, not by attribute name. (b) Prolog format: Variable symbols start in upper case. Everything else (except constants inside quotation marks) is in lower case. (c) Period (. ) at the end of every fact. (d) As with Prolog, if you also write stdnt( palooka, joe, english, 3), with 4 argument places instead of 5, this stdnt is treated as just a different relation! Don t do that in this class! 6
5. The Intentional Database IDB: Rules defining additional relations or giving constraints. hastaken(fname, LName, StdntNo, Dept, Crs) :- regstr(stdntno, Dept, Crs, Sect, Smstr, Yr, Gr), stdnt(lname, FName, StdntNo, ProgYr, Major). mystery(fname, LName, StdntNo, Dept, Crs) :- regstr(stdntno, Dept, Crs, Sect, Smstr, Yr, Gr), stdnt(lname, FName, StdntNo, ProgYr, Major), LName < Major, ProgYr > 2. mystery2(fname, LName, StdntNo, Dept, Crs) :- regstr(stdntno, Dept, Crs, Sect, Smstr, Yr, Gr), stdnt(lname, FName, StdntNo, ProgYr, Major), not stdnt(smstr, Yr, Gr, ProgYr, Major). Notes: (a) Definition: Rule: A statement with a :- symbol. (But some people use the word rule to include facts too.) (b) Period at the end of each rule. (c) The comma at the end of line 2 means and. (d) Vocabulary: head body s(x,y) :- r(x,y,z), q(z,a,x) Z>17. (positive) relational subgoal) (e) Interpretation of :-: if the body is true, the head is too. Semantics: define hastaken to be the set of all tuples (FName, LName, StdntNo, Dept, Crs) where, for some rule with hastaken it its head, and for for some of attribute values for the variables, the rule body is true in the database ( negation by failure ). 7
(f) Look at the first rule: hastaken(fname, LName, StdntNo, Dept, Crs) :- regstr(stdntno, Dept, Crs, Sect, Smstr, Yr, Gr), stdnt(lname, FName, StdntNo, ProgYr, Major). Several variable names, e.g., Gr and Major, appear in the body but not in the head. The result is like a projection. (g) Also in that rule: I use the same variable symbol StdntNo in lines 2 and 3. That means that I m looking for tuples in regstr and stdnt where the first attribute from regstr is the same as the third attribute value from stdnt. Thus it s similar to a natural join but regulated by common variable name in the rule (h) Scope of a variable symbol is one rule so, for example, in each rule, each occurrence of FName must be interpreted by the same string. The implication is to consider this inference rule for each possible choice of values for the variables. The choices for these values are anything explicitly mentioned in the program e.g., Palooka, 3, or cs. 8
(i) DATALOG : not can be applied to subgoals in the body. tuitionproblycvrd(stdnt, Dpt, Crs, Sec, Smst, Yr) :- regstr(stdnt, Dpt, Crs, Sec, Smstr, Yr, Gr), onugs(stdnt,smstr,yr), Crs >= 6000, not (Smstr == "summer"), not oweslibraryfine(stdnt). (j) Safety Rule: Every variably symbol appearing anywhere in a rule must appear in one of the positive relational subgoals. Why that restriction? Related: why relational algebra has a difference operator ( ) but not a complemenet. Consider an unsafe rule prblycvrtuit(stdnt, Dpt, Crs, Sec, Smst, Yr) :- regstr(stdnt, Dpt, Crs, Sec, Smstr, Yr, Gr), onugs(stdnt,smstr,yr), Crs >= 6000, not (Smstr == "summer"), not oweslibraryfine(stdnt2), Who is Stdnt2? It looks as if I m asking clingo to consider infinitely many things. Restricting where Stdnt2 came from also helps us ensure adding irrelevant facts (and constants thus new objects) doesn t change meaning of program. Challenge: find what other anomalies could occur if, say Stdnt2 also appeared in the head of the rule? 9
(k) Recursive programs: Direct positive recursion: r(x,y) :- s(x), s(y), X<Y. r(x,y) :- t(x,z), r(y,z). Indirect positive recursion: r(x,y) :- s(x), s(y), X<Y. r(x,y) :- a(x,y), s(x), s(y), X<Y. a(x,y) :- t(x,z), r(y,z), not q(z). Negative recursion: r(x,y) :- s(x), s(y), X<Y. r(x,y) :- a(x,y), s(x), s(y), X<Y. a(x,y) :- t(x,z), r(y,z), not q(z). q(z) :- r(x,z). q(z) :- a(z,x). i. For now, we ll be concerned with non-recursive programs. ii. Adding recursion lets us define relations that are not definable in relational algebra. Standard Example: Transitive closures. ancestorof(x,y) :- fatherof(x,y). ancestorof(x,y) :- motherof(x,y). ancestorof(x,z) :- ancestorof(x,y), ancestorof(y,z). iii. Recursion where some of the recursion is through negative subgoals (negative recursion) is more complicated than positive recursion. 10
Upcoming Theorem: The queries we can state in relational algebra are exactly the same as the queries we can state in non-recursive DATALOG. Work Through Example: Facts: stdnt("palooka", "Joe", md1234567, 3, english). stdnt("warbcks", "Annie", md7654321, 1, management). stdnt("twist", "Oliver", md0000000, 4, cs). stdnt("heap", "Uriah", md9999999, 2, cs). stdnt("jahan", "Shah", md1111111, 6, history). stdnt("mahal", "Mumtaz", md2222222, 6, cs). course("database Theory", cs, 6051, 3). course("history of Comics", engl, 2027, 3). course("data Structures", cs, 2028, 4). course("software Engineering", eece, 4095, 4). course("ai", cs, 6033, 3). course("info Retrieval", cs, 6054, 3). course("adv. Algorithms I", cs, 7081, 3). course("russian Empire", hist, 7021, 3). sched(cs, 6051, 001, fall, 2015, "Schlipf"). sched(engl, 2027, 067, spring, 2015, "Capp"). sched(cs, 6051, 001, fall, 2015, "Hamad"). sched(cs, 6054, 001, fall, 2015, "Cheng"). sched(cs, 7081, 001, fall, 2015, "Berman"). regist(md1234567, cs, 6051, 001, fall, 2015, "N"). regist(md0000000, cs, 6051, 001, fall, 2015, "N"). regist(md7654321, engl, 2027, 067, spring, 2015, "B-"). regist(md2222222, hist, 7021, 001, spring, 2015, "A"). prereq4(hist, 7021, cs, 1021). prereq4(cs, 7081, cs, 6051). 11
Query: Names of students currently now both cs 7890 and engl 7890 weird(lname, FName) :- stdnt(lname,fname, Mnmbr, Year, Major), regist(mnmbr, cs, 7890, SectNo, fall, 2015, Grd), regist(mnmbr, engl, 7890, SectNo, fall, 2015, Grd). Query: Names of students currently now cs 7890 or engl 7890 ambitious(lname, FName) :- stdnt(lname,fname, Mnmbr, Year, Major), regist(mnmbr, cs, 7890, SectNo, fall, 2015, Grd). ambitious(lname, FName) :- stdnt(lname,fname, Mnmbr, Year, Major), regist(mnmbr, engl, 7890, SectNo, fall, 2015, Grd). Query: Names of students who have taken both cs 7890 and engl 7890 since 2010 semiweird(lname, FName) :- stdnt(lname,fname, Mnmbr, Year, Major), regist(mnmbr, cs, 7890, SectNo, Smstr1, Yr1, Grd), Yr1 >= 2010, regist(mnmbr, engl, 7890, SectNo, Smstr2, Yr2, Grd), Yr2 >= 2010. Query: Order all the semesters in registration history by time. (I pretend all UC classes were in fall, spring, or summer semesters and assume all 3 semesters were offered each year UC existed.) earliersmstr(smstr1, Yr1, Smstr2, Yr2) :- regist(mn1, D1, CN1, SN1, Smstr1, Yr1, G1), regist(mn2, D2, CN2, SN2, Smstr2, Yr2, G2), Yr1 < Yr2. earliersmstr(spring, Yr, summer, Yr) :- regist(mn, D, CN, SN, Smstr, Yr, G1). earliersmstr(summer, Yr, fall, Yr) :- regist(mn, D, CN, SN, Smstr, Yr, G1). earliersmstr(spring, Yr, fall, Yr) :- regist(mn, D, CN, SN, Smstr, Yr, G1). 12
Query: All pairs of students who have taken exactly the same courses (but possibly not the same semester or the same section). hasevertaken(stdnt,dpt,crs) :- regist(stdnt, Dpt, Crs, _, _, _, _). hasevertakenmore(stdnt1,stdnt2) :- hasevertaken(stdnt1, Dpt, Crs), not hasevertaken(stdnt2, Dpt,Crs). mnmbrused(stdnt) :- stdnt (_, _, Stdnt, _, _). havetakensamecourses(stdnt1, Stdnt2) :- mnmbrused(stdnt1), mnmbrused(stdnt2), not hastakenmore(stdnt1, Stdnt2), not hastakenmore(stdnt2, Stdnt1). A Simple Constraint: We may not have 2 students with the same number. The only way we can interpret this is to say that if 2 students have the same student number, they must be identical in everything else except, possibly, major. A blank to the left of the :- symbol means infer false i.e., the rule body must be false. :- stdnt (LName1, _, Mnmbr, _, _), stdnt(lname2, _, Mnmbr, _, _), LName1!= LName2. :- stdnt (_, FName1, Mnmbr, _, _), stdnt(_, FName2, Mnmbr, _, _), FName1!= FName2. :- stdnt (_, _, Mnmbr, ClssYr1, _), stdnt(_, _, Mnmbr, ClssYr2, _), ClssYr1!= ClssYr2. 13
Homework for 1 week from today: Redo Homework 1, but in clingo instead of relational algebra. Email your solution to me before 4:00 that day. I ll write a test data file (to try to trick your programs, of course) and run your programs together with my test data file. Of course, change relation and constant names in homework 1 to start with lower-case letters. Store your answers in relations answer1 (a unary relation), answer2 (binary), answer3 (ternary), and answer4 (binary). Give any additional relations names that I can easily figure out. Note that clingo requires all rules to be safe. You don t need recursion, so try to avoid using it. 14
Database Anomalies: Basic idea of a problem I saw in my programming days: The company stored prices for items sold. But how were the prices set? By item type? By item type and date of order? By individual negotiation with purchasers, item-type by itemtype? By individual negotiation with purchasers, individual item by individual item? Their answer was Yes. Natural solution might have 2 tables: Item# PeriodStart PeriodEnd Price Item# Cust# PeriodStart PeriodEnd BargainedPrice Unfortunately,..., in their solution, when one customer bargained a new (higher or lower) price, that price was also reported for other items. ( Government accounting?) Anomalies: Errors in the data caused by updating data not apparently related to the data. Incidentally, would you if the customer is charged the regular price, what would you do with the BargainedPrice attribute value? 15
Insertion Anomaly: Inconsistency in data created by inserting new data. Deletion Anomaly: Loss of data created by deleting basically irrelevant data. Modification Anomaly: Inconsistency in or loss of data created by chainging data. Emp# FName LName Hours Prod# ProdName ProdLoc 12345 John Smith 32.5 1 Dingbat Bellaire 12345 John Smith 7.5 2 Gizmo Sugarland 12345 John Smith 15.0 3 DooDad Houston 01999 Rex Bhatnagar 23.9 1 Dingbat Bellaire 33445 Franklin Wong 10.0 3 DooDad Houston 33445 Franklin Wong 10.0 10 Thresh80 Stafford 98877 Alicia Zelaya 30.0 30 Magic007 Stafford 98877 Alicia Zelaya 10.0 10 Thresh80 Stafford 98765 Jennifer Wallace 15.0 20 Reorg. Houston 98765 Jennifer Wallace 20.0 30 Progress! Stafford 56444 Ramesh Narayan 40.0 3 DooDad Houston 45345 Joyce English 20.0 1 Dingbat Bellaire 45345 Joyce English 20.0 2 Gizmo Sugarland 98700 Ahmad Jabbar 35.0 10 Thresh80 Stafford 98700 Ahmad Jabbar 5.0 30 Magic007 Stafford 88888 James Borg 0.0 20 Reorg. Houston 80808 John Smith 0.0 1 Dingbat Bellaire 1. What s a likely key to the above relation? 2. What could go wrong if I added another labor record for Alicia Zelaya? 16
Fortunately, the cure to avoiding anomalies often generally saves space but it does normally increase access time to answer queries. What is a good relational schema? Captures semantics of the attributes Reduces redundancy and update anomalies Reduces null values Disallows spurious tuples 17