Structured vs. unstructured data 2 Databases are highly structured Semistructured data, XML, DTDs Well known data format: relations and tuples Every tuple conforms to a known schema Data independence? Woe unto you if you lose the schema Plain text is unstructured Introduction to databases CSCC43 Winter 2012 Ryan Johnson Cannot assume any predefined format Apparent organization makes no guarantees Self describing: little external knowledge needed... but have to infer what the data means Thanks to Manos Papagelis, John Mylopoulos, Arnold Rosenbloom, and Renee Miller for material in these slides Irony: database cannot stand alone Motivation for self describing data 3 Enter semistructured data 4 Consider a C struct struct { int id; int type; char name[8]; struct { double x; double y; } location; } shape; Data at code level: {1, 101, square, {1.5, 5.0}} Data at byte level: 0x0000000100000065 0x7371756172650000 0x3FF8000000000000 0x4014000000000000 Variable length fields? Pointers? Endianness? Observation: most data has some structure Text: sentences, paragraphs, sections,... Books: chapters Web pages: HTML Idea of semistructured data: Enforce well formatted data => Always know how to read/parse/manipulate it Optionally, enforce well structured data also => Might help us interpret the data, too *Very* easy to embed [parts of] schema in logic Pro: highly portable Con: verbose/redundant 1
Why not use... HTML? <dl> <dt style= color:red >id <dd>1 <dt>type</dt> <dd>101</dd> <dt>name <dd>square <dt>location <dd><dl> <dt>x <dd>1.5 <dt>y</dt> <dd>5</dd> </dl> Pro: popular Con: inconsistent, buggy Closing tags often missing div, table, ul instead of dl? Parsing is *hard* Con: data+presentation Describes presentation and structure, but not content More like a query result Fixed meaning for all tags 5 Why not use... JSON? (JavaScript Object Notation) { } id : 1, type : 101, name : square, location : { x : 1.5, y : 5 } Pros: simple/intuitive portable Cons: No support for any kind of metadata Underspecified (e.g. can t constrain types) Data processing tools missing/immature Growing popularity due to its simplicity 6 7 XML: designed for data interchange 8 XML <books search terms= database+design > <book> <title>database Design for Mere Mortals </title> <author>michael J. Hernandez</author> <date>13/03/2003 </date> </book> <book id= B2 > <title>beginning Database Design</title> <subtitle>from Novice to Professional</subtitle> <author>clare Churcher</author> </book> </books> 2
Features of XML Intentionally similar syntax to HTML Tree structured (hierarchical) format Elements surrounded by opening and closing tags Attributes embedded in opening tags => <tag name attr name= attr value >data</tag name> But with important differences Strictly well formed (must close all tags, etc.) Tag/attribute names carry no semantic meaning Data only format: no implied presentation 9 XML terminology <?xml version= 1.0?> <PersonList Type= Student Date= 2002 02 02 > <Title Value= Student List /> <Person> </Person> <Person> </Person> </PersonList> elements Elements are nested Root element contains all others Empty element Element (or tag) names attributes Root element 10 Descendant of SGML (as is HTML) XML terminology (cont.) Content of Person <Person Name = John Id = s111111111 > John is a nice fellow <Address> <Number>21</Number> <Street>Main St.</Street> </Address> </Person> standalone text, not very useful as data, non uniform Child of Address, Descendant of Person Nested element, child of Person Closing tag: What is open must be closed Opening tag Parent of Address, Ancestor of Number Example XML Document <?xml version= 1.0?> <! Some comment > <Students> <Student StudId= 111111111 > <Name><First>John</First><Last>Doe</Last></Name> <Status>U2</Status> <CrsTaken CrsCode= CS308 Semester= F1997 /> <CrsTaken CrsCode= MAT123 Semester= F1997 /> </Student> <Student StudId= 987654321 > <Name><First>Bart</First><Last>Simpson</Last></Name> <Status>U4</Status> <CrsTaken CrsCode= CS308 Semester= F1994 /> </Student> </Students> <! Some other comment > 12 3
XML Document is a Tree 13 Two kinds of XML Documents 14 Well Formed XML Just need to use proper nesting Can invent your own tags Any tag can go anywhere Validated XML Can invent tags, but have to declare them and specify where they can go A DTD (document type definition) specifies these rules Rules for well formed XML Must have a root element Every opening tag must have matching closing tag Elements must be properly nested <foo><bar></foo></bar> is a no no An attribute name can occur at most once in an opening tag. If it occurs: It must have an explicitly specified value (Boolean attrs, like in HTML, are not allowed) The value must be quoted (with or ) Parsers not allowed to tolerate ill formed XML Valid names in XML Simple rules for elements/attributes names may include letters (case sensitive!) may include (but not start with) digits and punctuation no reserved words or keywords But lots of gotchas Names must not start with xml (case insensitive) Non ASCII/latin letters: legal but not all parsers support them Punctuation is iffy business (one exception: ) Entity characters always forbidden: < > & Spec recommends _ instead of (real life: the opposite is true) : is reserved for namespaces (not enforced). officially discouraged (real life: very rare) $ often used for parameter substitution by XML processors (XQuery, etc.) Other punctuation vanishingly rare: @ # %... Upper case letters legal but fairly rare All caps very rare (just like rest of Internet) Often see book list instead of camel case BookList Rule of thumb: lower case and usually best 16 4
XML, text, and whitespace Adjacent non tag chars parsed as text nodes Parser never ignores whitespace Leading and trailing space left with its text node Whitespace between tags produces empty text nodes Example: <foo> hi<bar> ho </bar> </foo> foo hi bar 17 Example: Well Formed XML <?xml version = 1.0 standalone = yes?> <platforms> <platform><name>x Box</name> <game><title>halo</title> <price>59.99</price></game> Root tag <game><title>crash Bandicoot</title> <price>49.99</price></game> </platform> <platform> </platform> </platforms> Tags surrounding a platform element A name subelement A game subelement 18 \n ho \n Nesting rule for tags must be obeyed Checking your XML 19 Problems with well formed XML 20 http://validator.w3.org xmllint command on cdf. By default, checks if well formed debug Outputs an annotated tree of the parsed document If a program will process XML, good to know things like: What tags are allowed What order, nesting What attributes for each tag What s mandatory or optional A DTD specifies exactly this 5
21 Document type definition (DTD) 22 Enforces more than well formed ness Which entities may (or must) appear where Attributes entities may (or must) have Types attributes and data must adhere to DTD separate from XML it constraints DOCUMENT TYPE DEFINITION (DTD) May be embedded in separate section Most often referenced externally Validation: checking XML against its DTD(s) Important for interpreting/validating data Not necessary for parsing DTD building blocks 23 DTD elements 24 Elements (<an element>...</an element>) Must always close tags If no contents: <empty element/> Attributes (<... an attr=......>) Entities ( special tokens) e.g. < > & " ' HTML defines lots of others (e.g. ) More on this later PCDATA (parsed character data) Mixed text and markup Use entities to escape >, etc. which should not be parsed CDATA ([non parsed] character data) Plain text data Tags not parsed, entities not expanded <!ELEMENT $e...> $e is the element name " " may contain any of: Nothing: <!ELEMENT $e EMPTY> Anything: <!ELEMENT $e ANY> Text data: <!ELEMENT $e (#PCDATA)> Always parsed (#CDATA not allowed here) Child elements: <!ELEMENT $e (...)> Any child referenced must also be declared Child elements may themselves have children Mixed content: <!ELEMENT $e (#PCDATA... 6
DTD elements: children 25 DTD elements: Example 26 Base construct: sequence (,) <!ELEMENT $e (a)> <!ELEMENT $e (a, b, c,...)> Comma ","defines order of which children must appear in XML Either or content ( ) <!ELEMENT $e (a b...)> Exactly one of the options must appear in the XML Constraining child cardinality <!ELEMENT $e (a, b+, c*, d?)> not followed by any of +, *,? : exactly one (e.g., a) +: at least one (e.g., b) *: zero or more (e.g., c)?: at most one (e.g., d) <!ELEMENT resume ( bio,interests,education, experience,awards,service)> <!ELEMENT bio ( name, addr, phone, email?, fax?, url?)> <!ELEMENT interests (interest+)> <!ELEMENT education (degree*)> <!ELEMENT awards ((award honor)*)>... Sequences and either or can both nest DTD elements: Another Example <!DOCTYPE platforms [ <!ELEMENT platforms (platform*)> <!ELEMENT platform (name, game+)> <!ELEMENT name (#PCDATA)> <!ELEMENT game(name, price)> <!ELEMENT price (#PCDATA)> ]> NAME and PRICE are text A PLATFORMS element has zero or more PLATFORM elements nested within A PLATFORM has one NAME and one or more GAME elements A GAME has a NAME and a PRICE 27 DTD Attributes <!ATTLIST $e $a $type $required> Declares an attribute $a on element $e $type may be any of character data: CDATA one of a set of values: (v1 v2...) unique identifier: ID references to one/many ID token(s) of other attributes: IDREF[S] valid xml name (or list of names): NMTOKEN[S] entity (or entities): ENTITY/ENTITIES $required may be required (not required): #REQUIRED (#IMPLIED) fixed value (always the same): #FIXED $value default value (used if none given): $value 28 7
DTD attributes: examples 29 DTD attributes: ID[REF][S] 30 <!ATTLIST person sin ID #REQUIRED spouse IDREF #IMPLIED name CDATA John Doe trusted (yes no) no species #FIXED homo sapiens alive (yes no) #IMPLIED > ID attribute type Uniquely identifies an element in the document (like keys) Error to have two Like HTML id attribute, but can have any name IDREF Refers to another element by ID (like foreign keys) Error if corresponding ID does not exist Like HTML href attribute, but no # needed IDREFS List of IDREF attributes, space separated #IMPLIED unless specified otherwise Problem: only one global set of IDs Example: a DTD 31 Example: The XML Document 32 <!DOCTYPE PLATFORMS [ <!ELEMENT PLATFORMS (PLATFORM*, GAME*)> <!ELEMENT PLATFORM (SELLS+)> <!ATTLIST PLATFORM name ID #REQUIRED> <!ELEMENT SELLS (#PCDATA)> <!ATTLIST SELLS thegame IDREF #REQUIRED> <!ELEMENT GAME EMPTY> <!ATTLIST GAME name ID #REQUIRED> <!ATTLIST GAME soldby IDREFS #IMPLIED> ]> <PLATFORMS> <PLATFORM name = X Box > <AVAILABLE game= Halo >59.99</AVAILABLE> <AVAILABLE game= Crash Bandicoot >49.99</AVAILABLE> </PLATFORM> <GAME name= Halo availablefor = X Box X Box 360 /> </PLATFORMS> 8
DTD entities The XML equivalent of #define <!ENTITY $name $substituted value > Can t take parameters, though Used just like other entities <politician speak> I vow to lead the fight to stamp out &buzz word; by instituting powerful new programs that will... </politician speak> Pick your favorite substitution: <!ENTITY buzz word communism > <!ENTITY buzz word racism > <!ENTITY buzz word terrorism > <!ENTITY buzz word illegal file sharing > Not heavily used: better templating methods exist 33 Embedded vs. External DTD Specified as part of a document <?xml version= 1.0?> <!DOCTYPE Book [ ]> <Book> </Book> Reference to external (stand alone) DTD <?xml version= 1.0?> <!DOCTYPE Book http://csc343.com/book.dtd > <Book> </Book> EXAMPLE: Emdedded DTD <?xml version = 1.0 standalone = no?> <!DOCTYPE PLATFORMS [ <!ELEMENT PLATFORMS (PLATFORM*)> <!ELEMENT PLATFORM (NAME, GAME+)> The DTD <!ELEMENT NAME (#PCDATA)> <!ELEMENT GAME (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> The document ]> <PLATFORMS> <PLATFORM><NAME>X Box</NAME> <GAME><NAME>Halo</NAME> <PRICE>59.99</PRICE></GAME> <GAME><NAME>Crash Bandicoot</NAME> <PRICE>49.99</PRICE></GAME> </PLATFORM> <PLATFORM> </PLATFORMS> 35 EXAMPLE: External DTD <?xml version = 1.0 standalone = no?> <!DOCTYPE platforms SYSTEM PLATFORM.dtd > <platforms> <platform><name>x Box</name> <game><title>halo</title> <price>59.99</price></game> <game><title>crash Bandicoot</title> <price>49.99</price></game> </platform> <platform> </platform> </platforms> Get the DTD from the file PLATFORM.dtd 36 9
Limitations of DTDs 37 XML Schema 38 Don t understand namespaces Very limited typing (just strings and XML names) Very weak referential integrity All ID / IDREF / IDREFS share single ID space Can t express unordered contents conveniently How to specify that a,b,c must all appear, but in any order? All element names are global Is <name> for people or companies? can t declare both in the same DTD Designed to improve on DTDs Advantages: Integrated with namespaces Many built in types User defined types Has local element names Powerful key and referential constraints Disadvantages: Unwieldy, much more complex than DTDs We won t cover XML schema in class What is Next? 39 XML Query Languages XPATH XQUERY 10