Semistructured data and XML Institutt for Informatikk 1
Unstructured, Structured and Semistructured data Unstructured data e.g., text documents Structured data: data with a rigid and fixed data format e.g., tables in relational databases Semistructured data: no predefined schema, data is self-describing and mixed in with schema information (schemaless, self-describing data) e.g., email, ical etc. 2
Unstructured data Unstructured data data can be of any type not necessarily following any format or sequence does not follow any rules is not predictable examples include: text, video, sound, images 3
Structured data data is organized in semantic chunks (entities) similar entities are grouped together (classes) entities in the same group have the same descriptions (attributes) descriptions for all entities in a group (schema) have the same defined format, have a predefined length and are all present and follow the same order 4
Semistructured data organized in semantic entities similar entities are grouped together entities in same group may not have same attributes order of attributes not necessarily important not all attributes may be required size of same attributes in a group may differ type of same attributes in a group may differ 5
Semistructured data Why semistructured data? Integration of databases similar data different with schemas Information share on the Web e.g., XML, JSON etc. Flexible: irregular structure, evolves rapidly add new attributes freely empty values new relationships without needing to change a schema 6
Semistructured data Example name: Peter Wood email: ptw@dcs.bbk.ac.uk, p.wood@bbk.ac.uk name: first name: Mark last name: Levene email: mark@dcs.bbk.ac.uk name: Alex Poulovassilis affiliation: Birkbeck 7
Semistructured data Representation Labelled directed graph, nodes: leaf or interior schema information is in the edge labels data stored at the leaves StarMovieData StarsIn Star Star Movie StarOf Carrie Fisher Name Address Address Street City Street Mark Hamill City Name Street Oak City StarsIn Redwood StarOf Title Star Wars Year 1977 Maple Locust Malibu Hollywood 8
Semistructured data Information integration No common schema, legacy-database problem Approach: semistructured data with wrappers interface Other applications Other applications Database Database 9
Semistructured data Markup languages Allows marking up documents by representing structural, presentational, and semantic information alongside content Markup languages play a key role: notably XML XML is derived from SGML (Standard Generalized Markup Language) SGML is a ISO standard technology for defining markup languages HTML is another example of a markup language originally derived from SGML 10
XML Extensible Markup Language Follows a tag-based notation, similar to HTML HTML tags talk about the presentation while XML tags talk about the meaning HTML <html> <body> <i>this is italic</i> <p>this is a paragraph.</p> </body> </html> XML <note> <to>tove</to> <from>jani</from> <subject /> <heading>reminder</heading> <body>call me!</body> </note> 11
XML With and without schema XML can be used in different modes Well-formed XML no predefined schema invent your own tags nesting rules has to be obeyed (syntactically correct) i.e., has to be well-formed Valid XML: involves a schema definition allowable tags and grammar is specified between strict-schema and schemaless models 12
Well-formed XML Begins with a declaration of the document type (i.e., XML) It has a root element that is the entire body character encoding <?xml version="1.0" encoding= utf-8 standalone= yes?> <sometag>... </sometag> well-formed or valid root element 13
Well-formed XML example <?xml version="1.0" encoding="utf-8"?> <StarMovieData> <Star> <Name>Carrie Fisher</Name> <Address> <Street>123 Maple Street</Street> <City>Hollywood</City> </Address> </Star> <Movie> <Title>Star Wars</Title> <Year>1977</Year> </Movie> </StarMovieData> Carrie Fisher Name Maple Address Street Star City Hollywood StarMovieData Title Movie Star Wars Year 1977 14
Well-formed XML Attributes XML elements can have attributes within opening tags An alternative way to represent a leaf node Attributes can represent labeled arcs <Movie year = 1977><Title> Star Wars</Title></Movie> <Movie title= Star Wars year = 1977></Movie> <Movie title= Star Wars year = 1977 /> 15
Well-formed XML Attributes Attributes can also represent relationships <?xml version="1.0" encoding="utf-8"?> <StarMovieData> <Star starid="cf" starredin="sw"> <Name>Carrie Fisher</Name> <Address> <Street>123 Maple Street</Street> <City>Hollywood</City> </Address> </Star> <Movie movieid="sw starof="cf"> <Title>Star Wars</Title> <Year>1977</Year> </Movie> </StarMovieData> 16
Well-formed XML Namespaces Can qualify the tags in the XML document Facilitate reuse of vocabularies Use several vocabularies in the same XML document without name conflicts Namespace specified by a URI which is typically a URL that refers to a document describing the interpretation of the tags in the namespace This document can be an XML document, an informal document (HTML),... or nothing 17
Well-formed XML Namespaces HTML table <table> <tr> <td>apples</td> <td>bananas</td> </tr> </table> A real table <table> <name>african Coffee Table</name> <width>80</width> <length>120</length> </table> <root> <h:table xmlns:h="http://www.w3.org/"> <h:tr> <h:td>apples</h:td> <h:td>bananas</h:td> </h:tr> </h:table> <f:table xmlns:f="http://www.furniture.com"> <f:name>african Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> </root> 18
Well-formed XML XML and Databases It is common for computers to share data across the internet by passing messages in form of XML It is increasingly common for XML to be used for data storage similar to relational databases How do we catch efficiency in data access with XML? Store XML data in parsed form, e.g., SAX (Simple API for XML) and DOM (Document Object Model) Represent documents and their elements as relations and store in conventional databases 19
Well-formed XML XML and Databases A possible relational schema for storing XML is: Relates document IDs to the IDs of their root element DocRoot(docID, rootelementid) SubElement(parentID, childid, position) ElementAttribute(elementID, name, value) ElementValue(elementID, value) Connects an element to each of its immediate sub elements Relates elements to their attributes Relates leaf elements to their values 20
Valid XML Valid: well-formed and follows a particular schema A schema is a definition of the syntax of an XMLbased language (i.e., it defines a class of XML documents) Allows automatically interpreting the meaning or semantics of the elements Two prominent alternatives: XML DTD (document type definition) and XML Schema 21
Valid XML XML DTD <!DOCTYPE StarMovieData [ <!ELEMENT StarMovieData (Star*, Movie*)> ]> <!ELEMENT Star (Name, Address+)> <!ATTLIST Star starid ID #REQUIRED starredin IDREFS #IMPLIED > <!ELEMENT Name (#PCDATA)> <!ELEMENT Address (Street, (City Zip))> <!ELEMENT Street (#PCDATA)> <!ELEMENT City (#PCDATA)> <!ELEMENT Movie (Title, Year, Genre)> <!ATTLIST Movie movieid ID #REQUIRED starsof IDREFS #IMPLIED > <!ELEMENT Title (#PCDATA)> <!ELEMENT Year (#PCDATA)> <!ELEMENT Genre (Comedy Drama SciFi)> ELEMENT: element declaration ATTLIST: attribute declarations #PCDATA: data should be parsed #CDATA: data should not be parsed #REQUIRED: attribute must be present #IMPLIED: attribute is optional ID: defines an identifier IDREF: references to other elements *: element may occur any # of times +: element may occur 1 or more times?: element may occur 0 or 1 time : exactly 1 option appears 22
Valid XML XML Schema It is more powerful than DTD provides far more control for the developer over what is legal and a detailed way to define what the data can and cannot contain allows arbitrary restrictions on the number of occurrences of sub elements allows to declare types such as integer, float... gives ability to declare keys and foreign keys XML schemas themselves are XML documents 23
XML Schema <?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema"> </xs:schema> 24
XML Schema Elements and simple types <?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema"> <xs:element name="title" type="xs:string" /> <xs:element name="year" type="xs:integer" /> </xs:schema> 25
XML Schema <?xml version = "1.0" encoding="utf-8"?> Complex types <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema"> <xs:complextype name="movietype > <xs:sequence> </xs:sequence> </xs:complextype> <xs:element name="movies"> </xs:element> </xs:schema> <xs:complextype> <xs:element name="title" type="xs:string" /> <xs:element name="year" type="xs:integer" /> <xs:sequence> </xs:complextype> <xs:element name="movie" type="movietype" minoccurs="0" maxoccurs="unbounded" /> </xs:sequence> 26
XML Schema Example XML document <?xml version = "1.0"encoding="utf-8"?> <Movies xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:nonamespaceschemalocation="movies.xsd" > <Movie> </Movie> <Title>Star Wars</Title> <Year>1977</Year> <Movie> </Movie> </Movies> 27
XML Schema Attributes <?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema"> <xs:complextype name="movietype"> <xs:attribute name="movieid" type="xs:string" use="required" /> <xs:attribute name="starof" type="xs:string" /> <xs:sequence> <xs:element name="title" type="xs:string" /> <xs:element name="year" type="xs:integer" /> </xs:sequence> </xs:complextype> <xs:element name="movies"> <xs:complextype> <xs:sequence> <xs:element name="movie" type="movietype" minoccurs="0" maxoccurs="unbounded" /> </xs:sequence> </xs:complextype> </xs:element> </xs:schema> 28
XML Schema Example XML Document <?xml version = "1.0" encoding="utf-8"?> <Movies xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:nonamespaceschemalocation="movies.xsd"> <Movie movieid="sw"> <Title>Star Wars</Title> <Year>1977</Year> </Movie> <Movie movieid="rj"> </Movie> </Movies> 29
XML Schema Restricted Simple Types <xs:simpletype name = "MovieYearType > <xs:restriction base = xs:integer > <xs:mininclusive value = 1915 /> </xs:restriction> </xs:simpletype> restrict numerical values with mininclusive and maxinclusive <xs:simpletype name = "genretype"> <xs:restriction base = "xs:string"> <xs:enumeration value = "comedy" /> <xs:enumeration value = "drama" /> <xs:enumeration value = "scifi" /> </xs:restriciton> </xs:ssimpletype> restrict values to an enumerated type 30
XML Schema Keys <?xml version = "1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/xmlschema">... <xs:element name="movies"> <xs:complextype> <xs:sequence> </xs:complextype> <xs:element name="movie" type="movietype" minoccurs="0" maxoccurs="unbounded" /> </xs:sequence> <xs:key name="moviekey"> </xs:key> </xs:element> </xs:schema> <xs:selector xpath="movie" /> <xs:field xpath="title" /> <xs:field xpath= Year" /> 31
XML Schema <xs:element name="stars"> <xs:complextype> Foreign Keys <xs:element name="starredin" minoccurs="0" maxoccurs="unbounded"> <xs:complextype> <xs:element name="title" type="xs:string" /> <xs:element name="year" type="xs:integer" /> </xs:complextype> </xs:element> </xs:complextype> <xs:keyref name="movieref" refers = "moviekey"> <xs:selector xpath="star/starredin" /> <xs:field xpath= title" /> <xs:field xpath= year" /> </xs:keyref> </xs:element> 32
XML Programming Languages XPath uses path expressions to navigate in XML documents XQuery is the language for querying XML data and is built on XPath expressions (like SQL for DBs) XSLT transforms an XML document into another XML document 33
XPath XPath expressions generally returns a sequence of items that satisfy certain patterns A sequence of elements can be specified using an absolute or relative path /Movies - root element and all its content /Movies/Movie all Movie elements inside (direct child of) Movies element /Movies//Title all Title elements inside (at any level) Movies element * - any element /Movies/Movie/[Year="1980"] - all Movie elements with Year value 1980 34
XQuery Allows specification of more complex queries on one or more documents The typical form of XQuery is known FLWR expression FOR <variable bindings to individual nodes> LET <variable bindings to collection of nodes> WHERE <qualifier conditions> RETURN <query result specification> 35
XQuery Example XML Document <?xml version = "1.0" encoding="utf-8"?> <Movies> <Movie genre="comedy"> <Title>Bruce Almighty</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> <Movie genre="comedy"> <Title>Dumb & Dumber</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> <Movie genre="drama"> <Title>The Truman Show</Title> <Star><Name>Jim Carrey</Name></Star> </Movie> <Movie genre="comedy"> <Title>Nine Months</Title> <Star><Name>Hugh Grant<Name></Star> </Movie> </Movies> 36
XQuery Example XQuery Find all comedy movies in which Jim Carrey is an actor let $movies := doc("movies.xml") for $movie in $movies//movie[@genre="comedy"] where $movie/star/[name="jim Carrey"] return $movie/title Find the cities in which stars are mentioned let $movies := doc("movies.xml") let $stars := doc( stars.xml") for $s1 in $movies/movies/movie/version/star, $s2 in $stars/stars/star where data(s1) = data($s2/name) return $s2/address/city 37
XQuery Other features Eliminating duplicates let $s := distinct-values( ) Quantifiers every $s in satisfies some $s in satisfies Aggregation (count, sum, max, ) Branching if ( ) then else 38
XSLT Extensible Stylesheet Language for Transformations original purpose is to transform XML documents to other document forms (XML, HTML etc.) in practice is another query language uses XPath for navigating in XML documents 39
XSLT XML-document Example <?xml version = "1.0" encoding="utf-8"?> <Movies> <Movie genre="comedy"> <Title>Bruce Almighty</Title> <Star><Name>Jim Carrey</Name></Star> </Movie>... XSLT stylesheet <?xml version = "1.0" encoding = "utf-8"?> <xsl:stylesheet xmlns:xsl = "http:...xsl/transform version = "1.0"> <xsl:output method = xml indent = yes /> <xsl:template match = "/Movies"> <ComedyMovies> <xsl:apply-templates /> </ComedyMovies>... XML-document XSLT Processor <?xml version = "1.0" encoding="utf-8"?> <ComedyMovies> <Comedy title = "Bruce Almighty" /> <Comedy title = "Dumb & Dumber" /> <Comedy title = "Nine Months" /> </ComedyMovies> 40
XSLT Example <?xml version = "1.0" encoding = "utf-8"?> <xsl:stylesheet xmlns:xsl = "http://www.w3.org/1999/xsl/transform version = "1.0"> <xsl:output method = xml indent = yes /> <xsl:template match = "/Movies"> <ComedyMovies> <xsl:apply-templates /> </ComedyMovies> </xsl:template> <xsl:template match = "Movie[@genre="comedy"]"> <xsl:apply-templates /> </xsl:template> <xsl:template match = "Title"> <Comedy title = "<xsl:value-of select = "." /> " /> </xsl:template> <xsl:stylesheet> 41
Some online resources XML: http://www.w3schools.com/xml/ XPath: www.w3schools.com/xpath/ XPath tester: http://www.xpathtester.com/test XQuery: www.w3schools.com/xquery/ XQuery tester: http://www.zorba-xquery.com/html/demo XSLT: www.w3schools.com/xsl/ XSLT tester: http://www.w3.org/2005/08/online_xslt/ 42