XML and Data Integration Week 11-12 Week 11-12 MIE253-Consens 1
Schedule Week Date Lecture Topic 1 Jan 9 Introduction to Data Management 2 Jan 16 The Relational Model 3 Jan. 23 Constraints and SQL DDL 4 Jan. 30 SQL DML, DB Applications, JDBC 5 Feb 6 JDBC, DDL (Views, Access Control) 6 Feb 13 Relational Algebra, Advanced SQL - Feb 20 [Reading Week] 7 Feb 27 Review and Midterm (Mar 1) 8 Mar 5 OLAP 9 Mar 12 ER Conceptual Modelling 10 Mar 19 Normalization 11 Mar 26 XML and Data Integration 12 Apr 2 Transactions and the Internet, Query Processing 13 Apr 9 Final Review This week s reading: Chapter 15 Week 11-12 MIE253-Consens 2
Semistructured Data A typical piece of data on the Web: <dt>name: John Doe <dd>student Id: 111111111 <dd>address: <ul> <li>number: 123 <li>street: Main </ul> </dt> <dt>name: Joe Public <dd>student Id: 222222222 </dt> Week 11-12 MIE253-Consens 3
Semistructured Data (contd.) To make the previous student list suitable for machine consumption on the Web, it should have these characteristics: Be object-like Be schemaless (doesn t guarantee to conform exactly to any schema, but different objects have some commonality among themselves) Be self-describing (some schema-like information, like attribute names, is part of data itself) Week 11-12 MIE253-Consens 4
Why XML? XML is a standard format for data exchange Plenty of industry-specific standards Take a look at http://xml.coverpages.org Extensive software support All major relational database products have been retrofitted with facilities to store and construct XML documents Web Browser and Operating System support Week 11-12 MIE253-Consens 5
Health Data Exchange The HL7 Patient Record Architecture is a framework for exchange of clinical documents Week 11-12 MIE253-Consens 6
Sample HL7 Exam Report <LevelOne> <header>...</header> <body> <section> <section.title>admitting PHYSICAL EXAMINATION</section.title> <section> <section.title>general</section.title> <paragraph>the blood pressure is 170/88, pulse 80 and regular, and <healthcare.code identifier="9279-1" preferred.name="respiratory RATE" name.of.coding.system="ln" local.coding.system= N > respirations </healthcare.code> 18. She weighs 240 pounds. </paragraph> </section> <section> <section.title>heent</section.title> <paragraph>examination of the head is normocephalic. The patient has bilateral<healthcare.code identifier="f-f5480" preferred.name="carotid bruit" name.of.coding.system="sn3" local.coding.system= N > carotid bruit </healthcare.code>. There is no jugular venous distention or lymphadenopathy. </paragraph> </section> </section> </body> </LevelOne> Week 11-12 MIE253-Consens 7
Sample HL7 Header <header> <document>...</document> <event> <event.id><id.value>1009</id.value></event.id> <event.date>19990212</event.date> </event> <patient> <patient.id><id.value>p001</id.value></patient.id> <patient.name> <family.name>lantry</family.name> <given.name>connie</given.name> </patient.name> <patient.date.of.birth>19630613</patient.date.of.birth> <patient.sex value="female"/> </patient> <practitioner> <practitioner.id><id.value>24680</id.value></practitioner.id> <practitioner.role> <text>attending PHYSICIAN</text> <name.of.coding.system>hl70133</name.of.coding.system> </practitioner.role> </practitioner> </header> Week 11-12 MIE253-Consens 8
Sample HL7 Document Origin <document> <document.creation.date>19990212</document.creation.date> <document.id> <id.value>1009</id.value> </document.id> <document.originating.system> <id.value>systemx</id.value> <organization.name>global Healthcare, INC</organization.name> </document.originating.system> <document.originator.id> <id.value>24680</id.value> <family.name>levin</family.name> <given.name>henry</given.name> <suffix>the 7th</suffix> <degree>md</degree> </document.originator.id> <document.state value="original"/> <document.type> <identifier>11492-6</identifier> <text>history AND PHYSICAL</text> <name.of.coding.system>ln</name.of.coding.system> </document.type> </document> Week 11-12 MIE253-Consens 9
Summary: XML XML and Semi-structured Data Schema-less Self-describing XML for Data Exchange Week 11-12 MIE253-Consens 10
Additional Material: XML Well-formed XML Valid XML (DTD, XML Schema) XPath basics Further material, applications in MIE354H1F Business Process Engineering Week 11-12 MIE253-Consens 11
Example XML Document <?xml version= 1.0?> declaration attributes <PersonList Type= Student Date= 2002-02-02 > <Title Value= Student List /> <Person> </Person> <Person> </Person> </PersonList> elements empty element Element (or tag) names Root element Elements are nested Root element contains all others Week 11-12 MIE253-Consens 12
More Terminology Opening tag <Person Name = John Id = 111111111 > John is a nice fellow standalone text, not useful as data <Address> <Number>21</Number> <Street>Main St.</Street> </Address> </Person> Nested element, child of Person Child of Address, Descendant of Person Content of Person Parent of Address, Ancestor of number Closing tag: What is open must be closed Week 11-12 MIE253-Consens 13
Well-formed XML Documents Must have a root element Every opening tag must have matching closing tag Elements must be properly nested <a><b></a></b> is not well-formed <a><b></b></a> or <a></a><b></b> is well-formed An attribute name can occur at most once in an opening tag. If it occurs, It must have a value The value must be quoted (with or ) XML processors are not supposed to try and fix ill-formed documents (unlike HTML browsers) Week 11-12 MIE253-Consens 14
XML Document Tree Week 11-12 MIE253-Consens 15
Valid XML Documents Two mechanisms to describe the schema of an XML document: DTD (Document Type Definition) XML Schema A document that satisfies the constraints in an XML DTD/Schema is valid XML documents must always be well-formed, validity is an additional property Historic reasons for multiple schemas for XML - tools translate among them and from conceptual models (ER, UML) Week 11-12 MIE253-Consens 16
DTD Elements and Attributes <!DOCTYPE Report [ <!ELEMENT Report (Students, Classes, Courses)> <!ELEMENT Students (Student*)> <!ELEMENT Classes (Class*)> <!ELEMENT Courses (Course*)> <!ELEMENT Student (Name, Status, CrsTaken*)> <!ELEMENT Name (First,Last)> <!ELEMENT First (#PCDATA)> <!ELEMENT CrsTaken EMPTY> <!ELEMENT Class (CrsCode,Semester,ClassRoster)> <!ELEMENT Course (CrsName)> <!ATTLIST Report Date #IMPLIED> <!ATTLIST Student StudId ID #REQUIRED> <!ATTLIST Course CrsCode ID #REQUIRED> <!ATTLIST CrsTaken CrsCode IDREF #REQUIRED> <!ATTLIST ClassRoster Members IDREFS #IMPLIED> ]> text Zero or more Empty element Same attribute in different elements Week 11-12 MIE253-Consens 17
XPath Document Tree Week 11-12 MIE253-Consens 18
Document Corresponding to the Tree <?xml version= 1.0?> <!-- Some comment --> <Students> <Student StudId= 111111111 > <Name><First>John</First><Last>Doe</Last></Name> <Status>U2</Status> <CrsTaken CrsCode= CS308 Semester= F1997 /> <CrsTaken CrsCode= MAT123 Semester= F1997 /> </Student> <Student StudId= 987654321 > <Name><First>Bart</First><Last>Simpson</Last></Name> <Status>U4</Status> <CrsTaken CrsCode= CS308 Semester= F1994 /> </Student> </Students> <!-- Some other comment --> Week 11-12 MIE253-Consens 19
XML Query Languages XPath core query language used in XML Schema, XSLT, XQuery, many other XML standards XSLT a functional style document transformation language. Very powerful, very complicated XQuery upcoming standard. Very powerful, fairly intuitive, SQL-style Also SQL extensions supporting XML Week 11-12 MIE253-Consens 20
XPath Basics Expression / returns root node /Students/Student returns all Student-elements that are children of Students elements, which in turn must be children of the root /Student returns empty set //Students returns all Student-elements below the root Students who have taken CS532: //Student[CrsTaken/@CrsCode= CS532 ] Last course taken by the first student in the list: /Students/Student[1]/CrsTaken[last()] Week 11-12 MIE253-Consens 21
XPath Semantics locationstep1/locationstep2/ means: Find all nodes specified by locationstep1 For each such node N: Find all nodes specified by locationstep2 using N as the current node Take union For each node returned by locationstep2 do the same locationstep = axis::node[predicate] Find all nodes specified by axis::node Select only those that satisfy predicate Week 11-12 MIE253-Consens 22