Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured



Similar documents
Structured vs. unstructured data. Semistructured data, XML, DTDs. Motivation for self-describing data

XML: extensible Markup Language. Anabel Fraga

DTD Tutorial. About the tutorial. Tutorial

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

XML and Data Integration

XML. Document Type Definitions XML Schema

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Data Integration through XML/XSLT. Presenter: Xin Gu

Extensible Markup Language (XML): Essentials for Climatologists

Standard Recommended Practice extensible Markup Language (XML) for the Interchange of Document Images and Related Metadata

XML Schema Definition Language (XSDL)

An XML Based Data Exchange Model for Power System Studies

BASI DI DATI II 2 modulo Parte II: XML e namespaces. Prof. Riccardo Torlone Università Roma Tre

XML and Data Management

Internationalization Tag Set 1.0 A New Standard for Internationalization and Localization of XML

Translating between XML and Relational Databases using XML Schema and Automed

1. Write the query of Exercise 6.19 using TRC and DRC: Find the names of all brokers who have made money in all accounts assigned to them.

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Cleo Communications. CUEScript Training

Chapter 1: Introduction

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

04 XML Schemas. Software Technology 2. MSc in Communication Sciences Program in Technologies for Human Communication Davide Eynard

Chapter 2: Designing XML DTDs

Exchanger XML Editor - Canonicalization and XML Digital Signatures

Managing XML Documents Versions and Upgrades with XSLT

Introduction to XML Applications

Interactive Data Visualization for the Web Scott Murray

Ecma/TC39/2013/NN. 4 th Draft ECMA-XXX. 1 st Edition / July The JSON Data Interchange Format. Reference number ECMA-123:2009

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR

How To Use Xml In A Web Browser (For A Web User)

10CS73:Web Programming

XML. CIS-3152, Spring 2013 Peter C. Chapin

Markup Languages and Semistructured Data - SS 02

Chapter 3: XML Namespaces

MASTERTAG DEVELOPER GUIDE

CIS 467/602-01: Data Visualization

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Relational Databases for Querying XML Documents: Limitations and Opportunities. Outline. Motivation and Problem Definition Querying XML using a RDBMS

Managing large sound databases using Mpeg7

2009 Martin v. Löwis. Data-centric XML. Other Schema Languages

IoT-Ticket.com. Your Ticket to the Internet of Things and beyond. IoT API

Chapter 2 HTML Basics Key Concepts. Copyright 2013 Terry Ann Morris, Ed.D

Introduction to Ingeniux Forms Builder. 90 minute Course CMSFB-V6 P

Java and XML parsing. EH2745 Lecture #8 Spring

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

XML Processing and Web Services. Chapter 17

SQL DATA DEFINITION: KEY CONSTRAINTS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 7

A Workbench for Prototyping XML Data Exchange (extended abstract)

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Data Tool Platform SQL Development Tools

XSLT Mapping in SAP PI 7.1

Change Management for XML, in XML

ASPECTS OF XML TECHNOLOGY IN ebusiness TRANSACTIONS

metaengine DataConnect For SharePoint 2007 Configuration Guide

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

BASICS OF WEB DESIGN CHAPTER 2 HTML BASICS KEY CONCEPTS COPYRIGHT 2013 TERRY ANN MORRIS, ED.D

RUT developers handbook 9.51 Introduction to XML and DOM, with applications in Matlab v. 2.0

Lesson 8: Introduction to Databases E-R Data Modeling

Enterprise Content Management (ECM) Strategy

XML WEB TECHNOLOGIES

Representation of E-documents in AIDA Project

The Web Web page Links 16-3

JavaScript: Introduction to Scripting Pearson Education, Inc. All rights reserved.

Implementing XML Schema inside a Relational Database

Web Services Technologies

Software documentation systems

Database System Concepts

LabVIEW Internet Toolkit User Guide

How To Use Query Console

Web Programming. Robert M. Dondero, Ph.D. Princeton University

Modern Databases. Database Systems Lecture 18 Natasha Alechina

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science

What's New In DITA CMS 4.0

Purpose What is EDI X EDI X12 standards and releases Trading Partner Requirements EDI X12 Dissected... 3

by LindaMay Patterson PartnerWorld for Developers, AS/400 January 2000

Coping with Semantics in XML Document Management

TagSoup: A SAX parser in Java for nasty, ugly HTML. John Cowan (cowan@ccil.org)

CSE 530A Database Management Systems. Introduction. Washington University Fall 2013

Introduction to Web Services

Lesson 4 Web Service Interface Definition (Part I)

Cross Site Scripting (XSS) and PHP Security. Anthony Ferrara NYPHP and OWASP Security Series June 30, 2011

Lecture 9. Semantic Analysis Scoping and Symbol Table

[MS-ASMS]: Exchange ActiveSync: Short Message Service (SMS) Protocol

Moving from CS 61A Scheme to CS 61B Java

Digital Signatures for XML

2874CD1EssentialSQL.qxd 6/25/01 3:06 PM Page 1 Essential SQL Copyright 2001 SYBEX, Inc., Alameda, CA

Deferred node-copying scheme for XQuery processors

Semantic Analysis: Types and Type Checking

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

T XML in 2 lessons! %! " #$& $ "#& ) ' */,: -.,0+(. ". "'- (. 1

Keywords: XML, Web-based Editor

PL/SQL MOCK TEST PL/SQL MOCK TEST I

Grandstream XML Application Guide Three XML Applications

Design and Development of Website Validator using XHTML 1.0 Strict Standard

Instant SQL Programming

Data Modeling Basics

PHIN DIRECTORY EXCHANGE IMPLEMENTATION GUIDE. Version 1.0

T Network Application Frameworks and XML Web Services and WSDL Tancred Lindholm

Structured storage and retrieval of SGML documents using Grove

Transcription:

Structured vs. unstructured data 2 Databases are highly structured Semistructured data, XML, DTDs Well known data format: relations and tuples Every tuple conforms to a known schema Data independence? Woe unto you if you lose the schema Plain text is unstructured Introduction to databases CSCC43 Winter 2012 Ryan Johnson Cannot assume any predefined format Apparent organization makes no guarantees Self describing: little external knowledge needed... but have to infer what the data means Thanks to Manos Papagelis, John Mylopoulos, Arnold Rosenbloom, and Renee Miller for material in these slides Irony: database cannot stand alone Motivation for self describing data 3 Enter semistructured data 4 Consider a C struct struct { int id; int type; char name[8]; struct { double x; double y; } location; } shape; Data at code level: {1, 101, square, {1.5, 5.0}} Data at byte level: 0x0000000100000065 0x7371756172650000 0x3FF8000000000000 0x4014000000000000 Variable length fields? Pointers? Endianness? Observation: most data has some structure Text: sentences, paragraphs, sections,... Books: chapters Web pages: HTML Idea of semistructured data: Enforce well formatted data => Always know how to read/parse/manipulate it Optionally, enforce well structured data also => Might help us interpret the data, too *Very* easy to embed [parts of] schema in logic Pro: highly portable Con: verbose/redundant 1

Why not use... HTML? <dl> <dt style= color:red >id <dd>1 <dt>type</dt> <dd>101</dd> <dt>name <dd>square <dt>location <dd><dl> <dt>x <dd>1.5 <dt>y</dt> <dd>5</dd> </dl> Pro: popular Con: inconsistent, buggy Closing tags often missing div, table, ul instead of dl? Parsing is *hard* Con: data+presentation Describes presentation and structure, but not content More like a query result Fixed meaning for all tags 5 Why not use... JSON? (JavaScript Object Notation) { } id : 1, type : 101, name : square, location : { x : 1.5, y : 5 } Pros: simple/intuitive portable Cons: No support for any kind of metadata Underspecified (e.g. can t constrain types) Data processing tools missing/immature Growing popularity due to its simplicity 6 7 XML: designed for data interchange 8 XML <books search terms= database+design > <book> <title>database Design for Mere Mortals </title> <author>michael J. Hernandez</author> <date>13/03/2003 </date> </book> <book id= B2 > <title>beginning Database Design</title> <subtitle>from Novice to Professional</subtitle> <author>clare Churcher</author> </book> </books> 2

Features of XML Intentionally similar syntax to HTML Tree structured (hierarchical) format Elements surrounded by opening and closing tags Attributes embedded in opening tags => <tag name attr name= attr value >data</tag name> But with important differences Strictly well formed (must close all tags, etc.) Tag/attribute names carry no semantic meaning Data only format: no implied presentation 9 XML terminology <?xml version= 1.0?> <PersonList Type= Student Date= 2002 02 02 > <Title Value= Student List /> <Person> </Person> <Person> </Person> </PersonList> elements Elements are nested Root element contains all others Empty element Element (or tag) names attributes Root element 10 Descendant of SGML (as is HTML) XML terminology (cont.) Content of Person <Person Name = John Id = s111111111 > John is a nice fellow <Address> <Number>21</Number> <Street>Main St.</Street> </Address> </Person> standalone text, not very useful as data, non uniform Child of Address, Descendant of Person Nested element, child of Person Closing tag: What is open must be closed Opening tag Parent of Address, Ancestor of Number Example XML Document <?xml version= 1.0?> <! Some comment > <Students> <Student StudId= 111111111 > <Name><First>John</First><Last>Doe</Last></Name> <Status>U2</Status> <CrsTaken CrsCode= CS308 Semester= F1997 /> <CrsTaken CrsCode= MAT123 Semester= F1997 /> </Student> <Student StudId= 987654321 > <Name><First>Bart</First><Last>Simpson</Last></Name> <Status>U4</Status> <CrsTaken CrsCode= CS308 Semester= F1994 /> </Student> </Students> <! Some other comment > 12 3

XML Document is a Tree 13 Two kinds of XML Documents 14 Well Formed XML Just need to use proper nesting Can invent your own tags Any tag can go anywhere Validated XML Can invent tags, but have to declare them and specify where they can go A DTD (document type definition) specifies these rules Rules for well formed XML Must have a root element Every opening tag must have matching closing tag Elements must be properly nested <foo><bar></foo></bar> is a no no An attribute name can occur at most once in an opening tag. If it occurs: It must have an explicitly specified value (Boolean attrs, like in HTML, are not allowed) The value must be quoted (with or ) Parsers not allowed to tolerate ill formed XML Valid names in XML Simple rules for elements/attributes names may include letters (case sensitive!) may include (but not start with) digits and punctuation no reserved words or keywords But lots of gotchas Names must not start with xml (case insensitive) Non ASCII/latin letters: legal but not all parsers support them Punctuation is iffy business (one exception: ) Entity characters always forbidden: < > & Spec recommends _ instead of (real life: the opposite is true) : is reserved for namespaces (not enforced). officially discouraged (real life: very rare) $ often used for parameter substitution by XML processors (XQuery, etc.) Other punctuation vanishingly rare: @ # %... Upper case letters legal but fairly rare All caps very rare (just like rest of Internet) Often see book list instead of camel case BookList Rule of thumb: lower case and usually best 16 4

XML, text, and whitespace Adjacent non tag chars parsed as text nodes Parser never ignores whitespace Leading and trailing space left with its text node Whitespace between tags produces empty text nodes Example: <foo> hi<bar> ho </bar> </foo> foo hi bar 17 Example: Well Formed XML <?xml version = 1.0 standalone = yes?> <platforms> <platform><name>x Box</name> <game><title>halo</title> <price>59.99</price></game> Root tag <game><title>crash Bandicoot</title> <price>49.99</price></game> </platform> <platform> </platform> </platforms> Tags surrounding a platform element A name subelement A game subelement 18 \n ho \n Nesting rule for tags must be obeyed Checking your XML 19 Problems with well formed XML 20 http://validator.w3.org xmllint command on cdf. By default, checks if well formed debug Outputs an annotated tree of the parsed document If a program will process XML, good to know things like: What tags are allowed What order, nesting What attributes for each tag What s mandatory or optional A DTD specifies exactly this 5

21 Document type definition (DTD) 22 Enforces more than well formed ness Which entities may (or must) appear where Attributes entities may (or must) have Types attributes and data must adhere to DTD separate from XML it constraints DOCUMENT TYPE DEFINITION (DTD) May be embedded in separate section Most often referenced externally Validation: checking XML against its DTD(s) Important for interpreting/validating data Not necessary for parsing DTD building blocks 23 DTD elements 24 Elements (<an element>...</an element>) Must always close tags If no contents: <empty element/> Attributes (<... an attr=......>) Entities ( special tokens) e.g. < > & " &apos; HTML defines lots of others (e.g. ) More on this later PCDATA (parsed character data) Mixed text and markup Use entities to escape >, etc. which should not be parsed CDATA ([non parsed] character data) Plain text data Tags not parsed, entities not expanded <!ELEMENT $e...> $e is the element name " " may contain any of: Nothing: <!ELEMENT $e EMPTY> Anything: <!ELEMENT $e ANY> Text data: <!ELEMENT $e (#PCDATA)> Always parsed (#CDATA not allowed here) Child elements: <!ELEMENT $e (...)> Any child referenced must also be declared Child elements may themselves have children Mixed content: <!ELEMENT $e (#PCDATA... 6

DTD elements: children 25 DTD elements: Example 26 Base construct: sequence (,) <!ELEMENT $e (a)> <!ELEMENT $e (a, b, c,...)> Comma ","defines order of which children must appear in XML Either or content ( ) <!ELEMENT $e (a b...)> Exactly one of the options must appear in the XML Constraining child cardinality <!ELEMENT $e (a, b+, c*, d?)> not followed by any of +, *,? : exactly one (e.g., a) +: at least one (e.g., b) *: zero or more (e.g., c)?: at most one (e.g., d) <!ELEMENT resume ( bio,interests,education, experience,awards,service)> <!ELEMENT bio ( name, addr, phone, email?, fax?, url?)> <!ELEMENT interests (interest+)> <!ELEMENT education (degree*)> <!ELEMENT awards ((award honor)*)>... Sequences and either or can both nest DTD elements: Another Example <!DOCTYPE platforms [ <!ELEMENT platforms (platform*)> <!ELEMENT platform (name, game+)> <!ELEMENT name (#PCDATA)> <!ELEMENT game(name, price)> <!ELEMENT price (#PCDATA)> ]> NAME and PRICE are text A PLATFORMS element has zero or more PLATFORM elements nested within A PLATFORM has one NAME and one or more GAME elements A GAME has a NAME and a PRICE 27 DTD Attributes <!ATTLIST $e $a $type $required> Declares an attribute $a on element $e $type may be any of character data: CDATA one of a set of values: (v1 v2...) unique identifier: ID references to one/many ID token(s) of other attributes: IDREF[S] valid xml name (or list of names): NMTOKEN[S] entity (or entities): ENTITY/ENTITIES $required may be required (not required): #REQUIRED (#IMPLIED) fixed value (always the same): #FIXED $value default value (used if none given): $value 28 7

DTD attributes: examples 29 DTD attributes: ID[REF][S] 30 <!ATTLIST person sin ID #REQUIRED spouse IDREF #IMPLIED name CDATA John Doe trusted (yes no) no species #FIXED homo sapiens alive (yes no) #IMPLIED > ID attribute type Uniquely identifies an element in the document (like keys) Error to have two Like HTML id attribute, but can have any name IDREF Refers to another element by ID (like foreign keys) Error if corresponding ID does not exist Like HTML href attribute, but no # needed IDREFS List of IDREF attributes, space separated #IMPLIED unless specified otherwise Problem: only one global set of IDs Example: a DTD 31 Example: The XML Document 32 <!DOCTYPE PLATFORMS [ <!ELEMENT PLATFORMS (PLATFORM*, GAME*)> <!ELEMENT PLATFORM (SELLS+)> <!ATTLIST PLATFORM name ID #REQUIRED> <!ELEMENT SELLS (#PCDATA)> <!ATTLIST SELLS thegame IDREF #REQUIRED> <!ELEMENT GAME EMPTY> <!ATTLIST GAME name ID #REQUIRED> <!ATTLIST GAME soldby IDREFS #IMPLIED> ]> <PLATFORMS> <PLATFORM name = X Box > <AVAILABLE game= Halo >59.99</AVAILABLE> <AVAILABLE game= Crash Bandicoot >49.99</AVAILABLE> </PLATFORM> <GAME name= Halo availablefor = X Box X Box 360 /> </PLATFORMS> 8

DTD entities The XML equivalent of #define <!ENTITY $name $substituted value > Can t take parameters, though Used just like other entities <politician speak> I vow to lead the fight to stamp out &buzz word; by instituting powerful new programs that will... </politician speak> Pick your favorite substitution: <!ENTITY buzz word communism > <!ENTITY buzz word racism > <!ENTITY buzz word terrorism > <!ENTITY buzz word illegal file sharing > Not heavily used: better templating methods exist 33 Embedded vs. External DTD Specified as part of a document <?xml version= 1.0?> <!DOCTYPE Book [ ]> <Book> </Book> Reference to external (stand alone) DTD <?xml version= 1.0?> <!DOCTYPE Book http://csc343.com/book.dtd > <Book> </Book> EXAMPLE: Emdedded DTD <?xml version = 1.0 standalone = no?> <!DOCTYPE PLATFORMS [ <!ELEMENT PLATFORMS (PLATFORM*)> <!ELEMENT PLATFORM (NAME, GAME+)> The DTD <!ELEMENT NAME (#PCDATA)> <!ELEMENT GAME (NAME, PRICE)> <!ELEMENT PRICE (#PCDATA)> The document ]> <PLATFORMS> <PLATFORM><NAME>X Box</NAME> <GAME><NAME>Halo</NAME> <PRICE>59.99</PRICE></GAME> <GAME><NAME>Crash Bandicoot</NAME> <PRICE>49.99</PRICE></GAME> </PLATFORM> <PLATFORM> </PLATFORMS> 35 EXAMPLE: External DTD <?xml version = 1.0 standalone = no?> <!DOCTYPE platforms SYSTEM PLATFORM.dtd > <platforms> <platform><name>x Box</name> <game><title>halo</title> <price>59.99</price></game> <game><title>crash Bandicoot</title> <price>49.99</price></game> </platform> <platform> </platform> </platforms> Get the DTD from the file PLATFORM.dtd 36 9

Limitations of DTDs 37 XML Schema 38 Don t understand namespaces Very limited typing (just strings and XML names) Very weak referential integrity All ID / IDREF / IDREFS share single ID space Can t express unordered contents conveniently How to specify that a,b,c must all appear, but in any order? All element names are global Is <name> for people or companies? can t declare both in the same DTD Designed to improve on DTDs Advantages: Integrated with namespaces Many built in types User defined types Has local element names Powerful key and referential constraints Disadvantages: Unwieldy, much more complex than DTDs We won t cover XML schema in class What is Next? 39 XML Query Languages XPATH XQUERY 10