Structured vs. unstructured data. Semistructured data, XML, DTDs. Motivation for self-describing data



Similar documents
Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

DTD Tutorial. About the tutorial. Tutorial

Last Week. XML (extensible Markup Language) HTML Deficiencies. XML Advantages. Syntax of XML DHTML. Applets. Modifying DOM Event bubbling

XML and Data Integration

XML: extensible Markup Language. Anabel Fraga

Semistructured data and XML. Institutt for Informatikk INF Ahmet Soylu

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Introduction to XML. Data Integration. Structure in Data Representation. Yanlei Diao UMass Amherst Nov 15, 2007

04 XML Schemas. Software Technology 2. MSc in Communication Sciences Program in Technologies for Human Communication Davide Eynard

How To Use Xml In A Web Browser (For A Web User)

XML and Data Management

Chapter 2: Designing XML DTDs

Standard Recommended Practice extensible Markup Language (XML) for the Interchange of Document Images and Related Metadata

An XML Based Data Exchange Model for Power System Studies

Unified XML/relational storage March The IBM approach to unified XML/relational databases

Data Integration through XML/XSLT. Presenter: Xin Gu

XML WEB TECHNOLOGIES

XML Schema Definition Language (XSDL)

Exchanger XML Editor - Canonicalization and XML Digital Signatures

T XML in 2 lessons! %! " #$& $ "#& ) ' */,: -.,0+(. ". "'- (. 1

Introduction to XML Applications

Extensible Markup Language (XML): Essentials for Climatologists

XML Processing and Web Services. Chapter 17

XML. Document Type Definitions XML Schema

Chapter 3: XML Namespaces

XML. CIS-3152, Spring 2013 Peter C. Chapin

CS2Bh: Current Technologies. Introduction to XML and Relational Databases. The Relational Model. The relational model

Chapter 1: Introduction

LabVIEW Internet Toolkit User Guide

Web Programming. Robert M. Dondero, Ph.D. Princeton University

TagSoup: A SAX parser in Java for nasty, ugly HTML. John Cowan (cowan@ccil.org)

10CS73:Web Programming

Ecma/TC39/2013/NN. 4 th Draft ECMA-XXX. 1 st Edition / July The JSON Data Interchange Format. Reference number ECMA-123:2009

Interactive Data Visualization for the Web Scott Murray

Translating between XML and Relational Databases using XML Schema and Automed

Introduction to Web Services

Modern Databases. Database Systems Lecture 18 Natasha Alechina

A Workbench for Prototyping XML Data Exchange (extended abstract)

<Namespaces> Core XML Technologies. Why Namespaces? Namespaces - based on unique prefixes. Namespaces. </Person>

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

Lecture 21: NoSQL III. Monday, April 20, 2015

ART 379 Web Design. HTML, XHTML & CSS: Introduction, 1-2

Managing XML Documents Versions and Upgrades with XSLT

by LindaMay Patterson PartnerWorld for Developers, AS/400 January 2000

Web Development. Owen Sacco. ICS2205/ICS2230 Web Intelligence

Introduction to Web Development

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Fast track to HTML & CSS 101 (Web Design)

Introduction to web development using XHTML and CSS. Lars Larsson. Today. Course introduction and information XHTML. CSS crash course.

Lecture 9. Semantic Analysis Scoping and Symbol Table

Coping with Semantics in XML Document Management

Internationalization Tag Set 1.0 A New Standard for Internationalization and Localization of XML

Web Services Technologies

We automatically generate the HTML for this as seen below. Provide the above components for the teaser.txt file.

A Document Management System Based on an OODB

Multimedia Applications. Mono-media Document Example: Hypertext. Multimedia Documents

Managing large sound databases using Mpeg7

Lesson 4 Web Service Interface Definition (Part I)

PHP and XML. Brian J. Stafford, Mark McIntyre and Fraser Gallop

Verification of Good Design Style of UML Models

BASICS OF WEB DESIGN CHAPTER 2 HTML BASICS KEY CONCEPTS COPYRIGHT 2013 TERRY ANN MORRIS, ED.D

Cross Site Scripting (XSS) and PHP Security. Anthony Ferrara NYPHP and OWASP Security Series June 30, 2011

MASTERTAG DEVELOPER GUIDE

Data Modeling Basics

Representation of E-documents in AIDA Project

CSE 530A Database Management Systems. Introduction. Washington University Fall 2013

HTML, CSS, XML, and XSL

Design and Development of Website Validator using XHTML 1.0 Strict Standard

Authoring Within a Content Management System. The Content Management Story

Lesson 8: Introduction to Databases E-R Data Modeling

Chapter 1: Introduction. Database Management System (DBMS) University Database Example

CIS 467/602-01: Data Visualization

Java and XML parsing. EH2745 Lecture #8 Spring

About XML in InDesign

MPD Technical Webinar Transcript

Developers Guide. Designs and Layouts HOW TO IMPLEMENT WEBSITE DESIGNS IN DYNAMICWEB. Version: English

Course Information Course Number: IWT 1229 Course Name: Web Development and Design Foundation

1. Write the query of Exercise 6.19 using TRC and DRC: Find the names of all brokers who have made money in all accounts assigned to them.

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

SQL DATA DEFINITION: KEY CONSTRAINTS. CS121: Introduction to Relational Database Systems Fall 2015 Lecture 7

Markup Languages and Semistructured Data - SS 02

Lecture 5: Java Fundamentals III

6. SQL/XML. 6.1 Introduction. 6.1 Introduction. 6.1 Introduction. 6.1 Introduction. XML Databases 6. SQL/XML. Creating XML documents from a database

Importing Lease Data into Forms Online

How To Use Query Console

The Web Web page Links 16-3

Qlik REST Connector Installation and User Guide

Implementing XML Schema inside a Relational Database

Transcription:

Structured vs. unstructured data 2 Semistructured data, XML, DTDs Introduction to databases CSCC43 Winter 2011 Ryan Johnson Databases are highly structured Well-known data format: relations and tuples Every tuple conforms to a known schema Data independence? Woe unto you if you lose the schema Plain text is unstructured Cannot assume any predefined format Apparent organization makes no guarantees Self-describing: little external knowledge needed... but have to infer what the data means Thanks to Ramona Truta and Renee Miller for material in these slides Irony: database cannot stand alone Motivation for self-describing data 3 Enter semistructured data 4 Consider a C struct Description: struct { int id; int type; char name[8]; struct { double x; double y; } location; } shape; Data at code level: {1, 101, square, {1.5, 5.0}} Data at byte-level: 0x0000000100000065 0x7371756172650000 0x3FF8000000000000 0x4014000000000000 Pointers? Nightmare! Observation: most data has some structure Text: sentences, paragraphs, sections,... Books: chapters Web pages: HTML Idea of semistructured data: Enforce well-formatted data => Always know how to read/parse/manipulate it Optionally, enforce well-structured data also => Might help us interpret the data, too *Very* hard not to embed schema in logic Pro: highly portable Con: verbose/redundant

Why not use... HTML? <dt style= color:red >id<dd>1 <dt>type<dd>101</dd> <dt>name<dd>square <dt>location<dd><dl> <dt>x<dd>1.5 <dt>y</dt><dd>5</dd> </dl> Pro: popular Con: inconsistent, buggy div, table, ulinstead of dl? Parsing is *hard* Con: data+presentation More like a query result Fixed meaning for all tags JSON? { id:1, type:101, name: square, location:{ x:1.5, y:5 }} Pro: portable Con: underspecified e.g. can t constrain types Con: schema still partly embedded what do field names mean? 5 XML: designed for data interchange 6 <books search-terms= database+design > <book> <title>database Design for Mere Mortals </title> <author>michael J. Hernandez</author> <date>13/03/2003 </date> </book> <book id= B2 > <title>beginning Database Design</title> <subtitle>from Novice to Professional</subtitle> <author>clare Churcher</author> </book> </books> Features of XML Intentionally similar syntax to HTML Tree-structured (hierarchical) format Elements surrounded by opening and closing tags Attributes embedded in opening tags => <tag-name attr-name= attribute >data</tag-name> But with important differences Strictly well-formed (must close all tags, etc.) Tag/attribute names carry no semantic meaning Data-only format: no implied presentation Valid identifiers (for elements and attributes) [a-z][a-z] 7 XML terminology <?xml version= 1.0?> <PersonList Type= Student Date= 2002-02-02 > <Title Value= Student List /> <Person> </Person> <Person> </Person> </PersonList> elements Elements are nested Root element contains all others Empty element Element (or tag) names attributes Root element 8 Both descendants of SGML

XML terminology (cont.) Content of Person <Person Name = John Id = s111111111 > John is a nice fellow <Address> <Number>21</Number> <Street>Main St.</Street> </Address> </Person> standalone text, not very useful as data, non-uniform Child of Address, Descendant of Person Nested element, child of Person Closing tag: What is open must be closed Opening tag Parent of Address, Ancestor of number Rules for well-formed XML Must have a root element Every opening tag must have matching closing tag Elements must be properly nested <foo><bar></foo></bar> is a no-no An attributename can occur at most oncein an opening tag. If it occurs, It must have an explicitly specified value(boolean attrs, like in HTML, are not allowed) The value must be quoted (with or ) XML processors not allowed to fix ill-formed documents (HTML browsers expected to!) Valid names in XML 11 XML, text, and whitespace 12 Simple rules for elements/attributes names may include letters (case sensitive!) may include (but not start with) digits and punctuation no reserved words or keywords But lots of gotchas Names must not start with xml (case insensitive) Non-ASCII/latin letters: legal but not all parsers support them Punctuation is iffy business (one exception: - ) Entity characters always verboten: < > & Spec recommends _ instead of - (real life: the opposite is true) : is reserved for namespaces (not enforced). officially discouraged (real life: very rare) $ often used for parameter substitution by XML processors (XQuery, etc.) Other punctuation vanishingly rare: @ # %... Upper case letters legal but fairly rare All caps very rare (just like rest of Internet) Often see book-list instead of camel case (e.g. BookList) Adjacent non-tag chars parsed as text nodes Parser never ignores whitespace Leading and trailing space left with its text node Whitespace between tags produces empty text nodes Example: <foo> hi<bar> ho </bar> </foo> foo hi bar \n ho\n Rule of thumb: lower case and - usually best

Document type definition (DTD) 13 Embedded vs. external DTD Enforces more than well-formed -ness Specified as part of a document Which entities may (or must) appear where Attributes entities may (or must) have Types attributes and data must adhere to DTD separate from XML it constraints May be embedded in separate section Most often referenced externally Validation: checking XML against its DTD(s) Important for interpreting/validating data Not necessary for parsing <?xml version= 1.0?> <!DOCTYPE Book [ ]> <Book> </Book> Reference to external (stand-alone) DTD <?xml version= 1.0?> <!DOCTYPE Book http://csc343.com/book.dtd > <Book> </Book> DTD building blocks 15 DTD elements 16 Elements (<an-element>...</an-element>) Must always close tags If no contents: <empty-element/> Attributes (<... an-attr=......>) Entities ( special tokens) e.g. < > & " &apos; HTML defines lots of others (e.g. ) More on this later PCDATA (parsed character data) Mixed text and markup Use entities to escape >, etc. which should not be parsed CDATA ([non-parsed] character data) Plain text data Tags not parsed, entities not expanded <!ELEMENT $e...> May contain any of Nothing: <!ELEMENT $e EMPTY> Anything: <!ELEMENT $e ANY> Text data: <!ELEMENT $ (#PCDATA)> Always parsed (#CDATA not allowed here) Child elements: <!ELEMENT $e (...)> Any child referenced must also be declared Child elements may themselves have children Mixed content: <!ELEMENT $e (#PCDATA...

DTD elements: children 17 DTD elements: examples 18 Base construct: sequence <!ELEMENT $e (a)> <!ELEMENT $e (a, b, c,...)> Children in XML must appear in DTD declaration order Either-or content <!ELEMENT $e (a b...)> Exactly one of the options must appear in the XML Constraining child cardinality <!ELEMENT $e (a, b+, c*, d?)> exactly one (a), at least one (b) zero or more (c), at most one (d) <!ELEMENT resume ( bio,interests,education, experience,awards,service)> <!ELEMENT bio ( name, addr, phone, email?, fax?, url?)> <!ELEMENT interests (interest+)> <!ELEMENT education (degree*)> <!ELEMENT awards ((award honor)*)>... Sequences and either-or can both nest DTD attributes 19 DTD attributes: examples 20 <!ATTLIST $e $a $type $required> Declares an attribute $a on element $e $type may be any of character data: CDATA one of a set of values: (v1 v2...) unique identifier (or reference to one/many): ID[REF][S] valid xml name (or list of names): NMTOKEN[S] entity (or entities): ENTITY/ENTITIES $required may be required (not required): #REQUIRED/#IMPLIED fixed value (always the same): #FIXED $value default value (used if none given): $value <!ATTLIST person sin ID #REQUIRED spouse IDREF #IMPLIED name CDATA John Doe trusted (yes no) no species #FIXED homo sapiens alive (yes no) #IMPLIED > #IMPLIED unless specified otherwise

DTD attributes: ID[REF][S] 21 DTD entities 22 ID attribute type Uniquely identifies an element in the document Error to have two Like HTML id attribute, but can have any name IDREF Refers to another element by ID Error if corresponding ID does not exist Like HTML href attribute, but no # needed IDREFS List of IDREF attributes, separated by whitespace The XML equivalent of #define <!ENTITY $name $substituted-value > Can t take parameters, though Used just like other entities <politician-speak> I vow to lead the fight to stamp out &buzz-word; by instituting powerful new programs that will... </politician-speak> Pick your favorite substitution: <!ENTITY buzz-word communism > <!ENTITY buzz-word racism > <!ENTITY buzz-word terrorism > <!ENTITY buzz-word illegal file sharing > Problem: only one global set of IDs Not heavily used: better templating methods exist Limitations of DTDs 23 XML Schema 24 Don t understand namespaces Designed to improve on DTDs Very limited typing (just strings and xml names) Advantages: Very weak referential integrity All ID / IDREF / IDREFSuse global ID space Can t express unordered contents conveniently How to specify that a,b,c must all appear, but in any order? All element names are global Is <name> for people or companies? can t declare both in the same DTD Integrated with namespaces Many built-in types User-defined types Has local element names Powerful key and referential constraints Disadvantages: Unwieldy, much more complex than DTDs We won t cover XML schema in class